0% found this document useful (0 votes)
37 views

Predicting Code Coverage Without Execution

Uploaded by

bwq1004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Predicting Code Coverage Without Execution

Uploaded by

bwq1004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Predicting Code Coverage without Execution

Michele Tufano, Shubham Chandel, Anisha Agarwal, Neel Sundaresan, Colin Clement
Microsoft
Redmond, WA, USA
{mitufano, schandel, anisagarwal, neels, coclement}@microsoft.com

Abstract Focal Method {m}


Code coverage is a widely used metric for quan- public String foo(int x){
if(x == 0){
tifying the extent to which program elements, return "zero";
such as statements or branches, are executed } else if(x > 0){
return "positive";
arXiv:2307.13383v1 [cs.SE] 25 Jul 2023

during testing. Calculating code coverage is } else {


return "negative";
resource-intensive, requiring code building and }
execution with additional overhead for the in- return "impossible";}
strumentation. Furthermore, computing cover- Test Case {t}
age of any snippet of code requires the whole
program context. Using Machine Learning to public void testFoo() {
String res = foo(2);
amortize this expensive process could lower Assert.isEqual("positive", res);}
the cost of code coverage by requiring only
Coverage-Annotated Method {cov(m, t)}
the source code context, and the task of code
coverage prediction can be a novel benchmark > public String foo(int x){
for judging the ability of models to understand > if(x == 0){
code. We propose a novel benchmark task ! return "zero";

called Code Coverage Prediction for Large Lan- > } else if(x > 0){
> return "positive";
guage Models (LLMs). We formalize this task
! } else {
to evaluate the capability of LLMs in under-
! return "negative";
standing code execution by determining which
! }
lines of a method are executed by a given test - return "impossible";}
case and inputs. We curate and release a dataset
we call C OVERAGE E VAL by executing tests Figure 1: Given a focal method m, that is a method
and code from the HumanEval dataset and col- under test, and a test case t covering that method, the
lecting code coverage information. We report code coverage obtained by t on m can be represented
the performance of four state-of-the-art LLMs as the coverage-annotated method cov(m, t), where >
used for code-related tasks, including OpenAI’s represents executed statements, ! represents statements
GPT-4 and GPT-3.5-Turbo, Google’s BARD, not executed, and - represents unreachable code.
and Anthropic’s Claude, on the Code Coverage
Prediction task. Finally, we argue that code cov-
erage as a metric and pre-training data source For example, coverage is one of the metrics con-
are valuable for overall LLM performance on sidered by the Federal Aviation Administration
software engineering tasks. (FAA) for safety certification of avionic equipment,
as documented in DO-178B (Johnson, 1998) and
1 Introduction
DO-178C (Rierson, 2017). Test coverage is also a
Software testing is an essential part of the soft- requirement in the automotive safety standard ISO
ware life-cycle which aims at detecting bugs in a 26262 Road Vehicles - Functional Safety (Palin
program prior to shipping new versions. Code cov- et al., 2011).
erage is a widely used metric which estimates the Given a focal method m, which is executed di-
quality of testing, providing some confidence that rectly by the test case t, code coverage measures
the system will operate conforming to the specified the number of statements that have been executed
requirements. Several standards require a specific (i.e., covered) by the test t. Figure 1 shows an ex-
level of code coverage for software systems before ample of a focal method m (method under test)
they are allowed to be deployed. tested by t. The coverage obtained by t on m
is represented in the coverage-annotated method coverage prediction techniques and LLM code un-
cov(m, t), where executed statements are marked derstanding.
with > while missed (i.e., uncovered statements) We evaluate the performance of four state-of-the-
with ! and unreachable code (i.e., dead code) with art LLMs widely employed for code-related tasks:
- . From this representation, several quantitative OpenAI’s GPT-4 and GPT-3.5, Google’s BARD,
coverage metrics can be computed, such as func- and Anthropic’s Claude. Our ultimate goal is to
tional, statement, branch, and path coverage. gain insights into the capabilities of LLMs in pre-
Code coverage is computed by instrumenting the dicting code coverage, offering a promising alter-
code and running the test suite while monitoring native to execution-based coverage measurement
the code execution. This process is expensive, since in various scenarios. This approach proves advan-
it requires building and executing code, especially tageous when the costs associated with program
for large software projects or when code coverage building and execution are prohibitive, when code
is computed multiple times. Additionally, it is not coverage needs to be invoked multiple times, when
possible to measure code coverage for a snippet of only code snippets are available (e.g., in server-side
code without the availability of the entire program scenarios), or when errors in the project prevent
which contains the given snippet. This situation complete builds. Additionally, this task introduces
happens when only partial code is available, for a novel metric for assessing code understanding
example within a commit log/diff, or when only and serves as a valuable (pre-)training objective.
partial code is transmitted to a server, for security By training models to excel in this task, we be-
and/or networking reasons. lieve we can enhance their overall performance on
code-related tasks.
While Large Language Models (LLMs) have
This paper makes the following contributions:
gained prominence in code-related tasks and
demonstrated impressive results in areas such as • Code Coverage Prediction Task: We propose
code generation and test generation, it remains un- a novel task to assess the capability of LLMs
clear to what extent these models truly understand in understanding code execution by accurately
code execution (Liu et al., 2023). The task of ac- predicting executed lines of a method based
curately determining which lines of a method are on a given test case and inputs.
executed based on a given test case and its inputs
requires a deep understanding of the underlying • Evaluation of State-of-the-Art LLMs: We eval-
code execution dynamics. This motivates the need uate four prominent LLMs (GPT-4, GPT-3.5,
for a dedicated task, referred to as Code Coverage BARD, and Claude) on the Code Coverage
Prediction, which specifically evaluates the capa- Prediction task, providing insights into their
bility of LLMs in comprehending code execution. performance and understanding of code exe-
Further, a model capable of this task is indepen- cution.
dently useful as it can amortize the expensive code
• Curated Dataset: We curate a comprehen-
coverage computation process, or function in cases
sive dataset (C OVERAGE E VAL) of coverage-
where normal code coverage is not possible to com-
annotated methods and test cases, derived
pute.
from the HumanEval dataset. This dataset
In this paper we formalize the Code Coverage is openly available on GitHub1 (Microsoft,
Prediction task, with the primary objective of evalu- 2023) enabling further research and advance-
ating the capability of LLMs in understanding code ment in code coverage prediction techniques.
execution by accurately determining which lines of
a method are executed based on a given test case. 2 Background
To facilitate evaluation, we have curated a compre-
Code coverage is a measure of the degree to which
hensive dataset named C OVERAGE E VAL, consist-
a test suite exercises a software system (Ivanković
ing of coverage-annotated methods. This dataset is
et al., 2019). Code coverage is commonly com-
created by executing tests and code from the Hu-
puted by means of instrumentation. This technique
manEval dataset, allowing us to collect valuable
inserts instrumentation code in various locations
code coverage information. We have organized
within the code or binaries of the program under
and made this curated dataset available on GitHub,
enabling researchers to explore and advance code 1
https://github.com/microsoft/coverage-eval
test, in order to monitor its execution. This inserted sequence of statements Sm t given the focal method

code provides counters to record which function m and a test case t. Formally, this problem can be
or statement of the program have been executed defined in terms of inputs and expected output:
by the test suite. Inserting these additional state- Input
ments within the original code leads to execution
overhead, which can be significant especially for • Focal Method: m
large software programs (Tikir and Hollingsworth, • Test Case: t
2002).
The most common coverage metric is computed Output
at statement level, where statement refers to a syn- t = s∗ , s∗ , . . . , s∗
• Sm 1 2 n
tactic unit of code (e.g., assignment, invocation,
or
assertion), often matching a single line of code.
The coverage indicates whether a statement has t = c ,c ,...,c
• Cm 1 2 n
been executed or not, and aggregated metrics can
be computed at function/program level to measure Specifically, the output can be either the
coverage-annotated sequence of statements Sm t ,
the amount of statements covered by a test suite.
or the sequence of coverage symbols Cm , whicht
In the example in Figure 1, the test case t executes
four statements in m, which constitutes ∼ 44% can then combined with the original sequence of
statement coverage for the method m. statements Sm = s1 , s2 , . . . , sn , to obtain the
coverage-annotated sequence of statements Sm t =
Given statement coverage information, other
∗ ∗ ∗
s1 , s2 , . . . , sn comprising the coverage cov(m, t).
coverage criteria and metrics can be obtained by
means of static analysis. Statement coverage infor- This final step is performed by aligning the two
mation regarding control structure (e.g., if-else sequences and obtaining s∗i = ci + si , where the +
and case statements) can be used to compute operation refers to string concatenation.
branch coverage, which measure how many log- Let us take as example the focal method m and
ical branches in the program have been executed. test case t in Figure 1. The model is expected to
In the example in Figure 1 only one branch is ex- predict either the coverage-annotated sequence of
t or the sequence of coverage sym-
ecuted (i.e., else if (x > 0) ), while the other statements Sm
two branches are missed by the test case t. bols: > > ! > > ! ! ! -.
In the remainder of this paper we will focus on
3.1 Coverage Prediction for Pre-Training
statement coverage, from which other coverage
criteria can be obtained. We propose that the code coverage prediction task
introduced in our paper can serve as a valuable pre-
3 Code Coverage Prediction Task training task for LLMs focused on code generation.
While current pre-training tasks, such as Masked
Given a method under test (focal method) m, Language Modeling (MLM) help models under-
composed of n statements Sm = s1 , s2 , . . . , sn , stand code syntax and semantics by analyzing vast
and a test case t which exercises the method m, amounts of raw text representing code, our pro-
the coverage-annotated focal method cov(m, t) is posed task enables the model to learn about code
composed of a sequence of n statements Sm t =
execution, which is not technically discoverable by
s∗1 , s∗2 , . . . , s∗n , where each statement s∗i repre- source code text alone.
sents the coverage-annotated statement of si in To accomplish this pre-training, we suggest
m. Specifically, s∗i is marked with one of the augmenting the training data with extensive cov-
three possible coverage symbols c ∈ {>, !, −}, erage logs obtained from Continuous Integra-
where the symbol > identifies statements that have tion/Continuous Deployment (CI/CD) pipelines.
been executed by t, the symbol ! identifies state- These logs contain valuable information about code
ments that have been missed by t, and the sym- coverage from regression tests executed during pull
bol − identifies statements that are unreachable. requests or commits.
This defines a sequence of n coverage symbols By exposing the models to these coverage logs
Cm t = c , c , . . . , c , where c ∈ {>, !, −}.
1 2 n i during pre-training, they can learn to associate test
We define the Code Coverage Prediction Task as cases and inputs with the specific lines of code that
the problem of predicting the coverage-annotated are executed. This pre-training approach enhances
the models’ understanding of how different parts and/or simple mathematics (Chen et al., 2021).
of the code are exercised by various test scenar- Each code solution in the dataset includes a func-
ios. Consequently, the models can acquire a deeper tion signature, a docstring containing the problem
comprehension of the relationships between inputs, description, a function body, and several unit tests.
tests, and code execution, leading to improved code We extend the HumanEval dataset to include cov-
generation capabilities. erage, calculated using the function body and the
Integrating coverage prediction as a pre-training respective unit tests.
task could enable models to learn from real-world
test scenarios, capturing the nuances of code execu- 4.2 Coverage Analysis
tion in practical settings. This real-world exposure
In this section, we describe the steps taken to ana-
should enhances the models’ ability to generate
lyze the code coverage on the HumanEval dataset
code that aligns with actual testing practices.
and create our C OVERAGE E VAL dataset.
Furthermore, incorporating coverage prediction
Each code solution in the HumanEval dataset
as a pre-training task opens up possibilities for
is accompanied by a single test case, which in-
transfer learning. Models pre-trained on coverage
cludes multiple asserts designed to test the cor-
prediction can be fine-tuned on downstream tasks,
rectness of the code solution based on the given
such as bug detection or test case generation, where
problem’s functional requirements. These asserts
understanding code execution is crucial. The mod-
cover various inputs, scenarios, and code state-
els’ pre-existing knowledge of code coverage can
ments/branches. To enhance the dataset and in-
provide a solid foundation for these related tasks,
crease the complexity of each data point, we split
potentially improving their overall performance.
the single test case into multiple test cases, each
containing a single assert. This splitting process
4 C OVERAGE E VAL Dataset
allows us to generate additional method-test pairs,
In addition to proposing the code coverage predic- as well as making each data point more challenging.
tion task, this paper also introduces C OVERAGE E- The original test case may cover most of the lines
VAL , a dataset specifically designed for evaluating and branches in the method, but each individual
LLMs on this task. This section outlines the pro- assert covers only a subset of them.
cess of curating this dataset, which begins with By performing this split, we create a more di-
the HumanEval dataset (Chen et al., 2021). By verse set of method-test pairs within the dataset.
executing test cases from the HumanEval dataset, Each individual test case invokes the focal method
we gather code coverage information. To create once and covers a subset of the statements and
C OVERAGE E VAL, we parse the code coverage logs branches within the method. This enables us to
generated during the execution of the test cases. evaluate the LLMs’ ability to predict code coverage
This parsing step enables us to extract the relevant at a more granular level, going beyond the overall
coverage annotations. We then carefully structure coverage of the method. It also adds complexity
and export the dataset in a format that facilitates its to the task, as predicting coverage for each assert
use and evaluation by researchers and practitioners requires a deeper understanding of the code and its
alike. potential execution paths.
By curating this dataset, we aim to provide a Subsequently, we execute the extracted test cases
standardized benchmark for evaluating LLMs on individually with pytest. During the execution,
the code coverage prediction task. The availabil- we also enable the coverage computation using
ity of C OVERAGE E VAL enables researchers to ex- coverage.py. To do so, we run the following com-
plore and advance code understanding, fostering mand: coverage run -m pytest <test_name>
innovation and enabling the development of more where <test_name> is each individual test in the
effective models. dataset.
Next, for each test case t, we analyze the cor-
4.1 HumanEval responding coverage report obtained by the test
The HumanEval dataset consists of 164 hand- execution in order to extract the annotated cover-
written problems and their code solutions, where age cov(m, t). The coverage report marks each
each problem is a programming task involving source code line in the file with coverage informa-
language comprehension, reasoning, algorithms tion, specifying whether the statement has been
executed or not. Problems Solutions Tests
Coverage Symbols

We automatically parse this report and extract Executed (>) Missed (!) Unreachable (-)
158 164 1160 20037 1734 0
the corresponding annotated coverage cov(m, t).
At the end of this process, we obtained a dataset Table 1: C OVERAGE E VAL statistics.
where each data point is formed by a triplet d =
{m, t, cov(m, t)}. 5 Evaluating LLMs

4.3 Data Format In this section, we present our evaluation of state-


of-the-art Language Models (LLMs) for the pro-
The C OVERAGE E VAL dataset maintains the struc- posed task of Code Coverage Prediction. We se-
ture of the HumanEval dataset, with the addition lected four highly regarded LLMs that are not only
of coverage information for each test. Each record popular for code generation but also widely used
corresponds to a unique problem and contains the for other Natural Language (NL) tasks. The LLMs
following fields: we employed for this evaluation are OpenAI’s GPT-
4 and GPT-3.5, Google’s BARD, and Anthropic’s
• Problem ID: A unique ID for the problem
Claude.
• Problem: The name of the method written to GPT-3.5 (Brown et al., 2020) and GPT-4 (Ope-
solve the problem nAI, 2023) are large language models developed
by OpenAI which are Transformer-style models
• Method: The method contents, including a (Vaswani et al., 2017) pre-trained to predict the
function signature, a docstring with the details next token in a document. Both models were then
of the problem, and the function body. fine-tuned using Reinforcement Learning from Hu-
man Feedback (RLHF) (Christiano et al., 2017).
• Tests: A list of unit tests for the problem. Each GPT-4 improves over the predecessor by accept-
item in the list includes the unique ID of the ing as input both images and text (multimodal
test and the code of the test. We have also model) and producing text as output. BARD is
added coverage information for each test in a conversational AI developed by Google based
the following two forms: on LaMDA(Thoppilan et al., 2022) a Transformer-
1. Coverage: The code of the method, with based language models trained on dialogue (Adi-
each line annotated with > , ! or - for wardana et al., 2020). Anthropic Claude is a 52-
code that is executed, missed or unreach- billion-parameter LLM developed by Anthropic.
able by the given test. Claude was pretrained on a large text corpus and
finetuned with "RL from AI Feedback" (RLAIF),
2. Coverage Sequence: A list of equal
where AI feedback are steered by a small set of
length to the number of lines in the
principles drawn from a "constitution" defined by
method, where each value in the list is
humans (Bai et al., 2022).
> , ! or - , depending on the status of
the respective line of code in the method. 5.1 Experimental Design
Figure 3 (Appendix) shows a sample record from When evaluating the LLMs on the code coverage
the C OVERAGE E VAL dataset. C OVERAGE E VAL prediction task, we designed the experiments to
is available to the public via GitHub (Microsoft, assess their performance on non-trivial coverage
2023). sequences while progressively providing more in-
Table 1 reports the statistics for the C OVER - formation and examples.
AGE E VAL dataset in terms of number of problems, First, we filtered out data points d =
code solutions, tests, and coverage symbols. The {m, t, cov(m, t)} where the coverage sequence is
discrepancy between number of problems and so- trivial consisting exclusively of the symbol > .
lutions is explained by the fact that some problems These cases represent methods with no branches
have multiple solutions. It is also worth noting that or where the test case covers every statement in
while our dataset currently does not contain any un- the focal method. Although these data points are
reachable code (-), we have proactively considered included in the C OVERAGE E VAL dataset, we ex-
the potential presence of unreachable code while cluded them from this specific evaluation. The sub-
designing the task. set of data points containing only trivial symbols is
reported in our online appendix. It’s important to System NL Prompt
note that no data points in the dataset has a cover-
age sequence consisting solely of ! or - symbols. You are a terminal. Instruction:
When user runs:
After this filtering step, we were left with 478 data coverage run -m pytest code.py
points on which we evaluated the LLMs. then you'll cat the file code.py,
with each line starting with either of the two symbols below:
The prompt used to evaluate the LLMs was de-
signed to include the following sections: > if the line is executed
! is the line is not executed

Example output:
• System NL prompt: a prompt providing a > line1
! line2
natural language description of the task, aimed > line3
at conveying the task to the LLM. ...
> linen

You job is to figure out which line will be executed


• Examples: zero, one, or multiple examples of given different test cases.
the task.
Examples
(anaconda3-2020.11) cat code.py
• Focal Method m and Test Case t. def split_words(txt):
...

In terms of the System NL prompt, our evalua- (anaconda3-2020.11) cat test.py


def test():
tion involved experimenting with various prompts assert split_words("Hello,world!") == ["Hello","world!"]
assert True
and descriptions. We achieved the most favorable
outcomes by utilizing a system prompt that emu- (anaconda3-2020.11) coverage run -m pytest test.py
> def split_words(txt):
lates a terminal environment (e.g., python terminal).
> if " " in txt:
Within this prompt, we instructed the LLM to gen- ! return txt.split()
erate the code coverage output based on a given test > elif "," in txt:
case and method. For OpenAI models, we included > return txt.replace(',',' ').split()
this prompt in the specific system prompt section, ! else:
...
while for BARD and Claude, we incorporated it as
the initial part of the prompt. Focal Method m + Test Case t
To comprehensively assess the LLMs’ perfor- (anaconda3-2020.11) cat code.py
def <focal_method>
mance, we conducted evaluations using different ...
numbers of examples for the code coverage predic-
(anaconda3-2020.11) cat test.py
tion task. Specifically, we employed zero-shot, one- def test():
shot, and multi-shot prompting approaches. This ...

allowed us to examine the impact of example avail- (anaconda3-2020.11) coverage run -m pytest test.py
ability on the models’ performance and their ability
to generalize the task across various methods. Figure 2: Code Coverage Prediction Task Prompt: (i)
System NL Prompt instruct the LLM to operate as in a
When selecting examples for evaluating cover- terminal environment; (ii) zero, one, or multiple exam-
age on a particular method mi , we took care to ples of the coverage prediction task may be shown; (iii)
prevent data leakage and encourage the LLMs to the current focal method m and test case t are provided
generalize their predictions to other methods. To
achieve this, we randomly sampled a data point
{mj , t, cov(m, t)} where mj ̸= mi when provid- 5.2 Evaluation Metrics
ing examples. In this section we describe the evaluation metrics.
Finally, the prompt provides a focal method m Given the method m, the test case t, and the se-
and a corresponding test case t for which we ex- quence of coverage symbols Cm t = c ,c ,...,c ,
1 2 n
pected the model to predict the code coverage. Fig- where ci ∈ {>, !, −}, the model generates a
ure 2 shows an example of the prompt we designed. predicted sequence of coverage symbols Ĉm t =

Inference is performed on all the LLMs with ĉ1 , ĉ2 , . . . , ĉn . We consider the following metrics
temperature and topp set to 0, and generating one to evaluate the performances of our proposed ap-
sample. proach.
zero-shot one-shot multi-shot
Model
Match Stmt Branch Match Stmt Branch Match Stmt Branch
OpenAI GPT-4 (gpt-4) 25.75 84.47 20.16 22.85 90.71 22.65 30.04 90.5 22.5
OpenAI GPT-3.5 (gpt-3.5-turbo) 0 39.87 8.33 8.17 76.53 17.17 11.03 82.29 17.9
Google BARD (text-bison-001) 0 81.27 17.21 1.87 86.93 19.63 21.56 85.66 20.52
Anthropic Claude (claude-1.3) 3.9 84.47 20.07 4.83 83.21 19.16 6.88 55.7 12.23

Table 2: LLMs performances on the Code Coverage Prediction Task. The table reports the percentages of predicted
coverage sequences that match the ground truth (Match), the percentage of correct coverage symbols for statements
(Stmt), and specifically for branches (Branch). Evaluation performed for zero-shot, one-shot, and multi-shot.

5.2.1 Perfect Sequence Match ples are provided in the prompt. Notably, the other
The perfect sequence match metric counts the num- LLMs achieve low exact matches with zero-shot
t exactly
ber of times that the predicted sequence Ĉm prompting (between 0 and 4%), suggesting that
matches (symbol-by-symbol) the target coverage these foundational models may not have been ex-
sequence Cm t . This represents the case where the posed to coverage logs during their training or
model predicts the coverage with perfect accuracy that. The second best-performing model is Google
for all the statements and branches. BARD, with an exact sequence match reaching
21.5% with multi-shot prompting.
5.2.2 Statement Correctness Regarding the percentage of correct coverage
The statement correctness metric measures the per- statements (see Stmt), most models demonstrate
centage of statements for which the execution pre- improvement as more examples are included in
diction is correct. This is equivalent to the per- the prompt. OpenAI GPT-4 obtain the overall best
centage of symbols in the predicted sequence that scores between 84% and 90% of statement correct-
match the target sequence. ness.
When considering only statements involved in
5.2.3 Branch Correctness
branches (e.g., if-else, while), it becomes evi-
The branch correctness metric measures the per- dent that there is a significant drop in correct pre-
centage of branch-specific statements for which dictions. In fact, the best performing model, Ope-
the execution prediction is correct. The branch nAI GPT-4, accurately predicts a modest 22% of
correctness only considers the symbols associated these symbols when one- and multi-shot is used
with branch statements. It measures the percentage for prompting. It is important to note that this sub-
of symbols in the predicted sequence (associated set of statements, which are intricately connected
with branches) that match the symbols in the target to branches, presents a greater challenge for eval-
sequence. uation because the LLM must reason about the
boolean conditions that determine which branch
6 Results is covered. Consequently, accurately predicting
Table 2 presents the performance of different LLMs coverage symbols within this context requires the
on the Code Coverage Prediction task. The table model to possess a profound understanding of the
showcases the percentage of predicted coverage conditional logic that guides program execution.
sequences that match the ground trught (Match), Despite the surprisingly strong results of Ope-
the percentage of correct coverage symbols for all nAI GPT-4 on the Code Coverage Prediction task,
the statements (Stmt), and the percentage of correct it should be noted that the model still fails to gener-
coverage symbols when only considering branch ate the correct coverage for more than 70% of the
statements (Branch). Evaluation performances are method-test pairs in the C OVERAGE E VAL dataset.
computed using zero-shot, one-shot, and multi-shot This emphasizes that LLMs have a long way to go
prompting. in developing a deep understanding of code execu-
OpenAI GPT-4 demonstrates the highest perfor- tion.
mance on this task, achieving 24.75% exact match We believe that in order to enhance code gen-
with zero-shot prompting and improving to 30% eration results, these LLMs should gain a com-
with multi-shot prompting, where up to 6 exam- prehensive understanding of code execution under
different inputs and test cases. Therefore, we assert 7.3 Live Coverage
that our dataset and proposed task can contribute
Live Unit Testing, integrated into various IDEs, al-
to the advancement of LLMs towards this goal.
lows developers to receive real-time feedback on
7 Discussion& Applications the impact of code changes on existing tests and
identifies whether newly added or modified code
LLMs trained to excel on the Code Coverage Pre- is covered by existing tests. In this scenario, the
diction task could offer a promising alternative to Code Coverage Prediction approach can be applied
traditional execution-based code coverage measure- by replacing the actual execution of test cases with
ment in various scenarios. In this section, we dis- an AI inference call to predict the coverage on the
cuss several use case scenarios where this approach modified or newly added methods. This provides
can be valuable and beneficial. developers with immediate feedback on code cov-
erage without the need for executing the entire test
7.1 Expensive Build & Execution suite. By utilizing LLM-based models for code
For large software projects with millions of lines coverage prediction, developers can streamline the
of code and numerous dependencies, the build and testing process and receive timely insights into the
execution process can be time-consuming and ex- coverage of their code changes.
pensive. In such cases, developers may want to
analyze the code coverage obtained by newly writ- 8 Conclusion
ten tests without waiting for the lengthy build phase.
By leveraging LLMs trained on the Code Coverage In this paper, we introduced the novel task of Code
Prediction task, developers can predict the coverage Coverage Prediction, which aims to assess the ca-
obtained by the new tests on existing methods with- pabilities of Large Language Models (LLMs) in
out the need to build the entire project or execute understanding code execution by accurately pre-
the tests. This enables developers to quickly as- dicting the lines of code that are executed based
sess whether additional tests are required to cover on given test cases. We curated a comprehen-
missed lines or branches in the methods, saving sive dataset named C OVERAGE E VAL, consisting of
valuable time and resources. coverage-annotated methods derived from the Hu-
manEval dataset. This dataset enables researchers
7.2 Limited Code Availability to explore and advance code coverage prediction
Traditional code coverage computation requires the techniques and LLM code understanding.
complete source code of the codebase to be avail- We evaluated the performance of four state-of-
able for instrumentation and execution. However, the-art LLMs, namely OpenAI’s GPT-4 and GPT-
there are scenarios where only a partial view of the 3.5, Google’s BARD, and Anthropic’s Claude, on
code is accessible, making code coverage computa- the Code Coverage Prediction task. The results
tion impossible using traditional methods. demonstrated that GPT-4 achieved the highest per-
In cases where limited code availability poses a formance, with 10.46% exact match with zero-shot
challenge, the Code Coverage Prediction approach prompting and 24.48% with multi-shot prompting.
can be employed. For example, when utilizing an However, none of the models, including GPT-4,
AI code generation service from an IDE, develop- achieved high accuracy in predicting code cover-
ers may transmit only a partial view of the code age, indicating that LLMs still have a long way
to the server where the AI model resides. In this to go in developing a deep understanding of code
scenario, the server can use the proposed approach execution.
to predict the code coverage of the AI-generated The Code Coverage Prediction task serves as a
test cases on the given method. This enables esti- valuable metric for assessing code understanding
mation of the code coverage without the need for and can potentially contribute to the enhancement
the entire codebase, addressing privacy concerns of LLMs’ overall performance on code-related
and network limitations. The predicted code cover- tasks. By training models to excel in this task,
age can then be used to make informed decisions, we can improve their ability to comprehend code
such as generating additional tests if coverage is execution dynamics, which is crucial for tasks such
insufficient or transmitting the generated tests to as code generation and test generation.
the user if coverage is satisfactory.
References Microsoft. 2023. Coverage-eval. https://github.
com/microsoft/coverage-eval.
Daniel Adiwardana, Minh-Thang Luong, David R So,
Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, OpenAI. 2023. Gpt-4 technical report.
Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
et al. 2020. Towards a human-like open-domain chat- Rob Palin, David Ward, Ibrahim Habli, and Roger Riv-
bot. arXiv preprint arXiv:2001.09977. ett. 2011. Iso 26262 safety cases: Compliance and
assurance.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Leanna Rierson. 2017. Developing safety-critical soft-
Anna Chen, Anna Goldie, Azalia Mirhoseini, ware: a practical guide for aviation software and
Cameron McKinnon, et al. 2022. Constitutional DO-178C compliance. CRC Press.
ai: Harmlessness from ai feedback. arXiv preprint
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
arXiv:2212.08073.
Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
2022. Lamda: Language models for dialog applica-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
tions. arXiv preprint arXiv:2201.08239.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot Mustafa M Tikir and Jeffrey K Hollingsworth. 2002.
learners. Advances in neural information processing Efficient instrumentation for code coverage test-
systems, 33:1877–1901. ing. ACM SIGSOFT Software Engineering Notes,
27(4):86–96.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Greg Brockman, Alex Ray, Raul Puri, Gretchen Kaiser, and Illia Polosukhin. 2017. Attention is all
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- you need. In Advances in neural information pro-
try, Pamela Mishkin, Brooke Chan, Scott Gray, cessing systems, pages 5998–6008.
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Sutskever, and Wojciech Zaremba. 2021. Evaluating
large language models trained on code.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-


tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
vances in neural information processing systems, 30.

Marko Ivanković, Goran Petrović, René Just, and Gor-


don Fraser. 2019. Code coverage at google. In Pro-
ceedings of the 2019 27th ACM Joint Meeting on
European Software Engineering Conference and Sym-
posium on the Foundations of Software Engineering,
pages 955–963.

Leslie A Johnson. 1998. Do-178b. Software Considera-


tions in Airborne Systems and Equipment Certifica-
tion, Crosstalk Magazine.

Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang,


Alexey Svyatkovskiy, Shengyu Fu, Neel Sundare-
san, and Nan Duan. 2023. Code execution
with pre-trained language models. arXiv preprint
arXiv:2305.05383.
A C OVERAGE E VAL Example B Deployed Systems
We deploy our approach in two systems covering
Problem: rounded_avg some of the use cases described in the paper.

ID = 104 B.1 System A - Live Coverage


def rounded_avg(n, m): Figure 4 shows the deployment of System A, which
"""SYNTH You are given two positive integers n and m,
and your task is to compute the average of the integers provides live coverage prediction for developers
from n through m (including n and m).
Round the answer to the nearest integer directly into their IDE. System A supports the sce-
and convert that to binary. nario where a developer is writing tests for a given
If n is greater than m, return -1.
Example: method (e.g., Fibonacci(n)) in their codebase.
rounded_avg(1, 5) => "0b11"
rounded_avg(7, 5) => -1
System A provides live coverage information (bot-
rounded_avg(10, 20) => "0b1111" tom of Figure 4) where lines covered by the tests
rounded_avg(20, 33) => "0b11010"
""" are marked with > and highlighted in green and
if m < n: the line missed are marked with ! and highlighted
return -1
summation = 0 in red.
for i in range(n, m+1):
summation += i The benefits provided by System A are the fol-
return bin(round(summation/(m - n + 1))) lowing: (i) no need to build the entire codebase;
Test Cases (ii) no need to execute the tests; (iii) live and
lightweight coverage prediction.
def test_658():
assert rounded_avg(185,546) == "0b101101110"
assert True B.2 System B - Test Generation with
Coverage
def test_659():
assert rounded_avg(362,496) == "0b110101101" Figure 5 shows the deployment of System B, which
assert True
provides Test Suites with a coverage guarantee.
def test_660():
assert rounded_avg(560,851) == "0b1011000010" System B supports the scenario where a developer
assert True is requesting test cases for a given method and
would like to obtain a certain degree of coverage
Coverage-Annotated Method {test_id = 660} on the method under test. Once the method is
> def rounded_avg(n, m): transmitted to the Test Generation Service, the Test
> if m < n: Generation Model (i.e., an AI-based test generation
! return -1 tool or any other tool) outputs a first batch of test
> summation = 0
case candidates. The Coverage Prediction Model
> for i in range(n, m+1):
> summation += i
analyzes these tests and the method under test, and
> return bin(round(summation/(m - n + 1))) predicts the coverage that these tests achieve on
the method. If the coverage is satisfactory (w.r.t.
Coverage Sequence {test_id = 660} a given criteria and threshold) the tests are trans-
[ > , > , ! , > , > , > , > ] mitted to the IDE and shown to the developer. If
the tests do not meet the criteria in terms of cover-
Figure 3: Example record from the C OVERAGE E VAL age, the Test Generation Service requests additional
dataset. This record is for the rounded_avg problem. tests from the Test Generation Model (optionally,
We have shown 3 of the unit tests, as well as sample providing the specific lines/branches which still
coverage annotation data from one unit test. where > need to be covered).
represents executed statements, ! missing statements,
The benefits provided by System B are the fol-
and - unreachable code.
lowing: (i) automated test generation with cover-
age guarantees; (ii) lightweight generation without
need of build and test execution on the user side.
User Space Server Space

Method
Coverage Prediction
Model

Tests

Coverage Annotated Method

Figure 4: System A - Live Coverage


User Space Server Space

Test Generation
Model

Method
to be Tested

Request
More Tests

Coverage Prediction
Model

Generated
Tests
YES Satisfactory NO
Coverage?

Figure 5: System B - Test Generation with Coverage

You might also like