0% found this document useful (0 votes)
20 views10 pages

Generative AI To Generate Test Data Generators: Generative Ai For Software Engineering (Gai For Se)

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Generative AI To Generate Test Data Generators: Generative Ai For Software Engineering (Gai For Se)

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)

Generative AI to two names separated by a space. An-


other constraint relates to humor as
fakers have been proven to be a strong

Generate Test vector of healthy humor for bonding


software development teams.1 For an
English-speaking developer, charac-

Data Generators
ter names from Star Trek or Seinfeld
are more exciting test data than John
Doe, and there is support for this in
faking libraries. Hence, the most ad-
vanced faking libraries contain data
Benoit Baudry , Khashayar Etemadi , Sen Fang, Yogya Gamage ,
generators for specific languages, idi-
Yi Liu, Yuxin Liu , Martin Monperrus , Javier Ron , André Silva ,
oms, and cultures. These faking librar-
and Deepika Tiwari , KTH Royal Institute of Technology
ies are under constant evolution to
stay in tune with testing constraints
and the testing culture of the time.
// High quality data is essential for designing
Our intuition is that large language
effective software test suites. We propose three models (LLMs) are powerful tools for
original methods for using large language supporting developers in generating
high-quality faking data. LLMs are
models to generate representative test data, unique systems that possibly encode
which fit to the domain of the program 1) domain expertise, 2) testing fluency,
and 3) cultural literacy. Domain ex-
under test and are culturally adequate. //
pertise is key in testing because most
interesting data constraints come from
the domain. For example, a French
mobile phone number generator might
output “08 790 60 001.” This would
be incorrect as a French number must
start either with “06” or “07” and be
split every two digits, for example, “06
79 06 00 01.” For test fakers to engage
developers, they should generate data
that are both valid with respect to do-
SOFTWARE TESTS REQUIRE data and java-faker are popular faking librar- main constraints and contain refer-
that are realistic but not real. For ex- ies for the Ruby and Java languages. ences to their language and culture.
ample, banking applications cannot Faking libraries usually include gener- Finally, test data generators must be
be tested with actual customer names ators for names, phone numbers, and executable, and in some cases, readily
and addresses. In these situations, de- addresses. The development of test integrable into existing testing frame-
velopers rely on fake data generators, data generators is challenging as they works and their conventions. Our key
also known as fakers, to generate test must consider several constraints. For intuition is that the generative power
data to be used in automated tests. example, name generators must cap- of LLMs can help master these three
ture the cultural sphere into which the key aspects and be used for the genera-
Introduction system under test is being deployed. tion of fake test data.
Fakers exist in all programming lan- In many Spanish-speaking countries, In this article, we study the origi-
guages. For example, the faker gem a family name generator must output nal task of using LLMs for produc-
ing fake test data. To the best of our
Digital Object Identifier 10.1109/MS.2024.3418570
knowledge, this promising area has
Date of publication 28 June 2024; date of current version 10 October 2024. never been studied.
This work is licensed under a Creative Commons
At tr ib u tio n 4.0 L ic e nse. Fo r m o re info r m atio n,
s e e h t t p s : // c r e a t i v e c o m m o n s . o r g / l i c e n s e s / b y /4 . 0 / NOVEMBER /DECEMBER 2024 | I E E E S O F T WA R E 55
FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)

We fully implement an approach to generate test data. The results in- To sum up, our contributions are
based on state-of-the-art LLM tech- dicate that LLMs are indeed able to as follows:
niques for generating test data. To as- generate fake test data that are realis-
sess the feasibility of our approach, tic, compliant with data constraints, • an approach and prototype
we curate real-world test data genera- and readily usable in a testing con- implementation to use LLMs for
tion scenarios. For example, we use text. When prompted for executable generating high-quality fake data
our approach to generate fake movie code and not only data, the LLM • empirical evidence that LLMs
character names to help in the testing produces executable test data gener- are able to understand domain
of a market-leading streaming service ators, ready to be used in test cases. constraints such that the gener-
in China. We systematically assess To maximize ease of use and integra- ated data are domain adequate,
the ability of the LLM to generate tion within test suites, it is important culturally fit, and humorous.
that the LLM has knowledge about
1. test data that are fit for testing existing faking frameworks; our ex- Test Data Generation
and are culturally adequate periments have also validated this With LLMs
2. executable code that synthesizes aspect. In addition to these technical Testing aims at exercising a software
fake data assessments of the generated data, we system realistically, without the sys-
3. end-to-end code that is interop- have also assessed the qualitative as- tem being deployed to an actual
erable with state-of-the-art test pects of the generated test data. Our production environment. Instead of
data fakers. results indicate that LLMs are able to using production values in testing
capture key cultural dimensions, in- scenarios, developers rely on hard-
To evaluate our approach, we cluding language and humor, as part coded data or fake data produced by
have prompted the LLM 63 times of the fake test data. so-called test fakers. In this work,
we focus on generating test fakers,
either in the form of pure data or in
the form of test modules that can be
FAKING LIBRARIES reused by developers to generate test
data. In modern development, test
fakers are typically provided as re-
The goal of a faking library is to generate realistic fake data, which are used as usable faking libraries (see “Faking
a substitute for real data within software tests. Fakers contain a rich collection Libraries”).
of domain- and locale-specific data, such as for the generation of user names or
the generation of Chinese dishes. The first faker, an open source library called Overview
Data::Faker introduced in 2005, produces fake data to test PERL programs. Its Figure 1 summarizes the key steps of
six generators provide data related to companies, dates, and times, entities on our approach for generating fake test
the Internet such as e-mail or IP addresses, Western names of persons, phone data with LLMs. First, we design
numbers, and United States-specific street addresses. Data::Faker is designed prompts, which state the testing do-
to be flexible such that developers can extend it to define custom data genera- main and the cultural constraints as
tors. Over the years, multiple open source faking libraries have emerged and are well as the programming language
actively developed for all major programming languages, including Ruby, Python, that the test generators should use.
Java, JavaScript, Rust, Haskell, and C++. We now illustrate the concept with
In addition to conventional fakers, such as e-mail generators, the developers a realistic use case for our approach:
of faking libraries incorporate data generators with strong cultural and humor- testing a system for public adminis-
ous references.1 When used within a test case, a quote from Futurama is likely tration. Such a system requires fake
as effective a string input as is Lorem ipsum text, with the added benefit that it addresses that fulfill country-specific
is amusing to a developer who encounters it. Furthermore, good locale support constraints, such as the language for
within a faker can be helpful for developers who need test inputs in their native street names or the specificities of
language or to verify the internationalization of their system. postal codes. We propose three types
of prompts, with different levels of
complexity for the LLM-generated

56 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
test generators, referred to as M1, generated test data are used as input language and cultural sphere, and
M2, and M3. The M1 prompt asks data within test cases for the system 3) the expected number of items.
the LLM to directly generate pure under test, such as the public admin- For example, the prompt “生成十
test data (for example, addresses in istration software system. 个中国武汉的假地址。” asks for fake
Lisbon for a rental agency), with no The prompts can be done in English addresses in Wuhan, China, that
code involved. M2 directs the LLM or in the local language of the system can be used as test data for the so-
to generate a program that generates under test. Per our experiments, we cial security system for Wuhan resi-
data (for example, a Java program recommend using the local language dents. The expected outcome is a list
that generates addresses in Quim- to maximize cultural adequacy, such of 10 addresses that align with the
perlé for the French tax agency). as getting proper postal code or phone Chinese address format and district
With M3, we prompt the LLM to number formats. If the quality of the names in Wuhan. Figure 1 presents
generate a program that generates generated data is not satisfactory due an equivalent Portuguese prompt for
fake data and that aligns with a spe- to limited linguistic training resources, addresses in Lisbon. A test harness
cific faking library (for example, an the English language can be employed takes the generated list of items and
address generator pluggable within as a fallback means of prompting. feeds it into the system under test.
Faker.js, to be used by a real estate
company in Boston). M1: Directly Generate Test Data M2: Generate Executable Test
The second step is applicable to In this mode, we use the ability of Data Generators
the outputs for M2 and M3. For the LLM to generate pure test data. Beyond raw test data generation,
both cases, we execute the data gen- The outputs of M1 are directly used LLMs can also be employed to pro-
eration code, as presented in Figure 1. as inputs to test a software sys- duce executable code, which can gen-
Since the M2 and M3 prompts pro- tem. The core foundation of M1 is erate random fake testing data. We
duce programs that generate data, to craft a prompt that states 1) the refer to this mode as M2. This ex-
it is necessary to actually run them application domain of the system ecutable code is then integrated by
to obtain the test data. Finally, the under test, 2) the expected natural developers to generate fake data into

Prompt Test Data


Rua da Misericórdia,
Lisboa 1200-257 Rua de
Gera 10 moradas realistas são Bento, Lisboa 1200-821
em Lisboa. Travessa da Laje,
M1
Lisboa 1100-503 ... M1
Test Data Generators
System Under Test

Générer un générateur de $ java AddresseGenerator

données de test en Java Large public class AdresseGenerator {


private static final String[] villes = {
20 Impasse des Marronniers,
29300 Moëlan-sur-Mer

qui product 10 addresses Language "Quimperlé", "Mellac", "Rédéné", ... };


1 Avenue de la Liberté,
réalistes à Quimperlé, Model private static final String[] codesPostaux = {
"29300", "29380", "29350", ... };
29310 Clohars-Carnoët

sans utiliser de bibliothéque (LLM) }


... 61 Impasse des Marronniers,
29350 Quimperlé
M2 de test. M2 ...

$ node boston.js
const faker = require('faker'); 16272 O'Keefe Park, North Amber,
Generate a Faker.js style faker.address.bostonAddress = function() { Boston, MA 91917
return faker.address.streetAddress() + ', ' +
faker that can produce at ... + ', ' + 'Boston, ' + 87942 Mitchell Path, South Achille,
faker.address.stateAbbr() + ' ' + Boston, MA 09343
least 10 addresses faker.address.zipCodeByState(faker.address
.stateAbbr()); }; 119 Dusty Streets, East Salvatore,
M3 in boston. ... M3 Boston, MA 49498
...

Generate 10 realistic addresses in Lisbon. Generate a test generator in Java that produces 10 realistic addresses in
Quimperlé without using a faking library.

FIGURE 1. An overview of our approach for generating test data generators embedded in application domains and cultural spheres.
We design three prompt types with the goal of generating test data. The prompts request for an output formatted as realistic fake data
(M1), an automated data generator in a specific programming language (M2), or an automated data generator tailored to a specific
faking library (M3). The output of the LLM is either fake data (M1) or a fake data generator (M2 and M3). Our culturally diverse team of
authors analyzes the adequacy of this output to evaluate the ability of the LLM to generate test data that are domain adequate and
culturally adequate.

NOVEMBER /DECEMBER 2024 | I E E E S O F T WA R E 57


FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)

their test suite. Figure 1 shows an ex- backgrounds and expertise of the RQ2: Executability
ample prompt for M2. The prompt coauthors and guide the LLM to In RQ2, we assess the ability of the
has three main sections. First, a mes- generate test data generators for appli- LLM to synthesize executable code
sage guides the LLM to “synthesize a cations in Chinese, Farsi, Portuguese, that generates fake data. To do that,
test data generator without using any Sinhalese, French, Hindi, Spanish, and we write M2 and M3 prompts and
library.” Second, the prompt specifies English. We select a frontier LLM, run the code produced by the LLM.
the target programming language GPT-4, for our experiments and study We check whether this generated
in which the data generator should its performance under the strictest code can successfully be executed
be synthesized, such as “Java.” The conditions of a zero-shot setup. We to completion. We also check for
third component mentions the type devise three research questions (RQs) domain adequacy per the rule de-
and the cultural context of the data to evaluate our novel approach for test scribed in RQ1.
that should be generated, for ex- data synthesis with LLMs.
ample, “the program should pro- RQ3: Interoperability
duce addresses in Quimperlé.” M2 • RQ1: To what extent is the LLM For the final RQ, we evaluate the
prompts leverage the assumed capa- able to generate high-quality and ability of the LLM to generate ac-
bility of LLMs to 1) understand the domain-adequate data? curate and high-quality end-to-end
testing domain and 2) generate com- • RQ2: To what extent is the LLM test data fakers with the M3 mode.
plete executable code.2 able to generate executable code We start with the same evaluation
that synthesizes fake data? criteria for adequacy and executabil-
M3: Generate Complete Interoperable • RQ3: To what extent is the LLM ity per RQ1 and RQ2. Additionally,
Test Fakers able to generate end-to-end, in- we select one open source project
In the M3 mode, we aim at synthe- teroperable test data fakers? that uses a faking library in its test
sizing end-to-end test data generators suite. We replace the original faker
on top of existing faking libraries (see Our experimental artifacts can with the LLM-generated version. Fi-
“Faking Libraries”). The main moti- be found at https://github.com/ nally, we run the full test suite of the
vation for this mode is to minimize ASSERT-KTH/lollm. project to verify that all the tests still
the effort of integrating the test gen- pass with the generated fake data.
erators in an existing test suite. To RQ1: Domain Adequacy
that extent, M3 prompts are more In this RQ, we assess the ability of Experimental Results
effective than M2 prompts. This pro- the LLM to generate high-quality We have prompted the LLM 63 times,
ductivity boost happens thanks to the test data that are appropriate for the in eight different natural languages,
benefits of software reuse, here in the specified application domain. To vali- and within 10 application domains.
context of faking libraries. date quality, we examine the outputs For the sake of brevity, we focus on
Figure 1 shows an example prompt from the LLM, leveraging the cul- a subset of domains and prompts
used for M3. The prompt asks the tural diversity of the authors originat- in each RQ. The curious reader can
LLM to create a JavaScript test data ing from three continents, fluent in browse our appendix repository for
generator that specifically uses the seven mother tongues, and balanced more fake test data generators.
Faker.js library. It also specifies the over genders. This diversity allows us
type of data that should be produced to match one or several authors’ cul- RQ1: Data Adequacy
by the test data generator, such as tural backgrounds with the cultural When prompting the LLM to gener-
“10 addresses in Boston.” context of the studied domain to vali- ate pure test data, we discover a high
date the cultural adequacy of the au- level of cultural adequacy in five
Experimental tomatically generated test data. The cases and an absence of adequacy in
Methodology matched experts check whether the two cases. We now discuss the cul-
With the three modes of prompt- synthesized fake data 1) are realistic, tural context and the adequacy of
ing described previously, we gener- 2) are appropriate with respect to the the LLM output for two domains.
ate test data and test data generators semantics of the application domain,
for various application domains. To and 3) are consistent with the cultural Case Study: Adequacy of Chinese Data for
this end, we draw upon the diverse dimensions specified in the prompt. Testing a Streaming Application. Here, we

58 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
are testing a streaming application, locations are completely hallucinated: Case Study: Portuguese Food and Wine
such as Netflix. We prompt GPT-4 to “ ,” “ ,” Pairing. We aim to generate Ruby code
generate 10 suitable names for the Chi- “ ,”“ ” that produces random fake data to
nese TV series My Own Swordsman, We believe that the primary reason test a food recommendation system
using the M1, M2, and M3 prompt- for this poor performance lies in the such as Vivino. In this context, soft-
ing modes. Next, three Chinese coau- limited training of GPT-4 with Sin- ware developers expect the data to
thors assess the generated names with halese text. To produce a more sat- be realistic and correspond to cultur-
respect to their cultural adequacy. Ac- isfactory output, the model would ally adequate suggestions. Ideally, the
cording to our evaluation, all three require training on a large volume of data constraints are explicit predi-
modes of prompting can instruct Sinhalese data, which is likely miss- cates in code that can be checked.
GPT-4 to generate 10 fake names for ing in the OpenAI training dataset. We study the extent to which GPT-4
My Own Swordsman. Specifically, Overall, because of poor tokeniza- is able to generate an executable data
the names generated by each prompt tion and the lack of training data in constraint related to pairings between
align with the background of the show Sinhalese, the generated data are of Portuguese food and wine, using the
and display full culture adequacy. For low quality. This case study high- faker Ruby library (M3).
example, eight names from the M1 lights the limitation of our approach Listing 1 shows a snippet of the
prompt are suitable for our TV show, for low-resource languages. How- test data generator synthesized by
such as 风流剑痴, 清风子, and 月影红. ever, for all the other application do- GPT-4 that implements this data con-
From our Chinese analysts’ view, mains with high-resource languages, straint. Within Portuguese dining,
风流剑痴, which means “A charming we observe strong domain adequacy, red wines are typically paired with
man who is passionate about swords- including Chinese, French, Hindi, meat, while white wines are paired
manship,” is considered the best one Portuguese, and Spanish. with fish. On line 25 of the listing,
while also being highly consistent with the wine type is checked at runtime
the mix of ancient culture and humor against the food type. The pair is kept
that characterizes the show. ANSWER TO RQ1 if it complies with the wine-pairing
Recently, several Chinese LLMs constraint. This example shows that
LLMs are able to generate high-
have been developed by the research the capability of the model is twofold:
quality test data. Our evaluation
divisions of companies such as Baidu 1) it is aware of wine-pairing con-
of 63 carefully designed case
and Alibaba. For further evaluation, ventions, and 2) it is able to embed
studies indicates that LLMs suc-
we employ the same M1 prompt with wine-pairing constraints in code.
cessfully capture the application
two Chinese LLMs, ERNIE Bot and In total, we had success in gener-
domain of the test data as well
Qwen. We find that GPT-4 performs ating executable faker code, and re-
as the cultural and linguistic con-
better than these two LLMs. Over- markably, we also found executable
straints associated with it. This
all, although GPT-4 is not a Chinese data constraints regarding food and
is good since software systems
LLM, it is the better choice for Chi- wine types embedded in the gener-
are designed and embedded in
nese software testers if they want to ated code.
countries and cultures all over the
obtain relevant fake data with respect
world while all being tested with
to the Chinese language and culture. Case Study: Data Constraint in Farsi
the same rigor.
Poetry—Testing Applications Using
Case Study: Adequacy for Low-Resource Right-to-Left Scripts. In this case study,
Languages. In this case study, we are we are testing a web publishing appli-
testing a travel application, such as RQ2: Executability cation that should support right-to-left
TripAdvisor. We request GPT-4 to We now focus on M2 and M3 scripts with constraints on the size
generate tourist attractions in Sri prompts to evaluate the executability of each line. For this, we employ the
Lanka in Sinhalese, with M1 and M2 of the code generated by the LLM. M2 mode to synthesize executable
prompting modes. We observe that the We have performed 17 M2 and 25 Java code that generates Farsi poetry
generated results often include non- M3 prompts, and in 29/42 cases, we in Masnavi style.3 This type of po-
existent places. The following text obtain executable code. We now dis- etry is written from right to left, and
presents a generated output where the cuss two interesting case studies. the lines of the poem should have

NOVEMBER /DECEMBER 2024 | I E E E S O F T WA R E 59


FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)

approximately the same length. For project. We target the test suite of a assertion on line 35 verifies that the
this experiment, we use the following project called sakai, which is an open suggestion list includes the recently
M2 prompt: “Generate a Java pro- source feature-rich learning man- created resource name, Jane Doe key
gram without using any library that agement system. sakai already uses keyboard.
generates Farsi poem in Masnavi style the java-faker library in multiple test We prompt the LLM in M3
as test data.” The result is a Java ap- classes for generating fake names mode to generate a java-faker-style
plication that successfully executes and placeholder text inputs. For ex- generator that produces charac-
and generates two lines in Farsi. First, ample, lines 25 and 26 of Listing 2 ter names and quotes from the
“ ,” a nd show how the faker is used within TV show Merlin. Per our expec-
second, “ ” the test class ElasticSearchTest to gener- tations, the LLM generates code
The text is written in Farsi, which ate a fake name for a Resource object, that follows the structure of the
means it is right to left as expected. such as Jane Doe key keyboard. Then, this java-faker library, specifically a faker
It also consists of lines with al- object is used for testing the search class called Merlin.java (lines 1 to 20
most the same length. One limita- implementation within the test case in Listing 2) and a merlin.yml file
tion is that the generated text does testGetSearchSuggestions (lines 30–36) containing character names and
not follow the rhythmic patterns of to obtain search suggestions that quotes. Moreover, the two gener-
Farsi poetry, but we consider this contain the strings key and keyboard ated files follow the same pattern
constraint beyond the scope of the from an ElasticSearch service. The as the existing generators within
considered domain adequacy. Over-
all, this case study suggests that the
LLM is able to generate executable
code that produces proper Farsi text.
The generated Farsi text can be use-
ful for testing web applications dis-
playing right-to-left text.

ANSWER TO RQ2
We perform 42 prompts to as-
sess the executability of test data
generators produced by LLMs.
The results indicate that LLMs are
able to synthesize ready-to-use
programs for generating test data.
LLMs are able to reconcile the dual
constraints of generating adequate
test data in the considered domain
and generating source code that
compiles and executes in a given
programming language.

RQ3: Compatibility With Existing


Faking Libraries
For this RQ, we prompt the LLM to LISTING 1. A wine-pairing test data generator, generated by an LLM, with an
extend an existing faking library and embedded wine-pairing data constraint. In this example, we prompt GPT-4 to generate
integrate this extended library into Ruby code with method M3: “(…) Please create a custom test data generator that
the test suite of a real-world Java generates wine pairings between Portuguese wines and Portuguese foods.”

60 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
java-faker. Next, we extend java-faker observe that the complete end-to-end
with these two new files and re- integration works seamlessly: 1) the ANSWER TO RQ3
place the existing java-faker version test suite compiles and runs, and 2)
LLMs encode knowledge about
in sakai with this extended ver- all the test cases successfully pass
popular faking libraries, used
sion. We update the test class Elas- with the extended java-faker library.
to test thousands of software
ticSearchTest to generate fake names From a testing perspective, us-
projects. This knowledge can be
from characters in Merlin, as illus- ing a resource called Uther Pendragon
leveraged to generate new fakers,
trated on lines 26 and 27 of Listing from the generated Merlin faker is
directly interoperable with test
2. Finally, we compile the project as effective as using a conventional
suites. To the best of our knowl-
and execute the test suite, which “Jane Doe” resource. This is strong
edge, our article is the first to
now uses this extended faker. evidence that the LLM is capable of
bridge the creative power of LLMs
Within the class ElasticSearchTest, successfully generating fakers that
and the hard engineering con-
three test cases call the LLM-gener- are ready to be used by develop-
straints of data faking.
ated faker, and five assertions assess ers while engaging them even more
behavior using these fake data. We with their tasks.

Assorted Remarks
In this section, we summarize the
important insights from our ex-
periments, which may fuel further
research in the application of genera-
tive AI for fake test data generation.

LLM Selection
Our approach is designed to work
with off-the-shelf LLMs. Thanks to
instruction tuning, LLMs are able to
follow natural language instructions,
which, in our case, include the de-
scription of an application domain as
well as essential cultural framing for
the test data generation. Our findings
indicate that general-purpose LLMs,
such as GPT-4, are suitable for this
task, and we believe that open source
LLMs would also demonstrate good
capabilities. In a low-resource lan-
guage or cultural setting, the choice
of the LLM may result in higher per-
formance variability, as evidenced by
our experiments with the Sinhalese
section (“Case Study: Adequacy for
Low-Resource Languages”).

LISTING 2. Lines 1-20: An extension of the java-faker library generated by the LLM to LLM Randomness
produce characters and quotes from the TV show Merlin. Lines 23–37: An excerpt from The code and data generated by LLMs
a test case in the project sakai that uses the java-faker library to generate fake resource are not deterministic, with randomness
names. We replace the existing call to generate fake names (line 26) with names from increasing when a high temperature is
characters in Merlin (line 27). used. This is natural nondeterminism in

NOVEMBER /DECEMBER 2024 | I E E E S O F T WA R E 61


FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)

the test data that are produced. There- semantically and syntactically dif- adequacy within test data. While the
fore, as long as the generated test data ferent and potentially incorrect code LLM performs very well at data gen-
are valid, the randomness in LLMs’ based on hyperparameter configu- eration, we observe that cultural ad-
output is a strength of our approach. rations. In the same vein, Poesia et equacy varies depending on the task
As discussed in the “Experimental al.6 propose an approach to enforce and the language used for prompting.
Results” section, our 63 experiments constraints on the code generated by

I
show the general validity of the test LLMs, including syntax and variable
data generated. Overall, we recom- typing. In this article, we leverage the n this article, we have addressed
mend embracing the randomness of creative power of LLMs to synthesize the original problem statement
LLMs to generate diverse test data wide-ranging test data and their pro- of generating test data generators
while also ensuring that the generated gramming power to produce execut- with LLMs. To validate this far-reaching
data comply with required constraints able test data generators. idea, we have performed an in-depth
with the appropriate techniques. Several studies have experimented study into the capabilities of LLMs for
with LLMs and machine learning generating test data, with attention to
Harness LLMs in Test Suites models as tools to aid software test- both hard software testing requirements
In case study 1 (the “Case Study: Ad- ing, including the automated genera- and soft cultural requirements, such as
equacy of Chinese Data for Testing tion of unit tests.7 MockSniffer8 uses cultural adequacy. Our experimental re-
a Streaming Application” section), machine learning to recommend com- sults clearly indicate that current LLMs
we observed that the test data pro- ponents that may be substituted by are able to succeed in this task. Over 63
duced using the M1-level prompt are mocks within unit tests. QTypist9 uses prompts, we successfully obtain a large
of better quality compared to those LLMs to generate context-aware text majority of good data generators that
generated by the M2- and M3-level inputs for testing mobile application can execute and produce valuable test
prompts. This disparity might be at- interfaces. Tan and colleagues explore data. As a final proof of concept, we in-
tributed to the increased complexity the use of recurrent neural networks10 tegrate an LLM-generated data faker in
imposed by the M2= and M3-level and LLMs11 for generating synthetic the test suite of a mature and well-tested
prompts, potentially complicating the representative test data for the Norwe- Java project. The complete success of
task for LLMs. This suggests an inno- gian national population registry. Gen- this proof of concept indicates that
vative approach; instead of employing erative adversarial networks have been LLM-generated test fakers can support
LLMs to create a generator for gen- used to anonymize test data used in the serious and engaging software testing.
erating test data, LLMs themselves health-care domain.12 Similar to these Overall, our research opens a promising
could be directly utilized for generat- studies, we utilize LLMs to automati- avenue for the use of generative models
ing test data in various test scenarios. cally produce realistic synthetic test for generating data that are both ade-
This approach simplifies the process, data. A core novelty of our work is the quate for testing and culturally relevant.
increasing the validity of the generated use of LLMs in the context of fakers to In future work, we envision using other
data and making responsible use of automatically generate executable code types of prompting techniques, such as
generative AI for test data generation. that produces test data. few-shot and chain-of-thought strate-
Many researchers have explored gies, to improve the capability of the
Related Work the cultural (in)adequacies exhibited test data fakers generated by LLMs. It
LLMs have found applications in var- by LLM outputs. Cao et al.13 discover would also be interesting to compare
ious phases of software engineering, that ChatGPT performs poorly in non- the quality, diversity, and cost of LLM-
from the generation of specifications American contexts. Naous et al.14 ana- generated fake data against manually
to the maintenance of legacy soft- lyze the cultural adaptability of LLMs, crafted fakers.
ware.2 Yet there are caveats to the use concluding that Arabic LLMs default
of LLMs in software engineering tasks to Western cultures. Chen et al.15 re- References
owing to their unpredictability and is- port on the inadequate performance 1. D. Tiwari, T. Toady, M. Monper-
sues such as potential data leakage.4 of several LLMs in understanding Chi- rus, and B. Baudry, “With great
Ouyang et al.5 highlight that the non- nese humor, including the detection of ­humor comes great developer
determinism of LLMs can negatively punchlines. In this work, we have ex- ­engagement,” in Proc. IEEE/
impact code generation, producing plored diverse dimensions of cultural ACM 46th Int. Conf. Softw.

62 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
ABOUT THE AUTHORS

BENOIT BAUDRY is a professor in YOGYA GAMAGE is a research engineer


software technology at the Université de at KTH Royal Institute of Technology, 100
Montréal, Montréal, QC H3T 1J4, Canada. 44 Stockholm, Sweden. Her research inter-
His research interests include automated ests include software hardening, software
software engineering, software diversity, supply chain security, and program repair.
and software testing. He favors exploring Gamage received her bachelor’s degree in
code execution over code on disk. Baudry computer science and engineering from the
received his Ph.D. in software engineering University of Moratuwa, Sri Lanka. Contact
from the University of Rennes, France. her at [email protected].
Contact him at [email protected].

KHASHAYAR ETEMADI is a Ph.D. stu- YI LIU is a research assistant at KTH Royal


dent at KTH Royal Institute of Technology, Institute of Technology, 100 44 Stockholm,
100 44 Stockholm, Sweden. His research Sweden. Her research interests include
interests include explainable software the intersection between blockchain
bots, ML4SE, and program analysis. and security. Liu received her bachelor’s
Etemadi received his M.Sc. degree in soft- degree from Sichuan University and is now
ware engineering from Sharif University of pursuing a master’s degree at Stockholm
Technology, Iran. Contact him at khaes@ University. Contact her at [email protected].
kth.se.

SEN FANG is a Ph.D. student at North YUXIN LIU is a Ph.D. student at the KTH
Carolina State University, NC 27606 USA. Royal Institute of Technology, 100 44
Fang received his master’s degree in Stockholm, Sweden. Her research interests
electronics and communication engineer- include software engineering, software
ing from Central China Normal University analysis, and software package manage-
in 2020. His research interests include ment. Yuxin received her M.Sc. in software
the intersection of software engineering engineering from the Harbin Institute of
and machine learning, particularly large Technology. Contact her at [email protected].
language models for code. Contact him at
[email protected].

MARTIN MONPERRUS is a professor of ANDRÉ SILVA is a Ph.D. student at the


software technology and the 2024 Ph.D. KTH Royal Institute of Technology, 11428
Supervisor of the Year at the KTH Royal Stockholm, Sweden. His research interests
Institute of Technology, 100 44 Stockholm, include the intersection of automatic
Sweden. His research interests include program repair and machine learning. Silva
software engineering with a current received his M.Sc. in computer science
focus on automatic program repair, AI on from the Instituto Superior Técnico, Univer-
code, and program hardening. Monperrus sidade de Lisboa, Lisbon, Portugal. He is a
received his Ph.D. from the University of Graduate Student Member of IEEE. Contact
Rennes. He is a Senior Member of IEEE. him at [email protected].
Contact him at https://www.monperrus.
net/martin/.

NOVEMBER /DECEMBER 2024 | I E E E S O F T WA R E 63


FOCUS: GENERATIVE AI FOR SOFTWARE ENGINEERING (GAI FOR SE)
ABOUT THE AUTHORS

JAVIER RON is a Ph.D. student at the DEEPIKA TIWARI is a Ph.D. student at


KTH Royal Institute of Technology, 100 44 KTH Royal Institute of Technology, 100 44
Stockholm, Sweden. His research interests Stockholm, Sweden, working on software
include the intersection of software engi- testing. Her research interests include
neering, distributed systems, and game de- automatic software test generation, pro-
velopment. Ron received his M.Sc. degree duction monitoring, and software humor.
from the KTH Royal Institute of Technology. Tiwari received his master’s degree in
Contact him at [email protected]. computer applications from GGSIPU, India.
Contact her at [email protected].

Eng. Softw. Eng. Soc. (ICSE- pre-trained language models,” 2022, 11. C. Tan, R. Behjati, and E. Arisholm,
SEIS), 2024, pp. 1–11, doi: arXiv:2201.11227. “Enhancing synthetic test data genera-
10.1145/3639475.3640099. 7. M. Schäfer, S. Nadi, A. Eghbali, and F. tion with language models using a more
2. I. Ozkaya, “Application of large lan- Tip, “An empirical evaluation of using expressive domain-specific language,”
guage models to software engineer- large language models for automated in Proc. IFIP Int. Conf. Testing Softw.
ing tasks: Opportunities, risks, and unit test generation,” IEEE Trans. Softw. Syst., Cham, Switzerland: Springer-
implications,” IEEE Softw., vol. 40, Eng., vol. 50, no. 1, pp. 85–105, Jan. Verlag, 2023, pp. 21–39.
no. 3, pp. 4–8, 2023, doi: 10.1109/ 2023, doi: 10.1109/TSE.2023.3334955. 12. E. Piacentino and C. Angulo, “Gen-
MS.2023.3248401. 8. H. Zhu et al., “MockSniffer: Charac- erating fake data using GANs for an-
3. W. M. Thackston, A Millennium of terizing and recommending mocking onymizing healthcare data,” in Proc.
Classical Persian Poetry: A Guide to the decisions for unit tests,” in Proc. 35th Int. Work-Conf. Bioinf. Biomed.
Reading & Understanding of Persian IEEE/ACM Int. Conf. Automated Eng., Cham, Switzerland: Springer-
Poetry From the Tenth to the Twentieth Softw. Eng., 2020, pp. 436–447, doi: Verlag, 2020, pp. 406–417.
Century. Ibex Publishers, Inc., 1994. 10.1145/3324884.3416539. 13. Y. Cao, L. Zhou, S. Lee, L. Cabello, M.
4. J. Sallou, T. Durieux, and A. Pan- 9. Z. Liu et al., “Fill in the blank: Con- Chen, and D. Hershcovich, “Assessing
ichella, “Breaking the silence: The text-aware automated text input gener- cross-cultural alignment between Chat-
threats of using LLMs in soft- ation for mobile GUI testing,” in Proc. GPT and human societies: An empirical
ware engineering,” in Proc. IEEE/ IEEE/ACM 45th Int. Conf. Softw. Eng. study,” 2023, arXiv:2303.17466.
ACM 46th Int. Conf. Softw. Eng. (ICSE), Piscataway, NJ, USA: IEEE 14. T. Naous, M. J. Ryan, and W. Xu,
New Ideas Emerg. Results (ICSE- Press, 2023, pp. 1355–1367. “Having beer after prayer? Measuring
NIER), 2024, pp. 102–106, doi: 10. R. Behjati, E. Arisholm, M. Bedre- cultural bias in large language mod-
10.1145/3639476.3639764. gal, and C. Tan, “Synthetic test data els,” 2023, arXiv:2305.14456.
5. S. Ouyang, J. M. Zhang, M. Har- generation using recurrent neural 15. Y. Chen, Z. Li, J. Liang, Y. Xiao,
man, and M. Wang, “LLM is like a networks: A position paper,” in Proc. B. Liu, and Y. Chen, “Can pre-
box of chocolates: The non-determin- IEEE/ACM 7th Int. Workshop Re- trained language models understand
ism of ChatGPT in code generation,” alizing Artif. Intell. Synergies Softw. Chinese humor?” in Proc. 16th
2023, arXiv:2308.02828. Eng. (RAISE), Piscataway, NJ, USA: ACM Int. Conf. Web Search Data
6. G. Poesia et al., “Synchromesh: IEEE Press, 2019, pp. 22–27, doi: Mining, 2023, pp. 465–480, doi:
Reliable code generation from 10.1109/RAISE.2019.00012. 10.1145/3539597.3570431.

64 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E

You might also like