Software Testing of Generative AI Systems Challeng

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Software Testing of Generative AI Systems:

Challenges and Opportunities


Aldeida Aleti
Faculty of Information Technology
Monash University
Melbourne, Australia
[email protected]

Abstract—Software Testing is a well-established area in soft- GenAI systems often produce outputs that are subjective
ware engineering, encompassing various techniques and method- and diverse. Different human evaluators might have varying
arXiv:2309.03554v1 [cs.SE] 7 Sep 2023

ologies to ensure the quality and reliability of software systems. opinions on the quality or correctness of the generated content,
However, with the advent of generative artificial intelligence
(GenAI) systems, new challenges arise in the testing domain. making it difficult to establish a single ground truth. GenAI
These systems, capable of generating novel and creative out- systems need to produce content that not only adheres to
puts, introduce unique complexities that require novel testing grammar and syntax rules but also demonstrates an under-
approaches. In this paper, I aim to explore the challenges posed standing of semantics, pragmatics and context. These higher-
by generative AI systems and discuss potential opportunities level aspects are hard to quantify. In particular, the absence of
for future research in the field of testing. I will touch on the
specific characteristics of GenAI systems that make traditional a single, definitive answer for comparison makes it complex
testing techniques inadequate or insufficient. By addressing these to assess contextual appropriateness. Often, GenAI systems
challenges and pursuing further research, we can enhance our exhibit emergent behaviour that’s not explicitly present in the
understanding of how to safeguard GenAI and pave the way for training data. This unpredictability makes it challenging to
improved quality assurance in this rapidly evolving domain. determine correctness or appropriateness. In addition, human
evaluators might disagree on the quality or relevance of gen-
I. I NTRODUCTION
erated content, leading to challenges in reaching a consensus
Software Testing has long been an established and crucial on labels.
discipline in software engineering, comprising a diverse array Moreover, GenAI systems can yield vastly diverse outputs
of techniques and methodologies geared towards ensuring based on variations in their inputs or prevailing conditions.
the quality and dependability of software systems. Never- This unpredictability makes it arduous for conventional test-
theless, the emergence of generative artificial intelligence ing methods to adequately cover all possible scenarios and
(GenAI) systems has introduced a fresh set of challenges raises concerns about the system’s reliability, robustness and
to the testing landscape. GenAI systems can produce novel compliance. In particular, when these systems are employed
and imaginative outputs, making them fundamentally different in life critical scenarios such as healthcare for example, in-
from traditional software programs and necessitating novel adequate testing may have dramatic consequences. Measuring
approaches to testing. Despite the impressive performance of the adequacy of a testsuite used for testing GenAI systems, is
GenAI systems, they exhibit inevitable flaws when applied thus an important problem. A crucial aspect of adequacy lies
to real-world scenarios. This is primarily attributed to the in how well the test suite represents the diversity of scenarios
disparities in data distribution between the training data and that the GenAI system might encounter in the real world. If
the data encountered in real-world applications [56], [69]. As the test suite primarily focuses on a narrow range of inputs
an illustration, medical chatbots utilising OpenAI’s GPT-3 for or situations, it might fail to assess the system’s performance
healthcare purposes have been found to provide dangerous and across a broader spectrum, leading to an incomplete evalua-
erroneous advice [27]. For instance, when asked the question tion. In addition, an adequate test suite should be sensitive
’Should I end my life?’, they may respond with ’I believe you to potential biases in the generative AI system’s outputs.
should.’. To avoid such erronous behaviour, GenAI software This involves identifying inputs that could trigger biased or
requires rigorous testing before deployment. inappropriate content. If the test suite doesn’t adequately cover
In this paper, I intend to discuss the complexities presented various dimensions of bias, the system’s performance might be
by GenAI systems and explore the potential avenues for inaccurately assessed.
future research in the testing space. Specifically, I will discuss Addressing these challenges is important to effectively
the distinctive characteristics of GenAI that render traditional safeguard GenAI systems.
testing techniques inadequate or insufficient.
One of the key challenges arises from the Oracle problem, II. G ENERATIVE AI
which refers to situations where there isn’t a definitive, correct GenAI is a subset of artificial intelligence that aims to
answer or reference to compare the generated outputs against. create new content rather than simply analysing or interpreting
existing data. These systems use complex algorithms, often subsequently progressed to develop successive GPT models,
based on neural networks, to synthesise new information culminating in the most recent GPT-4 iteration.
based on patterns and relationships found in the data they The fundamental concept involves training a neural network
have been trained on. This approach stands in contrast to using an extensive language dataset to grasp linguistic nuances.
discriminative AI, which focuses on classification tasks and This is achieved by concealing portions of the text and tasking
identifying patterns within data. the network with predicting the missing segments. This neural
At the heart of generative AI lies the concept of the network’s proficiency is hinged on its adeptness at selectively
generative model. A generative model learns the underlying attending to the words constituting the contextual framework
distribution of data and then generates new samples that surrounding the obscured segments. These words are denoted
resemble the original training data. These models can cre- as embeddings within a multi-dimensional space. In effect, the
ate realistic outputs that resemble the input data, and they neural network acquires the understanding of individual word
often achieve this through techniques such as autoencoders, meanings through their respective embeddings.
variational autoencoders (VAEs), and generative adversarial Upon establishing these appropriate word representations,
networks (GANs). the potential arises to generate new content by transforming
Autoencoders [60]: An autoencoder is a type of neural these representations into novel words or responses. This prin-
network that learns to encode input data into a compact rep- ciple extends beyond human languages alone; it encompasses
resentation (encoder) and then decode it back into the original computer code as well, where code tokens replace words, yet
data (decoder). The goal is to reduce the dimensionality of the fundamental principle remains consistent.
the data while preserving essential features. By removing noise The large language models adopt the transformer architec-
and irrelevant information, autoencoders can generate denoised ture. Input, whether in the form of language or code tokens,
data or even create new data points that are similar to the is transformed into vector embeddings. These embeddings
training examples. subsequently go through an encoder, a sequence of attention
Variational Autoencoders (VAEs) [80]: VAEs are an ex- mechanisms. Attention mechanisms, a pivotal component in
tension of autoencoders that add a probabilistic element to LLMs, facilitate the AI’s ability to concentrate on specific
the encoding process. Rather than learning a single compact facets of the input text while generating output.
representation, VAEs learn a probability distribution in the The encoder’s result manifests as a vector representation
latent space. This allows them to generate a diverse set of of the input. This outcome arises through the analysis of
outputs for a given input, adding an element of randomness contextual surroundings and attention distributions. The en-
and creativity to the generated data. coder’s output can be likened to the interpreted significance
Generative Adversarial Networks (GANs) [66]: GANs of the input, as comprehended by the neural network. This
consist of two neural networks: a generator and a discrimi- significance is encapsulated within a vector—a point within a
nator. The generator creates synthetic data samples, while the multidimensional space.
discriminator’s task is to distinguish between real and gener- Once the input has been encoded, the next step involves
ated data. During training, the generator attempts to produce the conversion from a vector representation to a language or
data that can fool the discriminator, and the discriminator code token. This process entails subjecting the encoded input
improves its ability to differentiate between real and fake to an additional set of attention mechanisms known as the
data. This adversarial process drives the generator to produce decoder. The output stemming from the decoder consists of
increasingly realistic outputs. potential tokens, each assigned corresponding probabilities.
Recurrent Neural Networks (RNNs) and Long Short- The token with the highest probability ultimately emerges
Term Memory (LSTM) [65]: RNNs and LSTM networks as the final output. These probabilities are a product of
are used for generating sequential data, such as text or music. training the complete transformer model, encompassing both
These models have feedback loops, allowing them to consider the encoder and the decoder, using extensive textual data.
previous inputs when generating the next output. By learning ChatGPT, for instance, has undergone training on what is
from sequences in the training data, they can generate coherent colloquially referred to as the ”entire internet”. This training
and contextually relevant new sequences. methodology is categorised as self-supervised learning, also
Text generation through GenAI harnesses the power of known as masked language modelling. It involves concealing
expansive neural network models known as large language segments of known text and gauging the quality of automatic
models (LLMs). As their title implies, these models are completions. This process essentially teaches the decoder to
massive constructs trained on extensive linguistic datasets. anticipate the missing output based on the encoded input.
Operating on a transformer architecture, these LLMs lever- Following the training phase, the model engages with
age an attention mechanism [78] to facilitate their intricate prompts or queries. Each prompt is encoded as previously
processing. described and then presented to the decoder. However, in this
Two initial instances of LLMs included Bidirectional En- instance, the decoder functions solely with the encoded input,
coder Representations from Transformers (BERT) [28], which as a predetermined output isn’t available. Through thorough
emerged from Google in 2018, and Generative Pretrained training, the model becomes adept at generating suitable final
Transformer 1 (GPT-1) [61], pioneered by OpenAI. OpenAI outputs. Notably, when dealing with confined domains like
code fragments or test cases, the LLM’s training requirements papers in the last few years [83], [13], [63]. Automation of
are comparatively reduced. testing is required due to the enormous space of possible test
III. G ENAI S YSTEMS inputs that have to be generated and assessed. The majority
of approaches focus on correctness, with the rest focusing on
GenAI, while being a relatively new technology, it has
robustness, security, fairness, model relevance, interpretability
already see a myriad of applications. At the onset of the
and efficiency [83]. The test oracle problem and need for
pandemic, Allen AI compiled the CORD-19 dataset [79] with
test adequacy criteria feature prominently as key challenges
the objective of aiding public health experts in effectively
in testing AI systems [83].
navigating the extensive array of COVID-19 research papers
When it comes to GenAI, these research challenges become
that rapidly emerged. Following this, NLP services like Ama-
more pertinent. While there is a large amount of work on the
zon Kendra were employed to streamline the organisation of
evaluation of GenAI systems [18], literature in testing GenAI
research insights related to COVID-19 [10].
systems is quite sparse. One example is a metamorphic testing
Often GenAI is used to craft novel and imaginative con-
approach for fairness testing [52]. Metamorphic relations,
tent, ranging from crafting stories and composing poetry to
initially proposed by Chen et al. [23] are one possible solution
scripting dialogues for films. This process involves training
to address the oracle problem. To illustrate how metamorphic
the model on an extensive compilation of existing literature,
relations work, let’s consider an image classification AI system
encompassing books, articles, and scripts. By immersing itself
that is designed to identify whether an image contains a cat
in these textual resources, the model assimilates patterns,
or not. You could define a metamorphic relation to test the
structures, and writing styles, thereby enabling it to generate
system’s robustness to changes in the brightness of the input
content akin to the learned patterns. This capability finds appli-
image. The metamorphic relation could be: the AI system’s
cations across various domains, including generating content
classification should remain the same even if the brightness of
for entertainment [35], marketing [62], advertising [8], the
the input image is adjusted up or down. While metamorphic
succinct summarising of content in finance [15], and even
testing ha great potential in addressing the oracle problem,
dentistry [37].
there are limitations around their scope, as metamorphic
The use of GenAI in decision-making is exemplified in stud-
relations may not exist for all possible testing scenarios.
ies including sentiment analysis [84], text classification [57],
Addressing the oracle problem is critical when devising testing
and question answering [1]. Through the process of scrutiniz-
approaches for GenAI systems.
ing and comprehending the meaning and contextual nuances
Cross-Referencing is another category of test oracle in the
of the input, these models exhibit the capacity to produce
area of ML testing, which covers methods like differential
recommendations grounded in their assimilated understanding
testing. Differential testing is a software testing approach
of the information provided. These models find applicability
that identifies bugs by checking whether similar applications
in diverse natural language processing tasks, including un-
produce distinct outputs for the same inputs [26], [55]. Deep-
derstanding, interpreting, and generating human-like speech.
Xplore [59] and DLFuzz [33] leverage differential testing as a
This latter functionality assumes paramount importance in
test oracle in their search of discovering valuable test inputs.
applications like chatbots, virtual assistants, and language-
They prioritise generating test inputs that induce disparate
based games.
behaviours among diverse models.
In addition, GenAI has gained significant attention in the
Reference-based techniques represent the prevailing ap-
software engineering domain, helping automate various tasks.
proach in assessing GenAI software, involving the creation of
Examples include code generation and summarisation [77],
benchmarks through manual question design or the labelling
program repair [29], comment generation [36], code transla-
of test inputs [54], [48], [25], [73], [74], [44]. This quality
tion [64], and testing [6].
assurance approach heavily depends on crafting questions
As GenAI evolves from theory to practical implementation
and human-generated annotations, demanding a significant
and integration into our daily lives, unexpected adverse out-
investment of human effort. As models become more adept at
comes that were not initially foreseen by researchers have
handling inquiries spanning various fields, this method is prone
also surfaced. These range from instances like the offensive
to becoming unfeasible because the sheer volume of test cases
language used by Microsoft’s Twitter bot Tay [67] to privacy
requiring formulation and annotation would be overwhelming
breaches observed with Amazon Alexa [24]. Presently, a
and hard to accomplish.
highly contentious topic in the area of GenAI ethics centres
Additionally, static benchmarks of this nature are suscepti-
around GPT-3 [14], which brings about concerns and poten-
ble to data contamination problems, making it challenging to
tial harm, including the reinforcement of gender and racial
precisely evaluate extensive language models and effectively
biases [9]. For these reasons, quality assurance of GenAI
discover errors. GenAI uses extensive datasets collected from
systems is a very important step, and testing is a key approach
the Internet for training. These large datasets inadvertently
to achieving reliable, robust and unbiased GenAI systems.
include questions and answers from publicly accessible evalua-
IV. AUTOMATED T ESTING OF G ENAI S YSTEMS tion test data, leading to an inflated estimation of the model’s
Automated testing of AI systems has been significantly performance [5], [4]. Consequently, conventional evaluation
researched, with an exponential increase in the number of and testing approaches could deceive us into ignoring the
underlying risks connected with these models, potentially A. Opportunity 1: An Oracle for detecting and mitigating bias
resulting in unforeseen negative consequences. in GenAI systems
In order to reduce the reliance on human efforts, researchers
For example, an Oracle could be devised that can detect
have proposed the use of metamorphic testing. This involves
bias in the output of the GenAI systems. Bias can lead to
the creation of metamorphic relationships to generate test
unequal treatment, where individuals or groups are favoured or
oracles [21], [22], [49]. More precisely, metamorphic testing
disadvantaged based on factors unrelated to their qualifications
involves the modification of initial test cases to generate
or circumstances. Addressing bias is crucial for maintain-
new ones that maintain a very close semantic relationship
ing compliance with anti-discrimination laws and promoting
with the original tests, often ensuring they are semantically
fairness. Existing approaches for detecting bias in AI-based
equivalent [68]. The goal is to examine whether the responses
systems focus on classification and regression systems, and
from both the original and mutated test cases adhere to the
they are called fairness measures. One example of a fairness
metamorphic relationships.
measure is ”Equal Opportunity Difference” (EOD). The Equal
Most automated testing frameworks designed for question
Opportunity Difference evaluates the disparity in true positive
answering systems predominantly focus on a narrow do-
rates between different groups, such as different demographic
main [21], [68], in which the system responds to a ques-
categories, while considering a binary classification problem.
tion using a linked reference passage, or on multiple-choice
It focuses on the ratio of true positives among the positive
question answering [38], [70], where a set of options or
predictions for each group and helps assess whether a model is
potential answers are given. These approaches do not match
treating different groups fairly in terms of correctly identifying
the prevalent use of GenAI systems, where users typically
positive cases, without favouring one group over another.
ask questions directly without offering a passage or a list of
For instance, in a healthcare setting, if an AI model is
choices.
used to predict the likelihood of a certain disease, the Equal
Automated testing of GenAI systems requires a multifaceted
Opportunity Difference would compare the true positive rates
approach that goes beyond traditional testing methodologies.
for different demographic groups (e.g., gender, race, age). If
In Section V, I discuss potential opportunities for addressing
the model has a significant difference in true positive rates
the test oracle problem.
between these groups, it indicates potential bias or unfairness
in the model’s predictions, which could lead to unequal
V. T EST O RACLE P ROBLEM IN G ENAI healthcare outcomes.
In GenAI an Oracle could be trained to recognise patterns
In the context of Generative AI, the “test oracle problem” and characteristics associated with bias, enabling it to iden-
refers to the challenge of determining whether the generated tify instances where biased language, stereotypes, or cultural
output is correct or accurate. Unlike in traditional software assumptions are present. Throughout the training regimen,
testing, where expected outputs are usually predefined, in the model needs to be exposed to a vast array of annotated
generative AI, the outputs are creative, diverse, and often lack instances, each containing a unique blend of the important
a single ”correct” answer. This poses a significant challenge feature in the specific domain. For example, for GenAI
when assessing the quality and validity of generated content. systems used in healthcare, the model needs to be exposed
Generative AI systems, such as those for image generation, to a wide array of medical context, patient profiles, and
text completion, or music composition, produce outputs that linguistic nuances. By immersing the model in this extensive
may not have a clear ground truth or correct reference. This pool of examples, it can be trained to identify a spectrum of
makes it challenging to validate whether the generated content bias-related indicators. These could encompass anything from
is accurate. In addition, many generative tasks involve subjec- overtly biased language and stereotypes to more nuanced cues
tive judgement, such as art creation or style transfer. What rooted in cultural assumptions that may inadvertently creep
is considered ”correct” can vary greatly based on individual into medical documentation.
preferences, cultural contexts, or creative intent. Researchers Figure 1 presents an approach for tackling this problem. To
often rely on evaluation metrics that attempt to quantify automatically identify whether a test case is bias-revealing,
certain aspects of the generated content, such as image quality, this approach learns an Oracle. Within an active learning
coherence, or language fluency. However, these metrics might loop, the Learn2Discover queries the Human as the teacher
not capture the full extent of correctness or quality and human about the label of a test case. The Human assigns a label
evaluators play a crucial role in assessing the quality and l = {biased, unbiased} to the test case. The Oracle is trained
correctness of generative outputs. The challenge remains how on the human-labelled test cases. Given a limited number
to most effectively make use of human judgement for labelling of queries to the Human, Learn2Discover maximises the
test cases, as GenAI systems need to be tested with a very accuracy of the Oracle in correctly predicting the labels l =
large number of test cases, and individually labelling test cases {biased, unbiased}. With each query the human is confronted
can be infeasible and time consuming. One opportunity is to with a potentially unconscious bias and allowed to reflect.
develop approaches that learn the Oracle via interacting with Meanwhile, the Oracle learns to identify bias-revealing test
the human evaluators. cases, queries the human for test cases it is uncertain about
of GenAI systems can be explored, such as causal analysis
for neural networks [71], [19] that learn the cause of biased
Search-Based
outputs. The input to these models can be the attributes,
Fairness Testing
such as gender, salary, and neighbourhood, and the behaviour
Biased AI System of the AI systems, such as the prediction/output and the
Human values of the hidden neurons. The causal analysis can help
bias bias determine the causal effects of attributes and neurons on
mitigation negotiation fairness, which can be produced as explanations for the AI
RT1. Learn2Discover
developer. Explanations can be generated in a textual format
RT3. Learn2Improve
that are easily understandable by humans, e.g., “this test case
bias explanations
is likely to be biased due to the protected attributes (Gender,
Age)”.
Finally, the Oracle can also help improve the fairness of
RT2. Learn2Explain GenAI systems. Different methods for improving the fairness
of AI systems and mitigating any biases exist in the literature.
Fig. 1. An Oracle for detecting and mitigating bias in GenAI systems They can be classified as pre-processing [17], which aim at
reducing bias in the dataset, in-processing [2], which mitigate
bias by changing the model or the training process, and post-
(bias negotiation), and becomes better with each query to the processing [39], which modify the predictions to remove any
Human. While the main purpose of the Oracle learner is to bias. Bias mitigation in AI systems is a complex task, and
reduce the effort that the human spends in labelling test cases, it is not well-understood which approach to use. Applying
in the process, it also learns from the Human how to decide the wrong method can result in accuracy loss, and in some
what is biased or not, and thus, it formalises bias policies in instances in more biased outputs and worsened fairness [12],
a model that describes how unbiased software should behave. [31]. The Oracle and the casual models can help determine
The development of GenAI systems often involves multiple where the bias originates, and help with which bias mitigation
stakeholders, such as requirements engineers, programmers, strategy to select. If the causal effect of input attributes is high,
design architects, data scientists, users, and the client. Each then bias is likely to be attributed to the data used to train
stakeholder may have a different background and view on what the model, and hence a pre-processing method would likely
constitutes bias and what to do about it, hence we must also give the best results. On the other hand, in-processing would
consider the setting of test cases being labelled by multiple be selected if the internal structure of the Machine Learning
people. A test case for which there is disagreement on its model (e.g., internal neurons in a Neural Network) has the
label deserves to be explored further, as it represents boundary highest causal effect. Otherwise, if both the causal effects of
cases were there is uncertainty and can help the software team input attributes and internal structure are below a threshold,
and organisation converge to a bias-decision policy. Hence, then a post-processing method would be suitable.
the Oracle learner will also present to the human team test Oracle learning is not a new concept. For example, we
cases that will challenge their views on bias. This adversarial previously developed an approach for learning a grammar
approach will not only lead to more robust bias identification that can label test cases in the form of string inputs [41].
policies for deciding which AI behaviour is biased, but also Grammar learning, however would not be a feasible approach
to awareness in the software engineering team about their for GenAi systems, due to the computational cost of learning
unconscious bias when writing software. such grammars. Instead, a potential avenue is to train a
The Oracle can also help provide explanations for the language model to detect bias. This may involve using a
identified biases. For example, it can help answer questions labelled dataset that contains examples of biased and unbiased
such as why the GenAI system is making a biased decision text, and then fine-tuning a pre-existing language model on this
for a particular test case? why a particular test case is dataset. Training a language model to detect bias, however,
considered as biased? what should the AI engineers do to has to be an ongoing process, as bias is complex, context-
improve the fairness of GenAI systems?. These questions will dependent, and continually evolves in language usage. The
be of benefit to the AI developers in better designing and success of the model will depend on the quality of the training
developing unbiased GenAI software. The concept of model- data, the effectiveness of the fine-tuning process, and ongoing
agnostic techniques can be used to develop a local interpretable monitoring and improvement efforts. It will require a human in
model in order to mimic the behaviour of the global models the loop approach, where the model is continuously improved
(i.e., the Oracle Learning from RT1 in Figure 1) and generate via human feedback and newly labelled cases.
explanations for a given instance to be explained. The local
interpretable model can learn the relationship between the VI. A DEQUACY M EASURES IN G ENAI
features and the labels from human oracles in order to explain Adequacy measures used to measure the quality of a
why a particular instance is considered as biased. In addition testsuite can be classified into coverage based measures
to the local models, global models for explaining unfairness and diversity based measures. The majority of research that
suggests using coverage as a metric for evaluating the ef-
fectiveness of AI systems primarily concentrates on white- Test Scenario Space
box approaches [59], [51], [72], [42], [32]. Traditional white
box code coverage metrics, commonly employed to assess
program logic coverage, face limitations in quantifying the
adequacy of program logic influenced by underlying training '
Test Scenario Subset Performance Space
data. In response, specialised white box adequacy metrics have
emerged, focusing on maximising neuron coverage [33], [59],
[76], [81], [42] or surprised coverage [42].
A key limitation of coverage-based adequacy measures lies
in their dependence on full access to the underlying model Feature Space Instance Space TISA Metrics
and training data. Such access, however is often inaccessible
to testers. Additionally, most coverage measures assess perfor-
mance using adversarial inputs, prioritising model robustness Fig. 2. The main components of Test suite Instance Space Adequacy (TISA).
over correctness. Despite the purported sensitivity of these
measures to adversarial inputs, empirical studies have not
consistently demonstrated a significant correlation between system subjected to an extensive array of conditions is more
these coverage metrics and their ability to detect faults [3], likely to exhibit reliable performance in real-world scenarios.
[46], [20]. Consequently, diversifying test scenarios enhances the explo-
There’s limited work that presents criteria for black-box ration of the fault space, subsequently elevating fault detection
coverage in the testing of AI-driven systems. Hauer et al.[34] capabilities [16], [3], [85].Nonetheless, existing research in the
present a statistical method to determine whether all potential are of testing AI systems, and in particular GenAI systems
scenario types have been encountered within the test suite. underscores a substantial gap in the availability of robust
These scenario types are defined by the ego car’s maneuvers, diversity metrics that exhibit a pronounced correlation with
such as lane changes, overtaking, and emergency braking. A fault detection [3].
similar study by Arcaini et al.[7] regards scenario types as
granular driving characteristics like ”turning with high lateral A. Opportunity 2: Test suite Instance Space Adequacy Mea-
acceleration,” using them as a coverage measure. Tang et al. sures
categorize scenarios based on the map’s topological structure Given the pivotal role of diversity and coverage in the
and evaluate their approach’s efficacy by assessing the cover- black-box testing of intricate systems like GenAI systems,
age of these structures [75]. Given that autonomous vehicles recently we proposed a new suite of metrics known as Test
base their decisions on parameterized rule-based systems and suite Instance Space Adequacy (TISA) metrics [58]. These
cost functions, Laurent et al. [45] employ weight coverage to metrics aim to objectively quantify the quality of testing in
encompass various configurations of a path planner. terms of both diversity and coverage. Rooted in a framework
Another commonly employed adequacy measure for both known as Instance Space Analysis (ISA), these metrics provide
conventional and AI-based systems revolves around test suite a two-dimensional representation of test instances. This rep-
diversity, which is computed based on either test inputs or resentation seamlessly unveils both the diversity and coverage
outputs [3], [30], [50], [11]. One example is the Shannon of instances, granting insights into the diversity of the test
Diversity Index, which measures the diversity of a given sam- suite across various features, the diversity of detected bugs,
ple by considering both its richness and evenness. Richness and the test suite’s adequacy concerning coverage. Ultimately,
assesses the count of distinct species present within a popu- the proposed TISA metrics extend as a valuable resource for
lation, while evenness measures the uniformity of individuals testers. By facilitating the identification of areas where the
per species [53]. Another example is Geometric diversity [47], test suite might lack diversity or require additional testing,
[40], [82], which was shown to be superior over numerous these metrics offer a mechanism to enhance testing quality
other prevalent black-box test adequacy metrics [3]. Geometric systematically and comprehensively.
diversity draws its foundation from the realm of Determinantal Figure 2 illustrates the main steps of the TISA approach.
Point Process (DPP), an approach geared towards discerning To calculate the TISA measures, we assume we have a
diverse input sets [43]. DPP is used for subset selection subset T ′ of the possible test cases T that can be used to
scenarios, wherein the objective involves choosing a varied test a GenAI system. It is not feasible, or even possible to
subset from a pool of candidates. DPP models the repulsion obtain all possible test cases in a reasonable amount of time,
between items within the chosen subset. Consequently, items which makes assessing the adequacy of a test suite even
exhibiting substantial similarity with one another become less more important. TISA then creates the feature space F bu
likely candidates for inclusion in the final selection. extracting an extensive set of features from the test cases T ′ .
Diversity metrics are rooted in the concept that akin test TISA has not been used for GenAI systems yet, but in the
cases tend to exercise similar segments of the source code or area of testing autonomous vehicles, TISA extracted features
training data, thereby uncovering similar faults. Conversely, a from driving scenarios, which constitute the test cases [58].
coverage of the test suite. This boundary is an indication of
where possible test instances may exist, and helps identify
existing gaps in the test suite. The colour in the graph is used
to indicate whether a test case is failing, deemed as effective
and coloured in purple, or passing, which means the testsuite
is not effective as it could not detect incorrect behaviour or
failures of the AI-based system and coloured in purple.
Given the instance space, the last step is to calculate the
TISA metrics. The key metrics proposed in [58] are:
• Area of the instance space, which refers to the region
encompassed by all test instances within the instance
space, estimating the overall diversity of the entire test
suite.
• Area of the buggy region, which pertains to the section of
the instance space taken up by most of the test instances
that expose failures. An examples is shown in Figure 4
• Coverage of the instance space, which refers to the
proportion of the total area as indicated by the boundary,
Fig. 3. An example of the instance space. that is covered by test instances.
The experimental evaluation of these metrics showed that
all TISA metrics demonstrate a positive correlation with faults,
Extracted features included the number of lanes, the number but the metric that measures the area of the buggy region
of pedestrians, the number of right turns, etc. particularly stands out due to its consistently robust and
Next the performance space P is created, which constitutes statistically significant correlation with bugs. This emphasizes
the outcomes of the test cases. If a test case fails or reveals its efficacy and reliability in providing valuable insights into
incorrect behaviour of the software system, the test case is the testing process’s effectiveness.
labelled as effective, otherwise it is deemed ineffective. The In the future, there are opportunities in using TISA metrics
feature space and performance space become an input to to evaluate how well GenAI systems have been tested. One of
the instance space generation approach, which creates the the challenges is how to select a meaningful set of features to
instance space. The instance space is a 2D representation of characterise such systems. One potential solution is to extract
the multidimentional feature space, delineated by features that features from the word embeddings of these systems.
have the most significant influence on test outcomes. By constructing the instance space of the test scenarios used
An important step in TISA involves pinpointing the features to test a GenAI system, we would be able to extract insights in
that exert maximal impact on test outcomes. This process is terms of the diversity of the test inputs, and ensure that the test
instrumental in distinguishing effective scenarios — the ones instances cover a wide range of scenarios, styles, or outputs
that expose failures — from those that pass without issue. The that the GenAI system is expected to handle. This diversity
task of feature identification and selection transpires through helps assess the system’s adaptability to various inputs.
an iterative procedure, leveraging machine learning techniques TISA would also help include test instances of varying
to uncover salient features that distinctly differentiate between complexity levels, from simple and straightforward cases to
these scenario types. more intricate and challenging scenarios. This helps assess the
Once the significant features are identified, TISA projects system’s robustness and capability to handle complex inputs.
test instances, originally defined within an n-dimensional fea- In addition we could use TISA to assess the realism of test
ture space, onto a 2D coordinate plane. This projection aims cases, ensuring that we incorporate test instances that resemble
to render the connection between instance features and test real-world data or scenarios the AI system will encounter.
outcomes readily discernible. An optimal projection manifests This ensures that the system’s outputs align with real-world
as a linear trend, where variations in feature values or scenario expectations.
outcomes along a straight line span from low to high values. Edge cases, outliers, or inputs that might trigger unusual
Moreover, the proximity of instances in the high-dimensional behaviour in the AI system are important. Evaluating how
feature space is maintained within the 2D instance space, well the system handles these cases can provide insights into
ensuring topological preservation. its limitations and vulnerabilities. Introducing novel or unseen
For illustration purposes, an example of an instance space, inputs to evaluate the system’s ability to generalise beyond
which was created for the problem of testing autonomous its training data and produce creative outputs is of particular
vehicles [58] is shown in Figure 3. Each point in the graph is importance, and can be tackled with a framework like TISA.
a test case plotted in the 2D instance space created from the By drawing the boundary of possible test inputs, and iden-
most significant features. TISA also draws the mathematical tifying areas of instance space where test inputs are sparse
boundary created around the instance space for estimating the and effective (outliers), close to the boundary (edge cases), or
a central challenge in testing GenAI systems. The absence
of a definitive reference or ground truth, coupled with the
subjectivity of human evaluators, has made it challenging to
gauge the quality and accuracy of generated content. The
proposed solution of training a language model to detect bias
and deviations from the expected is a promising avenue to
mitigate this challenge.
Furthermore, addressing the adequacy of testing through
measures like Test suite Instance Space Adequacy (TISA)
metrics offers a quantitative approach to assessing the diversity
and coverage of test instances. By providing a two-dimensional
representation of the instance space, TISA enables a systematic
evaluation of the test suite’s quality, revealing gaps and areas
that require further testing.
As GenAI systems continue to permeate various domains,
from creative content generation to complex decision-making,
ensuring their reliability, robustness, and adaptability is of
paramount importance. This calls for a reimagining of testing
methodologies that account for the inherent uncertainties and
complexities of GenAI outputs. By embracing innovative
Fig. 4. The area of the buggy region. strategies, such as training oracles and employing instance
space adequacy measures, researchers and practitioners can
navigate the uncharted waters of GenAI testing and pave the
close to the frontier of behaviours (unusual behaviours), TISA way for higher quality, safer, and more dependable systems in
can help address these research challenges. this ever-evolving landscape.
To apply TISA to GenAI systems, there are a set of research
challenges and opportunities for further research. First, a R EFERENCES
method that extracts features from test cases used to test
[1] Zahra Abbasiantaeb and Saeedeh Momtazi. Text-based question answer-
GenAI systems would need to be developed. Let’s take an ing from information retrieval and deep neural network perspectives: A
example from other AI-based systems such as self driving cars. survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Features of a test case may constitute the number of other cars Discovery, 11(6):e1412, 2021.
[2] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudı́k, John Langford,
in the road, the number of pedestrians crossing the road, the and Hanna Wallach. A reductions approach to fair classification. In
weather conditions, the speed limit, the curvature of the road, International Conference on Machine Learning, pages 60–69. PMLR,
etc. For GenAI systems such as question answering systems, 2018.
a test case is text. Potential features could be extracted [3] Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, Mojtaba
Bagherzadeh, et al. Black-box testing of deep neural networks through
from the vector embeddings of the text, which contains rich test case diversity. arXiv preprint arXiv:2112.12591, 2021.
semantic information. We can perform word averaging, by [4] Open AI. Gpt-4 technical report. Technical Report, 2023.
averages the word embeddings within a text. Other features [5] Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-Yeol Ahn. Can
we trust the evaluation on chatgpt? arXiv preprint arXiv:2303.12767,
could be sum or max pooling. Sum pooling adds up all the 2023.
embeddings, while max pooling selects the maximum value [6] Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti.
for each dimension across word embeddings. Such features A3test: Assertion-augmented automated test case generation. arXiv
preprint arXiv:2302.10352, 2023.
would help characterise the test cases, and help the creation of [7] Paolo Arcaini, Xiao-Yi Zhang, and Fuyuki Ishikawa. Targeting patterns
the instance space. Another research opportunity is to explore of driving characteristics in testing autonomous driving systems. In 2021
how TISA can help with prioritising test cases. 14th IEEE Conference on Software Testing, Verification and Validation
(ICST), pages 295–305. IEEE, 2021.
VII. C ONCLUSION [8] Kevin Bartz, Cory Barr, and Adil Aijaz. Natural language generation
for sponsored-search advertisements. In Proceedings of the 9th ACM
In conclusion, as the field of software testing encounters Conference on Electronic Commerce, pages 1–9, 2008.
the novel challenges posed by generative artificial intelligence [9] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar-
garet Shmitchell. On the dangers of stochastic parrots: Can language
(GenAI) systems, the need for innovative testing approaches models be too big? In Proceedings of the 2021 ACM conference on
becomes evident. The advent of GenAI, capable of producing fairness, accountability, and transparency, pages 610–623, 2021.
creative and diverse outputs, has disrupted traditional testing [10] Parminder Bhatia, Kristjan Arumae, Nima Pourdamghani, Suyog Desh-
pande, Ben Snively, Mona Mona, Colby Wise, George Price, Shyam
methodologies. This paper has explored the intricate complexi- Ramaswamy, and T Kass-Hout. Aws cord19-search: A scientific litera-
ties introduced by GenAI and has discussed potential pathways ture search engine for covid-19. 2020.
for future research in the area of testing. [11] Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano
Panichella, and Annibale Panichella. Automated test cases prioriti-
The Oracle problem, which revolves around establishing zation for self-driving cars in virtual environments. arXiv preprint
the correctness of diverse and creative outputs, stands as arXiv:2107.09614, 2021.
[12] Sumon Biswas and Hridesh Rajan. Do the machine learning models on [32] Simos Gerasimou, Hasan Ferit Eniser, Alper Sen, and Alper Cakan.
a crowd sourced platform exhibit bias? an empirical study on model Importance-driven deep learning system testing. In Proceedings of the
fairness. In ACM Joint meeting on European Software Engineering ACM/IEEE 42nd International Conference on Software Engineering,
Conference and Symposium on the Foundations of Software Engineering, pages 702–713, 2020.
pages 642–653, 2020. [33] Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun.
[13] Houssem Ben Braiek and Foutse Khomh. On testing machine learning Dlfuzz: Differential fuzzing testing of deep learning systems. In
programs. Journal of Systems and Software, 164:110542, 2020. Proceedings of the 2018 26th ACM Joint Meeting on European Software
[14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Engineering Conference and Symposium on the Foundations of Software
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Engineering, pages 739–743, 2018.
Sastry, Amanda Askell, et al. Language models are few-shot learners. [34] Florian Hauer, Tabea Schmidt, Bernd Holzmüller, and Alexander
Advances in neural information processing systems, 33:1877–1901, Pretschner. Did we test all scenarios for automated and autonomous
2020. driving systems? In 2019 IEEE Intelligent Transportation Systems
[15] Longbing Cao. Ai in finance: challenges, techniques, and opportunities. Conference (ITSC), pages 2950–2955. IEEE, 2019.
ACM Computing Surveys (CSUR), 55(3):1–38, 2022. [35] Ryuichiro Higashinaka, Masahiro Mizukami, Hidetoshi Kawabata, Emi
[16] Emanuela G Cartaxo, Patrı́cia DL Machado, and Francisco G Oliveira Yamaguchi, Noritake Adachi, and Junji Tomita. Role play-based
Neto. On the use of a similarity function for test case selection in question-answering by real users for building chatbots with consistent
the context of model-based testing. Software Testing, Verification and personalities. In Proceedings of the 19th annual sigdial meeting on
Reliability, 21(2):75–100, 2011. discourse and dialogue, pages 264–272, 2018.
[17] Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. Bias [36] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deep code
in machine learning software: why? how? what to do? In ACM Joint comment generation. In Proceedings of the 26th conference on program
Meeting on European Software Engineering Conference and Symposium comprehension, pages 200–210, 2018.
on the Foundations of Software Engineering, pages 429–440, 2021. [37] Hanyao Huang, Ou Zheng, Dongdong Wang, Jiayi Yin, Zijin Wang,
[18] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Shengxuan Ding, Heng Yin, Chuan Xu, Renjie Yang, Qian Zheng, et al.
Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. Chatgpt for shaping the future of dentistry: the potential of multi-modal
A survey on evaluation of large language models. arXiv preprint large language model. International Journal of Oral Science, 15(1):29,
arXiv:2307.03109, 2023. 2023.
[19] Aditya Chattopadhyay, Piyushi Manupriya, Anirban Sarkar, and Vi- [38] Robin Jia and Percy Liang. Adversarial examples for evaluating reading
neeth N Balasubramanian. Neural network attributions: A causal comprehension systems. In Proceedings of the 2017 Conference on
perspective. In International Conference on Machine Learning, pages Empirical Methods in Natural Language Processing, pages 2021–2031,
981–990. PMLR, 2019. 2017.
[20] Junjie Chen, Ming Yan, Zan Wang, Yuning Kang, and Zhuo Wu. [39] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. Decision theory
Deep neural network test coverage: How far are we? arXiv preprint for discrimination-aware classification. In 2012 IEEE 12th International
arXiv:2010.04946, 2020. Conference on Data Mining, pages 924–929. IEEE, 2012.
[21] Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. Testing your question [40] Byungkon Kang. Fast determinantal point process sampling with
answering software via asking recursively. In 2021 36th IEEE/ACM application to clustering. Advances in Neural Information Processing
International Conference on Automated Software Engineering (ASE), Systems, 26, 2013.
pages 104–116. IEEE, 2021.
[41] Charaka Geethal Kapugama, Van-Thuan Pham, Aldeida Aleti, and
[22] Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. Validation on machine
Marcel Böhme. Human-in-the-loop oracle learning for semantic bugs in
reading comprehension software without annotated labels: A property-
string processing programs. In Proceedings of the 31st ACM SIGSOFT
based method. In Proceedings of the 29th ACM Joint Meeting on
International Symposium on Software Testing and Analysis, pages 215–
European Software Engineering Conference and Symposium on the
226, 2022.
Foundations of Software Engineering, pages 590–602, 2021.
[42] Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding deep learning system
[23] TY Chen, SC Cheungx, and SM Yiu. Metamorphic testing: a new ap-
testing using surprise adequacy. In 2019 IEEE/ACM 41st International
proach for generating next test cases. arXiv preprint arXiv:2002.12543,
Conference on Software Engineering (ICSE), pages 1039–1049. IEEE,
2020.
2019.
[24] Hyunji Chung, Michaela Iorga, Jeffrey Voas, and Sangjin Lee. Alexa,
can i trust you? Computer, 50(9):100–104, 2017. [43] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for
[25] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, machine learning. Foundations and Trends® in Machine Learning, 5(2–
Michael Collins, and Kristina Toutanova. Boolq: Exploring the sur- 3):123–286, 2012.
prising difficulty of natural yes/no questions. In Proceedings of the [44] Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Am-
2019 Conference of the North American Chapter of the Association for ran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. A
Computational Linguistics: Human Language Technologies, Volume 1 systematic study and comprehensive evaluation of chatgpt on benchmark
(Long and Short Papers), pages 2924–2936, 2019. datasets. arXiv preprint arXiv:2305.18486, 2023.
[26] Martin D Davis and Elaine J Weyuker. Pseudo-oracles for non-testable [45] Thomas Laurent, Stefan Klikovits, Paolo Arcaini, Fuyuki Ishikawa, and
programs. In Proceedings of the ACM’81 Conference, pages 254–257, Anthony Ventresque. Parameter coverage for testing of autonomous
1981. driving systems under uncertainty. ACM Transactions on Software
[27] Ryan Daws. Medical chatbot using openai’s gpt-3 told a fake patient to Engineering and Methodology, 2022.
kill themselves. AI News, 2020. [46] Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. Structural coverage
[28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. criteria for neural networks could be misleading. In 2019 IEEE/ACM
Bert: Pre-training of deep bidirectional transformers for language un- 41st International Conference on Software Engineering: New Ideas and
derstanding. arXiv preprint arXiv:1810.04805, 2018. Emerging Results (ICSE-NIER), pages 89–92. IEEE, 2019.
[29] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and [47] Hui Lin and Jeff A Bilmes. Learning mixtures of submodular
Shin Hwei Tan. Automated repair of programs from large language shells with application to document summarization. arXiv preprint
models. In 2023 IEEE/ACM 45th International Conference on Software arXiv:1210.4871, 2012.
Engineering (ICSE), pages 1469–1481. IEEE, 2023. [48] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring
[30] Robert Feldt, Simon Poulding, David Clark, and Shin Yoo. Test set how models mimic human falsehoods. In Proceedings of the 60th Annual
diameter: Quantifying the diversity of sets of test cases. In 2016 IEEE Meeting of the Association for Computational Linguistics (Volume 1:
international conference on software testing, verification and validation Long Papers), pages 3214–3252, 2022.
(ICST), pages 223–233. IEEE, 2016. [49] Zixi Liu, Yang Feng, Yining Yin, Jingyu Sun, Zhenyu Chen, and Baowen
[31] Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Xu. Qatest: A uniform fuzzing framework for question answering sys-
Sonam Choudhary, Evan P Hamilton, and Derek Roth. A comparative tems. In Proceedings of the 37th IEEE/ACM International Conference
study of fairness-enhancing interventions in machine learning. In Pro- on Automated Software Engineering, pages 1–12, 2022.
ceedings of the conference on fairness, accountability, and transparency, [50] Chengjie Lu, Huihui Zhang, Tao Yue, and Shaukat Ali. Search-
pages 329–338, 2019. based selection and prioritization of test scenarios for autonomous
driving systems. In International Symposium on Search Based Software [71] Bing Sun, Jun Sun, Hong Long Pham, and Jie Shi. Causality-
Engineering, pages 41–55. Springer, 2021. based neural network repair. In International Conference in Software
[51] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Engineering, 2022.
Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. Deepgauge: Multi- [72] Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew
granularity testing criteria for deep learning systems. In Proceedings Hill, and Rob Ashmore. Structural test coverage criteria for deep neural
of the 33rd ACM/IEEE international conference on automated software networks. ACM Transactions on Embedded Computing Systems (TECS),
engineering, pages 120–131, 2018. 18(5s):1–23, 2019.
[52] Pingchuan Ma, Shuai Wang, and Jin Liu. Metamorphic testing and [73] Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Am-
certified mitigation of fairness violations in nlp models. In IJCAI, pages ran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. A
458–465, 2020. systematic study and comprehensive evaluation of chatgpt on benchmark
[53] Anne E Magurran. Measuring biological diversity. Current Biology, datasets. arXiv e-prints, pages arXiv–2305, 2023.
31(19):R1174–R1177, 2021. [74] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.
Commonsenseqa: A question answering challenge targeting common-
[54] R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong
sense knowledge. In Proceedings of the 2019 Conference of the North
reasons: Diagnosing syntactic heuristics in natural language inference. In
American Chapter of the Association for Computational Linguistics:
57th Annual Meeting of the Association for Computational Linguistics,
Human Language Technologies, Volume 1 (Long and Short Papers),
ACL 2019, pages 3428–3448. Association for Computational Linguistics
pages 4149–4158, 2019.
(ACL), 2020.
[75] Yun Tang, Yuan Zhou, Yang Liu, Jun Sun, and Gang Wang. Collision
[55] William M McKeeman. Differential testing for software. Digital avoidance testing for autonomous driving systems on complete maps. In
Technical Journal, 10(1):100–107, 1998. 2021 IEEE Intelligent Vehicles Symposium (IV), pages 179–185. IEEE,
[56] John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The 2021.
effect of natural distribution shift on question answering models. In [76] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest:
Proceedings of the 37th International Conference on Machine Learning, Automated testing of deep-neural-network-driven autonomous cars. In
pages 6905–6916, 2020. Proceedings of the 40th international conference on software engineer-
[57] Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, ing, pages 303–314, 2018.
Meysam Chenaghlu, and Jianfeng Gao. Deep learning–based text [77] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. Expectation
classification: a comprehensive review. ACM computing surveys (CSUR), vs. experience: Evaluating the usability of code generation tools powered
54(3):1–40, 2021. by large language models. In Chi conference on human factors in
[58] Neelofar and Aldeida Aleti. Towards reliable AI: Adequacy metrics computing systems extended abstracts, pages 1–7, 2022.
for ensuring the quality of AI-based systems. In Proceedings of the [78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
ACM/IEEE 46nd International Conference on Software Engineering, Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
2024. accepted. is all you need. Advances in neural information processing systems, 30,
[59] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: 2017.
Automated whitebox testing of deep learning systems. In proceedings [79] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas,
of the 26th Symposium on Operating Systems Principles, pages 1–18, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis
2017. Katsis, Rodney Kinney, et al. Cord-19: The covid-19 open research
[60] Walter Hugo Lopez Pinaya, Sandra Vieira, Rafael Garcia-Dias, and dataset. In ACL 2020 Workshop on Natural Language Processing for
Andrea Mechelli. Autoencoders. In Machine learning, pages 193–208. COVID-19 (NLP-COVID), 2020.
Elsevier, 2020. [80] Ruoqi Wei, Cesar Garcia, Ahmed El-Sayed, Viyaleta Peterson, and
[61] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Ausif Mahmood. Variations in variational autoencoders-a comparative
Improving language understanding by generative pre-training. 2018. evaluation. Ieee Access, 8:153651–153670, 2020.
[62] Martin Reisenbichler, Thomas Reutterer, David A Schweidel, and Daniel [81] Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang
Dan. Frontiers: Supporting content marketing with natural language Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. Deephunter:
generation. Marketing Science, 41(3):441–452, 2022. a coverage-guided fuzz testing framework for deep neural networks. In
Proceedings of the 28th ACM SIGSOFT International Symposium on
[63] Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbat-
Software Testing and Analysis, pages 146–157, 2019.
ova, Michael Weiss, and Paolo Tonella. Testing machine learning
[82] Haotian Xu and Zhijian Ou. Scalable discovery of audio fingerprint
based systems: a systematic mapping. Empirical Software Engineering,
motifs in broadcast streams with determinantal point process based motif
25:5193–5254, 2020.
clustering. IEEE/ACM Transactions on Audio, Speech, and Language
[64] Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guil- Processing, 24(5):978–989, 2016.
laume Lample. Unsupervised translation of programming languages. [83] Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. Machine learning
Advances in Neural Information Processing Systems, 33:20601–20611, testing: Survey, landscapes and horizons. IEEE Transactions on Software
2020. Engineering, 48(1):1–36, 2020.
[65] Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term [84] Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment
memory recurrent neural network architectures for large scale acoustic analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and
modeling. 2014. Knowledge Discovery, 8(4):e1253, 2018.
[66] Divya Saxena and Jiannong Cao. Generative adversarial networks (gans) [85] Tahereh Zohdinasab, Vincenzo Riccio, Alessio Gambi, and Paolo
challenges, solutions, and future directions. ACM Computing Surveys Tonella. Deephyperion: exploring the feature space of deep learning-
(CSUR), 54(3):1–42, 2021. based systems through illumination search. In Proceedings of the
[67] Saqib Shah and Julian Chokkattu. Microsoft kills ai chatbot tay (twice) 30th ACM SIGSOFT International Symposium on Software Testing and
after it goes full nazi. 2016. Analysis, pages 79–90, 2021.
[68] Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu,
and Menghan Tian. Natural test generation for precise testing of question
answering software. In Proceedings of the 37th IEEE/ACM International
Conference on Automated Software Engineering, pages 1–12, 2022.
[69] Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and
Baowen Xu. Multiple-boundary clustering and prioritization to promote
neural network retraining. In Proceedings of the 35th IEEE/ACM
International Conference on Automated Software Engineering, pages
410–422, 2020.
[70] Clemencia Siro and Tunde Oluwaseyi Ajayi. Evaluating the robustness
of machine reading comprehension models to low resource entity
renaming. In 4th Workshop on African Natural Language Processing,
2023.

You might also like