0% found this document useful (0 votes)
59 views12 pages

Improving Test Case Generation For REST APIs Through Hierarchical Clustering

The document discusses an approach called LT-MOSA that uses hierarchical clustering to improve test case generation for REST APIs. It aims to better preserve patterns in sequences of API requests discovered during testing. The approach is evaluated on 7 real-world applications, showing it achieves statistically higher branch coverage than two state-of-the-art techniques. It also finds more unique faults left undetected by the other techniques.

Uploaded by

谭嘉俊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views12 pages

Improving Test Case Generation For REST APIs Through Hierarchical Clustering

The document discusses an approach called LT-MOSA that uses hierarchical clustering to improve test case generation for REST APIs. It aims to better preserve patterns in sequences of API requests discovered during testing. The approach is evaluated on 7 real-world applications, showing it achieves statistically higher branch coverage than two state-of-the-art techniques. It also finds more unique faults left undetected by the other techniques.

Uploaded by

谭嘉俊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Improving Test Case Generation for REST APIs


2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) | 978-1-6654-0337-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/ASE51524.2021.9678586

Through Hierarchical Clustering


Dimitri Stallenberg Mitchell Olsthoorn Annibale Panichella
Delft University of Technology Delft University of Technology Delft University of Technology
Delft, The Netherlands Delft, The Netherlands Delft, The Netherlands
[email protected] [email protected] [email protected]

Abstract—With the ever-increasing use of web APIs in modern-


day applications, it is becoming more important to test the system
Listing 1: Motivating example of patterns in API requests.
as a whole. In the last decade, tools and approaches have been 1 POST authenticate?user=admin&password=pwd
proposed to automate the creation of system-level test cases 2 POST product?id=1&price=10.99&token={key}
for these APIs using evolutionary algorithms (EAs). One of the 3 UPDATE product/1?price=8.99&token={key}
limiting factors of EAs is that the genetic operators (crossover and 4 GET product/1?token={key}
mutation) are fully randomized, potentially breaking promising
patterns in the sequences of API requests discovered during Evolutionary Algorithms (EAs) to evolve a pool of test cases
the search. Breaking these patterns has a negative impact on through randomized genetic operators, namely mutations and
the effectiveness of the test case generation process. To address
this limitation, this paper proposes a new approach that uses
crossover/recombination. More precisely, test cases are en-
Agglomerative Hierarchical Clustering (AHC) to infer a linkage coded as chromosomes, while statements (i.e., method calls)
tree model, which captures, replicates, and preserves these and test data are encoded as the genes [10]. In the context
patterns in new test cases. We evaluate our approach, called of REST API testing, a test case is a sequence of API
LT-MOSA, by performing an empirical study on 7 real-world requests (i.e., HTTP requests and SQL commands) on specific
benchmark applications w.r.t. branch coverage and real-fault
detection capability. We also compare LT-MOSA with the two
resources [11], [9].
existing state-of-the-art white-box techniques (MIO, MOSA) for REpresentational State Transfer (REST) APIs deal with
REST API testing. Our results show that LT-MOSA achieves a states. Each individual request changes the state of the API,
statistically significant increase in test target coverage (i.e., lines and therefore, its execution result depends on the state of the
and branches) compared to MIO and MOSA in 4 and 5 out of 7 application (i.e., the previously executed requests). Listing 1
applications, respectively. Furthermore, LT-MOSA discovers 27
and 18 unique real-faults that are left undetected by MIO and
shows an example of HTTP requests made to a REST API
MOSA, respectively. that manages products. In the example, the first request
Index Terms—system-level testing, test case generation, ma- authenticates the client to the API with the given username
chine learning, search-based software engineering and password. In return, the client receives a token that
can be used to make subsequent requests. The second request
I. I NTRODUCTION creates a new product by specifying the id, price, and the
Over the last decade, the software landscape has been token. The price is then updated in the third request and
characterized by the shift from large monolithic applications the changes are retrieved in the last request.
to component-based systems, such as microservices. These The example above contains patterns of HTTP requests that
systems, together with their many diverse client applications, strongly depend on the previous ones. The GET request can
make heavy use of web APIs for communication. Web APIs not retrieve a product that does not exist, and therefore, can
are almost ubiquitous today and rely on well-established not be successfully executed without request 2. Similarly, the
communication standards such as SOAP [1] and REST [2]. UPDATE request can not be executed before the product is
The shift towards component-based systems makes it ever- created. Lastly, request 2, 3, and 4, all depend on request 1
increasingly more important to test the system as a whole since for the authentication token. Hence, HTTP requests should not
many different components have to work together. Manually be executed in any random order [12].
writing system-level test cases is, however, time-consuming Test generation tools rely on EAs to build up sequences
and error-prone [3], [4]. of HTTP requests iteratively through genetic operators, i.e.,
For these reasons, researchers have come up with different crossover and mutation [9], [11], [13]. While these opera-
techniques to automate the generation of test cases. One tors can successfully create promising sequences of HTTP
class of such techniques is search-based software testing. requests, they do not directly recognize and preserve them
Recent advances have shown that search-based approaches when creating new test cases [14]. For example, the genetic
can achieve a high code coverage [5], also compared to operators may remove request 2 from the test case in Listing 1,
manually-written test cases [6], and are able to detect unknown breaking requests 3 and 4 unintentionally.
bugs [7], [8], [9]. Search-based test case generation uses In this paper, we argue that detecting and preserving patterns

978-1-6654-0337-5/21/$31.00 ©2021 IEEE 117


DOI 10.1109/ASE51524.2021.00021

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
of HTTP requests, hereafter referred to as linkage structures, applications (e.g., [18]) and it has been successfully used in
improves the effectiveness of the test case generation process. industry (e.g., [19], [20]). Popular tools for automatically
We propose a new approach that uses Agglomerative Hierar- generating test cases include EvoSuite [21], for unit testing,
chical Clustering (AHC) to infer these linkage structures from and Sapienz [20], [22], for Android testing.
automatically generated test cases in the context of REST Evolutionary Algorithms (EAs) are one of the most com-
APIs testing. In particular, AHC generates a Linkage Tree monly used class of meta-heuristics in search-based testing.
(LT) model from the test cases that are the closest to reach EAs have been used to generate both test data [16] and test
uncovered test targets (i.e., lines and branches). This model is cases [10]. The latter includes test data, method sequences
used by the genetic operators to determine which sequences and assertions. EAs are inspired by the biological process
of HTTP requests should not be broken up and should be of evolution. It initializes and evolves a population of ran-
replicated in new tests. domly generated individuals (test cases). These individuals
To evaluate the feasibility and effectiveness of our approach, are then evaluated based on a predefined fitness function.
we implemented a prototype based on E VO M ASTER, the state- The individuals with the best fitness value are selected for
of-the-art test case generation tool for Java-based REST APIs. reproduction through crossovers (swapping elements between
We performed an empirical study with 7 benchmark web/enter- two individuals) and random mutations (small changes to
prise applications from the E VO M ASTER Benchmark (EMB) individuals). The offspring test cases resulting from the repro-
dataset. We compared our approach against the two state- duction are then evaluated. Finally, the population for the next
of-the-art algorithms for system-level test generations imple- generation is created by selecting the best individuals across
mented in E VO M ASTER, namely Many Independent Objective the previous population and the newly generated tests (elitism).
(MIO) and Many Objective Search Algorithm (MOSA). This loop (reproduction, evaluation, and selection) continues
Our results show that LT-MOSA covers significantly more until a stopping condition has been reached. The final test suite
test targets in 4 and 5 out of the 7 applications compared is created based on the population’s best individuals.
to MIO and MOSA, respectively. On average, LT-MOSA,
covered 11.7 % more test targets than MIO (with a max B. REST API Testing
improvement of 66.5 %) and 8.5 % more than MOSA (with a A REpresentational State Transfer (REST) API is oriented
max improvement of 37.5 %). Furthermore, LT-MOSA could around resources. This differs from a command-oriented API
detect, on average, 27 and 18 unique real-faults that were not like for example the Remote Procedure Call (RPC) standard. A
detected by MIO and MOSA, respectively. REST API request performs an action on a specific resource.
In summary, we make the following contributions: These actions are encoded by the different methods defined in
1) A novel approach that uses Agglomerative Hierarchical the HTTP protocol. Common HTTP methods include GET,
Clustering to learn and preserve linkage structures em- HEAD, POST, PUT, PATCH, and DELETE. These actions
bedded in REST API. are performed on an endpoint, which is the location of the
2) An empirical evaluation of the proposed approach with resource. An example of this would be performing a GET
the two state-of-the-art algorithms (MIO and MOSA) action on the /user/3 endpoint to retrieve the information
on a benchmark of 7 web/enterprise applications. of a user with a user id of 3. Another example would be
3) A full replication package including code and re- performing a POST action on the /user/ endpoint to create
sults [15]. a new user.
The remainder of this paper is organized as follows. Sec- With the recent rise in popularity of REST APIs in the
tion II summarizes the related work and background concepts. last decade, it is becoming more important to test this critical
Section III introduces our approach called, LT-MOSA, and communication layer. There are two different ways system-
gives a detailed breakdown of how it works. Section IV level API testing can be approached: black-box and white-
describes the setup of our empirical study. Section V discusses box testing. Black-box testing frameworks (e.g., RESTTEST-
the obtained results and presents our findings. Section VI GEN [23], EvoMaster black-box [11]) examine the functional-
discusses the threats to validity and Section VII draws con- ity of the system without looking at the internals of the system.
clusions and identifies possible directions for future work. In contrast, white-box testing approaches rely on the internal
structure of the system and measure the adequacy of the tests
II. BACKGROUND AND R ELATED W ORK based on coverage criteria (e.g., branch coverage). This allows
This section provides an overview of basic concepts and the algorithm to easily identify which paths have been covered
related work in search-based software testing, REST API and which have not. Prior studies show white-box techniques
testing, test case generation, and linkage learning. achieve better results than their black-box counterparts [11].
Additionally, white-box techniques allow to integrate SQL
A. Search-based software testing databases in the test case generation process [9].
Search-based software testing has become a widely used
and effective method of automating the generation of test C. Test Case Generation for REST APIs
cases and test data [16], [17]. Automatic test case generation E VO M ASTER is a tool that aims to generate system-level
significantly reduces the time needed for testing and debugging test cases for REST APIs. It internally uses evolutionary

118

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
algorithms to evolve the test cases iteratively. At the time of the smallest test case will be selected. In each generation, an
writing, E VO M ASTER provides two EAs. The first algorithm archive collects the test cases that cover previously uncovered
is the Many Independent Objective (MIO) algorithm proposed targets. The archive is updated every time a newly generated
by Arcuri et al. [13]. The second algorithm is a variant of test case covers new targets or covers already covered targets
the Many-Objective Sorting Algorithm (MOSA) proposed by but with fewer statements.
Panichella et al. [24]. Both of these algorithms are specifically 3) Comparison: Both MIO and MOSA produce good
designed for test case generation and consider the peculiarities results in both unit and system-level tests. In the context of
of these systems. The pseudo-code of these algorithms can be system-level testing, Arcuri [13] showed that MIO achieves
found in the respective papers. the best average results, but there are web/enterprise appli-
1) MIO: The Many Independent Objective (MIO) algo- cations in which MOSA achieves higher coverage. In unit
rithm is an evolutionary algorithm that aims to improve the testing, Campos et al. [5] showed that MOSA (and its vari-
scalability of many-objective search algorithms for programs ants) achieves overall better coverage than MIO. Therefore, in
with a very large number of testing targets (in the order this paper, we consider both MIO and MOSA as they excel
of millions) [13]. It works under the assumption that: (i) in different scenarios and are the state-of-the-art in test case
each target can be independently optimized; (ii) targets can generation for REST APIs.
be strongly related, for example when nested, or completely Notice that an extension of MOSA, called
independent; (iii) not all targets can be covered. Based on DynaMOSA [26], has been proposed in the related literature
these assumptions, MIO maintains a separate population for for unit testing. Compared to MOSA, DynaMOSA organizes
each target of the System Under Test (SUT). The fitness the coverage targets (e.g., branches) of a given code unit into a
function for each population consists solely of the objective global hierarchy based on their structural dependencies. Then,
of that population. So, each population compares and ranks the list of search objectives is updated dynamically based
the individuals based on a single testing target. At the start on their structural dependencies and the previously covered
of the algorithm, all populations are empty. Each iteration, targets. While previous studies in unit-testing showed that
the algorithm either samples a random new test case with DynaMOSA outperforms its predecessor MOSA [26], [5],
a certain probability or it samples a test case from one of it cannot be applied for REST APIs as no global hierarchy
the populations with an uncovered target. This test case is exists across the coverage targets of different microservices
then added to all populations with an uncovered target and or functions/classes within the same microservice1 .
evaluated independently. The population that is chosen when 4) Chromosome Representation: Test cases in both search
a test case is sampled from one of the populations, is based algorithms included in E VO M ASTER are represented by two
on the number of times a test case has been sampled from genes: action gene and input gene. The action gene represents
that population before. Each time a population is sampled, a the structure and order of the HTTP requests in the test case.
counter is incremented. If a test case with a better fitness value E VO M ASTER extracts these actions from the Swagger/Ope-
is added to the population, the counter is reset to zero. The nAPI documentation that has to be provided for each system
sampling mechanism chooses the population with the lowest under test. An action gene consists of the HTTP method and
counter. This makes sure that the algorithm won’t get stuck the REST endpoint. An example of an action gene would be
on an unreachable target. When a population reaches a certain POST /authentication .
predefined size, it removes the test case with the worst fitness The input gene represents the input data for the HTTP
value. At the end of the algorithm, a test suite is built with request. An example of this input data would be the username
the best test case from each population. and password that are required by the /authentication
2) MOSA: The Many Objective Sorting Algorithm endpoint. This input data is sampled from the source code of
(MOSA) is an evolutionary algorithm that focuses on optimiz- the SUT.
ing multiple objectives (e.g., branches) at the same time [24].
It adapts the NSGA-II algorithm, which is one of the most D. Linkage Learning in EAs
popular multi-objective search algorithms [25]. In MOSA, a Linkage-learning refers to a large body of work in the
test case is represented as a chromosome. Each testing target evolutionary computation community that aims to infer linkage
(e.g., branch, line) in the code corresponds to a separate objec- structures present in promising individuals [27]. Linkage struc-
tive measuring the distance (of a given test) toward reaching tures are groups of “good” genes that contribute to the fitness
that target. The fitness of the test cases is measured according of a given population. Accurate inference of linkage structures
to a vector of scalar values that represent these different has been used to design “competent” genetic operators [28] for
objectives. Since MOSA has many different objectives, it numerical problems. These operators are designed to replicate
uses two preference criteria to determine which test cases rather than break groups of genes (patterns) into the offspring.
should be selected and evolved first: (i) minimal distance to the To learn linkage structures from numerical chromosomes,
uncovered target; (ii) test case length. More precisely, it first researchers have used different unsupervised machine learning
looks for the subset of the Pareto front that contains test cases algorithms. BOA [29] constructs a Bayesian Network and
with a minimum distance for each uncovered objective. When
multiple test cases are equally close to cover the same target, 1 Micro-services are loosely coupled and deployable independently.

119

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
creates new numerical chromosomes using the joint distribu- Algorithm 1: LT-MOSA
tion encoded by the network. DSMGA [30] uses Dependency Input:
Structure Matrix (DSM) clustering and applies the crossover Coverage targets Ω = {ω1 , . . . , ωn }
by exchanging gene clusters between parent chromosomes. Population size M
Frequency K for updating the linkage tree model
3LO [31] employs local optimization as an alternative method Result: A test suite T
for linkage learning. 1 begin
Two state-of-the-art EAs for numerical problems are LT- 2 P ←− RANDOM-POPULATION(M )
3 archive ←− UPDATE-ARCHIVE(∅, P )
GA [32] and GOMEA [33]. Both algorithms use clustering to 4 Fronts ←− PREFERENCE-SORTING(R)
infer linkage-trees, representing the linkage structures between 5 while not (stop_condition) do
genes (problem variables) using tree-like structures. GOMEA 6 L ←− LEARN-LINKAGE-MODEL(Fronts[0], K)
7 P  ←− ∅
uses agglomerative hierarchical clustering as a faster and more 8 for index = 1..M do
efficient way to learn linkage-trees [33]. GOMEA uses the 9 parent ←− TOURNAMENT-SELECTION(P )
gene-pool optimal mixing to create new solutions by applying 10 if apply_recombination then
11 donor ←− TOURNAMENT-SELECTION(P )
a local search within the recombination procedure. More 12 offspring ←− LINKAGE-RECOMB(parent,
precisely, it creates offspring solutions from one single parent donor, L)
by iteratively replicating (copying) gene clusters from different 13 offspring ←− MUTATION(offspring)
donors. In each iteration, the new solution is evaluated; if its 14 else
15 offspring ←− MUTATION(parent)
fitness improves, the replicated genes are kept; otherwise, the 
change is reverted. 16 P  ←− P  {offspring}
17 archive ←− UPDATE-ARCHIVE(archive, offspring)
Linkage models have been successfully applied to evolu- 
18 R ←− P P 
tionary algorithms for numerical [34], permutation [35], and 19 Fronts ←− PREFERENCE-SORTING(R)
binary optimization problems [32], [36] with fixed length 20 P ←− ENVIRONMENTAL-SELECTION(Fronts, M )
chromosomes. However, test cases for REST APIs are char- 21 T ←− archive
acterized by a more complex structure [11]: each test is a
sequence of HTTP requests towards a RESTful service, each
with input data, such as HTTP headers, URL parameters, and
payloads for POST/PUT/PATCH methods. Besides, a test case targets and, therefore, the fittest individuals to consider for
might include SQL data and commands for microservices that model learning.
use databases [9]. Finally, test cases have a variable size, Afterwards, the population P is evolved through subsequent
and their lengths can also vary throughout the generations. generations within the loop in lines 5-20. Each generation
Therefore, we need to tailor existing linkage learning methods starts by training a linkage tree model on the first non-
according to the test case characteristics discussed above. dominated front (line 6) with the goal of learning patterns of
HTTP and SQL actions that strongly contribute to the “opti-
III. A PPROACH mality” of the population. We discuss the learning procedure in
This section presents our approach, called LT-MOSA, for detail in Section III-A. Once the linkage tree model is obtained,
system-level test cases generation and that incorporates and LT-MOSA selects the fittest test cases using the tournament
tailors linkage learning into MOSA [24]. We selected MOSA selection (line 9 and 11) and creates an offspring population
as the base algorithm to apply linkage learning because it P  by using a linkage-based recombination [33] (line 12) and
evolves a single population of test cases, which is a require- mutation [9] (line 13 and 15). The linkage-based recombi-
ment for the learning process. Additionally, MOSA has been nation is a specialized crossover that relies on the linkage
proved to be very competitive in the context of RESTful API tree model to decide which patterns of genes (HTTP requests)
testing [13], unit testing [37], [5], and DNN testing [38]. can be copied into the offspring test cases. We describe the
Algorithm 1 outlines the pseudo-code of LT-MOSA. The linkage-based recombination operator in Section III-B.
parts where LT-MOSA deviates from MOSA are highlighted LT-MOSA adds the newly generated tests into the offspring
with a blue color. LT-MOSA starts with initializing the pop- population (line 16), executes them, and updates the archive
ulation P and computing the corresponding objective scores (line 17) in case new coverage targets have been reached. The
(line 2). Each test case is composed of HTTP calls (actions) generation ends by selecting the best M test cases across the
and SQL commands (database actions) [9]. The RANDOM- existing population P and the offspring population P  . This
POPULATION function also executes the generated tests and selection is made by combining the two population into one
computes their objective scores using the branch distance [39]. single pool R of size 2 × M (line 18), applying the preference
The branch distance is a well-known heuristic in search-based sorting (line 19), and selecting M solutions from the non-
testing to measure how far each test case is from reaching a dominated fronts starting from Front[0] until reaching the
given coverage target (e.g., branch). Then, the test cases are population size M (line 20).
sorted in sub-dominance fronts using the preference sorting The search stops when the termination criteria are met
algorithm [24], in line 4. The test cases within the first front (condition in line 5), the final test suite will then be composed
(Front[0]) are the closest ones in P to reach the coverage of all test cases that have been stored in the archive throughout

120

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
the search. Note that LT-MOSA updates the archive in each
generation by storing the shortest test case covering each target S1 , S2 , S3 , S4 , ...
ωi . Finally, the list of objectives is updated such that the search
focuses only on the targets (branches) that are left uncovered. S 1 , S2

distance
In the following sub-sections, we detail the key novel S3 , S4 , ...
ingredients in LT-MOSA, namely the linkage model learning
(LT) (Section III-A), the linkage-based recombination opera- S3 , S4
tor (Section III-B), and the mutation operator (Section III-C).
S1 S2 S3 S4 ...
A. Linkage tree learning
In this section, we describe the main changes we applied Fig. 1: Example of linkage tree model and the Family Of
to the traditional linkage learning and adapt it to our context, Subset (FOS)
i.e., test case generation for RESTful APIs.
1) Linkage Encoding: The first problem we had to solve
is encoding the test cases into discrete vectors of equal and Sj is computed using the mutual information as suggested
length, which can be interpreted and analyzed via hierarchical by Thierens and Bosman [33]:
clustering. To this aim, we opted for encoding test cases as
binary vectors whose entries denote the presence (or not) M I(Si , Sj ) = H(Si ) + H(Sj ) − H(Si , Sj ) (2)
of the possible HTTP requests. Given an SUT (software
under tests), there are N possible HTTP requests to the
available APIs. This information can be extracted from the where H(.) denotes the information entropy [41].
Swagger/OpenAPI definition [11], which is a widely-used tool Note that LT-MOSA infers the linkage tree for the most
for REST API documentation. A Swagger definition contains promising part of the population, i.e., the first non-dominated
the HTTP operations available for each API endpoint. Each front (line 6 in Algorithm 1). Furthermore, the training process
operation contains both fixed and variable parts. The fixed part is applied to the encoded test cases according to the schema
includes the type of operation (POST, GET, PUT, PATCH, and described in Section III-A1 rather than on the actual test cases.
DELETE ), the IP address or the URL of the target API, and the Hence, the linkage tree obtained with UPGMA captures the
HTTP headers. For the variable part, the Swagger definition hierarchical relationship between HTTP actions in our case.
includes information about the input data (e.g., string, double, For example, let us consider the linkage tree depicted
date, etc.) that can vary. Therefore, for each API endpoint, we in Fig. 1. In the example, the set of actions S =
identify the available HTTP operations, hereafter referred to {S1 , S2 , S3 , S4 , . . . } (the root of the tree) are partitioned into
as actions, by parsing the Swagger definition. two clusters: S1 , S2 and S3 , S4 , . . . ; each sub-cluster can be
Let S = {S1 , . . . , SN } be the set of N HTTP actions further divided in sub-cluster until reaching the leaf node. In
available for the target SUT. We encode each test case T as a general, the linkage tree has N leaves and N − 1 internal
binary string of size N as follows: nodes. The root node contains all HTTP actions of the SUT.
 Each internal node divides the set of HTTP actions into
0 if Si ∈ /T
E(T ) = e1 , . . . , eN  with ei = (1) two mutually exclusive clusters (the child nodes). Finally, the
1 if Si ∈ T
leaves contain the individual HTTP actions, which are the
In other words, each element ei in the encoded vector E(T ) is starting point of the UPGMA algorithm.
set to 1 if the test case T contains the action Si ; 0 otherwise. The linkage tree nodes are often referred to as Family of
The linkage model is trained on the binary-coded vectors Subsets (F) in the related literature [32], [33]. Each node (or
rather than on the original test cases. This encoding is used subset) F  ∈ F with |F > 2| has two mutuallyexclusive
to determine, via statistical analysis, which group of HTTP subsets  (or child nodes) Fx and Fy such that Fx Fy = ∅
actions often appear together within the fittest test cases, and and Fx Fy = F  . Each subset F  ∈ F represents a cluster of
which ones never occur together. This information is used to HTTP actions that often appear together and characterized the
create more efficient recombination operators. best test cases in the population. Therefore, the recombination
2) Linkage Model Training: In this paper, we use agglomer- operator should be applied by preserving these subsets (pat-
ative hierarchical clustering (AHC) over other techniques (e.g. terns) when creating new offspring tests. The next subsection
Bayesian Networks) for linkage tree learning. This is because describes the subsets-preserving recombination operator we
prior studies show AHC is more efficient [33]. In particular, implemented in LT-MOSA.
we apply the UPGMA (unweighted pair group method with The computation complexity of UPGMA is O(N × M 2 ),
arithmetic mean) algorithm [40]. In each iteration, UPGMA where N is the number of genes and M is the population size.
merges two clusters that are most similar based on the average To reduce its overhead, the linkage tree learning procedure is
distance across the data points (genes in our case) in the two not applied in each generation. Instead, the linkage tree model
clusters. The similarity between two HTTP actions genes Si is re-trained every K generations (line 6 of Algorithm 1).

121

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
B. Linkage-based Recombination IV. E MPIRICAL S TUDY
MOSA creates offspring tests using the single-point This section details the empirical study that we carried out
crossover [24]. This crossover operator is the classic recombi- to evaluate the effectiveness of the proposed solution, called
nation operator used in genetic algorithms [42], and test case LT-MOSA, and compare it with the state-of-the-art algorithms
generation [10], [21]. This operator generates two offsprings (MIO, MOSA) w.r.t. to the following testing criteria: (i) code
by randomly swapping statements between two parent tests T1 (line and branch) coverage and (ii) fault detection capability.
and T2 . As argued in Section I, exchanging statements between
A. Benchmark
test cases in a randomized manner can lead to breaking gene
patterns (HTTP actions) that characterized the fittest indi- This study uses the E VO M ASTER Benchmark (EMB)2
viduals. Randomized recombination is also disruptive toward version 1.0.1. This benchmark was specifically created as
building good partial solutions (building blocks), negatively a set of web/enterprise applications for evaluating the test
affecting the overall convergence [33]. case generation algorithms implemented in E VO M ASTER. We
Therefore, LT-MOSA uses a linkage-based recombination selected this benchmark since it has been widely used in the
operator rather than the classical single-point crossover to literature to assess test case generation approaches for REST
preserve the patterns of HTTP actions identified by the linkage APIs [11], [13].
tree model. The recombination operator generates only one In this study, we used five real-world open-source Java web
offspring starting from two existing test cases, called parent applications and two artificial Java web applications. CatWatch
and donor. Both test cases are selected from the current is a metrics dashboard for GitHub organizations. Features-
population P as indicated in lines 9 and 11. The offspring is Service is a REST Microservice for managing feature models
created by copying all genes (HTTP actions with input data) of products. OCVN (Open Contracting Vietnam) is a visual
from the parent and further injecting only some genes from data analytics platform for the Vietnam public procurement
the donor. These genes are selected by exploiting the linkage data. ProxyPrint is a platform for comparing and making
tree model trained according to Section III-A. requests to print-shops. Scout-API is a RESTful web service
More precisely, we first identify the gene patterns (i.e., the for the hosted monitoring service “Scout”. NCS (Numerical
subsets F  ∈ F) that the donor contains. This is done by Case Study) is an artificial application containing numerical
iterating across all subsets in the linkage tree model F and examples. SCS (String Case Study) is an artificial application
identifying the subsets F  ⊂ F that appear in the encoded containing string manipulation code examples. We use NCS
vector (see Section III-A1) of the donor. LT-MOSA randomly and SCS since they have been designed for assessing test
selects one of the identified subsets in F  and inserts it into generation tools. These artificial web applications allow to
the offspring. The injection point is randomly chosen, and the cover many different scenarios (e.g., deceptive branches [13]).
selected genes (HTTP actions with test data) are inserted into Compared to previous studies [11], [13], we added the OCVN
the offspring in the exact order as they appear in the donor. application as it is the largest real-world system in the bench-
If the donor does not contain any subset according to the mark. We additionally removed the rest-news application as
linkage tree (i.e., F  = ∅), then the offspring is generated by it contains artifical examples that are used for classroom
applying the traditional single-point crossover. This operator teaching.
can be applied to the latter case since the linkage tree model Table I summarizes the main characteristics of the appli-
could not identify any useful gene pattern within the donor. cations in the benchmark, such as the number of classes, the
number of test coverage targets, and the number of endpoints
C. Mutation included in the service. This benchmark contains a total of
1655 classes with around 20 000 test coverage targets and 440
In MOSA, each test case is mutated with a probability endpoints, not including tests or third-party libraries.
pm = 1/L, where L is the test case length [24]. This E VO M ASTER requires a test driver for the application under
also reflects the existing guidelines in evolutionary compu- test. This test driver contains a controller that is responsible
tation [43], [44], which suggest using a mutation probability for starting, resetting, and stopping the SUT. We used the
pm proportional to the size of the chromosome. test drivers available in the EMB benchmark for the web
In recent years, Arcuri [13] improved the mutation operator applications used in this study.
in the context of system-level test case generation by using
a variable mutation rate. Indeed, the mutation operator in B. Research Questions
MIO [13] increases the number of mutations applied to each Our empirical evaluation aims to answer the following
test case from 1 (start of the search) up to 10 (end of research questions:
the search) with linear incremental steps. The importance of RQ1 How does LT-MOSA perform compared to the state-of-
having a large mutation rate for RESTful API testing has also the-art approaches with regard to code coverage?
been confirmed by a recent study results [9]. RQ2 How effective is LT-MOSA compared to the state-of-the-
Based on these observations, LT-MOSA uses the same art approaches in detecting real-faults?
mutation rate of MIO (i.e., increasing mutation rate from 1 up
to 10 mutations) rather than the fixed mutation rate of MOSA. 2 https://github.com/EMResearch/EMB/releases/tag/v1.0.1

122

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Web applications from the E VO M ASTER Bench- We chose E VO M ASTER because it already implements the
mark (EMB) used in the empirical study. Reports the number state-of-the-art test case generation algorithms, and it is pub-
of Java classes, test coverage targets (i.e., lines and branches), licly available on GitHub. Besides, E VO M ASTER implements
and the number of endpoints. testability transformations to improve the guidance for search-
based algorithms [45] and can handle SQL databases [9].
Application Classes Coverage Targets Endpoints
E. Parameter Setting
CatWatch 69 2182 23
Features-Service 23 513 18 For this study, we have chosen to adopt the default search
NCS 10 652 7 algorithm parameter values set by E VO M ASTER. It has been
OCVN 548 8010 258 empirically shown [46] that although parameter tuning has
ProxyPrint 68 3758 74
Scout-API 75 3449 49
an impact on the effectiveness of a search algorithm, the
SCS 13 865 11 default values, which are commonly used in literature, provide
reasonable and acceptable results. Thus, this section only lists
Total 1655 19 429 440
a few of the most important search parameters and their values:
1) Search Budget: We chose a search budget (stopping
condition) based on time instead of the number of executed
RQ3 How effective is LT-MOSA at covering test targets over
tests. This choice was made as search time provides the
time compared to the state-of-the-art approaches?
fairest comparison given that we consider different kinds
The first two research questions aim to evaluate if preserving of algorithms with diverse internal routines (also in terms
patterns in HTTP requests through linkage learning can im- of computational complexity). Additionally, practitioners will
prove the effectiveness of test case generation for REST APIs often only allocate a certain amount of time for the algorithm
by reaching a higher coverage and detecting more faults. to run. The search budget for all algorithms was set to 30 min-
The last research question aims to answer if our approach, utes as this strikes a balance between giving the algorithms
LT-MOSA, is more efficient in covering these test targets by enough time to explore the search space and making the study
measuring how many test targets are covered at different times infeasible to execute. If the algorithm has covered all its test
within the search budget. objectives, it will stop prematurely. Note that running time
C. Baseline is considered a less biased stopping condition than counting
the number of executed tests since not all tests have the same
To answer our research questions, we compare LT-MOSA running time [6], [9], [13], [21]. We further discuss this aspect
with the two state-of-the-art search-based test case generation in the threats to validity.
algorithm for REST APIs as a baseline: 2) MIO parameters: For MIO, we used the default settings
• Many Independent Objective (MIO) is the state-of-the- as provided in the original paper by Arcuri et al. [47], [13].
art for REST API testing, and it is the default search • Population size: We use the default population size of
algorithm in E VO M ASTER. MIO aims to improve the 10 individuals per testing target. Notice that MIO uses
scalability of many-objective search algorithms for pro- separate populations for the different targets.
grams with a very large number of testing targets (see • Mutation: We use the default number of applied muta-
Section II-C1). tions on sampled individuals, which linearly increases
• Many-Objective Sorting Algorithm (MOSA) is the base
from 1 to 10 by the end of the search.
algorithm we use to build and design LT-MOSA. There- • F: We use the default percentage of time after which a
fore, we want to assess that our approach outperforms focused search should start of 0.5.
its predecessor. Furthermore, MOSA has been proven to • Pr : We use the default probability of sampling at random,
be very competitive in the context of REST APIs testing instead of sampling from one of the populations, of 0.5.
(see Section II-C2). This value will linearly increase/decrease based on the
D. Prototype Tool consumed search budget and the value of F .
We have implemented LT-MOSA in a prototype tool that 3) MOSA parameters: For MOSA, we used the default
extends E VO M ASTER, an automated system-level test case settings described in the original paper et al. [26].
• Population size: 50 individuals (test cases).
generation framework. In particular, we implemented the ap-
• Mutation: We use the uniform mutation, which either
proach as described in Section III within E VO M ASTER.
The variant of MOSA implemented in E VO M ASTER differs changes the test case structures (adding, deleting, or
from the original algorithm proposed by Panichella et al. [24]. replacing API requests) or the input data. Test structure
The E VO M ASTER variant does not use the crossover operator and test data mutation are equally probable, i.e, each
but merely relies on the mutation operator to create new test has 50% probability of being applied. The mutation
cases. Therefore, we implemented the single-point crossover probability for each statement/data gene is equal to 1/n,
as described in [24] and adapted it to the encoding schema where n is the number of statements in the test case.
• Recombination Operator: We use the single-point
used for representing REST API requests in E VO M ASTER.
See Section II-C2 for more details. crossover with a crossover probability of 0.75.

123

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
• Selection: We use the tournament selection with the TABLE II: Median number of covered test targets.
default tournament size of 10.
MIO MOSA LT-MOSA
Application
4) LT-MOSA parameters: For LT-MOSA, we used the Median IQR Median IQR Median IQR
same parameters as for the MOSA algorithm except for the
CatWatch 1173.00 12.75 1177.00 132.00 1215.50 161.75
mutation operator, for which we use the mutation described Features-Service 488.00 72.25 455.50 33.25 478.00 5.00
in Section III-C. Additionally, we use the following parameter NCS 622.50 1.25 623.00 4.00 622.00 3.25
OCVN 2421.50 374.75 2931.50 271.00 4031.50 338.75
values for the linkage learning model: ProxyPrint 1485.50 16.25 1501.00 78.25 1602.50 59.00
Scout-API 1727.50 54.75 1707.00 69.00 1826.50 33.25
• Frequency: We use a frequency of 10 generations for SCS 853.00 5.50 852.00 8.00 853.00 3.00
generating a new Linkage-Tree model. From a prelimi-
nary experiment that we have performed, this provides a
balance between having too much overhead (< 10) and coverage only at the end of the allocated time, we also
having an outdated model (> 10). want to analyze how algorithms perform during the search.
• Recombination Operator: We use the linkage-based re- One way to quantify the efficiency of an algorithm is by
combination with a probability of 0.75. plotting the number of test targets at predefined intervals
during the search process. This is called a convergence graph.
F. Real-fault Detection
We collected the number of targets that have been covered
To find out the number of unique faults that the search for every generation of each independent run. To express the
algorithms can detect, E VO M ASTER checks the returned status efficiency of the experimented algorithms using a single scalar
codes from the HTTP requests for 5xx server errors, as an value, we computed the overall convergence rate as the Area
indicator for a fault. Since web applications handle many Under the Curve (AUC) delimited by the convergence graph.
different clients, when an error occurs it is not desirable for This metric is normalized by dividing the AUC in each run
the application to crash or exit as this would also impact the by the maximum possible AUC per application3 .
other clients. Thus, web applications return a status code in
the 5xx range, indicating an error has occurred on the server’s V. R ESULTS
side. E VO M ASTER keeps track of the last executed statement This section details the results of the empirical study with
in the SUT (excluding third-party libraries) when a 5xx status the aim of answering our research questions.
code is returned, to distinguish between different errors that
A. RQ1: Code Coverage
happen on the same endpoint.
Table II reports the median and inter-quartile range (IQR)
G. Experimental Protocol of the number of test targets covered by MIO, MOSA, and
For each web application, all three search algorithms LT-MOSA for each of the seven applications.
(MOSA, MIO, LT-MOSA) are separately executed, and the From Table II, we observe that LT-MOSA achieved the
resulting number of test targets that are covered is recorded. highest median value (avg. +334.75 targets) for four out of the
Since all three search algorithm used in the study are seven applications, and MOSA and MIO both achieved the
randomized, we can expect a fair amount of variation in highest median value (+10.00 and +0.5 targets, respectively)
the results. To mitigate this, we repeated every experiment for 1 out of the 7 applications. The largest increase in code
20 times, with a different random seed, and computed the av- coverage is observable for OCVN, for which LT-MOSA
erage (median) results. In total, we performed 420 executions, covered +1100.00 more targets. For SCS, both LT-MOSA and
three search algorithms for seven web applications with 20 MIO covered the same number of targets (853.00). For both
repetitions each. With each execution taking 30 minutes, the artificial applications, namely NCS and SCS, the difference
total execution time is 8.75 days of consecutive running time. between the search algorithms is minimal (≤ 1).
To determine if the results (i.e., code coverage and fault In terms of variability (IQR), there is no clear trend
detection capability) of the three different algorithms are with regard to the applications under test and/or the search
statistically significant, we use the unpaired Wilcoxon rank- approaches. For example, in some cases, the winning con-
sum test [48] with a threshold of 0.05. This is a non-parametric figuration (LT-MOSA on CatWatch) has the highest IQR
statistical test that determines if two data distributions are with a significant margin (161.75 vs. 132.00 or 12.75). On
significantly different. Since we have three different data Scout-API, LT-MOSA yields the lowest IQR by a significant
distributions, one for each search algorithm, we perform margin (33.25 vs. 8.00 or 5.50). Within and across each search
the Wilcoxon test pairwise between each configuration pair: algorithm, the IQR varies.
(i) LT-MOSA and MOSA; (ii) LT-MOSA and MIO. We Table III reports the statistical significance (p-value), cal-
combine this with the Vargha-Delaney statistic [49] to measure culated by the Wilcoxon test, of the difference between the
the effect size of the result, which determines how large the number of targets covered by LT-MOSA and the two base-
difference between the two configuration pairs is. lines, MIO and MOSA. It also reports the magnitude of the
To determine how the two configuration pairs compare in differences according to the Vargha-Delaney Â12 statistic.
terms of efficiency, we analyze the code coverage at different 3 Which corresponds to the area of the box with a height of the maximum
points in time. While the effectiveness measures the code code coverage and a width equal to the search budget.

124

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Statistical results (p-value and Â12 ) for the cov- TABLE V: Statistical results (p-value and Â12 ) for the detected
ered test targets (RQ1). Significant p-values (i.e., p-value < real-faults (RQ2). Significant p-values (i.e., p-value < 0.05)
0.05) are marked gray. are marked gray.
LT-MOSA vs MIO LT-MOSA vs MOSA LT-MOSA vs MIO LT-MOSA vs MOSA
Application Application
p-value Â12 p-value Â12 p-value Â12 p-value Â12
CatWatch <0.01 0.87 (Large) 0.04 0.66 (Small) CatWatch <0.01 0.7 (Large) <0.01 0.84 (Large)
Features-Service 0.34 0.54 <0.01 0.83 (Large) Features-Service <0.01 0.92 (Large) <0.01 0.98 (Large)
NCS 0.86 0.49 0.90 0.38 (Small) NCS - - - -
OCVN <0.01 1.00 (Large) <0.01 1.00 (Large) OCVN <0.01 0.99 (Large) <0.01 0.92 (Large)
ProxyPrint <0.01 1.00 (Large) <0.01 0.86 (Large) ProxyPrint <0.01 0.89 (Large) 0.03 0.66 (Small)
Scout-API <0.01 0.96 (Large) <0.01 0.96 (Large) Scout-API <0.01 1.00 (Large) <0.01 0.91 (Large)
SCS 0.86 0.40 (Small) 0.50 0.50 SCS - - - -

TABLE IV: Median number of detected real-faults.


TABLE VI: Median normalized AUC for the number of
Application
MIO MOSA LT-MOSA covered test targets. The highest values are marked in gray.
Median IQR Median IQR Median IQR
Application MIO MOSA LT-MOSA
CatWatch 13.00 0.25 12.00 2.00 13.50 2.25
Features-Service 17.00 0.00 17.00 0.00 18.00 0.50 CatWatch 0.77 0.78 0.78
NCS 0 0.00 0 0.00 0 0.00 Features-Service 0.78 0.75 0.82
OCVN 34.00 5.25 37.50 5.25 48.00 3.50 NCS 0.99 0.99 0.99
ProxyPrint 32.50 1.00 33.00 1.00 34.00 0.25 OCVN 0.50 0.50 0.66
Scout-API 54.50 3.75 60.00 1.50 64.00 3.00 ProxyPrint 0.87 0.82 0.88
SCS 0 0.00 0 0.00 0 0.00 Scout-API 0.84 0.81 0.86
SCS 0.95 0.96 0.96

From Table III, we can observe that for the non-artificial


web applications, LT-MOSA achieves a significantly higher
code coverage than MIO in four out of five applications with largest difference between LT-MOSA and the baselines is
a large effect size (Â12 statistics). LT-MOSA significantly on the OCVN application, which is the application with by
outperforms MOSA in all five applications. The effect size far the most classes (i.e., 548) and endpoints (i.e., 258) in
is large in four applications and small for CatWatch. For the our benchmark. This could be explained by the fact that
two artificial applications, NCS and SCS, there is no statistical LT-MOSA also achieved a much higher code coverage for
difference between the results of LT-MOSA and the two this application. However, the difference in detected faults
baselines (MIO and MOSA). This confirms our preliminary for OCVN is larger than for the other applications in the
results reported in Table II. Moreover, the difference between benchmark, which could indicate that LT-MOSA is especially
LT-MOSA and MIO is not significant for Features-Service. effective for testing large REST APIs. The faults detected
Finally, in none of the applications in our benchmark, neither by LT-MOSA are a superset of the faults detected by MIO
of the baselines achieved a significantly larger coverage than and MOSA. These newly discovered faults originate from the
LT-MOSA. additional coverage that LT-MOSA achieves.
Table V reports the results of the statistical test, namely
In summary, LT-MOSA achieves significantly higher (most the Wilcoxon test, applied to the number of faults detected
of the cases) or equal code coverage when applied to REST by LT-MOSA and the two baselines, MIO and MOSA. It
APIs as compared to both MIO and MOSA. also reports the magnitude of the differences (if any) obtained
with the Vargha-Delaney Â12 statistic. Significant p-values
B. RQ2: Fault Detection Capability (i.e., p-value < 0.05) are highlighted with gray color. From
Table V, we can observe that LT-MOSA detects a significantly
Table IV reports the median number of real-faults (and
higher number of faults than MIO and MOSA in all non-
the corresponding IQR) detected by MIO, MOSA, and LT-
artificial applications. The effect size (Â12 ) is large in all
MOSA for each of the seven applications.
comparisons, except for ProxyPrint, where the effect size is
We observe that for both the artificial applications, NCS
small when comparing LT-MOSA and MOSA. Since none of
and SCS, the number of faults that have been detected by
the algorithms detected any faults in the artificial applications,
any search algorithm is zero. This is because these artificial
Table V does not report any p-value or Â12 statistics for these
applications are not designed to fail softly by returning 5xx
applications.
faults. For the open-source applications, LT-MOSA detects
the largest number of faults (avg. +3.40 faults) in all five
In summary, we can conclude that LT-MOSA detects more
cases. The largest increase in fault-detection rate is observable
faults than the state-of-the-art approaches, namely MIO and
for the OCVN application, with +10.5 more faults detected
MOSA, for all applications in our benchmark.
by LT-MOSA than the baselines. It is noteworthy that the

125

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
1XPEHURI&RYHUHG%UDQFKHV  the results from RQ1.
In summary, we can conclude that LT-MOSA achieves
 higher AUC values than the baselines, i.e., it covers more
targets and in less time.


6HDUFK$OJRULWKPV VI. T HREATS TO VALIDITY


 /7í026$
This section discusses the potential threats to the validity of
0,2
the study performed in this paper.
026$
 Threats to construct validity. We rely on well-established
    metrics in software testing to compare the different test case
7LPH VHFRQGV generation approaches, namely code coverage, fault detection
Fig. 2: Average number of targets covered by our approach capability, and running time. As a stopping condition for the
(LT-MOSA) and the baselines (MOSA, MIO) for OCVN. search, we measured the search budget in terms of running
time (i.e., 30 minutes) rather than considering the number
of executed tests, or HTTP requests. Given that the different
algorithms in the comparison use different genetic operators,
C. RQ3: Code Coverage over Time with different overhead, execution time provides a fairer
measure of time allocation.
Table VI reports the median Area Under the Curve (AUC) Threats to external validity. An important threat regards the
related to the number of targets covered over time by MIO, number of web services in our benchmark. We selected seven
MOSA, and LT-MOSA for each of the seven applications. web/enterprise applications from the EMB benchmark. The
The AUC indicates how efficient the search algorithms are at benchmark has been widely used in the related literature on
reaching a certain code coverage. For more information on testing for REST APIs. The applications are diverse in terms
how the AUC is calculated and normalized see Section IV-G. of size, application domain, and purpose. Further experiments
Table VI highlights the search algorithm (in gray color) that on a larger set of web/enterprise applications would increase
achieved the highest AUC value. the confidence in the generalizability of our study. A larger
We observe that for the open-source applications, LT- empirical evaluation is part of our future agenda.
MOSA has the highest AUC (avg. +0.06) in four out of five Threats to conclusion validity are related to the randomized
applications, with the largest difference (+0.16) in the OCVN nature of EAs. To minimize this risk, we have performed
application. For CatWatch, both MOSA and LT-MOSA have each experiment 20 times with different random seeds. We
the same AUC (i.e., 0.78). From Tables II and III however, have followed the best practices for running experiments
we can see that LT-MOSA covers significantly more targets with randomized algorithms as laid out in well-established
(+38.5) after 30 minutes of search budget. This means that guidelines [50] and analyzed the possible impact of different
MOSA reaches a higher coverage in the beginning but loses random seeds on our results. We used the unpaired Wilcoxon
to LT-MOSA over time. rank-sum test and the Vargha-Delaney Â12 effect size to assess
Fig. 2 shows the (median) number of targets covered over the significance and magnitude of our results.
time by the different search algorithms for OCVN, which is
the largest application in our benchmark. In the beginning of VII. C ONCLUSIONS AND F UTURE W ORK
the experiment (0-500 seconds), MIO and LT-MOSA perform In this paper, we have used agglomerative hierarchical
roughly equal. After the first 500 seconds, LT-MOSA outper- clustering to learn a linkage tree model that captures promising
forms MIO. This results in a much larger AUC value (+ 0.16) patterns of HTTP requests in automatically generated system-
for LT-MOSA compared to MIO as indicated in Table VI. level test cases. We proposed a novel algorithm, called LT-
We conclude that LT-MOSA significantly outperforms both MOSA, that extends state-of-the-art approaches by tailoring
MOSA and MIO in term of effectiveness and efficiency on and incorporating linkage learning within its genetic operators.
this application. To reaffirm this, we can observe that in Fig. 2, Linkage learning helps to preserve and replicate patterns of
MIO never reaches 2700 covered targets, MOSA takes 1311 API requests that depend on each other.
seconds to reach that many targets, and LT-MOSA performs We implemented LT-MOSA, in E VO M ASTER and evalu-
this in just 713 seconds, almost half the time of MOSA. ated it on seven web applications from the EMB benchmark.
For the two artificial applications, NCS and SCS, the dif- Our results show that LT-MOSA significantly improves code
ference in AUC between the three search algorithms is very coverage and can detect more faults than two state-of-the-
minimal (≤ 0.01). From Table II, we can also see that LT- art approaches in REST API testing, namely MIO [13] and
MOSA covers one target more than MOSA on SCS and one MOSA [24]. This suggests that using unsupervised machine
target less on NCS. However, they both yield the same AUC, learning (and agglomerative hierarchical clustering in our case)
i.e., 0.96 (SCS) and 0.99 (NCS). These results are in line with is a very promising research direction.

126

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
Based on our promising results, there are multiple potential [18] M. Soltani, A. Panichella, and A. Van Deursen, “Search-based crash
reproduction and its impact on debugging,” IEEE Transactions on
directions for future works. In this paper, we used the UPGMA Software Engineering, vol. 46, no. 12, pp. 1294–1317, 2018.
algorithm for hierarchical clustering. Therefore, we intend to [19] S. Ali, M. Z. Iqbal, A. Arcuri, and L. C. Briand, “Generating test
investigate more learning algorithms within the hierarchical data from ocl constraints with search techniques,” IEEE Transactions
on Software Engineering, vol. 39, no. 10, pp. 1376–1402, 2013.
clustering category. We also plan to investigate other categories [20] N. Alshahwan, X. Gao, M. Harman, Y. Jia, K. Mao, A. Mols, T. Tei,
of machine learning methods alternative to hierarchical cluster- and I. Zorin, “Deploying search based software engineering with sapienz
ing, such as Bayesian Network [29]. Finally, LT-MOSA uses a at facebook,” in International Symposium on Search Based Software
Engineering. Springer, 2018, pp. 3–45.
fixed parameter K for the linkage learning frequency. We plan [21] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for
to investigate alternative, more adaptive mechanisms to decide object-oriented software,” in Proceedings of the 19th ACM SIGSOFT
whether the linkage tree model needs to be retrained or not. symposium and the 13th European conference on Foundations of soft-
ware engineering, 2011, pp. 416–419.
Finally, we intend to implement and apply linkage learning to [22] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective automated
unit-test case generation as well. testing for android applications,” in Proceedings of the 25th International
Symposium on Software Testing and Analysis, 2016, pp. 94–105.
R EFERENCES [23] E. Viglianisi, M. Dallago, and M. Ceccato, “Resttestgen: automated
[1] F. Curbera, M. Duftler, R. Khalaf, W. Nagy, N. Mukhi, and S. Weer- black-box testing of restful apis,” in 2020 IEEE 13th International
awarana, “Unraveling the web services web: an introduction to soap, Conference on Software Testing, Validation and Verification (ICST).
wsdl, and uddi,” IEEE Internet computing, vol. 6, no. 2, pp. 86–93, IEEE, 2020, pp. 142–152.
2002. [24] A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch
[2] R. T. Fielding, Architectural styles and the design of network-based coverage as a many-objective optimization problem,” in Proceedings
software architectures. University of California, Irvine Irvine, 2000, of the International Conference on Software Testing, Verification and
vol. 7. Validation, (ICST’15), Graz, Austria, 2015, pp. 1–10.
[3] V. Lenarduzzi and A. Panichella, “Serverless testing: Tool vendors’ and [25] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
experts’ points of view,” IEEE Software, vol. 38, no. 1, pp. 54–60, 2020. multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolu-
[4] V. Lenarduzzi, J. Daly, A. Martini, S. Panichella, and D. A. Tamburri, tionary computation, vol. 6, no. 2, pp. 182–197, 2002.
“Toward a technical debt conceptualization for serverless computing,” [26] A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test case
IEEE Software, vol. 38, no. 1, pp. 40–47, 2020. generation as a many-objective optimisation problem with dynamic
[5] J. Campos, Y. Ge, N. Albunian, G. Fraser, M. Eler, and A. Arcuri, selection of the targets,” IEEE Transactions on Software Engineering,
“An empirical evaluation of evolutionary algorithms for unit test suite vol. 44, no. 2, pp. 122–158, Feb 2018.
generation,” Information and Software Technology, vol. 104, pp. 207– [27] R. A. Watson, G. S. Hornby, and J. B. Pollack, “Modeling building-
235, 2018. block interdependency,” in International Conference on Parallel Problem
[6] A. Panichella and U. R. Molina, “Java unit testing tool competition-fifth Solving from Nature. Springer, 1998, pp. 97–106.
round,” in 2017 IEEE/ACM 10th International Workshop on Search- [28] M. Pelikan and D. E. Goldberg, “Escaping hierarchical traps with com-
Based Software Testing (SBST). IEEE, 2017, pp. 32–38. petent genetic algorithms,” in Proceedings of the 3rd Annual Conference
[7] G. Fraser and A. Arcuri, “1600 faults in 100 projects: automatically on Genetic and Evolutionary Computation, 2001, pp. 511–518.
finding faults while achieving high coverage with evosuite,” Empirical [29] M. Pelikan, D. E. Goldberg, E. Cantú-Paz et al., “Boa: The bayesian
software engineering, vol. 20, no. 3, pp. 611–639, 2015. optimization algorithm,” in Proceedings of the genetic and evolutionary
[8] S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, computation conference GECCO-99, vol. 1. Citeseer, 1999, pp. 525–
“Do automatically generated unit tests find real faults? an empirical 532.
study of effectiveness and challenges (t),” in 2015 30th IEEE/ACM [30] T.-L. Yu, D. E. Goldberg, K. Sastry, C. F. Lima, and M. Pelikan, “Depen-
International Conference on Automated Software Engineering (ASE). dency structure matrix, genetic algorithms, and effective recombination,”
IEEE, 2015, pp. 201–211. Evolutionary computation, vol. 17, no. 4, pp. 595–626, 2009.
[9] A. Arcuri and J. P. Galeotti, “Handling sql databases in automated [31] M. W. Przewozniczek and M. M. Komarnicki, “Empirical linkage
system test generation,” ACM Transactions on Software Engineering and learning,” IEEE Transactions on Evolutionary Computation, vol. 24,
Methodology (TOSEM), vol. 29, no. 4, pp. 1–31, 2020. no. 6, pp. 1097–1111, 2020.
[10] P. Tonella, “Evolutionary testing of classes,” ACM SIGSOFT Software [32] D. Thierens, “The linkage tree genetic algorithm,” in International
Engineering Notes, vol. 29, no. 4, pp. 119–128, 2004. Conference on Parallel Problem Solving from Nature. Springer, 2010,
[11] A. Arcuri, “Restful api automated test case generation with evomaster,” pp. 264–273.
ACM Transactions on Software Engineering and Methodology (TOSEM), [33] D. Thierens and P. A. Bosman, “Optimal mixing evolutionary algo-
vol. 28, no. 1, pp. 1–37, 2019. rithms,” in Proceedings of the 13th annual conference on Genetic and
[12] M. Zhang, B. Marculescu, and A. Arcuri, “Resource-based test case evolutionary computation, 2011, pp. 617–624.
generation for restful web services,” in Proceedings of the Genetic and [34] A. Bouter, T. Alderliesten, C. Witteveen, and P. A. Bosman, “Exploiting
Evolutionary Computation Conference, 2019, pp. 1426–1434. linkage information in real-valued optimization with the real-valued
[13] A. Arcuri, “Test suite generation with the many independent objective gene-pool optimal mixing evolutionary algorithm,” in Proceedings of
(mio) algorithm,” Information and Software Technology, vol. 104, pp. the Genetic and Evolutionary Computation Conference, 2017, pp. 705–
195–206, 2018. 712.
[14] R. A. Watson and T. Jansen, “A building-block royal road where [35] P. A. Bosman, N. H. Luong, and D. Thierens, “Expanding from discrete
crossover is provably essential,” in Proceedings of the 9th annual cartesian to permutation gene-pool optimal mixing evolutionary algo-
conference on Genetic and evolutionary computation, 2007, pp. 1452– rithms,” in Proceedings of the Genetic and Evolutionary Computation
1459. Conference 2016, 2016, pp. 637–644.
[15] D. Stallenberg, M. Olsthoorn, and A. Panichella, “Replication package of [36] M. Olsthoorn and A. Panichella, “Multi-objective test case selection
"Improving Test Case Generation for REST APIs Through Hierarchical through linkage learning-based crossover,” in International Symposium
Clustering",” 2021. on Search Based Software Engineering. Springer, 2021.
[16] P. McMinn, “Search-based software test data generation: a survey,” [37] A. Panichella, F. M. Kifetew, and P. Tonella, “A large scale empirical
Software testing, Verification and reliability, vol. 14, no. 2, pp. 105– comparison of state-of-the-art search-based test case generators,” Infor-
156, 2004. mation and Software Technology, vol. 104, pp. 236–256, 2018.
[17] S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, [38] F. U. Haq, D. Shin, L. C. Briand, T. Stifter, and J. Wang, “Automatic
W. Grieskamp, M. Harman, M. J. Harrold, P. McMinn, A. Bertolino test suite generation for key-points detection dnns using many-objective
et al., “An orchestrated survey of methodologies for automated software search,” arXiv preprint arXiv:2012.06511, 2020.
test case generation,” Journal of Systems and Software, vol. 86, no. 8, [39] B. Korel, “Automated software test data generation,” IEEE Transactions
pp. 1978–2001, 2013. on Software Engineering, vol. 16, no. 8, pp. 870–879, 1990.

127

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.
[40] R. Sokal and C. D. Michener, “A statistical method for evaluating Validation and Verification (ICST). IEEE, 2020, pp. 153–163.
systematic relationships,” University of Kansas science bulletin, vol. 38, [46] A. Arcuri and G. Fraser, “Parameter tuning or default values? an
pp. 1409–1438, 1958. empirical investigation in search-based software engineering,” Empirical
[41] C. E. Shannon, “A mathematical theory of communication,” ACM Software Engineering, vol. 18, no. 3, pp. 594–623, 2013.
SIGMOBILE mobile computing and communications review, vol. 5, [47] A. Arcuri, “Many independent objective (mio) algorithm for test
no. 1, pp. 3–55, 2001. suite generation,” in Proceedings of the International Symposium on
[42] S. Sivanandam and S. Deepa, “Genetic algorithms,” in Introduction to Search Based Software Engineering (SSBSE’17). Paderborn, Germany:
genetic algorithms. Springer, 2008, pp. 15–37. Springer International Publishing, 2017, pp. 3–17.
[43] J. D. Schaffer, R. Caruana, L. J. Eshelman, and R. Das, “A study of [48] W. J. Conover, Practical nonparametric statistics. John Wiley & Sons,
control parameters affecting online performance of genetic algorithms 1998, vol. 350.
for function optimization,” in Proceedings of the 3rd international [49] A. Vargha and H. D. Delaney, “A critique and improvement of the cl
conference on genetic algorithms, 1989, pp. 51–60. common language effect size statistics of mcgraw and wong,” Journal
[44] J. E. Smith and T. C. Fogarty, “Adaptively parameterised evolutionary of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132,
systems: Self adaptive recombination and mutation in a genetic algo- 2000.
rithm,” in International Conference on Parallel Problem Solving from [50] A. Arcuri and L. Briand, “A hitchhiker’s guide to statistical tests for
Nature. Springer, 1996, pp. 441–450. assessing randomized algorithms in software engineering,” Software
[45] A. Arcuri and J. P. Galeotti, “Testability transformations for existing Testing, Verification and Reliability, vol. 24, no. 3, pp. 219–250, 2014.
apis,” in 2020 IEEE 13th International Conference on Software Testing,

128

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:19:32 UTC from IEEE Xplore. Restrictions apply.

You might also like