0% found this document useful (0 votes)
3 views

Software Testing

Test case Prioritization

Uploaded by

shedam4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Software Testing

Test case Prioritization

Uploaded by

shedam4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO.

X, MONTH YEAR 1

AGA: An Accelerated Greedy Additional


Algorithm for Test Case Prioritization
Feng Li, Jianyi Zhou, Yinzhu Li, Dan Hao, Lu Zhang

Abstract—In recent years, many test case prioritization (TCP) techniques have been proposed to speed up the process of fault
detection. However, little work has taken the efficiency problem of these techniques into account. In this paper, we target the Greedy
Additional (GA) algorithm, which has been widely recognized to be effective but less efficient, and try to improve its efficiency while
preserving effectiveness. In our Accelerated GA (AGA) algorithm, we use some extra data structures to reduce redundant data
accesses in the GA algorithm and thus the time complexity is reduced from O(m2 n) to O(kmn) when n > m, where m is the number
of test cases, n is the number of program elements, and k is the iteration number. Moreover, we observe the impact of iteration
arXiv:2205.10239v1 [cs.SE] 20 May 2022

numbers on prioritization efficiency on our dataset and propose to use a specific iteration number in the AGA algorithm to further
improve the efficiency. We conducted experiments on 55 open-source subjects. In particular, we implemented each TCP algorithm with
two kinds of widely-used input formats, adjacency matrix and adjacency list. Since a TCP algorithm with adjacency matrix is less
efficient than the algorithm with adjacency list, the result analysis is mainly conducted based on TCP algorithms with adjacency list.
The results show that AGA achieves 5.95X speedup ratio over GA on average, while it achieves the same average effectiveness as GA
in terms of Average Percentage of Fault Detected (APFD). Moreover, we conducted an industrial case study on 22 subjects, collected
from Baidu, and find that the average speedup ratio of AGA over GA is 44.27X, which indicates the practical usage of AGA in
real-world scenarios.
Note: This is a preprint of the accepted paper “Feng Li, Jianyi Zhou, Yinzhu Li, Dan Hao, and Lu Zhang. AGA: An Accelerated
Greedy Additional Algorithm for Test Case Prioritization. IEEE Transactions on Software Engineering, 2021”, which can be
accessed at https://ieeexplore.ieee.org/document/9662236.

Index Terms—Test Case Prioritization, Additional Strategy, Acceleration

1 I NTRODUCTION “uncovered”. This GA algorithm repeats the GA strategy


Test case prioritization (abbreviated as TCP) [1], [2], [3], [4], until all the test cases are selected and thus its effectiveness
[5], [6], is proposed to schedule the execution order of test is no worse than that of the original GA algorithm [5]. There-
cases so as to detect faults as early as possible. To address fore, the GA algorithm proposed by Elbaum et al. [3] is taken
this problem, a large number of TCP techniques have been as the default GA algorithm by most researchers in TCP and
proposed in the literature. in this paper1 . Moreover, the original GA algorithm is called
Among these TCP techniques, the Greedy Additional the GA-first algorithm for distinction. Note that we target
(GA) algorithm has received much attention since it was GA rather than GA-first in this paper because the former
proposed in 1999 [5] due to its widely recognized effect- is more widely used in the literature. Although researchers
iveness [7], [8], [9], [10]. In particular, the GA algorithm have put dedicated efforts in TCP and have proposed a large
iteratively selects the next test case which covers the largest number of TCP techniques since then, the GA approach [3]
number of elements (e.g., methods, branches, statements) remains one of the most effective strategies in terms of fault-
that have not been covered by previously selected test detection rate [7], [8], [10], which is usually measured by
cases. When the selected test cases cover all elements, this the average percentage of faults detected (abbreviated as
GA algorithm deals with the remaining unselected test APFD). In other words, none of the existing TCP techniques
cases with any prioritization technique (e.g., Greedy Total can always outperform GA [3] in terms of effectiveness.
algorithm [5], which schedules these test cases based on the Besides effectiveness, time cost is widely recognized
descendent order of the number of total covered program as another important issue influencing the application of
elements). Later in 2002, Elbaum et al. [3] slightly modified an approach [11] [12], [13], [14], especially considering the
this algorithm by reordering the remaining test cases with limited available time. In particular, the time cost of TCP,
the GA strategy again after resetting all the elements to be called TCP efficiency in this paper, refers to how much time
a TCP approach consumes. As reported, Google [15] runs
• Feng Li, Jianyi Zhou, Dan Hao, and Lu Zhang are with the Institute of 800K builds and 150M tests every day (the same tests are
Software, School of Computer Science, Peking University, Beijing, China run many times). If a TCP approach consumes much more
and Key Laboratory of High Confidence Software Technologies (Peking time on prioritization, the time left for test running will
University), MoE. Dan Hao is the corresponding author.
be reduced to a large extent. Furthermore, software modi-
E-mail: {lifeng2014, zhoujianyi, haodan, zhanglucs}@pku.edu.cn
fication occurs dramatically frequently so that regression
• Yinzhu Li is with the Baidu Online Network Technology (Beijing) Co.,
Ltd. 1
Without further clarification, the GA algorithm used in this paper refers
E-mail: [email protected]
to the one proposed by Elbaum et al. [3].

978-1-6654-4407-1/21/$31.00 ©2021 IEEE


TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 2

testing consumes about 80% testing cost [16]. For example, (program coverage) has two kinds of format, adjacency
Google developers modify source code one time per second matrix and adjacency list, we conduct our experiments on
on average [15]. To improve the efficiency of regression both of them, which is discussed in Section 2.1. In the
testing, it is necessary to apply TCP more than once because experiments, we studied the contributions of the two parts
frequent code modification may hamper the effectiveness of AGA separately, and found that both of them improve the
of TCP [17]. That is, considering the practical application efficiency to a large extent. Furthermore, we investigated the
of TCP, including the GA algorithm, both effectiveness and effectiveness and efficiency of AGA by comparing it with
efficiency are important. GA. The results showed that on average the speedup ratio
However, existing TCP approaches, including the GA of AGA over GA is 5.95X and 27.72X on two input formats,
algorithm, suffer from the efficiency problem, e.g., the which is a very large improvement. We also find that the
previous work shows that most existing TCP approaches average APFD of AGA and GA is the same, and Analysis of
cannot deal with large-scale application scenarios [13], [15], Covariance (ANCOVA) [19] shows no significant difference
[18]. Furthermore, some work [13], [15], [18] points out between them. Moreover, the effect size (Cohen’s d) also
that the GA algorithm spends dramatically long time on indicates small effect.
prioritization. Note that in the 20-year history of GA, there We also empirically compared AGA with FAST [18],
is no approach proposed to improve its efficiency while which focuses on the TCP efficiency problem. As FAST [18]
preserving the high effectiveness. targets a different problem, improving the time efficiency
In this paper, we make the first attempt to accelerate the by sacrificing effectiveness, such a comparison in terms
GA algorithm and maintain the effectiveness. In particular, of efficiency may be a bit unfair to our AGA approach.
we analyze the efficiency problem of the GA algorithm Surprisingly, the results showed that the average speedup
and propose to accelerate the GA algorithm through two ratio of AGA over FAST is 4.29X (with significant difference
enhancements. The proposed algorithm is called the Accel- and medium effect), which means AGA even outperforms
erated Greedy Additional (abbreviated as AGA) algorithm. the technique that sacrifices effectiveness to achieve high
First, many redundant data accesses occur during priorit- efficiency. Also, the average APFD difference that AGA
ization in GA. Whenever a test case is selected, the GA exceeds FAST is 0.1702, and ANCOVA shows that the dif-
algorithm scans the coverage information of all test cases ference is statistically significant. Moreover, the effect size
to mark elements covered by this selected test case and (Cohen’s d) also indicates huge effect.
calculates the number of unmarked elements covered by We further performed an industrial case study in Baidu,
each unselected test case. Such scanning is less efficient and a famous Internet service provider with over 600M monthly
may contain many redundant data accesses. Therefore, we active users. In particular, we compared the performance
design some extra data structures (e.g., indices) to summar- of AGA and GA in 22 subjects of Baidu. In this industrial
ize the coverage information of each test case in the AGA case study, the average speedup ratio of AGA over GA is
algorithm. Supposed that m, n, k are the number of test 44.27X and 61.43X on two input formats, which indicates
cases, the number of elements and the number of iterations the usefulness of AGA in real-world large-scale scenarios.
to repeat GA strategy (which is called iteration number in Also, AGA is faster than FAST on all 22 subjects and
this paper), and given n > m (which is true in most cases), achieves 4.58X speedup ratio on average, and the difference
the time complexity of our AGA algorithm is O(kmn), is statistically significant with very large effect. Due to the
while the time complexity of the GA algorithm is O(m2 n). commercial constraints, we cannot access the source code
The value of k determines to what extent the former is of these projects, and the developers in Baidu also do not
superior to the latter. In practice, k is usually much smaller record the fault positions in the history, which are necessary
than m, and in our approach, k is fixed as a constant (by the to calculate the APFD results. So, we did not compare the
second part below), so, our O(kmn) is superior to O(m2 n). effectiveness of these approaches in this study.
Second, the GA algorithm proposed by Elbaum et al. [3] The contributions of this work are summarized as below.
repeats the GA strategy multiple times in TCP and thus the • The first attempt to improve the efficiency of GA while
iteration number is usually larger than 1. Intuitively, when preserving its effectiveness, since GA is believed to
an element is covered for enough times, the probability that have high effectiveness. In particular, we resolve the ef-
it still contains faults is low, so the remaining iterations may ficiency issue of GA through theoretical improvement,
not contribute to the effectiveness but only decrease TCP which gives clear assurance for high-efficiency under
efficiency. Therefore, we investigated their relation empiric- any situations.
ally and applied it to modify the GA algorithm to improve • An approach to accelerating the widely-known GA al-
efficiency but preserving effectiveness. To sum up, our AGA gorithm through two parts, including time complexity
algorithm consists of two parts, time complexity reduction reduction and iteration number reduction. With the
and iteration number reduction. Note that theoretical im- former, the complexity is reduced from O(m2 n) to
provement is rather important and gives clear assurance for O(kmn) given n > m, which is theoretically proved;
high-efficiency under any situations (especially in the first with the latter, the corresponding AGA algorithm is
part of AGA). Also, our simple technique with theoretical more efficient and can be as competitive as GA regard-
improvement is meaningful in practice and can illustrate ing to effectiveness, which is empirically shown. In fact,
the simple nature of the problem. although it seems like an easy-to-implement algorithm,
We conducted controlled experiments by using 55 open- in the broad literature, nobody realizes this optimiza-
source projects from GitHub (whose total lines of code tion and the subsequent reduction of complexity. There-
are from 1,621 to 177,546). Because the algorithm input fore, this paper is the first to systematically analyze this
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 3

problem and propose and evaluate the optimization elements. Supposed that this algorithm chooses T1, then
approach, which is helpful for the community. T2, T3, T4, and T5 remain unselected. As the selected test
• Large scale experiments on 55 open-source projects case T1 covers elements E1, E2, and E3, the rest elements E4
demonstrating the effectiveness and efficiency of our and E5 remain uncovered. The algorithm scans the whole
AGA approach, compared with the GA algorithm. table again to find that T2, T3, T4, and T5 covers 2, 0, 1,
• An empirical comparison of AGA with FAST, which and 1 of the 2 uncovered elements, respectively. So, the GA
improves time efficiency but decreases effectiveness. algorithm chooses T2 as the next test case. Now, all elements
• An industrial case study on 22 subjects from Baidu, have been covered and the GA algorithm [3] starts another
which indicates the practical usage of AGA in real- iteration by resetting all elements to “uncovered”. Finally,
world scenarios. the test execution sequence produced by the GA algorithm
is “T1, T2, T3, T4, T5”. On the other hand, provided the
adjacency list as input, GA runs similarly and produces the
2 T IME C OMPLEXITY R EDUCTION same output.
In this section, we review the Greedy Additional (GA)
algorithm by an example (in Section 2.1). By analyzing its
2.2 Analysis of the GA Algorithm
time complexity (in Section 2.2), we propose to accelerate
GA through extra-defined data structures (in Section 2.3). In this section, we analyze the time complexity of the GA
Such modification improves the efficiency of GA so that the algorithm through its general implementation. Suppose the
time complexity becomes O(kmn) (given n > m), whereas coverage information is recorded in a table like Table 1(a),
the complexity of GA is O(m2 n), where n is the number the GA algorithm first scans the whole table to find the line
of program elements (e.g., statements, branches, methods) with the most “ ” entries and selects the corresponding
covered by the test suite, m is the number of test cases in test case into the prioritized sequence. When a test case
the test suite, and k is the iteration number. is selected and added to the sequence, the GA algorithm
scans the whole table to find the “ ”s whose corresponding
element is covered by the latest selected test case. These
2.1 Example
“ ”s are replaced by “×”s. The GA algorithm repeats the
Table 1 presents an example showing the coverage inform- proceeding process until all the entries in the table are “×”s
ation of a test suite. This test suite consists of five test or all the test cases have been selected. In the latter case
cases (i.e., T1, T2, . . . , and T5) and the test suite covers the termination condition is satisfied and the GA algorithm
five program elements (i.e., E1, E2, . . . , and E5). A common ends by producing a prioritized test suite; otherwise, GA
representation form of coverage information is adjacency reuses the initial table by replacing “ ”s with “×”s for each
matrix, which is shown in Table 1(a). represents that the selected test case and repeats the proceeding process again.
test case covers the corresponding program element, while Supposed that there are m test cases in the given test
× represents the opposite. Another representation form of suite to be prioritized and n program elements are covered
coverage information is adjacency list, which is shown in by the test suite, the GA algorithm needs to scan the whole
Table 1(b). In our example, the two forms represent totally table for m times and thus the time complexity is O(m2 n),
the same information. as shown by previous work [3], [7], [8]. However, lots of
accesses of the table are redundant. First and the most
Table 1: An Example
importantly, every time the coverage table is updated, the
(a) Adjacency Matrix GA algorithm recalculates the total “ ” entries of each
Elements unselected test case, without reusing previous calculation.
Cover or Not Second, none of the accesses to “×”s in the table is necessary
E1 E2 E3 E4 E5
T1 × × because the GA algorithm does not want to update them in
T2 × × the process. Third, in order to find the elements covered
Test Cases T3 × × × by the latest selected test case, the GA algorithm scans all
T4 × × × elements in the table, which is also unnecessary. Let us
T5 × × × ×
illustrate the preceding redundant accesses by the example.
When T1 is selected first, the GA algorithm scans Row T1
(b) Adjacency List
and finds three “ ”s. Among the five accesses (i.e., E1, E2,
Test Cases Covered Elements . . ., E5), the accesses of E4 and E5 are redundant. Then, the
T1 E1 E2 E3 GA algorithm changes the state of E1, E2, and E3 in other
T2 E3 E4 E5 four test cases from “ ” to “×”. During this process, it is
T3 E1 E2
also not necessary to access the state “×”. Then, the GA
T4 E3 E4
T5 E5 algorithm scans the whole table to select the next test case,
but this process can be optimized by analyzing updated
columns and the previous calculation on total number of
If we take the adjacency matrix as input, the GA al- “ ” covered by each test case. To sum up, due to such a
gorithm runs as follow. First, no element has been covered large number of redundant accesses in the GA algorithm, it
before and this algorithm scans the whole table to calculate is possible to reduce its time cost and improve its efficiency.
the number of elements covered by each test case. Then If we take the adjacency list as input, similar analysis
it chooses T1 or T2 since both of them cover the most can be done. First, the accesses of “×”s to find covered
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 4

elements in one row is reduced, while more time is spent on Algorithm 1: AGA C algorithm
finding all test cases that cover a specific element (through Input: Coverage information M;
scanning of the whole list). As a result, the overall time Output: Prioritized test cases P;
complexity remains O(m2 n). Second, lots of accesses of the 1 Initialize T C , HS , HC , F I , and II from M, t = 0;
list are redundant, too. Following our previous analysis, 2 Set P as empty list;
the time efficiency can be improved through reducing the 3 while t < m do
unnecessary operations. 4 Find the largest value in T C that the
corresponding test case t has not been selected
2.3 Improvement of Time Complexity (take the use of HS );
5 if No test case can be selected then
To reduce such redundant accesses, we propose the AGA C 6 Change T C to the original value;
approach that defines extra data structures, which reuse 7 Change HC to the original value;
previous information collected during its execution. In par- 8 Continue;
ticular, we use a list to record the total number of elements 9 end
covered by each test case and dynamically update it during 10 Add t into P;
prioritization, in order to alleviate the scanning of the cover- 11 Mark t as selected in HS ;
age table. We also use forward and inverted indices to save 12 forall j in F I[t] do
the data accesses of “×” entries in the table. 13 if HC[j] is “uncovered” then
Our AGA C algorithm is shown in Algorithm 1. Line 1 14 Mark j as “covered” in HC ;
initializes several data structures. T C is a list of length 15 forall i in II[j] do
m recording the number of elements covered by each test 16 Decrease T C[i] by 1;
case. In our example, T C is [3, 3, 2, 2, 1] from Table 1 by 17 end
definition. HS is a list of length n recording whether each 18 end
test case has been selected. HC is a list of length n record- 19 end
ing whether each element has been covered by previous 20 t = t + 1;
test cases. F I are forward indices that index all elements 21 end
covered by each test case, while II are inverted indices that 22 Return P;
index all test cases that cover each element. From Table 1,
in our example, F I records that T1 covers [E1, E2, E3], T2
covers [E3, E4, E5], etc. II records that E1 is covered by [T1,
T3], E2 is covered by [T1, T3], etc. Line 2 initializes P as to the definition of Big O notation, the total time complex-
the empty list. Then, in Line 3 to Line 21, the algorithm ity O(kmn + m2 ) can be simplified as O(kmn + m2 ) =
selects m test cases in turn. First, it chooses the largest O(kmn + mn) = O((k + 1)mn) = O(kmn), where k is the
value in T C whose test case t is marked unselected in HS . iteration number. Note that in most cases, n > m obviously
The algorithm adds t to the prioritized list P and marks holds, and can also be verified by the subject statistics in this
it in HS . In our example, in the first loop, T1 is selected paper (given by Table 6). For other special cases, the original
(since it covers the most program elements), marked in HS , time complexity O(kmn + m2 ) is still a large improvement.
and added to P. Then, for every element j in F I[t] that is In addition, in our algorithm, we use more storage space
marked uncovered in HC , the algorithm marks it as covered to maintain the extra data structures in order to improve
and for every test case i in II[j], the algorithm substracts time complexity. So, it is necessary to analyze the space
T C[i] by 1. In our example, in the first loop, E1 and E2 complexity, too. In GA, the coverage information (adjacency
are marked covered and the updated T C is [0, 2, 0, 1, 1] matrix/list) takes O(mn) space, and additional O(1) space
Finally, the algorithm continues to select the next test case is used to store temporary variables in the algorithm, which
by repeating the process. As shown from Line 5 to Line 9, means the overall space complexity of GA is O(mn). In
if all elements have been covered by selected test cases, AGA, the same O(mn) space is used to store the coverage
the algorithm completes current iteration and restores the table, while T C , HS , HC , F I , and II need O(m), O(m),
original T C to start the next iteration. In our example, after O(n), O(mn), and O(mn) spaces, respectively. So, the over-
T1 and T2 are selected, the original T C is restored. The total all space complexity of AGA is O(mn), which is the same
number of iterations is called iteration number. as GA, with the only difference lying in the constant factor.
Furthermore, we analyze the time complexity of our
AGA C algorithm. All initialization operations consume
O(mn) time. Each calculation of maximum value in T C
3 I TERATION N UMBER R EDUCTION
consumes O(m) time, which leads to O(m2 ) time in total. From Section 2, we obtain a new approach with time com-
The number of times to update T C is equal to the elements plexity O(kmn), where k is the iteration number. In practice,
in F I (also equal to the test cases in II ) in an iteration, k is often much smaller than m in most projects because
which is the number of “ ” entries in the coverage mat- usually many test cases are needed to cover all elements in
rix. So, in each iteration, the algorithm updates T C for an iteration. However, in the worst case, k may be equal
up to O(mn) times, and the total time for updating T C to m, indicating the worst time complexity of our AGA
is O(kmn), where k is the iteration number. Generally algorithm is still the same as that of the GA algorithm.
speaking, the number of elements is often larger than the To further improve the efficiency of the GA algorithm,
number of test cases, which means n > m. So, according especially in the worst case, in this section, we discuss
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 5

the impact of the iteration number and introduce another T2 and prioritizes the remaining test cases using GT.
modification adopted in our AGA algorithm. Finally, we 3.2 Experiment
present an experiment to evaluate the impact of iteration We conjecture that AGA I does not influence the effective-
number on the GA algorithm. ness (e.g., APFD) much but can improve efficiency (i.e., time
cost) a lot. To verify our conjecture, we design an experiment
to investigate how the iteration number impacts TCP in
terms of both effectiveness and efficiency.
3.1 Modification with Iteration Number Reduction Specifically, we use the same setup as the comprehensive
experiments in Section 5. More details about the subjects,
Let us re-examine the definition of “an iteration”. In
faults, implementation and supporting tools, and measure-
this paper, the process of selecting some test cases from
ment are given in Section 5.
covering 0 element to covering all possible elements and
We applied the GA algorithm to all subjects, and recor-
then resetting them to be “uncovered” is called “an iter-
ded the total number of iterations the GA strategy is applied
ation”. Intuitively, the iteration number may have large
during the process for each project, which is denoted as k .
impact on time cost of the GA algorithm. The difference
Then we applied to each project k modified GA algorithms,
between the GA-first algorithm [5] and the GA algorithm [3]
each of which is denoted as algorithm algoi (1 ≤ i ≤ k ),
also indicates the influence of such an iteration number.
recording their APFD values and time spent during pri-
Moreover, between these two algorithms, there exist many
oritization. In particular, algorithm algoi repeats the GA
other potential algorithms, depending how many times
strategy i times and prioritizes the remaining unselected test
the GA strategy is used (i.e., iteration number of the GA
cases by the Greedy Total strategy [5]. Note that algorithm
strategy) and what strategy is used to deal with the remain-
algo1 is actually the GA-first algorithm, whereas algorithm
ing unselected test cases (e.g., Greedy Total strategy, which
algok is actually the GA algorithm.
schedules test cases based on the descendent order of the Due to space limit, we only present some statistics of the
number of total covered program elements). experimental results in Table 2, that is minimum, maximum,
average, quartiles (Q1, Q2, Q3), and the detailed results
n=k×l (1) are given on the website of this project. From the eighth
column, the average iteration number among all open-
Here, we define the average number of test cases selected
source subjects is 29.20. The ninth to the fourteenth columns
in one iteration as l, so we can deduce Formula (1). Accord-
present the ratio between the time cost of the GA approach
ing to Formula (1), if a large number of test cases are selected
and that of the GA-first approach [5]. The big gap between
in one iteration, the total iteration number of this project is
the maximal and minimal time ratio indicates the influence
small; if few test cases are selected in one iteration, the total
of the iteration number. To better analyze the relationship
iteration number of this project is large. As our goal is to
between iteration number and time cost, we put detailed
improve efficiency while preserving effectiveness, projects
results in Appendix A. We draw a line chart of iteration
with small iteration number have already been efficient
number and time cost for each project. Note that in order to
enough, and the time complexity O(kmn) can be reduced to
see the trend, we only present the projects whose iteration
O(mn). For those projects with large iteration number, it is
number is no less than 20 (k ≥ 20). The plots also support
necessary to optimize the iteration number to some extent.
our claim that the iteration number contributes much to
In fact, everytime a program element is covered, the the time cost. As k is the coefficient of time complexity, it
probability that it still contains faults decreases. After many largely determines the actual efficiency in practice, so, we
iterations, all elements have been covered for enough times. think there is a large space to reduce time complexity.
On one hand, if all faults have been revealed after these The last six columns in Table 2 present the APFD ranges
iterations, the remaining iterations are useless for detecting of each project with different iteration numbers, that is,
faults but only increase the time cost. On the other hand, the highest APFD value minus the lowest APFD value.
if there are still several faults existed after many iterations, From the quartiles, we conclude that although some outliers
they are supposed to be hard to reveal and the remaining exist, most of the APFD ranges are very small. And the
iterations may only reveal them by chance, intuitively. So, average APFD range is only 0.0085 among all open-source
we conjecture that after some iterations, the effectiveness of subjects, indicating that little fluctuation of APFD occurs as
GA just fluctuates along with the remaining iterations. the iteration number varies.
Based on the above reasoning, we introduce another To sum up, we have two main observations. First, along
component of the proposed AGA algorithm, AGA I. with the increase of the iteration number, the time cost also
AGA I reduces the time cost by reducing the iteration increases, indicating that the iteration number contributes
number. Different from the GA algorithm, AGA I does not much to the time cost. Second, the APFD value varies a little
repeat applying the GA strategy until all the test cases are when the iteration number varies, which means a too large
prioritized, but stops when the specified iteration number is iteration number contributes little to the APFD value. These
achieved. Regarding to the remaining unselected test cases, two observations also verify our conjectures in Section 3.
AGA I applies other less costly techniques (e.g., the Greedy As we discuss in Section 3, projects with small iteration
Total technique (GT) [5], which is usually used in previous numbers are efficient enough by using AGA C, so, we need
work and also in this paper). Take Table 1 as an example, the to decide a proper reduced iteration number for projects
original iteration number is 2. If we reduce it to be 1, AGA I with a large iteration number. In fact, this reduced iteration
does not repeat the additional strategy after selecting T1 and number is not fixed, which means it can be adjusted for
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 6

Table 2: Statistics of the Impact of Iteration Number


Iteration Number Time GA / Time GA-first * APFD range **
Subjects #Projects
min. Q1 Q2 Q3 max. ave. min. Q1 Q2 Q3 max. ave. min. Q1 Q2 Q3 max. ave.
Open-Source 55 1 6 10 16.5 679 29.20 1.00 1.21 1.57 1.84 17.56 2.14 0.0000 0.0004 0.0013 0.0039 0.1328 0.0085
*
The time ratio between the GA algorithm and the GA-first algorithm.
**
The highest APFD subtracts the lowest APFD among all iteration numbers.

specific usage. In this papar, we determine this value from 4 R ESEARCH M ETHOD
some heuristics. On one hand, although we conjecture that To investigate the performance of our proposed AGA ap-
there is no need to conduct too many iterations to detect proach, we design comprehensive experiments. In this sec-
faults, we still prefer to choose a relatively high value to en- tion, we briefly introduce each component of our experi-
sure the effectiveness. On the other hand, if we assume that ments and their intentions.
every time an element is covered, the probability that it still 1) The main experiment of this paper is designed to
contains faults decreases to half of the original probability, confirm the contributions of our approach, and thus we in-
given that the initial probability is 1, we need to cover an vestigate the improvement of AGA and its component (i.e.,
element 10 times to reduced the probability to be less than AGA I and AGA C) over the GA algorithm. In particular,
1‰ ((1/2)10 = 1/1024). As a result, in the remaining of this this experiment is conducted on 55 open-source subjects.
paper, we implement our AGA approach by using 10 as the Details of this experiment are referred to Sections 5 and 6.
reduced iteration number. Note that the experiment in Section 3.2 also shares the same
setup and RQ1 complements the experiment in Section 3.2.
Finding: The iteration number has large influence This part of experiment can show the superiority of AGA in
on the efficiency of the GA algorithm, while it im- widely-used open-source subjects.
pacts little on effectiveness. In this paper we set the 2) Although we aim to improve the efficiency of GA, we
iteration number to be 10 in implementing the AGA are also curious about how AGA performs compared with
approach. other TCP techniques. Specifically, FAST targets the TCP
efficiency problem and its goal is close to ours. Therefore,
Note that this finding is confirmed on our dataset empir- we first compare AGA with FAST, and then with other
ically and may have bias considering the diversity of differ- representative TCP techniques, including ART-D, GA-S, and
ent datasets. However, the constraint on k does reduce the GE. This experiment is in Section 7. This part of experiment
overall time complexity from O(kmn+m2 ) to O(mn+m2 ). can show that AGA even outperforms techniques that aim
When n > m, which is general in most cases, the reduction to reduce TCP time cost while sacrificing effectiveness.
is from O(kmn) to O(mn). 3) To show the practical usage of our approach, we
conduct an industrial case study on Baidu, which is a
famous Internet service providers with over 600M monthly
active users. Specifically, we compare AGA with GA, FAST,
3.3 Discussion on the Chosen Iteration Number ART-D, GA-S, and GE, respectively and the experiment is
in Section 8. This part of experiment can show that AGA
In this paper, we set the iteration number to be 10 in im- works also well in real-world industrial applications and
plementing AGA through some heuristics. Here we discuss we receive positive feedback from Baidu.
the influence of this choice. First, we analyzed the APFD
results of the GA algorithm with various iteration number
(i.e., algorithm algo1 (1 ≤ i ≤ k ) in Section 6.1). In partic- 5 E VALUATION D ESIGN
ular, for each project we recorded the highest APFD value We conducted experiments to evaluate our AGA approach.
(denoted as APFDmax ) among these algorithms, and found The experiments was performed on a server whose CPU is
the smallest iteration number r whose corresponding APFD Intel(R) Xeon(R) E5-2683 2.10GHz with 132GB memory and
value is no smaller than AFPDmax ∗ 99%. Surprisingly, the whose operating system is Ubuntu 16.04.5 LTS. To make a
smallest iteration number r for all projects are no larger than fair comparison of time cost, we conducted all experiments
10, which indicates that only several iterations is enough on a single thread without parallel execution.
for maintaining original effectiveness, even in projects with In order to make our results more reliable and let readers
the iteration number up to 679. Second, although we set the reuse the artefacts, we share our data, analysis scripts, and
iteration number to be 10 in this paper, it may not be the best detailed data tables online. They are publicly available on
choice. We respectively applied algo8 , algo9 , algo10 , algo11 , our website: https://github.com/Spiridempt/AGA, and also
and algo12 to all projects with k > 8 as Section 6.1, and on figshare: https://figshare.com/s/cf8cc6ba9259c0e0754d.
found that the gap between the maximum and minimum
APFD value of these algos is 0.0006 on average, which
means that there might be many possible choices of the 5.1 Research Questions
reduced iteration number in practice. In other words, the As our AGA approach consists of two parts, time com-
value of k in our evaluation is decided by reasoning, but plexity reduction (AGA C) and iteration number reduction
it can have various values, depending on the choices of (AGA I), the first two research questions are to investig-
developers. For example, they can use historical faults or ate their impacts, separately. Note that the first research
seeded faults to empirically decide the value of k . question also complements the experiment in Section 3.2.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 7

The third research question is designed to investigate the branches. Additionally, we also design a research question
performance of the whole AGA approach by comparing to investigate whether AGA still improves efficiency in the
it with the GA algorithm. To investigate the influence of scenario of method coverage. The implementation code and
coverage type, the fourth research question is designed to all scripts used in this work are written with Python.
investigate whether AGA can also improve the efficiency of In prior work on coverage based test case prioritization,
GA with method coverage. some takes the adjacency matrix as input [12], while some
To sum up, this experiment is to answer the following uses the adjacency list [18]. In this work, in order to make
three research questions. a more general comparison, on one hand, we utilize the
RQ1: How does our reduction of iteration number perform GA implementation in [18], which is a relatively efficient
compared with the GA algorithm in terms of efficiency? implementation and uses adjacency list as input, and we
RQ2: How does our reduction of the time complexity per- implement AGA based on adjacency list. On the other hand,
form compared with the GA algorithm in terms of effi- we implement GA and AGA based on adjacency matrix, too.
ciency? Due to the space limit, in the experimental results, we only
RQ3: How does our AGA approach perform compared with report the results based on adjacency list [18], which can be
the GA algorithm in terms of effectiveness and efficiency? more reliable, and the detailed results based on adjacency
RQ4: Can our AGA approach also improve the efficiency matrix are put on the website.
when method coverage is used? It is worth mentioning that in our experiments, when ties
happen (i.e., more than one test case has the same number
of covered elements), AGA/GA selects the topmost test case
5.2 Subjects and Faults
in the test-list (given by developers).
Subjects. In this work, we use 55 open-source projects in
total. Among these projects, 33 are widely used in prior
5.4 Compared Prioritization Approaches
work [17], [20], [21], the others are the most popular subjects
selected from GitHub according to the number of stars. Besides the proposed AGA approach and the GA ap-
Specifically, we target Github subjects whose primary pro- proach [3], in this study we also implemented the GA-
gramming language is Java and order them according to first approach proposed by Rothermel et al. [5]. The GA-
the number of stars in Jan 2019. Then, we check the first first approach [5] applies the greedy additional strategy
100 subjects and keep only the ones that are code repository only in the first iteration, and deals with the remaining test
and the required tools (e.g., Maven, Clover, PIT, which is cases by other prioritization approach, e.g., the Greedy Total
explained in Section 5.3) could work. All the open-source approach in this paper, which schedules these test cases
projects used in this work are written in Java, whose number based on the descendent order of the number of covered
of lines of code is from 1,621 to 254,284. Each of these pro- program elements.
jects has a test suite written in JUnit Testing Framework. The
detailed information is given in Appendix B (Table 6). It is 5.5 Measurement
worth noting that compared with the experimental dataset In this study, similar to existing work [3], [5], we used the
used in recent TCP work [18], [22], [23], our dataset is larger Average Percentage of Fault Detected (APFD) to measure
and contains more large-scale projects, which can make our the effectiveness of TCP approaches. Formula (2) presents
experimental results more reliable and convincing. how to calculate APFD values for a subject with n tests
Faults. As existing work [24], [25], [26] have demonstrated and m faults. Typically, T Fi represents the first test case’s
mutation faults to be suitable for software testing exper- position in the test suite that detects the ith fault.
imentation and mutation faults are widely used in prior
work [7], [17], [22], [27], [28], [29], [30], [31] to evaluate
T F1 + T F2 + ... + T Fm 1
test case prioritization, we use a widely-used mutation APFD = 1 − + (2)
testing tool PIT [32] to generate mutants for all open-source nm 2n
subjects. In particular, for each subject, first, we generate all Besides, we used the total time spent during the TCP
mutants. Second, we keep the mutants that are killed by at process to measure the efficiency of a TCP approach. For
least one failing test case2 . Third, we construct one mutation fair comparison, we included the preparation time for a
group for each subject by containing all the remaining muta- TCP approach, i.e., the time spent in constructing extra data
tion faults, which is also consistent with previous work [12]. structures in the AGA approach.

5.3 Implementation and Supporting Tools 5.6 Threats


The internal threats to validity mainly lie in the imple-
We used Clover [33] to collect code coverage information
mentation of studied approaches and scripts used in the
including both statement coverage and method coverage for
experiments. To reduce this threat, the first two authors
each open-source subject. In this work, most experiments are
reviewed all the implementation and scripts used in this
conducted on statement coverage because it is the mostly
work. Also, to improve the reliability of our work, we reuse
studied test case prioritization granularity and its low-
some implementation code in previous work [18] to reduce
efficiency problem is severe. In other word, the number
the threats.
of statements is larger than the number of methods and
The external threats to validity mainly lie in the subjects
2
That is, the subject and the mutant produce different outputs on at least and faults. To reduce the former threat, we used 55 widely
one test case used open-source subjects in our study, which consist of 33
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 8

previously used subjects [17], [20] and 22 popular subjects iteration number in GA (less than or slightly more than
selected from GitHub. At the same time, as AGA is a general 10). Therefore, AGA I does not improve the efficiency much
approach, it is not biased towards the chosen projects. Note for them. However, for those subjects with a large iteration
that because the second part of our approach (iteration number, AGA I could reduce their time cost.
number reduction) is empirically verified on our dataset, To statistically check the differences between AGA I and
the large dataset itself also addresses the threat that our GA, we adopt hypothesis test. We first use Shapiro-Wilk
approach may be biased. Also, some prior work [34], [35] test [38] to check the normality of residuals, and the p-
shows that the relative performance of different test case pri- value in AGA I and GA is 9.416 ∗ 10−16 and 5.239 ∗ 10−16 ,
oritization techniques on mutation faults may not strongly which reject the hypothesis that they are normally distrib-
correlate with the performance on real faults, depending uted. Therefore, we need to adopt a non-parametric test.
upon the attributes of the studied subjects, but we follow As we need to include project size as a control variable,
the common practice to use mutation faults for open-source Wilcoxon rank sum test [39] cannot be used. We seek for
projects following the preceding TCP work [7], [17], [22], the proportional odds regression [40], which is a class of
[27], [28], [29], [30], [31]. Additionally, to complement this generalized linear models and is equivalent to Wilcoxon
experiment, in Section 7, we also evaluate our approach on rank sum test when there is a single binary covariate.
real faults. In the future, we plan to conduct an extensive We introduce a variable “group” representing AGA I and
study by using more projects with more real faults. In GA and take project size as a control variable. The results
addition, in this paper, we only target the GA algorithm and show that the p-value of “group” is 1.380 ∗ 10−6 , indicating
compare AGA with it. On one hand, it is widely accepted significant difference between AGA I and GA, and the effect
that GA remains one of the most effective strategies in terms size (Cohen’s d [41]) is 0.274 (medium effect). Here, because
of fault-detection rate [7], [8], [10]. On the other hand, the statistical tests of normality (e.g., Shapiro-Wilk test) might
results of a recent work [12] shows that other black-box be impacted by characteristics of the data, we draw the
techniques that do not use coverage information (e.g., [36], normal probability plots additionally and put them on our
[37]) are often less effective than GA. At the same time, website. Note that this applies to all normality checks in the
we also design another experiment in Section 7 to compare following of the paper.
AGA with other representative prioritization techniques.
Additionally, most of our experiments are conducted on
6.2 RQ2: Efficiency of Time Complexity Reduction
statement coverage because its wide usage and severe low-
efficiency problem. In fact, our analysis of AGA is regardless The extra data structures defined in our AGA C approach
of the scale of coverage matrix, and our theoretical improve- do not affect the prioritization results, but reduce the time
ment is general for all types of coverage. We also include complexity of prioritization. In this section, we compared
RQ4 to empirically verify our improvement on method AGA C with GA only in terms of time cost. Note that we
coverage. Another minor threat is induced by the diversity did not implement AGA I in this research question.
of used subjects, which may lead to misleading statistics The results are given by the first five columns (except
of our results. To address this threat, besides reporting the the third column) of Table 7 (Appendix C), where TimeGA
mean and median values, we also draw violin plots to learn presents the time cost of the GA approach and TimeC
the data distribution, which are shown on our website. represents that of AGA C. Moreover, we mark the results
of TimeC with X only if TimeC < TimeGA .
The last row summarizes the total number of subjects
6 R ESULTS AND A NALYSIS where AGA C outperforms the GA approach. The res-
In this section, we analyze the experimental results on open- ults show that in most projects (48 out of 55 open-source
source projects and answer the four research questions. subjects), the time cost of AGA C is lower than the GA
approach [3], which confirms our previous theoretical ana-
6.1 RQ1: Efficiency of Iteration Number Reduction lysis in Section 2. As we can see, in smaller subjects, the
differences between GA and AGA C are very small, which
In this section, we further investigate the efficiency im-
may be caused by precision errors resulting from calculation
provement of the iteration number reduction. According
or the operating system. In larger subjects, their differences
to Section 3.2, we implement our approach with iteration
are very large, which indicates the efficiency of AGA C.
number reduction alone by setting k = 10 and call this
In order to make our experiments comprehensive, we
implementation AGA I. In other words, in this subsection,
compared AGA C with the GA-first approach, whose time
we assess the contribution of iteration number reduction
cost is given by the fourth column TimeGAF of Table 7
alone (without the time complexity reduction).
(Appendix C). In 36 open-source subjects, AGA C is even
The results on the 55 open-source projects are given in
more efficient than the GA-first approach, which applies the
Table 7 (Appendix C)3 , where the projects are sorted in
time-consuming additional strategy for only one iteration.
ascending order of source lines of code (SLOC) and the
In general, the efficiency improvement of AGA C is usu-
first two columns present the results for RQ1. TimeGA
ally very large. In particular, if we define TimeGA /TimeC
presents the time cost of the GA approach, whereas TimeI
as the speedup ratio of AGA C over GA for a project, the
represents that of AGA I. The speedup ratio of AGA I over
average speedup ratio is 4.37X. As small time cost may yield
GA is 1.08X. It is apparent that most subjects have a small
biased speedup ratio, also in order to show the perform-
3
Due to the space limit, we put the results of several research questions ance of AGA in projects with different sizes, we classify
into one table and put the table in Appendix C. all 55 projects into small-size, middle-size, and large-size,
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 9

according to the SLOC. The small-size projects (S1 to S22 variable. The results show that the p-value of “group” is
in Table 7 (Appendix C)) all have less than 5,000 SLOC, the 0.399, indicating no significant difference between AGA C
middle-size projects (S23 to S41 ) all have 5,000-20,000 SLOC, and GA-first, and the effect size (Cohen’s d) is 0.208 (me-
and the other large-size projects have more than 20,000 dium effect).
SLOC. The results show that the average speedup ratio in Provided adjacency matrix as input, we also implemen-
the three categories is 2.16X, 4.65X, and 7.44X, respectively. ted GA and AGA C, and the detailed results are on our
So, the reduction of time complexity (AGA C) performs website. Specifically, the average speedup ratio of AGA C
well, especially in projects with large sizes. In order to over GA is 24.18X, and the average speedup ratio in the
give a more deep view into the distribution and variation three categories is 5.47X, 28.16X, and 48.19X, respectively.
of speedup ratios, we further present the violin plot with Besides, the speedup ratios of the AGA C approach
included box plot in Figure 1. The X-axis represents all vary a lot in different projects. On 19 open-source subjects
projects and projects in three categories, respectively. We AGA C is less efficient than GA-first. On the one hand, the
put the violin plots and box plots together to better present iteration numbers of these projects are high so that AGA C
the distributions. From the plots, the speedup ratio of large- becomes a bit costly. On the other hand, in the only iteration
size projects tends to be slightly larger than that of small- of GA-first, few test cases are needed to cover all statements
size projects. Moreover, from the plot of large-size projects, and they are selected fast so that GA-first is efficient on these
several projects have very large speedup ratio because their projects.
scale is also large. To sum up, AGA C addresses the high-complexity prob-
lem of GA well and successfully reduces its time complexity.
For any project, any scale of coverage matrix, our approach
40
could improve the efficiency a lot.

Conclusion to RQ2: The time complexity reduction


30
strategy used in our AGA approach demonstrates
great efficiency improvement compared to GA. Spe-
Speedup_ratio

cifically, the average speedup ratio of AGA C over


20 GA is 4.37X/24.18X on two types of input.

10
6.3 RQ3: Comparison with Greedy Additional Ap-
proaches

0
In this section, we compare the effectiveness and efficiency
between the proposed AGA approach and two Greedy
Total Small−size Middle−size Large−size
n=55 n=22 n=19 n=14 Additional approaches (including both GA and GA-first),
Categories of different sizes
whose results are given by the first ninth columns (except
Figure 1: Speedup Ratios Distribution of AGA C over GA the third and fifth column) of Table 7 (Appendix C), where
on Open-Source Projects APFDAGA and TimeAGA represent the APFD results and
time cost of the AGA approach whose iteration number
To statistically check the differences between AGA C is set to be 10. Moreover, when the GA approach [3] does
and GA, we perform hypothesis testing similar to the above. not outperform the corresponding AGA approach [3], i.e.,
We first use Shapiro-Wilk test [38] to check the normality of APFDAGA ≥ APFDGA or TimeAGA < TimeGA , the corres-
residuals, and the p-value in AGA C and GA is 4.207∗10−15 ponding results of the AGA approach is marked with X.
and 5.239 ∗ 10−16 , which reject the hypothesis that they
are normally distributed. We also use the proportional odds 6.3.1 Effectiveness
regression [40] and include project size as a control variable. The proposed AGA approach has the same or better APFD
The results show that the p-value of “group” is 0.038, indic- performance as the GA approach in 51 out of 55 open-source
ating significant difference between AGA C and GA, and subjects, and the average APFD value of AGA is 0.8870,
the effect size (Cohen’s d [41]) is 0.234 (medium effect). which is the same as GA. On some subjects (e.g., the open-
Besides, we also calculate the speedup ratios of AGA C source project whose ID is S44 ), the AGA approach does
over GA-first for a more complete comparison. The average not outperform the GA approach, but their APFD difference
speedup ratio is 3.01X, and the average speedup ratio in the is usually very small (e.g., 0.0021 for this subject). We also
three categories is 1.26X, 3.31X, and 5.35X, respectively. This make extra comparisons of AGA and GA-first and find that
shows our AGA approach is also superior to GA-first. AGA has the same or better APFD performance as GA-
To statistically check the differences between AGA C first in 45 out of 55 open-source subjects and their average
and GA-first, we perform the similar procedure as above. APFD values are the same. On 14 projects, neither the
We first use Shapiro-Wilk test to check the normality of AGA approach nor the GA approach outperforms the GA-
residuals, and the p-value in AGA C and GA-first is 4.207 ∗ first approach, but their differences are small. Through our
10−15 and 3.828 ∗ 10−16 , which reject the hypothesis that analysis, we suspect that after the first iteration, although all
they are normally distributed. We also use the proportional elements have been covered, the numbers of times that each
odds regression [40] and include project size as a control element is covered still differ. This means test cases with a
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 10

small number of times being covered should have higher better on large projects. On the other hand, the adjacency
priority, but in later iterations, this information is ignored. lists in some projects are very dense, which takes much time
Moreover, we statistically analyze whether the AGA in the preparation of data structure, and further leads to a
approach and the Greedy Additional approaches have sig- large coefficient. For example, S7 and S42 have relatively
nificant difference on their APFD values. First, we conduct small m values (45 and 34) and dense adjacency lists.
the Shapiro-Wilk test to check the normality of residuals.
The p-value of AGA, GA, and GAF is 0.328, 0.298, and 0.283,
indicating we cannot reject the hypothesis that they are
normally distributed. We additionally perform Shapiro-Wilk 60
test to check the normality of residuals, and the p-value of
AGA, GA, and GAF is 0.328, 0.298, and 0.283, indicating we
cannot reject the hypothesis that they are normally distrib-

Speedup_ratio
uted. Therefore, we can use parametric test in the following. 40

We use Bartlett’s test [42] to check the homogeneity of


variance, and the p-value is 0.880, indicating we cannot reject
the hypothesis that they have equal variance. Then, as we
20
need to take project size as a control variable (covariate), we
use Analysis of Covariance (ANCOVA) [19], a parametric
test that works on two or more groups to check whether
different groups have the same means. The p-value is 0.641, 0

indicating we cannot reject that they have the same means. Total Small−size Middle−size Large−size
n=55 n=22 n=19 n=14
Then, pairwise ANCOVA tests show that the p-values of Categories of different sizes
AGA vs. GA, AGA vs. GAF, and GA vs. GAF are 0.981,
0.427, and 0.414. In other words, the probability that AGA Figure 2: Speedup Ratios Distribution of AGA over GA on
is as competitive as GA is more than 98%. Then, we employ Open-Source Projects
Cohen’s d [41] to compute the effect size (ES), and the results
Provided adjacency matrix as input, we also implemen-
in AGA vs. GA, AGA vs. GAF, and GA vs. GAF are 0.005,
ted GA and AGA. The average speedup ratio of AGA over
0.151, and 0.156, which are all small effects. Furthermore, we
GA is 27.72X, and the average speedup ratio in the three
conduct Tukey’s range test [43] to check the 95% confidence
categories is 5.84X, 35.47X, and 51.59X, respectively.
intervals for all pairwise differences, and the results are
To sum up, not surprisingly, the speedup ratio of AGA
[-0.022, 0.022], [-0.030, 0.015], and [-0.030, 0.014].
is higher than AGA C and AGA I. After combining AGA I
6.3.2 Efficiency and AGA C, our whole AGA approach obtains more ef-
ficient results while preserving high effectiveness. At the
According to Table 7 (Appendix C), in almost all subjects
same time, the proposed AGA approach is demonstrated
(i.e., 44 out of 55), the time cost of AGA is much lower than
to be efficient especially on large-scale projects. In fact,
the GA approach. On average, the speedup ratio of AGA
the surprisingly high efficiency of the AGA approach also
over GA is 5.95X. Moreover, the speedup ratios in small-size,
indicates the existence of many redundant accesses of data
middle-size, large-size projects are 2.26X, 6.69X, and 10.76X,
and it is ubiquitous in most projects.
respectively. To learn the distribution of speedup ratios in
small-size, middle-size, large-size projects, we also present
Conclusion to RQ3: The AGA approach requires
the violin plot with included box plot in Figure 2. From this
much less time in prioritization than the GA ap-
figure, most medium-size and large-size projects achieve
proach and the average speedup ratio is 5.95X and
higher speedup ratios than small-size projects. Moreover,
27.72X on two types of input. Also, AGA is as
AGA achieves very large speedup ratios on some large-
competitive as the latter in terms of APFD values
size projects. So, AGA scales up well in large-size projects.
(with no significant difference). This means that we
Furthermore, we compared the time cost of the AGA ap-
achieve our goal in this paper and it has promising
proach with the GA-first approach, which requires less time
use in practice.
than the GA approach, and find that the AGA approach
even outperforms the GA-first approach in 37 open-source
subjects. The average speedup ratio is 3.95X, and the av-
erage speedup ratio in the three categories is 1.36X, 4.39X, 6.4 RQ4: Performance on Method Coverage
and 7.44X, respectively. Here, we notice that the speedup In previous research questions, we focus on statement-
ratio of AGA over other approaches is sometimes less than level coverage because it is the mostly studied coverage
1 (e.g., S3, S4, S7). In fact, the overall time complexity criterion and its low-efficiency problem is more severe than
analysis is meaningful only when the parameters are large other granularities. In this section, we collect the method-
enough. In our dataset, some projects have a relatively small level coverage for each of our 55 subjects and compare the
m value. In this case, although O(mn) seems to be small, its efficiency of AGA and GA. The results are shown in Table 3.
coefficient is not negligible compared to m. In other words, For each subject, we report the running time (in seconds) of
the preliminary data structure setup consumes much time GA and AGA.
and it impacts the overall running time in some cases. This is According to Table 3, in almost all subjects, the time
also consistent with the empirical results that AGA performs cost of AGA is much lower than GA. On average, the
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 11

speedup ratio of AGA over GA is 6.02X. Moreover, the coverage as input. As WB approaches have the same input
speedup ratios in small-size, middle-size, large-size projects as us and are much faster than BB approaches, we compare
are 2.28X, 7.32X, and 10.13X, respectively. Compared to the our work with WB approaches [18]. WB approaches include
results on statement coverage in Section 6.3, the speedup five algorithms FAST-pw, FAST-all, FAST-1, FAST-log, and
ratios are almost the same for all projects and projects in FAST-sqrt, whose difference lies in how many test cases are
different sizes. This confirms that AGA also works well on randomly selected for prioritization at a time. In this section,
method coverage. we implemented this family, and for each subject, we com-
In fact, the complexity analysis of GA and our AGA pared the best results of this family with AGA. Specifically,
approach is based on a general (0,1) matrix, regardless of according to prior work [18], none of the algorithms in FAST
the meaning behind it. In other words, the type of program family always performs the best. Therefore, to show the
element (e.g., statement, method) does not affect any aspect superiority of our approach, we run all FAST algorithms
of AGA, which means our approach works on any coverage and select the best one for each project. In other words,
and has a stable improvement. when comparing APFD, we keep the highest APFD, and
when comparing time cost, we keep the lowest time cost.
Conclusion to RQ4: The AGA approach also works Moreover, due to the randomness in FAST, for each subject
on method-level coverage. Specifically, the average we applied each of these approaches 10 times and used their
speedup ratio of AGA over GA is 6.02X. median effectiveness and efficiency results. Regarding the
time cost, the same as Section 5, we measure the efficiency
7 E MPIRICAL C OMPARISON WITH R EPRESENTAT- of a TCP approach by including its preparation time, i.e., the
IVE P RIORITIZATION T ECHNIQUES preparation time used in FAST4 .
In this section, we present an experiment comparing AGA
with some representative prioritization techniques. In par- 7.1.1 FAST Results on Seeded Faults
ticular, as FAST targets the TCP efficiency problem and The results of FAST are shown by the tenth and twelfth
thus is closet to our goal, we first present the comparison columns in Table 7 (Appendix C). Due to space limit, we
study with FAST in Section 7.1. Then we present the com- do not present the results of all the five FAST algorithms,
parison study with other representative TCP techniques in but the largest APFD value and smallest time cost among
Section 7.2. them for each subject. Note that usually a FAST algorithm
cannot achieve both the largest APFD value and the smallest
7.1 Comparison with FAST time cost. As the APFD results and time cost of AGA is
already given by the eighth and ninth columns, we use
In this section, we investigate the performance of AGA
column WinAPFD and column WinTime to show whether
with its most related work FAST [18]. In particular, FAST
APFDAGA ≥ APFDFAST and TimeAGA < TimeFAST ,
is proposed as a TCP approach to address the general
respectively.
TCP efficiency problem by sacrificing the TCP effective-
Regarding to APFD values, the AGA approach is much
ness, and it is shown to be more efficient than other TCP
better than FAST in all subjects. More specifically, the differ-
techniques [18]. Note that there is no other work in the
ences between them are from 0.0456 to 0.3039, and 0.1702 on
literature focusing on the same objective as ours, and thus
average. To statistically check their differences, we follow
we compare AGA against FAST. However, AGA and FAST
the similar procedure as above. We first use Shapiro-Wilk
target at slightly different goals: FAST approach focuses on
test to check the normality of residuals, and the p-value in
the efficiency problem of test prioritization, not specific to
AGA and FAST is 0.328 and 0.137, which cannot reject the
GA approaches. Although FAST targets a different goal, it is
hypothesis that they are normally distributed. Then, taken
still interesting to learn how AGA performs compared with
project size as a control variable, the Analysis of Covariance
FAST in terms of time cost since both AGA and FAST can be
(ANCOVA) shows that p-value < 2 ∗ 10−16 , indicating the
viewed as addressing the efficiency problem. However, as
statistically significant difference between AGA and FAST.
FAST improves efficiency while sacrifices effectiveness, the
Moreover, the effect size (Cohen’s d) is 2.96 (huge effect) and
comparison in terms of time cost is a bit “unfair” for AGA.
Tukey’s range test shows that the 95% confidence interval
In this study, we compare the performance of AGA and
of their difference is [0.149, 0.192]. To sum up, AGA signi-
FAST on both the 55 open-source projects used in Section 5
ficantly outperforms FAST in terms of APFD because FAST
and Defects4J [44], which is the largest real-fault benchmark
algorithms are designed to sacrifice prioritization accuracy
(i.e., a set of projects with reproducible real bugs) widely
to achieve high efficiency by using hash signatures.
used in test case prioritization [35], [45], [46], [47], [48]
Regarding to the time cost, the time cost of AGA outper-
and fault localization [49], [50], [51], [52], [53]. For ease of
forms FAST on 52 out of 55 open-source subjects, and the
understanding, we present the results of the former subjects
speedup ratio of AGA over FAST is 4.29X. To statistically
with seeded faults and the results of the latter subjects with
real faults separately. 4
The previous work FAST [18] separated their total running time into
The FAST approach borrows algorithms commonly used preparation time and prioritization time in their evaluation. However,
in the big data domain to find similar items and con- preparation happens only once in BB approaches while not in WB
tains a family of similarity-based test case prioritization approaches, because the input of BB approaches is test code. Given
approaches. In general, the authors proposed two categories updated source code but out-of-date coverage information (from the
previous version), we need not prioritize again and TCP results will
of FAST, While-box (WB) and Black-box (BB). BB approaches not change. Otherwise, with updated coverage information, the whole
take test code as input, while WB approaches take program process (including preparation) has to be repeated.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 12

Table 3: Results of Open-Source Subjects (Method-Level)

Project TimeGA TimeAGA Project TimeGA TimeAGA Project TimeGA TimeAGA


S1 0.0030 0.0031 S2 0.0011 0.0010 S3 0.0014 0.0013
S4 0.0021 0.0116 S5 0.0051 0.0014 S6 0.0019 0.0007
S7 0.0007 0.0012 S8 0.0032 0.0028 S9 0.2587 0.0504
S10 0.0145 0.0081 S11 0.0158 0.0080 S12 0.0027 0.0020
S13 0.0068 0.0048 S14 0.0158 0.0076 S15 0.0036 0.0008
S16 0.0106 0.0024 S17 0.0741 0.0146 S18 0.0047 0.0013
S19 0.2907 0.1007 S20 0.0024 0.0023 S21 0.0098 0.0064
S22 0.0111 0.0055 S23 0.0502 0.0147 S24 0.0035 0.0031
S25 1.1046 0.1654 S26 0.0131 0.0043 S27 0.0041 0.0020
S28 0.1945 0.0347 S29 1.2958 0.3499 S30 0.0624 0.0177
S31 0.4390 0.0495 S32 0.3495 0.0589 S33 0.1714 0.0162
S34 0.8556 0.0824 S35 0.2642 0.0330 S36 31.6469 3.8975
S37 0.7977 0.0594 S38 0.5507 0.0445 S39 6.6268 0.3601
S40 0.6570 0.0963 S41 0.3203 0.0463 S42 0.0011 0.0006
S43 0.2057 0.0211 S44 1.3687 0.1088 S45 0.0202 0.0034
S46 0.0182 0.0044 S47 0.0183 0.0045 S48 0.0010 0.0003
S49 0.0230 0.0183 S50 0.0011 0.0007 S51 1.0812 0.0883
S52 0.2423 0.0652 S53 0.1631 0.0369 S54 15.7745 1.1856
S55 190.9669 2.9963

check their differences, we follow the similar procedure as


above. We first use Shapiro-Wilk test to check the normality
of residuals, and the p-value in AGA and FAST is 4.92∗10−15
and 4.05 ∗ 10−15 , which reject the hypothesis that they are 6

normally distributed. Therefore, we use the proportional


odds regression [40] and include project size as a control
Speedup_ratio

variable. The results show that the p-value of “group” is 4


4.250 ∗ 10−4 , indicating significant difference between AGA
and FAST, and the effect size (Cohen’s d) is 0.286 (medium
effect). That is, the proposed AGA is more efficient to
FAST (with 4.29X speedup ratio). This is a surprising result 2

because AGA can even be faster than a technique that


is designed to sacrifice effectiveness to reduce time cost.
We also present the violin plot with included box plot in 0
Figure 3. On larger projects, the speedup ratios are smaller, Total Small−size Middle−size Large−size
n=55 n=22 n=19 n=14
which means FAST also scales up well on large-size projects, Categories of different sizes
whereas it is less efficient than AGA.
Figure 3: Speedup Ratios Distribution of AGA over FAST on
Open-Source Projects
7.1.2 FAST Results on Real Faults
Besides, as FAST is evaluated by some subjects of De-
fects4J [44] in the previous work [18], we apply the AGA Actually, it is worth pointing out that as the authors
approach to these subjects by reusing their artifact package of FAST [18] stated, no single FAST algorithm can be the
(including subjects and code) for fair comparison. Moreover, best, which means the most effective algorithm in FAST may
we add the experiment on Mockito, which is also in De- lead to somewhat higher time cost and the most efficient al-
fects4J but does not appear in the experiment of FAST. De- gorithm in FAST may lead to somewhat lower APFD value.
fects4J is the largest real-fault benchmark, so this experiment That is, the results of FAST in Table 7 (Appendix C) and
complements the previous experiments on seeded faults and Table 4 are not results of one FAST algorithm, but the best
can evaluate AGA on real faults. The comparison results results of all FAST algorithms. Moreover, even compared
are given by Table 4, where WinAPFD and WinTime show with these results, AGA is still promising considering both
whether the proposed AGA approach outperforms FAST in effectiveness and efficiency.
terms of APFD and time cost, respectively. From this table, Considering the advantageous of AGA over FAST, it is
AGA is more effective than FAST algorithms on 5 out of 6 interesting to analyze the secrets behind the observation.
projects and it achieves better time efficiency on all 6 projects FAST approach achieves the efficiency improvement by
(with 5.24X as average speedup ratio), which indicates the using the algorithms used in big data domain to summar-
superiority of AGA. Also, from this experiment, we show ize the key information in coverage, but suffers from the
that AGA is superior on real faults, too. effectiveness loss to some extent because some information
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 13

Table 4: Results of Some Defects4J Projects


FAST AGA GAF
Projects APFD range
APFD Time APFD WinAPFD Time WinTime APFD Time
Closure 0.5219 11.7302 0.4347 2.0408 X 0.0006 0.4354 9.4078
Math 0.5471 5.2600 0.6992 X 0.9586 X 0.0000 0.6992 7.4710
Lang 0.5627 0.4280 0.6094 X 0.1513 X 0.0000 0.6094 0.2150
Time 0.5463 2.8633 0.5469 X 0.5042 X 0.0034 0.5436 0.9870
Chart 0.5264 5.1456 0.7128 X 0.9030 X NA 0.7128 5.0767
Mockito 0.5197 2.7311 0.5975 X 0.4537 X 0.0014 0.5961 2.7539
Total 5 6

is missing in the summarization. AGA consists of two parts, largest number of uncovered elements among those in
time complexity reduction and iteration number reduction. the “spanning set”. Here, an element subsumes another
In particular, the former part is to use some extra data struc- if covering the former guarantees covering the latter:
tures (e.g., indices) to summarize the coverage information The notion of a spanning set denotes the subset of non-
of each test case, i.e., the statements covered by each test. subsumed elements.
With these data structures, AGA does not need to scan the • GE [8] is a genetic algorithm, which is a represent-
coverage table whenever a test case is selected, and thus the ative of search-based prioritization techniques and is
time cost of AGA reduces but its effectiveness maintains. To evaluated to be effective. In each iteration, it uses a
sum up, FAST suffers from effectivness loss because it uses fitness function to select individuals and then applies
simplified information, while AGA does not because it uses crossover and mutation operators to generate new in-
the same information as before but in an easy-to-access way. dividuals. Specifically, an individual (a sequence) is
Moreover, similar to Table 2, we compute the gaps encoded as an array where each value indicates the
between the highest and lowest APFD among all iteration position of a test case; The fitness function is defined by
numbers for Defects4J subjects, which are shown in Column Baker’s linear ranking algorithm [55]; The crossover op-
“APFD range” of Table 4. The range of “Chart” is marked erator selects two parents and each of the two offspring
as “NA” because it has only one iteration. As we can is formed by combining the first several values in one
see, the gaps are extremely small, which also confirms the parent and the remaining values in the other parent;
conclusion in Section 3. The mutation operator randomly selects two values in
Additionally, Column “GAF” of Table 4 shows the res- an individual and exchanges their positions.
ults of GA-first. AGA is much more efficient than GAF
while achieves larger APFD, which is consistent with the
conclusion in Section 6.3.
In this section, we reuse the implementation of ART-D,
Conclusion: Surprisingly, AGA can achieve 4.29X GA-S, and GE in [18], [22] and compare them with AGA on
speedup ratio compared to FAST, which targets im- the 55 open-source projects. Considering the randomness
proving time efficiency while sacrificing effective- of these techniques, each of them is run 10 times. The
ness. At the same time, the experimental results remaining setting of this experiment is the same as Section 5.
show that AGA is significantly better than FAST in Due to the space limit of Table 7 (Appendix C), we put the
terms of APFD values, and the average difference results in Table 5. In Table 5, each row represents one project,
between them is 0.1702. and the running time and APFD of AGA, ART-D, GA-S, and
GE are shown separately.

The average speedup ratio of AGA over ART-D is


7.2 Comparison with other TCP Techniques
144.58X. Moreover, in all 55 projects, the APFD values of
Although only FAST has a close goal to ours, to better evalu- AGA are larger than ART-D, and the average APFD dif-
ate AGA, we also compare it with more representative TCP ference is 0.1384. That is, AGA always outperforms ART-D
techniques. In particular, in this study we use the following in terms of both effectiveness and efficiency. The average
TCP techniques whose input is only coverage information speedup ratio of AGA over GA-S is 182.27X. In 54 out of 55
and which have been widely used in the literature [18], [31]. projects, the APFD values of AGA is larger than GA-S, and
• ART-D [10] is a family of adaptive random-based TCP the average APFD difference is 0.0708. The average speedup
techniques guided by coverage information. At each ratio of AGA over GE is 285.91X. In 50 out of 55 projects,
iteration, a candidate set is dynamically created by the APFD values of AGA are larger than GE, and the
randomly picking test cases from the set of not-yet- average APFD difference is 0.0459. That is, compared with
prioritized test cases as long as they can increase cover- the three TCP techniques, our proposed AGA achieves both
age. The test case in the candidate set that is the farthest effectiveness and efficiency. Moreover, the time cost and
away from the set of prioritized test cases is selected. APFD values of the compared TCP techniques distribute
• GA-S (Additional Spanning) [54] is a variant of GA in a larger range than AGA, indicating that the latter can
that at each iteration picks the test case that covers the achieve stably promising performance.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 14

Conclusion: As AGA aims to largely improve 80

the TCP efficiency while preserving the high- 5.0

effectiveness of GA, it outperforms ART-D, GA-S,


and GE in terms of both efficiency and effectiveness. 60
4.5

Speedup_ratio

Speedup_ratio
8 I NDUSTRIAL C ASE S TUDY 40
4.0

To show the practical usage of our approach, we then


conducted an industrial case study as follows. 3.5

Baidu is a famous Internet service provider with over 20

600M monthly active users. In their regression testing infra- 3.0

structure, test case prioritization is frequently needed and


they have been adopting Greedy Additional (GA) strategy 0
AGA over GA AGA over FAST
for a long time because of its simple idea and relatively n=22 n=22

high effectiveness. However, they often complain about the


long running time of GA, which deviates from the original Figure 4: Speedup Ratios Distribution of AGA over GA and
intention of test case prioritization, that is to accelerate the FAST on Industrial Projects
process of detecting faults.
To check the performance of our AGA approach in real-
world scenarios, we collected 22 versions of five industrial
projects from Baidu, each of which is taken as a subject
in AGA and FAST is 5 ∗ 10−4 and 2.5 ∗ 10−3 , which reject
in this study. More specifically, these subjects are collected
the hypothesis that they are normally distributed. Similary
from Dec. 2017 to Feb. 2018 and Oct. 2018 to Nov. 2018,
to the above, proportional odds regression [40] is used and
and all of them are written in C. As shown in the first three
we introduce a variable “group” representing AGA and
columns in Table 8 (Appendix D), we summarize the SLOC
FAST and take project size as a control variable. The results
and number of test cases of each subject. The SLOCs range
show that the p-value of “group” is 1.35 ∗ 10−6 , indicating
from 20K to 500K while the numbers of test cases range
significant difference between AGA and FAST, and the effect
from 202 to 4,246. Besides, we used C-Cover [56] to collect
size (Cohen’s d) is 1.28 (very large effect). We also present
statement coverage for each industrial subject.
the violin plot with included box plot in Figure 4. Again, we
In Table 8 (Appendix D), we report the time cost of GA,
find that on most projects, AGA has a large improvement
AGA, and FAST, respectively. When the time cost of AGA
than FAST. In [18], the authors proposed FAST to solve
is less than GA, we mark it with X. As we can see, in all 22
the scalability problem of TCP techniques with the decrease
subjects, the time cost of AGA is much lower than that of
of effectiveness. Their approach is evaluated to be efficient
GA, and the speedup ratio is 44.27X on average. In general,
when the project size grows up rapidly. However, our AGA
our AGA approach is demostrated to be efficient on indus-
approach is even more efficient than FAST, and this means
trial subjects from Baidu. For example, for the subject I1 ,
AGA may scale up better and is practical in real-world
its original prioritization time is larger than 29,000 seconds,
scenarios. Also, recall that when we compare AGA with
which may be unbearable in practice. However, through
FAST on open-source projects, the p-value is larger and the
AGA, the prioritization time is reduced to less than 360
effect size is smaller than here, and we conjecture that this
seconds. On the other hand, the surprisingly high efficiency
is due to the relatively small sizes of open-source projects.
of AGA also indicates the ubiquitous existence of many
redundant accesses of data in industrial projects. We also Additionally, besides FAST, which targets the TCP ef-
present the violin plot with included box plot in Figure 4 to ficiency problem, we also compare AGA with other more
show the distribution and variation. As we can see, on most general TCP techniques as we have done in Section 7.
projects, AGA has a large improvement compared to GA. Specifically, we run ART-D, GA-S, and GE on the 22 subjects.
Provided adjacency matrix as input, the average spee- Considering the randomness of these techniques, each of
dup ratio of AGA over GA is 61.43X. them is run 10 times. The results are shown in Table 8
After we report the results, developers in Baidu verified (Appendix D). As we can see, these techniques are much
(1) the time cost of our implementation of GA is close slower than AGA, even GA. The average speedup ratio of
to their inner implementations, and (2) the speedup ratio AGA over ART-D, GA-S, and GE is 993.37X, 4230.53X, and
is significant and our technique improves their prioritiza- 123.25X, respectively.
tion efficiency, because their implementation only works on
small projects, not large projects. It is worth noting that on one hand, Baidu is sensitive
In addition, we also compared our approach with FAST, to the positions of detected faults in history, thus these
and the experimental setup is the same with Section 7. Sur- positions are not available to us. On the other hand, they
prisingly, AGA outperforms FAST again. Specifically, on all only provide coverage data after desensitization and we
22 subjects, AGA is faster than FAST and the average spee- do not have access to the source code of these subjects (to
dup ratio is 4.58X. To statistically check their differences, we create mutants) due to the confidential policy. As a result, we
follow the similar procedure as above. We first use Shapiro- cannot compare the effectiveness of the three approaches in
Wilk test to check the normality of residuals, and the p-value terms of APFD in industrial subjects.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 15

Table 5: Comparison with Other TCP Techniques on Open-Source Subjects

AGA ART-D GA-S GE


Project
Time APFD Time APFD Time APFD Time APFD
S1 0.0157 0.9070 0.0832 0.8440 0.5878 0.8812 0.0650 0.8747
S2 0.0058 0.8380 0.0224 0.7508 0.3312 0.7870 0.0660 0.8373
S3 0.0222 0.8848 0.0196 0.7681 0.1593 0.8108 0.0900 0.8705
S4 0.0789 0.8509 0.0478 0.5933 0.1079 0.6555 0.4440 0.8061
S5 0.0089 0.8527 0.1361 0.7405 0.6358 0.7662 0.9430 0.8310
S6 0.0046 0.8101 0.0957 0.6909 0.2560 0.6800 0.9140 0.7935
S7 0.0076 0.9059 0.0150 0.7711 0.1116 0.8255 0.0970 0.8855
S8 0.0180 0.8898 0.0752 0.7459 2.7672 0.8153 0.1160 0.8985
S9 0.3363 0.9144 53.9033 0.8821 33.7447 0.8995 38.4580 0.8949
S10 0.0247 0.9518 0.5895 0.8009 0.9681 0.9017 0.8060 0.9172
S11 0.0553 0.8766 0.4818 0.7640 10.6250 0.7950 0.4070 0.8501
S12 0.0106 0.8864 0.0708 0.7589 0.3043 0.8375 0.2970 0.8667
S13 0.0385 0.8615 0.1763 0.7816 2.0406 0.7667 0.2000 0.7783
S14 0.0385 0.9188 0.6426 0.7800 0.8556 0.8779 0.6880 0.9011
S15 0.0055 0.8582 0.1225 0.7285 0.2513 0.7117 0.7360 0.8470
S16 0.0152 0.8031 0.3687 0.6578 0.9745 0.6509 3.0660 0.7760
S17 0.1317 0.9183 7.8831 0.8316 7.8195 0.8893 2.7440 0.8785
S18 0.0226 0.9028 0.1431 0.7783 1.5687 0.8039 0.3120 0.8950
S19 0.6995 0.9033 34.4428 0.7553 374.3919 0.8381 4.1710 0.8528
S20 0.0469 0.8013 0.2180 0.7021 36.2190 0.7293 0.3410 0.8185
S21 0.0248 0.8328 0.4520 0.7156 1.5459 0.7676 1.5660 0.7899
S22 0.0215 0.8642 0.2999 0.7660 0.8994 0.8151 0.5080 0.8570
S23 0.0962 0.8198 4.8919 0.6979 4.2931 0.7782 3.6190 0.7866
S24 0.0187 0.9858 0.0420 0.9213 12.3507 0.9857 0.1230 0.9860
S25 1.0579 0.8401 70.1140 0.8099 387.4319 0.8426 3.5190 0.8283
S26 0.0294 0.8339 0.4066 0.6026 0.3047 0.6927 2.6380 0.7833
S27 0.0303 0.9614 0.0254 0.7695 0.2441 0.8579 0.2200 0.9501
S28 0.1132 0.9164 11.6980 0.7642 4.6810 0.8734 8.5760 0.8534
S29 2.3900 0.9490 130.3359 0.8254 7955.7540 0.9180 3.7060 0.9131
S30 0.1804 0.9617 6.9994 0.8572 28.4835 0.9202 1.6830 0.9285
S31 0.3170 0.9426 46.5734 0.8342 15.9675 0.9191 11.7620 0.9065
S32 0.1270 0.8911 22.8351 0.7165 3.1369 0.8231 53.4990 0.7696
S33 0.0921 0.8662 11.0712 0.6612 5.5717 0.7356 23.6020 0.7568
S34 0.7903 0.9328 120.8907 0.8271 128.1646 0.8861 21.5730 0.8780
S35 0.3631 0.9467 32.4553 0.7277 20.0292 0.8694 5.1690 0.9105
S36 23.7849 0.9371 2,397.2400 0.8207 1,976.4397 0.8597 11.6880 0.8570
S37 0.3582 0.8507 33.6262 0.6876 23.6568 0.7621 1,979.6000 0.7165
S38 0.2794 0.8657 115.8419 0.7753 35.4853 0.7747 337.0780 0.8072
S39 2.2078 0.9545 1,114.6899 0.7931 265.5048 0.9289 168.0300 0.7733
S40 0.5942 0.9244 118.3193 0.7621 17.4111 0.8672 112.3900 0.8156
S41 0.1562 0.9106 19.2647 0.7195 4.2101 0.8454 40.8410 0.8116
S42 0.0287 0.8569 0.0123 0.7409 0.2401 0.8176 0.1300 0.8649
S43 0.2558 0.8924 32.1075 0.7331 38.5073 0.7915 24.0450 0.8208
S44 0.8465 0.9240 162.8027 0.8437 147.3191 0.9090 34.0820 0.8864
S45 0.0597 0.8464 0.8793 0.6822 7.4841 0.7811 4.6150 0.7918
S46 0.0858 0.8656 3.7066 0.6834 17.5147 0.7494 4.0400 0.8003
S47 0.0629 0.8750 0.7825 0.6884 1.0181 0.7588 3.6900 0.8225
S48 0.0025 0.7939 0.0095 0.6455 0.0386 0.7089 0.0790 0.7873
S49 0.1120 0.8009 0.8459 0.6463 5.6348 0.7600 4.3070 0.7884
S50 0.0047 0.8517 0.0164 0.5108 0.0585 0.6855 0.1830 0.8559
S51 1.1036 0.8671 203.2114 0.6595 505.1426 0.8080 61.9450 0.7261
S52 0.4052 0.9542 26.1336 0.8454 200.9616 0.8889 20.1360 0.9233
S53 0.2298 0.8710 13.8199 0.6956 32.9077 0.8101 33.7620 0.7768
S54 2.6648 0.9089 3,373.5142 0.7578 252.3093 0.8478 13,384.3880 0.7902
S55 18.1040 0.9544 55,120.6523 0.8613 4536.0047 0.9292 13,516.8050 0.8747
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 16

Moreover, some researchers noticed the efficiency prob-


Conclusion: Our AGA approach achieves 44.27X
lem of TCP and began to work on it. Henard et al. [12]
speedup ratio compared to GA. AGA even outper-
said, “if prioritization takes too long, then it eats into the
forms FAST in terms of time efficiency (4.58X), and
time available to run the prioritized test suite.” That is, for
the difference is statistically significant. This indic-
large software, it is necessary to take the scalability of TCP
ates that AGA is practical in real-world scenarios.
into consideration. Marijan et al. [81] proposed ROCKET to
prioritize test cases based on historical failure data, test ex-
ecution time and domain-specific heuristics to improve the
9 D ISCUSSION efficiency in the scenario of continuous integration. Knauss
Space comparison. From the space complexity analysis in et al. [82] proposed to analyze the correlation between test
Section 2, AGA consumes at most twice more space than failures and source code changes to rapidly prioritize test
GA, which is acceptable in practice. Moreover, AGA does cases. Elbaum et al. [13] introduced two techniques that use
not require high performance servers, e.g., the time cost of readily available test execution history data to determine
AGA on the two largest open-source projects (i.e., commons- what test cases are worth executing and execute them with
math & camel-core) is only 153.42s and 187.82s (on a per- higher priority. Recently, Miranda et al. [18] introduced
sonal computer whose Intel Core-i5 with 8GB memory), the FAST techniques to provide similarity-based test case
almost the same as Table 7 (Appendix C)). prioritization techniques with scalable improvements. Our
Impact of seeded faults and real faults. Previous work [24], work is related to the above work because all of them
[57], [58] has explored the relationship between seeded target TCP efficiency problem. However, the above work
faults and real faults and they may have different charac- either does not take advantage of the coverage information
teristics, which has potential influence on the evaluation on which results in lower effectiveness or addresses the effi-
test case prioritization, fault localization, etc. In this paper, ciency problem alone without balancing or even sacrificing
we evaluate our approach on both 55 open-source subjects the effectiveness. That is, to our best knowledge, none of
with seeded faults and Defects4J dataset with real faults. The the existing work can improve the efficiency of GA while
high performance of AGA on both of them can illustrate its maintaining its widely-recognized effectiveness. Our work
superiority well. achieves this goal and AGA is particularly advantageous for
Discussion on other TCP approaches. Researchers have put large-scale industrial projects.
dedicated efforts in TCP and have proposed a large number
of TCP techniques since then. Many approaches take other
information rather than coverage information (e.g., test in- 11 C ONCLUSIONS
puts, test outputs, mutants) as input, so they are in different
In this paper, we make a deep analysis of the Greedy
dimensions. However, even taken all kinds of approaches
Additional algorithm (GA) for test case prioritization (TCP)
into consideration, the GA approach remains one of the
problem and propose AGA to improve its efficiency while
most effective strategies in terms of fault-detection rate [7],
preserving effectiveness. On one hand, we find the redund-
[8], [10], [18]. So, we target GA in this paper and AGA can
ant data accesses in GA and take the use of extra data
be better than other approaches.
structures to cut down them, which leads to an optimized
time complexity from O(m2 n) to O(kmn) given n > m,
10 R ELATED W ORK where m is the number of test cases, n is the number of
Test case prioritization attracts much attention since this program elements, and k is the iteration number. On the
problem was raised at the end of the 20th century, and the other hand, we notice the impacts of iteration numbers on
work on test case prioritization can be classified into prior- the effectiveness and efficiency of GA and propose to reduce
itization algorithms [8], [10], [59], [60], [61], [62], coverage it to a relatively small value to improve efficiency while
criteria used in prioritization [3], [5], [28], [63], [64], [65], preserving effectiveness. Overall, we achieve an O(mn)
[66], [67], [68], measurement used to estimate prioritization algorithm for prioritization.
effectiveness [2], [5], [69], and empirical studies [1], [3], [5], We performed comprehensive experiments on 55 open-
[12], [31], [70], [71], [72], [73], [74]. Moreover, a number of source projects to show the effectiveness and efficiency of
surveys on test case prioritization are also given in the liter- AGA. On one hand, AGA can achieve the same average
ature [75], [76], [77]. For example, Catal et al. [76] conducted effectiveness as the GA approach, whose performance is
a systematic study of TCP techniques in 2001-2011 including considered to be high, and at the same time, the efficiency
120 papers published in that time period. Due to the space of AGA is much higher than GA. Specifically, our AGA
limit, we do not list all the prioritization work here, but approach can achieve 5.95X/27.72X speedup ratio over GA
introduce some very recently published work. Di et al. [78] on average on two input formats. On the other hand,
proposed Hypervolume-based Genetic Algorithm to prior- compared with FAST, which was recently proposed to solve
itize test cases using multiple test coverage criteria. Azizi the TCP efficiency problem while sacrificing effectiveness to
et al. [79] proposed a graph-based framework to map the some extent, AGA achieves 0.1702 higher APFD values on
prioritization problem to a graph traversal algorithm. Chen average and surprisingly the average speedup ratio of AGA
et al. [80] gave an adaptive random sequences approach over FAST is 4.29X.
based on clustering techniques using black-box information. Additionally, we conducted an industrial case study on
Different from them, our work targets the effective GA 22 industrial subjects, collected from Baidu, which is a
algorithm and attempts to solve its efficiency problem. famous Internet service provider with over 600M monthly
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 17

active users. The experimental results show that the av- [15] Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric
erage speedup ratios of AGA over GA and FAST are Nickell, Rob Siemborski, and John Micco. Taming google-scale
continuous testing. In Proceedings of the 39th International Conference
44.27X/61.43X and 4.58X (with significant difference and on Software Engineering: Software Engineering in Practice Track, pages
very large effect), respectively. 233–242. IEEE Press, 2017.
To the best of our knowledge, this is the first attempt [16] Ashish Kumar. Development at the speed and scale of google.
to alleviating the efficiency problem of the Greedy Addi- QCon San Francisco, 2010.
[17] Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan
tional TCP approach while maintaining its effectiveness. Hao, Yangfan Zhou, and Lu Zhang. How does regression test
It is worth noting that the efficiency of TCP algorithm is prioritization perform in real-world software evolution? In 2016
especially important when software becomes larger, that IEEE/ACM 38th International Conference on Software Engineering,
is to say, in real-world scenarios. Our empirical evidence pages 535–546. IEEE, 2016.
[18] Breno Miranda, Emilio Cruciani, Roberto Verdecchia, and Antonia
indicates that AGA is particularly more advantageous for Bertolino. Fast approaches to scalable similarity-based test case
large-scale industrial projects. prioritization. In Proceedings of the 40th International Conference on
Software Engineering, pages 222–232. ACM, 2018.
[19] Ronald Aylmer Fisher. Statistical methods for research workers.
ACKNOWLEDGMENTS In Breakthroughs in Statistics, pages 66–70. Springer, 1992.
[20] Qi Luo, Kevin Moran, Lingming Zhang, and Denys Poshyvanyk.
The authors would like to thank all the reviewers for their How do static and dynamic test case prioritization techniques per-
valuable comments and suggestions. This work was sup- form on modern software systems? an extensive study on github
ported by the National Natural Science Foundation of China projects. IEEE Transactions on Software Engineering, 45(11):1054–
under Grant No. 61872008. 1080, 2018.
[21] Jianyi Zhou, Junjie Chen, and Dan Hao. Parallel test prioritization.
ACM Transactions on Software Engineering and Methodology, 31(1):1–
50, 2021.
R EFERENCES [22] Junjie Chen, Yiling Lou, Lingming Zhang, Jianyi Zhou, Xiaoleng
[1] Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Wang, Dan Hao, and Lu Zhang. Optimizing test prioritization via
Mary Jean Harrold. Prioritizing test cases for regression testing. test distribution analysis. In Proceedings of the 2018 26th ACM Joint
IEEE Transactions on Software Engineering, 27(10):929–948, 2001. Meeting on European Software Engineering Conference and Symposium
[2] Sebastian Elbaum, Alexey Malishevsky, and Gregg Rothermel. on the Foundations of Software Engineering, pages 656–667. ACM,
Incorporating varying test costs and fault severities into test case 2018.
prioritization. In Proceedings of the 23rd International Conference on [23] Song Wang, Jaechang Nam, and Lin Tan. Qtep: quality-aware test
Software Engineering, pages 329–338. IEEE Computer Society, 2001. case prioritization. In Proceedings of the 2017 11th Joint Meeting on
[3] Sebastian Elbaum, Alexey G Malishevsky, and Gregg Rothermel. Foundations of Software Engineering, pages 523–534. ACM, 2017.
Test case prioritization: A family of empirical studies. IEEE [24] René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst,
Transactions on Software Engineering, 28(2):159–182, 2002. Reid Holmes, and Gordon Fraser. Are mutants a valid substitute
[4] Xiao Qu, Myra B Cohen, and Gregg Rothermel. Configuration- for real faults in software testing? In Proceedings of the 22nd
aware regression testing: an empirical study of sampling and ACM SIGSOFT International Symposium on Foundations of Software
prioritization. In Proceedings of the 2008 International Symposium Engineering, pages 654–665. ACM, 2014.
on Software Testing and Analysis, pages 75–86. ACM, 2008. [25] James H Andrews, Lionel C Briand, and Yvan Labiche. Is mutation
[5] Gregg Rothermel, Roland H Untch, Chengyun Chu, and an appropriate tool for testing experiments? In Proceedings of the
Mary Jean Harrold. Test case prioritization: An empirical study. 27th International Conference on Software Engineering, pages 402–411.
In Proceedings of the 1999 IEEE International Conference on Software ACM, 2005.
Maintenance, pages 179–188. IEEE, 1999. [26] Hyunsook Do and Gregg Rothermel. On the use of mutation faults
[6] W Eric Wong, Joseph R Horgan, Saul London, and Hiralal in empirical assessments of test case prioritization techniques.
Agrawal. A study of effective regression testing in practice. IEEE Transactions on Software Engineering, 32(9):733–752, 2006.
In Proceedings of the Eighth International Symposium On Software [27] Yiling Lou, Dan Hao, and Lu Zhang. Mutation-based test-case
Reliability Engineering, pages 264–274. IEEE, 1997. prioritization in software evolution. In 2015 IEEE 26th International
[7] Lingming Zhang, Dan Hao, Lu Zhang, Gregg Rothermel, and Symposium on Software Reliability Engineering, pages 46–57. IEEE,
Hong Mei. Bridging the gap between the total and additional test- 2015.
case prioritization strategies. In Proceedings of the 2013 International [28] Hong Mei, Dan Hao, Lingming Zhang, Lu Zhang, Ji Zhou, and
Conference on Software Engineering, pages 192–201. IEEE Press, 2013. Gregg Rothermel. A static approach to prioritizing junit test cases.
[8] Zheng Li, Mark Harman, and Robert M Hierons. Search al- IEEE Transactions on Software Engineering, 38(6):1258–1275, 2012.
gorithms for regression test case prioritization. IEEE Transactions [29] Md Junaid Arafeen and Hyunsook Do. Test case prioritization us-
on Software Engineering, 33(4):225–237, 2007. ing requirements-based clustering. In 2013 IEEE Sixth International
[9] Shen Lin. Computer solutions of the traveling salesman problem. Conference on Software Testing, Verification and Validation, pages 312–
Bell System Technical Journal, 44(10):2245–2269, 1965. 321. IEEE, 2013.
[10] Bo Jiang, Zhenyu Zhang, Wing Kwong Chan, and TH Tse. Ad-
[30] Hyunsook Do, Siavash Mirarab, Ladan Tahvildari, and Gregg Ro-
aptive random test case prioritization. In Proceedings of the 2009
thermel. The effects of time constraints on test case prioritization:
IEEE/ACM International Conference on Automated Software Engineer-
A series of controlled experiments. IEEE Transactions on Software
ing, pages 233–244. IEEE Computer Society, 2009.
Engineering, 36(5):593–617, 2010.
[11] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and
Clifford Stein. Introduction to algorithms. MIT press, 2009. [31] Qi Luo, Kevin Moran, and Denys Poshyvanyk. A large-scale
[12] Christopher Henard, Mike Papadakis, Mark Harman, Yue Jia, empirical comparison of static and dynamic test case prioritization
and Yves Le Traon. Comparing white-box and black-box test techniques. In Proceedings of the 2016 24th ACM SIGSOFT Inter-
prioritization. In 2016 IEEE/ACM 38th International Conference on national Symposium on Foundations of Software Engineering, pages
Software Engineering, pages 523–534. IEEE, 2016. 559–570. ACM, 2016.
[13] Sebastian Elbaum, Gregg Rothermel, and John Penix. Techniques [32] Pit mutation testing. http://pitest.org/, 2021. Accessed: 2021.
for improving regression testing in continuous integration devel- [33] atlassian / clover – bitbucket. https://bitbucket.org/atlassian/
opment environments. In Proceedings of the 22nd ACM SIGSOFT clover/src/default/, 2021. Accessed: 2021.
International Symposium on Foundations of Software Engineering, [34] Rahul Gopinath, Carlos Jensen, and Alex Groce. Mutations: How
pages 235–245. ACM, 2014. close are they to real faults? In 2014 IEEE 25th International
[14] Mika V Mäntylä, Bram Adams, Foutse Khomh, Emelie Engström, Symposium on Software Reliability Engineering, pages 189–200. IEEE,
and Kai Petersen. On rapid releases and software testing: a case 2014.
study and a semi-systematic literature review. Empirical Software [35] Qi Luo, Kevin Moran, Denys Poshyvanyk, and Massimiliano
Engineering, 20(5):1384–1425, 2015. Di Penta. Assessing test case prioritization on real faults and
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 18

mutants. In 2018 IEEE International Conference on Software Main- [56] Bullseye testing technology. http://www.bullseye.com/, 2021.
tenance and Evolution, pages 240–251. IEEE, 2018. Accessed: 2021.
[36] Mike Papadakis, Christopher Henard, and Yves Le Traon. [57] Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae.
Sampling program inputs with mutation analysis: Going beyond Are mutation scores correlated with real fault detection? a large
combinatorial interaction testing. In 2014 IEEE Seventh Inter- scale empirical study on the relationship between mutants and
national Conference on Software Testing, Verification and Validation, real faults. In 2018 IEEE/ACM 40th International Conference on
pages 1–10. IEEE, 2014. Software Engineering, pages 537–548. IEEE, 2018.
[37] Justyna Petke, Shin Yoo, Myra B Cohen, and Mark Harman. [58] Murial Daran and Pascale Thévenod-Fosse. Software error ana-
Efficiency and early fault detection with lower and higher strength lysis: A real case study involving real faults and mutations. ACM
combinatorial interaction testing. In Proceedings of the 2013 9th Joint SIGSOFT Software Engineering Notes, 21(3):158–171, 1996.
Meeting on Foundations of Software Engineering, pages 26–36. ACM, [59] Gordon Fraser and Franz Wotawa. Test-case prioritization with
2013. model-checkers. In 25th conference on IASTED International, 2007.
[38] Samuel Sanford Shapiro and Martin B Wilk. An analysis [60] Shin Yoo, Mark Harman, Paolo Tonella, and Angelo Susi. Clus-
of variance test for normality (complete samples). Biometrika, tering test cases to achieve effective and scalable prioritisation
52(3/4):591–611, 1965. incorporating expert knowledge. In Proceedings of the Eighteenth
[39] Henry B Mann and Donald R Whitney. On a test of whether one International Symposium on Software Testing and Analysis, pages 201–
of two random variables is stochastically larger than the other. The 212. ACM, 2009.
Annals of Mathematical Statistics, pages 50–60, 1947. [61] Ripon K Saha, Lingming Zhang, Sarfraz Khurshid, and De-
[40] Peter McCullagh. Regression models for ordinal data. Journal of wayne E Perry. An information retrieval approach for regres-
the Royal Statistical Society: Series B (Methodological), 42(2):109–127, sion test prioritization based on program changes. In 2015
1980. IEEE/ACM 37th IEEE International Conference on Software Engineer-
[41] Jacob Cohen. Statistical power analysis for the behavioral sciences. ing, volume 1, pages 268–279. IEEE, 2015.
Academic press, 2013. [62] Zengkai Ma and Jianjun Zhao. Test case prioritization based on
[42] Maurice Stevenson Bartlett. Properties of sufficiency and statistical analysis of program structure. In 2008 15th Asia-Pacific Software
tests. Proceedings of the Royal Society of London. Series A-Mathematical Engineering Conference, pages 471–478. IEEE, 2008.
and Physical Sciences, 160(901):268–282, 1937. [63] Sebastian Elbaum, Alexey G Malishevsky, and Gregg Rothermel.
[43] John W Tukey. Comparing individual means in the analysis of Prioritizing test cases for regression testing, volume 25. ACM, 2000.
variance. Biometrics, pages 99–114, 1949. [64] Hyunsook Do, Gregg Rothermel, and Alex Kinneer. Empirical
[44] René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A studies of test case prioritization in a junit testing environment.
database of existing faults to enable controlled testing studies for In 15th International Symposium on Software Reliability Engineering,
java programs. In Proceedings of the 2014 International Symposium pages 113–124. IEEE, 2004.
on Software Testing and Analysis, pages 437–440. ACM, 2014. [65] James A Jones and Mary Jean Harrold. Test-suite reduction and
[45] David Paterson, Gregory M Kapfhammer, Gordon Fraser, and Phil prioritization for modified condition/decision coverage. IEEE
McMinn. Using controlled numbers of real faults and mutants Transactions on Software Engineering, 29(3):195–209, 2003.
to empirically evaluate coverage-based test case prioritization. [66] Lingming Zhang, Ji Zhou, Dan Hao, Lu Zhang, and Hong Mei.
In Proceedings of the 13th International Workshop on Automation of Prioritizing junit test cases in absence of coverage information. In
Software Test, pages 57–63, 2018. 2009 IEEE International Conference on Software Maintenance, pages
[46] Md Abu Hasan, Md Abdur Rahman, and Md Saeed Siddik. 19–28. IEEE, 2009.
Test case prioritization based on dissimilarity clustering using [67] Bogdan Korel, Luay Ho Tahat, and Mark Harman. Test prioritiza-
historical data analysis. In International Conference on Information, tion using system models. In 21st IEEE International Conference on
Communication and Computing Technology, pages 269–281. Springer, Software Maintenance, pages 559–568. IEEE, 2005.
2017. [68] Lijun Mei, Zhenyu Zhang, WK Chan, and TH Tse. Test case
[47] Tanzeem Bin Noor and Hadi Hemmati. A similarity-based ap- prioritization for regression testing of service-oriented business
proach for test case prioritization using historical failure data. applications. In Proceedings of the 18th International Conference on
In 2015 IEEE 26th International Symposium on Software Reliability World Wide Web, pages 901–910. ACM, 2009.
Engineering, pages 58–68. IEEE, 2015. [69] Gregory M Kapfhammer and Mary Lou Soffa. Using coverage
[48] Alireza Haghighatkhah, Mika Mäntylä, Markku Oivo, and Pasi effectiveness to evaluate test suite prioritizations. In Proceedings
Kuvaja. Test case prioritization using test similarities. In Inter- of the 1st ACM International Workshop on Empirical Assessment of
national Conference on Product-Focused Software Process Improvement, Software Engineering Languages and Technologies: held in conjunction
pages 243–259. Springer, 2018. with the 22nd IEEE/ACM International Conference on Automated
[49] Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. Deepfl: Software Engineering, pages 19–20. ACM, 2007.
Integrating multiple fault diagnosis dimensions for deep fault [70] Donghwan Shin, Shin Yoo, Mike Papadakis, and Doo-Hwan
localization. In Proceedings of the 28th ACM SIGSOFT International Bae. Empirical evaluation of mutation-based test case prioritiz-
Symposium on Software Testing and Analysis, pages 169–180, 2019. ation techniques. Software Testing, Verification and Reliability, 29(1-
[50] Xia Li and Lingming Zhang. Transforming programs and tests in 2):e1695, 2019.
tandem for fault localization. Proceedings of the ACM on Program- [71] Hyunsook Do, Siavash Mirarab, Ladan Tahvildari, and Gregg
ming Languages, 1(OOPSLA):1–30, 2017. Rothermel. An empirical study of the effect of time constraints
[51] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui on the cost-benefits of regression testing. In Proceedings of the 16th
Abreu, Michael D Ernst, Deric Pang, and Benjamin Keller. Evalu- ACM SIGSOFT International Symposium on Foundations of Software
ating and improving fault localization. In 2017 IEEE/ACM 39th Engineering, pages 71–82. ACM, 2008.
International Conference on Software Engineering, pages 609–620. [72] Dan Hao, Lu Zhang, and Hong Mei. Test-case prioritization:
IEEE, 2017. achievements and challenges. Frontiers of Computer Science,
[52] Jeongju Sohn and Shin Yoo. Fluccs: Using code and change metrics 10(5):769–777, 2016.
to improve fault localization. In Proceedings of the 26th ACM [73] Michael G Epitropakis, Shin Yoo, Mark Harman, and Edmund K
SIGSOFT International Symposium on Software Testing and Analysis, Burke. Empirical evaluation of pareto efficient multi-objective
pages 273–283. ACM, 2017. regression test case prioritisation. In Proceedings of the 2015 Inter-
[53] Mengshi Zhang, Xia Li, Lingming Zhang, and Sarfraz Khurshid. national Symposium on Software Testing and Analysis, pages 234–245.
Boosting spectrum-based fault localization using pagerank. In ACM, 2015.
Proceedings of the 26th ACM SIGSOFT International Symposium on [74] Dan Hao, Lu Zhang, Lei Zang, Yanbo Wang, Xingxia Wu, and
Software Testing and Analysis, pages 261–272, 2017. Tao Xie. To be optimal or not in test-case prioritization. IEEE
[54] Martina Marré and Antonia Bertolino. Using spanning sets Transactions on Software Engineering, 42(5):490–505, 2015.
for coverage testing. IEEE Transactions on Software Engineering, [75] Shin Yoo and Mark Harman. Regression testing minimization,
29(11):974–984, 2003. selection and prioritization: a survey. Software Testing, Verification
[55] James Edward Baker. Adaptive selection methods for genetic and Reliability, 22(2):67–120, 2012.
algorithms. In Proceedings of an International Conference on Genetic [76] Cagatay Catal and Deepti Mishra. Test case prioritization: a
Algorithms and Their Applications, volume 1. Hillsdale, New Jersey, systematic mapping study. Software Quality Journal, 21(3):445–478,
1985. 2013.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 19

[77] Sanjukta Mohanty, Arup Abhinna Acharya, and Durga Prasad Dan Hao is an associate professor at School
Mohapatra. A survey on model based test case prioritization. of Computer Science, Peking University,
International Journal of Computer Science and Information Technologies, P.R.China. She received her Ph.D. in Computer
2(3):1042–1047, 2011. Science from Peking University in 2008, and
[78] Dario Di Nucci, Annibale Panichella, Andy Zaidman, and Andrea the B.S. in Computer Science from the Harbin
De Lucia. A test case prioritization genetic algorithm guided Institute of Technology in 2002. She was a
by the hypervolume indicator. IEEE Transactions on Software program co-chair of ASE 2021 and SANER
Engineering, 2018. 2022, a general co-chair of SPLC 2018,
[79] Maral Azizi and Hyunsook Do. Graphite: A greedy graph-based the program committees of many prestigious
technique for regression test case prioritization. In 2018 IEEE In- conferences (e.g., ICSE, FSE, ASE, and ISSTA).
ternational Symposium on Software Reliability Engineering Workshops, Her current research interests include software
pages 245–251. IEEE, 2018. testing and debugging.
[80] Jinfu Chen, Lili Zhu, Tsong Yueh Chen, Dave Towey, Fei-Ching
Kuo, Rubing Huang, and Yuchi Guo. Test case prioritization for
object-oriented software: An adaptive random sequence approach
based on clustering. Journal of Systems and Software, 135:107–125,
2018.
[81] Dusica Marijan, Arnaud Gotlieb, and Sagar Sen. Test case priorit-
ization for continuous regression testing: An industrial case study.
In 2013 IEEE International Conference on Software Maintenance, pages
540–543. IEEE, 2013.
[82] Eric Knauss, Miroslaw Staron, Wilhelm Meding, Ola Söder, Ag-
neta Nilsson, and Magnus Castell. Supporting continuous integra-
tion by code-churn based test selection. In Proceedings of the Second
International Workshop on Rapid Continuous Software Engineering,
pages 19–25. IEEE Press, 2015.

Feng Li received his B.S. degree from Peking


University in 2018. He is currently a Ph.D. can-
didate in School of Computer Science at Peking
University. His research interests include soft-
ware testing and analysis.

Lu Zhang is a professor at School of Computer


Science, Peking University, P.R. China. He re-
ceived both Ph.D. and BSc in Computer Sci-
ence from Peking University in 2000 and 1995
respectively. He was a postdoctoral researcher
in Oxford Brookes University and University of
Liverpool, UK. He served on the program com-
mittees of many prestigious conferences, such
as FSE, OOPSLA, ISSTA, and ASE. He was a
program co-chair of SCAM 2008 and a program
co-chair of ICSME 2017. He has been on the
Jianyi Zhou received his B.S. degree in 2014, editorial boards of Journal of Software Maintenance and Evolution:
and M.S. degree in 2017, both from Beihang Research and Practice and Software Testing, Verification and Reliability.
University. He is currently a Ph.D. candidate in His current research interests include software testing and analysis,
School of Computer Science at Peking Univer- program comprehension, software maintenance and evolution, software
sity. His research interests include software test- reuse and component-based software development, and service com-
ing and analysis. puting.

Yinzhu Li received her M.S. degree in Computer


Science and Technology in 2012 from Tianjin
Normal University. She is now an employee at
Baidu Online Network Technology (Beijing) Co.,
Ltd., mainly working on automation testing. Her
research interest is intelligent testing, including
test case selection, test case generation, and
fault localization.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 20

A PPENDIX A 0.18
30
CHARTS OF ITERATION NUMBER AND TIME COST 0.16
25

Time(s)

Time(s)
To better analyze the relationship between iteration number 0.14
20
and time cost, we put detailed results in Section 6.1 here. 0.12

We draw a line chart of iteration number and time cost for 0.10
15

each project. Note that in order to see the trend, we only 5 10 15 20 0 16 32 48 64


Iteration Number Iteration Number
present the projects whose iteration number is no less than blueflood camel−core
20 (k ≥ 20). As we can see, all projects follow a similar
trend. In some projects, the first several iterations cost more
time than other iterations. It is reasonable because along 0.5
1.2
with the decrease of the number of remaining test cases (n),

Time(s)

Time(s)
1.1
0.4
prioritization also becomes faster. The plots also support
1.0
our claim that the iteration number contributes much to 0.3
0.9
the time cost. As k is the coefficient of time complexity, it
0.8
largely determines the actual efficiency in practice, so, we 0 7 14 21 28 0 9 18 27 36
Iteration Number Iteration Number
think there is a large space to reduce time complexity. jopt−simple languagetool

0.45
3.5
5
0.40
3.0
4
Time(s)

Time(s)

0.35

Time(s)
2.5
0.30
3
2.0
0.25
1.5 2
0.20
0 13 26 39 52 0 7 14 21 28
0 16 32 48 64
Iteration Number Iteration Number Iteration Number
commons−math la4j−new mapdb−mapdb−1.0.9

0.8
lines of code (SLOC), test lines of code (TLOC), number of
1.5 test cases (#Test cases), and number of mutants (#Mutants),
Time(s)

Time(s)

0.7 respectively. The projects are sorted in ascending order of


1.0 source lines of code.
0.6

0.5
0 15 30 45 60 0 8 16 24 32 A PPENDIX C
Iteration Number Iteration Number
jsprit jsoup RESULTS OF OPEN - SOURCE SUBJECTS
Due to space limit, we show the complete results on open-
2.5
source subjects in Table 7. The subjects are sorted in ascend-
ing order of source lines of code (SLOC). The first three
40 2.0
columns present the results for RQ1, the first five columns
Time(s)

Time(s)

1.5
30
present the results for RQ2, the first nine columns present
1.0 the results of RQ3, and the last four columns present the
20 0.5 comparison results with FAST. The detailed analysis can be
0 10 20 30 40 0 169 338 507 676 found in Sections 6 and 7.
Iteration Number Iteration Number
rome−1.5.0 assertj−core

A PPENDIX D
0.6
RESULTS OF INDUSTRIAL SUBJECTS
4
Due to space limit, we present the complete results on
0.5
3 industrial subjects in Table 8. For each subject, we present
Time(s)

Time(s)

0.4 its SLOC, #Test cases, and the time cost of GA, AGA, FAST,
2
ART-D, GA-S, and GE, respectively. The detailed analysis
0.3
1 can be found in Section 8.
0 24 48 72 96 0 9 18 27 36
Iteration Number Iteration Number
la4j commons−dbcp

A PPENDIX B
BASIC INFORMATION OF OPEN - SOURCE SUBJECTS
Table 6 shows some basic information of our 55 open-source
subjects. Specifically, for each subject, we present the source
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 21

Table 6: Basic Information for Open-Source Subjects


ID Subjects SLOC TLOC #Test Cases #Mutants

S1 DiskLruCache 780 1,030 61 152


S2 gson-fire 895 726 36 520
S3 gson-fire-v2 1,178 952 47 202
S4 jumblr 1,489 1,243 103 167
S5 java-apns 1,503 1,724 87 412
S6 jasmine-maven-plugin 1,671 1,931 102 561
S7 java-uuid-generator 1,790 2,388 45 346
S8 gdx-artemis-master 1,851 1,492 35 961
S9 jopt-simple 1,924 5,903 727 1,677
S10 protoparser 2,153 3,227 171 864
S11 jackson-datatype-guava 2,217 1,035 73 845
S12 jackson-datatype-guava-v2 2,366 1,327 80 320
S13 JActor 2,542 4,418 65 56
S14 spring-retry 2,765 3,419 185 351
S15 scribe-java 2,808 2,536 99 563
S16 metrics-core 2,835 2,194 150 1,656
S17 javapoet 2,986 4,399 332 973
S18 low-gc-membuffers 3,184 9,782 51 780
S19 lambdaj-master 3,634 4,914 265 3,399
S20 LastCalc-0.1 4,522 581 32 2,499
S21 stream-lib 4,835 3,806 141 3,811
S22 webbit 4,914 8,463 131 349
S23 commons-pool 5,206 8,232 272 633
S24 redline-smalltalk-master 5,648 480 43 3,450
S25 la4j 7,086 4,050 625 5,023
S26 redline-smalltalk 7,212 2,414 240 833
S27 nv-websocket-client 7,351 657 73 277
S28 joss 8,078 6,035 531 1,289
S29 raml-java-parser-master 8,696 3,005 192 4,506
S30 raml-java-parser 8,788 5,061 197 1,288
S31 la4j-v2 9,272 4,035 799 3,141
S32 commons-io 9,980 19,189 1,081 7,773
S33 streamex 10,427 7,906 450 3,958
S34 jsoup 10,507 12,037 666 3,157
S35 commons-dbcp 11,592 8,752 560 2,601
S36 rome-1.5.0 11,647 2,705 475 4,929
S37 assertj-core 13,361 53,059 2,470 4,571
S38 vraptor-archive 16,910 16,213 1,130 7,245
S39 mapdb-mapdb-1.0.9 17,589 35,873 1,776 876
S40 RoaringBitmap 17,807 21,494 1,148 21,319
S41 blueflood 19,517 15,774 961 1,854
S42 lanterna 20,682 7,724 34 344
S43 jackson-core 21,320 10,924 376 6,215
S44 jsprit 23,073 18,373 1,250 12,350
S45 hivemall 28,569 3,975 150 6,557
S46 asterisk-java 30,495 4,263 217 3,226
S47 asterisk-java-v2 31,074 4,258 217 921
S48 restcountries 31,324 468 40 113
S49 chukwa 32,654 8,051 131 569
S50 ews-java-api 45,313 1,328 90 1,782
S51 languagetool 47,589 20,778 719 26,662
S52 OpenTripPlanner-otp-0.20.0 64,718 14,207 379 7,325
S53 hbase-1.2.2 66,630 17,385 434 1,781
S54 commons-math 86,748 90,798 5,082 84,476
S55 camel-core 120,248 134,036 5,623 13,005

Total 912,045 633,085 31,454 262,295


TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 22

Table 7: Results of Open-Source Subjects


RQ1 RQ2 RQ3 Comparison with FAST
Project
TimeGA TimeI TimeGAF TimeC APFDGAF APFDGA APFDAGA TimeAGA APFDFAST WinAPFD TimeFAST WinTime
S1 0.0197 0.0197 0.0069 0.0157 X 0.8809 0.9070 0.9070 X 0.0157 X 0.8164 X 0.0717 X
S2 0.0072 0.0072 0.0030 0.0058 X 0.8369 0.8380 0.8380 X 0.0058 X 0.7164 X 0.0348 X
S3 0.0074 0.0074 0.0038 0.0222 0.8868 0.8848 0.8848 X 0.0222 0.6916 X 0.0228 X
S4 0.0140 0.0140 0.0058 0.0789 0.8505 0.8509 0.8509 X 0.0789 0.7183 X 0.0140
S5 0.0314 0.0314 0.0163 0.0089 X 0.8527 0.8527 0.8527 X 0.0089 X 0.7285 X 0.0542 X
S6 0.0115 0.0115 0.0149 0.0046 X 0.8101 0.8101 0.8101 X 0.0046 X 0.6731 X 0.0315 X
S7 0.0045 0.0045 0.0027 0.0076 0.9045 0.9059 0.9059 X 0.0076 0.7531 X 0.0163 X
S8 0.0202 0.0202 0.0092 0.0180 X 0.8913 0.8898 0.8898 X 0.0180 X 0.6771 X 0.1102 X
S9 1.7455 1.7105 1.7891 0.5288 X 0.9144 0.9144 0.9144 X 0.3363 X 0.8688 X 1.8445 X
S10 0.0871 0.0871 0.0471 0.0247 X 0.9514 0.9518 0.9518 X 0.0247 X 0.7639 X 0.1326 X
S11 0.0975 0.0975 0.0464 0.0553 X 0.8741 0.8766 0.8766 X 0.0553 X 0.6758 X 0.3315 X
S12 0.0243 0.0243 0.0090 0.0108 X 0.8854 0.8864 0.8864 X 0.0106 X 0.6693 X 0.0452 X
S13 0.0383 0.0383 0.0158 0.0385 0.8500 0.8615 0.8615 X 0.0385 0.7584 X 0.1346 X
S14 0.0876 0.0870 0.0327 0.0407 X 0.9188 0.9188 0.9188 X 0.0385 X 0.6149 X 0.1315 X
S15 0.0221 0.0221 0.0117 0.0055 X 0.8582 0.8582 0.8582 X 0.0055 X 0.7052 X 0.0381 X
S16 0.0723 0.0723 0.0357 0.0152 X 0.8010 0.8031 0.8031 X 0.0152 X 0.6293 X 0.0793 X
S17 0.3706 0.3693 0.2144 0.1357 X 0.9114 0.9183 0.9183 X 0.1317 X 0.7128 X 0.7061 X
S18 0.0323 0.0323 0.0167 0.0226 X 0.9026 0.9028 0.9028 X 0.0226 X 0.7688 X 0.1115 X
S19 1.9204 1.9201 1.5941 0.7017 X 0.9003 0.9033 0.9033 X 0.6995 X 0.6955 X 3.6103 X
S20 0.0655 0.0655 0.0336 0.0469 X 0.8014 0.8013 0.8013 X 0.0469 X 0.6636 X 0.3444 X
S21 0.0942 0.0942 0.0416 0.0248 X 0.8353 0.8328 0.8328 X 0.0248 X 0.6847 X 0.1313 X
S22 0.0413 0.0413 0.0217 0.0215 X 0.8642 0.8642 0.8642 X 0.0215 X 0.7375 X 0.1019 X
S23 0.3202 0.3202 0.2099 0.0962 X 0.8189 0.8198 0.8198 X 0.0962 X 0.6742 X 0.4346 X
S24 0.0239 0.0217 0.0079 0.0198 X 0.9850 0.9858 0.9858 X 0.0187 X 0.9143 X 0.0919 X
S25 6.9852 2.7593 1.0409 4.1901 X 0.7135 0.8450 0.8401 1.0579 X 0.7500 X 5.7828 X
S26 0.0786 0.0786 0.0397 0.0294 X 0.8322 0.8339 0.8339 X 0.0294 X 0.6079 X 0.0435 X
S27 0.0095 0.0095 0.0041 0.0331 0.9614 0.9614 0.9614 X 0.0303 0.6707 X 0.0201
S28 0.5833 0.5819 0.3976 0.1179 X 0.9153 0.9164 0.9164 X 0.1132 X 0.7159 X 0.5634 X
S29 8.5669 8.5669 5.0644 2.3900 X 0.9482 0.9490 0.9490 X 2.3900 X 0.7234 X 14.0143 X
S30 0.4128 0.4128 0.2461 0.1804 X 0.9620 0.9617 0.9617 X 0.1804 X 0.7599 X 0.9181 X
S31 2.0359 1.9270 0.9671 0.4384 X 0.9511 0.9426 0.9426 X 0.3170 X 0.6473 X 1.5359 X
S32 1.0693 1.0649 0.8031 0.1415 X 0.8889 0.8911 0.8911 X 0.1270 X 0.7007 X 0.3410 X
S33 0.6033 0.6033 0.5780 0.0921 X 0.8659 0.8662 0.8662 X 0.0921 X 0.6105 X 0.4881 X
S34 5.7624 5.6923 6.0388 0.8250 X 0.9277 0.9328 0.9328 X 0.7903 X 0.7515 X 4.2321 X
S35 1.5128 1.3176 0.6824 0.5911 X 0.9163 0.9473 0.9467 0.3631 X 0.7872 X 1.5940 X
S36 210.0549 123.6547 32.9429 46.6052 X 0.8644 0.9418 0.9371 23.7849 X 0.8092 X 129.8628 X
S37 5.2729 3.2186 4.6264 2.4364 X 0.8508 0.8507 0.8507 X 0.3582 X 0.6925 X 1.0458 X
S38 3.6650 3.6650 3.9989 0.2794 X 0.8657 0.8657 0.8657 X 0.2794 X 0.7608 X 1.2374 X
S39 42.3862 35.0646 17.5059 5.1276 X 0.8679 0.9545 0.9545 X 2.2078 X 0.8279 X 11.6840 X
S40 4.0109 4.0064 3.0029 0.6129 X 0.9198 0.9244 0.9244 X 0.5942 X 0.6993 X 1.6539 X
S41 0.9968 0.9890 0.7100 0.1797 X 0.9040 0.9106 0.9106 X 0.1562 X 0.7181 X 0.3880 X
S42 0.0034 0.0034 0.0035 0.0287 0.8574 0.8569 0.8569 X 0.0287 0.6954 X 0.0267
S43 1.4103 1.4103 1.4142 0.2558 X 0.8913 0.8924 0.8924 X 0.2558 X 0.6681 X 1.5863 X
S44 7.2241 5.9892 4.4672 1.7918 X 0.9159 0.9261 0.9240 0.8465 X 0.7696 X 4.2414 X
S45 0.1297 0.1297 0.1095 0.0597 X 0.8466 0.8464 0.8464 X 0.0597 X 0.6643 X 0.2205 X
S46 0.2842 0.2842 0.2118 0.0858 X 0.8648 0.8656 0.8656 X 0.0858 X 0.6642 X 0.5231 X
S47 0.1310 0.1310 0.0754 0.0629 X 0.8751 0.8750 0.8750 X 0.0629 X 0.7217 X 0.1188 X
S48 0.0025 0.0025 0.0016 0.0025 0.7939 0.7939 0.7939 X 0.0025 0.6446 X 0.0064 X
S49 0.1545 0.1545 0.0867 0.1120 X 0.7997 0.8009 0.8009 X 0.1120 X 0.6620 X 0.2295 X
S50 0.0071 0.0070 0.0029 0.0049 X 0.8476 0.8517 0.8517 X 0.0047 X 0.7722 X 0.0069 X
S51 8.5532 8.4018 8.6768 1.2770 X 0.8571 0.8671 0.8671 X 1.1036 X 0.6025 X 5.8874 X
S52 1.5933 1.5887 1.5922 0.4241 X 0.9530 0.9542 0.9542 X 0.4052 X 0.7708 X 1.8314 X
S53 1.0478 1.0459 0.9088 0.2427 X 0.8673 0.8710 0.8710 X 0.2298 X 0.6630 X 0.8469 X
S54 100.2525 98.7254 78.0955 3.5015 X 0.9089 0.9089 0.9089 X 2.6648 X 0.7524 X 7.8017 X
S55 1,288.1519 1,236.9016 734.9618 32.4581 X 0.9516 0.9544 0.9544 X 18.1040 X 0.8269 X 88.3705 X
Total 48 0.8870 0.8870 51 48 55 52

Table 8: Results of Industrial Subjects


Basic Information Time cost (s)
Subject*
SLOC** #Test Cases GA AGA FAST ART-D GA-S GE
I1 >500K 4,246 29,278.9102 359.9679 X 1,860.1473 543,106.2852 2,680,036.2615 54,969.0830
I2 >200K 2,546 3,018.6473 89.9239 X 398.8814 32,938.6045 315,888.7090 13,887.3160
I3 >200K 2,566 3,228.2772 86.0066 X 417.8356 30,458.8672 304,555.6710 14,124.1435
I4 >200K 2,550 2,833.4841 80.5940 X 383.9494 24,944.4404 265,881.1139 19,345.1543
I5 >200K 2,556 3,289.5958 94.0641 X 428.5125 31,799.7539 366,902.7648 8,424.3798
I6 >500K 4,123 22,118.0296 329.4710 X 1,439.6848 402,039.4240 1,766,206.5003 49,274.3782
I7 >500K 4,139 21,963.5968 336.3432 X 1,600.3634 411,725.1937 2,390,410.2541 54,897.2351
I8 >200K 2,529 4,250.2729 89.2680 X 446.4625 36,610.5757 461,096.5509 3,857.2345
I9 >500K 4,134 22,057.8564 335.8682 X 1,450.5679 28,328.2207 2,091,910.4123 37,817.4141
I10 >200K 2,542 3,238.5423 96.6740 X 418.7254 769,960.0223 265,087.9653 7,134.1514
I11 >500K 4,133 23,749.9149 348.1437 X 1,531.0934 398,946.3216 2,537,854.1564 39,417.0345
I12 >500K 4,137 22,194.6776 342.6023 X 1,466.4241 398,254.6365 2,016,031.3451 38,741.9410
I13 >500K 4,128 22,545.8684 362.5389 X 1,470.3869 446,056.7049 2,018,768.3295 49,287.1451
I14 >200K 2,234 571.9417 22.2583 X 85.0108 4,999.5140 37,081.3254 487.0905
I15 >500K 2,201 6,517.1065 190.7537 X 926.5795 71,541.1456 513,769.5738 19,481.4108
I16 >20K 202 7.4382 3.5816 X 9.7204 87.4167 601.2848 42.7104
I17 >200K 2,216 599.1948 16.0822 X 85.3307 12,411.9608 32,268.7012 7,015.4581
I18 >20K 299 11.6980 2.2721 X 10.5942 83.9378 988.3095 38.6094
I19 >500K 3,993 21,482.4772 335.6216 X 1,750.2093 444,997.4857 2,295,089.0447 64,510.4519
I20 >200K 2,206 586.5093 18.7069 X 87.0280 6,905.6778 75,574.2453 1,048.8951
I21 >20K 281 8.0470 1.8397 X 9.1955 34.1523 610.4776 19.9627
I22 >500K 4,034 24,446.3671 335.9041 X 1,778.7107 466,512.4680 2,636,222.8890 52,941.8715
Total >6,860K 61,995 22
*
We hide project names for the confidential policy.
**
We report rough scale of SLOC due to the confidential policy.

You might also like