Software Testing
Software Testing
X, MONTH YEAR 1
Abstract—In recent years, many test case prioritization (TCP) techniques have been proposed to speed up the process of fault
detection. However, little work has taken the efficiency problem of these techniques into account. In this paper, we target the Greedy
Additional (GA) algorithm, which has been widely recognized to be effective but less efficient, and try to improve its efficiency while
preserving effectiveness. In our Accelerated GA (AGA) algorithm, we use some extra data structures to reduce redundant data
accesses in the GA algorithm and thus the time complexity is reduced from O(m2 n) to O(kmn) when n > m, where m is the number
of test cases, n is the number of program elements, and k is the iteration number. Moreover, we observe the impact of iteration
arXiv:2205.10239v1 [cs.SE] 20 May 2022
numbers on prioritization efficiency on our dataset and propose to use a specific iteration number in the AGA algorithm to further
improve the efficiency. We conducted experiments on 55 open-source subjects. In particular, we implemented each TCP algorithm with
two kinds of widely-used input formats, adjacency matrix and adjacency list. Since a TCP algorithm with adjacency matrix is less
efficient than the algorithm with adjacency list, the result analysis is mainly conducted based on TCP algorithms with adjacency list.
The results show that AGA achieves 5.95X speedup ratio over GA on average, while it achieves the same average effectiveness as GA
in terms of Average Percentage of Fault Detected (APFD). Moreover, we conducted an industrial case study on 22 subjects, collected
from Baidu, and find that the average speedup ratio of AGA over GA is 44.27X, which indicates the practical usage of AGA in
real-world scenarios.
Note: This is a preprint of the accepted paper “Feng Li, Jianyi Zhou, Yinzhu Li, Dan Hao, and Lu Zhang. AGA: An Accelerated
Greedy Additional Algorithm for Test Case Prioritization. IEEE Transactions on Software Engineering, 2021”, which can be
accessed at https://ieeexplore.ieee.org/document/9662236.
testing consumes about 80% testing cost [16]. For example, (program coverage) has two kinds of format, adjacency
Google developers modify source code one time per second matrix and adjacency list, we conduct our experiments on
on average [15]. To improve the efficiency of regression both of them, which is discussed in Section 2.1. In the
testing, it is necessary to apply TCP more than once because experiments, we studied the contributions of the two parts
frequent code modification may hamper the effectiveness of AGA separately, and found that both of them improve the
of TCP [17]. That is, considering the practical application efficiency to a large extent. Furthermore, we investigated the
of TCP, including the GA algorithm, both effectiveness and effectiveness and efficiency of AGA by comparing it with
efficiency are important. GA. The results showed that on average the speedup ratio
However, existing TCP approaches, including the GA of AGA over GA is 5.95X and 27.72X on two input formats,
algorithm, suffer from the efficiency problem, e.g., the which is a very large improvement. We also find that the
previous work shows that most existing TCP approaches average APFD of AGA and GA is the same, and Analysis of
cannot deal with large-scale application scenarios [13], [15], Covariance (ANCOVA) [19] shows no significant difference
[18]. Furthermore, some work [13], [15], [18] points out between them. Moreover, the effect size (Cohen’s d) also
that the GA algorithm spends dramatically long time on indicates small effect.
prioritization. Note that in the 20-year history of GA, there We also empirically compared AGA with FAST [18],
is no approach proposed to improve its efficiency while which focuses on the TCP efficiency problem. As FAST [18]
preserving the high effectiveness. targets a different problem, improving the time efficiency
In this paper, we make the first attempt to accelerate the by sacrificing effectiveness, such a comparison in terms
GA algorithm and maintain the effectiveness. In particular, of efficiency may be a bit unfair to our AGA approach.
we analyze the efficiency problem of the GA algorithm Surprisingly, the results showed that the average speedup
and propose to accelerate the GA algorithm through two ratio of AGA over FAST is 4.29X (with significant difference
enhancements. The proposed algorithm is called the Accel- and medium effect), which means AGA even outperforms
erated Greedy Additional (abbreviated as AGA) algorithm. the technique that sacrifices effectiveness to achieve high
First, many redundant data accesses occur during priorit- efficiency. Also, the average APFD difference that AGA
ization in GA. Whenever a test case is selected, the GA exceeds FAST is 0.1702, and ANCOVA shows that the dif-
algorithm scans the coverage information of all test cases ference is statistically significant. Moreover, the effect size
to mark elements covered by this selected test case and (Cohen’s d) also indicates huge effect.
calculates the number of unmarked elements covered by We further performed an industrial case study in Baidu,
each unselected test case. Such scanning is less efficient and a famous Internet service provider with over 600M monthly
may contain many redundant data accesses. Therefore, we active users. In particular, we compared the performance
design some extra data structures (e.g., indices) to summar- of AGA and GA in 22 subjects of Baidu. In this industrial
ize the coverage information of each test case in the AGA case study, the average speedup ratio of AGA over GA is
algorithm. Supposed that m, n, k are the number of test 44.27X and 61.43X on two input formats, which indicates
cases, the number of elements and the number of iterations the usefulness of AGA in real-world large-scale scenarios.
to repeat GA strategy (which is called iteration number in Also, AGA is faster than FAST on all 22 subjects and
this paper), and given n > m (which is true in most cases), achieves 4.58X speedup ratio on average, and the difference
the time complexity of our AGA algorithm is O(kmn), is statistically significant with very large effect. Due to the
while the time complexity of the GA algorithm is O(m2 n). commercial constraints, we cannot access the source code
The value of k determines to what extent the former is of these projects, and the developers in Baidu also do not
superior to the latter. In practice, k is usually much smaller record the fault positions in the history, which are necessary
than m, and in our approach, k is fixed as a constant (by the to calculate the APFD results. So, we did not compare the
second part below), so, our O(kmn) is superior to O(m2 n). effectiveness of these approaches in this study.
Second, the GA algorithm proposed by Elbaum et al. [3] The contributions of this work are summarized as below.
repeats the GA strategy multiple times in TCP and thus the • The first attempt to improve the efficiency of GA while
iteration number is usually larger than 1. Intuitively, when preserving its effectiveness, since GA is believed to
an element is covered for enough times, the probability that have high effectiveness. In particular, we resolve the ef-
it still contains faults is low, so the remaining iterations may ficiency issue of GA through theoretical improvement,
not contribute to the effectiveness but only decrease TCP which gives clear assurance for high-efficiency under
efficiency. Therefore, we investigated their relation empiric- any situations.
ally and applied it to modify the GA algorithm to improve • An approach to accelerating the widely-known GA al-
efficiency but preserving effectiveness. To sum up, our AGA gorithm through two parts, including time complexity
algorithm consists of two parts, time complexity reduction reduction and iteration number reduction. With the
and iteration number reduction. Note that theoretical im- former, the complexity is reduced from O(m2 n) to
provement is rather important and gives clear assurance for O(kmn) given n > m, which is theoretically proved;
high-efficiency under any situations (especially in the first with the latter, the corresponding AGA algorithm is
part of AGA). Also, our simple technique with theoretical more efficient and can be as competitive as GA regard-
improvement is meaningful in practice and can illustrate ing to effectiveness, which is empirically shown. In fact,
the simple nature of the problem. although it seems like an easy-to-implement algorithm,
We conducted controlled experiments by using 55 open- in the broad literature, nobody realizes this optimiza-
source projects from GitHub (whose total lines of code tion and the subsequent reduction of complexity. There-
are from 1,621 to 177,546). Because the algorithm input fore, this paper is the first to systematically analyze this
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 3
problem and propose and evaluate the optimization elements. Supposed that this algorithm chooses T1, then
approach, which is helpful for the community. T2, T3, T4, and T5 remain unselected. As the selected test
• Large scale experiments on 55 open-source projects case T1 covers elements E1, E2, and E3, the rest elements E4
demonstrating the effectiveness and efficiency of our and E5 remain uncovered. The algorithm scans the whole
AGA approach, compared with the GA algorithm. table again to find that T2, T3, T4, and T5 covers 2, 0, 1,
• An empirical comparison of AGA with FAST, which and 1 of the 2 uncovered elements, respectively. So, the GA
improves time efficiency but decreases effectiveness. algorithm chooses T2 as the next test case. Now, all elements
• An industrial case study on 22 subjects from Baidu, have been covered and the GA algorithm [3] starts another
which indicates the practical usage of AGA in real- iteration by resetting all elements to “uncovered”. Finally,
world scenarios. the test execution sequence produced by the GA algorithm
is “T1, T2, T3, T4, T5”. On the other hand, provided the
adjacency list as input, GA runs similarly and produces the
2 T IME C OMPLEXITY R EDUCTION same output.
In this section, we review the Greedy Additional (GA)
algorithm by an example (in Section 2.1). By analyzing its
2.2 Analysis of the GA Algorithm
time complexity (in Section 2.2), we propose to accelerate
GA through extra-defined data structures (in Section 2.3). In this section, we analyze the time complexity of the GA
Such modification improves the efficiency of GA so that the algorithm through its general implementation. Suppose the
time complexity becomes O(kmn) (given n > m), whereas coverage information is recorded in a table like Table 1(a),
the complexity of GA is O(m2 n), where n is the number the GA algorithm first scans the whole table to find the line
of program elements (e.g., statements, branches, methods) with the most “ ” entries and selects the corresponding
covered by the test suite, m is the number of test cases in test case into the prioritized sequence. When a test case
the test suite, and k is the iteration number. is selected and added to the sequence, the GA algorithm
scans the whole table to find the “ ”s whose corresponding
element is covered by the latest selected test case. These
2.1 Example
“ ”s are replaced by “×”s. The GA algorithm repeats the
Table 1 presents an example showing the coverage inform- proceeding process until all the entries in the table are “×”s
ation of a test suite. This test suite consists of five test or all the test cases have been selected. In the latter case
cases (i.e., T1, T2, . . . , and T5) and the test suite covers the termination condition is satisfied and the GA algorithm
five program elements (i.e., E1, E2, . . . , and E5). A common ends by producing a prioritized test suite; otherwise, GA
representation form of coverage information is adjacency reuses the initial table by replacing “ ”s with “×”s for each
matrix, which is shown in Table 1(a). represents that the selected test case and repeats the proceeding process again.
test case covers the corresponding program element, while Supposed that there are m test cases in the given test
× represents the opposite. Another representation form of suite to be prioritized and n program elements are covered
coverage information is adjacency list, which is shown in by the test suite, the GA algorithm needs to scan the whole
Table 1(b). In our example, the two forms represent totally table for m times and thus the time complexity is O(m2 n),
the same information. as shown by previous work [3], [7], [8]. However, lots of
accesses of the table are redundant. First and the most
Table 1: An Example
importantly, every time the coverage table is updated, the
(a) Adjacency Matrix GA algorithm recalculates the total “ ” entries of each
Elements unselected test case, without reusing previous calculation.
Cover or Not Second, none of the accesses to “×”s in the table is necessary
E1 E2 E3 E4 E5
T1 × × because the GA algorithm does not want to update them in
T2 × × the process. Third, in order to find the elements covered
Test Cases T3 × × × by the latest selected test case, the GA algorithm scans all
T4 × × × elements in the table, which is also unnecessary. Let us
T5 × × × ×
illustrate the preceding redundant accesses by the example.
When T1 is selected first, the GA algorithm scans Row T1
(b) Adjacency List
and finds three “ ”s. Among the five accesses (i.e., E1, E2,
Test Cases Covered Elements . . ., E5), the accesses of E4 and E5 are redundant. Then, the
T1 E1 E2 E3 GA algorithm changes the state of E1, E2, and E3 in other
T2 E3 E4 E5 four test cases from “ ” to “×”. During this process, it is
T3 E1 E2
also not necessary to access the state “×”. Then, the GA
T4 E3 E4
T5 E5 algorithm scans the whole table to select the next test case,
but this process can be optimized by analyzing updated
columns and the previous calculation on total number of
If we take the adjacency matrix as input, the GA al- “ ” covered by each test case. To sum up, due to such a
gorithm runs as follow. First, no element has been covered large number of redundant accesses in the GA algorithm, it
before and this algorithm scans the whole table to calculate is possible to reduce its time cost and improve its efficiency.
the number of elements covered by each test case. Then If we take the adjacency list as input, similar analysis
it chooses T1 or T2 since both of them cover the most can be done. First, the accesses of “×”s to find covered
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 4
elements in one row is reduced, while more time is spent on Algorithm 1: AGA C algorithm
finding all test cases that cover a specific element (through Input: Coverage information M;
scanning of the whole list). As a result, the overall time Output: Prioritized test cases P;
complexity remains O(m2 n). Second, lots of accesses of the 1 Initialize T C , HS , HC , F I , and II from M, t = 0;
list are redundant, too. Following our previous analysis, 2 Set P as empty list;
the time efficiency can be improved through reducing the 3 while t < m do
unnecessary operations. 4 Find the largest value in T C that the
corresponding test case t has not been selected
2.3 Improvement of Time Complexity (take the use of HS );
5 if No test case can be selected then
To reduce such redundant accesses, we propose the AGA C 6 Change T C to the original value;
approach that defines extra data structures, which reuse 7 Change HC to the original value;
previous information collected during its execution. In par- 8 Continue;
ticular, we use a list to record the total number of elements 9 end
covered by each test case and dynamically update it during 10 Add t into P;
prioritization, in order to alleviate the scanning of the cover- 11 Mark t as selected in HS ;
age table. We also use forward and inverted indices to save 12 forall j in F I[t] do
the data accesses of “×” entries in the table. 13 if HC[j] is “uncovered” then
Our AGA C algorithm is shown in Algorithm 1. Line 1 14 Mark j as “covered” in HC ;
initializes several data structures. T C is a list of length 15 forall i in II[j] do
m recording the number of elements covered by each test 16 Decrease T C[i] by 1;
case. In our example, T C is [3, 3, 2, 2, 1] from Table 1 by 17 end
definition. HS is a list of length n recording whether each 18 end
test case has been selected. HC is a list of length n record- 19 end
ing whether each element has been covered by previous 20 t = t + 1;
test cases. F I are forward indices that index all elements 21 end
covered by each test case, while II are inverted indices that 22 Return P;
index all test cases that cover each element. From Table 1,
in our example, F I records that T1 covers [E1, E2, E3], T2
covers [E3, E4, E5], etc. II records that E1 is covered by [T1,
T3], E2 is covered by [T1, T3], etc. Line 2 initializes P as to the definition of Big O notation, the total time complex-
the empty list. Then, in Line 3 to Line 21, the algorithm ity O(kmn + m2 ) can be simplified as O(kmn + m2 ) =
selects m test cases in turn. First, it chooses the largest O(kmn + mn) = O((k + 1)mn) = O(kmn), where k is the
value in T C whose test case t is marked unselected in HS . iteration number. Note that in most cases, n > m obviously
The algorithm adds t to the prioritized list P and marks holds, and can also be verified by the subject statistics in this
it in HS . In our example, in the first loop, T1 is selected paper (given by Table 6). For other special cases, the original
(since it covers the most program elements), marked in HS , time complexity O(kmn + m2 ) is still a large improvement.
and added to P. Then, for every element j in F I[t] that is In addition, in our algorithm, we use more storage space
marked uncovered in HC , the algorithm marks it as covered to maintain the extra data structures in order to improve
and for every test case i in II[j], the algorithm substracts time complexity. So, it is necessary to analyze the space
T C[i] by 1. In our example, in the first loop, E1 and E2 complexity, too. In GA, the coverage information (adjacency
are marked covered and the updated T C is [0, 2, 0, 1, 1] matrix/list) takes O(mn) space, and additional O(1) space
Finally, the algorithm continues to select the next test case is used to store temporary variables in the algorithm, which
by repeating the process. As shown from Line 5 to Line 9, means the overall space complexity of GA is O(mn). In
if all elements have been covered by selected test cases, AGA, the same O(mn) space is used to store the coverage
the algorithm completes current iteration and restores the table, while T C , HS , HC , F I , and II need O(m), O(m),
original T C to start the next iteration. In our example, after O(n), O(mn), and O(mn) spaces, respectively. So, the over-
T1 and T2 are selected, the original T C is restored. The total all space complexity of AGA is O(mn), which is the same
number of iterations is called iteration number. as GA, with the only difference lying in the constant factor.
Furthermore, we analyze the time complexity of our
AGA C algorithm. All initialization operations consume
O(mn) time. Each calculation of maximum value in T C
3 I TERATION N UMBER R EDUCTION
consumes O(m) time, which leads to O(m2 ) time in total. From Section 2, we obtain a new approach with time com-
The number of times to update T C is equal to the elements plexity O(kmn), where k is the iteration number. In practice,
in F I (also equal to the test cases in II ) in an iteration, k is often much smaller than m in most projects because
which is the number of “ ” entries in the coverage mat- usually many test cases are needed to cover all elements in
rix. So, in each iteration, the algorithm updates T C for an iteration. However, in the worst case, k may be equal
up to O(mn) times, and the total time for updating T C to m, indicating the worst time complexity of our AGA
is O(kmn), where k is the iteration number. Generally algorithm is still the same as that of the GA algorithm.
speaking, the number of elements is often larger than the To further improve the efficiency of the GA algorithm,
number of test cases, which means n > m. So, according especially in the worst case, in this section, we discuss
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 5
the impact of the iteration number and introduce another T2 and prioritizes the remaining test cases using GT.
modification adopted in our AGA algorithm. Finally, we 3.2 Experiment
present an experiment to evaluate the impact of iteration We conjecture that AGA I does not influence the effective-
number on the GA algorithm. ness (e.g., APFD) much but can improve efficiency (i.e., time
cost) a lot. To verify our conjecture, we design an experiment
to investigate how the iteration number impacts TCP in
terms of both effectiveness and efficiency.
3.1 Modification with Iteration Number Reduction Specifically, we use the same setup as the comprehensive
experiments in Section 5. More details about the subjects,
Let us re-examine the definition of “an iteration”. In
faults, implementation and supporting tools, and measure-
this paper, the process of selecting some test cases from
ment are given in Section 5.
covering 0 element to covering all possible elements and
We applied the GA algorithm to all subjects, and recor-
then resetting them to be “uncovered” is called “an iter-
ded the total number of iterations the GA strategy is applied
ation”. Intuitively, the iteration number may have large
during the process for each project, which is denoted as k .
impact on time cost of the GA algorithm. The difference
Then we applied to each project k modified GA algorithms,
between the GA-first algorithm [5] and the GA algorithm [3]
each of which is denoted as algorithm algoi (1 ≤ i ≤ k ),
also indicates the influence of such an iteration number.
recording their APFD values and time spent during pri-
Moreover, between these two algorithms, there exist many
oritization. In particular, algorithm algoi repeats the GA
other potential algorithms, depending how many times
strategy i times and prioritizes the remaining unselected test
the GA strategy is used (i.e., iteration number of the GA
cases by the Greedy Total strategy [5]. Note that algorithm
strategy) and what strategy is used to deal with the remain-
algo1 is actually the GA-first algorithm, whereas algorithm
ing unselected test cases (e.g., Greedy Total strategy, which
algok is actually the GA algorithm.
schedules test cases based on the descendent order of the Due to space limit, we only present some statistics of the
number of total covered program elements). experimental results in Table 2, that is minimum, maximum,
average, quartiles (Q1, Q2, Q3), and the detailed results
n=k×l (1) are given on the website of this project. From the eighth
column, the average iteration number among all open-
Here, we define the average number of test cases selected
source subjects is 29.20. The ninth to the fourteenth columns
in one iteration as l, so we can deduce Formula (1). Accord-
present the ratio between the time cost of the GA approach
ing to Formula (1), if a large number of test cases are selected
and that of the GA-first approach [5]. The big gap between
in one iteration, the total iteration number of this project is
the maximal and minimal time ratio indicates the influence
small; if few test cases are selected in one iteration, the total
of the iteration number. To better analyze the relationship
iteration number of this project is large. As our goal is to
between iteration number and time cost, we put detailed
improve efficiency while preserving effectiveness, projects
results in Appendix A. We draw a line chart of iteration
with small iteration number have already been efficient
number and time cost for each project. Note that in order to
enough, and the time complexity O(kmn) can be reduced to
see the trend, we only present the projects whose iteration
O(mn). For those projects with large iteration number, it is
number is no less than 20 (k ≥ 20). The plots also support
necessary to optimize the iteration number to some extent.
our claim that the iteration number contributes much to
In fact, everytime a program element is covered, the the time cost. As k is the coefficient of time complexity, it
probability that it still contains faults decreases. After many largely determines the actual efficiency in practice, so, we
iterations, all elements have been covered for enough times. think there is a large space to reduce time complexity.
On one hand, if all faults have been revealed after these The last six columns in Table 2 present the APFD ranges
iterations, the remaining iterations are useless for detecting of each project with different iteration numbers, that is,
faults but only increase the time cost. On the other hand, the highest APFD value minus the lowest APFD value.
if there are still several faults existed after many iterations, From the quartiles, we conclude that although some outliers
they are supposed to be hard to reveal and the remaining exist, most of the APFD ranges are very small. And the
iterations may only reveal them by chance, intuitively. So, average APFD range is only 0.0085 among all open-source
we conjecture that after some iterations, the effectiveness of subjects, indicating that little fluctuation of APFD occurs as
GA just fluctuates along with the remaining iterations. the iteration number varies.
Based on the above reasoning, we introduce another To sum up, we have two main observations. First, along
component of the proposed AGA algorithm, AGA I. with the increase of the iteration number, the time cost also
AGA I reduces the time cost by reducing the iteration increases, indicating that the iteration number contributes
number. Different from the GA algorithm, AGA I does not much to the time cost. Second, the APFD value varies a little
repeat applying the GA strategy until all the test cases are when the iteration number varies, which means a too large
prioritized, but stops when the specified iteration number is iteration number contributes little to the APFD value. These
achieved. Regarding to the remaining unselected test cases, two observations also verify our conjectures in Section 3.
AGA I applies other less costly techniques (e.g., the Greedy As we discuss in Section 3, projects with small iteration
Total technique (GT) [5], which is usually used in previous numbers are efficient enough by using AGA C, so, we need
work and also in this paper). Take Table 1 as an example, the to decide a proper reduced iteration number for projects
original iteration number is 2. If we reduce it to be 1, AGA I with a large iteration number. In fact, this reduced iteration
does not repeat the additional strategy after selecting T1 and number is not fixed, which means it can be adjusted for
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 6
specific usage. In this papar, we determine this value from 4 R ESEARCH M ETHOD
some heuristics. On one hand, although we conjecture that To investigate the performance of our proposed AGA ap-
there is no need to conduct too many iterations to detect proach, we design comprehensive experiments. In this sec-
faults, we still prefer to choose a relatively high value to en- tion, we briefly introduce each component of our experi-
sure the effectiveness. On the other hand, if we assume that ments and their intentions.
every time an element is covered, the probability that it still 1) The main experiment of this paper is designed to
contains faults decreases to half of the original probability, confirm the contributions of our approach, and thus we in-
given that the initial probability is 1, we need to cover an vestigate the improvement of AGA and its component (i.e.,
element 10 times to reduced the probability to be less than AGA I and AGA C) over the GA algorithm. In particular,
1‰ ((1/2)10 = 1/1024). As a result, in the remaining of this this experiment is conducted on 55 open-source subjects.
paper, we implement our AGA approach by using 10 as the Details of this experiment are referred to Sections 5 and 6.
reduced iteration number. Note that the experiment in Section 3.2 also shares the same
setup and RQ1 complements the experiment in Section 3.2.
Finding: The iteration number has large influence This part of experiment can show the superiority of AGA in
on the efficiency of the GA algorithm, while it im- widely-used open-source subjects.
pacts little on effectiveness. In this paper we set the 2) Although we aim to improve the efficiency of GA, we
iteration number to be 10 in implementing the AGA are also curious about how AGA performs compared with
approach. other TCP techniques. Specifically, FAST targets the TCP
efficiency problem and its goal is close to ours. Therefore,
Note that this finding is confirmed on our dataset empir- we first compare AGA with FAST, and then with other
ically and may have bias considering the diversity of differ- representative TCP techniques, including ART-D, GA-S, and
ent datasets. However, the constraint on k does reduce the GE. This experiment is in Section 7. This part of experiment
overall time complexity from O(kmn+m2 ) to O(mn+m2 ). can show that AGA even outperforms techniques that aim
When n > m, which is general in most cases, the reduction to reduce TCP time cost while sacrificing effectiveness.
is from O(kmn) to O(mn). 3) To show the practical usage of our approach, we
conduct an industrial case study on Baidu, which is a
famous Internet service providers with over 600M monthly
active users. Specifically, we compare AGA with GA, FAST,
3.3 Discussion on the Chosen Iteration Number ART-D, GA-S, and GE, respectively and the experiment is
in Section 8. This part of experiment can show that AGA
In this paper, we set the iteration number to be 10 in im- works also well in real-world industrial applications and
plementing AGA through some heuristics. Here we discuss we receive positive feedback from Baidu.
the influence of this choice. First, we analyzed the APFD
results of the GA algorithm with various iteration number
(i.e., algorithm algo1 (1 ≤ i ≤ k ) in Section 6.1). In partic- 5 E VALUATION D ESIGN
ular, for each project we recorded the highest APFD value We conducted experiments to evaluate our AGA approach.
(denoted as APFDmax ) among these algorithms, and found The experiments was performed on a server whose CPU is
the smallest iteration number r whose corresponding APFD Intel(R) Xeon(R) E5-2683 2.10GHz with 132GB memory and
value is no smaller than AFPDmax ∗ 99%. Surprisingly, the whose operating system is Ubuntu 16.04.5 LTS. To make a
smallest iteration number r for all projects are no larger than fair comparison of time cost, we conducted all experiments
10, which indicates that only several iterations is enough on a single thread without parallel execution.
for maintaining original effectiveness, even in projects with In order to make our results more reliable and let readers
the iteration number up to 679. Second, although we set the reuse the artefacts, we share our data, analysis scripts, and
iteration number to be 10 in this paper, it may not be the best detailed data tables online. They are publicly available on
choice. We respectively applied algo8 , algo9 , algo10 , algo11 , our website: https://github.com/Spiridempt/AGA, and also
and algo12 to all projects with k > 8 as Section 6.1, and on figshare: https://figshare.com/s/cf8cc6ba9259c0e0754d.
found that the gap between the maximum and minimum
APFD value of these algos is 0.0006 on average, which
means that there might be many possible choices of the 5.1 Research Questions
reduced iteration number in practice. In other words, the As our AGA approach consists of two parts, time com-
value of k in our evaluation is decided by reasoning, but plexity reduction (AGA C) and iteration number reduction
it can have various values, depending on the choices of (AGA I), the first two research questions are to investig-
developers. For example, they can use historical faults or ate their impacts, separately. Note that the first research
seeded faults to empirically decide the value of k . question also complements the experiment in Section 3.2.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 7
The third research question is designed to investigate the branches. Additionally, we also design a research question
performance of the whole AGA approach by comparing to investigate whether AGA still improves efficiency in the
it with the GA algorithm. To investigate the influence of scenario of method coverage. The implementation code and
coverage type, the fourth research question is designed to all scripts used in this work are written with Python.
investigate whether AGA can also improve the efficiency of In prior work on coverage based test case prioritization,
GA with method coverage. some takes the adjacency matrix as input [12], while some
To sum up, this experiment is to answer the following uses the adjacency list [18]. In this work, in order to make
three research questions. a more general comparison, on one hand, we utilize the
RQ1: How does our reduction of iteration number perform GA implementation in [18], which is a relatively efficient
compared with the GA algorithm in terms of efficiency? implementation and uses adjacency list as input, and we
RQ2: How does our reduction of the time complexity per- implement AGA based on adjacency list. On the other hand,
form compared with the GA algorithm in terms of effi- we implement GA and AGA based on adjacency matrix, too.
ciency? Due to the space limit, in the experimental results, we only
RQ3: How does our AGA approach perform compared with report the results based on adjacency list [18], which can be
the GA algorithm in terms of effectiveness and efficiency? more reliable, and the detailed results based on adjacency
RQ4: Can our AGA approach also improve the efficiency matrix are put on the website.
when method coverage is used? It is worth mentioning that in our experiments, when ties
happen (i.e., more than one test case has the same number
of covered elements), AGA/GA selects the topmost test case
5.2 Subjects and Faults
in the test-list (given by developers).
Subjects. In this work, we use 55 open-source projects in
total. Among these projects, 33 are widely used in prior
5.4 Compared Prioritization Approaches
work [17], [20], [21], the others are the most popular subjects
selected from GitHub according to the number of stars. Besides the proposed AGA approach and the GA ap-
Specifically, we target Github subjects whose primary pro- proach [3], in this study we also implemented the GA-
gramming language is Java and order them according to first approach proposed by Rothermel et al. [5]. The GA-
the number of stars in Jan 2019. Then, we check the first first approach [5] applies the greedy additional strategy
100 subjects and keep only the ones that are code repository only in the first iteration, and deals with the remaining test
and the required tools (e.g., Maven, Clover, PIT, which is cases by other prioritization approach, e.g., the Greedy Total
explained in Section 5.3) could work. All the open-source approach in this paper, which schedules these test cases
projects used in this work are written in Java, whose number based on the descendent order of the number of covered
of lines of code is from 1,621 to 254,284. Each of these pro- program elements.
jects has a test suite written in JUnit Testing Framework. The
detailed information is given in Appendix B (Table 6). It is 5.5 Measurement
worth noting that compared with the experimental dataset In this study, similar to existing work [3], [5], we used the
used in recent TCP work [18], [22], [23], our dataset is larger Average Percentage of Fault Detected (APFD) to measure
and contains more large-scale projects, which can make our the effectiveness of TCP approaches. Formula (2) presents
experimental results more reliable and convincing. how to calculate APFD values for a subject with n tests
Faults. As existing work [24], [25], [26] have demonstrated and m faults. Typically, T Fi represents the first test case’s
mutation faults to be suitable for software testing exper- position in the test suite that detects the ith fault.
imentation and mutation faults are widely used in prior
work [7], [17], [22], [27], [28], [29], [30], [31] to evaluate
T F1 + T F2 + ... + T Fm 1
test case prioritization, we use a widely-used mutation APFD = 1 − + (2)
testing tool PIT [32] to generate mutants for all open-source nm 2n
subjects. In particular, for each subject, first, we generate all Besides, we used the total time spent during the TCP
mutants. Second, we keep the mutants that are killed by at process to measure the efficiency of a TCP approach. For
least one failing test case2 . Third, we construct one mutation fair comparison, we included the preparation time for a
group for each subject by containing all the remaining muta- TCP approach, i.e., the time spent in constructing extra data
tion faults, which is also consistent with previous work [12]. structures in the AGA approach.
previously used subjects [17], [20] and 22 popular subjects iteration number in GA (less than or slightly more than
selected from GitHub. At the same time, as AGA is a general 10). Therefore, AGA I does not improve the efficiency much
approach, it is not biased towards the chosen projects. Note for them. However, for those subjects with a large iteration
that because the second part of our approach (iteration number, AGA I could reduce their time cost.
number reduction) is empirically verified on our dataset, To statistically check the differences between AGA I and
the large dataset itself also addresses the threat that our GA, we adopt hypothesis test. We first use Shapiro-Wilk
approach may be biased. Also, some prior work [34], [35] test [38] to check the normality of residuals, and the p-
shows that the relative performance of different test case pri- value in AGA I and GA is 9.416 ∗ 10−16 and 5.239 ∗ 10−16 ,
oritization techniques on mutation faults may not strongly which reject the hypothesis that they are normally distrib-
correlate with the performance on real faults, depending uted. Therefore, we need to adopt a non-parametric test.
upon the attributes of the studied subjects, but we follow As we need to include project size as a control variable,
the common practice to use mutation faults for open-source Wilcoxon rank sum test [39] cannot be used. We seek for
projects following the preceding TCP work [7], [17], [22], the proportional odds regression [40], which is a class of
[27], [28], [29], [30], [31]. Additionally, to complement this generalized linear models and is equivalent to Wilcoxon
experiment, in Section 7, we also evaluate our approach on rank sum test when there is a single binary covariate.
real faults. In the future, we plan to conduct an extensive We introduce a variable “group” representing AGA I and
study by using more projects with more real faults. In GA and take project size as a control variable. The results
addition, in this paper, we only target the GA algorithm and show that the p-value of “group” is 1.380 ∗ 10−6 , indicating
compare AGA with it. On one hand, it is widely accepted significant difference between AGA I and GA, and the effect
that GA remains one of the most effective strategies in terms size (Cohen’s d [41]) is 0.274 (medium effect). Here, because
of fault-detection rate [7], [8], [10]. On the other hand, the statistical tests of normality (e.g., Shapiro-Wilk test) might
results of a recent work [12] shows that other black-box be impacted by characteristics of the data, we draw the
techniques that do not use coverage information (e.g., [36], normal probability plots additionally and put them on our
[37]) are often less effective than GA. At the same time, website. Note that this applies to all normality checks in the
we also design another experiment in Section 7 to compare following of the paper.
AGA with other representative prioritization techniques.
Additionally, most of our experiments are conducted on
6.2 RQ2: Efficiency of Time Complexity Reduction
statement coverage because its wide usage and severe low-
efficiency problem. In fact, our analysis of AGA is regardless The extra data structures defined in our AGA C approach
of the scale of coverage matrix, and our theoretical improve- do not affect the prioritization results, but reduce the time
ment is general for all types of coverage. We also include complexity of prioritization. In this section, we compared
RQ4 to empirically verify our improvement on method AGA C with GA only in terms of time cost. Note that we
coverage. Another minor threat is induced by the diversity did not implement AGA I in this research question.
of used subjects, which may lead to misleading statistics The results are given by the first five columns (except
of our results. To address this threat, besides reporting the the third column) of Table 7 (Appendix C), where TimeGA
mean and median values, we also draw violin plots to learn presents the time cost of the GA approach and TimeC
the data distribution, which are shown on our website. represents that of AGA C. Moreover, we mark the results
of TimeC with X only if TimeC < TimeGA .
The last row summarizes the total number of subjects
6 R ESULTS AND A NALYSIS where AGA C outperforms the GA approach. The res-
In this section, we analyze the experimental results on open- ults show that in most projects (48 out of 55 open-source
source projects and answer the four research questions. subjects), the time cost of AGA C is lower than the GA
approach [3], which confirms our previous theoretical ana-
6.1 RQ1: Efficiency of Iteration Number Reduction lysis in Section 2. As we can see, in smaller subjects, the
differences between GA and AGA C are very small, which
In this section, we further investigate the efficiency im-
may be caused by precision errors resulting from calculation
provement of the iteration number reduction. According
or the operating system. In larger subjects, their differences
to Section 3.2, we implement our approach with iteration
are very large, which indicates the efficiency of AGA C.
number reduction alone by setting k = 10 and call this
In order to make our experiments comprehensive, we
implementation AGA I. In other words, in this subsection,
compared AGA C with the GA-first approach, whose time
we assess the contribution of iteration number reduction
cost is given by the fourth column TimeGAF of Table 7
alone (without the time complexity reduction).
(Appendix C). In 36 open-source subjects, AGA C is even
The results on the 55 open-source projects are given in
more efficient than the GA-first approach, which applies the
Table 7 (Appendix C)3 , where the projects are sorted in
time-consuming additional strategy for only one iteration.
ascending order of source lines of code (SLOC) and the
In general, the efficiency improvement of AGA C is usu-
first two columns present the results for RQ1. TimeGA
ally very large. In particular, if we define TimeGA /TimeC
presents the time cost of the GA approach, whereas TimeI
as the speedup ratio of AGA C over GA for a project, the
represents that of AGA I. The speedup ratio of AGA I over
average speedup ratio is 4.37X. As small time cost may yield
GA is 1.08X. It is apparent that most subjects have a small
biased speedup ratio, also in order to show the perform-
3
Due to the space limit, we put the results of several research questions ance of AGA in projects with different sizes, we classify
into one table and put the table in Appendix C. all 55 projects into small-size, middle-size, and large-size,
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 9
according to the SLOC. The small-size projects (S1 to S22 variable. The results show that the p-value of “group” is
in Table 7 (Appendix C)) all have less than 5,000 SLOC, the 0.399, indicating no significant difference between AGA C
middle-size projects (S23 to S41 ) all have 5,000-20,000 SLOC, and GA-first, and the effect size (Cohen’s d) is 0.208 (me-
and the other large-size projects have more than 20,000 dium effect).
SLOC. The results show that the average speedup ratio in Provided adjacency matrix as input, we also implemen-
the three categories is 2.16X, 4.65X, and 7.44X, respectively. ted GA and AGA C, and the detailed results are on our
So, the reduction of time complexity (AGA C) performs website. Specifically, the average speedup ratio of AGA C
well, especially in projects with large sizes. In order to over GA is 24.18X, and the average speedup ratio in the
give a more deep view into the distribution and variation three categories is 5.47X, 28.16X, and 48.19X, respectively.
of speedup ratios, we further present the violin plot with Besides, the speedup ratios of the AGA C approach
included box plot in Figure 1. The X-axis represents all vary a lot in different projects. On 19 open-source subjects
projects and projects in three categories, respectively. We AGA C is less efficient than GA-first. On the one hand, the
put the violin plots and box plots together to better present iteration numbers of these projects are high so that AGA C
the distributions. From the plots, the speedup ratio of large- becomes a bit costly. On the other hand, in the only iteration
size projects tends to be slightly larger than that of small- of GA-first, few test cases are needed to cover all statements
size projects. Moreover, from the plot of large-size projects, and they are selected fast so that GA-first is efficient on these
several projects have very large speedup ratio because their projects.
scale is also large. To sum up, AGA C addresses the high-complexity prob-
lem of GA well and successfully reduces its time complexity.
For any project, any scale of coverage matrix, our approach
40
could improve the efficiency a lot.
10
6.3 RQ3: Comparison with Greedy Additional Ap-
proaches
0
In this section, we compare the effectiveness and efficiency
between the proposed AGA approach and two Greedy
Total Small−size Middle−size Large−size
n=55 n=22 n=19 n=14 Additional approaches (including both GA and GA-first),
Categories of different sizes
whose results are given by the first ninth columns (except
Figure 1: Speedup Ratios Distribution of AGA C over GA the third and fifth column) of Table 7 (Appendix C), where
on Open-Source Projects APFDAGA and TimeAGA represent the APFD results and
time cost of the AGA approach whose iteration number
To statistically check the differences between AGA C is set to be 10. Moreover, when the GA approach [3] does
and GA, we perform hypothesis testing similar to the above. not outperform the corresponding AGA approach [3], i.e.,
We first use Shapiro-Wilk test [38] to check the normality of APFDAGA ≥ APFDGA or TimeAGA < TimeGA , the corres-
residuals, and the p-value in AGA C and GA is 4.207∗10−15 ponding results of the AGA approach is marked with X.
and 5.239 ∗ 10−16 , which reject the hypothesis that they
are normally distributed. We also use the proportional odds 6.3.1 Effectiveness
regression [40] and include project size as a control variable. The proposed AGA approach has the same or better APFD
The results show that the p-value of “group” is 0.038, indic- performance as the GA approach in 51 out of 55 open-source
ating significant difference between AGA C and GA, and subjects, and the average APFD value of AGA is 0.8870,
the effect size (Cohen’s d [41]) is 0.234 (medium effect). which is the same as GA. On some subjects (e.g., the open-
Besides, we also calculate the speedup ratios of AGA C source project whose ID is S44 ), the AGA approach does
over GA-first for a more complete comparison. The average not outperform the GA approach, but their APFD difference
speedup ratio is 3.01X, and the average speedup ratio in the is usually very small (e.g., 0.0021 for this subject). We also
three categories is 1.26X, 3.31X, and 5.35X, respectively. This make extra comparisons of AGA and GA-first and find that
shows our AGA approach is also superior to GA-first. AGA has the same or better APFD performance as GA-
To statistically check the differences between AGA C first in 45 out of 55 open-source subjects and their average
and GA-first, we perform the similar procedure as above. APFD values are the same. On 14 projects, neither the
We first use Shapiro-Wilk test to check the normality of AGA approach nor the GA approach outperforms the GA-
residuals, and the p-value in AGA C and GA-first is 4.207 ∗ first approach, but their differences are small. Through our
10−15 and 3.828 ∗ 10−16 , which reject the hypothesis that analysis, we suspect that after the first iteration, although all
they are normally distributed. We also use the proportional elements have been covered, the numbers of times that each
odds regression [40] and include project size as a control element is covered still differ. This means test cases with a
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 10
small number of times being covered should have higher better on large projects. On the other hand, the adjacency
priority, but in later iterations, this information is ignored. lists in some projects are very dense, which takes much time
Moreover, we statistically analyze whether the AGA in the preparation of data structure, and further leads to a
approach and the Greedy Additional approaches have sig- large coefficient. For example, S7 and S42 have relatively
nificant difference on their APFD values. First, we conduct small m values (45 and 34) and dense adjacency lists.
the Shapiro-Wilk test to check the normality of residuals.
The p-value of AGA, GA, and GAF is 0.328, 0.298, and 0.283,
indicating we cannot reject the hypothesis that they are
normally distributed. We additionally perform Shapiro-Wilk 60
test to check the normality of residuals, and the p-value of
AGA, GA, and GAF is 0.328, 0.298, and 0.283, indicating we
cannot reject the hypothesis that they are normally distrib-
Speedup_ratio
uted. Therefore, we can use parametric test in the following. 40
indicating we cannot reject that they have the same means. Total Small−size Middle−size Large−size
n=55 n=22 n=19 n=14
Then, pairwise ANCOVA tests show that the p-values of Categories of different sizes
AGA vs. GA, AGA vs. GAF, and GA vs. GAF are 0.981,
0.427, and 0.414. In other words, the probability that AGA Figure 2: Speedup Ratios Distribution of AGA over GA on
is as competitive as GA is more than 98%. Then, we employ Open-Source Projects
Cohen’s d [41] to compute the effect size (ES), and the results
Provided adjacency matrix as input, we also implemen-
in AGA vs. GA, AGA vs. GAF, and GA vs. GAF are 0.005,
ted GA and AGA. The average speedup ratio of AGA over
0.151, and 0.156, which are all small effects. Furthermore, we
GA is 27.72X, and the average speedup ratio in the three
conduct Tukey’s range test [43] to check the 95% confidence
categories is 5.84X, 35.47X, and 51.59X, respectively.
intervals for all pairwise differences, and the results are
To sum up, not surprisingly, the speedup ratio of AGA
[-0.022, 0.022], [-0.030, 0.015], and [-0.030, 0.014].
is higher than AGA C and AGA I. After combining AGA I
6.3.2 Efficiency and AGA C, our whole AGA approach obtains more ef-
ficient results while preserving high effectiveness. At the
According to Table 7 (Appendix C), in almost all subjects
same time, the proposed AGA approach is demonstrated
(i.e., 44 out of 55), the time cost of AGA is much lower than
to be efficient especially on large-scale projects. In fact,
the GA approach. On average, the speedup ratio of AGA
the surprisingly high efficiency of the AGA approach also
over GA is 5.95X. Moreover, the speedup ratios in small-size,
indicates the existence of many redundant accesses of data
middle-size, large-size projects are 2.26X, 6.69X, and 10.76X,
and it is ubiquitous in most projects.
respectively. To learn the distribution of speedup ratios in
small-size, middle-size, large-size projects, we also present
Conclusion to RQ3: The AGA approach requires
the violin plot with included box plot in Figure 2. From this
much less time in prioritization than the GA ap-
figure, most medium-size and large-size projects achieve
proach and the average speedup ratio is 5.95X and
higher speedup ratios than small-size projects. Moreover,
27.72X on two types of input. Also, AGA is as
AGA achieves very large speedup ratios on some large-
competitive as the latter in terms of APFD values
size projects. So, AGA scales up well in large-size projects.
(with no significant difference). This means that we
Furthermore, we compared the time cost of the AGA ap-
achieve our goal in this paper and it has promising
proach with the GA-first approach, which requires less time
use in practice.
than the GA approach, and find that the AGA approach
even outperforms the GA-first approach in 37 open-source
subjects. The average speedup ratio is 3.95X, and the av-
erage speedup ratio in the three categories is 1.36X, 4.39X, 6.4 RQ4: Performance on Method Coverage
and 7.44X, respectively. Here, we notice that the speedup In previous research questions, we focus on statement-
ratio of AGA over other approaches is sometimes less than level coverage because it is the mostly studied coverage
1 (e.g., S3, S4, S7). In fact, the overall time complexity criterion and its low-efficiency problem is more severe than
analysis is meaningful only when the parameters are large other granularities. In this section, we collect the method-
enough. In our dataset, some projects have a relatively small level coverage for each of our 55 subjects and compare the
m value. In this case, although O(mn) seems to be small, its efficiency of AGA and GA. The results are shown in Table 3.
coefficient is not negligible compared to m. In other words, For each subject, we report the running time (in seconds) of
the preliminary data structure setup consumes much time GA and AGA.
and it impacts the overall running time in some cases. This is According to Table 3, in almost all subjects, the time
also consistent with the empirical results that AGA performs cost of AGA is much lower than GA. On average, the
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 11
speedup ratio of AGA over GA is 6.02X. Moreover, the coverage as input. As WB approaches have the same input
speedup ratios in small-size, middle-size, large-size projects as us and are much faster than BB approaches, we compare
are 2.28X, 7.32X, and 10.13X, respectively. Compared to the our work with WB approaches [18]. WB approaches include
results on statement coverage in Section 6.3, the speedup five algorithms FAST-pw, FAST-all, FAST-1, FAST-log, and
ratios are almost the same for all projects and projects in FAST-sqrt, whose difference lies in how many test cases are
different sizes. This confirms that AGA also works well on randomly selected for prioritization at a time. In this section,
method coverage. we implemented this family, and for each subject, we com-
In fact, the complexity analysis of GA and our AGA pared the best results of this family with AGA. Specifically,
approach is based on a general (0,1) matrix, regardless of according to prior work [18], none of the algorithms in FAST
the meaning behind it. In other words, the type of program family always performs the best. Therefore, to show the
element (e.g., statement, method) does not affect any aspect superiority of our approach, we run all FAST algorithms
of AGA, which means our approach works on any coverage and select the best one for each project. In other words,
and has a stable improvement. when comparing APFD, we keep the highest APFD, and
when comparing time cost, we keep the lowest time cost.
Conclusion to RQ4: The AGA approach also works Moreover, due to the randomness in FAST, for each subject
on method-level coverage. Specifically, the average we applied each of these approaches 10 times and used their
speedup ratio of AGA over GA is 6.02X. median effectiveness and efficiency results. Regarding the
time cost, the same as Section 5, we measure the efficiency
7 E MPIRICAL C OMPARISON WITH R EPRESENTAT- of a TCP approach by including its preparation time, i.e., the
IVE P RIORITIZATION T ECHNIQUES preparation time used in FAST4 .
In this section, we present an experiment comparing AGA
with some representative prioritization techniques. In par- 7.1.1 FAST Results on Seeded Faults
ticular, as FAST targets the TCP efficiency problem and The results of FAST are shown by the tenth and twelfth
thus is closet to our goal, we first present the comparison columns in Table 7 (Appendix C). Due to space limit, we
study with FAST in Section 7.1. Then we present the com- do not present the results of all the five FAST algorithms,
parison study with other representative TCP techniques in but the largest APFD value and smallest time cost among
Section 7.2. them for each subject. Note that usually a FAST algorithm
cannot achieve both the largest APFD value and the smallest
7.1 Comparison with FAST time cost. As the APFD results and time cost of AGA is
already given by the eighth and ninth columns, we use
In this section, we investigate the performance of AGA
column WinAPFD and column WinTime to show whether
with its most related work FAST [18]. In particular, FAST
APFDAGA ≥ APFDFAST and TimeAGA < TimeFAST ,
is proposed as a TCP approach to address the general
respectively.
TCP efficiency problem by sacrificing the TCP effective-
Regarding to APFD values, the AGA approach is much
ness, and it is shown to be more efficient than other TCP
better than FAST in all subjects. More specifically, the differ-
techniques [18]. Note that there is no other work in the
ences between them are from 0.0456 to 0.3039, and 0.1702 on
literature focusing on the same objective as ours, and thus
average. To statistically check their differences, we follow
we compare AGA against FAST. However, AGA and FAST
the similar procedure as above. We first use Shapiro-Wilk
target at slightly different goals: FAST approach focuses on
test to check the normality of residuals, and the p-value in
the efficiency problem of test prioritization, not specific to
AGA and FAST is 0.328 and 0.137, which cannot reject the
GA approaches. Although FAST targets a different goal, it is
hypothesis that they are normally distributed. Then, taken
still interesting to learn how AGA performs compared with
project size as a control variable, the Analysis of Covariance
FAST in terms of time cost since both AGA and FAST can be
(ANCOVA) shows that p-value < 2 ∗ 10−16 , indicating the
viewed as addressing the efficiency problem. However, as
statistically significant difference between AGA and FAST.
FAST improves efficiency while sacrifices effectiveness, the
Moreover, the effect size (Cohen’s d) is 2.96 (huge effect) and
comparison in terms of time cost is a bit “unfair” for AGA.
Tukey’s range test shows that the 95% confidence interval
In this study, we compare the performance of AGA and
of their difference is [0.149, 0.192]. To sum up, AGA signi-
FAST on both the 55 open-source projects used in Section 5
ficantly outperforms FAST in terms of APFD because FAST
and Defects4J [44], which is the largest real-fault benchmark
algorithms are designed to sacrifice prioritization accuracy
(i.e., a set of projects with reproducible real bugs) widely
to achieve high efficiency by using hash signatures.
used in test case prioritization [35], [45], [46], [47], [48]
Regarding to the time cost, the time cost of AGA outper-
and fault localization [49], [50], [51], [52], [53]. For ease of
forms FAST on 52 out of 55 open-source subjects, and the
understanding, we present the results of the former subjects
speedup ratio of AGA over FAST is 4.29X. To statistically
with seeded faults and the results of the latter subjects with
real faults separately. 4
The previous work FAST [18] separated their total running time into
The FAST approach borrows algorithms commonly used preparation time and prioritization time in their evaluation. However,
in the big data domain to find similar items and con- preparation happens only once in BB approaches while not in WB
tains a family of similarity-based test case prioritization approaches, because the input of BB approaches is test code. Given
approaches. In general, the authors proposed two categories updated source code but out-of-date coverage information (from the
previous version), we need not prioritize again and TCP results will
of FAST, While-box (WB) and Black-box (BB). BB approaches not change. Otherwise, with updated coverage information, the whole
take test code as input, while WB approaches take program process (including preparation) has to be repeated.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 12
is missing in the summarization. AGA consists of two parts, largest number of uncovered elements among those in
time complexity reduction and iteration number reduction. the “spanning set”. Here, an element subsumes another
In particular, the former part is to use some extra data struc- if covering the former guarantees covering the latter:
tures (e.g., indices) to summarize the coverage information The notion of a spanning set denotes the subset of non-
of each test case, i.e., the statements covered by each test. subsumed elements.
With these data structures, AGA does not need to scan the • GE [8] is a genetic algorithm, which is a represent-
coverage table whenever a test case is selected, and thus the ative of search-based prioritization techniques and is
time cost of AGA reduces but its effectiveness maintains. To evaluated to be effective. In each iteration, it uses a
sum up, FAST suffers from effectivness loss because it uses fitness function to select individuals and then applies
simplified information, while AGA does not because it uses crossover and mutation operators to generate new in-
the same information as before but in an easy-to-access way. dividuals. Specifically, an individual (a sequence) is
Moreover, similar to Table 2, we compute the gaps encoded as an array where each value indicates the
between the highest and lowest APFD among all iteration position of a test case; The fitness function is defined by
numbers for Defects4J subjects, which are shown in Column Baker’s linear ranking algorithm [55]; The crossover op-
“APFD range” of Table 4. The range of “Chart” is marked erator selects two parents and each of the two offspring
as “NA” because it has only one iteration. As we can is formed by combining the first several values in one
see, the gaps are extremely small, which also confirms the parent and the remaining values in the other parent;
conclusion in Section 3. The mutation operator randomly selects two values in
Additionally, Column “GAF” of Table 4 shows the res- an individual and exchanges their positions.
ults of GA-first. AGA is much more efficient than GAF
while achieves larger APFD, which is consistent with the
conclusion in Section 6.3.
In this section, we reuse the implementation of ART-D,
Conclusion: Surprisingly, AGA can achieve 4.29X GA-S, and GE in [18], [22] and compare them with AGA on
speedup ratio compared to FAST, which targets im- the 55 open-source projects. Considering the randomness
proving time efficiency while sacrificing effective- of these techniques, each of them is run 10 times. The
ness. At the same time, the experimental results remaining setting of this experiment is the same as Section 5.
show that AGA is significantly better than FAST in Due to the space limit of Table 7 (Appendix C), we put the
terms of APFD values, and the average difference results in Table 5. In Table 5, each row represents one project,
between them is 0.1702. and the running time and APFD of AGA, ART-D, GA-S, and
GE are shown separately.
Speedup_ratio
Speedup_ratio
8 I NDUSTRIAL C ASE S TUDY 40
4.0
active users. The experimental results show that the av- [15] Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric
erage speedup ratios of AGA over GA and FAST are Nickell, Rob Siemborski, and John Micco. Taming google-scale
continuous testing. In Proceedings of the 39th International Conference
44.27X/61.43X and 4.58X (with significant difference and on Software Engineering: Software Engineering in Practice Track, pages
very large effect), respectively. 233–242. IEEE Press, 2017.
To the best of our knowledge, this is the first attempt [16] Ashish Kumar. Development at the speed and scale of google.
to alleviating the efficiency problem of the Greedy Addi- QCon San Francisco, 2010.
[17] Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan
tional TCP approach while maintaining its effectiveness. Hao, Yangfan Zhou, and Lu Zhang. How does regression test
It is worth noting that the efficiency of TCP algorithm is prioritization perform in real-world software evolution? In 2016
especially important when software becomes larger, that IEEE/ACM 38th International Conference on Software Engineering,
is to say, in real-world scenarios. Our empirical evidence pages 535–546. IEEE, 2016.
[18] Breno Miranda, Emilio Cruciani, Roberto Verdecchia, and Antonia
indicates that AGA is particularly more advantageous for Bertolino. Fast approaches to scalable similarity-based test case
large-scale industrial projects. prioritization. In Proceedings of the 40th International Conference on
Software Engineering, pages 222–232. ACM, 2018.
[19] Ronald Aylmer Fisher. Statistical methods for research workers.
ACKNOWLEDGMENTS In Breakthroughs in Statistics, pages 66–70. Springer, 1992.
[20] Qi Luo, Kevin Moran, Lingming Zhang, and Denys Poshyvanyk.
The authors would like to thank all the reviewers for their How do static and dynamic test case prioritization techniques per-
valuable comments and suggestions. This work was sup- form on modern software systems? an extensive study on github
ported by the National Natural Science Foundation of China projects. IEEE Transactions on Software Engineering, 45(11):1054–
under Grant No. 61872008. 1080, 2018.
[21] Jianyi Zhou, Junjie Chen, and Dan Hao. Parallel test prioritization.
ACM Transactions on Software Engineering and Methodology, 31(1):1–
50, 2021.
R EFERENCES [22] Junjie Chen, Yiling Lou, Lingming Zhang, Jianyi Zhou, Xiaoleng
[1] Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Wang, Dan Hao, and Lu Zhang. Optimizing test prioritization via
Mary Jean Harrold. Prioritizing test cases for regression testing. test distribution analysis. In Proceedings of the 2018 26th ACM Joint
IEEE Transactions on Software Engineering, 27(10):929–948, 2001. Meeting on European Software Engineering Conference and Symposium
[2] Sebastian Elbaum, Alexey Malishevsky, and Gregg Rothermel. on the Foundations of Software Engineering, pages 656–667. ACM,
Incorporating varying test costs and fault severities into test case 2018.
prioritization. In Proceedings of the 23rd International Conference on [23] Song Wang, Jaechang Nam, and Lin Tan. Qtep: quality-aware test
Software Engineering, pages 329–338. IEEE Computer Society, 2001. case prioritization. In Proceedings of the 2017 11th Joint Meeting on
[3] Sebastian Elbaum, Alexey G Malishevsky, and Gregg Rothermel. Foundations of Software Engineering, pages 523–534. ACM, 2017.
Test case prioritization: A family of empirical studies. IEEE [24] René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst,
Transactions on Software Engineering, 28(2):159–182, 2002. Reid Holmes, and Gordon Fraser. Are mutants a valid substitute
[4] Xiao Qu, Myra B Cohen, and Gregg Rothermel. Configuration- for real faults in software testing? In Proceedings of the 22nd
aware regression testing: an empirical study of sampling and ACM SIGSOFT International Symposium on Foundations of Software
prioritization. In Proceedings of the 2008 International Symposium Engineering, pages 654–665. ACM, 2014.
on Software Testing and Analysis, pages 75–86. ACM, 2008. [25] James H Andrews, Lionel C Briand, and Yvan Labiche. Is mutation
[5] Gregg Rothermel, Roland H Untch, Chengyun Chu, and an appropriate tool for testing experiments? In Proceedings of the
Mary Jean Harrold. Test case prioritization: An empirical study. 27th International Conference on Software Engineering, pages 402–411.
In Proceedings of the 1999 IEEE International Conference on Software ACM, 2005.
Maintenance, pages 179–188. IEEE, 1999. [26] Hyunsook Do and Gregg Rothermel. On the use of mutation faults
[6] W Eric Wong, Joseph R Horgan, Saul London, and Hiralal in empirical assessments of test case prioritization techniques.
Agrawal. A study of effective regression testing in practice. IEEE Transactions on Software Engineering, 32(9):733–752, 2006.
In Proceedings of the Eighth International Symposium On Software [27] Yiling Lou, Dan Hao, and Lu Zhang. Mutation-based test-case
Reliability Engineering, pages 264–274. IEEE, 1997. prioritization in software evolution. In 2015 IEEE 26th International
[7] Lingming Zhang, Dan Hao, Lu Zhang, Gregg Rothermel, and Symposium on Software Reliability Engineering, pages 46–57. IEEE,
Hong Mei. Bridging the gap between the total and additional test- 2015.
case prioritization strategies. In Proceedings of the 2013 International [28] Hong Mei, Dan Hao, Lingming Zhang, Lu Zhang, Ji Zhou, and
Conference on Software Engineering, pages 192–201. IEEE Press, 2013. Gregg Rothermel. A static approach to prioritizing junit test cases.
[8] Zheng Li, Mark Harman, and Robert M Hierons. Search al- IEEE Transactions on Software Engineering, 38(6):1258–1275, 2012.
gorithms for regression test case prioritization. IEEE Transactions [29] Md Junaid Arafeen and Hyunsook Do. Test case prioritization us-
on Software Engineering, 33(4):225–237, 2007. ing requirements-based clustering. In 2013 IEEE Sixth International
[9] Shen Lin. Computer solutions of the traveling salesman problem. Conference on Software Testing, Verification and Validation, pages 312–
Bell System Technical Journal, 44(10):2245–2269, 1965. 321. IEEE, 2013.
[10] Bo Jiang, Zhenyu Zhang, Wing Kwong Chan, and TH Tse. Ad-
[30] Hyunsook Do, Siavash Mirarab, Ladan Tahvildari, and Gregg Ro-
aptive random test case prioritization. In Proceedings of the 2009
thermel. The effects of time constraints on test case prioritization:
IEEE/ACM International Conference on Automated Software Engineer-
A series of controlled experiments. IEEE Transactions on Software
ing, pages 233–244. IEEE Computer Society, 2009.
Engineering, 36(5):593–617, 2010.
[11] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and
Clifford Stein. Introduction to algorithms. MIT press, 2009. [31] Qi Luo, Kevin Moran, and Denys Poshyvanyk. A large-scale
[12] Christopher Henard, Mike Papadakis, Mark Harman, Yue Jia, empirical comparison of static and dynamic test case prioritization
and Yves Le Traon. Comparing white-box and black-box test techniques. In Proceedings of the 2016 24th ACM SIGSOFT Inter-
prioritization. In 2016 IEEE/ACM 38th International Conference on national Symposium on Foundations of Software Engineering, pages
Software Engineering, pages 523–534. IEEE, 2016. 559–570. ACM, 2016.
[13] Sebastian Elbaum, Gregg Rothermel, and John Penix. Techniques [32] Pit mutation testing. http://pitest.org/, 2021. Accessed: 2021.
for improving regression testing in continuous integration devel- [33] atlassian / clover – bitbucket. https://bitbucket.org/atlassian/
opment environments. In Proceedings of the 22nd ACM SIGSOFT clover/src/default/, 2021. Accessed: 2021.
International Symposium on Foundations of Software Engineering, [34] Rahul Gopinath, Carlos Jensen, and Alex Groce. Mutations: How
pages 235–245. ACM, 2014. close are they to real faults? In 2014 IEEE 25th International
[14] Mika V Mäntylä, Bram Adams, Foutse Khomh, Emelie Engström, Symposium on Software Reliability Engineering, pages 189–200. IEEE,
and Kai Petersen. On rapid releases and software testing: a case 2014.
study and a semi-systematic literature review. Empirical Software [35] Qi Luo, Kevin Moran, Denys Poshyvanyk, and Massimiliano
Engineering, 20(5):1384–1425, 2015. Di Penta. Assessing test case prioritization on real faults and
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 18
mutants. In 2018 IEEE International Conference on Software Main- [56] Bullseye testing technology. http://www.bullseye.com/, 2021.
tenance and Evolution, pages 240–251. IEEE, 2018. Accessed: 2021.
[36] Mike Papadakis, Christopher Henard, and Yves Le Traon. [57] Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae.
Sampling program inputs with mutation analysis: Going beyond Are mutation scores correlated with real fault detection? a large
combinatorial interaction testing. In 2014 IEEE Seventh Inter- scale empirical study on the relationship between mutants and
national Conference on Software Testing, Verification and Validation, real faults. In 2018 IEEE/ACM 40th International Conference on
pages 1–10. IEEE, 2014. Software Engineering, pages 537–548. IEEE, 2018.
[37] Justyna Petke, Shin Yoo, Myra B Cohen, and Mark Harman. [58] Murial Daran and Pascale Thévenod-Fosse. Software error ana-
Efficiency and early fault detection with lower and higher strength lysis: A real case study involving real faults and mutations. ACM
combinatorial interaction testing. In Proceedings of the 2013 9th Joint SIGSOFT Software Engineering Notes, 21(3):158–171, 1996.
Meeting on Foundations of Software Engineering, pages 26–36. ACM, [59] Gordon Fraser and Franz Wotawa. Test-case prioritization with
2013. model-checkers. In 25th conference on IASTED International, 2007.
[38] Samuel Sanford Shapiro and Martin B Wilk. An analysis [60] Shin Yoo, Mark Harman, Paolo Tonella, and Angelo Susi. Clus-
of variance test for normality (complete samples). Biometrika, tering test cases to achieve effective and scalable prioritisation
52(3/4):591–611, 1965. incorporating expert knowledge. In Proceedings of the Eighteenth
[39] Henry B Mann and Donald R Whitney. On a test of whether one International Symposium on Software Testing and Analysis, pages 201–
of two random variables is stochastically larger than the other. The 212. ACM, 2009.
Annals of Mathematical Statistics, pages 50–60, 1947. [61] Ripon K Saha, Lingming Zhang, Sarfraz Khurshid, and De-
[40] Peter McCullagh. Regression models for ordinal data. Journal of wayne E Perry. An information retrieval approach for regres-
the Royal Statistical Society: Series B (Methodological), 42(2):109–127, sion test prioritization based on program changes. In 2015
1980. IEEE/ACM 37th IEEE International Conference on Software Engineer-
[41] Jacob Cohen. Statistical power analysis for the behavioral sciences. ing, volume 1, pages 268–279. IEEE, 2015.
Academic press, 2013. [62] Zengkai Ma and Jianjun Zhao. Test case prioritization based on
[42] Maurice Stevenson Bartlett. Properties of sufficiency and statistical analysis of program structure. In 2008 15th Asia-Pacific Software
tests. Proceedings of the Royal Society of London. Series A-Mathematical Engineering Conference, pages 471–478. IEEE, 2008.
and Physical Sciences, 160(901):268–282, 1937. [63] Sebastian Elbaum, Alexey G Malishevsky, and Gregg Rothermel.
[43] John W Tukey. Comparing individual means in the analysis of Prioritizing test cases for regression testing, volume 25. ACM, 2000.
variance. Biometrics, pages 99–114, 1949. [64] Hyunsook Do, Gregg Rothermel, and Alex Kinneer. Empirical
[44] René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A studies of test case prioritization in a junit testing environment.
database of existing faults to enable controlled testing studies for In 15th International Symposium on Software Reliability Engineering,
java programs. In Proceedings of the 2014 International Symposium pages 113–124. IEEE, 2004.
on Software Testing and Analysis, pages 437–440. ACM, 2014. [65] James A Jones and Mary Jean Harrold. Test-suite reduction and
[45] David Paterson, Gregory M Kapfhammer, Gordon Fraser, and Phil prioritization for modified condition/decision coverage. IEEE
McMinn. Using controlled numbers of real faults and mutants Transactions on Software Engineering, 29(3):195–209, 2003.
to empirically evaluate coverage-based test case prioritization. [66] Lingming Zhang, Ji Zhou, Dan Hao, Lu Zhang, and Hong Mei.
In Proceedings of the 13th International Workshop on Automation of Prioritizing junit test cases in absence of coverage information. In
Software Test, pages 57–63, 2018. 2009 IEEE International Conference on Software Maintenance, pages
[46] Md Abu Hasan, Md Abdur Rahman, and Md Saeed Siddik. 19–28. IEEE, 2009.
Test case prioritization based on dissimilarity clustering using [67] Bogdan Korel, Luay Ho Tahat, and Mark Harman. Test prioritiza-
historical data analysis. In International Conference on Information, tion using system models. In 21st IEEE International Conference on
Communication and Computing Technology, pages 269–281. Springer, Software Maintenance, pages 559–568. IEEE, 2005.
2017. [68] Lijun Mei, Zhenyu Zhang, WK Chan, and TH Tse. Test case
[47] Tanzeem Bin Noor and Hadi Hemmati. A similarity-based ap- prioritization for regression testing of service-oriented business
proach for test case prioritization using historical failure data. applications. In Proceedings of the 18th International Conference on
In 2015 IEEE 26th International Symposium on Software Reliability World Wide Web, pages 901–910. ACM, 2009.
Engineering, pages 58–68. IEEE, 2015. [69] Gregory M Kapfhammer and Mary Lou Soffa. Using coverage
[48] Alireza Haghighatkhah, Mika Mäntylä, Markku Oivo, and Pasi effectiveness to evaluate test suite prioritizations. In Proceedings
Kuvaja. Test case prioritization using test similarities. In Inter- of the 1st ACM International Workshop on Empirical Assessment of
national Conference on Product-Focused Software Process Improvement, Software Engineering Languages and Technologies: held in conjunction
pages 243–259. Springer, 2018. with the 22nd IEEE/ACM International Conference on Automated
[49] Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. Deepfl: Software Engineering, pages 19–20. ACM, 2007.
Integrating multiple fault diagnosis dimensions for deep fault [70] Donghwan Shin, Shin Yoo, Mike Papadakis, and Doo-Hwan
localization. In Proceedings of the 28th ACM SIGSOFT International Bae. Empirical evaluation of mutation-based test case prioritiz-
Symposium on Software Testing and Analysis, pages 169–180, 2019. ation techniques. Software Testing, Verification and Reliability, 29(1-
[50] Xia Li and Lingming Zhang. Transforming programs and tests in 2):e1695, 2019.
tandem for fault localization. Proceedings of the ACM on Program- [71] Hyunsook Do, Siavash Mirarab, Ladan Tahvildari, and Gregg
ming Languages, 1(OOPSLA):1–30, 2017. Rothermel. An empirical study of the effect of time constraints
[51] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui on the cost-benefits of regression testing. In Proceedings of the 16th
Abreu, Michael D Ernst, Deric Pang, and Benjamin Keller. Evalu- ACM SIGSOFT International Symposium on Foundations of Software
ating and improving fault localization. In 2017 IEEE/ACM 39th Engineering, pages 71–82. ACM, 2008.
International Conference on Software Engineering, pages 609–620. [72] Dan Hao, Lu Zhang, and Hong Mei. Test-case prioritization:
IEEE, 2017. achievements and challenges. Frontiers of Computer Science,
[52] Jeongju Sohn and Shin Yoo. Fluccs: Using code and change metrics 10(5):769–777, 2016.
to improve fault localization. In Proceedings of the 26th ACM [73] Michael G Epitropakis, Shin Yoo, Mark Harman, and Edmund K
SIGSOFT International Symposium on Software Testing and Analysis, Burke. Empirical evaluation of pareto efficient multi-objective
pages 273–283. ACM, 2017. regression test case prioritisation. In Proceedings of the 2015 Inter-
[53] Mengshi Zhang, Xia Li, Lingming Zhang, and Sarfraz Khurshid. national Symposium on Software Testing and Analysis, pages 234–245.
Boosting spectrum-based fault localization using pagerank. In ACM, 2015.
Proceedings of the 26th ACM SIGSOFT International Symposium on [74] Dan Hao, Lu Zhang, Lei Zang, Yanbo Wang, Xingxia Wu, and
Software Testing and Analysis, pages 261–272, 2017. Tao Xie. To be optimal or not in test-case prioritization. IEEE
[54] Martina Marré and Antonia Bertolino. Using spanning sets Transactions on Software Engineering, 42(5):490–505, 2015.
for coverage testing. IEEE Transactions on Software Engineering, [75] Shin Yoo and Mark Harman. Regression testing minimization,
29(11):974–984, 2003. selection and prioritization: a survey. Software Testing, Verification
[55] James Edward Baker. Adaptive selection methods for genetic and Reliability, 22(2):67–120, 2012.
algorithms. In Proceedings of an International Conference on Genetic [76] Cagatay Catal and Deepti Mishra. Test case prioritization: a
Algorithms and Their Applications, volume 1. Hillsdale, New Jersey, systematic mapping study. Software Quality Journal, 21(3):445–478,
1985. 2013.
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 19
[77] Sanjukta Mohanty, Arup Abhinna Acharya, and Durga Prasad Dan Hao is an associate professor at School
Mohapatra. A survey on model based test case prioritization. of Computer Science, Peking University,
International Journal of Computer Science and Information Technologies, P.R.China. She received her Ph.D. in Computer
2(3):1042–1047, 2011. Science from Peking University in 2008, and
[78] Dario Di Nucci, Annibale Panichella, Andy Zaidman, and Andrea the B.S. in Computer Science from the Harbin
De Lucia. A test case prioritization genetic algorithm guided Institute of Technology in 2002. She was a
by the hypervolume indicator. IEEE Transactions on Software program co-chair of ASE 2021 and SANER
Engineering, 2018. 2022, a general co-chair of SPLC 2018,
[79] Maral Azizi and Hyunsook Do. Graphite: A greedy graph-based the program committees of many prestigious
technique for regression test case prioritization. In 2018 IEEE In- conferences (e.g., ICSE, FSE, ASE, and ISSTA).
ternational Symposium on Software Reliability Engineering Workshops, Her current research interests include software
pages 245–251. IEEE, 2018. testing and debugging.
[80] Jinfu Chen, Lili Zhu, Tsong Yueh Chen, Dave Towey, Fei-Ching
Kuo, Rubing Huang, and Yuchi Guo. Test case prioritization for
object-oriented software: An adaptive random sequence approach
based on clustering. Journal of Systems and Software, 135:107–125,
2018.
[81] Dusica Marijan, Arnaud Gotlieb, and Sagar Sen. Test case priorit-
ization for continuous regression testing: An industrial case study.
In 2013 IEEE International Conference on Software Maintenance, pages
540–543. IEEE, 2013.
[82] Eric Knauss, Miroslaw Staron, Wilhelm Meding, Ola Söder, Ag-
neta Nilsson, and Magnus Castell. Supporting continuous integra-
tion by code-churn based test selection. In Proceedings of the Second
International Workshop on Rapid Continuous Software Engineering,
pages 19–25. IEEE Press, 2015.
A PPENDIX A 0.18
30
CHARTS OF ITERATION NUMBER AND TIME COST 0.16
25
Time(s)
Time(s)
To better analyze the relationship between iteration number 0.14
20
and time cost, we put detailed results in Section 6.1 here. 0.12
We draw a line chart of iteration number and time cost for 0.10
15
Time(s)
Time(s)
1.1
0.4
prioritization also becomes faster. The plots also support
1.0
our claim that the iteration number contributes much to 0.3
0.9
the time cost. As k is the coefficient of time complexity, it
0.8
largely determines the actual efficiency in practice, so, we 0 7 14 21 28 0 9 18 27 36
Iteration Number Iteration Number
think there is a large space to reduce time complexity. jopt−simple languagetool
0.45
3.5
5
0.40
3.0
4
Time(s)
Time(s)
0.35
Time(s)
2.5
0.30
3
2.0
0.25
1.5 2
0.20
0 13 26 39 52 0 7 14 21 28
0 16 32 48 64
Iteration Number Iteration Number Iteration Number
commons−math la4j−new mapdb−mapdb−1.0.9
0.8
lines of code (SLOC), test lines of code (TLOC), number of
1.5 test cases (#Test cases), and number of mutants (#Mutants),
Time(s)
Time(s)
0.5
0 15 30 45 60 0 8 16 24 32 A PPENDIX C
Iteration Number Iteration Number
jsprit jsoup RESULTS OF OPEN - SOURCE SUBJECTS
Due to space limit, we show the complete results on open-
2.5
source subjects in Table 7. The subjects are sorted in ascend-
ing order of source lines of code (SLOC). The first three
40 2.0
columns present the results for RQ1, the first five columns
Time(s)
Time(s)
1.5
30
present the results for RQ2, the first nine columns present
1.0 the results of RQ3, and the last four columns present the
20 0.5 comparison results with FAST. The detailed analysis can be
0 10 20 30 40 0 169 338 507 676 found in Sections 6 and 7.
Iteration Number Iteration Number
rome−1.5.0 assertj−core
A PPENDIX D
0.6
RESULTS OF INDUSTRIAL SUBJECTS
4
Due to space limit, we present the complete results on
0.5
3 industrial subjects in Table 8. For each subject, we present
Time(s)
Time(s)
0.4 its SLOC, #Test cases, and the time cost of GA, AGA, FAST,
2
ART-D, GA-S, and GE, respectively. The detailed analysis
0.3
1 can be found in Section 8.
0 24 48 72 96 0 9 18 27 36
Iteration Number Iteration Number
la4j commons−dbcp
A PPENDIX B
BASIC INFORMATION OF OPEN - SOURCE SUBJECTS
Table 6 shows some basic information of our 55 open-source
subjects. Specifically, for each subject, we present the source
TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, MONTH YEAR 21