0% found this document useful (0 votes)
120 views25 pages

Searching For Better Test Case Prioritization Schemes: A Case Study of AI-assisted Systematic Literature Review

This document describes a case study using machine learning to assist with a systematic literature review on test case prioritization techniques. The researchers used an AI tool called FASTREAD to identify relevant papers from a corpus of over 8,000 papers. With FASTREAD, they were able to find 242 relevant papers in 3 hours, whereas manual review would have taken an estimated 53 hours. Six graduate students validated the AI results and found it achieved 90% recall but was missing 27 additional relevant papers. However, the same key conclusions would have been reached regardless of the missing papers. Therefore, the study endorses using machine learning to aid future literature reviews by significantly reducing the required effort.

Uploaded by

sumit kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views25 pages

Searching For Better Test Case Prioritization Schemes: A Case Study of AI-assisted Systematic Literature Review

This document describes a case study using machine learning to assist with a systematic literature review on test case prioritization techniques. The researchers used an AI tool called FASTREAD to identify relevant papers from a corpus of over 8,000 papers. With FASTREAD, they were able to find 242 relevant papers in 3 hours, whereas manual review would have taken an estimated 53 hours. Six graduate students validated the AI results and found it achieved 90% recall but was missing 27 additional relevant papers. However, the same key conclusions would have been reached regardless of the missing papers. Therefore, the study endorses using machine learning to aid future literature reviews by significantly reducing the required effort.

Uploaded by

sumit kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Noname manuscript No.

(will be inserted by the editor)

Searching for Better Test Case Prioritization Schemes: a Case


Study of AI-assisted Systematic Literature Review

Zhe Yu · Jeffrey C. Carver · Gregg Rothermel ·


Tim Menzies

the date of receipt and acceptance should be inserted later


arXiv:1909.07249v2 [cs.SE] 2 Nov 2019

Abstract Given the large numbers of publications in the SE field, it is difficult to keep
current with the latest developments. In theory, AI tools could assist in finding relevant work
but those AI tools have primarily been tested/validated in simulations rather than actual
literature reviews. Accordingly, using a realistic case study, this paper assesses how well
machine learning algorithms can help with literature reviews.
The target of this case study is to identify test case prioritization techniques for auto-
mated UI testing; specifically from 8,349 papers on IEEE Xplore. This corpus was studied
with an incrementally updated human-in-the-loop active learning text miner. Using that AI
tool, in three hours, we found 242 relevant papers from which we identified 12 techniques
representing the state-of-the-art in test case prioritization when source code information is
not available.
The foregoing results were validated by having six graduate students manually explore
the same corpus. Using data from that validation study, we determined that without AI tools,
this task would take 53 hours and would have found 27 extra papers. That is, with 6% of
the effort of manual methods, our AI tools achieved a 90% recall. Significantly, the same 12
state-of-the-art test case prioritization techniques were found by both the AI study and the
manual study. That is, the 27/242 papers missed by the AI study would not have changed
our conclusions. Hence, this study endorses the use of machine learning algorithms to assist
future literature reviews.

Keywords Systematic Literature Review · Test Case Prioritization · Software Engineering ·


Active Learning · Primary Study Selection

Zhe Yu
Department of Computer Science, North Carolina State University, Raleigh, NC, USA. E-mail: [email protected]
Jeffrey C. Carver
Department of Computer Science, University of Alabama, Box 870290, 101 Houser Hall, Tuscaloosa, AL
35487-0290, USA. E-mail: [email protected]
Gregg Rothermel,
Department of Computer Science, North Carolina State University, Raleigh, NC, USA. E-mail:
[email protected]
Tim Menzies
Department of Computer Science, North Carolina State University, Raleigh, NC, USA. E-mail: [email protected]
2 Zhe Yu et al.

1 Introduction

New papers are being published every day, and in increasing numbers. Knowing what other
researchers have done to address a problem has become equally important, if not more so,
than providing a novel solution. However, it is also increasingly difficult to stay current
with what other researchers are doing. For example, when searching for work on test case
prioritization (TCP) on IEEE Xplore1 , 2,704 results would have been returned in 2009, while
in 2019, that number has grown to 8,349. As a result, finding an efficient way to conduct
literature reviews and extract useful information from thousands of papers has become a
crucial problem for researchers.
To address this problem, software engineering researchers have introduced Systematic
Literature Reviews (SLRs) [17, 30]. Following a set of guidelines, researchers conducting
SLRs manually examine all the papers relevant to some research questions and summarize
the research area. Other researchers can then get a general idea about current activity in
their field of interest by reading the published SLRs. However, SLRs are not conducted
frequently because of their labor-intensive and time-consuming nature [56]. As a result, when
researchers explore a specific problem, they often find that existing SLRs are outdated and
they need to carry out their own literature reviews.
We (the authors of this article) were faced with the same problem when we were exploring
the state-of-the-art in applying TCP techniques to automated UI tests [55]. There was no avail-
able SLR summarizing these TCP techniques; thus, we needed to conduct our own literature
review. One tedious task we faced in doing this was to find all the relevant TCP papers
from among 8,349 search results on IEEE
Xplore, using their titles and abstracts. In
our estimation, this would require around 50
53 human hours. To reduce this time, we
40
applied a state-of-the-art human-in-the-
loop machine learning algorithm called 30
Hours

FASTREAD [56, 58] to assist in select- 20


ing relevant papers. By considering and
labeling 470 papers suggested by the ma- 10

chine learning algorithm in three hours, 0


we included 242 relevant papers for full- Linear Screen FASTREAD

text reviewing. The algorithm indicated


that those 242 relevant papers constituted Fig. 1: Human hours required to retrieve 90%
91% of all the relevant papers in the of the relevant papers from the search result.
8,349 search results; finding more, how-
ever, would require much more human
effort. This was an impressive result because 50 human hours, which is 94% of the original
time required, can be saved by sacrificing 10% recall (missing 24 relevant papers).
Because this was the first time FASTREAD was applied in a real literature review, we
wished to validate its result. Enlisting six other graduate students to manually screen a subset
of the candidate papers and perform full-text reviewing of the missing relevant papers from
FASTREAD, we found:

– The FASTREAD selection process included 85% of the relevant papers with human errors
contributing to 5% of the missing papers. That said, FASTREAD was responsible for

1 https://ieeexplore.ieee.org
Title Suppressed Due to Excessive Length 3

missing 10% of the relevant papers, which is very close its estimation of having missed
9% of the relevant papers.
– The distributions of the missing relevant papers are very different from the distribution
ofpapers that FASTREAD included. This suggests that FASTREAD introduced a sampling
bias to the relevant papers it identified. On the other hand, with the missing relevant papers
added into the included papers, the distribution of the included papers remained roughly
unchanged. As a result, the overall conclusions of the literature review were not affected
by the missing relevant papers.
In conclusion, FASTREAD was able to include 90% of the relevant papers, as it claimed, and
the 242 papers it included were sufficient for our literature review. The set of TCP techniques
identified through the literature review was used as a baseline in our published work on
prioritizing automated UI tests [55]. Therefore, we believe that it was worth saving 50 hours
in selecting relevant papers with 10% of the relevant papers omitted, in our literature review,
and this may extend to other situations in which SLRs are conducted. When conducting
studies such as system mappings, however, where the actual number and distribution matters,
AI assistants such as FASTREAD may need to be avoided until the bias issue has been
resolved.
The main contributions of this paper are as follows:
– We conducted the first systematic literature review using FASTREAD. This is also among
the very few SLRs assisted by machine learning algorithms.
– We validated the use of FASTREAD in an SLR and observed its strenghts and weaknesses.
– We scripted our entire SLR process and we provided it in a public Github repo (https:
//github.com/fastread/SLR_on_TCP) so that other researchers can use the data
for reproduction and improvements on machine-learning-assisted primary study selection.
The remainder of this paper explores the research questions shown in Table 1. Section 2
presents background material and related work. Section 3 reports on the SLR process and
answers RQ1 with SLR results. Section 4 reports the selection results of FASTREAD and
answers RQ2. Section 5 concludes the paper and discusses the potential future work.

2 Background and Related Work

In this section, we provide background information on systematic literature reviews, and


introduce the FASTREAD algorithm.

2.1 Systematic Literature Reviews

Systematic Literature Reviews (SLRs) have become a well established and widely applied
review method in Software Engineering since Kitchenham, Dybå, and Jørgensen first adopted
them to support evidence-based software engineering in 2004 and 2005 [17,30]. SLRs employ
a defined search strategy, and an inclusion/exclusion criterion for identifying the maximum
possible relevant literature. As a result, compared to traditional literature reviews, an SLR
provides thorough, unbiased and valuable summaries of the existing information on specific
research questions. However, SLRs also require much more human effort than traditional
reviews (weeks to months of work as reported in [56]); therefore, SLRs cannot be conducted
or updated very frequently. Primary study selection, where thousands of candidate papers
must be reviewed by humans to find the dozens of relevant papers to be included in the SLR,
is one of the most time-consuming step in conducting SLRs [4].
4 Zhe Yu et al.

Research question Main motivation


RQ1: What are the state-of-the-art TCP tech- Through exploring RQ1a-d, we aim to identify TCP
niques that can be applied to automated UI techniques that prioritize for failure detection rate, use
testing? real failure data, and do not rely on source code infor-
mation.
RQ1a: What are the prioritization goals? Identify the goals of the TCP techniques, e.g., higher
fault detection rate, failure detection rate, or code cov-
erage.
RQ1b: What types of data are used? Identify the various data sources used in TCP papers.
RQ1c: What information is used to prioritize Identify the sources of independent variables (features)
the test cases? used in different TCP techniques.
RQ1d: What methods do not rely on source Identify the algorithms and strategies applied by TCP
code information? techniques when no source code, changes in source
code, or requirements information is available.
RQ2: What are the wins and losses of using Study the use of FASTREAD for human effort reduc-
FASTREAD to guide the selection of relevant tion in paper selection.
papers?
RQ2a: What percentage of relevant papers Determine whether using FASTREAD, user-defined
does FASTREAD actually retrieve? target recall can be achieved by screening only a small
portion of the search results.
RQ2b: What information is missing in the Validate whether retrieving and analyzing a certain
final report because of FASTREAD? percentage of relevant papers significantly shifts the
conclusions of the SLR.

Table 1: Research questions and main motivations

2.2 Machine Learning Algorithms for Primary Study Selection

The problem of how to efficiently find the dozens of relevant papers among thousands of
candidates is categorized as one type of information retrieval problem called total recall, and
has been studied for years [8, 9, 23, 37, 53, 56, 58]. With the goal of optimizing the cost for
achieving very high recall—as close as practicable to 100%—with a human assessor in the
loop [24], the total recall problem can be described as follows [57]:

The Total Recall Problem:


Given a set of candidates E , in which only a small fraction R ⊂ E are positive,
each candidate x ∈ E can be inspected to reveal its label as positive (x ∈ R) or
negative (x 6∈ R) at a cost. Starting with the labeled set L = ∅, the task is to inspect
and label as few candidates as possible (min |L|) while achieving very high recall
(max |L ∩ R|/|R|).

Active learning based approaches, where machine learning algorithms work alongside
humans to learn from human classifications and suggest what needs to be reviewed by
humans next, are widely applied in solving total recall problems [24]. The key idea be-
hind active learning is that a machine learning algorithm can train faster (i.e. using less
data) if it is allowed to choose the data from which it learns [45]. The experience in total
recall problems explored to date is that such active learners outperform supervised and
semi-supervised learners and can significantly reduce the effort required to achieve high
recall [8–12, 23, 49–53, 56, 58]. To understand active learning, consider the decision plane
Title Suppressed Due to Excessive Length 5

between the positive and negative data points shown in


Figure 2. Suppose we want to find more positive data
points and we had access to the model shown in Figure 2.
One tactic would be to inspect the unlabeled data points
that fall into the region of red circles in this figure, as far
as possible from the green squares (this tactic is called
certainty sampling). Another tactic would be to verify
the position of the boundary; i.e., to inspect the unla-
beled data points that are closest to the boundary (this
tactic is called uncertainty sampling). Besides the query
tactics, state-of-the-art approaches [56, 58] follow the Fig. 2: Separating positive data
general framework shown in Figure 3 and consider the points (red circles) from nega-
problem of how to stop the inspection at a target recall tive ones (green squares).
and how to efficiently correct human errors. When sim-
ulated with reverse-engineered primary study selection
datasets, these active learning based algorithms can retrieve 90-95% of relevant papers by
reviewing only 5-20% of the search results [58]. This could save weeks of work for humans
who might otherwise need to screen thousands of papers.
Despite the foregoing fact, many years have passed while few machine learning algo-
rithms have actually been applied in real systematic literature reviews. To the best of our
knowledge, only Xiong et al. [54] have employed a machine learning aided primary study
selection in a systematic review. Even though the machine learning algorithm applied in-
volved a combination of supervised and unsupervised learning (not active learning), they
succeeded reducing the cost of primary study selection by around 85%. This motivates the
study in this article: we would like to conduct a systematic literature review by utilizing a
state-of-the-art active learning approach (FASTREAD [56, 58]) to perform primary study
selection. We chose to apply FASTREAD to assist the SLR in our case study because in
previous work, (1) it outperformed other approaches in terms of inclusion rate [56], (2) its
recall estimation provides a confidence for the user that what is missed when the selection
stops [58], and (3) it has also been shown effective in solving other software engineering
problems [57] such as inspecting software security vulnerabilities [59], finding self-admitted
technical debt [21].

2.3 FASTREAD

FASTREAD is an active learning tool2 that helps reduce the cost of primary study selection
in SLRs [56, 58]. Consider a primary study selection with

– E : the set of all candidate papers from the search results.


– R: the set of relevant papers to be included (R ⊂ E ).
– L: the set of papers already reviewed and classified by humans (L ⊂ E ).
– LR = L ∩ R: the set of included papers.

Instead of reviewing and classifying all candidate papers in a random order, a primary study
selection with FASTREAD follows the procedure used in active learning frameworks for
total recall problems as shown in Figure 3, and benefits from three features:

2 https://github.com/fastread/src
6 Zhe Yu et al.

Step 8
Query
Strategy

Step 2 Step 3 Step 7


Initial Human Recall
Sampling Oracle Estimation

Step 1 Step 4 Step 6


Feature Train Double
Extraction Model Check

Step 5
Error
Prediction

Fig. 3: Active learning framework for total recall problems.

1. Higher inclusion rate: FASTREAD incrementally trains/updates a machine learning


model (in Step 4) on the human classification results (L and LR from Step 3). With the
help of the machine learning model, FASTREAD dynamically adjusts the order of papers
to be reviewed and classified by humans next (in Step 8) so that relevant papers will be
reviewed and included by humans earlier.
2. Recall estimation: FASTREAD estimates the total number of relevant papers |RE | ≈
|R|3 with a semi-supervised learning algorithm (in Step 7). The human can then stop
the primary study selection process when a pre-determined target recall Trec has been
reached by estimation Trec < |LR |/|RE |.
3. Human error correction: FASTREAD also predicts which papers are most likely to
have been misclassified by humans (in Step 5). Humans can double check those papers
(in Step 6) to correct those errors efficiently.
The version of FASTREAD that we implemented for our case study is shown in Algorithm 1.

3 Case Study: A Systematic Literature Review on Test Case Prioritization with


FASTREAD

Changes in a version of a software system may affect the behavior of that system. Regression
testing is performed to ensure that changes do not adversely affect the behavior [5]. As a
regression test suite grows with the size of a software system, software developers need to
wait increasingly longer times before they can get useful feedback on their latest commits. In
practice these times can be quite long; for example, Elbaum et al. [18] report on a test suite
of software with 20,000 lines of code that requires 7 weeks to run.
3 Here, |RE | ≈ |R| means that RE is an estimation of the value of |R|.
Title Suppressed Due to Excessive Length 7

Algorithm 1: Pseudo Code for FASTREAD [58] Implemented in the SLR


Input :E, the candidate paper set (search results)
Trec , target recall
N1 , batch size
N2 , threshold of query strategy
Q, search query for BM25 to boost initial selection
Output :L, screened papers
LR , included relevant papers
1 L ← ∅;
2 LR ← ∅;
3 |RE | ← ∞;
// Keep screening until target recall Trec has been achieved.
4 while |LR | < Trec |RE | do
// Start training when first relevant paper is found
5 if |LR | ≥ 1 then
// Alleviate bias in negative training examples
6 Lpre ← Presume(L, E \ L);
7 CL ← Train(Lpre );
// Estimate #relevant papers
8 |RE | ← SEM I(CL, E, L, LR );
// Select unscreened papers for human to screen
9 X ← Query(CL, E \ L, LR );
10 else
// Select unscreened papers for human to screen by BM25 Ranking
11 X ← argsort(BM 25(E \ L, Q))[: N1 ];
// Human screen selected papers
12 foreach x ∈ X do
// Include paper if relevant
13 if Screen(x) then
14 LR ← LR ∪ x;
// Add paper into screened set
15 L ← L ∪ x;

16 return L, LR ;
17 Function Presume (L, E \ L)
// Randomly sample |L| points from E \ L, presume those as non-relevant
18 return L ∪ Random(E \ L, |L|);

19 Function Train (Lpre )


// Train linear SVM with Weighting
20 CL ← SVM(Lpre , kernel=linear, class weight=balanced);
21 if LR ≥ N2 then
// Aggressive undersampling
22 LI ← Lpre \ LR ;
23 tmp ← LI [argsort(CL.decision function(LI ))[: |LR |]];
24 CL ← SVM(LR ∪ tmp, kernel=linear);
25 return CL;

26 Function Query (CL, E \ L, LR )


27 if LR ≥ N2 then
// Certainty Sampling (highest predicted probability of failing)
28 X ← argsort(CL.decision function(E \ L))[:: −1][: N1 ];
29 else
// Uncertainty Sampling
30 X ← argsort(abs(CL.decision function(E \ L)))[: N1 ];
31 return X;

32 Function Screen (x)


33 if human thinks x is relevant then
34 return True;
35 else
36 return False;
8 Zhe Yu et al.

1 Function SEMI (CL, E, L, LR )


2 |RE |last ← 0;
3 ¬L ← E \ L;
4 foreach x ∈ E do
5 D(x) ← CL.decision f unction(x);
6 if x ∈ |LR | then
7 Y (x) ← 1;
8 else
9 Y (x) ← 0;
P
10 |RE | ← Y (x);
x∈E

11 while |RE | 6= |RE |last do


// Fit and transform Logistic Regression
12 LReg ← LogisticRegression(D, Y );
13 Y ← T emporaryLabel(LReg, ¬L, Y );
14 |RE |last ← |RE |;
// Estimation
P based on temporary labels
15 |RE | ← Y (x);
x∈E

16 return |RE |;

17 Function TemporaryLabel (LReg, ¬L, Y )


18 count ← 0;
19 target ← 1;
20 can ← ∅;
// Sort ¬L by descending order of LReg(x)
21 ¬L ← SortBy(¬L, LReg);
22 foreach x ∈ ¬L do
23 count ← count + LReg(x);
24 can ← can ∪ {x};
25 if count ≥ target then
26 Y (can[0]) ← 1;
27 target ← target + 1;
28 can ← ∅;

29 return Y ;

Software engineering researchers have explored various techniques for improving the
cost-effectiveness of regression testing. Test case prioritization (TCP) is one such technique;
it schedules test cases for execution in an order that attempts to increase their effectiveness
at meeting some performance goal [43]. Unlike other techniques such as test case selection,
TCP techniques use the entire test suite and reduce testing cost by achieving parallelization
of testing and debugging activities [14]. By retaining all test cases, TCP techniques do not
run the risk of omitting some important test cases.
There are four attributes that distinguish papers on TCP techniques:

– What is prioritized for: the goal of the TCP technique. Common goals include increased
fault detection rate [43], increased failure detection rate [26], and the faster achievement
of coverage [33]. Different performance goals lead to different evaluation criteria.
– What information is used: TCP techniques rely on different types of information to
schedule test cases, e.g., they may rely on data showing the coverage of source code for
each test case [28] or on the execution history of test cases in previous runs [29].
Title Suppressed Due to Excessive Length 9

Test search for 1+1: @Given("^the user entered search term '(.*?)'$")
public void searchFor(String searchTerm){
● Given the user has document.getElementById("searchInput").sendKeys(
entered search term searchTerm);
"1+1" }

@When("^the user clicks on the search button$")


public void clickSearchButton(){
● When the user clicks document.getElementById("searchButton").click();
on the search button }

@Then("^Single result is shown for '(.*?)'$")


public void assertSingleResult(String searchResult){
● Then single result is
assertTrue(document.getElementById("searchResult
shown for "2" ").innerHTML==searchResult);
}

(a) UI test (b) test description (c) test code

Fig. 4: To test a page shown at the left (a), programmers write a test description (b) which is
converted to test code (c).

– What method is applied: different methods can be applied to utilize the foregoing infor-
mation to achieve the goals of techniques, such as search-based algorithms [39], greedy
algorithms [43], or supervised learning algorithms [26].
– Data type for evaluation: what type of data is applied to evaluate different TCP techniques.
Most important, are the data synthesized with injected faults [15] or based on real testing
results with failure and fault information [48]?
Automated user interface (UI) testing leads to one special case of regression testing.
Compared to unit tests, automated UI tests are more expensive to write and maintain. Worse
still, since automated UI tests are expressed in terms of actions taken by a browser user
agent, failures do not have a straightforward relationship to the underlying application code
or architecture. Figure 4 shows how one automated UI test case is designed to exercise a UI
performing a simple search on the string “1+1”. In this example, the test designer wishes to
test the search function by (a) verifying that when a user inputs “1+1” and clicks the search
button, a result of “2” will appear. To automate this UI test, the test designer would first (c)
define the test code for a set of scenarios, then (b) write the automated UI test case with the
pre-defined scenarios and expected input and output. In this way, the test designer does not
need to know what code will be executed when an automated UI test is executed, and the
pre-defined scenarios can be reused in designing other automated UI test cases. As a result,
when prioritizing for these automated UI tests, source code information is not available as
well as the mapping between a failure of the test case and a fault in the codebase.
We conducted this SLR to identify research papers on TCP techniques that can be
applied to automated UI testing. The requirements for such papers are: (1) they must present
techniques that prioritize for failure detection rate, (2) they must use real failure data, and (3)
then cannot rely on source code information. While (3) is essential for such techniques, (1)
and (2) can be relaxed if too few techniques satisfy all the requirements. The SLR investigated
papers from January 1, 1956 to January 1, 2019 and included the following phases, which we
go on to describe in turn:
– Planning,
– Execution,
– Validation, and
10 Zhe Yu et al.

– Reporting.

3.1 Phase 1—planning

In this phase, we specified research questions, search strategy, inclusion and exclusion criteria,
classification of papers, and threats to validity.

3.1.1 Research questions (RQ1)

We established four research questions by which to identify the primary papers that explore
test case prioritization. Research questions and their primary motivations are shown in Table 1
as RQ1a-d.

3.1.2 Search strategy

With Boolean operator OR to link the synonyms of the main terms and Boolean operator
AND to combine the main terms, the search string we applied is as follows:
[software AND test AND (rank* OR optimi* OR prioriti*)].
We executed this search string in the IEEE Explore4 database to find papers containing the
keywords in their titles and abstracts. We chose to search IEEE Explore because it covers a
large portion of the software engineering publications and is the only database we know of
in which thousands of search results can be downloaded automatically with their titles and
abstracts.

3.1.3 Inclusion and exclusion criteria

This review included papers published between 1956 and 2018. Papers from peer-reviewed
journals, conferences, and workshops were considered. We excluded papers that were not
related to test case prioritization in the context of software engineering, such as papers on
test case selection or fault localization. The inclusion criteria (IC) and exclusion criteria (EC)
are as follows:
IC 1 Primary papers on TCP.
IC 2 Secondary papers on TCP.
EC 1 Primary papers on test case selection or test suite reduction only.
EC 2 Primary papers on test case generation only.
EC 3 Primary papers on fault localization.
Primary study selection was performed by the first author alone. FASTREAD was applied
to help this process include 90% of the relevant papers (Trec = 0.9, N1 = 10, N2 = 30 ).
We targeted 90% recall because the creators of FASTREAD suggest that 90-95% recall is
appropriate because the cost required to reach higher recall increases exponentially [56, 58].
Whether 90% recall is in fact sufficient will be examined further in Section 4.

3.1.4 Classification of papers

Classification was also performed by the first author alone. The papers were classified
according to the properties and categories listed in Table 2. Details on each category will be
4 https://ieeexplore.ieee.org
Title Suppressed Due to Excessive Length 11

Table 2: Classification of papers

Property Categories
RQ1a: What is prioritized for Fault detection rate, failure detection rate, coverage, none.
RQ1b: Data type for evaluation No fault or failure, injected faults, real fault or failure.
RQ1c: What information is used Source code, change, requirement, history, test case, feedback.
RQ1d: What method is applied History-based, test case-based, feedback-based

provided in Section 3.3. The categories for RQ1a and RQ1c are non-exclusive. For example,
one paper may utilize both source code and history information.

3.1.5 Threats to validity

There are two major validity threats to this systematic literature review:
1. Only one data source: we searched for papers in one data source (IEEE Xplore) because
retrieving search results in other databases would have been inordinately expensive.
Therefore, TCP papers in journals or conference proceedings that are not indexed by
IEEE Xplore were not included in this SLR study.
2. Primary study selection with FASTREAD: this is the first SLR study conducted with
FASTREAD, applied to a single case, and the extent to which results will generalize
cannot be determined.

3.2 Phase 2—execution

3.2.1 Search

After applying the search string discussed in Section 3.1.2 in IEEE Xplore, we obtained a
result of 8,381 candidate papers. These 8,381 papers were downloaded automatically with
their title, abstract, pdf link, and publication year information. Among the 8,381 papers, 32
were not research papers (e.g., they were editorials or prefaces) and were thus excluded. The
search process, including the design of the search string and retrieval of all the search result,
required approximately 1 hour.

3.2.2 Primary study selection

Following the instructions for using the FASTREAD tool5 , we performed the following steps
to select the primary studies with a target recall Trec = 90%:
1. We loaded the search results of 8,349 papers with their titles and abstracts into FAS-
TREAD.
2. We searched for keywords “test prioritization” and screened the first ten results by
reading the titles and abstracts. Ten papers were included as relevant.
3. Given that |LR | = 10 ≥ 1, when the Next button is selected, an SVM model is trained
based on the ten screened papers and suggestions for uncertainty sampling and certainty
sampling are provided. Because |LR | = 10 < 30, the papers suggested during uncertainty
sampling were screened.
5 https://github.com/fastread/src
12 Zhe Yu et al.

Fig. 5: Interface of the FASTREAD tool when the study selection stopped.

4. After 20 more papers were reviewed based on uncertainty sampling (the SVM model was
retrained and suggestions were updated for every 10 papers screened by the author), 30
relevant and zero non-relevant papers were screened. Because |LR | = 30 ≥ 30, certainty
sampling was applied for the rest of the papers.
5. Finally, when 440 more papers had been screened based on certainty sampling, 242
relevant papers were found among the 470 reviewed ones (|LR | = 242 and |L| = 470), as
shown in Figure 5. Meanwhile, the estimated recall was |LR |/|RE | = 242/266 = 91%.
This was the first time the estimated recall reached the target recall (Trec = 90%). The
selection thus stopped and the results were exported.
The primary study selection process with FASTREAD required approximately three hours of
effort by one person.

3.2.3 Full-text review and paper classification

The 242 papers identified in the search were reviewed in full text and classified according to
Table 2 in Section 3.1.4. Among the 242 papers, three were determined to be not relevant to
our research questions based on their full text and were thus excluded. The paper classification
process required approximately 40 hours of effort by one person.

3.3 Phase 3—reporting

In this phase, the analytical results of the systematic literature review are discussed based
on research questions RQ1a – RQ1d. These results are collected by a full-text reviewing of
the 239 papers identified in the search. The distribution of publication years on the selected
papers is shown in Figure 6.

3.3.1 RQ1a: What are the prioritization goals?

Figure 7 shows the distribution of the prioritization goals for the TCP techniques each paper
considered. Different goals can overlap when one technique prioritizes for multiple goals and
Title Suppressed Due to Excessive Length 13

Fig. 6: Distribution of publication years

Fig. 7: Distribution of prioritization goals

these are evaluated by multiple performance metrics. From the distribution in Figure 7, we
observe the following:
– Fault detection rate: Most of the papers (close to 75%) present techniques that prioritize
test cases to achieve higher fault detection rates. An improved rate of fault detection during
testing can provide faster feedback on the system under test and let software engineers
begin correcting faults earlier than might otherwise be possible [43]. The most popular
performance metrics for evaluating early fault detection is average percentage of faults
detected (APFD), proposed by Rothermel et al. [43] in 2001. APFD is calculated as (1):

TF1 + TF2 + ··· + TFm 1


AP F D = 1 − + (1)
nm 2n
where T F i is the first test case that reveals fault i, m is the total number of faults revealed
by the test suite, and n is the total number of test cases in the test suite. Ranging from 0%
14 Zhe Yu et al.

Fig. 8: Distribution of data types

to 100%, higher APFD values mean better ordering of test cases in terms of early fault
detection.
– Failure detection rate: A few (seven) papers present techniques that prioritize test cases
to achieve higher failure detection rate [1, 26, 46–48, 61, 62]. This is a compromise prioriti-
zation target when failure to fault mapping information is not available. Usually, APFD
can also be applied to measure the failure detection rate [34].
– Coverage: Some (13%) of the TCP techniques presented in the papers attempt to order
test cases such that higher coverage can be achieved earlier. Here, coverage can refer
to requirements coverage [27], statement coverage [33], decision coverage [33], block
coverage [33], branch coverage [43], and so forth.
– None: 23 papers present TCP techniques without explicitly assigning them a goal.

3.3.2 RQ1b: What types of data are used?

Figure 8 shows the distribution of the type of data applied to evaluate the TCP methods. From
the distribution in Figure 8, we observe the following:

– No fault or failure: For those papers evaluating TCP techniques only by coverage and
those without evaluation, data with fault or failure information is not required. Thus 53 out
of 239 of the papers analyzed do not use data providing fault or failure information.
– Injected faults: Most (close to 75%) of the papers evaluate TCP techniques by manually
injecting faults to see how early these injected faults can be detected by the prioritized test
suite. Most of the papers presenting TCP techniques that prioritize for fault detection rate
use this type of data.
– Real faults or failures: A few (10) papers apply data with real faults or failures to
evaluate the fault/failure detection rate of TCP techniques. Most of these papers (seven
out of 10) are prioritizing for failure detection rate (having no failure to fault mapping) [1,
26, 46–48, 61, 62] while three out of 10 papers state that their data contain real fault
information [13, 35, 41].
Title Suppressed Due to Excessive Length 15

Table 3: Relation between data type and prioritization goal

PPData
PP
No fault or Injected Real faults or
Goal PP failure faults failures
Coverage only or none 53 0 0 53
Fault detection rate 0 176 3 179
Failure detection rate 0 0 7 7
53 176 10 239

Fig. 9: Distribution of information used

Table 3 shows the relation between data type and prioritization goal. As the Table shows,
there is a strong correlation between the data type used for TCP technique evaluation and the
prioritization goal:
– When no fault or failure information is available in the data, techniques can be evaluated
only by coverage.
– Most real world data has only failure information, in these cases TCP techniques can be
evaluated only by failure detection rate.

3.3.3 RQ1c: What information is used to prioritize the test cases?

TCP techniques utilize various information to reorder test suites for their prioritization goals.
Figure 9 shows the distribution of each category of information being used by the techniques
presented in the 239 papers. The six categories of information we analyzed are listed in the
table as follows:
– Source code: source code under test. About half of the analyzed TCP methods extract
features from the source code being tested, e.g. software metrics [3], code coverage [16].
– Changes: code change from the prior build. Around 16% of the analyzed papers present
TCP techniques that utilize information about ”what has been changed in the source code”
to decide which test cases should be executed earlier [31].
– Requirement: requirements properties. Around 10% of the analyzed papers present TCP
techniques that utilize this type of information, such as customer-assigned priority on
requirements, requirement volatility, or developer-perceived implementation complexity
of requirements, for prioritization [44].
16 Zhe Yu et al.

– History: execution results (pass/fail/skip) from previous runs. About 30% of the analyzed
papers present TCP techniques that utilize history information such as the fault/failure
exposing potential of each test case to reorder the test cases [20].
– Test cases: information about the test cases, e.g. test descriptions, test code, etc. Some
papers (17) present TCP techniques that utilize this type of information to calculate the
similarity between test cases, then prioritize the test cases based on the similarity and other
types of information [60].
– Feedback: execution results (pass/fail/skip) of test cases on current run. A few (10)
papers present TCP techniques that learn from the results of already executed test cases to
dynamically re-prioritize test cases that have not yet been executed [7].

Table 4: Relation between data type and information used

XXX Data
X No fault or Injected Real faults or
InformationXXXX failure faults failures
White box 53 152 6 211
Black box 0 24 4 28
53 176 10 239

Table 4 shows the relationship between data type and information used. For our conve-
nience, we categorize the information used as:
– White box if source code, code change information, or requirements information, which
are not available when prioritizing test cases for automated UI testing, are used;
– Black box if only history, test case, and feedback information, which are available when
prioritizing test cases for automated UI testing, are used.
In most of the papers (211 out of 239) that we surveyed, the TCP techniques presented
require white box information; only 28 of 239 present TCP techniques that rely on black
box information (which is necessary when prioritizing automated UI tests). Interestingly,
when considering only the papers in which real faults or failures are utilized a higher ratio of
papers (four out of 10) explore TCP techniques that use only black box information. This
suggests that it is more common than not that white box information is unavailable when
prioritizing test cases in a real-world testing scenario, just as is the case for the automated UI
test case prioritization problem.

3.3.4 RQ1d: What methods do not rely on source code information?

Most of the papers that present TCP techniques (around 50%) present coverage-based
methods, which prioritize test cases in orders that reach maximum coverage with minimum
testing cost, by using greedy or search-based algorithms [2]. However, these coverage-based
methods are not applicable when white box information (source code, changes, requirements)
is not available. Because only black box information is available when prioritizing automated
UI test cases, we are more interested in TCP techniques that are applied for black box testing.
Figure 10 shows the distribution of the papers that utilize such techniques. The techniques
applied for black box testing are:
– History-based: only history information is utilized. The history-based techniques pre-
sented in 13 out of 28 of the papers apply different metrics extracted from past test
execution results to predict fault/failure exposing potential, and to prioritize test cases [22].
Title Suppressed Due to Excessive Length 17

Fig. 10: Distribution of methods applied for black box testing

– Test case-based: test case information and history information is utilized. The test case-
based techniques presented in seven out of 28 of the papers utilize test case information to
calculate the similarity between test cases, and then prioritize the test cases based on both
test case similarity and history information [6].
– Feedback-based: feedback and history information is utilized. The feedback-based tech-
niques presented in eight out of 28 of the papers utilize the execution results (pass/fail)
in a current test case run to dynamically reorder those test cases that have not yet been
executed [42].

3.3.5 RQ1: What are the state-of-the-art TCP techniques that can be applied to automated
UI testing?

Summarizing from RQ1a-RQ1d, only four papers [1, 26, 48, 62] were dealing with the same
scenario in which automated UI testing applies: prioritizing for failure detection rate, using
real failure data and only black box information. However, techniques that prioritize test
cases for fault detection rate can also be applied to prioritize them for failure detection rate in
automated UI testing, as long as they do not rely on source code information. As a result, we
summarized the 28 black box papers and identified 12 state-of-the-art TCP techniques that
can be applied to automated UI testing, after accounting for similar algorithms and removing
inapplicable ones. The 12 TCP techniques thus identified are grouped by the information
they use and are listed in Table 5. In our prior work [55] we then applied these techniques
as baselines and compared them against a proposed new TCP algorithm on datasets used in
automated UI testing at LexisNexis.
Aside from identifying the state-of-the-art TCP techniques that can be applied to auto-
mated UI testing, there are other informative findings that can be derived from the systematic
literature review:
– Most existing papers on TCP techniques utilize white box information and are evaluated
by data with injected faults. This type of evaluation has clear drawbacks—the injected
faults may not resemble real faults and the white box information may not be available in
practice.
18 Zhe Yu et al.

Table 5: Black Box Test Case Prioritization Techniques and Information They Utilize

Utilized Information
ID Technique Description Execution history Test case description Feedback
B1 Execute test cases in ascending order of
time since last failure [25, 32].
B2 Execute test cases in descending order of
number of times failed/number of times
executed [1, 22, 48].
B3 Execute test cases in descending order of
exponential decay metrics [29].
B4 Execute test cases in descending order of
ROCKET metrics [36].
B5 Execute test cases in descending order
of the Mahalanobis distance of each test
case to the origin (0,0) when considering
two metrics—time since last execution
and failure rate [1].
C1 Execute test cases in ascending order of
the estimated test case runtime [40].
D1 Supervised learning with Simple History
(SH) [26].
D2 Supervised learning with All History
(AH) [26].
D3 Supervised learning with Weighted His-
tory (WH) [26].
E1 Dynamic test case prioritization with co-
failure information [62].
E2 Dynamic test case prioritization with flip-
ping history [7].
E3 Dynamic test case prioritization with
rules mined from failure history [42].

– On the other hand, fewer papers rely on data that involves real faults or failures (10 out
of 239) or present TCP techniques that utilize only black box information (28 out of
239). Only four papers present techniques that prioritize test cases based on black box
information on real failures. This is an under-explored area where results can be directly
(or most easily) applied to industry.

Therefore, scenarios similar to the one that holds for automated UI testing are under-explored
in the TCP literature. Specifically, there is a dearth of papers that (a) use real failure/fault
data and that (b) assume the information starved black box scenario defined above.

4 Validation

We now address RQ2, by first validating the primary study selection results of FASTREAD by
having six other graduate students manually screen the candidate papers, and then analyzing
the missing information with a full-text review of the relevant papers included by the six
students but missed by FASTREAD.
Title Suppressed Due to Excessive Length 19

4.1 RQ2a: What percentage of relevant papers did FASTREAD actually retrieve?

Considering the prohibitive cost involved in manually screening 8,349 papers, we validate
our results on only a subset of the candidate papers. By searching in IEEE Xplore with the
following search string
[software AND test AND prioriti*].
a validation set of 783 papers was retrieved. Among the 470 of these papers that had been
screened with FASTREAD, 318 were in the validation set; while among the 242 papers
classified as relevant by FASTREAD, 237 were in the validation set.
Each paper in the validation set was manually screened by at least two graduate students.
A third student was asked to screen the paper if the screening results from the first two were
inconsistent. A majority vote was then used to determine the final screening results of papers
(293 relevant papers) in the validation set. After that, full-text validation was applied to the
papers identified by the majority vote that were not identified by FASTREAD. This full-text
validation required the students to spend six hours to review 70 papers; they found that 39 of
these were relevant. These full-text validation results were treated as the ground truth for the
validation set.
Table 6 summarizes the validation results of FASTREAD, the majority vote, and the
ground truth with labels explained in Table 7. From Table 6 we can derive the performance
of FASTREAD on the validation set:
234
– Human Precision = 234+3 = 0.99
234
– FASTREAD Precision = 237+81 = 0.74
– Recall = FASTREAD Recall×Human Recall = 234+12 234
273 × 234+12 = 0.90 × 0.95 = 0.85
237+81
– Cost = 783 = 0.41
Here, the recall involved in selecting primary studies consists of two parts—FASTREAD
recall and human recall. The FASTREAD recall on the validation set is 90%, which is
very close to its estimation of 91% recall and is the same as the target recall Trec = 90%.
Therefore, we conclude that the recall estimation of FASTREAD was accurate in this SLR
study.
As for the performance of manual screening with majority votes, the following calcula-
tions apply:
259
– Precision = 259+34 = 0.88
259
– Recall = 259+14 = 0.95
– Cost = 783×783 2+174
= 2.22
This data shows that the human working with FASTREAD (the first author) achieved the same
recall as, but higher precision (99%) than, the majority vote results of the other six humans
(89%). This probably occurred because the author designed the inclusion and exclusion
criteria and had a better understanding of which papers are relevant to the research questions
RQ1a-d. This result suggests that, although employing more human reviewers for relevant
paper selection can effectively reduce the time required for that process, more cost-effective
and precise results can be achieved if only the human planning the SLR is employed for the
primary study selection, which leads to less unnecessary full-text review effort.

4.2 RQ2b: What information is missing in the final report because of FASTREAD?

To determine what information is lost by excluding the 39 relevant papers not discovered by
FASTREAD and the human reviewer, we analyzed these papers in the same manner as the
20 Zhe Yu et al.

Table 6: Primary Study Selection Validation Results

FASTREAD Majority Vote


yes no ignored yes no
Ground yes 234 12 27 259 14 273
Truth no 3 69 438 34 476 510
237 81 465 293 490 783

Table 7: Labels of Table 6

Description
yes Papers suggested by FASTREAD and included by human
FASTREAD no Papers suggested by FASTREAD but excluded by human
ignored Papers ignored by FASTREAD
Majority yes Papers included by two humans
Vote no Papers excluded by two humans
Ground yes Papers included by full-text validation
Truth no Papers excluded by full-text validation or by both FASTREAD and majority vote

239 papers that were initially included, and classified them based on prioritization goals, data
types, information used, and method applied for black box testing. As shown in Figure 11,
we observe:
– Distributions of the missing papers into categories are quite different from those obtained
for the 239 papers identified as relevant by FASTREAD. This suggests that when using
FASTREAD, a bias could be introduced in terms of which relevant papers will be retrieved.
This is probably caused by the imbalance of categories in the training data of FASTREAD,
e.g., when 70% of the training data (relevant papers found) uses injected faults, it is likely
that FASTREAD would predict that a paper using injected faults has a higher probability
of being relevant than otherwise.
– Distributions of the relevant papers into categories are still similar to those obtained for the
239 papers identified as relevant by FASTREAD – especially with respect to the rankings
of number of papers in each category. This suggests that, despite the bias introduced
by FASTREAD, the overall conclusions of the SLR are still representative when using
the 90% relevant papers selected with FASTREAD. However, in studies like systematic
mappings where the exact values of distributions matter, such biases should be avoided by
manually screening all of the search results.
– While one more paper presenting a black box TCP technique [38] was identified in the
papers omitted by FASTREAD, that technique is similar the D2 technique [26] listed in
Table 5; thus, including that paper did not add any new information to the conclusions of
the SLR case study.

4.3 RQ2: What are the costs and benefits associated with using FASTREAD to guide the
selection of relevant papers?

To summarize, in answer to RQ2, the benefits associated with the use of FASTREAD to
guide the selection of relevant papers are as follows:
470
– With the help of FASTREAD, 85% of the relevant papers were included with only 8349 =
6% of the candidate papers screened; this saved approximately 50 hours of work.
Title Suppressed Due to Excessive Length 21

(a) Prioritization goals (b) Data types

(c) Information used (d) Methods applied for black box testing

Fig. 11: Comparing the distribution of papers in each category, considering (1) the 239
papers considere relevant by FASTREAD (Percentage), (2) the 39 relevant papers missed by
FASTREAD (Percentage in missing), and (3) the 278 ground truth relevant papers (Percentage
in all).

– The recall of FASTREAD when the selection stopped was close to the target recall based
on the validation result. This suggests that a researcher may be able to choose a level
of recall at which to stop the selection with the help of the recall estimation given by
FASTREAD.
– The cost of primary study selection was reduced to a reasonable level (three hours for 470
papers), so that it was possible to employ only one human (the one who planned the SLR)
to select relevant papers. This reduced the human error rate for the selection process.

The costs associated with use of FASTREAD to guide the selection of relevant papers are as
follows:

– There were more missed relevant papers when applying FASTREAD with a target recall
lower than 100%. The higher the target recall is, the higher the cost will be [58]. Re-
searchers need to consider the tradeoff between the screening effort they are willing to
spend and the recall they can achieve with FASTREAD.
– Using FASTREAD can introduce a sampling bias into the included relevant papers. This
may not always affect SLR studies (e.g. the conclusions of the case study SLR in this paper
remained unchanged) but it should be avoided in studies such as systematic mappings.
22 Zhe Yu et al.

5 Conclusions and Future Work

This paper investigated 242 papers on test case prioritization published in conference proceed-
ings and journals through a systematic literature review process. The 242 papers were selected
by manually screening 470 of the 8,349 candidate papers the with the help of FASTREAD.
This systematic literature review study was conducted for two reasons: (1) to investigate the
costs and benefits of selecting relevant papers by using a machine learning tool as represented
by FASTEAD; and (2) to determine the state-of-the-art in research on TCP techniques that
can be applied to automated UI testing.
Regarding the first point, FASTREAD reduced the effort required for paper selection
470
by 1 − 8349 = 94% (from 53 hours to three hours). Based on the validation results in
which six other humans screened a subset of 783 candidate papers, the FASTREAD selection
process included 85% of the relevant papers with human errors contributing to 5% of the
missing papers. Given the large reduction of human effort on primary study selection with
only 10% loss on recall, and the fact that the missing relevant papers did not affect the final
conclusions of the case study SLR, this data supports the suggestion that FASTREAD can
be used to cost-effectively select primary studies in SLRs. However, we did find that using
FASTREAD can introduce a sampling bias in the included relevant papers. Thus, when
conducting systematic mapping studies, it may be best to avoid using FASTREAD.
Regarding the second point, this systematic literature review identified 12 state-of-the-art
TCP techniques that rely only on black box information, and that can be applied directly
to prioritize automated UI tests. These 12 techniques have since been used as baselines in
another already published work [55]. Meanwhile, this SLR also found that TCP techniques
that (1) utilize only black box information (no source code, change, or requirements), (2)
prioritize for higher failure detection rate, and (3) are validated on real world failure data are
under-explored and require more research attention, because these techniques could better
meet the needs of large organizations such as Google [19] and Cisco [36].
As for future work, we intend to encourage other software engineering researchers to
conduct systematic literature reviews using FASTREAD. We also intend to find ways to
improve the efficiency (higher recall and lower cost) of our machine learning assisted primary
study selection approach through simulations on the SLR datasets including this study. Finally,
we will attempt to alleviate the sampling bias introduced by FASTREAD. A possible solution
in this context may be to replace FASTREAD’s learner with some instance-based classifiers
such as K-Nearest Neighbors.

Acknowledgement

We thank Tianpei Xia, Huy Tu, Jianfeng Chen, Xueqi Yang, Rui Shu and Fahmid Morshed
Fahid for their effort in validating the FASTREAD results.

References

1. Aman, H., Tanaka, Y., Nakano, T., Ogasawara, H., Kawahara, M.: Application of mahalanobis-taguchi
method and 0-1 programming method to cost-effective regression testing. In: 2016 42th Euromicro
Conference on Software Engineering and Advanced Applications (SEAA), pp. 240–244. IEEE (2016)
2. Bian, Y., Li, Z., Zhao, R., Gong, D.: Epistasis based aco for regression test case prioritization. IEEE
Transactions on Emerging Topics in Computational Intelligence 1(3), 213–223 (2017)
Title Suppressed Due to Excessive Length 23

3. Carlson, R., Do, H., Denton, A.: A clustering approach to improving test case prioritization: An industrial
case study. In: Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pp. 382–391.
IEEE (2011)
4. Carver, J.C., Hassler, E., Hernandes, E., Kraft, N.A.: Identifying barriers to the systematic literature
review process. In: 2013 ACM/IEEE International Symposium on Empirical Software Engineering and
Measurement, pp. 203–212. IEEE (2013)
5. Catal, C., Mishra, D.: Test case prioritization: a systematic mapping study. Software Quality Journal
21(3), 445–478 (2013)
6. Chen, J., Bai, Y., Hao, D., Xiong, Y., Zhang, H., Zhang, L., Xie, B.: Test case prioritization for compilers:
A text-vector based approach. In: Software Testing, Verification and Validation (ICST), 2016 IEEE
International Conference on, pp. 266–277. IEEE (2016)
7. Cho, Y., Kim, J., Lee, E.: History-based test case prioritization for failure information. In: 2016 23rd
Asia-Pacific Software Engineering Conference (APSEC), pp. 385–388. IEEE (2016)
8. Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review
in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on Research &
development in information retrieval, pp. 153–162. ACM (2014)
9. Cormack, G.V., Grossman, M.R.: Autonomy and reliability of continuous active learning for technology-
assisted review. arXiv preprint arXiv:1504.06868 (2015)
10. Cormack, G.V., Grossman, M.R.: Engineering quality and reliability in technology-assisted review pp.
75–84 (2016)
11. Cormack, G.V., Grossman, M.R.: Scalability of continuous active learning for reliable high-recall text
classification. In: Proceedings of CIKM 2016, pp. 1039–1048. ACM (2016)
12. Cormack, G.V., Grossman, M.R.: Navigating imprecision in relevance assessments on the road to total
recall: Roger and me. In: The International ACM SIGIR Conference, pp. 5–14 (2017)
13. Di Nardo, D., Alshahwan, N., Briand, L., Labiche, Y.: Coverage-based test case prioritisation: An
industrial case study. In: 2013 IEEE Sixth International Conference on Software Testing, Verification and
Validation, pp. 302–311. IEEE (2013)
14. Do, H., Mirarab, S., Tahvildari, L., Rothermel, G.: The effects of time constraints on test case prioritization:
A series of controlled experiments. IEEE Transactions on Software Engineering 36(5), 593–617 (2010)
15. Do, H., Rothermel, G.: A controlled experiment assessing test case prioritization techniques via mutation
faults. In: Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE International Conference
on, pp. 411–420. IEEE (2005)
16. Do, H., Rothermel, G., Kinneer, A.: Empirical studies of test case prioritization in a junit testing environ-
ment. In: Software Reliability Engineering, 2004. ISSRE 2004. 15th International Symposium on, pp.
113–124. IEEE (2004)
17. Dyba, T., Kitchenham, B.A., Jorgensen, M.: Evidence-based software engineering for practitioners. IEEE
Software 22(1), 58–65 (2005). DOI 10.1109/MS.2005.6
18. Elbaum, S., Kallakuri, P., Malishevsky, A., Rothermel, G., Kanduri, S.: Understanding the effects of
changes on the cost-effectiveness of regression testing techniques. Software Testing, Verification and
Reliability 13(2), 65–83 (2003). DOI 10.1002/stvr.263. URL https://onlinelibrary.wiley.
com/doi/abs/10.1002/stvr.263
19. Elbaum, S., Rothermel, G., Penix, J.: Techniques for improving regression testing in continuous integration
development environments. In: Proceedings of the 22Nd ACM SIGSOFT International Symposium on
Foundations of Software Engineering, FSE 2014, pp. 235–245. ACM, New York, NY, USA (2014).
DOI 10.1145/2635868.2635910. URL http://doi.acm.org/10.1145/2635868.2635910
20. Engström, E., Runeson, P., Ljung, A.: Improving regression testing transparency and efficiency with
history-based prioritization–an industrial case study. In: Software Testing, Verification and Validation
(ICST), 2011 IEEE Fourth International Conference on, pp. 367–376. IEEE (2011)
21. Fahid, F.M., Yu, Z., Menzies, T.: Better technical debt detection via surveying. arXiv preprint
arXiv:1905.08297 (2019)
22. Fazlalizadeh, Y., Khalilian, A., Azgomi, M.A., Parsa, S.: Prioritizing test cases for resource constraint en-
vironments using historical test case performance data. In: Computer Science and Information Technology,
2009. ICCSIT 2009. 2nd IEEE International Conference on, pp. 190–195. IEEE (2009)
23. Grossman, M.R., Cormack, G.V.: The grossman-cormack glossary of technology-assisted review with
foreword by john m. facciola, u.s. magistrate judge. Federal Courts Law Review 7(1), 1–34 (2013)
24. Grossman, M.R., Cormack, G.V., Roegiest, A.: Trec 2016 total recall track overview. In: TREC (2016)
25. Hemmati, H., Fang, Z., Mantyla, M.V.: Prioritizing manual test cases in traditional and rapid release
environments. In: Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International
Conference on, pp. 1–10. IEEE (2015)
26. Hemmati, H., Sharifi, F.: Investigating nlp-based approaches for predicting manual test case failure. In:
Software Testing, Verification and Validation (ICST), 2018 IEEE 11th International Conference on, pp.
309–319. IEEE (2018)
24 Zhe Yu et al.

27. Hsu, Y.C., Peng, K.L., Huang, C.Y.: A study of applying severity-weighted greedy algorithm to software
test case prioritization during testing. In: Industrial Engineering and Engineering Management (IEEM),
2014 IEEE International Conference on, pp. 1086–1090. IEEE (2014)
28. Jun, W., Yan, Z., Chen, J.: Test case prioritization technique based on genetic algorithm. In: Internet
Computing & Information Services (ICICIS), 2011 International Conference on, pp. 173–175. IEEE
(2011)
29. Kim, J.M., Porter, A.: A history-based test prioritization technique for regression testing in resource
constrained environments. In: Proceedings of the 24th international conference on software engineering,
pp. 119–129. ACM (2002)
30. Kitchenham, B.A., Dyba, T., Jorgensen, M.: Evidence-based software engineering. In: Proceedings of the
26th international conference on software engineering, pp. 273–281. IEEE Computer Society (2004)
31. Kumar, H., Chauhan, N.: A coupling effect based test case prioritization technique. In: Computing for
Sustainable Global Development (INDIACom), 2015 2nd International Conference on, pp. 1341–1345.
IEEE (2015)
32. Kwon, J.H., Ko, I.Y.: Cost-effective regression testing using bloom filters in continuous integration
development environments. In: Asia-Pacific Software Engineering Conference (APSEC), 2017 24th, pp.
160–168. IEEE (2017)
33. Li, Z., Harman, M., Hierons, R.M.: Search algorithms for regression test case prioritization. IEEE
Transactions on software engineering 33(4) (2007)
34. Liang, J., Elbaum, S., Rothermel, G.: Redefining prioritization: Continuous prioritization for continuous
integration. In: Proceedings of the 40th International Conference on Software Engineering, ICSE
’18, pp. 688–698. ACM, New York, NY, USA (2018). DOI 10.1145/3180155.3180213. URL http:
//doi.acm.org/10.1145/3180155.3180213
35. Lou, Y., Hao, D., Zhang, L.: Mutation-based test-case prioritization in software evolution. In: Software
Reliability Engineering (ISSRE), 2015 IEEE 26th International Symposium on, pp. 46–57. IEEE (2015)
36. Marijan, D., Gotlieb, A., Sen, S.: Test case prioritization for continuous regression testing: An industrial
case study. In: Software Maintenance (ICSM), 2013 29th IEEE International Conference on, pp. 540–543.
IEEE (2013)
37. Miwa, M., Thomas, J., OMara-Eves, A., Ananiadou, S.: Reducing systematic review workload through
certainty-based screening. Journal of biomedical informatics 51, 242–253 (2014)
38. Noor, T.B., Hemmati, H.: Test case analytics: Mining test case traces to improve risk-driven testing. In:
2015 IEEE 1st International Workshop on Software Analytics (SWAN), pp. 13–16. IEEE (2015)
39. Öztürk, M.M.: Adapting code maintainability to bat-inspired test case prioritization. In: INnovations
in Intelligent SysTems and Applications (INISTA), 2017 IEEE International Conference on, pp. 67–72.
IEEE (2017)
40. Park, H., Ryu, H., Baik, J.: Historical value-based approach for cost-cognizant test case prioritization to
improve the effectiveness of regression testing. In: Secure System Integration and Reliability Improvement,
2008. SSIRI’08. Second International Conference on, pp. 39–46. IEEE (2008)
41. Paterson, D., Kapfhammer, G.M., Fraser, G., McMinn, P.: Using controlled numbers of real faults and
mutants to empirically evaluate coverage-based test case prioritization. In: Proceedings of the International
Workshop on Automation of Software Test (AST 2018). IEEE (2018)
42. Pradhan, D., Wang, S., Ali, S., Yue, T., Liaaen, M.: Remap: Using rule mining and multi-objective search
for dynamic test case prioritization. In: Software Testing, Verification and Validation (ICST), 2018 IEEE
11th International Conference on, pp. 46–57. IEEE (2018)
43. Rothermel, G., Untch, R.H., Chu, C., Harrold, M.J.: Prioritizing test cases for regression testing. IEEE
Transactions on software engineering 27(10), 929–948 (2001)
44. Salehie, M., Li, S., Tahvildari, L., Dara, R., Li, S., Moore, M.: Prioritizing requirements-based regression
test cases: A goal-driven practice. In: 2011 15th European Conference on Software Maintenance and
Reengineering, pp. 329–332. IEEE (2011)
45. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1),
1–114 (2012)
46. Strandberg, P.E., Afzal, W., Ostrand, T.J., Weyuker, E.J., Sundmark, D.: Automated system-level regression
test prioritization in a nutshell. IEEE Software 34(4), 30–37 (2017)
47. Strandberg, P.E., Sundmark, D., Afzal, W., Ostrand, T.J., Weyuker, E.J.: Experience report: automated
system level regression test prioritization using multiple factors. In: 2016 IEEE 27th International
Symposium on Software Reliability Engineering (ISSRE), pp. 12–23. IEEE (2016)
48. Tsai, W.T., Bai, X., Chen, Y., Zhou, X.: Web service group testing with windowing mechanisms. In: IEEE
International Workshop on Service-Oriented System Engineering (SOSE’05), pp. 213–218. IEEE (2005)
49. Wallace, B.C., Dahabreh, I.J.: Class probability estimates are unreliable for imbalanced data (and how to
fix them). In: Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 695–704. IEEE
(2012)
Title Suppressed Due to Excessive Length 25

50. Wallace, B.C., Dahabreh, I.J., Moran, K.H., Brodley, C.E., Trikalinos, T.A.: Active literature discovery for
scoping evidence reviews: How many needles are there. In: KDD workshop on data mining for healthcare
(KDD-DMH) (2013)
51. Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Active learning for biomedical citation screening.
In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data
mining, pp. 173–182. ACM (2010)
52. Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Who should label what? instance allocation in
multiple expert active learning. In: SDM, pp. 176–187. SIAM (2011)
53. Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C., Schmid, C.H.: Semi-automated screening of biomed-
ical citations for systematic reviews. BMC bioinformatics 11(1), 1 (2010)
54. Xiong, Z., Liu, T., Tse, G., Gong, M., Gladding, P., Smaill, B.H., Stiles, M., Gillis, A., Zhao, J.: A machine
learning aided systematic review and meta-analysis of the relative risk of atrial fibrillation in patients with
diabetes mellitus. Frontiers in physiology 9, 835 (2018)
55. Yu, Z., Fahid, F., Menzies, T., Rothermel, G., Patrick, K., Cherian, S.: Terminator: Better automated ui
test case prioritization. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE
2019, pp. 883–894. ACM, New York, NY, USA (2019). DOI 10.1145/3338906.3340448. URL http:
//doi.acm.org/10.1145/3338906.3340448
56. Yu, Z., Kraft, N.A., Menzies, T.: Finding better active learners for faster literature reviews. Empirical
Software Engineering (2018). DOI 10.1007/s10664-017-9587-0. URL https://doi.org/10.
1007/s10664-017-9587-0
57. Yu, Z., Menzies, T.: Total recall, language processing, and software engineering. In: Proceedings of the
4th ACM SIGSOFT International Workshop on NLP for Software Engineering, pp. 10–13. ACM (2018)
58. Yu, Z., Menzies, T.: Fast2: An intelligent assistant for finding relevant papers. Expert Systems with
Applications 120, 57 – 71 (2019)
59. Yu, Z., Theisen, C., Williams, L., Menzies, T.: Improving vulnerability inspection efficiency using active
learning. IEEE Transactions on Software Engineering pp. 1–1 (2019). DOI 10.1109/TSE.2019.2949275
60. Zhang, X., Xie, X., Chen, T.Y.: Test case prioritization using adaptive random sequence with category-
partition-based distance. In: Software Quality, Reliability and Security (QRS), 2016 IEEE International
Conference on, pp. 374–385. IEEE (2016)
61. Zhou, Z.Q.: Using coverage information to guide test case selection in adaptive random testing. In: 2010
IEEE 34th Annual Computer Software and Applications Conference Workshops, pp. 208–213. IEEE
(2010)
62. Zhu, Y., Shihab, E., Rigby, P.C.: Test re-prioritization in continuous testing environments. In: 2018 IEEE
International Conference on Software Maintenance and Evolution (ICSME), pp. 69–79. IEEE (2018)

You might also like