0% found this document useful (0 votes)
38 views16 pages

Flakify A Black-Box Language Model-Based Predictor For Flaky Tests

Uploaded by

venkadeshr123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

Flakify A Black-Box Language Model-Based Predictor For Flaky Tests

Uploaded by

venkadeshr123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1912 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO.

4, APRIL 2023

Flakify: A Black-Box, Language Model-Based


Predictor for Flaky Tests
Sakina Fatima , Taher A. Ghaleb , and Lionel Briand , Fellow, IEEE

Abstract—Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky,
i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software
development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times,
which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus
reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors
rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production
code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be
challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we
propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test
cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we
employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We
evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the
FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation
procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and
73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of
98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and
18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the
cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to
reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new
projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a
viable option for predicting flaky test cases.

Index Terms—Flaky tests, software testing, black-box testing, natural language processing, CodeBERT

1 INTRODUCTION testing code looking for a bug that might not really exist, or
(b) rerun a failed test case multiple times to check if it would
testing is an essential activity to assure software
S OFTWARE
dependability. When a test case fails, it usually indicates
that recent code changes were incorrect. However, it has
eventually pass, thus suggesting that the failure is not due to
recent code changes but to the test case itself.
Previous research has investigated the common reasons
been observed, in many environments, that test cases can be
behind test flakiness, such as concurrency, resource leakage,
non-deterministic, passing and failing across executions,
and test smells. The conventional approach to detect flaky test
even for the same version of the source code. These test cases
cases is to rerun them numerous times [4], [5], which is in
are referred to as flaky test cases [1], [2], [3]. Flaky test cases
most practical cases computationally expensive [6] or even
can introduce overhead to software development, since they
impossible. To address this issue, recent studies have pro-
require developers to either (a) debug the production or
posed approaches using machine learning (ML) models to
predict flaky test cases without rerunning them [7], [8], [9],
 Sakina Fatima and Taher A. Ghaleb are with the School of EECS, Univer- thus proposing a much more scalable and practical solution.
sity of Ottawa, Ottawa, ON K1N 6N5, Canada. Despite significant progress, these approaches (a) rely on pro-
E-mail: {sfati077, tghaleb}@uottawa.ca. duction code, which is not always accessible by software test
 Lionel Briand is with the School of EECS, University of Ottawa, Ottawa,
ON K1N 6N5, Canada, and also with the SnT Centre for Security, Reliabil- engineers or a scalable solution, or (b) employ project-specific
ity and Trust, University of Luxembourg, 4365 Esch-sur-Alzette, Luxembourg. features as flaky test case predictors, which makes them inap-
E-mail: [email protected]. plicable to other projects. Moreover, these approaches rely on
Manuscript received 23 December 2021; revised 19 August 2022; accepted 20 a limited set of pre-defined features, extracted from the source
August 2022. Date of publication 24 August 2022; date of current version 18 code of test cases and the system under test. However, when
April 2023.
This work was supported in part by research grant from Huawei Technologies
evaluated on realistic datasets, these approaches yield a rela-
Canada, Mitacs Canada, and in part by the Canada Research Chair and Dis- tively low accuracy (F1-scores in the range 19%-66%), thus
covery Grant programs of the Natural Sciences and Engineering Research suggesting they may not capture enough information about
Council of Canada (NSERC). test flakiness. Finding additional features that could poten-
(Corresponding author: Taher A. Ghaleb.)
Recommended for acceptance by A. Zaidman. tially be associated with flaky test cases, preferably based on
Digital Object Identifier no. 10.1109/TSE.2022.3201209 test code only (black-box), is therefore a research challenge.
0098-5589 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1913

In this paper, we propose Flakify (Flaky Test Classify), a recall, and F1-score of Flakify by 5 pp and 6 pp on the
generic language model-based solution for predicting flaky FlakeFlagger and IDoFT datasets, respectively. The
test cases. Flakify is black-box as it relies exclusively on the goal was to address a limitation of CodeBERT (and all
source code of test cases (test methods), thus not requiring other language models), which leads to only consid-
access to the production code of the system under test. This ering the first 512 tokens in the test source code. This
is important as production code is not always (entirely) result also confirms the previously reported associa-
accessible to test engineers due, for example, to outsourcing tion of test smells with flaky test cases [7], [9], [11].
software testing to a third-party. Further, analyzing produc- Overall, this paper makes the following contributions.
tion code may raise many scalability and practicality issues,
especially when applied to large industrial systems using  A generic, black-box, language model-based flaky
multiple programming languages. In addition, Flakify does test case predictor, which does not require rerunning
not require the definition of features—which are necessarily test cases.
incomplete—to be used as predictors for flaky test cases.  An ML-based classifier that predicts flaky test cases
Instead, we used CodeBERT [10], a pre-trained language on the basis of test code without requiring the defini-
model, and fine-tuned it to classify test cases as flaky or not tion of features.
based on their source code. To improve Flakify, we further  An Abstract Syntax Tree (AST)-based technique for
pre-processed test code to remove potentially irrelevant statically detecting and only retaining statements
information. We evaluated Flakify on two different datasets: that match eight test smells in the test code, thus
the FlakeFlagger dataset, containing 21,661 test cases col- enhancing the application of language models.
lected from 23 Java projects, and the IDoFT dataset, contain- The rest of this paper is organized as follows. Section 2
ing 3,862 test cases collected from 312 Java projects. To do provides background about flaky test cases and language
this, we used two different evaluation procedures: (1) cross- models. Section 3 presents our black-box approach for pre-
validation and (2) per-project validation, i.e., prediction on dicting flaky test cases. Section 4 evaluates our approach,
new projects. Our results were compared to FlakeFlagger [7], reports experimental results, and discusses the implications
the best state-of-the-art ML-based predictor for flaky test of our research.. Section 5 discusses the validity threats to
cases. Specifically, our evaluation addresses the following our results. Section 6 reviews and contrasts related work.
research questions. Finally, Section 7 concludes the paper and suggests future
work.
 RQ1: How accurately can Flakify predict flaky test cases?
Flakify achieved promising prediction results 2 BACKGROUND
when evaluated using two different datasets. In par-
In this section, we describe flaky test cases, their root causes,
ticular, based on cross-validation, Flakify achieved a
their practical impact, and the strategies to detect them. In
precision of 70%, a recall of 90%, and an F1-score of
addition, we describe pre-trained language models and
79% on the FlakeFlagger dataset, and a precision of
how they can potentially contribute to predicting flaky test
99%, a recall of 96%, and an F1-score of 98% on the
cases.
IDoFT dataset. Flakify yielded slightly worse results
when predicting flaky tests on new projects, with a
precision of 72%, a recall of 85%, and an F1-score of 2.1 Flaky Test Cases
73% on the FlakeFlagger dataset, and a precision of In software testing, a flaky test refers to test cases that inter-
91%, a recall of 88%, and an F1-score of 89% on the mittently fail and pass across executions, even for the same
IDoFT dataset. version of the source code, i.e., non-deterministically behav-
 RQ2: How does Flakify compare to the state-of-the-art ing test cases [1]. Flaky test cases lead to many problems
predictors for flaky test cases? during software testing, by producing unreliable results
The best performing model of Flakify achieved a and wasting time and computational resources. A flaky test
significantly higher precision (70% versus 60%) and can also fail for different reasons across executions, making
recall (90% versus 72%) on the FlakeFlagger dataset it difficult to identify which failures are actually related to
in predicting flaky test cases than FlakeFlagger, the faults in the system under test.
best state-of-the-art, white-box approach for predict- Flaky test cases have been reported to be a significant
ing flaky test cases. Hence, with Flakify, the cost of problem in practice at many companies including Google,
debugging test cases and production code is reduced Huawei, Microsoft, SAP, Spotify, Mozilla, and Facebook [12],
by 10 and 18 percentage points (pp) (a reduction rate [13], [14], [15]. As reported by Google, almost 16% of their 4.2
of 25% and 64%), respectively, when compared to million test cases are flaky [6]. Microsoft has also reported
FlakeFlagger. Moreover, our results show that a that 26% of 3.8 k build failures were due to flaky test cases.
black-box version of FlakeFlagger is not a viable Many studies have been conducted to study flaky test cases,
option for predicting flaky test cases. Specifically, their causes, and the solutions to address them [1], [2], [4],
FlakeFlagger became 39 pp less precise with 20 pp [7], [8], [9], [11], [16]. Prominent causes of flaky test cases
less recall when only black-box features were used include asynchronous waits, test order dependency, concur-
as predictors for flaky test cases. rency, resource leakage, and incorrect test inputs or outputs.
 RQ3: How does test case pre-processing improve Flakify? In addition, flaky test cases were found to be associated with
Retaining only code statements that are related to other factors, such as test smells, which are further discussed
a selected set of test smells improved the precision, below.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1914 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

TABLE 1
Test Smells Used by FlakeFlagger [7]

Test Smell Description


Indirect Testing A test interacts with the class under test using methods from other classes
Eager Testing A test performs multiple checks for various functionalities
Test Run War A test allocates files or resources that might be used by other test cases
Conditional Logic A test uses a conditional if statement
Fire and Forget A test launches background threads or processes
Mystery Guest A test accesses external resources
Assertion Roulette A test performs multiple assertions
Resource Optimism A test accesses external resources without checking their existence

2.2 Flaky Test Case Detection 2.4 Pre-Trained Language Models


The most common approach for detecting flaky test cases is Much research has been carried out in the field of NLP for
by rerunning test cases numerous times to check whether developing pre-trained language models. Language models
they behave consistently across executions [4], [5]. Though estimate the probability of different linguistic units, i.e.,
effective, this approach is computationally expensive and words, symbols, and sequence of them, occurring in a given
not practical in many situations, for example in continuous sentence. There are many language models proposed in the
integration contexts, where builds are submitted automati- literature, such as BERT [19], ELMo [20], XLNet [21], RoB-
cally and frequently to perform regression testing. To miti- ERTa [22], and VideoBERT [23]. These models were pre-
gate such cost, other approaches attempted to detect flaky trained, using self-supervised learning, on a large corpus of
test cases without relying on rerunning them. To that end, unlabelled data. For example, BERT was pre-trained using a
characteristics of test cases, such as execution history, cover- large dataset of English text collected from books and Wiki-
age information, and static test features, were used to pre- pedia, whereas VideoBERT was pre-trained using a large
dict whether a test case is flaky or not. Prediction models dataset of instructional videos collected from YouTube.
were built using ML and Natural Language Processing Pre-trained language models are often further fine-tuned
(NLP) techniques [7], [8], [9]. Such techniques require train- using a specific, labelled dataset to train neural networks
ing ML models with pre-defined sets of features used as for performing various NLP tasks, such as text classification
indicators for test flakiness. Such features commonly pres- and entity recognition [24], relation extraction [25], sentence
ent practical limitations, such as (a) their reliance on pro- tagging, or next sentence prediction [19]. For example,
duction code, which is not always accessible or efficiently BERT was fine-tuned to perform sentiment analysis [26],
analyzable by test engineers, and (b) their limited capacity [27], trained on labelled datasets to assign sentiment tags,
to capture the actual structure or behavior of test cases, such i.e., positive, negative, or neutral, to a given text. Fine-tun-
as the use of language keywords [8] or the presence of test ing requires initializing a language model with the same
smells [7], [9], [11] in test code. parameters used for pre-training, and then further training
After identifying potentially flaky test cases, developers the model using labeled data related to a specific task.
can focus their investigation on them and, hence, attempt to Language models usually employ multi-layer transform-
fix code statements causing such flakiness. Developers may ers as a model architecture to perform many computations
also choose to rerun those specific test cases many more in parallel [28]. Transformer models adopt positional
times to verify that they are actually flaky [17]. This is a rea- embedding to vectorize individual words by considering
sonable undertaking, since test cases predicted as flaky nor- their positions in a given sequence of words. Thus, unlike
mally represent a small percentage of the entire test suite. Recurrent Neural Networks (RNNS) [29] and Long-Short
This, in turn, significantly eliminates a large part of the Term Memory (LSTM) [30], transformer models do not
effort and time required to investigate or rerun test cases require looking at past hidden states to capture dependen-
whenever a failure occurs [7]. cies with previous words in a sequence of words.
Given the wide popularity of language models in various
2.3 Test Smells NLP applications, researchers have attempted to apply
Test Smells are inappropriate design or implementation these language models to programming languages. How-
choices made by developers while writing test cases [18]. ever, when BERT, for example, was used for detecting the
Though test smells might not harm the functionality of a architectural tactics in source code [31], e.g., recognizing
test case, previous research has reported that they tend to software design patterns, the results were relatively worse
be associated with test flakiness. Test smells were further compared to those obtained when BERT was used for natu-
employed to classify whether a test case is flaky or not. For ral language text. To address this issue, recent work pro-
example, test smells in Table 1 were part of the features posed pre-training language models on source code written
used by Alshammari et al. [7] to predict test flakiness. in many programming languages in addition to natural lan-
Camara et al. [9] also used a more comprehensive set of test guage text [10], [32], [33], [34]. These models are well suited
smells for flaky test case prediction. Results showed that for fine-tuning to perform tasks related to source code.
Sleepy Test and Assertion Roulette are among test smells that CodeBERT [10] is an example of a language model that was
are highly associated with flaky test cases. pre-trained on both natural and programming languages.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1915

2.4.1 CodeBERT code only. For example, unlike local variables, if a


CodeBERT [10] is a language model that was pre-trained on global or external variable is used by a test case,
a large, unlabeled dataset containing English text as well as GraphCodeBERT cannot identify the type and value
source code written in six different programming lan- of that variable when analyzing test code only.
guages, namely Java, JavaScript, Python, Ruby, PHP, and  Unlike TreeBERT, which requires converting source
Go, obtained from the CodeSearchNet corpus [35]. Code- code into ASTs, CodeBERT only requires source
BERT takes, as input, source code statements and natural code as input.
language sentences, which are then tokenized using the  Unlike CuBERT, which was only pre-trained on
WordPiece [36] tokenizer. Similar to BERT and RoBERTa, Python source code without comments, CodeBERT
CodeBERT uses a multi-layer bidirectional transformer [28] was pre-trained on multiple programming lan-
as model architecture. This transformer is composed of six guages using both source code and natural language
layers, each of which contains 12 self-attention heads cap- comments.
turing word relationships, a hidden state, and a 768-dimen-
sional vector, as the output of each layer. 3 BLACK-BOX FLAKY TEST CASE PREDICTOR
CodeBERT also employed Masked Language Modeling This section describes our black-box solution for predicting
(MLM) [19] and Replaced Token Detection (RTD) [37] dur- flaky test cases. This is motivated by making such predic-
ing pre-training, allowing to take tokens from random posi- tions scalable, as white-box analysis of the production
tions and masking them with special tokens, which are later source code, especially in the context of large systems, is
used to predict the original tokens. As a result, each token is often not a viable solution.
assigned a vector representation containing information
about the token and its position in a given code. The final 3.1 CodeBERT for Flaky Test Case Prediction
output of CodeBERT is a single vector representation aggre- In this paper, we propose Flakify, a black-box solution for pre-
gating all individual vector representations. This vector dicting whether a test case is flaky or not. Flakify relies solely
representation can further be fine-tuned to perform various on the source code of a test case and does not require to rerun
tasks, e.g., classification. For example, to evaluate the per- it multiple times. The source code of test cases, i.e., Java test
formance of CodeBERT, it was fine-tuned to perform two methods, includes the method declaration, body, and it asso-
tasks: (1) code search, i.e., retrieving the most relevant code ciated Javadoc comments. While several studies have pro-
to a given natural language text; (2) code documentation, posed ML techniques to predict flaky test cases, such
i.e., generating a natural language description for a given techniques rely on pre-defined features extracted not only
source code. Moreover, CodeBERT was also adopted to per- from the source code of test cases but also that of the system
form classification tasks, such as bug prediction [38] and under test. However, results [7], [8], [9] suggest those features
vulnerability detection [39]. may not be enough, and finding additional features that could
potentially be associated with flaky test cases remains a
2.4.2 Other Models for Programming Languages research challenge given their non-deterministic behavior.
Therefore, we employed CodeBERT, the pre-trained language
As mentioned above, recently, many language models for
model described above, to perform a binary classification of
programming languages were proposed. For example,
test cases as Flaky or Non-Flaky. CodeBERT does not require to
GraphCodeBERT [32] was pre-trained on the inherent struc-
define features as it automatically identifies patterns based on
ture of source code and its data flow showing variables
the syntax and semantics of a given test code.
dependencies. Similar to CodeBERT, GraphCodeBERT was
CodeBERT starts by converting the source code of a test
used for code search, in addition to code translation and
case into a list of tokens, each of which is converted into an
refinement as well as clone detection. Another model for pro-
integer vector representation. Finally, an aggregated vector
gramming languages is TreeBERT [34], which was pre-
representation is generated as an output of CodeBERT,
trained using AST representations of Java and Python source
which is further fine-tuned to classify test cases as Flaky or
code. TreeBERT was used for code documentation, similar to
Non-Flaky. Fig. 1 presents an example of how the source code
CodeBERT, in addition to code summarization. There is also
of a test case is converted into tokens and then into integer
CuBERT [33], a programming language model pre-trained
vector representations.
using Python source code. CuBERT was used for classification
tasks, such as classifying exceptions and variable misuses.
Despite the capabilities of these models, CodeBERT has
3.1.1 Source Code Tokenization
been the most commonly used language model and we To transform the source code into tokens, the source code of
selected it to address our objectives for several reasons pre- test cases is tokenized by the WordPiece [36] tokenizer using
sented below. a pre-generated vocabulary file containing the vocabulary of
both English and programming languages used for model
 The pre-trained CodeBERT model is publicly pre-training. However, uncommon words, i.e., those that do
available.1 not exist in the vocabulary file, are separated into several
 Unlike GraphCodeBERT, CodeBERT does not take sub-words. For example, the CodeBERT tokenizer splits
into consideration the data flow in a given source ‘assertThat’ into ‘assert’ and ‘##that,’ where ‘##’
code, which might not be easy to capture using test denotes that a token represents a sub-word. Then, if a token
is not found in the vocabulary file, the unknown token,
1. https://huggingface.co/microsoft/CodeBERT-base < UNK > , is used. For each input, two special tokens, [CLS]
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1916 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

Fig. 2. Fine-tuning CodeBERT for classifying test cases as Flaky or not.

result, if CodeBERT were to be trained from scratch on our


dataset, it would result into over-fitting. To avoid that,
CodeBERT, similar to other language models [40], needs to
be fine-tuned using data representative of the problem at
hand. To do this, we employed CodeBERT as pre-trained
and use its outputs, on our dataset, to train a Feedforward
Neural Network (FNN) to perform binary classification of
test cases as flaky or non-flaky, as shown in Fig. 2.
The output of CodeBERT, i.e., the aggregated vector
representation of the [CLS] token, is then fed as input to a
trained FNN to classify test cases as flaky or not. The FNN
contains an input layer of 768 neurons, a hidden layer of 512
Fig. 1. The process of converting the source code of a test case into a
neurons, and an output layer with two neurons. We used
sequence of tokens, where each token is assigned an input index (id) ReLU [41] as an activation function, which helps to speed up
and attention mask. Dots ‘....’ are used to save space, since the actual training by transforming the data within layers and output
length is 512. The input id of each token refers to a 768-dimensional vec- the input directly if it is positive or zero otherwise. Then, we
tor representation.
added a dropout layer [42] to eliminate some neurons ran-
domly from the network, by resetting their weights to zero
and [SEP], are added. Eventually, for a given source code, during the training phase to prevent model over-fitting [43].
the tokenizer generates a sequence of tokens in the form of We used the Softmax function to compute the probability of a
[CLS]; c1 ; c2 ; ::; cn ; [SEP], where ci is a code token. The test case to be Flaky or Non-Flaky. We used a learning rate of
[CLS] token plays an important role in the classification of 105 using the AdamW optimizer [44] and employed a batch
flaky test cases, as it contains the aggregated vector represen- size of two due to computational limitations. Using this con-
tation of all the vector representations of the tokens of a given figuration, we further trained CodeBERT on our training
test case. On the basis of that aggregated vector representa- and validation datasets, which enabled the selection of
tion, our model classifies a test case as Flaky or Non-Flaky. improved parameter values for weights and biases through
[SEP] is just used to mark the end of the sequence of tokens. back propagation. We then evaluated the model, with the
The tokenizer also adds ‘ ’ in front of each word that is pre- obtained weights, using a test dataset.
ceded by a whitespace in a statement.
3.2 Identifying Test Smells
3.1.2 Converting Tokens into Vector Representations As indicated above, the 512 token length limit induced by
CodeBERT truncates longer test code, which leads to losing
Once the source code tokens are generated, each token,
potentially relevant information about test flakiness. There-
including sub-word, special, and unknown tokens, is
fore, we pre-possessed the source code of test cases to
mapped to an index, e.g., id 34603 for “Test” in Fig. 1, based
reduce their token length by only retaining information
on the position and context of each word in a given input.
believed to be more relevant to test flakiness. To this end,
Each token is described by an 768-dimensional integer vector
for test cases exceeding the token length limit, we retained
generated during CodeBERT pre-training. Using token pad-
only code statements that match at least one of the eight test
ding, the same token length is given to the code of all test
smells that were used by FlakeFlagger [7] as predictors for
cases used as input, e.g., “1” in Fig. 1. However, CodeBERT
flaky test cases. We also retained the method declaration
has a limit of 512 tokens per input. As a result, any token
and the associated Javadoc, since the signature and natural
sequence exceeding that limit is truncated, which might lead
language description, if any, of the test case, might contain
to removing code statements with potentially relevant infor-
key terms or phrases that are likely associated with test flak-
mation about test flakiness. In addition to input ids matching
iness, e.g., ”...failures...unnecessary...” or ”thread-safe”.
tokens, another list of attention masks is generated containing
There exist several open source tools available for detect-
ones and zeros to help the model distinguish between code
ing test smells [45]. However, these tools, e.g., tsDetect [46]
tokens, which should be given attention, and extra tokens
and JNose Test [47], either rely on production code for
added for padding. Finally, for each test case, token vectors
detecting test smells or do not detect all test smells that are
are aggregated to form one vector characterizing the [CLS]
potentially relevant to test flakiness [7]. While Alshammari
token, which is also represented using a 768-sized vector
et al. [7] detects all the eight test smells shown in Table 1,
referred to by the first input index ‘0’.
their technique does so by running test cases and requiring
access to the production code for smell detection. Though
3.1.3 Fine-Tuning CodeBERT for Flaky Test Classification we were inspired by the heuristics used by Alshammari
CodeBERT was pre-trained with a huge number of parame- et al. to detect test smells, given that our approach aims to
ters, enabling it to recognize the source code structure. As a be black-box, we developed an entirely different technique
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1917

to traverse any given test code and retain statements that


match, according to our heuristics, any of the targeted test
smells. Using this library, each Java test file in the test suite is
parsed and converted into AST nodes representing different
code elements, e.g., method declaration or invocation. While
parsing Java test files, not all types are necessarily resolved
due to missing production code. We describe below the heu-
ristics used to identify each of the eight test smells presented
in Table 1. For each test case, i.e., test method, we analyzed
each statement to check whether it matches one of the targeted
test smells. If so, we retain that statement as part of the pre-
processed test code and otherwise exclude that statement. For
some test smells, we added flags, i.e., a Java line comments
appended to the end of each statement matching the test
smell, to help our fine-tuned model learn about the associa-
tion of these statements with test flakiness. The test smells
used in this work were detected as described below.

 Indirect Testing: We check whether a statement


Fig. 3. Example of pre-processing the source code of a test case, which invokes a method that belongs to a class other than
leads to reducing the number of tokens from 62 down to 43.
the test class or the production class under test. Since
our approach is black-box, i.e., no access to produc-
that detects test smells statically, relying exclusively on test
tion code, the production class name is extracted
code without requiring to run test cases. Flakify detects all
from the test class name by removing the word
targeted test smells and can be easily extended to detect
‘Test’. This is a commonly used coding convention,
additional test smells. We used an Abstract Syntax Tree
but our approach can easily be adapted to other cod-
(AST) [48] parser, provided by the Eclipse JDT library,2 to
ing conventions in practice [49]. Any statement that
statically traverse any given test code and retain statements
is found to invoke such methods is retained and the
that match any of the targeted test smells. Using this library,
‘//IT’ flag is added.
each Java file in a test suite is parsed and converted into
 Eager Testing: We check whether a test case invokes
AST nodes representing different code elements, e.g.,
more than one method belonging to the production
method declaration or invocation. Then, an AST visitor is
class under test as it can introduce confusion to what
used to traverse those AST nodes. We extended the AST vis-
exactly a test method is testing [45]. If this is the case,
itor to check the AST nodes related to method declarations
we retain the statements invoking these methods,
and apply heuristics (described below) to detect and retain
adding the ‘//ET’ flag.
code statements that match at least one test smell. Such
 Test Run War: We check whether a statement
statements are extracted as part of the pre-processed code.
accesses static variables that are not declared as
Fig. 3 gives an example of a Java test method, test_ex-
final, as the value of such variables could be changed
ample, and how it is pre-processed. As we can see, tes-
by other test cases in different test executions, espe-
t_example has seven different statements, four of them
cially when a test case is order-dependent, which
having test smells. In particular, test_example contains the
can then cause resource interference during test case
following test smells: Fire and Forget (line 5 – launching a
execution [7]. Any statement that is found to use one
thread), Conditional Test (line 7 – if condition), and Assertion
of these variables is retained, adding the ‘//RW’ flag.
Roulette (lines 8 and 10 – multiple assertions). As a result, our
 Conditional Logic: We check whether a statement con-
technique retains only these four statements, which in turn
tains an if condition. If so, we retain if statements,
leads to reducing the token length from 62 to 43 (31% reduc-
including their logical expressions. The presence of
tion rate). We expect our test code pre-processing to help
conditional statements makes test case behavior
improve the classification performance, since it mitigates the
dependent on their logical expressions, thus making
random truncation of code statements induced by CodeBERT.
them unpredictable [45]. For the statements inside
the then and else blocks, we only retain those that
3.2.1 Heuristics for Detecting Test Smells match one of the eight test smells.
To detect test smells in test code, we followed the same detec-  Fire and Forget: We check whether a statement
tion heuristics as those used by Alshammari et al. [7]. How- invokes a method that launches a thread by checking
ever, different from this work, which extracts test smell if the invoked method belongs to the java.lang.
information dynamically from the test and production code Thread class, java.lang.Runnable interface, or
(code coverage), we detected test smells statically by analyz- java.util.concurrent package. Thread-related
ing the test code only. To this end, we used an Abstract Syntax statements make test cases prone to synchronization
Tree (AST) [48] parser, provided by the Eclipse JDT library,3 issues during their execution [11]. If this test smell is
present, we retain that statement.
2. https://www.eclipse.org/jdt  Mystery Guest: We check whether a statement
3. https://www.eclipse.org/jdt invokes a method that accesses external resources,
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1918 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

such as the file system (via java.io.File), data- Many solutions have been proposed to predict
base system (via java.sql, javax.sql, or flaky test cases. In this RQ, we compare the perfor-
javax.persistence), or network (via java.net mance of our best performing model of Flakify (with
or javax.net). Such external resources can intro- test case pre-processing) to two versions (white-box
duce stability and performance issues during test and black-box) of FlakeFlagger, the best flaky test
case execution [11]. Any statement that is found to case predictor to date.
use methods that belong to one of these classes or RQ2.1: How accurate is Flakify for flaky test case pre-
packages is retained. diction compared to the best white-box ML-based solution?
 Assertion Roulette: We check whether a statement per- White-box prediction of flaky test cases requires
forms one of the following assertion mechanisms, access to production code, which is not (easily) acces-
including assertArrayEquals, assertEquals, sible by software test engineers in many contexts. We
assertFalse, assertNotNull, assertNot- assess whether Flakify achieves results that are at
Same, assertNull, assertSame, assertThat, least comparable to the best white-box flaky test case
assertTrue, and fail. If so, the statement is predictor. Specifically, we compare the accuracy of
retained. Multiple assert statements in a test method the best performing model of Flakify with FlakeFlag-
makes it difficult to identify the cause of the failure if ger [7], the best white-box solution currently avail-
just one of the asserts fails [9]. able, on the dataset used by FlakeFlagger. Our
 Resource Optimism: We check whether a statement motivation is to determine whether black-box solu-
accesses the file system (java.io.File) without tions, based on CodeBERT, can compete with the
checking if the path (for either a file or directory) state-of-the-art, white-box ones. We compare the
exists. Doing so makes optimistic assumptions about results of Flakify and FlakeFlagger on the dataset on
the availability of resources, thus causing non-deter- which FlakeFlagger was evaluated, hereafter referred
ministic behavior of the test case [46]. We check the to as the FlakeFlagger dataset. We also performed a
test initialization method (usually named as setUp per-project validation of Flakify compared against
or containing the @Before annotation) for any path FlakeFlagger to assess their relative capability to pre-
checking method, including getPath(), getAbso- dict test cases in new projects.
lutePath(), or getCanonicalPath(). If no RQ2.2: How accurate is Flakify for black-box flaky test
such checking is present, the statement is retained, case prediction compared to the best ML-based solution?
adding the ‘//RO’ flag. Existing black-box flaky test case prediction solutions
rely on a limited set of features that are sometimes
project-specific or applicable only to a certain pro-
4 VALIDATION gramming language, e.g., Java [8], since they were
This section reports on the experiments we conducted to trained on features capturing the keywords of that
evaluate how accurate is Flakify in predicting flaky test language. Besides not being generic, the accuracy of
cases and how it compares to FlakeFlagger as a baseline. these solutions has shown to be very low compared to
We discuss the research questions we address, the datasets white-box solutions [7]. Therefore, we compare the
used, and the experiment design. Then, we present the accuracy of Flakify with a black-box version of Flake-
results for each research question and discuss their practical Flagger, by excluding the features related to produc-
implications. tion code, such as code coverage features (see Table 2).
 RQ3: How does test case pre-processing improve Flakify?
4.1 Research Questions The token length limitation of CodeBERT may lead to
unintentionally removing relevant information about
 RQ1: How accurately can Flakify predict flaky test cases? flaky test cases, which could then impact prediction
The performance of ML-based flaky test predictors accuracy. We assess whether the accuracy of Flakify is
can be influenced by the data used for training and improved when training the model using pre-proc-
the underlying modeling methodology. In this RQ, essed test cases containing only code statements
we evaluate Flakify on two distinct datasets, which related to test smells, as opposed to the entire test case
differ in terms of numbers of projects, ratios of flaky code. We fully realize that we may be missing test
and non-flaky test cases, and the way flaky test cases smells or unintentionally removing relevant state-
were detected. In addition, predicting flaky test cases ments. But our motivation is to assess the benefits, if
can be influenced by project-specific information any, of our approach to reduce the number of tokens
used during model training, which is not available used as input to CodeBERT. We performed this analy-
for new projects. Therefore, we evaluate Flakify using sis on both the FlakeFlagger and the IDoFT datasets.
two different procedures: 10-fold cross-validation
and per-project validation. The former mixes test 4.2 Datasets Collection and Processing
cases from all projects together to perform model To evaluate Flakify, we used two publicly available datasets
training and testing, whereas the later tests the model for flaky test cases. The first dataset is the FlakeFlagger data-
on every project such that no information from that set [7]. The second dataset is the International Dataset of
project was used as part of model training. Flaky Tests (IDoFT),4 which comprises many datasets for
 RQ2: How does Flakify compare to the state-of-the-art
predictors for flaky test cases? 4. https://mir.cs.illinois.edu/flakytests
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1919

TABLE 2
FlakeFlagger Features

Category Feature Description


Presence of Test Smells See Table 1
Test Lines of Code Number of lines of code in the body of the test method
Black-Box Number of Assertions Number of assertions checked by the test
Execution Time Running time for the test execution
Libraries Number of external libraries used by the test
Source Covered Classes Number of production classes covered by each test
Source Covered Lines Number of lines covered by the test, counting only production code
White-Box
Covered Lines Number of lines of code covered by the test
Covered Lines Churn Churn of covered lines in past 5, 10, 25, 50, 75, 100, 500, and 10,000 commits

flaky test cases used by previous studies on flaky test case numbers of runs to detect test flakiness. However, we were
prediction [5], [50], [51], [52], [53], [54]. unable to obtain the test code of 474 test cases (from 2 proj-
FlakeFlagger Dataset. It is provided by Alshammari et al. ects) due to missing GitHub repositories or commits, leav-
[7], containing flakiness information about 22,236 test cases ing us with 3,268 Flaky test cases from 312 projects. Given
collected from 23 GitHub projects. These projects have dif- that the IDoFT dataset contains no test cases categorized as
ferent test suite sizes, ranging from 55 to 6,267 (with a Non-Flaky, we used the fixed versions of 1,263 flaky test
median of 430) test cases per project. All projects in the Fla- cases, from 174 projects, to obtain non-flaky test cases, as
keFlagger dataset are written in Java and use Maven as a recommended by the IDoFT maintainers.8 To do so, we
build system, and each test case is a Java test method. The relied on the provided links to pull requests9 used for fixing
dataset contains the source code of each test case and the flaky test cases to collect the corresponding code changes.
corresponding features that were computed to train Flake- However, of the 1,263 fixed flaky test cases, we found only
Flagger. Also, test cases in the dataset were assigned labels 594 flaky test cases, from 126 projects, in which the test case
indicating whether they are Flaky or Non-Flaky, which were code is changed to fix test flakiness. Based on our analysis,
determined by executing each test case 10,000 times. the other flaky test cases were fixed in other ways, such as
When we analyzed the dataset, we identified 453 test changing the order of test case execution, test configuration,
cases with missing source code when intersecting test cases or production code. Such flaky tests are out of the scope of
in a provided CSV file (called processed_data5) with those in this paper, since we consider only test cases whose test code
a provided folder (called original_tests6) containing their was fixed, e.g., causes of flakiness related to test smells or
source code. In addition, we identified 122 test cases, in the other test characteristics. As a result, we added the 594 Non-
original_tests folder, with empty source code, which we Flaky (fixed) tests to the 3,268 Flaky test cases to end up with
found out were not written in Java.7 Therefore, we excluded an updated dataset of 3,862 test cases. Limitations, regard-
these test cases from our dataset, since they do not add any ing the causes of flakiness we could not detect, are dis-
valuable information regarding our flakiness prediction cussed in Section 5. About 13% of all test cases exceed the
evaluation. Nine of these test cases were labeled as flaky, 512 limit of CodeBERT when converted into tokens.
three with missing source code and six with empty method We made the updated datasets of FlakeFlagger and
body. After excluding test cases with missing and empty IDoFT, including their pre-processed test cases, publicly
code, we obtained 21,661 test cases for our experiments. We available in our replication package [55].
compared Flakify and FlakeFlagger using this updated
dataset. To pre-process the source code of the test cases (see 4.3 Experiment Design
Section 3.2), we cloned the GitHub repository of each project 4.3.1 Baseline
and extracted the Java classes defining the methods of test We used the FlakeFlagger approach as a baseline against
cases. which we compare the results achieved by Flakify. To this
There are 802 test cases in the dataset that are labeled as end, we reran the experiments conducted by Alshammari
Flaky (with a median of 19 flaky test cases per project), et al. [7] to reproduce the prediction results of FlakeFlagger
whereas 20,859 test cases are Non-Flaky. About 4% of all test using their provided replication package.10 FlakeFlagger
cases exceed the 512 limit of CodeBERT when converted was trained and tested using a combination of white-box
into tokens, including 14% of the flaky test cases. and black-box features listed in Table 2. These features were
IDoFT Dataset. This dataset contains 3,742 Flaky test cases selected based on their Information Gain (IG), i.e., only fea-
from 314 different Java projects, and collected using differ- tures having an IG  0.01 were selected for training. Besides
ent ways, i.e., different runtime environments with different reproducing the original results of FlakeFlagger, we also
reran the experiments using black-box features only, which
was done by excluding all features that required access to
5. https://github.com/AlshammariA/FlakeFlagger/blob/main/
flakiness-predicter/result/processed_data.csv
6. https://github.com/AlshammariA/FlakeFlagger/tree/main/ 8. https://github.com/TestingResearchIllinois/IDoFT/issues/566
flakiness-predicter/input_data/original_tests 9. https://mir.cs.illinois.edu/flakytests/fixed.html
7. https://github.com/AlshammariA/FlakeFlagger/pull/4 10. https://github.com/AlshammariA/FlakeFlagger
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1920 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

production code. Comparing Flakify with FlakeFlagger is flaky test cases), Recall (the ability of a model to predict all
performed on the FlakeFlagger dataset only, as running Fla- flaky test cases), and the F1-Score (the harmonic mean of pre-
keFlagger on the IDoFT dataset requires extracting features, cision and recall) [58]. For the per-project validation of Fla-
both dynamic and static, needed to train FlakeFlagger. To kify, we computed the overall precision, recall, and F1-score
do so, we must access the project’s production code and using the prediction results of all projects in the FlakeFlagger
then successfully execute thousands of test cases across and IDoFT datasets. We also computed these metrics indi-
hundreds of project versions. vidually for those projects that have both Flaky and Non-Flaky
test cases, specifically 23 FlakeFlagger projects and 126
4.3.2 Training and Testing Prediction Models IDoFT projects, along with descriptive statistics, such as
mean, median, min, max, 25% and 75% quantiles. We used
Training and testing Flakify were conducted using two dif-
Fisher’s exact test [59] to assess how significant is the differ-
ferent procedures, performed independently on the two
ence in proportions of correctly classified test cases between
datasets describe above, as follows.
two independent experiments. Note that precision, recall,
1st Procedure (Cross-Validation). In this procedure, we
and F1-score are computed based on such proportions.
evaluated Flakify similarly to how FlakeFlagger was origi-
In practice, test cases classified as Flaky must be addressed
nally assessed. Specifically, we used a 10-fold stratified
by re-running them multiple times or by fixing the root
cross-validation to ensure our model is trained and tested
causes of flakiness [6], [12], [60]. Precisely predicting flaki-
in a valid and unbiased way. For that, we allocated 90% of
ness is therefore important as otherwise time and resources
the test cases for training and 10% for testing our model
are wasted on re-running and attempting to debug many test
in each fold. However, different from FlakeFlagger, we
cases that are believed to be flaky but are not [16], [61].
employed 20% of the training dataset as a validation data-
According to our industry partner, Huawei Canada, and a
set, which is required for fine-tuning CodeBERT. Using the
Google technical report [6], each flaky test case has to be
validation dataset, we calculated the training and validation
investigated and re-run by developers. Hence, when we
loss, which helped obtain optimal weights and stop the
multiply the number of predicted flaky test cases, we propor-
training early enough to avoid overfitting.
tionally increase the resources associated with re-running
Given that both of the datasets we used are highly imbal-
and investigating such flaky test cases. Therefore, we assume
anced—Flaky test cases represent only 3.7% of all test cases
that the wasted cost of unnecessarily re-running and debug-
in the FlakeFlagger dataset and Non-Flaky test cases repre-
ging test cases is inversely proportional to precision
sent only 15% of the IDoFT dataset—we balanced Flaky and
Non-Flaky test cases in the training and validation datasets Test Debugging Cost / 1  Precision: (1)
of FlakeFlagger and IDoFT. Different from FlakeFlagger,
which used the synthetic minority oversampling technique On the other hand, it is also important not to miss too
(SMOTE) [56], we used random oversampling [57], which many flaky test cases as otherwise time is bound to be
adds random copies of the minority class to the dataset. We wasted on futile attempts to find and fix non-existent bugs
were unable to use SMOTE, since it requires vector-based in the production code. Thus, we assume that the wasted
features, whereas our model takes the source code of test cost of unnecessarily finding and fixing non-existent bugs
cases (text) as input [10], [38], as opposed to pre-defined fea- in the production code is inversely proportional to recall
tures like FlakeFlagger. Similar to FlakeFlagger, we also per-
formed our experiments using undersampling but this led Code Debugging Cost / 1  Recall: (2)
to lower accuracy. We did not balance the testing dataset to
ensure that our model is only tested on the actual set of test We acknowledge that the above metrics are surrogate
cases. This prevents overestimating the accuracy of the measures for cost and that there are significant differences
model and reflects real-world scenarios where flaky test between individual flaky tests; however, they are reasonable
cases are rarer than non-flaky test cases [7]. and useful approximations on large test suites for the pur-
2nd Procedure (Per-Project Validation). In this procedure, pose of comparing classification techniques. We used Flake-
we evaluated Flakify in a way that yields more realistic Flagger as baseline to compute the reduction rate of test and
results when we predict test cases on a new project, thus code debugging costs, by dividing the difference in cost
evaluating the generalizability of Flakify across projects. To between Flakify and FlakeFlagger by the cost of FlakeFlagger.
do this, we performed a per-project validation of Flakify on
both datasets. In particular, for every project in each dataset, 4.4 Results
we trained Flakify on the other projects and tested it on that 4.4.1 RQ1 Results
project. This allowed us to evaluate how accurate Flakify is
Table 3 shows the prediction results (in terms of precision,
in predicting flaky test cases in one project without includ-
recall, and F1-score) of Flakify using both the full and pre-
ing any data from that project during training. We also per-
processed test code from the FlakeFlagger and IDoFT data-
formed this analysis for FlakeFlagger, on the FlakeFlagger
sets, based on cross-validation. Overall, Flakify achieved
dataset, for the sake of comparison.
promising prediction results using both datasets, with a
precision of 70%, a recall of 90%, and an F1-score of 79% on
4.3.3 Evaluation Metrics the FlakeFlagger dataset, and a precision of 99%, a recall of
To evaluate the performance of our approach, we used stan- 96%, and an F1-score of 98% on the IDoFT dataset. The
dard evaluation metrics for ML classifiers, including Preci- higher results achieved by Flakify on the IDoFT dataset
sion (the ability of a classification model to precisely predict over those achieved on the FlakeFlagger dataset is probably
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1921

TABLE 3
Results of Flakify (using Full Code and Pre-Processed) Compared to FlakeFlagger (White-Box and Black-Box Versions)

Approach Dataset Model Precision Recall F1-Score


Full code 65% 85% 74%
FlakeFlagger dataset Pre-processed code 70% 90% 79%
Flakify
Full code 98% 95% 92%
IDoFT dataset Pre-processed code 99% 96% 98%
White-box version 60% 72% 65%
FlakeFlagger FlakeFlagger dataset Black-box version 21% 52% 30%

TABLE 4
Summary of the Per-Project Prediction Results of Flakify on the FlakeFlagger and IDoFT Datasets

Dataset Metric Min 25% Mean Median 75% Max


Precision 6% 58% 72% 79% 91% 100%
FlakeFlagger dataset Recall 1% 87% 85% 95% 100% 100%
F1-Score 2% 63% 73% 83% 94% 100%
Precision 66% 100% 91% 100% 100% 100%
IDoFT dataset Recall 14% 94% 88% 100% 100% 100%
F1-Score 25% 95% 89% 100% 100% 100%

due to the fact that the IDoFT dataset contains many more RQ2.1 results. For FlakeFlagger, we obtained results close
flaky test cases than FlakeFlagger, which helped during to those reported in the original study, with a slight decrease
model training. Moreover, the non-flaky test cases in the in F1-score (1%), which is likely due to removing test cases
IDoFT dataset were labeled based on developer’s fixes with missing test code. Flakify achieved much better results
addressing the causes of flakiness in the test code, unlike with a precision of 70% (þ10 pp), a recall of 90% (þ18 pp),
the non-flaky test cases in the FlakeFlagger dataset whose and an F1-score of 79% (þ14 pp). These results clearly show
labels were based on 10,000 runs performed by Alshammari that Flakify, though being black-box and relying exclusively
et al. [7], which may not have been enough to fully expose on test code, significantly surpasses FlakeFlagger in accu-
test flakiness. This also helped during model training of rately predicting flaky test cases. Statistically, the proportion
Flakify. of correctly predicted test cases using Flakify is significantly
Table 4 reports the per-project prediction results of higher than that obtained with FlakeFlagger (Fisher-exact p-
Flakify on the FlakeFlagger dataset. Overall, as expected, value < 0:0001).
Flakify achieved slightly lower precision (72%), recall (85%), The number of true positives obtained by FlakeFlagger was
and F1-score (73%) than the cross-validation results on the 574, whereas Flakify increased that number to 721. This indi-
FlakeFlagger dataset. Similarly, Flakify achieved slightly cates that Flakify can potentially reduce the test debugging
worse precision (91%), recall (88%), and F1-score (89%) on cost by 10 pp, as defined above, when compared to FlakeFlag-
the IDoFT dataset. Table 5 shows descriptive statistics for the ger (a reduction rate of 25%). Similarly, Flakify reduces the
per-project prediction results of Flakify for individual proj- number of false negatives to 81 from 227 with FlakeFlagger,
ects of the FlakeFlagger dataset (due to space limitations, thus decreasing the code debugging cost by 18 pp, as defined
we provide individual per-project prediction results of above (a reduction rate of 64%).
Flakify on the IDoFT dataset in our replication package [55]). Table 5 shows the comparison of per-project prediction
Our analysis of individual per-project prediction results results between Flakify and FlakeFlagger. Overall,
revealed a high performance of Flakify on the majority of Flakify achieves a high accuracy, with a precision of 72%
projects. This result suggests that Flakify helps build models (þ57 pp)), a recall of 85% (þ71 pp)), and an F1-score of 73%
that are generalizable across projects, thus making it applica- (þ66 pp)), which, once again, significantly outperforms Fla-
ble to new projects where no historical information about keFlagger. Looking at the individual prediction results of
test flakiness exists. In short, Flakify is capable to learn about the projects, we observe that the accuracy of Flakify is
test flakiness through data collected from other projects to largely consistent across projects, with a few exceptions,
predict flaky test cases in new projects. whereas FlakeFlagger performed poorly on the majority of
projects. Further, Flakify performs better than FlakeFlagger
for almost all projects except two: incubator-dubbo and
4.4.2 RQ2 Results spring-boot where both techniques fare poorly.
Table 3 presents the prediction results of Flakify, using both To understand the reasons behind such degraded perfor-
full code and pre-processed test code, and FlakeFlagger, mance for these two projects, we performed a hierarchical
using both white-box and black-box versions, for the Flake- clustering of the 23 projects. We used different metrics that
Flagger dataset. capture the characteristics of each project, such as the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1922 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

TABLE 5
Results of the Per-Project Prediction for Flakify and FlakeFlagger on the FlakeFlagger Dataset

Project Precision Recall F1-Score


Flakify FlakeFlagger Flakify FlakeFlagger Flakify FlakeFlagger
achilles 100% 0% 100% 0% 100% 0%
activiti 80% 2% 90% 94% 85% 4%
alluxio 99% 100% 100% 13% 99% 24%
ambari 75% 39% 95% 61% 84% 47%
assertj-core 25% 0% 100% 0% 40% 0%
commons-exec 25% 0% 100% 0% 40% 0%
elastic-job-lite 50% 0% 100% 0% 60% 0%
handlebars.java 30% 0% 100% 0% 50% 0%
hbase 79% 72% 98% 33% 88% 45%
hector 100% 0% 93% 0% 96% 0%
http-request 88% 0% 88% 0% 88% 0%
httpcore 74% 7% 90% 4% 81% 5%
incubator-dubbo 6% 7% 16% 32% 9% 12%
java-websocket 95% 0% 95% 0% 95% 0%
logback 85% 0% 81% 0% 83% 0%
ninja 100% 0% 100% 0% 100% 0%
okhttp 78% 100% 85% 2% 81% 4%
orbit 88% 0% 100% 0% 93% 0%
spring-boot 40% 9% 1% 3% 2% 4%
undertow 75% 7% 85% 43% 79% 12%
wildfly 65% 6% 91% 26% 76% 10%
wro4j 88% 1% 100% 19% 94% 3%
zxing 100% 0% 50% 0% 66% 0%
Overall 72% 15% 85% 14% 73% 7%

For every project, we trained models on all other projects and tested them on that project.

number of test cases, number of flaky test cases, and fre- decrease its prediction power. The difference in accuracy
quency of test smells in each project. However, our cluster- between Flakify and the black-box version of FlakeFlagger
ing results were inconclusive, thus revealing no significant is rather striking, with a large improvement of +49% in F1-
differences between the two projects and the other projects. score (Fisher-exact p-value < 0:0001). FlakeFlagger is there-
As reported by Alshammari et al. [7], each project can have fore not a viable black-box option to predict flaky test cases.
distinct characteristics, e.g., environmental setup and test-
ing paradigm, that make it difficult to develop a general- 4.4.3 RQ3 Results
purpose flaky test case predictor. For example, the
spring-boot project has the highest number of flaky test With no code pre-processing, 898 (4%) of the test cases of the
cases among all projects, representing 20% of all flaky test FlakeFlagger dataset and 505 (13%) of the test cases of the
cases in the dataset. This, in turn, can influence model train- IDoFT dataset were truncated by CodeBERT to generate
ing when the model was tested for spring-boot. In addi- tokens of size 512. Such arbitrary code truncation is likely to
tion, the variation in prediction results can be a result of a affect how accurately Flakify can predict flaky test cases. Pre-
possible mislabeling of test cases as Flaky and Non-Flaky in processing test cases (see Section 3.2) led to reducing the num-
some projects, since some test cases may still exhibit flaki- ber of test cases being truncated to only 40 (from 898) in the
ness behavior if executed more than 10,000 executions, for FlakeFlagger dataset and 87 (from 505) in the IDoFT dataset, a
example. Finally, test flakiness can also occur due to the use large difference. As a result, we observe in Table 3 that, with
of network APIs or dependency conflicts [17], which were pre-processed test cases, Flakify predicted flaky test cases
not taken into account when predicting flaky test cases. with 5 pp higher F1-score on the FlakeFlagger dataset and 6
RQ2.2 results.As shown in Table 3, we observe a consid- pp higher F1-score on the IDoFT dataset. This corresponds to
erable decline in the accuracy for the black-box version of a significantly higher proportion of correctly predicted test
FlakeFlagger when compared to its original, white-box ver- cases (Fisher-exact p-value ¼ 0:0008) for the FlakeFlagger
sion, i.e., 39 pp less precise with a 54 pp decrease in dataset. In practice, the impact of pre-processing is expected
F1-score. Specifically, black-box FlakeFlagger correctly pre- to vary depending on the token length distribution of test
dicted a significantly lower proportion of test cases than cases. This result suggests that retaining statements related to
both Flakify and the original, white-box version of Flake- test smells in the test code contributed to making Flakify more
Flagger (Fisher-exact p-values < 0:0001). As a possible accurate, which also confirms the association of test smells
explanation, based on the results of FlakeFlagger regarding with flaky test cases reported by prior research [9].
the importance of features in predicting flaky test cases [7],
the majority of features having high IG values were based 4.5 Discussion
on source code coverage. Hence, removing those features, More Accurate Predictions With Easily Accessible Information.
to make FlakeFlagger black-box, is expected to significantly Our results showed that our black-box prediction of flaky
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1923

test cases performs significantly better than a white-box, the problem, since differences in test case verdicts, i.e., pass
state-of-the-art approach. This not only enables test engi- or fail, can be due to differences in builds rather than flaki-
neers to predict flaky test cases without rerunning test cases, ness. Therefore, test engineers can use the prediction results
but also without accessing the production code of the sys- obtained from Flakify to fix test cases that are predicted as
tem under test, a significant practical advantage in many flaky, e.g., by eliminating the presence of test smells, or oth-
contexts. The highest accuracy of our Flakify was achieved erwise rerun them a larger number of times, using the same
by only retaining relevant code statements matching eight code version, to verify whether a test case is actually flaky
test smells. Yet, there is still room for improvement in terms or not. More specifically, Flakify helps test engineers focus
of accuracy, which could be achieved by retaining more rel- their attention on a small subset of test cases that are most
evant statements based on additional test smells. For exam- likely to be flaky in a CI build. As our results show,
ple, retaining code statements related to other common Flakify significantly reduces the cost of debugging test and
flakiness causes [16], such as concurrency and randomness, production code, both in terms of human effort and execu-
could further improve flaky test case predictions. However, tion time. This makes Flakify an important strategy in prac-
the more code statements we retain, the more tokens to be tice to achieve scalability, especially when applied to large
considered by CodeBERT, which might lead to many test test suites. Moreover, the test smell detection capability of
cases exceeding their token length limit, thus truncating Flakify helps to inform test engineers about possible causes
other useful information. Hence, retaining additional code of flakiness that need to be addressed.
statements is a trade-off and should carefully be performed
in balance with the resulting token length of test cases. 5 THREATS TO VALIDITY
Moreover, building a white-box flaky test predictor, by con-
sidering both production and test code, is not always techni- This section discusses the potential threats to the validity of
cally feasible, since the production code is not always our reported results.
available to test engineers and, when possible, code cover-
age can be expensive and not scalable on large systems, 5.1 Construct Validity
especially in a continuous integration context. Considering Construct threats to validity are concerned with the degree
the production code also makes it impractical to build lan- to which our analyses measure what we claim to analyze. In
guage model-based predictors for flaky test cases, given the our study, to pre-process test cases, we used heuristics to
token length limitation of language models in general, and retain code statements that match at least one of the eight
CodeBERT in particular. Nevertheless, future research test smells shown in Table 1. However, our heuristics might
should assess the practicability of white-box, model-based have missed some code statements having test smells and
flaky test prediction, and should investigate further code this could have led to suboptimal results when applying
pre-processing methods to make the use of language mod- our approach. To mitigate this issue, though our approach
els more applicable in practice. to identify test smells is entirely different, we relied on the
Practical Implications of Imperfect Prediction Results. same heuristics as those used by Alshammari et al. [7].
Though Flakify surpassed the best state-of-the-art solution These heuristics assume commonly used coding conven-
in predicting flaky test cases, both in terms of precision and tions that might not be followed in all test suites. For exam-
recall, a precision of 70% is still not satisfactory, since mis- ple, we assumed that the test class name contains the
classifying non-flaky test cases as flaky leads to additional, production class name with the word ‘Test’. However, such
unnecessary cost, e.g., attempting to fix the test cases incor- heuristics can easily be adapted to other coding conventions
rectly predicted as flaky. Also, with a recall of 90%, we miss in practice. We also manually checked a random sample of
10% of flaky test cases, leading to wasted debugging cost. If test cases to verify that pre-processed code contains, as
we assume that precision should be prioritized over recall, expected, only test smells-related code statements and does
we can increase the former by restricting flaky test case pre- not dismiss any of them. We have made the tool we devel-
dictions to those test cases with highest prediction confi- oped to detect test smells publicly available in our replica-
dence, at the expense of a lower recall. For example, this can tion package [55].
be achieved by adjusting the classification threshold for
flaky test cases to 0.60 or 0.70, instead of the default thresh- 5.2 Internal Validity
old of 0.50. Nevertheless, given that the predicted probabili- Internal threats to validity are concerned with the ability to
ties generated by the neural network in Flakify are over draw conclusions from our experimental results. In our
confident due to the use of the Softmax function in the last study, we used CodeBERT to perform a binary classification
layer [62], i.e., probabilities are either close to 0.0 or 1.0, we of test cases as Flaky or Non-Flaky. However, due to the
were unable to perform such analysis. Therefore, future token length limit of CodeBERT, the source code of some
research should employ techniques for calibrating the pre- test cases was truncated, possibly leading to discarding rele-
dicted probabilities [63] and enable threshold adjustments vant information about test flakiness. To mitigate this issue,
when classifying flaky test cases. we pre-processed the source code of test cases to retain only
Deployment of a Flaky Test Case Predictor in Practice. code statements related to test smells. Doing so did not only
Flakify can be deployed in Continuous Integration (CI) reduce the token length of test cases, but also improved the
environments to help detect flaky test cases. One could prediction power of our approach. However, our pre-proc-
argue that the CI build history can be used as reference to essing may not be perfect or complete as it can lead to losing
conclude whether a test case is flaky or not. However, regu- other relevant information. Future research should investi-
lar test case executions across builds may not entirely solve gate whether retaining additionally relevant information to
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1924 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

flaky test cases leads to improving prediction results, e.g., considered in this paper as non-flaky. This helped during
statements related to common flakiness causes, such as syn- model training of Flakify on this dataset, which resulted in
chronous or platform-dependent operations. a higher prediction accuracy than those on the FlakeFlagger
Moreover, our prediction results were compared with dataset.
those of FlakeFlagger. But FlakeFlagger used white-box fea-
tures, whereas our approach is black-box and the compari- 6 RELATED WORK
son may not be entirely meaningful. To mitigate this issue,
we also compared our results with a black-box version of Flaky test detection has been an active area of research
FlakeFlagger in which we removed any features requiring where many techniques were proposed to detect flaky test
access to production code. In both cases, our approach cases [16]. Overall, these techniques can be classified into
obtained significantly higher prediction results than Flake- two groups: dynamic techniques, which require executing
Flagger. We did not compare our results with other black- test cases to determine whether they are flaky or not, and
box approaches, e.g., vocabulary-based [8], since they are static techniques, which rely only on the source code of test
project-specific and did not achieve good results on the Fla- cases or the system under test. In this section, we review the
keFlagger dataset [7]. flaky test detection techniques while comparing and con-
Finally, in our analysis, the cost of debugging the pro- trasting them to our approach.
duction or testing code assumes that test engineers address
all test cases predicted as flaky. However, test engineers 6.1 ML-Based Flaky Test Case Prediction
may choose to ignore a flaky test case, either by removing A common approach to detect flaky test cases is to re-run
or skipping it, thus not introducing any cost. Yet, we believe test cases multiple times [1], [16], which is computationally
that every flaky test case should be carefully addressed by expensive. To address this issue, recent research has pro-
test engineers, since ignoring test cases can lead to other posed the use of ML techniques for predicting flaky test
kinds of costs, such as overlooked system faults. cases, enabling test engineers to re-run only those test cases
that are predicted to be flaky, thus reducing the cost of
5.3 External Validity unnecessary debugging of test cases or production code.
External threats are concerned with the ability to generalize Alshammari et al. [7] proposed an innovative approach
our results. Our study is based on data collected by Alsham- to predict flaky test cases using dynamically computed fea-
mari et al. [7], which was obtained by rerunning test cases tures capturing code coverage, execution history, and test
10,000 times. Such data is of course not perfect as some test smells. They re-ran test cases 10,000 times to identify
cases that were not found to be flaky could have been if whether a test case was flaky or not and thus establish a
rerun more times. To mitigate this threat, we used the same ground truth. Their prediction model predicted flaky test
dataset for comparing Flakify with the baseline approach, cases with an F1-score of 0.65, leaving significant room for
FlakeFlagger. We also filtered out test cases which, to our improvement. However, some of the significant features
surprise, had no source code in the dataset. Further, the Fla- required access to production files which, as discussed
keFlagger and IDoFT datasets contain test cases from proj- above, are not always accessible by test engineers or may
ects that are exclusively written in Java, which might affect not be computable in a scalable way in many practical con-
the generalizability of our results. To mitigate this issue, we texts. Further, when only black-box features (see Table 2)
used CodeBERT, which was trained on six programming were used, the F1-score decreased by 35 pp. In contrast, our
languages. Hence, we believe our approach would be appli- approach achieved more accurate prediction results, with
cable to projects written in other programming languages as an F1-score of 0.79, while using test code only, thus offering
well, given an appropriate tool to identify test smells. a favorable black-box alternative.
Moreover, CodeBERT was pre-trained on production In addition, Pontillo et al. [11] proposed an approach to
source code only, i.e., source code related to test suites was identify the most important factors associated with flaky
not part of pre-training, making it unable to recognize test- test cases using the iDFlakies dataset [5]. They used logistic
specific structure and vocabulary, e.g., assertions. This can regression to model flaky test cases using features that were
potentially increase token length, since test-specific key statically computed using production code, e.g., code cover-
terms are decomposed into multiple tokens instead of one. age, and test code, e.g., test smells. They found that code
For example, CodeBERT converts assertEquals into complexity (both production and test code), assertions, and
three tokens: assert, ##equal, and ##s, rather than just test smells are associated with test flakiness.
one token. Our pre-processing of the source code of test Another approach was proposed by Pinto et al. [8] in
cases helped to mitigate the issue of token length; yet, future which Java keywords were extracted from test code and
work should aim at pre-training CodeBERT on test code in employed as vocabulary features to predict test flakiness.
addition to production code. Further, their study relied on the dataset of DeFlaker [4], in
Finally, the IDoFT dataset has shown that a significant which test cases were re-run less than 100 times to establish
number of test cases are flaky due to reasons unrelated to the ground truth. Despite high accuracy results (F1-score =
the test code. In situations where this is common, this is 0.95) on their dataset, their approach achieved much worse
obviously a limitation of any black-box approach like results (F1-score = 0.19) when using the dataset provided by
Flakify relying exclusively on test code. In our evaluation, Alshammari et al. [7]. In addition, their models were lan-
we did not consider such flaky test cases, but rather those guage- and project-specific, since most of the significant fea-
whose causes of flakiness were in the test code, which were tures for predicting flaky test cases were related to Java
confirmed and manually fixed by developers, and thus keywords, e.g., throws, or specific variable names, e.g., id. In
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1925

contrast, while our approach relies exclusively on test code, Bell et al. [4] proposed DeFlaker, a tool for detecting
it builds a generic model to predict flakiness, based on fea- flaky test cases using coverage information about code
tures that are neither language- nor project-dependent, and changes. In particular, a test case is labeled as flaky if it fails
achieved much better prediction results when using the Fla- and does not cover any changed code. Out of 4,846 test fail-
keFlagger dataset used by Alshammari et al. [7]. ures, DeFlaker was able to label 39 pp of them as flaky, with
Moreover, Haben et al. [15] and Camara et al. [64] repli- a 95.5% recall and a false positive rate of 1.5%, outperform-
cated the study by Pinto et al. using other datasets contain- ing the default way of detecting flaky test cases, i.e., by
ing projects written in other programming languages, e.g., rerunning test cases using Maven [67]. Different from
Python. They found that vocabulary-based approaches are DeFlaker, Lam et al. [5] proposed iDFlakies, which detects
not generalizable, especially when performing inter-project test flakiness by re-running test cases in random orders.
flaky test case predictions, since new vocabulary is needed This framework was used to construct a dataset containing
for any new project or programming language. Haben et al. 422 flaky test cases, with almost half of them being order-
also showed that combining the vocabulary-based features dependent.
with code coverage features does not significantly improve The above approaches either depend on rerunning test
the prediction accuracy of such an approach. cases multiple times, execution history (not available for
In summary, unlike the ML-based approaches above, our new test cases), or production code, e.g., coverage informa-
approach is generic, black-box, and language model-based, tion. In contrast, Flakify does not require repeated execu-
thus not requiring access to production code or pre-defini- tions of test cases or any information about the production
tion of features. Instead, our approach relies solely on test code, including code coverage.
code to predict whether a test case is flaky or not.

6.2 Flaky Test Case Prediction Using Test Smells


7 CONCLUSION
Camara et al. [9] proposed an approach for predicting test In this paper, we proposed Flakify, a black-box solution for
flakiness using test smells as prediction features. These fea- predicting flaky test cases using only the source code of test
tures require access to the production code and can be cases, as opposed to the system under test. Further, it does
extracted using tsDetect [46], a tool for detecting test smells, not require to rerun test cases multiple times and does not
that was applied to the DeFlaker dataset [4]. Their study entail the definition of features for ML prediction.
yielded a relatively high prediction accuracy (F1-score = We used CodeBERT, a pre-trained language model, and
0.83). Alshammari et al. [7] also relied on test smells as part fine-tuned it to classify test cases as flaky or not based exclu-
of their features for predicting flaky test cases. However, the sively on test source code. We evaluated our work on two
information gain of test smell features tended to be much distinct datasets, namely the FlakeFlagger and IDoFT data-
lower than code coverage features, suggesting they are less sets, using two different evaluation procedures: (1) cross-
significant flaky test case predictors. In Flakify, we also relied validation and (2) per-project validation, i.e., prediction on
on the test smells used by Alshammari et al. [7]. However, new projects. In addition, we pre-processed this source
they were not used as features but to exclusively retain rele- code by retaining only code statements that match eight test
vant test code statements for fine-tuning our CodeBERT smells, which are expected to be associated with test flaki-
model. Doing so improved the accuracy of Flakify, thus ness. This aimed at addressing a limitation of CodeBERT
reducing the cost of rerunning or debugging test cases. (and related language models), which can only process 512
tokens per test case. We evaluated our approach in compari-
son with both white-box and black-box versions of Flake-
6.3 Flaky Test Detection at Run Time
Flagger, the best state-of-the-art, ML-based flaky test case
Memon et al. [65] used a simple dynamic pattern matching predictor. The main results of our study are summarized as
approach to detect flaky test cases at GOOGLE by simply follows:
searching for certain textual patterns in test execution logs,
e.g., pass-fail-pass, to identify whether a test case is flaky or  Flakify achieves promising results on two different
not. The accuracy of detecting flaky test cases using this datasets (FlakeFlagger and IDoFT) and under two
approach was 90%. Similarly, Kowalczyk et al. [66] different evaluation procedures, one assuming
detected flaky test cases at APPLE by analyzing the behavior Flakify predicts test cases from a new project and the
of test cases using two scores: Flip rate, which measures the other one simply relying on cross-validation.
rate at which a test case alternates between pass and fail, and  When predicting test cases in new projects, the accu-
Entropy, which quantifies the uncertainty of a test case. An racy of Flakify is slightly lower but still close to
aggregated value of these two scores was used to generate cross-validation results.
flakiness ranks for test cases, which were then used to repre-  With cross-validation, Flakify reduces by 10 pp and
sent test flakiness, distributed across the test cases in differ- 18 pp of the cost bound to be wasted by the original,
ent services at APPLE. This technique marked 44% of test white-box version of FlakeFlagger due to unneces-
failures as flaky with less than 1% loss in fault detection. sarily debugging test cases and production code,
The above approaches require test cases to be executed respectively.
many times to determine whether they are flaky, which is  Similar to cross-validation results, Flakify also signif-
often not practical for large industrial projects. Unlike these icantly outperforms FlakeFlagger when predicting
approaches, Flakify is able to predict flaky test cases without flaky test cases in new projects, for which the model
executing them, relying exclusively on test code. was not trained.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1926 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023

 A black-box version of FlakeFlagger is not a viable [11] V. Pontillo, F. Palomba, and F. Ferrucci, “Toward static test flaki-
ness prediction: A feasibility study,” in Proc. 5th Int. Workshop
option to predict flaky test cases as it is too inaccurate. Mach. Learn. Techn. Softw. Qual. Evol., 2021, pp. 19–24.
 When retaining only code statements related to test [12] C. Ziftci and D. Cavalcanti, “De-flake your tests: Automatically
smells, Flakify predicted flaky test cases with 5 pp locating root causes of flaky tests in code at Google,” in Proc. IEEE
and 6 pp higher F1-score on the FlakeFlagger and Int. Conf. Softw. Maintenance Evol., 2020, pp. 736–745.
[13] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummala-
IDoFT datasets, respectively. penta, “Root causing flaky tests in a large-scale industrial setting,”
Overall, existing public datasets [4], [5], [7], [15] are not fully in Proc. 28th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2019,
adequate to appropriately evaluate flaky test case prediction pp. 101–111.
[14] T. Bach, A. Andrzejak, and R. Pannemans, “Coverage-based
approaches, since the ratio of flaky test cases tends to be very reduction of test execution time: Lessons from a very large indus-
low. In addition, flaky test cases in these datasets were detected trial project,” in Proc. IEEE Int. Conf. Softw. Testing, Verification
by rerunning test cases numerous times while monitoring their Validation Workshops, 2017, pp. 3–12.
behavior across executions, a technique that may be inaccurate. [15] G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. L. Traon, “A
replication study on the usability of code vocabulary in predicting
Further, many open source projects nowadays adopt Continu- flaky tests,” in Proc. IEEE/ACM 18th Int. Conf. Mining Softw. Reposi-
ous Integration (CI), which provides extensive test execution tories, 2021, pp. 219–229.
histories. Given the frequency of test executions in CI and the [16] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A sur-
high workload on CI servers, test cases might expose further vey of flaky tests,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 1,
pp. 1–74, 2021.
flakiness behaviors due to causes that may not be revealed [17] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn,
when running test cases on machines dedicated to test execu- “Surveying the developer experience of flaky tests,” in Proc. IEEE/
tion [68], [69]. Therefore, we plan in the future to build a larger ACM Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2022, pp. 253–262.
[18] A. Van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok,
dataset of flaky test cases in a CI context. “Refactoring test code,” in Proc. 2nd Int. Conf. Extreme Program.
Last, a significant proportion of flaky tests can be due to Flexible Processes Softw. Eng., 2001, pp. 92–95.
problems in the production code and cannot be addressed [19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-train-
by black-box models. Therefore, in the future, we need to ing of deep bidirectional transformers for language understanding,”
in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
devise light-weight and scalable approaches to address Lang. Technol., 2019, pp. 4171–4186.
such causes of flakiness. [20] M. E. Peters et al., “Deep contextualized word representations,” in
Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
Lang. Technol., 2018, pp. 2227–2237.
ACKNOWLEDGMENTS [21] Z. Yang et al., “XLNet: Generalized autoregressive pretraining for
language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
The experiments conducted in this work were enabled in
2019, pp. 5753–5763.
part by WestGrid (https://www.westgrid.ca) and Compute [22] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining
Canada (https://www.computecanada.ca). Moreover, we approach,” 2019, arXiv:1907.11692.
are grateful to the authors of FlakeFlagger and the main- [23] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
“VideoBERT: A joint model for video and language representa-
tainers of the IDoFT dataset, who have responded to our tion learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019,
multiple inquiries for clarifications about the datasets. pp. 7464–7473.
[24] D. Nadeau and S. Sekine, “A survey of named entity recognition and
classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
REFERENCES [25] N. Bach and S. Badaskar, “A review of relation extraction,” Litera-
[1] B. Zolfaghari, R. M. Parizi, G. Srivastava, and Y. Hailemariam, “Root ture Rev. Lang. Statist., vol. II, no. 2, pp. 1–15, 2007.
causing, detecting, and fixing flaky tests: State of the art and future [26] H. Xu, B. Liu, L. Shu, and P. S. Yu, “BERT post-training for review
roadmap,” Softw.: Pract. Exp., vol. 51, no. 5, pp. 851–867, 2021. reading comprehension and aspect-based sentiment analysis,” in
[2] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical anal- Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
ysis of flaky tests,” in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Lang. Technol., 2019, pp. 2324–2335.
Softw. Eng., 2014, pp. 643–653. [27] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune BERT for
[3] M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, text classification?,” in Proc. China Nat. Conf. Chin. Comput. Lin-
“Understanding flaky tests: The developer’s perspective,” in Proc. guistics, 2019, pp. 194–206.
27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. [28] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
Eng., 2019, pp. 830–840. Inf. Process. Syst., 2017, pp. 5998–6008.
[4] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Mari- [29] D. Mandic and J. Chambers, Recurrent Neural Networks for Predic-
nov, “DeFlaker: Automatically detecting flaky tests,” in Proc. tion: Learning Algorithms, Architectures and Stability. Hoboken, NJ,
IEEE/ACM 40th Int. Conf. Softw. Eng., 2018, pp. 433–444. USA: Wiley, 2001.
[5] W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “iDFlakies: A [30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
framework for detecting and partially classifying flaky tests,” in Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Proc. 12th IEEE Conf. Softw. Testing, Validation Verification, 2019, [31] J. Keim, A. Kaplan, A. Koziolek, and M. Mirakhorli, “Does BERT
pp. 312–322. understand code?–An exploratory study on the detection of archi-
[6] J. Micco, “Advances in continuous integration testing at Google,” tectural tactics in code,” in Proc. Eur. Conf. Softw. Archit., 2020,
2018. [Online]. Available: https://research.google/pubs/pub46593 pp. 220–228.
[7] A. Alshammari, C. Morris, M. Hilton, and J. Bell, “FlakeFlagger: [32] D. Guo et al., “GraphCodeBERT: Pre-training code representa-
Predicting flakiness without rerunning tests,” in Proc. IEEE/ACM tions with data flow,” in Proc. 9th Int. Conf. Learn. Representations,
43rd Int. Conf. Softw. Eng., 2021, pp. 1572–1584. 2021, pp. 1–18.
[8] G. Pinto, B. Miranda, S. Dissanayake, M. d’Amorim, C. Treude, [33] A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning
and A. Bertolino, “What is the vocabulary of flaky tests?,” in Proc. and evaluating contextual embedding of source code,” in Proc.
17th Int. Conf. Mining Softw. Repositories, 2020, pp. 492–502. Int. Conf. Mach. Learn., 2020, pp. 5110–5121.
[9] B. Camara, M. Silva, A. Endo, and S. Vergilio, “On the use of test [34] X. Jiang, Z. Zheng, C. Lyu, L. Li, and L. Lyu, “TreeBERT: A tree-based
smells for prediction of flaky tests,” in Proc. Braz. Symp. Systematic pre-trained model for programming language,” Proc. 37th Conf. Uncer-
Autom. Softw. Testing, 2021, pp. 46–54. tainty Artif. Intell., Mach. Learn. Res., vol. 161, pp. 54–63, 2021.
[10] Z. Feng et al., “CodeBERT: A pre-trained model for programming [35] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brocksch-
and natural languages,” in Proc. Findings Assoc. Comput. Linguis- midt, “CodeSearchNet challenge: Evaluating the state of semantic
tics: Empir. Methods Natural Lang. Process., 2020, pp. 1536–1547. code search,” 2019, arXiv:1909.09436.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1927

[36] Y. Wu et al., “Google’s neural machine translation system: Bridging the [62] G. Melotti, C. Premebida, J. J. Bird, D. R. Faria, and N. Gonçalves,
gap between human and machine translation,” 2016, arXiv:1609.08144. “Probabilistic object classification using CNN ML-MAP layers,”
[37] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: 2020, arXiv:2005.14565.
Pre-training text encoders as discriminators rather than gener- [63] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of
ators,” in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1–18. modern neural networks,” in Proc. Int. Conf. Mach. Learn., 2017,
[38] C. Pan, M. Lu, and B. Xu, “An empirical study on software defect pp. 1321–1330.
prediction using CodeBERT model,” Appl. Sci., vol. 11, no. 11, [64] B. H. P. Camara, M. A. G. Silva, A. T. Endo, and S. R. Vergilio, “What
2021, Art. no. 4793. is the vocabulary of flaky tests? An extended replication,” in Proc.
[39] J. Wu, “Literature review on vulnerability detection using NLP IEEE/ACM 29th Int. Conf. Prog. Comprehension, 2021, pp. 444–454.
technology,” 2021, arXiv:2104.11230. [65] A. Memon and J. Micco, “How flaky tests in continuous integra-
[40] J. Howard and S. Ruder, “Universal language model fine-tuning tion,” 2016. [Online]. Available: https://www.youtube.com/
for text classification,” in Proc. 56th Annu. Meeting Assoc. Comput. watch?v¼CrzpkF1-VsA
Linguistics, 2018, pp. 328–339. [66] E. Kowalczyk, K. Nair, Z. Gao, L. Silberstein, T. Long, and A.
[41] A. F. A., “Deep learning using rectified linear units (ReLU),” Memon, “Modeling and ranking flaky tests at Apple,” in Proc. IEEE/
2018, arXiv:1803.08375. ACM 42nd Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2020, pp. 110–119.
[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- [67] Identifying and analyzing flaky tests in maven and gradle builds,
dinov, “Dropout: A simple way to prevent neural networks from 2019. Accessed: Jan. 11, 2021. [Online]. Available: https://gradle.
overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014. com/blog/flaky-tests
[43] S. El Anigri, M. M. Himmi, and A. Mahmoudi, “How BERT’s [68] T. A. Ghaleb, D. A. da Costa, Y. Zou, and A. E. Hassan, “Studying
dropout fine-tuning affects text classification?,” in Proc. Int. Conf. the impact of noises in build breakage data,” IEEE Trans. Softw.
Bus. Intell., 2021, pp. 130–139. Eng., vol. 47, no. 09, pp. 1998–2011, Sep. 2021.
[44] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and [69] J. Lampel, S. Just, S. Apel, and A. Zeller, “When life gives you
M. Mahoney, “ADAHESSIAN: An adaptive second order opti- oranges: Detecting and diagnosing intermittent job failures at
mizer for machine learning,” Proc. AAAI Conf. Artif. Intell., vol. 35, mozilla,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf.
no. 12, pp. 10665–10673, 2021. Symp. Found. Softw. Eng., 2021, pp. 1381–1392.
[45] W. Aljedaani et al., “Test smell detection tools: A systematic map-
ping study,” in Proc. Eval. Assessment Softw. Eng., 2021, pp. 170–180. Sakina Fatima received the Erasmus Mundus Joint
[46] A. Peruma, K. Almalki, C. D. Newman, M. W. Mkaouer, A. Ouni, master’s degree in dependable software systems
and F. Palomba, “TsDetect: An open source test smells detection from the University of St Andrews, U.K., and May-
tool,” in Proc. 28th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. nooth University, Ireland. She is currently working
Found. Softw. Eng., 2020, pp. 1650–1654. toward the PhD degree with the School of EECS,
[47] T. Virgınio et al., “JNose: Java test smell detector,” in Proc. 34th University of Ottawa and a member of Nanda Lab. In
Braz. Symp. Softw. Eng., 2020, pp. 564–569. 2019, she was awarded the French Government
[48] R. E. Noonan, “An algorithm for generating abstract syntax trees,” Medal and the National University of Ireland prize for
Comput. Lang., vol. 10, no. 3/4, pp. 225–236, 1985. distinction in collaborative degrees. Her research
[49] A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hel- interests include automated software testing, natural
lendoorn, “Revisiting test smells in automatically generated tests: language processing and applied machine learning.
Limitations, pitfalls, and opportunities,” in Proc. IEEE Int. Conf.
Softw. Maintenance Evol., 2020, pp. 523–533. Taher A. Ghaleb received the BSc degree in infor-
[50] A. Wei, P. Yi, T. Xie, D. Marinov, and W. Lam, “Probabilistic and mation technology from Taiz University, Yemen, in
systematic coverage of consecutive test-method pairs for detecting 2008, and the MSc degree in computer science
order-dependent flaky tests,” in Proc. Int. Conf. Tools Algorithms from the King Fahd University of Petroleum and
Construction Anal. Syst., 2021, pp. 270–287. Minerals, Saudi Arabia, in 2016, and the PhD
[51] W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell, “A degree in computing from Queen’s University, Can-
large-scale longitudinal study of flaky tests,” Proc. ACM Program. ada, in 2021. He is a postdoctoral research fellow
Lang., vol. 4, no. OOPSLA, pp. 1–29, 2020. with the School of EECS, University of Ottawa,
[52] W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov, Canada. During his PhD, he held an Ontario Trillium
“Understanding reproducibility and characteristics of flaky tests Scholarship, a highly prestigious award for doctoral
through test reruns in java projects,” in Proc. IEEE 31st Int. Symp. students. He worked as a research/teaching assis-
Softw. Rel. Eng., 2020, pp. 403–413. tant. His research interests include continuous integration, software test-
[53] W. Lam, A. Shi, R. Oei, S. Zhang, M. D. Ernst, and T. Xie, ing, mining software repositories, applied data science and machine
“Dependent-test-aware regression testing techniques,” in Proc. 29th learning, program analysis, and empirical software engineering.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, pp. 298–311.
[54] A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, “iFixFlakies: A
framework for automatically fixing order-dependent flaky tests,” Lionel Briand (Fellow, IEEE) is professor of soft-
in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. ware engineering and has shared appointments
Softw. Eng., 2019, pp. 545–555. between (1) School of Electrical Engineering and
[55] Flakify: A Black-Box, Language Model-based Predictor for Flaky Computer Science, University of Ottawa, Canada
Tests – Replication Package, 2022. [Online]. Available: https:// and (2) The SnT centre for Security, Reliability, and
doi.org/10.5281/zenodo.6994692 Trust, University of Luxembourg. He is the head of
[56] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, the SVV Department, SnT Centre and a Canada
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. research chair in Intelligent Software Dependability
Intell. Res., vol. 16, pp. 321–357, 2002. and Compliance (Tier 1). He received an ERC
[57] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive Advanced Grant, the most prestigious European
modeling on imbalanced domains,” ACM Comput. Surv., vol. 49, individual research award, and has conducted
no. 2, pp. 1–50, 2016. applied research in collaboration with industry for more than 25 years,
[58] C. Goutte and E. Gaussier, “A probabilistic interpretation of preci- including projects in the automotive, aerospace, manufacturing, financial,
sion, recall and f-score, with implication for evaluation,” in Proc. and energy domains. He was elevated to the grades of ACM fellow, granted
Eur. Conf. Inf. Retrieval, 2005, pp. 345–359. the IEEE Computer Society Harlan Mills Award (2012), the IEEE Reliability
[59] M. Raymond and F. Rousset, “An exact test for population differ- Society Engineer-of-the-year Award (2013), and the ACM SIGSOFT Out-
entiation,” Evolution, vol. 49, pp. 1280–1283, 1995. standing Research Award (2022) for his work on software verification and
[60] J. Micco, “Flaky tests at Google and how we mitigate them,” 2016. testing. His research interests include testing and verification, search-
[Online]. Available: https://testing.googleblog.com/2016/05/ based software engineering, model-driven development, requirements
flaky-tests-at-google-and-how-we.html engineering, and empirical software engineering.
[61] A. Memon et al., “Taming Google-scale continuous testing,” in
Proc. IEEE/ACM 39th Int. Conf. Softw. Eng.: Softw. Eng. Pract. Track, " For more information on this or any other computing topic,
2017, pp. 233–242. please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.

You might also like