Flakify A Black-Box Language Model-Based Predictor For Flaky Tests
Flakify A Black-Box Language Model-Based Predictor For Flaky Tests
4, APRIL 2023
Abstract—Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky,
i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software
development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times,
which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus
reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors
rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production
code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be
challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we
propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test
cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we
employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We
evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the
FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation
procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and
73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of
98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and
18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the
cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to
reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new
projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a
viable option for predicting flaky test cases.
Index Terms—Flaky tests, software testing, black-box testing, natural language processing, CodeBERT
1 INTRODUCTION testing code looking for a bug that might not really exist, or
(b) rerun a failed test case multiple times to check if it would
testing is an essential activity to assure software
S OFTWARE
dependability. When a test case fails, it usually indicates
that recent code changes were incorrect. However, it has
eventually pass, thus suggesting that the failure is not due to
recent code changes but to the test case itself.
Previous research has investigated the common reasons
been observed, in many environments, that test cases can be
behind test flakiness, such as concurrency, resource leakage,
non-deterministic, passing and failing across executions,
and test smells. The conventional approach to detect flaky test
even for the same version of the source code. These test cases
cases is to rerun them numerous times [4], [5], which is in
are referred to as flaky test cases [1], [2], [3]. Flaky test cases
most practical cases computationally expensive [6] or even
can introduce overhead to software development, since they
impossible. To address this issue, recent studies have pro-
require developers to either (a) debug the production or
posed approaches using machine learning (ML) models to
predict flaky test cases without rerunning them [7], [8], [9],
Sakina Fatima and Taher A. Ghaleb are with the School of EECS, Univer- thus proposing a much more scalable and practical solution.
sity of Ottawa, Ottawa, ON K1N 6N5, Canada. Despite significant progress, these approaches (a) rely on pro-
E-mail: {sfati077, tghaleb}@uottawa.ca. duction code, which is not always accessible by software test
Lionel Briand is with the School of EECS, University of Ottawa, Ottawa,
ON K1N 6N5, Canada, and also with the SnT Centre for Security, Reliabil- engineers or a scalable solution, or (b) employ project-specific
ity and Trust, University of Luxembourg, 4365 Esch-sur-Alzette, Luxembourg. features as flaky test case predictors, which makes them inap-
E-mail: [email protected]. plicable to other projects. Moreover, these approaches rely on
Manuscript received 23 December 2021; revised 19 August 2022; accepted 20 a limited set of pre-defined features, extracted from the source
August 2022. Date of publication 24 August 2022; date of current version 18 code of test cases and the system under test. However, when
April 2023.
This work was supported in part by research grant from Huawei Technologies
evaluated on realistic datasets, these approaches yield a rela-
Canada, Mitacs Canada, and in part by the Canada Research Chair and Dis- tively low accuracy (F1-scores in the range 19%-66%), thus
covery Grant programs of the Natural Sciences and Engineering Research suggesting they may not capture enough information about
Council of Canada (NSERC). test flakiness. Finding additional features that could poten-
(Corresponding author: Taher A. Ghaleb.)
Recommended for acceptance by A. Zaidman. tially be associated with flaky test cases, preferably based on
Digital Object Identifier no. 10.1109/TSE.2022.3201209 test code only (black-box), is therefore a research challenge.
0098-5589 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1913
In this paper, we propose Flakify (Flaky Test Classify), a recall, and F1-score of Flakify by 5 pp and 6 pp on the
generic language model-based solution for predicting flaky FlakeFlagger and IDoFT datasets, respectively. The
test cases. Flakify is black-box as it relies exclusively on the goal was to address a limitation of CodeBERT (and all
source code of test cases (test methods), thus not requiring other language models), which leads to only consid-
access to the production code of the system under test. This ering the first 512 tokens in the test source code. This
is important as production code is not always (entirely) result also confirms the previously reported associa-
accessible to test engineers due, for example, to outsourcing tion of test smells with flaky test cases [7], [9], [11].
software testing to a third-party. Further, analyzing produc- Overall, this paper makes the following contributions.
tion code may raise many scalability and practicality issues,
especially when applied to large industrial systems using A generic, black-box, language model-based flaky
multiple programming languages. In addition, Flakify does test case predictor, which does not require rerunning
not require the definition of features—which are necessarily test cases.
incomplete—to be used as predictors for flaky test cases. An ML-based classifier that predicts flaky test cases
Instead, we used CodeBERT [10], a pre-trained language on the basis of test code without requiring the defini-
model, and fine-tuned it to classify test cases as flaky or not tion of features.
based on their source code. To improve Flakify, we further An Abstract Syntax Tree (AST)-based technique for
pre-processed test code to remove potentially irrelevant statically detecting and only retaining statements
information. We evaluated Flakify on two different datasets: that match eight test smells in the test code, thus
the FlakeFlagger dataset, containing 21,661 test cases col- enhancing the application of language models.
lected from 23 Java projects, and the IDoFT dataset, contain- The rest of this paper is organized as follows. Section 2
ing 3,862 test cases collected from 312 Java projects. To do provides background about flaky test cases and language
this, we used two different evaluation procedures: (1) cross- models. Section 3 presents our black-box approach for pre-
validation and (2) per-project validation, i.e., prediction on dicting flaky test cases. Section 4 evaluates our approach,
new projects. Our results were compared to FlakeFlagger [7], reports experimental results, and discusses the implications
the best state-of-the-art ML-based predictor for flaky test of our research.. Section 5 discusses the validity threats to
cases. Specifically, our evaluation addresses the following our results. Section 6 reviews and contrasts related work.
research questions. Finally, Section 7 concludes the paper and suggests future
work.
RQ1: How accurately can Flakify predict flaky test cases?
Flakify achieved promising prediction results 2 BACKGROUND
when evaluated using two different datasets. In par-
In this section, we describe flaky test cases, their root causes,
ticular, based on cross-validation, Flakify achieved a
their practical impact, and the strategies to detect them. In
precision of 70%, a recall of 90%, and an F1-score of
addition, we describe pre-trained language models and
79% on the FlakeFlagger dataset, and a precision of
how they can potentially contribute to predicting flaky test
99%, a recall of 96%, and an F1-score of 98% on the
cases.
IDoFT dataset. Flakify yielded slightly worse results
when predicting flaky tests on new projects, with a
precision of 72%, a recall of 85%, and an F1-score of 2.1 Flaky Test Cases
73% on the FlakeFlagger dataset, and a precision of In software testing, a flaky test refers to test cases that inter-
91%, a recall of 88%, and an F1-score of 89% on the mittently fail and pass across executions, even for the same
IDoFT dataset. version of the source code, i.e., non-deterministically behav-
RQ2: How does Flakify compare to the state-of-the-art ing test cases [1]. Flaky test cases lead to many problems
predictors for flaky test cases? during software testing, by producing unreliable results
The best performing model of Flakify achieved a and wasting time and computational resources. A flaky test
significantly higher precision (70% versus 60%) and can also fail for different reasons across executions, making
recall (90% versus 72%) on the FlakeFlagger dataset it difficult to identify which failures are actually related to
in predicting flaky test cases than FlakeFlagger, the faults in the system under test.
best state-of-the-art, white-box approach for predict- Flaky test cases have been reported to be a significant
ing flaky test cases. Hence, with Flakify, the cost of problem in practice at many companies including Google,
debugging test cases and production code is reduced Huawei, Microsoft, SAP, Spotify, Mozilla, and Facebook [12],
by 10 and 18 percentage points (pp) (a reduction rate [13], [14], [15]. As reported by Google, almost 16% of their 4.2
of 25% and 64%), respectively, when compared to million test cases are flaky [6]. Microsoft has also reported
FlakeFlagger. Moreover, our results show that a that 26% of 3.8 k build failures were due to flaky test cases.
black-box version of FlakeFlagger is not a viable Many studies have been conducted to study flaky test cases,
option for predicting flaky test cases. Specifically, their causes, and the solutions to address them [1], [2], [4],
FlakeFlagger became 39 pp less precise with 20 pp [7], [8], [9], [11], [16]. Prominent causes of flaky test cases
less recall when only black-box features were used include asynchronous waits, test order dependency, concur-
as predictors for flaky test cases. rency, resource leakage, and incorrect test inputs or outputs.
RQ3: How does test case pre-processing improve Flakify? In addition, flaky test cases were found to be associated with
Retaining only code statements that are related to other factors, such as test smells, which are further discussed
a selected set of test smells improved the precision, below.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1914 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023
TABLE 1
Test Smells Used by FlakeFlagger [7]
such as the file system (via java.io.File), data- Many solutions have been proposed to predict
base system (via java.sql, javax.sql, or flaky test cases. In this RQ, we compare the perfor-
javax.persistence), or network (via java.net mance of our best performing model of Flakify (with
or javax.net). Such external resources can intro- test case pre-processing) to two versions (white-box
duce stability and performance issues during test and black-box) of FlakeFlagger, the best flaky test
case execution [11]. Any statement that is found to case predictor to date.
use methods that belong to one of these classes or RQ2.1: How accurate is Flakify for flaky test case pre-
packages is retained. diction compared to the best white-box ML-based solution?
Assertion Roulette: We check whether a statement per- White-box prediction of flaky test cases requires
forms one of the following assertion mechanisms, access to production code, which is not (easily) acces-
including assertArrayEquals, assertEquals, sible by software test engineers in many contexts. We
assertFalse, assertNotNull, assertNot- assess whether Flakify achieves results that are at
Same, assertNull, assertSame, assertThat, least comparable to the best white-box flaky test case
assertTrue, and fail. If so, the statement is predictor. Specifically, we compare the accuracy of
retained. Multiple assert statements in a test method the best performing model of Flakify with FlakeFlag-
makes it difficult to identify the cause of the failure if ger [7], the best white-box solution currently avail-
just one of the asserts fails [9]. able, on the dataset used by FlakeFlagger. Our
Resource Optimism: We check whether a statement motivation is to determine whether black-box solu-
accesses the file system (java.io.File) without tions, based on CodeBERT, can compete with the
checking if the path (for either a file or directory) state-of-the-art, white-box ones. We compare the
exists. Doing so makes optimistic assumptions about results of Flakify and FlakeFlagger on the dataset on
the availability of resources, thus causing non-deter- which FlakeFlagger was evaluated, hereafter referred
ministic behavior of the test case [46]. We check the to as the FlakeFlagger dataset. We also performed a
test initialization method (usually named as setUp per-project validation of Flakify compared against
or containing the @Before annotation) for any path FlakeFlagger to assess their relative capability to pre-
checking method, including getPath(), getAbso- dict test cases in new projects.
lutePath(), or getCanonicalPath(). If no RQ2.2: How accurate is Flakify for black-box flaky test
such checking is present, the statement is retained, case prediction compared to the best ML-based solution?
adding the ‘//RO’ flag. Existing black-box flaky test case prediction solutions
rely on a limited set of features that are sometimes
project-specific or applicable only to a certain pro-
4 VALIDATION gramming language, e.g., Java [8], since they were
This section reports on the experiments we conducted to trained on features capturing the keywords of that
evaluate how accurate is Flakify in predicting flaky test language. Besides not being generic, the accuracy of
cases and how it compares to FlakeFlagger as a baseline. these solutions has shown to be very low compared to
We discuss the research questions we address, the datasets white-box solutions [7]. Therefore, we compare the
used, and the experiment design. Then, we present the accuracy of Flakify with a black-box version of Flake-
results for each research question and discuss their practical Flagger, by excluding the features related to produc-
implications. tion code, such as code coverage features (see Table 2).
RQ3: How does test case pre-processing improve Flakify?
4.1 Research Questions The token length limitation of CodeBERT may lead to
unintentionally removing relevant information about
RQ1: How accurately can Flakify predict flaky test cases? flaky test cases, which could then impact prediction
The performance of ML-based flaky test predictors accuracy. We assess whether the accuracy of Flakify is
can be influenced by the data used for training and improved when training the model using pre-proc-
the underlying modeling methodology. In this RQ, essed test cases containing only code statements
we evaluate Flakify on two distinct datasets, which related to test smells, as opposed to the entire test case
differ in terms of numbers of projects, ratios of flaky code. We fully realize that we may be missing test
and non-flaky test cases, and the way flaky test cases smells or unintentionally removing relevant state-
were detected. In addition, predicting flaky test cases ments. But our motivation is to assess the benefits, if
can be influenced by project-specific information any, of our approach to reduce the number of tokens
used during model training, which is not available used as input to CodeBERT. We performed this analy-
for new projects. Therefore, we evaluate Flakify using sis on both the FlakeFlagger and the IDoFT datasets.
two different procedures: 10-fold cross-validation
and per-project validation. The former mixes test 4.2 Datasets Collection and Processing
cases from all projects together to perform model To evaluate Flakify, we used two publicly available datasets
training and testing, whereas the later tests the model for flaky test cases. The first dataset is the FlakeFlagger data-
on every project such that no information from that set [7]. The second dataset is the International Dataset of
project was used as part of model training. Flaky Tests (IDoFT),4 which comprises many datasets for
RQ2: How does Flakify compare to the state-of-the-art
predictors for flaky test cases? 4. https://mir.cs.illinois.edu/flakytests
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1919
TABLE 2
FlakeFlagger Features
flaky test cases used by previous studies on flaky test case numbers of runs to detect test flakiness. However, we were
prediction [5], [50], [51], [52], [53], [54]. unable to obtain the test code of 474 test cases (from 2 proj-
FlakeFlagger Dataset. It is provided by Alshammari et al. ects) due to missing GitHub repositories or commits, leav-
[7], containing flakiness information about 22,236 test cases ing us with 3,268 Flaky test cases from 312 projects. Given
collected from 23 GitHub projects. These projects have dif- that the IDoFT dataset contains no test cases categorized as
ferent test suite sizes, ranging from 55 to 6,267 (with a Non-Flaky, we used the fixed versions of 1,263 flaky test
median of 430) test cases per project. All projects in the Fla- cases, from 174 projects, to obtain non-flaky test cases, as
keFlagger dataset are written in Java and use Maven as a recommended by the IDoFT maintainers.8 To do so, we
build system, and each test case is a Java test method. The relied on the provided links to pull requests9 used for fixing
dataset contains the source code of each test case and the flaky test cases to collect the corresponding code changes.
corresponding features that were computed to train Flake- However, of the 1,263 fixed flaky test cases, we found only
Flagger. Also, test cases in the dataset were assigned labels 594 flaky test cases, from 126 projects, in which the test case
indicating whether they are Flaky or Non-Flaky, which were code is changed to fix test flakiness. Based on our analysis,
determined by executing each test case 10,000 times. the other flaky test cases were fixed in other ways, such as
When we analyzed the dataset, we identified 453 test changing the order of test case execution, test configuration,
cases with missing source code when intersecting test cases or production code. Such flaky tests are out of the scope of
in a provided CSV file (called processed_data5) with those in this paper, since we consider only test cases whose test code
a provided folder (called original_tests6) containing their was fixed, e.g., causes of flakiness related to test smells or
source code. In addition, we identified 122 test cases, in the other test characteristics. As a result, we added the 594 Non-
original_tests folder, with empty source code, which we Flaky (fixed) tests to the 3,268 Flaky test cases to end up with
found out were not written in Java.7 Therefore, we excluded an updated dataset of 3,862 test cases. Limitations, regard-
these test cases from our dataset, since they do not add any ing the causes of flakiness we could not detect, are dis-
valuable information regarding our flakiness prediction cussed in Section 5. About 13% of all test cases exceed the
evaluation. Nine of these test cases were labeled as flaky, 512 limit of CodeBERT when converted into tokens.
three with missing source code and six with empty method We made the updated datasets of FlakeFlagger and
body. After excluding test cases with missing and empty IDoFT, including their pre-processed test cases, publicly
code, we obtained 21,661 test cases for our experiments. We available in our replication package [55].
compared Flakify and FlakeFlagger using this updated
dataset. To pre-process the source code of the test cases (see 4.3 Experiment Design
Section 3.2), we cloned the GitHub repository of each project 4.3.1 Baseline
and extracted the Java classes defining the methods of test We used the FlakeFlagger approach as a baseline against
cases. which we compare the results achieved by Flakify. To this
There are 802 test cases in the dataset that are labeled as end, we reran the experiments conducted by Alshammari
Flaky (with a median of 19 flaky test cases per project), et al. [7] to reproduce the prediction results of FlakeFlagger
whereas 20,859 test cases are Non-Flaky. About 4% of all test using their provided replication package.10 FlakeFlagger
cases exceed the 512 limit of CodeBERT when converted was trained and tested using a combination of white-box
into tokens, including 14% of the flaky test cases. and black-box features listed in Table 2. These features were
IDoFT Dataset. This dataset contains 3,742 Flaky test cases selected based on their Information Gain (IG), i.e., only fea-
from 314 different Java projects, and collected using differ- tures having an IG 0.01 were selected for training. Besides
ent ways, i.e., different runtime environments with different reproducing the original results of FlakeFlagger, we also
reran the experiments using black-box features only, which
was done by excluding all features that required access to
5. https://github.com/AlshammariA/FlakeFlagger/blob/main/
flakiness-predicter/result/processed_data.csv
6. https://github.com/AlshammariA/FlakeFlagger/tree/main/ 8. https://github.com/TestingResearchIllinois/IDoFT/issues/566
flakiness-predicter/input_data/original_tests 9. https://mir.cs.illinois.edu/flakytests/fixed.html
7. https://github.com/AlshammariA/FlakeFlagger/pull/4 10. https://github.com/AlshammariA/FlakeFlagger
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1920 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023
production code. Comparing Flakify with FlakeFlagger is flaky test cases), Recall (the ability of a model to predict all
performed on the FlakeFlagger dataset only, as running Fla- flaky test cases), and the F1-Score (the harmonic mean of pre-
keFlagger on the IDoFT dataset requires extracting features, cision and recall) [58]. For the per-project validation of Fla-
both dynamic and static, needed to train FlakeFlagger. To kify, we computed the overall precision, recall, and F1-score
do so, we must access the project’s production code and using the prediction results of all projects in the FlakeFlagger
then successfully execute thousands of test cases across and IDoFT datasets. We also computed these metrics indi-
hundreds of project versions. vidually for those projects that have both Flaky and Non-Flaky
test cases, specifically 23 FlakeFlagger projects and 126
4.3.2 Training and Testing Prediction Models IDoFT projects, along with descriptive statistics, such as
mean, median, min, max, 25% and 75% quantiles. We used
Training and testing Flakify were conducted using two dif-
Fisher’s exact test [59] to assess how significant is the differ-
ferent procedures, performed independently on the two
ence in proportions of correctly classified test cases between
datasets describe above, as follows.
two independent experiments. Note that precision, recall,
1st Procedure (Cross-Validation). In this procedure, we
and F1-score are computed based on such proportions.
evaluated Flakify similarly to how FlakeFlagger was origi-
In practice, test cases classified as Flaky must be addressed
nally assessed. Specifically, we used a 10-fold stratified
by re-running them multiple times or by fixing the root
cross-validation to ensure our model is trained and tested
causes of flakiness [6], [12], [60]. Precisely predicting flaki-
in a valid and unbiased way. For that, we allocated 90% of
ness is therefore important as otherwise time and resources
the test cases for training and 10% for testing our model
are wasted on re-running and attempting to debug many test
in each fold. However, different from FlakeFlagger, we
cases that are believed to be flaky but are not [16], [61].
employed 20% of the training dataset as a validation data-
According to our industry partner, Huawei Canada, and a
set, which is required for fine-tuning CodeBERT. Using the
Google technical report [6], each flaky test case has to be
validation dataset, we calculated the training and validation
investigated and re-run by developers. Hence, when we
loss, which helped obtain optimal weights and stop the
multiply the number of predicted flaky test cases, we propor-
training early enough to avoid overfitting.
tionally increase the resources associated with re-running
Given that both of the datasets we used are highly imbal-
and investigating such flaky test cases. Therefore, we assume
anced—Flaky test cases represent only 3.7% of all test cases
that the wasted cost of unnecessarily re-running and debug-
in the FlakeFlagger dataset and Non-Flaky test cases repre-
ging test cases is inversely proportional to precision
sent only 15% of the IDoFT dataset—we balanced Flaky and
Non-Flaky test cases in the training and validation datasets Test Debugging Cost / 1 Precision: (1)
of FlakeFlagger and IDoFT. Different from FlakeFlagger,
which used the synthetic minority oversampling technique On the other hand, it is also important not to miss too
(SMOTE) [56], we used random oversampling [57], which many flaky test cases as otherwise time is bound to be
adds random copies of the minority class to the dataset. We wasted on futile attempts to find and fix non-existent bugs
were unable to use SMOTE, since it requires vector-based in the production code. Thus, we assume that the wasted
features, whereas our model takes the source code of test cost of unnecessarily finding and fixing non-existent bugs
cases (text) as input [10], [38], as opposed to pre-defined fea- in the production code is inversely proportional to recall
tures like FlakeFlagger. Similar to FlakeFlagger, we also per-
formed our experiments using undersampling but this led Code Debugging Cost / 1 Recall: (2)
to lower accuracy. We did not balance the testing dataset to
ensure that our model is only tested on the actual set of test We acknowledge that the above metrics are surrogate
cases. This prevents overestimating the accuracy of the measures for cost and that there are significant differences
model and reflects real-world scenarios where flaky test between individual flaky tests; however, they are reasonable
cases are rarer than non-flaky test cases [7]. and useful approximations on large test suites for the pur-
2nd Procedure (Per-Project Validation). In this procedure, pose of comparing classification techniques. We used Flake-
we evaluated Flakify in a way that yields more realistic Flagger as baseline to compute the reduction rate of test and
results when we predict test cases on a new project, thus code debugging costs, by dividing the difference in cost
evaluating the generalizability of Flakify across projects. To between Flakify and FlakeFlagger by the cost of FlakeFlagger.
do this, we performed a per-project validation of Flakify on
both datasets. In particular, for every project in each dataset, 4.4 Results
we trained Flakify on the other projects and tested it on that 4.4.1 RQ1 Results
project. This allowed us to evaluate how accurate Flakify is
Table 3 shows the prediction results (in terms of precision,
in predicting flaky test cases in one project without includ-
recall, and F1-score) of Flakify using both the full and pre-
ing any data from that project during training. We also per-
processed test code from the FlakeFlagger and IDoFT data-
formed this analysis for FlakeFlagger, on the FlakeFlagger
sets, based on cross-validation. Overall, Flakify achieved
dataset, for the sake of comparison.
promising prediction results using both datasets, with a
precision of 70%, a recall of 90%, and an F1-score of 79% on
4.3.3 Evaluation Metrics the FlakeFlagger dataset, and a precision of 99%, a recall of
To evaluate the performance of our approach, we used stan- 96%, and an F1-score of 98% on the IDoFT dataset. The
dard evaluation metrics for ML classifiers, including Preci- higher results achieved by Flakify on the IDoFT dataset
sion (the ability of a classification model to precisely predict over those achieved on the FlakeFlagger dataset is probably
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1921
TABLE 3
Results of Flakify (using Full Code and Pre-Processed) Compared to FlakeFlagger (White-Box and Black-Box Versions)
TABLE 4
Summary of the Per-Project Prediction Results of Flakify on the FlakeFlagger and IDoFT Datasets
due to the fact that the IDoFT dataset contains many more RQ2.1 results. For FlakeFlagger, we obtained results close
flaky test cases than FlakeFlagger, which helped during to those reported in the original study, with a slight decrease
model training. Moreover, the non-flaky test cases in the in F1-score (1%), which is likely due to removing test cases
IDoFT dataset were labeled based on developer’s fixes with missing test code. Flakify achieved much better results
addressing the causes of flakiness in the test code, unlike with a precision of 70% (þ10 pp), a recall of 90% (þ18 pp),
the non-flaky test cases in the FlakeFlagger dataset whose and an F1-score of 79% (þ14 pp). These results clearly show
labels were based on 10,000 runs performed by Alshammari that Flakify, though being black-box and relying exclusively
et al. [7], which may not have been enough to fully expose on test code, significantly surpasses FlakeFlagger in accu-
test flakiness. This also helped during model training of rately predicting flaky test cases. Statistically, the proportion
Flakify. of correctly predicted test cases using Flakify is significantly
Table 4 reports the per-project prediction results of higher than that obtained with FlakeFlagger (Fisher-exact p-
Flakify on the FlakeFlagger dataset. Overall, as expected, value < 0:0001).
Flakify achieved slightly lower precision (72%), recall (85%), The number of true positives obtained by FlakeFlagger was
and F1-score (73%) than the cross-validation results on the 574, whereas Flakify increased that number to 721. This indi-
FlakeFlagger dataset. Similarly, Flakify achieved slightly cates that Flakify can potentially reduce the test debugging
worse precision (91%), recall (88%), and F1-score (89%) on cost by 10 pp, as defined above, when compared to FlakeFlag-
the IDoFT dataset. Table 5 shows descriptive statistics for the ger (a reduction rate of 25%). Similarly, Flakify reduces the
per-project prediction results of Flakify for individual proj- number of false negatives to 81 from 227 with FlakeFlagger,
ects of the FlakeFlagger dataset (due to space limitations, thus decreasing the code debugging cost by 18 pp, as defined
we provide individual per-project prediction results of above (a reduction rate of 64%).
Flakify on the IDoFT dataset in our replication package [55]). Table 5 shows the comparison of per-project prediction
Our analysis of individual per-project prediction results results between Flakify and FlakeFlagger. Overall,
revealed a high performance of Flakify on the majority of Flakify achieves a high accuracy, with a precision of 72%
projects. This result suggests that Flakify helps build models (þ57 pp)), a recall of 85% (þ71 pp)), and an F1-score of 73%
that are generalizable across projects, thus making it applica- (þ66 pp)), which, once again, significantly outperforms Fla-
ble to new projects where no historical information about keFlagger. Looking at the individual prediction results of
test flakiness exists. In short, Flakify is capable to learn about the projects, we observe that the accuracy of Flakify is
test flakiness through data collected from other projects to largely consistent across projects, with a few exceptions,
predict flaky test cases in new projects. whereas FlakeFlagger performed poorly on the majority of
projects. Further, Flakify performs better than FlakeFlagger
for almost all projects except two: incubator-dubbo and
4.4.2 RQ2 Results spring-boot where both techniques fare poorly.
Table 3 presents the prediction results of Flakify, using both To understand the reasons behind such degraded perfor-
full code and pre-processed test code, and FlakeFlagger, mance for these two projects, we performed a hierarchical
using both white-box and black-box versions, for the Flake- clustering of the 23 projects. We used different metrics that
Flagger dataset. capture the characteristics of each project, such as the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1922 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023
TABLE 5
Results of the Per-Project Prediction for Flakify and FlakeFlagger on the FlakeFlagger Dataset
For every project, we trained models on all other projects and tested them on that project.
number of test cases, number of flaky test cases, and fre- decrease its prediction power. The difference in accuracy
quency of test smells in each project. However, our cluster- between Flakify and the black-box version of FlakeFlagger
ing results were inconclusive, thus revealing no significant is rather striking, with a large improvement of +49% in F1-
differences between the two projects and the other projects. score (Fisher-exact p-value < 0:0001). FlakeFlagger is there-
As reported by Alshammari et al. [7], each project can have fore not a viable black-box option to predict flaky test cases.
distinct characteristics, e.g., environmental setup and test-
ing paradigm, that make it difficult to develop a general- 4.4.3 RQ3 Results
purpose flaky test case predictor. For example, the
spring-boot project has the highest number of flaky test With no code pre-processing, 898 (4%) of the test cases of the
cases among all projects, representing 20% of all flaky test FlakeFlagger dataset and 505 (13%) of the test cases of the
cases in the dataset. This, in turn, can influence model train- IDoFT dataset were truncated by CodeBERT to generate
ing when the model was tested for spring-boot. In addi- tokens of size 512. Such arbitrary code truncation is likely to
tion, the variation in prediction results can be a result of a affect how accurately Flakify can predict flaky test cases. Pre-
possible mislabeling of test cases as Flaky and Non-Flaky in processing test cases (see Section 3.2) led to reducing the num-
some projects, since some test cases may still exhibit flaki- ber of test cases being truncated to only 40 (from 898) in the
ness behavior if executed more than 10,000 executions, for FlakeFlagger dataset and 87 (from 505) in the IDoFT dataset, a
example. Finally, test flakiness can also occur due to the use large difference. As a result, we observe in Table 3 that, with
of network APIs or dependency conflicts [17], which were pre-processed test cases, Flakify predicted flaky test cases
not taken into account when predicting flaky test cases. with 5 pp higher F1-score on the FlakeFlagger dataset and 6
RQ2.2 results.As shown in Table 3, we observe a consid- pp higher F1-score on the IDoFT dataset. This corresponds to
erable decline in the accuracy for the black-box version of a significantly higher proportion of correctly predicted test
FlakeFlagger when compared to its original, white-box ver- cases (Fisher-exact p-value ¼ 0:0008) for the FlakeFlagger
sion, i.e., 39 pp less precise with a 54 pp decrease in dataset. In practice, the impact of pre-processing is expected
F1-score. Specifically, black-box FlakeFlagger correctly pre- to vary depending on the token length distribution of test
dicted a significantly lower proportion of test cases than cases. This result suggests that retaining statements related to
both Flakify and the original, white-box version of Flake- test smells in the test code contributed to making Flakify more
Flagger (Fisher-exact p-values < 0:0001). As a possible accurate, which also confirms the association of test smells
explanation, based on the results of FlakeFlagger regarding with flaky test cases reported by prior research [9].
the importance of features in predicting flaky test cases [7],
the majority of features having high IG values were based 4.5 Discussion
on source code coverage. Hence, removing those features, More Accurate Predictions With Easily Accessible Information.
to make FlakeFlagger black-box, is expected to significantly Our results showed that our black-box prediction of flaky
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1923
test cases performs significantly better than a white-box, the problem, since differences in test case verdicts, i.e., pass
state-of-the-art approach. This not only enables test engi- or fail, can be due to differences in builds rather than flaki-
neers to predict flaky test cases without rerunning test cases, ness. Therefore, test engineers can use the prediction results
but also without accessing the production code of the sys- obtained from Flakify to fix test cases that are predicted as
tem under test, a significant practical advantage in many flaky, e.g., by eliminating the presence of test smells, or oth-
contexts. The highest accuracy of our Flakify was achieved erwise rerun them a larger number of times, using the same
by only retaining relevant code statements matching eight code version, to verify whether a test case is actually flaky
test smells. Yet, there is still room for improvement in terms or not. More specifically, Flakify helps test engineers focus
of accuracy, which could be achieved by retaining more rel- their attention on a small subset of test cases that are most
evant statements based on additional test smells. For exam- likely to be flaky in a CI build. As our results show,
ple, retaining code statements related to other common Flakify significantly reduces the cost of debugging test and
flakiness causes [16], such as concurrency and randomness, production code, both in terms of human effort and execu-
could further improve flaky test case predictions. However, tion time. This makes Flakify an important strategy in prac-
the more code statements we retain, the more tokens to be tice to achieve scalability, especially when applied to large
considered by CodeBERT, which might lead to many test test suites. Moreover, the test smell detection capability of
cases exceeding their token length limit, thus truncating Flakify helps to inform test engineers about possible causes
other useful information. Hence, retaining additional code of flakiness that need to be addressed.
statements is a trade-off and should carefully be performed
in balance with the resulting token length of test cases. 5 THREATS TO VALIDITY
Moreover, building a white-box flaky test predictor, by con-
sidering both production and test code, is not always techni- This section discusses the potential threats to the validity of
cally feasible, since the production code is not always our reported results.
available to test engineers and, when possible, code cover-
age can be expensive and not scalable on large systems, 5.1 Construct Validity
especially in a continuous integration context. Considering Construct threats to validity are concerned with the degree
the production code also makes it impractical to build lan- to which our analyses measure what we claim to analyze. In
guage model-based predictors for flaky test cases, given the our study, to pre-process test cases, we used heuristics to
token length limitation of language models in general, and retain code statements that match at least one of the eight
CodeBERT in particular. Nevertheless, future research test smells shown in Table 1. However, our heuristics might
should assess the practicability of white-box, model-based have missed some code statements having test smells and
flaky test prediction, and should investigate further code this could have led to suboptimal results when applying
pre-processing methods to make the use of language mod- our approach. To mitigate this issue, though our approach
els more applicable in practice. to identify test smells is entirely different, we relied on the
Practical Implications of Imperfect Prediction Results. same heuristics as those used by Alshammari et al. [7].
Though Flakify surpassed the best state-of-the-art solution These heuristics assume commonly used coding conven-
in predicting flaky test cases, both in terms of precision and tions that might not be followed in all test suites. For exam-
recall, a precision of 70% is still not satisfactory, since mis- ple, we assumed that the test class name contains the
classifying non-flaky test cases as flaky leads to additional, production class name with the word ‘Test’. However, such
unnecessary cost, e.g., attempting to fix the test cases incor- heuristics can easily be adapted to other coding conventions
rectly predicted as flaky. Also, with a recall of 90%, we miss in practice. We also manually checked a random sample of
10% of flaky test cases, leading to wasted debugging cost. If test cases to verify that pre-processed code contains, as
we assume that precision should be prioritized over recall, expected, only test smells-related code statements and does
we can increase the former by restricting flaky test case pre- not dismiss any of them. We have made the tool we devel-
dictions to those test cases with highest prediction confi- oped to detect test smells publicly available in our replica-
dence, at the expense of a lower recall. For example, this can tion package [55].
be achieved by adjusting the classification threshold for
flaky test cases to 0.60 or 0.70, instead of the default thresh- 5.2 Internal Validity
old of 0.50. Nevertheless, given that the predicted probabili- Internal threats to validity are concerned with the ability to
ties generated by the neural network in Flakify are over draw conclusions from our experimental results. In our
confident due to the use of the Softmax function in the last study, we used CodeBERT to perform a binary classification
layer [62], i.e., probabilities are either close to 0.0 or 1.0, we of test cases as Flaky or Non-Flaky. However, due to the
were unable to perform such analysis. Therefore, future token length limit of CodeBERT, the source code of some
research should employ techniques for calibrating the pre- test cases was truncated, possibly leading to discarding rele-
dicted probabilities [63] and enable threshold adjustments vant information about test flakiness. To mitigate this issue,
when classifying flaky test cases. we pre-processed the source code of test cases to retain only
Deployment of a Flaky Test Case Predictor in Practice. code statements related to test smells. Doing so did not only
Flakify can be deployed in Continuous Integration (CI) reduce the token length of test cases, but also improved the
environments to help detect flaky test cases. One could prediction power of our approach. However, our pre-proc-
argue that the CI build history can be used as reference to essing may not be perfect or complete as it can lead to losing
conclude whether a test case is flaky or not. However, regu- other relevant information. Future research should investi-
lar test case executions across builds may not entirely solve gate whether retaining additionally relevant information to
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
1924 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 4, APRIL 2023
flaky test cases leads to improving prediction results, e.g., considered in this paper as non-flaky. This helped during
statements related to common flakiness causes, such as syn- model training of Flakify on this dataset, which resulted in
chronous or platform-dependent operations. a higher prediction accuracy than those on the FlakeFlagger
Moreover, our prediction results were compared with dataset.
those of FlakeFlagger. But FlakeFlagger used white-box fea-
tures, whereas our approach is black-box and the compari- 6 RELATED WORK
son may not be entirely meaningful. To mitigate this issue,
we also compared our results with a black-box version of Flaky test detection has been an active area of research
FlakeFlagger in which we removed any features requiring where many techniques were proposed to detect flaky test
access to production code. In both cases, our approach cases [16]. Overall, these techniques can be classified into
obtained significantly higher prediction results than Flake- two groups: dynamic techniques, which require executing
Flagger. We did not compare our results with other black- test cases to determine whether they are flaky or not, and
box approaches, e.g., vocabulary-based [8], since they are static techniques, which rely only on the source code of test
project-specific and did not achieve good results on the Fla- cases or the system under test. In this section, we review the
keFlagger dataset [7]. flaky test detection techniques while comparing and con-
Finally, in our analysis, the cost of debugging the pro- trasting them to our approach.
duction or testing code assumes that test engineers address
all test cases predicted as flaky. However, test engineers 6.1 ML-Based Flaky Test Case Prediction
may choose to ignore a flaky test case, either by removing A common approach to detect flaky test cases is to re-run
or skipping it, thus not introducing any cost. Yet, we believe test cases multiple times [1], [16], which is computationally
that every flaky test case should be carefully addressed by expensive. To address this issue, recent research has pro-
test engineers, since ignoring test cases can lead to other posed the use of ML techniques for predicting flaky test
kinds of costs, such as overlooked system faults. cases, enabling test engineers to re-run only those test cases
that are predicted to be flaky, thus reducing the cost of
5.3 External Validity unnecessary debugging of test cases or production code.
External threats are concerned with the ability to generalize Alshammari et al. [7] proposed an innovative approach
our results. Our study is based on data collected by Alsham- to predict flaky test cases using dynamically computed fea-
mari et al. [7], which was obtained by rerunning test cases tures capturing code coverage, execution history, and test
10,000 times. Such data is of course not perfect as some test smells. They re-ran test cases 10,000 times to identify
cases that were not found to be flaky could have been if whether a test case was flaky or not and thus establish a
rerun more times. To mitigate this threat, we used the same ground truth. Their prediction model predicted flaky test
dataset for comparing Flakify with the baseline approach, cases with an F1-score of 0.65, leaving significant room for
FlakeFlagger. We also filtered out test cases which, to our improvement. However, some of the significant features
surprise, had no source code in the dataset. Further, the Fla- required access to production files which, as discussed
keFlagger and IDoFT datasets contain test cases from proj- above, are not always accessible by test engineers or may
ects that are exclusively written in Java, which might affect not be computable in a scalable way in many practical con-
the generalizability of our results. To mitigate this issue, we texts. Further, when only black-box features (see Table 2)
used CodeBERT, which was trained on six programming were used, the F1-score decreased by 35 pp. In contrast, our
languages. Hence, we believe our approach would be appli- approach achieved more accurate prediction results, with
cable to projects written in other programming languages as an F1-score of 0.79, while using test code only, thus offering
well, given an appropriate tool to identify test smells. a favorable black-box alternative.
Moreover, CodeBERT was pre-trained on production In addition, Pontillo et al. [11] proposed an approach to
source code only, i.e., source code related to test suites was identify the most important factors associated with flaky
not part of pre-training, making it unable to recognize test- test cases using the iDFlakies dataset [5]. They used logistic
specific structure and vocabulary, e.g., assertions. This can regression to model flaky test cases using features that were
potentially increase token length, since test-specific key statically computed using production code, e.g., code cover-
terms are decomposed into multiple tokens instead of one. age, and test code, e.g., test smells. They found that code
For example, CodeBERT converts assertEquals into complexity (both production and test code), assertions, and
three tokens: assert, ##equal, and ##s, rather than just test smells are associated with test flakiness.
one token. Our pre-processing of the source code of test Another approach was proposed by Pinto et al. [8] in
cases helped to mitigate the issue of token length; yet, future which Java keywords were extracted from test code and
work should aim at pre-training CodeBERT on test code in employed as vocabulary features to predict test flakiness.
addition to production code. Further, their study relied on the dataset of DeFlaker [4], in
Finally, the IDoFT dataset has shown that a significant which test cases were re-run less than 100 times to establish
number of test cases are flaky due to reasons unrelated to the ground truth. Despite high accuracy results (F1-score =
the test code. In situations where this is common, this is 0.95) on their dataset, their approach achieved much worse
obviously a limitation of any black-box approach like results (F1-score = 0.19) when using the dataset provided by
Flakify relying exclusively on test code. In our evaluation, Alshammari et al. [7]. In addition, their models were lan-
we did not consider such flaky test cases, but rather those guage- and project-specific, since most of the significant fea-
whose causes of flakiness were in the test code, which were tures for predicting flaky test cases were related to Java
confirmed and manually fixed by developers, and thus keywords, e.g., throws, or specific variable names, e.g., id. In
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1925
contrast, while our approach relies exclusively on test code, Bell et al. [4] proposed DeFlaker, a tool for detecting
it builds a generic model to predict flakiness, based on fea- flaky test cases using coverage information about code
tures that are neither language- nor project-dependent, and changes. In particular, a test case is labeled as flaky if it fails
achieved much better prediction results when using the Fla- and does not cover any changed code. Out of 4,846 test fail-
keFlagger dataset used by Alshammari et al. [7]. ures, DeFlaker was able to label 39 pp of them as flaky, with
Moreover, Haben et al. [15] and Camara et al. [64] repli- a 95.5% recall and a false positive rate of 1.5%, outperform-
cated the study by Pinto et al. using other datasets contain- ing the default way of detecting flaky test cases, i.e., by
ing projects written in other programming languages, e.g., rerunning test cases using Maven [67]. Different from
Python. They found that vocabulary-based approaches are DeFlaker, Lam et al. [5] proposed iDFlakies, which detects
not generalizable, especially when performing inter-project test flakiness by re-running test cases in random orders.
flaky test case predictions, since new vocabulary is needed This framework was used to construct a dataset containing
for any new project or programming language. Haben et al. 422 flaky test cases, with almost half of them being order-
also showed that combining the vocabulary-based features dependent.
with code coverage features does not significantly improve The above approaches either depend on rerunning test
the prediction accuracy of such an approach. cases multiple times, execution history (not available for
In summary, unlike the ML-based approaches above, our new test cases), or production code, e.g., coverage informa-
approach is generic, black-box, and language model-based, tion. In contrast, Flakify does not require repeated execu-
thus not requiring access to production code or pre-defini- tions of test cases or any information about the production
tion of features. Instead, our approach relies solely on test code, including code coverage.
code to predict whether a test case is flaky or not.
A black-box version of FlakeFlagger is not a viable [11] V. Pontillo, F. Palomba, and F. Ferrucci, “Toward static test flaki-
ness prediction: A feasibility study,” in Proc. 5th Int. Workshop
option to predict flaky test cases as it is too inaccurate. Mach. Learn. Techn. Softw. Qual. Evol., 2021, pp. 19–24.
When retaining only code statements related to test [12] C. Ziftci and D. Cavalcanti, “De-flake your tests: Automatically
smells, Flakify predicted flaky test cases with 5 pp locating root causes of flaky tests in code at Google,” in Proc. IEEE
and 6 pp higher F1-score on the FlakeFlagger and Int. Conf. Softw. Maintenance Evol., 2020, pp. 736–745.
[13] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummala-
IDoFT datasets, respectively. penta, “Root causing flaky tests in a large-scale industrial setting,”
Overall, existing public datasets [4], [5], [7], [15] are not fully in Proc. 28th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2019,
adequate to appropriately evaluate flaky test case prediction pp. 101–111.
[14] T. Bach, A. Andrzejak, and R. Pannemans, “Coverage-based
approaches, since the ratio of flaky test cases tends to be very reduction of test execution time: Lessons from a very large indus-
low. In addition, flaky test cases in these datasets were detected trial project,” in Proc. IEEE Int. Conf. Softw. Testing, Verification
by rerunning test cases numerous times while monitoring their Validation Workshops, 2017, pp. 3–12.
behavior across executions, a technique that may be inaccurate. [15] G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. L. Traon, “A
replication study on the usability of code vocabulary in predicting
Further, many open source projects nowadays adopt Continu- flaky tests,” in Proc. IEEE/ACM 18th Int. Conf. Mining Softw. Reposi-
ous Integration (CI), which provides extensive test execution tories, 2021, pp. 219–229.
histories. Given the frequency of test executions in CI and the [16] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A sur-
high workload on CI servers, test cases might expose further vey of flaky tests,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 1,
pp. 1–74, 2021.
flakiness behaviors due to causes that may not be revealed [17] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn,
when running test cases on machines dedicated to test execu- “Surveying the developer experience of flaky tests,” in Proc. IEEE/
tion [68], [69]. Therefore, we plan in the future to build a larger ACM Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2022, pp. 253–262.
[18] A. Van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok,
dataset of flaky test cases in a CI context. “Refactoring test code,” in Proc. 2nd Int. Conf. Extreme Program.
Last, a significant proportion of flaky tests can be due to Flexible Processes Softw. Eng., 2001, pp. 92–95.
problems in the production code and cannot be addressed [19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-train-
by black-box models. Therefore, in the future, we need to ing of deep bidirectional transformers for language understanding,”
in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
devise light-weight and scalable approaches to address Lang. Technol., 2019, pp. 4171–4186.
such causes of flakiness. [20] M. E. Peters et al., “Deep contextualized word representations,” in
Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
Lang. Technol., 2018, pp. 2227–2237.
ACKNOWLEDGMENTS [21] Z. Yang et al., “XLNet: Generalized autoregressive pretraining for
language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
The experiments conducted in this work were enabled in
2019, pp. 5753–5763.
part by WestGrid (https://www.westgrid.ca) and Compute [22] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining
Canada (https://www.computecanada.ca). Moreover, we approach,” 2019, arXiv:1907.11692.
are grateful to the authors of FlakeFlagger and the main- [23] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
“VideoBERT: A joint model for video and language representa-
tainers of the IDoFT dataset, who have responded to our tion learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019,
multiple inquiries for clarifications about the datasets. pp. 7464–7473.
[24] D. Nadeau and S. Sekine, “A survey of named entity recognition and
classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
REFERENCES [25] N. Bach and S. Badaskar, “A review of relation extraction,” Litera-
[1] B. Zolfaghari, R. M. Parizi, G. Srivastava, and Y. Hailemariam, “Root ture Rev. Lang. Statist., vol. II, no. 2, pp. 1–15, 2007.
causing, detecting, and fixing flaky tests: State of the art and future [26] H. Xu, B. Liu, L. Shu, and P. S. Yu, “BERT post-training for review
roadmap,” Softw.: Pract. Exp., vol. 51, no. 5, pp. 851–867, 2021. reading comprehension and aspect-based sentiment analysis,” in
[2] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical anal- Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human
ysis of flaky tests,” in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Lang. Technol., 2019, pp. 2324–2335.
Softw. Eng., 2014, pp. 643–653. [27] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune BERT for
[3] M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, text classification?,” in Proc. China Nat. Conf. Chin. Comput. Lin-
“Understanding flaky tests: The developer’s perspective,” in Proc. guistics, 2019, pp. 194–206.
27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. [28] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
Eng., 2019, pp. 830–840. Inf. Process. Syst., 2017, pp. 5998–6008.
[4] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Mari- [29] D. Mandic and J. Chambers, Recurrent Neural Networks for Predic-
nov, “DeFlaker: Automatically detecting flaky tests,” in Proc. tion: Learning Algorithms, Architectures and Stability. Hoboken, NJ,
IEEE/ACM 40th Int. Conf. Softw. Eng., 2018, pp. 433–444. USA: Wiley, 2001.
[5] W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “iDFlakies: A [30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
framework for detecting and partially classifying flaky tests,” in Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Proc. 12th IEEE Conf. Softw. Testing, Validation Verification, 2019, [31] J. Keim, A. Kaplan, A. Koziolek, and M. Mirakhorli, “Does BERT
pp. 312–322. understand code?–An exploratory study on the detection of archi-
[6] J. Micco, “Advances in continuous integration testing at Google,” tectural tactics in code,” in Proc. Eur. Conf. Softw. Archit., 2020,
2018. [Online]. Available: https://research.google/pubs/pub46593 pp. 220–228.
[7] A. Alshammari, C. Morris, M. Hilton, and J. Bell, “FlakeFlagger: [32] D. Guo et al., “GraphCodeBERT: Pre-training code representa-
Predicting flakiness without rerunning tests,” in Proc. IEEE/ACM tions with data flow,” in Proc. 9th Int. Conf. Learn. Representations,
43rd Int. Conf. Softw. Eng., 2021, pp. 1572–1584. 2021, pp. 1–18.
[8] G. Pinto, B. Miranda, S. Dissanayake, M. d’Amorim, C. Treude, [33] A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning
and A. Bertolino, “What is the vocabulary of flaky tests?,” in Proc. and evaluating contextual embedding of source code,” in Proc.
17th Int. Conf. Mining Softw. Repositories, 2020, pp. 492–502. Int. Conf. Mach. Learn., 2020, pp. 5110–5121.
[9] B. Camara, M. Silva, A. Endo, and S. Vergilio, “On the use of test [34] X. Jiang, Z. Zheng, C. Lyu, L. Li, and L. Lyu, “TreeBERT: A tree-based
smells for prediction of flaky tests,” in Proc. Braz. Symp. Systematic pre-trained model for programming language,” Proc. 37th Conf. Uncer-
Autom. Softw. Testing, 2021, pp. 46–54. tainty Artif. Intell., Mach. Learn. Res., vol. 161, pp. 54–63, 2021.
[10] Z. Feng et al., “CodeBERT: A pre-trained model for programming [35] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brocksch-
and natural languages,” in Proc. Findings Assoc. Comput. Linguis- midt, “CodeSearchNet challenge: Evaluating the state of semantic
tics: Empir. Methods Natural Lang. Process., 2020, pp. 1536–1547. code search,” 2019, arXiv:1909.09436.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.
FATIMA ET AL.: FLAKIFY: A BLACK-BOX, LANGUAGE MODEL-BASED PREDICTOR FOR FLAKY TESTS 1927
[36] Y. Wu et al., “Google’s neural machine translation system: Bridging the [62] G. Melotti, C. Premebida, J. J. Bird, D. R. Faria, and N. Gonçalves,
gap between human and machine translation,” 2016, arXiv:1609.08144. “Probabilistic object classification using CNN ML-MAP layers,”
[37] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: 2020, arXiv:2005.14565.
Pre-training text encoders as discriminators rather than gener- [63] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of
ators,” in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1–18. modern neural networks,” in Proc. Int. Conf. Mach. Learn., 2017,
[38] C. Pan, M. Lu, and B. Xu, “An empirical study on software defect pp. 1321–1330.
prediction using CodeBERT model,” Appl. Sci., vol. 11, no. 11, [64] B. H. P. Camara, M. A. G. Silva, A. T. Endo, and S. R. Vergilio, “What
2021, Art. no. 4793. is the vocabulary of flaky tests? An extended replication,” in Proc.
[39] J. Wu, “Literature review on vulnerability detection using NLP IEEE/ACM 29th Int. Conf. Prog. Comprehension, 2021, pp. 444–454.
technology,” 2021, arXiv:2104.11230. [65] A. Memon and J. Micco, “How flaky tests in continuous integra-
[40] J. Howard and S. Ruder, “Universal language model fine-tuning tion,” 2016. [Online]. Available: https://www.youtube.com/
for text classification,” in Proc. 56th Annu. Meeting Assoc. Comput. watch?v¼CrzpkF1-VsA
Linguistics, 2018, pp. 328–339. [66] E. Kowalczyk, K. Nair, Z. Gao, L. Silberstein, T. Long, and A.
[41] A. F. A., “Deep learning using rectified linear units (ReLU),” Memon, “Modeling and ranking flaky tests at Apple,” in Proc. IEEE/
2018, arXiv:1803.08375. ACM 42nd Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2020, pp. 110–119.
[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- [67] Identifying and analyzing flaky tests in maven and gradle builds,
dinov, “Dropout: A simple way to prevent neural networks from 2019. Accessed: Jan. 11, 2021. [Online]. Available: https://gradle.
overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014. com/blog/flaky-tests
[43] S. El Anigri, M. M. Himmi, and A. Mahmoudi, “How BERT’s [68] T. A. Ghaleb, D. A. da Costa, Y. Zou, and A. E. Hassan, “Studying
dropout fine-tuning affects text classification?,” in Proc. Int. Conf. the impact of noises in build breakage data,” IEEE Trans. Softw.
Bus. Intell., 2021, pp. 130–139. Eng., vol. 47, no. 09, pp. 1998–2011, Sep. 2021.
[44] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and [69] J. Lampel, S. Just, S. Apel, and A. Zeller, “When life gives you
M. Mahoney, “ADAHESSIAN: An adaptive second order opti- oranges: Detecting and diagnosing intermittent job failures at
mizer for machine learning,” Proc. AAAI Conf. Artif. Intell., vol. 35, mozilla,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf.
no. 12, pp. 10665–10673, 2021. Symp. Found. Softw. Eng., 2021, pp. 1381–1392.
[45] W. Aljedaani et al., “Test smell detection tools: A systematic map-
ping study,” in Proc. Eval. Assessment Softw. Eng., 2021, pp. 170–180. Sakina Fatima received the Erasmus Mundus Joint
[46] A. Peruma, K. Almalki, C. D. Newman, M. W. Mkaouer, A. Ouni, master’s degree in dependable software systems
and F. Palomba, “TsDetect: An open source test smells detection from the University of St Andrews, U.K., and May-
tool,” in Proc. 28th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. nooth University, Ireland. She is currently working
Found. Softw. Eng., 2020, pp. 1650–1654. toward the PhD degree with the School of EECS,
[47] T. Virgınio et al., “JNose: Java test smell detector,” in Proc. 34th University of Ottawa and a member of Nanda Lab. In
Braz. Symp. Softw. Eng., 2020, pp. 564–569. 2019, she was awarded the French Government
[48] R. E. Noonan, “An algorithm for generating abstract syntax trees,” Medal and the National University of Ireland prize for
Comput. Lang., vol. 10, no. 3/4, pp. 225–236, 1985. distinction in collaborative degrees. Her research
[49] A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hel- interests include automated software testing, natural
lendoorn, “Revisiting test smells in automatically generated tests: language processing and applied machine learning.
Limitations, pitfalls, and opportunities,” in Proc. IEEE Int. Conf.
Softw. Maintenance Evol., 2020, pp. 523–533. Taher A. Ghaleb received the BSc degree in infor-
[50] A. Wei, P. Yi, T. Xie, D. Marinov, and W. Lam, “Probabilistic and mation technology from Taiz University, Yemen, in
systematic coverage of consecutive test-method pairs for detecting 2008, and the MSc degree in computer science
order-dependent flaky tests,” in Proc. Int. Conf. Tools Algorithms from the King Fahd University of Petroleum and
Construction Anal. Syst., 2021, pp. 270–287. Minerals, Saudi Arabia, in 2016, and the PhD
[51] W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell, “A degree in computing from Queen’s University, Can-
large-scale longitudinal study of flaky tests,” Proc. ACM Program. ada, in 2021. He is a postdoctoral research fellow
Lang., vol. 4, no. OOPSLA, pp. 1–29, 2020. with the School of EECS, University of Ottawa,
[52] W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov, Canada. During his PhD, he held an Ontario Trillium
“Understanding reproducibility and characteristics of flaky tests Scholarship, a highly prestigious award for doctoral
through test reruns in java projects,” in Proc. IEEE 31st Int. Symp. students. He worked as a research/teaching assis-
Softw. Rel. Eng., 2020, pp. 403–413. tant. His research interests include continuous integration, software test-
[53] W. Lam, A. Shi, R. Oei, S. Zhang, M. D. Ernst, and T. Xie, ing, mining software repositories, applied data science and machine
“Dependent-test-aware regression testing techniques,” in Proc. 29th learning, program analysis, and empirical software engineering.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, pp. 298–311.
[54] A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, “iFixFlakies: A
framework for automatically fixing order-dependent flaky tests,” Lionel Briand (Fellow, IEEE) is professor of soft-
in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. ware engineering and has shared appointments
Softw. Eng., 2019, pp. 545–555. between (1) School of Electrical Engineering and
[55] Flakify: A Black-Box, Language Model-based Predictor for Flaky Computer Science, University of Ottawa, Canada
Tests – Replication Package, 2022. [Online]. Available: https:// and (2) The SnT centre for Security, Reliability, and
doi.org/10.5281/zenodo.6994692 Trust, University of Luxembourg. He is the head of
[56] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, the SVV Department, SnT Centre and a Canada
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. research chair in Intelligent Software Dependability
Intell. Res., vol. 16, pp. 321–357, 2002. and Compliance (Tier 1). He received an ERC
[57] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive Advanced Grant, the most prestigious European
modeling on imbalanced domains,” ACM Comput. Surv., vol. 49, individual research award, and has conducted
no. 2, pp. 1–50, 2016. applied research in collaboration with industry for more than 25 years,
[58] C. Goutte and E. Gaussier, “A probabilistic interpretation of preci- including projects in the automotive, aerospace, manufacturing, financial,
sion, recall and f-score, with implication for evaluation,” in Proc. and energy domains. He was elevated to the grades of ACM fellow, granted
Eur. Conf. Inf. Retrieval, 2005, pp. 345–359. the IEEE Computer Society Harlan Mills Award (2012), the IEEE Reliability
[59] M. Raymond and F. Rousset, “An exact test for population differ- Society Engineer-of-the-year Award (2013), and the ACM SIGSOFT Out-
entiation,” Evolution, vol. 49, pp. 1280–1283, 1995. standing Research Award (2022) for his work on software verification and
[60] J. Micco, “Flaky tests at Google and how we mitigate them,” 2016. testing. His research interests include testing and verification, search-
[Online]. Available: https://testing.googleblog.com/2016/05/ based software engineering, model-driven development, requirements
flaky-tests-at-google-and-how-we.html engineering, and empirical software engineering.
[61] A. Memon et al., “Taming Google-scale continuous testing,” in
Proc. IEEE/ACM 39th Int. Conf. Softw. Eng.: Softw. Eng. Pract. Track, " For more information on this or any other computing topic,
2017, pp. 233–242. please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on February 12,2024 at 09:20:51 UTC from IEEE Xplore. Restrictions apply.