0% found this document useful (0 votes)

47 views15 pages

Fuzzing For Neural Networks

Uploaded by

Eric Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views15 pages

Fuzzing For Neural Networks

Uploaded by

Eric Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Information and Software Technology 172 (2024) 107476

Contents lists available at ScienceDirect

Information and Software Technology

journal homepage: www.elsevier.com/locate/infsof

CriticalFuzz: A critical neuron coverage-guided fuzz testing framework for

deep neural networks
Tongtong Bai a , Song Huang a ,∗, Yifan Huang b , Xingya Wang a,c,d , Chunyan Xia a , Yubin Qu a,e ,
Zhen Yang a
a
College of Command and Control Engineering, Army Engineering University of PLA, Nanjing, Jiangsu, China
b Nanyang Technological University, Singapore
c College of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing, Jiangsu, China
d State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China
e
School of Information Engineering, Jiangsu College of Engineering and Technology, Nantong, Jiangsu, China

ARTICLE INFO ABSTRACT

Keywords: Context: Deep neural networks (DNN) have been widely deployed in safety-critical domains, such as au-
Critical neurons tonomous cars and healthcare, where error behaviors can lead to serious accidents, testing DNN is extremely
Fuzz testing important. Neuron coverage-guided fuzz testing (NCFT) has become an effective whitebox testing approach
Critical neuron coverage
for testing DNN, which iteratively generates new test cases with the guidance of neuron coverage to explore
Deep neural network
different logics of DNN, and has found numerous defects. However, existing NCFT approaches ignore that the
role of neurons is distinct for the final output of DNN. Given an input, only a fraction of neurons determines
the final output of the DNN. These neurons hold the essential logic of the DNN.
Objective: To ensure the quality of DNN and improve testing efficiency, NCFT should first cover neurons
containing major logic of DNN.
Method: In this paper, we propose the critical neurons that hold essential logic of DNN. In order to prioritize
the detection of potential defects of critical neurons, we propose a fuzz testing framework, named CriticalFuzz,
which mainly contains the energy-based test case generation and the critical neuron coverage criteria. The
energy-based test case generation has the capability to produce test cases that are more likely to cover critical
neurons and involves energy-based seed selection, power schedule, and seed mutation. The critical neuron
coverage as a mechanism for providing feedback to guide the CriticalFuzz in prioritizing the coverage of critical
neurons. To evaluate the significance of critical neurons and the performance of CriticalFuzz, we conducted
experiments on popular DNNs and datasets.
Results: The experiment results show that (1) the critical neurons have a 100% impact on the output of
models, while the non-critical neurons have a lesser effect; (2) CriticalFuzz is effective in achieving 100%
coverage of critical neurons and covering 10 classes of critical neurons, outperforming both DeepHunter and
TensorFuzz. (3) CriticalFuzz exhibits exceptional error detection capabilities, successfully identifying thousands
of errors across 10 diverse error classes within DNN.
Conclusion: The critical neurons defined in this paper hold more significant logic of DNN than non-critical
neurons. CriticalFuzz can preferentially cover critical neurons, thereby improving the efficiency of the NCFT
process. Additionally, CriticalFuzz is capable of identifying a greater number of errors, thus enhancing the
reliability and effectiveness of the NCFT.

1. Introduction the error behavior, which maybe result in heavy losses [4–7]. In
particular, for the autonomous driving field, researchers have found
Deep neural network (DNN) has achieved significant improvements that there are thousands of error behaviors in DNN-driven autonomous
and have been applied to safety-critical fields, such as self-driving vehicles [5,8–10]. Therefore, the DNN needs to be sufficiently tested
cars [1], automatic medical diagnosis [2,3]. However, DNN still outputs

∗ Corresponding author.
E-mail addresses: [email protected] (T. Bai), [email protected] (S. Huang), [email protected] (Y. Huang), [email protected]
(X. Wang), [email protected] (C. Xia), [email protected] (Y. Qu), [email protected] (Z. Yang).

https://doi.org/10.1016/j.infsof.2024.107476
Received 6 November 2023; Received in revised form 21 April 2024; Accepted 21 April 2024
Available online 24 April 2024
0950-5849/© 2024 Elsevier B.V. All rights reserved.
T. Bai et al. Information and Software Technology 172 (2024) 107476

to ensure its correctness. To automatically produce massive test cases, seed selection and power schedule were proposed for the first time
the researchers propose multiple mutation methods that contain affine in order to generate test cases that can cover critical neurons with
transformations [8], generating driving scenes with various weather high probability. The energy of a seed is calculated based on whether
conditions [9], replacing the background region [11], etc. Moreover, it has covered the critical neurons. If the seed has covered certain
the metamorphic relation [12] is used to alleviate the oracle prob- critical neurons, more energy is allocated to it in order to cover more
lem [8,13]. The metamorphic relation is constructed based on the critical neurons. The energy-based seed selection strategy not only
properties of the DNN under test. The DNN has an error behavior when chooses seeds with few mutation times but also chooses seeds that
the outputs of the DNN for the original input and the mutated input can cover critical neurons as soon as possible. The power schedule
do not satisfy the metamorphic relation. In addition, to automatically method can distribute more power to seeds that have more energy.
explore the different decision logic of DNN, researchers propose the The seed with more power has more mutation chance. Moreover, we
neuron coverage-guided fuzz testing (NCFT) [5,7]. NCFT employs the implemented CriticalFuzz and evaluated it on four popular DNN mod-
mutation technique to generate novel test cases. Additionally, certain els. The experimental results demonstrate that CriticalFuzz achieves
NCFT approaches utilize metamorphic relations to alleviate the oracle full 100% coverage of critical neurons, encompassing 10 classes of
problem, thereby enhancing the level of test automation. The core of critical neurons, while also excelling in error detection by uncovering
NCFT is neuron coverage, which is used as a test criterion to guide over a thousand faults across 10 distinct error classes within DNN.
NCFT to generate interesting seeds. The neuron coverage criteria is Furthermore, comparative analysis shows that CriticalFuzz surpasses
based on the assumption that neurons contain the logic of neural both DeepHunter [5] and TensorFuzz [7] in terms of efficiency during
networks and is proposed to measure the inner behaviors of neural equivalent periods of testing.
networks [14]. Moreover, multiple other coverage criteria based on The contributions of this paper can be summarized as follows.
neuron coverage criteria have been proposed to explore the different
logic of DNN [15]. Although Harel-Canada et al. reject the hypothesis • Critical neurons. We define the critical neurons and class-critical
that neuron coverage is strongly and positively correlated with defect neurons to grasp the neurons containing the major logic of DNN.
detection, it does not mean that neurons do not hold the logic of Meanwhile, we evaluate the importance of critical neurons for the
DNN [16]. In addition, a number of interpretability methods of DNN output result of DNN.
point out that neurons learn the features of the sample through train- • Framework. We design the first critical neuron coverage-guided
ing [17,18]. Therefore, the assumption that neurons hold the logic of fuzz testing framework, namely CriticalFuzz, with the goal of
the neural network remains valid. preferentially covering critical neurons. The CriticalFuzz is mainly
In traditional software, the major logic in the code decides the final composed of an energy-based test case generation strategy and
output. To ensure the quality of software, the essential logic needs to critical neuron coverage criteria. The energy-based test case gen-
be tested first, as it can cause serious harm if it hides bugs. Unlike eration strategy can generate test cases that have a greater chance
traditional software, the major logic of DNN is difficult to obtain due to to cover critical neurons. The critical neuron coverage criteria
the lack of interpretability in DNN. Although a neuron holds the logic can guide CriticalFuzz to save seeds covering critical neurons for
of DNN, it is difficult to determine the importance of this logic to the the next iteration. We have implemented CriticalFuzz and made
output of the DNN. source code publicly available.1
The DNN is a data-driven program that automatically gains logic • Evaluation. We conduct multiple experiments to evaluate Crit-
from a large number of training data. The neurons that have learned icalFuzz, and the results show that CriticalFuzz is effective and
knowledge from more training data may hold more important logic. efficient in covering critical neurons and revealing the error
Therefore, this paper proposes the critical neurons, which are neurons behaviors of DNN.
that are activated by multiple training samples. The critical neurons
hold more important logic of DNN than another neuron. In addition, 2. Background
a neuron contains different logic in different classes. In order to more
accurately find the major logic of each output class, this paper defines Traditional software testing often involves using coverage-based
class-critical neurons. To validate the importance of the critical neu- fuzz testing as a method to test the decision logic inside the program.
rons, we evaluate the impact of critical neurons on the final output This approach mainly includes test case generation and code coverage
of DNN. The experiment results indicate that the critical neurons can analysis [19] (see Fig. 1(a)) . The process of test cases generation in-
decide the output of DNN. In addition, the critical neurons hold more volves the seed selection and mutation. For example, a seed is selected
important logic than non-critical neurons in DNN. based on whether the path covered by the seed is closer to the target
In order to balance the effectiveness and efficiency of testing, the path. Many mutation strategies have been designed and implemented,
critical neurons should be prioritized for testing. However, existing e.g., block operations, arithmetic operations and semantic-preserving
NCFT approaches do not take into account that different neurons mutation. Moreover, the code coverage filters the generated test cases
contain different importance levels of logic. These NCFT approaches in order to retain valuable test cases as seeds to be added to the seed
aim to cover all neurons and cannot prioritize the coverage of crit- queue for subsequent iterations. One form of code coverage strategy is
ical neurons. Moreover, a neuron has different degrees of influence saving test cases that can cover deeper paths as seeds. For example, in
on the output of DNN for the input of different classes. To fine- Fig. 1(a), when the existing seeds can execute the judgment statement
grained explore the decision logic of DNN, NCFT should cover the to determine if variable B is greater than 8, the code coverage strategy
critical neurons of each class. Therefore, a fuzz testing framework that will select the test case that can execute the judgment statement to
can preferentially cover critical neurons is in dire need. This paper determine if variable C is greater than 12 as the seed. Coverage-guided
proposes a critical neuron coverage-guided fuzz testing framework, fuzz testing can improve test performance by automatically exploring
namely CriticalFuzz. In order to cover the critical neurons, we define different logics in the program to discover new bugs.
critical neuron coverage. To ensure coverage of each class of critical Different from traditional software, which consists of logical code,
neurons, we presented class-critical neuron coverage. CriticalFuzz is the DNN possess the black-box nature and is complexity, making
able to obtain seeds that preferentially cover critical neurons based on coverage-guided fuzz testing for traditional software hard to cover
the guidance of critical neuron coverage. Specifically, to improve the the decision logic of the DNN. The proposal of neuron coverage [14]
efficiency and effectiveness of covering critical neurons, we design an
energy-based test case generation strategy that consists of energy-based
1
seed selection, power schedule, and seed mutation. The energy-based https://github.com/Jackie-Bai888/criticalfuzz.git

2
T. Bai et al. Information and Software Technology 172 (2024) 107476

Fig. 1. The coverage-guided fuzz testing for traditional software and the DNN.

enables coverage-guided fuzz testing to automatically explore logic of logic have distinct impacts on the ultimate output of DNN. Intuitively,
the DNN using neuron coverage as feedback guidance (see Fig. 1(b)). the neurons containing the major logic determine the output of DNN,
With the introduction of various categories of neuron coverage criteria, while other neurons have little impact on the output of the neural
NCFT can use more coverage criteria to test the DNN [5,7,20]. In network. Hence, we mask some neurons (i.e., set the output of neurons
addition, for the automatic judgment of test results, traditional fuzzers as zero) in trained DNN models for founding the neurons that can
have clear test predictions for automatically judging program abnormal influence DNN output, as shown in Table 1. The second column shows
behavior (e.g. crash). But, due to the complexity of DNN, it is difficult the proportion of masked neurons to all neurons of DNN models. The
for DNN fuzzers to automatically judge the test results. Therefore, third column represents the proportion of inconsistent output results of
NCFT uses mutation methods without changing semantics, such as the DNN model for 1000 same input images after masking some neu-
affine transformation and pixel transformation, to generate test cases rons. We found that some neurons can affect the output of DNN, while
in order to automatically judge test results of DNN. If there is a others do not. For example, for LeNet4, an initial masking of 37.68% of
difference between the output of the DNN for the original input and the neurons resulted in the observation of 100% inconsistent outputs.
the mutated input, it is considered that the DNN has caused a bug. Subsequently, upon masking the remaining 62.32% of the neurons, the
Moreover, in order to enhance the effectiveness of the test, traditional occurrence of inconsistent outputs decreased to 38.4%. Especially for
fuzzers can select vital paths for priority coverage. However, existing VGG16, 39.8% of the neurons have no impact on the model’s output.
NCFT approaches do not take into account the distinct of individual To explore the differences between individual neurons in a more fine-
neurons, thereby preventing the prioritized coverage of the essential grained way, we compared the output of model after masking a neuron.
logic within DNN. For example, in Fig. 1(b), The majority of the For example, in Fig. 2, the recognition result of handwritten number 5
training data activated khaki neurons, while a smaller portion of the is correct using LeNet4, which is masked the third neuron in the second
training data activated green neurons. Consequently, the primary logic convolutional layer. But the LeNet4 recognizes the handwritten number
of the DNN was predominantly encoded within the khaki neurons. To 5 as 3 after masking the fourth neuron of the second convolutional layer
enhance testing efficiency, NCFT should prioritize the coverage of khaki in LeNet4. These phenomena indicate that the major logic of the DNN
neurons. In other words, in order to enhance the efficiency of fuzz is hidden in several neurons. Motivated by this discovery, the present
testing for DNN, it is advisable to identify crucial neurons and prioritize study introduces a novel concept termed critical neuron, which is a
their coverage using NCFT. neuron that contain major logic of the DNN. This definition will be
expounded upon in Section 3.2. Moreover, to enhance the efficiency of
3. Approach testing DNN, the neuron that contain major logic of the DNN should
be covered first. Therefore, we propose the CriticalFuzz that a critical
3.1. Motivation neuron coverage-guided fuzz testing framework in Section 3.3.

In order to ensure the quality of DNN and reduce testing losses, 3.2. Critical neurons
testers need to prioritize testing the major logic of DNN. Previous
studies have shown that the neurons inside the DNN hold the logic of The logic of the DNN is established through the training process,
the DNN [17,18]. In order to initially test the major logic of DNN, it is and upon completion of training, the DNN’s logic becomes finalized.
necessary to identify discrepancies among neurons. The logic of DNN During training, in addition to the structure of the DNN, the training
determines the final output result, so neurons with varying levels of data determines the logic of the DNN. Therefore, we use the training

3
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 1 The critical neuron is represented as:

The output inconsistent rate of four DNN models after masking neurons. {
Model Masked neurons Inconsistent rate 1, 𝑐𝑙(𝑛, 𝑇 ) > 𝜀
𝑐𝑛(𝑛, 𝑇 ) = (2)
LeNet4 [21] 37.68% 100% 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
62.32% 38.40%
where 𝜀 is the predefined threshold that determines whether a neuron
LeNet5 [21] 43.80% 100%
56.20% 30.90%
is a critical neuron, 𝑐𝑛(𝑛, 𝑇 )=1 denotes that the neuron is a critical
neuron, and vice versa. It is important to highlight that the threshold
ResNet20 [22] 29.65% 100%
70.35% 70.80%
𝜀 should not be confused with the threshold 𝑡 in Eq. (1). The threshold
𝑡 determines whether the neuron is activated, while the threshold 𝜀
VGG16 [23] 60.20% 100%
39.80% 0% determines whether the neuron is critical neuron.
The critical neurons contained in a DNN is regarded as:

𝑑𝑐𝑛(𝑁, 𝑇 ) = {𝑛 ∣ 𝑛 ∈ 𝑁 ∧ 𝑐𝑛(𝑛, 𝑇 ) = 1} (3)

where 𝑁 denotes all neurons of the DNN, 𝑇 is the training dataset.

The output of DNN may have multiple classes for a dataset. For
a DNN, the same output class should have similar decision logic. For
example, for multiple handwritten digit input images that are predicted
as digit 5, the decision logic of the DNN should be consistent. In order
to explore the logic inside the single output class in DNN at a finer
granularity, we propose class-critical neurons. In the context of a DNN,
the class-critical neurons are defined as a set of critical neurons that
are essential for the DNN to output a specific class. The class-critical
neurons can be mathematically represented as follows:
{ }
𝑐𝑐𝑛(𝑁, 𝑇𝑐 ) = 𝑛 ∣ 𝑛 ∈ 𝑁 ∧ 𝑐𝑛(𝑛, 𝑇𝑐 ) = 1 (4)

where 𝑐 is a specific output class of a DNN, 𝑇𝑐 represents training data

of class 𝑐 rather than the entire training dataset, 𝑁 denotes all neurons
of the DNN, 𝑐𝑛(𝑛, 𝑇𝑐 ) is defined in Eq. (2).

3.3. Overview of CriticalFuzz

The essential logic of the DNN is stored in critical neurons. In order

to enhance the efficiency of testing, critical neurons should be covered
first. However, the existing NCFT treats at all neurons equally and
does not prioritize the coverage of critical neurons. To address this
limitation, this paper introduces a novel approach called CriticalFuzz,
as depicted in Fig. 3. The CriticalFuzz comprises three essential com-
ponents. First, the critical neurons of the DNN are obtained. Then, we
propose energy-based test case generation, encompassing energy-based
Fig. 2. The result of handwritten digit recognition using LeNet4 with partial neurons seed selection, power schedule, and seed mutation. The energy of the
masking. seed is distributed during the initial phases of CriticalFuzz. Finally,
we further propose the critical neuron coverage criteria to guide the
fuzzing process towards covering critical neurons.
data and the output of neurons to find the neuron that contains the Algorithm 1 The overall process of CriticalFuzz
major logic of the DNN. To assess the significance of the logic within
Require: 𝑆: Seed set, 𝑇 : Training dataset, 𝐷𝑁𝑁: Target pre-trained
neurons, we propose the criticality level of a neuron in the DNN and
neural network
identifying the critical neuron based on this level.
Ensure: 𝐹 : Failed test set, 𝑈 : Test set covering critical neurons
1: 𝑁𝑐 ← 𝑔𝑒𝑡𝐶𝑟𝑖𝑡𝑖𝑐𝑎𝑙𝑁𝑒𝑢𝑟𝑜𝑛𝑠(𝑇 , 𝐷𝑁𝑁)
Definition 1 (Criticality Level). For a neuron of the DNN, the criticality
2: 𝐸 ← 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑆𝑒𝑒𝑑𝑆𝑒𝑡𝐸𝑛𝑒𝑟𝑔𝑦(𝑆, 𝑁𝑐 , 𝐷𝑁𝑁)
level of the neuron measures the importance of the neuron for the DNN.
3: 𝑆 ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆𝑒𝑒𝑑𝑆𝑒𝑡(𝑆, 𝐸)
If a neuron is activated by a large amount of training data, its criticality
4: while 𝑠 ← 𝑠𝑒𝑙𝑒𝑐𝑡𝑆𝑒𝑒𝑑(𝑆) do
level of neuron is large.
5: 𝑝 ← 𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑃 𝑜𝑤𝑒𝑟(𝑠)
The criticality level of a neuron is calculated as follows: 6: 𝑆 ′ ← 𝑚𝑢𝑡𝑎𝑡𝑒(𝑠, 𝑝)
|{𝑥|𝑥 ∈ 𝑇 ∧ 𝑓 (𝑜𝑢𝑡(𝑛, 𝑥) > 𝑡)}| 7: for 𝑠′ ∈ 𝑆 ′ do
𝑐𝑙(𝑛, 𝑇 ) = (1) 8: if 𝑖𝑠𝐹 𝑎𝑖𝑙𝑒𝑑(𝑠′ , 𝐷𝑁𝑁) then
|𝑇 |
9: 𝐹 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑠′ )
where 𝑛 is a neuron of the DNN, 𝑇 denotes the training data, 𝑥 denotes
10: else if 𝑖𝑠𝐶𝑟𝑖𝑡𝑖𝑐𝑎𝑙𝐶𝑜𝑣𝑒𝑟𝐺𝑎𝑖𝑛(𝑠′ , 𝑁) then
a sample of training data, 𝑜𝑢𝑡(𝑛, 𝑥) denotes output of the neuron 𝑛 for a
11: 𝑈 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑠′ )
certain input 𝑥, 𝑡 represent the predefined threshold for considering a
12: 𝑒 ← 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝐸𝑛𝑒𝑟𝑔𝑦(𝑠′ , 𝑁𝑐 , 𝐷𝑁𝑁)
neuron to be activated, the neuron is activated when 𝑜𝑢𝑡(𝑛, 𝑥) is greater
13: 𝑠′ ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆𝑒𝑒𝑑(𝑠′ , 𝑒)
than the threshold 𝑡 [14], 𝑓 (⋅) denotes whether the neuron is activated,
14: 𝑆.𝑎𝑝𝑝𝑒𝑛𝑑(𝑠′ )
𝑓 (⋅) = 1 as a neuron be activated, 𝑓 (⋅) = 0 as a neuron not be activated.
15: end if
16: end for
Definition 2 (Critical Neuron). A neuron is a critical neuron if its
17: end while
criticality level is larger than predefined threshold 𝜀.

4
T. Bai et al. Information and Software Technology 172 (2024) 107476

Fig. 3. Overview of CriticalFuzz.

Algorithm 1 presents the process of CriticalFuzz fuzz testing critical Energy-based Seed Selection. DeepHunter [5] proposes a seed selec-
neurons. The initial seed set 𝑆, training dataset 𝑇 and target pre-trained tion strategy that probabilistically selects a seed based on the number
neural network as inputs and outputs failed test set 𝐹 and the test set of times it has been fuzzed. If the frequency of the seed being fuzzed is
𝑈 covering critical neurons. To cover class-critical neurons, we sample small, the probability of the seed being selected is high. This selection
test cases from each class to form the initial seed set 𝑆. For a DNN, we strategy increases the diversity of newly generated test cases. However,
first gain the critical neurons set 𝑁𝑐 based on training dataset 𝑇 and CriticalFuzz not only needs to consider the variety of the new test cases,
pre-trained model DNN, which is introduced in Section 3.2 (Line 1). To but also has to improve the probability of new test cases covering new
construct the seed set 𝑆, we need to calculate and distribute energy to critical neurons. Therefore, this paper proposes the energy-based seed
each seed (Line 2–3). Then, CriticalFuzz selects seed 𝑠 and schedules selection strategy. For a seed 𝑠, the selection probability is calculated
power in each iteration based on the energy of the seed (Line 4–5). To as follows:
{
generate new test cases 𝑆 ′ , we mutate the seed 𝑠 based on the power 𝑝 1+𝑒𝑛(𝑠)−𝑡(𝑠)∕𝛿
, 𝑡(𝑠) < (1 − 𝑝𝑚𝑖𝑛 ) × 𝛿
(Line 6). The energy calculate and energy-based test cases generation 𝑠𝑝(𝑠) = 2 (6)
𝑝𝑚𝑖𝑛 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
are introduced in Section 3.4.1. For every new test case 𝑠′ , we ascertain
whether it triggers errors in the DNN. If errors are indeed caused, the 𝑠′ where 𝑒𝑛(𝑠) denotes the energy of the seed 𝑠 as formula (5), 𝑡(𝑠) indi-
is included in the failed test set 𝐹 (Line 8–9). Furthermore, we evaluate cates the number of times the seed 𝑠 is fuzzed, 𝛿 controls the proportion
if the 𝑠′ enhances the critical neuron coverage criterion. If it does, the of probability reduction, 𝑝𝑚𝑖𝑛 denotes the minimum probability that a
𝑠′ is appended to the set 𝑈 and the set 𝑆 after the energy is distributed seed can be selected. When the seed 𝑠 is fuzzed a small number of times,
to it (Line 10–14). The critical neuron coverage criterion is described the selection probability is based on the energy and the frequency of
in Section 3.5. fuzzing. Otherwise, the selection probability is set to the minimum
value. This strategy of selecting seeds ensures both the diversity of
newly generated test cases and an increased likelihood of covering new
3.4. Energy-based test case generation
critical neurons.
Power Schedule. After the seed selection process, it is imperative to
To expedite the coverage of critical neurons, the CriticalFuzz re-
maximize the potential of the chosen seed in order to generate new test
quires the generation of biased test cases that can preferentially cover
cases that effectively cover a greater number of critical neurons. If the
these neurons. However, the test cases generated by the current NCFT
seed possesses a higher amount of energy, it is expected that the seed
are discrete and fail to prioritize the coverage of critical neurons.
will produce a greater number of novel test cases through the process of
To address this issue, this paper proposes an energy strategy and
mutation to increase the probability of covering more critical neurons.
adopts a mutation method put forward by DeepHunter [5] for test case
Therefore, CriticalFuzz uses the power schedule to distribute power
generation.
to the seed. Seeds with more power will generate a higher number
of new test cases through mutation. Note that the concepts of power
3.4.1. Energy strategy and energy in relation to a seed are distinct. In this context, power
To prioritize the coverage of critical neurons, a greater amount of represents the number of mutations of the seed and energy represents
energy should be allocated to seeds that possess the potential to cover the probability of the seed covering new critical neurons. The power of
more critical neurons. To achieve this, an energy strategy is proposed a seed is calculated as follows:
for seed selection, and seeds with more energy are assigned a greater
number of mutations based on a power schedule. 𝑝𝑜(𝑠) = 𝑒𝑛(𝑠) × 𝑚𝑚𝑎𝑥 (7)
Energy Calculation. The essential logic of a DNN is comprised of the where 𝑚𝑚𝑎𝑥 is the maximum number of mutations and is used to
logic embedded within all critical neurons, thus establishing a stronger balance between the efficiency and effectiveness of the mutation. If the
connection between critical neurons compared to the connection be- value of 𝑚𝑚𝑎𝑥 is large, the duration of the mutation process will in-
tween critical and non-critical neurons. Intuitively, if a seed activates crease, resulting in a decrease in the efficiency of the testing procedure.
more critical neurons, then it has a higher probability of covering new Conversely, if the value of 𝑚𝑚𝑎𝑥 is small, fewer newly critical neurons
critical neurons, so we need to give it more energy. Consequently, will be covered, leading to a reduction in the effectiveness of the testing
CriticalFuzz assigns a greater amount of energy to seeds that have process.
covered critical neurons. The energy is calculated as:
∑|𝑁| 3.4.2. Mutation method
(𝑓 (𝑜𝑢𝑡(𝑛𝑖 , 𝑠) > 𝑡) − 𝑐𝑛(𝑛𝑖 , 𝑇 ))2
𝑒𝑛(𝑠) = 1 − 𝑖 (5) The mutation method in CriticalFuzz is a metamorphic mutation,
|𝑁| which is proposed by Xie et al. in DeepHunter [5]. The reason Crit-
where 𝑁 is a set of all neurons in the DNN, 𝑛𝑖 denotes the 𝑖th neuron icalFuzz chooses this method is that it has multiple image transfor-
of 𝑁, 𝑐𝑛(𝑛𝑖 , 𝑇 ) indicates whether 𝑛𝑖 is a critical neuron as formula (2), mation methods and uses a metamorphic relation to automatically
𝑓 (𝑜𝑢𝑡(𝑛𝑖 , 𝑠) > 𝑡) indicates whether 𝑛𝑖 is activated as formula (1). When determine whether the output of the DNN is correct. The mutation
generating test cases covering a single class of critical neurons, we need method adopted eight image transformations of two categories, as
to replace 𝑇 with 𝑇𝐶 . The 𝑇𝐶 is training data of class 𝑐 as indicate in shown in Table 2. The difference between pixel transformation and
formula (4). affine transformation is that pixel transformation changes the pixel

5
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 2
Image transformation methods.
Image transformation categories Image transformation methods
pixel transformation image contrast, image brightness, image blur, and image noise.
affine transformation image translation, image scaling, image shearing, and image rotation.

value while affine transformation moves the pixels of the image. The Table 3
metamorphic relation is that if the semantics of the original seed 𝑠 and Datasets and DNN Models used in our study.

the mutation test case 𝑠′ are the same, then the outputs of 𝑠 and 𝑠′ Dataset DNN model Accuracy Number of parameters

should also be the same in a DNN. If the output of DNN before and MNIST LeNet4 98.87% 25,010
after seed mutation does not satisfy the metamorphic relation, DNN LeNet5 98.96% 44,426

has errors. To ensure semantic invariance, the metamorphic mutation CIFAR ResNet20 91.74% 274,442
VGG16 92.98% 15,001,418
uses a constraint function as follows:
{
𝑚𝑝 (𝑠, 𝑠′ ) ≤ 255, 𝑛𝑝 (𝑠, 𝑠′ ) < 𝛼 × 𝑠𝑧(𝑠)
𝑐𝑓 (𝑠, 𝑠′ ) = (8)
𝑚𝑝 (𝑠, 𝑠′ ) ≤ 𝛽 × 255, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
RQ2(Coverage): Can CriticalFuzz cover the critical neurons? How
where 𝑠 is a seed, 𝑠′ denotes a new test case generated by mutating efficient is CriticalFuzz in covering critical neurons? We show the
𝑠, 𝑚𝑝 (𝑠, 𝑠′ ) is the maximum value of the changed pixels between 𝑠 and number of CriticalFuzz covering critical neurons and compare it with
𝑠′ , 𝑛𝑝 (𝑠, 𝑠′ ) represents the number of the changed pixels, 𝑠𝑧(𝑠) denotes baseline approaches.
the total number of pixels of seed 𝑠, 0 < 𝛼, 𝛽 < 1 are used to restrict RQ3(Error Detection): Can CriticalFuzz find erroneous behaviors of
the number and value of changed pixels respectively. If fewer pixels DNNs? How effective is CriticalFuzz for the detection of erroneous
are changed, the semantics are assumed to remain the same and the behaviors?
range of changes in the value of each pixel is large. If a large number
of pixels change, the semantics are maintained by limiting the range of 4.2. Datasets and DNN models
pixel value changes.
We selected two popular datasets MNIST [24] and CIFAR-10 [25]
3.5. Coverage criteria to evaluate CriticalFuzz. MNIST is a handwritten digital dataset with
60 000 training and 10 000 test images. Each image is 28 × 28 grayscale
Coverage criteria are used to guide NCFT in generating seeds that image and contains a number from 0 to 9. CIFAR-10 consists of 60 000
can cover more neurons. If the test cases that have been newly gener- 32 × 32 color images in 10 classes (e.g., airplane, bird, ship), with 6000
ated result in an increase in the coverage criteria, they are preserved images per class. The 50 000 training images and 10 000 test images
as seeds to be utilized in subsequent iterations. The acceleration of are included in the dataset. For the MNIST dataset, we select LeNet4
the NCFT process can be facilitated by implementing coverage criteria. and LeNet5 [21]for digits classification tasks. For the CIFAR dataset, we
However, the existing coverage criteria did not distinguish neurons study ResNet20 [22] and VGG16 [23] to classify images. The accuracy
result in the inability to save the seeds that could cover critical neurons. and parameters of these DNN models are shown in Table 3.
Therefore, we propose the critical neuron coverage.
4.3. Baseline approaches
Definition 3 (Critical Neuron Coverage). For a input 𝑥, the critical
neuron coverage can be defined as follows:
To evaluate the effectiveness of CriticalFuzz, we compare it with
DeepHunter and TensorFuzz. Aspects of the comparison include the
|{ }| ability to cover critical neurons and detect misbehavior. DeepHunter
| 𝑛𝑑 |∀𝑥 ∈ 𝑇 , 𝑓 (𝑜𝑢𝑡(𝑛𝑑 , 𝑥) > 𝑡) |
𝐶𝑁𝐶𝑜𝑣(𝑇 , 𝑥) = | | (9) contains multiple testing criteria. The coverage granularity of critical
|𝑑𝑐𝑛| neuron coverage is a single neuron. So we adopt the neuron coverage
{ }
where 𝑇 = 𝑥1 , 𝑥2 , … is a set of test inputs, 𝑑𝑐𝑛 denotes a set of testing criterion in DeepHunter. In order to ensure the validity of the
critical neurons as formula (3), 𝑛𝑑 represents a critical neuron in 𝑑𝑐𝑛, comparison results, the mutation method of baseline approaches is the
𝑓 (⋅) denotes whether the neuron is activated as formula (1). same as that of CriticalFuzz, as described in Section 3.4.2.
As formula (4) mentions, the class-critical neurons of the DNN are
different for different class. We need to ensure that the critical neurons 4.4. Hyperparameters
of each class are covered. Therefore, we propose class-critical neuron
coverage as follows: As described in Section 3.5, 𝑡 is used to determine whether a neuron
|{ }| is activated or not, and affects the final number of critical neurons
| 𝑛𝑐 |∀𝑥 ∈ 𝑇 , 𝑓 (𝑜𝑢𝑡(𝑛𝑐 , 𝑥) > 𝑡) |
𝐶𝐶𝐶𝑜𝑣(𝑇 , 𝑥) = | | obtained. According to the previous work [15,16], 𝑡 ∈ {0, 0.2, 0.5, 0.75}.
(10)
|𝑐𝑐𝑛| In order to get an appropriate number of critical neurons, we set
where 𝑐𝑐𝑛 denotes a set of class-critical neurons as formula (4), 𝑛𝑐 the values of 𝑡 to 0.2 and 0.5. Moreover, different DNN models have
represents a class-critical neuron in 𝑐𝑐𝑛. different numbers of neurons, the value of 𝑡 should be varied according
to the size of the DNN model. If the DNN model is large and the 𝑡 is
4. Experiment design set small, the final obtained critical neurons will contain unimportant
neurons. LeNet4 and LeNet5 contain fewer neurons, while ResNet20
4.1. Research questions and VGG16 contain more neurons. Therefore, we set the neuron acti-
vation threshold 𝑡 of 0.2 in LeNet4 and LeNet5, and 0.50 in ResNet20
RQ1(Critical Neurons): Do critical neurons hold the essential decision and VGG16. For different models, the critical neuron threshold 𝜀 needs
logic of DNN? In this RQ, we aim to explore the influence of critical to be dynamically adjusted. Notably, the permissible range for both 𝜀
neurons on the output of the DNN. In order to highlight the importance and 𝑡 spans from 0 to 1. Moreover, both 𝜀 and 𝑡 are hyperparameters
of critical neurons, we compare the impact of critical neurons and related to neuronal coverage. Consequently, we determine the value of
non-critical neurons on the DNN output results. 𝜀 by emulating the values of 𝑡 as suggested by DeepXplore [14], the

6
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 4
The proportion of critical neurons to all neurons in four DNN models.
Model 𝜀 Total Output classes of DNN
0 1 2 3 4 5 6 7 8 9
0.25 64.49% 19.57% 22.46% 18.84% 10.87% 16.67% 16.67% 20.29% 23.91% 20.29% 17.39%
LeNet4 0.50 54.35% 8.70% 12.32% 8.70% 7.97% 9.42% 8.70% 11.59% 16.67% 11.59% 10.14%
0.75 37.68% 4.35% 7.97% 3.62% 3.62% 5.07% 4.35% 7.25% 6.52% 6.52% 5.07%
0.25 71.32% 19.77% 24.42% 19.38% 15.12% 18.99% 15.50% 17.83% 15.89% 20.54% 18.99%
LeNet5 0.50 60.47% 12.40% 18.60% 11.24% 8.14% 13.95% 7.75% 10.47% 9.30% 13.18% 12.02%
0.75 43.80% 7.36% 12.79% 5.43% 5.43% 7.75% 4.65% 7.36% 4.26% 5.43% 7.36%
0.25 68.43% 50.17% 47.07% 46.57% 47.74% 47.07% 47.24% 46.65% 49.50% 50.75% 48.83%
ResNet20 0.50 50% 31.91% 30.23% 27.47% 29.15% 27.97% 31.74% 31.91% 32.33% 33.84% 30.65%
0.75 29.65% 18.26% 18.17% 15.16% 16.75% 16.33% 17.17% 19.26% 17.76% 18.26% 17.09%
0.25 81.92% 47.70% 42.89% 51.27% 47.38% 48.60% 49.50% 50.03% 46.67% 48.47% 48.04%
VGG16 0.50 74.62% 37.21% 32.20% 36.70% 35.00% 36.93% 38.12% 36.39% 34.09% 38.21% 36.30%
0.75 60.20% 26.10% 23.56% 24.72% 24.32% 26.33% 28.22% 25.30% 25.60% 28.40% 26.55%

originator of neuron coverage metric. To compare the performance of Table 5

The inconsistent rate of four DNN models after modifying the output of neuron to zero.
CriticalFuzz in different the critical neurons threshold 𝜀, we set 𝜀 to
0.25, 0.50 and 0.75. The 𝑚𝑚𝑎𝑥 is set to 200 in all experiments. We set Model 𝜀 = 0.25 𝜀 = 0.50 𝜀 = 0.75

𝛼 = 0.02 and 𝛽 = 0.2 following DeepHunter. The mutation constraint Inc.CN Inc.NCN Inc.CN Inc.NCN Inc.CN Inc.NCN
parameter 𝐿∞ of TensorFuzz is set to 1. LeNet4 100% 19.30% 100% 28.70% 100% 38.40%
LeNet5 100% 9.40% 100% 30.90% 100% 30.90%
ResNet20 100% 70.80% 100% 70.80% 100% 70.80%
5. Result analysis VGG16 100% 0% 100% 0% 100% 0%

5.1. Answer to RQ1: Critical neurons

In this RQ, we first obtain the critical neurons of DNN models hold the major logic of the DNN model. In addition, we can find that
according to Definitions 1 and 2. The proportion of critical neurons the inconsistent rate of non-critical neurons is less than 40%, except for
to all neurons of DNN models is shown in Table 4. The third column the ResNet20, which indicates that non-critical neurons hold a small
shows the proportion of total critical neurons to all neurons of DNN amount of DNN logic. For ResNet20, although the inconsistent rate of
models. The number of critical neurons in all DNN models is less non-critical neurons is higher than another model, it is also lower than
than 100%, which means that only part of neurons in DNN models 100%, which is lower than the inconsistent rate of critical neurons.
are critical neurons. In particular, when the 𝜀 is 0.75, the fraction These results demonstrate that the critical neurons contain more logic
of critical neurons is less than 50%. The last 10 columns show the than non-critical neurons for the DNN models.
proportion of class-critical neurons in different classes to all neurons of In order to confirm more fine-grained that the critical neurons of
DNN models. The number of class-critical neurons of different classes is each class contain the important logic of that class in DNN, we modify
different, which indicates that different classes contain different critical the output of the class-critical neuron of each class to zero and check
neurons. In addition, for LeNet4 and LeNet5, the proportion of critical whether the predicted result is inconsistent with the original result.
neurons in each class is less than 50%. In other words, the number Unlike before, the input here is the corresponding input data for that
of critical neurons in each class is smaller than non-critical neurons.
class. For example, for LeNet4, when we verify whether the class-
ResNet20 and VGG16 have a higher proportion of critical neurons in
critical neurons of class 1 contain the important logic, we input the
each class compared to LeNet4 and LeNet5, but most of them are less
images containing the handwritten digit 1.
than 50%. Furthermore, the proportion of critical neurons and class-
Table 6 shown the inconsistent rate after modifying critical neurons
critical neurons in all DNN models decreases as the 𝜀 increases, as
defined by 2. Note that the total number of critical neurons of all and non-critical neurons of different classes in four DNN models. The
models is the union of all classes of class-critical neurons. cells with gray background represent the inconsistent rate of critical
To verify whether the critical neuron contains the major logic of neurons of different classes. For all classes and all 𝜀, the inconsistent
DNN, we modify the output of the critical neuron to zero and check rate after modifying the critical neurons of each class is 100% despite
whether the predicted result is inconsistent with the original result. having fewer critical neurons per class. In particular, for 𝜀 0.75, the
Intuitively, the critical neuron holds the major logic of DNN if the proportion of critical neurons is small, but the model still makes errors
output of DNN changes after modification and vice versa. In this after modifying the output of critical neurons. For example, when 𝜀
experiment, we sample 1000 training images to verify whether the is 0.75, the proportion of critical neurons of class 0 is only 4.35% in
critical neurons contain the major logic of DNN. In order to verify LeNet4, but the model still has 100% error behavior after modifying
that critical neurons contain more important logic of DNN than other the output of critical neurons. Moreover, although the number of non-
neurons (non-critical neurons), we also set the output of non-critical critical neurons is more than critical neurons in most classes, most of
neurons to zero, and compare the inconsistent rate of DNN output with the inconsistent rate after modifying non-critical neurons are 0%. The
critical neurons. result indicates that critical neurons of a class hold more important
Table 5 shows the inconsistent rate after modifying all neurons in logic of the class than non-critical neurons in DNN models. Therefore,
critical neurons and non-critical neurons. To sufficiently verify the sig- we should spend more effort covering critical neurons.
nificance of critical neurons, we modify neurons in each configuration
of 𝜀. In each critical neuron threshold 𝜀, we show the inconsistent rate
of critical neurons (Inc.CN) and the inconsistent rate of non-critical
Answer to RQ1: The critical neurons hold more significant
neurons (Inc.NCN).
logic than non-critical neurons in DNN.
For all models and different 𝜀, the inconsistent rate of critical
neurons is 100% in Table 5, which indicates that the critical neurons

7
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 6
The inconsistent rate of different classes in four DNN models after modifying the output of neuron to zero.

Output classes of DNN

Model 𝜀
0 1 2 3 4 5 6 7 8 9

0.25 100% 0% 100% 0% 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
LeNet4 0.50 100% 0% 100% 0% 100% 100% 100% 100% 100% 0% 100% 0% 100% 100% 100% 0% 100% 0% 100% 0%
0.75 100% 0% 100% 0% 100% 100% 100% 100% 100% 0% 100% 0% 100% 100% 100% 0% 100% 0% 100% 100%

0.25 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 100% 0% 100% 0% 100% 0%
LeNet5 0.50 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 100% 100% 100% 0% 100% 100%
0.75 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 100% 100% 100% 0% 100% 100%

0.25 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
ResNet20 0.50 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
0.75 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

0.25 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
VGG16 0.50 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
0.75 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%

Table 7
The critical neuron coverage of four DNN models in five minutes by CriticalFuzz.
Model 𝜀 = 0.25 𝜀 = 0.50 𝜀 = 0.75
Init. CriticalFuzz Init. CriticalFuzz Init. CriticalFuzz
LeNet4 62.4% 98.7% 82.2% 100% 97.9% 100%
LeNet5 67.4% 99.1% 84.2% 100% 96.5% 100%
ResNet20 63.3% 92.7% 79.2% 98.3% 90.7% 100%
VGG16 73.1% 77.9% 84.5% 92.0% 92.3% 94.9%

5.2. Answer to RQ2: Coverage Efficiency of CriticalFuzz. To investigate the efficiency of CriticalFuzz,
we compare the critical neuron coverage of CriticalFuzz with the other
In this RQ, we need to explore whether CriticalFuzz can cover more two baseline approaches. The initial seed can affect the result of fuzz
critical neurons. From the definition of critical neurons, the smaller testing. To ensure the correctness of the comparison results, we use
the hyperparameter 𝜀, the more critical neurons can be obtained. the same initial seeds in three approaches. The experiment results are
Therefore, we set the value of 𝜀 to 0.25 in this experiment for collecting shown in Fig. 4.
more critical neurons. From Fig. 4, we can observe that CriticalFuzz covers more criti-
Critical neuron coverage. The proportion of critical neurons covered cal neurons than DeepHunter and TensorFuzz at any value of 𝜀. For
by CriticalFuzz is shown in Table 7, where the column Init. shows the LetNet4, LeNet5, and ResNet20, the final critical neuron coverage is
critical neuron coverage achieved by the initial seeds. The larger the 𝜀, more than 90% in CriticalFuzz (i.e., 100%), while in the other two
the greater the critical neuron coverage of Init. The reason is that the approaches, it is only 98.9% at most. In particular, CriticalFuzz covers
larger 𝜀 is, the fewer critical neurons will be obtained and the higher more critical neurons than DeepHunter and TensorFuzz when 𝜀 is 0.25.
the critical neuron coverage of DNN is according to Formula (9). About VGG16, the growth rate of CriticalFuzz’s critical neuron coverage
From Table 7, we can observe that CriticalFuzz can cover new is low, but it is still higher than the other two approaches. For example,
critical neurons for different DNN models. After five minutes, the CriticalFuzz increases the critical neuron coverage by 2.6% compared
critical neurons covered by the CriticalFuzz reached more than 90% to the initial seed, while the other two methods increase the critical
of all critical neurons in DNN, except when 𝜀 was 0.25 in the VGG16. neuron coverage by 0.2% and 1.6%, respectively, when 𝜀 is 0.75.
Nevertheless, CriticalFuzz also covered new critical neurons when 𝜀 In order to fully validate the efficiency of CriticalFuzz in covering
was 0.25 in VGG16, and the critical neuron coverage increased from class-critical neurons, we calculate the class-critical neuron coverage
73.1% to 77.9%. Moreover, CriticalFuzz can cover all critical neurons of baseline approaches for each class and compare the results with
of the DNN model, e.g., the critical neuron coverage is 100% when 𝜀 CriticalFuzz. The comparison results are shown in Fig. 5. The numerical
was 0.50 in LeNet4 and LeNet5. labels in the figure represent the class-critical neuron coverage of
To ensure that CriticalFuzz adequately tests the logic of each class CriticalFuzz.
of DNN, we calculate the class-critical neuron coverage of CriticalFuzz For LeNet4 and LeNet5, we can discover that DeepHunter and
for each class in DNN according to the formula (10). The experiment TensorFuzz focus on specific classes of critical neurons, whereas Crit-
results of CriticalFuzz are shown in Table 8. Each of the four DNNs has icalFuzz has the capability to cover all classes of critical neurons. For
10 output classes. For each category, we calculated the class-critical example, in the case of LeNet5, when the threshold 𝜀 is set at 0.25,
neuron coverage of both the initial seed and the CriticalFuzz over a TensorFuzz and DeepHunter target critical neurons in classes 1, 4 and
five-minute period. 8, while CriticalFuzz is able to cover critical neurons across all classes.
From Table 8, we can observe that CriticalFuzz can add critical For ResNet20 and VGG16, CriticalFuzz covers more classes than Ten-
neuron coverage of multiple classes in DNN. Especially for LeNet4 and sorFuzz and DeepHunter, although it cannot cover all classes of critical
LeNet5, CriticalFuzz can increase the critical neuron coverage of each neurons. Particularly in the case of the VGG16, with a threshold 𝜀 value
class, and the increasing proportion is large. For example, CriticalFuzz of 0.25, the CriticalFuzz targets two classes of critical neurons, whereas
covers all critical neurons of each class when 𝜀 is 0.5 in LeNet4. For DeepHunter and TensorFuzz focus on only one class. In addition to
ResNet20 and VGG16, although CriticalFuzz cannot cover new critical the broad coverage of class-critical neurons, we also contrast the depth
neurons of all classes of critical neurons in any 𝜀, it can also cover multi- of coverage of CriticalFuzz, TensorFuzz, and DeepHunter. Note that in
ple classes of critical neurons. Especially for ResNet20, CriticalFuzz can order to compare the depth of coverage, we will compare the number of
cover all classes of critical neurons when 𝜀 is 0.75. The possible reason critical neurons covered in the classes that can be covered by all three
why CriticalFuzz does not cover all classes of critical neurons is that NCFT methods. For all tested DNN models, our analysis reveals that
the network structure of ResNet20 and VGG16 is complex compared CriticalFuzz demonstrates comparable effectiveness to other baseline
to LeNet4 and LeNet5, which results in CriticalFuzz not being able to approaches in terms of covering a singular class of critical neurons.
explore critical neurons of all classes at the same time. For instance, in the case of ResNet20, with a threshold 𝜀 of 0.75, both

8
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 8
The class-critical neuron coverage of four DNN models in five minutes by CriticalFuzz.
Model 𝜀 Approach Output classes of DNN
0 1 2 3 4 5 6 7 8 9
Init. 59.3% 58.1% 69.2% 66.7% 60.9% 60.9% 53.6% 75.8% 57.1% 62.5%
0.25
CriticalFuzz 100.0% 96.8% 100.0% 93.3% 100.0% 100.0% 100.0% 100.0% 96.4% 100.0%
LeNet4 Init. 91.7% 82.4% 100.0% 81.8% 76.9% 75.0% 68.8% 91.3% 68.8% 85.7%
0.50
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Init. 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 90.0% 100.0% 88.9% 100.0%
0.75
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Init. 66.7% 79.4% 60.0% 66.7% 69.4% 62.5% 65.2% 78.0% 56.6% 69.4%
0.25
CriticalFuzz 100.0% 100.0% 100.0% 97.4% 100.0% 100.0% 100.0% 100.0% 98.1% 95.9%
LeNet5 Init. 87.5% 95.8% 75.9% 81.0% 80.6% 80.0% 85.2% 95.8% 73.5% 87.1%
0.50
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Init. 100.0% 97.0% 100.0% 85.7% 100.0% 100.0% 94.7% 100.0% 92.9% 94.7%
0.75
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Init. 63.3% 50.0% 48.9% 71.2% 64.4% 74.8% 76.8% 64.0% 61.9% 57.3%
0.25
CriticalFuzz 99.3% 99.8% 99.8% 100.0% 64.4% 100.0% 99.8% 64.0% 100.0% 100.0%
ResNet20 Init. 82.2% 66.8% 64.0% 87.1% 83.2% 89.2% 89.0% 76.9% 79.0% 74.9%
0.50
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 83.2% 100.0% 99.7% 99.7% 100.0% 100.0%
Init. 91.3% 82.5% 75.1% 95.0% 94.4% 97.6% 97.4% 87.7% 96.3% 90.2%
0.75
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Init. 79.8% 78.2% 69.7% 72.4% 75.3% 67.1% 63.0% 79.5% 77.0% 69.4%
0.25
CriticalFuzz 79.8% 78.2% 69.7% 100.0% 75.3% 67.1% 63.0% 99.8% 77.0% 69.4%
VGG16 Init. 89.8% 91.2% 86.1% 81.0% 83.1% 75.5% 72.6% 94.1% 86.9% 84.4%
0.50
CriticalFuzz 99.9% 99.9% 86.1% 81.0% 83.1% 99.2% 100.0% 99.9% 86.9% 84.4%
Init. 97.1% 97.7% 98.2% 85.8% 91.0% 83.6% 78.9% 98.9% 95.8% 95.7%
0.75
CriticalFuzz 100.0% 100.0% 100.0% 100.0% 91.0% 83.6% 78.9% 100.0% 95.8% 100.0%

Fig. 4. Critical neuron coverage of four DNN models in five minutes by CriticalFuzz, DeepHunter and TensorFuzz.

9
T. Bai et al. Information and Software Technology 172 (2024) 107476

Fig. 5. Class neuron coverage for each class by CriticalFuzz, DeepHunter and TensorFuzz in five minutes.

CriticalFuzz and DeepHunter achieve full coverage of critical neurons 5.3. Answer to RQ3: Error detection
for class 1. Similarly, in the context of LeNet5, with a threshold 𝜀 of 0.5,
both CriticalFuzz and TensorFuzz achieve complete coverage of critical To verify the ability of CriticalFuzz to find errors, we compared the
neurons for class 5. number of errors found by CriticalFuzz and two baseline approaches.
The definition of error in this paper is that the model outputs differ for
the initial seed and the mutation seed, as defined by DeepHunter. For
the initial seed, we selected the correct and the same seed for the three
Answer to RQ2: CriticalFuzz can prioritize covering the criti-
approaches.
cal neurons of DNN, and has a wider coverage range and higher
Table 9 shows the number of errors found by CriticalFuzz, Dee-
efficiency compared to DeepHunter and TensorFuzz.
pHunter and TensorFuzz. CriticalFuzz detects more errors than other
approaches at the same time and 𝜀, especially for LeNet4. This result

10
T. Bai et al. Information and Software Technology 172 (2024) 107476

Table 9 time. In relation to the number of errors identified within each error
The number of errors detected in five minutes by CriticalFuzz, DeepHunter and
category, CriticalFuzz may not consistently surpass the performance of
TensorFuzz.
DeepHunter and TensorFuzz. However, the effectiveness of CriticalFuzz
Model Approach 𝜀
extends beyond detecting individual errors for each error type. For
0.25 0.50 0.75 example, when considering ResNet20 with a threshold 𝜀 of 0.75 and
CriticalFuzz 3322 5704 9064 error type 6, while TensorFuzz identifies 2707 errors, CriticalFuzz also
LeNet4 DeepHunter 139 30 83 detects 146 errors, a figure in proximity to the error count revealed
TensorFuzz 28 27 27 by DeepHunter. Particularly in the case of LeNet4 and LeNet5, when
CriticalFuzz 1359 2273 2450 𝜀 is greater than 0.25, CriticalFuzzy discovers a much greater number
LeNet5 DeepHunter 41 84 227 of errors than DeepHunter and TensorFuzz for the same type of error.
TensorFuzz 301 16 9 These results indicate that CriticalFuzz can explore different logics of a
CriticalFuzz 2677 2438 3456 DNN, and compared to the other two baseline approaches, it explores
ResNet20 DeepHunter 907 95 181 more different logics.
TensorFuzz 7 122 2707
CriticalFuzz 451 585 475
Answer to RQ3: CriticalFuzz is able to detect a larger number
VGG16 DeepHunter 241 37 82
TensorFuzz 100 44 87 and category of errors than DeepHunter and TensorFuzz.

CriticalFuzz 7809 11 000 15 445

Total DeepHunter 1328 246 573
TensorFuzz 436 209 2830 6. Discussion

6.1. The significance of critical neurons

indicates that CriticalFuzz is more effective for testing DNN. The pur- In order to ascertain the significance of critical neurons, we identify
pose of CriticalFuzz is to cover more critical neurons which hide more the critical neurons associated with various values of 𝜀 and subse-
logical of DNN than other neurons. The CriticalFuzz founding more quently modify the output of these critical neurons to zero. Based on
errors indicates that the critical neurons are weaker than others. In the findings of RQ1, it is evident that critical neurons exert a complete
addition, the larger 𝜀 is, the acquired critical neurons hide more logic of influence of 100% on the output of DNN models. Conversely, non-
DNN according to formula (4). From the last row of Table 9, we can find critical neurons, despite having more neurons than critical neurons,
that the total errors found by CriticalFuzz increase with the increase of exhibit a relatively low proportion of impact on the output of DNN
𝜀. This result suggests that neurons with more logical information are models. Furthermore, in order to explore whether class-critical neurons
more vulnerable. of a certain class contain the essential logic of class in DNN models
The diversity of errors detected by a testing method indicates that in a more fine-grained way, we modify the class-critical neurons of
the testing method can explore more different logic of a DNN. There- this class. Building upon the findings of RQ 1, it further proves that
fore, we classified the errors detected by CriticalFuzz and counted the these class-critical neurons possess significant logical information when
number of errors contained in each type. Note that the error types here compared to non-critical neurons within the same class.
refer to the DNN output classes that are detected to have errors by
NCFT. For example, error type 1 means that NCFT found a DNN error 6.2. The effectiveness of CriticalFuzz
that misclassified data of class 1 into another class. Furthermore, we
compare the error detection performance of CriticalFuzz, DeepHunter To check the effectiveness of CriticalFuzz, we use it to cover critical
and TensorFuzz. neurons and class-critical neurons of DNN models. From the results of
Fig. 6 shows the error types and the number of errors detected RQ2, we can find that CriticalFuzz is capable of efficiently covering a
by three fuzz testing approaches for each error type in different 𝜀 significant proportion of critical neurons within a brief timeframe, and
(i.e., 0.25, 0.50 and 0.75). The numerical symbols from 0 to 9 lo- in some cases, it can even achieve complete coverage. Additionally,
cated above the figure serve as indicators for the specific error types. CriticalFuzz demonstrates the ability to comprehensively cover class-
Due to the DNN model under evaluation having 10 output classes, critical neurons across various classes in DNN models. In addition, to
there exist 10 distinct error types associated with it. For LeNet4 and verify the efficiency of CriticalFuzz, we compare its performance with
LeNet5, CriticalFuzz is able to detect errors in each error type, while other fuzz testing approaches for DNN, i.e., DeepHunter and Tensor-
DeepHunter and TensorFuzz only find errors in a few error types. Fuzz. From the results of RQ2, we can find that CriticalFuzz can cover
For example, for LeNet4, the number of error types that CriticalFuzz, a greater number of critical neurons and more kinds of class-critical
DeepHunter and TensorFuzz detect errors is 10, 3, 1 when 𝜀 is 0.25. For neurons compared to the other two baseline approaches. Additionally,
ResNet20 and VGG16, CriticalFuzz does not detect every type of error we conducted additional verification to determine if CriticalFuzz is
for all in any 𝜀, but detects more error types containing errors than capable of identifying errors in DNN models. From the results of RQ3,
DeepHunter and TensorFuzz. Therefore, CriticalFuzz can find more CriticalFuzz is able to find a greater number and variety of errors than
categories of errors than DeepHunter and TensorFuzz. The principle the other two baseline approaches.
is that each output class of a DNN has different critical neurons and Critical neuron coverage is the guiding criterion of CriticalFuzz.
CriticalFuzz will explore the major logic corresponding to each class Previous work [16,26,27] found limitations in neuron coverage, so we
when multiple types of input are used for neuron coverage. Moreover, discuss the validity of critical neuron coverage. Harel-Canada et al. [16]
the discovery that DeepHunter and TensorFuzz identified fewer error have refuted the notion that there is a positive correlation between
types suggests that their focus was primarily on covering more neu- neuron coverage and error detection. However, 10 out of 12 experi-
rons, without taking into account the distinctions between individual mental results from Table 7 and Table 9 indicate that as critical neuron
neurons during the testing process. In addition, we can discover that coverage increases with the increase in parameter 𝜀, more errors are
the error types increase as 𝜀 increases by CriticalFuzz. The possible detected. There exist two experimental outcomes that deviate from the
reason is that the number of critical neurons in each class decreases majority. For example, for VGG16, we observed that increasing critical
as 𝜀 increases, which may make CriticalFuzz having more energy to neuron coverage from 92% (𝜀=0.5) to 94.9% (𝜀=0.75) did not result
explore different classes of critical neurons within the same amount of in a higher number of errors. This discrepancy may be attributed to

11
T. Bai et al. Information and Software Technology 172 (2024) 107476

Fig. 6. The number of errors for each error type detected by CriticalFuzz, DeepHunter and TensorFuzz in five minutes.

different in critical neurons identified at different 𝜀 values, thereby approach for covering critical neurons. There are some suggestions to
preventing a proportional rise in defect detections with increased cov- further extend critical neurons and CriticalFuzz.
erage. Additionally, the bias in neuron coverage towards specific output Firstly, we verify that critical neurons hold essential logic of DNN
classes, as highlighted by Harel-Canada et al. was also supported by our models for classification tasks. In the future, we will validate critical
RQ2 experiment results. Furthermore, we confirm that our proposed neurons in DNN models for other tasks, for example, object recognition,
critical neuron coverage hold the ability to encompass a broader range semantic segmentation, etc.
of critical neurons. Therefore, critical neuron coverage should be a Secondly, CriticalFuzz mainly aims to cover critical neurons, and
better choice for NCFT methods. has discovered many errors. However, while non-critical neurons hold
the minor logic of DNN, the non-critical neurons also need to be
6.3. Future works explored in order to ensure safety of DNN. Therefore, in the future, we
will further optimize CriticalFuzz so that it can give the tester a choice
The experiments in this paper have proven that critical neurons hold of the type of neurons needed to be covered and can dynamically adjust
major logic of DNN models, and CriticalFuzz is an efficient and effective its own energy to cover the target neuron.

12
T. Bai et al. Information and Software Technology 172 (2024) 107476

Thirdly, CriticalFuzz employs metamorphic mutation as a means of the quality of test data, and define source-level mutation operators
conducting mutation and automatically evaluating the output of DNN and model-level mutation operators. Kim et al. [39] proposed a sur-
models. Nevertheless, the utilization of metamorphic mutation imposes prise adequacy criterion for sample inputs to improve the accuracy of
constraints on the range of seed variations, thereby diminishing the di- DNN models. The surprise adequacy criterion contains likelihood-based
versity of newly generated test cases. Moving forward, our intention is surprise adequacy and distance-based surprise adequacy.
to enhance the range of seed variations while simultaneously automat- To consolidate efficiency and effectiveness, some researchers used
ing the evaluation of DNN output, thereby fostering the generation of fuzz testing to find the error and explore the different states of DNN.
more diverse test cases. DLFuzz [20] is the first fuzz testing framework to expose incorrect
behaviors of DNN. DLFuzz can select neurons through multiple neuron
6.4. Threats to validity selection strategies. TensorFuzz [7] used approximate nearest neigh-
bor algorithm [40] to guide the fuzz testing. However, the mutation
Test subject and dataset selection. The selection of the DNN models method and coverage criteria of these fuzz testing approaches are too
and dataset is a threat to the validity. Deep neural network models single. DeepHunter [5] proposed a metamorphic mutation strategy and
encompass a variety of architectures, such as convolutional neural net- implemented five coverage criteria and four seed selection strategies.
works (CNN) [28] and transformers [29], the latter being particularly Ex2 [41] used monte carlo tree search [42] to prioritize test inputs in
notable for its exceptional performance in natural language processing fuzz testing. Different from these existing fuzz testing approaches, our
tasks. However, existing neuron coverage metrics are mostly designed proposed CriticalFuzz can prioritize cover critical neurons to ensure the
for conventional fully connected and convolutional layers, and they quality of the essential decision logic of DNN. In addition, the energy-
may not have been adjusted to suit distinctive architectures like the based test case generation we proposed can quickly cover essential
transformer models. Moreover, to adequately assess the performance of neurons.
CriticalFuzz, it is imperative to compare it against other baseline meth-
ods using the same set of models. Therefore, we mitigate the risk by 7.2. Coverage criteria
choosing four widely recognized CNN models, which are also utilized
by other baseline methods. In addition, we perform experiments on two
The coverage criteria are used to measure the adequacy of testing
widely adopted open-source datasets to avoid the potential impact of
and guide fuzz testing in DNN. The neuron coverage [14] is the first
datasets.
white box testing metric for DNN, which is used to estimate the amount
Hyperparameters. Another threat is configurable hyperparameters in
of deep learning logic explored by a set of test inputs. DeepGauge [15]
critical neuron definition. We set different values for the same param-
proposed neuron-level coverage criteria and layer-level coverage cri-
eter and conduct corresponding experiments to eliminate the potential
teria. The neuron-level coverage criteria contain KMNCov, NBCov, and
impact of hyperparameters. In each experiment, the hyperparameter 𝜀
SNACov. The layer-level coverage criteria contain TKNCov and TKNPat.
is assigned values of 0.25, 0.5, and 0.75 for getting critical neurons.
In order to make the coverage criteria describing the decision logic
The hyperparameter 𝑡 is adjusted based on the number of neurons in
of DNN more accurate, Xie et al. [43] propose structure-based neuron
DNN models to activate and cover critical neurons. For other hyper-
path coverage and activation-based neuron path coverage. In addition
parameters in the energy-based test case generation, we set the value
to neuron coverage criteria, some researchers proposed other coverage
of the parameter following DeepHunter to ensure the rationality of the
criteria. DeepCT [44] firstly used combinatorial testing [45] to test
value.
DNN and proposed a set of combinatorial testing coverage criteria
to examine the interaction among neurons. Inspired by MC/DC test
7. Related work
criterion [46], Sun et al. [47] proposed a family of four test criteria,
which can achieve both intensive testing and computational feasibility.
7.1. Testing deep neural networks
However, these existing coverage criteria do not take into account the
different effects of different neurons on the final output of the DNN. To
DiffChaser [30] proposed a black-box testing framework for detect-
cover the essential decision logic of DNN, we propose critical neuron
ing disagreements between version variants of a DNN. DeepXplore [14]
coverage. Moreover, in order to cover the critical neurons of each class,
originally proposed a white-box testing framework to test DNN, and
we propose class-critical neuron coverage.
jointly maximize neuron coverage and maximize different behaviors of
similar DNNs. ADAPT [31] is a new white-box testing technique for
deep neural networks that proposes a parameterized neuron-selection 8. Conclusion
strategy and provides an online learning algorithm to adjust its param-
eters effectively. Due to the advent of neuron coverage, DeepTest [8] In this paper, we define the critical neurons and propose a critical
proposed to automatically test different parts of the DNN logic by neuron coverage-guided fuzz testing framework, CriticalFuzz. To guide
maximizing neuron coverage. In order to make the generated scene CriticalFuzz to cover the critical neurons, we propose the energy-
more realistic, DeepRoad [9] used UNIT [32] to change the weather based test case generation and the critical neuron coverage criteria.
in test cases. TACTIC [33] employed MUNIT [34] to generate multiple The energy-based test case generation consists of energy-based seed
different environmental conditions and (1+1) Evolution Strategy [35] selection, power schedule and seed mutation.
to identify the critical environmental conditions. DeepBillboard [10] To ensure the accuracy of the definition of critical neurons, we
generated physical-world adversarial billboard tests for real-world driv- evaluate the impact of critical neurons on DNN prediction. The results
ing under various weather conditions. To evaluate the robustness of show that the critical neurons hold the major logic of the DNN. Besides,
DNN about image background region changes, DeepBackground [11] CriticalFuzz has been evaluated on four popular DNN models and two
defined a series of domain-specific metamorphic relations and automat- widely-used datasets. In addition, we compare the CriticalFuzz and two
ically replaced the image background. DeepConcolic [36] developed fuzz testing frameworks in terms of the efficiency of covering critical
the first concolic testing [37] method for DNN to improve coverage. neurons and the capability of detecting error behaviors. The experiment
DeepCheck [38] used symbolic execution to analyze the DNN, and con- results show that CriticalFuzz can cover more critical neurons at the
tained two validation techniques. In addition to measuring the quality same time and detect more number and diversity errors compared to
of DNN, some researchers proposed multiple approaches to test the other fuzz testing frameworks. Moreover, CriticalFuzz covers critical
quality of testing data for promoting the development of DNN. DeepMu- neurons of each class in a more fine-grained way, and finds the errors
tation [6] proposed a mutation testing framework for DNN to measure hidden in each class of DNN.

13
T. Bai et al. Information and Software Technology 172 (2024) 107476

CRediT authorship contribution statement [13] Quang-Hung Luu, et al., A sequential metamorphic testing framework for
understanding automated driving systems, 2022, arXiv preprint arXiv:2206.
03075.
Tongtong Bai: Writing – review & editing, Writing – original draft,
[14] Kexin Pei, et al., DeepXplore: Automated whitebox testing of deep learning
Visualization, Validation, Supervision, Software, Resources, Project ad- systems, in: Proceedings of the 26th Symposium on Operating Systems Principles,
ministration, Methodology, Investigation, Funding acquisition, Formal 2017.
analysis, Data curation, Conceptualization. Song Huang: Writing – [15] Lei Ma, et al., DeepGauge: Multi-granularity testing criteria for deep learning
review & editing, Supervision, Conceptualization. Yifan Huang: Project systems, in: Proceedings of the 33rd ACM/IEEE International Conference on
Automated Software Engineering, 2018.
administration, Methodology, Investigation, Formal analysis, Conceptu-
[16] Fabrice Harel-Canada, et al., Is neuron coverage a meaningful measure for
alization. Xingya Wang: Supervision, Project administration, Method- testing deep neural networks? in: Proceedings of the 28th ACM Joint Meeting on
ology. Chunyan Xia: Visualization, Validation, Resources. Yubin Qu: European Software Engineering Conference and Symposium on the Foundations
Writing – review & editing, Software, Methodology. Zhen Yang: Visu- of Software Engineering, 2020.
[17] B. Zhou, Y. Sun, D. Bau, A. Torralba, Revisiting the importance of individual
alization, Validation.
units in cnns via ablation, 2018, arXiv preprint arXiv:1806.02891.
[18] A. Nguyen, J. Yosinski, J. Clune, Multifaceted feature visualization: Uncovering
Declaration of competing interest the different types of features learned by each neuron in deep neural networks,
2016, arXiv preprint arXiv:1602.03616.
The authors declare that they have no known competing finan- [19] Marcel Böhme, et al., Directed greybox fuzzing, in: Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, 2017.
cial interests or personal relationships that could have appeared to
[20] Jianmin Guo, et al., DLFuzz: Differential fuzz testing testing of deep learning
influence the work reported in this paper. systems, in: Proceedings of the 2018 26th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software
Data availability Engineering, 2018.
[21] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied
to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
Raw experimental data of CriticalFuzz are available at Mendeley [22] Kaiming He, et al., Deep residual learning for image recognition, in: Proceedings
Data. DOI: 10.17632/n83vfckc8b.1. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[23] Karen Simonyan, Andrew Zisserman, Very deep convolutional networks for
Acknowledgements large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[24] Yann LeCun, Corrina Cortes, The MNIST database of handwritten digits, 1998,
1998.
We would like to thank anonymous reviewers for their constructive [25] Nair Krizhevsky, Hinton Vinod, Christopher Geoffrey, Mike Papadakis, Anthony
comments. This research was funded by the Open Project Program Ventresque, The cifar-10 dataset, 2014, http://www.cs.toronto.edu/kriz/cifar.
of the State Key Laboratory for Novel Software Technology (Nanjing html. 2014.
University) (KFKT2022B10) and the Nantong Science and Technology [26] Z. Li, X. Ma, C. Xu, et al., Structural coverage criteria for neural networks could
be misleading, in: 2019 IEEE/ACM 41st International Conference on Software
Project (No. JC2023070).
Engineering: New Ideas and Emerging Results, ICSE-NIER May 25, IEEE, 2019,
pp. 89–92.
References [27] J. Sekhon, C. Fleming, Towards improved testing for deep learning, in: 2019
IEEE/ACM 41st International Conference on Software Engineering: New Ideas
[1] Brian K.S. Isaac-Medina, et al., Unmanned aerial vehicle visual detection and and Emerging Results, ICSE-NIER May 25, IEEE, 2019, pp. 85–88.
tracking using deep neural networks: A performance benchmark, in: Proceedings [28] Zewen Li, et al., A survey of convolutional neural networks: analysis, appli-
of the IEEE/CVF International Conference on Computer Vision, 2021. cations, and prospects, IEEE Trans. Neural Netw. Learn. Syst. 33 (12) (2021)
[2] Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali 6999–7019.
Mottaghi, Yun Liu, Eric Topol, Jeff Dean, Richard Socher, Deep learning-enabled [29] Yi Tay, et al., Efficient transformers: A survey, ACM Comput. Surv. 55 (6) (2022)
medical computer vision, NPJ Digit. Med. 4 (1) (2021) 5. 1–28.
[3] Asmaa Abbas, Mohammed M. Abdelsamea, Mohamed Medhat Gaber, Classifica- [30] Xiaofei Xie, et al., Diffchaser: Detecting disagreements for deep neural networks,
tion of COVID-19 in chest X-ray images using DeTraC deep convolutional neural in: International Joint Conferences on Artificial Intelligence Organization, 2019.
network, Appl. Intell. 51 (2021) 854–864. [31] Seokhyun Lee, et al., Effective white-box testing of deep neural networks with
[4] Xing Xu, et al., What machines see is not what they get: Fooling scene text adaptive neuron-selection strategy, in: Proceedings of the 29th ACM SIGSOFT
recognition models with adversarial text images, in: Proceedings of the IEEE/CVF International Symposium on Software Testing and Analysis, 2020.
Conference on Computer Vision and Pattern Recognition, 2020. [32] Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised image-to-image translation
[5] Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun networks, Adv. Neural Inf. Process. Syst. 30 (2017).
Zhao, Bo Li, Jianxiong Yin, Simon See, DeepHunter: a coverage-guided fuzz [33] Zhong Li, et al., Testing dnn-based autonomous driving systems under critical
testing framework for deep neural networks, in: Proceedings of the 28th ACM environmental conditions, in: International Conference on Machine Learning,
SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. PMLR, 2021.
146–157. [34] Xun Huang, et al., Multimodal unsupervised image-to-image translation, in:
[6] Lei Ma, et al., Deepmutation: Mutation testing of deep learning systems, in: 2018 Proceedings of the European Conference on Computer Vision, ECCV, 2018.
IEEE 29th International Symposium on Software Reliability Engineering, ISSRE, [35] I. Rechenberg, Simulationsmethoden in der Medizin und Biologie, 1978, pp.
IEEE, 2018. 83–114.
[7] Augustus Odena, Catherine Olsson, David Andersen, Ian Goodfellow, Tensorfuzz: [36] Youcheng Sun, et al., Concolic testing for deep neural networks, in: Proceed-
Debugging neural networks with coverage-guided fuzz testing, in: International ings of the 33rd ACM/IEEE International Conference on Automated Software
Conference on Machine Learning, PMLR, 2019, pp. 4901–4911. Engineering, 2018.
[8] Yuchi Tian, Kexin Pei, Suman Jana, Baishakhi Ray, Deeptest: Automated testing [37] Koushik Sen, Darko Marinov, Gul Agha, CUTE: A concolic unit testing engine
of deep-neural-network-driven autonomous cars, in: Proceedings of the 40th for C, ACM SIGSOFT Softw. Eng. Notes 30 (50) (2005) 263–272.
International Conference on Software Engineering, 2018, pp. 303–314. [38] Divya Gopinath, et al., Symbolic execution for deep neural networks, 2018, arXiv
[9] Mengshi Zhang, et al., DeepRoad: GAN-based metamorphic testing and input preprint arXiv:1807.10439.
validation framework for autonomous driving systems, in: Proceedings of the [39] Jinhan Kim, Robert Feldt, Shin Yoo, Guiding deep learning system testing using
33rd ACM/IEEE International Conference on Automated Software Engineering, surprise adequacy, in: 2019 IEEE/ACM 41st International Conference on Software
2018. Engineering, ICSE, IEEE, 2019.
[10] Husheng Zhou, Wei Li, Zelun Kong, Junfeng Guo, Yuqun Zhang, Bei Yu, Lingming [40] Marius Muja, David G. Lowe, Scalable nearest neighbor algorithms for high
Zhang, Cong Liu, DeepBillboard: Systematic physical-world testing of au- dimensional data, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2227–
tonomous driving systems, in: Proceedings of the ACM/IEEE 42nd International 2240, Cameron B. Browne, A survey of monte carlo tree search methods. IEEE
Conference on Software Engineering, 2020, pp. 347–358. Transactions on,
[11] Zhiyi Zhang, et al., DeepBackground: Metamorphic testing for deep-learning- [41] Aoshuang Ye, et al., Ex2: Monte Carlo tree search-based test inputs prioritization
driven image recognition systems accompanied by background-relevance, Inf. for fuzz testing deep neural networks, Int. J. Intell. Syst. 37 (12) (2022)
Softw. Technol. 140 (2021) 106701. 11966–11984.
[12] Tsong Y. Chen, Shing C. Cheung, Shiu Ming Yiu, Metamorphic testing: a new [42] Cameron B. Browne, et al., A survey of monte carlo tree search methods, IEEE
approach for generating next test cases, 2020, arXiv preprint arXiv:2002.12543. Trans. Comput. Intell. AI Games 4 (1) (2012) 1–43.

14
T. Bai et al. Information and Software Technology 172 (2024) 107476

[43] Xiaofei Xie, et al., NPC: Neuron path coverage via characterizing decision logic [45] Changhai Nie, Hareton Leung, A survey of combinatorial testing, ACM Comput.
of deep neural networks, ACM Trans. Softw. Eng. Methodol. (TOSEM) 31 (3) Surv. 43 (2) (2011) 1–29.
(2022) 1–27. [46] Kelly J. Hayhurst, A Practical Tutorial on Modified Condition/Decision Coverage,
[44] Lei Ma, et al., DeepCT: Tomographic combinatorial testing for deep learning DIANE Publishing, 2001.
systems, in: 2019 IEEE 26th International Conference on Software Analysis, [47] Youcheng Sun, et al., Testing deep neural networks, 2018, arXiv preprint arXiv:
Evolution and Reengineering, SANER, IEEE, 2019. 1803.04792.