Automated Testing and Safety Analysis of Deep Neural Networks

Safe AI Systems
Automated Testing and Safety Analysis
of Deep Neural Networks
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info

Affiliations & Expertise
• Canada Research Chair (Tier 1), University of Ottawa, Canada
• Director, Lero, Research Ireland centre for software research
• Software Engineering (SE)
• AI4SE, e.g., test automation, requirements QA
• SE4AI, e.g., assurance of AI-enabled systems
2

Objectives
• Provide a software engineering perspective on the quality assurance and
safety of deep-learning models and systems.
• Provide an integrated overview of recent work
• Intuitive level
• Not a survey
3

Software Testing
5
Software
Inputs Outputs
Generation or selection strategy
Effectiveness, cost
Test adequacy
Level of testing
Information access
Automated failure detection
Root cause analysis
Analysis (safety, security …)
Risk assessment
Impact of AI?

What’s Different with AI Components?
• AI: traditional ML, deep learning, reinforcement learning, generative models
• No source code captures the model’s behaviour
• No specifications for ML components
• Behaviour acquired through training based on data
• Models are never perfectly accurate
• This has a significant impact
6

AI System Trustworthiness
• Uncertainty
• Robustness
• Safety
• Security
• Bias and Fairness
• Transparency and explainability
• …
7

AI Safety Verification
• Ideally:
• Absolute safety guarantees
• Practical assumptions (about input space, model, properties …)
• Scalable (e.g., to large models and systems)
• But
• We cannot have it all (e.g., high dimensionality, many parameters)
• Safety is not a model property, rather a system one
8

Formal Analysis vs. Testing
• Formal analysis, e.g., reachability analysis
• Focus on robustness to perturbations
• Guarantees about models (not systems) under restrictive assumptions (e.g.,
model architecture, input space shape, reasonable over-approximation of
output space)
• Scalability issues
• Testing-based analysis, e.g., search-based testing
• Heuristics for input space exploration
• No guarantees, but no restrictive assumptions and more scalable
9

Testing Levels
• Testing is still the main practical mechanism through which to gain trust
• Levels: AI Model (e.g., Deep Neural Networks), integration, system
• Integration: Issues that arise when multiple models and components are
integrated
• System: Test the system in its target environment, in-field or simulated
10

Key-points Detection Testing with
Simulation in the Loop
• DNNs used for key-points detection in images
• Testing: Find test suite that causes DNN to
poorly predict as many key-points as possible
within time budget
• Evaluate safety from testing results
• Images generated by a simulator
13
Ground truth
Predicted
Ul Haq et al., ISSTA, 2021

Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to one test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms) combined
with simulator
14

Overview
15
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image

Safety through Explanation
• Regression trees to predict model accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow on the location of KP26 is
the cause of high NE values
• Regression trees show excellent accuracy and are interpretable
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
16
Image Characteristics Condition NE
𝑀 = 9 ∧ 𝑃 < 18.41 0.04
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 < 17.06 0.26
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 17.06 ≤ 𝑌 < 19 0.71
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89

Real Images
• Many images usually available but they are unlabeled
• Labeling costs can be significant
• Test selection requires a different approach than with a simulator:
minimization
17

DNN Test Selection
• We want to test a DNN model with a fixed labeling and test budget.
• How can we automatically select an optimal test subset to test DNNs?
• DeepGD: Multi-objective search algorithm based on diversity and uncertainty
scores for black-box test set selection.
18
BB test
selection
method
Test
inputs T
Subset
S⊆T

Diversity-driven Test Selection
• Access to the DNN internals and sometimes the training set are not
realistic in many practical settings.
• Scalable and practical selection
• The more diverse, the more likely test inputs are to reveal faults.
• Black-box approach based on measuring the diversity of test inputs.
19

Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
20

Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V, the
geometric diversity of a subset S ⊆ X is defined as the hyper-volume of
the parallelepiped spanned by the rows of Vs, i.e., feature vectors of
items in S, where the larger the volume, the more diverse is the feature
space of S
21
Agahababaeyan et al., IEEE TSE, 2023

Pareto Front Optimization
22
• DeepGD: Black-box test selection to detect as many diverse faults as possible
for a given test budget
• Search-based Approach: Multi-Objective Genetic Search (NSGA-II)
• Two objectives (Max):
• Diversity (Geometric diversity)
• Uncertainty (e.g., Gini)
GD score
Gini
score
Agahababaeyan et al., IEEE TSE, 2024

Model Test Adequacy
• For an arbitrary test suite, accuracy is high. But can we trust it?
• Is testing sufficiently complete? (adequate)
• Assess adequacy before labelling.
• Requirements of a test adequacy assessment approach:
• Practical (cost, effort)
• Accurately estimates fault detection capability of test suite
23

TEASMA
• Post-training mutation analysis (DeepMutation++) and nonlinear regression
• Predicting Fault Detection Rate (FDR) from Mutation Score (MS)
24
MS: Mutation Score, FDR: Fault Detection Rate, T: Test set
Abbasishahkoo et al., IEEE TSE, 2024

TEASMA: Results
• Accurate prediction of fault detection rate based on
post-training mutants: modify bias and weighs …
• Non-linear relationships, varies across models
• Alternative coverage metrics to post-training
mutation scores:
• Distance-based Surprise Coverage (DSC)
• Likelihood-based Surprise Coverage (LSC)
• Input Distribution Coverage (IDC)
• Trade-off between accuracy and computation time
25

Differential Testing of Models
• DiffGAN: Image generation approach to generate test inputs that reveal
behavioral disagreements between DNN models (i.e., triggering inputs)
• Applications:
• Selection of alternative models
• Understand differences between original and updated models
• Select model predictions at run-time
26
DNN 1
DNN 2
X1
Y2
Y1

DiffGAN
27
Agahababaeyan et al., 2024

Results
• Effective at triggering inputs
• Generates diverse triggering inputs
• Triggering inputs are useful to train accurate ML models (e.g.,
Random Forest based on image features) for DNN selection
• For a given input, at run-time, which model to trust the most?
28

Testing Fine-Tuned DNNs
• DNNs are typically deployed across various
contexts
• They usually need to be fine-tuned and re-
tested
• Contexts differ in terms of data distributions
(data drift)
• How can we minimize the effort of re-testing
(i.e., labeling …)?
29

Problem Definition
30
Rank and select test inputs
for a fine-tuned model
Use information from both
the fine-tuned model (MT)
and pre-trained model (MS)

MetaSeL
• MetaSel leverages the relationship between a fine-tuned model and its
pre-trained counterpart to estimate the probability that an unlabeled
test input will be mispredicted by the fine-tuned model.
31 Abbasishahkoo et al., 2025

Random Forest Features
• The output of the logit layer of MS (LS).
• The output of the logit layer of MT (LT ).
• The difference between LS and LT.
• The outcome of differential testing by comparing output labels of MS and
MT.
• The ODIN score of the given input calculated based on MS.
• The ODIN score of the given input calculated based on MT.
32

MetaSel: Results
• Baselines of comparison using the target model only: Probability-based
uncertainty, surprise adequacy …
• MetaSel consistently outperforms all baselines in selecting test subsets that
contain a higher number of inputs misclassified by the fine-tuned DNN model.
• MetaSel’s efficiency remains acceptable from a practical standpoint, even for
subjects with large input sets featuring numerous output classes, and deep,
complex DNN architectures.
• MetaSel proves to be a robust solution, consistently maintaining its
effectiveness across a wide range of severity levels in distribution shift
between source and target input sets.
33

Explainability and Safety
• All ML-enabled systems and components fail under specific
conditions.
• Understanding and characterizing what these conditions are is key to
evaluate risks.
• Identify rare situations, mitigation mechanisms
• It can also help focus the re-training of models.
34

SAFE
35
Retraining
Features Extraction Detection of root causes
Error inducing
images
Data Preprocessing
Features
Extraction
Clustering
Root cause
clusters
Dimensionality
Reduction
Unsafe-set
Selection
Retraining
Improved DNN
• Image inputs
• Black-box
• No need to extend the DNN (LRP)
• Reduces training time and memory usage
• Dimensionality reduction: PCA, UMAP
• Density-based clustering: DBSCAN Attaoui et al., ACM TOSEM, 2022-2023

SAFE: Results
• Experiments with humans have shows that these clusters help correctly
determine root causes by inspecting only a few images.
• Re-training based on clusters is also effective and efficient

Autonomous Driving Systems
• AI-Enabled ADSs are systems that sense their environment and navigate
autonomously. They process data from sensors (e.g., cameras, LiDAR) and
use AI-based components to make driving decisions.
38

ADS Testing
• Aim: Automate testing of ADSs in a scalable and
practical way.
• Challenges:
• Scenario space: Open context (environment)
• Test oracle: Automated detection of behavioral failures (not
just safety violations)
• No (complete) specifications
• Many safety requirements
• Expensive simulations
39

System Testing via Physics-based
Simulation
40
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traﬃc signs)
Test input
Test output
time-stamped output

Reinforcement Learning for ADS
Testing (MORLOT)
• Goal: RL agent makes sequential changes to the environment (simulator), in a
realistic manner, with the objective of triggering safety violations
• Train a reinforcement learning (RL) agent to do so
• Challenge: Test many safety requirements simultaneously
• Combine RL with many-objective search to effectively address all requirements
at the same time
41
Ul Haq et al. ICSE 2023

Simple Example
Consider an RL agent that aims to
cause the ego vehicle to violate
safety requirements (e.g.,
collision) by performing a
sequence of actions through the
vehicle in front
42
Ego Vehicle Vehicle in front
Possible actions

Reinforcement Learning for ADAS
Testing (MORLOT)
43
Action of the vehicle in front
Reward (e.g., based on distance from vehicle in front)
State/Next State (Conditions about the environment, e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS
Ul Haq et al. ICSE 2023

MORLOT: MOS and RL
44
Many-Objective Search Reinforcement Learning
(Q-learning)
Environment
Action (e.g., increase
speed)
Fitness Values
Reward (e.g., distance)
Requirement
Ul Haq et al. 2023

Example Violation
• Violation: Ego Vehicle collides with vehicle in front
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Model was not trained with such situations
45
Car View Top View

Surrogate Models
• Simulation computation time is still a major obstacle to large-scale testing
• Surrogate model: Model that mimics the simulator, to a certain extent, while
being much less computationally expensive
• Research: Combine search with surrogate modeling to decrease the
computational cost of testing (Ul Haq et al. ICSE 2022)
46
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)

COCOMEGA
• Another attempt at ADS testing with a simulator in the loop
• Goals:
• Effective search of the scenario space
• Automated failure detection without (complete) specifications
• Failures: Not just safety violations but subtle undesirable behaviors
• Smart combination of:
• Cooperative co-evolutionary algorithm
• Metamorphic testing
47
Yousefizadeh et al. IEEE TSE 2025

Metamorphic Testing
• Definition: Testing is driven by differences in system behavior under varied input
transformations.
• Metamorphic relations (MR): relationships between a sequence of inputs and their
respective outputs.
• Change input i to i’ implies a predictable change in output, unless there is a failure
• Example: If you add a pedestrian to the field of view, the ego-vehicle should slow
down.
48

Metamorphic Testing
• Definition: Testing is driven by differences in system behavior under varied
input transformations.
• Metamorphic relations (MR): relationships between a sequence of inputs and
their respective outputs.
• Change input i to i’ implies a predictable change in output, unless there is a
failure
• If you add a pedestrian to the field of view, the ego-vehicle should slow down.
49

Co-evolution
• Evolutionary algorithm
• Co-evolution: Co-operating subpopulations
• More efficient search: lower-dimensional
• Enable parallelism, efficient search
• Our case:
• Scenarios: Ego vehicle, the trajectory, the environment
(e.g., the weather), and other dynamic (e.g., vehicles
and pedestrians) and static (e.g., obstacles) objects.
• Perturbations: Attribute changes, addition, deletion, or
replacement of objects in the scenario.
• Search objectives: Violations of Metamorphic
Relations, Diversity
50

Results
• Significantly outperform random search and
genetic algorithms in detecting severe MR
violations.
• Much more efficient: Detects MR violations
much faster
• High scenario diversity.
• Experts confirm MR violations entail safety
risks
51

Testing and Run-Time
Monitoring of Reinforcement
Learning Agents
52

Problems
• Safety Concerns: Growing use of Deep Reinforcement Learning (DRL) systems to
solve real-world challenges, especially in safety-critical domains.
• Lack of Guarantees: DRL agents, focused on maximizing rewards, may inadvertently
violate safety protocols, leading to potentially dangerous situations.
• Testing Limitations: How to effectively and efficiently evaluate DRL agents due to the
probabilistic nature of environments, vast state spaces, …
53

Solutions
• A Two-Phase Comprehensive Solution toward safety of Reinforcement Learning (RL)
(Zolfagharian et al., IEEE TSE 2023-2025):
• Pre-Deployment Testing: Rigorous testing is essential to identify and mitigate unsafe
behaviors in RL agents before deployment.
• Runtime Safety Monitoring: Continuous safety monitoring ensures ongoing protection by
predicting and mitigating potential safety violations during operation.
• STARLA: A search-based testing approach for RL agents, with a focus on detecting unsafe
behaviors
• SMARLA: A runtime safety monitoring approach that predicts and prevents potential safety
violations during RL agent operation
54

STARLA: Testing RL Agents
Maximize fault
detection
Data:
• We record episodes (i.e., sequence of states and actions that could be made by RL
agent )
• Generate random episodes from running the agent in the test environment
• Sample episode from later training steps of the RL agent
• Label: Functional fault seen in the environment
55

STARLA: Testing RL Agents
ML-based fault prediction model:
• Predict faulty episodes without executing
• Abstract states to increase learnability of ML
• Abstraction: State abstraction is a way to reduce the state space. it can be defined as
a mapping from original state space to the abstracted state space
A: S→SA
where abstract state space (S A) is smaller than original state space
Used as a fitness function to guide the search toward faulty episodes
56
Maximize fault
detection

Approach – Q* State Abstraction
• We rely on a Q-irrelevance state abstraction technique.
• Q*(s,a) is the predicted overall reward (expected reward) from state s, when agent selects action a in state s
• We consider two states in the same abstraction class when the expected overall reward for all actions in both
states are the same
• Abstraction level can be changed by parameter 𝑑
Q∗ (s1,a)
𝑑
=
Q∗ (s2,a)
𝑑
Q*(s,a1) Q*(s,a1)
Q*(s,a2)
Q*(s,a2)
∀𝑎 ∈ 𝐴
57

SMARLA: Safety Monitoring
58
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prediction
time
Crash
Time Step

Approach – SMARLA Training
Features: Presence and absence of the abstract states
Episodes
Abstract
state 1
Abstract
state 2
….
Abstract
state n
Label
E1 = [(S1,a1), (S2,a2),… , (Sj,aj)] 0 1 0 Safe
E2 = [(S’1,a’1), (S’2,a’2),… , (S’i,a’i)] 1 1 1 Unsafe
59

Approach – SMARLA Training
• We need a lightweight ML model that can accurately classify RL episodes as safe or unsafe, since we
target resource-constrained edge devices.
• Therefore, we exclude DNN models and choose Random Forest as our machine learning model because it:
• Can handle a large number of features effectively
• Provides efficient and timely classification results
• Is robust to overfitting, ensuring reliable performance
60

Summary
• Testing, safety analysis, and re-training DNN and RL models
• With or without simulation in the loop
• Focus on images and classification with DNNs
• Understand the differences between DNN models and combine them
• Efficiently testing fine-tuned models
• Testing systems (e.g., safety is a system property)
• Safety analysis and decisions through explanations
• Scalability: Surrogate models, co-evolutionary algorithms, …
61

Not Addressed
•Simulation fidelity
•Inputs other than images
•Other properties than safety
•Generative models
62

Other Related Works
• Determining the safety boundaries of ML-based systems
• Safety monitoring of learned components using safety metric forecasting
• …
• See references
63

Selected References (1)
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-
Objective Search" ACM International Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al. " Can Offline Testing of Deep Neural Networks Replace Their Online Testing?,
Empirical Software Engineering, 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and
Many-Objective Optimization” IEEE/ACM ICSE 2022
• Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled
Systems”, IEEE/ACM ICSE 2023
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based
Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance
of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-
based safety-critical systems”, ACM TOSEM, 2022
65

• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”,
ACM TOSEM, 2022
• Attaoui et al., “Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches”,
ACM TOSEM, 2024
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE TSE 2023
• Aghababaeyan et al., “DeepGD: A multi-objective black-box test selection approach for deep neural networks”,
ACM TOSEM, 2024
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”, IEEE TSE, 2023
• Zolfagharian et al., “SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents”, IEEE TSE,
2024
• Sharifi et al., “Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-
Evolutionary Search”, IEEE TSE, 2024
• Abbasishahkoo et al., TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using
Mutation Analysis”, IEEE TSE, 2024
66

• Attaoui et al., “Search-based DNN Testing and Retraining with
GAN-enhanced Simulations”, IEEE TSE 2025
• Yousefizadeh et al., “Using Cooperative Co-evolutionary Search to
Generate Metamorphic Test Cases for Autonomous Driving
Systems”, IEEE TSE 2025
• Aghababaeyan et al., “DiffGAN: A Test Generation Approach for
Differential Testing of Deep Neural Networks”
• Abbasishahkoo et al., “MetaSel: A Test Selection Approach for
Fine-tuned DNN Models”, ArXiv 2025
67

Correlation with Faults
Correlation between geometric diversity and faults
69

Estimating Faults in DNNs
70
#Clusters ~ #Faults
Faults <> Mispredictions

Simulation+DNN Examples
71
Pylot + Carla Apollo + LGSVL
High-fidelity simulators:
Carla
LGSVL
DNN-based ADAS:
Pylot: many DNN models
Apollo: 20 DNN models

Automated Testing and Safety Analysis of Deep Neural Networks

More Related Content

Similar to Automated Testing and Safety Analysis of Deep Neural Networks (20)

More from Lionel Briand (20)

Recently uploaded (20)

Automated Testing and Safety Analysis of Deep Neural Networks