SlideShare a Scribd company logo
Safe AI Systems
Automated Testing and Safety Analysis
of Deep Neural Networks
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info
Affiliations & Expertise
• Canada Research Chair (Tier 1), University of Ottawa, Canada
• Director, Lero, Research Ireland centre for software research
• Software Engineering (SE)
• AI4SE, e.g., test automation, requirements QA
• SE4AI, e.g., assurance of AI-enabled systems
2
Objectives
• Provide a software engineering perspective on the quality assurance and
safety of deep-learning models and systems.
• Provide an integrated overview of recent work
• Intuitive level
• Not a survey
3
Introduction
4
Software Testing
5
Software
Inputs Outputs
Generation or selection strategy
Effectiveness, cost
Test adequacy
Level of testing
Information access
Automated failure detection
Root cause analysis
Analysis (safety, security …)
Risk assessment
Impact of AI?
What’s Different with AI Components?
• AI: traditional ML, deep learning, reinforcement learning, generative models
• No source code captures the model’s behaviour
• No specifications for ML components
• Behaviour acquired through training based on data
• Models are never perfectly accurate
• This has a significant impact
6
AI System Trustworthiness
• Uncertainty
• Robustness
• Safety
• Security
• Bias and Fairness
• Transparency and explainability
• …
7
AI Safety Verification
• Ideally:
• Absolute safety guarantees
• Practical assumptions (about input space, model, properties …)
• Scalable (e.g., to large models and systems)
• But
• We cannot have it all (e.g., high dimensionality, many parameters)
• Safety is not a model property, rather a system one
8
Formal Analysis vs. Testing
• Formal analysis, e.g., reachability analysis
• Focus on robustness to perturbations
• Guarantees about models (not systems) under restrictive assumptions (e.g.,
model architecture, input space shape, reasonable over-approximation of
output space)
• Scalability issues
• Testing-based analysis, e.g., search-based testing
• Heuristics for input space exploration
• No guarantees, but no restrictive assumptions and more scalable
9
Testing Levels
• Testing is still the main practical mechanism through which to gain trust
• Levels: AI Model (e.g., Deep Neural Networks), integration, system
• Integration: Issues that arise when multiple models and components are
integrated
• System: Test the system in its target environment, in-field or simulated
10
Model Testing & Analysis
12
Key-points Detection Testing with
Simulation in the Loop
• DNNs used for key-points detection in images
• Testing: Find test suite that causes DNN to
poorly predict as many key-points as possible
within time budget
• Evaluate safety from testing results
• Images generated by a simulator
13
Ground truth
Predicted
Ul Haq et al., ISSTA, 2021
Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to one test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms) combined
with simulator
14
Overview
15
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image
Safety through Explanation
• Regression trees to predict model accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow on the location of KP26 is
the cause of high NE values
• Regression trees show excellent accuracy and are interpretable
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
16
Image Characteristics Condition NE
𝑀 = 9 ∧ 𝑃 < 18.41 0.04
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 < 17.06 0.26
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 17.06 ≤ 𝑌 < 19 0.71
𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89
Real Images
• Many images usually available but they are unlabeled
• Labeling costs can be significant
• Test selection requires a different approach than with a simulator:
minimization
17
DNN Test Selection
• We want to test a DNN model with a fixed labeling and test budget.
• How can we automatically select an optimal test subset to test DNNs?
• DeepGD: Multi-objective search algorithm based on diversity and uncertainty
scores for black-box test set selection.
18
BB test
selection
method
Test
inputs T
Subset
S⊆T
Diversity-driven Test Selection
• Access to the DNN internals and sometimes the training set are not
realistic in many practical settings.
• Scalable and practical selection
• The more diverse, the more likely test inputs are to reveal faults.
• Black-box approach based on measuring the diversity of test inputs.
19
Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
20
Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V, the
geometric diversity of a subset S ⊆ X is defined as the hyper-volume of
the parallelepiped spanned by the rows of Vs, i.e., feature vectors of
items in S, where the larger the volume, the more diverse is the feature
space of S
21
Agahababaeyan et al., IEEE TSE, 2023
Pareto Front Optimization
22
• DeepGD: Black-box test selection to detect as many diverse faults as possible
for a given test budget
• Search-based Approach: Multi-Objective Genetic Search (NSGA-II)
• Two objectives (Max):
• Diversity (Geometric diversity)
• Uncertainty (e.g., Gini)
GD score
Gini
score
Agahababaeyan et al., IEEE TSE, 2024
Model Test Adequacy
• For an arbitrary test suite, accuracy is high. But can we trust it?
• Is testing sufficiently complete? (adequate)
• Assess adequacy before labelling.
• Requirements of a test adequacy assessment approach:
• Practical (cost, effort)
• Accurately estimates fault detection capability of test suite
23
TEASMA
• Post-training mutation analysis (DeepMutation++) and nonlinear regression
• Predicting Fault Detection Rate (FDR) from Mutation Score (MS)
24
MS: Mutation Score, FDR: Fault Detection Rate, T: Test set
Abbasishahkoo et al., IEEE TSE, 2024
TEASMA: Results
• Accurate prediction of fault detection rate based on
post-training mutants: modify bias and weighs …
• Non-linear relationships, varies across models
• Alternative coverage metrics to post-training
mutation scores:
• Distance-based Surprise Coverage (DSC)
• Likelihood-based Surprise Coverage (LSC)
• Input Distribution Coverage (IDC)
• Trade-off between accuracy and computation time
25
Differential Testing of Models
• DiffGAN: Image generation approach to generate test inputs that reveal
behavioral disagreements between DNN models (i.e., triggering inputs)
• Applications:
• Selection of alternative models
• Understand differences between original and updated models
• Select model predictions at run-time
26
DNN 1
DNN 2
X1
Y2
Y1
DiffGAN
27
Agahababaeyan et al., 2024
Results
• Effective at triggering inputs
• Generates diverse triggering inputs
• Triggering inputs are useful to train accurate ML models (e.g.,
Random Forest based on image features) for DNN selection
• For a given input, at run-time, which model to trust the most?
28
Testing Fine-Tuned DNNs
• DNNs are typically deployed across various
contexts
• They usually need to be fine-tuned and re-
tested
• Contexts differ in terms of data distributions
(data drift)
• How can we minimize the effort of re-testing
(i.e., labeling …)?
29
Problem Definition
30
Rank and select test inputs
for a fine-tuned model
Use information from both
the fine-tuned model (MT)
and pre-trained model (MS)
MetaSeL
• MetaSel leverages the relationship between a fine-tuned model and its
pre-trained counterpart to estimate the probability that an unlabeled
test input will be mispredicted by the fine-tuned model.
31 Abbasishahkoo et al., 2025
Random Forest Features
• The output of the logit layer of MS (LS).
• The output of the logit layer of MT (LT ).
• The difference between LS and LT.
• The outcome of differential testing by comparing output labels of MS and
MT.
• The ODIN score of the given input calculated based on MS.
• The ODIN score of the given input calculated based on MT.
32
MetaSel: Results
• Baselines of comparison using the target model only: Probability-based
uncertainty, surprise adequacy …
• MetaSel consistently outperforms all baselines in selecting test subsets that
contain a higher number of inputs misclassified by the fine-tuned DNN model.
• MetaSel’s efficiency remains acceptable from a practical standpoint, even for
subjects with large input sets featuring numerous output classes, and deep,
complex DNN architectures.
• MetaSel proves to be a robust solution, consistently maintaining its
effectiveness across a wide range of severity levels in distribution shift
between source and target input sets.
33
Explainability and Safety
• All ML-enabled systems and components fail under specific
conditions.
• Understanding and characterizing what these conditions are is key to
evaluate risks.
• Identify rare situations, mitigation mechanisms
• It can also help focus the re-training of models.
34
SAFE
35
Retraining
Features Extraction Detection of root causes
Error inducing
images
Data Preprocessing
Features
Extraction
Clustering
Root cause
clusters
Dimensionality
Reduction
Unsafe-set
Selection
Retraining
Improved DNN
• Image inputs
• Black-box
• No need to extend the DNN (LRP)
• Reduces training time and memory usage
• Dimensionality reduction: PCA, UMAP
• Density-based clustering: DBSCAN Attaoui et al., ACM TOSEM, 2022-2023
SAFE: Results
• Experiments with humans have shows that these clusters help correctly
determine root causes by inspecting only a few images.
• Re-training based on clusters is also effective and efficient
System Testing
37
Autonomous Driving Systems
• AI-Enabled ADSs are systems that sense their environment and navigate
autonomously. They process data from sensors (e.g., cameras, LiDAR) and
use AI-based components to make driving decisions.
38
ADS Testing
• Aim: Automate testing of ADSs in a scalable and
practical way.
• Challenges:
• Scenario space: Open context (environment)
• Test oracle: Automated detection of behavioral failures (not
just safety violations)
• No (complete) specifications
• Many safety requirements
• Expensive simulations
39
System Testing via Physics-based
Simulation
40
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traffic signs)
Test input
Test output
time-stamped output
Reinforcement Learning for ADS
Testing (MORLOT)
• Goal: RL agent makes sequential changes to the environment (simulator), in a
realistic manner, with the objective of triggering safety violations
• Train a reinforcement learning (RL) agent to do so
• Challenge: Test many safety requirements simultaneously
• Combine RL with many-objective search to effectively address all requirements
at the same time
41
Ul Haq et al. ICSE 2023
Simple Example
Consider an RL agent that aims to
cause the ego vehicle to violate
safety requirements (e.g.,
collision) by performing a
sequence of actions through the
vehicle in front
42
Ego Vehicle Vehicle in front
Possible actions
Reinforcement Learning for ADAS
Testing (MORLOT)
43
Action of the vehicle in front
Reward (e.g., based on distance from vehicle in front)
State/Next State (Conditions about the environment, e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS
Ul Haq et al. ICSE 2023
MORLOT: MOS and RL
44
Many-Objective Search Reinforcement Learning
(Q-learning)
Environment
Action (e.g., increase
speed)
Fitness Values
Reward (e.g., distance)
Requirement
Ul Haq et al. 2023
Example Violation
• Violation: Ego Vehicle collides with vehicle in front
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Model was not trained with such situations
45
Car View Top View
Surrogate Models
• Simulation computation time is still a major obstacle to large-scale testing
• Surrogate model: Model that mimics the simulator, to a certain extent, while
being much less computationally expensive
• Research: Combine search with surrogate modeling to decrease the
computational cost of testing (Ul Haq et al. ICSE 2022)
46
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)
COCOMEGA
• Another attempt at ADS testing with a simulator in the loop
• Goals:
• Effective search of the scenario space
• Automated failure detection without (complete) specifications
• Failures: Not just safety violations but subtle undesirable behaviors
• Smart combination of:
• Cooperative co-evolutionary algorithm
• Metamorphic testing
47
Yousefizadeh et al. IEEE TSE 2025
Metamorphic Testing
• Definition: Testing is driven by differences in system behavior under varied input
transformations.
• Metamorphic relations (MR): relationships between a sequence of inputs and their
respective outputs.
• Change input i to i’ implies a predictable change in output, unless there is a failure
• Example: If you add a pedestrian to the field of view, the ego-vehicle should slow
down.
48
Metamorphic Testing
• Definition: Testing is driven by differences in system behavior under varied
input transformations.
• Metamorphic relations (MR): relationships between a sequence of inputs and
their respective outputs.
• Change input i to i’ implies a predictable change in output, unless there is a
failure
• If you add a pedestrian to the field of view, the ego-vehicle should slow down.
49
Co-evolution
• Evolutionary algorithm
• Co-evolution: Co-operating subpopulations
• More efficient search: lower-dimensional
• Enable parallelism, efficient search
• Our case:
• Scenarios: Ego vehicle, the trajectory, the environment
(e.g., the weather), and other dynamic (e.g., vehicles
and pedestrians) and static (e.g., obstacles) objects.
• Perturbations: Attribute changes, addition, deletion, or
replacement of objects in the scenario.
• Search objectives: Violations of Metamorphic
Relations, Diversity
50
Results
• Significantly outperform random search and
genetic algorithms in detecting severe MR
violations.
• Much more efficient: Detects MR violations
much faster
• High scenario diversity.
• Experts confirm MR violations entail safety
risks
51
Testing and Run-Time
Monitoring of Reinforcement
Learning Agents
52
Problems
• Safety Concerns: Growing use of Deep Reinforcement Learning (DRL) systems to
solve real-world challenges, especially in safety-critical domains.
• Lack of Guarantees: DRL agents, focused on maximizing rewards, may inadvertently
violate safety protocols, leading to potentially dangerous situations.
• Testing Limitations: How to effectively and efficiently evaluate DRL agents due to the
probabilistic nature of environments, vast state spaces, …
53
Solutions
• A Two-Phase Comprehensive Solution toward safety of Reinforcement Learning (RL)
(Zolfagharian et al., IEEE TSE 2023-2025):
• Pre-Deployment Testing: Rigorous testing is essential to identify and mitigate unsafe
behaviors in RL agents before deployment.
• Runtime Safety Monitoring: Continuous safety monitoring ensures ongoing protection by
predicting and mitigating potential safety violations during operation.
• STARLA: A search-based testing approach for RL agents, with a focus on detecting unsafe
behaviors
• SMARLA: A runtime safety monitoring approach that predicts and prevents potential safety
violations during RL agent operation
54
STARLA: Testing RL Agents
Maximize fault
detection
Data:
• We record episodes (i.e., sequence of states and actions that could be made by RL
agent )
• Generate random episodes from running the agent in the test environment
• Sample episode from later training steps of the RL agent
• Label: Functional fault seen in the environment
55
STARLA: Testing RL Agents
ML-based fault prediction model:
• Predict faulty episodes without executing
• Abstract states to increase learnability of ML
• Abstraction: State abstraction is a way to reduce the state space. it can be defined as
a mapping from original state space to the abstracted state space
A: S→SA
where abstract state space (S A) is smaller than original state space
Used as a fitness function to guide the search toward faulty episodes
56
Maximize fault
detection
Approach – Q* State Abstraction
• We rely on a Q-irrelevance state abstraction technique.
• Q*(s,a) is the predicted overall reward (expected reward) from state s, when agent selects action a in state s
• We consider two states in the same abstraction class when the expected overall reward for all actions in both
states are the same
• Abstraction level can be changed by parameter 𝑑
Q∗ (s1,a)
𝑑
=
Q∗ (s2,a)
𝑑
Q*(s,a1) Q*(s,a1)
Q*(s,a2)
Q*(s,a2)
∀𝑎 ∈ 𝐴
57
SMARLA: Safety Monitoring
58
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prediction
time
Crash
Time Step
Approach – SMARLA Training
Features: Presence and absence of the abstract states
Episodes
Abstract
state 1
Abstract
state 2
….
Abstract
state n
Label
E1 = [(S1,a1), (S2,a2),… , (Sj,aj)] 0 1 0 Safe
E2 = [(S’1,a’1), (S’2,a’2),… , (S’i,a’i)] 1 1 1 Unsafe
59
Approach – SMARLA Training
• We need a lightweight ML model that can accurately classify RL episodes as safe or unsafe, since we
target resource-constrained edge devices.
• Therefore, we exclude DNN models and choose Random Forest as our machine learning model because it:
• Can handle a large number of features effectively
• Provides efficient and timely classification results
• Is robust to overfitting, ensuring reliable performance
60
Summary
• Testing, safety analysis, and re-training DNN and RL models
• With or without simulation in the loop
• Focus on images and classification with DNNs
• Understand the differences between DNN models and combine them
• Efficiently testing fine-tuned models
• Testing systems (e.g., safety is a system property)
• Safety analysis and decisions through explanations
• Scalability: Surrogate models, co-evolutionary algorithms, …
61
Not Addressed
•Simulation fidelity
•Inputs other than images
•Other properties than safety
•Generative models
62
Other Related Works
• Determining the safety boundaries of ML-based systems
• Safety monitoring of learned components using safety metric forecasting
• …
• See references
63
Safe AI Systems
Automated Testing and Safety Analysis
of Deep Neural Networks
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info
Selected References (1)
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-
Objective Search" ACM International Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al. " Can Offline Testing of Deep Neural Networks Replace Their Online Testing?,
Empirical Software Engineering, 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and
Many-Objective Optimization” IEEE/ACM ICSE 2022
• Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled
Systems”, IEEE/ACM ICSE 2023
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based
Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance
of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-
based safety-critical systems”, ACM TOSEM, 2022
65
Selected References (2)
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”,
ACM TOSEM, 2022
• Attaoui et al., “Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches”,
ACM TOSEM, 2024
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE TSE 2023
• Aghababaeyan et al., “DeepGD: A multi-objective black-box test selection approach for deep neural networks”,
ACM TOSEM, 2024
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”, IEEE TSE, 2023
• Zolfagharian et al., “SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents”, IEEE TSE,
2024
• Sharifi et al., “Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-
Evolutionary Search”, IEEE TSE, 2024
• Abbasishahkoo et al., TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using
Mutation Analysis”, IEEE TSE, 2024
66
Selected References (3)
• Attaoui et al., “Search-based DNN Testing and Retraining with
GAN-enhanced Simulations”, IEEE TSE 2025
• Yousefizadeh et al., “Using Cooperative Co-evolutionary Search to
Generate Metamorphic Test Cases for Autonomous Driving
Systems”, IEEE TSE 2025
• Aghababaeyan et al., “DiffGAN: A Test Generation Approach for
Differential Testing of Deep Neural Networks”
• Abbasishahkoo et al., “MetaSel: A Test Selection Approach for
Fine-tuned DNN Models”, ArXiv 2025
67
Backup
68
Correlation with Faults
Correlation between geometric diversity and faults
69
Estimating Faults in DNNs
70
#Clusters ~ #Faults
Faults <> Mispredictions
Simulation+DNN Examples
71
Pylot + Carla Apollo + LGSVL
High-fidelity simulators:
Carla
LGSVL
DNN-based ADAS:
Pylot: many DNN models
Apollo: 20 DNN models

More Related Content

PDF
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
PDF
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
PDF
Testing Machine Learning-enabled Systems: A Personal Perspective
PDF
Adobe Audition Crack FRESH Version 2025 FREE
PDF
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
PDF
Revisiting the Notion of Diversity in Software Testing
PPTX
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
PDF
Automated Testing of Autonomous Driving Assistance Systems
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Testing Machine Learning-enabled Systems: A Personal Perspective
Adobe Audition Crack FRESH Version 2025 FREE
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Revisiting the Notion of Diversity in Software Testing
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
Automated Testing of Autonomous Driving Assistance Systems

Similar to Automated Testing and Safety Analysis of Deep Neural Networks (20)

PDF
AI in SE: A 25-year Journey
PDF
SBST 2019 Keynote
PDF
Approach AI assurance
PPTX
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
PDF
Artificial Intelligence for Automated Software Testing
PDF
Auditing AI models for verified deployment under semantic specifications.pdf
PDF
Enabling Automated Software Testing with Artificial Intelligence
PDF
01 - Course setup software sustainability
PPTX
Machine learning testing survey, landscapes and horizons, the Cliff Notes
PDF
Testing the Intelligence of your AI
PPTX
Foutse_Khomh.pptx
PDF
Guide to Successful AI.pdf
PDF
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
PPTX
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
PDF
Testing tools and AI - ideas what to try with some tool examples
PDF
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
PDF
The power of AI and ML in Testing .
PDF
AI improves software testing through test automation, test creation and test ...
PDF
AI improves software testing to be more fault tolerant, focused and efficient
PDF
Keynote presentation at DeepTest Workshop 2025
AI in SE: A 25-year Journey
SBST 2019 Keynote
Approach AI assurance
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
Artificial Intelligence for Automated Software Testing
Auditing AI models for verified deployment under semantic specifications.pdf
Enabling Automated Software Testing with Artificial Intelligence
01 - Course setup software sustainability
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Testing the Intelligence of your AI
Foutse_Khomh.pptx
Guide to Successful AI.pdf
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Testing tools and AI - ideas what to try with some tool examples
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
The power of AI and ML in Testing .
AI improves software testing through test automation, test creation and test ...
AI improves software testing to be more fault tolerant, focused and efficient
Keynote presentation at DeepTest Workshop 2025
Ad

More from Lionel Briand (20)

PDF
LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on...
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
PDF
Automated Test Case Repair Using Language Models
PDF
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categorie...
PDF
Precise and Complete Requirements? An Elusive Goal
PDF
Large Language Models for Test Case Evolution and Repair
PDF
Metamorphic Testing for Web System Security
PDF
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
PDF
Fuzzing for CPS Mutation Testing
PDF
Data-driven Mutation Analysis for Cyber-Physical Systems
PDF
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
PDF
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
PDF
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
PDF
PRINS: Scalable Model Inference for Component-based System Logs
PDF
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
PDF
Reinforcement Learning for Test Case Prioritization
PDF
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
PDF
On Systematically Building a Controlled Natural Language for Functional Requi...
PDF
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
PDF
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on...
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Automated Test Case Repair Using Language Models
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categorie...
Precise and Complete Requirements? An Elusive Goal
Large Language Models for Test Case Evolution and Repair
Metamorphic Testing for Web System Security
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Fuzzing for CPS Mutation Testing
Data-driven Mutation Analysis for Cyber-Physical Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
PRINS: Scalable Model Inference for Component-based System Logs
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Reinforcement Learning for Test Case Prioritization
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
On Systematically Building a Controlled Natural Language for Functional Requi...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Ad

Recently uploaded (20)

PDF
System and Network Administration Chapter 2
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
PDF
Build Multi-agent using Agent Development Kit
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PDF
top salesforce developer skills in 2025.pdf
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PDF
System and Network Administraation Chapter 3
PDF
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
PDF
Convert Thunderbird to Outlook into bulk
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Save Business Costs with CRM Software for Insurance Agents
System and Network Administration Chapter 2
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
A REACT POMODORO TIMER WEB APPLICATION.pdf
Build Multi-agent using Agent Development Kit
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
top salesforce developer skills in 2025.pdf
Materi_Pemrograman_Komputer-Looping.pptx
System and Network Administraation Chapter 3
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
Convert Thunderbird to Outlook into bulk
Materi-Enum-and-Record-Data-Type (1).pptx
The Five Best AI Cover Tools in 2025.docx
The Role of Automation and AI in EHS Management for Data Centers.pdf
ai tools demonstartion for schools and inter college
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Save Business Costs with CRM Software for Insurance Agents

Automated Testing and Safety Analysis of Deep Neural Networks

  • 1. Safe AI Systems Automated Testing and Safety Analysis of Deep Neural Networks Lionel Briand, FACM, FIEEE, FRSC http://www.lbriand.info
  • 2. Affiliations & Expertise • Canada Research Chair (Tier 1), University of Ottawa, Canada • Director, Lero, Research Ireland centre for software research • Software Engineering (SE) • AI4SE, e.g., test automation, requirements QA • SE4AI, e.g., assurance of AI-enabled systems 2
  • 3. Objectives • Provide a software engineering perspective on the quality assurance and safety of deep-learning models and systems. • Provide an integrated overview of recent work • Intuitive level • Not a survey 3
  • 5. Software Testing 5 Software Inputs Outputs Generation or selection strategy Effectiveness, cost Test adequacy Level of testing Information access Automated failure detection Root cause analysis Analysis (safety, security …) Risk assessment Impact of AI?
  • 6. What’s Different with AI Components? • AI: traditional ML, deep learning, reinforcement learning, generative models • No source code captures the model’s behaviour • No specifications for ML components • Behaviour acquired through training based on data • Models are never perfectly accurate • This has a significant impact 6
  • 7. AI System Trustworthiness • Uncertainty • Robustness • Safety • Security • Bias and Fairness • Transparency and explainability • … 7
  • 8. AI Safety Verification • Ideally: • Absolute safety guarantees • Practical assumptions (about input space, model, properties …) • Scalable (e.g., to large models and systems) • But • We cannot have it all (e.g., high dimensionality, many parameters) • Safety is not a model property, rather a system one 8
  • 9. Formal Analysis vs. Testing • Formal analysis, e.g., reachability analysis • Focus on robustness to perturbations • Guarantees about models (not systems) under restrictive assumptions (e.g., model architecture, input space shape, reasonable over-approximation of output space) • Scalability issues • Testing-based analysis, e.g., search-based testing • Heuristics for input space exploration • No guarantees, but no restrictive assumptions and more scalable 9
  • 10. Testing Levels • Testing is still the main practical mechanism through which to gain trust • Levels: AI Model (e.g., Deep Neural Networks), integration, system • Integration: Issues that arise when multiple models and components are integrated • System: Test the system in its target environment, in-field or simulated 10
  • 11. Model Testing & Analysis 12
  • 12. Key-points Detection Testing with Simulation in the Loop • DNNs used for key-points detection in images • Testing: Find test suite that causes DNN to poorly predict as many key-points as possible within time budget • Evaluate safety from testing results • Images generated by a simulator 13 Ground truth Predicted Ul Haq et al., ISSTA, 2021
  • 13. Example Application • Drowsiness or gaze detection based on interior camera monitoring the driver • In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly important for safety • Each KP leads to one test objective • For our subject DNN, we have 27 test objectives • Goal: Cause the DNN to mispredict as many key-points as possible • Solution: Many-objective search algorithms (based on genetic algorithms) combined with simulator 14
  • 14. Overview 15 Input Generator (search) Simulator Input (vector) DNN Fitness Calculator Actual Key-points Positions Predicted Key-points Positions Fitness Score (Error Value) Most Critical Test Inputs Test Image
  • 15. Safety through Explanation • Regression trees to predict model accuracy based on simulation parameters • Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow on the location of KP26 is the cause of high NE values • Regression trees show excellent accuracy and are interpretable • Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time 16 Image Characteristics Condition NE 𝑀 = 9 ∧ 𝑃 < 18.41 0.04 𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 < 17.06 0.26 𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 17.06 ≤ 𝑌 < 19 0.71 𝑀 = 9 ∧ 𝑃 ≥ 18.41 ∧ 𝑅 < −22.31 ∧ 𝑌 ≥ 19 0.36 Representative rules derived from the decision tree for KP26 (M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error) (A) A test image satisfying the first condition (B) A test image satisfying the third condition NE = 0.013 NE = 0.89
  • 16. Real Images • Many images usually available but they are unlabeled • Labeling costs can be significant • Test selection requires a different approach than with a simulator: minimization 17
  • 17. DNN Test Selection • We want to test a DNN model with a fixed labeling and test budget. • How can we automatically select an optimal test subset to test DNNs? • DeepGD: Multi-objective search algorithm based on diversity and uncertainty scores for black-box test set selection. 18 BB test selection method Test inputs T Subset S⊆T
  • 18. Diversity-driven Test Selection • Access to the DNN internals and sometimes the training set are not realistic in many practical settings. • Scalable and practical selection • The more diverse, the more likely test inputs are to reveal faults. • Black-box approach based on measuring the diversity of test inputs. 19
  • 19. Extracting Image Features • VGG16 is a convolutional neural network trained on a subset of the ImageNet dataset, a collection of over 14 million images belonging to 22,000 categories. 20
  • 20. Geometric Diversity (GD) • Given a dataset X and its corresponding feature vectors V, the geometric diversity of a subset S ⊆ X is defined as the hyper-volume of the parallelepiped spanned by the rows of Vs, i.e., feature vectors of items in S, where the larger the volume, the more diverse is the feature space of S 21 Agahababaeyan et al., IEEE TSE, 2023
  • 21. Pareto Front Optimization 22 • DeepGD: Black-box test selection to detect as many diverse faults as possible for a given test budget • Search-based Approach: Multi-Objective Genetic Search (NSGA-II) • Two objectives (Max): • Diversity (Geometric diversity) • Uncertainty (e.g., Gini) GD score Gini score Agahababaeyan et al., IEEE TSE, 2024
  • 22. Model Test Adequacy • For an arbitrary test suite, accuracy is high. But can we trust it? • Is testing sufficiently complete? (adequate) • Assess adequacy before labelling. • Requirements of a test adequacy assessment approach: • Practical (cost, effort) • Accurately estimates fault detection capability of test suite 23
  • 23. TEASMA • Post-training mutation analysis (DeepMutation++) and nonlinear regression • Predicting Fault Detection Rate (FDR) from Mutation Score (MS) 24 MS: Mutation Score, FDR: Fault Detection Rate, T: Test set Abbasishahkoo et al., IEEE TSE, 2024
  • 24. TEASMA: Results • Accurate prediction of fault detection rate based on post-training mutants: modify bias and weighs … • Non-linear relationships, varies across models • Alternative coverage metrics to post-training mutation scores: • Distance-based Surprise Coverage (DSC) • Likelihood-based Surprise Coverage (LSC) • Input Distribution Coverage (IDC) • Trade-off between accuracy and computation time 25
  • 25. Differential Testing of Models • DiffGAN: Image generation approach to generate test inputs that reveal behavioral disagreements between DNN models (i.e., triggering inputs) • Applications: • Selection of alternative models • Understand differences between original and updated models • Select model predictions at run-time 26 DNN 1 DNN 2 X1 Y2 Y1
  • 27. Results • Effective at triggering inputs • Generates diverse triggering inputs • Triggering inputs are useful to train accurate ML models (e.g., Random Forest based on image features) for DNN selection • For a given input, at run-time, which model to trust the most? 28
  • 28. Testing Fine-Tuned DNNs • DNNs are typically deployed across various contexts • They usually need to be fine-tuned and re- tested • Contexts differ in terms of data distributions (data drift) • How can we minimize the effort of re-testing (i.e., labeling …)? 29
  • 29. Problem Definition 30 Rank and select test inputs for a fine-tuned model Use information from both the fine-tuned model (MT) and pre-trained model (MS)
  • 30. MetaSeL • MetaSel leverages the relationship between a fine-tuned model and its pre-trained counterpart to estimate the probability that an unlabeled test input will be mispredicted by the fine-tuned model. 31 Abbasishahkoo et al., 2025
  • 31. Random Forest Features • The output of the logit layer of MS (LS). • The output of the logit layer of MT (LT ). • The difference between LS and LT. • The outcome of differential testing by comparing output labels of MS and MT. • The ODIN score of the given input calculated based on MS. • The ODIN score of the given input calculated based on MT. 32
  • 32. MetaSel: Results • Baselines of comparison using the target model only: Probability-based uncertainty, surprise adequacy … • MetaSel consistently outperforms all baselines in selecting test subsets that contain a higher number of inputs misclassified by the fine-tuned DNN model. • MetaSel’s efficiency remains acceptable from a practical standpoint, even for subjects with large input sets featuring numerous output classes, and deep, complex DNN architectures. • MetaSel proves to be a robust solution, consistently maintaining its effectiveness across a wide range of severity levels in distribution shift between source and target input sets. 33
  • 33. Explainability and Safety • All ML-enabled systems and components fail under specific conditions. • Understanding and characterizing what these conditions are is key to evaluate risks. • Identify rare situations, mitigation mechanisms • It can also help focus the re-training of models. 34
  • 34. SAFE 35 Retraining Features Extraction Detection of root causes Error inducing images Data Preprocessing Features Extraction Clustering Root cause clusters Dimensionality Reduction Unsafe-set Selection Retraining Improved DNN • Image inputs • Black-box • No need to extend the DNN (LRP) • Reduces training time and memory usage • Dimensionality reduction: PCA, UMAP • Density-based clustering: DBSCAN Attaoui et al., ACM TOSEM, 2022-2023
  • 35. SAFE: Results • Experiments with humans have shows that these clusters help correctly determine root causes by inspecting only a few images. • Re-training based on clusters is also effective and efficient
  • 37. Autonomous Driving Systems • AI-Enabled ADSs are systems that sense their environment and navigate autonomously. They process data from sensors (e.g., cameras, LiDAR) and use AI-based components to make driving decisions. 38
  • 38. ADS Testing • Aim: Automate testing of ADSs in a scalable and practical way. • Challenges: • Scenario space: Open context (environment) • Test oracle: Automated detection of behavioral failures (not just safety violations) • No (complete) specifications • Many safety requirements • Expensive simulations 39
  • 39. System Testing via Physics-based Simulation 40 ADAS (SUT) Simulator (Matlab/Simulink) Model (Matlab/Simulink) ▪ Physical plant (vehicle / sensors / actuators) ▪ Other cars ▪ Pedestrians ▪ Environment (weather / roads / traffic signs) Test input Test output time-stamped output
  • 40. Reinforcement Learning for ADS Testing (MORLOT) • Goal: RL agent makes sequential changes to the environment (simulator), in a realistic manner, with the objective of triggering safety violations • Train a reinforcement learning (RL) agent to do so • Challenge: Test many safety requirements simultaneously • Combine RL with many-objective search to effectively address all requirements at the same time 41 Ul Haq et al. ICSE 2023
  • 41. Simple Example Consider an RL agent that aims to cause the ego vehicle to violate safety requirements (e.g., collision) by performing a sequence of actions through the vehicle in front 42 Ego Vehicle Vehicle in front Possible actions
  • 42. Reinforcement Learning for ADAS Testing (MORLOT) 43 Action of the vehicle in front Reward (e.g., based on distance from vehicle in front) State/Next State (Conditions about the environment, e.g., collision) RL Agent RL-Environment Simulator (CARLA) Control (Image, location,…) ADAS Ul Haq et al. ICSE 2023
  • 43. MORLOT: MOS and RL 44 Many-Objective Search Reinforcement Learning (Q-learning) Environment Action (e.g., increase speed) Fitness Values Reward (e.g., distance) Requirement Ul Haq et al. 2023
  • 44. Example Violation • Violation: Ego Vehicle collides with vehicle in front • Vehicle-in-front slows down suddenly and then moves to the right • Possible reason: Model was not trained with such situations 45 Car View Top View
  • 45. Surrogate Models • Simulation computation time is still a major obstacle to large-scale testing • Surrogate model: Model that mimics the simulator, to a certain extent, while being much less computationally expensive • Research: Combine search with surrogate modeling to decrease the computational cost of testing (Ul Haq et al. ICSE 2022) 46 Polynomial Regression (PR) Radial Basis Function (RBF) Kringing (KR)
  • 46. COCOMEGA • Another attempt at ADS testing with a simulator in the loop • Goals: • Effective search of the scenario space • Automated failure detection without (complete) specifications • Failures: Not just safety violations but subtle undesirable behaviors • Smart combination of: • Cooperative co-evolutionary algorithm • Metamorphic testing 47 Yousefizadeh et al. IEEE TSE 2025
  • 47. Metamorphic Testing • Definition: Testing is driven by differences in system behavior under varied input transformations. • Metamorphic relations (MR): relationships between a sequence of inputs and their respective outputs. • Change input i to i’ implies a predictable change in output, unless there is a failure • Example: If you add a pedestrian to the field of view, the ego-vehicle should slow down. 48
  • 48. Metamorphic Testing • Definition: Testing is driven by differences in system behavior under varied input transformations. • Metamorphic relations (MR): relationships between a sequence of inputs and their respective outputs. • Change input i to i’ implies a predictable change in output, unless there is a failure • If you add a pedestrian to the field of view, the ego-vehicle should slow down. 49
  • 49. Co-evolution • Evolutionary algorithm • Co-evolution: Co-operating subpopulations • More efficient search: lower-dimensional • Enable parallelism, efficient search • Our case: • Scenarios: Ego vehicle, the trajectory, the environment (e.g., the weather), and other dynamic (e.g., vehicles and pedestrians) and static (e.g., obstacles) objects. • Perturbations: Attribute changes, addition, deletion, or replacement of objects in the scenario. • Search objectives: Violations of Metamorphic Relations, Diversity 50
  • 50. Results • Significantly outperform random search and genetic algorithms in detecting severe MR violations. • Much more efficient: Detects MR violations much faster • High scenario diversity. • Experts confirm MR violations entail safety risks 51
  • 51. Testing and Run-Time Monitoring of Reinforcement Learning Agents 52
  • 52. Problems • Safety Concerns: Growing use of Deep Reinforcement Learning (DRL) systems to solve real-world challenges, especially in safety-critical domains. • Lack of Guarantees: DRL agents, focused on maximizing rewards, may inadvertently violate safety protocols, leading to potentially dangerous situations. • Testing Limitations: How to effectively and efficiently evaluate DRL agents due to the probabilistic nature of environments, vast state spaces, … 53
  • 53. Solutions • A Two-Phase Comprehensive Solution toward safety of Reinforcement Learning (RL) (Zolfagharian et al., IEEE TSE 2023-2025): • Pre-Deployment Testing: Rigorous testing is essential to identify and mitigate unsafe behaviors in RL agents before deployment. • Runtime Safety Monitoring: Continuous safety monitoring ensures ongoing protection by predicting and mitigating potential safety violations during operation. • STARLA: A search-based testing approach for RL agents, with a focus on detecting unsafe behaviors • SMARLA: A runtime safety monitoring approach that predicts and prevents potential safety violations during RL agent operation 54
  • 54. STARLA: Testing RL Agents Maximize fault detection Data: • We record episodes (i.e., sequence of states and actions that could be made by RL agent ) • Generate random episodes from running the agent in the test environment • Sample episode from later training steps of the RL agent • Label: Functional fault seen in the environment 55
  • 55. STARLA: Testing RL Agents ML-based fault prediction model: • Predict faulty episodes without executing • Abstract states to increase learnability of ML • Abstraction: State abstraction is a way to reduce the state space. it can be defined as a mapping from original state space to the abstracted state space A: S→SA where abstract state space (S A) is smaller than original state space Used as a fitness function to guide the search toward faulty episodes 56 Maximize fault detection
  • 56. Approach – Q* State Abstraction • We rely on a Q-irrelevance state abstraction technique. • Q*(s,a) is the predicted overall reward (expected reward) from state s, when agent selects action a in state s • We consider two states in the same abstraction class when the expected overall reward for all actions in both states are the same • Abstraction level can be changed by parameter 𝑑 Q∗ (s1,a) 𝑑 = Q∗ (s2,a) 𝑑 Q*(s,a1) Q*(s,a1) Q*(s,a2) Q*(s,a2) ∀𝑎 ∈ 𝐴 57
  • 57. SMARLA: Safety Monitoring 58 0% 20% 40% 60% 80% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Prediction time Crash Time Step
  • 58. Approach – SMARLA Training Features: Presence and absence of the abstract states Episodes Abstract state 1 Abstract state 2 …. Abstract state n Label E1 = [(S1,a1), (S2,a2),… , (Sj,aj)] 0 1 0 Safe E2 = [(S’1,a’1), (S’2,a’2),… , (S’i,a’i)] 1 1 1 Unsafe 59
  • 59. Approach – SMARLA Training • We need a lightweight ML model that can accurately classify RL episodes as safe or unsafe, since we target resource-constrained edge devices. • Therefore, we exclude DNN models and choose Random Forest as our machine learning model because it: • Can handle a large number of features effectively • Provides efficient and timely classification results • Is robust to overfitting, ensuring reliable performance 60
  • 60. Summary • Testing, safety analysis, and re-training DNN and RL models • With or without simulation in the loop • Focus on images and classification with DNNs • Understand the differences between DNN models and combine them • Efficiently testing fine-tuned models • Testing systems (e.g., safety is a system property) • Safety analysis and decisions through explanations • Scalability: Surrogate models, co-evolutionary algorithms, … 61
  • 61. Not Addressed •Simulation fidelity •Inputs other than images •Other properties than safety •Generative models 62
  • 62. Other Related Works • Determining the safety boundaries of ML-based systems • Safety monitoring of learned components using safety metric forecasting • … • See references 63
  • 63. Safe AI Systems Automated Testing and Safety Analysis of Deep Neural Networks Lionel Briand, FACM, FIEEE, FRSC http://www.lbriand.info
  • 64. Selected References (1) • Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many- Objective Search" ACM International Symposium on Software Testing (ISSTA), 2021 • Ul Haq et al. " Can Offline Testing of Deep Neural Networks Replace Their Online Testing?, Empirical Software Engineering, 2021 • Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization” IEEE/ACM ICSE 2022 • Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, IEEE/ACM ICSE 2023 • Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021 • Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN- based safety-critical systems”, ACM TOSEM, 2022 65
  • 65. Selected References (2) • Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ACM TOSEM, 2022 • Attaoui et al., “Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches”, ACM TOSEM, 2024 • Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE TSE 2023 • Aghababaeyan et al., “DeepGD: A multi-objective black-box test selection approach for deep neural networks”, ACM TOSEM, 2024 • Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”, IEEE TSE, 2023 • Zolfagharian et al., “SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents”, IEEE TSE, 2024 • Sharifi et al., “Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co- Evolutionary Search”, IEEE TSE, 2024 • Abbasishahkoo et al., TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using Mutation Analysis”, IEEE TSE, 2024 66
  • 66. Selected References (3) • Attaoui et al., “Search-based DNN Testing and Retraining with GAN-enhanced Simulations”, IEEE TSE 2025 • Yousefizadeh et al., “Using Cooperative Co-evolutionary Search to Generate Metamorphic Test Cases for Autonomous Driving Systems”, IEEE TSE 2025 • Aghababaeyan et al., “DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks” • Abbasishahkoo et al., “MetaSel: A Test Selection Approach for Fine-tuned DNN Models”, ArXiv 2025 67
  • 68. Correlation with Faults Correlation between geometric diversity and faults 69
  • 69. Estimating Faults in DNNs 70 #Clusters ~ #Faults Faults <> Mispredictions
  • 70. Simulation+DNN Examples 71 Pylot + Carla Apollo + LGSVL High-fidelity simulators: Carla LGSVL DNN-based ADAS: Pylot: many DNN models Apollo: 20 DNN models