0% found this document useful (0 votes)
34 views31 pages

Bachelor Thesis XAI Methods On CLIP-6

This thesis deals with explainable deep learning methods for object detection within 3D images of digital twin assets, with a focus on applications in transport infrastructure management. The main objective is to improve the accuracy and reliability of detecting and identifying objects in complex 3D environments, such as roads, bridges, and tunnels, to support informed decision-making for infrastructure maintenance and planning. The research addresses the challenge of interpreting 3D spatial dat

Uploaded by

piperdownloader
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views31 pages

Bachelor Thesis XAI Methods On CLIP-6

This thesis deals with explainable deep learning methods for object detection within 3D images of digital twin assets, with a focus on applications in transport infrastructure management. The main objective is to improve the accuracy and reliability of detecting and identifying objects in complex 3D environments, such as roads, bridges, and tunnels, to support informed decision-making for infrastructure maintenance and planning. The research addresses the challenge of interpreting 3D spatial dat

Uploaded by

piperdownloader
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Object detection from 3D images of digital twin

assets for transport infrastructure management


using CLIP and Explainable Deep Learning
Approach

by
Ramin Udash

May 20, 2024

Submission: March 20, 2024


Supervisor: Prof. Dr. -Ing Hendro Wicaksono, Dr. Vijaya Annas

Constructor University — School of Computer Science and Engineering


Statutory Declaration

Family Name, Given/First Name Udash, Ramin

Matriculation number 30005353


What kind of thesis are you submitting:
Bachelor
Bachelor-, Master- or PhD-Thesis

English: Declaration of Authorship

I hereby declare that the thesis submitted was created and written solely by myself
without any external support. Any sources, direct or indirect, are marked as such. I am
aware of the fact that the contents of the thesis in digital form may be revised with
regard to usage of unauthorized aid as well as whether the whole or parts of it may be
identified as plagiarism. I do agree my work to be entered into a database for it to be
compared with existing sources, where it will remain in order to enable further
comparisons with future theses. This does not grant any rights of reproduction and
usage, however.

This document was neither presented to any other examination board nor has it been
published.

German: Erklärung der Autorenschaft (Urheberschaft)

Ich erkläre hiermit, dass die vorliegende Arbeit ohne fremde Hilfe ausschließlich von
mir erstellt und geschrieben worden ist. Jedwede verwendeten Quellen, direkter oder
indirekter Art, sind als solche kenntlich gemacht worden. Mir ist die Tatsache bewusst,
dass der Inhalt der Thesis in digitaler Form geprüft werden kann im Hinblick darauf, ob
es sich ganz oder in Teilen um ein Plagiat handelt. Ich bin damit einverstanden, dass
meine Arbeit in einer Datenbank eingegeben werden kann, um mit bereits
bestehenden Quellen verglichen zu werden und dort auch verbleibt, um mit
zukünftigen Arbeiten verglichen werden zu können. Dies berechtigt jedoch nicht zur
Verwendung oder Vervielfältigung.

Diese Arbeit wurde noch keiner anderen Prüfungsbehörde vorgelegt noch wurde sie
bisher veröffentlicht.

18.05.2024
___________________________________________________________________
Date, Signature
Acknowledgments
I would like to express my sincere gratitude to my supervisor, Prof. Dr. -Ing. Hendro Wicaksono
and co-supervisor Dr. Vijaya Annas, for his invaluable guidance, unwavering support, and profound
insights throughout this research endeavor. His patience, encouragement, and dedication to excellence
have been instrumental in shaping this work. Despite his demanding schedule, he consistently found the
time to provide constructive feedback and steer me in the right direction, for which I am truly grateful.

I am deeply indebted to my family for their unconditional love, understanding, and moral support,
which have been the driving force behind my academic pursuits. Their belief in me has been a constant
source of motivation, enabling me to tackle challenges with resilience and determination.

I would also like to extend my heartfelt appreciation to the researchers whose groundbreaking work has
inspired and informed this study. Their contributions have laid the foundation for this research and
have been instrumental in advancing the field of explainable artificial intelligence and its applications
in the domain of transport infrastructure management.

Completing this bachelor’s thesis has been a transformative journey, marking my first steps into the
world of academia. I am grateful for the opportunity to learn from my esteemed supervisor, Prof.
Dr. -Ing. Hendro Wicaksono, who has imparted invaluable lessons in conducting and communicating
research effectively. This experience has ignited a passion for lifelong learning, and I look forward to
continuing my academic pursuits with the same dedication and enthusiasm.

To all those who have supported me throughout this endeavor, I extend my heartfelt gratitude. Your
contributions have been instrumental in helping me realize my academic goals and have paved the way
for future explorations in this exciting field. Ramin

2
Contents

1 Introduction 4
1.1 Background and problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 6
2.1 Digital Twins in the Transport Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Object Detection with CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 How CLIP works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Multi-view Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 CLIP Batch-Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Zero-Shot Detection in CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Explainable Deep Learning(xAI) approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Grad-CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 LIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Patch Detection with CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methodology 15
3.1 Data Acquisition and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Sourcing 3D assets of transport industry objects . . . . . . . . . . . . . . . . . . 15
3.1.2 Bounding sphere and Multi-view rendering . . . . . . . . . . . . . . . . . . . . . 15
3.2 Brief overview of Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Implementation and Outcomes 18


4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Tools used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Installation of dependencies and libraries . . . . . . . . . . . . . . . . . . . . . . 18
4.1.3 Running the CLIP-ViT-B/32 model . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Explainability Techniques Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 GradCAM implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Lime implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Patch Detection for Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Discussion and Interpretation 24


5.1 Interpretation of results and performance analysis . . . . . . . . . . . . . . . . . . . . . 24
5.2 Challenges and Limitations with our approach . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Practical Implications and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusion 28

7 References 29

3
1. Introduction

1.1 Background and problem statement


During recent years, the digital twin has emerged as a crucial concept in various industries. Originally
introduced by NASA in 2012, the digital twin is defined as ”an integrated multiphysics, multiscale,
probabilistic simulation of an as-built vehicle or system that uses the best available physical models,
sensor updates, fleet history, etc., to mirror the life of its corresponding physical counterpart.”[1] This
concept combines expertise from diverse fields, including computer science, data science, and indus-
trial engineering, to enable the seamless integration of data between physical and virtual environments.

In the context of transportation infrastructure, the digital twin has proven to be a valuable tool for
effectively addressing the challenges faced in infrastructure engineering. By reflecting the performance
of real-world products through virtual space simulation, the digital twin can leverage the increasing
amount of equipment monitoring data generated by sensor and computing technology[2]. Furthermore,
the importance of 3D object classification in digital twins has been emphasized, as it enables the effi-
cient generation of a digital twin using CAD models[3]. Additionally, an ontology-based methodology
has been proposed for effective data management in digital twin applications, transforming conceptual
knowledge into a minimum data model for flexible database management.

One of the common data types that needs to be integrated into the digital twin of transport in-
frastructure is 3D images. These 3D images must be annotated and linked to the semantic structure
of the ontology, ensuring that the contents of the 3D scene are interpreted based on the ontology
definition and seamlessly integrated into the digital twin. In this research, the focus will be on object
detection from 3D images of digital twin assets for transport infrastructure management, utilizing the
CLIP (Contrastive Language-Image Pre-training) model and an explainable deep learning approach[4].
By leveraging the power of CLIP and explainable deep learning techniques, the aim is to develop a
robust and interpretable system for accurately detecting and classifying objects within the 3D images
of the digital twin, ultimately enhancing the management and maintenance of transport infrastructure.

1.2 Research objectives


This thesis deals with explainable deep learning methods, with a specific focus on object detection
within 3D images of digital twin assets, particularly within the domain of transport infrastructure
management. The primary objective is to elevate the precision and dependability of object detection
and identification within intricate 3D environments, thus facilitating more informed decision-making
in the realm of transport infrastructure maintenance and planning. By harnessing digital twin technol-
ogy—a dynamic, virtual representation of physical assets—this research addresses critical challenges
inherent in managing expansive transport systems, including roads, bridges, and tunnels, through the
application of advanced computational models.

The crux of the research problem lies in understanding the complexities of interpreting 3D spatial
data using deep learning models, which, while potent, often operate as ’black boxes’ with limited
transparency in their decision-making processes. This opacity poses significant hurdles for stakehold-
ers reliant on these models for pivotal infrastructure decisions. Therefore, this study explores the

4
integration of explainable artificial intelligence (XAI) techniques into deep learning frameworks to
provide clearer insights into model predictions[6]. Specifically, it utilizes the CLIP neural network
developed by OpenAI, which enables multimodal learning, incorporating both textual and visual in-
formation for enhanced understanding and interpretation[5].

Through empirical investigation, this thesis seeks not only to push the boundaries of object detec-
tion within 3D digital twins but also to contribute to the broader space of explainable AI by furnishing
methodologies and insights that enhance the interpretability and trustworthiness of deep learning mod-
els in transport infrastructure contexts. Moreover, by shedding light on the inner workings of these
models, this research strives to instill confidence and foster trust among stakeholders, thereby strength-
ening the efficacy of decision-making processes in the realm of transport infrastructure management.
The key objectives of this research are:

1. To investigate the application of CLIP (Contrastive Language-Image Pre-training) for object


detection in 3D images of digital twin assets for transport infrastructure.

2. To develop an explainable deep learning approach for object detection and classification in the
3D images, providing insights into the decision-making process.
3. To integrate the object detection and classification results into the digital twin structure, enabling
seamless data management and infrastructure monitoring.

4. To evaluate the performance and effectiveness of the proposed approach in improving transport
infrastructure management and maintenance.

5
2. Literature Review

2.1 Digital Twins in the Transport Industry


Digital twins are gaining significant importance in the transport industry due to their ability to create
virtual replicas of physical transportation assets and systems, enabling improved planning, monitoring,
and maintenance.

In the transport sector, digital twins can be applied to various assets such as roads, bridges, tunnels,
vehicles, and entire transportation networks. By integrating real-time data from sensors and other
sources, digital twins can provide a comprehensive view of the current state and performance of these
assets, allowing for predictive maintenance, optimized asset utilization, and enhanced decision-making
processes[7].

2.2 Object Detection with CLIP


CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network that has shown
promising results for object detection tasks, particularly in scenarios where labeled data is limited[4].
By leveraging its pretraining on a massive dataset of image-text pairs, CLIP can establish connections
between visual and textual representations, enabling zero-shot or few-shot object detection capabili-
ties[8]. In the context of our research on object detection from 3D images of digital twin assets for
transport infrastructure management, CLIP’s multimodal nature and ability to leverage large-scale
pretraining make it a compelling approach to explore.

2.2.1 How CLIP works?


CLIP, which stands for Contrastive Language-Image Pre-training, is a groundbreaking deep learning
model developed by OpenAI[5]. It represents a significant advancement in the field of multimodal
learning, enabling the joint understanding of images and text. Traditional methods often treated
image and text understanding as separate tasks, but CLIP combines both modalities into a single
unified framework. At its core, CLIP utilizes a transformer-based architecture, similar to models such
as BERT and GPT, which have achieved remarkable success in natural language processing tasks[4].
The transformer architecture allows CLIP to efficiently process both images and text by leveraging
self-attention mechanisms.

CLIP consists of two main components:


1. Vision Transformer (ViT): This component processes images by dividing them into smaller
patches and passing them through a series of transformer layers. Each patch is embedded into a
high-dimensional space, allowing the model to capture spatial relationships and features within
the image[5].
2. Text Transformer: In parallel, CLIP processes textual descriptions or prompts associated with
the images. These textual inputs are encoded using the same transformer architecture, resulting
in a text embedding that captures semantic information[5].
A general overview of how CLIP works can be seen in figure 2.1, abstracted from the CLIP research
paper:

6
Figure 2.1: CLIP pretraining and prediction overview

Contrastive Learning
CLIP uses a unique approach of using contrastive learning to align the embeddings of images and
text in a shared semantic space. Contrastive learning aims to bring similar pairs closer together while
pushing dissimilar pairs apart. CLIP achieves this alignment through a large-scale pre-training process,
where it learns to associate images and corresponding textual descriptions.
During pre-training, CLIP is presented with a vast dataset of images and associated text from
diverse sources, such as the internet. The model learns to predict whether a given image-text pair is
a match (positive sample) or a mismatch (negative sample)[4]. Through this process, CLIP learns to
understand the semantics of both images and text and captures the rich relationships between them.

Cosine Similarity
CLIP utilizes cosine similarity as a measure to quantify the alignment between the visual and textual
representations learned by the model[9]. Cosine similarity is a metric that calculates the cosine of the
angle between two non-zero vectors, providing a value between -1 and 1, where 1 indicates that the
vectors are identical, and -1 indicates that they are diametrically opposed[10].

In the context of CLIP, the model generates visual and textual embeddings for the input image and
text, respectively. The cosine similarity between these embeddings is then computed to determine the
degree of similarity or relevance between the image and the textual description.

The mathematical formula for cosine similarity between two vectors, A and B, is given by:
A·B
cosine similarity(A, B) =
∥A∥∥B∥

Where:
• A · B represents the dot product of the two vectors.

• ∥A∥ and ∥B∥ represent the L2 norms (Euclidean lengths) of vectors A and B, respectively.
In CLIP, the cosine similarity is calculated between the visual embedding of the input image and the
textual embedding of the target description or class. The higher the cosine similarity score, the more
relevant the image is to the given text, according to the model’s learned representations.

This cosine similarity metric plays a crucial role in CLIP’s zero-shot and few-shot capabilities, as
it allows the model to associate visual and textual representations without requiring extensive labeled
data for specific object classes or categories[10].

The image encoder and text encoder within CLIP are trained to maximize the cosine similarity between
the embeddings of matching image-text pairs and minimize the cosine similarity between non-matching

7
pairs. This is achieved by optimizing a contrastive loss function, which is a variant of the InfoNCE
loss used in self-supervised learning. The contrastive loss for a batch of N image-text pairs can be
formulated as:    
N sim(Ii ,Ti )
1 X exp τ
L=− log  P  
N i=1 N
exp
sim(I i ,Tj )
j=1 τ

where:
• Ii and Ti are the embeddings of the i-th image and text in the batch, respectively.
• sim(Ii , Tj ) is the cosine similarity between the i-th image and j-th text embeddings.

• τ is a temperature parameter that scales the logits.


The objective is to maximize the similarity between matching image-text pairs (Ii , Ti ) while minimizing
the similarity between non-matching pairs (Ii , Tj ) for j ̸= i[9].

Zero-shot Learning
One of the most remarkable capabilities of CLIP is its ability to generalize to unseen tasks and concepts
through zero-shot learning. By learning a unified representation space for images and text, CLIP can
perform tasks without the need for task-specific training data[5]. This means that CLIP can understand
and generate responses to textual prompts or queries even for tasks it has never been explicitly trained
on.

Applications
CLIP has demonstrated impressive performance across various applications, including image classi-
fication, object detection, image retrieval, and natural language understanding. Its versatility and
generalization capabilities make it particularly well-suited for tasks that require multimodal under-
standing, such as image captioning, visual question answering, and content-based image retrieval.

Conclusion
In summary, CLIP represents a significant milestone in the field of multimodal learning, enabling
models to understand both images and text in a unified framework. By leveraging contrastive learning
and a transformer-based architecture, CLIP achieves state-of-the-art performance across a wide range
of tasks, making it a powerful tool for researchers and practitioners alike in fields such as computer
vision, natural language processing, and AI.

2.2.2 Multi-view Rendering


CLIP is an image-based neural network, so to effectively leverage its capabilities for 3D assets, we need
to generate multiple rendered views or image instances from the 3D asset model. This multi-view ren-
dering approach allows us to capture the 3D object from different angles and perspectives, providing
CLIP with a diverse set of visual representations that can be associated with textual descriptions.

The rendering process begins with the selection of a suitable 3D file format. The widely-adopted
Wavefront .obj format is an excellent choice due to its lightweight nature, ease of reading and writing,
and compatibility with various rendering engines. This format not only defines the geometry of the
3D object but also references material files (.mtl) and texture files (.jpg, .png, .bmp, etc.), allowing for
realistic and detailed rendering.

The multi-view rendering process with CLIP involves generating multiple views of the 3D object
from different angles and perspectives. Each view is then associated with a textual description, which
can be generated automatically or provided manually. These textual descriptions serve as anchors,
allowing CLIP to establish meaningful connections between the visual and textual representations of
the 3D object.

8
Figure 2.2: Multi-view render of a bridge

The rendering parameters play a crucial role in determining the quality and characteristics of
the generated views. These parameters may include camera position, lighting conditions, material
properties, and rendering settings such as resolution, anti-aliasing, and post-processing effects. By
carefully adjusting these parameters, developers can create a diverse set of views that capture the es-
sential features and details of the 3D object from various angles and under different lighting conditions.

Once the multi-view rendering process is complete, the resulting image-text pairs can be fed into
the CLIP model for training or inference tasks. CLIP’s powerful language-vision capabilities enable a
wide range of applications, such as 3D object recognition, retrieval, and captioning, as well as more
advanced tasks like 3D scene understanding, object manipulation, and content creation.

The combination of multi-view rendering and CLIP opens up exciting possibilities for enhancing
human-computer interaction, enabling natural language-based control and manipulation of 3D ob-
jects, and facilitating the creation of immersive and interactive virtual environments. By leveraging
the strengths of both techniques, researchers and developers can push the boundaries of computer
vision, graphics, and multimedia, unlocking new avenues for innovation and exploration.

2.2.3 CLIP Batch-Predictions


To perform batch predictions with CLIP on multiple image instances rendered from a 3D asset model,
we need to leverage CLIP’s capability to encode and process batches of images simultaneously. Since
CLIP is an image-based neural network, it requires visual inputs in the form of image tensors or ar-
rays[11].

Load the 3D asset model: First, we load the 3D asset model in a suitable format, such as Wave-
front .obj or other compatible formats supported by rendering libraries like PyTorch3D.

Multi-view rendering: We generate multiple rendered views or image instances of the 3D asset model
from different angles, perspectives, and lighting conditions. This can be achieved using rendering
libraries like PyTorch3D, which provide tools for setting up cameras, lighting, and rendering parame-
ters. The number of views generated can vary based on the desired level of detail and computational

9
resources available.

Preprocess images: The rendered image instances need to be preprocessed to match the input re-
quirements of the CLIP model. This typically involves resizing, normalization, and conversion to
tensor format compatible with PyTorch or other deep learning frameworks.

Batch creation: The preprocessed image instances are then combined into batches, which can be
efficiently processed by CLIP[11]. The batch size should be chosen based on the available computa-
tional resources and memory constraints.

CLIP encoding: The batches of image instances are fed into the CLIP model’s encode image method,
which encodes the visual information into high-dimensional feature vectors or embeddings.

Text encoding: In parallel, the text descriptions or labels associated with the 3D asset model are
tokenized and encoded using CLIP’s encode text method, resulting in text embeddings.

Similarity computation: The image embeddings and text embeddings are compared using a similarity
metric, such as cosine similarity, to determine the relevance or match between each image instance
and the text description[9].

Prediction and analysis: The similarity scores can be analyzed and interpreted to make predictions,
classifications, or rankings based on the task at hand. For example, in a classification task, the image
instances with the highest similarity scores to the target class labels can be identified as the predicted
classes.

By leveraging CLIP’s ability to process batches of images and associate them with text descriptions,
we can efficiently perform predictions and analyses on multiple rendered views of a 3D asset model.
This approach allows us to capture the 3D object’s visual characteristics from various angles and per-
spectives, enhancing the model’s understanding and enabling more accurate and robust predictions.

2.2.4 Zero-Shot Detection in CLIP


Zero-shot detection with CLIP on multiple image instances rendered from a 3D asset model involves
leveraging CLIP’s ability to align visual and textual representations in a shared embedding space. This
approach allows us to detect and localize objects within the rendered images without requiring any
prior training on the specific 3D asset or its associated classes.

The process begins by generating a diverse set of rendered views or image instances from the 3D
asset model, capturing the object from various angles, perspectives, and lighting conditions. These
image instances are then preprocessed and fed into CLIP’s image encoder, which maps them into
high-dimensional visual embeddings.

Simultaneously, we provide CLIP with textual descriptions or class labels related to the objects or
concepts we want to detect within the rendered images. These text inputs are encoded into corre-
sponding textual embeddings using CLIP’s text encoder.

To perform zero-shot detection, we compute the similarity scores between the visual embeddings of
the rendered image instances and the textual embeddings of the target class labels or descriptions.
This can be done using cosine similarity or other distance metrics in the shared embedding space[8].

The key idea is that CLIP’s pretraining on a vast dataset of image-text pairs has enabled it to learn
rich visual-semantic associations. As a result, rendered image regions or patches that align well with
the provided textual descriptions will have high similarity scores, indicating the presence and location
of the target objects within the image[5].

By sliding a window or patch across the rendered image instances and computing the similarity scores
between the local visual embeddings and the target textual embeddings, we can effectively localize and

10
detect the objects of interest. This process can be repeated for multiple textual descriptions or class
labels, enabling the detection of various objects within the same set of rendered images[4].

The zero-shot detection capability of CLIP is particularly valuable in scenarios where annotated train-
ing data for the specific 3D asset or object classes is scarce or unavailable. It allows us to leverage
CLIP’s broad knowledge acquired during pretraining and apply it to novel objects or domains without
the need for additional fine-tuning or data collection efforts.

2.3 Explainable Deep Learning(xAI) approaches


Explainable Artificial Intelligence (XAI) are techniques that make the decision-making process of AI
models, especially complex deep neural networks like CLIP, more transparent, interpretable, and un-
derstandable to humans. XAI addresses the opaque nature of many state-of-the-art AI models by
revealing the underlying reasoning and factors influencing their outputs[12]. This enhances trust, ac-
countability, and understanding of these systems, crucial in high-stakes domains like healthcare and
finance.

Within our research on object detection of 3D assets in the transport industry, XAI can provide
insights into the specific image features or regions that the AI model considers most relevant for accu-
rate detection of what the object is. By applying XAI methods, we can better understand a particular
model’s performance and limitations and make a decision on its reliability in comparison to other mod-
els[14]. This will potentially lead to an accurate framework for object detection of 3D assets within
digital twin environments, ultimately leading to increase in logistical efficiency and remote virtual
supply chain management.

2.3.1 Grad-CAM
GradCAM is a technique used to generate visual explanations for the predictions made by convolutional
neural networks (CNNs). It highlights the important regions in the input image that contribute the
most to the CNN’s output prediction[16]. In the context of object detection from 3D images of digital
twin assets for transport infrastructure management using an explainable deep learning approach,
GradCAM can be utilized as follows:
1. GradCAM computes the gradients of the target output class with respect to the feature maps
of a specific convolutional layer in the CNN. These gradients are then combined with the corre-
sponding feature map activations to produce a coarse localization map highlighting the important
regions in the image for the target class[16].
2. For 3D object detection tasks, GradCAM can be applied to the 2D convolutional layers of the
CNN that process the projected 2D views of the 3D data (e.g., rendered images or depth maps).
The resulting heatmaps can reveal the regions in the 2D views that the model focuses on for
detecting and classifying different types of transport infrastructure assets, such as bridges, roads,
or tunnels.
3. By visualizing these heatmaps, we can gain insights into the model’s decision-making process
and ensure that it is attending to the relevant features and regions of the 3D assets. This can
help identify potential biases, errors, or limitations in the model’s predictions, and guide further
improvements or fine-tuning of the deep learning approach[15].
Grad-CAM uses neuron importance weights to determine the contribution of each feature map in the
convolutional layer to the final prediction.

The neuron importance weight αck for a feature map k and class c is calculated as[19]:
1 X X ∂y c
αck =
Z i j ∂Akij

Where:

11
• y c is the score for class c (before the softmax).
• Ak is the k-th feature map of the convolutional layer.
• i and j are the spatial indices of the feature map.
• Z is a normalization factor equal to the number of elements in Ak .

The Grad-CAM heatmap is then computed as a weighted combination of the feature maps, using
the neuron importance weights[16]:
!
X
c k k
Grad-CAM = ReLU αc A
k

In context of CLIP, the Grad-CAM technique can be applied to visualize the important regions in
the input image that contribute to the model’s prediction for a given text prompt. The neuron im-
portance weights would be calculated based on the gradients of the cosine similarity score between the
image and text embeddings with respect to the feature maps of the chosen convolutional layer in the
image encoder[17].

2.3.2 LIME
LIME (Local Interpretable Model-Agnostic Explanations) is a technique used to explain the predictions
of any machine learning model, including object detection models like Faster R-CNN and YOLO[20].

LIME works by approximating the complex machine learning model (e.g., object detector) with an
interpretable model locally around the prediction to be explained. It does this by perturbing the input
data (e.g., 3D image) and observing how the predictions change. The interpretable model (e.g., linear
regression) is then trained on the perturbed data, weighted by its proximity to the instance being
explained[20].

For object detection tasks, LIME can be used to explain the probability or confidence score of de-
tecting a particular object class in a given image region. The key steps would be:

1. Segment the input 3D image into superpixels or image patches.


2. Permute the superpixels/patches and get the model predictions for each permuted instance.
3. Weight the permuted instances based on their similarity to the original instance.
4. Permute the superpixels/patches and get the model predictions for each permuted instance.

5. Train an interpretable model (e.g., sparse linear model) on the weighted instances to approximate
the object detector locally.
6. Use the interpretable model to highlight the superpixels/patches that contributed most to the
object detection prediction.

By visualizing these highlighted regions, we can understand which parts of the 3D image the object
detector focused on for detecting different infrastructure assets like bridges or tunnels. This can help
identify potential biases, errors or reveal insights about the model’s decision-making process[21].

The objective function in LIME aims to optimize to generate local explanations. It’s defined as[20]:

ξ(x) = arg min L(f, g, πx ) + Ω(g)


g∈G

Where:

• x is the instance being explained.

12
• f is the original black-box model.
• G is the class of interpretable models (e.g., linear models).
• g is the interpretable model used to approximate f locally.
• L is a measure of how unfaithful g is in approximating f locally.

• πx is the proximity measure that defines the locality around x.


• Ω(g) is a complexity measure of the interpretable model g.
The proximity measure πx defines the locality around the instance x. A common choice for πx is the
exponential kernel[20]:
D(x, z)2
 
πx (z) = exp −
σ2

Where:
• D is a distance metric (e.g., Euclidean distance).
• σ is a kernel width parameter that controls the size of the locality.

The proximity measure πx assigns higher weights to instances z that are closer to x, ensuring that
the interpretable model g focuses on approximating the original model f in the local neighborhood of
x. It is defined as:[20]
D(x, z)2
 
πx (z) = exp −
σ2
The LIME technique can be applied to CLIP to generate local explanations for its predictions. In this
case, the black-box model f would be the CLIP model, and the interpretable model g could be a linear
model or decision tree that approximates CLIP’s behavior locally around a specific input image and
text pair.

The main challenge with LIME for object detection is determining the appropriate representation and
perturbation approach for the 3D data. However, recent works like SODEx have proposed methods to
adapt LIME for explaining object detectors on 2D image data[22].

2.3.3 Patch Detection with CLIP


Patch Detector is a method that leverages CLIP’s multimodal capabilities to detect objects in images
by attending to relevant image patches or regions[23]. Here’s how Patch Detector with CLIP can be
applied:
1. The input image is divided into a grid of patches or regions.
2. For each patch, CLIP’s image encoder processes the patch independently to obtain a visual
representation.

3. CLIP’s text encoder encodes the query text describing the object of interest (e.g., ”car”, ”person”,
etc.) to obtain a textual representation.
4. The similarity between each patch’s visual representation and the textual representation is com-
puted, typically using a cosine similarity metric[9].
5. The patches with the highest similarity scores are considered as potential object detections, as
they align most closely with the query text.
6. Post-processing steps like non-maximum suppression can be applied to refine the detections and
remove duplicates or overlapping regions[10].

13
By visualizing the patches with high similarity scores, we can gain insights into the regions of the image
that CLIP attends to for detecting the specified object class[23]. This can help understand CLIP’s
decision-making process and identify potential biases or limitations in its object detection capabilities.

The patchify function takes the following arguments:[4]

• image path: The path to the input image file.


• resolution: The desired resolution to resize the image.
• patch size: The size of the square patches to be extracted from the image.
• patch stride (optional): The stride or step size for extracting patches. If not provided, it
defaults to the patch size.
The function first loads the image using the load image helper function and converts it to a Py-
Torch tensor using transforms.ToTensor()[4].

The key step in the patchify function is the use of the unfold operation from PyTorch. The unfold
operation extracts sliding local blocks (patches) from a batched input tensor. In this case, the input
tensor is the image tensor, and the unfold operation is applied along the height and width dimensions
(dimensions 1 and 2) of the tensor.

The unfold operation is performed twice, first along the height dimension and then along the width
dimension, to extract square patches from the image. The unfold operation is defined as:

patches = img tensor . unfold (1 , patch size , p a t c h s t r i d e )


. unfold (2 , patch size , patch stride )

After applying the unfold operation twice (along height and width), the resulting tensor has di-
mensions (batch size, num patches, patch size, patch size, patch size), where the last two
dimensions represent the height and width of the patches.

The next step is to reshape the tensor to have dimensions (num patches, channels, patch size,
patch size) using the following line:[4]

p a t c h e s = p a t c h e s . r e s h a p e ( 3 , −1, p a t c h s i z e , p a t c h s i z e ) . permute ( 1 , 0 , 2 , 3 )

This line first reshapes the tensor to have dimensions (channels, num patches, patch size,
patch size), and then permutes the dimensions to (num patches, channels, patch size, patch size).

The resulting patches tensor contains all the square patches extracted from the input image, with
each patch represented as a tensor of shape (channels, patch size, patch size). These patches
can then be processed by CLIP’s image encoder to obtain visual embeddings, which can be compared
with the textual embeddings of the target object or class using cosine similarity.

By visualizing the patches with high cosine similarity scores, the Patch Detection method can identify
the relevant regions in the input image corresponding to the specified object or class[9].

The key advantage of Patch Detector with CLIP is that it leverages CLIP’s multimodal represen-
tations learned from large-scale pretraining, enabling zero-shot or few-shot object detection without
requiring extensive labeled data for specific object classes. However, it may not match the performance
of fully supervised object detectors and can be computationally expensive for dense patch grids.

14
3. Methodology

3.1 Data Acquisition and Preprocessing


We first took steps to acquire the necessary data for this research, which in our case involves obtaining
3D assets of transport industry objects. These 3D assets within the transport industry will allow for
applicable testing and evaluating our proposed object classification and explainable AI (xAI) methods.

Acquiring high-quality and diverse 3D models of various transport infrastructure components, such as
bridges, tunnels, roads, and vehicles, is crucial for ensuring the robustness and overall generalization of
our approach. By sourcing these 3D assets from online repositories and databases, we aim to gather a
comprehensive dataset that represents the wide range of objects encountered in the transport industry.

The 3D nature of these assets allows us to leverage their inherent three-dimensional characteristics,
enabling us to explore and develop techniques for object classification and detection from multiple
viewpoints and angles. Furthermore, the availability of 3D data provides an opportunity to investigate
and apply explainable AI methods, which can offer insights into the decision-making process of our
models and enhance interpretability, a critical aspect in the transport infrastructure domain.

The groundwork for evaluating the performance and explainability of our proposed methods was laid
by starting with a diverse and representative collection of 3D assets. This ultimately should contribute
to the development of more reliable and trustworthy object classification and detection systems for the
transport industry.

3.1.1 Sourcing 3D assets of transport industry objects


To obtain the necessary data for our research, we sourced 3D models of various transport industry
objects, including infrastructure assets such as bridges, tunnels, roads, and vehicles. These 3D models
were acquired from a variety of online repositories and databases that provide free or commercially
available 3D assets for research and development purposes.

The process of sourcing these 3D assets involved searching through relevant online platforms, eval-
uating the quality and suitability of the available models, and ensuring that they meet the necessary
licensing requirements for our research. Additionally, we aimed to gather a diverse set of 3D mod-
els representing different types of transport infrastructure and vehicles to ensure the robustness and
generalizability of our approach.

3.1.2 Bounding sphere and Multi-view rendering


Once the 3D assets were obtained, we employed a technique called bounding sphere computation to
determine the optimal viewpoints for rendering multiple 2D image instances from each 3D model. The
bounding sphere is a spherical volume that encompasses the entire 3D object, and its center serves as
a reference point for generating viewpoints around the object.

By calculating the bounding sphere for each 3D asset, we were able to systematically position vir-
tual cameras at various angles and distances from the object, ensuring comprehensive coverage of its
different perspectives. This process, known as multi-view rendering, involved rendering 2D image in-
stances of the 3D model from multiple viewpoints, capturing its appearance from different angles and

15
orientations.

The multi-view rendering approach allowed us to generate a rich dataset of 2D image instances for each
3D asset, providing a diverse set of visual representations that can be used for training and evaluating
our object detection and classification models. This technique is particularly valuable in the context
of our research, as it enables us to leverage the 3D nature of the transport industry objects while
also generating the necessary 2D image data required for deep learning-based object detection and
classification tasks.

3.2 Brief overview of Research Methodology


To approach this research topic on object detection from 3D images of digital twin assets for trans-
port infrastructure management using CLIP and an explainable deep learning approach, the following
research methodologies simplified for understanding were employed:

1. Familiarization with Pre-trained CLIP Model


(a) Get familiar with a pre-trained CLIP model that is based on the Vision Transformer ViT-
B/32 architecture.
(b) Experiment with the pre-trained CLIP model to understand its capabilities and limitations
in object detection tasks.
2. CLIP Object Detection for 3D Image Instances
(a) Collect 3D asset files of transport industry objects
(b) Run multi-view rendering of the 3D assets collected to retrieve 2D image instances

16
(c) Develop a pipeline to input 3D image instances of transport industry objects into the pre-
trained CLIP model.
(d) Perform object detection and classification using the CLIP model on these image instances.
(e) Analyze the performance of the CLIP model in accurately classifying these objects via Batch
Prediction and similarity scores.

3. Explainable AI (xAI) Methods over CLIP


(a) Apply various explainable AI (xAI) methods, such as Grad-CAM, ,LIME, and similar ap-
plicable approaches, to the CLIP model’s object detection and classification results.
(b) Provide interpretability and insights into the CLIP model’s decision-making process for the
object detection and classification tasks.
(c) Understand the strengths, limitations, and trade-offs of each xAI method in the context of
this research.
4. Limitations and Challenges of xAI Methods
(a) Critically analyze the limitations and challenges associated with the application of xAI
methods to the CLIP model’s object detection and classification results.
(b) Identify the potential drawbacks or shortcomings of the xAI methods in providing compre-
hensive and reliable explanations for the model’s decisions.
(c) Explore ways to address these limitations and enhance the interpretability of the overall
system.

5. Performance Evaluation and Final Steps


(a) Evaluate the overall performance and effectiveness of the proposed approach in improving
transport infrastructure management and maintenance.
(b) Provide a good overview of next steps with regards to the end state of this thesis

Throughout this overall study, the following research practices have been employed:
1. Literature review: Comprehensive review of the state-of-the-art methods and techniques in the
relevant domains, including digital twins, 3D object detection, CLIP, and explainable deep learn-
ing.
2. Experimental evaluation: Extensive experimentation with various tools, libraries, and datasets
to assess the performance and feasibility of the proposed approach.
3. Documentation and organization: Detailed documentation of the research process, including
code, data sources, and project management practices.
4. Collaboration and feedback: Engagement with domain experts, industry partners, and the re-
search community to gather feedback and refine the research direction.

By following these research methodologies, the study aims to contribute to the advancement
of digital twin technology in the transportation infrastructure domain, leveraging the power of
CLIP and explainable deep learning to enhance the management and maintenance of critical
infrastructure assets.

17
4. Implementation and Outcomes

4.1 Experimental setup


For our experimental setup, we utilized a workstation with the CPU (M1 MacBook with 8GB RAM) for
both rendering and running CLIP. The primary tools and software employed in our research included
as follows:

4.1.1 Tools used


1. Python: The programming language of choice for our implementation, renowned for its extensive
libraries and community support in the field of machine learning and computer vision.
2. PyTorch: A widely-used open-source machine learning framework that provides efficient tensor
computations and supports GPU acceleration.
3. CLIP (Contrastive Language-Image Pre-training): The state-of-the-art multimodal model de-
veloped by OpenAI, which serves as the foundation for our object detection and classification
tasks.
4. Jupyter Notebook: An interactive computing environment that facilitated code development,
visualization, and documentation throughout our research process.
5. Blender: A free and open-source 3D creation suite used for importing and exporting 3D models
in various formats.

4.1.2 Installation of dependencies and libraries


To set up the required environment for our experiments, we installed the following dependencies and
libraries using Python’s package manager, pip:

18
These libraries and modules provide essential functionality for various tasks, including:
1. matplotlib.pyplot: A plotting library for creating static, animated, and interactive visualizations
in Python.
2. PIL (Python Imaging Library): A library for opening, manipulating, and saving images.

3. numpy: A fundamental package for scientific computing in Python, providing support for large,
multi-dimensional arrays and matrices.
4. os and json: Standard Python libraries for interacting with the operating system and handling
JSON data, respectively.
5. torch: PyTorch, an open-source machine learning library for building and training neural net-
works.
6. torchvision: A package consisting of popular datasets, model architectures, and image transfor-
mations for computer vision tasks.
7. torch.autograd: PyTorch’s module for automatic differentiation, enabling efficient computation
of gradients.
8. torch.nn.functional: A module providing tensor operations and functional utilities for building
neural networks in PyTorch.
9. torch.autograd: PyTorch’s module for automatic differentiation, enabling efficient computation
of gradients.

10. urllib.request: A Python module for retrieving data across the web.
11. clip: The Contrastive Language-Image Pre-training (CLIP) model developed by OpenAI, used
for multimodal tasks involving images and text.
12. scipy.ndimage: A submodule of SciPy (Scientific Python) that provides functions for multi-
dimensional image processing.
13. torch.nn: PyTorch’s module for constructing and manipulating neural networks.
14. math: A built-in Python module providing access to mathematical functions.
These libraries and modules were carefully selected and installed to ensure that our experiments could
leverage the necessary functionality for tasks such as image processing, neural network construction
and training, data manipulation, and visualization.

It’s important to note that the specific versions of these libraries may vary depending on the re-
quirements of your project and the compatibility with other dependencies. It’s recommended to
document the exact versions used to ensure reproducibility and consistency across different computing
environments.

19
4.1.3 Running the CLIP-ViT-B/32 model
The following Python code demonstrates how we set up and ran the CLIP ViT-B/32 model for our
initial experiments. We loaded the pre-trained CLIP model and processed an input image, in this case,
a picture of a dog. The model then computed the cosine similarity scores between the image’s visual
representation and various textual descriptions, including ”A picture of a Dog ”, ”A cat on a table”,
and ”A picture of London at night”[4]. By comparing these similarity scores, we could evaluate the
model’s ability to correctly associate the image with the relevant textual description.

In this initial experiment, we expected the cosine similarity score between the image and the text
”a dog” to be the highest, indicating that the model correctly recognized the image as depicting a dog.
The scores for ”A cat on a table” and ”A picture of London at night” should be lower, reflecting the
model’s ability to differentiate between different object categories.

(a) A picture of a dog (b) Cosine Similarity Chart

This setup allowed us to validate the CLIP model’s performance on a simple image classification
task and provided a baseline for further experiments involving more complex object detection and
classification scenarios in the context of our research on transport infrastructure assets.

20
4.2 Explainability Techniques Implementation
After validating the CLIP ViT-B/32 model’s performance on the initial image classification task,
we proceeded to apply various explainable AI (xAI) techniques to enhance the interpretability and
transparency of the model’s predictions. Specifically, we employed Grad-CAM, LIME, and Patch
Detector methods to analyze the model’s decision-making process on the 2D image instances generated
from the 3D assets of transport industry objects.

4.2.1 GradCAM implementation


Grad-CAM (Gradient-weighted Class Activation Mapping) was utilized to generate visual explanations
by highlighting the important regions in the input images that contributed most to the model’s output
predictions. By applying Grad-CAM to the 2D renderings of 3D transport infrastructure assets,
we could identify the specific areas or features that the model focused on for object detection and
classification tasks. The following images show the results after running the Grad-CAM method on
various transport infrastructure assets.

(a) Train (b) Tunnel

(c) Bridge (d) Rail Track

(e) Rail Bridge (f) Road

Figure 4.2: GradCAM visualizations with passed input text below them. The heatmap shows which
areas were responsible for the respective textual language description

21
4.2.2 Lime implementation
Additionally, we implemented LIME (Local Interpretable Model-Agnostic Explanations) to approxi-
mate the complex CLIP model with an interpretable model locally around the predictions of interest.
This technique provided insights into the model’s decision-making process by analyzing the impact of
perturbing different regions or superpixels in the input images on the model’s outputs.

The following images show the results after running the LIME method on various transport infras-
tructure assets.

(a) Bridge (b) Tunnel

(c) Rail Bridge (d) Car

Figure 4.3: LIME visualizations with passed input text below them.

22
4.2.3 Patch Detection for Explainability
Furthermore, we employed the Patch Detector approach, which involved dividing the input images
into patches and computing the similarity between each patch’s visual representation and the textual
representation of the target object class. By visualizing the patches with high similarity scores, we
could identify the relevant regions corresponding to the specified transport infrastructure components,
such as bridges, tunnels, or roads.

The following images show the results after running the LIME method on various transport infras-
tructure assets.

(a) Train (b) Tunnel

(c) Rail Bridge (d) Car

Figure 4.4: Patch Detection across transport industry 3D assets. The patch with red outline had the
highest cosine similarity score.

23
5. Discussion and Interpretation

5.1 Interpretation of results and performance analysis


To evaluate the effectiveness of the employed explainable AI (xAI) methods in enhancing the inter-
pretability of the CLIP model’s predictions, we conducted a comprehensive performance analysis. In
our analysis, we found that Patch Detection as the most effective method for enhancing explainability
for the CLIP model, offering clear insights into the model’s decision-making process.

Patch Detector proved to be reliable since, with multi-view renderings of the same 3D asset, it con-
sistently identified the same parts of the 3D asset file with the highest relevance for prediction. This
consistency across different orientations and viewpoints of the same object provided confidence in the
model’s decision-making process and highlighted the relevant features that contributed to the object
detection and classification tasks.

(a) Train- Rendering no. 1 (b) Train- Rendering no. 2

(c) Car- Rendering no. 1 (d) Car- Rendering no. 2

Figure 5.1: Patch Detection across mutiple renderings of same 3D asset file. The patch with red outline
had the highest cosine similarity score with input text ”Train” and ”Car” respectively.

24
While GradCAM provided some insights into the important regions of the input images, its explainabil-
ity did not seem as reliable as that of Patch Detector. The visual explanations generated by GradCAM
were mostly sensible but sometimes less precise or focused, making it challenging to pinpoint the exact
features or regions that the model relied upon for its predictions. An example demonstration can be
observed figure 5.2 below. Here, it can be observed that for the 3D asset of an old bridge, only one
particular section was used with the highest intensity while the other portions of the bridge were left
out. So is the case with the 3D asset of a tunnel.

(a) Tunnels (b) City Intersection

(c) Train

Figure 5.2: GradCAM visualization for bridge and tunnel.

Although for some visualizations as shown in figure 5.2 c), the heatmap seems to be somewhat plausibe
but still not entirely reliable for providing complete transparency over the decision making capabilities
of the CLIP model.

Additionally, LIME proved to be ineffective as an explainer for CLIP in our experiments, highlighting
the need for alternative approaches to achieve comprehensive interpretability in this context. The
approximations provided by LIME did not accurately capture the model’s decision-making process,
potentially due to the complexity of the CLIP architecture and the multimodal nature of the task.
Overall, our performance analysis revealed that Patch Detector was the most suitable and reliable

(a) Tunnels (b) City Intersection (c) Bridge

Figure 5.3: LIME visualization for tunnels and city intersection.

method for enhancing the explainability of the CLIP model in the context of object detection and

25
classification of transport infrastructure assets. Its ability to consistently identify relevant regions
across multiple viewpoints and its clear visual representations made it a valuable tool for understand-
ing the model’s decision-making process and ensuring that it was attending to the appropriate features
of the 3D assets.

5.2 Challenges and Limitations with our approach


While our research approach of using CLIP and explainable AI techniques for object detection and
classification of transport infrastructure assets from 3D data showed promising results, we encountered
some challenges and limitations that need to be addressed for its effective practical use in digital twin
environments.

One significant challenge was the computational overhead and time required to make the CLIP object
detection run swiftly. Since CLIP is a large pre-trained model, processing and rendering 3D assets into
2D image instances for object detection can be time-consuming, especially when dealing with complex
3D models or large-scale digital twin environments[24]. This limitation could hinder the real-time
performance and scalability of our approach within a digital twin environment.

Additionally, our current implementation of CLIP-based object detection works effectively with only
few instances of a 3D model within multiple image frames. This limitation arises from the fact that
CLIP is primarily designed for image classification tasks, and our approach involves processing 2D
renderings of 3D assets. Extending this approach to handle multiple instances of objects within a 3D
environment could require additional techniques, such as instance segmentation or object tracking,
which may further increase the computational complexity.

Furthermore, we observed a trade-off between the accuracy of the pre-trained CLIP model and the
computational resources required. While higher-capacity pre-trained models, such as CLIP-ViT-L/14,
generally provided better accuracy in object detection and classification tasks, they also demanded sig-
nificantly more computational resources and time to run. This trade-off highlights the need for efficient
model optimization and deployment strategies to balance accuracy and computational requirements,
especially in resource-constrained environments or real-time applications[24].

Another limitation of our approach is the reliance on 2D image instances generated from 3D as-
sets. While this approach leverages the strengths of CLIP and allows for the application of explainable
AI techniques, it may not fully capture the rich 3D information and spatial relationships present in
the original 3D data. Exploring techniques that can directly process 3D data or leverage 3D rep-
resentations could potentially enhance the accuracy and interpretability of our object detection and
classification pipeline[24].

Despite these challenges and limitations, our research should provides a good foundation for exploring
the use of multimodal models like CLIP and explainable AI techniques in the context of transport
infrastructure management. Addressing these limitations through further research and development
could pave the way for more efficient, scalable, and interpretable object detection and classification
systems tailored to the unique requirements of the transport industry.

26
5.3 Practical Implications and Future Directions
Backed by the explainable AI methods that provided various ways of understanding which portions of
the image data CLIP used for object detection, we can conclude that CLIP is indeed a reliable deep
learning model for object detection of 3D assets within the transport industry. Explainable methods
like Patch Detector reliably pinpointed crucial features across various renderings, enhancing the inter-
pretability and trustworthiness of the model’s predictions.

One of the key advantages of our CLIP-based approach is its ability to perform reliable batch predic-
tions for object detection tasks in the transport industry. By leveraging CLIP’s zero-shot capabilities,
our method does not require extensive pretraining or fine-tuning on domain-specific data, making it a
lightweight and fast solution that can be run without the need for powerful GPU resources. This char-
acteristic makes our approach particularly attractive for applications where computational resources
are limited or where rapid deployment is necessary.

Furthermore, our experiments demonstrated that CLIP can achieve high accuracy in object detec-
tion tasks, comparable to or even surpassing traditional object detection models. This performance,
combined with the interpretability provided by explainable AI techniques, makes our approach a com-
pelling solution for transport infrastructure management, where both accuracy and transparency are
crucial. Looking ahead, several future research directions can be explored to further enhance and refine
our approach:

1. Fine-tuning CLIP for Object Detection: While CLIP’s zero-shot capabilities are valuable, fine-
tuning the model on a dataset containing annotated images from the transport infrastructure
domain could potentially enhance its ability to detect relevant objects accurately. This fine-
tuning process could leverage transfer learning techniques and domain-specific data to adapt
CLIP to the unique characteristics of transport assets.
2. Comparative Analysis with Traditional Object Detection Models: Conducting a comprehensive
comparative analysis between CLIP-based object detection and traditional object detection mod-
els like YOLO or Faster R-CNN could provide valuable insights into the strengths and weaknesses
of each approach. Such an analysis could inform the selection of the most suitable method for
specific use cases within the transport industry.
3. Exploring 3D Object Detection: While our current approach focuses on 2D image instances gen-
erated from 3D assets, exploring techniques that can directly process 3D data or leverage 3D
representations could potentially enhance the accuracy and interpretability of our object detec-
tion pipeline. This direction could involve integrating CLIP with 3D deep learning architectures
or developing new multimodal models tailored for 3D object detection tasks.
4. Integration with Digital Twin Environments: Seamlessly integrating our CLIP-based object
detection approach into digital twin environments could enable real-time monitoring, analysis,
and decision-making for transport infrastructure management. This integration would require
addressing challenges related to computational efficiency, scalability, and real-time performance,
potentially leveraging edge computing or cloud-based solutions.
5. By pursuing these future research directions, we can further refine and enhance our approach,
unlocking the full potential of multimodal models like CLIP and explainable AI techniques for
reliable, interpretable, and efficient object detection and classification in the transport infras-
tructure domain.

27
6. Conclusion

The primary goal of this study was to investigate the efficacy of applying explainable AI approaches
for object detection of 3D assets in the transport infrastructure domain. The utilization of these tech-
niques could lead to more effective, precise, and interpretable object detection systems, which have
substantial implications for transport infrastructure management and maintenance.

In our study, we employed the CLIP (Contrastive Language-Image Pre-training) model, a state-of-
the-art multimodal approach, for object detection tasks on 2D image instances generated from 3D
assets of transport infrastructure components and vehicles. CLIP demonstrated remarkable accuracy
in identifying and classifying these objects, leveraging its zero-shot capabilities and large-scale pre-
training on image-text pairs.

Additionally, we incorporated explainable AI techniques, such as Grad-CAM, LIME, and Patch De-
tector, to provide visual explanations and insights into the model’s decision-making process. These
techniques highlighted the relevant regions or features in the input images that contributed most to
CLIP’s predictions, enhancing the interpretability and transparency of the object detection system.

By presenting a framework that combines the excellent predictive performance of CLIP with the
interpretability provided by explainable AI techniques, our research makes a significant contribution
to the field of object detection for transport infrastructure management. It also contributes to the
broader topic of explainable AI by demonstrating its applicability and potential in the transport in-
dustry.

However, our study does have some limitations. Future research should consider expanding the dataset
to include a wider variety of 3D assets and transport infrastructure components, as well as incorporating
additional performance metrics for comprehensive model evaluation. The accuracy and explainability
of the object detection system can also be improved by investigating new explainable AI techniques
and exploring more advanced or ensemble model architectures.

Furthermore, the integration of our approach into digital twin environments and real-time monitoring
systems for transport infrastructure could be a valuable direction for future research, addressing chal-
lenges related to computational efficiency, scalability, and real-time performance.

This study highlights the significant potential of merging explainable and non-explainable AI algo-
rithms in the field of object detection for transport infrastructure management. It serves as a foun-
dation for developing more accurate, reliable, and interpretable AI systems tailored to the unique
requirements of the transport industry, ultimately contributing to improved decision-making, asset
maintenance, and overall operational efficiency.

28
7. References

1. Allen, B. D. (2021). Digital Twins and Living Models at NASA. NASA Technical Reports
Server. https://ntrs.nasa.gov/citations/20210023699 [This source provides the original definition
of digital twins by NASA and their history in the aerospace industry.]
2. Wu, J., Wang, X., Dang, Y., Lv, Z. (2022). Digital twins and artificial intelligence in transporta-
tion infrastructure: Classification, application, and future research directions. Transportation
Research Part C: Emerging Technologies, 139, 103646. https://doi.org/10.1016/j.trc.2022.103646
[This source discusses the applications of digital twins and AI in transportation infrastructure,
including the importance of 3D object classification.]
3. Tao, F., Zhang, M., Nee, A. Y. C. (2019). Digital twin driven smart manufacturing. Academic
Press. [This book provides a comprehensive overview of digital twins and their applications in
various industries, including manufacturing.]
4. CLIP Official GitHub Repository https://github.com/openai/CLIP [This is the official GitHub
repository for CLIP published by OpenAI]
5. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . Sutskever, I. (2021).
Learning transferable visual models from natural language supervision.https://openai.com/index/clip/
In International Conference on Machine Learning (pp. 8748-8763). PMLR.
6. Zohaib, S., Woodruff, H. C., Chatterjee, A., Lambin, P. (2022). Transparency of deep neural
networks for medical image analysis: A review of interpretability methods. Artificial Intelligence
in Medicine, 101, 102217. https://doi.org/10.1016/j.artmed.2021.102217
7. Kušić, K., Schumann, R., Ivanjko, E. (2023). A digital twin in transportation: Real-time synergy
of traffic data streams and simulation for virtualizing motorway dynamics. Journal of Systems Ar-
chitecture, 127, 102546. https://www.sciencedirect.com/science/article/pii/S1383762122003160
8. Fine-Tuning the CLIP Foundation Model for Image Classification.
https://www.alexanderthamm.com/en/blog/fine-tuning the-clip-foundation-model-for-image-
classification/
9. DataStax blog post ”What is Cosine Similarity? A Comprehensive Guide” https://www.datastax
.com/guides/what-is-cosine-similarity
10. LearnDataSci glossary entry on ”Cosine Similarity https://www.learndatasci.com/glossary/cosine-
similarity/
11. Long-CLIP: Unlocking the Long-Text Capability of CLIP” by Beichen Zhang et al https://arxiv.org
/html/2403.15378v1
12. Transparency of deep neural networks for medical image analysis: A review of interpretability
by https://www.sciencedirect.com/science/article/pii/S0010482521009057
13. Interpretability techniques: state of the art https://www.managementsolutions.com/sites/default
/files/minisite/static/22959b0f-b3da-47c8-9d5c-80ec3216552b/iax/pdf/explainable-artificial-intelligence-
en-04.pdf
14. What is AI transparency? A comprehensive guide by Hannah Wren https://www.zendesk.de/blog/ai-
transparency/

29
15. Advanced AI explainability for PyTorch https://github.com/jacobgil/pytorch-grad-cam
16. Grad-CAM: Visualize class activation maps with Keras, TensorFlow, and Deep Learning https://
pyimagesearch.com/2020/03/09/grad-cam-visualize-class-activation-maps-with-keras-tensorflow-
and-deep-learning/
17. Explain network predictions using Grad-CAM https://www.mathworks.com/help/deeplearning
/ref/gradcam.html
18. Pixel Attribution (Saliency Maps) https://christophm.github.io/interpretable-ml-book/pixel-
attribution.html
19. Gradient-Weighted Class Activation Mapping https://www.sciencedirect.com/topics/computer-
science/gradient-weighted-class-activation-mapping
20. Local Interpretable Model-Agnostic Explanations (LIME) by C3 Ai https://c3.ai/glossary/data-
science/lime-local-interpretable-model-agnostic-explanations/
21. Lime Github https://github.com/marcotcr/lime

22. Surrogate Object Detection Explainer (SODEx) with YOLOv4 and LIME https://www.mdpi.com/2504-
4990/3/3/33
23. Zero Shot Object Detection with OpenAI’s CLIP https://www.pinecone.io/learn/series/image-
search/zero-shot-object-detection-clip/
24. Learning Free Open-world 3D Scene Representations from 2D Dense CLIP https://openaccess.thecvf
.com/content/ICCV2023W/OpenSUN3D/papers

30

You might also like