0% found this document useful (0 votes)
12 views34 pages

Thesis (2) Removed

Uploaded by

pedrotiago112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views34 pages

Thesis (2) Removed

Uploaded by

pedrotiago112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 1

Introduction

1.1 Computer Vision

Artificial Intelligence (AI) is reshaping the local landscape, altering not only our interactions but also
our perceptions of the world around us. AI has emerged as one of the most significant disruptors of the
21st century, fundamentally altering economies, societies, and professions [1]. AI can be applied in a
plethora of sectors including transportation, education, healthcare and construction.
As depicted in Fig. 1.1, among the many branches of AI, one of the most impactful is Computer Vision
(CV) a subfield that enables computers with the capacity to analyse and extract meaningful insights from
visual data, such as digital images and videos, thereby opening new frontiers in image processing across
a wide range of applications [2]. CV comprises a variety of subfields such as object recognition, object
detection, video tracking and object segmentation.

Figure 1.1: Relationship between AI and CV [2]

CV’s versatility is reflected in its vast applications across numerous industries, revolutionising sectors
like healthcare and transportation. In healthcare, CV empowers professionals to acquire, process,
and analyse both static and dynamic medical images in real-time, improving diagnostic accuracy and
enabling earlier disease detection. This, in turn, improves treatment outcomes while lowering total costs
for all the stakeholders. [3]
In transportation, CV plays a pivotal role in autonomous driving, since it allows for real-time environ-
mental perception. By enabling vehicles to perceive and interpret the surrounding environment including

1
pedestrians, other vehicles, and traffic signals, CV allows for informed decision-making, significantly en-
hancing traffic safety. Continuous research is being conducted to optimise these algorithms and increase
the reliability of CV systems in diverse driving conditions. [4]

1.2 Object Detection

Object detection (OD) is one of the most important tasks in image processing and computer vision as
it combines two key components: object classification and object localisation [5]. Image classifica-
tion assigns an image one or multiple labels that correspond to a category, whereas object localisation
identifies the position of one or more objects within an image and draws boundary boxes around them
[6]. Because of its dual nature, OD models can identify both what objects are in an image and where
they are located, making it a more complex task than simple categorisation.
Machines, unlike humans, are unable to distinguish and identify objects in the real world by default.
Advanced object detection techniques are needed to bridge this gap [7]. Despite its potential, several
challenges remain, including:

• Number of categories: To identify several classes in a single image, OD models require high-
quality, annotated datasets, which are generally scarce.

• Intra-class variation: Illumination, occlusion, position, and viewpoint significantly alter the ap-
pearance of objects, making detection complex and error-prone.

• Computational efficiency: Modern OD models require substantial computational power, which


can hinder their implementation in real-time environments [8].

• Size and arrangement: Models may struggle when objects occupy a small portion of the image
(< 5%) or are closely packed together, making it difficult to differentiate between objects [9].

Research in object detection has primarily focused on three categories: objectness detection,
which seeks to identify any object regardless of its class; salient OD, which uses human attention
mechanisms to highlight objects of interest; and category-specific OD, which aims to detect and lo-
calise predefined object categories in each image [10], as Fig. 1.2 demonstrates. The latter approach is
the primary focus of this research, where the goal is to accurately identify and localise specific objects
within complex environments.
Deep learning-based OD algorithms are typically divided into two-stage and one-stage detectors.
Two-stage detectors, like R-CNN and its variants, separate the localisation and classification tasks by
first generating region proposals and then refining these through a secondary network that predicts the
class of each object and adjusts the bounding box coordinates accordingly [11]. These models are
more accurate but computationally costly, making them better suited for applications that require great
precision.
In contrast, one-stage detectors like YOLO (You Only Look Once) apply a single neural network over
the entire image to directly predict object locations and classes. YOLO divides the image into a grid

2
Figure 1.2: The three research directions in object detection [10]

and assigns each cell the responsibility of detecting objects centred within it. It then generates anchor
boxes of varying scales, scores them for objectness, and predicts class probabilities and bounding
box refinements. By eliminating the need for separate proposal generation, one-stage detectors are
significantly faster and better suited for real-time applications [11]. YOLO, introduced in 2015, has been
continuously developed and improved through several versions, each iteration addressing limitations of
its predecessors to enhance accuracy and efficiency. These advancements will be covered in further
detail in the following sections.

1.3 Objectives and Deliverables

The objective of this work is to refine and apply a computer vision algorithm, specifically YOLO,
to accurately detect and identify common structures along Portuguese highways. The ultimate goal is
to evaluate the feasibility of installing photovoltaic panels in these locations, leveraging the algorithm’s
capabilities to optimise infrastructure analysis and decision-making.

1.4 Thesis Outline

3
4
Chapter 2

Literature Review

2.1 Artificial Neural Networks

Artificial Neural Networks (ANNs) are computational models inspired by the structure and function of
the human brain. Just like the brain processes information through billions of interconnected neurons,
ANNs consist of nodes (neurons) connected in a web-like structure. These artificial neurons process
data, learn patterns, and make predictions based on input data, drawing parallels to how biological
neurons work [12].
In biological systems, neurons receive signals through dendrites, process them in the cell body, and
transmit them via axons to other neurons. Similarly, artificial neurons receive input signals, apply weights
to modify the significance of each input, and pass the result through an activation function to produce an
output.
The primary goal of ANNs is to emulate the brain’s ability to learn from experience. Unlike traditional
algorithms that require explicit programming of all rules, ANNs develop their own rules of behaviour by
learning from data. This is achieved through learning algorithms that adjust the synaptic weights
(connections between neurons) to minimise the error in predictions [13].
The neurons in an ANN are interconnected and organised into layers, forming the basic structure of
the network. These layers include:

• Input Layer: This layer receives the raw data directly from the external environment. Each neuron
in the input layer corresponds to a feature in the dataset.

• Hidden Layers: Positioned between the input and output layers, these layers process and extract
intermediate representations from the data. Each hidden layer applies transformations to the in-
put using weighted connections and activation functions, enabling the network to learn complex
patterns.

• Output Layer: This layer generates the final predictions or classifications based on the processed
information from the hidden layers [14].

5
Figure 2.1: General ANN architecture [15]

2.1.1 Components

To better understand how ANNs work, Fig. 2.2 represents an example of a neuron architecture.

Figure 2.2: Example of a neuron architecture [15]

This neuron includes the following components:

1. Inputs (x1 ,x2 ,. . . ,xn ): These represent the data fed into the neuron, which could come from exter-
nal sensory systems or other neurons within the network.

2. Weights (w1 ,w2 ,. . . ,wn ): Synaptic weights modify the significance of each input, emulating the
way synapses between biological neurons strengthen or weaken signals. Weights can either am-
plify or attenuate the input values.

3. Bias (bj ): The bias is an additional parameter that helps the neuron adjust its output independently
of the input signals, functioning like a threshold or intercept.

4. Net Input: The neuron computes the net input by summing the weighted inputs and adding the
bias. Mathematically, this is expressed as:

n
X
u= wi xi + bj (2.1)
i=1

Here, u represents the net input to the neuron [16].

5. Activation Function: After computing the net input, the neuron applies an activation function to
introduce non-linearity. This enables the network to learn complex patterns in data. More detailed
information about the activation functions can be found in section 2.2.

6
2.1.2 Applications and Limitations

ANNs have found diverse applications across various sectors due to their ability to model complex
patterns and relationships in data. They are widely used for tasks such as handwriting recognition, noise
filtering, semantic parsing, question answering, and stock market prediction.
However, ANNs have notable limitations. One major drawback is their extensive use of parameters,
which makes the network computationally expensive. Additionally, ANNs treat all input features as inde-
pendent, which means they fail to account for spatial relationships between neighbouring features. This
is a critical limitation for tasks like object recognition and segmentation, where the spatial arrangement
of pixels in an image carries significant information [17].
To address these challenges, Convolutional Neural Networks were developed.

2.2 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a specialised class of Artificial Neural Networks (ANNs)
designed to handle complex image processing tasks. While ANNs aim to mimic the human brain’s neural
system, CNNs enhance this approach by introducing convolutional layers, making them particularly
effective for visual tasks like object detection and classification [18]. CNNs excel in feature extraction
and representation learning, and their layered architecture has become the dominant approach for object
detection, outperforming traditional methods in accuracy and efficiency [19].
The success of a CNN relies on training with large annotated datasets, where the model optimises
its weights to minimise the error between predictions and ground truth. For instance, a CNN model used
for ground object detection from aerial disaster imagery identified key assets, such as roofs, vehicles
and flooded zones, achieving a mean average precision of 80.69% for high-altitude and 74.48% for low-
altitude images [20].
Such real-world applications highlight the life-saving potential of CNNs, underscoring the importance
of optimising CNN architectures to maximise their impact in critical scenarios.

2.2.1 Structure

The main difference between a CNN and an ANN is the alternating convolutional and pooling lay-
ers that act as feature extractors, followed by one or more fully connected layers that generate final
predictions
During the initial training phase, CNNs learn to associate specific features with their corresponding
labels through a process of backpropagation and optimisation, using a large dataset of labelled images
that contain the objects of interest [21–23].
Training involves a non-convex and high dimensional optimisation problem, where the network’s
weights are iteratively adjusted to minimise error, aiming for an optimal point that enables accurate
classification or prediction in unseen images. Accurate model performance is critically dependent on

7
this tuning process, which can be improved by employing effective optimisation algorithms [24]. Addi-
tionally, designing the arrangement of CNN components is crucial for creating new architectures that
enhance performance.

Convolutional Layer

The Convolutional layer is the core building block of CNNs. It applies filters (small, learnable matrices)
to the input image, performing a convolution operation that extracts specific features, such as edges,
shapes, and textures. This operation is fundamental in signal processing, image processing, and CV,
enabling the detection of patterns in data.
Mathematically, the convolution operation is denoted by the symbol ∗. For discrete input signals, it is
defined as:


X
(f ∗ g)[n] = f [m]g[n − m]∆m (2.2)
m=−∞

where f represents the input signal (e.g., an image) and g is the filter or kernel, n is the position of
the output signal and ∆m is the sampling interval.
For 2D data, such as images, the convolution operation is generalised as:


X ∞
X
h[i, j] = g[m, n] · f [i − m, j − n] (2.3)
m=−∞ n=−∞

where, f represents the input image matrix, g is the kernel matrix, and h is the resulting output matrix
(or feature map) after convolution. The indices i and j correspond to the positions in the output matrix
h, while m and n represent the positions within the kernel.
If the kernel size is 3 × 3, the indices m and n range from -1 to 1, as the kernel slides across the
image. This operation effectively combines the kernel values with the corresponding region of the input
image to produce the feature map, where each value represents the presence or intensity of the detected
feature at that location. This process is illustrated in Fig. 2.3.

Figure 2.3: Convolution operation [22]

To handle edge cases where the kernel extends beyond the boundaries of the image, padding is
applied by adding extra pixels (commonly set to zero) around the edges of the input. This technique helps
preserve the spatial dimensions of the image as it passes through the convolutional layers, ensuring that
important boundary features are not lost.

8
Additionally, the stride determines the step size by which the kernel moves across the input image,
influencing the size of the output feature map.
The size of the output feature map depends on several factors: the dimensions of the input image,
the filter size, the stride, and the amount of padding. It can be calculated using the formula:

W − F + 2P
Output size = + 1, (2.4)
S

where W is the width (or height) of the input image, F is the width (or height) of the filter, P is the
amount of padding, and S is the stride [23].
Each convolutional layer comprises multiple kernels (filters) of size F × F × D, where D is the filter
depth, which matches the depth of the input feature map to ensure proper computation. As the number
of kernels increases, the network’s capacity to capture complex and diverse features also increases.
However, this comes at the cost of higher computational complexity, necessitating a balance between
feature extraction capabilities and computational efficiency [22].

Pooling Layer

After feature maps are generated by the convolutional layer, a pooling layer is added to downsample
the spatial dimensions of these feature maps, making the CNN more efficient and less computationally
intensive. Pooling is crucial for minimising data redundancy while retaining significant features, which
helps prevent overfitting.
The most common pooling methods are Max Pooling, Min Pooling, and Average Pooling. In each
case, a sliding window moves across the input, and either the maximum, minimum, or average value
within the window is calculated, as illustrated in the Fig. 2.4. However, these traditional methods have
limitations: average pooling considers all values equally, potentially reducing strong activations, while
max pooling can ignore meaningful but weaker signals, erasing parts of the input data [25].

Figure 2.4: Max Pooling operation [25]

To address these drawbacks, advanced pooling techniques have been developed. One approach
involves stochastic pooling, which randomly selects activations within each window based on a multi-
nomial distribution, thus retaining a broader range of features [26]. Another is Deep Generalized Max
Pooling, which balances frequent and rare activations, ensuring a more even feature representation [27].
Additionally, alpha-pooling introduces a trainable parameter that adapts the pooling type dynamically,
outperforming traditional pooling methods in specific tasks like image recognition [28].

9
Activation Layer

Activation functions play a critical role in decision-making within neural networks, allowing them to
learn complex features from input data. Choosing the appropriate activation function can significantly
speed up the training process by facilitating the learning of complex patterns from raw inputs [29]. With-
out activation functions, each layer’s output would simply be a linear transformation of the previous layer’s
input, limiting the network’s ability to capture non-linear relationships that reflect real-world complexity.

Introducing non-linearity through activation functions is essential, as real-world phenomena are often
non-linear. The most common activation functions include the Rectified Linear Unit (ReLU), sigmoid
function, and hyperbolic tangent (tanh). ReLU is widely used for its simplicity and efficiency in pre-
venting neuron saturation, which helps CNNs learn complex features [23]. Figure 2.5 is a graphical
representation of these functions.

Figure 2.5: Graphical representation of ReLU, sigmoid and tanh functions [23]

Mathematically ReLU is defined as:

ReLU (x) = max(0, x), (2.5)

Here, x represents the input to the neuron. ReLU outputs the maximum of 0 and x, effectively
ignoring negative values while retaining positive ones. This enables efficient training by allowing only
significant features to pass through.

The sigmoid function is defined as:

1
Sigmoid(x) = , (2.6)
1 + e−x

Mapping inputs to values between 0 and 1, the sigmoid function is useful for binary classification.
However, it can lead to vanishing gradients, which may slow down training.

Similar to sigmoid, tanh maps inputs to values between -1 and 1, allowing for stronger gradient
signals and improved convergence in certain applications.

In addition to these, the softmax function is often applied in the output layer for multi-class classifica-
tion tasks. Softmax converts output values into probabilities across classes, making it suitable for tasks
requiring probability distributions.

Overall, the choice of activation function in CNNs has a profound impact on model performance and
convergence. Selecting the appropriate function is critical in designing an effective CNN. [23]

10
Fully Connected Layer

A fully connected layer is typically located at the end of a CNN, performing a global operation that
connects each neuron to every neuron in the previous layer. This dense layer is crucial for classification,
as it combines all extracted features and determines which class each image belongs to, bridging the
feature extraction and decision-making stages.
For example, in a three-class classification task, the output layer would contain three neurons, each
representing one class. The values from these neurons represent the probability of the image belonging
to each class [22].

2.2.2 Architectures

Throughout the years, several CNN architectures have been developed, each bringing unique fea-
tures and performance improvements that have shaped the field of computer vision. Here’s an overview
of some of the most significant architectures:

• LeNet: Developed by Yann LeCun in 1998, LeNet was one of the earliest CNN architectures
designed for handwritten digit recognition. It consists of five alternating convolution and pooling
layers, followed by two fully connected layers. An improved version, LeNet-5, achieved more than
98% accuracy on the MNIST dataset but was limited by hardware constraints and dataset size,
making it unsuitable for complex tasks [30].

• AlexNet: Proposed by Krizhevsky et al. in 2012, AlexNet marked a significant breakthrough,


becoming the dividing line between traditional and deep learning approaches. Comprising five
convolutional layers and three fully connected layers, it was the first deep CNN to adopt modern
techniques like ReLU activation and dropout. Achieving state-of-the-art results on the ImageNet
dataset for visual recognition and classification, AlexNet sparked a rapid growth in deep learning
research and applications [31].

• VGG: Developed by Simonyan and Zisserman in 2014, Visual Geometry Group (VGG) introduced
the use of small 3x3 convolutional filters, instead of the 5x5 previously used, which increased depth
while reducing computational cost and the number of training parameters.

• GoogLeNet: Developed in 2014, GoogLeNet, also known as InceptionNet, introduced the concept
of inception modules, which combine filters of multiple sizes to capture information at different
scales. By using global average pooling instead of fully connected layers, GoogLeNet reduced
the number of parameters, improving computational efficiency and enabling deeper networks with
sparse connections to eliminate redundant data and cut costs by skipping through pointless feature
maps.

• ResNet: Created in 2015, ResNet (Residual Network) addressed the “degradation problem,”
where deeper networks reach a point where adding layers no longer improves accuracy. ResNet

11
added residual connections that let the network skip over some layers. This way, the network can
stay accurate even at deep levels (up to 152 layers) [22].

As observed, there is a trend toward increasing the number of layers in CNNs, which allows for
more complex feature extraction and greater predictive power. Numerous variations of the previously
discussed architectures have been developed to address specific challenges, highlighting the versatility
and importance of CNN architectures. These architectures form the backbone of several algorithms,
including YOLO.

2.3 Performance Metrics

Various algorithms have been developed for object detection in images and videos, and their perfor-
mance is typically evaluated using specific metrics that measure speed and accuracy. These metrics
provide an objective way to compare models and understand their effectiveness across different tasks.
The speed at which a model processes images is commonly measured in Frames Per Second
(FPS), indicating how many images the model can handle per second. Higher FPS values correspond
to faster models, which are essential for real-time applications [32].
To evaluate the accuracy of object detection models the primary metric used is mean Average
Precision (mAP). This metric summarises the precision of a model across all object categories, offering
a single score to compare models. Before explaining mAP, it’s important to define some foundational
concepts:

• True Positive (TP): A correct detection of a ground-truth bounding box.

• False Positive (FP): An incorrect detection, either of a non-existent object or a misplaced detection
of an existing object.

• False Negative (FN): An undetected ground-truth bounding box.

In object detection, True Negatives (TN) do not apply due to the vast number of possible bounding
boxes that should not be detected in any given image [33].

Precision and Recall

The assessment of object detection performance primarily relies on precision (P) and recall (R).
Precision is the model’s ability to recognise only relevant objects, reflecting the proportion of correct
predictions out of all detections, defined as:

TP
P = , (2.7)
TP + FP

Recall represents the fraction of relevant instances that the model successfully detects, calculated
based on TPs and FNs [34]:

12
TP
R= , (2.8)
TP + FN

There is often a trade-off between precision and recall. Increasing detected objects (higher recall)
may result in more false positives, reducing precision. To capture this balance, the Average Precision
(AP) metric uses a precision-recall curve, which measures the area under the curve to provide a com-
bined assessment of precision and recall image [33].

Mean Average Precision

Object detection models are generally tasked with identifying and localising multiple categories within
an image. The AP metric calculates each category’s average precision separately, then takes the mean
across all categories, known as mean Average Precision (mAP). This approach provides a more com-
prehensive evaluation by assessing performance for each object class, ensuring that the model’s perfor-
mance is balanced across categories [35].

Intersection over Union

In addition to detecting objects, object detection models aim to accurately localise objects by pre-
dicting bounding boxes. Intersection over Union (IoU) is a critical component of the AP metric, as it
measures the quality of predicted bounding boxes. IoU is the ratio of the overlapping area between
the predicted and ground truth bounding boxes to the total combined area of both, as illustrated in Fig.
2.6. This metric evaluates how well the predicted box matches the true location, contributing to a more
precise assessment of object detection models [35].

Figure 2.6: (a) How IoU is calculated; (b) examples of three different IoU values for different box loca-
tions. [35]

The correctness of a detection is determined by a predefined threshold (t) for IoU overlap. A de-
tection is considered correct if the IoU meets or exceeds this threshold, meaning IoU ≥ t; otherwise,
it is classified as incorrect. This threshold helps to quantify the overlap needed for a detection to be
considered valid, ensuring consistent accuracy in object localisation across varying scenarios [34].

13
Non-Maximum Suppression

Non-Maximum Suppression (NMS) is a post-processing technique used in object detection to re-


duce overlapping bounding boxes, thereby improving detection quality. Object detection algorithms often
generate multiple bounding boxes around the same object, each with different confidence scores. NMS
helps filter out these redundant boxes by retaining only the bounding box with the highest confidence
score for each object. This process ensures that the final detection output is precise and free from un-
necessary overlaps. Figure 2.7 illustrates the typical output of an object detection model with multiple
overlapping bounding boxes, along with the refined result after applying NMS [35].

Figure 2.7: Non-Maximum Suppression [35]

2.4 You Only Look Once (YOLO)

YOLO, proposed by Redmon et al. in 2015 [36], is a groundbreaking object detection algorithm that
leverages CNNs to detect objects in real-time. As a single-stage detection method, YOLO achieves
real-time performance on standard GPUs by combining region proposal and classification into a unified
neural network. This innovative architecture significantly reduces computation time, making YOLO one
of the fastest object detection models available. Its design divides an input image into a grid, with
each cell predicting bounding boxes and class probabilities directly, allowing for end-to-end learning and
inference.
Among object detection algorithms, YOLO stands out for its exceptional balance of speed and accu-
racy, enabling reliable identification of objects in images at high speeds. Since its inception, the YOLO
family has evolved through multiple iterations, each addressing limitations of earlier versions and incor-
porating advancements to enhance performance [35], as illustrated in Fig. 2.8.

Figure 2.8: Timeline of YOLO versions from 2015 to 2024 [37]

14
The versatility of YOLO has made it invaluable in domains where both accuracy and speed are
critical, as illustrated in Fig. 2.9.

Figure 2.9: Bibliometric network visualization of the main YOLO Applications [35]

In agriculture, YOLO models are employed to detect and classify crops, pests, and diseases, enabling
precision agriculture techniques. These applications optimise farming operations, improve productivity,
and reduce input costs [38]. In healthcare, YOLO has significantly impacted diagnostic processes,
assisting in tasks such as lesion detection, brain tumour segmentation, skin lesion classification, and
personal protective equipment detection. These applications demonstrate YOLO’s adaptability to various
challenges in medical imaging and diagnostics [39].
In the realm of surveillance and security, YOLO has proven invaluable for real-time monitoring and
rapid identification of suspicious activities [40]. Detecting unwanted human actions, especially in low-
light conditions and varying poses, remains a complex task. By integrating YOLO models into surveil-
lance systems, security personnel can monitor environments more effectively and respond promptly to
potential threats, enhancing public safety [41].
YOLO has also supported public health measures, such as face mask detection and social distancing
monitoring during pandemics, ensuring compliance with health regulations [42].
In industrial settings, YOLO is employed for surface inspection processes to detect defects and
anomalies, ensuring quality control in manufacturing. These systems can be seamlessly integrated
into production lines, optimising efficiency and reducing operational costs [43].

2.4.1 Evolution

YOLOv1

YOLOv1, introduced in 2015, revolutionised object detection by unifying the detection steps into a
single-stage process. Unlike previous methods that relied on region proposal networks, YOLO predicts
all bounding boxes simultaneously, significantly reducing computation time and enabling real-time per-

15
formance.

To achieve this, YOLOv1 divides the input image into an S×S grid. Each grid cell predicts B bounding
boxes, each associated with a confidence score (Pc ) and C class probabilities. The confidence score
reflects both the likelihood that the box contains an object and the accuracy of the bounding box’s
coordinates. For each bounding box, five values are predicted: Pc ,bx ,by ,bh ,bw , where bx and by are the
coordinates of the box’s centre relative to the grid cell, and bh and bw are the height and width of the box
relative to the full image. The total output is a tensor of dimensions S×S×(B×5+C), which can optionally
be processed using NMS to eliminate duplicate detections.

Figure 2.10 illustrates a simplified output vector for a 3×3 grid with three classes and a single bound-
ing box per grid cell. In this case, the output tensor has dimensions 3×3×8.

Figure 2.10: YOLO output prediction [35]

YOLOv1 achieved an AP of 63.4 on the PASCAL VOC2007 dataset, demonstrating its effectiveness
as a real-time detector.

The YOLOv1 architecture consists of 24 convolutional layers for feature extraction followed by 2
fully connected layers that predict bounding box coordinates and probabilities. Its novel full-image
one-shot regression approach made YOLO significantly faster than existing object detection methods.
However, it faced notable limitations that impacted its localisation accuracy:

• Grid Cell Constraint: YOLOv1 could detect at most two objects per grid cell for a given class,
limiting its ability to handle overlapping or nearby objects of the same class.

• Aspect Ratio Challenges: It struggled to predict objects with aspect ratios not seen in the training
data.

• Coarse Features: Downsampling layers caused YOLO to learn coarse object features, leading to
lower localisation accuracy. Despite these shortcomings, YOLOv1’s speed and simplicity paved
the way for subsequent versions, each addressing these limitations and improving performance
[35].

16
YOLOv2

YOLOv2, introduced by Joseph Redmon and Ali Farhadi at CVPR 2017 [44], improved upon YOLOv1
by introducing key enhancements that preserved its speed while significantly increasing accuracy and
expanding its capacity to detect up to 9000 object categories. The main innovations of YOLOv2 include:

1. Batch Normalisation: Applied to all convolutional layers, batch normalisation improved training
efficiency and reduced overfitting, enhancing the model’s stability.

2. High-Resolution Classifier: The model was fine-tuned on high-resolution input images (448×448)
improving performance on larger image resolutions compared to YOLOv1.

3. Fully Convolutional Architecture: Dense layers were removed, making YOLOv2 fully convolu-
tional and allowing it to handle inputs of varying sizes.

4. Anchor Boxes: Predefined anchor boxes were introduced to match common object shapes and
sizes. These enabled YOLOv2 to predict multiple bounding boxes per grid cell, improving flexibility
and accuracy.

5. Dimension Clusters: To optimise anchor box shapes, k-means clustering was used to identify the
most suitable prior boxes from the training data, improving bounding box predictions.

6. Direct Location Prediction: Bounding box coordinates were predicted relative to the grid cell,
following YOLOv1’s approach. Each grid cell predicted five bounding boxes with confidence scores
and location values, refining localisation accuracy.

7. Finer-Grained Features: YOLOv2 preserved more detailed spatial information by reducing down-
sampling and incorporating additional feature maps. This enhanced its ability to detect smaller
objects and improve overall precision.

8. Multi-Scale Training: By randomly resizing input images between 320×320 and 608×608 during
training, YOLOv2 became robust to varying input sizes, improving performance across different
resolutions.

These improvements resulted in a significant boost in AP, with YOLOv2 achieving 78.67% on the
PASCAL VOC2007 dataset, compared to the 63.4% AP achieved by YOLOv1.
The backbone architecture of YOLOv2, Darknet-19, featured 19 convolutional layers and 5 max-
pooling layers, providing a more efficient and powerful feature extraction process.

YOLOv3

YOLOv3, introduced by Joseph Redmon and Ali Farhadi in 2018 [45], introduced significant enhance-
ments to the YOLO framework, bringing it on par with state-of-the-art methods while preserving real-time
performance. The major advancements in YOLOv3 include:

17
1. Bounding Box Prediction: YOLOv3 predicts four coordinates for each bounding box, similar to
YOLOv2. However, it adds an objectness score for each bounding box using logistic regression.
This score identifies the anchor box with the highest overlap with the ground truth (set to 1) and
marks others as 0, improving object localisation accuracy.

2. Class Prediction: YOLOv3 replaced the softmax function with binary cross-entropy loss, treating
classification as a multilabel problem. This approach allows the model to assign multiple labels to
the same bounding box, such as ”Person” and ”Man,” making it more versatile in complex scenar-
ios.

3. New Backbone Architecture: YOLOv3 introduced Darknet-53, a deeper and more robust feature
extractor with 53 convolutional layers and residual connections. By replacing max-pooling layers
with strided convolutions, Darknet-53 achieves better feature representation without sacrificing
efficiency.

4. Multi-Scale Predictions: To detect objects of varying sizes, YOLOv3 generates predictions at


three different scales. By combining feature maps of low, medium, and high resolution, the model
improves detection performance for small, medium, and large objects.

5. Bounding Box Priors: Like YOLOv2, YOLOv3 uses kkk-means clustering to optimise anchor box
dimensions. However, it employs three anchor boxes per scale, compared to five in YOLOv2,
aligning detection with multi-scale predictions.

6. Enhanced Receptive Field: A modified spatial pyramid pooling (SPP) block in the backbone
architecture expands the receptive field, improving the model’s ability to capture contextual infor-
mation.

When YOLOv3 was released, the evaluation benchmark shifted from PASCAL VOC to Microsoft
COCO. On this dataset, YOLOv3-Spp achieved an AP of 36.2% and an AP50 of 60.6%, operating at
20 FPS. This made YOLOv3 twice as fast as competing methods while maintaining state-of-the-art
accuracy.

Backbone, neck, and head

At this time, the architecture of object detectors started to be described in three parts: the backbone,
the neck, and the head. Each part plays a distinct role in the object detection pipeline:
Backbone: The backbone is responsible for extracting features from the input image. Typically, it is
a CNN pre-trained on large-scale image classification datasets like ImageNet. The backbone captures
hierarchical features at different scales: lower-level features, such as edges and textures, are extracted
in the earlier layers, while deeper layers capture higher-level features, such as object parts and semantic
information.
Neck: The neck serves as an intermediary between the backbone and the head. It aggregates
and refines the features extracted by the backbone, enhancing spatial and semantic information across

18
different scales. Commonly, the neck incorporates additional convolutional layers or mechanisms to
improve the representation of multi-scale features.
Head: The head is the final stage of an object detector, tasked with making predictions based on
the refined features provided by the neck. It typically consists of one or more task-specific subnetworks
that handle classification, localisation, and sometimes additional tasks such as instance segmentation
or pose estimation. After predictions are made, a post-processing step like NMS removes overlapping
predictions, retaining only the most confident detections [35].

Yolov4

YOLOv4, introduced in April 2020 by Alexey Bochkovskiy and colleagues [46], marked a significant
evolution in the YOLO family. Despite the change in authorship, YOLOv4 adhered to the foundational
principles of its predecessors, maintaining a focus on achieving high accuracy while preserving real-time
performance. YOLOv4 introduced substantial architectural changes, delivering remarkable performance
improvements [47]. The development process explored various enhancements, categorised as Bag of
Freebies (BoF) and Bag of Specials (BoS):
BoF: These methods improve model training without affecting inference time. Common examples
include advanced data augmentation techniques that enhance the robustness of the model during train-
ing.
BoS: These methods slightly increase inference cost but deliver significant accuracy gains. Exam-
ples include techniques that optimise feature extraction or enhance prediction precision [35].
Table 2.1 provides a summary of the specific BoF and BoS methods implemented in YOLOv4.

Table 2.1: YOLOv4 BoF and BoS comparison [47]

Category Backbone Detector


Bag-of-specials • Multi-input weighted residual connections • Distance-IoU NMS
• Cross-stage partial connections • Spatial attention module (SAM)
• Mish activation • Mish activation
• Spatial pyramid pooling block
• Path aggregation network (PAN)
Bag-of-freebies • Class label smoothing • Cross mini-batch normalization (CmBN)
• Data augmentation (mosaic, CutMix) • Data augmentation (mosaic)
• Regularisation (DropBlock) • Multiple anchors for single ground truth
• Elimination of grid sensitivity
• Cosine annealing scheduler
• Random training shapes
• Optimal hyperparameters
• CIoU loss

Evaluated on the Microsoft COCO dataset, YOLOv4 achieved an AP of 43.5% and an AP50 of 65.7%,
operating at over 50 FPS. These results demonstrated its ability to outperform competitors in both accu-
racy and speed, solidifying YOLOv4’s status as a state-of-the-art real-time object detection model.

19
YOLOv5

YOLOv5, released in 2020 by Glen Jocher, the founder and CEO of Ultralytics, introduced significant
advancements in object detection. While many of its improvements were inspired by YOLOv4, YOLOv5
was developed in PyTorch instead of Darknet, offering easier integration with modern machine learning
pipelines and greater flexibility for researchers and developers. The key features of YOLOv5 include:
AutoAnchor: YOLOv5 incorporated an Ultralytics algorithm called AutoAnchor, which automated the
anchor box selection process. This integration allowed the network to automatically learn optimal anchor
box dimensions for a given dataset during training, accelerating the process and improving accuracy.
Scalability: YOLOv5 introduced five scaled versions—YOLOv5n (nano), YOLOv5s (small), YOLOv5m
(medium), YOLOv5l (large), and YOLOv5x (extra-large). These variants offered flexibility by varying the
width and depth of convolutional modules to suit different hardware capabilities and application require-
ments, from resource-constrained devices to high-performance GPUs [35].
On the MS COCO dataset, YOLOv5x achieved an AP score of 50.7% using a 640-pixel input image
size. The model demonstrated exceptional processing speed, achieving 200 FPS. When evaluated with
a larger input size of 1536 pixels, YOLOv5 achieved an AP score of 55.8%, showcasing its ability to
accurately detect objects even at higher resolutions [47].

YOLOv6

In September 2022, the Meituan Vision AI Department introduced YOLOv6 [48], a version specifi-
cally designed for industrial deployment scenarios. This iteration incorporated significant architectural
improvements aimed at optimising speed, accuracy, and efficiency, making it ideal for real-time object
detection tasks. The main innovations in YOLOv6 are:

• CSPDarknet Backbone: YOLOv6 introduced CSPDarknet as its new backbone, enhancing the
extraction of features from input images. This architecture surpassed previous versions in speed
and efficiency, providing better performance for real-time applications.

• Feature Pyramid Network (FPN): The integration of an FPN allowed YOLOv6 to process features
at multiple scales, improving detection accuracy. The FPN combines information from both low-
level (detailed) and high-level (semantic) features, enabling the model to detect small and large
objects more effectively within complex scenes.

• Decoupled Head Architecture: YOLOv6 separated the tasks of classification and bounding box
regression into independent layers. This decoupled head design allowed the model to handle each
task more accurately, leading to better overall predictions.

When evaluated on the MS COCO test-dev 2017 subset, the YOLOv6L model achieved an AP of
52.5% and an AP50 of 70%, maintaining a processing speed of approximately 50 FPS. This balance of
high accuracy and speed demonstrates YOLOv6’s capability for real-world, real-time deployment [47].

20
YOLOv7

YOLOv7, introduced in July 2022 by the authors of YOLOv4 [49], set a new benchmark for object
detection by achieving exceptional speed and accuracy. It outperformed all known object detectors
across a range of processing speeds, from 5 FPS to 160 FPS, making it suitable for both resource-
constrained and high-performance applications. Like YOLOv4, YOLOv7 was trained exclusively on the
MS COCO dataset without relying on pre-trained backbones, demonstrating its robustness [35]. The key
innovations in YOLOv7 were:

• E-ELAN Backbone: YOLOv7 introduced the Extended Efficient Layer Aggregation Network (E-
ELAN) as its backbone. This architecture enhances the model’s ability to aggregate and reuse
features across layers, improving its capability to learn complex patterns without increasing com-
putational cost. E-ELAN ensures efficient processing while maintaining accuracy, making the back-
bone highly effective for real-time applications [50].

• Model Scaling: YOLOv7 implemented a uniform scaling strategy to adjust the depth (number
of layers) and width (number of channels) of network blocks. This allows for creating models of
different sizes while maintaining the same overall structure. This scaling ensures adaptability to
various hardware setups, from low-power devices to high-performance GPUs.

• BoF: YOLOv7 integrated several BoF techniques, which improved training accuracy without af-
fecting inference speed. These methods, while increasing training time, ensured that the model
delivered high performance during deployment [35].

YOLOv7 underwent rigorous testing on the MS COCO test-dev 2017 subset, where the YOLOv7E6
variant achieved an AP of 55.9% and an AP50 with IoU threshold of 0.5 of 73.5%. These results solidified
YOLOv7’s position as a state-of-the-art detector, offering unparalleled speed and accuracy for real-time
object detection tasks [47].

YOLOv8

YOLOv8, released in January 2023 by Ultralytics, marked another significant advancement in the
YOLO series. It introduced five scaled versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x,
making it adaptable to a wide range of hardware and applications. Expanding beyond object detection,
YOLOv8 supports multiple vision tasks, including segmentation, pose estimation, tracking, and classifi-
cation [35]. The main changes in YOLOv8 were:

• Anchor-Free Design: Unlike its predecessors, YOLOv8 adopts an anchor-free approach, predict-
ing the object’s centre rather than relying on predefined anchor boxes. This innovation eliminates
the reliance on box shapes that may not align with custom dataset distributions, improving adapt-
ability across various datasets. It also reduces the number of redundant box predictions, making
the post-processing step faster and more efficient.

21
• Advanced Training Techniques: YOLOv8 leverages enhanced training routines, including on-
line image augmentation methods such as mosaic augmentation. These techniques improve the
model’s ability to detect objects in diverse conditions and complex spatial arrangements, making it
robust for real-world applications like surveillance and autonomous navigation.

• Optimised Neck Architecture: YOLOv8 improves on YOLOv5 by modifying the neck compo-
nent. Instead of enforcing uniform channel dimensions, YOLOv8 directly concatenates features,
reducing the model’s parameter count and tensor size. This optimised design enhances efficiency
without sacrificing performance [47].

• User-Friendly Features: YOLOv8 prioritises usability with a command-line interface and a well-
structured Python package. These features simplify deployment, enabling researchers and devel-
opers to integrate YOLOv8 seamlessly into various machine learning pipelines.

When evaluated on the MS COCO test-dev 2017 subset, the YOLOv8x model achieved an AP of
53.9% with a 640-pixel input size (compared to 50.7% for YOLOv5 on the same input size) with a speed
of 280 FPS. These results highlight YOLOv8’s improvements in both accuracy and processing speed,
showcasing its capability for advanced tasks across domains like healthcare, robotics, and security [35].

YOLOv9

YOLOv9, introduced by Wang et al. in February 2024 [51], continues the evolution of the YOLO series
by focusing on enhancing lightweight architectures without compromising accuracy. YOLOv9 introduces
two key innovations:

• Programmable Gradient Information (PGI) Framework: PGI addresses the information bottle-
neck problem often encountered in deep neural networks. It ensures reliable gradient propagation
during training, enabling more effective learning. This innovation allows for the integration of deep
supervision mechanisms into lightweight architectures, significantly improving accuracy. By incor-
porating PGI, YOLOv9 enhances the learning capacity of both lightweight and deep architectures,
leading to better prediction accuracy.

• Generalised Efficient Layer Aggregation Network (GELAN): GELAN further optimises YOLOv9
by improving the efficiency of feature aggregation, enabling the model to achieve better perfor-
mance with fewer parameters and reduced computational overhead.

YOLOv9 delivers notable advancements in lightweight object detection. On the MS COCO dataset,
it demonstrates a 0.6% improvement in AP compared to YOLOv8. Alongside this increase in accu-
racy, YOLOv9 achieves significant parameter reduction and computational efficiency, making it a highly
competitive choice for resource-constrained applications. [47].

YOLOv10

YOLOv10, developed by researchers at Tsinghua University and released in May 2024 [52], marks
a significant leap forward in real-time OD. This version addresses one of the most critical challenges

22
in OD: achieving a balance between accuracy and computational efficiency. By introducing innovative
training strategies and architectural modifications, YOLOv10 continues to push the boundaries of what
is achievable in real-time detection. The core innovation of YOLOv10 lies in its “Consistent Dual As-
signments” approach, which enhances training by providing rich supervision without relying on compu-
tationally intensive NMS during inference. This strategy allows YOLOv10 to maintain high precision and
recall while significantly reducing computational overhead [47]. Additionally, YOLOv10 offers a range of
model variants, enabling adaptability across diverse hardware setups and application requirements.
Extensive evaluations demonstrate that YOLOv10 outperforms its predecessors and other state-of-
the-art (SOTA) models in the critical accuracy–efficiency trade-off. By eliminating the need for NMS and
optimising the training pipeline, YOLOv10 achieves superior results while maintaining computational
efficiency. A comparison of YOLOv10 with earlier YOLO versions and other SOTA models is illustrated
in Fig. 2.11.

Figure 2.11: Comparison with other models in terms of latency-accuracy (left) and size-accuracy (right).
[52]

A summary of the main contributions of each YOLO version can be found in the table 2.2.

Table 2.2: Summary of YOLO versions, their contributions, and the frameworks used [47].

Version Year Contributions Framework


v1 2015 Single-stage object detector Darknet
v2 2016 Multi-scale training, dimension clustering Darknet
v3 2018 SPP block, Darknet-53 backbone Darknet
v4 2020 Mish activation, CSPDarknet-53 backbone Darknet
v5 2020 Anchor-free detection, SWISH activation, PANet PyTorch
v6 2022 Self-attention, anchor-free OD PyTorch
v7 2022 Transformers, E-ELAN reparameterisation PyTorch
v8 2023 GANs, anchor-free detection PyTorch
v9 2024 PGI and GELAN PyTorch
v10 2024 Consistent dual assignments for NMS-free training PyTorch

2.4.2 Advantages

• Real-Time Detection: YOLO’s single-pass architecture enables it to detect and localise objects
in real-time, making it suitable for time-sensitive applications such as autonomous driving, video

23
surveillance, and robotics.

• Computational Efficiency: The single-stage detection framework eliminates the need for region
proposal steps, allowing YOLO to achieve high processing speeds while maintaining low compu-
tational resource requirements.

• Accuracy and Context Awareness: YOLO’s unified architecture processes the entire image at
once, considering global context rather than isolated regions. This holistic approach leads to
improved localisation and classification accuracy.

• Versatility and Adaptability: The modular design of YOLO enables easy customisation for spe-
cific applications. It can be fine-tuned to detect objects in specialised domains, such as medical
imaging, agriculture, or industrial inspection.

• Reduced False Positives: YOLO’s ability to incorporate global information reduces the occur-
rence of false positives, ensuring reliable detections even in challenging scenarios with overlapping
or densely packed objects [39].

2.4.3 Limitations

• Large Dataset Requirement: One of the primary drawbacks of YOLO is its dependence on a sub-
stantial dataset of annotated images for effective training. Collecting and labelling these datasets
can be both time-intensive and costly, especially for specialised domains where high-quality anno-
tations are critical.

• Sensitivity to Object Scale: YOLO’s performance is significantly affected by the scale of objects
in the input image. It may struggle with objects that are either too large or too small relative to the
grid cell size, leading to false positives or false negatives.

• Small Objects: YOLO often has difficulty detecting smaller objects accurately. Since the sys-
tem divides the image into a grid, small objects may fail to occupy sufficient grid cells for proper
localisation, reducing detection accuracy for these cases.

• Occluded Objects: Detecting partially visible or occluded objects remains a challenge for YOLO.
If an object is obscured by another, the predicted bounding box may not be well-defined, which
affects both localisation and classification performance.

• Limited Generalisation for Diverse Object Classes: YOLO can struggle with generalising to
diverse object classes, particularly when trained on a finite dataset. It may fail to detect objects
that significantly differ from those in the training data, limiting its effectiveness in highly diverse or
dynamic environments [39].

24
2.5 Data Augmentation

As highlighted in the previous section, one of the major limitations of YOLO and other CV algorithms
is their dependence on adequate and representative training data. Without sufficient diversity and quality
in the dataset, even the most sophisticated models struggle to generalise effectively, leading to poor
performance on unseen data. Conversely, large and diverse datasets can significantly enhance model
robustness, often enabling simpler algorithms to achieve competitive results.
Data augmentation addresses this challenge by artificially expanding the training dataset. By gen-
erating additional samples from existing data or creating synthetic examples from scratch, data aug-
mentation helps improve the generalisation capabilities of machine learning models. Recent research
demonstrates that effective augmentation strategies can mitigate the drawbacks of limited datasets, re-
ducing the need for highly complex architectures and boosting the performance of models trained on
smaller, less representative datasets.
The success of a model often correlates with the diversity and quality of its training data. On one
hand, the training samples must be varied enough to prepare the model for different deviations in image
appearance, noise, and distortions. On the other hand, the augmentations must maintain sufficient
quality to ensure that the model performs well on standard images [53].

2.5.1 Techniques

Data augmentation fundamentally tackles overfitting at its root: the training dataset. It operates on
the premise that more information can be extracted from the original dataset through augmentations.
These techniques can be broadly classified into two categories:
Data warping involves applying transformations to existing images to increase dataset diversity while
preserving the original labels. These techniques are computationally efficient and widely used for aug-
menting training datasets in a straightforward manner. A summary of the commonly applied methods is
presented in table 2.3.

Table 2.3: Common image augmentation methods [54]

Method Description

Flipping Flip the image horizontally, vertically, or both.


Rotation Rotate the image at an angle.
Scaling Ratio Increase or reduce the image size.
Noise Injection Add noise into the image.
Color Space Change the image color channels.
Contrast Change the image contrast.
Translation Move the image horizontally, vertically, or both.
Cropping Crop a sub-region of the image.
Erasing Deletes one or more sub-regions in the image.

25
While these techniques can significantly enhance the diversity of training datasets, they come with
potential drawbacks:

• Boundary Effects: Techniques like translation or rotation may result in parts of the image moving
outside the frame. Padding strategies can mitigate this issue by preserving spatial continuity.

• Data Distribution Alignment: Augmentations must reflect real-world variations relevant to the
target task. Mismatched transformations can degrade model performance [54].

Oversampling generates entirely new synthetic samples, applying complex transformations. Exam-
ples of this include:

• Mosaic: Combines multiple images into a single augmented image, helping models learn from
varied object contexts and scales within a single frame.

• CutMix: Replaces random patches of one image with patches from another image, blending the
labels of both. This approach encourages the model to focus on less obvious features and prevents
overfitting.

• MixUp: Creates synthetic images by linearly interpolating between two images and their corre-
sponding labels.

• Generative Approaches: Techniques like GANs create entirely new, realistic samples based on
the original dataset, significantly expanding the diversity of training data.

Figure 2.12 demonstrates the transformations applied to the original dataset and the correspond-
ing augmented outputs, showcasing how data augmentation techniques enhance training diversity and
model robustness.

(a) Synthetic data created with GANs [55] (b) Example of CutMix [53]

Figure 2.12: Examples of data augmentation techniques.

These categories are not mutually exclusive and are often combined to create robust datasets tai-
lored to the specific needs of the application [55].

26
2.6 Previous Case Studies

A significant number of projects related to implementations of YOLO have already been developed,
especially in industrial sectors.
The study [56] focuses on improving the detection of solar panels in satellite imagery using CV
algorithms, with a particular emphasis on YOLOv9 and YOLOv10 architectures. The primary goal is
to accurately identify solar panels in various environmental contexts, addressing the growing need for
automated renewable energy monitoring tools. The dataset used comprises 3,480 satellite images,
categorised based on backgrounds such as rooftops or ground-level environments, ensuring a diverse
range of scenarios for model evaluation.
Through extensive testing, the authors highlight the superiority of YOLOv9e, which achieves a mAP
of 0.74. This model demonstrates its ability to effectively distinguish solar panels from complex back-
grounds. YOLOv10, although still in its early stages, shows promising results with potential for further
optimisation, offering an alternative for high-accuracy object detection tasks.
In [57] the authors aimed to improve YOLOv10 for detecting urban road cracks, addressing common
challenges such as noisy data, poor image clarity, and the small size of defects. The enhanced version,
YOLOv10-D-CBAM, incorporated several improvements to reduce computational complexity and to en-
hance feature extraction. The model was trained and validated on extensive datasets covering diverse
road conditions and defects to ensure robustness across different scenarios.
The YOLOv10-D-CBAM model achieved a precision of 99.6%, recall of 95.6%, and a mAP of 99.6%
at a 0.5 IoU threshold, representing significant improvements over YOLOv10 (precision: 92.4%, recall:
77.9%, mAP: 77.4%) and YOLOv5 (precision: 81.6%, recall: 75.6%, mAP: 72.3%).
The study [58] evaluates and compares the performance of YOLOv9 and YOLOv10 models for car
detection. The objective is to assess the accuracy and efficiency of these OD algorithms, focusing on
their applicability to tasks like surveillance, traffic management, and autonomous vehicle systems. Both
models were trained and tested using a dataset of 1,176 images, where the images were annotated with
bounding boxes for cars.
YOLOv10 demonstrated superior performance in terms of accuracy, achieving a precision of 96.1%
and a recall of 94.0%, compared to YOLOv9’s 88.7% precision and 79.4% recall. However, YOLOv9
had faster inference times, making it more suitable for real-time applications. The results underline
YOLOv10’s advantages for high-accuracy tasks and YOLOv9’s strengths in scenarios requiring rapid
processing.

27
28
Bibliography

[1] A. G. P. H. T. S. S. S. Clint Andrews, Keith Cooke and N. Wright. Ai in planning: Opportunities and
challenges and how to prepare. Technical report, American Planning Association, 2022. [Online]
Available: https://planning.org/publications/document/9255930/.

[2] R. Marasinghe, T. Yigitcanlar, S. Mayere, T. Washington, and M. Limb. Computer vision applica-
tions for urban planning: A systematic review of opportunities and constraints. Sustainable Cities
and Society, page 105047, 2023. [Online] Available: https://www.sciencedirect.com/science/
article/pii/S2210670723006571.

[3] J. Olveres, G. González, F. Torres, J. C. Moreno-Tagle, E. Carbajal-Degante, A. Valencia-


Rodrı́guez, N. Méndez-Sánchez, and B. Escalante-Ramı́rez. What is new in computer vision and
artificial intelligence in medical image analysis applications. Quantitative imaging in medicine and
surgery, 11(8):3830, 2021. [Online] Available: https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC8245941/.

[4] K. Tan, J. Wu, H. Zhou, Y. Wang, and J. Chen. Integrating advanced computer vision and ai
algorithms for autonomous driving systems. Journal of Theory and Practice of Engineering Science,
4(01):41–48, 2024. [Online] Available: https://centuryscipub.com/index.php/jtpes/article/
view/427.

[5] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object localization and image classifica-
tion. In 2009 IEEE 12th international conference on computer vision, pages 237–244. IEEE, 2009.
[Online] Available: https://ieeexplore.ieee.org/abstract/document/5459257.

[6] D. T. Delight and V. Karunakaran. A comprehensive analysis of methodologies used for object
detection and localization. In 2021 7th International Conference on Advanced Computing and
Communication Systems (ICACCS), volume 1, pages 448–453. IEEE, 2021. [Online] Available:
https://ieeexplore.ieee.org/abstract/document/9441916.

[7] K. Khurana and R. Awasthi. Techniques for object recognition in images and multi-object detection.
International journal of advanced research in Computer Engineering & Technology (IJARCET), 2
(4):1383–1388, 2013.

[8] B. More and S. Bhosale. A comprehensive survey on object detection using deep learning. Revue
d’Intelligence Artificielle, 37(2), 2023.

39
[9] J. Kaur and W. Singh. Tools, techniques, datasets and application areas for object detection in an
image: a review. Multimedia Tools and Applications, 81(27):38297–38351, 2022. [Online] Available:
https://link.springer.com/article/10.1007/s11042-022-13153-y.

[10] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu. Advanced deep-learning techniques for salient
and category-specific object detection: A survey. IEEE Signal Processing Magazine, 35(1):84–
100, 2018. doi: 10.1109/MSP.2017.2749125. [Online] Available: https://ieeexplore.ieee.org/
abstract/document/8253582.

[11] B. Karbouj, G. A. Topalian-Rivas, and J. Krüger. Comparative performance evaluation of one-


stage and two-stage object detectors for screw head detection and classification in disassembly
processes. Procedia CIRP, 122:527–532, 2024. [Online] Available: https://www.sciencedirect.
com/science/article/pii/S2212827124001021.

[12] R. Dastres and M. Soori. Artificial neural network systems. International Journal of Imaging and
Robotics (IJIR), 21(2):13–25, 2021.

[13] O. A. Montesinos López, A. Montesinos López, and J. Crossa. Fundamentals of Artificial Neural
Networks and Deep Learning, pages 379–425. Springer International Publishing, Cham, 2022.
ISBN 978-3-030-89010-0. doi: 10.1007/978-3-030-89010-0 10. [Online] Available: https://doi.
org/10.1007/978-3-030-89010-0_10.

[14] M. Thorat, S. Pandit, and S. Balote. Artificial neural network: A brief study. Asian Journal For
Convergence In Technology (AJCT) ISSN-2350-1146, 8(3):12–16, 2022.

[15] F. Bre, J. M. Gimenez, and V. D. Fachinotti. Prediction of wind pressure coefficients on building
surfaces using artificial neural networks. Energy and Buildings, 158:1429–1441, 2018.

[16] M. Zakaria, A. Mabrouka, and S. Sarhan. Artificial neural network: a brief overview. neural net-
works, 1:2, 2014.

[17] A. Katal and N. Singh. Artificial neural network: Models, applications, and challenges. Innovative
Trends in Computational Intelligence, pages 235–257, 2022.

[18] S. C. Patel. Survey on different object detection and segmentation methods. International Journal
of Innovative Science and Research Technology, 1(IJISRT):608–611, 2021.

[19] Z. Deng and A. Li. Object detection algorithms based on convolutional neural networks. High-
lights in Science, Engineering and Technology, 81:243–251, 2024. doi: https://doi.org/10.54097/
vyfg4e34.

[20] Y. Pi, N. D. Nath, and A. H. Behzadan. Convolutional neural networks for object detection
in aerial imagery for disaster response and recovery. Advanced Engineering Informatics, 43:
101009, 2020. [Online] Available: https://www.sciencedirect.com/science/article/abs/pii/
S1474034619305828.

40
[21] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional
neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 467–482. Springer, 2016.
[Online] Available: https://arxiv.org/abs/1512.05830.

[22] X. Zhao, L. Wang, Y. Zhang, X. Han, M. Deveci, and M. Parmar. A review of convolutional neural
networks in computer vision. Artificial Intelligence Review, 57(4):99, 2024. [Online] Available:
https://link.springer.com/article/10.1007/s10462-024-10721-6.

[23] M. Krichen. Convolutional neural networks: A survey. Computers, 12(8):151, 2023. [Online]
Available: https://www.mdpi.com/2073-431X/12/8/151.

[24] E. M. Dogo, O. Afolabi, N. Nwulu, B. Twala, and C. Aigbavboa. A comparative analysis of gradient
descent-based optimization algorithms on convolutional neural networks. In 2018 international
conference on computational techniques, electronics and mechanical systems (CTEMS), pages
92–99. IEEE, 2018. [Online] Available: https://ieeexplore.ieee.org/document/8769211.

[25] N.-I. Galanis, P. Vafiadis, K.-G. Mirzaev, and G. A. Papakostas. Convolutional neural networks:
A roundup and benchmark of their pooling layer variants. Algorithms, 15(11):391, 2022. [Online]
Available: https://www.mdpi.com/1999-4893/15/11/391.

[26] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural
networks. arXiv preprint arXiv:1301.3557, 2013. [Online] Available: https://arxiv.org/abs/
1301.3557.

[27] V. Christlein, L. Spranger, M. Seuret, A. Nicolaou, P. Král, and A. Maier. Deep generalized
max pooling. In 2019 International conference on document analysis and recognition (ICDAR),
pages 1090–1096. IEEE, 2019. [Online] Available: https://ieeexplore.ieee.org/abstract/
document/8978110.

[28] H. Eom and H. Choi. Alpha-integration pooling for convolutional neural networks. arXiv preprint
arXiv:1811.03436, 2018. [Online] Available: https://arxiv.org/abs/1811.03436.

[29] G. Habib and S. Qureshi. Optimization and acceleration of convolutional neural networks: A survey.
Journal of King Saud University-Computer and Information Sciences, 34(7):4244–4268, 2022. [On-
line] Available: https://www.sciencedirect.com/science/article/pii/S1319157820304845?
fr=RR-2&ref=pdf_download&rr=8d8c905dcf035be8.

[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [Online] Available: https://
ieeexplore.ieee.org/abstract/document/726791.

[31] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, B. C. Van Esesn, A. A. S.


Awwal, and V. K. Asari. The history began from alexnet: A comprehensive survey on deep learning

41
approaches. arXiv preprint arXiv:1803.01164, 2018. [Online] Available: https://arxiv.org/abs/
1803.01164.

[32] R. Mehta and C. Ozturk. Object detection at 200 frames per second. In Proceedings of the Euro-
pean Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.

[33] R. Padilla, S. L. Netto, and E. A. Da Silva. A survey on performance metrics for object-detection
algorithms. In 2020 international conference on systems, signals and image processing (IWS-
SIP), pages 237–242. IEEE, 2020. [Online] Available: https://ieeexplore.ieee.org/abstract/
document/9145130.

[34] J. Kaur and W. Singh. Tools, techniques, datasets and application areas for object detection in an
image: a review. Multimedia Tools and Applications, 81(27):38297–38351, 2022. [Online] Available:
https://link.springer.com/article/10.1007/s11042-022-13153-y.

[35] J. Terven, D.-M. Córdova-Esparza, and J.-A. Romero-González. A comprehensive review of yolo
architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Machine Learning and
Knowledge Extraction, 5(4):1680–1716, 2023.

[36] J. Redmon. You only look once: Unified, real-time object detection. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[Online] Available: https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/
Redmon_You_Only_Look_CVPR_2016_paper.html.

[37] R. Sapkota, R. Qureshi, M. Flores-Calero, C. Badgujar, U. Nepal, A. Poulose, P. Zeno, U. Bhanu


Prakash Vaddevolu, P. Yan, M. Karkee, et al. Yolov10 to its genesis: A decadal and comprehensive
review of the you only look once series. Available at SSRN 4874098, 2024.

[38] C. M. Badgujar, A. Poulose, and H. Gan. Agricultural object detection with you look only once (yolo)
algorithm: A bibliometric and systematic literature review. arXiv preprint arXiv:2401.10379, 2024.
[Online] Available: https://arxiv.org/abs/2401.10379.

[39] M. G. Ragab, S. J. Abdulkader, A. Muneer, A. Alqushaibi, E. H. Sumiea, R. Qureshi, S. M. Al-


Selwi, and H. Alhussian. A comprehensive systematic review of yolo for medical object detec-
tion (2018 to 2023). IEEE Access, 2024. [Online] Available: https://ieeexplore.ieee.org/
abstract/document/10494845.

[40] M. A. Arroyo, M. T. I. Ziad, H. Kobayashi, J. Yang, and S. Sethumadhavan. Yolo: frequently resetting
cyber-physical systems for security. In Autonomous Systems: Sensors, Processing, and Security
for Vehicles and Infrastructure 2019, volume 11009, pages 166–183. SPIE, 2019.

[41] N. Bordoloi, A. K. Talukdar, and K. K. Sarma. Suspicious activity detection from videos using
yolov3. In 2020 IEEE 17th India Council International Conference (INDICON), pages 1–5. IEEE,
2020. [Online] Available: https://ieeexplore.ieee.org/abstract/document/9342230.

42
[42] R. Kolpe, S. Ghogare, M. Jawale, P. William, and A. Pawar. Identification of face mask and social
distancing using yolo algorithm based on machine learning approach. In 2022 6th International
conference on intelligent computing and control systems (ICICCS), pages 1399–1403. IEEE, 2022.
[Online] Available: https://ieeexplore.ieee.org/abstract/document/9788241.

[43] D.-L. Pham, T.-W. Chang, et al. A yolo-based real-time packaging defect detection system. Procedia
Computer Science, 217:886–894, 2023. [Online] Available: https://www.sciencedirect.com/
science/article/pii/S1877050922023638.

[44] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. [On-
line] Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Redmon_YOLO9000_
Better_Faster_CVPR_2017_paper.html.

[45] J. Redmon. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.

[46] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao. Yolov4: Optimal speed and accuracy of object
detection. arXiv preprint arXiv:2004.10934, 2020. [Online] Available: https://arxiv.org/abs/
2004.10934.

[47] M. Hussain and R. Khanam. In-depth review of yolov1 to yolov10 variants for enhanced photovoltaic
defect detection. In Solar, volume 4, pages 351–386. MDPI, 2024. [Online] Available: https:
//www.mdpi.com/2673-9941/4/3/16#B140-solar-04-00016.

[48] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, et al. Yolov6: A single-
stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976, 2022.
[Online] Available: https://arxiv.org/abs/2209.02976.

[49] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. Yolov7: Trainable bag-of-freebies sets new
state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 7464–7475, 2023. [Online] Available: https://
openaccess.thecvf.com/content/CVPR2023/html/Wang_YOLOv7_Trainable_Bag-of-Freebies_
Sets_New_State-of-the-Art_for_Real-Time_Object_Detectors_CVPR_2023_paper.html.

[50] M. Hussain. Yolo-v1 to yolo-v8, the rise of yolo and its complementary nature toward digital
manufacturing and industrial defect detection. Machines, 11(7):677, 2023. [Online] Available:
https://www.mdpi.com/2075-1702/11/7/677.

[51] C.-Y. Wang, I.-H. Yeh, and H.-Y. Mark Liao. Yolov9: Learning what you want to learn us-
ing programmable gradient information. In European Conference on Computer Vision, pages
1–21. Springer, 2025. [Online] Available: https://link.springer.com/chapter/10.1007/
978-3-031-72751-1_1.

[52] A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding. Yolov10: Real-time end-to-end
object detection. arXiv preprint arXiv:2405.14458, 2024. [Online] Available: https://arxiv.org/
abs/2405.14458.

43
[53] A. Mumuni and F. Mumuni. Data augmentation: A comprehensive survey of modern approaches.
Array, 16:100258, 2022.

[54] S. Yang, W. Xiao, M. Zhang, S. Guo, J. Zhao, and F. Shen. Image data augmentation for deep
learning: A survey. arXiv preprint arXiv:2204.08610, 2022.

[55] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning.
Journal of big data, 6(1):1–48, 2019.

[56] S. E. Droguett and C. N. Sanchez. Solar panel detection on satellite images: From faster r-cnn to
yolov10.

[57] P. Zhang, H. Chen, J. Gao, L. Ma, and R. He. Improved yolov10 for high-precision road defect
detection. In 2024 4th International Conference on Computer Science and Blockchain (CCSB),
pages 79–83. IEEE, 2024.

[58] F. B. K. ArdaÇ and P. Erdogmus. Car object detection: Comparative analysis of yolov9 and yolov10
models. In 2024 Innovations in Intelligent Systems and Applications Conference (ASYU), pages
1–6. IEEE, 2024.

44

You might also like