Object Detection in Drone Imagery Using Convolutional Neural Networks
Object Detection in Drone Imagery Using Convolutional Neural Networks
Object Detection in Drone Imagery Using Convolutional Neural Networks
Items in Figshare are protected by copyright, with all rights reserved, unless otherwise indicated.
PUBLISHER
Loughborough University
LICENCE
CC BY-NC-ND 4.0
REPOSITORY RECORD
Wang, Guoxu. 2023. “Object Detection in Drone Imagery Using Convolutional Neural Networks”.
Loughborough University. https://doi.org/10.26174/thesis.lboro.24435160.v1.
Object Detection In Drone Imagery using
Convolutional Neural Networks
by
Guoxu Wang
A Doctoral Thesis
Doctor of Philosophy
of
Loughborough University
May 2023
Drones, also known as Unmanned Aerial Vehicles (UAVs), are lightweight aircraft
that can fly without a pilot on board. Equipped with high-resolution cameras and
ample data storage capacity, they can capture visual information for subsequent
processing by humans to gather vital information. Drone imagery provides a
unique viewpoint that humans cannot access by other means, and the captured
images can be valuable for both manual processing and automated image ana-
lysis. However, detecting and recognising objects in drone imagery using computer
vision-based methods is difficult because the object appearances di↵er from those
typically used to train object detection and recognition systems. Additionally,
drones are often flown at high altitudes, which makes the captured objects appear
small. Furthermore, various adverse imaging conditions may occur during flight,
such as noise, illumination changes, motion blur, object occlusion, background
clutter, and camera calibration issues, depending on the drone hardware used, in-
terference in flight paths, changing environmental conditions, and regional climate
conditions. These factors make the automated computer-based analysis of drone
footage challenging.
In the past, conventional machine-based object detection methods were widely
used to identify objects in images captured by cameras of all types. These methods
involved using feature extractors to extract an object’s features and then using
an image classifier to learn and classify the object’s features, enabling the learn-
ing system to infer objects based on extracted features from an unknown object.
However, the feature extractors used in traditional object detection methods were
based on handcrafted features decided by humans (i.e. feature engineering was
required), making it challenging to achieve robustness of feature representation
and a↵ecting classification accuracy. Addressing this challenge, Deep Neural Net-
work (DNN) based learning provides an alternative approach to detect objects
in images. Convolutional Neural Networks (CNNs) are a type of DNN that can
extract millions of high-level features of objects that can be e↵ectively trained for
object detection and classification. The aim of research presented in this thesis is
to optimally design, develop and extensively investigate the performance of CNN
based object detection and recognition models that can be efficiently used on drone
iii
imagery.
One significant achievement of this work is the successful utilization of the
state-of-the-art CNNs, such as SSD, Faster R-CNN and YOLO (versions 5s, 5m,
5l, 5x, 7), to generate innovative DNN-based models. We show that these models
are highly e↵ective in detecting and recognising Ghaf trees, multiple tree types
(i.e., Ghaf, Acacia and Date Palm trees) and in detecting litter. Mean Average
Precision ([email protected]) values ranging from 70%-92% were obtained, depending
on the application and the CNN architecture utilised.
The thesis places a strong emphasis on developing systems that can e↵ectively
perform under practical constraints and variations in images. As a result, several
robust computer vision applications have been developed through this research,
which are currently being used by the collaborators and stakeholders.
Guoxu Wang,
May 2023
Acknowledgements
Guoxu Wang,
May 2023
v
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Contextual Background and Motivation . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Aim and Objectives . . . . . . . . . . . . . . . . . . . . . 6
1.4 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 11
2.1 Classical Object Detection Methods . . . . . . . . . . . . . . . . . . 11
2.1.1 Sliding Window-based Method . . . . . . . . . . . . . . . . . 11
2.1.2 Region Proposal-based Method . . . . . . . . . . . . . . . . 13
2.2 Object Detection in Aerial Imagery Using Machine Learning . . . . 14
2.3 Object Detection in Aerial Imagery Using Deep Learning . . . . . . 20
3 Theoretical Background 25
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Fundamental of Neural Network . . . . . . . . . . . . . . . . 28
3.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . 35
3.3 CNN-based object detection . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Faster Region-based Convolutional Neural Network . . . . . 45
3.3.2 Single Shot Multibox Detector . . . . . . . . . . . . . . . . . 46
3.3.3 You Only Look Once . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Quantitative Performance Comparison Methods . . . . . . . . . . . 54
vii
4 Ghaf Tree Detection Using Deep Neural Networks 57
4.1 Introduction to Ghaf Tree Detection . . . . . . . . . . . . . . . . . 57
4.2 Proposed Approach to the Ghaf Tree Detection . . . . . . . . . . . 58
4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . 59
4.3.1 Quantitative Performance Comparison . . . . . . . . . . . . 60
4.3.2 Visual Performance Comparison . . . . . . . . . . . . . . . . 61
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
4.3 The visual performance comparison of Ghaf tree detector models
derived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c)
YOLO-V5s, (d) YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x . 62
4.4 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 63
4.4 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 64
4.5 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 66
4.5 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 67
4.6 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 69
4.6 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 70
4.7 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 72
4.7 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 73
6.1 The results of litter detection in drone imagery using the SSD,
Faster R-CNN, YOLO-V5s, YOLO-V5m, YOLO-V5l and YOLO-
V5x based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 106
6.2 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 107
6.3 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 108
6.3 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 109
6.4 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 110
6.4 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 111
6.5 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 113
6.5 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 114
6.6 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 116
6.6 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 117
6.7 Detecting very small objects of litter using YOLO-V5x based model 118
6.8 Test results of single-class litter detection models in desert campsites121
6.9 Test results of single-class litter detection models in desert campsites121
6.10 Testing results of YOLO-V5l based two class litter detection model 124
6.11 Testing results of YOLO-V7l based two class litter detection model 125
6.12 Testing results of YOLO-V5l based two class litter detection model 126
6.13 Testing results of YOLO-V7l based two class litter detection model 126
6.14 Testing results of YOLO-V5l based two class litter detection model 127
6.15 Testing results of YOLO-V7l based two class litter detection model 127
6.1 Number of labelled litters and human-made items in each data subset102
6.2 Performance and comparison of DNN based litter detection models 104
6.3 Illustration of a conceptual/architectural comparison of the two
Deep neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123
6.5 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123
6.6 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123
xiii
List of Abbreviations
AI Artificial Intelligence
xv
MSE Mean Squared Error
NN Neural Network
Introduction
The Ghaf tree also known as the tree of life by local people in Bahrain and much
of Arabia, is a drought-resilient tree capable of withstanding the extreme harsh
conditions of a desert environment [1]. The Ghaf tree, scientifically known as
Prosopis cineraria [2], can survive in extremely dry and hot weather for hundreds
of years with no artificial irrigation required. In the United Arab Emirates (UAE)
particularly, the Ghaf was declared a national tree in 2008 due to its historical and
national importance [3] [4]. The leaves of Ghaf trees have historically been used
as food for camels, while its tender leaves are still used in the UAE to make salads
and for various medicinal purposes. Like any other natural entity in the envir-
onment, the Ghaf trees have, in the recent years, increasingly become threatened
by the ever-expanding human activity in the UAE as a result of urbanisation and
infrastructural development projects. Given the arid environment in which the
Ghaf trees exist, aerial surveillance systems such as Unmanned Aerial Vehicles
(UAV) based imagery are naturally the preferred monitoring mechanism for aerial
monitoring of habitats in such environments.
One of the oldest fruit trees in the Arabian Peninsula, the Middle East, and
North Africa is the date palm tree. It is a major fruit crop grown in arid locations
around the world and is a vital part of the gulf region’s crop/food production
systems, due to its arid climate. Approximately 90% of the world’s dates are pro-
duced in the Arabian Peninsula. The fruits, bark, and leaves of date palm trees
are the most often used portions. Date palm tree fruits can be considered a food
with great nutritional value and numerous possible health advantages [5]. The
remaining portions of date palm trees can be used to make cosmetics, building
materials, and paper [6], among other things. As a result, surveying date palm
trees, which includes measuring their quantity, determining their location, pat-
1
2 1.1 Contextual Background and Motivation
tern, and distribution, is critical for predicting and forecasting production levels
and plantation management. Thani.J has successfully completed the monocular
detection of palm trees and published a research paper [7].
Due to the depth of its roots, the Acacia tree has a great drought tolerance
and can flourish without water for long periods. It has a medium amount of salt
tolerance as well. Acacia trees can be found in sand and gravel valleys, as well
as on the medium-altitude slopes of mountains. They’re frequent and extensive
throughout the UAE’s eastern regions, including in the Hafeet Mountain area. The
Acacia tree has numerous advantages. It is a good source of animal feed since it
increases milk production, especially in semi-arid environments. During a drought,
when all other sources of protein and energy are scarce, it is a valuable supply
of protein and energy. The acacia tree’s branches are a great source of nutrients,
containing 38 percent raw protein, phosphorus, and the calories needed to provide
energy to animals [8]. Therefore, monitoring acacia trees, paying attention to their
growth conditions and growing environment, is crucial for predicting production
levels and planning routine maintenance.
With the increase of tourism in the desert areas of the Gulf region, the de-
tection, and removal of litter left behind by visitors in popular tourist sites, is
becoming an increasingly important environmental problem to resolve. Litter im-
pacts desert ecology and endangers wildlife and natural habitats. Typically litter
detection and pickup is done by humans who physically visit popular tourist des-
tinations in search of litter left behind. However, given the difficult terrain to
navigate by foot or in vehicles, searching large areas for litter, using surveillance
ground vehicles is practically not recommended. Further wind could spread light-
weight litter far beyond popular tourist sites and make searching more difficult.
The use of lightweight Unmanned Aerial Vehicles (UAV), Drones [9], in prac-
tical application areas have increased in popularity recently due to advances in aer-
ial photography, digital aerial displays, and the potential they provide for low-cost
surveillance of large areas. Visual data captured by high-resolution, high-quality,
digital cameras mounted on drones are increasingly used in computer vision ap-
plications. Drones equipped with high-resolution cameras can capture images over
large areas, providing a bird’s-eye view similar to satellite imaging [10], but at a
relatively lower altitude, and hence without being challenged by cloud cover. Fur-
ther, the cost of aerial image capture with drones is much lower than that of aerial
image capture with satellites.
A drone is a lightweight aircraft that does not have a pilot onboard. They
have the ability to be operated automatically through a pre-programmed drone
flight planning system/software, or alternatively, can be manually controlled by a
ground pilot. Various practical tasks such as surveillance [11], aerial mapping [12],
1.1 Contextual Background and Motivation 3
infrastructure inspection [13], search and rescue [14], precision agriculture [15], and
ecological monitoring [16] are now made possible with the use of drones. Drones
flying at a controlled height can provide vital visual aerial surveillance data for
above application areas, either to be processed manually by humans or processed
automatically by computers.
Due to factors such as high altitude flying, noise in images, lighting changes,
motion-induced object blur, object occlusion, and background clutter, detecting
objects in images captured by drones, either by humans or computers, can be ex-
ceedingly challenging, especially considering the small size of the objects involved.
Although the drone camera technology had advanced significantly recently, min-
imising the impacts from the above challenges, for improving the potential of
information gathering from drone images, is still a challenging task. Additionally,
light-weight UAV-based (i.e., Drone) imagery and sensing has in the recent past
been utilised in detecting and mapping woody species’ encroachment in subalpine
grassland [17], estimating carbon stock for native desert shrubs [18], and several
other desert and forest monitoring applications [19–22]. While UAV imagery has
enabled large scale high resolution and fast landscape mapping, the use of this
significant imagery data is still largely limited to o✏ine use, with much more to
be realised for real-time applications [21] [23] [24].
Due to the large amounts of visual data captured in drone applications, hu-
man/manual processing of such data is time-consuming, and is often prohibitive.
Automatic object detection methods based on machine learning have traditionally
been used to detect objects in images captured by cameras to support a wide range
of application domains. The traditional machine learning based methods of object
detection and recognition can be divided into three stages: image pre-processing,
feature extraction, and classification. The aim is to extract object features from an
image/object using a feature extractor and then learn and classify these features,
using a classifier.
Moranduzzo and Melgani used a Histogram of Oriented Gradients (HOG) as a
feature descriptor to represent the features of a car based on its shape in 2014 [25].
To detect cars in images, they used support vector machines (SVMs) [26] to learn
and classify the features of cars and other objects. Another example is the use
of Haar-like features [27] in conjunction with an AdaBoost classifier [28] to detect
suspicious objects in drone video for military surveillance [29]. These traditional
object detection methods use handcrafted feature descriptors, which are determ-
ined by humans based on their experience and judgment about which characterist-
ics uniquely define the object for a specific application domain. Hence the problem
of selecting the right features to detect and identify objects in various scenarios
is complex. Describing them e↵ectively is difficult for humans because object
4 1.1 Contextual Background and Motivation
detection methods for the human psycho-visual system are mostly inexplicable.
Although a large number of object detection and recognition approaches based on
traditional machine learning approaches have been presented in literature, due to
the above reason, they will not be a focus of the research context presented in this
thesis.
Deep learning is an alternative and much e↵ective approach to solving the prob-
lems mentioned above. It can be thought of as enabling computers to mimic the
high-level behaviours and operations of the human psycho-visual system. Several
recent studies have addressed the problem of prior feature selection in traditional
machine learning-based object detection systems using deep learning-based ap-
proaches [30] [31], thereby addressing the challenges of a human having to select
the features that will optimise an object recognition task. More relevant to the
research conducted in this thesis is the use of a Convolutional Neural Network
(CNN) as a deep learning architecture to create models that can be used for ob-
ject detection and classification [32–34]. CNNs can extract and learn high-level
features from millions of objects without the need for prior feature selection. Sev-
eral CNN-based object detection methods have been proposed in the literature
since 2014. These methods are broadly classified into two types: single-stage and
two-stage methods. You Only Look Once (YOLO) [35] and Single Shot MultiBox
Detector (SSD) [36] are two popular approaches for single-stage object detectors.
The region-based CNN family, which consists of R-CNN [37], Fast R-CNN [38],
and Faster R-CNN [39], are widely used as two-stage detectors. However, the
detection speed and accuracy of the two categories di↵er.
Deep learning-based object detection and classification methods, based on
DNN architectures such as, YOLO, SSD, R-CNN, Fast R-CNN, and Faster R-
CNN, have been used in the literature [40]. They are primarily used to detect
standard objects such as vehicles, cars, buildings, animals etc., captured by hand-
held cameras or cameras set at the angles and heights of typical surveillance cam-
eras. In such a system, the pretrained networks for common object detection,
readily available to download from the internet and use, only need to be retrained
with a small number of additional images, for fine-tuning the network weights. Un-
fortunately, when dealing with drone images, which tend to record objects with
the camera looking straight down (or bird’s eye view), or at small angles to the
vertical, a significant amount of additional training is required to fine-tune the
weights to achieve object detection as seen from a drone. Although there have
been few attempts to detect objects in video footage captured by drones using
CNN-based methods [41] [42], such research is also limited to data collected under
controlled conditions. It has thus been demonstrated to work only under certain
practical constraints. For example, the detected objects must not overlap, the
1.2 Research Problem 5
contrast with the background must be clear, blurry objects may not be detected,
object detection may not work in scenes with significant lighting changes, and
so on. As a result, the research contributions of such work to solving practical
problems is limited. A further limitation of existing work is that their scientific
rigour and contribution are limited, as in majority of such work, the DNN has
been considered as a ‘black-box’, used to solve a practical problem. Hence some
of the model design details and decisions on parameter selections do not follow a
scientific approach and hence have not resulted in optimal outcomes.
This thesis proposes novel object detection models that can cope with real-
world challenges of object detection and recognition, resulting from aerial images
captured from drones. We follow a rigorous experimental design and scientific
decision-making process, based on knowledge of the structures of Deep Neural Net-
work architectures we use to create our models and the domain knowledge of the
application sectors acquired through the expertise of our research collaborators.
The models developed have been practically implemented and have been success-
fully field tested for accuracy and performance. All design details are presented,
making significant original contributions to the subject area of DNN based object
detection and recognition, in general, and drone imaging, in particular.
Given the above, the research conducted within the scope of this thesis, aims
to make necessary alterations, optimisations, and improvements to existing DNN
architectures to make their e↵ective use in novel designs of object detection and
recognition models. We demonstrate that the proposed novel models are able to
detect and recognise objects within drone video footage, captured in challenging
real-world environments and conditions. Each method has been designed, imple-
mented, and tested to ensure practical relevance and use in the field. The field
test results have been used to refine all designs, backed up by well-defined train-
ing methods and rigorous analysis. The experimental results obtained, when using
object detection and recognition models developed with all state-of-the-art Deep
Neural Network models, are quantitatively and qualitatively compared, to identify
the best methods that can be used in desert areas, having challenging practical
limitations.
Design, develop, implement, rigorously test and compare, novel Deep Neural
Network-based models for object detection and object classification in aerial im-
ages captured with drones in desert areas.
6 1.4 Original Contributions
• Design and develop, novel models for detecting and recognising Ghaf trees
in drone imagery, under real-world conditions, using state-of-the-art Deep
Neural Network architectures, optimise their performance based on feedback,
rigorously compare their performance and recommend the best models and
approaches adopted for their design;
• Design and develop, novel models for detecting and recognising multiple
tree types in drone imagery (e.g., Ghaf trees, Date Palm trees, and Aca-
cia trees), under real-world conditions, using state-of-the-art Deep Neural
Network Architectures, optimise their performance based on feedback, rig-
orously compare their performance and recommend the best models and
approaches adopted for their design;
• Design and develop, novel models for detecting and recognising litter in
drone imagery captured in natural and campsite desert areas, under real-
world conditions, using state-of-the-art Deep Neural Network architectures,
optimise their performance based on feedback, rigorously compare their per-
formance and recommend the best models and approaches adopted for their
design.
recognition models, presented in this thesis. Chapter-4 presents the design, devel-
opment, implementation details and testing results of novel CNN models presented
in this thesis for detection and recognition of Ghaf trees. Chapter-5 extends the
work of Chapter-4 to present novel CNN models for multiple tree detection and re-
cognition, namely Ghaf, Acacia and Palm trees. Chapter-6 presents novel models
for litter detection in drone imagery, based on state-of-the-art CNN architectures.
Finally, Chapter-7 summarises and concludes the findings of the research presented
in this thesis and recommends possible improvements and further work.
Chapter 2
Literature Review
The di↵erent methods for detecting and classifying objects in digital images that
have been proposed in the literature are summarized in this chapter. The sliding
window and region proposal methods, two standard classical object identification
techniques, are described in section 2.1. Section 2.2, 2.3 introduces machine learn-
ing, and deep learning-based object detection approaches developed and presented
in literature. The review also helps to highlight research gaps in object detection,
particularly in object detection in aerial imagery, which helps to justify the prosed
study’s research focus.
11
12 2.1 Classical Object Detection Methods
this method is not suitable for real-time object detection applications because it
is considered an exhaustive search, and thus finding objects in an image can take
a significant amount of time [49].
used a greedy algorithm to calculate similarity values between regions and their
neighbours, before passing them to the classifier. After completing the detection
process, they analysed the experimental results. They claimed that their proposed
method could reduce the execution time of searching for objects in an image, but
the performance may degrade when detecting overlapping objects.
In 2014, Zitnick and Doll [55] proposed a method for generating bounding boxes
for a set of object proposals by considering edge-based possible object regions.
They used a structural edge detector algorithm to predict object boundaries and
a greedy algorithm to group the edges [56] [57]. Finally, they assigned a possible
value to each set of edges using a scoring function to rank and define possible
object regions. According to the experimental results, their proposed method can
detect overlapping objects but needs to achieve the speed required for real-time
object detection applications.
In 2017, Huang et al. [58] presented an algorithm for generating object propos-
als to detect ships in remote sensing images. The core of their proposed method
is to generate a set of object proposals using edge detection and structured forest
methods. They then used morphological processing to eliminate some object pro-
posals, that edge detection false positives could have caused. Finally, each target
proposal’s edge results are fed into a classifier to identify ships. According to the
experimental results, their proposed method outperforms other methods that use
illumination and interference conditions in remote sensing images to detect ships
on images with the sea as the background. However, their proposed method has
limitations when another object, such as a cloud, obscures parts of the ship.
Other methods for generating object proposals, besides those mentioned above,
include Geodesic Object Proposals (GOP) [59], Multi-Scale Combination Group-
ing (MCG) [60], and so on.
Figure 2.2: The workflow of object detection implemented using machine learning
dom Fields (CRF) [63], and other classifiers can be used to learn a set of features
for objects in images.
In 2016, Redmon et al. presented a novel object detection algorithm called You
Only Look Once (YOLO). YOLO divides the input image into a grid and predicts
the object’s class and bounding box location for each cell in the grid. This way,
YOLO can detect multiple objects in an image at once, making it a fast and
efficient algorithm. YOLO uses a deep convolutional neural network to extract
features from the image, and the last layer of the network predicts the class prob-
abilities and bounding box coordinates. They claim that their method achieves
state-of-the-art accuracy while maintaining real-time performance. YOLO now is
the most popular and e↵ective network in object detection area.
In 2017, He et al. [64] presented a new object detection method called Mask
RCNN, an extension of the Faster R-CNN algorithm. Mask R-CNN adds a branch
to the Faster R-CNN network to generate a binary mask for each detected object
in addition to the bounding box and class prediction. The mask branch uses a fully
convolutional network to predict the object mask pixel by pixel. They achieved
state-of-the-art performance on several object detection benchmarks, including
COCO [65]and PASCAL VOC [66].
In 2019, Ghiasi et al. proposed an object detection algorithm named NAS-
FPN [67], which is a combination of Neural Architecture Search (NAS) and Feature
Pyramid Networks (FPN). NAS is used to search for the optimal network archi-
tecture, and FPN is used to extract features at di↵erent scales from the input
image. They claimed that their proposed algorithm can achieve state-of-the-art
performance while using fewer parameters than previous methods.
In summary, these object detection methods based on machine learning have
significantly advanced the field of computer vision, enabling the detection of ob-
jects in images captured from aerial viewpoints, which can be used for various
applications such as environmental protection, surveillance, traffic monitoring,
and disaster response.
Tuermer et al. [68] presented a new method for the detection of cars in dense
urban areas in 2013. The first stage of their proposed method involves selecting
road locations in urban areas from a road database. This stage is used to prevent
the car detector from being confused by the similarity of other objects that may
resemble a car, such as a sunroof. They then used HOG feature descriptors to
represent shape-based car features in the second stage. Finally, they trained and
classified the car feature vectors using AdaBoost as a classifier. They tested their
proposed method on test images taken by a drone over downtown Munich and
discovered that it achieved 70% accuracy in detecting cars in dense urban areas.
Meanwhile, Cheng et al. [69] created an object detection framework that uses
2.2 Object Detection in Aerial Imagery Using Machine Learning 17
lenging task of ship detection is the change in the appearance and background of
the ship in the image. Therefore, they decided to convert the panchromatic im-
age to a pseudo hyper-spectral form and rearrange spatially adjacent pixels into
vectors. The size of each vector produced during the conversion process adds ad-
ditional contextual information that can magnify the di↵erence between the vessel
and the background. They then used the HOG feature descriptor to represent the
ship’s shape. Finally, send the ship’s feature vector to AdaBoost for classification,
and allow the algorithm to learn and classify objects. According to the experi-
mental results, their proposed method could detect ships when they were not close
to each other, but it still missed some detections when detecting the ships close
to land.
In summary, classical machine learning based approaches to object detection
based on HOG features have been presented above. The results obtained by these
algorithms were considered excellent at the time they were published, achieving
accuracy rates of approximately 70%. As with many other machine classical ma-
chine learning based approaches to object detection, these algorithms lack the
depth of feature extraction and accuracy of classification based on the limited fea-
ture set. Novel Deep Leaving based approaches to object detection address these
limitations.
The texture is a critical feature widely used in aerial image target detection
using a machine learning algorithm. Texture features are used to describe local
patterns within an object’s surface. Zhong and Wang [74]] published a method for
detecting urban areas in remote sensing images in 2007. They divided the input
images (training and testing) into non-overlapping 1616 blocks at the start of the
proposed method. The researchers then calculated five multi-scale texture features
around each block to capture the general statistical properties of urban areas. The
five texture features are grey-level appearance texture features, Gray-level co-
occurrence matrix (GLCM) [75], Gabor texture features [76], gradient direction
features [77], and line length features. Finally, they used multiple Conditional
Random Fields (CRFs) as base classifiers for each block to learn and classify
feature vectors. They analyse the experimental results and claim that the proposed
model can outperform a single CRF in terms of detection accuracy while avoiding
the over-fitting problem.
Senaras et al. [78] used texture features to represent building features in optical
satellite images in 2013. They used a multi-classifier approach known as Fuzzy
Stacked Generalisation (FSG) [79] to classify objects in the proposed system. To
generate the final detection, the detection results of multiple classifiers are com-
bined. They used GLCM as a texture feature descriptor in the feature extraction
stage, combining it with shape features to represent the features of buildings.
2.2 Object Detection in Aerial Imagery Using Machine Learning 19
After using GLCM for segmentation, they analysed the detection results. They
discovered that their proposed method could detect buildings of various sizes but
had problems detecting buildings with textures similar to the background. The
same year, Aytekin et al. [80] presented an algorithm for detecting runways in
satellite images. The proposed algorithm is based on texture PR operations to
detect runways. They began by dividing the input satellite image into 32 x 32-
pixel non-overlapping image patches. The texture features, image intensity mean,
and standard deviation were then used to characterise the runway. They used
six texture features to represent the texture properties of the runway in the tex-
ture representation, including Zernike moments [81], circular Merlin features [82],
GLCM, Fourier power spectrum [83], wavelet analysis [84], and Gabor filter [85].
The six features were then concatenated into a feature vector for classification
using the AdaBoost algorithm. They analysed the experimental results and con-
cluded that incorporating texture features can improve detection accuracy for
runway detection applications.
Cirneanu et al. [86] presented a method for detecting flooded areas in drone
images in 2015. They used texture features to detect flooded areas. They selected
sample images of flooded areas, extracted their features using the LBP feature
extraction method during the model training phase, and used the LBP feature
descriptor to extract features from an RGB input image’s red and green channels.
As blue contains a small amount of the flooded image, the blue channel of the RGB
input image is not chosen to represent the texture information of the flooded area.
In addition, they converted the RGB input image to HSV colour space and only
selected the H component. The texture features of the H component are then
extracted using the LBP feature extractor. The average histogram vector for the
three-colour components was then computed (R, G, and H). The same operations
as in the training phase are performed on the test images in the test phase, and the
flooded areas are classified by comparing the Euclidean distance of the training
image histogram vector to the test image. They tested their proposed method
and claimed it could detect flooded areas in drone images with high accuracy.
Moranduzzo et al. [87] published their findings in the same year. A multi-class
classification method is proposed for detecting multiple objects in drone images.
They started by dividing the raw drone image into tiles. The LBP feature extractor
is then used to extract each patch and texture feature. The chi-square histogram
distance between each tile’s LBP feature vector and the similarity of the tiles
in the training dataset is then calculated. Experimental results show that their
proposed method performs well when only detecting large objects.
Agular et al. [88] published their findings in 2017. They proposed an algorithm
for detecting pedestrians in drone images. Firstly, they divided all the images into
20 2.3 Object Detection in Aerial Imagery Using Deep Learning
two categories: training (70 percent) and testing (30 percent). The first set was
used for training, and the second was used to evaluate the proposed method’s
performance. Secondly, they used Haar-like and LBP feature descriptors to rep-
resent pedestrian features. Finally, they fed the combination of these two feature
vectors into AdaBoost for object recognition and classification. They examined
the experimental results and concluded that their proposed method could detect
pedestrians even when they were not approaching.
The papers presented above attempts to use further features and di↵erent
feature combinations, followed by classification, for object detection, obtaining
very good levels of accuracy. However they are limited by the limited number of
features being exploited within the object detection sta↵, requiring a human to
be involved in deciding which features to be used for a given object detection task
(i.e., the need to do feature engineering), and using a selected classifier in the final
stage of object detection. If unlimited amount of features can be exploited and the
feature engineering is performed optimized for the detection task, automatically by
a computer and the best approach to classification is decided based on exhaustive
investigations, one could achieve much higher levels of accuracy. This would be
the focus of Deep Neural Network based, object detection approaches.
Although many attempts have been made to detect objects in images captured
by drones using distance-based and machine learning-based methods, these meth-
ods are designed by humans who manually select feature descriptors based on the
application domain. The problem of selecting the correct features to represent
object type information for accurate classification still needs to be solved. The
proposed algorithms must be improved to account for small object size, overlap
or proximity between objects, illumination changes, and background contrast.
One of the deep learning networks extensively employed in object detection devel-
opment is the Convolutional Neural Network (CNN). A CNN can extract count-
less high-level characteristics from objects without needing feature pre-selection
during the initial stage of object detection. As a result, in this section, object de-
tection methods for images acquired from aerial perspectives are provided. These
approaches were developed and implemented using deep learning algorithms, par-
ticularly CNN.
Zeggada and Melgani [90] introduced a multi-class classification method in
2016 for identifying eight objects in urban settings using aerial photography. They
began by creating a grid of identical tiles from the supplied image. Then, to extract
the key features of objects in each tile image, they used CNN without a fully
connected layer as a feature extractor. Finally, they used a Radial Basis Function
Neural Network (RBFNN) to deduce the class of each object using the features
of each tile. According to their analysis of the experimental results, the proposed
strategy outperformed the methods employing just an LBP feature descriptor for
detecting various classes of objects.
A CNN-based object detection approach for detecting several items in aerial
photos was also presented by Radovic et al. in 2017 [91]. Their study’s objective
was to create an application for object detection that could be used to survey
urban environments to carry out more research in the domain of city transporta-
tion. The open source ”YOLO” object detection and object classification platform
was the foundation for the CNN used in the suggested method. Twenty-four con-
volutional layers, two fully linked layers, and one final detection layer made up the
YOLO network design. The input image was segmented into S ⇥ S grid cells by
YOLO. Then, each grid cell forecasted bounding boxes and gave each bounding
box an object class likelihood. The trial findings demonstrated that their sugges-
ted strategy had a 97.8 percent accuracy rate for identifying vehicles and buses in
urban areas in aerial photographs. The method used a pre-trained initial YOLO
Version to detect widespread objects.
To locate and count olive ridley sea turtles in aerial photos, Gray et al. [92]
used a CNN. They put drone photos taken during a maritime survey in Ostional,
Costa Rica, during a significant nesting episode, to the test. Two groups were
created from all the photographs. The olive ridley sea turtles in the first group of
photographs were manually labelled with bounding boxes for training the CNN.
The images of the olive ridley sea turtles in the second group of images were saved
for evaluating the performance of the trained CNN model. The test image was
partitioned into a grid of tiles once the CNN had been trained. Each tile image was
sent to the trained model with a size of 100 ⇥ 100 pixels to identify olive ridley sea
turtles. They analysed the trial findings and found that the number of sea turtles
22 2.3 Object Detection in Aerial Imagery Using Deep Learning
after using the trained CNN model to identify them was equal to the number of sea
turtles while manually counting. Saqib et al. [93] built a real-time object detection
program for a drone in 2018, to find aquatic life. Their research’s objectives were
to recognise and count the dolphins and stingrays in real-time. They put the
technique into practice, using a faster R-CNN to identify both animals. The faster
R-CNN discovered objects by creating feature maps of potential object regions.
In the second stage, a network classifier categorised the potential object regions
into di↵erent types of objects. They examined the detection data and concluded
that while the proposed method’s detection accuracy was quite good, its detection
speed required improvement.
In 2019, Rohan et al. [94] presented a real-time object detection application for
use in a drone for detecting fixed and moving objects. They used an SSD object
detector to create a real-time object detection program, which was based on the
VGG-16 network architecture but did not use the fully linked layer. Instead,
they expanded the network’s convolutional layers, enabling the SSD to extract
more features at di↵erent feature map scales. According to their analysis of the
experimental findings, the proposed real-time object detection program achieved
98 percent accuracy when detecting just one class of object in photos. However,
the accuracy decreased when applied to recognising several classes of objects.
Meanwhile, in 2019, Hong et al. [95] focused on detecting birds in drone imagery
to examine the performance of five CNN-object detection approaches, including
Faster R-CNN, Region-based Fully Convolutional Network (R-FCN) [96], SSD,
RetinaNet [97], and YOLO. Both detection speed and accuracy were evaluated
to compare the performance of the five CNN-based object detection techniques.
After analyzing the detection findings, they concluded that Faster R-CNN had the
most excellent detection accuracy, and YOLO had the fastest detection speed.
In addition to using the already-existing CNN-based object detection tech-
niques, other researchers used CNN’s performance to create their network. Long
et al. [98] proposed a feature-fusing deep network (FFDN) for recognising small
objects in aerial photos in 2019. The envisioned network had three main parts. To
learn the deep hierarchical properties of objects, they used convolutional restric-
ted Boltzmann machines in the first component. The second component employed
conditional random fields to create the spatial relationship between neighbouring
objects and backgrounds. In the third component, a deep sparse auto-encoder
combined the results of the first and second components. After examining the
experimental data, they asserted that their suggested approach might strengthen
CNN’s feature representation while enabling the detection of small objects against
a challenging background.
The literature presented above shows that existing deep neural network-based
2.3 Object Detection in Aerial Imagery Using Deep Learning 23
Theoretical Background
The theory, procedures, and methodologies of the research described in this thesis
are supported by the scientific background information provided in this chapter.
The foundational theoretical ideas of the machine learning approach are presented
in the first section. The background of deep learning is discussed in the second half,
with a focus on CNN to provide crucial background information on well-known
Convolutional Neural Networks (CNN) to be used in the contributory chapters of
this thesis, Chapter 4, 5 and 6.
25
26 3.1 Machine Learning
Ntrain
Dtrain = {(xi , yi )} 1 = (x1 , y1 ), ..., (xNtrain , yNtrain ) (3.1a)
The two primary stages of supervised learning are the training and testing
phases, as depicted in Figure 3.1. Prior to the training phase, it is necessary to
prepare the training data and their corresponding labels (ground truth). Feature
3.2 Deep Learning 27
the prediction of the input data with the actual response to change the network
weights.
Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), Deep
Feedforward Networks (FDNs), Generative Adversarial Networks (GANs), and
Convolutional Neural Networks (CNNs) are some of the various deep neural net-
works that have been developed over the past two decades. Among these networks,
CNNs are the dominant deep neural network that performs best on visual iden-
tification tasks, and they are the main subject of the remaining sections of this
chapter.
the input, hidden, and output layers. Each neuron in the layer below is fully
connected to every neuron in the layer above it, and each layer contains multiple
neurons. The input layer’s input must be a vector. When an image is used as
the input, it must be flattened into a one-dimensional vector. For instance, if the
input image is a grayscale image with dimensions of 13 ⇥ 13 pixels, the flattened
vector’s dimension should be 1 ⇥ 169 pixels, and the input layer should have 169
neurons that are fully connected to every other neuron in the layer above.
Figure 3.3 illustrates an example of an NN architecture using a grayscale image
with a 13 ⇥ 13 pixel input size. As shown, the NN contains one hidden layer and
one output layer in addition to the input layer with 169 neurons. The output layer
has ten neurons that recognise the type of handwritten digit in the input image.
Figure 3.3: Neural network architecture for an input image of size 13 x 13 pixels
with one hidden layer and ten neurons in the output layer from [30][31]
In a NN, there may be more than one hidden layer. A NN might have one
or more hidden layers for handling more complex issues and data. However, the
architecture of the NN will become more complex with more hidden layers. The
output layer is responsible for creating the NN’s output. The number of neurons
in the output layer needs to be proportional to the kind of work done by the
neural network. For instance, if the neural network is classifying a handwritten
digit (0-9), the 10 neurons in the output layer should correspond to each class, as
shown in Figure 3.3.
Each neuron in a NN is controlled by multiplying each input (xi ) by its weights
(wi ), combining the resulting sum with a bias (b), and then passing the result
through an activation function (f) to produce an output (Y’). The following equa-
tion can be used to calculate the output Y’ of a neuron with n inputs:
30 3.2 Deep Learning
X
Y 0 = f (( n
i=1 wi xi ) + b) (3.2)
Activation Function
The function that converts the weighted total of each neuron’s inputs into output is
called an activation function. The activation function may also be referred to as a
transfer function. It is used to introduce nonlinearity into a neuron’s output value.
NNs use a variety of activation functions, including softmax, sigmoid, hyperbolic
tangent, ReLU, and leaky ReLU. These activation functions are briefly described
in the sub-section that follows.
Sigmoid: The sigmoid activation function is a logistic function that takes any
real value as input and produces an output value between 0 and 1. As shown
in Figure 3.5, the sigmoid activation function has an S-shaped curve. From the
curve, it can be observed that when the input value is very large or positive, the
sigmoid activation function will map it towards 1. Similarly, when the input value
is very low or negative, the sigmoid activation function will map it closer to 0.
The formula for the sigmoid activation function is as follows:
1
(z) = z
(3.3)
1+e
2
tanh(z) = 2z
+1 (3.4)
1+e
ReLU: Due to its computational simplicity, the ReLU activation function has
become one of the most used activations in NNs, particularly in CNNs. This func-
tion converts all negative numbers to zero, leaving all positive values unchanged.
ReLU stands for Rectified Linear Unit. Figure 3.7 displays the ReLU activation
function curve. The formula for the ReLU activation function is as follows:
Leaky ReLU: The leaky ReLU activation function improves upon the ReLU
activation function by addressing the ”dying ReLU” problem. This function at-
tempts to convert any negative values that were changed to zero in the ReLU
activation function to non-zero values by multiplying all negative values by a
small constant number m, known as the leak factor, typically set to 0.01. The
curve for the leaky ReLU activation function is shown in Figure 3.8. As a result,
the Leaky ReLU activation function’s formula is as follows:
Softmax: The function commonly used in the output layer of NNs for cat-
egorizing inputs into multiple categories is called the softmax activation function.
This function normalizes the outputs for each class between 0 and 1 to calculate
the likelihood that an input belongs to a particular class. The softmax activation
function’s formula is as follows:
e zi
(zi ) = PK f or i = 1, 2, . . . , K (3.7)
zj
j=1 e
The NNs employ both forward and backward propagation as e↵ective techniques.
The process of passing inputs to a group of neurons in the first layer and passing
34 3.2 Deep Learning
the neurons’ outputs to the last layer (the output layer) to generate a result is
referred to as, ”forward propagation” (or ”forward pass”). The di↵erence between
the predicted output of the neural network and the correct response (ground truth)
is calculated using the loss function after the neural network has produced a result.
It is then used as a feedback signal to modify the network’s weights. The network
weights initialised in the first stage using weight initialisation methods such as zero
initialisation, random initialisation, Xavier initialisation [101], Kaiming initialisa-
tion [102], etc., are updated using a process known as backward propagation, also
known as back-propagation. The weights are continuously adjusted throughout
the training to align with the input data set, which is referred to as the learning
process.
However, various loss functions can be employed to determine the loss score in
NNs, including mean squared error (MSE), binary cross-entropy, and multi-class
cross-entropy (or categorical cross-entropy). Di↵erent scenarios call for the use of
each function. For instance, the loss score in a regression issue is calculated using
the mean squared error. In a binary classification problem, cross-entropy is used,
while in a multi-class classification problem, multi-class cross-entropy is employed.
The following is a list of the formulas used to determine the loss of each function:
Binary Cross-Entropy:
Categorical Cross-Entropy:
c
X
LM ultiCE = yi log(yi0 ) (3.10)
i=1
Where y’ is the predicted value, y is the actual value and C is the number
classes.
Hyper parameters are a group of variables that can be utilised to regulate the
learning process of a neural network. A set of hyper parameters must be estab-
lished before a neural network can be trained. The hyper parameters learning rate,
momentum, decay, batch size, and epoch significantly a↵ect the neural network’s
capacity for learning.
3.2 Deep Learning 35
The learning rate is used to regulate how much the neural network’s weights
are modified concerning the loss gradient. The momentum is used to manage the
degree to which the previous weight update influences the current weight update.
In a neural network, decay is used to regulate how quickly the learning rate declines
at each weight update. The batch size is the number of samples sent through the
neural network and processed before the weights are updated.
For instance, if the batch size is set to 8 and there are 800 training samples,
the algorithm will use the first eight examples (from the first to the eighth) from
the training dataset to train the network. The network is then trained again using
the training dataset’s subsequent eight samples (from the ninth to the sixteenth).
This process will be repeated until all training samples have gone through the
network once.
The epoch specifies the number of times the neural network will traverse the
entire dataset.
The first version of the CNN was created by Fukushima in 1980 [103], but it
gained popularity in 2012 after Krizhevsky et al. [104] suggested using AlexNet,
a CNN-based system, to categorise the 1.2 million high-resolution images in the
ImageNet Large Scale Visual Recognition Challenge into 1,000 di↵erent classes
(ILSVRC). Since AlexNet performed well in the ILSVRC competition, other re-
searchers were inspired to create many CNN models. It quickly gained popularity
in various computer vision applications, including image classification, object re-
cognition, and image segmentation.
The CNN’s architecture is inspired by visual perception and has three primary
divisions. The first division is the input layer, followed by the hidden layers, which
comprise numerous convolutional layers, activation processes, pooling layers, and
a fully connected layer. The third part is the output layer. The CNN’s structure
is depicted in Figure 3.9 [105].
36 3.2 Deep Learning
Convolutional Layer
At each step of the kernel, the convolutional layer performs the convolution
operation by multiplying the kernel values by the corresponding image pixel values
and then adding the multiplication results to produce a feature map. An example
of a convolution operation on a 6-dimensional input image is shown with a kernel
that is 6 pixels by 3 pixels in size.
3.2 Deep Learning 37
Figure 3.10 illustrates how the convolution operation on an image with a 6⇥6
pixel resolution and a kernel with a 3⇥3 pixel resolution result in a 4⇥4 pixel res-
olution output feature map. Equations (3.11) and (3.12) can be used to determine
the size of the output feature map if the image is H rows by W columns and the
kernel is H rows by W columns.
Oheight = IH Kh + 1 (3.11)
Owidth = IW Kw + 1 (3.12)
Where IH and IW are an input image’s height and width, Kh and Kw are a
kernel’s height and width, and Oh eight and Ow idth are an output feature map’s
height and width, respectively. However, the number of kernels used to perform
the convolution operation on the input image determines the number of output
feature maps. As each kernel is used to identify di↵erent aspects of an input image,
applying multiple kernels to the same input will result in various output feature
maps from the same input image. For instance, the first kernel is used to identify
the input image’s vertical edges, the second kernel is used to identify the input
image’s horizontal edges, and the third kernel is used to sharpen the image. As a
result, utilising K kernels to perform the convolution operation on the same input
picture will result in K feature maps, which are then combined to create the final
output of the convolutional layer (see Figure 3.11) [106]. The size of the kernels
in a convolutional layer is often set to an odd number, such as 3, 5, 7, or 11, to
extract the features of an input image.
additional method that can be used to help the convolution operation work better
by retaining information at the edges of an input image. An extra set of pixels can
be added around an input image’s edge to perform the padding technique. Zero-
padding is a common practice that involves setting the additional pixels’ values to
zero. Figure 3.12 illustrates how zero-padding is applied to the 6⇥6 input picture
from Figure 3.10 and how an output feature map is created using a 3⇥3 kernel
and a single stride value.
Figure 3.12: Convolution operation on a 6⇥6 image with zero-padding and a 3⇥3
kernel
40 3.2 Deep Learning
Figure 3.12 demonstrates how the input image’s size was enlarged from 6⇥6 to
8⇥8 pixels by adding extra pixels with a value of 0 around the edge of the image.
It also illustrates how the size of the output feature map was raised without using
the padding technique in the input image. As a result, the following formula may
be used to determine the size of the output feature map after the convolution
operation:
IH Kh + 2P
Oheight = +1 (3.13)
S
IW Kw + 2P
Owidth = +1 (3.14)
S
Where IH and IW are an input image’s height and width, Kh and Kw are
a kernel’s height and width, P is a convolution operation’s padding value (for
instance, if the input image had one extra pixel added around the boundary, the
P value should be 1), S is the convolution operation’s stride value, and Oheight and
Owidth are an output feature map’s height and width. Equation (3.15) describes the
formula to calculate the convolution operation, where I ranges from 1 to Oheight ,
j ranges from 1 to Owidth , K is a kernel, and I is an input image.
Kh X
X Kw
O(i, j) = I(i + k 1, j + l 1)(k, l) (3.15)
k=1 i=1
Pooling Layer
The layer responsible for reducing the spatial dimension of the output feature map
is known as the pooling layer. This layer is typically applied after the convolu-
tional layer has extracted the relevant features, and it serves multiple purposes in
the convolutional neural network (CNN) architecture. Firstly, before the pooling
layer is applied, an activation function is used to introduce non-linearity into the
output feature map. This activation function transforms the values within the
feature map, allowing for the modelling of complex relationships and enhancing
the network’s ability to learn intricate patterns and representations. The pooling
layer’s primary role is to downsample the feature map, e↵ectively reducing its
spatial dimensions. By aggregating information within local receptive fields, such
as max pooling or average pooling, the pooling layer decreases the resolution of
the feature map while preserving the most salient features. This downsampling
operation contributes to reducing the computational complexity of the network,
as it decreases the number of parameters and subsequent computations required
in the network’s subsequent layers. Moreover, the pooling layer introduces several
benefits to the CNN model. It enhances the network’s robustness to noise and dis-
tortions in the input data by capturing the most prominent features within local
regions, thus reducing the impact of irrelevant variations. Additionally, pooling
helps to address the issue of overfitting by promoting generalisation. By summar-
ising the information within each pooling region, the pooling layer encourages the
CNN model to focus on the most significant features while discarding less relevant
or noisy details, which can lead to better generalisation performance on unseen
data.
The pooling layer down-scales the feature map’s size by reducing its height
and width while maintaining its depth. The pooling layer’s technique is similar to
how an image is resized in image processing. In CNN designs, the pooling layer
performs a pooling operation on an output feature map using a 2⇥2 pixel-sized
kernel and a 2-pixel stride. There are two primary types of pooling operations
in the pooling layer: maximum pooling and average pooling. Below is a detailed
description of each pooling approach:
Max Pooling: The goal of the pooling procedure known as ”max pooling”
is to choose the highest value possible from the area of the feature map that the
kernel covers. The output of the max-pooling layer is a feature map that includes
the most noticeable features from the prior feature map. A sample of the max
pooling process using a 2⇥2 kernel and a 2-stride value is shown in Figure 3.14.
This feature map was obtained from Figure 3.10 and is called a ”rectified feature
map.”
3.2 Deep Learning 43
A CNN uses a fully connected layer as its last layer. After the final convolutional
or pooling layer has produced the final output of the feature map, this layer is
used to learn and classify the features. However, the fully connected layer can
only work with one-dimensional input data. Therefore, the output feature map’s
multi-dimensional data must go through a flattening process to convert the multi-
dimensional matrix to a 1-dimensional one. After flattening, each feature map
component is configured to connect to every neuron in the fully connected layer.
Figure 3.16 [107] shows an example of the fully connected layer and the flattening
operation being applied to a max-pooled feature map derived from Figure 3.14.
Figure 3.16: Example of fully connected layer and flatten operation from [107]
Fast R-CNN and R-CNN with Region Proposal Network (RPN). The region
proposals produced by the RPN in the second stage are then supplied to the
classification layer, which is used to identify object locations and the class to
which each object belongs [108]. The relationship between RPN and R-CNN is
shown in Figure 3.17.
SSD forecasts bounding boxes for di↵erent-sized feature maps. The prediction
layers in SSD are Conv4 3, Conv 7, Conv8 2, Conv9 2, Conv10 2, and Conv 11.
3.3 CNN-based object detection 47
In the case of using an image of size 300⇥300 as input, referred to as SSD300, these
prediction layers will predict the bounding boxes. The SSD detector can identify
objects of di↵erent sizes thanks to the feature maps’ predictions at multiple scales.
After detection using non-maximum suppression, SSD aggregates redundant and
overlapping bounding boxes into a single box to produce the final detection res-
ult. However, SSD performs worse than Faster R-CNN when identifying small
objects. Feature map layers can only recognise small objects in SSD with better
resolution. However, the feature information from these layers is less valuable for
categorisation because they only provide low-level object characteristics, such as
edges or colour patches [110].
Figure 3.19: The key process of YOLO object detection algorithm from [115]
YOLO finds objects by partitioning an input image into S-by-S grid cells, each
of which forecasts B bounding boxes. Each bounding box contains the centre x,
centre y, width, height, confidence score, and class probability of each class. The
centre of the enclosed box is indicated by the (x, y) coordinates. The key process
of YOLO object detection algorithm is in Figure 3.19.
the same image. The feature pyramid network (FPN) uses a top-down architecture
with skip connections to combine low-level and high-level features, which helps to
improve the accuracy of object detection. In addition, YOLO-V3 uses a di↵erent
backbone network than YOLO-V2, which further improves its performance.
YOLO-V4 builds upon the success of its predecessors and brings significant
improvements to object detection accuracy and speed. One of the most significant
improvements in YOLO-V4 is the use of the CSP (cross-stage partial) architecture
[116], which improves the model’s ability to learn features and makes it more
efficient. YOLO-V4 also uses several other techniques such as spatial pyramid
pooling, focal loss, and bag-of-freebies (BoF) to improve its performance.
In addition to these improvements, YOLO-V4 has also made significant strides
in terms of speed. YOLO-V4 can process up to 65 frames per second on a single
GPU, making it one of the fastest object detection systems available.
In conclusion, YOLO-V3 and YOLO-V4 are both significant improvements
over their predecessors in terms of object detection accuracy and speed. YOLO-
V4, in particular, has introduced several new techniques and architectures that
have pushed the boundaries of what is possible in real-time object detection. As
computer vision and deep learning continue to advance, we can expect even more
improvements to object detection systems like YOLO.
YOLO-V5
YOLO-V5 is the 5th version of this object detection system, released in 2020,
which has brought significant improvements over its predecessors YOLO-V3 and
YOLO-V4.
YOLO-V5 is built upon a more efficient architecture, compared to YOLO-V4,
which reduces the model’s complexity and improves its speed. This architecture
uses a single neural network with a large number of channels and smaller image
sizes. Additionally, YOLO-V5 uses a new backbone network architecture, Effi-
cientNet, which is a family of efficient models that have been optimised for both
accuracy and efficiency.
Compared to YOLO-V3 and YOLO-V4, YOLO-V5 is significantly faster and
more accurate. YOLO-V5 is capable of achieving state-of-the-art performance on
several benchmarks while maintaining real-time speed, even on low-power devices.
The model’s speed has been further improved through several techniques such as
model pruning, automatic mixed-precision training, and optimised CUDA code.
Another significant improvement in YOLO-V5 is the introduction of a novel
data augmentation technique, named mosaic data augmentation, which combines
multiple images into a single training image. This technique helps the model learn
3.3 CNN-based object detection 49
to detect objects in complex scenes, where objects can appear partially or fully
occluded.
The input of YOLO-V5 uses the same Mosaic data enhancement method as
YOLO-V4, and the author of this method is also a member of the YOLO-V5
team. However, random scaling, random cropping, and random arrangement are
used for splicing, and the e↵ect is still very good for detecting small targets.
In the YOLO algorithm, anchor boxes with initially set length and width are
assigned for di↵erent datasets. During network training, the network outputs the
prediction frame based on the initial anchor frame and compares it with the real
frame ground truth, calculates the di↵erence between the two, and then updates
the network parameters in reverse to iterate them. Therefore, the initial anchor
box is also an important part.
However, this has been improved in the YOLO-V5 code. The author of YOLO-
V5 believes that when the network is practically used, the input images may
have di↵erent aspect ratios. Therefore, the letterbox function of datasets.py in
YOLO-V5’s code has been modified to adaptively add the least black border to
the original image. This simple improvement has increased the reasoning speed
by 37%, making it very e↵ective.
In the backbone of the network, YOLO-V5 uses the Focus structure, which is
not present in YOLO-V3 & YOLO-V4, and the key to this structure is the slice
operation. For instance, a 4⇥4⇥3 image slice becomes a 2⇥2⇥12 feature map.
Taking the structure of YOLO-V5s as an example, the original 608⇥608⇥3 image
is input into the Focus structure, and the slicing operation is used to first create a
304⇥304⇥12 feature map. Then, after a convolution operation of 32 convolution
kernels, it finally becomes a feature map of 304⇥304⇥32 [117].
50 3.3 CNN-based object detection
Only the backbone network in YOLO-V4 uses the CSP structure, whereas in
YOLO-V5, two CSP structures are designed. Taking the YOLO-V5s network as
an example, the CSP1 X structure is applied to the Backbone backbone network,
and the other CSP2 X structure is applied to the Neck. In YOLO-V4’s Neck struc-
ture, ordinary convolution operations are used. However, in the Neck structure
of YOLO-V5, the CSP2 structure is adopted to enhance the ability of network
feature fusion. The CSP2 structure is designed by referring to CSPnet. In conclu-
sion, YOLO-V5 is a significant improvement over its predecessors, YOLO-V3 and
YOLO-V4, in terms of accuracy, speed, and efficiency. The model’s new archi-
tecture and innovative techniques have made it one of the best-performing object
detection systems in the field.
YOLOX
YOLO-V6, 7 and 8
Figure 3.21: ResNet, RepVGG training and RepVGG testing from [120]
In order to reduce the delay on the hardware, the Rep structure is also in-
troduced in the feature fusion structure on the Neck. Rep-PAN is used in Neck.
3.3 CNN-based object detection 53
Rep-PAN is based on the combination of both PAN and RepBlock. The main idea
is to replace the CSP-Block in the PAN with the RepBlock.
Like YOLOX, YOLO-V6 also decouples the detection head, separating the
processes of border regression and category classification. Coupling bounding box
regression and class classification can a↵ect performance, as it not only slows
down convergence but also increases the complexity of the detection head. In the
decoupling head of YOLOX, two additional 3⇥3 convolutions are added, which
also increases the complexity of the operation to a certain extent. YOLO-V6 has
redesigned a more efficient decoupling head structure based on the strategy of
Hybrid Channels. The delay is reduced without changing the accuracy, achieving
a trade-o↵ between speed and accuracy.
YOLO-V7 [122] and YOLO-V5 are di↵erent versions of YOLO, with YOLO-
V7 being the newer version. In terms of computational efficiency and accuracy,
YOLO-V7 has improved compared to YOLO-V5. YOLO-V7 uses faster convolu-
tion operations and smaller models, allowing it to achieve higher detection speeds
under the same computing resources. Additionally, YOLO-V7 provides higher
accuracy and can detect more fine-grained objects.
However, YOLO-V5 trains and infers much faster than YOLO-V7 and has
a lower memory footprint. This makes YOLO-V5 more advantageous in certain
application scenarios, such as in mobile devices or resource-constrained systems.
In general, both YOLO-V7 and YOLO-V5 have improved in performance and
accuracy, but YOLO-V7 is faster and takes more resources, while YOLO-V5 is
faster in training and inference speed but slightly lower in accuracy than YOLO-
V7. Therefore, when choosing which version to use, you need to make a trade-o↵
based on the specific needs of the application scenario.
Ultralytics YOLO-V8 [123] is the latest version of the YOLO target detection
and image segmentation model developed by Ultralytics. YOLO-V8 is a cut-
tingedge, state-of-the-art (SOTA) model that builds on the success of previous
YOLO Versions and introduces new features and improvements to further boost
performance and flexibility. It can be trained on large datasets and can run on
various hardware platforms, from CPU to GPU.
A key feature of YOLO-V8 is its extensibility, which is designed as a framework
that supports all previous versions of YOLO, making it easy to switch between
di↵erent versions and compare their performance.
In addition to scalability, YOLO-V8 includes many other innovations that
make it an attractive choice for various object detection and image segmentation
tasks. These include a new backbone network, a new anchor-free detection head,
and new loss functions.
Overall, YOLO-V8 is a powerful and flexible tool for object detection and
54 3.4 Quantitative Performance Comparison Methods
image segmentation that o↵ers the best of both worlds: state-of-the-art SOTA
technology; and the ability to use and compare all previous YOLO Versions.
• FP (False Positive): The sample’s true class is a negative example, but the
model predicts a positive example, which is incorrect;
• FN (False Negative): The sample’s true class is a positive example, but the
model predicts a negative example, resulting in an incorrect prediction;
• mAP (mean Average Precision): mAP can characterise the entire precision-
recall curve. The area under the precision-recall curve is mAP (In our ex-
periments we use the default threshold of 0.5).
TP + TN
Accuracy = (3.16)
TP + TN + FP + FN
TP
Recall = (3.17)
TP + FN
3.4 Quantitative Performance Comparison Methods 55
TP
P recision = (3.18)
TP + FP
The three numbers mentioned above fall between 0 and 1. The more precise
the prediction is, the nearer it is to 1. The higher the fraction of projected positive
samples to all positive samples is for recall, the closer it is to 1. The proportion
of positive ground truth for the premise increases as it gets closer to 1. The e↵ect
is better or worse depending on how closely these three values are to 1.
In determining the accuracy of an object detector, it is important to judge the
accuracy based not only on the fact that an object has been correctly identified
as being of a particular type, but also to determine how close the location of the
object identified, is to the ground truth. Therefore, instead of using Accuracy, in
this research we use [email protected] as a measure to determine correctness of object
detection.
In our experiments we compared the performance of the four object detection
models we obtained by training the four sub-versions of YOLO-V5, with models
based on other popular Deep Neural Networks, SSD (Single Shot Multi-box De-
tector) and Faster R-CNN (Faster Region-based Convolutional Neural Networks).
Based on the foundational machine learning and deep learning algorithms and
networks provided in this chapter, Chapters 4–6 present the original contributions
of the research presented in this thesis.
56 3.4 Quantitative Performance Comparison Methods
Chapter 4
In this chapter, we utilise one of the best Convolutional Neural Networks (CNN),
YOLO-V5, based model to e↵ectively detect Ghaf trees in images taken by cam-
eras onboard lightweight, Unmanned Aircraft Vehicles (UAV), i.e., drones, in some
areas of the UAE. We utilise a dataset of over 3200 drone captured images par-
titioned into data-subsets to be used for training (60%), validation (20%), and
testing (20%). Four versions of YOLO-V5 CNN architecture are trained using the
training data subset. The validation data subset is used to fine tune the trained
models to realise the best Ghaf tree detection accuracy. The trained models are
finally evaluated on the reserved test data subset not utilised during training. The
object detection results of the Ghaf tree detection models obtained using the four
di↵erent sub-versions of YOLO-V5 are compared quantitatively and qualitatively.
57
58 4.2 Proposed Approach to the Ghaf Tree Detection
”Ghaf”. It is specifically noted that some Ghaf trees can contain a number of
canopies that grow from the same root structure, while some have only one canopy.
Therefore, it is often impossible to judge whether some adjacent canopies be-
long to the same root structure (as sand covers or occludes most of the roots and
trunks) and hence form a single Ghaf tree. Therefore, in this chapter rather than
attempting to detect a Ghaf tree, we attempt to detect Ghaf trees canopies. It
should be therefore noted that counting canopies, for example, will not allow us
to count the total number of Ghaf trees.
Training 3200
Validation 900
Testing 900
Table 4.1: Number of labelled Ghaf tree canopies in each data subset
The labelled data of the training data subset is used to train the four subver-
sions of YOLO-V5 CNN. The training is for a single class of an object, ’Ghaf Tree’
and hence a Ghaf tree is detected by di↵erentiating it from its background. Simil-
arly, the tagged Ghaf trees from the validation dataset subset is used to fine tune
the training model of YOLO-V5 CNN when determining the optimal parameters
of the model.
Finally, the test dataset subset is used to evaluate the performance of the
trained model. The labelled Ghaf trees in the validation set is used during training
to optimise the network parameters, whilst the labelled Ghaf trees in the test
dataset is used as benchmark data to determine the accuracy of prediction.
When labelling data for training and validation, when Ghaf trees are enclosed
within rectangles, the rectangles may contain Ghaf trees of di↵erent sizes and may
overlap or be obscured by other Ghaf trees or objects.
Moreover, they may have di↵erent backgrounds (i.e., sand, bushes/shrub un-
dergrowth, etc.). It is therefore important to capture rectangles of image pixels
with the above possible variations for testing and training, as it will e↵ectively test
the generalisability of the trained CNN model for subsequent Ghaf tree detection
tasks.
R-CNN and SSD. Each model was trained with the same set of UAV images of
the training data subset, validated on the same validation data subset and tested
on the same test data subset. The performance of the four models are compared
booth quantitatively and intuitively, below.
The results tabulated in Table-4.2 also show that SSD requires significant
amount of time for the convergence of training (i.e., completion of training)
and also take significant amount of extra time for testing. Recall, precision and
[email protected] values also remain significantly lower than those of the YOLO-V5
models. Faster R-CNN took the lowest amount of time to complete training and
has very good testing speeds, second only to YOLO-V5s, the shallowest YOLO-V5
sub-version.
However, accuracy, precision, [email protected] values were much lower than in the
case of the four YOLO-V5 models. Comparing the performance of the four YOLO-
V5 models, it is observed that when the complexity/depth of the architecture
increases, more time is taken for training and generally the same trend exists
when it comes to testing, with YOLO-V5x taking significantly more time than
sub-versions, m and l for testing. In comparison to other models, YOLO-V5x
achieved the highest mean average precision (81.1%) in Ghaf tree detection, as
shown in Table-4.2. Figure 4.2 illustrates the Precision vs Recall graph for YOLO-
V5x indicating a [email protected] value of 0.811.
4.3 Experimental Results and Analysis 61
trees, and others contain other types of trees or plants. Yellow circles in the images
show the missing targets and red crosses mean wrong detections.
Figure 4.3: The visual performance comparison of Ghaf tree detector models de-
rived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c) YOLO-V5s, (d)
YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x
Figure 4.3 illustrates the performance of the six models on a desert area only
containing Ghaf trees. Unfortunately, SSD based model did not pick up any of the
Ghaf trees and the Faster R-CNN based model did not detect a number of Ghaf
trees. The performance of the four sub-versions of YOLO-V5 were very much
comparable. It is noted that in this image Ghaf trees have been captured at a
high resolution with clear views, with no other types of objects in the background.
4.3 Experimental Results and Analysis 63
YOLO-V5s
YOLO-V5m
Figure 4.4: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
64 4.3 Experimental Results and Analysis
YOLO-V5l
YOLO-V5x
Figure 4.4: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
4.3 Experimental Results and Analysis 65
Figure 4.4 illustrates the performance of models created by the four subver-
sions of YOLO-V5, on a drone captured image of a higher altitude (hence trees
appearing smaller) and in an area where there are other trees and objects. The
yellow circles denote missed Ghaf trees. The model based on YOLO-V5x outper-
forms the models based on other YOLO-V5 sub-versions. YOLO-V5s misses some
sparse canopies. YOLO-V5s, m and l misses Ghaf trees located at the boundary
of the image .
66 4.3 Experimental Results and Analysis
YOLO-V5s
YOLO-V5m
Figure 4.5: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
4.3 Experimental Results and Analysis 67
YOLO-V5l
YOLO-V5x
Figure 4.5: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
68 4.3 Experimental Results and Analysis
Further inspecting the images included in Figure 4.5, it can observe that the
models generated by training the four YOLO-V5 sub-versions, perform very well
most of the time, and their operational/accuracy gaps only exists in some detail.
YOLO-V5x performs marginally better when detecting small canopy, overlapped
and close canopies.
4.3 Experimental Results and Analysis 69
YOLO-V5s
YOLO-V5m
Figure 4.6: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
70 4.3 Experimental Results and Analysis
YOLO-V5l
YOLO-V5x
Figure 4.6: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
4.3 Experimental Results and Analysis 71
Figure 4.6 illustrates the use of the models created by the four sub-versions
of YOLO-V5 in detecting Ghaf trees in a more complex area, that includes other
trees. This image consists of Ghaf trees of a wider size variation. Comparing with
the labelled ground truth image, the model generated by YOLO-V5s demonstrates
a better performance as compared with the performance of the model created by
YOLO-V5m, l and x. When the scene becomes complex, significantly more data
is needed for training a deeper Neural Network. Thus, if we can have more high-
quality data for training YOLO-V5x, the results can still be improved.
72 4.3 Experimental Results and Analysis
YOLO-V5s
YOLO-V5m
Figure 4.7: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
4.3 Experimental Results and Analysis 73
YOLO-V5l
YOLO-V5x
Figure 4.7: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
74 4.4 Conclusion
Figure 4.7 illustrates the performance of the four models on another test image
in which other di↵erent size of trees or plants exist in a complex area. In this
specific case the model created by YOLO-V5l, performs better than other versions,
detecting every Ghaf tree with no mistake. Once again, the slightly less accurate
detection capability of YOLO-V5x can be attributed to the lack of substantial
quantities of training data. The depth of the model and the amount of data will
a↵ect the actual detection situation. In certain situations, a certain model may
perform better.
4.4 Conclusion
In this chapter we investigated the use of Convolutional Neural Networks in detect-
ing Ghaf trees in videos captured by a drone, flying at di↵erent altitudes and in dif-
ferent environments that consists of Ghaf trees. To the best of authors knowledge,
this is the first attempt in using Convolutional Neural Networks in automatically
detecting Ghaf trees, that poses a significant challenge to detecting them using
traditional machine learning approaches. Despite the relatively small number of
images utilised for training the DNNs in this work, the high [email protected] value
of 81.1% obtained by the YOLO-V5x based model in detecting Ghaf trees in ap-
proximately 78 MS, is a promising step towards achieving real-time detection using
aerial imagery. The training time for model generation was high, approximately
10 hours and this was mainly due to hardware limitation of the computer utilised.
The training time could be considerably reduced if a faster computer hardware
was utilised. Models trained based on all other sub-versions of YOLO-V5 resulted
in [email protected] values of above 77.4%, whilst other popular DNNs such as SSD
and Faster R-CNN performed less efficiently. Rigorous visual inspection of Ghaf
tree detections obtained using all four sub-versions of YOLO-V5 revealed that
YOLO-V5x particularly outperforms the other YOLO-V5 networks at detecting
Ghaf trees in scenarios where images are overlapping, blurred, obstructed with
di↵erent backgrounds, and where there is a significant size variation of Ghaf trees.
Additional test results can be found in Appendix B.
This work utilised just over 5000 ghaf trees with 3200 of them used during
training. If this number can be expanded to 10,000 + images, the detection
performance will be further improved as the models would then be able to bet-
ter generalise on new unseen data and accurately identify Ghaf trees. Dataset
limitation notwithstanding, the results obtained in this work promise ground for
real-time detection of the Ghaf using aerial surveillance, thus aiding in the e↵orts
to preserve this endangered national tree of the UAE. The models can be used
to design change detection software to identify damages to Ghaf trees based on
4.4 Conclusion 75
77
78 5.2 Proposed Approach to Multiple Tree Detection and Classification
dataset is separated into three data subsets, for training (60%), validation (20%)
and testing (20%). Since some images included a large number of trees and some
consisted of a relatively restricted number of trees, the total numbers of images
used for training, validation and testing were 619, 80 and 101, respectively.
In the data preparing phase, each Ghaf, Palm and Acacia tree within the
training, validation and testing data subsets was labelled with a bounding box,
using the labelImg software and labelled as ”Ghaf” or ”Palm” or ”Acacia”. It
is noted that the images contained other types of trees and strubs that did not
belong to these three groups of trees and hence would likely create false positives.
Moreover, since single Ghaf tree can contain more than one canopy (grow from
one root/trunk structure) for the purpose of this research, we define a Ghaf tree
as a single Ghaf tree canopy, which we label and aim to detect via the object
detections and classifications models we develop.
Tables 5.1-5.3 tabulates the total number of images used for labelling each type
of tree and the total number of trees of each type present in them, in the training,
validation and testing, image subsets.
The drone images were randomly picked up from a large set of images, con-
taining images, captured at the same resolution, but when the drone was flying at
di↵erent altitudes, di↵erent times of the day (sometimes resulting in illumination
variations, shadows etc.). The density/sparsity of trees varied between images.
Palm trees co-existed with Ghaf trees, but most Acacia trees existed in isolation.
Trees overlapped and occluded and with similar or di↵erent trees and other ob-
jects. The underlying background below trees varied, mostly consisting of sand but
occasionally consisting other trees, shrubs or man-made structures. In drawing
rectangles around samples of trees, it was essential that the captured rectangles
attempted to tightly fit the tree canopies, but included in some cases included
other objects, trees, crops, sand within the rectangle. When a tree is partially
occluded by other trees, when drawing the rectangle, we imagined the trees oc-
cluded canopy area and drew the rectangle to include the potentially covered area
of the canopy. We also attempted to capture many trees which are clearly isolated
from other trees and objects and HD clarity of its shape, boundary, texture and
colour. Further when labelling trees that has a shadow, it is important to exclude
the shadow being enclosed within the rectangle. All the above strict criteria was
adopted in labelling to give the DNN the opportunity to learn from identifying
and recognising the three types of trees under variations of illumination, occlusion,
size, clarity etc.
80 5.3 Experimental Results and Analysis
Training 34 3600
Validation 8 1200
Testing 9 1200
If the intention is to only generate one binary confusion matrix, then the
Macro-Averages can be used for performance evaluation. Macro-averaging is to
first calculate the index value (e.g. Precision or Recall) of each class, and then
calculate the arithmetic mean of all classes.
n
1X
M acro R = Ri (5.1b)
n i=1
Pn
i=1 T Pi
M icro R = Pn Pn (5.2b)
i=1 T Pi + i=1 F Ni
has about 2.2% lower performance in [email protected] value, and for object detect-
ors deployed in powerful computing devices (desktops, cloud services, o↵-line pro-
cessing), YOLO-V5l or x are recommended, due to their superior [email protected]
values.
Table 5.5: Ghaf Tree Detection Performance comparison of YOLO-V5 based ob-
ject detection models
Table 5.6: Palm Tree Detection Performance comparison of YOLO-V5 based ob-
ject detection models
detect and classify multiple trees in the benchmark dataset. These results suggest
that YOLO-V5x may be the best choice for researchers or practitioners who are
interested in using object detection models for tree-related applications. According
to analysis, the reason why YOLO-V5x performs best is that the dataset of this
project is large, there are many types of objects to be detected, and the features of
the ghaf tree and acacia tree are similar. Therefore, deeper networks can extract
more features, which is conducive to object detection. Recall and Precision are
contradictory to each other. If we want higher recall, we need to make the model’s
predictions cover more samples, but this makes the model more likely to make
mistakes, which means the precision will be lower. If the model is very conservative
and can only detect samples with certainty, its precision will be high, but recall
will be relatively low. Here, the precision of YOLO V5x is significantly higher than
other models, further indicating that its detection accuracy is higher than other
models. The reason is also because it is a deeper model that performs better in
complex situations. In the tables below, we present the testing results of YOLO-
V5s, m, l, and x on a benchmark dataset. We evaluated the models based on their
precision, recall, and mean average precision (mAP) value, which are commonly
used metrics for object detection tasks. The tables included in this report compare
the performance of multiple tree detection and classification models on a single
kind of tree.
Acacia trees grow in the wild or nature reserves where they have been purposely
grown for conservation purposes. Palm trees are often present in areas that are
human habituated as they are normally grown by humans for a purpose (shade,
consumption etc.). Ghaf trees are grown for consumptions by animals and hence
grown in areas reachable by humans and animals. Given these observations, it
was difficult for us to find any single image for that consisted of Acacia trees with
Palm and/or Ghaf trees. There were few images in which Ghaf trees co-existed
with Palm trees. Therefore, for the purpose of testing the subjective e↵ectiveness
of tree type classification, we created a mosaic image that was formed by three
sub-images consisting mostly, one type of tree (see test Image in Figure 5.2). The
yellow circles in the images show the missing targets and the red crosses mean
wrong detections.
5.3 Experimental Results and Analysis 85
Figure 5.1: The visual performance comparison of Multiple tree detector models
derived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c) YOLO-V5s, (d)
YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x
86 5.3 Experimental Results and Analysis
Results illustrated in Figure 5.1 shows that all four YOLO-V5 sub-versions
can detect the Ghaf, Date Palm and Acacia trees with high precisions (above
81.3% [email protected] as tabulated in Table-5.4). SSD failed to detect any type of
tree, whereas Faster R-CNN was not able to detect Palm Trees that appears very
small in the mosaic image. Testing on many di↵erent mosaic images we created,
with random combinations/orientations of images, predominantly consisting of the
three di↵erent tree types, we had similar subjective observations. Therefore, in
the subjective analysis that follow, we are excluding the subjective performance
comparison with SSD and Faster R-CNN.
5.3 Experimental Results and Analysis 87
YOLO-V5s
YOLO-V5m
Figure 5.2: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
88 5.3 Experimental Results and Analysis
YOLO-V5l
YOLO-V5x
Figure 5.2: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
5.3 Experimental Results and Analysis 89
In Figure 5.2, consists of an image that only consists of Ghaf trees. We observe
that the trees are of di↵erent size, have been captured at a time at which signi-
ficant shadows exist, the density of trees vary largely within the test image and
in many cases, trees overlap and occlude. Given the strict labelling procedure,
we described in section 5.2.1, it is observed in Figure 5.2 that we have been able
to achieve a remarkable level of accuracy in detecting and recognising the Ghaf
trees, accurately, in particularly when using YOLO-V5 x and l. Most Ghaf trees
have been detected and all with a confidence value of over 0.27. No tree has been
miss classified. A remarkable result is also shown in the trained model’s ability to
avoid detecting shadows as Ghaf trees and/or double counting Ghaf trees due to
their shadows. We have also managed to use non-maxima suppression approaches
to avoid multiple rectangles being drawn around a single tree. Due to this we
should be able to count the number of Ghaf trees, and find their GPS locations,
if the original images included longitude and latitude information. The results
indicate that deeper the model, it has managed to train better and perform more
accurately, given the amount of data we have used in training.
90 5.3 Experimental Results and Analysis
YOLO-V5s
YOLO-V5m
Figure 5.3: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
5.3 Experimental Results and Analysis 91
YOLO-V5l
YOLO-V5x
Figure 5.3: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
YOLO-V5s
YOLO-V5m
Figure 5.4: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
94 5.3 Experimental Results and Analysis
YOLO-V5l
YOLO-V5x
Figure 5.4: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
As indicated in Table-5.3, we only had 500 samples of Acacia trees for training,
as against 3600 and 3200, Palm and Ghaf trees, respectively. Given the significance
5.3 Experimental Results and Analysis 95
class imbalance, when training a multi-class classifier, this would have expected
to result in a significant reduction of accuracy in Acacia tree detections. However,
in Figure 5.4, our image samples that consisted of Acacia trees did not have
any other trees of any other types, and as a result the results of Acacia tree
detection performance using all four YOLO-V5 models, illustrates a very good
level of accuracy as shown in Further all Acacia trees has a good level of contrast
with the sandy background at the automatic exposure setting of the drone camera
that the image has been captured.
To further investigate the accuracy of ghaf tree detection, Figure 5.5 shows an
image with a very di↵erent background, which is relatively rare in the training
set. In this image, there are ghaf trees of varying sizes, making the detection of
some of them more difficult. No acacia trees have been found in this image. A
large number of training samples, extracted from images with di↵erent exposure
settings, taken at di↵erent times of the day and at di↵erent altitudes of the drone
flight, as well as samples of ghaf trees from environments consisting of overlapping
and/or occluded by other trees, will significantly improve the accuracy of ghaf tree
detection.
Despite the modest number of images used in this study, excellent mean average
precision values (over 81.3%) obtained for the detection of Ghaf, Palm and Acacia
trees from drone images using the four sub-versions of YOLO-V5 are a promising
96 5.4 Conclusion
step toward real-time detection utilising aerial data. The model training time
could have been shortened if a more powerful computer was used for training.
Rather than creating separate DNN models for the detection of the three di↵erent
tree types commonly present in the Arabian regions, we have proven that a single
multi-class model can be equally e↵ective, provided the training data between
classes are balanced and a modest set of data is being used in training to detect
each type of tree. The number of samples needed for each di↵erent type of tree to
be detected at the same level of accuracy, depends on complexities and variations
present in the test images, uniqueness of tree types from the other trees, other
vegetation present in the images, drone’s flying altitude, image resolution etc.
5.4 Conclusion
In this chapter, we have proposed YOLO-V5 based, multi-class object detection
models, to detect three types of trees widely present in Arabian countries, Ghaf,
Palm and Acacia trees. We have successfully demonstrated the use of object
detection models created from four sub-versions of YOLO-V5 (s, m, l and x), hav-
ing di↵erent computational and network complexities, in achieving Mean Average
Precision values of over 81.3%. The model created by YOLO-V5x demonstrated
the best multiple tree detection performance accuracy, with a 83.5% [email protected]
value. We have used specific approaches to data labelling, training and network
optimisations to create models that are capable of detecting the three types of
trees, in the presence of occlusion, illumination variations, changes in sizes, dif-
ferent levels of contrast with the background, shadows, overlaps, shape variations
etc. Over 619 drone captured images were used in the training, validation and
testing of the models, with over 11500 trees, of three di↵erent types, being used
during training. We show that if this number is increased to over 30,000 or more,
the detection performance will improve significantly more as the models will be
able to generalise better to previously unseen data. Despite the limitations of the
data set used in training, we have optimised training via e↵ective data labelling
and training data selection. We share the knowledge gathered for the benefit of
the wider research community. The detailed findings of this research should pave
the way for real-time detection of multiple trees via aerial surveillance, assisting in
the preservation of the UAE’s desert and environment. We should that the model
created by YOLO-V5s is capable of being deployed on board a drone or within
a mobile device, to enable real time applications, despite its [email protected] value
is approximately 2.2% less. We provide quantitative and subjective experimental
results to evaluate the performance of the four DNN models proposed. Additional
test results can be found in Appendix B.
5.4 Conclusion 97
In chapter 6, we propose the use of Deep Neural Network based object detector
models for litter detection in remote desert areas and in more suburban campsites.
98 5.4 Conclusion
Chapter 6
In this chapter we investigate the use of the most popular Deep Neural Network
(DNN) architectures to create several novel litter detection models. We investigate
the use of Faster R-CNN (Faster Region based CNN), SSD (Single Shot Detec-
tion) and YOLO (You Only Look Once) architectures. With regards to YOLO,
we investigate the use of version-5 (s, m, l and x sub-versions) and Version-7 (l
sub version) architectures. Two types of object detection models are developed;
a single class classifier (litter only) models and a two-class classifier (litter and
man-made objects, which are not litter) models. Approximately 5000 samples
of litter objects and 2100 man-made objects not identified as litter was used for
training. 3200 litter objects of various types and 1400 man-made, non-litter ob-
jects were used for validation and testing, respectively. We rigorously compare the
performance of the di↵erent models in litter detection and localisation in drone
images captured at di↵erent altitudes, under di↵erent environmental conditions.
Both objective and subjective approaches are used for the performance analysis.
99
100 6.1 Introduction to the Litter Detection
(the background consisting of some man-made objects such as camp sites) areas.
For clarity of presentation, this chapter is divided into five sub-sections. Section-
6.1 provided an introduction to the application and research context. Section-6.2
provides the research methodology to be adopted, experimental design details of
two approaches to be adopted for litter detection and corresponding details of
dataset preparation, data labelling, training and the approach adopted for testing
the performance of the Deep Neural Network models. Section-6.3 presents de-
tails of Litter detection results and a comprehensive analysis of the performance
of the trained models under the two adopted approaches to litter detection. Fi-
nally, Section-6.5 concludes, with an insight to future work and suggestions for
improvements of the established DNN models.
in the background and mostly consisted of natural habitat. The selected image
dataset was then divided into three data subsets for training (60%), validation
(20%), and testing (20%). In the dataset used, as some of the images have a
significantly greater number of litter objects than others, the number of images
used in the training, validation, and test data subsets di↵ered and were recorded
as 512, 236, and 165, respectively.
For the two-class litter detection approach, a further 255 images containing
more than 3500 human-made objects were randomly selected from images taken
during drone flights at di↵erent altitudes over sub-urban camp sites of the DDCR.
These images consisted of many human-made objects of di↵erent sizes and shapes,
whilst including litter objects. The selected image dataset was then divided into
three data subsets for training (60%), validation (20%), and testing (20%). In the
dataset used, as some of the images have a significantly greater number of human-
made objects than others, the number of images used in the training, validation,
and test data subsets di↵ered and were recorded as, 171, 53, and 31, respectively.
Dataset Number of Human made items images Number of Litters images Number of labelled Litters Number of labelled Human made items
Table 6.1: Number of labelled litters and human-made items in each data subset
All drone imagery was collected by the authors within the nature reserve areas
of the Dubai Desert Conservation Reserve(DDCR) with DJI Phantom-4 and DJI
Mavic 2 Pro drones flying at di↵erent heights/altitudes.
The drone captured image set included images captured in di↵erent areas of
the DDCR demarcated land (nature and camp sites), at di↵erent altitudes, cam-
era angles and taken at di↵erent times of the day (during daytime). The dens-
ity/sparsity of litter present within an image varied between images. The nature-
site images only consisted of natural objects and litter (i.e., rarely consisted of
a man-made, non-litter object). The camp-site images consisted of litter as well
as man-made objects/structures, which are difficult to di↵erentiate with respect
to litter, without taking the image context into account. A set of images were
randomly picked up from the drone captured images as training, validation and
testing sets for the design of the two litter detection approaches, single-class and
two-class (see Table-6.1).
All experiments were performed on a computer system comprising of an Intel
Core i7-6850k CPU, NVIDIA GeForce GTX-1080ti GPU, and 32 GB of RAM.
The computer was running Windows 10 operating system.
We use the design of the proposed Single-Class Object Detector approach to
litter detection as means of comparing the capabilities of popular Deep Neural
6.3 Experimental Results and Discussion 103
Network architectures in object detection and use the Two-Class Object Detector
approach to litter detection as means of rigorously comparing the performance of
only the best DNN architectures.
Single-Class Object Detector - The labelled information (i.e., objects of
class ‘litter’) of the training data subset are used to train SSD, Faster R-CNN,
and four sub-versions of YOLO-V5 CNNs. Additionally, the labelled litter samples
from the validation data subset are used to fine-tune the CNN architectures and
optimise their performance during training, by deciding on the optimal values of
the hyper parameters of the network.
Two-Class Object Detector - The labelled information (i.e., objects of
classes, ‘litter’ and ‘Human-made’) of the training data subset are used to train
YOLO-V5-Large and YOLO-V7(Large) CNN architectures. Additionally, the la-
belled samples from the validation data subset are used to fine-tune the CNN
architectures and optimise their performance during training, by deciding on the
optimal values of the hyper parameters of the network.
and [email protected] values remain substantially lower than those of the four YOLO-
V5 models. Faster R-CNN, on the other hand, takes the longest time for the
convergence of training but has relatively good testing speeds/deployment costs
as compared to the YOLO-V5 based models. However, the accuracy, precision, and
[email protected] values of the Faster R-CNN model are inferior to those of the four
YOLO-V5 models. Comparing the performance of the four YOLO-V5 models, we
observed that as the complexity/depth of the architecture increases, more time is
required for training and testing, with YOLO-V5l and x taking significantly more
time than sub-versions s and m. YOLO-V5l achieved the highest mean average
precision (71.5%) in litter detection as compared to other models, as presented
in Table-6.2. However, its precision value is marginally lower than in the case
of YOLO-V5x. The YOLO-V5l based model, however, has a significantly lower
detection time as compared to the YOLO-V5x based model. Given the above
observations, considering the objective performance metrics, we recommend the
use of YOLO-V5l for litter detection.
Model Training Hours Precision Recall [email protected] Average Detect Time(milli second)
Table 6.2: Performance and comparison of DNN based litter detection models
According to Table-6.2, all four YOLO-V5 versions can identify every sort of
litter present in the images with generally regarded precisions of above 76%. The
excellent precisions and very good recall values of all four YOLO-V5 models are
also consistent with their successful performance, subjectively, as demonstrated by
the results illustrated in figures 6.4. In contrast SSD and Faster R-CNN, exhibits
poor performance. Both SSD and Faster R-CNN misses the detection of many
objects of litter as illustrated in Figure 6.1.
6.3 Experimental Results and Discussion 105
It is noted that with all test images we used in testing, a similar relative
performance was demonstrated in the visual performance comparison of the said
models. Given this observation and the results tabulated in Table-6.2, in further
performance comparisons, we do not consider the SSD and Faster R-CNN here.
The yellow circles in the images show the missing targets and the red crosses mean
wrong detections.
Figure 6.1: The results of litter detection in drone imagery using the SSD, Faster
R-CNN, YOLO-V5s, YOLO-V5m, YOLO-V5l and YOLO-V5x based models
106 6.3 Experimental Results and Discussion
YOLO-V5s
YOLO-V5m
Figure 6.2: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 107
YOLO-V5l
YOLO-V5x
Figure 6.2: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
In Figure 6.2, after testing using four versions of YOLO-V5, it was observed
that all four models were able to detect the majority of the litter. However,
they all missed detecting some of the smaller pieces of litter. The YOLO-V5l
model performed the best out of the four in terms of detecting the smaller litter
items. Overall, YOLO-V5l is an e↵ective tool for litter detection. However, further
improvements are needed to accurately detect all types and sizes of litter.
108 6.3 Experimental Results and Discussion
YOLO-V5s
YOLO-V5m
Figure 6.3: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 109
YOLO-V5l
YOLO-V5x
Figure 6.3: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
110 6.3 Experimental Results and Discussion
YOLO-V5s
YOLO-V5m
Figure 6.4: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 111
YOLO-V5l
YOLO-V5x
Figure 6.4: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
The visual comparison results above, illustrate that each of the four sub-
versions of YOLO-V5 performs well in litter detection. When the flight height
of the UAV is relatively low , the detection results of all four models are excellent
(e.g., see Figure 6.4). It is observed generally that litter objects have been detected
regardless of their sub-type (e.g., bottles, paper, bags etc.), with more common
objects used in training (e.g. bottles) and well-defined objects in terms of shape,
112 6.3 Experimental Results and Discussion
being detected very accurately, especially at high altitudes of drone flights. The
visual performance comparison shows a bias towards objects of certain colour such
as blue/turquoise litters. The reason is that the training data is not balanced and
hence have more data related to these colours. The training data consists of a large
number of blue coloured bottles. This project is focused on optimised detection
of litter typically present in the Dubai desert areas. The types and distribution
of litter in Dubai desert are similar to that of the training data, and hence the
specific challenge is minimised.
6.3 Experimental Results and Discussion 113
YOLO-V5s
YOLO-V5m
Figure 6.5: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
114 6.3 Experimental Results and Discussion
YOLO-V5l
YOLO-V5x
Figure 6.5: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
Based on the results in Figure 6.5, in litter detection tests using four versions of
YOLO-V5, all models demonstrated good performance. The mAP values from the
test data indicated that YOLO-V5l performed better than YOLO-V5m overall.
However, in a specific test image, YOLO-V5m performed slightly better. This
could be attributed to the random characteristics of the test image. Additionally,
in less complex situations, a shallower model may have an advantage. Overall, the
6.3 Experimental Results and Discussion 115
data suggests that YOLO-V5l is the best-performing model for litter detection,
with YOLO-V5x demonstrating comparable performance. In this particular test
image, YOLO-V5l detected more litter than YOLO-V5x. Therefore, it can be
concluded that YOLO-V5l is the most e↵ective model for litter detection, while
YOLO-V5x can also be a good alternative. It is important to note that individual
test images may have unique characteristics, and broader perspective analysis is
necessary to evaluate the overall model performance.
116 6.3 Experimental Results and Discussion
YOLO-V5s
YOLO-V5m
Figure 6.6: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 117
YOLO-V5l
YOLO-V5x
Figure 6.6: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
In Figure 6.6, based on the results of litter detection tests using four versions
of YOLO-V5, it was found that all four models were able to detect most of the
litter but missed some of the smaller items. The current test image was captured
from a UAV at a height of 60 meters, in comparison to the previous test images.
The litter size in the current image is extremely small, making it difficult for the
human eye to identify. As a result, there are more missed instances of litter in the
118 6.3 Experimental Results and Discussion
current image compared to the previous ones. Comparing YOLO-V5l and YOLO-
V5x, both models performed similarly in litter detection accuracy, but YOLO-V5l
demonstrated advantages in terms of model training and testing speed, as well
as its smaller size. Therefore, YOLO-V5l is a more practical and efficient choice
for litter detection tasks in real-world applications. However, further research and
optimisation can still be done to improve the detection of small litter items in
YOLO-V5l.
Figure 6.7 illustrates an example image that shows several, very small objects
of litter, and the impact of applying the YOLO-V5x based litter detection model
to detect litter. This is an image captured at 30 meters. Few very small objects of
litter on the top right side of the image are not detected, as the objects are very
small. The YOLO-V5x model performs best in detecting the smallest objects of
litter and our detailed analysis indicated that by adding a larger number of very
small litter objects in training this model, the accuracy of detection can be further
improved. We will keep adding new balanced data in future work. However, when
considering the performance of shallower YOLO-V5 models, such as ‘s’, this is
not the case as it could lead to false positives, i.e. very small, non-litter objects,
being detected as litter. The depth of the model should be sufficient to carry
out a feature analysis sufficiently detailed for the trained model to be able to
di↵erentiate very small litter objects from very small non-litter objects. It was
also noted in Figure 6.8 (an few other example images we tested) that YOLO-V5l
performed marginally better than YOLO-V5x. This is likely due to the reason
that more data is needed for the deeper network, i.e. YOLO-V5x to perform more
accurately.
Figure 6.7: Detecting very small objects of litter using YOLO-V5x based model
6.3 Experimental Results and Discussion 119
According to the review of literature we conducted, this is the first attempt car-
ried out in literature to identify litter in drone captured footage using the latest
advancements in Deep Neural Network architectures. We have shown that single-
class litter detection models based on YOLO-V5 sub-versions, result in mean av-
erage precision values of above 71.5% and precision values over 76%. This is a
promising step toward real-time detection of litter using aerial image data, des-
pite the small sample size of training images employed in the training of the DNN
architectures, in the proposed research. Further investigation of the resulting mod-
els’ training and object detection times (see Table 6.3) indicated higher training
and detection times of the models generated from the more complex and deeper
networks (i.e., YOLO-V5, x and l sub-versions) were mainly due to the limitations
of the processing power of the computational hardware used in this research. The
training and testing times could be considerably reduced by using faster computer
hardware.
In the experiments conducted we only investigated a single-class litter detector
in which all objects of litter, e.g., regardless of whether they are bottles, cans,
paper, boxes, or any other common objects of litter typically left behind after
human consumption, was labelled as a single type of object, ‘litter’. Our detailed
investigations revealed that it is important still to have a sub-class balance, i.e.,
similar number of di↵erent types of litter objects being used in training, despite
the fact that all litter types are classed as one type, for testing accuracy for all
sub-types to be similar. For example, in our training data the least amount of
litter was ‘drink cans’ and such objects had the highest chances of being missed or
misclassified. Therefore, within the training process, we attempted to balance the
amount of di↵erent sub-types of litter as the available sample data on some sub-
types of litter (such cans) was relatively scarce, e.g., drink cans. Our investigations
revealed that the models’ performance will be better if we collected sufficient data
for all sub-types of litter, e.g., 2000 samples of each sub-type.
The reason why we did not compare the litter detector in this project with
some existing work is that the test set they tested was captured using ground
cameras. In contrast, our work is more forward-looking and challenging due to
the widespread use of drones and the size of the litters in the images. There are
also individual projects that use drones for testing, but they detect a large group
of Litters. In contrast, our project o↵ers the potential for litters classification.
As recommendations for the future improvement of performance of models,
in particular that of the models created from the deeper DNN architectures such
as YOLO-V5 x and l, it is recommended that more UAV data to be captured
120 6.3 Experimental Results and Discussion
Figure 6.8: Test results of single-class litter detection models in desert campsites
Figure 6.9: Test results of single-class litter detection models in desert campsites
122 6.3 Experimental Results and Discussion
Figure 6.8 and 6.9 were tested using a single class litter detector at the camp-
site. We can see that many artificial items that are not litters are identified as
litters. This is because litter and artificial objects have the same features. This is
a subjective question, only humans can determine whether a human made item is
a litter. This poses some challenges for our research.
This section presents a comparison of the efficiency of the YOLO-V5 and YOLO-
V7 models in detecting litter and human-made items. All models were trained,
validated, and tested on the same subsets of UAV images. Their performance was
6.3 Experimental Results and Discussion 123
evaluated using both quantitative and qualitative measures, discussed below. The
performance comparison will be conducted in two ways: a quantitative perform-
ance comparison and a visual performance comparison.
Network Precision of Human-made items Recall of Human-made items mAP of Human-made items
Network Precision of 2 classes Recall of 2 classes mAP for 2 classes Model size
In this study, we trained YOLO-V5l and YOLO-V7l to detect litter and human-
made objects, that are not litter. The performance of the resulting trained
models was compared in terms of precision, recall, and mean average precision
(mAP). YOLO-V5l exhibited high precision in detecting litter but had low recall
on human-made items. Conversely, YOLO-V7l had very low recall in detecting lit-
ter, but performed better in detecting human-made items. The results show that
YOLO-V5 performs better when detecting litter, while YOLO-V7 performs better
when detecting human-made items. Compared to the YOLO-V5l model in single
class litter detection (which achieved a mAP of 71.5%), the two-class litter detec-
tion model achieved a slightly lower mAP of 70.6% in detecting litter. This can
be attributed to the high similarity between the two types of target objects, which
could cause some interference in the results. Nonetheless, the performance of the
two-class model is still considered satisfactory, and it demonstrates the potential
of the YOLO-V5 model in complex object detection tasks. Further research could
explore ways to improve the performance of the model in distinguishing between
closely related object classes. The mAP of YOLO-V7l was slightly higher than
that of YOLO-V5l. Additionally, YOLO-V5l has a smaller model size. In prac-
tical applications, a model with a smaller size is often more advantageous, because
hardware limitations, such as the memory of the drone, must be considered. A
124 6.3 Experimental Results and Discussion
smaller model not only requires less memory to store but also can be processed
more quickly, which is essential for real-time applications. Therefore, according
to the quantitative performance of the two models, depending on the specific de-
tection requirements, either YOLO-V5l or YOLO-V7l can be chosen for optimal
performance. The following visual performance comparison can provide a more
intuitive comparison of the performance of the two models.
In Figure 6.10 and 6.11, we can see that YOLO-V5 can accurately detect lit-
ter items of di↵erent sizes, shapes, and colours, while YOLO-V7 has difficulty
in identifying smaller litter items, resulting in a lower recall rate. This is con-
sistent with the quantitative performance comparison results. Additionally, in
Figure 6.10, YOLO-V5 shows better performance in detecting litter items that
are partially occluded or located in complex backgrounds, which indicates that
YOLO-V5 can e↵ectively extract and analyse features from these images. These
results demonstrate the superior performance of YOLO-V5 in litter detection.
Figure 6.10: Testing results of YOLO-V5l based two class litter detection model
6.3 Experimental Results and Discussion 125
Figure 6.11: Testing results of YOLO-V7l based two class litter detection model
126 6.3 Experimental Results and Discussion
Figure 6.12: Testing results of YOLO-V5l based two class litter detection model
Figure 6.13: Testing results of YOLO-V7l based two class litter detection model
In Figure 6.12 and Figure 6.13, both models are able to detect a wide range
of litter items such as plastic bags, bottles, and cans, but YOLO-V5 provides
more accurate and consistent detection results with higher confidence values at
the left top corner of every rectangle. Similarly, in Figure 6.13, YOLO-V5 is
more successful in detecting litter items that are partially occluded or located
in complex backgrounds, indicating its superior ability to analyse features and
accurately identify objects. These observations further support the findings from
the quantitative performance comparison and demonstrate the e↵ectiveness of
6.3 Experimental Results and Discussion 127
Figure 6.14: Testing results of YOLO-V5l based two class litter detection model
Figure 6.15: Testing results of YOLO-V7l based two class litter detection model
Figures 6.14 and 6.15 depict the results of using camp images to detect the
two models. This test is particularly challenging due to the scene’s complexity,
including the number and size of objects present, which can significantly a↵ect the
result. Comparing the two models, we find that their performance is comparable.
128 6.4 Conclusion
However, Table 6.5 shows that YOLO-V5’s ability to detect human-made items is
not very good. This is due to the complexity of the camp, making it challenging
to label such images accurately, leading to a lower coincidence between the actual
prediction frame and the benchmark. Despite this, both YOLO-V5 and YOLO-V7
perform well in the test images, successfully recognising objects of di↵erent sizes
and colours such as houses, walls, crocks, and abandoned buildings.
Based on the comprehensive test results and visual performance, YOLO-V5l
outperforms YOLO-V7l in detecting litter and human-made items. However, it
is important to note that the choice of model depends on the specific project
requirements and constraints. In this section, we have proven that YOLO-V5l is a
feasible and e↵ective option for litter detection, using a two-class approach. The
success of both YOLO-V5l and YOLO-V7l models in detecting litter and human-
made items highlights the potential of deep learning techniques in helping to solve
a complex environmental problem.
6.4 Conclusion
The accurate detection of objects that rely on human-judgement, such as identi-
fying litter and human-made items attempted in this chapter, is a difficult task in
the field of object recognition, as the definition is defined by a human, and thus
could di↵er between humans. This study used di↵erent DNN models to detect
the objects concerned and rigorously compared their performance. The outcomes
of this study can provide new solutions and inspiration for future researchers. It
is essential to continue to explore and develop new methods and technologies for
object recognition, especially in areas where subjective interpretation is required.
By doing so, we can improve the accuracy and efficiency of object recognition and
expand its applications to various fields. Additional test results can be found in
Appendix B.
In the research presented in this chapter, we first investigated using the most
established recent YOLO Version, i.e., YOLO-V5., to detect litter, as a single-class
litter detection problem. All YOLO-V5 sub-version networks have a detection pre-
cision of over 76%, according to the findings of the experiments presented in this
chapter. The performance of the YOLO-V5l in terms of litter detection is the best,
with a [email protected] rate of over 71.5%. We observed that when the images are
blurry, overlapping, or consist of di↵erent backgrounds, YOLO-V5x outperforms
the other YOLO-V5 sub-version networks at detecting litter. Over 5000 litter
samples were used during training, and over 913 drone captured were used in this
for collecting these samples. The detection performance will enhance even more if
this number is raised to 10,000 or higher, since the models, especially the models
6.4 Conclusion 129
created from the sub-versions with the deeper architectures, will be able to gener-
alise more and hence perform better on previously unobserved data. To improve
the performance of the resulting object detection models under more complex
scenarios and backgrounds, we added a new class, namely a human-made object
class, and trained DNN architectures of both YOLO-V5 and the more recently
proposed, YOLO-V7. We compared the objective and subjective/visual perform-
ance of the resulting models and concluded that generally YOLO-V5 performed
better in this study, as the deeper sub-versions of YOLO-V5 had more complex
architectures as compared to the model that has the deepest architecture in YOL
V7. The two-class litter detection approach proposed helped overcome many chal-
lenges of litter detection as compared to the single-class approach presented. The
results of this research and the resulting models proposed, despite the limits of
the data set used in training, prove the possibility of real-time litter-detection via
aerial surveillance, aiding the preservation of the environment and desert in the
UAE.
130 6.4 Conclusion
Chapter 7
This thesis presented the results of a research study in which the design, devel-
opment and testing of novel and innovative, deep network based computational
models were rigorously investigated for detecting and identifying named objects in
drone imagery, with a particular focus on detecting objects of significance import-
ance in nature reserves in desert areas in the Middle East, UAE. The study aimed
to design deep learning-based object detection models using well-known Convo-
lutional Neural Networks (CNNs) that could efficiently and accurately recognise
various types of objects, such as Ghaf trees, di↵erent types of trees, i.e., Ghaf,
Acacia and Palm trees, and litter. The research was conducted as part of an on-
going research collaboration with the Dubai Desert Conservation Reserve, Dubai,
UAE and involved the creation of software packages, that were field-tested in real-
world settings, providing vital feedback for continues improvements of the model
designs and supported with expert knowledge of the application scenarios that the
research supported. The methodology employed in the research involved the col-
lection of drone imagery from farms and suburban desert areas in the Middle East
and Thailand. The images were labelled and used to train the CNN models, which
were then tested on a separate dataset of images. The results of the study showed
that the CNN models developed in this research achieved high levels of accuracy
in detecting various objects, including ghaf trees, di↵erent types of trees using a
single model, and litter. These findings are significant, as they can contribute to
some tedious and time-consuming job functions of nature reserve managers and
wardens in protecting nature reserves. Overall, the research presented in this thesis
contributes to the field of machine learning and deep learning by demonstrating
the e↵ectiveness of CNN-based object detection models in identifying objects in
drone imagery, their limitations and provides useful insights for future research in
131
132 7.1 Summary and Conclusion
this area.
In chapter 4, we explored the use of Convolutional Neural Networks to detect
Ghaf trees in aerial videos captured by a drone in various environmental settings
and at di↵erent altitudes. Our findings represent the first attempt in literature to
automatically detect Ghaf trees in drone video footage, using CNN based compu-
tational models, which have been proven to be more e↵ective than using traditional
machine learning based methods. Despite training with a relatively small number
of images, we obtained a high [email protected] value of 81.1% using the YOLO-V5x
based model, which can detect Ghaf trees in approximately 78 MS, on an image of
size 3840⇥2160 pixels. This is a promising achievement towards real-time detec-
tion of Ghaf trees. The training time for generating the model was approximately
10 hours, predominantly limited by the hardware used. Models based on other
three sub-versions of YOLO-V5 achieved [email protected] values of above 77.4%,
while other popular DNNs, such as R-CNN and SSD, performed less e↵ectively.
Our detailed analysis of the detection results revealed that YOLO-V5x was the
most successful at detecting Ghaf trees in complex scenarios, including overlap-
ping, blurry, and obstructed images with di↵erent backgrounds, and varying tree
sizes. With a dataset of just over 5,000 Ghaf trees, 3,200 of which were used for
training, we expect that performance can be further enhanced if the number of
images is expanded to 10,000 or more. Our results demonstrate the potential of
aerial surveillance for real-time detection of the endangered national tree of the
UAE and in the Gulf Region.
Chapter 5 investigated the e↵ectiveness of using YOLO-V5 based computa-
tional models in detecting multiple types of trees in drone images, using a single
computational model, specifically focusing detecting and di↵erentiating, Ghaf,
Palm, and Acacia trees, in high altitude drone video footage. The practical chal-
lenge addressed in this research is the difficulty in detecting and di↵erentiating
multiple types of trees at high altitude, as they are too small in appearance, to
include features that a deep neural network can meaningfully learn from. To im-
plement the multiple-tree detection system, the three types of trees as perceived
by a human observer, were marked with three di↵erent labels. The images we used
consisted of at most two types of trees, as Acacia trees are usually only present
in nature areas, that do not usually have Palm or Ghaf trees. Therefore, we cre-
ated image mosaics, consisting multiple tree types, for rigorous testing. Di↵erent
sizes of trees, overlaps with other trees and objects, lack of contrast with image
background, image-blur, similarities in colour, texture and shape with other trees,
challenged the accurate detections of the three tree types from drone footage.
The proposed research also conducted experiments to train YOLO-V5 in detect-
ing individual types of trees, for comparison purposes with the three types of tree
7.1 Summary and Conclusion 133
detector. The results showed that the mean average precision (mAP) values ob-
tained for single type of tree detector (chapter 4) are lower than those for group
detection across all four sub-versions of YOLO-V5 (chapter 5). The highest mAP
value achieved for Ghaf three only detector was 81.1%, using YOLO-V5x, while
the corresponding mAP value obtained, was 83.5% when using the multiple tree
detector model. Additionally, the models trained to detect single types of trees
are unable to accurately detect small trees that are in close proximity to big trees,
indicating the difficulty of detecting single trees, especially those very small in
size/appearance, in high altitude drone imagery. The use of more than one type
of tree in the labelling process, also helps more accurately di↵erentiating the type
of tree, from other trees and the image background, hence leading to a relatively
more accurate detection performance. For example, if the confidence of a tree
detected as a Ghaf tree is 0.51, the confidence of it being detected as an acacia
tree may be recorded as 0.30, and the confidence of it being detected as a palm
tree be recorded as 0.15. Hence the model can be more confident that the tree
in question is a Ghaf tree, but not of the other two types, hence minimising false
positives. This original research finding presented in Chapter 5 have practical
applications, in using drones to autonomously survey large areas for di↵erent tree
species, allowing their protection, maintenance of management.
In chapter 6, we first investigated using the most established YOLO Version,
i.e., YOLO-V5., to detect litter, as a single-class litter detection problem. We
found that models based on all YOLO-V5 sub-version networks have a detection
precision of over 76%, according to the findings of the experiments presented
in this chapter. The performance of the YOLO-V5l in terms of litter detection
is the best, with a [email protected] rate of over 71.5%. We observed that when
the images are blurry, overlapping, or consist of di↵erent backgrounds, YOLO-
V5x outperforms the other YOLO-V5 sub-version networks at detecting litter.
Over 5000 litter samples were used during training, and over 913 drone captured
images were used in the research conducted, for collecting these samples. We
presented that the detection performance will enhance even more if this number
is raised to 10,000 or higher, since especially those models created from the sub-
versions with the deeper architectures, will be able to generalise more and hence
perform better on previously unobserved data. To improve the performance of the
resulting object detection models under more complex scenarios and backgrounds,
we added a new class, namely a human-made object class, and trained DNN
architectures of both YOLO-V5 and the more recently proposed, YOLO-V7. We
compared the objective and subjective/visual performance of the resulting models
and concluded that generally YOLO-V5 performed better in this study, as the
deeper sub-versions of YOLO-V5 had more complex architectures as compared to
134 7.2 Limitations and Future Work
the model that has the deepest architecture in YOLO-V7. The two-class litter
detection approach proposed helped overcome many challenges of litter detection
as compared to the single-class approach presented. The results of this research
and the resulting models proposed, despite the limits of the data set used in
training, prove the possibility of real-time litter-detection via aerial surveillance,
aiding the preservation of the environment and desert in the UAE.
• Training the networks with more training samples to increase the detection
accuracy of all trained models;
• Utilizing a drone-based tree detection system for analyzing the locations (for
senses purposes), growth, wellbeing and distribution of trees (e.g., change
detection);
• Extending the drone-based litter detection system for detecting and recog-
nising litter type in association with the flight altitude and image resolution;
• YOLO-V8 was published few months before the submission of this thesis.
It promises more accurate predictions and faster training and deployment
times. The same approached to data labelling, training and testing can be
used or generating object detection models based on YOLO-V8.
The original research presented in this thesis has already been submitted as
four papers, two journal and two conference (see Appendix-A). The resulting im-
plantations are being used in practice at the Dubai Desert Conservation Reserve
(DDCR), UAE. The continuous feedback that is being received on the operational
7.2 Limitations and Future Work 135
accuracy and the additional data being gathered is being used continuously to
improve the accuracy of models being currently used and have been presented in
this thesis.
With the rapid development of artificial intelligence, more powerful algorithms
and networks emerge frequently. At present, the accuracy one can achieve with
these models is already very high. However, there is still a significant gap in
performance between machines and human-beings. Human can make mistakes,
usually called human-error. For example, the ground truth has 100 ghaf trees.
Based on our current ability to visually recognise Ghaf trees, it is possible for a
human to correctly detect at least 98 trees, while the model developed can correctly
detect over 80 trees. There may be some trees that a human cannot detect or with
detect wrongly, but the model could detect correctly. Based on these observations,
with future advancement of AI , machines could surpass human performance in the
field of object detection and classification when large amounts of data is available
for training and more advanced networks are proposed.
136 7.2 Limitations and Future Work
Appendix A
Research Publications
This thesis presents four contributions, with two submitted to international con-
ferences and the remaining two to international journals.
Conference
Journal
137
138 7.2 Limitations and Future Work
Appendix B
Additional Results
This appendix provides additional image results that were obtained from testing
the systems proposed in Chapters 4, 5, and 6. The results include ghaf tree
detection, multiple tree detection and classification, and litter detection.
139
140 7.2 Limitations and Future Work
[1] Nourhan Elmeseiry, Nancy Alshaer, and Tawfik Ismail. A detailed survey
and future directions of unmanned aerial vehicles (uavs) with potential ap-
plications. Aerospace, 8(12):363, 2021.
[2] Remya Kottarathu Kalarikkal, Youngwook Kim, and Taoufik Ksiksi. Incor-
porating satellite remote sensing for improving potential habitat simulation
of prosopis cineraria (l.) druce in united arab emirates. Global Ecology and
Conservation, 37:e02167, 2022.
[3] David Gallacher and Je↵rey Hill. Status of prosopis cineraria (ghaf) tree
clusters in the dubai desert conservation reserve. Tribulus, 15(2), 2005.
[5] Simon Bilik, Lukas Kratochvila, Adam Ligocki, Ondrej Bostik, Tomas Zem-
cik, Matous Hybl, Karel Horak, and Ludek Zalud. Visual diagnosis of the
varroa destructor parasitic mite in honeybees using object detector tech-
niques. Sensors, 21(8):2764, 2021.
[6] Amanat Ali, Mostafa Waly, Mohamed Essa, and Sankar Devaranjan. 26 nu-
tritional and medicinal. Dates: production, processing, food, and medicinal
values, page 361, 2012.
[7] Thani Jintasuttisak, Eran Edirisinghe, and Ali Elbattay. Deep neural net-
work based date palm tree detection in drone imagery. Computers and
Electronics in Agriculture, 192:106560, 2022.
[8] AFMN Sadat, Mohammad Ali, Afroza Sultana, Muhammad Mehedi Hasan,
Debobrata Sharma, MA Rahman, and MAK Azad. Comparative study
of a proposed green extraction method named aqueous ultrasound assisted
extraction from fresh leaves of acacia nilotica with conventional extraction
method. International Journal of Innovative Science & Research Technology,
6(10):946–951, 2021.
161
162 References
[9] Rafiq Ahmad and Shoaib Ismail. Use of prosopis in arab/gulf states including
possible cultivation with saline water in deserts. Prosopis, 13, 1996.
[10] Tanmay Kumar Behera, Sambit Bakshi, Pankaj Kumar Sa, Michele Nappi,
Aniello Castiglione, Pandi Vijayakumar, and Brij Bhooshan Gupta. The
nitrdrone dataset to address the challenges for road extraction from aerial
images. Journal of Signal Processing Systems, pages 1–13, 2022.
[11] Bharat Rao, Ashwin Goutham Gopi, and Romana Maione. The societal
impact of commercial drones. Technology in society, 45:83–90, 2016.
[12] Jurgen Everaerts et al. The use of unmanned aerial vehicles (uavs) for
remote sensing and mapping. The International Archives of the Photogram-
metry, Remote Sensing and Spatial Information Sciences, 37(2008):1187–
1192, 2008.
[13] Luis Duque, Junwon Seo, and James Wacker. Synthesis of unmanned aerial
vehicle applications for infrastructures. Journal of Performance of Construc-
ted Facilities, 32(4):04018046, 2018.
[14] Michael A Goodrich, Bryan S Morse, Cameron Engh, Joseph L Cooper, and
Julie A Adams. Towards using unmanned aerial vehicles (uavs) in wilderness
search and rescue: Lessons from field trials. Interaction Studies, 10(3):453–
478, 2009.
[15] Rafal Perz and Kacper Wronowski. Uav application for precision agriculture.
Aircraft Engineering and Aerospace Technology, 91(2):257–263, 2019.
[18] Meshal M Abdullah, Zahraa M Al-Ali, and Shruthi Srinivasan. The use of
uav-based remote sensing to estimate biomass and carbon stock for native
desert shrubs. MethodsX, 8:101399, 2021.
[20] Riccardo Dainelli, Piero Toscano, Salvatore Filippo Di Gennaro, and Aless-
andro Matese. Recent advances in unmanned aerial vehicle forest re-
mote sensing—a systematic review. part i: A general framework. Forests,
12(3):327, 2021.
[23] Austin Chad Hill and Yorke M Rowan. The black desert drone survey: New
perspectives on an ancient landscape. Remote Sensing, 14(3):702, 2022.
[24] Leila Hashemi-Beni, Je↵ery Jones, Gary Thompson, Curt Johnson, and
Asmamaw Gebrehiwot. Challenges and opportunities for uav-based digital
elevation model generation for flood-risk management: a case of princeville,
north carolina. Sensors, 18(11):3843, 2018.
[25] Thomas Moranduzzo and Farid Melgani. Detecting cars in uav images with
a catalog-based approach. IEEE Transactions on Geoscience and remote
sensing, 52(10):6356–6367, 2014.
[27] Rainer Lienhart and Jochen Maydt. An extended set of haar-like features
for rapid object detection. In Proceedings. international conference on image
processing, volume 1, pages I–I. IEEE, 2002.
[28] Xuchun Li, Lei Wang, and Eric Sung. Adaboost with svm-based component
classifiers. Engineering Applications of Artificial Intelligence, 21(5):785–795,
2008.
[29] Peter M Asaro. The labor of surveillance and bureaucratized killing: new
subjectivities of military drone operators. Social semiotics, 23(2):196–224,
2013.
164 References
[30] Nassim Ammour, Haikel Alhichri, Yakoub Bazi, Bilel Benjdira, Naif Alajlan,
and Mansour Zuair. Deep learning approach for car detection in uav imagery.
Remote Sensing, 9(4):312, 2017.
[31] Matija Radovic, O↵ei Adarkwa, and Qiaosong Wang. Object recognition
in aerial images using convolutional neural networks. Journal of Imaging,
3(2):21, 2017.
[32] Teja Kattenborn, Jens Leitlo↵, Felix Schiefer, and Stefan Hinz. Review on
convolutional neural networks (cnn) in vegetation remote sensing. ISPRS
journal of photogrammetry and remote sensing, 173:24–49, 2021.
[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only
look once: Unified, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 779–788, 2016.
[36] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox
detector. In Computer Vision–ECCV 2016: 14th European Conference, Am-
sterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14,
pages 21–37. Springer, 2016.
[37] Ross Girshick, Je↵ Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recogni-
tion, pages 580–587, 2014.
[38] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international confer-
ence on computer vision, pages 1440–1448, 2015.
[39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. Advances
in neural information processing systems, 28, 2015.
[40] Faisal S Alsubaei, Fahd N Al-Wesabi, and Anwer Mustafa Hilal. Deep
learning-based small object detection and classification model for garbage
waste management in smart cities and iot environment. Applied Sciences,
12(5):2281, 2022.
References 165
[41] Dong-Hyun Lee. Cnn-based single object detection and tracking in videos
and its application to drone detection. Multimedia Tools and Applications,
80(26-27):34237–34248, 2021.
[44] Zahid Mahmood, Ossama Haneef, Nazeer Muhammad, and Shahid Khattak.
Towards a fully automated car parking system. IET Intelligent Transport
Systems, 13(2):293–302, 2019.
[45] Rohit Raja, Sandeep Kumar, and Md Rashid Mahmood. Color object detec-
tion based image retrieval using roi segmentation with multi-feature method.
Wireless Personal Communications, 112(1):169–192, 2020.
[46] Sulaiman Khan, Inam Ullah, Farhad Ali, Muhammad Shafiq, Yazeed Yasin
Ghadi, and Taejoon Kim. Deep learning-based marine big data fusion for
ocean environment monitoring: Towards shape optimization and salient ob-
jects detection. 2023.
[47] Xinyi Zhou, Wei Gong, WenLong Fu, and Fengtong Du. Application of
deep learning in object detection. In 2017 IEEE/ACIS 16th International
Conference on Computer and Information Science (ICIS), pages 631–634.
IEEE, 2017.
[48] Pawan Kumar Mishra and GP Saroha. A study on video surveillance system
for object detection and tracking. In 2016 3rd International Conference on
Computing for Sustainable Global Development (INDIACom), pages 221–
226. IEEE, 2016.
[49] Francesco Comaschi, Sander Stuijk, Twan Basten, and Henk Corporaal.
Rasw: a run-time adaptive sliding window to improve viola-jones object
detection. In 2013 Seventh International Conference on Distributed Smart
Cameras (ICDSC), pages 1–6. IEEE, 2013.
[50] Venkatesh Bala Subburaman and Sébastien Marcel. Fast bounding box es-
timation based face detection. In ECCV, Workshop on Face Detection:
Where we are, and what next?, number CONF, 2010.
166 References
[51] Francesco Comaschi, Sander Stuijk, Twan Basten, and Henk Corporaal.
Rasw: a run-time adaptive sliding window to improve viola-jones object
detection. In 2013 Seventh International Conference on Distributed Smart
Cameras (ICDSC), pages 1–6. IEEE, 2013.
[52] Xiaoheng Jiang, Yanwei Pang, Jing Pan, and Xuelong Li. Flexible sliding
windows with adaptive pixel strides. Signal Processing, 110:37–45, 2015.
[53] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104:154–171, 2013.
[55] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals
from edges. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages
391–405. Springer, 2014.
[56] Piotr Dollár and C Lawrence Zitnick. Structured forests for fast edge de-
tection. In Proceedings of the IEEE international conference on computer
vision, pages 1841–1848, 2013.
[57] Santiago Manen, Matthieu Guillaumin, and Luc Van Gool. Prime object
proposals with randomized prim’s algorithm. In Proceedings of the IEEE
international conference on computer vision, pages 2536–2543, 2013.
[58] Jie Huang, Zhiguo Jiang, Haopeng Zhang, Bowen Cai, and Yuan Yao. Region
proposal for ship detection based on structured forests edge method. In 2017
IEEE international geoscience and remote sensing symposium (IGARSS),
pages 1856–1859. IEEE, 2017.
[64] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-
cnn. In Proceedings of the IEEE international conference on computer vision,
pages 2961–2969, 2017.
[65] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Com-
mon objects in context. In Computer Vision–ECCV 2014: 13th European
Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part
V 13, pages 740–755. Springer, 2014.
[66] Sara Vicente, Joao Carreira, Lourdes Agapito, and Jorge Batista. Recon-
structing pascal voc. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 41–48, 2014.
[67] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable
feature pyramid architecture for object detection. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages
7036–7045, 2019.
[68] Sebastian Tuermer, Franz Kurz, Peter Reinartz, and Uwe Stilla. Airborne
vehicle detection in dense urban areas using hog features and disparity maps.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, 6(6):2327–2337, 2013.
[69] Gong Cheng, Junwei Han, Lei Guo, Xiaoliang Qian, Peicheng Zhou, Xiwen
Yao, and Xintao Hu. Object detection in remote sensing imagery using a
discriminatively trained mixture model. ISPRS Journal of Photogrammetry
and Remote Sensing, 85:32–43, 2013.
[70] Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospa-
tial object detection and geographic image classification based on collection
of part detectors. ISPRS Journal of Photogrammetry and Remote Sensing,
98:119–132, 2014.
[71] Roman Seidel, André Apitzsch, and Gangolf Hirtz. Improved person detec-
tion on omnidirectional images with non-maxima suppression. arXiv pre-
print arXiv:1805.08503, 2018.
168 References
[72] Jian Xu, Xian Sun, Daobing Zhang, and Kun Fu. Automatic detection of
inshore ships in high-resolution remote sensing images using robust invariant
generalized hough transform. IEEE geoscience and remote sensing letters,
11(12):2070–2074, 2014.
[73] Zhenwei Shi, Xinran Yu, Zhiguo Jiang, and Bo Li. Ship detection in high-
resolution optical imagery based on anomaly detector and local shape fea-
ture. IEEE Transactions on Geoscience and Remote Sensing, 52(8):4511–
4523, 2013.
[74] Ping Zhong and Runsheng Wang. A multiple conditional random fields
ensemble model for urban area detection in remote sensing optical images.
IEEE Transactions on Geoscience and Remote Sensing, 45(12):3978–3988,
2007.
[75] Mryka Hall-Beyer. Glcm texture: a tutorial v. 3.0 march 2017. 2017.
[76] Dengsheng Zhang, Aylwin Wong, Maria Indrawan, and Guojun Lu. Content-
based image retrieval using gabor texture features. IEEE Transactions Pami,
3656:13–15, 2000.
[78] Caglar Senaras, Mete Ozay, and Fatos T Yarman Vural. Building detec-
tion with decision fusion. IEEE journal of selected topics in applied earth
observations and remote sensing, 6(3):1295–1304, 2013.
[79] Mete Ozay and Fatos T Yarman Vural. A new fuzzy stacked generalization
technique and analysis of its performance. arXiv preprint arXiv:1204.0171,
2012.
[80] Örsan Aytekin, U Zöngür, and Ugur Halici. Texture-based airport runway
detection. IEEE Geoscience and Remote Sensing Letters, 10(3):471–475,
2012.
[81] Alireza Khotanzad and Yaw Hua Hong. Invariant image recognition by
zernike moments. IEEE Transactions on pattern analysis and machine in-
telligence, 12(5):489–497, 1990.
[84] Shawn D Newsam and Chandrika Kamath. Retrieval using texture features
in high-resolution multispectral satellite imagery. In Data Mining and Know-
ledge Discovery: Theory, Tools, and Technology VI, volume 5433, pages 21–
32. SPIE, 2004.
[85] Rangaraj M Rangayyan, Ricardo José Ferrari, JE Leo Desautels, and An-
nie France Frere. Directional analysis of images with gabor wavelets. In
Proceedings 13th Brazilian Symposium on Computer Graphics and Image
Processing (Cat. No. PR00878), pages 170–177. IEEE, 2000.
[86] Andrada Livia Cirneanu, Dan Popescu, and Loretta Ichim. Cnn based on
lbp for evaluating natural disasters. In 2018 15th International Conference
on Control, Automation, Robotics and Vision (ICARCV), pages 568–573.
IEEE, 2018.
[88] Wilbert G Aguilar, Marco A Luna, Julio F Moya, Vanessa Abad, Humberto
Parra, and Hugo Ruiz. Pedestrian detection for uavs using cascade classifiers
with meanshift. In 2017 IEEE 11th international conference on semantic
computing (ICSC), pages 509–514. IEEE, 2017.
[90] Abdallah Zeggada and Farid Melgani. Multilabel classification of uav images
with convolutional neural networks. In 2016 IEEE International Geoscience
and Remote Sensing Symposium (IGARSS), pages 5083–5086. IEEE, 2016.
[91] Matija Radovic, O↵ei Adarkwa, and Qiaosong Wang. Object recognition
in aerial images using convolutional neural networks. Journal of Imaging,
3(2):21, 2017.
tional neural network for detecting sea turtles in drone imagery. Methods in
Ecology and Evolution, 10(3):345–355, 2019.
[93] Muhammad Saqib, Sultan Daud Khan, Nabin Sharma, Paul Scully-Power,
Paul Butcher, Andrew Colefax, and Michael Blumenstein. Real-time drone
surveillance and population estimation of marine animals from aerial im-
agery. In 2018 International Conference on Image and Vision Computing
New Zealand (IVCNZ), pages 1–6. IEEE, 2018.
[94] Ali Rohan, Mohammed Rabah, and Sung-Ho Kim. Convolutional neural
network-based real-time object detection and tracking for parrot ar drone 2.
IEEE access, 7:69575–69584, 2019.
[95] Suk-Ju Hong, Yunhyeok Han, Sang-Yeon Kim, Ah-Yeong Lee, and Ghiseok
Kim. Application of deep-learning methods to bird detection using un-
manned aerial vehicle imagery. Sensors, 19(7):1651, 2019.
[96] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via
region-based fully convolutional networks. Advances in neural information
processing systems, 29, 2016.
[97] Yuanyuan Wang, Chao Wang, Hong Zhang, Yingbo Dong, and Sisi Wei.
Automatic ship detection based on retinanet using multi-resolution gaofen-
3 imagery. Remote Sensing, 11(5):531, 2019.
[98] Hao Long, Yinung Chung, Zhenbao Liu, and Shuhui Bu. Object detection
in aerial images using feature fusion deep networks. IEEE Access, 7:30980–
30990, 2019.
[100] Irina Rish et al. An empirical study of the naive bayes classifier. In IJ-
CAI 2001 workshop on empirical methods in artificial intelligence, volume 3,
pages 41–46, 2001.
[101] Justin Sirignano and Konstantinos Spiliopoulos. Scaling limit of neural net-
works with the xavier initialization and convergence to a global minimum.
arXiv preprint arXiv:1907.04108, 2019.
[102] Zimeng Lyu, AbdElRahman ElSaid, Joshua Karns, Mohamed Mkaouer, and
Travis Desell. An experimental study of weight initialization and lamarckian
inheritance on neuroevolution. In Applications of Evolutionary Computa-
tion: 24th International Conference, EvoApplications 2021, Held as Part of
References 171
EvoStar 2021, Virtual Event, April 7–9, 2021, Proceedings 24, pages 584–
600. Springer, 2021.
[104] Alex Krizhevsky, Ilya Sutskever, and Geo↵rey E Hinton. Imagenet classific-
ation with deep convolutional neural networks (alexnet) imagenet classific-
ation with deep convolutional neural networks (alexnet) imagenet classific-
ation with deep convolutional neural networks.
[105] Avinash Kumar, Sobhangi Sarkar, and Chittaranjan Pradhan. Malaria dis-
ease detection using cnn technique with sgd, rmsprop and adam optimizers.
Deep learning techniques for biomedical and health informatics, pages 211–
230, 2020.
[106] Zilong Zhong, Jonathan Li, Zhiming Luo, and Michael Chapman. Spectral–
spatial residual network for hyperspectral image classification: A 3-d deep
learning framework. IEEE Transactions on Geoscience and Remote Sensing,
56(2):847–858, 2017.
[107] Hiroki Nakahara, Tomoya Fujii, and Shimpei Sato. A fully connected layer
elimination for a binarizec convolutional neural network on an fpga. In 2017
27th international conference on field programmable logic and applications
(FPL), pages 1–4. IEEE, 2017.
[108] Md Moniruzzaman, Syed Mohammed Shamsul Islam, Paul Lavery, and Mo-
hammed Bennamoun. Faster r-cnn based deep learning for seagrass detection
from underwater digital images. In 2019 Digital Image Computing: Tech-
niques and Applications (DICTA), pages 1–7. IEEE, 2019.
[109] Anand John and DD Meva. A comparative study of various object detection
algorithms and performance analysis. International Journal of Computer
Sciences and Engineering, 8(10):158–163, 2020.
[110] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object de-
tection with deep learning: A review. IEEE transactions on neural networks
and learning systems, 30(11):3212–3232, 2019.
[111] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Target
detection classic paper-yolov1 paper translation (pure chinese version): Yolo:
unified real-time target detection target detection classic paper-yolov1 paper
translation (pure chinese version): Yolo: unified real-time target detection.
172 References
[112] Jun Sang, Zhongyuan Wu, Pei Guo, Haibo Hu, Hong Xiang, Qian Zhang,
and Bin Cai. An improved yolov2 for vehicle detection. Sensors, 18(12):4272,
2018.
[115] Jianqing Zhao, Xiaohu Zhang, Jiawei Yan, Xiaolei Qiu, Xia Yao, Yongchao
Tian, Yan Zhu, and Weixing Cao. A wheat spike detection method in uav
images based on improved yolov5. Remote Sensing, 13(16):3095, 2021.
[116] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen,
Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance
learning capability of cnn. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops, pages 390–391, 2020.
[117] Kaiyue Liu, Haitong Tang, Shuang He, Qin Yu, Yulong Xiong, and Nizhuan
Wang. Performance validation of yolo variants for object detection. In
Proceedings of the 2021 International Conference on bioinformatics and in-
telligent computing, pages 239–243, 2021.
[118] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Ex-
ceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
[119] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li,
Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, et al. Yolov6: A
single-stage object detection framework for industrial applications. arXiv
preprint arXiv:2209.02976, 2022.
[120] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding,
and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceed-
ings of the IEEE/CVF conference on computer vision and pattern recogni-
tion, pages 13733–13742, 2021.
[121] Chhaya Gupta, Nasib Singh Gill, Preeti Gulia, and Jyotir Moy Chatterjee. A
novel finetuned yolov6 transfer learning model for real-time object detection.
Journal of Real-Time Image Processing, 20(3):42, 2023.
References 173
[122] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7:
Trainable bag-of-freebies sets new state-of-the-art for real-time object de-
tectors. arXiv preprint arXiv:2207.02696, 2022.
[124] Gaurav Sharma. A review on the studies on faunal diversity, status, threats
and conservation of thar desert or great indian desert ecosystem. In Bio-
logical Forum–An International Journal, volume 5, pages 81–90. Citeseer,
2013.
[125] Huimin Yang, Xingming Zhang, Fangyuan Zhao, Jing’ai Wang, Peijun Shi,
and Lianyou Liu. Mapping sand-dust storm risk of the world. World Atlas
of Natural Disaster Risk, pages 115–150, 2015.
[126] David Gallacher and Je↵rey Hill. Status of prosopis cineraria (ghaf) tree
clusters in the dubai desert conservation reserve. Tribulus, 15(2), 2005.
[128] Mattis Wolf, Katelijn van den Berg, Shungudzemwoyo P Garaba, Nina
Gnann, Klaus Sattler, Frederic Stahl, and Oliver Zielinski. Machine learning
for aquatic plastic litter detection, classification and quantification (aplastic-
q). Environmental Research Letters, 15(11):114042, 2020.
[129] Manuel Córdova, Allan Pinto, Christina Carrozzo Hellevik, Saleh Abdel-
Afou Alaliyat, Ibrahim A Hameed, Helio Pedrini, and Ricardo da S
Torres. Litter detection with deep learning: A comparative study. Sensors,
22(2):548, 2022.
174 References