Object Detection in Drone Imagery Using Convolutional Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 191

This item was submitted to Loughborough's Research Repository by the author.

Items in Figshare are protected by copyright, with all rights reserved, unless otherwise indicated.

Object detection in drone imagery using convolutional neural networks


PLEASE CITE THE PUBLISHED VERSION

PUBLISHER

Loughborough University

LICENCE

CC BY-NC-ND 4.0

REPOSITORY RECORD

Wang, Guoxu. 2023. “Object Detection in Drone Imagery Using Convolutional Neural Networks”.
Loughborough University. https://doi.org/10.26174/thesis.lboro.24435160.v1.
Object Detection In Drone Imagery using
Convolutional Neural Networks

by

Guoxu Wang

A Doctoral Thesis

Submitted in partial fulfilment


of the requirements for the award of

Doctor of Philosophy
of
Loughborough University

May 2023

Copyright 2023 Guoxu Wang


Abstract

Drones, also known as Unmanned Aerial Vehicles (UAVs), are lightweight aircraft
that can fly without a pilot on board. Equipped with high-resolution cameras and
ample data storage capacity, they can capture visual information for subsequent
processing by humans to gather vital information. Drone imagery provides a
unique viewpoint that humans cannot access by other means, and the captured
images can be valuable for both manual processing and automated image ana-
lysis. However, detecting and recognising objects in drone imagery using computer
vision-based methods is difficult because the object appearances di↵er from those
typically used to train object detection and recognition systems. Additionally,
drones are often flown at high altitudes, which makes the captured objects appear
small. Furthermore, various adverse imaging conditions may occur during flight,
such as noise, illumination changes, motion blur, object occlusion, background
clutter, and camera calibration issues, depending on the drone hardware used, in-
terference in flight paths, changing environmental conditions, and regional climate
conditions. These factors make the automated computer-based analysis of drone
footage challenging.
In the past, conventional machine-based object detection methods were widely
used to identify objects in images captured by cameras of all types. These methods
involved using feature extractors to extract an object’s features and then using
an image classifier to learn and classify the object’s features, enabling the learn-
ing system to infer objects based on extracted features from an unknown object.
However, the feature extractors used in traditional object detection methods were
based on handcrafted features decided by humans (i.e. feature engineering was
required), making it challenging to achieve robustness of feature representation
and a↵ecting classification accuracy. Addressing this challenge, Deep Neural Net-
work (DNN) based learning provides an alternative approach to detect objects
in images. Convolutional Neural Networks (CNNs) are a type of DNN that can
extract millions of high-level features of objects that can be e↵ectively trained for
object detection and classification. The aim of research presented in this thesis is
to optimally design, develop and extensively investigate the performance of CNN
based object detection and recognition models that can be efficiently used on drone

iii
imagery.
One significant achievement of this work is the successful utilization of the
state-of-the-art CNNs, such as SSD, Faster R-CNN and YOLO (versions 5s, 5m,
5l, 5x, 7), to generate innovative DNN-based models. We show that these models
are highly e↵ective in detecting and recognising Ghaf trees, multiple tree types
(i.e., Ghaf, Acacia and Date Palm trees) and in detecting litter. Mean Average
Precision ([email protected]) values ranging from 70%-92% were obtained, depending
on the application and the CNN architecture utilised.
The thesis places a strong emphasis on developing systems that can e↵ectively
perform under practical constraints and variations in images. As a result, several
robust computer vision applications have been developed through this research,
which are currently being used by the collaborators and stakeholders.

Guoxu Wang,
May 2023
Acknowledgements

I would like to express my sincere gratitude to Prof. Eran Edirisinghe, my primary


supervisor, for his encouragement, support, and guidance throughout my PhD. Dr.
Asma Adnane, my supervisor, and Dr. Sara Saravi, my associate supervisor, also
deserve my gratitude for their wise counsel and concern.
I extend my gratitude to the Dubai Desert Conservation Reserve (DDCR) and
Dr. Andrew Leonce from Zayed University, Dubai for their assistance in captur-
ing data, helping with their interpretation, providing expert views and general
collaboration in my research.
I deeply appreciate the support I received from my father, Qiang Wang, my
mother, Donghe Gao and my fiancée, Xiaohui Huang. Without their unconditional
love and unwavering support, this thesis would not have been possible.
I also want to express my appreciation to all my Ph.D. colleagues, especially
those based in Room N2.12 of the Haslegrave Building at Loughborough Univer-
sity, for their warmth, supportive comments, and enjoyable experience. Addition-
ally, I would like to thank my Chinese friends in Loughborough, who never failed
to lift my spirits, when needed, and provided me with numerous hours of enjoyable
conversations.
Lastly, but not least, I would like to thank my friends from the 3+1+1 collab-
oration between Loughborough University and Northeastern University, for their
assistance in both my studies and personal life. The five years I spent with you
were unforgettable, and I wish you all success in your studies.

Guoxu Wang,
May 2023

v
Contents

Abstract iii

Acknowledgements v

List of Abbreviations xiii

1 Introduction 1
1.1 Contextual Background and Motivation . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Aim and Objectives . . . . . . . . . . . . . . . . . . . . . 6
1.4 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Literature Review 11
2.1 Classical Object Detection Methods . . . . . . . . . . . . . . . . . . 11
2.1.1 Sliding Window-based Method . . . . . . . . . . . . . . . . . 11
2.1.2 Region Proposal-based Method . . . . . . . . . . . . . . . . 13
2.2 Object Detection in Aerial Imagery Using Machine Learning . . . . 14
2.3 Object Detection in Aerial Imagery Using Deep Learning . . . . . . 20

3 Theoretical Background 25
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Fundamental of Neural Network . . . . . . . . . . . . . . . . 28
3.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . 35
3.3 CNN-based object detection . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Faster Region-based Convolutional Neural Network . . . . . 45
3.3.2 Single Shot Multibox Detector . . . . . . . . . . . . . . . . . 46
3.3.3 You Only Look Once . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Quantitative Performance Comparison Methods . . . . . . . . . . . 54

vii
4 Ghaf Tree Detection Using Deep Neural Networks 57
4.1 Introduction to Ghaf Tree Detection . . . . . . . . . . . . . . . . . 57
4.2 Proposed Approach to the Ghaf Tree Detection . . . . . . . . . . . 58
4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . 59
4.3.1 Quantitative Performance Comparison . . . . . . . . . . . . 60
4.3.2 Visual Performance Comparison . . . . . . . . . . . . . . . . 61
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Multiple Tree Classification Using Deep Neural Networks 77


5.1 Introduction to the Multiple Tree Detection and Classification . . . 78
5.2 Proposed Approach to Multiple Tree Detection and Classification . 78
5.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . 80
5.3.1 Quantitative Performance Comparison . . . . . . . . . . . . 80
5.3.2 Visual Performance Comparison . . . . . . . . . . . . . . . . 84
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Litter Detection Using Deep Neural Networks 99


6.1 Introduction to the Litter Detection . . . . . . . . . . . . . . . . . . 99
6.2 Proposed Approach to Litter Detection . . . . . . . . . . . . . . . . 101
6.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Experimental Results and Discussion . . . . . . . . . . . . . . . . . 103
6.3.1 Single-Class Litter Detection . . . . . . . . . . . . . . . . . . 103
6.3.2 Two-Class Litter Detection . . . . . . . . . . . . . . . . . . 122
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7 Conclusions and Future Work 131


7.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 134

A Research Publications 137

B Additional Results 139


List of Figures

2.1 Example of sliding window object detector . . . . . . . . . . . . . . 12


2.2 The workflow of object detection implemented using machine learning 15

3.1 The learning process of supervised learning . . . . . . . . . . . . . . 26


3.2 A workflow of a deep learning algorithm . . . . . . . . . . . . . . . 28
3.3 Neural network architecture for an input image of size 13 x 13 pixels
with one hidden layer and ten neurons in the output layer from [30][31] 29
3.4 Example of the operation of a neuron in a neuron network . . . . . 30
3.5 The curve of sigmoid activation function . . . . . . . . . . . . . . . 31
3.6 The curve of tanh activation function . . . . . . . . . . . . . . . . . 32
3.7 The curve of ReLU activation function . . . . . . . . . . . . . . . . 32
3.8 The curve of Leaky ReLU activation function . . . . . . . . . . . . 33
3.9 Convolutional network architecture from [105] . . . . . . . . . . . . 36
3.10 Convolution operation on a 6×6 image with a 3×3 kernel . . . . . . 37
3.11 Example of a stack of feature maps from [106] . . . . . . . . . . . . 38
3.12 Convolution operation on a 6⇥6 image with zero-padding and a
3⇥3 kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.13 Convolution operation on a 6⇥6⇥3 image with a 3⇥3⇥3 kernel . . 41
3.14 Example of the max pooling operation . . . . . . . . . . . . . . . . 43
3.15 Example of the average pooling operation . . . . . . . . . . . . . . 44
3.16 Example of fully connected layer and flatten operation from [107] . 45
3.17 Network structure diagram of Faster R-CNN from [108] . . . . . . . 46
3.18 Architecture of SSD from [109] . . . . . . . . . . . . . . . . . . . . . 46
3.19 The key process of YOLO object detection algorithm from [115] . . 47
3.20 Focus structure from [117] . . . . . . . . . . . . . . . . . . . . . . . 50
3.21 ResNet, RepVGG training and RepVGG testing from [120] . . . . . 52
3.22 Backbone of YOLO-V6 from [121] . . . . . . . . . . . . . . . . . . . 52

4.1 The workflow of the proposed method . . . . . . . . . . . . . . . . . 58


4.2 mAP: Precision&Recall curve . . . . . . . . . . . . . . . . . . . . . 61

ix
4.3 The visual performance comparison of Ghaf tree detector models
derived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c)
YOLO-V5s, (d) YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x . 62
4.4 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 63
4.4 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 64
4.5 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 66
4.5 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 67
4.6 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 69
4.6 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 70
4.7 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 72
4.7 The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 73

5.1 The visual performance comparison of Multiple tree detector mod-


els derived from DNN architectures, (a) SSD, (b) Faster R-CNN,
(c) YOLO-V5s, (d) YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x 85
5.2 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . 87
5.2 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . 88
5.3 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . 90
5.3 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . 91
5.4 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . 93
5.4 The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . 94
5.5 Missed trees due to limited raining data . . . . . . . . . . . . . . . 95

6.1 The results of litter detection in drone imagery using the SSD,
Faster R-CNN, YOLO-V5s, YOLO-V5m, YOLO-V5l and YOLO-
V5x based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 106
6.2 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 107
6.3 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 108
6.3 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 109
6.4 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 110
6.4 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 111
6.5 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 113
6.5 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 114
6.6 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x . . . . . . . . . . . . 116
6.6 The results of litter detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued) . . . . . 117
6.7 Detecting very small objects of litter using YOLO-V5x based model 118
6.8 Test results of single-class litter detection models in desert campsites121
6.9 Test results of single-class litter detection models in desert campsites121
6.10 Testing results of YOLO-V5l based two class litter detection model 124
6.11 Testing results of YOLO-V7l based two class litter detection model 125
6.12 Testing results of YOLO-V5l based two class litter detection model 126
6.13 Testing results of YOLO-V7l based two class litter detection model 126
6.14 Testing results of YOLO-V5l based two class litter detection model 127
6.15 Testing results of YOLO-V7l based two class litter detection model 127

B.1 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 141


B.2 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 141
B.3 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 142
B.4 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 142
B.5 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 143
B.6 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 143
B.7 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 144
B.8 Ghaf Tree Detection Result . . . . . . . . . . . . . . . . . . . . . . 144
B.9 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 146
B.10 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 146
B.11 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 147
B.12 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 147
B.13 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 148
B.14 Ghaf Tree Detection Using Multiple Tree Detector . . . . . . . . . . 148
B.15 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 149
B.16 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 149
B.17 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 150
B.18 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 150
B.19 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 151
B.20 Palm Tree Detection Using Multiple Tree Detector . . . . . . . . . 151
B.21 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 152
B.22 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 152
B.23 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 153
B.24 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 153
B.25 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 154
B.26 Acacia Tree Detection Using Multiple Tree Detector . . . . . . . . . 154
B.27 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 156
B.28 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 156
B.29 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 157
B.30 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 157
B.31 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 158
B.32 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 158
B.33 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 159
B.34 Litter Detection Result in Desert Area . . . . . . . . . . . . . . . . 159
B.35 Litter Detection Result in Camp Area . . . . . . . . . . . . . . . . 160
B.36 Litter Detection Result in Camp Area . . . . . . . . . . . . . . . . 160
List of Tables

4.1 Number of labelled Ghaf tree canopies in each data subset . . . . . 59


4.2 Performance comparison of DNN based object detection models . . 61

5.1 Number of labelled Ghaf tree in each data subset . . . . . . . . . . 80


5.2 Number of labelled Palm tree in each data subset . . . . . . . . . . 80
5.3 Number of labelled Acacia tree in each data subset . . . . . . . . . 80
5.4 Performance comparison of DNN based object detection models . . 82
5.5 Ghaf Tree Detection Performance comparison of YOLO-V5 based
object detection models . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Palm Tree Detection Performance comparison of YOLO-V5 based
object detection models . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 Acacia Tree Detection Performance comparison of YOLO-V5 based
object detection models . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 Number of labelled litters and human-made items in each data subset102
6.2 Performance and comparison of DNN based litter detection models 104
6.3 Illustration of a conceptual/architectural comparison of the two
Deep neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123
6.5 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123
6.6 Performance comparison of YOLO-V5 and YOLO-V7 models . . . . 123

xiii
List of Abbreviations

AI Artificial Intelligence

CNN Convolutional Neural Network

CRFs Conditional Random Fields

CSPNet Cross Stage Partial Network

DBNS Deep Belief Networks

DDCR Dubai Desert Conservation Reserve

DNN Deep Neural Network

FDNS Deep Feedforward Networks

FFDN Feature Fusing Deep Network

FSG Fuzzy Stacked generalisation

GANs Generative Adversarial Networks

GLCM Gray Level Co-occurrence Matrix

GMDH Group Method for Data Hadling

GOP Geodesic Object Proposals

HOG Histogram of Oriented Gradients

ILSVRC ImageNet Large Scale Visual Recognition Challenge

IoU Intersection over Union

LBP Artificial Intelligence

mAP Mean Average Precision

MCG Multiscale Combinatorial Grouping

xv
MSE Mean Squared Error

NMS Non-Maximum Suppression

NN Neural Network

PANet Path Aggregation Network

RBFNN Radial Basis Function Neural Network

R-CNN Region-based Convolutional Neural Network

R-FCN Region-based Fully Convolutional Network

ReLU Rectified Linear Unit

RNNs Recurrent Neural Networks

ROI Region of Interest

RPN Region Proposal Network

SIFT ScaleInvariant Feature Transform

SPP Spatial Pyramid Pooling

SSD Single Shot MultiBox Detector

SVM Support Vertor Machine

UAE United Arab Emirates

UAVS Unmanned Aerial Vehicles

YOLO You Only Look Once


Chapter 1

Introduction

1.1 Contextual Background and Motivation

The Ghaf tree also known as the tree of life by local people in Bahrain and much
of Arabia, is a drought-resilient tree capable of withstanding the extreme harsh
conditions of a desert environment [1]. The Ghaf tree, scientifically known as
Prosopis cineraria [2], can survive in extremely dry and hot weather for hundreds
of years with no artificial irrigation required. In the United Arab Emirates (UAE)
particularly, the Ghaf was declared a national tree in 2008 due to its historical and
national importance [3] [4]. The leaves of Ghaf trees have historically been used
as food for camels, while its tender leaves are still used in the UAE to make salads
and for various medicinal purposes. Like any other natural entity in the envir-
onment, the Ghaf trees have, in the recent years, increasingly become threatened
by the ever-expanding human activity in the UAE as a result of urbanisation and
infrastructural development projects. Given the arid environment in which the
Ghaf trees exist, aerial surveillance systems such as Unmanned Aerial Vehicles
(UAV) based imagery are naturally the preferred monitoring mechanism for aerial
monitoring of habitats in such environments.
One of the oldest fruit trees in the Arabian Peninsula, the Middle East, and
North Africa is the date palm tree. It is a major fruit crop grown in arid locations
around the world and is a vital part of the gulf region’s crop/food production
systems, due to its arid climate. Approximately 90% of the world’s dates are pro-
duced in the Arabian Peninsula. The fruits, bark, and leaves of date palm trees
are the most often used portions. Date palm tree fruits can be considered a food
with great nutritional value and numerous possible health advantages [5]. The
remaining portions of date palm trees can be used to make cosmetics, building
materials, and paper [6], among other things. As a result, surveying date palm
trees, which includes measuring their quantity, determining their location, pat-

1
2 1.1 Contextual Background and Motivation

tern, and distribution, is critical for predicting and forecasting production levels
and plantation management. Thani.J has successfully completed the monocular
detection of palm trees and published a research paper [7].
Due to the depth of its roots, the Acacia tree has a great drought tolerance
and can flourish without water for long periods. It has a medium amount of salt
tolerance as well. Acacia trees can be found in sand and gravel valleys, as well
as on the medium-altitude slopes of mountains. They’re frequent and extensive
throughout the UAE’s eastern regions, including in the Hafeet Mountain area. The
Acacia tree has numerous advantages. It is a good source of animal feed since it
increases milk production, especially in semi-arid environments. During a drought,
when all other sources of protein and energy are scarce, it is a valuable supply
of protein and energy. The acacia tree’s branches are a great source of nutrients,
containing 38 percent raw protein, phosphorus, and the calories needed to provide
energy to animals [8]. Therefore, monitoring acacia trees, paying attention to their
growth conditions and growing environment, is crucial for predicting production
levels and planning routine maintenance.
With the increase of tourism in the desert areas of the Gulf region, the de-
tection, and removal of litter left behind by visitors in popular tourist sites, is
becoming an increasingly important environmental problem to resolve. Litter im-
pacts desert ecology and endangers wildlife and natural habitats. Typically litter
detection and pickup is done by humans who physically visit popular tourist des-
tinations in search of litter left behind. However, given the difficult terrain to
navigate by foot or in vehicles, searching large areas for litter, using surveillance
ground vehicles is practically not recommended. Further wind could spread light-
weight litter far beyond popular tourist sites and make searching more difficult.
The use of lightweight Unmanned Aerial Vehicles (UAV), Drones [9], in prac-
tical application areas have increased in popularity recently due to advances in aer-
ial photography, digital aerial displays, and the potential they provide for low-cost
surveillance of large areas. Visual data captured by high-resolution, high-quality,
digital cameras mounted on drones are increasingly used in computer vision ap-
plications. Drones equipped with high-resolution cameras can capture images over
large areas, providing a bird’s-eye view similar to satellite imaging [10], but at a
relatively lower altitude, and hence without being challenged by cloud cover. Fur-
ther, the cost of aerial image capture with drones is much lower than that of aerial
image capture with satellites.
A drone is a lightweight aircraft that does not have a pilot onboard. They
have the ability to be operated automatically through a pre-programmed drone
flight planning system/software, or alternatively, can be manually controlled by a
ground pilot. Various practical tasks such as surveillance [11], aerial mapping [12],
1.1 Contextual Background and Motivation 3

infrastructure inspection [13], search and rescue [14], precision agriculture [15], and
ecological monitoring [16] are now made possible with the use of drones. Drones
flying at a controlled height can provide vital visual aerial surveillance data for
above application areas, either to be processed manually by humans or processed
automatically by computers.
Due to factors such as high altitude flying, noise in images, lighting changes,
motion-induced object blur, object occlusion, and background clutter, detecting
objects in images captured by drones, either by humans or computers, can be ex-
ceedingly challenging, especially considering the small size of the objects involved.
Although the drone camera technology had advanced significantly recently, min-
imising the impacts from the above challenges, for improving the potential of
information gathering from drone images, is still a challenging task. Additionally,
light-weight UAV-based (i.e., Drone) imagery and sensing has in the recent past
been utilised in detecting and mapping woody species’ encroachment in subalpine
grassland [17], estimating carbon stock for native desert shrubs [18], and several
other desert and forest monitoring applications [19–22]. While UAV imagery has
enabled large scale high resolution and fast landscape mapping, the use of this
significant imagery data is still largely limited to o✏ine use, with much more to
be realised for real-time applications [21] [23] [24].
Due to the large amounts of visual data captured in drone applications, hu-
man/manual processing of such data is time-consuming, and is often prohibitive.
Automatic object detection methods based on machine learning have traditionally
been used to detect objects in images captured by cameras to support a wide range
of application domains. The traditional machine learning based methods of object
detection and recognition can be divided into three stages: image pre-processing,
feature extraction, and classification. The aim is to extract object features from an
image/object using a feature extractor and then learn and classify these features,
using a classifier.
Moranduzzo and Melgani used a Histogram of Oriented Gradients (HOG) as a
feature descriptor to represent the features of a car based on its shape in 2014 [25].
To detect cars in images, they used support vector machines (SVMs) [26] to learn
and classify the features of cars and other objects. Another example is the use
of Haar-like features [27] in conjunction with an AdaBoost classifier [28] to detect
suspicious objects in drone video for military surveillance [29]. These traditional
object detection methods use handcrafted feature descriptors, which are determ-
ined by humans based on their experience and judgment about which characterist-
ics uniquely define the object for a specific application domain. Hence the problem
of selecting the right features to detect and identify objects in various scenarios
is complex. Describing them e↵ectively is difficult for humans because object
4 1.1 Contextual Background and Motivation

detection methods for the human psycho-visual system are mostly inexplicable.
Although a large number of object detection and recognition approaches based on
traditional machine learning approaches have been presented in literature, due to
the above reason, they will not be a focus of the research context presented in this
thesis.
Deep learning is an alternative and much e↵ective approach to solving the prob-
lems mentioned above. It can be thought of as enabling computers to mimic the
high-level behaviours and operations of the human psycho-visual system. Several
recent studies have addressed the problem of prior feature selection in traditional
machine learning-based object detection systems using deep learning-based ap-
proaches [30] [31], thereby addressing the challenges of a human having to select
the features that will optimise an object recognition task. More relevant to the
research conducted in this thesis is the use of a Convolutional Neural Network
(CNN) as a deep learning architecture to create models that can be used for ob-
ject detection and classification [32–34]. CNNs can extract and learn high-level
features from millions of objects without the need for prior feature selection. Sev-
eral CNN-based object detection methods have been proposed in the literature
since 2014. These methods are broadly classified into two types: single-stage and
two-stage methods. You Only Look Once (YOLO) [35] and Single Shot MultiBox
Detector (SSD) [36] are two popular approaches for single-stage object detectors.
The region-based CNN family, which consists of R-CNN [37], Fast R-CNN [38],
and Faster R-CNN [39], are widely used as two-stage detectors. However, the
detection speed and accuracy of the two categories di↵er.
Deep learning-based object detection and classification methods, based on
DNN architectures such as, YOLO, SSD, R-CNN, Fast R-CNN, and Faster R-
CNN, have been used in the literature [40]. They are primarily used to detect
standard objects such as vehicles, cars, buildings, animals etc., captured by hand-
held cameras or cameras set at the angles and heights of typical surveillance cam-
eras. In such a system, the pretrained networks for common object detection,
readily available to download from the internet and use, only need to be retrained
with a small number of additional images, for fine-tuning the network weights. Un-
fortunately, when dealing with drone images, which tend to record objects with
the camera looking straight down (or bird’s eye view), or at small angles to the
vertical, a significant amount of additional training is required to fine-tune the
weights to achieve object detection as seen from a drone. Although there have
been few attempts to detect objects in video footage captured by drones using
CNN-based methods [41] [42], such research is also limited to data collected under
controlled conditions. It has thus been demonstrated to work only under certain
practical constraints. For example, the detected objects must not overlap, the
1.2 Research Problem 5

contrast with the background must be clear, blurry objects may not be detected,
object detection may not work in scenes with significant lighting changes, and
so on. As a result, the research contributions of such work to solving practical
problems is limited. A further limitation of existing work is that their scientific
rigour and contribution are limited, as in majority of such work, the DNN has
been considered as a ‘black-box’, used to solve a practical problem. Hence some
of the model design details and decisions on parameter selections do not follow a
scientific approach and hence have not resulted in optimal outcomes.
This thesis proposes novel object detection models that can cope with real-
world challenges of object detection and recognition, resulting from aerial images
captured from drones. We follow a rigorous experimental design and scientific
decision-making process, based on knowledge of the structures of Deep Neural Net-
work architectures we use to create our models and the domain knowledge of the
application sectors acquired through the expertise of our research collaborators.
The models developed have been practically implemented and have been success-
fully field tested for accuracy and performance. All design details are presented,
making significant original contributions to the subject area of DNN based object
detection and recognition, in general, and drone imaging, in particular.
Given the above, the research conducted within the scope of this thesis, aims
to make necessary alterations, optimisations, and improvements to existing DNN
architectures to make their e↵ective use in novel designs of object detection and
recognition models. We demonstrate that the proposed novel models are able to
detect and recognise objects within drone video footage, captured in challenging
real-world environments and conditions. Each method has been designed, imple-
mented, and tested to ensure practical relevance and use in the field. The field
test results have been used to refine all designs, backed up by well-defined train-
ing methods and rigorous analysis. The experimental results obtained, when using
object detection and recognition models developed with all state-of-the-art Deep
Neural Network models, are quantitatively and qualitatively compared, to identify
the best methods that can be used in desert areas, having challenging practical
limitations.

1.2 Research Problem

Design, develop, implement, rigorously test and compare, novel Deep Neural
Network-based models for object detection and object classification in aerial im-
ages captured with drones in desert areas.
6 1.4 Original Contributions

1.3 Research Aim and Objectives


This research aims to create several Deep Neural Network based computational
models that can automatically detect and recognise specific objects in UAV im-
ages, specifically captured in the desert areas of the Middle East under challenging
practical and environmental conditions. Detecting and recognising each named ob-
ject (e.g., Ghaf trees, Date Palm trees, Acacia trees and Litters) presents practical
challenges that the proposed methods must e↵ectively overcome.
The following objectives are met to achieve the aim of this research:

• Conduct a background study of computational algorithms and methods used


in literature for machine learning and Deep Neural Network based, object
detection and recognition;

• Conduct a review of algorithms and models proposed in literature for object


detection and recognition, in aerial and drone imagery;

• Design and develop, novel models for detecting and recognising Ghaf trees
in drone imagery, under real-world conditions, using state-of-the-art Deep
Neural Network architectures, optimise their performance based on feedback,
rigorously compare their performance and recommend the best models and
approaches adopted for their design;

• Design and develop, novel models for detecting and recognising multiple
tree types in drone imagery (e.g., Ghaf trees, Date Palm trees, and Aca-
cia trees), under real-world conditions, using state-of-the-art Deep Neural
Network Architectures, optimise their performance based on feedback, rig-
orously compare their performance and recommend the best models and
approaches adopted for their design;

• Design and develop, novel models for detecting and recognising litter in
drone imagery captured in natural and campsite desert areas, under real-
world conditions, using state-of-the-art Deep Neural Network architectures,
optimise their performance based on feedback, rigorously compare their per-
formance and recommend the best models and approaches adopted for their
design.

1.4 Original Contributions


This thesis focuses on developing novel object detection and recognition models
for drone imagery by making e↵ective use of state-of-the-art Deep Neural Network
1.4 Original Contributions 7

Architectures. It also contributes to the wider area of computer-based object de-


tection and recognition, by recommending e↵ective approaches to data labelling,
training, parameter-based optimisation of designs and recommending best archi-
tectures to be used under di↵erent practical and environmental conditions. The
research particularly considers domain knowledge of the applications and practical
limitations / environmental conditions of image capture, in developing the most
e↵ective approaches to object detection and recognition. In doing so, the research
presented in this thesis fills a vital knowledge gap in literature, where such systems
are typically developed considering CNN architectures as black boxes.
The original contributions of this thesis can be summarised as follows:

• The use of state-of-the-art CNN architectures to develop efficient,


novel computational models for automated Ghaf tree detection in
drone imagery.

In Chapter 4, we present research that rigorously investigates the potential of op-


timally using the popular state-of-the-art CNN architectures (i.e.,YOLO, Faster
R-CNN, and SSD), for creating computational models for detecting Ghaf trees,
in drone imagery captured in desert environments. The trees vary in size, view
angle, percentage of occlusion, lighting, contents of background, and the presence
of other crops. Extensive experiments were conducted, and the results were thor-
oughly compared to determine the best model for detecting Ghaf trees in drone
images. It is worth noting that no previous work has been published on Ghaf
tree detection, in particular when using drones to capture visual content. Such
detection work needs to include rigorous comparisons, between models developed
from SSD, Faster R-CNN, and YOLO sub-versions 5s, 5m, 5l, and 5x. The models
created by the latest version of YOLO (i.e., YOLO-V5) achieved the best perform-
ance in Ghaf tree detection, with significant di↵erences between models created by
its di↵erent sub- versions s, m, l, and x. Although the sub-versions having deeper
neural network architectures, i.e., sub-versions ‘l’ and ‘x’ were more accurate in
Ghaf tree detection, our analysis revealed that substantially more data is required
for their e↵ective training. All developed CNN models were field-tested during
the proposed research delivery stages. Further a number of criteria to follow dur-
ing data labelling for optimal network training were investigated. The proposed
automated Ghaf tree inspection system could be an essential part of an automated
plantation management system, allowing the industry to check the growth of Ghaf
trees.

• The use of state-of-the-art CNN architectures to develop efficient,


novel computational models for automated multiple-tree type de-
tection and recognition in drone imagery.
8 1.4 Original Contributions

In Chapter 5, we present research that rigorously investigates the potential of op-


timally using the popular state-of-the-art CNN architectures (i.e.,YOLO, Faster
R-CNN, and SSD), for creating computational models for detecting and recog-
nising multiple tree types (e.g. Ghaf, Acacia and Date Palm trees) in the desert
areas. Trees come in di↵erent sizes, species and sub-species, can be di↵erentially
shaded, and may have di↵erent lighting conditions, backgrounds, and crops. Ex-
tensive experiments are conducted, and the results are thoroughly compared to
determine the best model for multiple tree detection in drone imagery. It is worth
noting that existing literature on multiple tree detection is yet to include such
rigorous comparisons, and no research has been conducted on detecting multiple
tree types in drone imagery using models developed from SSD, Faster R-CNN, and
YOLO Versions 5s, 5m, 5l, and 5x. The model created by the most established
version of YOLO, YOLO-V5 achieves the best performance in multiple tree detec-
tion, with significant di↵erences between the models created by it’s sub-versions, s,
m, l, and x. The developed models are field-tested during research and the results
are used to further fine-tune the models. According to the investigation carried
out in this thesis, increasing the volume of data used in training results in best
performing models, with the models generated by deeper architectures needing
relatively more amounts of data to reach optimal performance. Experiments also
showed that expanding the amount of training data through data augmentation
helps to improve the results. The proposed automated multiple tree detection
system could become an essential part of the Dubai Desert Nature Reserve’s plant
management system, enabling them to monitor and control the growth and dis-
tribution of multiple tree types, within their reserve.

• The use of state-of-the-art CNN architectures to develop efficient,


novel computational models for automated Litter detection in drone
imagery.

In Chapter 6, we present two novel approaches to litter detection in drone im-


agery. We investigate the potential use of popular, state-of-the-art CNNs (YOLO,
Faster R-CNN, and SSD), to create novel computational models for litter detec-
tion in desert environments. Through rigorous training, validation, optimisations
and making e↵ective design decisions backed by domain knowledge, we develop
accurate models for litter detection in areas of natural environments and camp-
sites in desert areas. Commonly found litter items include plastic bottles, paper
bags, polythene bags, boxes, drink cans, wrapping material etc. Therefore, litter
objects are of various shapes, sizes, colour, and transparency. In addition, di↵erent
lighting conditions, backgrounds, presence of vegetation, partial occlusions (e.g.
being partially buried in desert sand) could cause challenges for automatic litter
1.5 Thesis Outline 9

detection via computational means. Existing litter detection algorithms mainly


use ground level cameras for image capture, and machine learning approaches to
detect litter. Unfortunately given the above listed challenges such approaches have
failed to detect litter at any reasonable level of accuracy and the studies have been
very much limited to controlled conditions and environments. Deep Neural Net-
works based models provides a very viable solution tom litter detection. However,
no such approaches have been presented in literature. In this thesis we investigate
the use of state-of-the-art CNN architectures to create accurate models for litter
detection in two di↵erent environments. We propose a one class, i.e., litter only,
approach to litter detection that shows significant level of performance accuracy
in particularly when generated based on the deeper CNN architectures such as
YOLO-V5, ‘l’ and ‘x’ sub-versions. However, this approach fails in campsites
where many other human-made objects are present. Therefore, as an improve-
ment, we propose the introduction of a second object class, i.e., a human-made
object class. This helps in di↵erentiating litter from non-litter, human-made ob-
jects and avoid misclassification and false detections in campsites. The proposed
automated litter detection systems can positively contribute towards the Dubai
Desert Conservation Reserve’s aim to keep the environment of their reserve clean,
allowing them to monitor litter distribution in a large are of their land, relatively
quickly using drones. The models produced can be field-tested and the resulting
outcomes and more data for use within training can further improve the accuracy
of the proposed models.
The above original contributions have led to the submission of four scientific
publications (see Appendix-A). Further the resulting software for Ghaf Tree De-
tection and Multiple Tree Type Detection, have been implemented within the
DDCR’s software platforms, to help officials monitor and manage the trees and
their growth, within the reserve. The Litter Detection software will be implemen-
ted both on-board drones and the reserve’s software platforms for practical use,
in due course.

1.5 Thesis Outline


This thesis consists of six chapters. Chapter-1, introduces the research context,
the research problem addressed and highlights the original contributions being
made by the research presented in this thesis. Chapter-2 provides a thorough re-
view of literature of machine-based and deep-learning based approaches to object
detection and recognition. Chapter-3 reviews theoretical background on machine
learning and deep neural networks, and defines the objective metrics used in this
thesis to compare the qualitative performance of the various object detection and
10 1.5 Thesis Outline

recognition models, presented in this thesis. Chapter-4 presents the design, devel-
opment, implementation details and testing results of novel CNN models presented
in this thesis for detection and recognition of Ghaf trees. Chapter-5 extends the
work of Chapter-4 to present novel CNN models for multiple tree detection and re-
cognition, namely Ghaf, Acacia and Palm trees. Chapter-6 presents novel models
for litter detection in drone imagery, based on state-of-the-art CNN architectures.
Finally, Chapter-7 summarises and concludes the findings of the research presented
in this thesis and recommends possible improvements and further work.
Chapter 2

Literature Review

The di↵erent methods for detecting and classifying objects in digital images that
have been proposed in the literature are summarized in this chapter. The sliding
window and region proposal methods, two standard classical object identification
techniques, are described in section 2.1. Section 2.2, 2.3 introduces machine learn-
ing, and deep learning-based object detection approaches developed and presented
in literature. The review also helps to highlight research gaps in object detection,
particularly in object detection in aerial imagery, which helps to justify the prosed
study’s research focus.

2.1 Classical Object Detection Methods


In digital images and videos, object detection is used to identify objects such as
people, cars, buildings, animals, plants, and more. Numerous applications, such as
video surveillance [43], automated parking systems [44], picture retrieval [45], and
environmental monitoring [46], depend on object detection. It can also be used to
track moving objects in video recordings, such as detecting and tracking a person
or a ball in a football game video [47] [48]. However, computer based automated
object detection is more challenging than when it is carried out ably by humans,
who can perceive objects with their eyes and instantly pinpoint their location,
due to cognition (intelligence and memory). Two traditional object detection
techniques, region proposal-based and sliding-window-based, that have been in
widespread use in image and video analysis over the past ten years are described
below.

2.1.1 Sliding Window-based Method


Sliding-window techniques involve sliding a fixed-size window from the upper-left
to the lower-right corner of the image to search for objects of interest. However,

11
12 2.1 Classical Object Detection Methods

this method is not suitable for real-time object detection applications because it
is considered an exhaustive search, and thus finding objects in an image can take
a significant amount of time [49].

(a) The first frame

(b) The last frame

Figure 2.1: Example of sliding window object detector


2.1 Classical Object Detection Methods 13

In 2010, Subburaman and Marcel [50] proposed an algorithm for detecting


faces in images. They combined a bounding box estimator and a sliding window
to improve the detection speed of a sliding window detector when searching for
objects in an image. They used a binary decision tree to predict a set of patches
that are likely to contain faces in the bounding box estimation stage and then
drew bounding boxes for each predicted face on the face region. They tested their
face detectors to assess the performance of their proposed method and discovered
that it was still not good enough when the faces were obscured by other objects
or faces. In 2013, Comaschi et al. [51] proposed an algorithm to improve the
speed of sliding window-based methods, by changing the step size of moving the
sliding window. They developed their proposed method on the assumption that
the sliding window should move quickly when scanning in regions with no objects
of interest and slowly when scanning in regions with objects of interest. They
applied their algorithm to face detection and discovered that it could increase
the detection speed of sliding-window-based methods to 2.03x frames per second.
However, it still could not be used for real-time object detection applications.
In 2015, Jiang et al. [52] proposed a technique for reducing the execution time
of traditional sliding-window-based methods. They proposed an algorithm that
generates flexible-sized sliding windows when searching for objects in an image.
The classifier’s response value determines the size of the next sliding window at
the current window position. The algorithm can increase the sliding window size
in areas with no objects and decrease the size in areas with objects of interest.
This technique can detect objects of various sizes and reduce the execution time of
traditional sliding window-based methods. However, it still needs to be improved
to accurately detect objects when the objects of interest overlap.

2.1.2 Region Proposal-based Method


Region proposal-based approaches are another conventional technique for detect-
ing objects in images. These approaches use an object proposal algorithm to create
a set of object regions likely to include items of interest instead of searching for ob-
jects throughout the whole image with a sliding window. This method surpasses
sliding-window-based algorithms in terms of detection accuracy, while avoiding
the specific search issue they have. Thus, region proposal-based techniques have
become popular for reducing the computing expense of numerous object identific-
ation applications.
Uijlings et al. [53] proposed a method for generating object proposals in images,
named ”selective search.” Their proposed method incorporates the Felzenszwalb
and Huttenlocher method [54] to develop an initial object proposal. They then
14 2.2 Object Detection in Aerial Imagery Using Machine Learning

used a greedy algorithm to calculate similarity values between regions and their
neighbours, before passing them to the classifier. After completing the detection
process, they analysed the experimental results. They claimed that their proposed
method could reduce the execution time of searching for objects in an image, but
the performance may degrade when detecting overlapping objects.

In 2014, Zitnick and Doll [55] proposed a method for generating bounding boxes
for a set of object proposals by considering edge-based possible object regions.
They used a structural edge detector algorithm to predict object boundaries and
a greedy algorithm to group the edges [56] [57]. Finally, they assigned a possible
value to each set of edges using a scoring function to rank and define possible
object regions. According to the experimental results, their proposed method can
detect overlapping objects but needs to achieve the speed required for real-time
object detection applications.

In 2017, Huang et al. [58] presented an algorithm for generating object propos-
als to detect ships in remote sensing images. The core of their proposed method
is to generate a set of object proposals using edge detection and structured forest
methods. They then used morphological processing to eliminate some object pro-
posals, that edge detection false positives could have caused. Finally, each target
proposal’s edge results are fed into a classifier to identify ships. According to the
experimental results, their proposed method outperforms other methods that use
illumination and interference conditions in remote sensing images to detect ships
on images with the sea as the background. However, their proposed method has
limitations when another object, such as a cloud, obscures parts of the ship.

Other methods for generating object proposals, besides those mentioned above,
include Geodesic Object Proposals (GOP) [59], Multi-Scale Combination Group-
ing (MCG) [60], and so on.

2.2 Object Detection in Aerial Imagery Using


Machine Learning

Machine learning has become a popular method in the development of object


detection systems as artificial intelligence (AI) technology advanced.
2.2 Object Detection in Aerial Imagery Using Machine Learning 15

Figure 2.2: The workflow of object detection implemented using machine learning

Machine learning-based object detection is accomplished by using a classifier


to learn a set of features from training data and then using the classifier to detect
and classify objects in test images after training. Feature extraction and classifier
training are the two main methods for implementing object detection based on
machine learning. Feature extraction involves extracting features from an image
related to the colour, texture, or shape of the object of interest. Feature extraction
can create a feature descriptor to describe and represent an object’s characterist-
ics. Several feature descriptors, such as Histogram of Oriented Gradients (HOG),
Local Binary Patterns (LBP) [61], Scale-Invariant Feature Transform (SIFT) [62],
Haar-like, colour, and others, can be used to describe and extract features from ob-
jects. These characteristics can be used alone to represent the object’s information
or combined manually to make the object’s characteristics more straightforward.
Following that, feature selection is used to select only necessary features from the
combined feature vector, increasing the robustness of these features. However,
feature fusion and feature selection methods are optional when using machine
learning to detect objects. The classifier then learns a set of features that can be
used to identify information about the object. SVM, AdaBoost, Conditional Ran-
16 2.2 Object Detection in Aerial Imagery Using Machine Learning

dom Fields (CRF) [63], and other classifiers can be used to learn a set of features
for objects in images.
In 2016, Redmon et al. presented a novel object detection algorithm called You
Only Look Once (YOLO). YOLO divides the input image into a grid and predicts
the object’s class and bounding box location for each cell in the grid. This way,
YOLO can detect multiple objects in an image at once, making it a fast and
efficient algorithm. YOLO uses a deep convolutional neural network to extract
features from the image, and the last layer of the network predicts the class prob-
abilities and bounding box coordinates. They claim that their method achieves
state-of-the-art accuracy while maintaining real-time performance. YOLO now is
the most popular and e↵ective network in object detection area.
In 2017, He et al. [64] presented a new object detection method called Mask
RCNN, an extension of the Faster R-CNN algorithm. Mask R-CNN adds a branch
to the Faster R-CNN network to generate a binary mask for each detected object
in addition to the bounding box and class prediction. The mask branch uses a fully
convolutional network to predict the object mask pixel by pixel. They achieved
state-of-the-art performance on several object detection benchmarks, including
COCO [65]and PASCAL VOC [66].
In 2019, Ghiasi et al. proposed an object detection algorithm named NAS-
FPN [67], which is a combination of Neural Architecture Search (NAS) and Feature
Pyramid Networks (FPN). NAS is used to search for the optimal network archi-
tecture, and FPN is used to extract features at di↵erent scales from the input
image. They claimed that their proposed algorithm can achieve state-of-the-art
performance while using fewer parameters than previous methods.
In summary, these object detection methods based on machine learning have
significantly advanced the field of computer vision, enabling the detection of ob-
jects in images captured from aerial viewpoints, which can be used for various
applications such as environmental protection, surveillance, traffic monitoring,
and disaster response.
Tuermer et al. [68] presented a new method for the detection of cars in dense
urban areas in 2013. The first stage of their proposed method involves selecting
road locations in urban areas from a road database. This stage is used to prevent
the car detector from being confused by the similarity of other objects that may
resemble a car, such as a sunroof. They then used HOG feature descriptors to
represent shape-based car features in the second stage. Finally, they trained and
classified the car feature vectors using AdaBoost as a classifier. They tested their
proposed method on test images taken by a drone over downtown Munich and
discovered that it achieved 70% accuracy in detecting cars in dense urban areas.
Meanwhile, Cheng et al. [69] created an object detection framework that uses
2.2 Object Detection in Aerial Imagery Using Machine Learning 17

a discriminatively trained hybrid model to detect aircraft in aerial images. The


framework is divided into two stages: model training and object detection. The
training images are downscaled using the image pyramid method during the model
training phase. This technique is intended to assist the framework in detecting
aircraft of all sizes. After downsizing the training images, HOG feature descriptors
are used to extract the features and train them with SVM. The features of a
given test image are extracted in the object detection stage by constructing multi-
scale HOG features. The object detection task is then carried out by computing
a threshold based on the response of the mixture model. They analysed the
experimental results and found that their proposed method could detect objects
of various sizes and rotations but could have performed better for small-sized
objects in aerial images.
Cheng et al. [70] presented a technique for detecting multiple objects in aerial
images in 2014. The proposed method is based on a group of part detectors and
is divided into two stages: object training and object detection. During the ob-
ject training phase, the authors collected various objects of interest with varying
orientations. They used an image pyramid technique to zoom in and out of this
set of objects to obtain more training image samples and generate more samples
of objects of interest of various sizes. HOG feature descriptors are extracted from
the training image set and fed into a linear SVM for training. The authors use a
sliding-window-based method to search for objects in the entire image during the
object detection stage. The sliding window’s size is fixed, and the test image is
zoomed in and out with the image pyramid method. HOG features are extracted
at each sliding window step and passed to the SVM classifier to classify objects.
The authors also used non-maxima suppression [71] in a post-processing step after
detection to remove redundant and overlapping bounding boxes. They analysed
the experimental results and claimed that their proposed method can detect mul-
tiple objects with high accuracy but still has a detection speed problem due to
suppression. Meanwhile, Xu et al. [72] proposed a rotation-invariant part-based
model to detect complex-shaped objects in high-resolution remote sensing images
in 2014. They applied the original HOG features to rotation-invariant HOG fea-
tures (RIHOG) by evaluating the principal orientation of each region containing
the object of interest. Then they introduced a clustering method to reduce the
proposed method’s detection time. Their experimental results show that their
proposed method achieves high accuracy, mainly when applied to the detection of
aerial images, but degrades when applied to other objects.
Meanwhile, Shi et al. [73] presented an algorithm for detecting ships in high-
resolution satellite images in 2013. They first investigated the problem of ship
detection in high-resolution satellite images. They discovered that the most chal-
18 2.2 Object Detection in Aerial Imagery Using Machine Learning

lenging task of ship detection is the change in the appearance and background of
the ship in the image. Therefore, they decided to convert the panchromatic im-
age to a pseudo hyper-spectral form and rearrange spatially adjacent pixels into
vectors. The size of each vector produced during the conversion process adds ad-
ditional contextual information that can magnify the di↵erence between the vessel
and the background. They then used the HOG feature descriptor to represent the
ship’s shape. Finally, send the ship’s feature vector to AdaBoost for classification,
and allow the algorithm to learn and classify objects. According to the experi-
mental results, their proposed method could detect ships when they were not close
to each other, but it still missed some detections when detecting the ships close
to land.
In summary, classical machine learning based approaches to object detection
based on HOG features have been presented above. The results obtained by these
algorithms were considered excellent at the time they were published, achieving
accuracy rates of approximately 70%. As with many other machine classical ma-
chine learning based approaches to object detection, these algorithms lack the
depth of feature extraction and accuracy of classification based on the limited fea-
ture set. Novel Deep Leaving based approaches to object detection address these
limitations.
The texture is a critical feature widely used in aerial image target detection
using a machine learning algorithm. Texture features are used to describe local
patterns within an object’s surface. Zhong and Wang [74]] published a method for
detecting urban areas in remote sensing images in 2007. They divided the input
images (training and testing) into non-overlapping 1616 blocks at the start of the
proposed method. The researchers then calculated five multi-scale texture features
around each block to capture the general statistical properties of urban areas. The
five texture features are grey-level appearance texture features, Gray-level co-
occurrence matrix (GLCM) [75], Gabor texture features [76], gradient direction
features [77], and line length features. Finally, they used multiple Conditional
Random Fields (CRFs) as base classifiers for each block to learn and classify
feature vectors. They analyse the experimental results and claim that the proposed
model can outperform a single CRF in terms of detection accuracy while avoiding
the over-fitting problem.
Senaras et al. [78] used texture features to represent building features in optical
satellite images in 2013. They used a multi-classifier approach known as Fuzzy
Stacked Generalisation (FSG) [79] to classify objects in the proposed system. To
generate the final detection, the detection results of multiple classifiers are com-
bined. They used GLCM as a texture feature descriptor in the feature extraction
stage, combining it with shape features to represent the features of buildings.
2.2 Object Detection in Aerial Imagery Using Machine Learning 19

After using GLCM for segmentation, they analysed the detection results. They
discovered that their proposed method could detect buildings of various sizes but
had problems detecting buildings with textures similar to the background. The
same year, Aytekin et al. [80] presented an algorithm for detecting runways in
satellite images. The proposed algorithm is based on texture PR operations to
detect runways. They began by dividing the input satellite image into 32 x 32-
pixel non-overlapping image patches. The texture features, image intensity mean,
and standard deviation were then used to characterise the runway. They used
six texture features to represent the texture properties of the runway in the tex-
ture representation, including Zernike moments [81], circular Merlin features [82],
GLCM, Fourier power spectrum [83], wavelet analysis [84], and Gabor filter [85].
The six features were then concatenated into a feature vector for classification
using the AdaBoost algorithm. They analysed the experimental results and con-
cluded that incorporating texture features can improve detection accuracy for
runway detection applications.
Cirneanu et al. [86] presented a method for detecting flooded areas in drone
images in 2015. They used texture features to detect flooded areas. They selected
sample images of flooded areas, extracted their features using the LBP feature
extraction method during the model training phase, and used the LBP feature
descriptor to extract features from an RGB input image’s red and green channels.
As blue contains a small amount of the flooded image, the blue channel of the RGB
input image is not chosen to represent the texture information of the flooded area.
In addition, they converted the RGB input image to HSV colour space and only
selected the H component. The texture features of the H component are then
extracted using the LBP feature extractor. The average histogram vector for the
three-colour components was then computed (R, G, and H). The same operations
as in the training phase are performed on the test images in the test phase, and the
flooded areas are classified by comparing the Euclidean distance of the training
image histogram vector to the test image. They tested their proposed method
and claimed it could detect flooded areas in drone images with high accuracy.
Moranduzzo et al. [87] published their findings in the same year. A multi-class
classification method is proposed for detecting multiple objects in drone images.
They started by dividing the raw drone image into tiles. The LBP feature extractor
is then used to extract each patch and texture feature. The chi-square histogram
distance between each tile’s LBP feature vector and the similarity of the tiles
in the training dataset is then calculated. Experimental results show that their
proposed method performs well when only detecting large objects.
Agular et al. [88] published their findings in 2017. They proposed an algorithm
for detecting pedestrians in drone images. Firstly, they divided all the images into
20 2.3 Object Detection in Aerial Imagery Using Deep Learning

two categories: training (70 percent) and testing (30 percent). The first set was
used for training, and the second was used to evaluate the proposed method’s
performance. Secondly, they used Haar-like and LBP feature descriptors to rep-
resent pedestrian features. Finally, they fed the combination of these two feature
vectors into AdaBoost for object recognition and classification. They examined
the experimental results and concluded that their proposed method could detect
pedestrians even when they were not approaching.
The papers presented above attempts to use further features and di↵erent
feature combinations, followed by classification, for object detection, obtaining
very good levels of accuracy. However they are limited by the limited number of
features being exploited within the object detection sta↵, requiring a human to
be involved in deciding which features to be used for a given object detection task
(i.e., the need to do feature engineering), and using a selected classifier in the final
stage of object detection. If unlimited amount of features can be exploited and the
feature engineering is performed optimized for the detection task, automatically by
a computer and the best approach to classification is decided based on exhaustive
investigations, one could achieve much higher levels of accuracy. This would be
the focus of Deep Neural Network based, object detection approaches.
Although many attempts have been made to detect objects in images captured
by drones using distance-based and machine learning-based methods, these meth-
ods are designed by humans who manually select feature descriptors based on the
application domain. The problem of selecting the correct features to represent
object type information for accurate classification still needs to be solved. The
proposed algorithms must be improved to account for small object size, overlap
or proximity between objects, illumination changes, and background contrast.

2.3 Object Detection in Aerial Imagery Using


Deep Learning
New deep learning methods can be used to select the right features to construct
object detection algorithms based on machine learning methodologies. Deep learn-
ing systems do not require pre-selected features in the first stage of training since
it can automatically extract crucial features from the raw input image. There
are numerous processing layers in the deep learning network’s architecture. Each
layer can learn many aspects from the input image. As there are many layers, the
network may learn the high-level characteristics of the input data. Ivakhnenko [89]
used deep learning techniques for the first time in 1971 when he created the so-
called Group Method for Data Hadling (GMDH) utilising a deep neural network.
2.3 Object Detection in Aerial Imagery Using Deep Learning 21

One of the deep learning networks extensively employed in object detection devel-
opment is the Convolutional Neural Network (CNN). A CNN can extract count-
less high-level characteristics from objects without needing feature pre-selection
during the initial stage of object detection. As a result, in this section, object de-
tection methods for images acquired from aerial perspectives are provided. These
approaches were developed and implemented using deep learning algorithms, par-
ticularly CNN.
Zeggada and Melgani [90] introduced a multi-class classification method in
2016 for identifying eight objects in urban settings using aerial photography. They
began by creating a grid of identical tiles from the supplied image. Then, to extract
the key features of objects in each tile image, they used CNN without a fully
connected layer as a feature extractor. Finally, they used a Radial Basis Function
Neural Network (RBFNN) to deduce the class of each object using the features
of each tile. According to their analysis of the experimental results, the proposed
strategy outperformed the methods employing just an LBP feature descriptor for
detecting various classes of objects.
A CNN-based object detection approach for detecting several items in aerial
photos was also presented by Radovic et al. in 2017 [91]. Their study’s objective
was to create an application for object detection that could be used to survey
urban environments to carry out more research in the domain of city transporta-
tion. The open source ”YOLO” object detection and object classification platform
was the foundation for the CNN used in the suggested method. Twenty-four con-
volutional layers, two fully linked layers, and one final detection layer made up the
YOLO network design. The input image was segmented into S ⇥ S grid cells by
YOLO. Then, each grid cell forecasted bounding boxes and gave each bounding
box an object class likelihood. The trial findings demonstrated that their sugges-
ted strategy had a 97.8 percent accuracy rate for identifying vehicles and buses in
urban areas in aerial photographs. The method used a pre-trained initial YOLO
Version to detect widespread objects.
To locate and count olive ridley sea turtles in aerial photos, Gray et al. [92]
used a CNN. They put drone photos taken during a maritime survey in Ostional,
Costa Rica, during a significant nesting episode, to the test. Two groups were
created from all the photographs. The olive ridley sea turtles in the first group of
photographs were manually labelled with bounding boxes for training the CNN.
The images of the olive ridley sea turtles in the second group of images were saved
for evaluating the performance of the trained CNN model. The test image was
partitioned into a grid of tiles once the CNN had been trained. Each tile image was
sent to the trained model with a size of 100 ⇥ 100 pixels to identify olive ridley sea
turtles. They analysed the trial findings and found that the number of sea turtles
22 2.3 Object Detection in Aerial Imagery Using Deep Learning

after using the trained CNN model to identify them was equal to the number of sea
turtles while manually counting. Saqib et al. [93] built a real-time object detection
program for a drone in 2018, to find aquatic life. Their research’s objectives were
to recognise and count the dolphins and stingrays in real-time. They put the
technique into practice, using a faster R-CNN to identify both animals. The faster
R-CNN discovered objects by creating feature maps of potential object regions.
In the second stage, a network classifier categorised the potential object regions
into di↵erent types of objects. They examined the detection data and concluded
that while the proposed method’s detection accuracy was quite good, its detection
speed required improvement.
In 2019, Rohan et al. [94] presented a real-time object detection application for
use in a drone for detecting fixed and moving objects. They used an SSD object
detector to create a real-time object detection program, which was based on the
VGG-16 network architecture but did not use the fully linked layer. Instead,
they expanded the network’s convolutional layers, enabling the SSD to extract
more features at di↵erent feature map scales. According to their analysis of the
experimental findings, the proposed real-time object detection program achieved
98 percent accuracy when detecting just one class of object in photos. However,
the accuracy decreased when applied to recognising several classes of objects.
Meanwhile, in 2019, Hong et al. [95] focused on detecting birds in drone imagery
to examine the performance of five CNN-object detection approaches, including
Faster R-CNN, Region-based Fully Convolutional Network (R-FCN) [96], SSD,
RetinaNet [97], and YOLO. Both detection speed and accuracy were evaluated
to compare the performance of the five CNN-based object detection techniques.
After analyzing the detection findings, they concluded that Faster R-CNN had the
most excellent detection accuracy, and YOLO had the fastest detection speed.
In addition to using the already-existing CNN-based object detection tech-
niques, other researchers used CNN’s performance to create their network. Long
et al. [98] proposed a feature-fusing deep network (FFDN) for recognising small
objects in aerial photos in 2019. The envisioned network had three main parts. To
learn the deep hierarchical properties of objects, they used convolutional restric-
ted Boltzmann machines in the first component. The second component employed
conditional random fields to create the spatial relationship between neighbouring
objects and backgrounds. In the third component, a deep sparse auto-encoder
combined the results of the first and second components. After examining the
experimental data, they asserted that their suggested approach might strengthen
CNN’s feature representation while enabling the detection of small objects against
a challenging background.
The literature presented above shows that existing deep neural network-based
2.3 Object Detection in Aerial Imagery Using Deep Learning 23

approaches, either use pre-trained networks to detect well-known objects (e.g.


humans, cars, buildings etc.) from a distance or conventional deep neural net-
works without the necessary fine-tuning to train specifically to overcome chal-
lenges around training specific objects, particularly when object detection in arid
conditions.
The research presented in chapters 4–6 of this thesis focuses on the challenges
of detecting specific named objects and other regions of interest in dry environ-
ments with significant variations of object size, appearance, colour, occlusion and
di↵erent environmental conditions, using fine-tuned and trained state-of-the-art,
machine learning methods and detecting distinct objects utilising deep learning-
based methodologies. Uniquely, each system proposed has been carefully designed
and calibrated to function best in the scenarios and application areas under con-
sideration, considerably expanding our understanding of object detection and clas-
sification. To the best of authors knowledge, the detection of such objects has not
been carried out using Deep Neural Network technology, in literature. Hence to
work presents novel findings of scientific significance and practical use.
24 2.3 Object Detection in Aerial Imagery Using Deep Learning
Chapter 3

Theoretical Background

The theory, procedures, and methodologies of the research described in this thesis
are supported by the scientific background information provided in this chapter.
The foundational theoretical ideas of the machine learning approach are presented
in the first section. The background of deep learning is discussed in the second half,
with a focus on CNN to provide crucial background information on well-known
Convolutional Neural Networks (CNN) to be used in the contributory chapters of
this thesis, Chapter 4, 5 and 6.

3.1 Machine Learning

Machine learning involves discovering rules or features through a learning al-


gorithm and training data. Supervised and unsupervised learning are the two
primary methods used in machine learning. Di↵erent applications, datasets, and
purposes require di↵erent types of learning. Supervised learning is used to solve
classification and regression problems, whereas unsupervised learning is used for
clustering and density estimation. The learning objective of supervised learning is
to identify input instances with their desired outcomes, while unsupervised learn-
ing aims to learn input patterns without the desired outcomes and group them
based on their structures. Supervised learning techniques are utilised for object
recognition.
The next sections of the chapter describe the theoretical and conceptual aspects
of supervised learning and its application in recognising objects through machine
learning (3.1.1). These sections are specifically linked to solving classification
problems.

25
26 3.1 Machine Learning

3.1.1 Supervised Learning


The concept of learning from examples has been formalised through supervised
learning. The primary objective of supervised learning is to train a learning al-
gorithm using a set of training data comprising instances with matching outputs
(or ground truth) to predict an output (y 0 ) for each new input data sample (x0 )
using the acquired knowledge.
Before beginning supervised learning, all input data must be divided into train-
ing and testing datasets. The input data is randomly split into two parts, with
20% kept for testing and 80% used for training(In YOLO, 20% is for testing, 20%
is for validation and 60% is for training). The training dataset is utilised to train
a neural network and develop a model, while the test dataset is used to evaluate
the performance of the trained model.
A training and testing data set in supervised learning can be described by
equations (3.1a) and (3.1b), respectively.

Ntrain
Dtrain = {(xi , yi )} 1 = (x1 , y1 ), ..., (xNtrain , yNtrain ) (3.1a)

Dtest = {(x0i , yi0 )} Ntest


1 = (x01 , y10 ), ..., (x0Ntest , yN
0
test
) (3.1b)

yi corresponds to the output of a training sample xi , while yi0 corresponds to


the output of a test sample x0i . Ntrain and Ntest refer to the number of training
and testing samples in a training dataset and testing dataset, respectively. Dtrain
and Dtest represent the sets of training and testing data, respectively. Figure 3.1
depicts the supervised learning process.

Figure 3.1: The learning process of supervised learning

The two primary stages of supervised learning are the training and testing
phases, as depicted in Figure 3.1. Prior to the training phase, it is necessary to
prepare the training data and their corresponding labels (ground truth). Feature
3.2 Deep Learning 27

extraction is then performed to extract the characteristics of the training data.


Various feature extraction techniques, such as HOG, local binary patterns (LBP),
and scale-invariant feature transforms (SIFT), can be employed to extract the
features of the input data.
HOG and LBP feature extractors are frequently used to extract characterist-
ics of objects based on shape and texture, respectively, in object identification
and classification based on machine learning. However, more than one feature
descriptor may be required to capture all of an object’s characteristics. This is
where feature fusion comes in - during this process, one or more feature descriptors
are combined to better understand the characteristics of the training data. How-
ever, adding multiple characteristics can lengthen the processing time of the model,
and some features may even be unnecessary for the task at hand. Therefore, an
optional step named ”Feature selection” is often performed to choose the most
relevant features.
After the feature fusion and selection processes, the selected features are used as
input to supervised learning algorithms, such as support vector machines, decision
trees [99], Naive Bayes [100], and neural networks. These algorithms learn from
the features of the training data and map them to their corresponding labels,
resulting in a classification model or trained model that can predict the classes of
the testing data.
In the testing phase, the testing data must be prepared in a similar manner as
the training data. This involves feature extraction and feature selection. Then,
the output features from this initial stage are fed into the trained model to predict
the class of the item based on its features.

3.2 Deep Learning


A particular branch of machine learning is called deep learning, which aims to use
a neural network with multiple layers to learn complex data. It is also referred
to as deep structured learning or hierarchical learning in real-world applications.
Unlike traditional machine learning algorithms, deep learning does not require the
features to be defined in the first stage. Instead, it uses several layers to transform
the input data into various levels to extract and learn the features of the input
data. Figure 3.2 illustrates the workflow of the deep learning algorithm.
At the beginning of the learning process, the input data is supplied to the
first layer of the deep learning algorithm, and each layer is parameterised by its
weights. The deep sequence layers modify the data in a series of steps to extract
and learn the features from the raw input data. During learning, the weights of
each layer are changed to match the training samples. The loss function compares
28 3.2 Deep Learning

the prediction of the input data with the actual response to change the network
weights.
Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), Deep
Feedforward Networks (FDNs), Generative Adversarial Networks (GANs), and
Convolutional Neural Networks (CNNs) are some of the various deep neural net-
works that have been developed over the past two decades. Among these networks,
CNNs are the dominant deep neural network that performs best on visual iden-
tification tasks, and they are the main subject of the remaining sections of this
chapter.

Figure 3.2: A workflow of a deep learning algorithm

3.2.1 Fundamental of Neural Network


This section explains the foundations of a neural network (NN), which is one of
the essential tools in deep learning. The NN is composed of three main layers:
3.2 Deep Learning 29

the input, hidden, and output layers. Each neuron in the layer below is fully
connected to every neuron in the layer above it, and each layer contains multiple
neurons. The input layer’s input must be a vector. When an image is used as
the input, it must be flattened into a one-dimensional vector. For instance, if the
input image is a grayscale image with dimensions of 13 ⇥ 13 pixels, the flattened
vector’s dimension should be 1 ⇥ 169 pixels, and the input layer should have 169
neurons that are fully connected to every other neuron in the layer above.
Figure 3.3 illustrates an example of an NN architecture using a grayscale image
with a 13 ⇥ 13 pixel input size. As shown, the NN contains one hidden layer and
one output layer in addition to the input layer with 169 neurons. The output layer
has ten neurons that recognise the type of handwritten digit in the input image.

Figure 3.3: Neural network architecture for an input image of size 13 x 13 pixels
with one hidden layer and ten neurons in the output layer from [30][31]

In a NN, there may be more than one hidden layer. A NN might have one
or more hidden layers for handling more complex issues and data. However, the
architecture of the NN will become more complex with more hidden layers. The
output layer is responsible for creating the NN’s output. The number of neurons
in the output layer needs to be proportional to the kind of work done by the
neural network. For instance, if the neural network is classifying a handwritten
digit (0-9), the 10 neurons in the output layer should correspond to each class, as
shown in Figure 3.3.
Each neuron in a NN is controlled by multiplying each input (xi ) by its weights
(wi ), combining the resulting sum with a bias (b), and then passing the result
through an activation function (f) to produce an output (Y’). The following equa-
tion can be used to calculate the output Y’ of a neuron with n inputs:
30 3.2 Deep Learning

X
Y 0 = f (( n
i=1 wi xi ) + b) (3.2)

Figure 3.4: Example of the operation of a neuron in a neuron network

A figure of a neuron in a NN in action can be seen in Figure 3.4. Figure 3.4


demonstrates how this neuron takes two input values (x1 and x2 ) and is connected
to two weights (w1 and w2 ). The weights of the inputs x1 and x2 are w1 and w2 ,
respectively. Each weight can be changed during the learning process by utilising
backpropagation to optimise and match it with the input data. Each weight con-
trols the strength of the connection between the neuron of the previous layer and
the neuron of the current layer. The results of the multiplications are then added
together with a bias. Finally, an activation function receives the summing result
and produces the output Y’. The activation function introduces nonlinearity into
the value of a neuron’s output Y. The activation functions extensively employed
in NN are sigmoid, hyperbolic tangent, ReLU, leaky ReLU, and softmax. Each
function has unique properties.
3.2 Deep Learning 31

Activation Function

The function that converts the weighted total of each neuron’s inputs into output is
called an activation function. The activation function may also be referred to as a
transfer function. It is used to introduce nonlinearity into a neuron’s output value.
NNs use a variety of activation functions, including softmax, sigmoid, hyperbolic
tangent, ReLU, and leaky ReLU. These activation functions are briefly described
in the sub-section that follows.
Sigmoid: The sigmoid activation function is a logistic function that takes any
real value as input and produces an output value between 0 and 1. As shown
in Figure 3.5, the sigmoid activation function has an S-shaped curve. From the
curve, it can be observed that when the input value is very large or positive, the
sigmoid activation function will map it towards 1. Similarly, when the input value
is very low or negative, the sigmoid activation function will map it closer to 0.
The formula for the sigmoid activation function is as follows:

1
(z) = z
(3.3)
1+e

Figure 3.5: The curve of sigmoid activation function

Hyperbolic tangent: An improved variation of the sigmoid activation func-


tion with a di↵erent range for the output value is the hyperbolic tangent, or tanh
activation function. This function maps any real input value to an output value
between -1 and 1. Figure 3.6 displays the tanh activation function curve. The
convergence rate of the tanh activation function is higher than the convergence
rate of the sigmoid activation function, as seen in Figure 3.6. The formula for the
tanh activation function is as follows:
32 3.2 Deep Learning

2
tanh(z) = 2z
+1 (3.4)
1+e

Figure 3.6: The curve of tanh activation function

ReLU: Due to its computational simplicity, the ReLU activation function has
become one of the most used activations in NNs, particularly in CNNs. This func-
tion converts all negative numbers to zero, leaving all positive values unchanged.
ReLU stands for Rectified Linear Unit. Figure 3.7 displays the ReLU activation
function curve. The formula for the ReLU activation function is as follows:

Relu(z) = max(0, z) (3.5)

Figure 3.7: The curve of ReLU activation function


3.2 Deep Learning 33

Leaky ReLU: The leaky ReLU activation function improves upon the ReLU
activation function by addressing the ”dying ReLU” problem. This function at-
tempts to convert any negative values that were changed to zero in the ReLU
activation function to non-zero values by multiplying all negative values by a
small constant number m, known as the leak factor, typically set to 0.01. The
curve for the leaky ReLU activation function is shown in Figure 3.8. As a result,
the Leaky ReLU activation function’s formula is as follows:

LeakyRelu(z) = max(mz, z) (3.6)

Figure 3.8: The curve of Leaky ReLU activation function

Softmax: The function commonly used in the output layer of NNs for cat-
egorizing inputs into multiple categories is called the softmax activation function.
This function normalizes the outputs for each class between 0 and 1 to calculate
the likelihood that an input belongs to a particular class. The softmax activation
function’s formula is as follows:

e zi
(zi ) = PK f or i = 1, 2, . . . , K (3.7)
zj
j=1 e

Where J is the number of classes of the classification problem and xi is the


value received from neuron i of the output layer. K is a random integer starting
from 1. The specific value of K depends on the problem being addressed.

Forward and Backward Propagation

The NNs employ both forward and backward propagation as e↵ective techniques.
The process of passing inputs to a group of neurons in the first layer and passing
34 3.2 Deep Learning

the neurons’ outputs to the last layer (the output layer) to generate a result is
referred to as, ”forward propagation” (or ”forward pass”). The di↵erence between
the predicted output of the neural network and the correct response (ground truth)
is calculated using the loss function after the neural network has produced a result.
It is then used as a feedback signal to modify the network’s weights. The network
weights initialised in the first stage using weight initialisation methods such as zero
initialisation, random initialisation, Xavier initialisation [101], Kaiming initialisa-
tion [102], etc., are updated using a process known as backward propagation, also
known as back-propagation. The weights are continuously adjusted throughout
the training to align with the input data set, which is referred to as the learning
process.
However, various loss functions can be employed to determine the loss score in
NNs, including mean squared error (MSE), binary cross-entropy, and multi-class
cross-entropy (or categorical cross-entropy). Di↵erent scenarios call for the use of
each function. For instance, the loss score in a regression issue is calculated using
the mean squared error. In a binary classification problem, cross-entropy is used,
while in a multi-class classification problem, multi-class cross-entropy is employed.
The following is a list of the formulas used to determine the loss of each function:

Mean Squared Error:


LM SE = (y y 0 )2 (3.8)

Binary Cross-Entropy:

LBinaryCE = [ylog(y 0 ) + (1 y)log(1 y 0 )] (3.9)

Categorical Cross-Entropy:
c
X
LM ultiCE = yi log(yi0 ) (3.10)
i=1

Where y’ is the predicted value, y is the actual value and C is the number
classes.

Hyperparameters in neural network

Hyper parameters are a group of variables that can be utilised to regulate the
learning process of a neural network. A set of hyper parameters must be estab-
lished before a neural network can be trained. The hyper parameters learning rate,
momentum, decay, batch size, and epoch significantly a↵ect the neural network’s
capacity for learning.
3.2 Deep Learning 35

The learning rate is used to regulate how much the neural network’s weights
are modified concerning the loss gradient. The momentum is used to manage the
degree to which the previous weight update influences the current weight update.
In a neural network, decay is used to regulate how quickly the learning rate declines
at each weight update. The batch size is the number of samples sent through the
neural network and processed before the weights are updated.

For instance, if the batch size is set to 8 and there are 800 training samples,
the algorithm will use the first eight examples (from the first to the eighth) from
the training dataset to train the network. The network is then trained again using
the training dataset’s subsequent eight samples (from the ninth to the sixteenth).
This process will be repeated until all training samples have gone through the
network once.

The epoch specifies the number of times the neural network will traverse the
entire dataset.

3.2.2 Convolutional Neural Networks

A deep neural network, called a ”Convolutional Neural Network” (CNN), is used


to process image data. This network comprises convolutional layers and a group
of linked neurons. Using convolutional layers and linked neurons, the CNN can
automatically pick out high-level properties of things without requiring a human
to do so during the pre-processing stage.

The first version of the CNN was created by Fukushima in 1980 [103], but it
gained popularity in 2012 after Krizhevsky et al. [104] suggested using AlexNet,
a CNN-based system, to categorise the 1.2 million high-resolution images in the
ImageNet Large Scale Visual Recognition Challenge into 1,000 di↵erent classes
(ILSVRC). Since AlexNet performed well in the ILSVRC competition, other re-
searchers were inspired to create many CNN models. It quickly gained popularity
in various computer vision applications, including image classification, object re-
cognition, and image segmentation.

The CNN’s architecture is inspired by visual perception and has three primary
divisions. The first division is the input layer, followed by the hidden layers, which
comprise numerous convolutional layers, activation processes, pooling layers, and
a fully connected layer. The third part is the output layer. The CNN’s structure
is depicted in Figure 3.9 [105].
36 3.2 Deep Learning

Figure 3.9: Convolutional network architecture from [105]

Convolutional Layer

The foundational element of a CNN is a convolutional layer. This layer seeks to


identify object features from the supplied image. The input convolutional layer in
the CNN is in charge of capturing the low-level image features.

The convolutional layer is made up of several learnable convolution kernels


(also known as filters) that are used to extract and learn various features (such as
edges, colours, and gradient orientation) of the low-level image. The deeper-level
convolutional layers capture higher-level features of objects, such as edges, colour,
and gradient orientation.

Convolution is carried out by sliding a kernel over an input image, starting at


the top. Typically, the right corner of an input picture determines the kernel’s
sliding left corner to bottom stride size. A stride size of 1 denotes a 1-pixel kernel
movement. When the kernel reaches the top-right corner of an input image, it
shifts down one pixel. The kernel then migrates from the left-hand side to the
right-hand side by one pixel. This operation is repeated until the kernel reaches
the bottom-right corner of an input image.

At each step of the kernel, the convolutional layer performs the convolution
operation by multiplying the kernel values by the corresponding image pixel values
and then adding the multiplication results to produce a feature map. An example
of a convolution operation on a 6-dimensional input image is shown with a kernel
that is 6 pixels by 3 pixels in size.
3.2 Deep Learning 37

Figure 3.10: Convolution operation on a 6×6 image with a 3×3 kernel


38 3.2 Deep Learning

Figure 3.10 illustrates how the convolution operation on an image with a 6⇥6
pixel resolution and a kernel with a 3⇥3 pixel resolution result in a 4⇥4 pixel res-
olution output feature map. Equations (3.11) and (3.12) can be used to determine
the size of the output feature map if the image is H rows by W columns and the
kernel is H rows by W columns.

Oheight = IH Kh + 1 (3.11)

Owidth = IW Kw + 1 (3.12)

Where IH and IW are an input image’s height and width, Kh and Kw are a
kernel’s height and width, and Oh eight and Ow idth are an output feature map’s
height and width, respectively. However, the number of kernels used to perform
the convolution operation on the input image determines the number of output
feature maps. As each kernel is used to identify di↵erent aspects of an input image,
applying multiple kernels to the same input will result in various output feature
maps from the same input image. For instance, the first kernel is used to identify
the input image’s vertical edges, the second kernel is used to identify the input
image’s horizontal edges, and the third kernel is used to sharpen the image. As a
result, utilising K kernels to perform the convolution operation on the same input
picture will result in K feature maps, which are then combined to create the final
output of the convolutional layer (see Figure 3.11) [106]. The size of the kernels
in a convolutional layer is often set to an odd number, such as 3, 5, 7, or 11, to
extract the features of an input image.

Figure 3.11: Example of a stack of feature maps from [106]

When performing the convolution operation on an input image, padding is an


3.2 Deep Learning 39

additional method that can be used to help the convolution operation work better
by retaining information at the edges of an input image. An extra set of pixels can
be added around an input image’s edge to perform the padding technique. Zero-
padding is a common practice that involves setting the additional pixels’ values to
zero. Figure 3.12 illustrates how zero-padding is applied to the 6⇥6 input picture
from Figure 3.10 and how an output feature map is created using a 3⇥3 kernel
and a single stride value.

Figure 3.12: Convolution operation on a 6⇥6 image with zero-padding and a 3⇥3
kernel
40 3.2 Deep Learning

Figure 3.12 demonstrates how the input image’s size was enlarged from 6⇥6 to
8⇥8 pixels by adding extra pixels with a value of 0 around the edge of the image.
It also illustrates how the size of the output feature map was raised without using
the padding technique in the input image. As a result, the following formula may
be used to determine the size of the output feature map after the convolution
operation:
IH Kh + 2P
Oheight = +1 (3.13)
S
IW Kw + 2P
Owidth = +1 (3.14)
S

Where IH and IW are an input image’s height and width, Kh and Kw are
a kernel’s height and width, P is a convolution operation’s padding value (for
instance, if the input image had one extra pixel added around the boundary, the
P value should be 1), S is the convolution operation’s stride value, and Oheight and
Owidth are an output feature map’s height and width. Equation (3.15) describes the
formula to calculate the convolution operation, where I ranges from 1 to Oheight ,
j ranges from 1 to Owidth , K is a kernel, and I is an input image.

Kh X
X Kw
O(i, j) = I(i + k 1, j + l 1)(k, l) (3.15)
k=1 i=1

When performing a convolution operation on images with multiple channels (such


as an RGB colour image), the kernel size must have the same depth as the input
image. This ensures that each channel of the input image is convolved with its
corresponding kernel weights, allowing for the extraction of spatial features inde-
pendently within each channel. The depth of the kernel determines the number
of filters applied to each channel, enabling the convolutional operation to cap-
ture di↵erent aspects of information across the input image’s colour channels. By
maintaining consistency in the depth of the kernel and the input image, the convo-
lutional layer can e↵ectively process multi-channel images and extract meaningful
features that encompass both spatial and colour-related characteristics. For ex-
ample, if the input image is an RGB colour image, the kernel’s depth must be
three to match the input image’s three colour channels. Figure 3.13 shows the
application of a 3⇥3⇥3 kernel’s convolution operation to an RGB image with a
6⇥6⇥3 pixel size. As can be seen, each kernel channel carries out the convolution
operation on the corresponding channel of the input image. For example, the con-
volution operation of the input image’s red, green, and blue channels is performed
using the kernel’s channel 1, channel 2, and channel 3, respectively. The output
feature map is then created by adding the channel results at each moving kernel
step.
3.2 Deep Learning 41

Figure 3.13: Convolution operation on a 6⇥6⇥3 image with a 3⇥3⇥3 kernel


42 3.2 Deep Learning

Pooling Layer

The layer responsible for reducing the spatial dimension of the output feature map
is known as the pooling layer. This layer is typically applied after the convolu-
tional layer has extracted the relevant features, and it serves multiple purposes in
the convolutional neural network (CNN) architecture. Firstly, before the pooling
layer is applied, an activation function is used to introduce non-linearity into the
output feature map. This activation function transforms the values within the
feature map, allowing for the modelling of complex relationships and enhancing
the network’s ability to learn intricate patterns and representations. The pooling
layer’s primary role is to downsample the feature map, e↵ectively reducing its
spatial dimensions. By aggregating information within local receptive fields, such
as max pooling or average pooling, the pooling layer decreases the resolution of
the feature map while preserving the most salient features. This downsampling
operation contributes to reducing the computational complexity of the network,
as it decreases the number of parameters and subsequent computations required
in the network’s subsequent layers. Moreover, the pooling layer introduces several
benefits to the CNN model. It enhances the network’s robustness to noise and dis-
tortions in the input data by capturing the most prominent features within local
regions, thus reducing the impact of irrelevant variations. Additionally, pooling
helps to address the issue of overfitting by promoting generalisation. By summar-
ising the information within each pooling region, the pooling layer encourages the
CNN model to focus on the most significant features while discarding less relevant
or noisy details, which can lead to better generalisation performance on unseen
data.
The pooling layer down-scales the feature map’s size by reducing its height
and width while maintaining its depth. The pooling layer’s technique is similar to
how an image is resized in image processing. In CNN designs, the pooling layer
performs a pooling operation on an output feature map using a 2⇥2 pixel-sized
kernel and a 2-pixel stride. There are two primary types of pooling operations
in the pooling layer: maximum pooling and average pooling. Below is a detailed
description of each pooling approach:
Max Pooling: The goal of the pooling procedure known as ”max pooling”
is to choose the highest value possible from the area of the feature map that the
kernel covers. The output of the max-pooling layer is a feature map that includes
the most noticeable features from the prior feature map. A sample of the max
pooling process using a 2⇥2 kernel and a 2-stride value is shown in Figure 3.14.
This feature map was obtained from Figure 3.10 and is called a ”rectified feature
map.”
3.2 Deep Learning 43

Figure 3.14: Example of the max pooling operation


44 3.2 Deep Learning

Average Pooling:The goal of average pooling is to determine each element’s


average value inside the kernel-covered feature map area. An example of an average
pooling procedure using a 2⇥2 kernel and a 2-stride value is shown in Figure 3.15.
This feature map was obtained from Figure 3.10 and is called a ”rectified feature
map.”

Figure 3.15: Example of the average pooling operation


3.3 CNN-based object detection 45

Fully Connected Layer

A CNN uses a fully connected layer as its last layer. After the final convolutional
or pooling layer has produced the final output of the feature map, this layer is
used to learn and classify the features. However, the fully connected layer can
only work with one-dimensional input data. Therefore, the output feature map’s
multi-dimensional data must go through a flattening process to convert the multi-
dimensional matrix to a 1-dimensional one. After flattening, each feature map
component is configured to connect to every neuron in the fully connected layer.
Figure 3.16 [107] shows an example of the fully connected layer and the flattening
operation being applied to a max-pooled feature map derived from Figure 3.14.

Figure 3.16: Example of fully connected layer and flatten operation from [107]

3.3 CNN-based object detection


In this section, we will discuss three popular CNN-based object recognition tech-
niques that have been widely used for object detection in images. There are two
categories for the widely used CNN-based object detection techniques. The first
group uses the Faster R-CNN two-stage object detection approach. The second
category includes YOLO, SSD, and other one-stage object detection techniques.

3.3.1 Faster Region-based Convolutional Neural Network


A faster Region-based Convolutional Neural Network is an improved version of
R-CNN and Fast R-CNN (Faster R-CNN). A speedier R-CNN was presented by
Ren et al. (2015). The creators of Faster R-CNN want to improve the detection
speed and accuracy of both Faster R-CNN and R-CNN by eliminating the selective
search used to generate region recommendations in the first stage.
46 3.3 CNN-based object detection

Figure 3.17: Network structure diagram of Faster R-CNN from [108]

Fast R-CNN and R-CNN with Region Proposal Network (RPN). The region
proposals produced by the RPN in the second stage are then supplied to the
classification layer, which is used to identify object locations and the class to
which each object belongs [108]. The relationship between RPN and R-CNN is
shown in Figure 3.17.

3.3.2 Single Shot Multibox Detector


Single Shot MultiBox Detector [109] is a single-stage object detector developed
on CNN (SSD) by using the VGG-16 network as a primary network and six extra
convolutional layers built after VGG-16, the inventors of SSD created the SSD
network in 2016.

Figure 3.18: Architecture of SSD from [109]

SSD forecasts bounding boxes for di↵erent-sized feature maps. The prediction
layers in SSD are Conv4 3, Conv 7, Conv8 2, Conv9 2, Conv10 2, and Conv 11.
3.3 CNN-based object detection 47

In the case of using an image of size 300⇥300 as input, referred to as SSD300, these
prediction layers will predict the bounding boxes. The SSD detector can identify
objects of di↵erent sizes thanks to the feature maps’ predictions at multiple scales.
After detection using non-maximum suppression, SSD aggregates redundant and
overlapping bounding boxes into a single box to produce the final detection res-
ult. However, SSD performs worse than Faster R-CNN when identifying small
objects. Feature map layers can only recognise small objects in SSD with better
resolution. However, the feature information from these layers is less valuable for
categorisation because they only provide low-level object characteristics, such as
edges or colour patches [110].

3.3.3 You Only Look Once


Another one-stage object detector, You Only Look Once (YOLO), was created to
shorten the detection time of two-stage object detectors [111]. Since it was first
proposed by Redmon et al. in 2016, five versions of YOLO have been developed:
YOLO-V1, YOLO-V2 [112], YOLO-V3 [113], YOLO-V4 [114], and YOLO-V5
[115]. In its network, YOLO can simultaneously localise and categorise items.

Figure 3.19: The key process of YOLO object detection algorithm from [115]

YOLO finds objects by partitioning an input image into S-by-S grid cells, each
of which forecasts B bounding boxes. Each bounding box contains the centre x,
centre y, width, height, confidence score, and class probability of each class. The
centre of the enclosed box is indicated by the (x, y) coordinates. The key process
of YOLO object detection algorithm is in Figure 3.19.

YOLO-V3 and YOLO-V4

YOLO-V3 introduced several new features and improvements over YOLO-V2,


such as multi-scale detection, feature pyramid network, and more. Multi-scale
detection improves the ability of the model to detect objects of di↵erent sizes in
48 3.3 CNN-based object detection

the same image. The feature pyramid network (FPN) uses a top-down architecture
with skip connections to combine low-level and high-level features, which helps to
improve the accuracy of object detection. In addition, YOLO-V3 uses a di↵erent
backbone network than YOLO-V2, which further improves its performance.
YOLO-V4 builds upon the success of its predecessors and brings significant
improvements to object detection accuracy and speed. One of the most significant
improvements in YOLO-V4 is the use of the CSP (cross-stage partial) architecture
[116], which improves the model’s ability to learn features and makes it more
efficient. YOLO-V4 also uses several other techniques such as spatial pyramid
pooling, focal loss, and bag-of-freebies (BoF) to improve its performance.
In addition to these improvements, YOLO-V4 has also made significant strides
in terms of speed. YOLO-V4 can process up to 65 frames per second on a single
GPU, making it one of the fastest object detection systems available.
In conclusion, YOLO-V3 and YOLO-V4 are both significant improvements
over their predecessors in terms of object detection accuracy and speed. YOLO-
V4, in particular, has introduced several new techniques and architectures that
have pushed the boundaries of what is possible in real-time object detection. As
computer vision and deep learning continue to advance, we can expect even more
improvements to object detection systems like YOLO.

YOLO-V5

YOLO-V5 is the 5th version of this object detection system, released in 2020,
which has brought significant improvements over its predecessors YOLO-V3 and
YOLO-V4.
YOLO-V5 is built upon a more efficient architecture, compared to YOLO-V4,
which reduces the model’s complexity and improves its speed. This architecture
uses a single neural network with a large number of channels and smaller image
sizes. Additionally, YOLO-V5 uses a new backbone network architecture, Effi-
cientNet, which is a family of efficient models that have been optimised for both
accuracy and efficiency.
Compared to YOLO-V3 and YOLO-V4, YOLO-V5 is significantly faster and
more accurate. YOLO-V5 is capable of achieving state-of-the-art performance on
several benchmarks while maintaining real-time speed, even on low-power devices.
The model’s speed has been further improved through several techniques such as
model pruning, automatic mixed-precision training, and optimised CUDA code.
Another significant improvement in YOLO-V5 is the introduction of a novel
data augmentation technique, named mosaic data augmentation, which combines
multiple images into a single training image. This technique helps the model learn
3.3 CNN-based object detection 49

to detect objects in complex scenes, where objects can appear partially or fully
occluded.

The input of YOLO-V5 uses the same Mosaic data enhancement method as
YOLO-V4, and the author of this method is also a member of the YOLO-V5
team. However, random scaling, random cropping, and random arrangement are
used for splicing, and the e↵ect is still very good for detecting small targets.

In the YOLO algorithm, anchor boxes with initially set length and width are
assigned for di↵erent datasets. During network training, the network outputs the
prediction frame based on the initial anchor frame and compares it with the real
frame ground truth, calculates the di↵erence between the two, and then updates
the network parameters in reverse to iterate them. Therefore, the initial anchor
box is also an important part.

In YOLO-V3 and YOLO-V4, a separate program is used to calculate the value


of the initial anchor box for di↵erent datasets. However, YOLO-V5 embeds this
function into the code and adaptively calculates the best anchor box value for
di↵erent training sets each time it is trained.

In commonly used target detection algorithms, di↵erent images have di↵erent


lengths and widths, so the common way is to uniformly scale the original picture
to a standard size and then send it to the detection network.

However, this has been improved in the YOLO-V5 code. The author of YOLO-
V5 believes that when the network is practically used, the input images may
have di↵erent aspect ratios. Therefore, the letterbox function of datasets.py in
YOLO-V5’s code has been modified to adaptively add the least black border to
the original image. This simple improvement has increased the reasoning speed
by 37%, making it very e↵ective.

In the backbone of the network, YOLO-V5 uses the Focus structure, which is
not present in YOLO-V3 & YOLO-V4, and the key to this structure is the slice
operation. For instance, a 4⇥4⇥3 image slice becomes a 2⇥2⇥12 feature map.
Taking the structure of YOLO-V5s as an example, the original 608⇥608⇥3 image
is input into the Focus structure, and the slicing operation is used to first create a
304⇥304⇥12 feature map. Then, after a convolution operation of 32 convolution
kernels, it finally becomes a feature map of 304⇥304⇥32 [117].
50 3.3 CNN-based object detection

Figure 3.20: Focus structure from [117]

Only the backbone network in YOLO-V4 uses the CSP structure, whereas in
YOLO-V5, two CSP structures are designed. Taking the YOLO-V5s network as
an example, the CSP1 X structure is applied to the Backbone backbone network,
and the other CSP2 X structure is applied to the Neck. In YOLO-V4’s Neck struc-
ture, ordinary convolution operations are used. However, in the Neck structure
of YOLO-V5, the CSP2 structure is adopted to enhance the ability of network
feature fusion. The CSP2 structure is designed by referring to CSPnet. In conclu-
sion, YOLO-V5 is a significant improvement over its predecessors, YOLO-V3 and
YOLO-V4, in terms of accuracy, speed, and efficiency. The model’s new archi-
tecture and innovative techniques have made it one of the best-performing object
detection systems in the field.

YOLOX

Object detection is an important computer vision task that involves identifying


and locating objects within an image or video stream. The YOLOX algorithm is
a recent breakthrough in the field of object detection that has gained popularity
due to its fast and accurate performance.
Compared to previous versions of YOLO such as YOLO-V3, YOLO-V4, and
YOLO-V5, YOLOX [118] improves upon several aspects of the algorithm. One of
the main di↵erences is the use of a more efficient backbone network architecture,
the Scaled-YOLO-V3-Darknet53, which allows for better feature representation
and faster training. YOLOX also introduces a new cross-stage partial connection
(CSP) module that improves feature fusion and reduces computation cost.
In terms of performance, YOLOX achieves state-of-the-art results on several
benchmark datasets such as COCO, with a trade-o↵ between accuracy and speed.
3.3 CNN-based object detection 51

It is able to achieve comparable accuracy to YOLO-V5 while being significantly


faster. YOLOX is also more accurate than YOLO-V4 and YOLO-V3 in most
cases, while still maintaining a high processing speed.
In contrast, YOLO-V5 was released in 2020 and is an improvement over YOLO-
V4. YOLO-V5 introduced a new architecture design, a di↵erent detection head,
and a new training pipeline, among other changes. While YOLO-V5 has shown
promising results in object detection tasks, YOLOX aims to further improve on
the YOLO series with its anchor-free approach and deep feature aggregation.
Overall, the YOLOX algorithm is a significant advancement in the field of
object detection that o↵ers improved accuracy and speed compared to previous
versions of YOLO.

YOLO-V6, 7 and 8

The Meituan Visual AI Department has open-sourced the framework of YOLO-


V6 [119]. While YOLO-V4 and YOLO-V5 focus more on data enhancement, they
made fewer changes to the network structure.
YOLO-V6, on the other hand, has made significant changes to the network
structure. Its backbone no longer uses Cspdarknet but switches to EfficientRep,
which is more efficient than Rep. Its neck also builds Rep-PAN based on Rep and
PAN, and the head, like YOLOX, decouples and adds a more efficient structure. It
is worth mentioning that YOLO-V6 also uses the anchor-free method, abandon-
ing the previous anchor-based method. In addition to changes in the model’s
structure, its data enhancement is consistent with that of YOLO-V5. Its label
assignment is the same as YOLOX, using simOTA, and it introduces a new frame
regression loss called SIOU. From this perspective, YOLO-V6 can be described as
a combination of the best features of both models.
Both YOLO-V5 and YOLOX use a multi-branch residual structure called
CSPNet, but this structure is not very hardware-friendly. To make the model more
suitable for GPU devices, the backbone structure was improved by introducing Re-
VGG and developing a more efficient structure called EfficientRep. RepVGG adds
a parallel 1⇥1 convolution branch and identity mapping branch to each 3×3 con-
volution, forming a RepVGG Block. Unlike ResNet, which adds this structure to
every two or three layers, RepVGG adds it to each layer. RepVGG claims that the
fused 3⇥3 convolution structure is highly efficient on computing-intensive hard-
ware devices. Figure 3.21 introduce more details about ResNet, RepVGG training
and testing [120].
52 3.3 CNN-based object detection

Figure 3.21: ResNet, RepVGG training and RepVGG testing from [120]

RepVGG is a simple and powerful CNN structure. It uses a high-performance


multi-branch model during training, and uses a fast and memory-saving single
channel model during inference, which also has a better balance between speed
and accuracy. EfficientRep replaces the convolutional layer with stride = 2 in the
backbone with the RepConv layer with stride = 2. And also changed CSP-Block
to RepBlock. The Backbone of YOLO-V6 is shown in Figure 3.22 [121].

Figure 3.22: Backbone of YOLO-V6 from [121]

In order to reduce the delay on the hardware, the Rep structure is also in-
troduced in the feature fusion structure on the Neck. Rep-PAN is used in Neck.
3.3 CNN-based object detection 53

Rep-PAN is based on the combination of both PAN and RepBlock. The main idea
is to replace the CSP-Block in the PAN with the RepBlock.
Like YOLOX, YOLO-V6 also decouples the detection head, separating the
processes of border regression and category classification. Coupling bounding box
regression and class classification can a↵ect performance, as it not only slows
down convergence but also increases the complexity of the detection head. In the
decoupling head of YOLOX, two additional 3⇥3 convolutions are added, which
also increases the complexity of the operation to a certain extent. YOLO-V6 has
redesigned a more efficient decoupling head structure based on the strategy of
Hybrid Channels. The delay is reduced without changing the accuracy, achieving
a trade-o↵ between speed and accuracy.
YOLO-V7 [122] and YOLO-V5 are di↵erent versions of YOLO, with YOLO-
V7 being the newer version. In terms of computational efficiency and accuracy,
YOLO-V7 has improved compared to YOLO-V5. YOLO-V7 uses faster convolu-
tion operations and smaller models, allowing it to achieve higher detection speeds
under the same computing resources. Additionally, YOLO-V7 provides higher
accuracy and can detect more fine-grained objects.
However, YOLO-V5 trains and infers much faster than YOLO-V7 and has
a lower memory footprint. This makes YOLO-V5 more advantageous in certain
application scenarios, such as in mobile devices or resource-constrained systems.
In general, both YOLO-V7 and YOLO-V5 have improved in performance and
accuracy, but YOLO-V7 is faster and takes more resources, while YOLO-V5 is
faster in training and inference speed but slightly lower in accuracy than YOLO-
V7. Therefore, when choosing which version to use, you need to make a trade-o↵
based on the specific needs of the application scenario.
Ultralytics YOLO-V8 [123] is the latest version of the YOLO target detection
and image segmentation model developed by Ultralytics. YOLO-V8 is a cut-
tingedge, state-of-the-art (SOTA) model that builds on the success of previous
YOLO Versions and introduces new features and improvements to further boost
performance and flexibility. It can be trained on large datasets and can run on
various hardware platforms, from CPU to GPU.
A key feature of YOLO-V8 is its extensibility, which is designed as a framework
that supports all previous versions of YOLO, making it easy to switch between
di↵erent versions and compare their performance.
In addition to scalability, YOLO-V8 includes many other innovations that
make it an attractive choice for various object detection and image segmentation
tasks. These include a new backbone network, a new anchor-free detection head,
and new loss functions.
Overall, YOLO-V8 is a powerful and flexible tool for object detection and
54 3.4 Quantitative Performance Comparison Methods

image segmentation that o↵ers the best of both worlds: state-of-the-art SOTA
technology; and the ability to use and compare all previous YOLO Versions.

3.4 Quantitative Performance Comparison


Methods
A number of metrics can be used to analyse the performance of an object de-
tector/model.

• TP (True Positive): The sample’s true category is a positive example, and


the model’s anticipated result is also a positive example, indicating that the
prediction is right;

• TN (True Negative): The sample’s true category is a negative example, and


the model predicts that it will be a negative example, which is right;

• FP (False Positive): The sample’s true class is a negative example, but the
model predicts a positive example, which is incorrect;

• FN (False Negative): The sample’s true class is a positive example, but the
model predicts a negative example, resulting in an incorrect prediction;

• IoU (Intersection over Union): IoU is a key notion in object detection. In


general, it refers to the intersection of the bounding box and the model’s
projected Ground Truth. If the IoU is greater than an agreed threshold, we
can conclude that the forecast was right.

• mAP (mean Average Precision): mAP can characterise the entire precision-
recall curve. The area under the precision-recall curve is mAP (In our ex-
periments we use the default threshold of 0.5).

Performance measures Accuracy, Recall and Precision can be derived as per


the equations below, where TP, TN, FP, FN refers to the number of true positives,
true negatives, false positives, and false negatives, respectively:

• Accuracy (all correct/all):

TP + TN
Accuracy = (3.16)
TP + TN + FP + FN

• Recall (true positives/all actual positives):

TP
Recall = (3.17)
TP + FN
3.4 Quantitative Performance Comparison Methods 55

• Precision (true positives/predicted positives):

TP
P recision = (3.18)
TP + FP

The three numbers mentioned above fall between 0 and 1. The more precise
the prediction is, the nearer it is to 1. The higher the fraction of projected positive
samples to all positive samples is for recall, the closer it is to 1. The proportion
of positive ground truth for the premise increases as it gets closer to 1. The e↵ect
is better or worse depending on how closely these three values are to 1.
In determining the accuracy of an object detector, it is important to judge the
accuracy based not only on the fact that an object has been correctly identified
as being of a particular type, but also to determine how close the location of the
object identified, is to the ground truth. Therefore, instead of using Accuracy, in
this research we use [email protected] as a measure to determine correctness of object
detection.
In our experiments we compared the performance of the four object detection
models we obtained by training the four sub-versions of YOLO-V5, with models
based on other popular Deep Neural Networks, SSD (Single Shot Multi-box De-
tector) and Faster R-CNN (Faster Region-based Convolutional Neural Networks).
Based on the foundational machine learning and deep learning algorithms and
networks provided in this chapter, Chapters 4–6 present the original contributions
of the research presented in this thesis.
56 3.4 Quantitative Performance Comparison Methods
Chapter 4

Ghaf Tree Detection Using Deep


Neural Networks

In this chapter, we utilise one of the best Convolutional Neural Networks (CNN),
YOLO-V5, based model to e↵ectively detect Ghaf trees in images taken by cam-
eras onboard lightweight, Unmanned Aircraft Vehicles (UAV), i.e., drones, in some
areas of the UAE. We utilise a dataset of over 3200 drone captured images par-
titioned into data-subsets to be used for training (60%), validation (20%), and
testing (20%). Four versions of YOLO-V5 CNN architecture are trained using the
training data subset. The validation data subset is used to fine tune the trained
models to realise the best Ghaf tree detection accuracy. The trained models are
finally evaluated on the reserved test data subset not utilised during training. The
object detection results of the Ghaf tree detection models obtained using the four
di↵erent sub-versions of YOLO-V5 are compared quantitatively and qualitatively.

4.1 Introduction to Ghaf Tree Detection

In this chapter we investigate the use of four YOLO-V5 sub-variants represent-


ing DNN architectures of di↵erent depth, S (Small-shallowest), M (Medium), L
(Large) and X (Extra Large). For clarity of presentation, this chapter is divided
into four sub-sections. Section-4.1 provided an introduction to the application
and research context. Section-4.2 provides the methodology to be used includ-
ing dataset preparation, data labelling and training the Deep Neural Network
models, YOLO-V5, S, M, L and X. Section-4.3 presents the Ghaf Tree detection
experimental results and a detailed analysis of the performance of the four mod-
els trained. Finally, Section-4.4 concludes with an insight to future work and
suggestions for improvements of the established DNN models.

57
58 4.2 Proposed Approach to the Ghaf Tree Detection

4.2 Proposed Approach to the Ghaf Tree


Detection
The workflow of the proposed method is shown in Figure 4.1. It includes two main
stages: training stage (including training and validation) and testing stage. The
details of each phase (i.e., data preparation, training, validation and testing) are
described in the following sections.

Figure 4.1: The workflow of the proposed method

4.2.1 Data Preparation


A total 368 images containing 5000 Ghaf trees were randomly selected from a large
number of images taken during a drone’s flight. Drone imagery was collected by
the Dubai Desert Conservation Reserve (DDCR) team with DJI Phantom-4 and
DJI Mavic 2 Pro drones flying at di↵erent heights/altitudes. The selected image
dataset was then divided into three data subsets for training (60%), validation
(20%) and testing (20%).
In the dataset used, as some of the images have a significantly larger number
of Ghaf trees as compared to some others, the number of images in the training,
validation and test data subsets were 298, 30 and 40, respectively. To start the
training phase, each Ghaf tree in the training and validation dataset subsets was
labelled with a bounding box using the ‘LabelImg’ tool and was labelled as type
4.3 Experimental Results and Analysis 59

”Ghaf”. It is specifically noted that some Ghaf trees can contain a number of
canopies that grow from the same root structure, while some have only one canopy.
Therefore, it is often impossible to judge whether some adjacent canopies be-
long to the same root structure (as sand covers or occludes most of the roots and
trunks) and hence form a single Ghaf tree. Therefore, in this chapter rather than
attempting to detect a Ghaf tree, we attempt to detect Ghaf trees canopies. It
should be therefore noted that counting canopies, for example, will not allow us
to count the total number of Ghaf trees.

Dataset Number of labelled Ghaf tree canopies

Training 3200
Validation 900
Testing 900

Table 4.1: Number of labelled Ghaf tree canopies in each data subset

The labelled data of the training data subset is used to train the four subver-
sions of YOLO-V5 CNN. The training is for a single class of an object, ’Ghaf Tree’
and hence a Ghaf tree is detected by di↵erentiating it from its background. Simil-
arly, the tagged Ghaf trees from the validation dataset subset is used to fine tune
the training model of YOLO-V5 CNN when determining the optimal parameters
of the model.
Finally, the test dataset subset is used to evaluate the performance of the
trained model. The labelled Ghaf trees in the validation set is used during training
to optimise the network parameters, whilst the labelled Ghaf trees in the test
dataset is used as benchmark data to determine the accuracy of prediction.
When labelling data for training and validation, when Ghaf trees are enclosed
within rectangles, the rectangles may contain Ghaf trees of di↵erent sizes and may
overlap or be obscured by other Ghaf trees or objects.
Moreover, they may have di↵erent backgrounds (i.e., sand, bushes/shrub un-
dergrowth, etc.). It is therefore important to capture rectangles of image pixels
with the above possible variations for testing and training, as it will e↵ectively test
the generalisability of the trained CNN model for subsequent Ghaf tree detection
tasks.

4.3 Experimental Results and Analysis


In this section, we compare the performance of the Ghaf tree detection models
generated from the four versions of YOLO-V5 and popular networks like Faster
60 4.3 Experimental Results and Analysis

R-CNN and SSD. Each model was trained with the same set of UAV images of
the training data subset, validated on the same validation data subset and tested
on the same test data subset. The performance of the four models are compared
booth quantitatively and intuitively, below.

4.3.1 Quantitative Performance Comparison

In determining the accuracy of an object detector, it is important to judge the


accuracy based not only on the fact that an object has been correctly identified
as being of a particular type, but also to determine how close the location of the
object identified, is to the ground truth.

Therefore, instead of using Accuracy, in this research we use [email protected] as


a measure to determine correctness of object detection. In our experiments we
compared the performance of the four object detection models we obtained by
training the four sub-versions of YOLO-V5, with models based on other popular
Deep Neural Networks, SSD and Faster R-CNN (Faster Region-based Convolu-
tional Neural Networks) (see Table-4.2).

The results tabulated in Table-4.2 also show that SSD requires significant
amount of time for the convergence of training (i.e., completion of training)
and also take significant amount of extra time for testing. Recall, precision and
[email protected] values also remain significantly lower than those of the YOLO-V5
models. Faster R-CNN took the lowest amount of time to complete training and
has very good testing speeds, second only to YOLO-V5s, the shallowest YOLO-V5
sub-version.

However, accuracy, precision, [email protected] values were much lower than in the
case of the four YOLO-V5 models. Comparing the performance of the four YOLO-
V5 models, it is observed that when the complexity/depth of the architecture
increases, more time is taken for training and generally the same trend exists
when it comes to testing, with YOLO-V5x taking significantly more time than
sub-versions, m and l for testing. In comparison to other models, YOLO-V5x
achieved the highest mean average precision (81.1%) in Ghaf tree detection, as
shown in Table-4.2. Figure 4.2 illustrates the Precision vs Recall graph for YOLO-
V5x indicating a [email protected] value of 0.811.
4.3 Experimental Results and Analysis 61

Model Training Hours Recall Precision mAP

SSD 18 0.50 0.14 28.6%


Faster R-CNN 5 0.52 0.56 57.6%
YOLO-V5s 6 0.71 0.82 78.8%
YOLO-V5m 7 0.69 0.86 78.3%
YOLO-V5l 8 0.72 0.83 77.4%
YOLO-V5x 10 0.71 0.88 81.1%

Table 4.2: Performance comparison of DNN based object detection models

Figure 4.2: mAP: Precision&Recall curve

In summary, based on the objective performance values presented and dis-


cussed above, it can be concluded that the object detector models generated from
YOLO-V5 and its sub versions are far superior in performance as compared to
models generated by other popular CNNs such as SSD and Faster R-CNN. In
particular the deeper the architecture, the objective performance improves when
comparing the di↵erent sub-versions of YOLO-V5.

4.3.2 Visual Performance Comparison


As shown by the objective performance results tabulated in Table-4.3, all four
YOLO-V5 sub-versions can detect Ghaf trees with a typically acceptable mAP
of over 77.4%. To further analyse and compare the performance of the various
models developed, this section provides a comprehensive subjective performance
analysis.
Figures 4.3-4.7 provide a number of di↵erent images taken from the drone
footage of di↵erent areas that contain Ghaf trees. Some areas only contain Ghaf
62 4.3 Experimental Results and Analysis

trees, and others contain other types of trees or plants. Yellow circles in the images
show the missing targets and red crosses mean wrong detections.

(a) SSD (b) Faster R-CNN

(c) YOLO-V5s (d) YOLO-V5m

(e) YOLO-V5l (f) YOLO-V5x

Figure 4.3: The visual performance comparison of Ghaf tree detector models de-
rived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c) YOLO-V5s, (d)
YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x

Figure 4.3 illustrates the performance of the six models on a desert area only
containing Ghaf trees. Unfortunately, SSD based model did not pick up any of the
Ghaf trees and the Faster R-CNN based model did not detect a number of Ghaf
trees. The performance of the four sub-versions of YOLO-V5 were very much
comparable. It is noted that in this image Ghaf trees have been captured at a
high resolution with clear views, with no other types of objects in the background.
4.3 Experimental Results and Analysis 63

YOLO-V5s

YOLO-V5m

Figure 4.4: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
64 4.3 Experimental Results and Analysis

YOLO-V5l

YOLO-V5x

Figure 4.4: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
4.3 Experimental Results and Analysis 65

Figure 4.4 illustrates the performance of models created by the four subver-
sions of YOLO-V5, on a drone captured image of a higher altitude (hence trees
appearing smaller) and in an area where there are other trees and objects. The
yellow circles denote missed Ghaf trees. The model based on YOLO-V5x outper-
forms the models based on other YOLO-V5 sub-versions. YOLO-V5s misses some
sparse canopies. YOLO-V5s, m and l misses Ghaf trees located at the boundary
of the image .
66 4.3 Experimental Results and Analysis

YOLO-V5s

YOLO-V5m

Figure 4.5: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
4.3 Experimental Results and Analysis 67

YOLO-V5l

YOLO-V5x

Figure 4.5: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
68 4.3 Experimental Results and Analysis

Further inspecting the images included in Figure 4.5, it can observe that the
models generated by training the four YOLO-V5 sub-versions, perform very well
most of the time, and their operational/accuracy gaps only exists in some detail.
YOLO-V5x performs marginally better when detecting small canopy, overlapped
and close canopies.
4.3 Experimental Results and Analysis 69

YOLO-V5s

YOLO-V5m

Figure 4.6: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
70 4.3 Experimental Results and Analysis

YOLO-V5l

YOLO-V5x

Figure 4.6: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
4.3 Experimental Results and Analysis 71

Figure 4.6 illustrates the use of the models created by the four sub-versions
of YOLO-V5 in detecting Ghaf trees in a more complex area, that includes other
trees. This image consists of Ghaf trees of a wider size variation. Comparing with
the labelled ground truth image, the model generated by YOLO-V5s demonstrates
a better performance as compared with the performance of the model created by
YOLO-V5m, l and x. When the scene becomes complex, significantly more data
is needed for training a deeper Neural Network. Thus, if we can have more high-
quality data for training YOLO-V5x, the results can still be improved.
72 4.3 Experimental Results and Analysis

YOLO-V5s

YOLO-V5m

Figure 4.7: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
4.3 Experimental Results and Analysis 73

YOLO-V5l

YOLO-V5x

Figure 4.7: The results of Ghaf tree detection in drone imagery using the YOLO-
V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
74 4.4 Conclusion

Figure 4.7 illustrates the performance of the four models on another test image
in which other di↵erent size of trees or plants exist in a complex area. In this
specific case the model created by YOLO-V5l, performs better than other versions,
detecting every Ghaf tree with no mistake. Once again, the slightly less accurate
detection capability of YOLO-V5x can be attributed to the lack of substantial
quantities of training data. The depth of the model and the amount of data will
a↵ect the actual detection situation. In certain situations, a certain model may
perform better.

4.4 Conclusion
In this chapter we investigated the use of Convolutional Neural Networks in detect-
ing Ghaf trees in videos captured by a drone, flying at di↵erent altitudes and in dif-
ferent environments that consists of Ghaf trees. To the best of authors knowledge,
this is the first attempt in using Convolutional Neural Networks in automatically
detecting Ghaf trees, that poses a significant challenge to detecting them using
traditional machine learning approaches. Despite the relatively small number of
images utilised for training the DNNs in this work, the high [email protected] value
of 81.1% obtained by the YOLO-V5x based model in detecting Ghaf trees in ap-
proximately 78 MS, is a promising step towards achieving real-time detection using
aerial imagery. The training time for model generation was high, approximately
10 hours and this was mainly due to hardware limitation of the computer utilised.
The training time could be considerably reduced if a faster computer hardware
was utilised. Models trained based on all other sub-versions of YOLO-V5 resulted
in [email protected] values of above 77.4%, whilst other popular DNNs such as SSD
and Faster R-CNN performed less efficiently. Rigorous visual inspection of Ghaf
tree detections obtained using all four sub-versions of YOLO-V5 revealed that
YOLO-V5x particularly outperforms the other YOLO-V5 networks at detecting
Ghaf trees in scenarios where images are overlapping, blurred, obstructed with
di↵erent backgrounds, and where there is a significant size variation of Ghaf trees.
Additional test results can be found in Appendix B.
This work utilised just over 5000 ghaf trees with 3200 of them used during
training. If this number can be expanded to 10,000 + images, the detection
performance will be further improved as the models would then be able to bet-
ter generalise on new unseen data and accurately identify Ghaf trees. Dataset
limitation notwithstanding, the results obtained in this work promise ground for
real-time detection of the Ghaf using aerial surveillance, thus aiding in the e↵orts
to preserve this endangered national tree of the UAE. The models can be used
to design change detection software to identify damages to Ghaf trees based on
4.4 Conclusion 75

drone captured aerial footage.


In chapter 5, we propose the use of Deep Neural Networks to create trained
models to detect and classify multiple tree types, namely Ghaf Trees, Acacia Trees
and Palm Trees.
76 4.4 Conclusion
Chapter 5

Multiple Tree Classification Using


Deep Neural Networks

In this chapter we investigate the use of Deep Convolutional Neural Networks in


the detection and recognition of three types of trees present in desert areas, Ghaf,
Palm and Acacia trees. These are all drought-resistant trees native to regions of
Asia and the Indian Subcontinent, including the UAE (the United Arab Emirates).
Because of its historical and cultural significance, the Ghaf is particularly seen
as a symbol of stability and tranquility in the UAE. It is now designated an
endangered tree in the UAE, requiring conservation due to rising urbanisation
and infrastructural development. Similarly, the Arabian Peninsula, the Middle
East, and North Africa rely heavily on date palm trees as a source of food and
income. For projecting date production and plantation management, counting
the quantity of date palm trees and knowing their locations is critical. Finally,
Acacia Arabica, often known as Babul or Gum Arabic tree, is a multipurpose tree
that has been recognition all over the world. This species is moderately long-lived,
despite its sluggish growth rate. The species can survive in severely dry conditions
as well as floods.
In the proposed work, we investigate the use of, YOLO-V5, one of the best
and most established Convolutional Neural Networks (CNN) and its several sub-
versions, to automatically detect and recognise Ghaf, Date Palm and Acacia trees
in imagery captured by onboard cameras on an Unmanned Aerial Vehicles (UAVs).
We use a dataset of about 800 images that are divided into three subsets: training
(60%), validation (20%), and testing (20%). The training dataset is used to train
the four sub-versions of the YOLO-V5 CNN. To get the best detection accuracy,
the trained models are fine-tuned, using the validation dataset. Finally, the trained
models are tested against the unknown UAV imagery data in the reserved test
dataset. The quantitative and qualitative results of tree detection and recognition

77
78 5.2 Proposed Approach to Multiple Tree Detection and Classification

of the four sub-versions of YOLO-V5 are compared.

5.1 Introduction to the Multiple Tree


Detection and Classification
In this chapter we investigate the use of YOLO-V5 and four of its variants repres-
enting DNN architectures of di↵erent depth, S (Small-shallowest), M (Medium),
L (Large) and X (Extra Large). In this research we are not exploring the use of
YOLO-V7 and YOLO-V8, which are latest releases. For clarity of presentation,
this chapter is divided into five sub-sections. Section-5.1 provided an introduction
to the application and research context. Section-5.2 presents the YOLO-V5 net-
work architecture and defines the objective metrics that we use in measuring and
comparing the performance of the trained models. Section-5.3 provides the overall
methodology to be adopted and details dataset preparation, data labelling, train-
ing and the approach adopted for testing the performance of the Deep Neural
Network models, YOLO-V5, S, M, L and X. Section-5.4 presents details of the
Multiple Tree Type detection and recognition experimental results and a compre-
hensive analysis of the performance of the four trained models, in recognising each
type of tree. Finally, Section-5 concludes, with an insight into future work and
suggestions for improvements of the established DNN models.

5.2 Proposed Approach to Multiple Tree


Detection and Classification
The workflow of the proposed multiple tree detection and classification method is
the same as ghaf tree only detection which is shown in Figure 4.1. It includes two
main stages: training stage (including training and validation) and testing stage.
The details of each phase (i.e., data preparation, training, validation and testing)
are described in the following sections.

5.2.1 Data Preparation


Up to 800 drone captured images of resolution 3840⇥2160 pixels captured at
di↵erent altitudes (ranging between 10-60 meters) of drone flights, containing 5000
Ghaf trees, 6000 Palm trees and 500 Acacia trees were randomly selected from
a large number of images. Drone imagery was collected by the Dubai Desert
Conservation Reserve (DDCR) team with DJI Phantom-4 and DJI Mavic 2 Pro
drones flying at di↵erent heights/altitudes (10-60 meters). The selected image
5.2 Proposed Approach to Multiple Tree Detection and Classification 79

dataset is separated into three data subsets, for training (60%), validation (20%)
and testing (20%). Since some images included a large number of trees and some
consisted of a relatively restricted number of trees, the total numbers of images
used for training, validation and testing were 619, 80 and 101, respectively.

In the data preparing phase, each Ghaf, Palm and Acacia tree within the
training, validation and testing data subsets was labelled with a bounding box,
using the labelImg software and labelled as ”Ghaf” or ”Palm” or ”Acacia”. It
is noted that the images contained other types of trees and strubs that did not
belong to these three groups of trees and hence would likely create false positives.
Moreover, since single Ghaf tree can contain more than one canopy (grow from
one root/trunk structure) for the purpose of this research, we define a Ghaf tree
as a single Ghaf tree canopy, which we label and aim to detect via the object
detections and classifications models we develop.

Tables 5.1-5.3 tabulates the total number of images used for labelling each type
of tree and the total number of trees of each type present in them, in the training,
validation and testing, image subsets.

The drone images were randomly picked up from a large set of images, con-
taining images, captured at the same resolution, but when the drone was flying at
di↵erent altitudes, di↵erent times of the day (sometimes resulting in illumination
variations, shadows etc.). The density/sparsity of trees varied between images.
Palm trees co-existed with Ghaf trees, but most Acacia trees existed in isolation.
Trees overlapped and occluded and with similar or di↵erent trees and other ob-
jects. The underlying background below trees varied, mostly consisting of sand but
occasionally consisting other trees, shrubs or man-made structures. In drawing
rectangles around samples of trees, it was essential that the captured rectangles
attempted to tightly fit the tree canopies, but included in some cases included
other objects, trees, crops, sand within the rectangle. When a tree is partially
occluded by other trees, when drawing the rectangle, we imagined the trees oc-
cluded canopy area and drew the rectangle to include the potentially covered area
of the canopy. We also attempted to capture many trees which are clearly isolated
from other trees and objects and HD clarity of its shape, boundary, texture and
colour. Further when labelling trees that has a shadow, it is important to exclude
the shadow being enclosed within the rectangle. All the above strict criteria was
adopted in labelling to give the DNN the opportunity to learn from identifying
and recognising the three types of trees under variations of illumination, occlusion,
size, clarity etc.
80 5.3 Experimental Results and Analysis

Dataset Number of Images Number of Ghaf trees

Training 437 3200


Validation 40 900
Testing 62 900

Table 5.1: Number of labelled Ghaf tree in each data subset

Dataset Number of Images Number of Plam trees

Training 34 3600
Validation 8 1200
Testing 9 1200

Table 5.2: Number of labelled Palm tree in each data subset

Dataset Number of Images Number of Acacia trees

Training 148 300


Validation 32 100
Testing 30 100

Table 5.3: Number of labelled Acacia tree in each data subset

All experiments were conducted on a PC equipped with an Intel Core i7-6850k


CPU, NVIDIA GeForce GTX-1080ti GPU and 32 GB ram. The operating system
on the computer used was, Windows 10.

5.3 Experimental Results and Analysis


In this section, we compare the performance of the multiple tree detection and
classification models generated from the four versions of YOLO-V5 and popular
networks like Faster R-CNN and SSD. Each model was trained with the same set
of UAV images of the training data subset, validated on the same validation data
subset and tested on the same test data subset.

5.3.1 Quantitative Performance Comparison


In determining the accuracy of an object detector, it is important to judge the
accuracy based not only on the fact that an object has been correctly identified
as being of a particular type, but also to determine how close the location of the
5.3 Experimental Results and Analysis 81

object identified, is to the ground truth. Therefore, instead of using Accuracy, in


this research we use [email protected] as a measure to determine correctness of object
detection.

If the intention is to only generate one binary confusion matrix, then the
Macro-Averages can be used for performance evaluation. Macro-averaging is to
first calculate the index value (e.g. Precision or Recall) of each class, and then
calculate the arithmetic mean of all classes.

Macro-average Precision and Recall are defines as:


n
1X
M acro P = Pi (5.1a)
n i=1

n
1X
M acro R = Ri (5.1b)
n i=1

Alternatively Micro-averages are used to statistically establish a global con-


fusion matrix for each instance in the data set without classification, and then
calculate the corresponding indicators.

Micro-average Precision and Recall:


Pn
i=1 T Pi
M icro P = Pn Pn (5.2a)
i=1 T Pi + i=1 F Pi

Pn
i=1 T Pi
M icro R = Pn Pn (5.2b)
i=1 T Pi + i=1 F Ni

In summary, the calculation method of macro-average is independent of dif-


ferent categories. The P and R values of each category are calculated separately,
and then the metric values of all categories are directly averaged, so it treats
each category equally. Micro-average combines the contribution sizes of di↵erent
categories to calculate the average. Therefore, in multi-classification problems, if
there is a data imbalance problem, the e↵ect obtained by using micro-average is
more credible. In this chapter, the training set sizes of the three trees are quite
di↵erent (1600 Ghaf trees, 2000 Date Palm trees and 500 Acacia trees), so the
micro-average method is chosen to evaluate the models.
82 5.3 Experimental Results and Analysis

Model Training Hours Recall Precision mAP

SSD 23 0.50 0.14 28.6%


Faster R-CNN 6 0.52 0.56 57.6%
YOLO-V5s 7.5 0.76 0.83 81.3%
YOLO-V5m 8 0.78 0.84 81.9%
YOLO-V5l 9.5 0.79 0.85 83.0%
YOLO-V5x 10.6 0.78 0.88 83.5%

Table 5.4: Performance comparison of DNN based object detection models

In performance analysis of the multiple tree classification models, we have


created based on the four sub-versions of YOLO-V5 CNN architecture, within the
research context of this work, we also compare their performance with classifiers
created using SSD and R-CNN DNN architectures. All trained models have been
developed based on the same training and validation datasets and have been tested
on the same data subset, as explained above. Performance of models are compared
both quantitatively and visually/subjectively.
Quantitative object classification results for models based on SSD, Faster
RCNN and four sub-versions of YOLO-V5 s, m, l and x are tabulated in Table-5.4.
Whilst the recorded [email protected] value for SSD was lowest, the YOLO-V5 mod-
els performed better with increased complexity/depth-of-network (s-to-l), but the
performance of YOLO-V5x, was slightly lower than that of YOLO-V5l. This is
due to the fact that when the depth of a CNN increases, relatively larger amount
of data is needed for its e↵ective training and YOLO-V5x appears to have not
had sufficient data to result in a model that performs optimally. It is noted
that the precision and recall values quoted in Table-5.4 are the Micro-precision
and Micro-recall values defined in section 5.3.1 for all three types of tree detec-
tion/classification.
We also note that the model training time increases when the model depth
increases, as demonstrated by the Training Times of YOLO-V5 s-to-x models
(progressively increasing training time). Faster R-CNN demonstrates a signific-
antly low training time and SSD has a relatively high training time. It is also
noted that the models developed di↵er in complexity and hence have di↵erent
deployment costs in practice, with Faster R-CNN and YOLO-V5s (small/shallow)
models needing least deployment/detection time and YOLO-V5 X requiring the
most resources for deployment. Given these observations, with YOLO-V5 sub-
versions one has the choice of trading-o↵ detection accuracy vs deployment cost,
e.g. for object detectors deployed via computationally constraint edge devices
(e.g. mobile devices, on-board drones), YOLO-V5s is recommended, though it
5.3 Experimental Results and Analysis 83

has about 2.2% lower performance in [email protected] value, and for object detect-
ors deployed in powerful computing devices (desktops, cloud services, o↵-line pro-
cessing), YOLO-V5l or x are recommended, due to their superior [email protected]
values.

Model Recall Precision mAP

YOLO-V5s 0.68 0.82 77.8%


YOLO-V5m 0.68 0.86 76.6%
YOLO-V5l 0.68 0.85 76.8%
YOLO-V5x 0.66 0.86 76.4%

Table 5.5: Ghaf Tree Detection Performance comparison of YOLO-V5 based ob-
ject detection models

Model Recall Precision mAP

YOLO-V5s 0.83 0.85 80.7%


YOLO-V5m 0.86 0.83 84.7%
YOLO-V5l 0.86 0.84 84.7%
YOLO-V5x 0.86 0.85 85.6%

Table 5.6: Palm Tree Detection Performance comparison of YOLO-V5 based ob-
ject detection models

Model Recall Precision mAP

YOLO-V5s 0.79 0.85 85.5%


YOLO-V5m 0.80 0.82 84.6%
YOLO-V5l 0.84 0.84 87.6%
YOLO-V5x 0.80 0.93 88.5%

Table 5.7: Acacia Tree Detection Performance comparison of YOLO-V5 based


object detection models

Based on the Table 5.5-5.7, it is evident that YOLO-V5x outperforms the


other models in the task of multiple tree detection and classification. YOLO-
V5x achieved the highest mAP scores, indicating that it was able to accurately
84 5.3 Experimental Results and Analysis

detect and classify multiple trees in the benchmark dataset. These results suggest
that YOLO-V5x may be the best choice for researchers or practitioners who are
interested in using object detection models for tree-related applications. According
to analysis, the reason why YOLO-V5x performs best is that the dataset of this
project is large, there are many types of objects to be detected, and the features of
the ghaf tree and acacia tree are similar. Therefore, deeper networks can extract
more features, which is conducive to object detection. Recall and Precision are
contradictory to each other. If we want higher recall, we need to make the model’s
predictions cover more samples, but this makes the model more likely to make
mistakes, which means the precision will be lower. If the model is very conservative
and can only detect samples with certainty, its precision will be high, but recall
will be relatively low. Here, the precision of YOLO V5x is significantly higher than
other models, further indicating that its detection accuracy is higher than other
models. The reason is also because it is a deeper model that performs better in
complex situations. In the tables below, we present the testing results of YOLO-
V5s, m, l, and x on a benchmark dataset. We evaluated the models based on their
precision, recall, and mean average precision (mAP) value, which are commonly
used metrics for object detection tasks. The tables included in this report compare
the performance of multiple tree detection and classification models on a single
kind of tree.

5.3.2 Visual Performance Comparison

Acacia trees grow in the wild or nature reserves where they have been purposely
grown for conservation purposes. Palm trees are often present in areas that are
human habituated as they are normally grown by humans for a purpose (shade,
consumption etc.). Ghaf trees are grown for consumptions by animals and hence
grown in areas reachable by humans and animals. Given these observations, it
was difficult for us to find any single image for that consisted of Acacia trees with
Palm and/or Ghaf trees. There were few images in which Ghaf trees co-existed
with Palm trees. Therefore, for the purpose of testing the subjective e↵ectiveness
of tree type classification, we created a mosaic image that was formed by three
sub-images consisting mostly, one type of tree (see test Image in Figure 5.2). The
yellow circles in the images show the missing targets and the red crosses mean
wrong detections.
5.3 Experimental Results and Analysis 85

(a) SSD (b) Faster R-CNN

(c) YOLO-V5s (d) YOLO-V5m

(e) YOLO-V5l (f) YOLO-V5x

Figure 5.1: The visual performance comparison of Multiple tree detector models
derived from DNN architectures, (a) SSD, (b) Faster R-CNN, (c) YOLO-V5s, (d)
YOLO-V5m, (e) YOLO-V5l, and (f) YOLO-V5x
86 5.3 Experimental Results and Analysis

Results illustrated in Figure 5.1 shows that all four YOLO-V5 sub-versions
can detect the Ghaf, Date Palm and Acacia trees with high precisions (above
81.3% [email protected] as tabulated in Table-5.4). SSD failed to detect any type of
tree, whereas Faster R-CNN was not able to detect Palm Trees that appears very
small in the mosaic image. Testing on many di↵erent mosaic images we created,
with random combinations/orientations of images, predominantly consisting of the
three di↵erent tree types, we had similar subjective observations. Therefore, in
the subjective analysis that follow, we are excluding the subjective performance
comparison with SSD and Faster R-CNN.
5.3 Experimental Results and Analysis 87

YOLO-V5s

YOLO-V5m

Figure 5.2: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
88 5.3 Experimental Results and Analysis

YOLO-V5l

YOLO-V5x

Figure 5.2: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
5.3 Experimental Results and Analysis 89

In Figure 5.2, consists of an image that only consists of Ghaf trees. We observe
that the trees are of di↵erent size, have been captured at a time at which signi-
ficant shadows exist, the density of trees vary largely within the test image and
in many cases, trees overlap and occlude. Given the strict labelling procedure,
we described in section 5.2.1, it is observed in Figure 5.2 that we have been able
to achieve a remarkable level of accuracy in detecting and recognising the Ghaf
trees, accurately, in particularly when using YOLO-V5 x and l. Most Ghaf trees
have been detected and all with a confidence value of over 0.27. No tree has been
miss classified. A remarkable result is also shown in the trained model’s ability to
avoid detecting shadows as Ghaf trees and/or double counting Ghaf trees due to
their shadows. We have also managed to use non-maxima suppression approaches
to avoid multiple rectangles being drawn around a single tree. Due to this we
should be able to count the number of Ghaf trees, and find their GPS locations,
if the original images included longitude and latitude information. The results
indicate that deeper the model, it has managed to train better and perform more
accurately, given the amount of data we have used in training.
90 5.3 Experimental Results and Analysis

YOLO-V5s

YOLO-V5m

Figure 5.3: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
5.3 Experimental Results and Analysis 91

YOLO-V5l

YOLO-V5x

Figure 5.3: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

Figure 5.3 illustrates the application of the multiple-tree type detectors/classifiers


of YOLO-V5 sub-versions on an image that consist of Palm trees, in a surround-
92 5.3 Experimental Results and Analysis

ing that consist of hedges as boundary fences or windbreaks. These images do


not contain Ghaf or Acacia trees but contain Palm trees. Results illustrate that
models created from all four sub-versions of YOLO-V5 are capable of accurately
detecting Palm trees, despite the fact that the Palm trees are di↵erent sizes, they
overlap and gets occluded/camouflaged by other trees and hedges. Palm trees are
of a distinct shape and texture as compared to the Ghaf trees and this fact makes
them easier to be detected. It is also noted that we have used 3600 palm trees for
training vs 3200 Ghaf trees. However, few small size Palm trees at the left bottom
corner are missed because the training data has few of this size of palm trees.
5.3 Experimental Results and Analysis 93

YOLO-V5s

YOLO-V5m

Figure 5.4: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x
94 5.3 Experimental Results and Analysis

YOLO-V5l

YOLO-V5x

Figure 5.4: The results of Multiple tree detection in drone imagery using the
YOLO-V5s, YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

As indicated in Table-5.3, we only had 500 samples of Acacia trees for training,
as against 3600 and 3200, Palm and Ghaf trees, respectively. Given the significance
5.3 Experimental Results and Analysis 95

class imbalance, when training a multi-class classifier, this would have expected
to result in a significant reduction of accuracy in Acacia tree detections. However,
in Figure 5.4, our image samples that consisted of Acacia trees did not have
any other trees of any other types, and as a result the results of Acacia tree
detection performance using all four YOLO-V5 models, illustrates a very good
level of accuracy as shown in Further all Acacia trees has a good level of contrast
with the sandy background at the automatic exposure setting of the drone camera
that the image has been captured.

Figure 5.5: Missed trees due to limited raining data

To further investigate the accuracy of ghaf tree detection, Figure 5.5 shows an
image with a very di↵erent background, which is relatively rare in the training
set. In this image, there are ghaf trees of varying sizes, making the detection of
some of them more difficult. No acacia trees have been found in this image. A
large number of training samples, extracted from images with di↵erent exposure
settings, taken at di↵erent times of the day and at di↵erent altitudes of the drone
flight, as well as samples of ghaf trees from environments consisting of overlapping
and/or occluded by other trees, will significantly improve the accuracy of ghaf tree
detection.
Despite the modest number of images used in this study, excellent mean average
precision values (over 81.3%) obtained for the detection of Ghaf, Palm and Acacia
trees from drone images using the four sub-versions of YOLO-V5 are a promising
96 5.4 Conclusion

step toward real-time detection utilising aerial data. The model training time
could have been shortened if a more powerful computer was used for training.
Rather than creating separate DNN models for the detection of the three di↵erent
tree types commonly present in the Arabian regions, we have proven that a single
multi-class model can be equally e↵ective, provided the training data between
classes are balanced and a modest set of data is being used in training to detect
each type of tree. The number of samples needed for each di↵erent type of tree to
be detected at the same level of accuracy, depends on complexities and variations
present in the test images, uniqueness of tree types from the other trees, other
vegetation present in the images, drone’s flying altitude, image resolution etc.

5.4 Conclusion
In this chapter, we have proposed YOLO-V5 based, multi-class object detection
models, to detect three types of trees widely present in Arabian countries, Ghaf,
Palm and Acacia trees. We have successfully demonstrated the use of object
detection models created from four sub-versions of YOLO-V5 (s, m, l and x), hav-
ing di↵erent computational and network complexities, in achieving Mean Average
Precision values of over 81.3%. The model created by YOLO-V5x demonstrated
the best multiple tree detection performance accuracy, with a 83.5% [email protected]
value. We have used specific approaches to data labelling, training and network
optimisations to create models that are capable of detecting the three types of
trees, in the presence of occlusion, illumination variations, changes in sizes, dif-
ferent levels of contrast with the background, shadows, overlaps, shape variations
etc. Over 619 drone captured images were used in the training, validation and
testing of the models, with over 11500 trees, of three di↵erent types, being used
during training. We show that if this number is increased to over 30,000 or more,
the detection performance will improve significantly more as the models will be
able to generalise better to previously unseen data. Despite the limitations of the
data set used in training, we have optimised training via e↵ective data labelling
and training data selection. We share the knowledge gathered for the benefit of
the wider research community. The detailed findings of this research should pave
the way for real-time detection of multiple trees via aerial surveillance, assisting in
the preservation of the UAE’s desert and environment. We should that the model
created by YOLO-V5s is capable of being deployed on board a drone or within
a mobile device, to enable real time applications, despite its [email protected] value
is approximately 2.2% less. We provide quantitative and subjective experimental
results to evaluate the performance of the four DNN models proposed. Additional
test results can be found in Appendix B.
5.4 Conclusion 97

In chapter 6, we propose the use of Deep Neural Network based object detector
models for litter detection in remote desert areas and in more suburban campsites.
98 5.4 Conclusion
Chapter 6

Litter Detection Using Deep


Neural Networks

In this chapter we investigate the use of the most popular Deep Neural Network
(DNN) architectures to create several novel litter detection models. We investigate
the use of Faster R-CNN (Faster Region based CNN), SSD (Single Shot Detec-
tion) and YOLO (You Only Look Once) architectures. With regards to YOLO,
we investigate the use of version-5 (s, m, l and x sub-versions) and Version-7 (l
sub version) architectures. Two types of object detection models are developed;
a single class classifier (litter only) models and a two-class classifier (litter and
man-made objects, which are not litter) models. Approximately 5000 samples
of litter objects and 2100 man-made objects not identified as litter was used for
training. 3200 litter objects of various types and 1400 man-made, non-litter ob-
jects were used for validation and testing, respectively. We rigorously compare the
performance of the di↵erent models in litter detection and localisation in drone
images captured at di↵erent altitudes, under di↵erent environmental conditions.
Both objective and subjective approaches are used for the performance analysis.

6.1 Introduction to the Litter Detection


The area of land defined as desert in the world is about 47 million square kilometres
[124]. Deserts are widely distributed throughout the African, Asian, Australian,
the American and European continents. The major deserts of the world include
the Sahara Desert in North Africa, the Rub Khali Desert in Saudi Arabia, and
the Taklamakan Desert in China [125]. The Dubai Desert Conservation Reserve
(DDCR) in the United Arab Emirates is a protected desert area of approximately
225 square kilometres and is home to several endangered species of wildlife and
flora [126]. It is involved in several desert conservation projects that include,

99
100 6.1 Introduction to the Litter Detection

breeding and controlled release of wildlife, and preservation of endangered tree


species. Annually DDCR protected regions are visited by many tourists, scientists,
and researchers. Frequent human visits to nature reserves such as the DDCR
managed land areas often lead to noticeable amounts of litter being left behind.
Such litter is a threat to the natural environment, flora and fauna.
At present the approaches to litter removal in desert areas have focused around
sending groups of litter pickers (workers and/or volunteers) to the most likely areas
of presence of litter (popular tourist sites) and carrying out a manual search and
pick-up. This requires navigating challenging terrain by these groups, either on
vehicles or on foot. This is a tedious task that may even risk such groups having
to cover large areas, where no litter is eventually found to be picked up. This
could result in the waste of time, e↵ort, resources and could put the individuals
under unwanted risks. More recently, with the widespread use of low-cost drones,
the tendency has been to use drones in the surveillance of such terrain. A drone
flying at a sufficiently high altitude can cover a large ground area in a relatively
short time. Subsequently such images/video can be checked by humans to identify
presence of litter, and the marking of the locations of litter, manually. This is also a
tedious task as operators of the video surveillance system will have to subjectively
observe all footage captured by the drones. Once human observations are made,
litter pickers can be sent directly to the areas where litter has been marked (e.g.
via GPS locations), for eventual collection.
It is possible to use computer vision-based approaches to automatically detect
and locate litter in drone footage. Traditional Machine Learning (ML) approaches
require identification of unique features of litter (i.e., requires manual feature en-
gineering) and subsequently using a feature-based object classifier to di↵erentiate
other objects from litter. The challenge faced in using machine learning for lit-
ter detection is the definition of the most e↵ective features that will allow their
accurate di↵erentiation from other objects. Although some attempts have been
made in literature for machine learning based litter detection [127–129] they have
largely been ine↵ective as the recorded accuracy levels are not very high.
In the research presented in this chapter, we adopt two approaches to litter
detection using Deep Neural Network models. The first approach is to use a
single class detection approach where one type of object, i.e., a litter object, is
di↵erentiated from anything else in the background of the scene. The second
approach uses a two-class classifier, where litter objects and man-made objects,
which are not litter, in a scene are separated into two di↵erent classes, and from
their background. In order to compare and contrast the performance of the two
approaches of litter detection being investigated, we carry out litter detection in
remote (mostly consisting of a background with natural objects) and sub-urban
6.2 Proposed Approach to Litter Detection 101

(the background consisting of some man-made objects such as camp sites) areas.
For clarity of presentation, this chapter is divided into five sub-sections. Section-
6.1 provided an introduction to the application and research context. Section-6.2
provides the research methodology to be adopted, experimental design details of
two approaches to be adopted for litter detection and corresponding details of
dataset preparation, data labelling, training and the approach adopted for testing
the performance of the Deep Neural Network models. Section-6.3 presents de-
tails of Litter detection results and a comprehensive analysis of the performance
of the trained models under the two adopted approaches to litter detection. Fi-
nally, Section-6.5 concludes, with an insight to future work and suggestions for
improvements of the established DNN models.

6.2 Proposed Approach to Litter Detection


The generic workflow that underpins the designs of the proposed approaches to
litter detection is the same as ghaf tree only detection and multiple tree detection
which is illustrated in Figure 4.1. It includes two main stages: the training stage
(including training and validation) and the testing stage. Before starting the train-
ing, the captured dataset must be prepared for training and objects to be detected
should be labelled for training, validation and testing purposes. Subsequently the
Deep Neural Network is trained. Once the network is trained it creates a trained
model, which can be used as an object or multiple-object detector/classifier. The
test image set is fed into the object classifier that classifies the di↵erent object
types in the test image. In the case of a single object detector model, the trained
model searches for one object type. The details of each phase, i.e., data prepara-
tion, labelling, training, validations, and testing, are described in the sub-sections
that follow.

6.2.1 Data Preparation


We investigate two di↵erent approaches to litter detection, detecting one object
class, i.e., litter objects only and detecting, two classes of objects, i.e. detecting
litter objects and human-made objects that are not litter. In section-6.3, we show
the merits of each of the approaches in accurately detecting litter in natural and
sub-urban environments.
For the single-class litter detection approach, a total of 913 images containing
more than 8000 litter objects were randomly selected from many images taken
during drone flights at di↵erent altitudes in nature/natural areas of the DDCR.
These images did not contain any significant amounts of human-made objects
102 6.2 Proposed Approach to Litter Detection

in the background and mostly consisted of natural habitat. The selected image
dataset was then divided into three data subsets for training (60%), validation
(20%), and testing (20%). In the dataset used, as some of the images have a
significantly greater number of litter objects than others, the number of images
used in the training, validation, and test data subsets di↵ered and were recorded
as 512, 236, and 165, respectively.
For the two-class litter detection approach, a further 255 images containing
more than 3500 human-made objects were randomly selected from images taken
during drone flights at di↵erent altitudes over sub-urban camp sites of the DDCR.
These images consisted of many human-made objects of di↵erent sizes and shapes,
whilst including litter objects. The selected image dataset was then divided into
three data subsets for training (60%), validation (20%), and testing (20%). In the
dataset used, as some of the images have a significantly greater number of human-
made objects than others, the number of images used in the training, validation,
and test data subsets di↵ered and were recorded as, 171, 53, and 31, respectively.

Dataset Number of Human made items images Number of Litters images Number of labelled Litters Number of labelled Human made items

Training 512 171 5000 2100


Validation 236 53 1600 700
Testing 165 31 1600 700

Table 6.1: Number of labelled litters and human-made items in each data subset

All drone imagery was collected by the authors within the nature reserve areas
of the Dubai Desert Conservation Reserve(DDCR) with DJI Phantom-4 and DJI
Mavic 2 Pro drones flying at di↵erent heights/altitudes.
The drone captured image set included images captured in di↵erent areas of
the DDCR demarcated land (nature and camp sites), at di↵erent altitudes, cam-
era angles and taken at di↵erent times of the day (during daytime). The dens-
ity/sparsity of litter present within an image varied between images. The nature-
site images only consisted of natural objects and litter (i.e., rarely consisted of
a man-made, non-litter object). The camp-site images consisted of litter as well
as man-made objects/structures, which are difficult to di↵erentiate with respect
to litter, without taking the image context into account. A set of images were
randomly picked up from the drone captured images as training, validation and
testing sets for the design of the two litter detection approaches, single-class and
two-class (see Table-6.1).
All experiments were performed on a computer system comprising of an Intel
Core i7-6850k CPU, NVIDIA GeForce GTX-1080ti GPU, and 32 GB of RAM.
The computer was running Windows 10 operating system.
We use the design of the proposed Single-Class Object Detector approach to
litter detection as means of comparing the capabilities of popular Deep Neural
6.3 Experimental Results and Discussion 103

Network architectures in object detection and use the Two-Class Object Detector
approach to litter detection as means of rigorously comparing the performance of
only the best DNN architectures.
Single-Class Object Detector - The labelled information (i.e., objects of
class ‘litter’) of the training data subset are used to train SSD, Faster R-CNN,
and four sub-versions of YOLO-V5 CNNs. Additionally, the labelled litter samples
from the validation data subset are used to fine-tune the CNN architectures and
optimise their performance during training, by deciding on the optimal values of
the hyper parameters of the network.
Two-Class Object Detector - The labelled information (i.e., objects of
classes, ‘litter’ and ‘Human-made’) of the training data subset are used to train
YOLO-V5-Large and YOLO-V7(Large) CNN architectures. Additionally, the la-
belled samples from the validation data subset are used to fine-tune the CNN
architectures and optimise their performance during training, by deciding on the
optimal values of the hyper parameters of the network.

6.3 Experimental Results and Discussion


We present the experimental results obtained and a comprehensive performance
analysis of all CNN models developed within the two approaches to litter detection,
separately. The performances are compared using both quantitative and qualit-
ative approaches. Section-6.3.1 presents the results of the Single-Class Object
Detection and Section-6.3.2 presents the Two-Class Object Detection, approaches
to litter detection. Testing is done on a test image set (reserved and previously
not used in training and/or validation) comprising of both drone images that have
been captured in the nature areas and the campsites of the DDCR.

6.3.1 Single-Class Litter Detection


We evaluate the e↵ectiveness of the four object detection models produced by
training the four sub-versions of YOLO-V5 and compare them to models created
based on training other widely used Deep Neural Networks, namely SSD and
Faster R-CNN (Faster Region-based Convolutional Neural Networks). The six
networks are trained, validated and tested on the same datasets (see Table-6.1).

Quantitative Performance Comparison

The comparison of quantitative results is presented in Table-6.2. Our analysis


show that SSD has the fastest training convergent time, but requires significant
time for testing (i.e., high deployment time/resources), and its recall, precision,
104 6.3 Experimental Results and Discussion

and [email protected] values remain substantially lower than those of the four YOLO-
V5 models. Faster R-CNN, on the other hand, takes the longest time for the
convergence of training but has relatively good testing speeds/deployment costs
as compared to the YOLO-V5 based models. However, the accuracy, precision, and
[email protected] values of the Faster R-CNN model are inferior to those of the four
YOLO-V5 models. Comparing the performance of the four YOLO-V5 models, we
observed that as the complexity/depth of the architecture increases, more time is
required for training and testing, with YOLO-V5l and x taking significantly more
time than sub-versions s and m. YOLO-V5l achieved the highest mean average
precision (71.5%) in litter detection as compared to other models, as presented
in Table-6.2. However, its precision value is marginally lower than in the case
of YOLO-V5x. The YOLO-V5l based model, however, has a significantly lower
detection time as compared to the YOLO-V5x based model. Given the above
observations, considering the objective performance metrics, we recommend the
use of YOLO-V5l for litter detection.

Model Training Hours Precision Recall [email protected] Average Detect Time(milli second)

SSD 10.5 0.21 0.30 15.9% 193


Faster R-CNN 15.6 0.14 0.28 20.2% 65.4
YOLO-V5s 14.6 0.76 0.61 65.3% 65.4
YOLO-V5m 14.8 0.81 0.62 69.5% 78.1
YOLO-V5l 15.1 0.76 0.65 71.5% 83.2
YOLO-V5x 15.2 0.79 0.65 71.3% 87.5

Table 6.2: Performance and comparison of DNN based litter detection models

In summary, based on the objective performance values presented and dis-


cussed above, it can be concluded that the object detector models generated from
YOLO-V5 and its sub versions are far superior in performance as compared to
models generated by other popular CNNs such as SSD and Faster-RCNN. The
deeper the architecture, the objective performance improves when comparing the
di↵erent sub-versions of YOLO-V5.

Visual Performance Comparison

According to Table-6.2, all four YOLO-V5 versions can identify every sort of
litter present in the images with generally regarded precisions of above 76%. The
excellent precisions and very good recall values of all four YOLO-V5 models are
also consistent with their successful performance, subjectively, as demonstrated by
the results illustrated in figures 6.4. In contrast SSD and Faster R-CNN, exhibits
poor performance. Both SSD and Faster R-CNN misses the detection of many
objects of litter as illustrated in Figure 6.1.
6.3 Experimental Results and Discussion 105

It is noted that with all test images we used in testing, a similar relative
performance was demonstrated in the visual performance comparison of the said
models. Given this observation and the results tabulated in Table-6.2, in further
performance comparisons, we do not consider the SSD and Faster R-CNN here.
The yellow circles in the images show the missing targets and the red crosses mean
wrong detections.

(a) SSD (b) Faster R-CNN

(c) YOLO-V5s (d) YOLO-V5m

(e) YOLO-V5l (f) YOLO-V5x

Figure 6.1: The results of litter detection in drone imagery using the SSD, Faster
R-CNN, YOLO-V5s, YOLO-V5m, YOLO-V5l and YOLO-V5x based models
106 6.3 Experimental Results and Discussion

YOLO-V5s

YOLO-V5m

Figure 6.2: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 107

YOLO-V5l

YOLO-V5x

Figure 6.2: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

In Figure 6.2, after testing using four versions of YOLO-V5, it was observed
that all four models were able to detect the majority of the litter. However,
they all missed detecting some of the smaller pieces of litter. The YOLO-V5l
model performed the best out of the four in terms of detecting the smaller litter
items. Overall, YOLO-V5l is an e↵ective tool for litter detection. However, further
improvements are needed to accurately detect all types and sizes of litter.
108 6.3 Experimental Results and Discussion

Further experimental results comparing the performance of the litter detec-


tion models created via the four YOLO-V5 sub-versions are illustrated in Figures
6.3 and 6.4 below. Note that ‘yellow circles’ depict all missed items of litter as
compared to a human observer.

YOLO-V5s

YOLO-V5m

Figure 6.3: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 109

YOLO-V5l

YOLO-V5x

Figure 6.3: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)
110 6.3 Experimental Results and Discussion

YOLO-V5s

YOLO-V5m

Figure 6.4: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 111

YOLO-V5l

YOLO-V5x

Figure 6.4: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

The visual comparison results above, illustrate that each of the four sub-
versions of YOLO-V5 performs well in litter detection. When the flight height
of the UAV is relatively low , the detection results of all four models are excellent
(e.g., see Figure 6.4). It is observed generally that litter objects have been detected
regardless of their sub-type (e.g., bottles, paper, bags etc.), with more common
objects used in training (e.g. bottles) and well-defined objects in terms of shape,
112 6.3 Experimental Results and Discussion

being detected very accurately, especially at high altitudes of drone flights. The
visual performance comparison shows a bias towards objects of certain colour such
as blue/turquoise litters. The reason is that the training data is not balanced and
hence have more data related to these colours. The training data consists of a large
number of blue coloured bottles. This project is focused on optimised detection
of litter typically present in the Dubai desert areas. The types and distribution
of litter in Dubai desert are similar to that of the training data, and hence the
specific challenge is minimised.
6.3 Experimental Results and Discussion 113

YOLO-V5s

YOLO-V5m

Figure 6.5: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
114 6.3 Experimental Results and Discussion

YOLO-V5l

YOLO-V5x

Figure 6.5: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

Based on the results in Figure 6.5, in litter detection tests using four versions of
YOLO-V5, all models demonstrated good performance. The mAP values from the
test data indicated that YOLO-V5l performed better than YOLO-V5m overall.
However, in a specific test image, YOLO-V5m performed slightly better. This
could be attributed to the random characteristics of the test image. Additionally,
in less complex situations, a shallower model may have an advantage. Overall, the
6.3 Experimental Results and Discussion 115

data suggests that YOLO-V5l is the best-performing model for litter detection,
with YOLO-V5x demonstrating comparable performance. In this particular test
image, YOLO-V5l detected more litter than YOLO-V5x. Therefore, it can be
concluded that YOLO-V5l is the most e↵ective model for litter detection, while
YOLO-V5x can also be a good alternative. It is important to note that individual
test images may have unique characteristics, and broader perspective analysis is
necessary to evaluate the overall model performance.
116 6.3 Experimental Results and Discussion

YOLO-V5s

YOLO-V5m

Figure 6.6: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x
6.3 Experimental Results and Discussion 117

YOLO-V5l

YOLO-V5x

Figure 6.6: The results of litter detection in drone imagery using the YOLO-V5s,
YOLO-V5m, YOLO-V5l, and YOLO-V5x (continued)

In Figure 6.6, based on the results of litter detection tests using four versions
of YOLO-V5, it was found that all four models were able to detect most of the
litter but missed some of the smaller items. The current test image was captured
from a UAV at a height of 60 meters, in comparison to the previous test images.
The litter size in the current image is extremely small, making it difficult for the
human eye to identify. As a result, there are more missed instances of litter in the
118 6.3 Experimental Results and Discussion

current image compared to the previous ones. Comparing YOLO-V5l and YOLO-
V5x, both models performed similarly in litter detection accuracy, but YOLO-V5l
demonstrated advantages in terms of model training and testing speed, as well
as its smaller size. Therefore, YOLO-V5l is a more practical and efficient choice
for litter detection tasks in real-world applications. However, further research and
optimisation can still be done to improve the detection of small litter items in
YOLO-V5l.
Figure 6.7 illustrates an example image that shows several, very small objects
of litter, and the impact of applying the YOLO-V5x based litter detection model
to detect litter. This is an image captured at 30 meters. Few very small objects of
litter on the top right side of the image are not detected, as the objects are very
small. The YOLO-V5x model performs best in detecting the smallest objects of
litter and our detailed analysis indicated that by adding a larger number of very
small litter objects in training this model, the accuracy of detection can be further
improved. We will keep adding new balanced data in future work. However, when
considering the performance of shallower YOLO-V5 models, such as ‘s’, this is
not the case as it could lead to false positives, i.e. very small, non-litter objects,
being detected as litter. The depth of the model should be sufficient to carry
out a feature analysis sufficiently detailed for the trained model to be able to
di↵erentiate very small litter objects from very small non-litter objects. It was
also noted in Figure 6.8 (an few other example images we tested) that YOLO-V5l
performed marginally better than YOLO-V5x. This is likely due to the reason
that more data is needed for the deeper network, i.e. YOLO-V5x to perform more
accurately.

Figure 6.7: Detecting very small objects of litter using YOLO-V5x based model
6.3 Experimental Results and Discussion 119

Discussion – Single Class Litter Detector Performance

According to the review of literature we conducted, this is the first attempt car-
ried out in literature to identify litter in drone captured footage using the latest
advancements in Deep Neural Network architectures. We have shown that single-
class litter detection models based on YOLO-V5 sub-versions, result in mean av-
erage precision values of above 71.5% and precision values over 76%. This is a
promising step toward real-time detection of litter using aerial image data, des-
pite the small sample size of training images employed in the training of the DNN
architectures, in the proposed research. Further investigation of the resulting mod-
els’ training and object detection times (see Table 6.3) indicated higher training
and detection times of the models generated from the more complex and deeper
networks (i.e., YOLO-V5, x and l sub-versions) were mainly due to the limitations
of the processing power of the computational hardware used in this research. The
training and testing times could be considerably reduced by using faster computer
hardware.
In the experiments conducted we only investigated a single-class litter detector
in which all objects of litter, e.g., regardless of whether they are bottles, cans,
paper, boxes, or any other common objects of litter typically left behind after
human consumption, was labelled as a single type of object, ‘litter’. Our detailed
investigations revealed that it is important still to have a sub-class balance, i.e.,
similar number of di↵erent types of litter objects being used in training, despite
the fact that all litter types are classed as one type, for testing accuracy for all
sub-types to be similar. For example, in our training data the least amount of
litter was ‘drink cans’ and such objects had the highest chances of being missed or
misclassified. Therefore, within the training process, we attempted to balance the
amount of di↵erent sub-types of litter as the available sample data on some sub-
types of litter (such cans) was relatively scarce, e.g., drink cans. Our investigations
revealed that the models’ performance will be better if we collected sufficient data
for all sub-types of litter, e.g., 2000 samples of each sub-type.
The reason why we did not compare the litter detector in this project with
some existing work is that the test set they tested was captured using ground
cameras. In contrast, our work is more forward-looking and challenging due to
the widespread use of drones and the size of the litters in the images. There are
also individual projects that use drones for testing, but they detect a large group
of Litters. In contrast, our project o↵ers the potential for litters classification.
As recommendations for the future improvement of performance of models,
in particular that of the models created from the deeper DNN architectures such
as YOLO-V5 x and l, it is recommended that more UAV data to be captured
120 6.3 Experimental Results and Discussion

at very di↵erent altitudes of drone flights, i.e., capturing at altitudes resulting


in visibly distinct litter types in addition to capturing images at altitudes where
the litter types are indistinguishable. Such data, when used in training results in
better generalisation of models in litter detection. Despite the limitations of data
available for training, we have demonstrated the superior capability of the models
based on the most popular DNN architectures in litter detection.
In the investigation conducted in this section and the results presented, the use
of training and test data was limited to desert regions of natural habitats, where
litter had been left behind by visitors to the DDCR desert conservation areas. The
scenes/images rarely consisted of any human-made or non-natural objects. This
led to our curiosity to determine the potential capability of the developed models
to detect litter in areas that is usually occupied by humans, such as camp sites. In
this section we present the results of our investigation and a resulting alteration
to our litter detection approach to enable more accurate detection of litter in such
regions.
We only use the best performing litter detection based on the YOLO-V5l sub-
version (71.5% mAP value) for single-class litter detection in camp sites of the
DDCR desert regions. Representative sample of results that are illustrated in
Figures 6.8 and 6.9, clearly show that the single-class litter detection model detects
many human-made objects (such as cars and freezers), as litter. Not all human-
made objects are detected as litter. Most false positives are of a distinct colour
(not relevant to typical colours present in natural desert regions) or shape (e.g.,
those with straight line edges). Although one could argue that the scene in Figure
6.8 has some objects that can be defined as litter, the scene in Figure 6.9 does not
have any objects that can be defined as litter. It is noted that human-made objects
that are left around in a non-organised manner, in a locality that they are not
generally expected, could be contextually defined as litter, typically. An attempt
to detect individual objects of litter, without considering the context in which they
appear within a scene, has limitations. However, our current research e↵orts are
limited to identifying litter based on single object detection. The fact that ‘litter’
objects are not natural objects and are human-made, makes the challenge further
complex. Therefore, it is important that we di↵erentiate litter objects that a
human would define as litter (based on a visual context analysis) from man-made,
but non-litter objects (human visual judgment based on context). It is noted that
both object types are human-made and the single class litter detection approach we
adopted in section 6.3.1 will therefore fail to di↵erentiate the two types of objects.
Further it is noted that as we trained our single-class litter detector only on natural
images, the background of labelled litter objects used in training and validation
only consists of natural objects. Therefore, when we apply the resulting object
6.3 Experimental Results and Discussion 121

detectors on campsites any non-natural or man-made objects, which we might not


define as litter, is still more likely to be classified as litter, than being classified
as a part of the scene background. This is a further reason behind the failure
of the single-class litter detector, when applied in campsite images. Nevertheless,
we demonstrated that the single-class approach to litter detection is an ideal and
simple solution to litter detection in nature reserves such as most of the land area
managed/conserved by the DDCR.

Figure 6.8: Test results of single-class litter detection models in desert campsites

Figure 6.9: Test results of single-class litter detection models in desert campsites
122 6.3 Experimental Results and Discussion

Figure 6.8 and 6.9 were tested using a single class litter detector at the camp-
site. We can see that many artificial items that are not litters are identified as
litters. This is because litter and artificial objects have the same features. This is
a subjective question, only humans can determine whether a human made item is
a litter. This poses some challenges for our research.

6.3.2 Two-Class Litter Detection


Given the observations and discussion in section 6.3.1, we propose the introduction
of a second object class into the design of our litter detector. We propose intro-
ducing a second object class that relates to ‘human-made’ (hence non-natural)
objects, which would not be classified as ‘litter’ by a human observer. Many
human-made objects in campsite images illustrated in Figures 6.8 and 6.9 belong
to this group. However, in the scene illustrated particular in Figure 6.8 there are
a number of campsite, human-made objects, a human observer might still classify
as litter due to their abandoned/neglected nature, for example, the two clay bots
on the top, left, quarter of the image. In the labelling we carried out, we have
still classified such objects as human-made, as they are large, despite the way
they have been left, neglected. Therefore, the resulting litter detector should not
classify such objects as litter but should rather classify as ‘human-made’, not as
a part of the scene background.
Following the approaches proposed in section 6.2.1, we labelled images in read-
iness for the design of the two-class litter detector. At the commencement of this
research project, YOLO-V5 (and its sub-versions) was the most recent YOLO ar-
chitecture proposed. However more recently, YOLO-V7 and YOLO-V8, have been
proposed with claims of better performance. Therefore, in approaching two-class
litter detection, we have decided to compare the best performing sub-version of
YOLO-V5, i.e., YOLO-V5l against the performance of a two-class litter detector
trained using YOLO-V7l.

Network YOLO-V5s YOLO-V5m YOLO-V5l YOLO-V5x YOLO-V7l

No. of Convolutional Layers 78 170 268 448 101

Table 6.3: Illustration of a conceptual/architectural comparison of the two Deep


neural Networks

Experimental Results and Discussion

This section presents a comparison of the efficiency of the YOLO-V5 and YOLO-
V7 models in detecting litter and human-made items. All models were trained,
validated, and tested on the same subsets of UAV images. Their performance was
6.3 Experimental Results and Discussion 123

evaluated using both quantitative and qualitative measures, discussed below. The
performance comparison will be conducted in two ways: a quantitative perform-
ance comparison and a visual performance comparison.

Network Precision of Litter Recall of Litter mAP of Litter

YOLO-V5l 82.6% 58.1% 70.6%%


YOLO-V7l 87.5% 19.9% 39.0%

Table 6.4: Performance comparison of YOLO-V5 and YOLO-V7 models

Network Precision of Human-made items Recall of Human-made items mAP of Human-made items

YOLO-V5l 51.2% 58.1% 46.3%%


YOLO-V7l 89.7% 69.0% 88.0%

Table 6.5: Performance comparison of YOLO-V5 and YOLO-V7 models

Network Precision of 2 classes Recall of 2 classes mAP for 2 classes Model size

YOLO-V5l 66.9% 54.4% 58.5% 88.4MB


YOLO-V7l 87.9% 30.5% 63.6% 142MB

Table 6.6: Performance comparison of YOLO-V5 and YOLO-V7 models

In this study, we trained YOLO-V5l and YOLO-V7l to detect litter and human-
made objects, that are not litter. The performance of the resulting trained
models was compared in terms of precision, recall, and mean average precision
(mAP). YOLO-V5l exhibited high precision in detecting litter but had low recall
on human-made items. Conversely, YOLO-V7l had very low recall in detecting lit-
ter, but performed better in detecting human-made items. The results show that
YOLO-V5 performs better when detecting litter, while YOLO-V7 performs better
when detecting human-made items. Compared to the YOLO-V5l model in single
class litter detection (which achieved a mAP of 71.5%), the two-class litter detec-
tion model achieved a slightly lower mAP of 70.6% in detecting litter. This can
be attributed to the high similarity between the two types of target objects, which
could cause some interference in the results. Nonetheless, the performance of the
two-class model is still considered satisfactory, and it demonstrates the potential
of the YOLO-V5 model in complex object detection tasks. Further research could
explore ways to improve the performance of the model in distinguishing between
closely related object classes. The mAP of YOLO-V7l was slightly higher than
that of YOLO-V5l. Additionally, YOLO-V5l has a smaller model size. In prac-
tical applications, a model with a smaller size is often more advantageous, because
hardware limitations, such as the memory of the drone, must be considered. A
124 6.3 Experimental Results and Discussion

smaller model not only requires less memory to store but also can be processed
more quickly, which is essential for real-time applications. Therefore, according
to the quantitative performance of the two models, depending on the specific de-
tection requirements, either YOLO-V5l or YOLO-V7l can be chosen for optimal
performance. The following visual performance comparison can provide a more
intuitive comparison of the performance of the two models.

In Figure 6.10 and 6.11, we can see that YOLO-V5 can accurately detect lit-
ter items of di↵erent sizes, shapes, and colours, while YOLO-V7 has difficulty
in identifying smaller litter items, resulting in a lower recall rate. This is con-
sistent with the quantitative performance comparison results. Additionally, in
Figure 6.10, YOLO-V5 shows better performance in detecting litter items that
are partially occluded or located in complex backgrounds, which indicates that
YOLO-V5 can e↵ectively extract and analyse features from these images. These
results demonstrate the superior performance of YOLO-V5 in litter detection.

Figure 6.10: Testing results of YOLO-V5l based two class litter detection model
6.3 Experimental Results and Discussion 125

Figure 6.11: Testing results of YOLO-V7l based two class litter detection model
126 6.3 Experimental Results and Discussion

Figure 6.12: Testing results of YOLO-V5l based two class litter detection model

Figure 6.13: Testing results of YOLO-V7l based two class litter detection model

In Figure 6.12 and Figure 6.13, both models are able to detect a wide range
of litter items such as plastic bags, bottles, and cans, but YOLO-V5 provides
more accurate and consistent detection results with higher confidence values at
the left top corner of every rectangle. Similarly, in Figure 6.13, YOLO-V5 is
more successful in detecting litter items that are partially occluded or located
in complex backgrounds, indicating its superior ability to analyse features and
accurately identify objects. These observations further support the findings from
the quantitative performance comparison and demonstrate the e↵ectiveness of
6.3 Experimental Results and Discussion 127

YOLO-V5 in litter detection. Overall, the visual performance comparison provides


valuable insights into the strengths and weaknesses of the two models, which can
help inform decision-making in environmental monitoring and management.

Figure 6.14: Testing results of YOLO-V5l based two class litter detection model

Figure 6.15: Testing results of YOLO-V7l based two class litter detection model

Figures 6.14 and 6.15 depict the results of using camp images to detect the
two models. This test is particularly challenging due to the scene’s complexity,
including the number and size of objects present, which can significantly a↵ect the
result. Comparing the two models, we find that their performance is comparable.
128 6.4 Conclusion

However, Table 6.5 shows that YOLO-V5’s ability to detect human-made items is
not very good. This is due to the complexity of the camp, making it challenging
to label such images accurately, leading to a lower coincidence between the actual
prediction frame and the benchmark. Despite this, both YOLO-V5 and YOLO-V7
perform well in the test images, successfully recognising objects of di↵erent sizes
and colours such as houses, walls, crocks, and abandoned buildings.
Based on the comprehensive test results and visual performance, YOLO-V5l
outperforms YOLO-V7l in detecting litter and human-made items. However, it
is important to note that the choice of model depends on the specific project
requirements and constraints. In this section, we have proven that YOLO-V5l is a
feasible and e↵ective option for litter detection, using a two-class approach. The
success of both YOLO-V5l and YOLO-V7l models in detecting litter and human-
made items highlights the potential of deep learning techniques in helping to solve
a complex environmental problem.

6.4 Conclusion
The accurate detection of objects that rely on human-judgement, such as identi-
fying litter and human-made items attempted in this chapter, is a difficult task in
the field of object recognition, as the definition is defined by a human, and thus
could di↵er between humans. This study used di↵erent DNN models to detect
the objects concerned and rigorously compared their performance. The outcomes
of this study can provide new solutions and inspiration for future researchers. It
is essential to continue to explore and develop new methods and technologies for
object recognition, especially in areas where subjective interpretation is required.
By doing so, we can improve the accuracy and efficiency of object recognition and
expand its applications to various fields. Additional test results can be found in
Appendix B.
In the research presented in this chapter, we first investigated using the most
established recent YOLO Version, i.e., YOLO-V5., to detect litter, as a single-class
litter detection problem. All YOLO-V5 sub-version networks have a detection pre-
cision of over 76%, according to the findings of the experiments presented in this
chapter. The performance of the YOLO-V5l in terms of litter detection is the best,
with a [email protected] rate of over 71.5%. We observed that when the images are
blurry, overlapping, or consist of di↵erent backgrounds, YOLO-V5x outperforms
the other YOLO-V5 sub-version networks at detecting litter. Over 5000 litter
samples were used during training, and over 913 drone captured were used in this
for collecting these samples. The detection performance will enhance even more if
this number is raised to 10,000 or higher, since the models, especially the models
6.4 Conclusion 129

created from the sub-versions with the deeper architectures, will be able to gener-
alise more and hence perform better on previously unobserved data. To improve
the performance of the resulting object detection models under more complex
scenarios and backgrounds, we added a new class, namely a human-made object
class, and trained DNN architectures of both YOLO-V5 and the more recently
proposed, YOLO-V7. We compared the objective and subjective/visual perform-
ance of the resulting models and concluded that generally YOLO-V5 performed
better in this study, as the deeper sub-versions of YOLO-V5 had more complex
architectures as compared to the model that has the deepest architecture in YOL
V7. The two-class litter detection approach proposed helped overcome many chal-
lenges of litter detection as compared to the single-class approach presented. The
results of this research and the resulting models proposed, despite the limits of
the data set used in training, prove the possibility of real-time litter-detection via
aerial surveillance, aiding the preservation of the environment and desert in the
UAE.
130 6.4 Conclusion
Chapter 7

Conclusions and Future Work

7.1 Summary and Conclusion

This thesis presented the results of a research study in which the design, devel-
opment and testing of novel and innovative, deep network based computational
models were rigorously investigated for detecting and identifying named objects in
drone imagery, with a particular focus on detecting objects of significance import-
ance in nature reserves in desert areas in the Middle East, UAE. The study aimed
to design deep learning-based object detection models using well-known Convo-
lutional Neural Networks (CNNs) that could efficiently and accurately recognise
various types of objects, such as Ghaf trees, di↵erent types of trees, i.e., Ghaf,
Acacia and Palm trees, and litter. The research was conducted as part of an on-
going research collaboration with the Dubai Desert Conservation Reserve, Dubai,
UAE and involved the creation of software packages, that were field-tested in real-
world settings, providing vital feedback for continues improvements of the model
designs and supported with expert knowledge of the application scenarios that the
research supported. The methodology employed in the research involved the col-
lection of drone imagery from farms and suburban desert areas in the Middle East
and Thailand. The images were labelled and used to train the CNN models, which
were then tested on a separate dataset of images. The results of the study showed
that the CNN models developed in this research achieved high levels of accuracy
in detecting various objects, including ghaf trees, di↵erent types of trees using a
single model, and litter. These findings are significant, as they can contribute to
some tedious and time-consuming job functions of nature reserve managers and
wardens in protecting nature reserves. Overall, the research presented in this thesis
contributes to the field of machine learning and deep learning by demonstrating
the e↵ectiveness of CNN-based object detection models in identifying objects in
drone imagery, their limitations and provides useful insights for future research in

131
132 7.1 Summary and Conclusion

this area.
In chapter 4, we explored the use of Convolutional Neural Networks to detect
Ghaf trees in aerial videos captured by a drone in various environmental settings
and at di↵erent altitudes. Our findings represent the first attempt in literature to
automatically detect Ghaf trees in drone video footage, using CNN based compu-
tational models, which have been proven to be more e↵ective than using traditional
machine learning based methods. Despite training with a relatively small number
of images, we obtained a high [email protected] value of 81.1% using the YOLO-V5x
based model, which can detect Ghaf trees in approximately 78 MS, on an image of
size 3840⇥2160 pixels. This is a promising achievement towards real-time detec-
tion of Ghaf trees. The training time for generating the model was approximately
10 hours, predominantly limited by the hardware used. Models based on other
three sub-versions of YOLO-V5 achieved [email protected] values of above 77.4%,
while other popular DNNs, such as R-CNN and SSD, performed less e↵ectively.
Our detailed analysis of the detection results revealed that YOLO-V5x was the
most successful at detecting Ghaf trees in complex scenarios, including overlap-
ping, blurry, and obstructed images with di↵erent backgrounds, and varying tree
sizes. With a dataset of just over 5,000 Ghaf trees, 3,200 of which were used for
training, we expect that performance can be further enhanced if the number of
images is expanded to 10,000 or more. Our results demonstrate the potential of
aerial surveillance for real-time detection of the endangered national tree of the
UAE and in the Gulf Region.
Chapter 5 investigated the e↵ectiveness of using YOLO-V5 based computa-
tional models in detecting multiple types of trees in drone images, using a single
computational model, specifically focusing detecting and di↵erentiating, Ghaf,
Palm, and Acacia trees, in high altitude drone video footage. The practical chal-
lenge addressed in this research is the difficulty in detecting and di↵erentiating
multiple types of trees at high altitude, as they are too small in appearance, to
include features that a deep neural network can meaningfully learn from. To im-
plement the multiple-tree detection system, the three types of trees as perceived
by a human observer, were marked with three di↵erent labels. The images we used
consisted of at most two types of trees, as Acacia trees are usually only present
in nature areas, that do not usually have Palm or Ghaf trees. Therefore, we cre-
ated image mosaics, consisting multiple tree types, for rigorous testing. Di↵erent
sizes of trees, overlaps with other trees and objects, lack of contrast with image
background, image-blur, similarities in colour, texture and shape with other trees,
challenged the accurate detections of the three tree types from drone footage.
The proposed research also conducted experiments to train YOLO-V5 in detect-
ing individual types of trees, for comparison purposes with the three types of tree
7.1 Summary and Conclusion 133

detector. The results showed that the mean average precision (mAP) values ob-
tained for single type of tree detector (chapter 4) are lower than those for group
detection across all four sub-versions of YOLO-V5 (chapter 5). The highest mAP
value achieved for Ghaf three only detector was 81.1%, using YOLO-V5x, while
the corresponding mAP value obtained, was 83.5% when using the multiple tree
detector model. Additionally, the models trained to detect single types of trees
are unable to accurately detect small trees that are in close proximity to big trees,
indicating the difficulty of detecting single trees, especially those very small in
size/appearance, in high altitude drone imagery. The use of more than one type
of tree in the labelling process, also helps more accurately di↵erentiating the type
of tree, from other trees and the image background, hence leading to a relatively
more accurate detection performance. For example, if the confidence of a tree
detected as a Ghaf tree is 0.51, the confidence of it being detected as an acacia
tree may be recorded as 0.30, and the confidence of it being detected as a palm
tree be recorded as 0.15. Hence the model can be more confident that the tree
in question is a Ghaf tree, but not of the other two types, hence minimising false
positives. This original research finding presented in Chapter 5 have practical
applications, in using drones to autonomously survey large areas for di↵erent tree
species, allowing their protection, maintenance of management.
In chapter 6, we first investigated using the most established YOLO Version,
i.e., YOLO-V5., to detect litter, as a single-class litter detection problem. We
found that models based on all YOLO-V5 sub-version networks have a detection
precision of over 76%, according to the findings of the experiments presented
in this chapter. The performance of the YOLO-V5l in terms of litter detection
is the best, with a [email protected] rate of over 71.5%. We observed that when
the images are blurry, overlapping, or consist of di↵erent backgrounds, YOLO-
V5x outperforms the other YOLO-V5 sub-version networks at detecting litter.
Over 5000 litter samples were used during training, and over 913 drone captured
images were used in the research conducted, for collecting these samples. We
presented that the detection performance will enhance even more if this number
is raised to 10,000 or higher, since especially those models created from the sub-
versions with the deeper architectures, will be able to generalise more and hence
perform better on previously unobserved data. To improve the performance of the
resulting object detection models under more complex scenarios and backgrounds,
we added a new class, namely a human-made object class, and trained DNN
architectures of both YOLO-V5 and the more recently proposed, YOLO-V7. We
compared the objective and subjective/visual performance of the resulting models
and concluded that generally YOLO-V5 performed better in this study, as the
deeper sub-versions of YOLO-V5 had more complex architectures as compared to
134 7.2 Limitations and Future Work

the model that has the deepest architecture in YOLO-V7. The two-class litter
detection approach proposed helped overcome many challenges of litter detection
as compared to the single-class approach presented. The results of this research
and the resulting models proposed, despite the limits of the data set used in
training, prove the possibility of real-time litter-detection via aerial surveillance,
aiding the preservation of the environment and desert in the UAE.

7.2 Limitations and Future Work


While the thesis demonstrates the design and development of novel object detec-
tion models that can be e↵ectively used in drone imagery, there is scope for further
research to improve the models further. Future investigations could explore the
following:

• Training the networks with more training samples to increase the detection
accuracy of all trained models;

• Further tuning the network hyperparameters to optimise performance;

• Pruning the networks by removing some unnecessary sections of the net-


works to increase the detection speed and the detection accuracy and reduce
deployment cost/challenges;

• Utilizing a drone-based tree detection system for analyzing the locations (for
senses purposes), growth, wellbeing and distribution of trees (e.g., change
detection);

• Extending the drone-based litter detection system for detecting and recog-
nising litter type in association with the flight altitude and image resolution;

• Applying all developed models on large, orth-mosaic images, to work seam-


lessly with GIS software tools;

• YOLO-V8 was published few months before the submission of this thesis.
It promises more accurate predictions and faster training and deployment
times. The same approached to data labelling, training and testing can be
used or generating object detection models based on YOLO-V8.

The original research presented in this thesis has already been submitted as
four papers, two journal and two conference (see Appendix-A). The resulting im-
plantations are being used in practice at the Dubai Desert Conservation Reserve
(DDCR), UAE. The continuous feedback that is being received on the operational
7.2 Limitations and Future Work 135

accuracy and the additional data being gathered is being used continuously to
improve the accuracy of models being currently used and have been presented in
this thesis.
With the rapid development of artificial intelligence, more powerful algorithms
and networks emerge frequently. At present, the accuracy one can achieve with
these models is already very high. However, there is still a significant gap in
performance between machines and human-beings. Human can make mistakes,
usually called human-error. For example, the ground truth has 100 ghaf trees.
Based on our current ability to visually recognise Ghaf trees, it is possible for a
human to correctly detect at least 98 trees, while the model developed can correctly
detect over 80 trees. There may be some trees that a human cannot detect or with
detect wrongly, but the model could detect correctly. Based on these observations,
with future advancement of AI , machines could surpass human performance in the
field of object detection and classification when large amounts of data is available
for training and more advanced networks are proposed.
136 7.2 Limitations and Future Work
Appendix A

Research Publications

This thesis presents four contributions, with two submitted to international con-
ferences and the remaining two to international journals.

Conference

• G.Wang, A.Leonce, E.Edirisinghe, T.Khafaga, G.Simkins, U.Yahya and M.S.Shah,


”Ghaf Tree Detection from Unmanned Aerial Vehicle Imagery Using Con-
volutional Neural Networks” has been accepted to The 10th International
Symposium on Networks, Computers and Communications (ISNCC’23);

• A.Leonce, G.Wang, E.Edirisinghe, ”Litter Detection from Unmanned Aerial


Vehicle Imagery Using Convolutional Neural Networks” has been accepted
to The 10th International Symposium on Networks, Computers and Com-
munications (ISNCC’23);

Journal

• G.Wang, T. Jintasuttisak, A.Leonce, E.Edirisinghe, T.Khafaga, G.Simkins,


U.Yahya and M.S.Shah, ”A Deep Neural Network based Approach to Tree
Type Recognition in Desert Drone Imagery” has been submitted to Compu-
tational Visual Media (A Q1 Journal);

• A.Leonce, G.Wang, E.Edirisinghe, ”Deep Neural Network based Automatic


Litter Detection in Desert Areas using Unmanned Aerial Vehicle Imagery”
has been submitted to Journal of Environmental Informatics(A Q1 Journal).

137
138 7.2 Limitations and Future Work
Appendix B

Additional Results

This appendix provides additional image results that were obtained from testing
the systems proposed in Chapters 4, 5, and 6. The results include ghaf tree
detection, multiple tree detection and classification, and litter detection.

139
140 7.2 Limitations and Future Work

B-I: Additional results of ghaf tree detection


In order to better demonstrate the excellent performance of the trained model,
this section will show the extra testing results in ghaf tree detection.
7.2 Limitations and Future Work 141

Figure B.1: Ghaf Tree Detection Result

Figure B.2: Ghaf Tree Detection Result


142 7.2 Limitations and Future Work

Figure B.3: Ghaf Tree Detection Result

Figure B.4: Ghaf Tree Detection Result


7.2 Limitations and Future Work 143

Figure B.5: Ghaf Tree Detection Result

Figure B.6: Ghaf Tree Detection Result


144 7.2 Limitations and Future Work

Figure B.7: Ghaf Tree Detection Result

Figure B.8: Ghaf Tree Detection Result


7.2 Limitations and Future Work 145

B-II: Additional results of multiple tree detection and classification


In order to better demonstrate the excellent performance of the trained model,
this section will show the extra testing results in multiple tree detection and clas-
sification.
146 7.2 Limitations and Future Work

Figure B.9: Ghaf Tree Detection Using Multiple Tree Detector

Figure B.10: Ghaf Tree Detection Using Multiple Tree Detector


7.2 Limitations and Future Work 147

Figure B.11: Ghaf Tree Detection Using Multiple Tree Detector

Figure B.12: Ghaf Tree Detection Using Multiple Tree Detector


148 7.2 Limitations and Future Work

Figure B.13: Ghaf Tree Detection Using Multiple Tree Detector

Figure B.14: Ghaf Tree Detection Using Multiple Tree Detector


7.2 Limitations and Future Work 149

Figure B.15: Palm Tree Detection Using Multiple Tree Detector

Figure B.16: Palm Tree Detection Using Multiple Tree Detector


150 7.2 Limitations and Future Work

Figure B.17: Palm Tree Detection Using Multiple Tree Detector

Figure B.18: Palm Tree Detection Using Multiple Tree Detector


7.2 Limitations and Future Work 151

Figure B.19: Palm Tree Detection Using Multiple Tree Detector

Figure B.20: Palm Tree Detection Using Multiple Tree Detector


152 7.2 Limitations and Future Work

Figure B.21: Acacia Tree Detection Using Multiple Tree Detector

Figure B.22: Acacia Tree Detection Using Multiple Tree Detector


7.2 Limitations and Future Work 153

Figure B.23: Acacia Tree Detection Using Multiple Tree Detector

Figure B.24: Acacia Tree Detection Using Multiple Tree Detector


154 7.2 Limitations and Future Work

Figure B.25: Acacia Tree Detection Using Multiple Tree Detector

Figure B.26: Acacia Tree Detection Using Multiple Tree Detector


7.2 Limitations and Future Work 155

B-III: Additional results of litter detection


156 7.2 Limitations and Future Work

Figure B.27: Litter Detection Result in Desert Area

Figure B.28: Litter Detection Result in Desert Area


7.2 Limitations and Future Work 157

Figure B.29: Litter Detection Result in Desert Area

Figure B.30: Litter Detection Result in Desert Area


158 7.2 Limitations and Future Work

Figure B.31: Litter Detection Result in Desert Area

Figure B.32: Litter Detection Result in Desert Area


7.2 Limitations and Future Work 159

Figure B.33: Litter Detection Result in Desert Area

Figure B.34: Litter Detection Result in Desert Area


160 7.2 Limitations and Future Work

Figure B.35: Litter Detection Result in Camp Area

Figure B.36: Litter Detection Result in Camp Area


References

[1] Nourhan Elmeseiry, Nancy Alshaer, and Tawfik Ismail. A detailed survey
and future directions of unmanned aerial vehicles (uavs) with potential ap-
plications. Aerospace, 8(12):363, 2021.

[2] Remya Kottarathu Kalarikkal, Youngwook Kim, and Taoufik Ksiksi. Incor-
porating satellite remote sensing for improving potential habitat simulation
of prosopis cineraria (l.) druce in united arab emirates. Global Ecology and
Conservation, 37:e02167, 2022.

[3] David Gallacher and Je↵rey Hill. Status of prosopis cineraria (ghaf) tree
clusters in the dubai desert conservation reserve. Tribulus, 15(2), 2005.

[4] VIBHA BHARDWAJ. Pods of prosopis cineraria (ghaf): A gift of nature


for nutraceutical. Journal of Global Ecology and Environment, pages 15–18,
2021.

[5] Simon Bilik, Lukas Kratochvila, Adam Ligocki, Ondrej Bostik, Tomas Zem-
cik, Matous Hybl, Karel Horak, and Ludek Zalud. Visual diagnosis of the
varroa destructor parasitic mite in honeybees using object detector tech-
niques. Sensors, 21(8):2764, 2021.

[6] Amanat Ali, Mostafa Waly, Mohamed Essa, and Sankar Devaranjan. 26 nu-
tritional and medicinal. Dates: production, processing, food, and medicinal
values, page 361, 2012.

[7] Thani Jintasuttisak, Eran Edirisinghe, and Ali Elbattay. Deep neural net-
work based date palm tree detection in drone imagery. Computers and
Electronics in Agriculture, 192:106560, 2022.

[8] AFMN Sadat, Mohammad Ali, Afroza Sultana, Muhammad Mehedi Hasan,
Debobrata Sharma, MA Rahman, and MAK Azad. Comparative study
of a proposed green extraction method named aqueous ultrasound assisted
extraction from fresh leaves of acacia nilotica with conventional extraction
method. International Journal of Innovative Science & Research Technology,
6(10):946–951, 2021.

161
162 References

[9] Rafiq Ahmad and Shoaib Ismail. Use of prosopis in arab/gulf states including
possible cultivation with saline water in deserts. Prosopis, 13, 1996.

[10] Tanmay Kumar Behera, Sambit Bakshi, Pankaj Kumar Sa, Michele Nappi,
Aniello Castiglione, Pandi Vijayakumar, and Brij Bhooshan Gupta. The
nitrdrone dataset to address the challenges for road extraction from aerial
images. Journal of Signal Processing Systems, pages 1–13, 2022.

[11] Bharat Rao, Ashwin Goutham Gopi, and Romana Maione. The societal
impact of commercial drones. Technology in society, 45:83–90, 2016.

[12] Jurgen Everaerts et al. The use of unmanned aerial vehicles (uavs) for
remote sensing and mapping. The International Archives of the Photogram-
metry, Remote Sensing and Spatial Information Sciences, 37(2008):1187–
1192, 2008.

[13] Luis Duque, Junwon Seo, and James Wacker. Synthesis of unmanned aerial
vehicle applications for infrastructures. Journal of Performance of Construc-
ted Facilities, 32(4):04018046, 2018.

[14] Michael A Goodrich, Bryan S Morse, Cameron Engh, Joseph L Cooper, and
Julie A Adams. Towards using unmanned aerial vehicles (uavs) in wilderness
search and rescue: Lessons from field trials. Interaction Studies, 10(3):453–
478, 2009.

[15] Rafal Perz and Kacper Wronowski. Uav application for precision agriculture.
Aircraft Engineering and Aerospace Technology, 91(2):257–263, 2019.

[16] Artur Zaporozhets. Overview of quadrocopters for energy and ecological


monitoring. In Systems, Decision and Control in Energy I, pages 15–36.
Springer, 2020.

[17] Ludovica Oddi, Edoardo Cremonese, Lorenzo Ascari, Gianluca Filippa,


Marta Galvagno, Davide Serafino, and Umberto Morra di Cella. Using uav
imagery to detect and map woody species encroachment in a subalpine grass-
land: advantages and limits. Remote Sensing, 13(7):1239, 2021.

[18] Meshal M Abdullah, Zahraa M Al-Ali, and Shruthi Srinivasan. The use of
uav-based remote sensing to estimate biomass and carbon stock for native
desert shrubs. MethodsX, 8:101399, 2021.

[19] David Gallacher. Ecological monitoring of arid rangelands using micro-uavs


(drones). In FINAL CONFERENCE PROCEEDINGS, page 181.
References 163

[20] Riccardo Dainelli, Piero Toscano, Salvatore Filippo Di Gennaro, and Aless-
andro Matese. Recent advances in unmanned aerial vehicle forest re-
mote sensing—a systematic review. part i: A general framework. Forests,
12(3):327, 2021.

[21] Mohamed Abdelkader, Mohammad Shaqura, Christian G Claudel, and Wail


Gueaieb. A uav based system for real time flash flood monitoring in desert
environments using lagrangian microsensors. In 2013 International confer-
ence on unmanned aircraft systems (ICUAS), pages 25–34. IEEE, 2013.

[22] Meshal M Abdullah, Zahraa M Al-Ali, Mansour T Abdullah, and Bader


Al-Anzi. The use of very-high-resolution aerial imagery to estimate the
structure and distribution of the rhanterium epapposum community for long-
term monitoring in desert ecosystems. Plants, 10(5):977, 2021.

[23] Austin Chad Hill and Yorke M Rowan. The black desert drone survey: New
perspectives on an ancient landscape. Remote Sensing, 14(3):702, 2022.

[24] Leila Hashemi-Beni, Je↵ery Jones, Gary Thompson, Curt Johnson, and
Asmamaw Gebrehiwot. Challenges and opportunities for uav-based digital
elevation model generation for flood-risk management: a case of princeville,
north carolina. Sensors, 18(11):3843, 2018.

[25] Thomas Moranduzzo and Farid Melgani. Detecting cars in uav images with
a catalog-based approach. IEEE Transactions on Geoscience and remote
sensing, 52(10):6356–6367, 2014.

[26] William S Noble. What is a support vector machine? Nature biotechnology,


24(12):1565–1567, 2006.

[27] Rainer Lienhart and Jochen Maydt. An extended set of haar-like features
for rapid object detection. In Proceedings. international conference on image
processing, volume 1, pages I–I. IEEE, 2002.

[28] Xuchun Li, Lei Wang, and Eric Sung. Adaboost with svm-based component
classifiers. Engineering Applications of Artificial Intelligence, 21(5):785–795,
2008.

[29] Peter M Asaro. The labor of surveillance and bureaucratized killing: new
subjectivities of military drone operators. Social semiotics, 23(2):196–224,
2013.
164 References

[30] Nassim Ammour, Haikel Alhichri, Yakoub Bazi, Bilel Benjdira, Naif Alajlan,
and Mansour Zuair. Deep learning approach for car detection in uav imagery.
Remote Sensing, 9(4):312, 2017.

[31] Matija Radovic, O↵ei Adarkwa, and Qiaosong Wang. Object recognition
in aerial images using convolutional neural networks. Journal of Imaging,
3(2):21, 2017.

[32] Teja Kattenborn, Jens Leitlo↵, Felix Schiefer, and Stefan Hinz. Review on
convolutional neural networks (cnn) in vegetation remote sensing. ISPRS
journal of photogrammetry and remote sensing, 173:24–49, 2021.

[33] Jayanth Koushik. Understanding convolutional neural networks. arXiv pre-


print arXiv:1605.09081, 2016.

[34] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural


networks. arXiv preprint arXiv:1511.08458, 2015.

[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only
look once: Unified, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 779–788, 2016.

[36] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox
detector. In Computer Vision–ECCV 2016: 14th European Conference, Am-
sterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14,
pages 21–37. Springer, 2016.

[37] Ross Girshick, Je↵ Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recogni-
tion, pages 580–587, 2014.

[38] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international confer-
ence on computer vision, pages 1440–1448, 2015.

[39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. Advances
in neural information processing systems, 28, 2015.

[40] Faisal S Alsubaei, Fahd N Al-Wesabi, and Anwer Mustafa Hilal. Deep
learning-based small object detection and classification model for garbage
waste management in smart cities and iot environment. Applied Sciences,
12(5):2281, 2022.
References 165

[41] Dong-Hyun Lee. Cnn-based single object detection and tracking in videos
and its application to drone detection. Multimedia Tools and Applications,
80(26-27):34237–34248, 2021.

[42] Khang Nguyen, Nhut T Huynh, Phat C Nguyen, Khanh-Duy Nguyen,


Nguyen D Vo, and Tam V Nguyen. Detecting objects from space: An
evaluation of deep-learning modern approaches. Electronics, 9(4):583, 2020.

[43] Apoorva Raghunandan, Pakala Raghav, HV Ravish Aradhya, et al. Object


detection algorithms for video surveillance applications. In 2018 Interna-
tional Conference on Communication and Signal Processing (ICCSP), pages
0563–0568. IEEE, 2018.

[44] Zahid Mahmood, Ossama Haneef, Nazeer Muhammad, and Shahid Khattak.
Towards a fully automated car parking system. IET Intelligent Transport
Systems, 13(2):293–302, 2019.

[45] Rohit Raja, Sandeep Kumar, and Md Rashid Mahmood. Color object detec-
tion based image retrieval using roi segmentation with multi-feature method.
Wireless Personal Communications, 112(1):169–192, 2020.

[46] Sulaiman Khan, Inam Ullah, Farhad Ali, Muhammad Shafiq, Yazeed Yasin
Ghadi, and Taejoon Kim. Deep learning-based marine big data fusion for
ocean environment monitoring: Towards shape optimization and salient ob-
jects detection. 2023.

[47] Xinyi Zhou, Wei Gong, WenLong Fu, and Fengtong Du. Application of
deep learning in object detection. In 2017 IEEE/ACIS 16th International
Conference on Computer and Information Science (ICIS), pages 631–634.
IEEE, 2017.

[48] Pawan Kumar Mishra and GP Saroha. A study on video surveillance system
for object detection and tracking. In 2016 3rd International Conference on
Computing for Sustainable Global Development (INDIACom), pages 221–
226. IEEE, 2016.

[49] Francesco Comaschi, Sander Stuijk, Twan Basten, and Henk Corporaal.
Rasw: a run-time adaptive sliding window to improve viola-jones object
detection. In 2013 Seventh International Conference on Distributed Smart
Cameras (ICDSC), pages 1–6. IEEE, 2013.

[50] Venkatesh Bala Subburaman and Sébastien Marcel. Fast bounding box es-
timation based face detection. In ECCV, Workshop on Face Detection:
Where we are, and what next?, number CONF, 2010.
166 References

[51] Francesco Comaschi, Sander Stuijk, Twan Basten, and Henk Corporaal.
Rasw: a run-time adaptive sliding window to improve viola-jones object
detection. In 2013 Seventh International Conference on Distributed Smart
Cameras (ICDSC), pages 1–6. IEEE, 2013.

[52] Xiaoheng Jiang, Yanwei Pang, Jing Pan, and Xuelong Li. Flexible sliding
windows with adaptive pixel strides. Signal Processing, 110:37–45, 2015.

[53] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104:154–171, 2013.

[54] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based


image segmentation. International journal of computer vision, 59:167–181,
2004.

[55] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals
from edges. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages
391–405. Springer, 2014.

[56] Piotr Dollár and C Lawrence Zitnick. Structured forests for fast edge de-
tection. In Proceedings of the IEEE international conference on computer
vision, pages 1841–1848, 2013.

[57] Santiago Manen, Matthieu Guillaumin, and Luc Van Gool. Prime object
proposals with randomized prim’s algorithm. In Proceedings of the IEEE
international conference on computer vision, pages 2536–2543, 2013.

[58] Jie Huang, Zhiguo Jiang, Haopeng Zhang, Bowen Cai, and Yuan Yao. Region
proposal for ship detection based on structured forests edge method. In 2017
IEEE international geoscience and remote sensing symposium (IGARSS),
pages 1856–1859. IEEE, 2017.

[59] Philipp Krähenbühl and Vladlen Koltun. Geodesic object proposals. In


Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzer-
land, September 6-12, 2014, Proceedings, Part V 13, pages 725–739.
Springer, 2014.

[60] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques,


and Jitendra Malik. Multiscale combinatorial grouping. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 328–
335, 2014.
References 167

[61] Matti Pietikäinen. Local binary patterns. Scholarpedia, 5(3):9775, 2010.

[62] Tony Lindeberg. Scale invariant feature transform. 2012.

[63] Hanna M Wallach. Conditional random fields: An introduction. Technical


Reports (CIS), page 22, 2004.

[64] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-
cnn. In Proceedings of the IEEE international conference on computer vision,
pages 2961–2969, 2017.

[65] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Com-
mon objects in context. In Computer Vision–ECCV 2014: 13th European
Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part
V 13, pages 740–755. Springer, 2014.

[66] Sara Vicente, Joao Carreira, Lourdes Agapito, and Jorge Batista. Recon-
structing pascal voc. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 41–48, 2014.

[67] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable
feature pyramid architecture for object detection. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages
7036–7045, 2019.

[68] Sebastian Tuermer, Franz Kurz, Peter Reinartz, and Uwe Stilla. Airborne
vehicle detection in dense urban areas using hog features and disparity maps.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, 6(6):2327–2337, 2013.

[69] Gong Cheng, Junwei Han, Lei Guo, Xiaoliang Qian, Peicheng Zhou, Xiwen
Yao, and Xintao Hu. Object detection in remote sensing imagery using a
discriminatively trained mixture model. ISPRS Journal of Photogrammetry
and Remote Sensing, 85:32–43, 2013.

[70] Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospa-
tial object detection and geographic image classification based on collection
of part detectors. ISPRS Journal of Photogrammetry and Remote Sensing,
98:119–132, 2014.

[71] Roman Seidel, André Apitzsch, and Gangolf Hirtz. Improved person detec-
tion on omnidirectional images with non-maxima suppression. arXiv pre-
print arXiv:1805.08503, 2018.
168 References

[72] Jian Xu, Xian Sun, Daobing Zhang, and Kun Fu. Automatic detection of
inshore ships in high-resolution remote sensing images using robust invariant
generalized hough transform. IEEE geoscience and remote sensing letters,
11(12):2070–2074, 2014.

[73] Zhenwei Shi, Xinran Yu, Zhiguo Jiang, and Bo Li. Ship detection in high-
resolution optical imagery based on anomaly detector and local shape fea-
ture. IEEE Transactions on Geoscience and Remote Sensing, 52(8):4511–
4523, 2013.

[74] Ping Zhong and Runsheng Wang. A multiple conditional random fields
ensemble model for urban area detection in remote sensing optical images.
IEEE Transactions on Geoscience and Remote Sensing, 45(12):3978–3988,
2007.

[75] Mryka Hall-Beyer. Glcm texture: a tutorial v. 3.0 march 2017. 2017.

[76] Dengsheng Zhang, Aylwin Wong, Maria Indrawan, and Guojun Lu. Content-
based image retrieval using gabor texture features. IEEE Transactions Pami,
3656:13–15, 2000.

[77] Charles Sutton, Andrew McCallum, et al. An introduction to conditional


random fields. Foundations and Trends® in Machine Learning, 4(4):267–
373, 2012.

[78] Caglar Senaras, Mete Ozay, and Fatos T Yarman Vural. Building detec-
tion with decision fusion. IEEE journal of selected topics in applied earth
observations and remote sensing, 6(3):1295–1304, 2013.

[79] Mete Ozay and Fatos T Yarman Vural. A new fuzzy stacked generalization
technique and analysis of its performance. arXiv preprint arXiv:1204.0171,
2012.

[80] Örsan Aytekin, U Zöngür, and Ugur Halici. Texture-based airport runway
detection. IEEE Geoscience and Remote Sensing Letters, 10(3):471–475,
2012.

[81] Alireza Khotanzad and Yaw Hua Hong. Invariant image recognition by
zernike moments. IEEE Transactions on pattern analysis and machine in-
telligence, 12(5):489–497, 1990.

[82] Gopalan Ravichandran and Mohan M Trivedi. Circular-mellin features for


texture segmentation. IEEE transactions on image processing, 4(12):1629–
1640, 1995.
References 169

[83] Marijke F Augusteijn, Laura E Clemens, and Kelly A Shaw. Performance


evaluation of texture measures for ground cover identification in satellite
images by means of a neural network classifier. IEEE Transactions on
Geoscience and Remote Sensing, 33(3):616–626, 1995.

[84] Shawn D Newsam and Chandrika Kamath. Retrieval using texture features
in high-resolution multispectral satellite imagery. In Data Mining and Know-
ledge Discovery: Theory, Tools, and Technology VI, volume 5433, pages 21–
32. SPIE, 2004.

[85] Rangaraj M Rangayyan, Ricardo José Ferrari, JE Leo Desautels, and An-
nie France Frere. Directional analysis of images with gabor wavelets. In
Proceedings 13th Brazilian Symposium on Computer Graphics and Image
Processing (Cat. No. PR00878), pages 170–177. IEEE, 2000.

[86] Andrada Livia Cirneanu, Dan Popescu, and Loretta Ichim. Cnn based on
lbp for evaluating natural disasters. In 2018 15th International Conference
on Control, Automation, Robotics and Vision (ICARCV), pages 568–573.
IEEE, 2018.

[87] Thomas Moranduzzo, Mohamed L Mekhalfi, and Farid Melgani. Lbp-based


multiclass classification method for uav imagery. In 2015 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), pages 2362–2365.
IEEE, 2015.

[88] Wilbert G Aguilar, Marco A Luna, Julio F Moya, Vanessa Abad, Humberto
Parra, and Hugo Ruiz. Pedestrian detection for uavs using cascade classifiers
with meanshift. In 2017 IEEE 11th international conference on semantic
computing (ICSC), pages 509–514. IEEE, 2017.

[89] Alexey Grigorevich Ivakhnenko. Polynomial theory of complex systems.


IEEE transactions on Systems, Man, and Cybernetics, (4):364–378, 1971.

[90] Abdallah Zeggada and Farid Melgani. Multilabel classification of uav images
with convolutional neural networks. In 2016 IEEE International Geoscience
and Remote Sensing Symposium (IGARSS), pages 5083–5086. IEEE, 2016.

[91] Matija Radovic, O↵ei Adarkwa, and Qiaosong Wang. Object recognition
in aerial images using convolutional neural networks. Journal of Imaging,
3(2):21, 2017.

[92] Patrick C Gray, Abram B Fleishman, David J Klein, Matthew W McKown,


Vanessa S Bezy, Kenneth J Lohmann, and David W Johnston. A convolu-
170 References

tional neural network for detecting sea turtles in drone imagery. Methods in
Ecology and Evolution, 10(3):345–355, 2019.

[93] Muhammad Saqib, Sultan Daud Khan, Nabin Sharma, Paul Scully-Power,
Paul Butcher, Andrew Colefax, and Michael Blumenstein. Real-time drone
surveillance and population estimation of marine animals from aerial im-
agery. In 2018 International Conference on Image and Vision Computing
New Zealand (IVCNZ), pages 1–6. IEEE, 2018.

[94] Ali Rohan, Mohammed Rabah, and Sung-Ho Kim. Convolutional neural
network-based real-time object detection and tracking for parrot ar drone 2.
IEEE access, 7:69575–69584, 2019.

[95] Suk-Ju Hong, Yunhyeok Han, Sang-Yeon Kim, Ah-Yeong Lee, and Ghiseok
Kim. Application of deep-learning methods to bird detection using un-
manned aerial vehicle imagery. Sensors, 19(7):1651, 2019.

[96] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via
region-based fully convolutional networks. Advances in neural information
processing systems, 29, 2016.

[97] Yuanyuan Wang, Chao Wang, Hong Zhang, Yingbo Dong, and Sisi Wei.
Automatic ship detection based on retinanet using multi-resolution gaofen-
3 imagery. Remote Sensing, 11(5):531, 2019.

[98] Hao Long, Yinung Chung, Zhenbao Liu, and Shuhui Bu. Object detection
in aerial images using feature fusion deep networks. IEEE Access, 7:30980–
30990, 2019.

[99] Sotiris B Kotsiantis. Decision trees: a recent overview. Artificial Intelligence


Review, 39:261–283, 2013.

[100] Irina Rish et al. An empirical study of the naive bayes classifier. In IJ-
CAI 2001 workshop on empirical methods in artificial intelligence, volume 3,
pages 41–46, 2001.

[101] Justin Sirignano and Konstantinos Spiliopoulos. Scaling limit of neural net-
works with the xavier initialization and convergence to a global minimum.
arXiv preprint arXiv:1907.04108, 2019.

[102] Zimeng Lyu, AbdElRahman ElSaid, Joshua Karns, Mohamed Mkaouer, and
Travis Desell. An experimental study of weight initialization and lamarckian
inheritance on neuroevolution. In Applications of Evolutionary Computa-
tion: 24th International Conference, EvoApplications 2021, Held as Part of
References 171

EvoStar 2021, Virtual Event, April 7–9, 2021, Proceedings 24, pages 584–
600. Springer, 2021.

[103] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model


for a mechanism of pattern recognition una↵ected by shift in position. Bio-
logical cybernetics, 36(4):193–202, 1980.

[104] Alex Krizhevsky, Ilya Sutskever, and Geo↵rey E Hinton. Imagenet classific-
ation with deep convolutional neural networks (alexnet) imagenet classific-
ation with deep convolutional neural networks (alexnet) imagenet classific-
ation with deep convolutional neural networks.

[105] Avinash Kumar, Sobhangi Sarkar, and Chittaranjan Pradhan. Malaria dis-
ease detection using cnn technique with sgd, rmsprop and adam optimizers.
Deep learning techniques for biomedical and health informatics, pages 211–
230, 2020.

[106] Zilong Zhong, Jonathan Li, Zhiming Luo, and Michael Chapman. Spectral–
spatial residual network for hyperspectral image classification: A 3-d deep
learning framework. IEEE Transactions on Geoscience and Remote Sensing,
56(2):847–858, 2017.

[107] Hiroki Nakahara, Tomoya Fujii, and Shimpei Sato. A fully connected layer
elimination for a binarizec convolutional neural network on an fpga. In 2017
27th international conference on field programmable logic and applications
(FPL), pages 1–4. IEEE, 2017.

[108] Md Moniruzzaman, Syed Mohammed Shamsul Islam, Paul Lavery, and Mo-
hammed Bennamoun. Faster r-cnn based deep learning for seagrass detection
from underwater digital images. In 2019 Digital Image Computing: Tech-
niques and Applications (DICTA), pages 1–7. IEEE, 2019.

[109] Anand John and DD Meva. A comparative study of various object detection
algorithms and performance analysis. International Journal of Computer
Sciences and Engineering, 8(10):158–163, 2020.

[110] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object de-
tection with deep learning: A review. IEEE transactions on neural networks
and learning systems, 30(11):3212–3232, 2019.

[111] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Target
detection classic paper-yolov1 paper translation (pure chinese version): Yolo:
unified real-time target detection target detection classic paper-yolov1 paper
translation (pure chinese version): Yolo: unified real-time target detection.
172 References

[112] Jun Sang, Zhongyuan Wu, Pei Guo, Haibo Hu, Hong Xiang, Qian Zhang,
and Bin Cai. An improved yolov2 for vehicle detection. Sensors, 18(12):4272,
2018.

[113] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.


arXiv preprint arXiv:1804.02767, 2018.

[114] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao.


Yolov4: Optimal speed and accuracy of object detection. arXiv preprint
arXiv:2004.10934, 2020.

[115] Jianqing Zhao, Xiaohu Zhang, Jiawei Yan, Xiaolei Qiu, Xia Yao, Yongchao
Tian, Yan Zhu, and Weixing Cao. A wheat spike detection method in uav
images based on improved yolov5. Remote Sensing, 13(16):3095, 2021.

[116] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen,
Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance
learning capability of cnn. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops, pages 390–391, 2020.

[117] Kaiyue Liu, Haitong Tang, Shuang He, Qin Yu, Yulong Xiong, and Nizhuan
Wang. Performance validation of yolo variants for object detection. In
Proceedings of the 2021 International Conference on bioinformatics and in-
telligent computing, pages 239–243, 2021.

[118] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Ex-
ceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.

[119] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li,
Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, et al. Yolov6: A
single-stage object detection framework for industrial applications. arXiv
preprint arXiv:2209.02976, 2022.

[120] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding,
and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceed-
ings of the IEEE/CVF conference on computer vision and pattern recogni-
tion, pages 13733–13742, 2021.

[121] Chhaya Gupta, Nasib Singh Gill, Preeti Gulia, and Jyotir Moy Chatterjee. A
novel finetuned yolov6 transfer learning model for real-time object detection.
Journal of Real-Time Image Processing, 20(3):42, 2023.
References 173

[122] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7:
Trainable bag-of-freebies sets new state-of-the-art for real-time object de-
tectors. arXiv preprint arXiv:2207.02696, 2022.

[123] Juan Terven and Diana Cordova-Esparza. A comprehensive review of yolo:


From yolov1 to yolov8 and beyond. arXiv preprint arXiv:2304.00501, 2023.

[124] Gaurav Sharma. A review on the studies on faunal diversity, status, threats
and conservation of thar desert or great indian desert ecosystem. In Bio-
logical Forum–An International Journal, volume 5, pages 81–90. Citeseer,
2013.

[125] Huimin Yang, Xingming Zhang, Fangyuan Zhao, Jing’ai Wang, Peijun Shi,
and Lianyou Liu. Mapping sand-dust storm risk of the world. World Atlas
of Natural Disaster Risk, pages 115–150, 2015.

[126] David Gallacher and Je↵rey Hill. Status of prosopis cineraria (ghaf) tree
clusters in the dubai desert conservation reserve. Tribulus, 15(2), 2005.

[127] Sylwia Majchrowska, Agnieszka Mikolajczyk, Maria Ferlin, Zuzanna


Klawikowska, Marta A Plantykow, Arkadiusz Kwasigroch, and Karol Ma-
jek. Deep learning-based waste detection in natural and urban environments.
Waste Management, 138:274–284, 2022.

[128] Mattis Wolf, Katelijn van den Berg, Shungudzemwoyo P Garaba, Nina
Gnann, Klaus Sattler, Frederic Stahl, and Oliver Zielinski. Machine learning
for aquatic plastic litter detection, classification and quantification (aplastic-
q). Environmental Research Letters, 15(11):114042, 2020.

[129] Manuel Córdova, Allan Pinto, Christina Carrozzo Hellevik, Saleh Abdel-
Afou Alaliyat, Ibrahim A Hameed, Helio Pedrini, and Ricardo da S
Torres. Litter detection with deep learning: A comparative study. Sensors,
22(2):548, 2022.
174 References

You might also like