Fulltext01 P
Fulltext01 P
OSCAR ALSING
OSCAR ALSING
Abstract
With the advancement in deep learning in the past few years, we are
able to create complex machine learning models for detecting objects
in images, regardless of the characteristics of the objects to be detected.
This development has enabled engineers to replace existing heuristics-
based systems in favour of machine learning models with superior
performance. In this report, we evaluate the viability of using deep
learning models for object detection in real-time video feeds on mobile
devices in terms of object detection performance and inference delay
as either an end-to-end system or feature extractor for existing algo-
rithms. Our results show a significant increase in object detection per-
formance in comparison to existing algorithms with the use of transfer
learning on neural networks adapted for mobile use.
iv
Sammanfattning
Utvecklingen inom djuplärning de senaste åren innebär att vi är ka-
pabla att skapa mer komplexa maskininlärningsmodeller för att iden-
tifiera objekt i bilder, oavsett objektens attribut eller karaktär. Denna
utveckling har möjliggjort forskare att ersätta existerande heuristik-
baserade algoritmer med maskininlärningsmodeller med överlägsen
prestanda. Den här rapporten syftar till att utvärdera användandet av
djuplärningsmodeller för exekvering av objektigenkänning i video på
mobila enheter med avseende på prestanda och exekveringstid. Vå-
ra resultat visar på en signifikant ökning i prestanda relativt befintli-
ga heuristikbaserade algoritmer vid användning av djuplärning och
överförningsinlärning i artificiella neurala nätverk.
Contents
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 History of Computer Vision . . . . . . . . . . . . . . . . . 4
2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Classification . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Object Detection . . . . . . . . . . . . . . . . . . . 5
2.2.3 Real-Time Object Detection . . . . . . . . . . . . . 6
2.2.4 Training and inference . . . . . . . . . . . . . . . . 6
2.2.4.1 Mean Average Precision . . . . . . . . . 7
2.2.5 Precision and Recall . . . . . . . . . . . . . . . . . 7
2.2.6 Cost function . . . . . . . . . . . . . . . . . . . . . 8
2.2.7 Hyperparameters . . . . . . . . . . . . . . . . . . . 8
2.3 Relevant Theory . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Artificial Neural Networks . . . . . . . . . . . . . 9
2.3.1.1 Architecture . . . . . . . . . . . . . . . . 9
2.3.1.1.1 Feed-forward Neural Networks 9
2.3.1.1.2 Deep Neural Networks . . . . . 9
2.3.1.2 Activation function . . . . . . . . . . . . 10
2.3.1.2.1 Rectified Linear Units . . . . . . 10
2.3.1.2.2 Softmax . . . . . . . . . . . . . . 10
2.3.1.3 Learning . . . . . . . . . . . . . . . . . . 10
2.3.1.3.1 Algorithms . . . . . . . . . . . . 10
2.3.1.3.2 Generalisation . . . . . . . . . . 11
2.3.1.3.3 Regularisation . . . . . . . . . . 12
2.3.2 Convolutional Neural Networks . . . . . . . . . . 13
v
vi CONTENTS
4 Results 41
4.1 mAP Performance . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Detection experiments . . . . . . . . . . . . . . . . . . . . 45
CONTENTS vii
5 Discussion 54
5.1 Performance/latency payoff . . . . . . . . . . . . . . . . . 54
5.2 Augmented network performance . . . . . . . . . . . . . 55
5.3 Deep learning vs heuristics . . . . . . . . . . . . . . . . . 55
5.4 Quality of data . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . 57
5.6 Sustainability and ethics . . . . . . . . . . . . . . . . . . . 57
6 Conclusions 59
Bibliography 60
AI Artificial Intelligence.
CV Computer Vision.
DL Deep Learning.
ML Machine Learning.
TF TensorFlow.
viii
Chapter 1
Introduction
With the advancement in Deep Learning (DL) in the past few years, we
are able to create complex ML models for detecting objects in images,
regardless of the characteristics of the objects to be detected. This de-
velopment has enabled engineers to replace existing heuristics-based
systems in favour of ML models with superior performance [37].
As people are using their mobile phones to a larger extent, and
also expect increasingly advanced performance [43] from their mobile
applications, the industry needs to adopt more advanced technologies
to meet up to expectations. One such adaptation could be the use of
ML algorithms for object detection.
ML is commonly divided into two phases namely the training and
the inference phase. Training is the phase where a model, usually a
neural network, is trained to behave a certain way based on given
datasets. This step can easily be carried out in the cloud and dis-
tributed to mobile devices, where the trained models can be used for
inference on previously unknown data.
When applying more advanced technologies and algorithms in a
mobile environment one of the challenges is the limited computational
power of the mobile hardware. As inference is computationally ex-
pensive, it is crucial that operations are optimised for mobile devices.
By using the mobile version of TensorFlow (TF) [30] namely Tensor-
Flow Mobile (TFM) [22] and the updated mobile framework Tensor-
Flow Lite (TFL) [21], developers are able to use pre-trained models on
mobile devices for inference with optimisation for mobile hardware.
The goal of this thesis is to evaluate the feasibility of using DL mod-
els for detecting Post-it R notes on mobile devices in comparison to the
1
2 CHAPTER 1. INTRODUCTION
1.2 Scope
The assignment entails the development of a ML model running on a
mobile device capable of detecting Post-it R notes in real-time from a
video feed. The following challenges have been identified.
1. Computational cost time during the recall phase of such a model,
as it should be capable of running on a mobile device with lim-
ited computational power. As multiple objects might exist in a
single frame, the frame must be divided into a grid were multi-
ple cells, where each cell is analysed independently, which inher-
ently increase the computational cost.
Background
4
CHAPTER 2. BACKGROUND 5
Figure 2.1: Example images from the MNIST data with the corre-
sponding correct labels. The dataset consists of thousands of images
of handwritten digits.
2.2 Definitions
2.2.1 Classification
The process of specifying which of the k possible categories some input
x belongs to is referred to as a classification problem. This is described
as producing a function f : Rn → − {1, . . . , k}. The output could be the
predicted class y, or a vector Y with the probability distribution of all k
classes [12]. Image classification is the task of classifying the category
to which the object in the image belongs to.
Figure 2.2: The left image shows how the YOLO architecture image
is split into an grid, and the middle image displays how this grid is
used to evaluate multiple sub-windows of the image. The right image
displays the original image with the corresponding GT boxes [40].
To calculate the precision and recall we use the number of true posi-
tives, false positives, true negatives and false negatives. Precision and recall
are calculated as in equation 2.1 and equation 2.2.
tp
P recision = (2.1)
tp + f p
tp
Recall = (2.2)
tp + f n
2.2.7 Hyperparameters
The goal of training a ML model is to learn the model parameters θ that
minimises the cost function (section 2.2.6). These parameters are de-
rived during the training phase, but there are other parameters in the
algorithm, hyperparameters, that are not optimised during the training
phase but have to be set prior to the learning process begins.
For example, the learning rate (η) in a mini-batch stochastic gradi-
ent descent algorithm specifies the speed of the weight updates in the
learning phase, as described in equation 2.4 [45] where J is the cost
function.
2.3.1.2.2 Softmax
The softmax function, also known as the normalised exponential function
is a multiclass generalisation of the logistic function. It takes a K dimen-
sional vector z as input and returns a K dimensional vector σ(z) where
P
σ(z) = 1 and where the value of every element in σ(z) is between
0 and 1 [3]. The softmax function is commonly used as the last layer
in neural networks for multiclass classification as the vector σ(z) rep-
resents the probabilities for the K different classes, and the equation is
described in equation 2.8.
ezj
σ(z)j = PK for j = 1, . . . , K. (2.8)
zk
k=1 e
2.3.1.3 Learning
2.3.1.3.1 Algorithms
During the learning phase of the ML model the objective is to minimise
the cost function by making constant improvements. This is achieved
by the concept of partial derivatives, where we analyse the gradients at
a point x, where the function f (x) decreases fastest from x when mov-
ing towards the negative gradient of f (x) described in equation 2.9 [12]
which enables us to update the parameters θ. By doing this process it-
eratively, the algorithm makes constant small improvements towards
the global minimum. Though, the algorithm does not guarantee to
converge towards a global minimum as the algorithm is prone to get
stuck in local minimums as visualised in 2.4.
θ0 = θ − η · ∇θ J(θ) (2.9)
CHAPTER 2. BACKGROUND 11
Figure 2.4: Local and global minimums [12]. There are multiple ex-
treme points corresponding to minimums as seen in the figure, which
is an challenge when training an ANN. As we want to avoid local min-
imums, there are multiple strategies to avoid these.
2.3.1.3.2 Generalisation
Generalisation is the capability of a ML model to perform well on pre-
viously unseen data. During the training process of a ML model, the
cost on the training set, the training cost, is used to optimise the pa-
rameters θ of the model as to minimise the cost function. When the
training phase is complete, an optimal algorithm performs just as well
on a dataset not used during the training phase, namely the test set,
which measures how well the algorithm generalises to previously un-
seen data. This is called the generalisation error or test error [12].
The goal of the training phase is to minimise the training loss, in-
herently learning as much as possible from the training data. Further-
more, we would like to minimise the gap between training loss and
test loss. In order to increase the performance on the training set, one
could increase the complexity of the ML model, by for example in-
creasing the depth of an ANN. This process is challenging from a view-
point of generalisation, as the risk that the model mimics the training
data rather than learn from it increases as the model complexity in-
creases and the training loss decreases, which has a negative impact
on the test loss as the model performs worse on unseen data [12].
When a ML models is overly complex and more or less mimics
12 CHAPTER 2. BACKGROUND
the training data, the model is said to suffer from overfitting. In the
same way, a model that is not complex enough to learn the underlying
patterns of the training data is said to suffer from underfitting. Over-
fitting, underfitting and appropriate model complexity are visualised
in Figure 2.5.
2.3.1.3.3 Regularisation
L1 and L2 regularisation are two common regularisation techniques
used to penalise large weights. For example, equation 2.10 describes
logistic regression with an regularisation term R(θ) controlled by a
penalty parameter α [32].
The difference between L1 and L2 regularisation is the regularisa-
tion term R(θ), where L1 regularisation equals R(θ) = ||θ||1 = ni=1 |θi |
P
m
X
arg max log p(y (i) |x(i) ; θ) − αR(θ) (2.10)
θ
i=1
N
X
L(w) = `i (y(Xi ; w, γ, β)) for every sample i (2.11)
i=1
the original image (receptive field) is performed and the result is sum-
marised. This will result in one or more N by N, N ∈ Z feature maps
of the original image, where every unit in the resulting feature map is
a result of the operations on the Y by Y neighbourhood of the original
image [27]. This convolution operation is visualised in Figure 2.6.
As seen in Figure 2.8, the patterns that the CNN layers learns to
distinguish are increasingly complex. The first layers learn to identify
simple lines and shapes, whereas the second layer is more complex
which naturally follows from the non-linearity between layers. This
pattern follows from all layers, and more complex patterns are con-
structed as we reach the top layers in the CNN. This shows the impor-
tance of depth in a CNN, as without the depth the network is unable
to distinguish and classify more complex patterns and images [58].
Figure 2.10: Example of transfer learning where the last fully con-
nected layer is removed, and two new adaptation layers are added to
the network. These two layers are later fine-tuned to the new task [35].
Figure 2.11: Sliding Windows over an image [8]. When using the slid-
ing window approach, a fixed-sized window slides across the entire
input space. This approach is therefore computationally expensive
and does not capture objects of varying scale.
Figure 2.13: Network using the Spatial Pyramid Pooling Layer [15].
The CNN feature vector is calculated once as displayed by the black
and white layers, and the the SPP-layer is thereafter applied to calcu-
late the CNN representation for each region.
Fast R-CNN added the bounding box regression to the training of the
network. As of this, training for classification and localisation is not
required to be performed independently [10].
Faster R-CNN is an improvement of the Fast R-CNN network and
performs 10x faster by replacing the Selective Search and Edge Boxes
part of the Fast R-CNN network, which served as the performance
bottleneck. The improved version of the Selective Search algorithm is
a CNN called the Region Proposal Network (RPN) [42].
The RPN is faster than traditional region proposal algorithms as it
shares the full-image convolutional features with the network respon-
sible for the detection of objects, minimising model cost for region pro-
posal. As with R-CNN and SPP-net, the RPN is subject to being trained
end-to-end with the Fast R-CNN detection network [42].
The Faster R-CNN network edge boxes algorithm is modified to
further improve the architectures capability to identify objects with
various aspect ratios and scale. The network uses three kinds of anchor
CHAPTER 2. BACKGROUND 21
boxes with the scales 1282 , 2562 and 5122 , with the aspect ratios 1 : 1,
2 : 1 and 1 : 2, which in total gives us 9 boxes to be analysed by the
RPN [42].
The TensorFlow Object Detection API [20] provides multiple im-
plementations of the Faster R-CNN model built on both the Inception
V2 [51] and ResNet [13] models.
2.4.2 SSD
SSD uses a fixed set of default bounding boxes using convolutional
filters applied to feature maps. The network makes predictions on
feature maps of different scales in order to achieve high accuracy on
predictions. The SSD network utilities anchor boxes just as YOLO (sec-
tion 2.4.3). At the time of prediction, that network creates box adjust-
ments to max object shape and produces probabilities for the existence
of each classification label in the box [29].
The fact that SSD uses various feature maps to combine predic-
tions results in an increased number of detections per class and image
and the varying resolution on these feature maps leads to increased
capabilities of detecting objects of different sizes. At the time of devel-
opment, SSD aimed to outperform the state-of-the-art Faster R-CNN
network in terms of mAP [29].
The Faster R-CNN network executed slowly at about 7 FPS using
the same hardware as the SSD network, which runs at 59 FPS due to
the removed need to re-sample pixels or features, which lowered the
number of computations per detection while maintaining high per-
formance. The SSD network ran both faster and had superior per-
formance to YOLO. A model comparison between SSD and YOLO is
displayed in Figure 2.14.
As mentioned, the increased performance in speed in comparison
to the Faster R-CNN model was due to the elimination of bounding
box proposals and subsampling of the image. The performance in-
crease was partially due to the small changes to the network as listed
below [29].
Figure 2.14: Comparison between the SSD and the YOLO Single Shot
detection models for object detection [29]. The primary difference be-
tween the two architectures is that the YOLO architecture utilises two
fully connected layers, whereas the SSD network uses convolutional
layers of varying size.
N
X X
Lloc (x, l, g) = xkij smooth L1 (lim − ĝjm )
i∈P os m∈{cx,cy,w,h}
Figure 2.15: During the training of the SSD network, boxes and the
corresponding box offsets are predicted and compared versus the GT
boxes [29]. The confidence loss (label) and the localisation loss (box)
are used to increase the performance of the network.
the network from finding smaller objects. YOLO unifies the task of
object detection and the framing of the detected objects as the spatial
location of the bounding boxes are treated as a regression problem.
As of this, the entire process of calculating class probabilities and pre-
dicting bounding boxes is executed in one single ANN, which enables
optimised end-to-end training of the network, and enables the YOLO
network to perform at a high FPS [39].
As networks such as Fast R-CNN, Faster R-CNN and SSD were
released the YOLO network was improved to YOLOv2 (YOLO9000),
which included some of the algorithms adapted in these networks.
The aim of YOLO9000 was to release a better version of the YOLO
network, and some of the changes in YOLOv2 are the following [40].
and optimised for use on embedded systems and mobile devices. The
Tiny YOLO networks is inferior to the full YOLO networks in terms of
mAP but runs at significantly higher FPS.
As seen in Figure 2.17 YOLOv3 executed faster than both Faster
R-CNN and SSD but does not perform as well in terms of classifica-
tion accuracy. As computational power is a limiting factor for mobile
devices, the Tiny YOLO networks will be reviewed deeply when con-
structing the network in this thesis.
All YOLO networks are executed in darknet [38], which is an open-
source ANN library written in C. These networks can be exported to
a common .pb format, which is supported by TensorFlow. As of this,
networks trained in darknet can be used on all platforms supported
by TensorFlow.
2.4.4 MobileNets
MobileNets are based on a streamlined architecture that uses depth-
wise separable convolutions to build the CNN. This results in a light
weight DNN not only restricting model size but primarily model la-
tency (inference time). These set of models allow for simple optimisa-
26 CHAPTER 2. BACKGROUND
performance.
2.4.5 Inception
The Inception network share the same goal as MobileNet, namely to
limit the model size and computational cost in environments such as
for mobile vision and big-data scenarios. The Inception network is
successfully scaled up by factorising convolutions and adding regu-
larisation [51].
In general, the Inception network is built upon a set of design prin-
ciples, implying constraint on how the network should be constructed.
One of these design principles is to avoid bottlenecks and allowing
information to flow through the network in a direct manner. This is
28 CHAPTER 2. BACKGROUND
Figure 2.19: Object detection results on the COCO dataset using vary-
ing frameworks and models [17]. As seen in the table, the number
of parameters and operations vary greatly between the models. From
this data, is it clear that the MobileNet model architecture is less com-
putationally expensive than the others.
2.4.6 ResNet
The key idea behind ResNet is to easen the increased difficulty of
training as networks become deeper by reformulating the neural net-
work layers as residual functions with references to the input layer,
CHAPTER 2. BACKGROUND 29
as seen in Figure 2.23, enabling easy scaling of ANN training with par-
allelisation and replication [30]. As the model described in this paper
is to be trained on computational servers and later ported to mobile
devices, TF is highly suitable.
This section explains the procedure and rationality behind the con-
ducted experiments and the choice of method. The goal of the con-
ducted experiments was to assess the performance of the object detec-
tion algorithms and to evaluate their viability in this computationally
limited environment. The evaluated networks are R-CNN, SSD and
YOLO, as these networks provide state-of-the-art performance and are
widely favoured. Ultimately, the desired outcome was to identify one
or multiple networks that outperformed the existing heuristics-based
model under the constraint of performing inference in a reasonable
time.
Training of all networks was conducted on a desktop computer as
described in section 3.2.
3.1 Data
3.1.1 Existing data
Bontouch has a dataset of pre-processed images of Post-it R notes, that
can be used for evaluating the deep learning model in comparison to
the existing heuristics-based algorithm. This dataset is limited to a
couple of hundred entries, and more data is required for the training
process. To extend the dataset, we gathered images of Post-it R notes
from social media websites and web directories. Luckily, there were
thousands of high-quality images of Post-it R notes easily accessible.
33
34 CHAPTER 3. METHOD AND EXPERIMENTS
The manual search for suitable accounts and the use of instagram-
scraper resulted in a training set of over 3000 images of Post-it R notes,
where 1842 images including 2436 Post-it R notes remained after man-
ually removing non-relevant images from the dataset. This amount
of data is sufficient for tuning the models, and adding more images
in transfer learning tasks does not increase mAP performance signifi-
cantly as the training suffers from diminishing returns [18].
CHAPTER 3. METHOD AND EXPERIMENTS 35
3.1.2.2 RectLabel
RectLabel [24] is a tool for image annotation available on the Mac
App Store, which eases the process of labelling images with bounding
boxes. The tool enables the user to easily draw bounding boxes and
annotate each box with a pre-defined label. The annotations of the
bounding boxes created by RectLabel follows the PASCAL VOC [6]
format, as seen in Figure 3.3, which is commonly used in object detec-
tion.
All images gathered from the scraping of Post-it R notes from In-
stagram were manually processed for bounding boxes in RectLabel,
where one or multiple notes were annotated in each image as seen in
Figure 3.2.
3.2 Hardware
The networks were trained on a computational server with the follow-
ing hardware specifications.
Results
41
42 CHAPTER 4. RESULTS
Table 4.1: The mAP performance of the trained models. The perfor-
mance of the models vary greatly, and the more complex Faster RCNN
ResNet50 model reaches near-optimal performance, whereas the less
complex Tiny Yolo V2 performs significantly worse.
Figure 4.1: The AP of the SSD Inception V2 model has an near loga-
rithmic growth, and reaches a plateau after 30k steps. At this point,
the performance of the model barely increases.
bileNet and SSD Inception are quite similar, whereas the Faster RCNN
ResNet50 & Faster RCNN Inception inference time is multiple times slower.
As the inference time of the RCNN models was too long, these mod-
els were dropped for further finetuning and evaluation. Furthermore,
these models will be left out in the following graphs as to keep the
graph data representation more similar and convenient. The execution
time on a non-logarithmic scale for all models is displayed in Table 4.2.
Figure 4.2: Frame inference time per model measured in log(ms). The
Faster RCNN ResNet50 model is significantly slower than the other
models, which is not surprising as of the complexity of the model.
except for the somewhat slower SSD Inception V2 model, that had a
significantly slower inference time in comparison to the other models.
Furthermore, it is clear that the SSD Inception V2 & Tiny Yolo V2
models had significantly more inference time outliers in comparison
to the SSD models based on MobileNet, which resulted in a larger stan-
dard deviation of the inference time as seen in figure 4.5. This be-
haviour was traced to the launch of the Android App, as the applica-
tion required significant resources during launch, and did not affect
the overall performance of the model when the app was fully loaded.
The mean inference times of the four models are visualised in Fig-
ure 4.4.
Figure 4.3: Inception & MobileNet based models inference time per
model measured in ms. The models vary in inference time, and it is
clear that the SSD Inception V2 and the Tiny YOLO V2 model suffers
from outliers that the MobileNet based models do not.
CHAPTER 4. RESULTS 45
Figure 4.4: Mean inference time per model measured in ms. The mod-
els based on MobileNet executes faster than the others, and the SSD
Inception V2 model is significantly slower than the others.
Figure 4.5: Standard deviation of the inference time per model mea-
sured in ms. The standard deviation of the SSD Inception V2 models is
larger than the other models, where as the SSD MobileNet V2 and Tiny
YOLO has the smallest standard deviation.
detection reliability.
Following the initial round of experiments, another round of ex-
periments was conducted with the purpose of detecting many notes
gathered in an equally spaced table pattern. The primary purpose of
this test was to evaluate the performance on a large number of notes
in the same image, as the training data mostly consisted of images of
single notes. During these experiments, the performance difference
was staggering. The performance of the SSD Inception V2 algorithm
outperformed the SSD MobileNet V2 by a large margin, as seen in Fig-
ure 4.6.
Figure 4.6: The left image represents the SSD Inception V2 model,
which performs well when detecting multiple notes, and is capable
of detecting all but one note. The right image represents the SSD Mo-
bileNet V2 model, which performs poorly despite the large mAP value
of the SSD MobileNet V2 model, it is unable to detect multiple notes in
the same image in a satisfying manner. The squares surrounding the
notes represent the output of the bounding boxes of the ML algorithm.
order to find the options with the greatest time-to-benefit ratio for our
task at hand.
As to alter the spatial relationship in the training images the ran-
dom horizontal flipping of images, random cropping of images and
insertion of random black patches augmentations were applied. Fur-
thermore, random conversion from colour to grey images and random
adjustment of contrast was applied to alter the colour and colour in-
tensity of the notes.
The augmented training was executed from the already trained
network, and trained for 50k more steps, as finetuning the already
trained network with the extra added augmentation step provided su-
perior performance than training the network with full augmentation
added during the whole training phase. Furthermore, the augmented
training time is 4x longer than training with no augmentation added,
which was a limiting factor given the mentioned constraint in compu-
tational power and time.
As seen in Table 4.3, the performance of the networks did not in-
crease with augmentation but rather decreased slightly. These net-
CHAPTER 4. RESULTS 49
Model mAP
SSD MobileNet V2 90.74%
SSD Inception V2 95.48%
Table 4.3: The mAP performance of the two final models. As seen, the
SSD Inception V2 model outperforms the SSD MobileNet V2 model.
model experienced issues with images where the notes were of similar
color to the surrounding objects, as seen in Figure 4.10.
The major drawback of the heuristic based model, where the SSD
Inception V2 was far superior, was when the notes were partly ob-
scured or in distorted environments in general, as seen in Figure 4.11.
Images with these attributes attributed to the largest differences in per-
formance between the two models.
As seen in Table 4.4, the performance of the heuristic based model
and the SSD Inception V2 varied to a great extent in terms of recall and
precision. The heuristics based model has a precision of 79.3% and
recall of 45%, whereas the better performing SSD Inception V2 model
has a precision of 100% and a recall of 98%. The largest difference
between the models was in terms of False Negatives, which inherently
lead to a large difference in recall as well. Furthermore, the heuristic
based model suffered from False Positives, which the ML based model
did not.
The percentage increase in precision and recall is displayed in Ta-
ble 4.5, and displays the superior performance of the ML model. The
increase in recall clearly suggests that ML could be used successfully
in the object detection task.
In order to statistically evaluate the difference between the heuristics-
based model and the SSD Inception V2 model in terms of recall as dis-
played in Table 4.4 and Table 4.5, an ANOVA test was conducted. The
results of the ANOVA test is displayed in Table 4.6, and given the p-
value of 0.004, which is smaller than the significance level of 0.05 we
can with 95% confidence reject the null hypothesis that the models do
CHAPTER 4. RESULTS 51
Recall
SS df MS F P-value eta^2 Obs. power
Between Groups 11.161 1 11.161 9.077 0.004 0.144 0.796
Within Groups 66.393 54 1.229
Total 77.554 55
Discussion
54
CHAPTER 5. DISCUSSION 55
Figure 5.1: Example image with Figure 5.2: Image lacking end
characteristics similar to end case. case environment setting.
Conclusions
59
Bibliography
60
BIBLIOGRAPHY 61
[11] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Ma-
lik. “Rich feature hierarchies for accurate object detection and
semantic segmentation”. In: CoRR abs/1311.2524 (2013). arXiv:
1311.2524.
[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learn-
ing. MIT Press, 2016.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep
Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). arXiv: 1512.03385.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Delv-
ing Deep into Rectifiers: Surpassing Human-Level Performance
on ImageNet Classification”. In: CoRR abs/1502.01852 (2015).
arXiv: 1502.01852.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spa-
tial Pyramid Pooling in Deep Convolutional Networks for Vi-
sual Recognition”. In: CoRR abs/1406.4729 (2014). arXiv: 1406.
4729.
[16] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. “Improving neural networks by pre-
venting co-adaptation of feature detectors”. In: CoRR abs/1207.0580
(2012). arXiv: 1207.0580.
[17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig
Adam. “MobileNets: Efficient Convolutional Neural Networks
for Mobile Vision Applications”. In: CoRR abs/1704.04861 (2017).
arXiv: 1704.04861.
[18] Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. “What makes
ImageNet good for transfer learning?” In: CoRR abs/1608.08614
(2016). arXiv: 1608.08614.
[19] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song
Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-
level accuracy with 50x fewer parameters and <1MB model size”.
In: CoRR abs/1602.07360 (2016). arXiv: 1602.07360.
[20] Google Inc. Tensorflow. https://github.com/tensorflow/
models / tree / master / research / object _ detection /
models. Accessed: 2018-05-02.
62 BIBLIOGRAPHY
• NormalizeImage normalize_image = 1;
• RandomHorizontalFlip random_horizontal_flip = 2;
• RandomPixelValueScale random_pixel_value_scale = 3;
• RandomImageScale random_image_scale = 4;
• RandomRGBtoGray random_rgb_to_gray = 5;
• RandomAdjustBrightness random_adjust_brightness = 6;
• RandomAdjustContrast random_adjust_contrast = 7;
• RandomAdjustHue random_adjust_hue = 8;
• RandomAdjustSaturation random_adjust_saturation = 9;
66
APPENDIX A. TENSORFLOW API DATA AUGMENTATION VARIABLES
67
• ScaleBoxesToPixelCoordinates scale_boxes_to_pixel_coordinates
= 18;
• SSDRandomCropFixedAspectRatio ssd_random_crop_fixed_aspect_ratio
= 23;
• SSDRandomCropPadFixedAspectRatio ssd_random_crop_pad_fixed_aspect_ratio
= 24;