0% found this document useful (0 votes)
15 views78 pages

Fulltext01 P

Uploaded by

PratikGoyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views78 pages

Fulltext01 P

Uploaded by

PratikGoyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2018

Mobile Object Detection using


TensorFlow Lite and Transfer
Learning

OSCAR ALSING

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Mobile Object Detection
using TensorFlow Lite and
Transfer Learning

OSCAR ALSING

Master in Computer Science


Date: August 27, 2018
Supervisor: Pawel Herman
Examiner: Danica Kragic
Swedish title: Objektigenkänning i mobila enheter med Tensorflow
Lite
School of Computer Science and Communication
iii

Abstract
With the advancement in deep learning in the past few years, we are
able to create complex machine learning models for detecting objects
in images, regardless of the characteristics of the objects to be detected.
This development has enabled engineers to replace existing heuristics-
based systems in favour of machine learning models with superior
performance. In this report, we evaluate the viability of using deep
learning models for object detection in real-time video feeds on mobile
devices in terms of object detection performance and inference delay
as either an end-to-end system or feature extractor for existing algo-
rithms. Our results show a significant increase in object detection per-
formance in comparison to existing algorithms with the use of transfer
learning on neural networks adapted for mobile use.
iv

Sammanfattning
Utvecklingen inom djuplärning de senaste åren innebär att vi är ka-
pabla att skapa mer komplexa maskininlärningsmodeller för att iden-
tifiera objekt i bilder, oavsett objektens attribut eller karaktär. Denna
utveckling har möjliggjort forskare att ersätta existerande heuristik-
baserade algoritmer med maskininlärningsmodeller med överlägsen
prestanda. Den här rapporten syftar till att utvärdera användandet av
djuplärningsmodeller för exekvering av objektigenkänning i video på
mobila enheter med avseende på prestanda och exekveringstid. Vå-
ra resultat visar på en signifikant ökning i prestanda relativt befintli-
ga heuristikbaserade algoritmer vid användning av djuplärning och
överförningsinlärning i artificiella neurala nätverk.
Contents

1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 History of Computer Vision . . . . . . . . . . . . . . . . . 4
2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Classification . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Object Detection . . . . . . . . . . . . . . . . . . . 5
2.2.3 Real-Time Object Detection . . . . . . . . . . . . . 6
2.2.4 Training and inference . . . . . . . . . . . . . . . . 6
2.2.4.1 Mean Average Precision . . . . . . . . . 7
2.2.5 Precision and Recall . . . . . . . . . . . . . . . . . 7
2.2.6 Cost function . . . . . . . . . . . . . . . . . . . . . 8
2.2.7 Hyperparameters . . . . . . . . . . . . . . . . . . . 8
2.3 Relevant Theory . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Artificial Neural Networks . . . . . . . . . . . . . 9
2.3.1.1 Architecture . . . . . . . . . . . . . . . . 9
2.3.1.1.1 Feed-forward Neural Networks 9
2.3.1.1.2 Deep Neural Networks . . . . . 9
2.3.1.2 Activation function . . . . . . . . . . . . 10
2.3.1.2.1 Rectified Linear Units . . . . . . 10
2.3.1.2.2 Softmax . . . . . . . . . . . . . . 10
2.3.1.3 Learning . . . . . . . . . . . . . . . . . . 10
2.3.1.3.1 Algorithms . . . . . . . . . . . . 10
2.3.1.3.2 Generalisation . . . . . . . . . . 11
2.3.1.3.3 Regularisation . . . . . . . . . . 12
2.3.2 Convolutional Neural Networks . . . . . . . . . . 13

v
vi CONTENTS

2.3.3 Transfer learning . . . . . . . . . . . . . . . . . . . 16


2.3.4 Sliding Window Detector . . . . . . . . . . . . . . 17
2.3.5 Existing heuristics based algorithm . . . . . . . . 17
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 R-CNN, Fast R-CNN & Faster R-CNN . . . . . . . 18
2.4.2 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 YOLO, YOLOv2, YOLOv3 & Tiny YOLO . . . . . 23
2.4.4 MobileNets . . . . . . . . . . . . . . . . . . . . . . 25
2.4.5 Inception . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.6 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Tools and Utilities . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 TensorFlow Mobile . . . . . . . . . . . . . . . . . . 30
2.5.3 TensorFlow Lite . . . . . . . . . . . . . . . . . . . . 31
2.5.4 CUDA and cuDNN . . . . . . . . . . . . . . . . . . 31

3 Method and experiments 33


3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Existing data . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Data gathering and processing . . . . . . . . . . . 34
3.1.2.1 Instagram-scraper . . . . . . . . . . . . . 34
3.1.2.2 RectLabel . . . . . . . . . . . . . . . . . . 35
3.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Choice of base models . . . . . . . . . . . . . . . . . . . . 36
3.4 Model training . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Tiny YOLO . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 SSD and Faster R-CNN . . . . . . . . . . . . . . . 37
3.5 Hyperparameter selection . . . . . . . . . . . . . . . . . . 37
3.6 Hyperparameter optimisation . . . . . . . . . . . . . . . . 37
3.7 Data augmentation . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Measuring and evaluating model performance . . . . . . 38
3.8.1 Evaluating model inference time . . . . . . . . . . 38
3.8.2 Evaluating heuristic model versus Machine Learn-
ing (ML) model . . . . . . . . . . . . . . . . . . . . 40

4 Results 41
4.1 mAP Performance . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Detection experiments . . . . . . . . . . . . . . . . . . . . 45
CONTENTS vii

4.4 Augmented training . . . . . . . . . . . . . . . . . . . . . 47


4.5 Heuristics vs Machine Learning . . . . . . . . . . . . . . . 49

5 Discussion 54
5.1 Performance/latency payoff . . . . . . . . . . . . . . . . . 54
5.2 Augmented network performance . . . . . . . . . . . . . 55
5.3 Deep learning vs heuristics . . . . . . . . . . . . . . . . . 55
5.4 Quality of data . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . 57
5.6 Sustainability and ethics . . . . . . . . . . . . . . . . . . . 57

6 Conclusions 59

Bibliography 60

A TensorFlow API data augmentation variables 66


Acronyms

AI Artificial Intelligence.

ANN Artificial Neural Networks.

CNN Convolutional Neural Networks.

CPU Central Processing Unit.

CV Computer Vision.

DL Deep Learning.

DNN Deep Neural Networks.

FLOPS Floating Point Operations Per Second.

FPS Frames Per Second.

GPU Graphics Processing Unit.

mAP Mean Average Precision.

ML Machine Learning.

RAM Random-Access Memory.

ReLU Rectified Linear Units.

TF TensorFlow.

TFL TensorFlow Lite.

TFM TensorFlow Mobile.

viii
Chapter 1

Introduction

With the advancement in Deep Learning (DL) in the past few years, we
are able to create complex ML models for detecting objects in images,
regardless of the characteristics of the objects to be detected. This de-
velopment has enabled engineers to replace existing heuristics-based
systems in favour of ML models with superior performance [37].
As people are using their mobile phones to a larger extent, and
also expect increasingly advanced performance [43] from their mobile
applications, the industry needs to adopt more advanced technologies
to meet up to expectations. One such adaptation could be the use of
ML algorithms for object detection.
ML is commonly divided into two phases namely the training and
the inference phase. Training is the phase where a model, usually a
neural network, is trained to behave a certain way based on given
datasets. This step can easily be carried out in the cloud and dis-
tributed to mobile devices, where the trained models can be used for
inference on previously unknown data.
When applying more advanced technologies and algorithms in a
mobile environment one of the challenges is the limited computational
power of the mobile hardware. As inference is computationally ex-
pensive, it is crucial that operations are optimised for mobile devices.
By using the mobile version of TensorFlow (TF) [30] namely Tensor-
Flow Mobile (TFM) [22] and the updated mobile framework Tensor-
Flow Lite (TFL) [21], developers are able to use pre-trained models on
mobile devices for inference with optimisation for mobile hardware.
The goal of this thesis is to evaluate the feasibility of using DL mod-
els for detecting Post-it R notes on mobile devices in comparison to the

1
2 CHAPTER 1. INTRODUCTION

current heuristic-based models for detecting Post-it R notes from a live


camera feed.

1.1 Problem statement


The task of detecting Post-it R notes is challenging as their geometrical
shape is similar to many other objects, and when obscured this shape
is altered to a large extent. Without no other characteristics, the note is
easily mistaken to be another object, and vice versa.
Specifically, this thesis is aimed to examine if DL models can be
used on mobile devices to outperform, the existing heuristic-based vi-
sual object detection algorithms in terms of recall performance. Fur-
thermore, this thesis examines the constraints in delay and computa-
tional time in the inference phase for the use of DL models on mobile
devices. The examination is limited to the development and evalu-
ation of the DL models and does not cover the implementation and
deployment of the DL models on any end-user mobile applications,
but solely the development of an android application for ML model
inference time measures.

1.2 Scope
The assignment entails the development of a ML model running on a
mobile device capable of detecting Post-it R notes in real-time from a
video feed. The following challenges have been identified.
1. Computational cost time during the recall phase of such a model,
as it should be capable of running on a mobile device with lim-
ited computational power. As multiple objects might exist in a
single frame, the frame must be divided into a grid were multi-
ple cells, where each cell is analysed independently, which inher-
ently increase the computational cost.

2. The need to distinguish between similar objects.

3. Identification of multiple objects in a single frame, where some


objects might be only partially visible, and others are overlap-
ping.

4. Gathering and pre-processing of training data.


CHAPTER 1. INTRODUCTION 3

The primary focus of this thesis is to evaluate the recall perfor-


mance of an will focus heavily on the use of Convolutional Neural
Networks (CNN)s[25, 17] as CNNs have been proven useful for Com-
puter Vision.

1.3 Thesis outline


This report follows a standard academic outline for research papers
and is divided into multiple chapters. Chapter 1 (Introduction) in-
troduces the research subject and the question to be explored as well
as brief information regarding the stakeholders of the paper. Chap-
ter 2 (Background) consists of an overview of the research subject and
its corresponding definitions, as well as relevant theory and previ-
ous work. Furthermore, the chapter outlines tools and utilities to be
used in later chapters. The primary purpose of chapter 2 is to present
the reader with the theoretical knowledge required to follow the argu-
ments and conclusions in later chapters and enable the reader to the-
oretically grasp the following results and the interpretation of them.
Chapter 3 outlines the experiment in terms of model construction,
model tuning, gathering of data as well as evaluation methods for the
ML models and the heuristic model comparison. Chapter 4 presents
the results from the experiments. Chapter 5 contains discussions and
reflections of the achieved the results. Chapter 6 contains the final in-
terpretations of the results and the discussion that followed from them,
as well as suggestions for further research, refinements and improve-
ments.
Chapter 2

Background

2.1 History of Computer Vision


Computer Vision (CV) emerged in the late 1960s as a subset of Arti-
ficial Intelligence (AI), where scientists intended to mimic the func-
tionality of the human vision system. It was believed that processing
data from digital images in order to achieve a high-level understand-
ing of it and unravelling symbolic data from the image data was an
easy task – namely the "visual input" problem [52]. It didn’t take long
until researchers grasped the complexity of transforming retina input
into symbolic information.
As the field evolved in the 1980s, researchers focused on mathemat-
ical models and techniques to analyse images, like edge and contour
detection. During this time, researchers noticed correlations between
various algorithms in CV, and algorithms were unified to a higher ex-
tent [52].
In the later 2000s, the domain of ML models for visual recogni-
tion emerged, and is currently dominating the field of CV. With the
rise of large amounts of labelled data, sophisticated algorithms and in-
creasing computational power, these ML models are able to categorise
objects without human supervision [52]. The MNIST [28] dataset (Fig-
ure 2.1) was commonly used to evaluate performance of ML models
for visual recognition [56].
Currently the most commonly used algorithm for object detection
(section 2.2.2) in CV are CNNs [46], that have been proven to surpass
human-level performance on image classification [14] (section 2.2.1).
The characteristics of Artificial Neural Networks (ANN) computa-

4
CHAPTER 2. BACKGROUND 5

Figure 2.1: Example images from the MNIST data with the corre-
sponding correct labels. The dataset consists of thousands of images
of handwritten digits.

tions are similar to the characteristics of real-time graphics computa-


tions in video game rendering, where operations such as matrix multi-
plications and division are executed per pixel in parallel. Around 2005,
researchers realised the potential advantages of performing ANN com-
putations on Graphics Processing Unit (GPU)s rather than Central Pro-
cessing Unit (CPU)s, resulting in faster computations and higher per-
formance. This enabled researchers to add more layers to ANNs, also
knows as deeper ANNs, and to use more data while maintaining a
reasonable execution time [12].

2.2 Definitions
2.2.1 Classification
The process of specifying which of the k possible categories some input
x belongs to is referred to as a classification problem. This is described
as producing a function f : Rn → − {1, . . . , k}. The output could be the
predicted class y, or a vector Y with the probability distribution of all k
classes [12]. Image classification is the task of classifying the category
to which the object in the image belongs to.

2.2.2 Object Detection


Object detection is the process of detecting objects on an image apply-
ing a recognition algorithm on all sub-windows of the original image,
stretching from one to multiple classes of objects [52].
6 CHAPTER 2. BACKGROUND

Object detection could, for example, be used to detect faces, pedes-


trians or cars in an image. In Figure 2.2 the YOLO network is used to
detect objects in an image, where the process of dividing the image to
multiple sub-windows is visualised [40, 39]. Object detection requires
localisation of the objects within the image, which classification does
not [11].

Figure 2.2: The left image shows how the YOLO architecture image
is split into an grid, and the middle image displays how this grid is
used to evaluate multiple sub-windows of the image. The right image
displays the original image with the corresponding GT boxes [40].

2.2.3 Real-Time Object Detection


When analysing videos consisting of multiple images (frames) per sec-
ond, and the detection is executed in real-time, the object detection al-
gorithm is considered a real-time object detection algorithm. Analysing
multiple images per second of objects puts heavy emphasis on efficient
algorithms, as the computational power required is increased.

2.2.4 Training and inference


In ML the training phase is where the model parameters θ are opti-
mised to minimise the cost function, and inherently learns the map-
ping function f ∗ from input to output.
The inference phase of a ML model is when the fully trained model
is shown some input x, and outputs some output y derived from the
learnt function composition.
CHAPTER 2. BACKGROUND 7

2.2.4.1 Mean Average Precision


A commonly used performance metric in object detection is Mean Av-
erage Precision (mAP), as defined by PASCAL VOC [6]. Better per-
formance is indicated as a higher mAP value given the ground-truth
boxes and given classes for an object detection task.
In order to use mAP in object detection, all predicted boxes and
classes are sorted in decreasing order of probability and matched with
ground-truth boxes and classes. If the classes of the prediction and
the ground truth match, and their Intersection over Union (IoU, also
knows as Jaccard Index) (Figure 2.3) is greater or equal to 0.5 (0.5IOU),
the prediction is considered a match. The match is predicted as a true
positive if and only if it has not previously been used, to mitigate du-
plicate detections of objects [4].
The Average Precision is computed as the area under the preci-
sion/recall curve by numerical integration, and the mAP is achieved
by calculating the mean of the Average Precision of all classes.

Figure 2.3: Intersection over Union as an similarity metric for object


detection [44]. As seen in the right image, the GT bounding box and
the predicted bounding box has an high percentage of overlap, which
results in an large IoU value.

2.2.5 Precision and Recall


Precision and recall are commonly used metrics in pattern recognition
and information retrieval. The precision metric represents the fraction
of relevant documents of all the retrieved documents, and the recall
metric represents the fraction of relevant documents that have been
retrieved out of all relevant documents.
8 CHAPTER 2. BACKGROUND

To calculate the precision and recall we use the number of true posi-
tives, false positives, true negatives and false negatives. Precision and recall
are calculated as in equation 2.1 and equation 2.2.
tp
P recision = (2.1)
tp + f p
tp
Recall = (2.2)
tp + f n

2.2.6 Cost function


The cost function is used to evaluate the performance of the model.
During the training phase, our model is constructed as to minimise
the cost function and therefore increase the performance of the model.
Many ML algorithms are trained with maximum likelihood, which
naturally leaves us with a cost function represented as the negative
log-likelihood as described in equation 2.3. The primary cost func-
tion is often combined with a regularisation term as described in sec-
tion 2.3.1.3.3 [12].

J(θ) = −Ex,y∼p̂ log pmodel (y|x) (2.3)

2.2.7 Hyperparameters
The goal of training a ML model is to learn the model parameters θ that
minimises the cost function (section 2.2.6). These parameters are de-
rived during the training phase, but there are other parameters in the
algorithm, hyperparameters, that are not optimised during the training
phase but have to be set prior to the learning process begins.
For example, the learning rate (η) in a mini-batch stochastic gradi-
ent descent algorithm specifies the speed of the weight updates in the
learning phase, as described in equation 2.4 [45] where J is the cost
function.

θ = θ − η · ∇θ J(θ; x(i:i+n) ; y (i:i+n) ) (2.4)


Another hyperparameter example is momentum (γ) as described in
equation 2.5, which aims to dampen oscillations and therefore acceler-
ate learning [45].
CHAPTER 2. BACKGROUND 9

υt = γυt−1 + η∇θ J(θ)


(2.5)
θ = θ − υt

2.3 Relevant Theory


The literature study will focus on the use of DL for real-time object
detection, using CNNs [42, 40, 29, 25, 58, 27, 11, 10]. Furthermore, the
literature study will cover existing literature on the implementation of
DL algorithms with the use of TFL on mobile devices[48, 57, 1, 17].
The necessary background knowledge and knowledge regarding
current state-of-the-art systems will be obtained by analysing the re-
sources identified in the literature study and implementing these in
TF.

2.3.1 Artificial Neural Networks


2.3.1.1 Architecture
2.3.1.1.1 Feed-forward Neural Networks
The Feed-Forward ANN is the most commonly used deep learning
model and serves as the basis for most neural network models such as
the CNN. The network aims to approximate some function f ∗ , which
for example could represent the mapping of input x to category y as
a classifier function y = f ∗ (x). This mapping is represented as y =
f (x; θ), where the θ parameters are learnt during training [12].
The Feed-Forward ANN is represented as a composition of func-
tions between layers as described in equation 2.6, where each layer
in the neural network represents one function in the composition. The
length of this chain of function compositions is referred to as the depth
of the model [12].

f (x) = f (3) (f (2) (f (1) (x))) (2.6)

2.3.1.1.2 Deep Neural Networks


An ANN is considered a Deep Neural Networks (DNN) when it con-
sists of multiple hidden layers between the input and the output layer.
This corresponds to the function composition depth described in 2.3.1.1.1.
10 CHAPTER 2. BACKGROUND

2.3.1.2 Activation function


2.3.1.2.1 Rectified Linear Units
The Rectified Linear Units (ReLU) was proposed as an non-saturating
alternative activation function to the activation functions tanh and sig-
moid. The ReLU has been proven to convert faster than the saturating
nonlinearities [31, 25] and is described in equation 2.7 as the max func-
tion of 0 and the given value.

g(z) = max{0, z} (2.7)

2.3.1.2.2 Softmax
The softmax function, also known as the normalised exponential function
is a multiclass generalisation of the logistic function. It takes a K dimen-
sional vector z as input and returns a K dimensional vector σ(z) where
P
σ(z) = 1 and where the value of every element in σ(z) is between
0 and 1 [3]. The softmax function is commonly used as the last layer
in neural networks for multiclass classification as the vector σ(z) rep-
resents the probabilities for the K different classes, and the equation is
described in equation 2.8.

ezj
σ(z)j = PK for j = 1, . . . , K. (2.8)
zk
k=1 e

2.3.1.3 Learning
2.3.1.3.1 Algorithms
During the learning phase of the ML model the objective is to minimise
the cost function by making constant improvements. This is achieved
by the concept of partial derivatives, where we analyse the gradients at
a point x, where the function f (x) decreases fastest from x when mov-
ing towards the negative gradient of f (x) described in equation 2.9 [12]
which enables us to update the parameters θ. By doing this process it-
eratively, the algorithm makes constant small improvements towards
the global minimum. Though, the algorithm does not guarantee to
converge towards a global minimum as the algorithm is prone to get
stuck in local minimums as visualised in 2.4.

θ0 = θ − η · ∇θ J(θ) (2.9)
CHAPTER 2. BACKGROUND 11

Figure 2.4: Local and global minimums [12]. There are multiple ex-
treme points corresponding to minimums as seen in the figure, which
is an challenge when training an ANN. As we want to avoid local min-
imums, there are multiple strategies to avoid these.

2.3.1.3.2 Generalisation
Generalisation is the capability of a ML model to perform well on pre-
viously unseen data. During the training process of a ML model, the
cost on the training set, the training cost, is used to optimise the pa-
rameters θ of the model as to minimise the cost function. When the
training phase is complete, an optimal algorithm performs just as well
on a dataset not used during the training phase, namely the test set,
which measures how well the algorithm generalises to previously un-
seen data. This is called the generalisation error or test error [12].
The goal of the training phase is to minimise the training loss, in-
herently learning as much as possible from the training data. Further-
more, we would like to minimise the gap between training loss and
test loss. In order to increase the performance on the training set, one
could increase the complexity of the ML model, by for example in-
creasing the depth of an ANN. This process is challenging from a view-
point of generalisation, as the risk that the model mimics the training
data rather than learn from it increases as the model complexity in-
creases and the training loss decreases, which has a negative impact
on the test loss as the model performs worse on unseen data [12].
When a ML models is overly complex and more or less mimics
12 CHAPTER 2. BACKGROUND

the training data, the model is said to suffer from overfitting. In the
same way, a model that is not complex enough to learn the underlying
patterns of the training data is said to suffer from underfitting. Over-
fitting, underfitting and appropriate model complexity are visualised
in Figure 2.5.

Figure 2.5: Three different models subjects to underfitting, overfitting


and appropriate complexity [12]. When a model suffers from under-
fitting as in the left image, the model is unable to learn the underlying
patterns of the data, and if the model suffers from overfitting as in the
right image, the model simply learns to mimic the data.

2.3.1.3.3 Regularisation
L1 and L2 regularisation are two common regularisation techniques
used to penalise large weights. For example, equation 2.10 describes
logistic regression with an regularisation term R(θ) controlled by a
penalty parameter α [32].
The difference between L1 and L2 regularisation is the regularisa-
tion term R(θ), where L1 regularisation equals R(θ) = ||θ||1 = ni=1 |θi |
P

and L2 regularisation equals R(θ) = ||θ||22 = ni=1 θi2 [32].


P

m
X
arg max log p(y (i) |x(i) ; θ) − αR(θ) (2.10)
θ
i=1

When training ANNs, our goal is to minimise the objective function


or loss function. The unregularised objective function depends only on
the weights of the network, as described in equation 2.11 [26].
CHAPTER 2. BACKGROUND 13

N
X
L(w) = `i (y(Xi ; w, γ, β)) for every sample i (2.11)
i=1

When adding regularisation to the training of ANNs a regulari-


sation term is added to the unregularised objective function, just as
described for the logistic regression in equation 2.10. This revised ob-
jective function with the added L2 regularisation is described in equa-
tion 2.12. The λ parameter in equation 2.12 corresponds to the α pa-
rameter in 2.10, and adjusts the degree of penalty from the regularisa-
tion [26].

Lλ (w) = L(w) + λ||w||22 (2.12)


In order to lower the degree of co-adaption between neurons, dropout
is commonly performed on feed-forward ANNs as a measure to lower
the degree of overfitting. It can be interpreted as killing random neu-
rons in the ANN with a probability p, inherently lowering the degree
to which connected neurons depend on each other by rendering the
presence of each neuron unreliable. Dropout has been proven to help
networks generalise better to unseen data and increase network per-
formance [16, 56, 50].

2.3.2 Convolutional Neural Networks


The CNN was first introduced in 1989 [27], and proved effective for
digit recognition. Following this success, the interest for CNNs in-
creased, and they have been shown to perform incredibly well in more
challenging image recognition tasks, and have outperformed other ML
models. The success of CNNs can be contributed to the availability of
larger datasets, increased computational power and improved regu-
larisation techniques (section 2.3.1.3.3) [58].
The underlying idea of the CNN is based on previous work in vi-
sual pattern recognition, where it has been demonstrated useful to
extract and combine local features to more abstract higher-order fea-
tures [27].
The first layer of a CNN is a convolutional layer, which consists of
one or multiple Y by Y, Y ∈ Z filters convolving over the original in-
put image. As each filter convolves over the original image, element-
wise multiplication of the pixel values of the filter and the sub-part of
14 CHAPTER 2. BACKGROUND

the original image (receptive field) is performed and the result is sum-
marised. This will result in one or more N by N, N ∈ Z feature maps
of the original image, where every unit in the resulting feature map is
a result of the operations on the Y by Y neighbourhood of the original
image [27]. This convolution operation is visualised in Figure 2.6.

Figure 2.6: An example of a 2-D CNN Convolution [12]. The figure


displays how the sub-part of the original input is multiplied element-
wise with the filter values to construct new output.

Pooling layers are commonly inserted between successive convo-


lutional layers as a means to reduce the spatial size and therefore the
number of parameters in the network, and also serves as a natural
way to reduce overfitting. Max pooling is commonly used in the pool-
ing layer, which performs the action of selecting the maximum value
inside of the the receptive field [23] as visualised in Figure 2.7. The
summarised neighbourhoods from the pooling operation usually do
not overlap as the stride is equal to the width of the filter, but over-
lapping pooling where the stride is smaller than the width of the filter
has proven useful [25]. The depth of the data remains unchanged, and
only the width and height dimensions are altered during the pooling
operation [23].
CHAPTER 2. BACKGROUND 15

Figure 2.7: CNN Max Pooling operation. By choosing the maximum


value in each sampled filter area, the input space is downsampled [23].
As seen in the figure, the data is downsampled from a 4x4 to 2x2.

As seen in Figure 2.8, the patterns that the CNN layers learns to
distinguish are increasingly complex. The first layers learn to identify
simple lines and shapes, whereas the second layer is more complex
which naturally follows from the non-linearity between layers. This
pattern follows from all layers, and more complex patterns are con-
structed as we reach the top layers in the CNN. This shows the impor-
tance of depth in a CNN, as without the depth the network is unable
to distinguish and classify more complex patterns and images [58].

Figure 2.8: Evolution of CNN layers. The complexity of the layers


increase as data flows from the bottom layers to the top layers [58]. As
seen in the figure, the bottom layers learn to identify crude geometrical
shapes whereas the top layers identify more advanced shapes.

Naturally, CNNs are built deep with multiple convolutional lay-


ers and pooling layers. It’s a common practice to add one or multiple
dense layers for further complexity and non-linearity at the top of the
network and to use a softmax layer (section 2.3.1.2.2) as the top layer
for image classification. A complex CNN network is displayed in Fig-
ure 2.9 as an example.
16 CHAPTER 2. BACKGROUND

Figure 2.9: Example CNN network architecture. The network contains


convolutional layers and pooling layers, as well as dense layers fol-
lowed with a softmax layer for image classification [25].

2.3.3 Transfer learning


It is rare to train a CNN from scratch, as it requires rarely large quan-
tities of data to perform well, and training a CNN with such large
amounts of data could take weeks on GPU clusters. As of this, it
is a common practice to use pre-trained networks as an initialisation
or feature extractor for the task to be implemented. The task of re-
using a pre-trained network is known as Transfer Learning, and implies
transferring knowledge from one domain to another. The base net-
work is commonly trained on a dataset such as the ImageNet dataset,
which holds 1.2 million images over 1000 categories. This network
then serves as the base network used in the transfer learning.
When the pre-trained network is used as a feature extractor, the
last fully connected layer in the base CNN is removed, and two new
adaption layers are added to the network. During training, all layers
but the new two last layers remain fixed, and the only weights that are
adjusted are the weights of these two fully connected layers. As of this,
the rest of the CNN serves as a fixed feature extractor as visualized in
Figure 2.10.
The pre-trained network could also, but not as commonly, be used
to initialise all the weights in the network where all the weights are
adjusted during training. When applying this approach in transfer
learning, it is common to leave some of the earlier layers fixed as they
represent generic features.
CHAPTER 2. BACKGROUND 17

Figure 2.10: Example of transfer learning where the last fully con-
nected layer is removed, and two new adaptation layers are added to
the network. These two layers are later fine-tuned to the new task [35].

2.3.4 Sliding Window Detector


A historical approach to object detection in images is the sliding win-
dow detector, where boxes are sliding over the whole frame in incre-
mental steps in order to analyse each window for objects as visualised
in Figure 2.11. As objects could be of different scales in the image, the
sliding window detector creates a image pyramid by resizing the image
at various scales, and sliding the fixed-sized window detector over the
resized image. Nevertheless, the sliding window detector is computa-
tionally expensive and struggles with aspect ratios [8].

2.3.5 Existing heuristics based algorithm


The existing algorithm used to detect Post-it R notes performs a series
of image pre-processors prior to performing the object detection task
on the processed image. These series of processing steps are primarily
performed with the use of the OpenCV library [53],
The algorithm is built to detect edges in the image corresponding
to the edges of Post-it R notes, and thereafter applies a set of filters to
the detected edges in order to evaluate if the detected edges are edges
of actual notes or not. This process consists of four steps, where the
overlap, boundary, shape and size filters are applied in order. As these
filters are applied, the algorithm calculates various metrics that are
18 CHAPTER 2. BACKGROUND

Figure 2.11: Sliding Windows over an image [8]. When using the slid-
ing window approach, a fixed-sized window slides across the entire
input space. This approach is therefore computationally expensive
and does not capture objects of varying scale.

used to make the decision to classify the detected edges as a Post-it R


note or not.
Figure 2.3.5 displays the process of applying the preprocessing and
filters to an image and how the edges in an image are approximated.

2.4 Related Work


2.4.1 R-CNN, Fast R-CNN & Faster R-CNN
As described in section 2.3.4 the sliding window detector is compu-
tationally expensive, and as of this it would not be feasible to apply
CNNs to such an approach [11]. To solve this, R-CNN utilises an object
proposal algorithm named selective search, intended to limit the num-
ber of bounding boxes to be analysed and therefore lowering the num-
ber of windows to analyse. The selective search algorithm combines
exhaustive search and segmentation and uses cues such as texture and
colour to pinpoint all possible locations of objects [55].
The selected bounding boxes from the selective search algorithm
are later extracted as a fixed-size feature vector using a CNN and fed to
a Support Vector Machine (SVM) for classification of the object inside
the region. All the generated boxes are resized to a fixed size (224x244
CHAPTER 2. BACKGROUND 19

Figure 2.12: The current heuristic algorithm applying multiple steps


of preprocessing and calculations on a two images. As seen in the left
figure the algorithm classifies the object as an false positive Post-it R
note.

for VGG) before being fed to the CNN [11].


Despite the use of selective search as an object proposal algorithm
for limiting the number of regions to analyse the average number of
proposed regions are 2403 [11]. This heavily limits the performance of
the architecture, as all of the 2403 regions have to be analysed individ-
ually, serving as a bottleneck for the Frames Per Second (FPS) of the
algorithm.
The Spatial Pyramid Pooling Network (SPP-net) was developed on
the basis of R-CNN, with the intention of increasing the performance
of the network. This was achieved by calculating the CNN feature
vector for the entire image once, and then calculate the CNN repre-
sentation for each region based on the CNN feature vector already cal-
culated as seen in Figure 2.13. However, back-propagation was non-
trivial with the SPP-net [15].
Fast R-CNN was released as an improvement to both R-CNN and
SPP-net, as it incorporates the improvements of SPP-net to R-CNN,
with the possibility of training the network end-to-end. The Fast R-
CNN network is 9x faster than the R-CNN network. Furthermore,
20 CHAPTER 2. BACKGROUND

Figure 2.13: Network using the Spatial Pyramid Pooling Layer [15].
The CNN feature vector is calculated once as displayed by the black
and white layers, and the the SPP-layer is thereafter applied to calcu-
late the CNN representation for each region.

Fast R-CNN added the bounding box regression to the training of the
network. As of this, training for classification and localisation is not
required to be performed independently [10].
Faster R-CNN is an improvement of the Fast R-CNN network and
performs 10x faster by replacing the Selective Search and Edge Boxes
part of the Fast R-CNN network, which served as the performance
bottleneck. The improved version of the Selective Search algorithm is
a CNN called the Region Proposal Network (RPN) [42].
The RPN is faster than traditional region proposal algorithms as it
shares the full-image convolutional features with the network respon-
sible for the detection of objects, minimising model cost for region pro-
posal. As with R-CNN and SPP-net, the RPN is subject to being trained
end-to-end with the Fast R-CNN detection network [42].
The Faster R-CNN network edge boxes algorithm is modified to
further improve the architectures capability to identify objects with
various aspect ratios and scale. The network uses three kinds of anchor
CHAPTER 2. BACKGROUND 21

boxes with the scales 1282 , 2562 and 5122 , with the aspect ratios 1 : 1,
2 : 1 and 1 : 2, which in total gives us 9 boxes to be analysed by the
RPN [42].
The TensorFlow Object Detection API [20] provides multiple im-
plementations of the Faster R-CNN model built on both the Inception
V2 [51] and ResNet [13] models.

2.4.2 SSD
SSD uses a fixed set of default bounding boxes using convolutional
filters applied to feature maps. The network makes predictions on
feature maps of different scales in order to achieve high accuracy on
predictions. The SSD network utilities anchor boxes just as YOLO (sec-
tion 2.4.3). At the time of prediction, that network creates box adjust-
ments to max object shape and produces probabilities for the existence
of each classification label in the box [29].
The fact that SSD uses various feature maps to combine predic-
tions results in an increased number of detections per class and image
and the varying resolution on these feature maps leads to increased
capabilities of detecting objects of different sizes. At the time of devel-
opment, SSD aimed to outperform the state-of-the-art Faster R-CNN
network in terms of mAP [29].
The Faster R-CNN network executed slowly at about 7 FPS using
the same hardware as the SSD network, which runs at 59 FPS due to
the removed need to re-sample pixels or features, which lowered the
number of computations per detection while maintaining high per-
formance. The SSD network ran both faster and had superior per-
formance to YOLO. A model comparison between SSD and YOLO is
displayed in Figure 2.14.
As mentioned, the increased performance in speed in comparison
to the Faster R-CNN model was due to the elimination of bounding
box proposals and subsampling of the image. The performance in-
crease was partially due to the small changes to the network as listed
below [29].

• Small Convolutional Filters were used to predict object labels


and bounding box offsets.

• Multiple Feature Maps & Prediction at Multiple Scales increased


performance on objects of all sizes, but especially smaller objects
22 CHAPTER 2. BACKGROUND

Figure 2.14: Comparison between the SSD and the YOLO Single Shot
detection models for object detection [29]. The primary difference be-
tween the two architectures is that the YOLO architecture utilises two
fully connected layers, whereas the SSD network uses convolutional
layers of varying size.

which could be more challenging to detect. These predictions


were separated by aspect ratio.

During training of the network, default boxes of varying aspect


ratios are evaluated and box offsets are predicted as well as prediction
confidence for each category and compared to the ground truth (GT)
boxes as displayed in Figure 2.15. The model loss during training is
a weighted sum of the confidence loss (label) and the localisation loss
(box) as described in equation 2.13 [29].
1
L(x, c, l, g) = (Lconf (x, c) + αLloc (x, l, g)) (2.13)
N
In the described loss equation, N is the number of matched default
boxes, the localisation loss is a Smooth L1 loss between the GT box (g)
and the predicted box (l). The offsets for the center (cx, cy) of the de-
fault bounding box (d) and its width (w) and height (h) are regressed
similar to Faster R-CNN, as described in equations 2.14 and 2.15 be-
low [29].
CHAPTER 2. BACKGROUND 23

N
X X
Lloc (x, l, g) = xkij smooth L1 (lim − ĝjm )
i∈P os m∈{cx,cy,w,h}

ĝjcx = (gjcx − dcx w


ĝjcy = (gjcy − dcy h (2.14)
i )/di i )/di
gjw gjh
ĝjw = log( ) ĝjh = log( )
dwi dhi
N
X X
Lconf (x, c) = − xpij log(ĉpi ) − log(ĉ0i )
i∈P os i∈N eg (2.15)
exp(cpi )
where ĉpi = P
p exp(cpi )

Figure 2.15: During the training of the SSD network, boxes and the
corresponding box offsets are predicted and compared versus the GT
boxes [29]. The confidence loss (label) and the localisation loss (box)
are used to increase the performance of the network.

The TensorFlow Object Detection API [20] provides multiple im-


plementations of the SSD model built on the MobileNet V1 [17], Mo-
bileNet V2 and Inception V2 models.

2.4.3 YOLO, YOLOv2, YOLOv3 & Tiny YOLO


As visualised in Figure 2.2, the YOLO network divides images into
a grid with GxG cells, and the grid then generates N predictions for
bounding boxes (GxGxN boxes in total). Each bounding box is limited
to having only one class during the time of prediction, which restricts
24 CHAPTER 2. BACKGROUND

the network from finding smaller objects. YOLO unifies the task of
object detection and the framing of the detected objects as the spatial
location of the bounding boxes are treated as a regression problem.
As of this, the entire process of calculating class probabilities and pre-
dicting bounding boxes is executed in one single ANN, which enables
optimised end-to-end training of the network, and enables the YOLO
network to perform at a high FPS [39].
As networks such as Fast R-CNN, Faster R-CNN and SSD were
released the YOLO network was improved to YOLOv2 (YOLO9000),
which included some of the algorithms adapted in these networks.
The aim of YOLO9000 was to release a better version of the YOLO
network, and some of the changes in YOLOv2 are the following [40].

• Batch Normalisation. Helps to regularise the model and im-


proved training converge.

• High-Resolution Classifier. Fine tunes the classification network


at a higher resolution.

• Convolution With Anchor Boxes. YOLO uses bounding boxes


for framing the objects, whereas YOLOv2 is inspired by Faster
R-CNN and utilises Anchor Boxes instead, predicting offset and
confidences for these.

• Multi-Scale Training. By training on various input dimensions,


the network is forced to perform well on various input dimen-
sions. This also leaves the network with a good payoff between
accuracy and speed depending on the requirements of the appli-
cation.

• Joint Classification and Detection. Utilising WordTree (Figure 2.16)


the authors of YOLOv2 were able to combine multiple datasets
in hierarchies, and therefore train hierarchies of objects such as
animal families. This enables the network to train on more data
and to further refine the detection of objects.

The development of YOLOv3 built upon YOLOv2 but introduced


changes such as multi-scale predictions, an improved backbone classi-
fier and a new network for feature extraction [41].
As YOLO, YOLOv2 and YOLOv3 require large amounts of compu-
tational power for inference, the Tiny YOLO network was developed
CHAPTER 2. BACKGROUND 25

Figure 2.16: Combined datasets using a WordTree Hierarchy [40]. Us-


ing this approach, multiple datasets could be combined in an hierar-
chical manner, which increased the amount of training data.

and optimised for use on embedded systems and mobile devices. The
Tiny YOLO networks is inferior to the full YOLO networks in terms of
mAP but runs at significantly higher FPS.
As seen in Figure 2.17 YOLOv3 executed faster than both Faster
R-CNN and SSD but does not perform as well in terms of classifica-
tion accuracy. As computational power is a limiting factor for mobile
devices, the Tiny YOLO networks will be reviewed deeply when con-
structing the network in this thesis.
All YOLO networks are executed in darknet [38], which is an open-
source ANN library written in C. These networks can be exported to
a common .pb format, which is supported by TensorFlow. As of this,
networks trained in darknet can be used on all platforms supported
by TensorFlow.

2.4.4 MobileNets
MobileNets are based on a streamlined architecture that uses depth-
wise separable convolutions to build the CNN. This results in a light
weight DNN not only restricting model size but primarily model la-
tency (inference time). These set of models allow for simple optimisa-
26 CHAPTER 2. BACKGROUND

Figure 2.17: YOLOv3 Performance on the COCO dataset [41]. As seen


in the figure, YOLOv3 outperforms all other networks on the COCO
dataset.

tion of hyperparameters to adjust the latency/accuracy trade-off de-


pending on the problem at hand. MobileNets have been proven effec-
tive in a wide range of applications, include object detection, classifica-
tion, facial attributes recognition and large scale geolocalization [17].
MobileNet alters the standard convolution as described in chap-
ter 2.3.2 by factorising it into a depthwise convolution and a 1x1 point-
wise convolution, as displayed in Figure 2.18. The pointwise convolu-
tion is used to combine the output of the depthwise convolution that
applies a filter to each input. The standard convolution generates new
output by combining the input and applying the filters in one step,
whereas the depthwise convolution splits the layers into a filtering
layer and a combining layer, which reduces model size and compu-
tations. Using 3x3 depthwise convolutions, MobileNet computations
is reduced to 1/8 - 1/9 of a standard convolution [17].
In the task of object detection, MobileNet has been shown to be a
useful base network. As shown in Figure 2.19, the reported results
for MobileNet on the COCO dataset is promising. The MobileNet ar-
chitecture remains the smallest and least computationally expensive
base network in comparison to VGG and Inception for both the SSD
and Faster-RCNN framework while having similar or superior mAP
CHAPTER 2. BACKGROUND 27

Figure 2.18: The MobileNets convolution architecture [17]. By factoris-


ing the standard convolutional filters to depthwise 1x1 pointwise con-
volutions the number of computations required is reduced.

performance.

2.4.5 Inception
The Inception network share the same goal as MobileNet, namely to
limit the model size and computational cost in environments such as
for mobile vision and big-data scenarios. The Inception network is
successfully scaled up by factorising convolutions and adding regu-
larisation [51].
In general, the Inception network is built upon a set of design prin-
ciples, implying constraint on how the network should be constructed.
One of these design principles is to avoid bottlenecks and allowing
information to flow through the network in a direct manner. This is
28 CHAPTER 2. BACKGROUND

Figure 2.19: Object detection results on the COCO dataset using vary-
ing frameworks and models [17]. As seen in the table, the number
of parameters and operations vary greatly between the models. From
this data, is it clear that the MobileNet model architecture is less com-
putationally expensive than the others.

achieved by gently decreasing the size of the network from input to


output [51].
Increasing the activation per tile, the Inception network trains faster
as it allows for more disentangled features, and by doing spatial aggre-
gation over lower dimensional embeddings the network remains less
complex while not losing large quantities of representational power.
Furthermore, the Inception network emphasises the importance of bal-
ancing the width and depth of the network in order to reach optimal
performance and create more qualitative networks [51].
The Inception network uses Inception Modules, which uses convo-
lutions of varying sizes per layer, executing these in parallel and con-
catenating the results for the following layer. As these modules more
or less serve as models inside of a model, they are called Inception
Modules. As to increase the speed of the network, larger convolutions
are replaced by multiple smaller ones as seen in 2.21, which reduces
the number of parameters due to the weight sharing between adjacent
tiles [51].

2.4.6 ResNet
The key idea behind ResNet is to easen the increased difficulty of
training as networks become deeper by reformulating the neural net-
work layers as residual functions with references to the input layer,
CHAPTER 2. BACKGROUND 29

Figure 2.21: Inception module


Figure 2.20: The original Incep- where 5x5 convolution is re-
tion module [51]. placed with two 3x3 conolu-
tions [51].

whereas normal networks learn from unreferenced functions. This en-


ables training of deeper networks, and the ResNet network was able
to train on a depth that was 8x deeper than other successful networks
as it was presented, while still having lower complexity [13].
The complexity of training very deep ANN is solved in ResNet by
adding identity mappings as shortcuts connections between layers as
seen in Figure 2.22, which enables it to not only hope that ANN layers
stacked on each other will fit an underlying mapping, but explicty fit
it to a residual mapping. These identity mappings ads no extra com-
plexity or parameters to the network [13].

2.5 Tools and Utilities


2.5.1 TensorFlow
A commonly used software for ML tasks is TF. It is widely adopted
as it provides an interface to express common ML algorithms and ex-
ecutable code of the models. Models created in TF can be ported to
heterogeneous systems with little or no change with devices ranging
from mobile phones to distributed servers. TF was created by and is
maintained by Google, and is used internally within the company for
ML purposes. TF expresses computations as a stateful data flow graph
30 CHAPTER 2. BACKGROUND

Figure 2.22: Residual learning identity mapping [13]. By adding these


identify mappings shortcuts connections are created between layers,
inherently making the network easier to train.

as seen in Figure 2.23, enabling easy scaling of ANN training with par-
allelisation and replication [30]. As the model described in this paper
is to be trained on computational servers and later ported to mobile
devices, TF is highly suitable.

2.5.2 TensorFlow Mobile


During design, Google developed TF to be able to run on heteroge-
neous systems, including mobile devices. This was due to the prob-
lems of sending data back and forth between devices and data centres
when computations could be executed on the device instead. TFM en-
abled developers to create interactive applications without the need of
network round-trip delays for ML computations [22].
As ML tasks are computationally expensive, model optimisation is
used to improve performance. The minimum hardware requirements
of TFM in terms of Random-Access Memory (RAM) size and CPU
speed are low, and the primary bottleneck is the calculation speed of
the computations as the desired latency for mobile applications is low.
For example, a mobile device with hardware capable of running 10
Giga Floating Point Operations Per Second (FLOPS) is limited to run
a 5 GFLOPS model in 2 FPS, which might impede desired application
performance.
CHAPTER 2. BACKGROUND 31

Figure 2.23: Example of a Computational Graph in TensorFlow CNN.


As seen in the figure, all operations and variables are connected and
split according to how they relate to each other.

2.5.3 TensorFlow Lite


TFL is the evolution of TFM, which already supports deployment on
mobile and embedded devices. As there is a trend to incorporate ML
in mobile applications and as users have higher expectations on their
mobile applications in terms of camera and voice it is highly incen-
tivised to further optimise TFM for lightweight mobile use [21].
Some of the optimisations included in TFL are hardware accelera-
tion through the silicon layer, frameworks such as the Android Neural
Network API and mobile-optimised ANNs such as MobileNets [17]
and SqueezeNet [19]. TF-trained models are converted to the TFL
model format automatically by TF [21].

2.5.4 CUDA and cuDNN


CUDA was developed by NVIDIA to create an interface for paral-
lel computing on CUDA-enabled GPUs. The platform functions as
a software layer for general calculations that developers can utilise
to execute virtual instructions, and support many programming lan-
guages [33].
The NVIDIA CUDA Deep Neural Network Library (cuDNN) en-
32 CHAPTER 2. BACKGROUND

ables GPU-accelerated training and inference of deep neural networks


for common routines and operations in ML. As ML heavily depends
on accessibility to computational power, this is crucial when training
larger networks or training on high-dimensional data such as images.
The cuDNN library offers great support for low-level GPU perfor-
mance tuning, enabling ML developers to focus on the implementa-
tion of the networks. cuDNN supports and accelerates operations in
TensorFlow, which is the deep learning framework used in this the-
sis [34].
Chapter 3

Method and experiments

This section explains the procedure and rationality behind the con-
ducted experiments and the choice of method. The goal of the con-
ducted experiments was to assess the performance of the object detec-
tion algorithms and to evaluate their viability in this computationally
limited environment. The evaluated networks are R-CNN, SSD and
YOLO, as these networks provide state-of-the-art performance and are
widely favoured. Ultimately, the desired outcome was to identify one
or multiple networks that outperformed the existing heuristics-based
model under the constraint of performing inference in a reasonable
time.
Training of all networks was conducted on a desktop computer as
described in section 3.2.

3.1 Data
3.1.1 Existing data
Bontouch has a dataset of pre-processed images of Post-it R notes, that
can be used for evaluating the deep learning model in comparison to
the existing heuristics-based algorithm. This dataset is limited to a
couple of hundred entries, and more data is required for the training
process. To extend the dataset, we gathered images of Post-it R notes
from social media websites and web directories. Luckily, there were
thousands of high-quality images of Post-it R notes easily accessible.

33
34 CHAPTER 3. METHOD AND EXPERIMENTS

3.1.2 Data gathering and processing


3.1.2.1 Instagram-scraper
In order to gather more images of Post-it R notes the command-line
Python application instagram-scraper [36] was used. The application
enables efficient scraping of images from user profiles and hashtags
on Instagram [7], which is one of the largest social media platforms
focusing on photo-sharing with 233 million active users.
In order to find suitable images of Post-it R notes, the social media
platform was queried and browsed manually. As the platform was
explored, multiple users were identified with a large quantity of high-
quality Post-it R , such as the image in Figure 3.1.

Figure 3.1: Post-it R note uploaded by Instagram user instachaaz.

The manual search for suitable accounts and the use of instagram-
scraper resulted in a training set of over 3000 images of Post-it R notes,
where 1842 images including 2436 Post-it R notes remained after man-
ually removing non-relevant images from the dataset. This amount
of data is sufficient for tuning the models, and adding more images
in transfer learning tasks does not increase mAP performance signifi-
cantly as the training suffers from diminishing returns [18].
CHAPTER 3. METHOD AND EXPERIMENTS 35

3.1.2.2 RectLabel
RectLabel [24] is a tool for image annotation available on the Mac
App Store, which eases the process of labelling images with bounding
boxes. The tool enables the user to easily draw bounding boxes and
annotate each box with a pre-defined label. The annotations of the
bounding boxes created by RectLabel follows the PASCAL VOC [6]
format, as seen in Figure 3.3, which is commonly used in object detec-
tion.
All images gathered from the scraping of Post-it R notes from In-
stagram were manually processed for bounding boxes in RectLabel,
where one or multiple notes were annotated in each image as seen in
Figure 3.2.

Figure 3.3: Example of the bound-


Figure 3.2: RectLabel bounding ing box annotation in the com-
box annotation interface [24]. monly used PASCAL VOC for-
mat [6].

3.2 Hardware
The networks were trained on a computational server with the follow-
ing hardware specifications.

• GPU: Nvidia GTX 1080 Ti (11GB)

• CPU: Intel i7-7700K @ 4.20GHz

• RAM: 32GB 1600MHz


36 CHAPTER 3. METHOD AND EXPERIMENTS

3.3 Choice of base models


As Deep CNNs require intensive training, networks such as SSD[29],
R-CNN[11] and YOLO[40, 39] were evaluated as the detection frame-
works for the CNN to be extended and fine-tuned, and MobileNet [17],
Inception [51] and ResNet [13] were evaluated as base networks.
The selection of the base model for the object detection algorithms
amongst these state-of-the-art models depends heavily on the speed
versus performance payoff between the models. As visualised in Fig-
ure 2.17, the speed and accuracy varies heavily between the different
models as well as the mAP between the networks. According to this
data Tiny YOLO is faster than object detection models built with SSD,
but is inferior in detection performance, which is also the case between
Faster R-CNN and SSD, where SSD is the faster but lower performing
networking.
In a mobile environment, the FPS that the mobile device is able to
perform computations in is crucial for the mobile experience. With
a too complex and computationally expensive network, the device
would struggle to run the application on more than 1 FPS. According
to professional developers at Bontouch, a minimum around 2 FPS is
required for a smooth mobile experience, which limited the base net-
work to either SSD or YOLO as Faster R-CNN has inferior speed on
mobile devices. Nevertheless, all mentioned architectures were evalu-
ated and assessed.

3.4 Model training


3.4.1 Tiny YOLO
The Tiny YOLO network was trained using the open-source library
darkflow [54] as it enabled convenient transfer learning from the base
model to our specific model by loading the pre-trained model and
dropping the last two layers. Darkflow also provides an API to pre-
dict bounding boxes for detection of objects, and to export these pre-
dictions in JSON format. Furthermore, darkflow enabled us to export
the darknet model to tensorflow for deployment on mobile devices,
which was required to evaluate the viability of the model in terms of
FPS on mobile devices.
CHAPTER 3. METHOD AND EXPERIMENTS 37

3.4.2 SSD and Faster R-CNN


The official TensorFlow [30] object detection library [20] contains the
object detection frameworks SSD and Faster R-CNN built on top of
MobileNet and Inception. These pre-trained models were downloaded
and trained in the same manner as Tiny YOLO.

3.5 Hyperparameter selection


The DNN models that were used had existing hyperparameter val-
ues, that had been analysed to give the most optimal performance
for each model during the training on their respective datasets, and
these values were used for the transfer learning task as the default
values. These default hyperparameter values were later optimised as
described in chapter 3.6.

3.6 Hyperparameter optimisation


When training a DNN, finding the most suitable hyperparameters for
the given dataset increases the probability of optimal performance.
The initial networks were trained using grid search for hyperparam-
eter optimisation, performing an exhaustive search over the hyperpa-
rameter space [5].
As the top performing networks were identified during the first
round of network experiments these hyperparameters of these net-
works were further optimised using random search over a discrete
space, as it has been proven to outperform grid search when applied
to a small number of hyperparameters [2].

3.7 Data augmentation


The TensorFlow Object Detection API image preprocessor provides
multiple data augmentation steps in the preprocessing pipeline as shown
in appendix A. Applying these augmentation steps to the dataset
could increase the networks ability to generalise as more training data
is generated, with variation and modification from the original data.
To further improve the performance of the top performing net-
works, these top-performing networks were finetuned and trained fur-
38 CHAPTER 3. METHOD AND EXPERIMENTS

ther on augmented data. Augmentation techniques such as the Ran-


domBlackPatches are significantly interesting, as the primary character-
istic of the Post-it R is the square form with four corners, which could
be altered when the black patches are applied to the original image.

3.8 Measuring and evaluating model perfor-


mance
The performance of the detection networks was evaluated using mAP
as defined in chapter 2.2.4.1. The final network was evaluated in com-
parison to the heuristics-based model in terms of recall as described
in chapter 2.2.5. The mAP value served as a direct indicator of detec-
tion performance in terms of both class prediction and bounding box
prediction, and as the base metric for evaluating model performance.
As mAP is a commonly used metric, there are multiple open-source
libraries and software packages available to evaluate the network mAP
performance and TensorFlow has built-in support for calculating the
mAP metric via TensorBoard during training and evaluation.
After evaluating the various alternatives, the open-source project
mAP[4] was selected due to the functionality to convert the darkflow
prediction JSON format and PASCAL VOC Ground Truth boxes XML
format to a common format for evaluation, which enables us to evalu-
ate the various formats effortlessly.
The recall was measured by conducting a set of experiential object
detection tasks for the heuristics-based model and DL model and eval-
uating the results manually. By performing a series of experiments,
enough data was gathered to evaluate the performance of the models.

3.8.1 Evaluating model inference time


As the final model is to be deployed on a mobile phone the inference
time of the model has to be low as to perform in a smooth manner. As
mentioned in section 3.3, 2 FPS is enough to guarantee a good enough
mobile experience. The trained models were evaluated based on their
execution time, and inherently FPS, as this metric is crucial for the area
of use of the trained model.
The trained models were deployed on a OnePlus 5T Android mo-
bile device, where the inference task was executed on a separate thread.
CHAPTER 3. METHOD AND EXPERIMENTS 39

During the time of inference, the inference time was measured in ms


and logged in a text file in the internal storage of the mobile device
as shown in Figure 3.4. The inference time was calculated as seen in
listing 3.1. This data was later be analysed and assessed to evaluate
the viability of the model in terms of FPS performance.

Listing 3.1: Capture inference execution time in ms


final long startTime = SystemClock.uptimeMillis();
final List<Classifier.Recognition> results =
detector.recognizeImage(croppedBitmap);
lastProcessingTimeMs = SystemClock.uptimeMillis()
- startTime;

Figure 3.4: Mobile object detection inference architecture and inference


time capturing. The mobile device launches an video thread and an
inference thread, where the video thread serves data from a video feed
to the inference thread for object detection. The inference thread logs
the execution time to a log file on the device.
40 CHAPTER 3. METHOD AND EXPERIMENTS

3.8.2 Evaluating heuristic model versus ML model


As stated in section 1.1, this paper aims to investigate if the ML model
is superior to the heuristic based model in terms of recall, as our pri-
mary goal is to increase the number of correctly identified Post-it R
notes in the image. These metrics were calculated by performing ob-
ject detection with both the heuristic based model as well as the ML
model on the same dataset, and the results were evaluated manually.
The precision and recall metrics were calculated as described in chap-
ter 2.2.5.
Chapter 4

Results

4.1 mAP Performance


In order to find the best performing model for the task of detecting
Post-it R notes, the Tiny YOLO V2, SSD MobileNet V1, SSD MobileNet
V2, SSD Inception V2, Faster RCNN Inception V2 and the Faster RCNN
ResNet50 ML models were trained and evaluated in terms of mAP per-
formance. The performance of the trained models is displayed in Ta-
ble 4.1, where all models have been trained on the Post-it R training
data and tested on the test data as described in chapter 3.1. The calcu-
lation of the mAP was carried out as described in chapter 3.8.
As seen in the table, the mAP value varies heavily between the var-
ious models, which follows naturally from the complexity of the mod-
els. A more complex model such as Faster RCNN ResNet50 consists
of considerably more parameters and operations than a less complex
model such as SSD MobileNet V1. As the complexity of the models
increase, the models capabilities to capture spatial relationships in-
creases as there are more non-linearities in the network. As of this, the
performance difference between these models is natural and expected.
During the training the mAP performance of all models increased
fast initially, to later stagnate around 20k − 30k steps, as seen in Fig-
ure 4.1. From the given results, these many steps are required to fine-
tune the network to detect the spatial features of the Post-it R note.
Nevertheless, all models were trained for 100k steps or finished earlier
due to early stopping in order to optimise model performance.

41
42 CHAPTER 4. RESULTS

Model Base model train data mAP


Tiny Yolo V2 VOC 2007+2012 87.57%
SSD MobileNet V1 COCO trainval 91.16%
SSD MobileNet V2 COCO trainval 91.90%
SSD Inception V2 COCO trainval 96.82%
Faster RCNN Inception V2 COCO trainval 96.69%
Faster RCNN ResNet50 COCO trainval 99.33%

Table 4.1: The mAP performance of the trained models. The perfor-
mance of the models vary greatly, and the more complex Faster RCNN
ResNet50 model reaches near-optimal performance, whereas the less
complex Tiny Yolo V2 performs significantly worse.

Figure 4.1: The AP of the SSD Inception V2 model has an near loga-
rithmic growth, and reaches a plateau after 30k steps. At this point,
the performance of the model barely increases.

4.2 Inference time


As mentioned in chapter 3.3, the inference time and inherently FPS of
the model is crucial for a smooth experience in a mobile app environ-
ment. In order to capture the inference time of each model and frame,
the code as described in chapter 3.8 was executed as a wrapper around
each inference.
The inference time of the various models varied greatly as seen in
Figure 4.2. This boxplot described the inference time in ms for each of
the evaluated models.
As seen in Figure 4.2, the inference time of Tiny YOLO, SSD Mo-
CHAPTER 4. RESULTS 43

bileNet and SSD Inception are quite similar, whereas the Faster RCNN
ResNet50 & Faster RCNN Inception inference time is multiple times slower.
As the inference time of the RCNN models was too long, these mod-
els were dropped for further finetuning and evaluation. Furthermore,
these models will be left out in the following graphs as to keep the
graph data representation more similar and convenient. The execution
time on a non-logarithmic scale for all models is displayed in Table 4.2.

Figure 4.2: Frame inference time per model measured in log(ms). The
Faster RCNN ResNet50 model is significantly slower than the other
models, which is not surprising as of the complexity of the model.

To visualise the boxplot results in a more graspable manner, the


measured inference times on a non-logarithmic scale of the Tiny YOLO,
SSD MobileNet and SSD Inception models are displayed in Figure 4.3.
As seen in the figure, the inference times of the models are quite similar
44 CHAPTER 4. RESULTS

except for the somewhat slower SSD Inception V2 model, that had a
significantly slower inference time in comparison to the other models.
Furthermore, it is clear that the SSD Inception V2 & Tiny Yolo V2
models had significantly more inference time outliers in comparison
to the SSD models based on MobileNet, which resulted in a larger stan-
dard deviation of the inference time as seen in figure 4.5. This be-
haviour was traced to the launch of the Android App, as the applica-
tion required significant resources during launch, and did not affect
the overall performance of the model when the app was fully loaded.
The mean inference times of the four models are visualised in Fig-
ure 4.4.

Figure 4.3: Inception & MobileNet based models inference time per
model measured in ms. The models vary in inference time, and it is
clear that the SSD Inception V2 and the Tiny YOLO V2 model suffers
from outliers that the MobileNet based models do not.
CHAPTER 4. RESULTS 45

Figure 4.4: Mean inference time per model measured in ms. The mod-
els based on MobileNet executes faster than the others, and the SSD
Inception V2 model is significantly slower than the others.

4.3 Detection experiments


As the purpose of the networks was to serve as real-time object detec-
tion algorithms for mobile devices, the trained networks were evalu-
ated by users on mobile devices. In order to evaluate the performance
of the models apart from the mAP metric and the viability of the mod-
els in terms of inference latency performance, detection experiments
were conducted in an environment as similar as possible to the end
use case, which is to detect one or multiple notes gathered on a plain
surface.
These experiments were conducted by end users, where they utilised
the Android application to detect Post-it R notes in real-time. As the
users were given the task to use the application to detect notes in vary-
ing environments, recall accuracy of Post-it R notes was the primary
46 CHAPTER 4. RESULTS

Figure 4.5: Standard deviation of the inference time per model mea-
sured in ms. The standard deviation of the SSD Inception V2 models is
larger than the other models, where as the SSD MobileNet V2 and Tiny
YOLO has the smallest standard deviation.

metric to evaluate model performance.


The detection experiments were conducted using the top perform-
ing networks in terms of the accuracy and latency, namely SSD Incep-
tion V2 and SSD MobileNet V2. As seen in Table 4.1, SSD Inception V2 is
the highest performing network of the viable networks in terms infer-
ence speed, whereas SSD MobileNet V2 performs somewhat worse in
terms of mAP, but had superior inference speed as seen in Figure 4.3.
The initial experiments included detection of one to three notes on
a plain surface. During these experiments, both networks performed
with near to 100% recall, which is not surprising given the near opti-
mal mAP value of both the network as displayed in Table 4.2. In this
setting, the two networks were approximately equivalent in terms of
CHAPTER 4. RESULTS 47

Model Inference (ms) mAP


Tiny Yolo V2 515 87.57%
SSD MobileNet V1 454 91.16%
SSD MobileNet V2 438 91.90%
SSD Inception V2 716 96.82%
Faster RCNN Inception V2 4105 96.69%
Faster RCNN ResNet50 20018 99.33%

Table 4.2: Inference time and mAP performance summary of the


trained models. The inference time of the Faster RCNN ResNet50 is
significantly higher than the others, but also the mAP performance of
the model. As seen in the table, both RCNN based models are compu-
tationally expensive in comparison to the others.

detection reliability.
Following the initial round of experiments, another round of ex-
periments was conducted with the purpose of detecting many notes
gathered in an equally spaced table pattern. The primary purpose of
this test was to evaluate the performance on a large number of notes
in the same image, as the training data mostly consisted of images of
single notes. During these experiments, the performance difference
was staggering. The performance of the SSD Inception V2 algorithm
outperformed the SSD MobileNet V2 by a large margin, as seen in Fig-
ure 4.6.

4.4 Augmented training


Following the initial experiments with the networks trained on vanilla
data, the SSD MobileNet V2 and SSD Inception V2 models were trained
on augmented data as to increase the performance of the models in
non-optimal settings. The primary purpose of the augmented training
was to induce variation in the images, forcing the models to not get
stuck in local minimums and to increase generalisation as the mod-
els inherently is forced to learn a broader spectrum of spatial relation-
ships, as the process of systematically occluding parts of the image has
been shown to make the network activation vary [58].
There are multiple augmentation techniques available in the Ten-
sorFlow API as described in appendix A, and these were reviewed in
48 CHAPTER 4. RESULTS

Figure 4.6: The left image represents the SSD Inception V2 model,
which performs well when detecting multiple notes, and is capable
of detecting all but one note. The right image represents the SSD Mo-
bileNet V2 model, which performs poorly despite the large mAP value
of the SSD MobileNet V2 model, it is unable to detect multiple notes in
the same image in a satisfying manner. The squares surrounding the
notes represent the output of the bounding boxes of the ML algorithm.

order to find the options with the greatest time-to-benefit ratio for our
task at hand.
As to alter the spatial relationship in the training images the ran-
dom horizontal flipping of images, random cropping of images and
insertion of random black patches augmentations were applied. Fur-
thermore, random conversion from colour to grey images and random
adjustment of contrast was applied to alter the colour and colour in-
tensity of the notes.
The augmented training was executed from the already trained
network, and trained for 50k more steps, as finetuning the already
trained network with the extra added augmentation step provided su-
perior performance than training the network with full augmentation
added during the whole training phase. Furthermore, the augmented
training time is 4x longer than training with no augmentation added,
which was a limiting factor given the mentioned constraint in compu-
tational power and time.
As seen in Table 4.3, the performance of the networks did not in-
crease with augmentation but rather decreased slightly. These net-
CHAPTER 4. RESULTS 49

Figure 4.7: The SSD Inception V2 model detecting notes in an distorted


environments. As the geometrical fingerprint of the Post-it R note is the
square, the task of detecting notes with covered corners is not trivial.

works were manually tested for superior performance in distorted en-


vironments but did not show any increased performance. As of this,
the vanilla networks were favoured over the networks finetuned with
augmented data.

Model mAP
SSD MobileNet V2 90.74%
SSD Inception V2 95.48%

Table 4.3: The mAP performance of the two final models. As seen, the
SSD Inception V2 model outperforms the SSD MobileNet V2 model.

4.5 Heuristics vs Machine Learning


In order to evaluate the performance of the SSD Inception V2 network
in comparison to the heuristics based Post-it R note detector, the per-
formance was evaluated in terms of precision and recall on a subset of
the test images.
The heuristic based model did not perform well on rectangular
Post-it R notes as seen in Figure 4.9 and was limited to the traditional
square notes. Furthermore, some square notes were not correctly iden-
tified as seen in Figure 4.8. This was surprising, as the performance on
square notes were good in general. Furthermore, the heuristic based
50 CHAPTER 4. RESULTS

Model FP FN TP Precision Recall


Heuristic Based 6 28 23 79.3% 45%
SSD Inception V2 1 50 100% 98%

Table 4.4: Post-it R note detection performance of the heuristics based


detector and the SSD Inception V2 network in terms of False Positives,
False Negatives, True Positives, Precision and Recall.

Precision increase Recall increase


26.1% 117.7%

Table 4.5: The increase in performance of SSD Inception V2 model in


comparison to the heuristics based model.

model experienced issues with images where the notes were of similar
color to the surrounding objects, as seen in Figure 4.10.
The major drawback of the heuristic based model, where the SSD
Inception V2 was far superior, was when the notes were partly ob-
scured or in distorted environments in general, as seen in Figure 4.11.
Images with these attributes attributed to the largest differences in per-
formance between the two models.
As seen in Table 4.4, the performance of the heuristic based model
and the SSD Inception V2 varied to a great extent in terms of recall and
precision. The heuristics based model has a precision of 79.3% and
recall of 45%, whereas the better performing SSD Inception V2 model
has a precision of 100% and a recall of 98%. The largest difference
between the models was in terms of False Negatives, which inherently
lead to a large difference in recall as well. Furthermore, the heuristic
based model suffered from False Positives, which the ML based model
did not.
The percentage increase in precision and recall is displayed in Ta-
ble 4.5, and displays the superior performance of the ML model. The
increase in recall clearly suggests that ML could be used successfully
in the object detection task.
In order to statistically evaluate the difference between the heuristics-
based model and the SSD Inception V2 model in terms of recall as dis-
played in Table 4.4 and Table 4.5, an ANOVA test was conducted. The
results of the ANOVA test is displayed in Table 4.6, and given the p-
value of 0.004, which is smaller than the significance level of 0.05 we
can with 95% confidence reject the null hypothesis that the models do
CHAPTER 4. RESULTS 51

Recall
SS df MS F P-value eta^2 Obs. power
Between Groups 11.161 1 11.161 9.077 0.004 0.144 0.796
Within Groups 66.393 54 1.229
Total 77.554 55

Table 4.6: ANOVA displaying the significance of the recall perfor-


mance difference between the heuristics-based model and the SSD In-
ception V2 model. With an p-value of 0.004, the null hypothesis can be
rejected with 95% confidence.

not differ in recall performance.


As the null hypothesis was rejected in the ANOVA test we con-
ducted a post hoc Tukey HSD test to gain more insights into the source
of differences between the mean performances. In consequence, we
found that ML outperformed heuristics based approach.

Figure 4.8: Detection experiments with many notes. The DL based


model is able to identify all but one note, whereas the heuristics-based
model only identifies a fraction of the notes.
52 CHAPTER 4. RESULTS

Figure 4.9: Detection experiments with a single rectangular note. As


seen in the figure, the heuristics-based model is not able to identify the
rectangular note, as it is crafted as to only identify square notes. This
limits the use of the algorithm, as there exists Post-it R notes of multiple
shapes.

Figure 4.10: Detection experiments with one note with a background


of similar color. The heuristics-based algorithm is not able to separate
the Post-it R note from the background, and is therefore not able to
identify the note.
CHAPTER 4. RESULTS 53

Figure 4.11: Detection experiments with overlapping notes. As the


Post-it R notes lose their common geometrical fingerprint and spatial
structure when the corners are overlapping or missing, the heuristics-
based algorithm is not able to identify the notes.
Chapter 5

Discussion

5.1 Performance/latency payoff


When choosing a suitable ML model for object detection on mobile
devices, the payoff between the performance of the model and the la-
tency depends on the object detection task at hand. In settings where
the latency is not crucial, such as settings where inference jobs are not
executed in a real-time setting, more accurate and complex models
such as the Faster RCNN Inception model are suitable. In settings such
as the real-time object detection task presented in this report, the la-
tency has to be taken into consideration to a higher degree, and such
models would imply too much latency.
As mentioned in chapter 3.3, the deployed model should have a
minimum of approximately 2 FPS. As seen in Table 4.2, this limits us
to the Tiny YOLO, SSD MobileNet & SSD Inception models as earlier
mentioned. The SSD Inception models is executed at ≈ 1.4 FPS, which
might imply too much latency depending on the implementation. The
importance of the latency has to be taken into consideration for each
individual application.
For the task of detecting Post-it R notes, the SSD Inception model
is favoured over the SSD MobileNet model, due to its superior perfor-
mance in distorted environments and when detecting multiple notes
despite the increased latency implication. From the user experiments
conducted, the users did not perceive the application as too slow, which
implies that the application could utilise this higher latency higher per-
formance model in favour of the MobileNet models with lower latency.
Furthermore, all execution tests were performed on the OnePlus 5T

54
CHAPTER 5. DISCUSSION 55

device as earlier mentioned. According to the Android Benchmarks


performed by PassMark [49], this device is ranked as the 19th most
powerful Android device out of the 4523 evaluated devices. Depend-
ing on the hardware of the mobile application end users, this could im-
ply strict limitations on the performance of the deployed ML model to
reach feasibly FPS performance. As of this, it might be worth deploy-
ing multiple networks in different locations on the performance/la-
tency payoff scale, and use the most suitable model depending on the
hardware of the mobile device running the application. Deploying a
mobile application with this methodology, one does not have to take
the full spectrum of the performance/latency payoff into considera-
tion.

5.2 Augmented network performance


The inferior performance of the networks with augmented training as
displayed in Table 4.3 was surprising as we did except superior results
in comparison to the network without augmented data.
The worse performance could be explained by the fact that the aug-
mentation techniques applied could invalidate the input served to the
ML model, inherently forcing the network to learn patterns that do not
represent the actual spatial relationships of the objects to be identified.
As of this, it might instead prevent the network to learn the actual pat-
terns during the training phase.

5.3 Deep learning vs heuristics


The trained networks performed well on the training data, and was
able to identify close to all Post-it R notes as seen in Table 4.1 and Ta-
ble 4.2. As previously mentioned, the SSD Inception V2 model was
evaluated as sufficiently fast and performs with a mAP value of 96.82%.
As presented in Table 4.4 and Table 4.5, the increase in recall of the fi-
nal ML model is 117.7% and the increase in precision is 26.1% in com-
parison to the heuristic-based model, and is far superior to the of the
existing object detection algorithm.
Furthermore, the SSD Inception V2 model is able to detect objects
in distorted and occluded environments, which the existing algorithm
is not. For example, the model is able to detect notes with occluded
56 CHAPTER 5. DISCUSSION

corners, and detect notes with multiple objects of on top of them as


shown in image 4.7.
This strongly suggests that the DL models can be used to support
existing algorithms, either as a detector itself or as a feature extractor
for existing algorithms. For example, the model can be used to extract
detected Post-it R notes and serve these to the corner detection algo-
rithm, where certain thresholds could be adjusted as we are sure of
the fact that there exists a note in the image, inherently increasing the
recall of the algorithm.

5.4 Quality of data


The quality of the training data is good, given the overall lightning
conditions and resolution of the images. The training data consists of
images of Post-it R notes in various lightning conditions, angels and
rotations, which suits our needs and requirements.
The possible downfall and challenge of the training data is the
lack of images from an environment related to the end use case of
the model, which is in an environment with multiple notes and non-
perfect lightning. An example of a high-quality image that does not
represent an environment similar to the use case environment in our
training set is the image in Figure 5.2. The image is a closeup of a
single note in near optimal lightning conditions. A more reliable and
qualitative image for our training set would be an image such as Fig-
ure 5.1, which consists of multiple notes captured in non-optimal light-
ning conditions.
The Instagram scraping method as described in chapter 3.1 scrapes
and downloads Post-it R note images from multiple Instagram accounts,
where most of them consist of images of Post-it R notes and nothing
else. This is problematic from the training data capturing perspective,
as these Instagram accounts mostly consists of images of single Post-
it R notes in perfect lightning conditions, and not the conditions we
aimed for in order to diversify our training data.
As of this, gathering a significant number of images in such en-
vironments could improve the performance of the model. Though,
the geometrical simplicity of the Post-it R note shaped could already
be captured near to optimally with the current data, and significantly
increased performance can only be achieved by increasing the com-
CHAPTER 5. DISCUSSION 57

plexity of the model.

Figure 5.1: Example image with Figure 5.2: Image lacking end
characteristics similar to end case. case environment setting.

5.5 Hyperparameter tuning


The task of tuning the hyperparameters of the trained models involves
a large amount of implicit and explicit hyperparameters not mentioned
in this report.
Given the time constraint and the constraint in available computa-
tional power, it is not viable to investigate all possible hyperparame-
ters and configurations. For example, one could investigate all possi-
ble data augmentation variables as described in appendix A, the hy-
perparameters regarding the early stopping, regularisation, and the
given optimiser, to name a few. The reader is encouraged to investi-
gate these hyperparameters for their chosen model in order to increase
the performance of their ML model.

5.6 Sustainability and ethics


The training of machine learning algorithms could have a possible
negative impact in terms of sustainability, as these algorithms require
intensive training over a larger period of time. As image data is high-
dimensional and expensive in terms of the number of computations
required to train the networks, this project requires a large amount of
58 CHAPTER 5. DISCUSSION

energy to achieve good results. Furthermore, the hardware required


to run such computations is constructed with scarce natural resources.
From an ethical perspective, this work required the gathering of
public data from social media websites, which resulted in extensive
gathering and processing of public data from private individuals as
well as the crawling of public websites. As users were not asked for
the allowance to use these public images, it can be argued that this
gathering could be unethical despite the public nature of the images.
Nevertheless, the training data is not recoverable from the trained net-
works, and it is not possible to recreate the data that was used during
training.
Chapter 6

Conclusions

In this report we have presented an approach to building a Post-it R


note object detection CNN for use on mobile devices. Utilising mul-
tiple base models and object detection frameworks, we successfully
trained and implemented a variety of models that could be used for
real-time detection of notes in an Android application. Using Transfer
Learning, the network could be trained on a relatively small amount
of data while still achieving high mAP scores.
Our results conclude that ML models are viable for deployment
on mobile devices in terms of inference time and accuracy and can
provide a heuristic based corner-detection algorithm with bounding
boxes of high recall.
The results given in this report emphasises the possibility of us-
ing ML models and algorithms for object detection on mobile devices.
It is shown that the trained models are feasible for use in a real-time
environment on devices with limited computational power while re-
maining highly accurate and reliable for the end user.
In order to further develop the work presented in this report care-
ful gathering of diversified training data could lead to increased per-
formance, as more images from environments typical to the end use
case could increase the performance of the less complex models when
detecting multiple notes, as these models might have been limited by
the quality of the training data. Furthermore, utilising Capsule Net-
works [9, 47] instead of CNNs as the base model for the network could
extend the image vision capabilities of the model and add improved
learning of part-whole relationships and viewpoint variation and thus
increase the recall performance of the model.

59
Bibliography

[1] Mansour Ahmadi, Angelo Sotgiu, and Giorgio Giacinto. “Intel-


liAV: Toward the Feasibility of Building Intelligent Anti-malware
on Android Devices”. In: Machine Learning and Knowledge Extrac-
tion. Ed. by Andreas Holzinger, Peter Kieseberg, A Min Tjoa, and
Edgar Weippl. Cham: Springer International Publishing, 2017,
pp. 137–154.
[2] James Bergstra and Yoshua Bengio. “Random Search for Hyper-
parameter Optimization”. In: J. Mach. Learn. Res. 13 (Feb. 2012),
pp. 281–305. ISSN: 1532-4435.
[3] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[4] João Cartucho. mAP. https://github.com/Cartucho/mAP.
Accessed: 2018-04-09.
[5] Marc Claesen and Bart De Moor. “Hyperparameter Search in
Machine Learning”. In: CoRR abs/1502.02127 (2015). arXiv: 1502.
02127.
[6] PASCAL2 Network of Excellence. Pascal VOC. http://host.
robots.ox.ac.uk/pascal/VOC/. Accessed: 2018-03-21.
[7] Facebook. Instagram. https : / / www . instagram . com/. Ac-
cessed: 2018-03-20.
[8] Sanja Fidler. Lecture Notes in Intro to Image Understanding.
[9] Nicholas Frosst Geoffrey E Hinton Sara Sabour. “Matrix cap-
sules with EM routing”. In: International Conference on Learning
Representations (2018). accepted as poster.
[10] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015).
arXiv: 1504.08083.

60
BIBLIOGRAPHY 61

[11] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Ma-
lik. “Rich feature hierarchies for accurate object detection and
semantic segmentation”. In: CoRR abs/1311.2524 (2013). arXiv:
1311.2524.
[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learn-
ing. MIT Press, 2016.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep
Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). arXiv: 1512.03385.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Delv-
ing Deep into Rectifiers: Surpassing Human-Level Performance
on ImageNet Classification”. In: CoRR abs/1502.01852 (2015).
arXiv: 1502.01852.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spa-
tial Pyramid Pooling in Deep Convolutional Networks for Vi-
sual Recognition”. In: CoRR abs/1406.4729 (2014). arXiv: 1406.
4729.
[16] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. “Improving neural networks by pre-
venting co-adaptation of feature detectors”. In: CoRR abs/1207.0580
(2012). arXiv: 1207.0580.
[17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig
Adam. “MobileNets: Efficient Convolutional Neural Networks
for Mobile Vision Applications”. In: CoRR abs/1704.04861 (2017).
arXiv: 1704.04861.
[18] Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. “What makes
ImageNet good for transfer learning?” In: CoRR abs/1608.08614
(2016). arXiv: 1608.08614.
[19] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song
Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-
level accuracy with 50x fewer parameters and <1MB model size”.
In: CoRR abs/1602.07360 (2016). arXiv: 1602.07360.
[20] Google Inc. Tensorflow. https://github.com/tensorflow/
models / tree / master / research / object _ detection /
models. Accessed: 2018-05-02.
62 BIBLIOGRAPHY

[21] Google Inc. Tensorflow Lite. https://www.tensorflow.org/


mobile/tflite/. Accessed: 2018-02-07.
[22] Google Inc. Tensorflow Mobile. https : / / www . tensorflow .
org/mobile/mobile_intro. Accessed: 2018-01-23.
[23] Andrej Karpathy. CS231n Convolutional Neural Networks for Vi-
sual Recognition. http://cs231n.github.io/convolutional-
networks/. Accessed: 2018-02-20.
[24] Ryo Kawamura. RectLabel. https : / / rectlabel . com. Ac-
cessed: 2018-03-20.
[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Ima-
geNet Classification with Deep Convolutional Neural Networks”.
In: Advances in Neural Information Processing Systems 25. Ed. by F.
Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran
Associates, Inc., 2012, pp. 1097–1105.
[26] Twan van Laarhoven. “L2 Regularization versus Batch and Weight
Normalization”. In: CoRR abs/1706.05350 (2017). arXiv: 1706.
05350.
[27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
Hubbard, and L. D. Jackel. “Backpropagation Applied to Hand-
written Zip Code Recognition”. In: Neural Comput. 1.4 (Dec. 1989),
pp. 541–551. ISSN: 0899-7667. DOI: 10.1162/neco.1989.1.
4.541.
[28] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
“Gradient-based learning applied to document recognition”. In:
Proceedings of the IEEE. 1998, pp. 2278–2324.
[29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy,
and Scott E. Reed. “SSD: Single Shot MultiBox Detector.” In:
CoRR abs/1512.02325 (2015).
[30] Martin Abadi et al. TensorFlow: Large-Scale Machine Learning on
Heterogeneous Systems. 2015.
[31] Vinod Nair and Geoffrey E. Hinton. “Rectified Linear Units Im-
prove Restricted Boltzmann Machines”. In: Proceedings of the 27th
International Conference on International Conference on Machine Learn-
ing. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 807–814. ISBN:
978-1-60558-907-7.
BIBLIOGRAPHY 63

[32] A.Y. Ng. “Feature selection, l1 vs. l2 regularization, and rota-


tional invariance”. In: Proceedings of the twenty-first international
conference on Machine learning. ACM. 2004, p. 78.
[33] Nvidia. CUDA. https://developer.nvidia.com/cuda-
zone. Accessed: 2018-02-23.
[34] Nvidia. cuDNN. https://developer.nvidia.com/cudnn.
Accessed: 2018-02-23.
[35] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. “Learn-
ing and Transferring Mid-level Image Representations Using Con-
volutional Neural Networks”. In: Proceedings of the 2014 IEEE
Conference on Computer Vision and Pattern Recognition. CVPR ’14.
Washington, DC, USA: IEEE Computer Society, 2014, pp. 1717–
1724. ISBN: 978-1-4799-5118-5. DOI: 10.1109/CVPR.2014.222.
[36] Rarcega. Instagram-scraper. https://github.com/rarcega/
instagram-scraper. Accessed: 2018-03-20.
[37] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and
Stefan Carlsson. “CNN Features off-the-shelf: an Astounding Base-
line for Recognition”. In: CoRR abs/1403.6382 (2014). arXiv: 1403.
6382.
[38] Joseph Redmon. Darknet: Open Source Neural Networks in C. https:
//pjreddie.com/darknet/. Accessed: 2018-03-21.
[39] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and
Ali Farhadi. “You Only Look Once: Unified, Real-Time Object
Detection”. In: CoRR abs/1506.02640 (2015). arXiv: 1506.02640.
[40] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”.
In: 2017 IEEE Conference on Computer Vision and Pattern Recogni-
tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 2017, pp. 6517–
6525. DOI: 10.1109/CVPR.2017.690.
[41] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Im-
provement”. In: arXiv (2018).
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster
R-CNN: Towards Real-Time Object Detection with Region Pro-
posal Networks”. In: Advances in Neural Information Processing
Systems 28. Ed. by C. Cortes, N. D. Lawrence, D. D. Lee, M.
Sugiyama, and R. Garnett. Curran Associates, Inc., 2015, pp. 91–
99.
64 BIBLIOGRAPHY

[43] Dimensional Research. FAILING TO MEET MOBILE APP USER


EXPECTATIONS – A MOBILE APP USER SURVEY. 2015.
[44] Adrian Rosebrock. Intersection over Union (IoU) for object detec-
tion. https://www.pyimagesearch.com/2016/11/07/
intersection-over-union-iou-for-object-detection/.
Accessed: 2018-04-09.
[45] Sebastian Ruder. “An overview of gradient descent optimiza-
tion algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.
04747.
[46] Olga Russakovsky et al. “ImageNet Large Scale Visual Recog-
nition Challenge”. In: CoRR abs/1409.0575 (2014). arXiv: 1409.
0575.
[47] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. “Dynamic
Routing Between Capsules”. In: Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information
Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA.
2017, pp. 3859–3869.
[48] Pouya Samangouei and Rama Chellappa. “Convolutional Neu-
ral Networks for Attribute-based Active Authentication on Mo-
bile Devices”. In: CoRR abs/1604.08865 (2016). arXiv: 1604.08865.
[49] PassMark Software. Android Benchmarks. https://www.androidbenchmark.
net/device_list.php. Accessed: 2018-05-14.
[50] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. “Dropout: a simple way to prevent
neural networks from overfitting.” In: Journal of Machine Learn-
ing Research 15.1 (2014), pp. 1929–1958.
[51] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon
Shlens, and Zbigniew Wojna. “Rethinking the Inception Archi-
tecture for Computer Vision”. In: CoRR abs/1512.00567 (2015).
arXiv: 1512.00567.
[52] Richard Szeliski. Computer vision algorithms and applications. Lon-
don; New York: Springer, 2011, pp. 10–17. ISBN: 9781848829343
1848829345 9781848829350 1848829353.
[53] OpenCV team. OpenCV (Open Source Computer Vision Library).
https://opencv.org. Accessed: 2018-05-18.
BIBLIOGRAPHY 65

[54] Thtrieu. Darkflow: Translate Darknet to Tensorflow. https://github.


com/thtrieu/darkflow. Accessed: 2018-03-21.
[55] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeul-
ders. “Selective Search for Object Recognition”. In: International
Journal of Computer Vision (2013). DOI: 10.1007/s11263-013-
0620-5.
[56] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob
Fergus. “Regularization of Neural Networks using DropCon-
nect”. In: Proceedings of the 30th International Conference on Ma-
chine Learning. Ed. by Sanjoy Dasgupta and David McAllester.
Vol. 28. Proceedings of Machine Learning Research 3. Atlanta,
Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1058–1066.
[57] Mengwei Xu, Feng Qian, and Saumay Pushp. Enabling Coopera-
tive Inference of Deep Learning on Wearables and Smartphones. 2017.
eprint: arXiv:1712.03073.
[58] Matthew D. Zeiler and Rob Fergus. “Visualizing and Under-
standing Convolutional Networks”. In: CoRR abs/1311.2901 (2013).
arXiv: 1311.2901.
Appendix A

TensorFlow API data augmen-


tation variables

• NormalizeImage normalize_image = 1;

• RandomHorizontalFlip random_horizontal_flip = 2;

• RandomPixelValueScale random_pixel_value_scale = 3;

• RandomImageScale random_image_scale = 4;

• RandomRGBtoGray random_rgb_to_gray = 5;

• RandomAdjustBrightness random_adjust_brightness = 6;

• RandomAdjustContrast random_adjust_contrast = 7;

• RandomAdjustHue random_adjust_hue = 8;

• RandomAdjustSaturation random_adjust_saturation = 9;

• RandomDistortColor random_distort_color = 10;

• RandomJitterBoxes random_jitter_boxes = 11;

• RandomCropImage random_crop_image = 12;

• RandomPadImage random_pad_image = 13;

• RandomCropPadImage random_crop_pad_image = 14;

• RandomCropToAspectRatio random_crop_to_aspect_ratio = 15;

66
APPENDIX A. TENSORFLOW API DATA AUGMENTATION VARIABLES
67

• RandomBlackPatches random_black_patches = 16;

• RandomResizeMethod random_resize_method = 17;

• ScaleBoxesToPixelCoordinates scale_boxes_to_pixel_coordinates
= 18;

• ResizeImage resize_image = 19;

• SubtractChannelMean subtract_channel_mean = 20;

• SSDRandomCrop ssd_random_crop = 21;

• SSDRandomCropPad ssd_random_crop_pad = 22;

• SSDRandomCropFixedAspectRatio ssd_random_crop_fixed_aspect_ratio
= 23;

• SSDRandomCropPadFixedAspectRatio ssd_random_crop_pad_fixed_aspect_ratio
= 24;

• RandomVerticalFlip random_vertical_flip = 25;

• RandomRotation90 random_rotation90 = 26;

• RGBtoGray rgb_to_gray = 27;


www.kth.se

You might also like