Improving Embedded Deep Learning Object Detection by Integrating Infrared Camera
Improving Embedded Deep Learning Object Detection by Integrating Infrared Camera
GEORGE PUNTER
GEORGE PUNTER
Abstract
Deep learning is the current state-of-the-art for computer vision applications.
FPGAs have a potential to fit this niche due to lower development costs and
faster development cycles than ASICs, with a smaller size and power footprint
than GPUs. Recent developments have allowed increasingly easier access to
FPGA development with HLS and other frameworks which help the develop-
ment of deep learning on FPGAs. However, neural networks deployed onto
FPGAs suffer from reduced accuracy than their software counterparts.
This thesis explores whether integrating an additional camera, namely long-
wave infrared, into an embedded computer vision system is a viable option
to improve inference accuracy in critical vision tasks, and is split into three
stages.
First, we explore image registration methods between RGB and infrared
images to find one suitable for embedded implementation, and conclude that
for a static camera setup, manually assigning point matches to obtain a warping
homography is the best route. Incrementally optimising this estimate or using
phase congruency features combined with a feature matching algorithm are
both promising avenues to pursue further.
We implement this perspective warping function on an FPGA using the Vi-
vado HLS workflow, concluding that whilst not without limitations – the devel-
opment of computer vision functions in HLS is considerably faster than imple-
mentations in HDL. We note that the open-source PYNQ framework by Xilinx
is convenient for edge data processing, allowing drop-in access to hardware-
accelerated functions from Python which opens up FPGA-accelerated data
processing to less hardware-centric developers and data scientists.
Finally, we analyse whether the additional IR data can improve the ob-
ject detection accuracy of a pre-trained RGB network by calculating accuracy
metrics for with and without image augmentation across a dataset of 7,777
annotated image pairs. We conclude that detection accuracy, especially for
pedestrians and at night, can be significantly improved without requiring any
network retraining.
We demonstrate that integrating an IR camera is a viable approach to im-
prove the accuracy of deep learning vision systems in terms of implementa-
tion overhead. Future work should explore other methods of integrating the IR
data, such as enhancing predictions by utilising hot-point information within
bounding boxes, applying transfer learning principles with a dataset of aug-
mented images, or improving the image registration and fusion stages.
iv
Sammanfattning
Djupinlärning är spjutspetsteknik för tillämpningar inom datorseende. FPGA-
kretsar är potentiellt användbara för detta då de har lägre utvecklingskostna-
der och snabbare utvecklingscykler än ASIC-kretsar. ASIC-kretsar har i sin
tur mindre storlek och elkraftkonsumtion än GPU-processorer. Med djupin-
lärningsramverk som HLS och andra har det blivit enklare att använda FPGA-
kretsar. Dessvärre lider neurala nätverk som körs på FPGA-kretsar av nedsatt
träffsäkerhet jämfört med motsvarande mjukvaruvarianter.
Denna avhandling utforskar om det är möjligt att förbättra inferensträffsä-
kerheten för grundläggande datorseendeförmågor genom att tillföra en långvå-
gs-IR-kamera till ett inbyggt datorseendesystem. Avhandlingen är uppdelad i
tre delar.
Först utforskar vi om bildregisteringsmetoder mellan RGB- och IR-bilder
för att hitta en lämplig inbyggd implementering. Slutsatsen är att för en statisk
kamerainställning är det lämpligast att manuellt tillskriva punktmarkering för
homografi. Att stadigt förbättra detta estimat eller använda faskongruenskänne-
tecken-metoden kombinerat med en känneteckningsmatchningsalgoritm är två
framtida förbättringar.
Vi implementerar en perspektivsböjningsfunktion på en FPGA-krets med
Vivado HLS-verktyg, och drar slutsatsen att även om det är begränsat för da-
torseende så är funktionerna i HLS snabbare att använda än att implementera i
HDL. Vi observerar att det öppenkodsbaserade ramverket PYNQ av Xilinx är
nära till hands för kantdatahantering, och har släpp-in åtkomst för hårdvaru-
accelererade funktioner i Python. Detta gör det möjligt för hårdvarunoviser att
använda FPGA-kretsar.
Slutligen analyserar vi om extra IR-data kan förbättra träffsäkerheten för
objektdetektering då vi använder ett färdigtränade RGB-nätverk genom att be-
räkna träffsäkerhetsmått med och utan bildförstärkning över en datamängd på
7 777 annoterade bildpar. Vi konstaterar att träffsäkerheten för detekteringen
kan förbättras utan behov av nätverks-omlärning.
Vi visar att integrering av en IR-kamera är ett fullgott sätt att förbättra
träffsäkerheten hos djupinlärningsbaserade datorseendesystem då det är han-
terbart implementationsmässigt. Framtida forskningsarbete bör fokusera på
andra metoder för att utnyttja IR-data. Exempelvis går det att förbättra pre-
diktionen genom hetpunktsinformation avgränsat med ramar, användande av
överförningsinlärning-principer med en datamängd med förstärkta bilder, eller
förbättring av bildregistrering- eller fusionssteget.
Acknowledgements
I’d like to dedicate this work to my grandfathers: Francis George Punter and
Necdet ‘George’ Çilasun whose names I share and will always be a part of me.
v
Contents
1 Introduction 1
1.1 Thesis Description . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Contribution Goals . . . . . . . . . . . . . . . . . . . 6
1.3.2 Societal Impact . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Ethical Considerations . . . . . . . . . . . . . . . . . 7
1.4 Report structure . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Works 9
2.1 CNNs on IR data . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 FPGA Acceleration of Computer Vision . . . . . . . . . . . . 10
2.3 Image Registration . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Global Methods . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Blob homography . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Feature based . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Incremental optimisation . . . . . . . . . . . . . . . . 13
3 Image Registration 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Two-view Geometry . . . . . . . . . . . . . . . . . . 16
3.2.3 FLIR ADAS Dataset . . . . . . . . . . . . . . . . . . 16
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 SIFT and SURF . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Phase Features . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Line matching . . . . . . . . . . . . . . . . . . . . . 20
vi
CONTENTS vii
4 FPGA Implementation 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 High-Level Synthesis (HLS) . . . . . . . . . . . . . . 26
4.2.3 Zynq-7000 System-on-a-Chip (SoC) . . . . . . . . . . 27
4.2.4 The PYNQ Framework . . . . . . . . . . . . . . . . . 28
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Development Platform . . . . . . . . . . . . . . . . . 29
4.4 Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Algorithm Implementation . . . . . . . . . . . . . . . 30
4.4.2 Testbench: C-simulation and co-simulation . . . . . . 32
4.4.3 Programmer Directives . . . . . . . . . . . . . . . . . 33
4.5 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 AXI Interfaces . . . . . . . . . . . . . . . . . . . . . 35
4.6 The PYNQ Framework . . . . . . . . . . . . . . . . . . . . . 36
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7.1 Development Process . . . . . . . . . . . . . . . . . . 37
4.7.2 HLS pragmas . . . . . . . . . . . . . . . . . . . . . . 37
4.7.3 Timing Comparison . . . . . . . . . . . . . . . . . . 39
4.8 FPGA Implementation Conclusion . . . . . . . . . . . . . . . 39
4.8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . 40
5 Data Augmentation 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Neural Network Primer . . . . . . . . . . . . . . . . . 44
5.2.3 Convolutional Neural Networks (CNNs) . . . . . . . . 46
5.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . 48
5.2.5 Evaluating model performance . . . . . . . . . . . . . 48
5.2.6 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 51
viii CONTENTS
6 Conclusion 63
Bibliography 65
AI Artificial Intelligence. 1
CV Computer Vision. 1, 3
ix
x Acronyms
FF Flip-Flop. 27, 38
FN False Negative. 54
FP False Positive. 54
GP General Purpose. 35
HP High Performance. 35
II Initiation Interval. 33
IP Intellectual Property. 28
IR infrared. 4
ML Machine Learning. 1, 3, 44
Acronyms xi
TP True Positive. 54
Introduction
1
2 CHAPTER 1. INTRODUCTION
(a) Before deep learning [1] (b) With deep learning [3]
with these results? The key is in the hardware, the above performance is on
CPUs. Neural networks require extensive matrix multiplications, something
that the video game industry has been developing custom hardware for for
decades in the form of the Graphics Processing Unit (GPU). CPUs are de-
signed for sequential operations, whilst huge computational performance gains
can be acquired by computing in parallel. Machine learning researchers ap-
propriated the GPU, exploiting its parallel computing capabilities to speed up
model training and inference on neural networks, and kick-starting the ML
boom we are now in the midst of. Running inference on a GPU can speed up
execution times by two orders of magnitude, which hugely improves usability.
More importantly, similar performance gains can be seen for the training stage
of neural networks. Since training a network for computer vision with a GPU
can take days, it is easy to see just how pivotal specialised hardware is, and
why ML did not take off until the incorporation of GPUs.
However, even with a GPU, the performance of neural networks is slow.
The original state-of-the-art, Region-Convolutional Neural Network (R-CNN),
takes approximately 47 seconds to process an image. Its successors, Fast R-
CNN and Faster R-CNN, are approaching real-time, cutting time down to 2.3s
and 0.2s respectively. However, this is still only 5 FPS. The GPU used is an
Nvidia Tesla M40 GPU, which has a footprint of 10.5x4.4in, costs over $1000,
and consumes 250W of power. To get closer to real-time performance, there
are 2 routes: either change the hardware or change the software.
You Only Look Once (YOLO) networks are designed for speed, only re-
quiring a single pass over the image and performing an order of magnitude
faster than the fastest R-CNN implementations on a GPU [4]. Whilst GPUs
CHAPTER 1. INTRODUCTION 3
are fine to use in a local desktop tower or via cloud computing services, edge
devices require a smaller size and power footprint, and real-time or critical ap-
plications do not have the luxury of relying on internet connectivity for both
latency and security reasons.
Bootstrapping hardware designed for one thing to use it for another is never
going to be efficient, which is why Google has started developing custom AI
Application-Specific Integrated Circuit (ASIC)s for machine learning, which
it has dubbed the Tensor Processing Unit (TPU). This allows much lower power
consumption per compute, and a smaller footprint, since the processor design
is completely optimised towards matrix multiplication-and-accumulation of
the kind performed in neural networks [5]. However, ASICs are very expensive
to design, and the field of machine learning is changing so rapidly that designs
could quickly become obsolete. Only for companies with a huge amount of
resources is this really a possibility, and even still most CV applications require
pre- and post-processing besides the ML inference.
Field Programmable Gate Array (FPGA)s are a middle ground between
GPUs and ASICs, allowing the development of custom hardware that is re-
programmable. If the architecture required for a cutting-edge AI implementa-
tion changes, it is possible to reprogram the architecture of the FPGA to suit the
change needed. Another benefit of FPGAs is that due to their closeness to the
hardware, it is easy to connect and interface additional sensors into the design
as befits the application, making it a good edge data acquisition and analysis
platform. With FPGAs, one can have custom hardware tailored to the entire
computer vision application, including pre- and post-processing steps that are
specific to the application domain, without the overhead of ASIC design and
manufacture.
FPGA design for computer vision applications requires a very broad set
of knowledge. FPGAs are programmed in a Hardware Description Language
(HDL), such as Verilog or VHSIC Hardware Description Language (VHDL),
requiring hardware engineers. The top machine learning tools are written in
C++ (tensorflow, numpy, scikit-learn, dlib), and computer vision tools in C++
or Python (OpenCV), with data scientists largely using Python due to its sim-
plicity, popularity for CV and ML, and support for Read Evaluate Print Loop
(REPL) making it easier to develop applications in Python than in C++ or
directly on an FPGA [6].
In short, development for an FPGA computer vision system requires ex-
tensive domain-specific knowledge in several fields: digital hardware design,
embedded systems, machine learning, and computer vision.
However, recent developments are allowing increased accessibility to this
4 CHAPTER 1. INTRODUCTION
paradigm. High Level Synthesis (HLS) tools allow the compilation of C++
into HDL, allowing software engineers to develop for FPGA without exten-
sive knowledge of HDL, and test algorithms within seconds rather than hours.
Xilinx has released open-source implementations of common OpenCV func-
tion in HLS-ready code in its xfOpenCV library1 , to further reduce this burden.
System-on-a-Chip (SoC) boards like the Zynq-7000 series combine Process-
ing System (PS) and Programmable Logic (PL) on the same board, reducing
the complexity of interacting with PL, and the Python productivity for Zynq
(PYNQ) framework [7] further simplifies this. Finally, tools are being devel-
oped to streamline the implementation of neural network inference in FPGA
programmable logic, though often at a cost to accuracy [8].
These initiatives open up new possibilities for the rapid development of
custom hardware for computer vision, but are yet to mature or stabilise. PYNQ
was first released in 2016, and a 2015 analysis of the state of Vivado HLS2 for
image processing concluded that whilst promising, was not yet worth it, since
it in itself required a steep, poorly documented learning curve that could be
‘better spent learning [V]HDL’ [9].
1.2.1 Hypothesis
Our hypothesis is that:
1.3 Contribution
The contribution goals of this thesis revolve around two concepts. Firstly,
demonstrating the practical utility of FPGAs in the field of computer vision
given industry efforts to improve developer productivity. Secondly, demon-
strating the usefulness of infrared cameras for computer vision systems.
[10], the TPU by Google in 2015 [5], and the Full Self-Driving (FSD) com-
puter announced by Tesla this year3 [11]. The ability to generate HDL blocks
from C/C++ and interface with custom hardware accelerators via Python opens
up this domain to less financed entities, potentially lessening the technologi-
cal monopolies held by large corporations who can afford to develop custom,
proprietary ASICs to edge out competition.
Utilising data from an additional modality to augment RGB images opens
up further capabilities of computer vision-based systems. In the case of IR,
the additional data could be used to analyse temperatures, improve detec-
tion of warm-bodied objects, and improve visibility in the dark. Applica-
tions of this could be monitoring plant, animal or human health, search and
rescue equipment, and improving pedestrian detection for Advanced Driver-
Assistance Systems (ADAS) and automated driving applications, using either
classical or deep learning computer vision techniques.
Image registration across different sensor modalities, in particular between
infrared and visible spectra, is useful in a variety of fields, from evaluating bi-
ological health, to surveillance or search and rescue systems. Hardware imple-
mentations of multimodal image registration provide a smaller size and power
footprint whilst improving real-time responsiveness. This extends the usabil-
ity of the aforementioned applications, for example extending the flight time
of a search and rescue drone or converting a medical device from a desktop to
a hand-held device.
The area of deep learning is the current state-of-the-art for object detection,
and embedded implementations provide more real-time applicability, lower
power and size profiles, and lower cost. The inference accuracies of FPGA
implementations of deep learning are usually worse than software-based coun-
terparts, since network weights are quantised, and the complexity of the model
itself reduced to ease implementation [8]. Hence, in the place of network com-
plexity, it is possible that augmenting images with additional data could im-
prove object detection or segmentation accuracies. Real-time object detection
is especially sought after for autonomous vehicles and other ADAS, therefore
exploring ways to improve object detection on safety critical computer vision
systems could ultimately improve automotive safety.
Firstly, the demonstration of the ease of use of HLS tools. Then, the im-
plementation of a hardware-based platform with image registration between
RGB and IR camera data streams. Finally, improved dual spectra datasets
for deep learning, and improved real-time embedded deep learning using IR-
augmented images.
The biggest ethical concern would be the use of improved RGB-IR vi-
sion systems in military technology, or by governments in surveillance which
breaches personal rights. RGB-IR image fusion is useful for weapons target-
ing systems, since the heat information can be used a life-sign indicator and
aid military personnel to identify hostile targets, especially during the night.
More intelligent systems implementing some kind of auto-targeting function-
ality using AI are even more ethically dubious, since the confirmation that the
target is hostile and not a civilian is left to the system. In any case, the ethical
morality of warfare is always questionable.
Thus, this technology could be adapted for unethical applications, however
this is not the intention behind the research, and the author is strongly against
the use of this work for those purposes.
Related Works
The inspiration for this work comes from Convolutional Neural Network Quan-
tisation for Accelerating Inference in Visual Embedded Systems [8], where it is
noted that quantised Convolutional Neural Network (CNN) inference on em-
bedded systems suffers from reduced accuracy compared to software-based
counterparts.
Since the aim of this work is to evaluate whether this reduced accuracy can
be in-part alleviated by integrating an IR camera into an embedded computer
vision system, relevant works are therefore applications of CNNs to IR data,
FPGA acceleration for computer vision, and methods for image registration.
9
10 CHAPTER 2. RELATED WORKS
RGB and IR data in such a way to minimise any additional training overhead,
maintaining the benefits of data wealth in the RGB domain and minimising
implementation overhead, but also to leverage the additional spectrum data in
cases where the visible spectrum is less useful such as at night or in fog.
Research into deep learning is hugely active but still relatively new, and so
whilst not found at the time of writing, we expect papers with titles such as
Common deep learning object detection features between RGB and IR images
or Combining RGB and Infrared Images for Robust Object Detection using
Convolutional Neural Networks to be released in the near future. Researchers
are just beginning to understand how deep networks really work, and these
insights will be highly beneficial not just in the field of deep learning. This
thesis shows there are enough similarities between features in RGB and IR
images that a deep network can successfully identify the same objects in RGB,
IR, and RGB-IR augmented images despite having been trained on a solely
RGB dataset. Future work based on this concept could provide a more concrete
understanding of the relationship between these two spectra, with implications
on our understanding of optics.
to 5 times lower across the board’ using the accelerated Vivado HLS imple-
mentation over the software implementation [9].
Fourier-based methods
The Fourier-Mellin transform, such as described in [23], can provide fast and
robust image registration, but ‘the main drawback of the FMT approach is
that it is only applicable to register images linked through a transformation
limited to translation, rotation and scale change’ [23]. Other papers using FFT
described similar limitations [24][25].
12 CHAPTER 2. RELATED WORKS
feature matching, and it was shown that phase congruency descriptors proved
more effective than the traditional SIFT/SURF approach which do not work
well on RGB-IR image pairs.
The line-based registration algorithm by [33], where lines are used as fea-
tures and matched between images seems an extremely relevant and effective
way to align RGB and IR images, since lines are the most common attribute
shared between the two spectra.
Image Registration
3.1 Introduction
The overall aim of this thesis is to evaluate the viability of integrating an IR
camera into an image acquisition platform as a means to improve real-time
object detection. A crucial part of using this additional data is the image reg-
istration step, which aligns the two images.
This step must be done in real-time, and should utilise minimal resources
of the FPGA since most of the Programmable Logic (PL) will be used by the
neural network implementation. Ideally this process would be automatic, to
reduce the need for manual calibration.
Most methods for multi-spectra image registration are developed for desk-
top applications [20][22][25][28][31][33][34]. These methods are not neces-
sarily suitable for real-time or embedded implementation, either due to com-
putational overhead or the difficulty of re-implementation in hardware.
The aim of this stage of the thesis is to evaluate possible methods for image
registration from Section 2.3, Related Works, and choose a suitable method in
terms of processing speed (can it work in real-time?), registration quality (how
well are the two images aligned?), and suitability for hardware implementation
(can it be implemented in hardware within a short time-frame?).
First we will cover background information on the software used and rele-
vant computer vision theory for image registration in Section 3.2, Background.
In Section 3.3, Method, we describe our method and the work done to imple-
ment registration methods from Section 2.3, and discuss the outcomes of those
experiments. We summarise our thoughts in Section 3.4, Conclusion, finalis-
ing our chosen method for Chapters 4 and 5 of the thesis.
14
CHAPTER 3. IMAGE REGISTRATION 15
3.2 Background
3.2.1 OpenCV
Operations on images are highly parallelisable, and since the number of pix-
els in high quality images approaching the same order of magnitude of the
clock speed on modern CPUs, it is important to use Single Instruction Multiple
Data (SIMD) instructions and parallelised architectures wherever possible for
real-time performance. OpenCV is one of the fastest, most mature, and most
popular open-source image processing libraries. The library has ‘more than
2500 optimized algorithms’, and ‘leans mostly towards real-time vision ap-
plications’. It ‘takes advantage of MMX and SSE instructions when available’
and ‘full-featured CUDA and OpenCL interfaces are being actively developed’
for GPU execution, giving us one of the most efficient tools in software for in-
teracting with digital images [35].
OpenCV is used extensively in this project: it is written in C++, and has
bindings for Python, meaning the function calls in Python are executed at the
speed of statically-compiled and optimised C++, which is typically at least an
order of magnitude faster than Python code. Since Python is used for interfac-
ing with hardware on the PYNQ framework, OpenCV allows us to compare
side-by-side our custom hardware implementation on PL with the equivalent,
optimised C++ function calls on the PS of the system. Since it is open-source,
it benefits from an entire community of bug fixes and improvements, as well
as adaptations for specific hardware platforms.
xfOpenCV
A huge benefit from using OpenCV in this stage of the thesis is that Xilinx has
released an open-source library named xfOpenCV which ‘is a set of 50+ ker-
nels, optimized for Xilinx FPGAs and SoCs, based on the OpenCV computer
vision library’ 1 .
This means that using OpenCV functions in this stage of the thesis could
result in not needing to write their hardware equivalent if they are already im-
plmented in xfOpenCV. At the very least, the Xilinx library is a good reference
for writing OpenCV-like functionality in HLS code for an FPGA.
1
https://github.com/Xilinx/xfopencv
16 CHAPTER 3. IMAGE REGISTRATION
3.3 Method
An optimistic goal of this section is to implement a robust image registra-
tion algorithm between RGB and IR images, without requiring manually-input
point matches, and running in real-time.
Out of the algorithms reviewed in Related Works, Section 2.3, several
methods such as the Gaussian field methods were discounted due to complex-
ity [20][21][22]. FFT and FMT based methods were discounted due to their
limitations in application.
We gathered that for completely automatic image registration, without hu-
man assistance, the standard method is to use SIFT or SURF feature descrip-
tors on pre-rectified images to detect feature points to match between the im-
ages, and use these matches to generate homographies. The following three
algorithms proved promising for solving the image registration problem:
Figure 3.1: SIFT and SURF between RGB and IR images [38]
this method in Python seem promising. As visible in Figure 3.2, the edges
extracted from both the RGB (top) and IR (bottom) images can be seen to be
extremely similar, despite the differences in spectral modality. However, our
implementation was slow and memory intensive, leading us to conclude that a
real-time implementation would be difficult within our time-frame. Applying
a feature matching algorithm on top of this would be even further detrimental
to execution time and viability.
Whilst this algorithm is not suitable for this thesis, it is worth noting a cou-
ple of things. Firstly, the structure of the algorithm is well suited to hardware
optimisation: it uses a ‘bank of filters to analyse the signal’ [32]. Convolving
an input signal, or image, with a bank filters is a typical use case of FPGAs,
since filters can be applied to streaming data, and stacked into a data pro-
cessing pipeline. Secondly, a qualitative review of the output finds that this
method produces features that are more similar between RGB and IR images
than traditional feature detectors such as Sobel or Canny edge detectors, with
no adjustable threshold parameter.
sufficiently far away and the cameras are collimated as much as possible, it
is the simplest and most effective way to obtain a homography which maps
one camera image to another, as well as a prerequisite step for semi-manual
methods described in the next sub-section.
We wrote a tool using Python and the Python library Matplotlib to allow
a quick way for a user to assign matching points between two images, and
save these to a .json file. Then, we manually annotated images from the FLIR
ADAS dataset in order to test this method.
We found that the homography generated from manual point matches be-
tween one pair of RGB-IR images produced a good mapping between the two
spectra. Furthermore, the homography from one pair could also be used for
all other image pairs captured using the same setup with decent results, as will
be seen in Chapter 5, Data Augmentation.
believe that the similarity measures used rewarded mapping dense feature ar-
eas together too much, for example sacrificing a homography which provided
a good global image alignment in order to maximise the overlapping areas of
tree leaves, which were often a particularly feature-rich region of the images
used in testing. As in [34], it may be necessary to optimise over multiple im-
age scales. Another possible solution would be to split the image into regions,
assigning each region the same weight in the final calculation. In this way,
regions with particularly dense features’ influence on the skew of the homog-
raphy is limited.
FPGA Implementation
4.1 Introduction
Implementing image registration on embedded hardware is an important stage
of this thesis since the conclusions drawn from the final experiment are only
relevant to the overall research question if image registration on the embedded
platform is viable.
The viability of this approach is fundamentally a product of the research
questions in Section 1.2, and additionally we will explore the following, more
specific questions:
Therefore in this stage of the thesis we will implement the image regis-
tration algorithm chosen in Chapter 3 using Vivado HLS, set up the required
interfaces for interacting with the hardware via the PYNQ framework using
Vivado, and directly compare the speed of the hardware-acceleration block on
the Programmable Logic (PL) to the OpenCV implementation on the Process-
ing System (PS) of the board.
25
26 CHAPTER 4. FPGA IMPLEMENTATION
4.2 Background
4.2.1 FPGAs
The FPGA (Field Programmable Gate Array) is a re-programmable hardware
chip consisting of an array of programmable logic blocks with re-configurable
interconnects allowing the logic to be connected in many possible ways. In
contrast to ASICs, CPUs and GPUs whose internal structure cannot be changed,
an FPGA’s hardware architecture can be reprogrammed in field, providing cus-
tomisable hardware without lock-in to a specific design and with the benefits of
mass production. Having application-specific hardware allows data to be pro-
cessed in a manner designed specifically for the use-case, ‘speeding up com-
putation time by orders of magnitude’ [9] and allowing a much more power
efficient execution.
FPGAs are usually developed using a Hardware Description Language
(HDL), which allows hardware developers to define the wires, buses, calcu-
lations, memory usage, clock frequencies and so on of the internal hardware,
determining the behaviour of the system. The HDL is used to synthesize a
digital circuit to be implemented on the FPGA, which involves an optimisa-
tion process to determine how to utilise the resources of the FPGA in the best
possible way to reduce latency and chip area usage whilst respecting timing
constraints to ensure the desired result each time.
Vivado is a software tool produced by Xilinx for synthesis and analysis
of HDL designs, and is the tool used throughout the thesis for synthesising
the final designs. The hardware synthesis process is very time consuming,
with a typical design taking units of hours to compile despite using dedicated
servers. HDL is slower to implement solutions in than software programming
languages since it is used to define functionality at the Register Transfer Level
(RTL) level, with the typical development time an order of magnitude higher
than the equivalent function in software – weeks rather than days for an equiv-
alent function. It is more important to ensure correctness prior to compilation
since errors incur a high compilation cost, therefore it is best to ensure that
any logical bugs are tested for and removed before running compilation and
proceeding with integration testing.
[41]. However, these tools have yet to see mainstream use, as often the learn-
ing curves of the HLS tools themselves do not justify time that could be better
spent learning or writing HDL [9].
Vivado HLS ‘accelerates IP creation by enabling C, C++ and System C
specifications to be directly targeted into Xilinx programmable devices with-
out the need to manually create [RTL using a HDL]’ [42]. In theory, this
provides a huge increase to developer productivity, since designs can be im-
plemented more quickly, and the logical correctness of the implementation can
be tested using a software testbench, allowing the individual HLS components
to be tested and validated much more quickly prior to including the generated
blocks in the full design and testing the whole system.
4.3 Method
The function we will be implementing is a perspective warp, which requires
using the manual point matching tool created in Chapter 3 and solving the sys-
tem of equations generated by the point matches to obtain a homography which
maps between the two images. This only needs to be done once for each cam-
era setup, and so will be done in software using the getPerspectiveWarp
1
An IP block is the name for synthesised hardware blocks used in designs.
CHAPTER 4. FPGA IMPLEMENTATION 29
vado to incorporate this block into a PYNQ design, and leveraging the PYNQ
framework to be able to access the acceleration block from Python and per-
form a direct comparison of the Programmable Logic (PL) implementation
with the OpenCV implementation on the Processing System (PS).
Due to the delay in waiting for assistance in our understanding of these is-
sues, it was more time-effective to implement our own perspective warp func-
32 CHAPTER 4. FPGA IMPLEMENTATION
tion since we would have a full understanding of its constituent parts. From
development to HDL synthesis was incredibly fast, taking just half a day. This
is mixed result: not all built-in xfOpenCV functions work out-of-the-box, with
some requiring quite an in depth understanding into their implementation and
configuration options, which can be difficult to find amongst the dense docu-
mentation and examples online which are quickly out-of-date due to the fast-
paced changes to these resources. However, enough of the base architecture is
there, such as copying to and from memory and accessing pixel values from a
matrix, that coding image processing algorithms in C/C++ which can be syn-
thesised to HDL is remarkably straightforward and can be faster than debug-
ging the out-of-the-box implementation, which makes up for any unexpected
issues with the provided xfOpenCV code.
In the case of xf_warp_perspective, we have since received an-
swers from the Xilinx team which deepened our understanding of the func-
tionality of their implementation, however not to the extent that we feel fully
confident using it. Our current understanding is that we had set the number
of pixels to be processed per clock cycle to 8 as opposed to 1, leading to 8
times as many resources being used, and setting this to 1 could resolve the
problem. This is a reasonable explanation since, as visible in Figure 4.1, di-
viding utilisation by 8 would reduce it to below 100%, solving the over-usage
of resources.
That said, even dividing by 8 results in on average more chip utilisation
than our hand-coded warp implementation. For this reason, and since this
information was gained after our hand-coded example was working, we did
not take time to explore whether making this change solves the issue, and what
the difference in utilisation and latency is compared to our implementation.
A pipelined function or loop can process new inputs every <N> clock
cycles, where <N> is the II of the loop or function. The default II for the
PIPELINE pragma is 1, which processes a new input every clock cycle. You
can also specify the initiation interval through the use of the II option for the
pragma.
Pipelining a loop allows the operations of the loop to be implemented in a
concurrent manner as shown in Figure 4.4 (B). (A) shows the default sequential
operation where there are 3 clock cycles between each input read (II=3), and
it requires 8 clock cycles before the last output write is performed.
Loop Unrolling
Loop unrolling signals to the compiler to remove all the loop overhead of the
logic, and instead generate hardware which directly represents a logical evalu-
ation of the loop. For example, instead of initialising a loop counter, and then
incrementing it and checking each time whether the loop is over, the logic for
loop body is just copied 8 times in sequence. This reduces latency at the cost
of space, however is especially efficient to use on short loops of known length
that are executed repeatedly.
Intuitively, it makes sense to unroll the innermost loop which extracts and
applies the warping function to the 8 pixels represented by one 64-bit word,
since it is likely that the variable will arrive as one 64-bit word.
CHAPTER 4. FPGA IMPLEMENTATION 35
Loop Merging
Loop merging combines the loop logic for nested loops into a single loop to
reduce overall latency, increase resource sharing, and improve logic optimisa-
tion. Merging loops:
4.5 Interfaces
After developing C/C++ synthesisable implementations which can be com-
piled to HDL and optimised using programmer directives, the next challenge
is setting up the interfaces for the synthesised IP blocks in Vivado from the
Zynq board. Documentation is incredibly dense, and is more descriptive than
demonstrative with few working examples. Examples that do exist are often
outdated since the domain is undergoing rapid development.
Therefore, for a software engineer with little hardware experience, it is
difficult to grok the hardware interfaces used on a SoC: AXI, AXI-Stream,
the use of Direct Memory Access (DMA), whether to use RAM or Unified
Random Access Memory (URAM), or copy directly from the shared memory,
and how to interface with each of these things. In this work, it was understood
that the two most simple interfaces to use are AXI and AXI-Stream.
controlled by the PS. Then, there is a second connection from a Master AXI
port on the HLS block to a HP Slave AXI port on the Zynq-7000. The HLS
block uses this connection for fast access to an internal shared memory be-
tween the PS and the PL, and must be Master so that it is in control of the data
it receives. Both input and output are transferred across these ports, requiring
allocated address space for both input and output, and this approach accesses
the memory directly and in a random-access manner. This method is called
memory-mapped AXI, and is chosen since the inverse warp requires random
access to the input image.
There is an opportunity cost to using memory-mapped protocols – for
known memory access patterns, it is possible to use AXI-Stream interfaces
instead. These stream in the data in a known order, reducing latency since
blocks of pixels can be transferred at once which reduces data transfer time.
Since the order of pixels is known it is also possible to implement pipelining
optimisations, because blocks can stream output pixels which can then start to
be used by subsequent blocks before the entire image is processed. However,
since only one block is used in this design as opposed to a chain of processing
block, and to save time, only memory-mapped AXI protocols were used even
though a streaming interface could have been used for the output.
The full process of connecting the HLS-generated IP block (the Warp IP)
to the Zynq-7000 PS is listed below:
1. Enable High Performance AXI slave and General Purpose Master ports
on the Zynq-7000 IP.
2. Connect the Warp IP AXI Master to Zynq-7000 HP AXI Slave port, and
Zynq-7000 GP AXI Master port to Warp IP AXI Slave port via AXI
Interconnect IPs.
3. Assign shared memory addresses using the Address Editor
4. Connect the interrupt from the Warp IP to the interrupt port of the Zynq-
7000 IP using an AXI Interrupt Controller IP (AXI INTC).
After completing these steps, it is possible to synthesise the design into
a bitstream used to program the PL of the SoC, and then interface with the
hardware-accelerated functional block created in Vivado HLS via PYNQ.
file which is used to program the FPGA, and a .hwh file which contains a
description used by the PYNQ framework to obtain a user-friendly description
of the hardware.
Installing PYNQ simply involves flashing the latest image onto the SD
card of a PYNQ-compatible SoC. This installs an environment containing the
pynq Python package, which contains drivers and an interface for program-
ming the PL with the hardware Overlay created using Vivado.
The Overlay is loaded by passing the path to the location of the .bit
file, and returns a Python class. The Warp IP will be available as a class mem-
ber of the main class, with a RegisterMap member. RegisterMap con-
tains references to the hardware registers created and used by our design. For
the Warp IP, homography, input and output are available registers in the PL
to be set to addresses of physical memory locations of the warp homography
matrix, the input image and the output image respectively on the SoC. Addi-
tional control registers are included with the design, most notably ap_start
which signals the block to start execution when set to 1.
In summary, the PYNQ framework makes it very easy to interact with IP
blocks created with Vivado HLS and connected to the Zynq-7000 board using
standard AXI interfaces, giving access to custom hardware acceleration from
a Python development environment.
4.7 Results
4.7.1 Development Process
The first major result of this stage is the demonstration that it is possible to,
with little to no experience with Vivado Electronic Design Automation (EDA)
tools, Vivado HLS or HDL, successfully write, compile, synthesise into hard-
ware and then run, from Python, a hardware-accelerated image warp in a very
short time frame.
Whilst we have no direct comparison between implementations of image
registration, from our experience and noted by others in literature [18], using
HLS reduces the implementation overhead by at least an order of magnitude
(weeks to days, months to weeks).
outer loops which govern scanning the two dimensions of the output image,
and a pipeline across the merged loop, the max latency was reduced substan-
tially from 648,024 to 12,049, a reduction of a factor of over 50 (54.8). See
Table 4.2, compared to with pragmas in Table 4.3 for the full latency report
from Vivado HLS.
Latency Interval
min max min max Type
609,624 648,024 607,202 645,602 dataflow
Latency Interval
min max min max Type
12,049 12,049 9,627 9,627 dataflow
Data Augmentation
5.1 Introduction
The aim of this chapter of the thesis is to obtain a quantitative and qualitative
analysis of whether a network trained with RGB images can achieve higher
accuracies by integrating data from the IR spectrum.
Though it relies heavily on the work done so far, this is the most important
part of thesis with regard to answering the main research question: How viable
is it to improve the accuracy of real-time embedded object detection by inte-
grating an IR camera to augment the RGB image? Successful results would
demonstrate a method to improve the accuracy of a deep object detection net-
work solely by fusing additional spectral data with the input data.
The most important images from this dataset will be at nighttime, contain-
ing pedestrians or vehicles visible in IR but not in the RGB images. However,
it will be useful to see whether image fusion adversely affects object detection
in other conditions, either due to poorly registered images, sub-optimal fusion
or the additional features from the IR spectrum obscuring RGB features. In
this case, it may be necessary for dual-spectra systems to disable augmenta-
tion during daytime or good lighting conditions. Another solution would be
to train the network further so that it can learn this distinction itself, or run
inference on both images sequentially, though this would increase latency.
Hence we will perform a qualitative analysis on select image pairs in which
pedestrians or cars are recognisable in IR, but difficult to see in RGB. We will
obtain a quantitative measure of the performance of the network on data with
and without augmentation by calculating common accuracy metrics across the
entire datasets, and with the datasets split into daytime and nighttime images,
giving an objective view of the mean effect on accuracy of our approach.
43
44 CHAPTER 5. DATA AUGMENTATION
5.2 Background
5.2.1 History
Deep ML (Machine Learning), often referred to simply as Deep Learning,
provides the state-of-the-art for object detection at the time of writing. Its
preeminence in computer vision emerged as a result of the annual competi-
tion held by ImageNet, the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC), which began in 2010. Deep Convolutional Neural Network
(CNN)s in the computer vision domain were beginning to become more popu-
lar in this period, with the use of GPUs in training an enabler for this approach
[15]. In 2012, AlexNet won the competition with a top-5 error of 15.3%, com-
pared to the second place score of 26.2%.
Prior to this point, the previous victors of ILSVRC were either much shal-
lower neural networks or based on human-engineered features like SIFT, how-
ever this huge margin of victory triggered extensive research into the role of
GPUs in deep learning, and into deep learning itself. Since then, all winners
of the ImageNet competition, and the similar Common Objects in COntext
(COCO) detection challenge, have been deep networks, and Deep Neural Net-
work (DNN)s, specifically CNNs, now ‘perform better than or on par with
humans on good quality images’ [44].
Convolutional Layer
The convolutional layer played a key part in AlexNet, where the ‘immense
complexity of the object recognition task’ required models with ‘lots of prior
knowledge to compensate for data we don’t have’. Convolutional layers are
considered to ‘make strong and mostly correct assumptions about the nature
of images (namely, stationarity of statistics and locality of pixel dependen-
cies)’. This allows more complex relationships to be derived with ‘much fewer
connections and parameters’ [15].
Convolutional layers have two main adjustable hyperparameters, the size
of the convolutional kernels, or receptive field, and the number of filters to use,
or depth.
The receptive field is so called because each neuron in the convolutional
layer is only connected to a local region in the previous layer, so is only recep-
tive to changes in that sub-field4 , as opposed to fully-connected layers where
each neuron in the current layer is connected to all neurons in the previous
layer. Fewer connections means fewer calculations, and since image pixels
mostly exhibit local dependencies – i.e. pixels close together are more corre-
lated – the number of connections per neuron can be reduced substantially to
a much smaller receptive field without a loss of information.
The depth of a convolutional layer refers to the number of filters to train
for. Intuitively, this means the number of features since the filter responses
for convolving each filter with a region of the image are ultimately passed
through to the later parts of the network as inputs and therefore should identify
4
Receptive field instead of region since the input usually has three dimensions: width,
height, and depth (which for an image is each RGB component). For example 256x256x3 for
a 256x256 RGB image. Therefore the receptive field is a 3D block: 5x5x3 for a 5x5 kernel.
CHAPTER 5. DATA AUGMENTATION 47
unique and independent features, for example some sort of horizontal, curved
or vertical line, which in combination can be used to differentiate different
objects.
One of the keys to the efficiency of convolutional layers is that they take
advantage of the stationarity of image statistics – the property that features, for
example a horizontal line in an image, do not depend on their spatial position
(the x, y coordinates of the line). This allows a parameter sharing scheme that
dramatically reduces the number of parameters. Instead of updating unique
weights across the entire 2D space of the image, neurons at the same depth
– i.e. the kernel weights at each depth slice – can be constrained to use the
same weights for the entire 2D space, hugely reducing the number of unique
parameters at each depth to the dimensionality of the receptive field5 .
Pooling Layer
Pooling layers are used to reduce the spatial size of the input by applying a
downsampling function across each 2D depth slice of the input, which reduces
parameters for future layers and helps to control overfitting. These layers have
two main hyperparameters, the receptive field, or kernel size, and stride length.
The downsampling function summarises data from the input volume, for
example reducing a 2x2 area in the input to the average or maximum value
in the area, reducing the stored information by 75%. In practice taking the
maximum has been most effective, which is know as Max Pooling.
These are either used in-between convolutional layers to create a deeper
network but reduce dimensionality, or just before the final layers of the network
to sample the outputs into the desired shape.
Residual Block
Residual blocks are a solution to the vanishing gradient, or degradation, prob-
lem that occurs with deep neural networks. They function by skipping the
training of intermediate layers, allowing a simpler sub-model to train its weights
without going through the entire network, and then since earlier layers influ-
ence the deeper layers of the model, the gradients passed forward are less sus-
ceptible to becoming too small and thus halting training. This allows deeper
networks to be trained.
5
For example, from [5x5x3] parameters for each input in a [256x256] 2D space, 5 ∗ 5 ∗ 3 ∗
256 ∗ 256 = 4, 915, 200 parameters, to 5 ∗ 5 ∗ 3 = 75 parameters per depth slice.
48 CHAPTER 5. DATA AUGMENTATION
Fully-connected Layer
The fully-connected layer is the typical neuron layer in machine learning, de-
scribed previously and depicted in Figure 5.1. Each of the nodes is connected
to all of the input nodes, with weights for each of the connections. In a CNN,
there is usually one fully-connected layer at the end of the network either before
or after a pooling layer (to sample to the correct dimension), which generates
activations based on parameters learned from training each node across all the
final input features.
These activations are then passed to the output layer, which converts the
vector of activations into a probabilistic interpretation of the input weights, for
example by assigning a probability to each class label.
datasets. The test dataset is not “seen” by the network during the training
phase, and therefore is used to evaluate the trained network’s performance
on unseen data, which prevents rote-learning and tests the model’s ability to
generalise. The test data is fully annotated with ground-truth values, and the
predictions by the network are compared to the ground-truth in order to obtain
a metric for performance.
Top-1, Top-5, Top-X numbers are image classification metrics. Given an
image with one ‘main’ object, a Top-X score is the percentage of times the
correct classification is in the top X results of the classifier. For example,
Top-1 is the percentage of times the most likely prediction by the classifier
is the correct result, whereas Top-5 is the percentage of times the top 5 most
likely predictions by the classifier contain the correct result.
Object detection, where both the position and classification of all objects
in an image must be returned requires a different metric. This is based on IoU
(Intersection-over-Union) of the predicted bounding boxes and the ground-
truth values. As visible in Figure 5.3, the IoU is a decimal value from 0 to
1, defined as the fraction: IoU = area of overlap
area of union
of the ground-truth bounding
box and the predicted bounding box.
The detection measures used in COCO detection challenge are shown in
Figure 5.2. In this report we will be using Average Precision (AP), specifically
AP IoU =0.5 , which was used in the PASCAL VOC challenges. The .5 in the
metric is the value of IoU for which we consider the prediction True Positive.
Since humans have a hard time differentiating .5, and .75, and annotations
were only provided for the IR data and so had to be migrated for the RGB and
augmented datasets, it was decided to make the accuracy metric as broad as
possible and use an IoU of .5.
Common evaluation metrics used for prediction models are Precision and
Recall. Precision measures the percentage of predictions which are correct:
TP
T P +F P
, whereas Recall measures the percentage of possible positive predic-
tions were found: T PT+FP
N
. Mean AP (mAP) is calculated by sampling the
Precision/Recall graph, and the measure usually used is the Area Under Curve
(AUC).
5.2.6 YOLO
Since we are aiming for real-time computer vision, the network that we will use
to evaluate our IR augmented images will be the YOLO network [4], specifi-
cally YOLOv3, which is ‘more than 1000x faster than R-CNN and 100x faster
50 CHAPTER 5. DATA AUGMENTATION
5.3 Method
5.3.1 Overview
In this section we describe the processing steps completed on 7,777 images
from the FLIR ADAS dataset presented in Section 3.2.3 for our analysis of the
hypothesis that:
Cleaning
First since 499 images in Training and 109 images in Validation do not have
RGB counterpart images, we remove all images without counterparts.
The flip-side of this is that we will have bounding boxes for some objects in
IR-space which are not visible in RGB, which is exactly the desired outcome
for this testing. Ideally, the dataset would be fully annotated with all objects
visible in both RGB and IR, however it is easy to see why this is hard to ob-
tain. Firstly, the publishers of the dataset had no registration between the im-
ages. Secondly, annotating both sets of images is twice as time consuming
and costly. Thirdly, even if both sets of images are annotated, when mapped
together many annotations would be overlapping, duplicates or contradictory,
and it would require further processing to obtain ‘ground-truth’ containing all
annotations in both spectra.
from our annotation files to calculate the number of True Positive (TP)s, False
Negative (FN)s and False Positive (FP)s, with two different Intersection over
Union (IoU) values, 0.5 and 0.25.
A manual review of the predictions showed that the predictions by the net-
work on an RGB image were very similar to the mapped annotations from the
annotated FLIR dataset, as can be seen in Figure 5.4. Inspecting the bound-
ing boxes in these figures further, we would expect 5 TP car predictions, with
4 FN cars missed, and 1 FN for the person. The actual result is close, with
just one of the car predictions not aligning closely enough to the annotation
bounding box to register as a TP. Reducing the IoU to 0.25 solves this issue
for this image. Therefore whilst 0.5 is more standard in literature, we repeated
the measurements at 0.25 due to error introduced from mapping annotations
from the IR coordinates to the RGB image.
At the end of the experiment, we had obtained TP, FP, and FN counts for
each image, and an overall count for each dataset at each IoU value. We also
noted which images were taken at daytime or nighttime, so that the results
could be split into these categories.
Mean Average Precision (mAP) is calculated as the Area Under Curve
(AUC) of the Precision/Recall graph. In this work, we calculate the Precision
and Recall for each image, and then sort the results by Precision. Then we
accumulate TP, FP and FN to obtain an evolving value for Precision and Recall
as we parse the results for each image. In the end, we acquire a graph of
Precision/Recall which starts with high precision and low recall, and descends
to a lower precision and higher recall value, for each class and at each IoU
value.
Average Precision (AP) for each class is calculated by computing the area
under the graph. At each block of width ri+1 −ri , we multiply by the precision
value at this index to get the area, and accumulate over the entire graph, as in
Equation 5.1.
(5.1)
X
AP = (rn+1 − rn )p(rn+1 )
5.4 Results
5.4.1 Qualitative Analysis
For the qualitative analysis, we picked images from the 42 reviewed images
which showed a large range of possible results. Figure C.1 shows both the
CHAPTER 5. DATA AUGMENTATION 55
Tables 5.1 and 5.2 are calculated across the entire dataset using IoU > 0.5
and IoU > 0.25 respectively. Meanwhile, Tables 5.3 and 5.4 are calculated for
the 4,891 daytime images, and Tables 5.5 and 5.6 across the 2,886 nighttime
images, both sets of tables calculated for IoU > 0.5 and IoU > 0.25 respectively.
The mAP (mean Average Precision) for each table is calculated by com-
puting a weighted average of the AP for each class, based on the number of
ground-truth instances of the class. The mAP for each result table is displayed
in Table 5.7.
TP + FN should equal the number of positive examples in the dataset, and
a quick check confirms that TP + FN for each image and class is constant,
with ground-truth 21,119 people, 3,693 bicycles, 37,380 cars, and 207 dogs
in total. Since 7,777 images were used in testing, this means on average 2.7
people, 0.5 bicycles and 4.8 cars per image, which seems reasonable.
Using AP as a measure of performance of the network on each dataset,
we observe that when calculated over all images the performance on the RGB
dataset is superior for all classes except the person class, where it is outper-
formed by the AUG and IR datasets for IoU > 0.5, and only the IR dataset for
the less strict IoU > 0.25. This is reflected in the mAP, which is highest on the
RGB dataset, next highest on the AUG dataset and lowest on the IR dataset for
all IoU values.
For daytime images, the network performs better on the RGB dataset than
over all images or nighttime images, as expected. It still outperforms itself for
person detection on the IR dataset over the RGB dataset for IoU > 0.5, but for
IoU > 0.25 this is reversed and the network performs significantly better on
the RGB dataset by 9.2% AP.
For nighttime images, the general trend reverses, with mAP for the IR
dataset the highest for both IoU > 0.5 and IoU > 0.25. For IoU > 0.5, the
network performs better with the AUG dataset, whilst for IoU > 0.25 it per-
forms better on the RGB dataset than the AUG dataset.
60 CHAPTER 5. DATA AUGMENTATION
The IR and AUG datasets see much higher precision than the RGB dataset
for all classes.
Comparing between IoU > 0.5 and IoU > 0.25 we see that the RGB dataset
sees a proportionally higher rise in AP when reducing the required IoU than the
other datasets. For the person class, the AP for RGB rises by 11.6%, compared
to 6.5% for AUG and 4.2% for IR. For bicycle, it rises 17.7% compared to
12.9% and 8.2%, for cars 14.9% to 8.6% and 6.7%, and for dogs 2.9% to
1.5% and 0.0%.
research questions with regard to the affect of augmenting RGB images with IR
data on the inference performance of a deep neural network trained on solely
RGB data.
Augmenting images with IR data has the potential to improve the inference
accuracy of a neural network trained on RGB data with no additional training.
Whilst the augmentation strategy used in this work results in an on-average
poorer performance of the network, it does however show improved pedestrian
detection during the night and higher precision in its predictions across the
board when utilising the augmented images.
At night, the network has considerably better performance when applied
to raw IR images than either corresponding RGB or AUG images. Whilst on
average performance is adversely affected, for the specific task of person detec-
tion, and in qualitative examples, performance is improved using the additional
modality.
Overall, considering the results attained, better performance can be ob-
tained by augmenting images with the additional data at night, although on
the FLIR dataset and with the image fusion method utilised in this paper, bet-
ter results would be achieved by switching entirely to the IR camera during the
night.
for in the FLIR dataset are used in this study. A more specifically trained
RGB network may see better results at night with IR, however equally, it could
be that more generally trained networks learn features which generalise better
across the two spectra.
The Long-wave infrared (LWIR) camera used to capture the IR images in
the FLIR dataset is very high quality which makes the method used in this
work more viable since the capture resolution is similar to that of typical RGB
images used to train deep networks. For poorer quality IR cameras, it could
be possible to use the additional data in a different way which provides more
effective results. For example, prediction accuracy could be enhanced by ap-
plying Bayesian inference utilising the temperature information within a pre-
dicted object’s bounding box to respectively increase or decrease the prediction
probability for that object.
Chapter 6
Conclusion
Combining the results from each of the three stages provides an answer to
whether it is viable to integrate an infrared camera into an existing embedded
computer vision design to improve the detection accuracy of a pre-trained RGB
network.
In short, the answer is yes, it is viable. The image registration method used
is effective and fast, and both easily implemented in hardware with HLS and
easy to swap in-place of a software implementation with the PYNQ frame-
work. The network performance on augmented images is more precise, and
achieves better person detection than on the base RGB images at night, al-
though at night using the raw IR image sees even better results than using the
augmented images. During the day, the plain RGB images are best to use,
although the extra IR information helps person detection in some cases.
To us, these results hint at a multitude of ways in which additional IR data
could be useful in embedded computer vision applications that are worth in-
vestigating. These range from simple to complex, such as either using IR hot-
point to adjust predictions, simply switching to IR in poor lighting conditions,
or creating a lightweight, parallel branch of the RGB network solely for person
detection, which utilises both RGB and IR data.
Integrating IR is most useful for person detection; the results did not see
any improvement in the detection of bicycles, cars or dogs using the addi-
tional modality. However, since cars are usually well lit with headlights, bi-
cycles are only relevant when they are being ridden by a person, and dogs are
usually leashed to a person, this paper demonstrates an excellent method to
improve the effectiveness of a real-time embedded system exposed to objects
with identifiable IR profiles, such as people or animals, in poor visible lighting
conditions.
63
64 CHAPTER 6. CONCLUSION
65
66 BIBLIOGRAPHY
[10] Michael Bedford Taylor. “Bitcoin and the age of bespoke silicon”. In:
2013 International Conference on Compilers, Architecture and Synthe-
sis for Embedded Systems (CASES). IEEE. 2013, pp. 1–10.
[11] Jean Baptiste Su. Why Tesla Dropped Nvidia’s AI Platform For Self-
Driving Cars And Built Its Own. Aug. 15, 2018. url: https://www.
forbes.com/sites/jeanbaptiste/2018/08/15/why-
tesla - dropped - nvidias - ai - platform - for - self -
driving - cars - and - built - its - own / #72ae0cf67228
(visited on June 8, 2019).
[12] Zhiping Dan et al. “A Transfer Knowledge Framework for Object Recog-
nition of Infrared Image”. In: Communications in Computer and Infor-
mation Science 363 (Apr. 2013), pp. 209–214. doi: 10.1007/978-
3-642-37149-3_25.
[13] Markus Jangblad. “Object Detection in Infrared Images using Deep
Convolutional Neural Networks”. PhD thesis. Uppsala Universitet, 2018.
[14] Jionghui Jiang et al. “Multi-spectral RGB-NIR image classification us-
ing double-channel CNN”. In: IEEE Access PP (Jan. 2019), pp. 1–1.
doi: 10.1109/ACCESS.2019.2896128.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet
Classification with Deep Convolutional Neural Networks”. In: Advances
in Neural Information Processing Systems 25. Ed. by F. Pereira et al.
Curran Associates, Inc., 2012, pp. 1097–1105. url: http://papers.
nips.cc/paper/4824-imagenet-classification-with-
deep-convolutional-neural-networks.pdf.
[16] Fabián Inostroza et al. “Embedded registration of visible and infrared
images in real time for noninvasive skin cancer screening”. In: Mi-
croprocessors and Microsystems 55.January (2017), pp. 70–81. issn:
01419331. doi: 10 . 1016 / j . micpro . 2017 . 09 . 006. url:
http://dx.doi.org/10.1016/j.micpro.2017.09.006.
[17] Jorge Hiraiwa and Hideharu Amano. “An FPGA implementation of re-
configurable real-time vision architecture”. English. In: Proceedings -
27th International Conference on Advanced Information Networking
and Applications Workshops, WAINA 2013. 2013, pp. 150–155. isbn:
9780769549521. doi: 10.1109/WAINA.2013.131.
BIBLIOGRAPHY 67
[46] Jonathan Hui. mAP (mean Average Precision) for Object Detection.
Mar. 7, 2018. url: https://medium.com/@jonathan_hui/
map-mean-average-precision-for-object-detection-
45c121a31173 (visited on Apr. 28, 2019).
[47] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improve-
ment”. In: CoRR abs/1804.02767 (2018). arXiv: 1804.02767. url:
http://arxiv.org/abs/1804.02767.
[48] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”.
In: CoRR abs/1612.08242 (2016). arXiv: 1612.08242. url: http:
//arxiv.org/abs/1612.08242.
[49] Deng Yuan Huang, Ta Wei Lin, and Wu Chih Hu. “Automatic multi-
level thresholding based on two-stage Otsu’s method with cluster deter-
mination by valley estimation”. In: International Journal of Innovative
Computing, Information and Control 7.10 (2011), pp. 5631–5644. issn:
13494198.
[50] Michael K. Spencer. To LiDAR or tesla. May 9, 2019. url: https:
/ / medium . com / artificial - intelligence - network /
to-lidar-or-to-tesla-5d3c2ab254c3 (visited on June 12,
2019).
Appendix A
71
72 APPENDIX A. TUL PYNQ-Z2 PRODUCT BRIEF
TUL PYNQ-Z2
Product Specification Photos of Product
Part number 1M4-M000127000
74
APPENDIX B. TIMING ANALYSIS OF WARPING 75
76
Figure C.1: FLIR_04580.jpg – Note imperfect annotation mapping to RGB.
(c) AUG
Figure C.2: FLIR_05828.jpg – Note IR is better than RGB and AUG.
(c) AUG
Figure C.3: FLIR_07409.jpg – Note AUG better than both RGB and IR in this case.
(c) AUG
Figure C.4: FLIR_07464.jpg – Note more car detections in AUG than in RGB or IR.
(c) AUG
Figure C.5: FLIR_08175.jpg – Note IR better than AUG, AUG better than RGB.
(c) AUG
Figure C.6: FLIR_06426.jpg – Note RGB better than AUG, AUG better than IR.
(c) AUG
www.kth.se