0% found this document useful (0 votes)
65 views98 pages

Improving Embedded Deep Learning Object Detection by Integrating Infrared Camera

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views98 pages

Improving Embedded Deep Learning Object Detection by Integrating Infrared Camera

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


,

Improving embedded deep


learning object detection by
integrating infrared camera
Förbättra inbyggd djupinlärnings-
objektigenkänning genom integration av en
infraröd kamera.

GEORGE PUNTER

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Improving embedded deep
learning object detection by
integrating an infrared
camera.

GEORGE PUNTER

Master in Computer Science


Date: June 13, 2019
Supervisor: Mårten Björkman
Examiner: Danica Kragic Jensfelt
School of Electrical Engineering and Computer Science
Host company: BitSim AB
Company supervisor: Andrea Leopardi
Swedish title: Förbättra inbyggd djupinlärnings-objektigenkänning
genom integration av en infraröd kamera.
iii

Abstract
Deep learning is the current state-of-the-art for computer vision applications.
FPGAs have a potential to fit this niche due to lower development costs and
faster development cycles than ASICs, with a smaller size and power footprint
than GPUs. Recent developments have allowed increasingly easier access to
FPGA development with HLS and other frameworks which help the develop-
ment of deep learning on FPGAs. However, neural networks deployed onto
FPGAs suffer from reduced accuracy than their software counterparts.
This thesis explores whether integrating an additional camera, namely long-
wave infrared, into an embedded computer vision system is a viable option
to improve inference accuracy in critical vision tasks, and is split into three
stages.
First, we explore image registration methods between RGB and infrared
images to find one suitable for embedded implementation, and conclude that
for a static camera setup, manually assigning point matches to obtain a warping
homography is the best route. Incrementally optimising this estimate or using
phase congruency features combined with a feature matching algorithm are
both promising avenues to pursue further.
We implement this perspective warping function on an FPGA using the Vi-
vado HLS workflow, concluding that whilst not without limitations – the devel-
opment of computer vision functions in HLS is considerably faster than imple-
mentations in HDL. We note that the open-source PYNQ framework by Xilinx
is convenient for edge data processing, allowing drop-in access to hardware-
accelerated functions from Python which opens up FPGA-accelerated data
processing to less hardware-centric developers and data scientists.
Finally, we analyse whether the additional IR data can improve the ob-
ject detection accuracy of a pre-trained RGB network by calculating accuracy
metrics for with and without image augmentation across a dataset of 7,777
annotated image pairs. We conclude that detection accuracy, especially for
pedestrians and at night, can be significantly improved without requiring any
network retraining.
We demonstrate that integrating an IR camera is a viable approach to im-
prove the accuracy of deep learning vision systems in terms of implementa-
tion overhead. Future work should explore other methods of integrating the IR
data, such as enhancing predictions by utilising hot-point information within
bounding boxes, applying transfer learning principles with a dataset of aug-
mented images, or improving the image registration and fusion stages.
iv

Sammanfattning
Djupinlärning är spjutspetsteknik för tillämpningar inom datorseende. FPGA-
kretsar är potentiellt användbara för detta då de har lägre utvecklingskostna-
der och snabbare utvecklingscykler än ASIC-kretsar. ASIC-kretsar har i sin
tur mindre storlek och elkraftkonsumtion än GPU-processorer. Med djupin-
lärningsramverk som HLS och andra har det blivit enklare att använda FPGA-
kretsar. Dessvärre lider neurala nätverk som körs på FPGA-kretsar av nedsatt
träffsäkerhet jämfört med motsvarande mjukvaruvarianter.
Denna avhandling utforskar om det är möjligt att förbättra inferensträffsä-
kerheten för grundläggande datorseendeförmågor genom att tillföra en långvå-
gs-IR-kamera till ett inbyggt datorseendesystem. Avhandlingen är uppdelad i
tre delar.
Först utforskar vi om bildregisteringsmetoder mellan RGB- och IR-bilder
för att hitta en lämplig inbyggd implementering. Slutsatsen är att för en statisk
kamerainställning är det lämpligast att manuellt tillskriva punktmarkering för
homografi. Att stadigt förbättra detta estimat eller använda faskongruenskänne-
tecken-metoden kombinerat med en känneteckningsmatchningsalgoritm är två
framtida förbättringar.
Vi implementerar en perspektivsböjningsfunktion på en FPGA-krets med
Vivado HLS-verktyg, och drar slutsatsen att även om det är begränsat för da-
torseende så är funktionerna i HLS snabbare att använda än att implementera i
HDL. Vi observerar att det öppenkodsbaserade ramverket PYNQ av Xilinx är
nära till hands för kantdatahantering, och har släpp-in åtkomst för hårdvaru-
accelererade funktioner i Python. Detta gör det möjligt för hårdvarunoviser att
använda FPGA-kretsar.
Slutligen analyserar vi om extra IR-data kan förbättra träffsäkerheten för
objektdetektering då vi använder ett färdigtränade RGB-nätverk genom att be-
räkna träffsäkerhetsmått med och utan bildförstärkning över en datamängd på
7 777 annoterade bildpar. Vi konstaterar att träffsäkerheten för detekteringen
kan förbättras utan behov av nätverks-omlärning.
Vi visar att integrering av en IR-kamera är ett fullgott sätt att förbättra
träffsäkerheten hos djupinlärningsbaserade datorseendesystem då det är han-
terbart implementationsmässigt. Framtida forskningsarbete bör fokusera på
andra metoder för att utnyttja IR-data. Exempelvis går det att förbättra pre-
diktionen genom hetpunktsinformation avgränsat med ramar, användande av
överförningsinlärning-principer med en datamängd med förstärkta bilder, eller
förbättring av bildregistrering- eller fusionssteget.
Acknowledgements

I would like to thank my supervisor Mårten Björkman, whose tireless efforts


simultaneously supervising 10 theses students have been astounding. His ad-
vice is always spot-on and he has gone above and beyond all expectations with
the level of care and dedication he has shown towards his supervisees. A very
special mention goes to my thesis examiner, Danica Kragic, who, despite a
huge number of responsibilities, somehow still manages to respond ridicu-
lously quickly to my emails.
I am very grateful towards BitSim AB for the resources they provided, the
open-minded work environment, and the coffee – a truly instrumental trinity
behind this work. Special thanks go to my company supervisor Andrea Leop-
ardi who has always taken the time to support me whenever necessary.
I’d like to thank both Professor Alessandro Astolfi and Adrian Hawksworth
from Imperial College London who have both done an excellent job in han-
dling the year abroad process and whose presence has been reassuring through-
out any and all difficult moments over the past year.
This thesis is the end of a 4-year chapter of my life. A huge thank you to
Imperial alumnus Derek Kingsbury (CBE FREng) whose eponymous schol-
arship has supported me throughout my time at Imperial College London.
I’d like to thank my family and close friends and colleagues for helping
to make this possible: Martin Ferianc for his tremendous work ethic and re-
silience, the Tally for their boundless enthusiasm towards everything, both
Valentin Gourmet and Zoe Slattery for the endless shared hours writing up
together, Henrik Lagebrand for helping with my Swedish and the White Stag
for their constant emotional support.

I’d like to dedicate this work to my grandfathers: Francis George Punter and
Necdet ‘George’ Çilasun whose names I share and will always be a part of me.

v
Contents

1 Introduction 1
1.1 Thesis Description . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Contribution Goals . . . . . . . . . . . . . . . . . . . 6
1.3.2 Societal Impact . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Ethical Considerations . . . . . . . . . . . . . . . . . 7
1.4 Report structure . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Works 9
2.1 CNNs on IR data . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 FPGA Acceleration of Computer Vision . . . . . . . . . . . . 10
2.3 Image Registration . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Global Methods . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Blob homography . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Feature based . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Incremental optimisation . . . . . . . . . . . . . . . . 13

3 Image Registration 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Two-view Geometry . . . . . . . . . . . . . . . . . . 16
3.2.3 FLIR ADAS Dataset . . . . . . . . . . . . . . . . . . 16
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 SIFT and SURF . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Phase Features . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Line matching . . . . . . . . . . . . . . . . . . . . . 20

vi
CONTENTS vii

3.3.4 Point matching . . . . . . . . . . . . . . . . . . . . . 20


3.3.5 Semi-manual homography optimisation . . . . . . . . 22
3.4 Image Registration Conclusion . . . . . . . . . . . . . . . . . 23
3.4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . 24

4 FPGA Implementation 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 High-Level Synthesis (HLS) . . . . . . . . . . . . . . 26
4.2.3 Zynq-7000 System-on-a-Chip (SoC) . . . . . . . . . . 27
4.2.4 The PYNQ Framework . . . . . . . . . . . . . . . . . 28
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Development Platform . . . . . . . . . . . . . . . . . 29
4.4 Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Algorithm Implementation . . . . . . . . . . . . . . . 30
4.4.2 Testbench: C-simulation and co-simulation . . . . . . 32
4.4.3 Programmer Directives . . . . . . . . . . . . . . . . . 33
4.5 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 AXI Interfaces . . . . . . . . . . . . . . . . . . . . . 35
4.6 The PYNQ Framework . . . . . . . . . . . . . . . . . . . . . 36
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7.1 Development Process . . . . . . . . . . . . . . . . . . 37
4.7.2 HLS pragmas . . . . . . . . . . . . . . . . . . . . . . 37
4.7.3 Timing Comparison . . . . . . . . . . . . . . . . . . 39
4.8 FPGA Implementation Conclusion . . . . . . . . . . . . . . . 39
4.8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . 40

5 Data Augmentation 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Neural Network Primer . . . . . . . . . . . . . . . . . 44
5.2.3 Convolutional Neural Networks (CNNs) . . . . . . . . 46
5.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . 48
5.2.5 Evaluating model performance . . . . . . . . . . . . . 48
5.2.6 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 51
viii CONTENTS

5.3.2 Dataset Processing Steps . . . . . . . . . . . . . . . . 52


5.3.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . 53
5.3.4 Quantitative Analysis . . . . . . . . . . . . . . . . . . 53
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . 54
5.4.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . 55
5.5 Data Augmentation Conclusion . . . . . . . . . . . . . . . . . 60
5.5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusion 63

Bibliography 65

A TUL PYNQ-Z2 Product Brief 71

B Timing Analysis of Warping 74

C Detection on RGB/AUG/IR Images 76


Acronyms

ADAS Advanced Driver-Assistance Systems. 7

AI Artificial Intelligence. 1

AMBA Advanced Microcontroller Bus Architecture. 35

ANN Artificial Neural Network. 44

ASIC Application-Specific Integrated Circuit. iii, 3

AUC Area Under Curve. 49, 54

AUG Augmented RGB-IR Dataset. 52

AXI Advanced eXtensible Interface. 35

BRAM Block Random Access Memory. 28

CNN Convolutional Neural Network. 9, 44

COCO Common Objects in COntext. 44

CPU Central Processing Unit. 1

CUDA Compute Unified Device Architecture. 15

CV Computer Vision. 1, 3

DMA Direct Memory Access. 35

DNN Deep Neural Network. 44

DSP Digital Signal Processing. 27

EDA Electronic Design Automation. 37

ix
x Acronyms

FAIR Facebook AI Research. 1

FF Flip-Flop. 27, 38

FLIR FLIR is a thermal imaging camera company.. 16

FN False Negative. 54

FP False Positive. 54

FPGA Field Programmable Gate Array. iii, 3

FPS Frames Per Second. 1

FSD Full Self-Driving. 7

GP General Purpose. 35

GPU Graphics Processing Unit. iii, 2

HDL Hardware Description Language. 3, 4, 26

HLS High Level Synthesis. iii, 4, 26

HP High Performance. 35

II Initiation Interval. 33

ILSVRC ImageNet Large Scale Visual Recognition Challenge. 44

IoU Intersection over Union. 54

IP Intellectual Property. 28

IR infrared. 4

LiDAR Light Detection and Ranging. 64

LUT Look-Up Table. 27, 38

LWIR Long-wave infrared. 62

mAP Mean Average Precision. 54

ML Machine Learning. 1, 3, 44
Acronyms xi

NCC Normalised Cross-Correlation. 13, 22

OpenCL Open Computing Language. 15

PL Programmable Logic. 4, 14, 25, 27, 28, 30, 36, 39

PS Processing System. 4, 25, 27, 30, 35, 36, 39

PYNQ Python productivity for Zynq. 4

R-CNN Region-Convolutional Neural Network. 2

RAM Random Access Memory. 31

RANSAC Random sample consensus. 17

REPL Read Evaluate Print Loop. 3

RGB Red, Green, Blue. iii, 4

RTL Register Transfer Level. 26, 27, 35

SIMD Single Instruction Multiple Data. 15

SoC System-on-a-Chip. 4, 27, 36, 37

TP True Positive. 54

TPU Tensor Processing Unit. 3

TSMO Two-Stage Multithreshold Otsu. 53

URAM Unified Random Access Memory. 35

VHDL VHSIC Hardware Description Language. 3

YOLO You Only Look Once. 2


Chapter 1

Introduction

In the goal of implementing Artificial Intelligence (AI), it is impossible to


overlook the importance of Computer Vision (CV). A true demonstration of
AI is not doing what humans cannot do, but doing things uniquely within the
human domain, and then doing it better, things like the ability to communicate
with language, solve unseen problems, and vision.
A reasonable definition of vision could be the ability to detect, classify
and identify the position of objects in a three dimensional scene. Humans are
much, much better than computers at vision. Until recently, the prospect of
computer vision surpassing human ability was unheard of. Using classical
computer vision algorithms, the best achievable is an imperfect segmentation
of your image into discrete areas, for example using the Watershed algorithm
[1], as in fig 1.1a. Note that this segmentation method is naive, it contains no
semantic or position information.
The advancement in Machine Learning (ML) over the past decade has al-
lowed researchers to generate inference models which provide dramatic im-
provements in the state-of-the-art of computer vision. An implementation by
Facebook AI Research (FAIR) in 2017 produces the results shown in fig 1.1b,
where not only are objects segmented, but also the position and semantic in-
formation of each object is inferred. Nowadays, AI computer vision systems
can surpass human ability in terms of quality, but not yet speed [2].
Even implementations of classical computer vision techniques, for exam-
ple the Watershed algorithm, are a long way off real-time performance on
Central Processing Unit (CPU)s, achieving frame rates in the region of 1-10
Frames Per Second (FPS) [1]. The machine learning methods do not even
come close in terms of speed.
So how can car manufacturers even begin to think about self-driving cars

1
2 CHAPTER 1. INTRODUCTION

Figure 1.1: Deep learning impact on computer vision

(a) Before deep learning [1] (b) With deep learning [3]

with these results? The key is in the hardware, the above performance is on
CPUs. Neural networks require extensive matrix multiplications, something
that the video game industry has been developing custom hardware for for
decades in the form of the Graphics Processing Unit (GPU). CPUs are de-
signed for sequential operations, whilst huge computational performance gains
can be acquired by computing in parallel. Machine learning researchers ap-
propriated the GPU, exploiting its parallel computing capabilities to speed up
model training and inference on neural networks, and kick-starting the ML
boom we are now in the midst of. Running inference on a GPU can speed up
execution times by two orders of magnitude, which hugely improves usability.
More importantly, similar performance gains can be seen for the training stage
of neural networks. Since training a network for computer vision with a GPU
can take days, it is easy to see just how pivotal specialised hardware is, and
why ML did not take off until the incorporation of GPUs.
However, even with a GPU, the performance of neural networks is slow.
The original state-of-the-art, Region-Convolutional Neural Network (R-CNN),
takes approximately 47 seconds to process an image. Its successors, Fast R-
CNN and Faster R-CNN, are approaching real-time, cutting time down to 2.3s
and 0.2s respectively. However, this is still only 5 FPS. The GPU used is an
Nvidia Tesla M40 GPU, which has a footprint of 10.5x4.4in, costs over $1000,
and consumes 250W of power. To get closer to real-time performance, there
are 2 routes: either change the hardware or change the software.
You Only Look Once (YOLO) networks are designed for speed, only re-
quiring a single pass over the image and performing an order of magnitude
faster than the fastest R-CNN implementations on a GPU [4]. Whilst GPUs
CHAPTER 1. INTRODUCTION 3

are fine to use in a local desktop tower or via cloud computing services, edge
devices require a smaller size and power footprint, and real-time or critical ap-
plications do not have the luxury of relying on internet connectivity for both
latency and security reasons.
Bootstrapping hardware designed for one thing to use it for another is never
going to be efficient, which is why Google has started developing custom AI
Application-Specific Integrated Circuit (ASIC)s for machine learning, which
it has dubbed the Tensor Processing Unit (TPU). This allows much lower power
consumption per compute, and a smaller footprint, since the processor design
is completely optimised towards matrix multiplication-and-accumulation of
the kind performed in neural networks [5]. However, ASICs are very expensive
to design, and the field of machine learning is changing so rapidly that designs
could quickly become obsolete. Only for companies with a huge amount of
resources is this really a possibility, and even still most CV applications require
pre- and post-processing besides the ML inference.
Field Programmable Gate Array (FPGA)s are a middle ground between
GPUs and ASICs, allowing the development of custom hardware that is re-
programmable. If the architecture required for a cutting-edge AI implementa-
tion changes, it is possible to reprogram the architecture of the FPGA to suit the
change needed. Another benefit of FPGAs is that due to their closeness to the
hardware, it is easy to connect and interface additional sensors into the design
as befits the application, making it a good edge data acquisition and analysis
platform. With FPGAs, one can have custom hardware tailored to the entire
computer vision application, including pre- and post-processing steps that are
specific to the application domain, without the overhead of ASIC design and
manufacture.
FPGA design for computer vision applications requires a very broad set
of knowledge. FPGAs are programmed in a Hardware Description Language
(HDL), such as Verilog or VHSIC Hardware Description Language (VHDL),
requiring hardware engineers. The top machine learning tools are written in
C++ (tensorflow, numpy, scikit-learn, dlib), and computer vision tools in C++
or Python (OpenCV), with data scientists largely using Python due to its sim-
plicity, popularity for CV and ML, and support for Read Evaluate Print Loop
(REPL) making it easier to develop applications in Python than in C++ or
directly on an FPGA [6].
In short, development for an FPGA computer vision system requires ex-
tensive domain-specific knowledge in several fields: digital hardware design,
embedded systems, machine learning, and computer vision.
However, recent developments are allowing increased accessibility to this
4 CHAPTER 1. INTRODUCTION

paradigm. High Level Synthesis (HLS) tools allow the compilation of C++
into HDL, allowing software engineers to develop for FPGA without exten-
sive knowledge of HDL, and test algorithms within seconds rather than hours.
Xilinx has released open-source implementations of common OpenCV func-
tion in HLS-ready code in its xfOpenCV library1 , to further reduce this burden.
System-on-a-Chip (SoC) boards like the Zynq-7000 series combine Process-
ing System (PS) and Programmable Logic (PL) on the same board, reducing
the complexity of interacting with PL, and the Python productivity for Zynq
(PYNQ) framework [7] further simplifies this. Finally, tools are being devel-
oped to streamline the implementation of neural network inference in FPGA
programmable logic, though often at a cost to accuracy [8].
These initiatives open up new possibilities for the rapid development of
custom hardware for computer vision, but are yet to mature or stabilise. PYNQ
was first released in 2016, and a 2015 analysis of the state of Vivado HLS2 for
image processing concluded that whilst promising, was not yet worth it, since
it in itself required a steep, poorly documented learning curve that could be
‘better spent learning [V]HDL’ [9].

1.1 Thesis Description


This thesis explores the current state of Vivado HLS and PYNQ in the context
of implementing a hardware-accelerated system for the real-time augmenta-
tion of RGB images with data from an infrared (IR) camera. This is part of a
broader plan to investigate whether incorporating IR data at the inference stage
can increase the capabilities of deep learning networks trained on purely RGB
datasets, and the feasibility of implementing this in real-time on an FPGA.
The Principal, BitSim AB, is interested in developing embedded AI sys-
tems. Since the bulk of research into object detection is on RGB images, the
availability of both data and deep learning models for dual-spectral data is
limited. The Principal’s objective is to develop a robust embedded real-time
image registration platform for a boresighted pair of RGB and IR cameras.
Achieving this facilitates potential solutions which take advantage of the rela-
tive data wealth in RGB compared to IR data.
Firstly, the ability to map bounding box annotations between multispectral
image pairs will be used to aid dataset creation, with the possibility to automate
the annotation of IR images using a trained RGB network as ground-truth.
1
https://github.com/Xilinx/xfOpenCV
2
https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html
CHAPTER 1. INTRODUCTION 5

Secondly, an investigation into the extent to which existing, trained models


for object detection can be improved by augmenting images with additional
spectral data without the overhead of updating the trained parameters could
demonstrate an avenue to improve the accuracy of embedded AI vision sys-
tems, opening new business opportunities for the Principal.

1.2 Thesis Work


The thesis is split into three stages.
Firstly, an investigation into image registration methods described in lit-
erature in order to pick a suitable method for registering RGB and IR data in
hardware. Promising examples from literature will be explored using Python
and OpenCV, and then the method used for the next sections of the report will
be chosen according to the following criteria:
1. Real-time applicability i.e. processing speed.
2. Registration quality.
3. Suitability for hardware implementation.
Secondly, the implementation of multimodal image registration in pro-
grammable logic and a summary of the state of Vivado HLS and PYNQ plat-
form in 2019. The questions to be answered are:
1. How complex is it to implement a custom computer vision algorithm in
C/C++ which compiles to HDL using Vivado HLS?
2. What is the speed-up compared to software implementations?
3. What is the implementation overhead? In terms of: development time,
space requirement, and latency.
Finally, performing an analysis of the effect of augmenting RGB images
with IR data on pre-trained deep neural network inference accuracy.
1. How does this affect inference performance?
2. In what conditions is it most useful?
3. Are there other benefits to integrating an IR camera?
The overarching question to be answered in this work is:
How viable is it to improve the accuracy of real-time embedded
object detection by integrating an IR camera to augment the RGB
image?
6 CHAPTER 1. INTRODUCTION

1.2.1 Hypothesis
Our hypothesis is that:

Augmenting RGB images with IR data can improve the inference


performance of a deep neural network trained on RGB data.

Therefore if integrating an IR camera into an FPGA design is viable, it follows


that:

Embedded implementations of deep neural networks for object


detection can benefit from integrating an IR camera.

1.3 Contribution
The contribution goals of this thesis revolve around two concepts. Firstly,
demonstrating the practical utility of FPGAs in the field of computer vision
given industry efforts to improve developer productivity. Secondly, demon-
strating the usefulness of infrared cameras for computer vision systems.

1.3.1 Contribution Goals


1. Analysis of registration techniques for multimodal images and a hard-
ware implementation.
2. Evaluation and demonstration of the current state of Vivado HLS tools.
3. Demonstrating FPGA hardware accelerated function calls in Python.
4. Analysis of the effect of image fusion on deep learning network perfor-
mance.
5. Explore the viability of utilising infrared data in deep learning without
requiring an extensive dataset for the new data modality.

1.3.2 Societal Impact


Custom computing, i.e. generating custom hardware for a particular task, is
a highly specialised field, requiring broad knowledge base and large teams of
people. Thus, its cost is prohibitive unless the application is worth it, for ex-
ample in recent years the custom-compute ASICs for Bitcoin mining in 2013
CHAPTER 1. INTRODUCTION 7

[10], the TPU by Google in 2015 [5], and the Full Self-Driving (FSD) com-
puter announced by Tesla this year3 [11]. The ability to generate HDL blocks
from C/C++ and interface with custom hardware accelerators via Python opens
up this domain to less financed entities, potentially lessening the technologi-
cal monopolies held by large corporations who can afford to develop custom,
proprietary ASICs to edge out competition.
Utilising data from an additional modality to augment RGB images opens
up further capabilities of computer vision-based systems. In the case of IR,
the additional data could be used to analyse temperatures, improve detec-
tion of warm-bodied objects, and improve visibility in the dark. Applica-
tions of this could be monitoring plant, animal or human health, search and
rescue equipment, and improving pedestrian detection for Advanced Driver-
Assistance Systems (ADAS) and automated driving applications, using either
classical or deep learning computer vision techniques.
Image registration across different sensor modalities, in particular between
infrared and visible spectra, is useful in a variety of fields, from evaluating bi-
ological health, to surveillance or search and rescue systems. Hardware imple-
mentations of multimodal image registration provide a smaller size and power
footprint whilst improving real-time responsiveness. This extends the usabil-
ity of the aforementioned applications, for example extending the flight time
of a search and rescue drone or converting a medical device from a desktop to
a hand-held device.
The area of deep learning is the current state-of-the-art for object detection,
and embedded implementations provide more real-time applicability, lower
power and size profiles, and lower cost. The inference accuracies of FPGA
implementations of deep learning are usually worse than software-based coun-
terparts, since network weights are quantised, and the complexity of the model
itself reduced to ease implementation [8]. Hence, in the place of network com-
plexity, it is possible that augmenting images with additional data could im-
prove object detection or segmentation accuracies. Real-time object detection
is especially sought after for autonomous vehicles and other ADAS, therefore
exploring ways to improve object detection on safety critical computer vision
systems could ultimately improve automotive safety.

1.3.3 Ethical Considerations


To consider the ethical implications it is necessary to imagine the fullest suc-
cess of the investigations in this thesis.
3
Link to announcement: https://youtu.be/Ucp0TTmvqOE?t=4310
8 CHAPTER 1. INTRODUCTION

Firstly, the demonstration of the ease of use of HLS tools. Then, the im-
plementation of a hardware-based platform with image registration between
RGB and IR camera data streams. Finally, improved dual spectra datasets
for deep learning, and improved real-time embedded deep learning using IR-
augmented images.
The biggest ethical concern would be the use of improved RGB-IR vi-
sion systems in military technology, or by governments in surveillance which
breaches personal rights. RGB-IR image fusion is useful for weapons target-
ing systems, since the heat information can be used a life-sign indicator and
aid military personnel to identify hostile targets, especially during the night.
More intelligent systems implementing some kind of auto-targeting function-
ality using AI are even more ethically dubious, since the confirmation that the
target is hostile and not a civilian is left to the system. In any case, the ethical
morality of warfare is always questionable.
Thus, this technology could be adapted for unethical applications, however
this is not the intention behind the research, and the author is strongly against
the use of this work for those purposes.

1.4 Report structure


The report is divided into 7 Chapters and the work has been split into three
parts, Chapters 3, 4 and 5, in order to preserve continuity within each part.
Background, Method and Conclusion will be presented for each part individ-
ually, prior to an overarching Conclusion for the thesis work as a whole.
In Chapter 2, Related Works, the findings from reading papers around the
topics of real-time image registration, image fusion, and FPGA design for
computer vision are presented.
In Chapter 3, Image Registration, we will cover the work done in order to
choose the image registration method used in the subsequent stages.
In Chapter 4, FPGA Implementation, we will implement image registra-
tion on an FPGA as a feasibility study, and analyse the state of current tools
and software that make hardware development easier for software engineers.
In Chapter 5, Data Augmentation, we will analyse the effect of augmenting
input images with IR data on the object recognition inference of a deep neu-
ral network and discuss other potential strategies for utilising the additional
camera data.
In Chapter 6, Conclusion, we will discuss the implications of the work
done as a whole and summarise the conclusions drawn from this work.
Chapter 2

Related Works

The inspiration for this work comes from Convolutional Neural Network Quan-
tisation for Accelerating Inference in Visual Embedded Systems [8], where it is
noted that quantised Convolutional Neural Network (CNN) inference on em-
bedded systems suffers from reduced accuracy compared to software-based
counterparts.
Since the aim of this work is to evaluate whether this reduced accuracy can
be in-part alleviated by integrating an IR camera into an embedded computer
vision system, relevant works are therefore applications of CNNs to IR data,
FPGA acceleration for computer vision, and methods for image registration.

2.1 CNNs on IR data


To the best of our knowledge, no research has been done into directly apply-
ing an RGB-trained neural network to augmented RGB-IR data, for desktop or
embedded applications. Instead, research tends to be focused on either trans-
fer learning, as in A Transfer Knowledge Framework for Object Recognition
of Infrared Image [12], or training networks designed for RGB images with
IR data, as in Object Detection in Infrared Images using Deep Convolutional
Neural Networks [13], or training networks with both spectra simultaneously,
as in Multi-spectral RGB-NIR image classification using double-channel CNN
[14]. The downside to these approaches is that annotated datasets in RGB are
far more extensive and even still, ‘the immense complexity of the object recog-
nition task means that this problem cannot be specified even by a dataset as
large as ImageNet’ [15], so object detectors trained solely with IR data, or
on the even rarer case of registered dual-spectra image data, are hampered
by a lack of data. Our approach differs to these in that it aims to combine

9
10 CHAPTER 2. RELATED WORKS

RGB and IR data in such a way to minimise any additional training overhead,
maintaining the benefits of data wealth in the RGB domain and minimising
implementation overhead, but also to leverage the additional spectrum data in
cases where the visible spectrum is less useful such as at night or in fog.
Research into deep learning is hugely active but still relatively new, and so
whilst not found at the time of writing, we expect papers with titles such as
Common deep learning object detection features between RGB and IR images
or Combining RGB and Infrared Images for Robust Object Detection using
Convolutional Neural Networks to be released in the near future. Researchers
are just beginning to understand how deep networks really work, and these
insights will be highly beneficial not just in the field of deep learning. This
thesis shows there are enough similarities between features in RGB and IR
images that a deep network can successfully identify the same objects in RGB,
IR, and RGB-IR augmented images despite having been trained on a solely
RGB dataset. Future work based on this concept could provide a more concrete
understanding of the relationship between these two spectra, with implications
on our understanding of optics.

2.2 FPGA Acceleration of Computer Vision


Whilst there exist descriptions of embedded implementations of image regis-
tration in literature [16], we believed the cost of a full HDL implementation of
image registration was too high to justify in our application, and instead opted
to use HLS.
Evaluating Vivado High-Level Synthesis on OpenCV Functions for the
Zynq-7000 FPGA [9] (2015) is one of the most up-to-date evaluations of Vi-
vado HLS, and provides a good description of its benefits and flaws, conclud-
ing that it was at the time ‘too unreliable and require[d] too much effort to
learn to make use of... The time spent wrestling with the IDE [...] is better
spent on learning [V]HDL’. This work brought our attention to prior works on
Vivado HLS, which are summarised as below.
A comparison between manually written HDL and Vivado HLS for a real-
time video processing architecture [17] (2013) found that ‘while the manually
coded HDL took approximately 15 days to complete, the Vivado HLS code
took only 3 days’ with no significant difference in latency, but a ‘cost of about
3-4 times higher FPGA resource usage’. Vivado HLS is used in [18] (2013)
to quantify the benefits of hardware loop unrolling on three different Digital
Pre-Distortion solutions, where it is found that ‘total latency [...] showed close
CHAPTER 2. RELATED WORKS 11

to 5 times lower across the board’ using the accelerated Vivado HLS imple-
mentation over the software implementation [9].

2.3 Image Registration


RGB-IR image registration is an active field of research, and the bulk of the
reading for this paper was focused on finding a suitable image registration
algorithm for an embedded implementation.
The goal was to find a method for automatic image registration between
RGB and IR images, ideally following an incremental approach so that as long
as the algorithm converges then iterations can be performed in real-time on a
video stream. Methods involving deep learning were not considered, since the
overhead is too large for what is to be an embedded pre-processing step for a
deep learning application.
Infrared and visible image fusion methods and applications: A survey [19],
is an extremely comprehensive and up-to-date review of current research into
infrared and visible image registration, and an invaluable resource for tracking
down relevant papers to review.
The methods mentioned in this section are the most promising subset from
the papers reviewed, and the choice of method to use for this paper will be
discussed in more depth in Section 3.3.

2.3.1 Global Methods


Gaussian fields
Three papers were found describing methods using Gaussian fields to regis-
ter non-homographic images seem incredibly accurate, however are too com-
putationally intensive, taking seconds to compute rather than the sub-second
real-time our application requires [20][21][22].

Fourier-based methods
The Fourier-Mellin transform, such as described in [23], can provide fast and
robust image registration, but ‘the main drawback of the FMT approach is
that it is only applicable to register images linked through a transformation
limited to translation, rotation and scale change’ [23]. Other papers using FFT
described similar limitations [24][25].
12 CHAPTER 2. RELATED WORKS

2.3.2 Blob homography


In [26] and [27], the authors implement a blob-based homography approxi-
mation to overcome parallax errors. Background subtraction is used to reduce
computational cost, and the algorithm estimates the blob-to-blob homogra-
phy matrices to achieve pixel-level registration based on the disparity between
each blob pair from the two cameras. Unfortunately, these solutions require a
video stream, which restricts their applicability since they cannot be used on
individual image pairs, however are good examples of real-time registration.

2.3.3 Feature based


The solution presented in [16] is an embedded hardware implementation of
image registration that works in real-time, and thus a very good reference. The
method implements feature point matching between images, estimating trans-
formation parameters recursively on-line with the video, which is the high
level structure we were looking for. This implementation uses a marker, how-
ever, which is not appropriate for our application. It functions by performing
gradient descent on an error calculated from the differences when trying to
overlap corners found in both images on the marker. The results in [28] are
another example of extremely good results that require the use of a square
marker.
The solution described in [29] ‘uses the wavelet transform modulus maxi-
mum algorithm for edge detection, makes use of SURF to detect feature points
on the edges, applies two-stage matching method including rough matching
and accurate matching and hence carries out the image registration of infrared
and visible images with high accuracy and stability’. However, this does not
seem suitable for real-time image registration, and the authors note that ‘con-
sidering that current registration of multi-source images requires real-time ap-
plications.... further research should focus on improving the computing speed
of the registration’.
In [30] the authors employ a method which analyses sliding correspon-
dence windows in the colour reference image, and finds their appropriate match
in the thermal image using maximisation of mutual information. This requires
a foreground extraction step on both images, and pre-calibrated and rectified
image pairs.
Phase Congruency was used for feature descriptor extraction in [31], which
is based off work by [32] on the advantages of using phase congruency for
detecting features instead of the more standard feature detectors, such as Sobel
or Canny line detection filters. The images in this case were rectified prior to
CHAPTER 2. RELATED WORKS 13

feature matching, and it was shown that phase congruency descriptors proved
more effective than the traditional SIFT/SURF approach which do not work
well on RGB-IR image pairs.
The line-based registration algorithm by [33], where lines are used as fea-
tures and matched between images seems an extremely relevant and effective
way to align RGB and IR images, since lines are the most common attribute
shared between the two spectra.

2.3.4 Incremental optimisation


Edge-based RGB-IR image fusion by [34] is based on an initial estimate of
the homography with the assumption that parallax effects are negligible. The
initial homography is obtained by selecting point matches between two images
with the cameras set-up as they will be in practice, and solving the constraints
imposed by the point matches to obtain a homography that maps between the
two images. This estimate is then optimised over a cost function based on a
Normalised Cross-Correlation (NCC) over edge detected maps at varied res-
olutions. This implementation is very similar to [28], which uses a modified
Sobel operator and varies the resolution to improve efficiency and achieves
very good results. This seems to be an effective method, drawing from a man-
ual initial calibration, and then further optimising the homography to better
map between the two images which offloads the bulk of the challenge to a
one-time, reliable human operator and then incrementally improves on this in
the field.
Chapter 3

Image Registration

3.1 Introduction
The overall aim of this thesis is to evaluate the viability of integrating an IR
camera into an image acquisition platform as a means to improve real-time
object detection. A crucial part of using this additional data is the image reg-
istration step, which aligns the two images.
This step must be done in real-time, and should utilise minimal resources
of the FPGA since most of the Programmable Logic (PL) will be used by the
neural network implementation. Ideally this process would be automatic, to
reduce the need for manual calibration.
Most methods for multi-spectra image registration are developed for desk-
top applications [20][22][25][28][31][33][34]. These methods are not neces-
sarily suitable for real-time or embedded implementation, either due to com-
putational overhead or the difficulty of re-implementation in hardware.
The aim of this stage of the thesis is to evaluate possible methods for image
registration from Section 2.3, Related Works, and choose a suitable method in
terms of processing speed (can it work in real-time?), registration quality (how
well are the two images aligned?), and suitability for hardware implementation
(can it be implemented in hardware within a short time-frame?).
First we will cover background information on the software used and rele-
vant computer vision theory for image registration in Section 3.2, Background.
In Section 3.3, Method, we describe our method and the work done to imple-
ment registration methods from Section 2.3, and discuss the outcomes of those
experiments. We summarise our thoughts in Section 3.4, Conclusion, finalis-
ing our chosen method for Chapters 4 and 5 of the thesis.

14
CHAPTER 3. IMAGE REGISTRATION 15

3.2 Background
3.2.1 OpenCV
Operations on images are highly parallelisable, and since the number of pix-
els in high quality images approaching the same order of magnitude of the
clock speed on modern CPUs, it is important to use Single Instruction Multiple
Data (SIMD) instructions and parallelised architectures wherever possible for
real-time performance. OpenCV is one of the fastest, most mature, and most
popular open-source image processing libraries. The library has ‘more than
2500 optimized algorithms’, and ‘leans mostly towards real-time vision ap-
plications’. It ‘takes advantage of MMX and SSE instructions when available’
and ‘full-featured CUDA and OpenCL interfaces are being actively developed’
for GPU execution, giving us one of the most efficient tools in software for in-
teracting with digital images [35].
OpenCV is used extensively in this project: it is written in C++, and has
bindings for Python, meaning the function calls in Python are executed at the
speed of statically-compiled and optimised C++, which is typically at least an
order of magnitude faster than Python code. Since Python is used for interfac-
ing with hardware on the PYNQ framework, OpenCV allows us to compare
side-by-side our custom hardware implementation on PL with the equivalent,
optimised C++ function calls on the PS of the system. Since it is open-source,
it benefits from an entire community of bug fixes and improvements, as well
as adaptations for specific hardware platforms.

xfOpenCV
A huge benefit from using OpenCV in this stage of the thesis is that Xilinx has
released an open-source library named xfOpenCV which ‘is a set of 50+ ker-
nels, optimized for Xilinx FPGAs and SoCs, based on the OpenCV computer
vision library’ 1 .
This means that using OpenCV functions in this stage of the thesis could
result in not needing to write their hardware equivalent if they are already im-
plmented in xfOpenCV. At the very least, the Xilinx library is a good reference
for writing OpenCV-like functionality in HLS code for an FPGA.
1
https://github.com/Xilinx/xfopencv
16 CHAPTER 3. IMAGE REGISTRATION

3.2.2 Two-view Geometry


The theory behind image registration comes from epipolar geometry, a topic
covered extremely well in Multiple View Geometry for Computer Vision by
Richard Hartley and Andrew Zisserman [36]. In this section we will sum-
marise the most important parts for image registration.
‘The epipolar geometry is the intrinsic projective geometry between two
views. It is independent of scene structure, and only depends on the cameras’
internal parameters and relative pose. The fundamental matrix F encapsulates
this geometry. It is a 3 × 3 matrix of rank 2. If a point in 3-space X is imaged
as x in the first view, and x0 in the second, then the image points satisfy the
relation x0T F x = 0’ [36]. This does not, however, give a point-to-point cor-
respondence between the two images, F x maps point x to an epipolar line l0
in the second image.
By referencing a plane, it is possible to use two-view geometry to map
points between images. ‘It is said that the plane induces a homography be-
tween the views. The homography map transfers points from one view to the
other as if they were images of points on the plane’ [36]. This point-to-point
transformation matrix, H, can be computed from point matches between im-
ages which are used to solve a system of 9 equations with 8 degrees of freedom,
and several algorithms to do so are provided in the book and subsequently im-
plemented in OpenCV. The downside to this method is that it creates a plane
induced parallax.
Ideally, one would need a homography for each image plane present in the
two images in order to obtain accurate image registrations for all regions in
the image, which is the idea behind blob-based homography image registration
methods. The extent of the parallax effect is influenced by the distance between
the two cameras, and the distance to the scene. If the distance to the scene with
respect to the distance between the cameras is increased, the parallax induced
by mapping based on a single plane is reduced, and the argument that ‘the
parallax effect becomes negligible (i.e. sub-pixel) when the distance exceeds
a certain value’ [34] is used in literature.

3.2.3 FLIR ADAS Dataset


The FLIR ADAS dataset was used throughout this work. The dataset specifi-
cation2 is copied at Table 3.1. It consists of time-synced RGB-IR image pairs
which are not registered and thus provides a good platform for testing image
2
Found at https://www.flir.com/oem/adas/adas-dataset-form/
CHAPTER 3. IMAGE REGISTRATION 17

registration algorithms as well as data for the experimentation in Chapter 5,


Data Augmentation.

3.3 Method
An optimistic goal of this section is to implement a robust image registra-
tion algorithm between RGB and IR images, without requiring manually-input
point matches, and running in real-time.
Out of the algorithms reviewed in Related Works, Section 2.3, several
methods such as the Gaussian field methods were discounted due to complex-
ity [20][21][22]. FFT and FMT based methods were discounted due to their
limitations in application.
We gathered that for completely automatic image registration, without hu-
man assistance, the standard method is to use SIFT or SURF feature descrip-
tors on pre-rectified images to detect feature points to match between the im-
ages, and use these matches to generate homographies. The following three
algorithms proved promising for solving the image registration problem:

1. An improved SURF implementation using phase features, which would


provide an unassisted image registration [32] [31].
2. Homography calculations acquired from matching similar lines between
images [37] [33].
3. Incremental Optimisation algorithms using a manually obtained homo-
graphic estimation [34] [16].

3.3.1 SIFT and SURF


SIFT (Scale-Invariant Feature Transform) and SURF (Speeded Up Robust
Features) are the go-to algorithms in computer vision for creating feature de-
scriptors to use when matching features between images, and thus for image
registration used with one of the various featuring-matching algorithms, such
as Random sample consensus (RANSAC), for finding homographic mappings
between images.
The difficulty of using SIFT and SURF feature descriptors to acquire point
matches between RGB-IR image pairs has been observed in previous research
[38], for example observe Figure 3.1: the majority of feature point matches,
indicated by the blue lines, do not correspond to the same points in the scene
at all.
18 CHAPTER 3. IMAGE REGISTRATION

Table 3.1: FLIR ADAS Dataset Specification

Content Synced annotated thermal imagery and non-annotated RGB


imagery for reference. Camera centerlines approximately 2
inches apart and collimated to minimize parallax
Images >14K total images with >10K from short video segments and
random image samples, plus >4K BONUS images from a 140
second video
Image Capture Re- Recorded at 30Hz. Dataset sequences sampled at 2 frames/sec
fresh Rate or 1 frame/ second. Video annotations were performed at 30
frames/sec recording.
Frame Annotation 10,228 total frames and 9,214 frames with bounding boxes. 1.
Label Totals Person (28,151) 2. Car (46,692) 3. Bicycle (4,457) 4. Dog
(240) 5. Other Vehicle (2,228)
Video Annotation 4,224 total frames and 4,183 frames with bounding boxes.1.
Label Totals Person (21,965) 2. Car (14,013) 3. Bicycle (1,205) 4. Dog (0)
5. Other Vehicle (540)
Driving Conditions Day (60%) and night (40%) driving on Santa Barbara, CA area
streets and highways during November to May with clear to
overcast weather.
Capture Camera IR Tau2 640x512, 13mm f/1.0 (HFOV 45◦ , VFOV 37◦ ) FLIR
Specifications BlackFly (BFS-U3-51S5C-C) 1280x1024, Computar 4-8mm
f/1.4-16 megapixel lens (FOV set to match Tau2)
Dataset File For- 1. Thermal - 14-bit TIFF (no AGC)2. Thermal 8-bit JPEG
mat (AGC applied) w/o bounding boxes embedded in images3.
Thermal 8-bit JPEG (AGC applied) with bounding boxes em-
bedded in images for viewing purposes 4. RGB - 8-bit JPEG
5. Annotations: JSON (MSCOCO format)
Sample Results mAP score of 0.587 (50% IoU) was obtained by fine tuning
RefineDetect512 with this dataset and testing using holdout
validation set. Details further explained in readme.
CHAPTER 3. IMAGE REGISTRATION 19

Figure 3.1: SIFT and SURF between RGB and IR images [38]

A preliminary implementation of these algorithms on the FLIR ADAS


dataset confirmed that SIFT or SURF-based features between RGB and IR
images are too dissimilar for feature matching methods. This is believed to
be due structural differences between the two spectra, for example the reverse
gradient problem: in the IR spectrum edges of foreground objects are usually
bright going to darker, since more infrared radiation is received from closer
objects, whilst in the visible spectrum object boundaries are the opposite.

3.3.2 Phase Features


An incredibly interesting paper by Peter Kovesi [32], further investigated in the
context of RGB-IR image registration by Mouats [31], proposes phase features
as a more robust and less parameter dependent method for feature extraction,
particularly between multispectral images, where it was asserted that using
phase features as opposed to traditional feature detectors such as Canny or
Sobel edge detection would provide more similar features between the two
different spectra.
Kovesi’s method extends phase congruency theory from one dimension
to two dimensions, allowing it to be applied to images, and shows ‘how phase
congruency can be calculated from Gabor wavelets – geometrically scaled spa-
tial filters in quadrature’ [32]. The results from implementing and applying
20 CHAPTER 3. IMAGE REGISTRATION

this method in Python seem promising. As visible in Figure 3.2, the edges
extracted from both the RGB (top) and IR (bottom) images can be seen to be
extremely similar, despite the differences in spectral modality. However, our
implementation was slow and memory intensive, leading us to conclude that a
real-time implementation would be difficult within our time-frame. Applying
a feature matching algorithm on top of this would be even further detrimental
to execution time and viability.
Whilst this algorithm is not suitable for this thesis, it is worth noting a cou-
ple of things. Firstly, the structure of the algorithm is well suited to hardware
optimisation: it uses a ‘bank of filters to analyse the signal’ [32]. Convolving
an input signal, or image, with a bank filters is a typical use case of FPGAs,
since filters can be applied to streaming data, and stacked into a data pro-
cessing pipeline. Secondly, a qualitative review of the output finds that this
method produces features that are more similar between RGB and IR images
than traditional feature detectors such as Sobel or Canny edge detectors, with
no adjustable threshold parameter.

3.3.3 Line matching


Since a couple of papers [33] and [37] saw some success by seeking to match
lines between images instead of points in order to calculate the homographies,
we attempted this method.
Using a pair of images from the FLIR ADAS dataset, we detected lines us-
ing the Hough Line Transform method, which has an implementation in both
OpenCV and xfOpenCV. Using ‘brightness-based and geometric-based im-
age parameters’, it was possible to extract many similar lines in both images
[33]. The scheme used to match lines involved ranking line similarity using a
correlation with the line and the edge map of the image, and the angle of the
line. This produced good line matches between the two images, however did
not provide enough information to perform any kind of effective image regis-
tration, since the lines which matched tended to already be similarly aligned
in both images. This meant that the homographies that were calculated from
the line matches did not produce an image warp that significantly changed the
target image enough to produce an alignment between the two images.

3.3.4 Point matching


Manually assigning point matches is not an ideal solution, however if the rel-
ative position between the two cameras is fixed, the distance to the scene is
CHAPTER 3. IMAGE REGISTRATION 21

Figure 3.2: Phase features


22 CHAPTER 3. IMAGE REGISTRATION

sufficiently far away and the cameras are collimated as much as possible, it
is the simplest and most effective way to obtain a homography which maps
one camera image to another, as well as a prerequisite step for semi-manual
methods described in the next sub-section.
We wrote a tool using Python and the Python library Matplotlib to allow
a quick way for a user to assign matching points between two images, and
save these to a .json file. Then, we manually annotated images from the FLIR
ADAS dataset in order to test this method.
We found that the homography generated from manual point matches be-
tween one pair of RGB-IR images produced a good mapping between the two
spectra. Furthermore, the homography from one pair could also be used for
all other image pairs captured using the same setup with decent results, as will
be seen in Chapter 5, Data Augmentation.

3.3.5 Semi-manual homography optimisation


Inspired by [34], research was done into an implementation of a global opti-
misation technique which, after estimating a homography with point matches,
further refines this estimate by overlapping the edge maps at a number of dif-
ferent scales simultaneously and calculating a Normalised Cross-Correlation
(NCC) of the edge maps to use as a metric for optimisation. The results posted
are good, and the paper suggests that the optimisation algorithm they used
could be applied iteratively in order to use it in real-time. Unfortunately, the
paper only presented the results, and the algorithm used to robustly converge
the mapping after the initial estimate was not published.
An attempt was made to apply a global optimisation to the initial homog-
raphy estimate in order to improve the mapping. Since we could not formulate
the cross-correlation in such a way to provide a gradient to optimise over, the
method presented in Global optimization of Lipschitz functions [39], which
can optimise an arbitrary, non-differentiable function, was a possible solution.
The algorithm is presented as a method to find global optima for the hyperpa-
rameters of neural networks, which can be over 10 dimensions, and optimise
over all of these. Since the homography needs to be solved for 8 free vari-
ables, it seemed a good match. Davis King provides an implementation of the
algorithm in dlib3 , a popular deep learning library for C++, which also has
Python bindings [40]. Using various types of correlation measures between
the two images (NCC, histogram similarity, cosine similarity) as the loss func-
tion, we found that the mappings did not converge to visibly good results. We
3
http://dlib.net
CHAPTER 3. IMAGE REGISTRATION 23

believe that the similarity measures used rewarded mapping dense feature ar-
eas together too much, for example sacrificing a homography which provided
a good global image alignment in order to maximise the overlapping areas of
tree leaves, which were often a particularly feature-rich region of the images
used in testing. As in [34], it may be necessary to optimise over multiple im-
age scales. Another possible solution would be to split the image into regions,
assigning each region the same weight in the final calculation. In this way,
regions with particularly dense features’ influence on the skew of the homog-
raphy is limited.

3.4 Image Registration Conclusion


In the Method, Section 3.3, we reviewed five methods for image registration.
The fundamental similarity between all the algorithms presented was finding
some way to calculate a homographic mapping between the two images. The
algorithms reviewed were: using SIFT/SURF feature detectors and a feature
matching algorithm, extracting and matching lines using a Hough Line Trans-
form, using phase congruency features as alternative feature descriptors, and
manually assigning point matches with and without a post-optimisation step.
In the context of this work, good enough results were achieved using man-
ual point matching to generate a homography between a pair of images, which
was found to be good enough even when applied to image pairs of different
scenes acquired from the same platform.
Since creating point matches and calculating the homography parameters
can be done offline in software as a calibration step, the embedded imple-
mentation of this method is fast and simple, equivalent to applying a matrix
multiplication at each output pixel. Out of the reviewed algorithms, this is the
most simple, and sufficiently satisfies the aims set out in Section 1.2:

1. This method is clearly the fastest since the homography is calculated


offline, and therefore the most real-time applicable of those reviewed.
2. The registration quality is the best out of all the methods, although better
registration quality could theoretically be achieved at the cost of the pro-
cessing speed and implementation overhead of including an incremental
optimisation step.
3. Again, since the homography calculation is offline in software, this is
the simplest method to implement in hardware.
24 CHAPTER 3. IMAGE REGISTRATION

3.4.1 Future Work


In terms of completely automatic image registration between RGB and IR im-
ages, we believe that using phase congruency to generate feature descriptors
as described in [32] is an avenue that should be explored further. Future work
on this could be writing robust, optimised implementations for phase feature
computation in order to provide an accurate like-to-like comparison between
phase features and other feature descriptors such as SIFT and SURF. If proved
a viable if not superior alternative, we see a hardware-accelerated implemen-
tation of phase features achieving a substantial benefit to speed with relatively
low cost to develop.
For the semi-manual methods, we believe it would be fruitful to investi-
gate further down the route of implementing a robust iterative optimisation of
homography estimations, as described in [34]. This approach allows a best of
both worlds, with a simple and effective initial homography estimate given by
hand-matched points, and further refinement at the point of deployment. If
such a method were to be developed as a standalone post-processing step to a
manually obtained point match homography, it could be added to any image
registration algorithm with minimal cost to improve the result.
Chapter 4

FPGA Implementation

4.1 Introduction
Implementing image registration on embedded hardware is an important stage
of this thesis since the conclusions drawn from the final experiment are only
relevant to the overall research question if image registration on the embedded
platform is viable.
The viability of this approach is fundamentally a product of the research
questions in Section 1.2, and additionally we will explore the following, more
specific questions:

1. How reliable are Xilinx’s ported OpenCV functions (xfOpenCV)?


2. How easy is it to include IP blocks generated by Vivado HLS in Vivado
projects?
3. How easy is it to interface with hardware using PYNQ?

Therefore in this stage of the thesis we will implement the image regis-
tration algorithm chosen in Chapter 3 using Vivado HLS, set up the required
interfaces for interacting with the hardware via the PYNQ framework using
Vivado, and directly compare the speed of the hardware-acceleration block on
the Programmable Logic (PL) to the OpenCV implementation on the Process-
ing System (PS) of the board.

25
26 CHAPTER 4. FPGA IMPLEMENTATION

4.2 Background
4.2.1 FPGAs
The FPGA (Field Programmable Gate Array) is a re-programmable hardware
chip consisting of an array of programmable logic blocks with re-configurable
interconnects allowing the logic to be connected in many possible ways. In
contrast to ASICs, CPUs and GPUs whose internal structure cannot be changed,
an FPGA’s hardware architecture can be reprogrammed in field, providing cus-
tomisable hardware without lock-in to a specific design and with the benefits of
mass production. Having application-specific hardware allows data to be pro-
cessed in a manner designed specifically for the use-case, ‘speeding up com-
putation time by orders of magnitude’ [9] and allowing a much more power
efficient execution.
FPGAs are usually developed using a Hardware Description Language
(HDL), which allows hardware developers to define the wires, buses, calcu-
lations, memory usage, clock frequencies and so on of the internal hardware,
determining the behaviour of the system. The HDL is used to synthesize a
digital circuit to be implemented on the FPGA, which involves an optimisa-
tion process to determine how to utilise the resources of the FPGA in the best
possible way to reduce latency and chip area usage whilst respecting timing
constraints to ensure the desired result each time.
Vivado is a software tool produced by Xilinx for synthesis and analysis
of HDL designs, and is the tool used throughout the thesis for synthesising
the final designs. The hardware synthesis process is very time consuming,
with a typical design taking units of hours to compile despite using dedicated
servers. HDL is slower to implement solutions in than software programming
languages since it is used to define functionality at the Register Transfer Level
(RTL) level, with the typical development time an order of magnitude higher
than the equivalent function in software – weeks rather than days for an equiv-
alent function. It is more important to ensure correctness prior to compilation
since errors incur a high compilation cost, therefore it is best to ensure that
any logical bugs are tested for and removed before running compilation and
proceeding with integration testing.

4.2.2 High-Level Synthesis (HLS)


HLS has been a ‘holy-grail’ of FPGA design for over 20 years, with many tools
promising C-to-HDL compilation being developed over the past few years
CHAPTER 4. FPGA IMPLEMENTATION 27

[41]. However, these tools have yet to see mainstream use, as often the learn-
ing curves of the HLS tools themselves do not justify time that could be better
spent learning or writing HDL [9].
Vivado HLS ‘accelerates IP creation by enabling C, C++ and System C
specifications to be directly targeted into Xilinx programmable devices with-
out the need to manually create [RTL using a HDL]’ [42]. In theory, this
provides a huge increase to developer productivity, since designs can be im-
plemented more quickly, and the logical correctness of the implementation can
be tested using a software testbench, allowing the individual HLS components
to be tested and validated much more quickly prior to including the generated
blocks in the full design and testing the whole system.

4.2.3 Zynq-7000 System-on-a-Chip (SoC)


The advances we have seen in computing over the past few decades have pri-
marily focused on homogeneous computing: maximising economies of scale
by designing computing components to be as similar and general purpose as
possible, which allowed us to progress from millisecond to sub-nanosecond
clock frequencies on consumer CPUs within 30 years. However, in recent
years this progress has stagnated and computing has begun to diversify. In
particular towards heterogeneous computing, a paradigm where a processing
unit is made up of distinct components (a CPU for general purpose, a GPU
for graphics, a TPU for AI, an FPGA for custom Digital Signal Processing
(DSP)), each specialised for its use domain.
All the major FPGA vendors have their own version of FPGA System-on-
a-Chip (SoC)s, which contain both a CPU and FPGA in tandem. Xilinx has the
Zynq-7000 series and MPSoC, Intel simply calls theirs SoCs, and Microchip
have SmartFusion SoCs. This setup allows very fast interaction between Pro-
cessing System (PS) and Programmable Logic (PL), and allows the FPGA and
CPU to share an external memory across a relatively fast memory interface,
compared to interfacing with off-chip memory. This type of solution allows
conventional software to interleave with hardware-accelerated function calls,
combining the flexibility of a CPU with the speed of dedicated hardware ac-
celeration.
The board used in this thesis is the TUL PYNQ-Z2 [43] (See Appendix
A), which contains the Xilinx Zynq-7020 SoC. The SoC has a Dual ARM
Cortex-A9 CPU with 4 High Performance AXI ports, and PL with 13,300
logic slices, each with four 6-input Look-Up Table (LUT)s and 8 Flip-Flop
(FF), and 220 dedicated DSP slices (specifically DSP48E) with 630KB of fast
28 CHAPTER 4. FPGA IMPLEMENTATION

Block Random Access Memory (BRAM).

Figure 4.1: PYNQ Overview [7]

4.2.4 The PYNQ Framework


‘PYNQ (Python productivity for Zynq) is an open-source [software] project
from Xilinx® that makes it easy to design embedded systems with Xilinx
Zynq® SoCs.’ The ‘PYNQ image is a bootable Linux image [for Zynq SoCs]’
which ‘includes the pynq Python package, and other open-source packages’
[7]. PYNQ provides a Python interface to the hardware on the Zynq board, and
‘PYNQ enabled Zynq board[s] can be easily programmed in Jupyter Notebook
using Python’[7]. See Figure 4.1 to get an idea of how PYNQ allows users
to interact with Programmable Logic (PL) on the board, and Figure 4.2 for
the typical development process for hardware-accelerated Intellectual Prop-
erty (IP) blocks1 .

4.3 Method
The function we will be implementing is a perspective warp, which requires
using the manual point matching tool created in Chapter 3 and solving the sys-
tem of equations generated by the point matches to obtain a homography which
maps between the two images. This only needs to be done once for each cam-
era setup, and so will be done in software using the getPerspectiveWarp
1
An IP block is the name for synthesised hardware blocks used in designs.
CHAPTER 4. FPGA IMPLEMENTATION 29

Figure 4.2: Hardware development options for PYNQ

function from OpenCV. The homography acquired is a 3×3 matrix describing


a pixel-to-pixel relation between the two images.
There are two main approaches to performing an image warp using the
homography matrix. The first method is the forward warp, iterating over the
input image and directly applying the homography to each pixel to acquire the
coordinates of the warped pixel. This method is the most intuitive, but can
result in ‘holes’ in the output image, since not all pixels in the output image
may be produced by the rounded multiplication of the input image with the
homography matrix. The second method is to iterate over the would-be output
image, and acquire the corresponding pixel coordinate in the input image by
applying the inverse of the homography matrix and using that value. This
ensures that every pixel in the output image has a value from the input image,
since we iterate over the entire output explicitly obtaining a value for every
pixel.

4.3.1 Development Platform


The work on this thesis will be carried out on the PYNQ-Z2 board from TUL
[43], which ‘based on [the] Xilinx Zynq SoC, is designed for the Xilinx Uni-
versity Program to support PYNQ’. The board will be booted with an SD card
flashed with the latest PYNQ image (v2.4 at the time of use).
We will be using Vivado HLS to generated HDL from C/C++ code, Vi-
30 CHAPTER 4. FPGA IMPLEMENTATION

vado to incorporate this block into a PYNQ design, and leveraging the PYNQ
framework to be able to access the acceleration block from Python and per-
form a direct comparison of the Programmable Logic (PL) implementation
with the OpenCV implementation on the Processing System (PS).

4.4 Vivado HLS


The first stage of the implementation is to generate an IP block which performs
the image registration algorithm chosen in Chapter 3. To do this, we will be
using Vivado HLS to compile C/C++ code into HDL.
The design methodology for working with Vivado HLS is the following:
1. Implement algorithm in HLS-compatible C/C++.
2. Write a testbench using special adaptor functions to ensure compatibil-
ity between hardware and software function call.
3. Correct the implementation until it passes the testbench.
4. Attempt to synthesise C/C++ code into hardware.
5. Correct synthesis errors and adjust parameters and HLS pragma calls to
optimise latency and/or size trade-offs.
6. Check co-simulation of the generated RTL block passes the testbench.
7. Export IP block into the main Vivado design and connect it using stan-
dard hardware interfaces.
8. Update block interface definitions in HLS to suit the required interfaces
in Vivado.
In practice, steps 3, 4 and 5 are done in parallel – there is no point continu-
ing development on an algorithm if you already know it is completely incom-
patible with RTL synthesis.

4.4.1 Algorithm Implementation


There are two functions in the xfOpenCV library which implement the func-
tionality required to do a perspective warp, xf_warpperspective, and
xf_warp_transform which can be configured to perform either an affine
or a perspective warp. The xfOpenCV library is developed specifically for
SDAccel or SDSoC Development Environments, which provide a full system
design view and therefore have more complete knowledge of the hardware plat-
form than standalone Vivado HLS. Thus, whilst the xfOpenCV functions are
CHAPTER 4. FPGA IMPLEMENTATION 31

designed to be plug-and-play for SDAccel and SDSoC environments, there are


limited examples documenting its use in standalone HLS and in some cases
the functions may utilise resources that Vivado HLS is unaware of or are not
on the target board.
Whilst the software testbench for our implementation using
xf_warpperspective passed, we encountered several difficulties trying
to synthesise the design into RTL. After resolving several issues which pre-
vented compilation due to incorrectly understanding function arguments and
their types, we were finally able to synthesise the design into HDL without
errors. However, due to configuration problems, the resultant design utilised
more than 100% of the FPGA logic, see Figure 4.1, and therefore was not
usable with our FPGA.
Name BRAM_18K DSP48E FF LUT
Total 290 781 53,651 57,083
Available 280 220 106,400 53,200
Utilization (%) 103 355 50 107

Table 4.1: Utilisation estimates for xfOpenCV function

At this point, we tried xf_warp_transform. This function would


not synthesise due to it attempting to use a type of Random Access Memory
(RAM) not present on the FPGA chip we were synthesising for, regardless of
whether the setting USE_URAM was set to 0 or 1. However, the software test-
bench did not pass either, with the test output displaying noticeable artefacts
which were believed to be due to the function accessing input in fixed-size
blocks as opposed to all at once. These can be seen in Figure 4.3.

Figure 4.3: Artifacts introduced by streaming access

(a) Output from warp_transform (b) Desired output from OpenCV

Due to the delay in waiting for assistance in our understanding of these is-
sues, it was more time-effective to implement our own perspective warp func-
32 CHAPTER 4. FPGA IMPLEMENTATION

tion since we would have a full understanding of its constituent parts. From
development to HDL synthesis was incredibly fast, taking just half a day. This
is mixed result: not all built-in xfOpenCV functions work out-of-the-box, with
some requiring quite an in depth understanding into their implementation and
configuration options, which can be difficult to find amongst the dense docu-
mentation and examples online which are quickly out-of-date due to the fast-
paced changes to these resources. However, enough of the base architecture is
there, such as copying to and from memory and accessing pixel values from a
matrix, that coding image processing algorithms in C/C++ which can be syn-
thesised to HDL is remarkably straightforward and can be faster than debug-
ging the out-of-the-box implementation, which makes up for any unexpected
issues with the provided xfOpenCV code.
In the case of xf_warp_perspective, we have since received an-
swers from the Xilinx team which deepened our understanding of the func-
tionality of their implementation, however not to the extent that we feel fully
confident using it. Our current understanding is that we had set the number
of pixels to be processed per clock cycle to 8 as opposed to 1, leading to 8
times as many resources being used, and setting this to 1 could resolve the
problem. This is a reasonable explanation since, as visible in Figure 4.1, di-
viding utilisation by 8 would reduce it to below 100%, solving the over-usage
of resources.
That said, even dividing by 8 results in on average more chip utilisation
than our hand-coded warp implementation. For this reason, and since this
information was gained after our hand-coded example was working, we did
not take time to explore whether making this change solves the issue, and what
the difference in utilisation and latency is compared to our implementation.

4.4.2 Testbench: C-simulation and co-simulation


Vivado HLS gives the user access to powerful testing functionality, provid-
ing a software-based C-simulation, which simulates the functionality of the
function under test, and co-simulation, which simulates the generated hard-
ware definition in software. These simulations allow an error detection step
much earlier on in the development cycle than in traditional hardware devel-
opment, increasing productivity. Xilinx provides HLS libraries which contain
adaptor functions for using in a testbench. For example, OpenCV expects a
cv::Mat, whilst the hardware implementation requires the xfOpenCV equiv-
alent, xf::Mat. xf::Mat contains member functions copyFrom and
copyTo to convert matrix data to and from cv::Mat and xf::Mat formats,
CHAPTER 4. FPGA IMPLEMENTATION 33

which are only useful in this context.


This allows a direct comparison between the output of the OpenCV warp
function and our hardware-compatible implementation of the same algorithm.
In the testbench we perform a diff between the two images, accumulating the
difference between the values of the reference image and the test image to gen-
erate an error value. It’s also possible to generate a more visual representation
of the error by creating an image from the individual pixel difference values.
This allowed us to see issues with our software implementation whilst devel-
oping the hand-coded version, and the issues with xf_warp_transform
shown in Figure 4.3, before synthesising to HDL.
There is a bug with co-simulation requiring a workaround to enable run-
ning it on some Linux systems which has been reported on the Xilinx forums
and acknowledged since 2015. This required adding some # include state-
ments relative to the location of the Xilinx installation into some of the source
files. Troubleshooting this bug was an unnecessary timesink, since during the
early stages of development it was believed to be a result of our implemen-
tation. Thankfully, although bugs in HLS software used to be commonplace,
their appearance has been extremely rare in this work.

4.4.3 Programmer Directives


Vivado HLS comes with a suite of programmer directives, otherwise known
as pragmas, which are a standard language construct of C/C++ to specify
how the compiler should process its input. These are introduced by placing
# pragma HLS ... lines into the code to indicate how the hardware syn-
thesis should be treated in those places. Without these user-defined pragmas,
the source code is compiled as-is, meaning it will not be optimised efficiently
into hardware. The developer of the code has a better insight than the compiler
of which loops can be pipelined, unrolled or merged to improve either latency
or size, although it can be a bit of an educated guessing game. Pragma-free
C/C++ code compiled to HDL, particularly code containing a number of loops,
will result in very high latency programs. This is discussed in more detail in
Section 4.7, where we observe the reduction in latency as a result of applying
these pragmas.

Function and Loop Pipelining


The PIPELINE pragma reduces the Initiation Interval (II) for a function or
loop by allowing the concurrent execution of operations.
34 CHAPTER 4. FPGA IMPLEMENTATION

Figure 4.4: Pipelining graphic by Xilinx®

A pipelined function or loop can process new inputs every <N> clock
cycles, where <N> is the II of the loop or function. The default II for the
PIPELINE pragma is 1, which processes a new input every clock cycle. You
can also specify the initiation interval through the use of the II option for the
pragma.
Pipelining a loop allows the operations of the loop to be implemented in a
concurrent manner as shown in Figure 4.4 (B). (A) shows the default sequential
operation where there are 3 clock cycles between each input read (II=3), and
it requires 8 clock cycles before the last output write is performed.

Loop Unrolling
Loop unrolling signals to the compiler to remove all the loop overhead of the
logic, and instead generate hardware which directly represents a logical evalu-
ation of the loop. For example, instead of initialising a loop counter, and then
incrementing it and checking each time whether the loop is over, the logic for
loop body is just copied 8 times in sequence. This reduces latency at the cost
of space, however is especially efficient to use on short loops of known length
that are executed repeatedly.
Intuitively, it makes sense to unroll the innermost loop which extracts and
applies the warping function to the 8 pixels represented by one 64-bit word,
since it is likely that the variable will arrive as one 64-bit word.
CHAPTER 4. FPGA IMPLEMENTATION 35

Loop Merging
Loop merging combines the loop logic for nested loops into a single loop to
reduce overall latency, increase resource sharing, and improve logic optimisa-
tion. Merging loops:

• Reduces the number of clock cycles required in the RTL to transition


between the loop-body implementations.
• Allows the loops be implemented in parallel.

4.5 Interfaces
After developing C/C++ synthesisable implementations which can be com-
piled to HDL and optimised using programmer directives, the next challenge
is setting up the interfaces for the synthesised IP blocks in Vivado from the
Zynq board. Documentation is incredibly dense, and is more descriptive than
demonstrative with few working examples. Examples that do exist are often
outdated since the domain is undergoing rapid development.
Therefore, for a software engineer with little hardware experience, it is
difficult to grok the hardware interfaces used on a SoC: AXI, AXI-Stream,
the use of Direct Memory Access (DMA), whether to use RAM or Unified
Random Access Memory (URAM), or copy directly from the shared memory,
and how to interface with each of these things. In this work, it was understood
that the two most simple interfaces to use are AXI and AXI-Stream.

4.5.1 AXI Interfaces


The Advanced eXtensible Interface (AXI) protocol is part of the ARM Ad-
vanced Microcontroller Bus Architecture (AMBA) specification, which is ‘an
open standard for the connection and management of functional blocks in a
System-on-Chip’. This standard is widely adopted in the industry, therefore it
is no surprise that the functional blocks generated by Vivado HLS are mostly
easily interfaced with using the AXI protocol. The AXI protocol is based on
a point to point interconnect to avoid bus sharing and therefore allow higher
bandwidth and lower latency.
The Zynq-7000 PS contains a number of General Purpose (GP) and High
Performance (HP) AXI ports. For interfacing with the Vivado HLS block,
we use a GP Master AXI port on the Zynq-7000 to send control signals to
the corresponding Slave AXI interface on the IP block, since block will be
36 CHAPTER 4. FPGA IMPLEMENTATION

controlled by the PS. Then, there is a second connection from a Master AXI
port on the HLS block to a HP Slave AXI port on the Zynq-7000. The HLS
block uses this connection for fast access to an internal shared memory be-
tween the PS and the PL, and must be Master so that it is in control of the data
it receives. Both input and output are transferred across these ports, requiring
allocated address space for both input and output, and this approach accesses
the memory directly and in a random-access manner. This method is called
memory-mapped AXI, and is chosen since the inverse warp requires random
access to the input image.
There is an opportunity cost to using memory-mapped protocols – for
known memory access patterns, it is possible to use AXI-Stream interfaces
instead. These stream in the data in a known order, reducing latency since
blocks of pixels can be transferred at once which reduces data transfer time.
Since the order of pixels is known it is also possible to implement pipelining
optimisations, because blocks can stream output pixels which can then start to
be used by subsequent blocks before the entire image is processed. However,
since only one block is used in this design as opposed to a chain of processing
block, and to save time, only memory-mapped AXI protocols were used even
though a streaming interface could have been used for the output.
The full process of connecting the HLS-generated IP block (the Warp IP)
to the Zynq-7000 PS is listed below:
1. Enable High Performance AXI slave and General Purpose Master ports
on the Zynq-7000 IP.
2. Connect the Warp IP AXI Master to Zynq-7000 HP AXI Slave port, and
Zynq-7000 GP AXI Master port to Warp IP AXI Slave port via AXI
Interconnect IPs.
3. Assign shared memory addresses using the Address Editor
4. Connect the interrupt from the Warp IP to the interrupt port of the Zynq-
7000 IP using an AXI Interrupt Controller IP (AXI INTC).
After completing these steps, it is possible to synthesise the design into
a bitstream used to program the PL of the SoC, and then interface with the
hardware-accelerated functional block created in Vivado HLS via PYNQ.

4.6 The PYNQ Framework


Vivado is used to generate the hardware description files used to program and
interact with the PL from Python. For PYNQ v2.4, these are a .bit bitstream
CHAPTER 4. FPGA IMPLEMENTATION 37

file which is used to program the FPGA, and a .hwh file which contains a
description used by the PYNQ framework to obtain a user-friendly description
of the hardware.
Installing PYNQ simply involves flashing the latest image onto the SD
card of a PYNQ-compatible SoC. This installs an environment containing the
pynq Python package, which contains drivers and an interface for program-
ming the PL with the hardware Overlay created using Vivado.
The Overlay is loaded by passing the path to the location of the .bit
file, and returns a Python class. The Warp IP will be available as a class mem-
ber of the main class, with a RegisterMap member. RegisterMap con-
tains references to the hardware registers created and used by our design. For
the Warp IP, homography, input and output are available registers in the PL
to be set to addresses of physical memory locations of the warp homography
matrix, the input image and the output image respectively on the SoC. Addi-
tional control registers are included with the design, most notably ap_start
which signals the block to start execution when set to 1.
In summary, the PYNQ framework makes it very easy to interact with IP
blocks created with Vivado HLS and connected to the Zynq-7000 board using
standard AXI interfaces, giving access to custom hardware acceleration from
a Python development environment.

4.7 Results
4.7.1 Development Process
The first major result of this stage is the demonstration that it is possible to,
with little to no experience with Vivado Electronic Design Automation (EDA)
tools, Vivado HLS or HDL, successfully write, compile, synthesise into hard-
ware and then run, from Python, a hardware-accelerated image warp in a very
short time frame.
Whilst we have no direct comparison between implementations of image
registration, from our experience and noted by others in literature [18], using
HLS reduces the implementation overhead by at least an order of magnitude
(weeks to days, months to weeks).

4.7.2 HLS pragmas


After adding a loop unroll directive to the innermost loop which applies the
algorithm to each byte in the input word, a loop merge between the inner and
38 CHAPTER 4. FPGA IMPLEMENTATION

outer loops which govern scanning the two dimensions of the output image,
and a pipeline across the merged loop, the max latency was reduced substan-
tially from 648,024 to 12,049, a reduction of a factor of over 50 (54.8). See
Table 4.2, compared to with pragmas in Table 4.3 for the full latency report
from Vivado HLS.

Latency Interval
min max min max Type
609,624 648,024 607,202 645,602 dataflow

Table 4.2: Latency (clock cycles) without pragmas

Latency Interval
min max min max Type
12,049 12,049 9,627 9,627 dataflow

Table 4.3: Latency (clock cycles) with pragmas

Utilisation without and with pragmas respectively can be found in Tables


4.4 and 4.5. Looking at the individual resource types we see that the use of
BRAM_18K stays constant, the number of DSP48E blocks increases by 14%,
number of FFs by 5% and LUTs by 15%. These are respectively 4.5×, 2×
and 2× increases in resource usage, so an on-average utilisation increase of
approximately 3× for an over 50× speed-up.

Name BRAM_18K DSP48E FF LUT


Total 48 10 4,399 7,714
Available 280 220 106,400 53,200
Utilization (%) 17 4 4 14

Table 4.4: Utilisation without pragmas


CHAPTER 4. FPGA IMPLEMENTATION 39

Name BRAM_18K DSP48E FF LUT


Total 48 40 9,769 16,493
Available 280 220 106,400 53,200
Utilization (%) 17 18 9 31

Table 4.5: Utilisation with pragmas

Whilst utilisation has marginally increased, we made cost-benefit effective


decisions to minimise the increase in utilisation whilst substantially improving
latency, with a benefit-cost ratio of nearly 20.

4.7.3 Timing Comparison


Since it was possible to implement the hardware-accelerated warp and interact
with it from Python, it was possible to perform a timing analysis of the speed-
up obtained from using custom logic compared to the optimised OpenCV im-
plementation on the board’s Dual-Core ARM Cortex-A9 CPU. The hardware-
accelerated implementation on the PL achieved a speed of 762µs compared to
3.54ms on the PS, a nearly 5× (4.6×) speed-up.
Note that a lot of the processing time is spent executing Python code to
copy the images to the correct reserved memory locations and set the register
values for the homography and input/output memory locations. This overhead
could be reduced by writing a compiled driver for these setup steps in a lan-
guage such as C/C++, and chaining multiple accelerated processing blocks in
hardware whenever possible. Additionally, if the same homography and in-
put/output memory addresses are used throughout the program, this is just a
one-time cost.

4.8 FPGA Implementation Conclusion


Starting from no knowledge, we were firstly capable of implementing an im-
age warp function extremely quickly, with an extremely useful software test-
bench to verify the custom implementation’s correctness before committing
to hardware compilation. Secondly, optimising the implementation with HLS
pragmas was quite intuitive, although it seems like it has potential to be auto-
mated to an extent, and reduced latency by a huge margin. The most difficult
step is then working out how to interface with the generated block, however
40 CHAPTER 4. FPGA IMPLEMENTATION

with bigger teams and an interdisciplinary sharing of knowledge this is easily


overcome.
Finally, even with extremely limited knowledge of hardware, we managed
to connect the block to the Zynq board in such a way that it was accessible from
the PS in Python. The PYNQ way of interacting with the board is incredibly
intuitive for someone with a mainly software background. Again, documenta-
tion is a bit dense with few examples so it can be hard to find the answers you
want, because the platform itself is evolving so fast and is not in mainstream
use. Examples that can be found tend to be out of date. There does not seem
to be enough volume of use and the framework is not stable enough to always
have up-to-date, clear documentation and examples, which is a problem that
will resolve itself as development on the platform becomes more stable. In-
deed, each day returns more search results referencing the PYNQ framework.
Whilst not a flawless experience, we conclude that this development parad-
igm is definitely worth the time spent familiarising oneself with it.

4.8.1 Future Work


Interfaces
With more understanding, we could further reduce latency by reducing unnec-
essary data paths, or allowing concurrent execution of IP blocks by accessing
or outputting data in a streaming manner using AXI-Stream. This part of HLS
design is the most tricky, since the HLS-familar software engineers have min-
imal knowledge of the standard interfaces frequently used by hardware engi-
neers, whilst the hardware engineers have minimal knowledge of HLS and the
interfaces it generates. As is often the case, the interface where two separate
domains meet is a tricky area to develop in.

Floating Point Optimisation


Floating point operations are considerably slower than fixed-point or integer
operations. An investigation into the timing analysis of hand-coded HLS per-
spective warp showed that the majority (> 90%) of the latency in the design
was due to floating point operations. See Appendix, Figure B.1: 20 out of
22 steps are occupied by sitofp, fmul and fadd, which are, respectively,
a conversion from integer to floating point, floating point multiplication and
floating point addition. Since registration is only accurate to a certain preci-
sion anyway, and moreover, the warp algorithm only accurate to integer pixel
values, there is an option to instead use fixed-point arithmetic which would
CHAPTER 4. FPGA IMPLEMENTATION 41

provide a considerable speed-up. This will not come as a surprise to hard-


ware engineers, since quantisation – i.e. removing floating point operations in
favour of fixed-point values – is one of the most common optimisations made
in hardware design.

Pipelining HLS blocks


Since image processing usually involves a sequence of operations over the im-
ages, each of the processing blocks should be chained together into an image
processing pipeline whenever possible.
In this way, after the image has been loaded into the shared memory, multi-
ple hardware-accelerated functions can be applied directly in sequence entirely
on hardware, before reading the results from memory only when the complete
sequence of operations is complete. This minimises the time spent setting up
and copying the data to the correct memory locations, and then subsequently
reading the results, which is all currently done in Python and introduces a fixed
overhead regardless of the complexity of the accelerated function calls.
Our implementation of the image warp requires random access to the in-
put image, restricting the possibility to dramatically reduce latency and size
of the generated hardware by utilising an AXI-Stream interface. Successive
AXI-Stream interfaces, chained together, allow the data processing pipeline to
be executed concurrently, since subsequent logic blocks can receive and start
processing processed data from earlier pixels in the data stream, before the
later pixels have even been processed by the prior IP block.
We implemented identical interfaces for both input and output data to re-
duce the complexity of our implementation and save development time, so
we were restricted to direct memory access due to random access required to
the input data. However, since the output data is produced sequentially when
applying the inverse warp algorithm, it would be possible to stream the out-
put pixels using an AXI-Stream interface, and we would recommend that ap-
proach. For a forward warp transform, the input data is accessed sequentially,
whilst the output must have random access.
For an ideal image processing pipeline, we would chain the processing
blocks to take advantage of the data streaming described previously, with mul-
tiplexers placed in such a way that the individual blocks can be skipped by
manipulating the control logic. This would minimise latency and space usage,
and result in a hardware-accelerated image processing pipeline which can be
changed on-the-fly by interacting with the control logic from Python.
With the rise in demand for edge computing and data processing at the
42 CHAPTER 4. FPGA IMPLEMENTATION

point of capture, we believe the paradigm described above is hugely applicable


to many applications in embedded computer vision.
Chapter 5

Data Augmentation

5.1 Introduction
The aim of this chapter of the thesis is to obtain a quantitative and qualitative
analysis of whether a network trained with RGB images can achieve higher
accuracies by integrating data from the IR spectrum.
Though it relies heavily on the work done so far, this is the most important
part of thesis with regard to answering the main research question: How viable
is it to improve the accuracy of real-time embedded object detection by inte-
grating an IR camera to augment the RGB image? Successful results would
demonstrate a method to improve the accuracy of a deep object detection net-
work solely by fusing additional spectral data with the input data.
The most important images from this dataset will be at nighttime, contain-
ing pedestrians or vehicles visible in IR but not in the RGB images. However,
it will be useful to see whether image fusion adversely affects object detection
in other conditions, either due to poorly registered images, sub-optimal fusion
or the additional features from the IR spectrum obscuring RGB features. In
this case, it may be necessary for dual-spectra systems to disable augmenta-
tion during daytime or good lighting conditions. Another solution would be
to train the network further so that it can learn this distinction itself, or run
inference on both images sequentially, though this would increase latency.
Hence we will perform a qualitative analysis on select image pairs in which
pedestrians or cars are recognisable in IR, but difficult to see in RGB. We will
obtain a quantitative measure of the performance of the network on data with
and without augmentation by calculating common accuracy metrics across the
entire datasets, and with the datasets split into daytime and nighttime images,
giving an objective view of the mean effect on accuracy of our approach.

43
44 CHAPTER 5. DATA AUGMENTATION

5.2 Background
5.2.1 History
Deep ML (Machine Learning), often referred to simply as Deep Learning,
provides the state-of-the-art for object detection at the time of writing. Its
preeminence in computer vision emerged as a result of the annual competi-
tion held by ImageNet, the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC), which began in 2010. Deep Convolutional Neural Network
(CNN)s in the computer vision domain were beginning to become more popu-
lar in this period, with the use of GPUs in training an enabler for this approach
[15]. In 2012, AlexNet won the competition with a top-5 error of 15.3%, com-
pared to the second place score of 26.2%.
Prior to this point, the previous victors of ILSVRC were either much shal-
lower neural networks or based on human-engineered features like SIFT, how-
ever this huge margin of victory triggered extensive research into the role of
GPUs in deep learning, and into deep learning itself. Since then, all winners
of the ImageNet competition, and the similar Common Objects in COntext
(COCO) detection challenge, have been deep networks, and Deep Neural Net-
work (DNN)s, specifically CNNs, now ‘perform better than or on par with
humans on good quality images’ [44].

5.2.2 Neural Network Primer


The concept of Machine Learning (ML) is to enable a program to learn in-
dependently from programmer directives, in a way inspired by neuroscience.
Artificial Neural Network (ANN) are so named because they take inspiration
from the brain, which consists of an extensive network of interconnected neu-
rons1 , and as such some of the terminology from neuroscience carries over.
Simplified, brain neurons consist of the cell body, dendrites, axons, and
synapses. The cell body (node or neuron in computing) receive inputs along
the dendrites, and outputs a corresponding impulse along its axon which ter-
minates at a number of synapses2 . Synapses act as the junction between the
output of one cell (an axon terminal) and the input of another cell (the den-
drites), scaling the signal received from the axon before transmitting it to the
dendrites.
1
On average 86 billion neurons, with ≈ 1015 connections or synapses.
2
On average each axon terminates at between 1,000 and 10,000 individual synapses.
CHAPTER 5. DATA AUGMENTATION 45

Figure 5.1: ANN representation by Colin M.L. Burnett

In artificial neural networks, the scaling role of the synapses is represented


by vectors of weights for each node, which are used to scale each of the input
signals. At each node, a net input function combines the input signals be-
fore an activation function is applied to generate the output signal. A neural
network consists of multiple layers of interconnected nodes: an input layer, a
number of hidden layers, and an output layer. Each layer of nodes is usually
homogeneous: i.e. the behaviour of each node within a layer is identical to
its sister nodes. Deep neural networks are defined as neural networks with
more than one hidden layer that can learn features in a hierarchical manner,
learning complicated features by building them up from simpler ones, which
allows them to overcome the performance plateau common to other ML meth-
ods [45]. A representation of a 3-layer neural network is provided at Figure
5.1, although typically DNNs contain many more hidden layers. The process
of learning comes from adjusting the weight values, or network parameters, in
such a way to improve the performance of the final output of the network.
This paradigm of computing allows developers to train machines to per-
form a specified task without explicitly instructing it how – either by specify-
ing an objective reward or fitness function (reinforcement learning or genetic
programming), or providing a set of training data with examples of correctly
corresponding input and outputs (supervised learning).
This is a useful tool when it is more difficult to specify a function by hand3
than generate an equivalent function with a neural network, and is one of
the most effective methods for generating ‘artificially intelligent’ functions in
3
Explicitly programming heuristics.
46 CHAPTER 5. DATA AUGMENTATION

fields where human understanding is limited, such as Natural Language Pro-


cessing, Automatic Control and Computer Vision.

5.2.3 Convolutional Neural Networks (CNNs)


There are a few core types of neural network layers used in CNNs, and a myriad
of variations and combinations of these to achieve different goals. Here we will
go over three important layers used in CNNs for computer vision, which are in
use in most current state-of-the-art implementations for object detection and
relevant to our analysis: the Convolutional Layers, the Residual Block, and the
Fully-Connected Layer.

Convolutional Layer
The convolutional layer played a key part in AlexNet, where the ‘immense
complexity of the object recognition task’ required models with ‘lots of prior
knowledge to compensate for data we don’t have’. Convolutional layers are
considered to ‘make strong and mostly correct assumptions about the nature
of images (namely, stationarity of statistics and locality of pixel dependen-
cies)’. This allows more complex relationships to be derived with ‘much fewer
connections and parameters’ [15].
Convolutional layers have two main adjustable hyperparameters, the size
of the convolutional kernels, or receptive field, and the number of filters to use,
or depth.
The receptive field is so called because each neuron in the convolutional
layer is only connected to a local region in the previous layer, so is only recep-
tive to changes in that sub-field4 , as opposed to fully-connected layers where
each neuron in the current layer is connected to all neurons in the previous
layer. Fewer connections means fewer calculations, and since image pixels
mostly exhibit local dependencies – i.e. pixels close together are more corre-
lated – the number of connections per neuron can be reduced substantially to
a much smaller receptive field without a loss of information.
The depth of a convolutional layer refers to the number of filters to train
for. Intuitively, this means the number of features since the filter responses
for convolving each filter with a region of the image are ultimately passed
through to the later parts of the network as inputs and therefore should identify
4
Receptive field instead of region since the input usually has three dimensions: width,
height, and depth (which for an image is each RGB component). For example 256x256x3 for
a 256x256 RGB image. Therefore the receptive field is a 3D block: 5x5x3 for a 5x5 kernel.
CHAPTER 5. DATA AUGMENTATION 47

unique and independent features, for example some sort of horizontal, curved
or vertical line, which in combination can be used to differentiate different
objects.
One of the keys to the efficiency of convolutional layers is that they take
advantage of the stationarity of image statistics – the property that features, for
example a horizontal line in an image, do not depend on their spatial position
(the x, y coordinates of the line). This allows a parameter sharing scheme that
dramatically reduces the number of parameters. Instead of updating unique
weights across the entire 2D space of the image, neurons at the same depth
– i.e. the kernel weights at each depth slice – can be constrained to use the
same weights for the entire 2D space, hugely reducing the number of unique
parameters at each depth to the dimensionality of the receptive field5 .

Pooling Layer
Pooling layers are used to reduce the spatial size of the input by applying a
downsampling function across each 2D depth slice of the input, which reduces
parameters for future layers and helps to control overfitting. These layers have
two main hyperparameters, the receptive field, or kernel size, and stride length.
The downsampling function summarises data from the input volume, for
example reducing a 2x2 area in the input to the average or maximum value
in the area, reducing the stored information by 75%. In practice taking the
maximum has been most effective, which is know as Max Pooling.
These are either used in-between convolutional layers to create a deeper
network but reduce dimensionality, or just before the final layers of the network
to sample the outputs into the desired shape.

Residual Block
Residual blocks are a solution to the vanishing gradient, or degradation, prob-
lem that occurs with deep neural networks. They function by skipping the
training of intermediate layers, allowing a simpler sub-model to train its weights
without going through the entire network, and then since earlier layers influ-
ence the deeper layers of the model, the gradients passed forward are less sus-
ceptible to becoming too small and thus halting training. This allows deeper
networks to be trained.
5
For example, from [5x5x3] parameters for each input in a [256x256] 2D space, 5 ∗ 5 ∗ 3 ∗
256 ∗ 256 = 4, 915, 200 parameters, to 5 ∗ 5 ∗ 3 = 75 parameters per depth slice.
48 CHAPTER 5. DATA AUGMENTATION

Fully-connected Layer
The fully-connected layer is the typical neuron layer in machine learning, de-
scribed previously and depicted in Figure 5.1. Each of the nodes is connected
to all of the input nodes, with weights for each of the connections. In a CNN,
there is usually one fully-connected layer at the end of the network either before
or after a pooling layer (to sample to the correct dimension), which generates
activations based on parameters learned from training each node across all the
final input features.
These activations are then passed to the output layer, which converts the
vector of activations into a probabilistic interpretation of the input weights, for
example by assigning a probability to each class label.

5.2.4 Data Augmentation


‘The easiest and most common method to reduce overfitting on image data is to
artificially enlarge the dataset using label-preserving transformations’. Since
we will be using a pre-trained network, the details of these transformations are
not so important, but a broad overview is provided.
The first class of transformations involves spatial properties of the images:
conserving RGB pixel values, but transforming images using cropping, hori-
zontal reflections and translations, and translating label values accordingly.
‘The second form of data augmentation consists of altering the intensities
of the RGB channels in training images’. In AlexNet, they performed ‘PCA
(Principal Component Analysis) on the set of RGB pixel values throughout
the ImageNet training set ... add[ing] multiples of the found principal compo-
nents, with magnitudes proportional to the corresponding eigenvalues times a
random variable drawn from a Gaussian ... This scheme approximately cap-
tures an important property of natural images, namely, that object identity is
invariant to changes in the intensity and color of the illumination’ [15]. This
way of training the deep networks implies that data from the IR spectrum –
where the general shape of objects is preserved but intensity and colour of
illumination diverge from in RGB images – may have recognisable features
learned by the network despite the network not having been trained on IR data.

5.2.5 Evaluating model performance


In order to analyse the performance of our model, we will need an understand-
ing of the various accuracy measures used in the object detection community.
For testing machine learning, the data is split into training, validation and test
CHAPTER 5. DATA AUGMENTATION 49

datasets. The test dataset is not “seen” by the network during the training
phase, and therefore is used to evaluate the trained network’s performance
on unseen data, which prevents rote-learning and tests the model’s ability to
generalise. The test data is fully annotated with ground-truth values, and the
predictions by the network are compared to the ground-truth in order to obtain
a metric for performance.
Top-1, Top-5, Top-X numbers are image classification metrics. Given an
image with one ‘main’ object, a Top-X score is the percentage of times the
correct classification is in the top X results of the classifier. For example,
Top-1 is the percentage of times the most likely prediction by the classifier
is the correct result, whereas Top-5 is the percentage of times the top 5 most
likely predictions by the classifier contain the correct result.
Object detection, where both the position and classification of all objects
in an image must be returned requires a different metric. This is based on IoU
(Intersection-over-Union) of the predicted bounding boxes and the ground-
truth values. As visible in Figure 5.3, the IoU is a decimal value from 0 to
1, defined as the fraction: IoU = area of overlap
area of union
of the ground-truth bounding
box and the predicted bounding box.
The detection measures used in COCO detection challenge are shown in
Figure 5.2. In this report we will be using Average Precision (AP), specifically
AP IoU =0.5 , which was used in the PASCAL VOC challenges. The .5 in the
metric is the value of IoU for which we consider the prediction True Positive.
Since humans have a hard time differentiating .5, and .75, and annotations
were only provided for the IR data and so had to be migrated for the RGB and
augmented datasets, it was decided to make the accuracy metric as broad as
possible and use an IoU of .5.
Common evaluation metrics used for prediction models are Precision and
Recall. Precision measures the percentage of predictions which are correct:
TP
T P +F P
, whereas Recall measures the percentage of possible positive predic-
tions were found: T PT+FP
N
. Mean AP (mAP) is calculated by sampling the
Precision/Recall graph, and the measure usually used is the Area Under Curve
(AUC).

5.2.6 YOLO
Since we are aiming for real-time computer vision, the network that we will use
to evaluate our IR augmented images will be the YOLO network [4], specifi-
cally YOLOv3, which is ‘more than 1000x faster than R-CNN and 100x faster
50 CHAPTER 5. DATA AUGMENTATION

Figure 5.2: COCO metrics: http://cocodataset.org/#detection-leaderboard

Figure 5.3: Intersection-over-Union definition [46]


CHAPTER 5. DATA AUGMENTATION 51

than Fast-R-CNN’, and ‘achieves 57.9AP50 in 51ms on a Titan X6 , compared


to 57.5AP50 in 198ms by RetinaNet, similar performance but 3.8× faster’
[47].
‘YOLOv3 uses successive 3 × 3 and 1 × 1 convolutional layers’, ‘shortcut
connections [residual blocks]’ and ‘has 53 convolutional layers’. As is the
case for most networks, it’s trained using ‘lots of data augmentation... all the
standard stuff’, which we can expect includes all the ‘random crops, rotations,
and hue, saturation, and exposure shifts’ mentioned in the YOLOv2 paper [48].
Darknet is the machine learning framework developed by YOLO’s cre-
ator Redmond. In addition to providing Darknet, Redmond has made many
different network configurations and pre-trained weights freely available for
download, including the configuration for YOLOv3 and associated network
weights. Since the aim of this research is to analyse whether pre-trained net-
works in the RGB spectrum can benefit from IR data, we will use a YOLOv3
network with the network weights provided by Redmond. In the interests of
preserving as much similarity to an embedded implementation as possible, we
use a version of YOLOv3 named tiny-YOLOv3, which is a significantly smaller
and hence less accurate version of the YOLOv3 network.

5.3 Method
5.3.1 Overview
In this section we describe the processing steps completed on 7,777 images
from the FLIR ADAS dataset presented in Section 3.2.3 for our analysis of the
hypothesis that:

Augmenting RGB images with IR data can improve the inference


performance of a deep neural network trained on RGB data.

This involves the following high-level steps:

1. Cleaning the dataset.


2. Acquiring warp homographies.
3. Map annotations for RGB images.
4. Image fusion to obtain AUG dataset.
5. Network accuracy analysis.
6
NVIDIA Titan X
52 CHAPTER 5. DATA AUGMENTATION

5.3.2 Dataset Processing Steps


Firstly, the dataset had to undergo the processing steps described in this section
prior to being used in this study. The end goal of this process is to obtain three
annotated datasets of 512x512 images in RGB, IR, and the two combined into
an augmented RGB-IR dataset which we will call AUG.

Cleaning
First since 499 images in Training and 109 images in Validation do not have
RGB counterpart images, we remove all images without counterparts.

Acquiring warp homographies


Then, the next step is to acquire homographies which describe the mapping be-
tween each pair of images. Whilst the dataset description specifies a 1280x1024
camera, most of the RGB images are at a 1800x1600 resolution, with 1,973 im-
ages at 2048x1636, 1,401 images at 1280x1024, and 119 images at 720x480.
For each of the different RGB camera resolutions, the mapping between the
IR and RGB cameras is different, however within each image resolution the
mappings are the same.
Using the manual point matching and findHomography functions de-
scribed in Section 3, Image Registration, the coordinates of corresponding
points between the two images are stored by clicking on a pair of images side-
by-side and can then be used to generate a homography. This is done for each
RGB image at maximum resolution to make the most of the visual data avail-
able, and then the results from this are scaled down to 512 × 512 which is the
most convenient resolution for working with the network.

Mapping annotations for RGB images


The same homographies can be applied to the annotation bounding box co-
ordinates in the annotation files provided with the dataset, to transform the
annotations to align with the RGB images, and again we do this at the highest
possible resolution first, before shrinking the results to 512x512.
Since annotations are only provided for the IR images, these will be mapped
to align with the RGB images and used as ground-truth for the RGB and AUG
data. This introduces some error since the annotations were made using the
IR images and as well as the mapping being imperfect, there may be objects
in RGB that are clearly visible but were not visible to the IR annotators, which
will result in some correct predictions being interpreted as False Positives.
CHAPTER 5. DATA AUGMENTATION 53

The flip-side of this is that we will have bounding boxes for some objects in
IR-space which are not visible in RGB, which is exactly the desired outcome
for this testing. Ideally, the dataset would be fully annotated with all objects
visible in both RGB and IR, however it is easy to see why this is hard to ob-
tain. Firstly, the publishers of the dataset had no registration between the im-
ages. Secondly, annotating both sets of images is twice as time consuming
and costly. Thirdly, even if both sets of images are annotated, when mapped
together many annotations would be overlapping, duplicates or contradictory,
and it would require further processing to obtain ‘ground-truth’ containing all
annotations in both spectra.

Image fusion to obtain AUG dataset


In order to fuse the warped IR image with the RGB image, we first use a mod-
ified Two-Stage Multithreshold Otsu (TSMO) method [49] to remove low in-
tensity background noise from the IR image; since intensity in the infrared
spectrum decreases with distance this is a good approximation of background
subtraction. Then, we perform a pixel-by-pixel weighted addition between the
RGB image and the warped IR image, taking 50% of the intensity from each.
This is quite a crude method, chosen for its simplicity as a proof-of-concept
that even the simplest fusion method can aid detection. See Appendix C, Fig-
ure C.1c to observe the artefacts introduced by fusing in this way, notably the
very visible border of the warped IR image.

5.3.3 Qualitative Analysis


For this section of the report, we selected 42 images containing distinct charac-
teristics, and performed object detection on the RGB image, the IR image, and
the augmented RGB-IR image. Six of these images are found in the Appendix
C. The red bounding boxes are the ground-truth annotations from FLIR, the
rest are the result of inference using the pre-trained tiny-YOLOv3 network with
green bounding boxes for people, orange for bicycles, blue for cars, and cyan
for dogs.

5.3.4 Quantitative Analysis


We follow the description in Section 5.2.5, Evaluating Model Performance,
in order to obtain an accuracy metric on which to compare the performance
of the model on the three datasets. For each image, we run the tiny-YOLOv3
network inference and compare the predictions to the annotated ground-truth
54 CHAPTER 5. DATA AUGMENTATION

from our annotation files to calculate the number of True Positive (TP)s, False
Negative (FN)s and False Positive (FP)s, with two different Intersection over
Union (IoU) values, 0.5 and 0.25.
A manual review of the predictions showed that the predictions by the net-
work on an RGB image were very similar to the mapped annotations from the
annotated FLIR dataset, as can be seen in Figure 5.4. Inspecting the bound-
ing boxes in these figures further, we would expect 5 TP car predictions, with
4 FN cars missed, and 1 FN for the person. The actual result is close, with
just one of the car predictions not aligning closely enough to the annotation
bounding box to register as a TP. Reducing the IoU to 0.25 solves this issue
for this image. Therefore whilst 0.5 is more standard in literature, we repeated
the measurements at 0.25 due to error introduced from mapping annotations
from the IR coordinates to the RGB image.
At the end of the experiment, we had obtained TP, FP, and FN counts for
each image, and an overall count for each dataset at each IoU value. We also
noted which images were taken at daytime or nighttime, so that the results
could be split into these categories.
Mean Average Precision (mAP) is calculated as the Area Under Curve
(AUC) of the Precision/Recall graph. In this work, we calculate the Precision
and Recall for each image, and then sort the results by Precision. Then we
accumulate TP, FP and FN to obtain an evolving value for Precision and Recall
as we parse the results for each image. In the end, we acquire a graph of
Precision/Recall which starts with high precision and low recall, and descends
to a lower precision and higher recall value, for each class and at each IoU
value.
Average Precision (AP) for each class is calculated by computing the area
under the graph. At each block of width ri+1 −ri , we multiply by the precision
value at this index to get the area, and accumulate over the entire graph, as in
Equation 5.1.

(5.1)
X
AP = (rn+1 − rn )p(rn+1 )

5.4 Results
5.4.1 Qualitative Analysis
For the qualitative analysis, we picked images from the 42 reviewed images
which showed a large range of possible results. Figure C.1 shows both the
CHAPTER 5. DATA AUGMENTATION 55

Figure 5.4: FLIR IR bounding boxes

(a) FLIR dataset annotations (b) YOLOv3 bounding boxes

crudeness of the image fusion method and particularly misaligned bounding


boxes for the people on the right-hand side of the RGB image. Figure C.2
shows better detection on IR than either RGB or AUG. Figures C.3 and C.4
show cases where the network performs better on the AUG image than both
the RGB and IR images. Figures C.5 and C.6 show the common case where,
at nighttime, the IR image is best, followed by the AUG image and RGB the
worst, and the inverse during the daytime.

5.4.2 Quantitative Analysis


The results from the quantitative analysis are summarised in the following ta-
bles and will be discussed in detail in this section.
56 CHAPTER 5. DATA AUGMENTATION

id class name dataset AP (%) TP FP FN Precision Recall


RGB 19.0 4206 3271 16913 0.56 0.20
1 person AUG 21.5 4592 1435 16527 0.76 0.22
IR 33.7 7152 1065 13967 0.87 0.34
RGB 17.2 665 844 3028 0.44 0.18
2 bicycle AUG 14.4 544 487 3149 0.53 0.15
IR 14.8 553 228 3140 0.71 0.15
RGB 40.6 16416 8482 20964 0.66 0.44
3 car AUG 34.8 13335 3298 24045 0.80 0.36
IR 28.3 10684 2023 26696 0.84 0.29
RGB 20.8 43 13 164 0.77 0.21
17 dog AUG 14.0 29 5 178 0.85 0.14
IR 14.0 29 1 178 0.97 0.14

Table 5.1: Entire dataset IoU > 0.5

id class name dataset AP (%) TP FP FN Precision Recall


RGB 30.6 6508 1223 14611 0.84 0.31
1 person AUG 27.0 5727 514 15392 0.92 0.27
IR 37.9 8017 463 13102 0.95 0.38
RGB 34.9 1295 320 2398 0.80 0.35
2 bicycle AUG 27.3 1011 133 2682 0.88 0.27
IR 23.0 849 37 2844 0.96 0.23
RGB 55.5 21190 4990 16190 0.81 0.57
3 car AUG 43.4 16314 1583 21066 0.91 0.44
IR 35.0 13103 708 24277 0.95 0.35
RGB 23.7 49 7 158 0.88 0.24
17 dog AUG 15.5 32 2 175 0.94 0.15
IR 14.0 29 1 178 0.97 0.14

Table 5.2: Entire dataset IoU > 0.25


CHAPTER 5. DATA AUGMENTATION 57

id class name dataset AP (%) TP FP FN Precision Recall


RGB 27.2 3010 2690 7423 0.53 0.29
1 person AUG 26.7 2838 1178 7595 0.71 0.27
IR 31.5 3305 592 7128 0.85 0.32
RGB 20.2 631 765 2356 0.45 0.21
2 bicycle AUG 16.9 519 427 2468 0.55 0.17
IR 17.2 521 207 2466 0.72 0.17
RGB 49.6 13393 7083 11567 0.65 0.54
3 car AUG 40.9 10435 2654 14525 0.80 0.42
IR 29.6 7499 1480 17461 0.84 0.30
RGB 27.7 43 12 112 0.78 0.28
17 dog AUG 18.7 29 5 126 0.85 0.19
IR 18.7 29 1 126 0.97 0.19

Table 5.3: Daytime IoU > 0.5

id class name dataset AP (%) TP FP FN Precision Recall


RGB 45.7 4819 1043 5614 0.82 0.46
1 person AUG 35.6 3723 401 6710 0.90 0.36
IR 36.5 3807 193 6626 0.95 0.36
RGB 40.3 1210 288 1777 0.81 0.41
2 bicycle AUG 31.6 945 108 2042 0.90 0.32
IR 26.7 797 30 2190 0.96 0.27
RGB 66.8 17028 4409 7932 0.79 0.68
3 car AUG 50.7 12714 1330 12246 0.91 0.51
IR 36.9 9229 481 15731 0.95 0.37
RGB 31.6 49 6 106 0.89 0.32
17 dog AUG 20.6 32 2 123 0.94 0.21
IR 18.7 29 1 126 0.97 0.19

Table 5.4: Daytime IoU > 0.25


58 CHAPTER 5. DATA AUGMENTATION

id class name dataset AP (%) TP FP FN Precision Recall


RGB 10.9 1196 581 9490 0.67 0.11
1 person AUG 16.3 1754 257 8932 0.87 0.16
IR 35.8 3847 473 6839 0.89 0.36
RGB 4.8 34 79 672 0.30 0.05
2 bicycle AUG 3.5 25 60 681 0.29 0.04
IR 4.5 32 21 674 0.60 0.05
RGB 23.1 3023 1399 9397 0.68 0.24
3 car AUG 23.0 2900 644 9520 0.82 0.23
IR 25.5 3185 543 9235 0.85 0.26
RGB 0.00 0 1 52 0.00 0.00
17 dog AUG nan 0 0 52 nan 0.00
IR nan 0 0 52 nan 0.00

Table 5.5: Nighttime IoU > 0.5

id class name dataset AP (%) TP FP FN Precision Recall


RGB 15.8 1689 180 8997 0.90 0.16
1 person AUG 18.7 2004 113 8682 0.95 0.19
IR 39.3 4210 270 6476 0.94 0.39
RGB 12.0 85 32 621 0.73 0.12
2 bicycle AUG 9.3 66 25 640 0.73 0.09
IR 7.4 52 7 654 0.88 0.07
RGB 33.2 4162 581 8258 0.88 0.34
3 car AUG 28.9 3600 253 8820 0.93 0.29
IR 31.1 3874 227 8546 0.94 0.31
RGB 0.00 0 1 52 0.00 0.00
17 dog AUG nan 0 0 52 nan 0.00
IR nan 0 0 52 nan 0.00

Table 5.6: Nighttime IoU > 0.25


CHAPTER 5. DATA AUGMENTATION 59

Results Dataset mAP Results Dataset mAP Results Dataset mAP


RGB 35.6 RGB 44.5 RGB 19.5
All Day Night
AUG 30.9 AUG 36.9 AUG 20.4
IoU 0.5 IoU 0.5 IoU 0.5
IR 29.9 IR 29.6 IR 31.0
RGB 48.9 RGB 60.9 RGB 27.9
All Day Night
AUG 38.6 AUG 46.7 AUG 25.1
IoU 0.25 IoU 0.25 IoU 0.25
IR 35.6 IR 36.2 IR 35.2

Table 5.7: mAP for each results table

Tables 5.1 and 5.2 are calculated across the entire dataset using IoU > 0.5
and IoU > 0.25 respectively. Meanwhile, Tables 5.3 and 5.4 are calculated for
the 4,891 daytime images, and Tables 5.5 and 5.6 across the 2,886 nighttime
images, both sets of tables calculated for IoU > 0.5 and IoU > 0.25 respectively.
The mAP (mean Average Precision) for each table is calculated by com-
puting a weighted average of the AP for each class, based on the number of
ground-truth instances of the class. The mAP for each result table is displayed
in Table 5.7.
TP + FN should equal the number of positive examples in the dataset, and
a quick check confirms that TP + FN for each image and class is constant,
with ground-truth 21,119 people, 3,693 bicycles, 37,380 cars, and 207 dogs
in total. Since 7,777 images were used in testing, this means on average 2.7
people, 0.5 bicycles and 4.8 cars per image, which seems reasonable.
Using AP as a measure of performance of the network on each dataset,
we observe that when calculated over all images the performance on the RGB
dataset is superior for all classes except the person class, where it is outper-
formed by the AUG and IR datasets for IoU > 0.5, and only the IR dataset for
the less strict IoU > 0.25. This is reflected in the mAP, which is highest on the
RGB dataset, next highest on the AUG dataset and lowest on the IR dataset for
all IoU values.
For daytime images, the network performs better on the RGB dataset than
over all images or nighttime images, as expected. It still outperforms itself for
person detection on the IR dataset over the RGB dataset for IoU > 0.5, but for
IoU > 0.25 this is reversed and the network performs significantly better on
the RGB dataset by 9.2% AP.
For nighttime images, the general trend reverses, with mAP for the IR
dataset the highest for both IoU > 0.5 and IoU > 0.25. For IoU > 0.5, the
network performs better with the AUG dataset, whilst for IoU > 0.25 it per-
forms better on the RGB dataset than the AUG dataset.
60 CHAPTER 5. DATA AUGMENTATION

The IR and AUG datasets see much higher precision than the RGB dataset
for all classes.
Comparing between IoU > 0.5 and IoU > 0.25 we see that the RGB dataset
sees a proportionally higher rise in AP when reducing the required IoU than the
other datasets. For the person class, the AP for RGB rises by 11.6%, compared
to 6.5% for AUG and 4.2% for IR. For bicycle, it rises 17.7% compared to
12.9% and 8.2%, for cars 14.9% to 8.6% and 6.7%, and for dogs 2.9% to
1.5% and 0.0%.

5.5 Data Augmentation Conclusion


Potentially, the error introduced by the warping of annotations from IR to RGB
space hinders the ability of the network to perform well on the RGB and AUG
datasets. This hypothesis is based on the fact that the results see a propor-
tionally higher rise in performance (according to mAP, AP and Recall values)
on the RGB and AUG datasets than over the IR dataset in most cases at the
lower IoU value of 0.25 compared to 0.5. However, decreasing the required
IoU value for a TP result rewards less precise guesses by the network, and for
RGB data the network makes considerably more FP predictions than for the
other two datasets i.e. the network is much more precise over the AUG and
IR datasets than the RGB dataset. Therefore, it is not clear whether these dis-
proportional increases in accuracy are due to the error introduced by the map-
ping of annotations or the higher precision of the network on the AUG and IR
datasets, and consequently where results could lead to different conclusions
depending on the IoU value we cannot decisively conclude either way.
From the results, we can definitively say that the information in IR images
is beneficial to the prediction accuracy of people by a pre-trained RGB net-
work during the night, with the network performing better at detecting people
on both the IR and AUG dataset than on the RGB dataset. During daytime,
there are some qualitative examples of the network detecting a person in the
IR image that it missed in the RGB image, however the quantitative results are
inconclusive since the results differ depending on the IoU value used.
For the other 3 classes, the network proved significantly better at detect-
ing objects in the RGB images during daytime. At nighttime, performances
over all three datasets were comparable, with performance on the RGB im-
ages slightly better on average. Performance on the AUG dataset was typi-
cally somewhere in-between performance on the RGB and IR datasets, with
IR results better at night and RGB better during the day.
The results from this stage of the thesis allow us to directly answer the
CHAPTER 5. DATA AUGMENTATION 61

research questions with regard to the affect of augmenting RGB images with IR
data on the inference performance of a deep neural network trained on solely
RGB data.
Augmenting images with IR data has the potential to improve the inference
accuracy of a neural network trained on RGB data with no additional training.
Whilst the augmentation strategy used in this work results in an on-average
poorer performance of the network, it does however show improved pedestrian
detection during the night and higher precision in its predictions across the
board when utilising the augmented images.
At night, the network has considerably better performance when applied
to raw IR images than either corresponding RGB or AUG images. Whilst on
average performance is adversely affected, for the specific task of person detec-
tion, and in qualitative examples, performance is improved using the additional
modality.
Overall, considering the results attained, better performance can be ob-
tained by augmenting images with the additional data at night, although on
the FLIR dataset and with the image fusion method utilised in this paper, bet-
ter results would be achieved by switching entirely to the IR camera during the
night.

5.5.1 Future work


Beyond the work of this thesis is developing and exploring the effectiveness
of applying transfer-learning principles to the pre-trained networks, using the
pre-trained network as a base and training it further with augmented and/or IR
images. This would allow the best of both worlds, utilising prior knowledge
about the structure of objects in images in the RGB space from the vast amount
of data available, but also specialising for the domain of IR-augmented images.
The image fusion method used in this paper is incredibly basic, removing
low-intensity values from the IR image using Otsu thresholding and perform-
ing a weighted addition between the RGB and mapped IR image. The extent to
which this fusion method can be improved or optimised is unknown though it is
evident the method could be more complex than the chosen method. Fully op-
timising the fusion process would require reverse engineering object features
defined by the training of the network, at which point training a network with
both spectra seems a better option. The quality of the image fusion method is
in any case constrained by the quality of the image registration.
The network weights used were generated by training for 80 classes on
ImageNet, although only the 4 classes most relevant for ADAS and annotated
62 CHAPTER 5. DATA AUGMENTATION

for in the FLIR dataset are used in this study. A more specifically trained
RGB network may see better results at night with IR, however equally, it could
be that more generally trained networks learn features which generalise better
across the two spectra.
The Long-wave infrared (LWIR) camera used to capture the IR images in
the FLIR dataset is very high quality which makes the method used in this
work more viable since the capture resolution is similar to that of typical RGB
images used to train deep networks. For poorer quality IR cameras, it could
be possible to use the additional data in a different way which provides more
effective results. For example, prediction accuracy could be enhanced by ap-
plying Bayesian inference utilising the temperature information within a pre-
dicted object’s bounding box to respectively increase or decrease the prediction
probability for that object.
Chapter 6

Conclusion

Combining the results from each of the three stages provides an answer to
whether it is viable to integrate an infrared camera into an existing embedded
computer vision design to improve the detection accuracy of a pre-trained RGB
network.
In short, the answer is yes, it is viable. The image registration method used
is effective and fast, and both easily implemented in hardware with HLS and
easy to swap in-place of a software implementation with the PYNQ frame-
work. The network performance on augmented images is more precise, and
achieves better person detection than on the base RGB images at night, al-
though at night using the raw IR image sees even better results than using the
augmented images. During the day, the plain RGB images are best to use,
although the extra IR information helps person detection in some cases.
To us, these results hint at a multitude of ways in which additional IR data
could be useful in embedded computer vision applications that are worth in-
vestigating. These range from simple to complex, such as either using IR hot-
point to adjust predictions, simply switching to IR in poor lighting conditions,
or creating a lightweight, parallel branch of the RGB network solely for person
detection, which utilises both RGB and IR data.
Integrating IR is most useful for person detection; the results did not see
any improvement in the detection of bicycles, cars or dogs using the addi-
tional modality. However, since cars are usually well lit with headlights, bi-
cycles are only relevant when they are being ridden by a person, and dogs are
usually leashed to a person, this paper demonstrates an excellent method to
improve the effectiveness of a real-time embedded system exposed to objects
with identifiable IR profiles, such as people or animals, in poor visible lighting
conditions.

63
64 CHAPTER 6. CONCLUSION

The most important contribution of this work, beyond demonstrating an


alternative way to improve the performance of an embedded deep object detec-
tion network, is that we show that an RGB network can perform significantly
better at person detection in IR images despite having been trained entirely
with RGB examples. This is hugely relevant to all autonomous driving appli-
cations, whether using GPU-, ASIC- or FPGA-implemented neural networks,
as a way to improve accuracy of person detection in poor visible light condi-
tions.
Whilst it is still disputed whether autonomous driving applications should
focus on more vision-like sensor systems as opposed to radar and LiDAR, it
is worth noting that humans drive extremely competently whilst being almost
solely reliant on two visible spectrum cameras. The view that self-driving sen-
sor suites should focus on visible spectrum cameras instead of LiDAR is held
by Tesla CEO Elon Musk and Director of AI Andrej Karpathy, who believe
LiDAR to be an ineffective and expensive crutch in the place of actual visual
understanding [50].
However, since there are clearly situations where visible cameras fail, the
IR spectrum is a good spectral alternative to the visible spectrum, over ultra-
sound, radar or LiDAR. The latter sensors are used for their ability to ‘see
through’ adverse conditions such as rain, fog or dust, which is equally possi-
ble with IR. IR also provides more visually relevant data which, as it has been
shown in this thesis, can benefit from the huge amount of visual learning al-
ready performed on the visible spectrum. Visual recognition is considerably
easier for humans on IR images versus ultrasound, radar or LiDAR point maps,
aiding manual data annotation, and providing better data for network learning
– differentiating between a plastic bag blowing across the road which the car
should ignore versus something it should emergency stop for is considerably
easier using visual IR data than distance-reading point maps.
This leads us to assert that IR is one of the best spectral complements to the
visible spectrum for ADAS, since it provides much more visual information
than ultrasound, radar or LiDAR, and works well in situations where visible
cameras do not, such as glare, fog or darkness.
Bibliography

[1] Anton S Kornilov and Ilia V Safonov. An Overview of Watershed Al-


gorithm Implementations in Open Source Libraries. 2018. doi: 10 .
3390/jimaging4100123.
[2] R. Colin Johnson. Microsoft, Google Beat Humans at Image Recogni-
tion. Feb. 18, 2015. url: https://www.eetimes.com/document.
asp?doc_id=1325712 (visited on June 8, 2019).
[3] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017).
arXiv: 1703.06870. url: http://arxiv.org/abs/1703.
06870.
[4] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object
Detection”. In: CoRR abs/1506.02640 (2015). arXiv: 1506.02640.
url: http://arxiv.org/abs/1506.02640.
[5] Norman P Jouppi et al. “In-datacenter performance analysis of a tensor
processing unit”. In: 2017 ACM/IEEE 44th Annual International Sym-
posium on Computer Architecture (ISCA). IEEE. 2017, pp. 1–12.
[6] Alison DeNisco Rayome. The 10 most popular machine learning frame-
works used by data scientists. Sept. 14, 2018. url: https://www.
techrepublic.com/article/the-10-most-popular-
machine-learning-frameworks-used-by-data-scientists/
(visited on June 8, 2019).
[7] Xilinx. PYNQ: Python Productivity for Zynq. url: http : / / www .
pynq.io (visited on June 10, 2019).
[8] Andrea Leopardi. “Convolutional Neural Network Quantisation for Ac-
celerating Inference in Visual Embedded Systems”. 2018.
[9] Henrik Johansson and Carl Ahlberg. Evaluating Vivado High-Level Syn-
thesis on OpenCV Functions for the Zynq-7000 FPGA. 2015.

65
66 BIBLIOGRAPHY

[10] Michael Bedford Taylor. “Bitcoin and the age of bespoke silicon”. In:
2013 International Conference on Compilers, Architecture and Synthe-
sis for Embedded Systems (CASES). IEEE. 2013, pp. 1–10.
[11] Jean Baptiste Su. Why Tesla Dropped Nvidia’s AI Platform For Self-
Driving Cars And Built Its Own. Aug. 15, 2018. url: https://www.
forbes.com/sites/jeanbaptiste/2018/08/15/why-
tesla - dropped - nvidias - ai - platform - for - self -
driving - cars - and - built - its - own / #72ae0cf67228
(visited on June 8, 2019).
[12] Zhiping Dan et al. “A Transfer Knowledge Framework for Object Recog-
nition of Infrared Image”. In: Communications in Computer and Infor-
mation Science 363 (Apr. 2013), pp. 209–214. doi: 10.1007/978-
3-642-37149-3_25.
[13] Markus Jangblad. “Object Detection in Infrared Images using Deep
Convolutional Neural Networks”. PhD thesis. Uppsala Universitet, 2018.
[14] Jionghui Jiang et al. “Multi-spectral RGB-NIR image classification us-
ing double-channel CNN”. In: IEEE Access PP (Jan. 2019), pp. 1–1.
doi: 10.1109/ACCESS.2019.2896128.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet
Classification with Deep Convolutional Neural Networks”. In: Advances
in Neural Information Processing Systems 25. Ed. by F. Pereira et al.
Curran Associates, Inc., 2012, pp. 1097–1105. url: http://papers.
nips.cc/paper/4824-imagenet-classification-with-
deep-convolutional-neural-networks.pdf.
[16] Fabián Inostroza et al. “Embedded registration of visible and infrared
images in real time for noninvasive skin cancer screening”. In: Mi-
croprocessors and Microsystems 55.January (2017), pp. 70–81. issn:
01419331. doi: 10 . 1016 / j . micpro . 2017 . 09 . 006. url:
http://dx.doi.org/10.1016/j.micpro.2017.09.006.
[17] Jorge Hiraiwa and Hideharu Amano. “An FPGA implementation of re-
configurable real-time vision architecture”. English. In: Proceedings -
27th International Conference on Advanced Information Networking
and Applications Workshops, WAINA 2013. 2013, pp. 150–155. isbn:
9780769549521. doi: 10.1109/WAINA.2013.131.
BIBLIOGRAPHY 67

[18] B. Özgül et al. “Software-programmable digital pre-distortion on the


Zynq SoC”. In: 2013 IFIP/IEEE 21st International Conference on Very
Large Scale Integration (VLSI-SoC). Oct. 2013, pp. 288–289. doi: 10.
1109/VLSI-SoC.2013.6673292.
[19] Jiayi Ma, Yong Ma, and Chang Li. “Infrared and visible image fusion
methods and applications: A survey”. In: Information Fusion 45.Chang
Li (2019), pp. 153–178. issn: 15662535. doi: 10.1016/j.inffus.
2018.02.004.
[20] Faysal Boughorbel et al. “Gaussian fields: A new criterion for 3D rigid
registration”. In: Pattern Recognition 37.7 (2004), pp. 1567–1571. issn:
00313203. doi: 10.1016/j.patcog.2004.02.005.
[21] Gang Wang, Qiangqiang Zhou, and Yufei Chen. “Robust non-rigid point
set registration using spatially constrained Gaussian fields”. In: IEEE
Transactions on Image Processing 26.4 (2017), pp. 1759–1769. issn:
10577149. doi: 10.1109/TIP.2017.2658947.
[22] Jiayi Ma et al. “Non-rigid visible and infrared face registration via regu-
larized Gaussian fields criterion”. In: Pattern Recognition 48.3 (2015),
pp. 772–784. issn: 0031-3203. doi: https : / / doi . org / 10 .
1016/j.patcog.2014.09.005. url: http://www.sciencedirect.
com/science/article/pii/S0031320314003471.
[23] Gilles Rabatel and Sylvain Labbé. “Registration of visible and near in-
frared unmanned aerial vehicle images based on Fourier-Mellin trans-
form”. In: Precision Agriculture 17.5 (2016), pp. 564–587. issn: 15731618.
doi: 10.1007/s11119-016-9437-x.
[24] B. Srinivasa Reddy and B. N. Chatterji. “An FFT-based technique for
translation, rotation, and scale-invariant image registration”. In: IEEE
Transactions on Image Processing 5.8 (1996), pp. 1266–1271. issn:
10577149. doi: 10.1109/83.506761.
[25] Qing Zhou et al. “Image Registration Method Based On Edge Phase
Correlation Algorithm”. In: Proceedings of the 2016 3rd International
Conference on Materials Engineering, Manufacturing Technology and
Control Icmemtc (2016), pp. 1590–1598. doi: 10.2991/icmemtc-
16.2016.304. url: http://www.atlantis-press.com/
php/paper-details.php?id=25852406.
68 BIBLIOGRAPHY

[26] Jian Zhao and Sen-ching S. Cheung. “Human segmentation by geo-


metrically fusing visible-light and thermal imageries”. In: Multimedia
Tools and Applications 73.1 (Nov. 2014), pp. 61–89. issn: 1573-7721.
doi: 10.1007/s11042-012-1299-2. url: https://doi.
org/10.1007/s11042-012-1299-2.
[27] Pier-Luc St-Onge and Guillaume-Alexandre Bilodeau. “Visible and In-
frared Sensors Fusion by Matching Feature Points of Foreground Blobs”.
In: Advances in Visual Computing. Ed. by George Bebis et al. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2007, pp. 1–10. isbn: 978-3-
540-76856-2.
[28] Weiping Yang et al. “Efficient registration of optical and infrared images
via modified Sobel edging for plant canopy temperature estimation”.
In: Computers and Electrical Engineering 38.5 (2012), pp. 1213–1221.
issn: 00457906. doi: 10 . 1016 / j . compeleceng . 2012 . 05 .
014.
[29] Xiang Yi et al. “Registration of infrared and visible images based on the
correlation of the edges”. In: Proceedings of the 2013 6th International
Congress on Image and Signal Processing, CISP 2013 2.Cisp (2013),
pp. 990–994. doi: 10.1109/CISP.2013.6745309.
[30] Stephen Krotosky and Mohan Trivedi. “Multimodal Stereo Image Reg-
istration for Pedestrian Detection”. In: 2006 IEEE Intelligent Trans-
portation Systems Conference (2006). doi: 10.1109/ITSC.2006.
1706727.
[31] Tarek Mouats and Nabil Aouf. “Multimodal stereo correspondence based
on phase congruency and edge histogram descriptor”. In: 2013 IEEE In-
ternational Conference on Information Fusion (2013), pp. 1981–1987.
url: http://ieeexplore.ieee.org/xpls/abs%7B%5C_
%7Dall.jsp?arnumber=6641248.
[32] P. Kovesi. “Image Features from Phase Congruency”. In: Technical Re-
port 1.3 (1999), pp. C3–C3. issn: 1041-1135.
[33] Jungong Han, Eric Pauwels, and Paul De Zeeuw. “Visible and infrared
image registration employing line-based geometric analysis”. In: Lec-
ture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7252
LNCS. 2012, pp. 114–125. isbn: 9783642324352. doi: 10 . 1007 /
978-3-642-32436-9_10.
BIBLIOGRAPHY 69

[34] Jamie P. Heather and Moira I. Smith. “Multimodal image registration


with applications to image fusion”. In: 2005 7th International Confer-
ence on Information Fusion, FUSION 1.i (2005), pp. 372–379. doi:
10.1109/ICIF.2005.1591879.
[35] OpenCV Team. About OpenCV (Open Source Computer Vision Library).
url: https://opencv.org/about/ (visited on June 11, 2019).
[36] Richard Hartley and Andrew Zisserman. Multiple View Geometry in
Computer Vision. 2nd ed. New York, NY, USA: Cambridge University
Press, 2003. isbn: 0521540518.
[37] José J. Guerrero and Carlos Sagüés. Robust Line Matching and Estimate
of Homographies Simultaneously. 2010. doi: 10 . 1007 / 978 - 3 -
540-44871-6_35.
[38] Huan Wu et al. “Image registration of infrared and visible based on
SIFT and SURF”. In: Aug. 2018, p. 186. doi: 10.1117/12.2503048.
[39] Nicolas Vayatis. “Global optimization of Lipschitz functions”. In: 1972
(2017). arXiv: arXiv:1703.02628v3.
[40] Davis King. A Global Optimization Algorithm Worth Using. Dec. 28,
2017. url: http://blog.dlib.net/2017/12/a-global-
optimization-algorithm-worth.html (visited on Feb. 17,
2019).
[41] Wim Meeus et al. An overview of today’s high-level synthesis tools.
2012. doi: 10.1007/s10617-012-9096-8.
[42] Xilinx. Vivado High-Level Synthesis (HLS). url: https : / / www .
xilinx.com/products/design-tools/vivado/integration/
esl-design.html (visited on June 11, 2019).
[43] TUL. PYNQ-Z2. url: http://www.tul.com.tw/ProductsPYNQ-
Z2.html (visited on June 10, 2019).
[44] Samuel Dodge and Lina Karam. A Study and Comparison of Human
and Deep Learning Recognition Performance Under Visual Distortions.
arXiv: arXiv:1705.02498v1.
[45] Thimira Amaratunga. How ‘deep’ should it be to be called Deep Learn-
ing? Sept. 5, 2017. url: https://towardsdatascience.com/
how-deep-should-it-be-to-be-called-deep-learning-
a7b1a6ab5610 (visited on June 11, 2019).
70 BIBLIOGRAPHY

[46] Jonathan Hui. mAP (mean Average Precision) for Object Detection.
Mar. 7, 2018. url: https://medium.com/@jonathan_hui/
map-mean-average-precision-for-object-detection-
45c121a31173 (visited on Apr. 28, 2019).
[47] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improve-
ment”. In: CoRR abs/1804.02767 (2018). arXiv: 1804.02767. url:
http://arxiv.org/abs/1804.02767.
[48] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”.
In: CoRR abs/1612.08242 (2016). arXiv: 1612.08242. url: http:
//arxiv.org/abs/1612.08242.
[49] Deng Yuan Huang, Ta Wei Lin, and Wu Chih Hu. “Automatic multi-
level thresholding based on two-stage Otsu’s method with cluster deter-
mination by valley estimation”. In: International Journal of Innovative
Computing, Information and Control 7.10 (2011), pp. 5631–5644. issn:
13494198.
[50] Michael K. Spencer. To LiDAR or tesla. May 9, 2019. url: https:
/ / medium . com / artificial - intelligence - network /
to-lidar-or-to-tesla-5d3c2ab254c3 (visited on June 12,
2019).
Appendix A

TUL PYNQ-Z2 Product Brief

71
72 APPENDIX A. TUL PYNQ-Z2 PRODUCT BRIEF

Introducing TUL PYNQTM-Z2

ZYNQ XC7Z020-1CLG400C USB and Ethernet


• 650MHz dual-core Cortex-A9 processor • Gigabit Ethernet PHY
• DDR3 memory controller with 8 DMA channels and 4 • Micro USB-JTAG Programming circuitry
High Performance AXI3 Slave ports • Micro USB-UART bridge
• High-bandwidth peripheral controllers: 1G Ethernet, • USB 2.0 OTG PHY (supports host only)
USB 2.0, SDIO Audio and Video
• Low-bandwidth peripheral controller: SPI, UART, CAN, • HDMI sink port (input)
I2C • HDMI source port (output)
• Programmable from JTAG, Quad-SPI flash, and • I2S interface with 24bit DAC with 3.5mm TRRS jack
microSD card • Line-in with 3.5mm jack
• Programmable logic equivalent to Artix-7 FPGA Switches, Push-buttons and LEDs
• 13,300 logic slices, each with four 6-input LUTs and • 4 push-buttons
8 flip-flops • 2 slide switches
• 630 KB of fast block RAM • 4 LEDs
• 4 clock management tiles, each with a phase-locked • 2 RGB LEDs
loop (PLL) and mixed-mode clock manager (MMCM) Expansion Connectors
• 220 DSP slices • Two standard Pmod ports
• On-chip analog-to-digital converter (XADC) • 16 Total FPGA I/O (8 shared pins with Raspberry Pi
Memory connector)
• 512MB DDR3 with 16-bit bus @ 1050Mbps • Arduino Shield connector
• 16MB Quad-SPI Flash with factory programmed 48- • 24 Total FPGA I/O
bit globally unique EUI-48/64™ compatible identifier • 6 Single-ended 0-3.3V Analog inputs to XADC
• microSD slot • Raspberry Pi connector
Power • 28 Total FPGA I/O (8 shared pins with Pmod A port)
• Powered from USB or 7V-15V external power source
APPENDIX A. TUL PYNQ-Z2 PRODUCT BRIEF 73

TUL PYNQ-Z2
Product Specification Photos of Product
Part number 1M4-M000127000

EAN TUL PYNQ-Z2 | 4713436170785

Processor: Dual-Core ARM Cortex-A9

FPGA: 1.3M reconfigurable gates

Memory: 512MB DDR3 / 128Mbit FLASH

Storage: Micro SD card slot

Video: HDMI In / HDMI Out

Audio: HP+Mic, Line in, ADAU1761 AUDIO codec

Network: 10/100/1000 Ethernet

Expansion: USB Host connected to ARM PS

Arduino Shield connector, Raspberry Pi connector,


Interfaces GPIO:
2 Pmod ports

Other I/O: 6 User LEDs, 4 Push-buttons, 2 Slide Switches

Dimensions: 87mm x 140mm


Appendix B

Timing Analysis of Warping

74
APPENDIX B. TIMING ANALYSIS OF WARPING 75

Figure B.1: Timing analysis of HLS warp


Appendix C

Detection on RGB/AUG/IR Images

76
Figure C.1: FLIR_04580.jpg – Note imperfect annotation mapping to RGB.

(a) RGB (b) IR

(c) AUG
Figure C.2: FLIR_05828.jpg – Note IR is better than RGB and AUG.

(a) RGB (b) IR

(c) AUG
Figure C.3: FLIR_07409.jpg – Note AUG better than both RGB and IR in this case.

(a) RGB (b) IR

(c) AUG
Figure C.4: FLIR_07464.jpg – Note more car detections in AUG than in RGB or IR.

(a) RGB (b) IR

(c) AUG
Figure C.5: FLIR_08175.jpg – Note IR better than AUG, AUG better than RGB.

(a) RGB (b) IR

(c) AUG
Figure C.6: FLIR_06426.jpg – Note RGB better than AUG, AUG better than IR.

(a) RGB (b) IR

(c) AUG
www.kth.se

You might also like