Deep Learning in Computer Vision - Principles and Applications
Deep Learning in Computer Vision - Principles and Applications
Computer Vision
Digital Imaging and Computer Vision Series
Series Editor
Rastislav Lukac
Foveon, Inc./Sigma Corporation San Jose, California, U.S.A.
Dermoscopy Image Analysis
by M. Emre Celebi, Teresa Mendonça, and Jorge S. Marques
Semantic Multimedia Analysis and Processing
by Evaggelos Spyrou, Dimitris Iakovidis, and Phivos Mylonas
Microarray Image and Data Analysis: Theory and Practice
by Luis Rueda
Perceptual Digital Imaging: Methods and Applications
by Rastislav Lukac
Image Restoration: Fundamentals and Advances
by Bahadir Kursat Gunturk and Xin Li
Image Processing and Analysis with Graphs: Theory and Practice
by Olivier Lézoray and Leo Grady
Visual Cryptography and Secret Image Sharing
by Stelvio Cimato and Ching-Nung Yang
Digital Imaging for Cultural Heritage Preservation: Analysis,
Restoration, and Reconstruction of Ancient Artworks
by Filippo Stanco, Sebastiano Battiato, and Giovanni Gallo
Computational Photography: Methods and Applications
by Rastislav Lukac
Super-Resolution Imaging
by Peyman Milanfar
Deep Learning in
Computer Vision
Principles and Applications
Edited by
Mahmoud Hassaballah and Ali Ismail Awad
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have
been made to publish reliable data and information, but the author and publisher cannot assume responsibility
for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if
permission to publish in this form has not been obtained. If any copyright material has not been acknowledged
please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microflming, and recording, or in any information storage or retrieval system, with-
out written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-proft organization that provides licenses and registration
for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate
system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identifcation and explanation without intent to infringe.
v
vi Contents
Index...................................................................................................................... 315
Foreword
Deep learning, while it has multiple defnitions in the literature, can be defned as
“inference of model parameters for decision making in a process mimicking the
understanding process in the human brain”; or, in short: “brain-like model iden-
tifcation”. We can say that deep learning is a way of data inference in machine
learning, and the two together are among the main tools of modern artifcial intel-
ligence. Novel technologies away from traditional academic research have fueled
R&D in convolutional neural networks (CNNs); companies like Google, Microsoft,
and Facebook ignited the “art” of data manipulation, and the term “deep learning”
became almost synonymous with decision making.
Various CNN structures have been introduced and invoked in many computer
vision-related applications, with greatest success in face recognition, autonomous
driving, and text processing. The reality is: deep learning is an art, not a science.
This state of affairs will remain until its developers develop the theory behind its
functionality, which would lead to “cracking its code” and explaining why it works,
and how it can be structured as a function of the information gained with data. In
fact, with deep learning, there is good and bad news. The good news is that the indus-
try—not necessarily academia—has adopted it and is pushing its envelope. The bad
news is that the industry does not share its secrets. Indeed, industries are never inter-
ested in procedural and textbook-style descriptions of knowledge.
This book, Deep Learning in Computer Vision: Principles and Applications—as
a journey in the progress made through deep learning by academia—confnes itself
to deep learning for computer vision, a domain that studies sensory information
used by computers for decision making, and has had its impacts and drawbacks for
nearly 60 years. Computer vision has been and continues to be a system: sensors,
computer, analysis, decision making, and action. This system takes various forms
and the fow of information within its components, not necessarily in tandem. The
linkages between computer vision and machine learning, and between it and arti-
fcial intelligence, are very fuzzy, as is the linkage between computer vision and
deep learning. Computer vision has moved forward, showing amazing progress in
its short history. During the sixties and seventies, computer vision dealt mainly with
capturing and interpreting optical data. In the eighties and nineties, geometric com-
puter vision added science (geometry plus algorithms) to computer vision. During
the frst decade of the new millennium, modern computing contributed to the evolu-
tion of object modeling using multimodality and multiple imaging. By the end of
that decade, a lot of data became available, and so the term “deep learning” crept
into computer vision, as it did into machine learning, artifcial intelligence, and other
domains.
This book shows that traditional applications in computer vision can be solved
through invoking deep learning. The applications addressed and described in the
eleven different chapters have been selected in order to demonstrate the capabilities
of deep learning algorithms to solve various issues in computer vision. The content
of this book has been organized such that each chapter can be read independently
vii
viii Foreword
of the others. Chapters of the book cover the following topics: accelerating the CNN
inference on feld-programmable gate arrays, fre detection in surveillance applica-
tions, face recognition, action and activity recognition, semantic segmentation for
autonomous driving, aerial imagery registration, robot vision, tumor detection, and
skin lesion segmentation as well as skin melanoma classifcation.
From the assortment of approaches and applications in the eleven chapters, the
common thread is that deep learning for identifcation of CNN provides accuracy
over traditional approaches. This accuracy is attributed to the fexibility of CNN
and the availability of large data to enable identifcation through the deep learning
strategy. I would expect that the content of this book to be welcomed worldwide by
graduate and postgraduate students and workers in computer vision, including prac-
titioners in academia and industry. Additionally, professionals who want to explore
the advances in concepts and implementation of deep learning algorithms applied to
computer vision may fnd in this book an excellent guide for such purpose. Finally,
I hope that readers would fnd the presented chapters in the book interesting and
inspiring to future research, from both theoretical and practical viewpoints, to spur
further advances in discovering the secrets of deep learning.
ix
x Preface
Mahmoud Hassaballah
Qena, Egypt
xiii
Contributors
Ahmad El Sallab Hesham F.A. Hamed
Valeo Company Egyptian Russian University
Cairo, Egypt Cairo, Egypt
and
Ahmed Nassar Minia University
IRISA Institute Minia, Egypt
Rennes, France
Javier Ruiz-Del-Solar
Alaa S. Al-Waisy University of Chile
University of Bradford Santiago, Chile
Bradford, UK
Kaidong Li
University of Kansas
Ali Ismail Awad
Kansas City, Kansas
Luleå University of Technology
Luleå, Sweden
Kamel Abdelouahab
and
Clermont Auvergne University
Al-Azhar University
Clermont-Ferrand, France
Qena, Egypt
Khalid M. Hosny
Amin Ullah Zagazig University
Sejong University Zagazig, Egypt
Seoul, South Korea
Khan Muhammad
Ashraf A. M. Khalaf Sejong University
Minia University Seoul, South Korea
Minia, Egypt
Mahmoud Hassaballah
François Berry South Valley University
University Clermont Auvergne Qena, Egypt
Clermont-Ferrand, France
Mahmoud Khaled Abd-Ellah
Guanghui Wang Al-Madina Higher Institute for
University of Kansas Engineering and Technology
Kansas City, Kansas Giza, Egypt
xv
xvi Contributors
CONTENTS
1.1 Introduction ......................................................................................................2
1.2 Background on CNNs and Their Computational Workload ............................3
1.2.1 General Overview.................................................................................3
1.2.2 Inference versus Training ..................................................................... 3
1.2.3 Inference, Layers, and CNN Models ....................................................3
1.2.4 Workloads and Computations...............................................................6
1.2.4.1 Computational Workload .......................................................6
1.2.4.2 Parallelism in CNNs ..............................................................8
1.2.4.3 Memory Accesses ..................................................................9
1.2.4.4 Hardware, Libraries, and Frameworks ................................ 10
1.3 FPGA-Based Deep Learning.......................................................................... 11
1.4 Computational Transforms ............................................................................. 12
1.4.1 The im2col Transformation ................................................................ 13
1.4.2 Winograd Transform .......................................................................... 14
1.4.3 Fast Fourier Transform ....................................................................... 16
1.5 Data-Path Optimizations ................................................................................ 16
1.5.1 Systolic Arrays.................................................................................... 16
1.5.2 Loop Optimization in Spatial Architectures ...................................... 18
Loop Unrolling ................................................................................... 19
Loop Tiling .........................................................................................20
1.5.3 Design Space Exploration................................................................... 21
1.5.4 FPGA Implementations ...................................................................... 22
1.6 Approximate Computing of CNN Models ..................................................... 23
1.6.1 Approximate Arithmetic for CNNs.................................................... 23
1.6.1.1 Fixed-Point Arithmetic ........................................................ 23
1.6.1.2 Dynamic Fixed Point for CNNs...........................................28
1.6.1.3 FPGA Implementations ....................................................... 29
1.6.1.4 Extreme Quantization and Binary Networks....................... 29
1.6.2 Reduced Computations....................................................................... 30
1.6.2.1 Weight Pruning .................................................................... 31
1.6.2.2 Low Rank Approximation ................................................... 31
1.6.2.3 FPGA Implementations ....................................................... 32
1
2 Deep Learning in Computer Vision
1.7 Conclusions..................................................................................................... 32
Bibliography ............................................................................................................ 33
1.1 INTRODUCTION
The exponential growth of big data during the last decade motivates for innovative
methods to extract high semantic information from raw sensor data such as videos,
images, and speech sequences. Among the proposed methods, convolutional neural
networks (CNNs) [1] have become the de facto standard by delivering near-human
accuracy in many applications related to machine vision (e.g., classifcation [2],
detection [3], segmentation [4]) and speech recognition [5].
This performance comes at the price of a large computational cost as CNNs
require up to 38 GOPs to classify a single frame [6]. As a result, dedicated hard-
ware is required to accelerate their execution. Graphics processing units GPUs
are the most widely used platform to implement CNNs as they offer the best per-
formance in terms of pure computational throughput, reaching up 11 TFLOPs
[7]. Nevertheless, in terms of power consumption, feld-programmable gate array
(FPGA) solutions are known to be more energy effcient (vs. GPU). While GPU
implementations have demonstrated state-of-the-art computational performance,
CNN acceleration will soon be moving towards FPGAs for two reasons. First,
recent improvements in FPGA technology put FPGA performance within striking
distance of GPUs with a reported performance of 9.2 TFLOPs for the latter [8].
Second, recent trends in CNN development increase the sparsity of CNNs and
use extremely compact data types. These trends favor FPGA devices, which are
designed to handle irregular parallelism and custom data types. As a result, next-
generation CNN accelerators are expected to deliver up to 5.4× better computa-
tional throughput than GPUs [7].
As an infection point in the development of CNN accelerators might be near, we
conduct a survey on FPGA-based CNN accelerators. While a similar survey can be
found in [9], we focus in this chapter on the recent techniques that were not covered
in the previous works. In addition to this chapter, we refer the reader to the works
of Venieris et al. [10], which review the toolfows automating the CNN mapping
process, and to the works of Sze et al., which focus on ASICs for deep learning
acceleration.
The amount and diversity of research on the subject of CNN FPGA acceleration
within the last 3 years demonstrate the tremendous industrial and academic interest.
This chapter presents a state-of-the-art review of CNN inference accelerators over
FPGAs. The computational workloads, their parallelism, and the involved memory
accesses are analyzed. At the level of neurons, optimizations of the convolutional
and fully connected (FC) layers are explained and the performances of the differ-
ent methods compared. At the network level, approximate computing and data-path
optimization methods are covered and state-of-the-art approaches compared. The
methods and tools investigated in this survey represent the recent trends in FPGA
CNN inference accelerators and will fuel the future advances on effcient hardware
deep learning.
Accelerating the CNN Inference on FPGAs 3
TABLE 1.1
Tensors Involved in the Inference of a Given Layer ˜ with Their Dimensions
X Input FMs B×C×H×W B Batch size (Number of input frames)
Y Output FMs B×N×V×U W/H/C Width/Height/Depth of Input FMs
Θ Learned Filters N×C×J×K U/V/N Width/Height/Depth of Output FMs
β Learned biases N K/J Horizontal/Vertical Kernel size
A convolutional layer (conv) carries out the feature extraction process by applying – as
illustrated in Figure 1.1 – a set of three-dimensional convolution flters Θconv to a set
of B input volumes Xconv. Each input volume has a depth C and can be a color image
(in the case of the frst conv layer), or an output generated by previous layers in the
network. Applying a three-dimensional flter to three-dimensional input results in
a 2D (FM). Thus, applying N three-dimensional flters in a layer results in a three-
dimensional output with a depth N.
In some CNN models, a learned offset βconv – called a bias – is added to processed
feature maps. However, this practice has been discarded in recent models [6]. The
computations involved in feed-forward propagation of conv layers are detailed in
Equation 1.1.
Y conv[b, n, v, u] = b conv[ n]
C J K (1.1)
+ åååX
c=1 j=1 k=1
conv
[b, c, v + j, u + k ] × Qconv[ n, c, j, k]
One may note that applying a depth convolution to a 3D input boils down to applying
a mainstream 2D convolution to each of the 2D channels of the input, then, at each
point, summing the results across all the channels, as shown in Equation 1.2.
FIGURE 1.1 Feed-forward propagation in conv, act, and pool layers (batch size B = 1, bias
β omitted).
Accelerating the CNN Inference on FPGAs 5
˜n °[1, N ]
C
Y[ n] conv
=b conv
[n] + åconv2D ( X[c]
c=1
conv
,Q[ c]conv ) (1.2)
Each conv layer of a CNN is usually followed by an activation layer that applies a
nonlinear function to all the values of FMs. Early CNNs were trained with TanH
or Sigmoid functions, but recent models employ the rectifed linear unit (ReLU)
function, which grants faster training times and less computational complexity, as
highlighted in Krizhevsky et al. [12].
Y act [b, n, h, w] = act(X act [b, n, h, w]) | act:=TanH, Sigmoid, ReLU… (1.3)
The convolutional and activation parts of a CNN are directly inspired by the
cells of visual cortex in neuroscience [13]. This is also the case with pooling
layers, which are periodically inserted in between successive conv layers. As
shown in Equation 1.4, pooling sub-samples each channel of the input FM by
selecting either the average, or, more commonly, the maximum of a given neigh-
borhood K. As a result, the dimensionality of an FM is reduced, as illustrated
in Figure 1.1.
(
Y pool [b, n, v, u] = max X pool [b, n, v + p, u + q]
p,q˜[1:K ]
) (1.4)
When deployed for classifcation purposes, the CNN pipeline is often terminated
by FC layers. In contrast with convolutional layers, FC layers do not implement
weight sharing and involve as much weight as input data (i.e., W = K, H = J,U = V = 1).
Moreover, in a similar way as conv layers, a nonlinear function is applied to the
outputs of FC layers.
C H W
Y [b, n] = b [ n] +
fc fc
åååX [b, c, h, w]× Q [n, c, h, w]
c=1 h=1 w=1
fc fc
(1.5)
6 Deep Learning in Computer Vision
X BN [b, n, u, v] − m (1.7)
Y BN [b, n, v, u] = g+a
s2 + ˜
TABLE 1.2
Popular CNN Models with Their Computational Workload*
AlexNet GoogleNet VGG16 VGG19 ResNet101 ResNet-152
Model [12] [16] [6] [6] [17] [17]
˜
Lc
conv 666 M 1.58 G 15.3 G 19.5 G 7.57 G 11.3 G
C˜
˜=1
˜
Lc 2.33 M 5.97 M 14.7 M 20 M 42.4 M 58 M
W˜conv
˜=1
Act ReLU
Pool 3 14 5 5 2 2
Lf 3 1 3 3 1 1
˜
Lf 58.6 M 1.02 M 124 M 124 M 2.05 M 2.05 M
C˜fc
˜=1
˜
Lf 58.6 M 1.02 M 124 M 124 M 2.05 M 2.05 M
W˜ fc
˜=1
Lc Lf
C= ˜C
˜=1
˜
conv
+ ˜C
˜=1
˜
fc
(1.8)
C˜conv = N ˜ × C˜ × J ˜ × K ˜ × U ˜ × V˜ (1.9)
C˜fc = N ˜ × C˜ × W˜ × H ˜ (1.10)
In a similar way, the number of weights, and consequently the size of a given CNN
model, can be expressed as follows:
Lc Lf
W= ˜W
˜=1
˜
conv
+ ˜W
˜=1
˜
fc
(1.11)
W˜conv = N ˜ × C˜ × J ˜ × K ˜ (1.12)
W˜fc = N ˜ × C˜ × W˜ × H ˜ (1.13)
For state-of-the-art CNN models, L c, N ˜ , and C˜ can be quite large. This makes
CNNs computationally and memory intensive, where for instance, the classifcation
of a single frame using the VGG19 network requires 19.5 billion MAC operations.
It can be observed in the same table that most of the MACs occur on the convolu-
tion parts, and consequently, 90% of the execution time of a typical inference is spent
on conv layers [18]. By contrast, FC layers marginalize most of the weights and thus
the size of a given CNN model.
Moreover, the execution of the most computationally intensive parts (i.e., conv lay-
ers), exhibits the four following types of concurrency:
Note that the fully connected parts of state-of-the-art models involve large values
of N ˜ and C˜ , making the memory reading of weights the most impacting factor,
as formulated in Equation 1.16. In this context, batch parallelism can signifcantly
accelerate the execution of CNNs with a large number of FC layers.
In the conv parts, the high number of MAC operations results in a high number
of memory accesses, as each MAC requires at least 2 memory reads and 1 memory
write*. This number of memory accesses accumulates with the high dimensions of
data manipulated by conv layers, as shown in Equation 1.18. If all these accesses are
towards external memory (for instance, DRAM), throughput and energy consumption
* This is the best-case scenario of a fully pipelined MAC, where intermediate results do not need to be
loaded.
10 Deep Learning in Computer Vision
will be highly impacted, because DRAM access engenders high latency and energy
consumption, even more than the computation itself [21].
The number of these DRAM accesses, and thus latency and energy consumption, can
be reduced by implementing a memory-caching hierarchy using on-chip memories.
As discussed in the next sections, state-of-the-art CNN accelerators employ register
fles as well as several levels of caches. The former, being the fastest, is implemented
at the nearest of the computational capabilities. The latency and energy consumption
resulting from these caches is lower by several orders of magnitude than external
memory accesses, as pointed out in Sze et al. [22].
1. A high density of hard-wired digital signal processor (DSP) blocks that are
able to achieve up to 20 (8 TFLOPs) TMACs [8].
2. A collection of in situ on-chip memories, located next to DSPs, that can be
exploited to signifcantly reduce the number of external memory accesses.
T (MACS)
T (FPS) = (1.19)
C(MAC)
* At a similar number of memory accesses. These accesses typically play the most dominant role in the
power consumption of an accelerator.
12 Deep Learning in Computer Vision
* https://www.openblas.net/
† https://developer.nvidia.com/cublas
Accelerating the CNN Inference on FPGAs 13
H
1 1 1 1 1 1
N B
Output
N
x N FMs =
= ~
Yfc
CHW B
CHW
(a) B (b)
FIGURE 1.3 GEMM-based processing of FC layers (a) and conv layers (b).
14 Deep Learning in Computer Vision
Winograd
~ transform ~
x ˜
~
䎭 = y
EWMM
w+k-1 w+k-1
* That’s what the im2col name refers to: fattening an image to a column.
Accelerating the CNN Inference on FPGAs 15
y = C T °˜q̃ ˜ x̃ ˛˝ C (1.24)
°1 0 −1 0˙
˝0 0 ˇˇ
°1 1 1 0˙ ˝ 1 1
AT = ˝ ; BT
=
˛0 1 −1 −1ˇˆ ˝0 −1 1 0ˇ
˝ ˇ
˛0 1 0 −1ˆ
° 1 0 0 ˙
˝1 / 2 1/ 2 1 / 2 ˇˇ
C=˝ (1.25)
˝1 / 2 −1 / 2 1 / 2ˇ
˝ ˇ
˛ 0 0 1 ˆ
Beside this complexity reduction, implementing Winograd fltering in FPGA-based
CNN accelerators has two advantages. First, transformation matrices A, B, C can be
evaluated offine once u and k are determined. As a result, these transforms become
multiplications with the constants that can be implemented by means of lut and shift
registers, as proposed in [54].
Second, Winograd fltering can employ the loop optimization techniques dis-
cussed in Section 1.5.2 to vectorize the implementation. On one hand, the computa-
tional throughput is increased when unrolling the computation of the ewmm parts
over multiple DSP blocks. On the other hand, memory bandwidth is optimized using
loop tiling to determine the size of the FM tiles and flter buffers.
First, utilization of Winograd fltering in FPGA-based CNN accelerators is inves-
tigated in [32] and delivers a computational throughput of 46 GOPs when executing
AlexNet convolutional layers. This performance is signifcantly improved by a factor
of ×42 in [31] when optimizing the data path to support Winograd convolutions (by
employing loop unrolling and tiling strategies), and storing the intermediate FM in
on-chip buffers (cf Section 1.4).
The same method is employed in [54] to derive a CNN accelerator on a Xilinx
ZCU102 device that delivers a throughput of 2.94 TOPs on VGG convolutional lay-
ers. The reported throughput corresponds to half of the performance of a TitanX
device, with 5.7× less power consumption [23]*.
* Implementation in the TitanX GPU employs Winograd algorithm and 32-bit foating point arithmetic.
16 Deep Learning in Computer Vision
Using FFT to process 2D convolutions reduces the complexity from O(W2 × K2) to
O(W2log2(W)), which is exploited to derive FPGA-based accelerators and to infer
CNN [34]. When compared to standard fltering and Winograd algorithm, FFT fnds
its interest in convolutions with large kernel size (K > 5), as demonstrated in [53,
55]. The computational complexity of FFT convolutions can be further reduced to
O(Wlog2(K)) using the overlap-and-add method [56], which can be applied when
the signal size is much larger than the flter size, which is typically the case in conv
layers (W >> K ). Works in [33, 57] leverage on the overlap-and-add to implement
frequency domain acceleration for conv layers on FPGA, which results in a compu-
tational throughput of 83 GOPs for AlexNet (Table 1.3).
Winograd [33] AlexNet-C 1.3 2.3 Float 32 OpenCL Virtex7 VX690T 200 46 – 505 3683 56.3
[32] AlexNet-C 1.3 2.3 Float16 OpenCL Arria10 GX1150 303 1382 44.3 246 1576 49.7
[55] VGG16-C 30.7 14.7 Fixed 16 HLS Zynq ZU9EG 200 3045 23.6 600 2520 32.8
[55] AlexNet-C 1.3 2.3 Fixed 16 HLS Zynq ZU9EG 200 855 23.6 600 2520 32.8
Accelerating the CNN Inference on FPGAs
FFT [34] AlexNet-C 1.3 2.3 Float 32 – Stratix5 QPI 200 83 13.2 201 224 4.0
[34] VGG19-C 30.6 14.7 Float 32 – Stratix5 QPI 200 123 13.2 201 224 4.0
GEMM [30] AlexNet-C 1.3 2.3 Fixed 16 OpenCL Stratix5 GXA7 194 66 33.9 228 256 37.9
[50] VGG16-F 31.1 138.0 Fixed 16 HLS Kintex KU060 200 365 25.0 150 1058 14.1
[50] VGG16-F 31.1 138.0 Fixed 16 HLS Virtex7 VX960T 150 354 26.0 351 2833 22.5
[51] VGG16-F 31.1 138.0 Fixed 16 OpenCL Arria10 GX1150 370 866 41.7 437 1320 25.0
[51] VGG16-F 31.1 138.0 Float 32 OpenCL Arria10 GX1150 385 1790 37.5 – 2756 29.0
17
18 Deep Learning in Computer Vision
Off-chip
memory Previous partial sum
x1conv
... conv
PE PE PE PE
˜1conv +
...
conv
PE PE PE ... PE
yconv
PC
...
...
...
...
DMA + act
PE PE PE ... PE
xpc
conv +
... Bottleneck
conv
...
...
...
...
...
˜pc
conv
PE PE PE ... PE
...
FIGURE 1.5 Generic data paths of FPGA-based CNN accelerators: (a) Static systolic array.
(b) Dedicated SIMD accelerator. (c) Dedicated processing element.
// Lb : Batch
for (int b =0;b<B,l++) {
// Ll: Layer
for (int l =0;l<L,l++) {
// Ln: Y Depth
for (int n =0;n<N;n++) {
// Lv: Y Columns
for (int v =0;v<V,v++) {
// Lu: Y Raws
Accelerating the CNN Inference on FPGAs 19
Loop Unrolling
Unrolling a loop Li with an unrolling factor Pi (Pi ˜ i, i °{L, V , U , N , C , J , K}) acceler-
ates its execution by allocating multiple computational resources. Each of the par-
allelism patterns listed in Section 1.2.4.2 can be implemented by unrolling one of
the loops of Listing 1.1, as summarized in Table 1.4. For the confguration given in
Figure 1.5c, the unrolling factor PN sets the number of PEs. The remaining factors
– PC, PK, PJ – determine the number of multipliers, as well as the size of buffer con-
tained in each PE (Figure 1.6).
20 Deep Learning in Computer Vision
TABLE 1.4
Loop Optimization Parameters Pi and Ti
Parallelism Intra layer Inter FM Intra FM Inter conv. Intra conv.
Loop LL LN LV LU Lc LJ LK
Unroll Factor PL PN PV PU Pc PJ PK
Tiling Factor TL TN TU TU TC TJ TK
TC TN
V
H
TH
TV +
TW × × × +
TU PK + +
C N conv × × ×
θi
...
× × × +
TC J PJ × PK Adder
conv Weights PJ
conv T ... mult. tree
ϴ J
C
TK K
Loop Tiling
In general, the capacity of on-chip memory in current FPGA is not large enough to
store the weights and intermediate FM of all CNN layers*. For example, AlexNet’s
convolutional layers resort to 18.6 Mbits of weights, and generate a total 70.7 Mbits
of intermediate feature maps†. In contrast, the highest-end Stratix V FPGA provides
a maximum of 52 Mbits of on-chip ram.
As a consequence, FPGA-based accelerators resort to external DRAM to store
these data. As mentioned in Section 1.2.4.3, DRAM accesses are costly in terms of
energy and latency, and data caches must be implemented by means of on-chip buf-
fers and local registers. The challenge is thus to build a data path in a way that every
data transferred from DRAM is reused as much as possible.
For conv layers, this challenge can be addressed by tiling the nested loops of
Listing 1.1. Loop tiling [66] divides the FM and weights of each layer into multiple
groups that can fit into the on-chip buffers. For the configuration given in Figure 1.5c,
the size of the buffers containing input FM, weights, and output FM is set according
to the tiling factors listed in Table 1.4.
* Exception can be made for [6666], where a large cluster of FPGAs is interconnected and resorts only
to on-chip memory to store CNN weights and intermediate data.
† Estimated by summing the number of outputs for each convolution layer.
Accelerating the CNN Inference on FPGAs 21
B˜conv = TN × TC × TJ × TK (1.28)
BYconv = TN × TV × TU (1.29)
With these buffers, the memory accesses occurring in the conv layer (cf Equation
1.18) are respectively divided by BXconv , B˜conv and BYconv , as expressed in Equation 1.30.
Since the same hardware is reused to accelerate the execution of multiple conv layers
with different workloads, the tiling factors are agnostic to the workload of a specifc
layer, as can be noticed in the denominator of Equation 1.30. As a result, the value
of the tiling factors is generally set to optimize the overall performance of a CNN
execution.
the design variables searching for optimal loop unroll and tiling factors. More
particularly, the authors demonstrate that the input FM and weights are optimally
reused when unrolling only computations within a single input FM (i.e., when
PC = PJ = PK = 1). Tiling factors are set in such a way that all the data required to
compute an element of Y are fully buffered (i.e., TC = C, TK = K, TJ = J). The remain-
ing design parameters are derived after a brute-force design exploration. The same
authors leverage on these loop optimizations to build an RTL compiler for CNNs
in [71]. To the best of our knowledge, this accelerator outperforms all the previ-
ous implementations that are based on loop optimization in terms of computational
throughput (Tables 1.5 through 1.7).
[40] AlexNet-C 1.3 2.3 Float 32 HLS Virtex7 VX485T 100 61.62 18.61 186 2240 18.4
[29] VGG16SVD-F 30.8 50.2 Fixed 16 RTL Zynq Z7045 150 136.97 9.63 183 780 17.5
[30] AlexNet-C 1.3 2.3 Fixed 16 OpenCL Stratix5 GSD8 120 187.24 33.93 138 635 18.2
[30] AlexNet-F 1.4 61.0 Fixed 16 OpenCL Stratix5 GSD8 120 71.64 33.93 272 752 30.1
[30] VGG16-F 31.1 138.0 Fixed 16 OpenCL Stratix5 GSD8 120 117.9 33.93 524 1963 51.4
[68] AlexNet-C 1.3 2.3 Float 32 HLS Virtex7 VX485T 100 75.16 33.93 28 2695 19.5
[49] AlexNet-F 1.4 61.0 Fixed 16 HLS Virtex7 VX690T 150 825.6 126.00 N.R 14400 N.R
[49] VGG16-F 31.1 138.0 Fixed 16 HLS Virtex7 VX690T 150 1280.3 160.00 N.R 21600 N.R
[69] NIN-F 2.2 61.0 Fixed 16 RTL Stratix5 GXA7 100 114.5 19.50 224 256 46.6
[69] AlexNet-F 1.5 7.6 Fixed 16 RTL Stratix5 GXA7 100 134.1 19.10 242 256 31.0
[38] AlexNet-F 1.4 61.0 Fixed 16 RTL Virtex7 VX690T 156 565.94 30.20 274 2144 34.8
[63] AlexNet-C 1.3 2.3 Float 32 HLS Virtex7 VX690T 100 61.62 30.20 273 2401 20.2
[64] VGG16-F 31.1 138.0 Fixed 16 RTL Arria10 GX1150 150 645.25 50.00 322 1518 38.0
[42] AlexNet-C 1.3 2.3 Fixed 16 RTL Cyclone5 SEM 100 12.11 N.R 22 28 0.2
[42] AlexNet-C 1.3 2.3 Fixed 16 RTL Virtex7 VX485T 100 445 N.R 22 2800 N.R
[72] NiN 20.2 7.6 Fixed 16 RTL Stratix5 GXA7 150 282.67 N.R 453 256 30.2
[72] VGG16-F 31.1 138.0 Fixed 16 RTL Stratix5 GXA7 150 352.24 N.R 424 256 44.0
[72] ResNet-50 7.8 25.5 Fixed 16 RTL Stratix5 GXA7 150 250.75 N.R 347 256 39.3
[72] NiN 20.2 7.6 Fixed 16 RTL Arria10 GX1150 200 587.63 N.R 320 1518 30.4
[72] VGG16-F 31.1 138.0 Fixed 16 RTL Arria10 GX1150 200 720.15 N.R 263 1518 44.5
[72] ResNet-50 7.8 25.5 Fixed 16 RTL Arria10 GX1150 200 619.13 N.R 437 1518 38.5
[73] AlexNet-F 1.5 7.6 Float 32 N.R Virtex7 VX690T 100 445.6 24.80 207 2872 37
[73] VGG16SVD-F 30.8 50.2 Float 32 N.R Virtex7 VX690T 100 473.4 25.60 224 2950 47
Deep Learning in Computer Vision
TABLE 1.6
Accelerators Employing Approximate Arithmetic
Bit-width
Comp Params Acc Freq Through. Power LUT Memory
A×C Entry Dataset (GOP) (M) In/Out FMs θconv θFC (%) Device (MHz) (GOPs) (W) (K) DSP (MB)
FP32 [51] ImageNet 30.8 138.0 32 32 32 32 90.1 Arria10 370 866 41.7 437 1320 25.0
GX1150
FP16 [32] ImageNet 30.8 61.0 16 16 16 16 79.2 Arria10 303 1382 44.3 246 1576 49.7
GX1150
[64] ImageNet 30.8 138.0 16 16 8 8 88.1 Arria10 150 645 N.R 322 1518 38.0
GX1150
DFP [72] ImageNet 30.8 138.0 16 16 16 16 N.R Arria10 200 720 N.R 132 1518 44.5
GX1150
[51] ImageNet 30.8 138.0 16 16 16 16 N.R Arria10 370 1790 N.R 437 2756 29.0
Accelerating the CNN Inference on FPGAs
GX1150
[91] Cifar10 1.2 13.4 20 2 1 1 87.7 Zynq Z7020 143 208 4.7 47 3 N.R
[46] Cifar10 0.3 5.6 20/16 2 1 1 80.1 Zynq Z7045 200 2465 11.7 83 N.R 7.1
BNN [93] MNIST 0.0 9.6 8 2 1 1 98.2 Stratix5 GSD8 150 5905 26.2 364 20 44.2
[93] Cifar10 1.2 13.4 8 8 1 1 86.3 Stratix5 GSD8 150 9396 26.2 438 20 44.2
[93] ImageNet 2.3 87.1 8 32 1a 1 66.8 Stratix5 GSD8 150 1964 26.2 462 384 44.2
[94] Cifar10 1.2 13.4 8 2 2 2 89.4 Xilinx7 250 10962 13.6 275 N.R 39.4
VX690T
TNN [94] SVHN 0.3 5.6 8 2 2 2 97.6 Xilinx7 250 86124 7.1 155 N.R 12.2
VX690T
[94] GTSRB 0.3 5.6 8 2 2 2 99.0 Xilinx7 250 86124 6.6 155 N.R 12.2
VX690T
25
26
TABLE 1.7
Accelerators Employing Pruning and Low Rank Approximation
Comp Params Removed Acc Freq Through. Power LUT Memory
Reduc. Entry Dataset (GOP) (M) Param. (%) Bit-width (%) Device (MHz) (GOPs) (W) (K) DSP (MB)
SVD [29] ImageNet 30.8 50.2 63.6 16 Fixed 88.0 Zynq 7Z045 150 137.0 9.6 183 780 17.50
Pruning [44] Cifar10 0.3 13.9 89.3 8 Fixed 91.5 Kintex 7K325T 100 8620.7 7.0 17 145 15.12
[7] ImageNet 1.5 9.2 85.0 32 Float 79.7 Stratix 10 500 12000.0 141.2 N.R N.R N.R
Deep Learning in Computer Vision
Accelerating the CNN Inference on FPGAs 27
In the simplest version of fxed-point arithmetic, all the numbers are encoded
with the same fractional and integer bit-widths. This means that the position of the
radix point is similar for all the represented numbers. In this chapter, we refer to this
representation as static F × P.
When compared to foating point, F × P is known to be more effcient in terms of
hardware utilization and power consumption. This is especially true in FPGAs [79],
where – for instance – a single DSP block in Intel devices can either implement one
32-bit foating-point multiplication or three concurrent F × P multiplications of 9 bits
[8]. This motivated early FPGA implementations such as [61, 80] to employ fxed-
point arithmetic in deriving CNN accelerators. These implementations mainly use
a 16-bit Q8.8 format, where 8 bits are allocated to the integer parts, and 8 bits to the
fractional part. Note that the same Q8.8 format is used for representing the features
and the weights of all the layers.
In order to prevent overfow, the former implementations also expand the bit-
width when computing weighted sums of convolutions. Equation 1.31 explains how
the bit-width is expanded; if bX bits are used to quantize the input FM and bΘ bits are
used to quantize the weights, an accumulator of bacc bits is required to represent a
weighted sum of C˜ K ˜2 elements, where:
In practice, most FPGA accelerators use 48-bit accumulators, such as in [59, 60]
(Figure 1.9).
bx bits bq + bx bits
conv
X
X + Accumulate
conv bq bits bq + bx +
Q log2(CK²) bits
sign mantissa
{
{
0 0 1 1 0 0 1 0
0 0 1 1 0 0 1 0
(b) Histogram of weights and activations. Inputs and weights encoded in 8 bits
8-bit quantization.
‡ Another approach to address this problem is to use custom foating point representations, as detailed
in [31].
Accelerating the CNN Inference on FPGAs 29
to another. More particularly, weights, weighted sums, and outputs of each layer are
assigned distinct integer and fractional bit-widths.
The optimal values of these bit-widths (i.e., the ones that deliver the best trade-off
between accuracy loss and computational load) for each layer can be derived after
a profling process performed by dedicated frameworks that support F × P. Among
these frameworks, Ristretto [81] and FixCaffe [84] are compatible with Caffe, while
TensorFlow natively supports 8-bit computations. Most of these tools can fne-tune a
given CNN model to improve the accuracy of the quantized network.
In particular, the works in [85] demonstrate the effciency dynamic of F × P, point-
ing out how the inference of AlexNet is possible using 6 bits in dynamic F × P instead
of 16 bits with a conventional fxed-point format.
* Since the same PEs are reused to process different layers, the same bit width is used with a variable
radix point for each layer.
† When compared to an exact 32-bit implementation of AlexNet.
30 Deep Learning in Computer Vision
˜conv ˜ conv
0 1 0
0 1 0
conv
Xconv X
conv 1 1 0
Xconv X xconv 1 1 0
Z Z
Xconv Yconv
1 0 0
1 0 0
* The network topology used in this work involves 90% fewer computations and achieves 7% less clas-
sifcation accuracy on Cifar10.
Accelerating the CNN Inference on FPGAs 31
The same concept expands to depth convolutions, where a separable flter requires
C + J + K multiplications instead of C × J × K multiplications.
Nonetheless, only a small proportion of CNN flters are separable. To increase
this proportion, a frst approach is to force the convolution kernels to be separable by
penalizing high-rank flters when training the network [98]. Alternatively, and after
the training, the weights Θ of a given layer can be approximated into a small set of r
low-rank flters. In this case, r × (C + J + K) multiplications are required to process a
single depth convolution.
Finally, CNN computations can be reduced further by decomposing the weight
matrix ˜ ˜ through single-value decomposition (SVD). As demonstrated in the early
works of [99], SVD greatly reduces the resource utilization of a given 2D-flter
implantation. Moreover, SVD also fnds its interest when processing FC layers and
convolutions that employ the im2col method (cf Section 1.4.1). In a similar way to
pruning, low rank approximation or SVD is followed by a fne-tuning in order to
counterbalance the drop in classifcation accuracy.
1.7 CONCLUSIONS
In this chapter, a number of methods and tools have been compared that aim at
porting convolutional neural networks onto FPGAs. At the network level, approxi-
mate computing and data-path optimization methods have been covered, while at
* This format represents a matrix by three one-dimensional arrays, that respectively contain nonzero
values, row indices, and column indices.
Accelerating the CNN Inference on FPGAs 33
the neuron level, the optimizations of convolutional and fully connected layers have
been detailed and compared. All the different degrees of freedom offered by FPGAs
(custom data types, local data streams, dedicated processors, etc.) are exploited by
the presented methods. Moreover, algorithmic and data-path optimizations can and
should be jointly implemented, resulting in additive hardware performance gains.
CNNs are by nature overparameterized and support particularly well approxi-
mate computing techniques such as weight pruning and fxed-point computation.
Approximate computing already constitutes a key to CNN acceleration over hard-
ware and will certainly continue driving the performance gains in the years to come.
BIBLIOGRAPHY
1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015.
2. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual
recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp.
211–252, September 2014.
3. R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on
Computer Vision - ICCV ’15, 2015, pp. 1440–1448.
4. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic seg-
mentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition - CVPR ’15, 2015, pp. 3431–3440.
5. Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville,
“Towards end-to-end speech recognition with deep convolutional neural networks,”
arXiv preprint, vol. arXiv:1701, 2017.
6. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint, vol. arXiv:1409, pp. 1–14, 2014.
7. E. Nurvitadhi, S. Subhaschandra, G. Boudoukh, G. Venkatesh, J. Sim, D. Marr, R.
Huang, J. OngGeeHock, Y. T. Liew, K. Srivatsan, and D. Moss, “Can FPGAs beat GPUs
in accelerating next-generation deep neural networks?” in Proceedings of the ACM/
SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17,
2017, pp. 5–14.
8. Intel FPGA, “Intel Stratix 10 variable precision DSP blocks user guide,” pp. 4–5, 2017.
9. G. Lacey, G. Taylor, and S. Areibi, “Deep learning on FPGAs: Past, present, and
future,” arXiv e-print, 2016.
10. S. I. Venieris, A. Kouris, and C.-S. Bouganis, “Toolfows for mapping convolutional
neural networks on FPGAs,” ACM Computing Surveys, vol. 51, no. 3, pp. 1–39, June
2018.
11. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learning applied to
document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
12. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifcation with deep con-
volutional neural networks,” in Advances in Neural Information Processing Systems -
NIPS’12, 2012, p. 19.
13. D. H. Hubel and T. N. Wiesel, “Receptive felds, binocular interaction and functional
architecture in the cat’s visual cortex,” The Journal of Physiology, vol. 160, no. 1, pp.
106–154, 1962.
14. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in Proceedings of the International Conference on
Machine Learning - ICML ’15, F. Bach and D. Blei, Eds., vol. 37, 2015, pp. 448–456.
34 Deep Learning in Computer Vision
47. R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multicontext deep
learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition - CVPR ’15, 2015, pp. 1265–1274.
48. C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-effcient CNN imple-
mentation on a deeply pipelined FPGA cluster,” in Proceedings of the International
Symposium on Low Power Electronics and Design - ISLPED ’16, 2016, pp. 326–331.
49. C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards uniformed repre-
sentation and acceleration for deep convolutional neural networks,” in Proceedings of the
International Conference on Computer-Aided Design - ICCAD ’16. ACM, 2016, pp. 1–8.
50. J. Zhang and J. Li, “Improving the performance of openCL-based FPGA accelerator
for convolutional neural network,” in Proceedings of the ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays - FPGA ’17, 2017, pp. 25–34.
51. Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong,
“FP-DNN: An automated framework for mapping deep neural networks onto FPGAs
with RTL-HLS hybrid templates,” in Proceedings of the IEEE Annual International
Symposium on Field- Programmable Custom Computing Machines - FCCM ’17. IEEE,
2017, pp. 152–159.
52. S. Winograd, Arithmetic Complexity of Computations. SIAM, 1980, vol. 33.
53. A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” arXiv
e-print, vol. arXiv: 150, September 2015.
54. L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neu-
ral networks on FPGAs,” in Proceedings of the IEEE Annual International Symposium
on Field-Programmable Custom Computing Machines - FCCM ’17, 2017, pp. 101–108.
55. J. Bottleson, S. Kim, J. Andrews, P. Bindu, D. N. Murthy, and J. Jin, “ClCaffe: OpenCL accel-
erated caffe for convolutional neural networks,” in Proceedings of the IEEE International
Parallel and Distributed Processing Symposium – IPDPS ’16, 2016, pp. 50–57.
56. T. Highlander and A. Rodriguez, “Very effcient training of convolutional neural net-
works using fast fourier transform and overlap-and- add,” arXiv preprint, pp. 1–9, 2016.
57. H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for generating high
throughput CNN implementations on FPGAs,” in Proceedings of the ACM/SIGDA
International Symposium on Field- Programmable Gate Arrays - FPGA ’18. ACM
Press, 2018, pp. 117–126.
58. M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and
H. P. Graf, “A massively parallel coprocessor for convolutional neural networks,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing - ICASSP ’17. IEEE, July 2009, pp. 53–60.
59. C. Farabet, C. Poulet, J. Y. Han, Y. LeCun, D. R. Tobergte, and S. Curtis, “CNP: An
FPGA-based processor for convolutional networks,” in Proceedings of the International
Conference on Field Programmable Logic and Applications - FPL ’09, pp. 32–37, 2009.
60. S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically con-
fgurable coprocessor for convolutional neural networks,” ACM SIGARCH Computer
Architecture News, vol. 38, no. 3, pp. 247–257, June 2010.
61. V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 G-ops/s mobile
coprocessor for deep neural networks,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition - CVPR ’14, June 2014, pp. 696–701.
62. M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in
Proceedings of the Annual International Symposium on Microarchitecture - MICRO
’16, vol. 2016, December 2016.
63. Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and datafow
in FPGA acceleration of deep convolutional neural networks,” in Proceedings of the
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA
’17, 2017, pp. 45–54.
Accelerating the CNN Inference on FPGAs 37
96. T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-effcient convolutional neural
networks using energy-aware pruning,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition - CVPR ’17, pp. 5687–5695, 2017.
97. S. Han, H. Mao, and W. J. Dally, “Deep compression - compressing deep neural net-
works with pruning, trained quantization and huffman coding,” in Proceedings of the
International Conference on Learning Representations – ICLR ’16, 2016, pp. 1–13.
98. A. Sironi, B. Tekin, R. Rigamonti, V. Lepetit, and P. Fua, “Learning separable flters,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp.
94–106, 2015.
99. C. Bouganis, G. Constantinides, and P. Cheung, “A novel 2D flter design methodology
for heterogeneous devices,” in Proceedings of the Annual IEEE Symposium on Field-
Programmable Custom Computing Machines - FCCM ’05. IEEE, 2005, pp. 13–22.
100. R. Dorrance, F. Ren, and D. Markovi, “A scalable sparse matrix-vector multiplication
kernel for energy-effcient sparse-blas on FPGAs,” in Proceedings of the ACM/SIGDA
International Symposium on Field- Programmable Gate Arrays - FPGA ’14. ACM,
2014, pp. 161–170.
2 Object Detection
with Convolutional
Neural Networks
Kaidong Li, Wenchi Ma, Usman Sajid,
Yuanwei Wu, and Guanghui Wang
CONTENTS
2.1 Introduction .................................................................................................... 41
2.1.1 Major Evolution .................................................................................. 42
2.1.2 Other Development............................................................................. 43
2.2 Two-Stage Model............................................................................................44
2.2.1 Regions with CNN Features (R-CNN)...............................................44
2.2.2 Fast R-CNN ........................................................................................ 45
2.2.3 Faster R-CNN .....................................................................................46
2.3 One-Stage Model ............................................................................................ 48
2.3.1 You Only Look Once (YOLO) ........................................................... 48
2.3.2 YOLOv2 and YOLO9000 .................................................................. 49
2.3.3 YOLOv3 ............................................................................................. 50
2.3.4 Single-Shot Detector (SSD) ................................................................ 50
2.4 Other Detector Architectures ......................................................................... 51
2.4.1 Deep Residual Learning (ResNet)...................................................... 52
2.4.2 RetinaNet ............................................................................................ 53
2.5 Performance Comparison on Different Datasets............................................ 54
2.5.1 PASCAL VOC .................................................................................... 54
2.5.2 MS COCO .......................................................................................... 55
2.5.3 VisDrone-DET2018 ............................................................................ 55
2.6 Conclusion ...................................................................................................... 58
Bibliography ............................................................................................................ 58
2.1 INTRODUCTION
Deep learning was frst proposed in 2006 [1]; however, it did not attract much atten-
tion until 2012. With the development of computational power and a large number
of labeled datasets [2, 3], deep learning has proven to be very effective in extract-
ing intrinsic structural and high-level features. Signifcant progress has been made
in solving various problems in the artifcial intelligence community [4], espe-
cially in areas where the data are multidimensional and the features are diffcult to
41
42 Deep Learning in Computer Vision
One-stage models, thanks to their simpler architecture, usually require less com-
putational resource. Some recent networks could achieve more than 150 frames
per second (fps) [43]. However, the trade-off is accuracy, especially for small-scale
objects as shown in Table 2.2 in Section 2.5.1. Another drawback is the class imbal-
ance during training. In order to detect B objects in a grid, the output tensor we
mentioned above includes information for N × N × B number of anchor boxes. But
among them, only a few contain objects. The ratio most of the time is about 1,000:1
[36]. This results in low effciency during training.
behaving this way. The study from Bradley [65] back in 2010 found that the mag-
nitude of gradient update decreases exponentially from output layer to input layer.
Therefore, training becomes ineffcient towards the input layers. This shows the dif-
fculty of training very deep models. On the other hand, research shows that simply
adding additional layers will not necessarily result in better detection performance.
Monitoring the training process shows that added layers are trained to become iden-
tity maps. Therefore, it can only generate models whose performance is equal to a
shallower network at most after a certain amount of layers. To address this issue,
skip connections are introduced into the networks [35, 46, 47] to pass information
between nonadjacent layers.
Feature pyramid networks (FPN) [48], which output feature maps at different
scales, can detect objects at very different scales. This idea could also be found when
the features were hand-engineered [49]. Another problem is the trailing accuracy
in one-stage detector. According to Lin et al. [36], the lower accuracy is caused by
extreme foreground-background class imbalance during training. To address this
issue, Lin et al. introduced RetinaNet [36] by defning focal loss to reduce the weight
of background loss.
SVM Classifers
CONV
NET
Bounding Box
Regressors
FIGURE 2.1 An overview of the R-CNN detection system. The system generates 2k
proposals. Each warped proposal produces feature maps. Then the system classifes each
proposal and refnes the bounding-box prediction.
Object Detection with Convolutional Neural Networks 45
The frst stage is the region proposal extraction. The R-CNN model utilizes selec-
tive search [40], which takes the entire image as input and generates around 2,000
class-independent region proposals. In theory, R-CNN is able to work with any
region proposal methods. Selective search is chosen because it performs well and has
been employed by other detection models. The second stage starts with a CNN that
takes a fxed-dimension image as input and generates a fxed-length feature vector as
output. The input comes from the regional proposals. Each proposal is wrapped into
the required size regardless of its original size and aspect ratio. Then using the CNN
feature vector, pre-trained class-specifc linear support vector machines (SVMs) are
applied to calculate the class scores. Girshick et al. [41] conducted an error analysis,
and based on the analysis result, a bounding-box regression is added to reduce the
localization errors using the CNN feature vector. The regressor, which is class spe-
cifc, can refne bounding-box prediction.
CONV Net
Somax
Classifca˙on
FC
ROI FC
Bounding
Pooling
Box
FC Regressor
Input Image Selec˙ve Search: Feature ROI Feature
~2K region proposals Maps Vector
FIGURE 2.2 An overview of Fast R-CNN structure. Feature maps of the entire image are
extracted. RoI is projected to the feature maps and pooled into a fxed size. Then, RoI fea-
ture vector is computed using fully connected layers. Using the vector, softmax classifcation
probability and bounding-box regression are calculated as outputs for each proposal.
* A unit of computational complexity equivalent to what a single GPU can complete in a day.
46 Deep Learning in Computer Vision
using selective search. For the second stage, instead of generating feature vectors
for each region proposal separately, fast R-CNN calculates one feature map from
the entire image and uses a RoI pooling layer to extract a fxed-length feature vec-
tor for each region proposal. Each RoI feature vector is processed by a sequence of
fully connected layers and forked into two branches. The frst branch calculates class
scores with a softmax layer, and the other branch is again a bounding-box regressor
that refnes the bounding-box estimation.
In fast R-CNN, the RoI pooling layer is the key part where the region proposals
are translated into fxed-dimension feature vectors. This layer takes the feature map
of the entire image and RoI as input. A four-element tuple (r,c,h,w) represents a RoI,
with (r,c) and (h,w) specifying its top-left corner and its height and width respectively
[50]. Then, the (h,w) dimension of RoI is divided into an H × W grid, where H and W
are layer hyper-parameters. Standard max pooling is conducted to each grid cell in
each feature map channel.
With similar mAP (mean average precision) levels, fast R-CNN is very successful
in improving the training/testing speed. For large-scale objects defned in He et al.
[35], fast R-CNN is 8.8 times faster in training and 146 times faster in detection [50].
It achieves 0.32s/image performance, which is a signifcant leap towards real-time
detection.
So°max
FC Classifca˜on
ROI
Pooling
Input Image FC
Bounding Box
FC Regressor
ROI Feature
Vector
For each
anchor box
cls layer
2k scores
CONV Bbox
regressor
Net 4k output
k anchor boxes
for each loaca˜on
RPN
FIGURE 2.3 An overview of the Faster R-CNN framework. Region proposal network
(RPN) shares feature maps with a fast R-CNN network; a small convolutional network
slides over the shared feature maps to produce region proposal; and each location has k
anchor boxes.
To train RPN, binary class labels are assigned to each anchor. Positive is assigned
to two situations.
Non-positive is given to anchors with IoU lower than 0.3 for all ground-truth boxes.
During training, only positive or non-positive anchors contribute to loss functions.
With the introduction of RPN, the total time to complete object detection on GPU
is 198 ms using VGG as the CNN layer. Compared to the selective search, it is
almost 10 times faster. The proposal stage is improved from 1,510 ms to only 10 ms.
Combined, the new Faster R-CNN achieves 5 fps.
Two-stage detectors usually have higher accuracy in comparison to one-stage
detectors. Most of the models at the top detection dataset leader board are two-
stage models. Region proposals with regressor to refne localization means that
they inherently produce much better results for bounding-box predictions. In terms
of speed, these detectors offered far from real-time performance when they were
frst introduced. With recent development, they are getting closer by simplifying
48 Deep Learning in Computer Vision
the architecture. It evolves from models with multiple computationally heavy steps
to models with a single shared CNN. After the Faster R-CNN [33] was published,
the networks are mainly based on shared feature maps. For example, R-FCN
(region-based fully convolutional networks) [51] and Cascade R-CNN [52] focus on
localization accuracy, and at the same time, they can both achieve around 10 fps.
Region-based two-stage detectors are becoming not only more accurate but also
faster. Therefore, it becomes one of the main branches for object detection.
which refects how confdent the prediction is that an object exists in this bounding
box and how accurate the bounding box is relative to the ground-truth box. The fnal
parameter is C, conditional class probabilities. It represents the conditional prob-
abilities of the grid cell containing an object, regardless of how many bounding
boxes a grid cell has. Therefore, the fnal predictions are an S × S ×(B × 5 + C) tensor.
The network design is shown in Table 2.1. It is inspired by GoogLeNet [12], fol-
lowed by 2 fully connected layers: FC1 of 4,096 dimensional vector, and the fnal
output FC2 of 7 × 7 × 30 tensor. In the paper, the authors also introduced a fast ver-
sion with only 79 convolutional layers.
During training, the loss function is shown in Equation 2.1, where 1ijobj denotes
whether the jth prediction in cell i should make a prediction, and 1iobj denotes whether
there’s an object in cell i [34]. From the function, we can see that it penalizes the
Object Detection with Convolutional Neural Networks 49
TABLE 2.1
The Architecture of the Convolutional Layers of YOLO [34]
Stage Image conv1 conv2 conv3 conv4 conv5 conv6
error only when an object exists and when a prediction is actually responsible for the
ground truth.
where
S2
˝ ˝
B
Lossbounding box = lcoord 1ijobj ˙ˆ(xi − x̂i )2 + (yi − ŷi )2 ˇ˘
i=0 j =0
S2
˝ ˝ ˙
( ) +( ) ˇ
B 2 2
+ lcoord 1ijobj ˆ wi − ŵi yi − ŷi ˘ ,
i=0 j =0
S2
å å 1ijobj ( Ci - Cˆ i )
B 2
Lossconfidence =
i=0 j=0
S2
å å 1ijobj ( Ci - Cˆ i ) ,
B 2
+ l noodj
i=0 j=0
S2
Lossclassification = ˙ i=0
1iobj ˙ cˆclasses
( pi (c) − pˆ i (c))2 .
suggests, can detect 9,000 object categories. Based on a slightly modifed version of
YOLOv2, this is achieved with a joint training strategy on classifcation and detec-
tion datasets. A WorldTree hierarchy [34] is used to merge the ground-truth classes
from different datasets.
In this section, we will discuss some of the most effective modifcations. Batch
normalization can help to reduce the effect of internal covariate shifts [54], thus
accelerating convergence during training. By adding batch normalization to all CNN
layers, the accuracy is improved by 2.4%.
In YOLO, the classifer is trained on the resolution of 224 × 224. At the stage of
detection, the resolution is increased to 448 × 448. In YOLOv2, during the last 10
epochs of classifer training, the image is changed to a full 448 resolution, so the
detection training can focus on object detection rather than adapting to the new reso-
lution. This gives a 4% mAP increase. While trying anchor boxes with YOLO, the
issue of instability is exposed. YOLO predicts the box by generating offsets to each
anchor. The most effcient training is when all objects are predicted by the closest
anchor with a minimal number of offsets. However, without offset constrains, an
original anchor could predict an object at any location. Therefore in YOLOv2 [53],
a logistic activation constraint on offset is introduced to limit the predicted bound-
ing box near the original anchor. This makes the network more stable and increases
mAP by 4.8%.
2.3.3 YOLOV3
YOLOv3 [43] is another improved detector based on YOLOv2. The major improve-
ment is on the convolutional network. Based on the idea of Feature Pyramid Networks
[48], YOLOv3 predicts the boxes at 3 different scales, which helps to detect small
objects. The idea of skip connection, as discussed in the following section, is also
added into the design. As a result, the network becomes larger, with 53 layers com-
pared to its original 19 layers. Thus, it is called Darknet-53 [43]. Another achieve-
ment is that Darknet-53 runs at the highest measured foating point operation speed,
which is an indication that the network is better at utilizing GPU resources.
Basee
Bas CONV
C ONV CONV
CO NV
Input
In put Image
Image
Netwoork
or
Network
FIGURE 2.4 SSD architecture. The frst CONV layer, which is called the base network, is
truncated from a standard network. The detection layer computes confdence scores for each
class and offsets to default boxes.
is divided into grid cells. Each cell in the feature maps associates a set of default
bounding boxes and aspect ratios. SSD then computes category confdence scores
and bounding-box offsets to those default bounding boxes for each set. During pre-
diction, SSD performs detection for objects with different sizes on the feature maps
with various scales (Figure 2.4).
One-stage models inherently have the advantage of speed. In recent years, models
with more than 150 fps have been published, while a fast two-stage model [33] only
achieves around 20 fps. To make the prediction more accurate, networks combine
ideas from both one-stage and two-stage models to fnd the balance between speed
and accuracy. For example, Faster R-CNN [33] resembles the one-stage model in
sharing one major CNN network. One-stage models normally produce prediction
from the entire feature map. So they are good at taking context information into con-
sideration. However, this also means they are not very sensitive to smaller objects.
Methods like deconvolutional single-shot detector (DSSD) [55] and RetinaNet [36]
have been proposed to fx this problem from different viewpoints. DSSD learned
from FPN [48] modifes its CNN to predict objects at different scales. RetinaNet
develops a unique loss function to focus on hard objects. A detailed comparison of
results from different models on the PASCAL VOC dataset is given in Table 2.2.
TABLE 2.2
Detection Results on PASCAL VOC 2007 Test Set
Detector mAP (%) fps
Iden˜ty Map
ReLu ReLu
Input:
CONV CONV
Layer Layer
FIGURE 2.5 ResNet building block. Convert the underlying mapping from H( x ) to
F ( x ), where H( x ) = F ( x ) + x .
Object Detection with Convolutional Neural Networks 53
solution is an identity mapping, the modifed layer is going to ft zero by adding the
skip connection. The hypothesis is proven in the test results, as discussed in the next
session.
Another attractive property of ResNets is that the design does not introduce
any extra parameters or computational complexity. This is desirable in two
ways. First, it helps the network to achieve better performance without extra
cost. Second, it facilitates fair comparison between plain networks and residual
networks. During the test, we can run same networks with and without the skip
connection. Therefore, the performance difference is caused just by the skip
connection.
Based on a 34-layer plain network, the skip connection is added as indicated
in Figure 2.5. When the input and output dimensions are the same, the skip con-
nection can be added directly. And when the skip connection goes across different
dimensions, the paper proposes two options: (i) identity mapping with extra zeros
paddings; and (ii) using a projection shortcut, which is the projection in the plain
network block. With either option, a stride of 2 will be performed.
The signifcance of this network design is to reduce the diffculties in fnding
the optimal mappings of each layer. Without this design, researchers have to design
architectures with different depths, train on the dataset, and then compare the detec-
tion accuracy to narrow down the optimal depth. The process is very time-consum-
ing, and different problem domains may have various optimal depths. The common
training strategy, which is introduced in R-CNN [41], is supervised pre-training on a
large auxiliary dataset, followed by domain-specifc fne-tuning on a small dataset.
On the domain-specifc dataset, there is a high chance that the network architecture
is not optimal. Plus, it is practically impossible to fnd the optimal depth for each
specifc problem. However, with ResNets, we can safely add more layers and expect
better performance. We can rely on the networks and training to fnd the best per-
formance model as long as the optimal model is shallower. Another advantage of
ResNets is its effciency, which achieves performance improvement without adding
any computational complexity.
2.4.2 RETINANET
With all the advancements made in recent years, both two-stage methods and one-
stage methods have their own pros and cons. One-stage methods are usually faster in
detection, but trailing in detection accuracy. According to Lin et al. [36], the lower
accuracy is mainly caused by the extreme foreground-background class imbalance
during training. For the one-stage detector, it must sample much more candidate
locations in an image. In practice, the number of candidate locations normally goes
up to around 100k covering different positions, scales, and aspect ratios. Candidate
locations are dominated by background examples, which makes the training very
ineffcient. In two-stage detectors, however, most background regions are fltered out
by region proposals.
To address this issue, Lin et al. proposed RetinaNet [36]. The problem of a nor-
mal one-stage network is the overwhelming number of easy background candidates.
During training, it contributes little to the improvement of accuracy. Candidates can
54 Deep Learning in Computer Vision
° p, if y = 1
CE( p, y) = − log( pt ), where pt = ˛ (2.2)
˝1 − p, otherwise
To reduce the imbalance between the easy background and hard foreground samples,
Lin et al. [36] introduced the following focal loss.
When a sample is misclassifed with a high score, pt will have small value. Therefore,
the loss is not affected that much. But when a sample is correctly classifed with a
high confdence, pt is close to 1, which will signifcantly down-weigh its effect. In
addition, a weighting factor α is employed to balance the importance of positive/
negative examples.
Both models in this section address the problems encountered during experiments.
ResNet [35] has become one of the most widely used backbone networks. Its model
is provided in most of the popular deep learning frameworks, such as Caffe, PyTorch,
Keras, MXNet, etc. RetinaNet [36] achieves remarkable performance improvement
with minimal addition in computational complexity by designing a new loss function.
More importantly, it shows a new direction in optimizing the detector. Modifying
evaluation metrics could result in signifcant accuracy increase. Rezatofghi et al.
[57] in 2019 proposed Generalized Intersection-over-Union (GIoU) to replace the
current IoU and used GIoU as the loss function. Before the study of Rezatofghi et al.
[57], all CNN detectors calculate IoU during the test to evaluate the result, while
employing other metrics as loss function during training to optimize. Other recent
detectors proposed after 2017 mainly focus on detecting objects at various scales,
especially at small scale [52, 58, 60]. Among them, the trend in design is refnement
based on previous successful detectors [59, 37].
2.5.2 MS COCO
MS COCO [26] is a large-scale object detection, segmentation, and captioning data-
set with 330,000 images. It aims at addressing three core problems at scene under-
standing: detection of non-iconic views (or non-canonical perspectives) of objects,
contextual reasoning between objects, and the precise 2D localization of objects
[26]. COCO defnes 12 metrics to evaluate the performance of a detector, which
gives a more detailed and insightful look. Since MS COCO is a relatively new data-
set, not all detectors have offcial results. We only compare the results of YOLOv3
and RetinaNet in Table 2.3. Although the two models were tested under different
GPUs, M40 and Titan X, their performances are almost identical [43].
Performance difference with higher IoU threshold. From Table 2.3, we can see
that YOLOv3 performs better than RetinaNet when the IoU threshold is 50%, while
at 75% IoU, RetinaNet has better performance, which indicates that RetinaNet has
higher localization accuracy. This aligns with the effort by Lin et al. [36]. The focal
loss is designed to put more weight on learning hard examples.
2.5.3 VISDRONE-DET2018
VisDrone-DET2018 dataset [61] is a special dataset that consists of images from
different drone platforms. It has 8,599 images in total, including 6,471 for train-
ing, 548 for validation, and 1,580 for testing. The images feature diverse real-
world scenes, collected in various scenarios (across 14 different cities in different
56 Deep Learning in Computer Vision
TABLE 2.3
Detection Results on MS COCO*
Detector Backbone Scale (pixels) AP AP50 AP75 fps
* AP50 and AP75 Are Average Precision with IOU Thresholds of 0.5 and 0.75, Respectively. AP (Average
Precision) Is the Average over 10 Thresholds ([0.5:0.05:0.95]). RetinaNet Is Measured on an Nvidia
M40 GPU [36], while YOLOv3 Is Tested on an Nvidia Titan X [43].
regions), and under various environmental and lighting conditions [62]. This data-
set is challenging since most of the objects are small and densely populated as
shown in Figure 2.6.
To check the performance of different models on this dataset, we implemented
the latest version of each detector and show the result in Tables 2.4 and 2.5. The
results are calculated on the VisDrone validation set since the test set is not publicly
available.
From Table 2.4, it is evident that Faster R-CNN [33] performs signifcantly better
than YOLOv3 [43]. At IoU = 0.5, Faster R-CNN has 10% higher accuracy. YOLO
detectors inherently would struggle in datasets like this. The fully connected lay-
ers at the end take the entire feature map as input. It enables YOLO to have enough
contextual information. However, it lacks local details. In addition, each grid cell can
predict only a certain number of objects, which is determined before training starts.
So they have a hard time in the small and densely populated VisDrone-DET2018
FIGURE 2.6 Six sample images from the VisDrone-DET2018 dataset. They are pictures
taken from drones. The rectangular boxes on each image are ground truth. It has a large
number of small objects at distance.
Object Detection with Convolutional Neural Networks 57
TABLE 2.4
Performance Metrics on VisDrone-DET2018*
Detector Iterations AP Score
* AR @ [maxDets=1] and [maxDets=10] mean the maximum recall given 1 detection per
image and 10 detections per image, respectively.
TABLE 2.5
RetinaNet on VisDrone-DET2018 without Tuning
Pedestrian Person Bicycle Car Van
dataset. It can be noted that the gap is reduced to 5% at IoU = 0.75, which means
YOLOv3 is catching up in terms of localization precision. We suggest this is the
result of adding prediction at 3 different scales in YOLOv3 and detecting on high-
resolution (832 × 832) images.
Table 2.5 shows the mAPs for each class from RetinaNet [36]. It shows that recent
detectors still need to be improved for small and morphologically similar objects.
The VisDrone dataset is a unique dataset, which has many potential real-life applica-
tions. Actually, most practical applications have to face different challenging situa-
tions, like bad exposure, lack of lighting, and a saturated number of objects. More
investigation has to be done to develop more effective models to handle these com-
plex real-life applications.
The evolution of object detection is partially linked to the availability of labeled
large-scale datasets. The existence of various datasets helps train the neural network
to extract more effective features. The different characteristics of datasets motivate
58 Deep Learning in Computer Vision
2.6 CONCLUSION
In this chapter, we have made a brief review of CNN-based object detection by pre-
senting some of the most typical detectors and network architectures. The detector
starts with a two-stage multi-step architecture and evolves to a simpler one-stage
model. In the latest models, even two-stage ones employ an architecture that shares
a single CNN feature map so as to reduce the computational load [52]. Some models
gain accuracy increase by fusing different ideas into one detector [47, 59]. In addi-
tion to the direct effort on detectors, training strategy is also an important factor to
produce high-quality results [41, 63]. With recent developments, most detectors have
decent performance in both accuracy and effciency. In practical applications, we
need to make a trade-off between the accuracy and the speed by choosing a proper
set of parameters and network structures. Although great progress has been made in
object detection in past years, some challenges still need to be addressed, like occlu-
sion [64] and truncation. In addition, better-designed datasets, like VisDrone, need
to be developed for specifc practical applications.
BIBLIOGRAPHY
1. G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
2. Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
35, no. 8, pp. 1798–1828, 2013.
3. I. Arel, D. C. Rose, T. P. Karnowski, et al., “Deep machine learning-a new frontier in
artifcial intelligence research,” IEEE Computational Intelligence Magazine, vol. 5, no.
4, pp. 13–18, 2010.
4. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p.
436, 2015.
5. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V.
Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for acoustic mod-
eling in speech recognition: The shared views of four research groups,” IEEE Signal
Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
6. T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convo-
lutional neural networks for lvcsr,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 8614–8618, IEEE, 2013.
7. G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neu-
ral networks for large-vocabulary speech recognition,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
8. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural
language processing (almost) from scratch,” Journal of Machine Learning Research,
vol. 12, pp. 2493–2537, 2011.
9. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural net-
works,” in Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
Object Detection with Convolutional Neural Networks 59
10. R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes and natural
language with recursive neural networks,” in Proceedings of the 28th International
Conference on Machine Learning (ICML-11), pp. 129–136, 2011.
11. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifcation with deep con-
volutional neural networks,” in Advances in Neural Information Processing Systems,
pp. 1097–1105, 2012.
12. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
13. L. He, G. Wang, and Z. Hu, “Learning depth from single images with deep neural net-
work embedding focal length,” IEEE Transactions on Image Processing, vol. 27, no. 9,
pp. 4676–4689, 2018.
14. J. Gao, J. Yang, G. Wang, and M. Li, “A novel feature extraction method for scene
recognition based on centered convolutional restricted boltzmann machines,”
Neurocomputing, vol. 214, pp. 708–717, 2016.
15. M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E.
Muharemagic, “Deep learning applications and challenges in big data analytics,”
Journal of Big Data, vol. 2, no. 1, p. 1, 2015.
16. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van
Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical
image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
17. X. Mo, K. Tao, Q. Wang, and G. Wang, “An effcient approach for polyps detection in
endoscopic videos based on faster R-CNN,” in 2018 24th International Conference on
Pattern Recognition (ICPR), pp. 3929–3934, IEEE, 2018.
18. A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, “CAN: Creative adversarial
networks, generating “art” by learning about styles and deviating from style norms,”
arXiv preprint arXiv:1706.07068, 2017.
19. W. Xu, S. Keshmiri, and G. Wang, “Adversarially approximated autoencoder for
image generation and manipulation,” IEEE Transactions on Multimedia, doi:10.1109/
TMM.2019.2898777, 2019.
20. W. Xu, S. Keshmiri, and G. Wang, “Toward learning a unifed many-to-many mapping
for diverse image translation,” Pattern Recognition, doi:10.1016/j.patcog.2019.05.017,
2019.
21. W. Ma, Y. Wu, Z. Wang, and G. Wang, “MDCN: Multi-scale, deep inception con-
volutional neural networks for effcient object detection,” in 2018 24th International
Conference on Pattern Recognition (ICPR), pp. 2510–2515, IEEE, 2018.
22. Z. Zhang, Y. Wu, and G. Wang, “Bpgrad: Towards global optimality in deep learning
via branch and pruning,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3301–3309, 2018.
23. L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep
learning for generic object detection: A survey,” arXiv preprint arXiv:1809.02165, 2018.
24. F. Cen and G. Wang, “Dictionary representation of deep features for occlusion-robust
face recognition,” IEEE Access, vol. 7, pp. 26595–26605, 2019.
25. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (voc) challenge,” International Journal of Computer Vision, vol.
88, no. 2, pp. 303–338, 2010.
26. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
Zitnick, “Microsoft coco: Common objects in context,” in European Conference on
Computer Vision, pp. 740–755, Springer, 2014.
27. S. P. Bharati, S. Nandi, Y. Wu, Y. Sui, and G. Wang, “Fast and robust object tracking
with adaptive detection,” in 2016 IEEE 28th International Conference on Tools with
Artifcial Intelligence (ICTAI), pp. 706–713, IEEE, 2016.
60 Deep Learning in Computer Vision
28. Y. Wu, Y. Sui, and G. Wang, “Vision-based real-time aerial object localization and
tracking for UAV sensing system,” IEEE Access, vol. 5, pp. 23969–23978, 2017.
29. S. P. Bharati, Y. Wu, Y. Sui, C. Padgett, and G. Wang, “Real-time obstacle detection and
tracking for sense-and-avoid mechanism in UAVs,” IEEE Transactions on Intelligent
Vehicles, vol. 3, no. 2, pp. 185–197, 2018.
30. Y. Wei, X. Pan, H. Qin, W. Ouyang, and J. Yan, “Quantization mimic: Towards very tiny
CNN for object detection,” in Proceedings of the European Conference on Computer
Vision (ECCV), pp. 267–283, 2018.
31. K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X.
Wang, et al., “T-CNN: Tubelets with convolutional neural networks for object detection
from videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol.
28, no. 10, pp. 2896–2907, 2018.
32. W. Chu and D. Cai, “Deep feature based contextual model for object detection,”
Neurocomputing, vol. 275, pp. 1035–1042, 2018.
33. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detec-
tion with region proposal networks,” in Advances in Neural Information Processing
Systems, pp. 91–99, 2015.
34. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unifed, real-
time object detection,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 779–788, 2016.
35. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778, 2016.
36. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detec-
tion,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1, 2018.
37. Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, “M2det: A sin-
gle-shot object detector based on multi-level feature pyramid network,” CoRR, vol.
abs/1811.04533, 2019.
38. P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multi-
scale, deformable part model,” in IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1–8, IEEE, 2008.
39. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat:
Integrated recognition, localization and detection using convolutional networks,” arXiv
preprint arXiv:1312.6229, 2013.
40. J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search
for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp.
154–171, 2013.
41. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 580–587, 2014.
42. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd:
Single shot multibox detector,” in European Conference on Computer Vision, pp. 21–37,
Springer, 2016.
43. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
arXiv:1804.02767, 2018.
44. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
45. X. Glorot and Y. Bengio, “Understanding the diffculty of training deep feedforward
neural networks,” in Proceedings of the Thirteenth International Conference on
Artifcial Intelligence and Statistics, pp. 249–256, 2010.
46. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected con-
volutional networks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 4700–4708, 2017.
Object Detection with Convolutional Neural Networks 61
47. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances
in Neural Information Processing Systems, pp. 4467–4475, 2017.
48. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature
pyramid networks for object detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2117–2125, 2017.
49. D. G. Lowe, “Object recognition from local scale-invariant features,” in The
Proceedings of the Seventh IEEE International Conference on Computer Vision, vol.
99, no. 2, pp. 1150–1157, IEEE, 1999.
50. R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 1440–1448, 2015.
51. J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully con-
volutional networks,” in Advances in Neural Information Processing Systems, pp.
379–387, 2016.
52. Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detec-
tion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 6154–6162, 2018.
53. J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, 2017.
54. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in Proceedings of the 32nd International Conference
on Machine Learning, vol. 37, pp. 448–456, 2015.
55. C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Deconvolutional single
shot detector,” arXiv preprint arXiv:1701.06659, 2017.
56. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception
architecture for computer vision,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2818–2826, 2016.
57. H. Rezatofghi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized
intersection over union: A metric and a loss for bounding box regression,” arXiv pre-
print arXiv:1902.09630, 2019.
58. P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-transferrable object detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
528–537, 2018.
59. S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refnement neural network
for object detection,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 4203–4212, 2018.
60. B. Singh and L. S. Davis, “An analysis of scale invariance in object detection snip,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
3578–3587, 2018.
61. P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu, “Vision meets drones: A challenge,”
arXiv preprint arXiv:1804.07437, 2018.
62. P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, Q. Nie, H. Cheng, C. Liu, X. Liu, et
al., “VisDrone-DET2018: The vision meets drone object detection in image challenge
results,” In Proceedings of the European Conference on Computer Vision (ECCV),
pp. 437–468, 2018.
63. J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter
optimization,” in Advances in Neural Information Processing Systems, pp. 2546–2554,
2011.
64. B. Wu and R. Nevatia, “Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors,” in Tenth IEEE International
Conference on Computer Vision, vol. 1, pp. 90–97, 2005.
65. Bradley, David M. “Learning in modular systems,” No. CMU-RI-TR-09-26.
CARNEGIE-MELLON UNIV PITTSBURGH PA ROBOTICS INST, 2010.
3 Effcient Convolutional
Neural Networks
for Fire Detection in
Surveillance Applications
Khan Muhammad, Salman Khan,
and Sung Wook Baik
CONTENTS
3.1 Introduction ....................................................................................................64
3.2 Related Work .................................................................................................. 65
3.2.1 Traditional Sensor-Based Fire Detection............................................ 65
3.2.2 Vision-Based Fire Detection...............................................................66
3.2.2.1 Color- and Motion-Based Fire Detection Methods..............66
3.2.2.2 Deep Learning-Based Fire Detection Methods ...................66
3.3 CNN-Based Fire Detection Methods ............................................................. 67
3.3.1 EFD: Early Fire Detection for Effcient Disaster Management ......... 67
3.3.1.1 EFD Architecture................................................................. 68
3.3.1.2 EFD Training and Fine-Tuning............................................ 68
3.3.2 BEA: Balancing Effciency and Accuracy ......................................... 69
3.3.2.1 BEA Framework .................................................................. 70
3.3.3 FLCIE: Fire Localization and Contextual Information Extraction........ 72
3.3.3.1 FLCIE Architecture............................................................. 72
3.3.3.2 Feature Map Selection for Localization .............................. 72
3.3.3.3 Contextual Information Extraction Mechanism .................. 74
3.4 Experimental Results and Discussion ............................................................ 75
3.4.1 Datasets............................................................................................... 75
3.4.2 Experimental Setup ............................................................................ 77
3.4.3 Experimental Results.......................................................................... 78
3.4.3.1 Results and Comparison ...................................................... 78
3.4.3.2 FLCIE Results .....................................................................80
3.4.3.3 Experiments for Contextual Information Extraction........... 81
3.5 Conclusions..................................................................................................... 83
Bibliography ............................................................................................................84
63
64 Deep Learning in Computer Vision
3.1 INTRODUCTION
Recently, a variety of sensors have been introduced for different applications such as
setting off a fre alarm [1], detecting vehicle obstacles [2], visualizing the interior of
the human body for diagnosis [3–5], monitoring animals and ships, and surveillance.
Of these applications, surveillance has primarily attracted the attention of research-
ers due to the enhanced embedded processing capabilities of cameras. Using smart
surveillance systems, various abnormal events such as road accidents, fres, medical
emergencies, etc. can be detected at early stages, and the appropriate authority can
be autonomously informed [6, 7]. A fre is an abnormal event that can cause signif-
cant damage to lives and property within a very short time. Such disasters, the causes
of which include human error and system failures, result in severe loss of human life
and other damage [8]. In June 2013, fre disasters killed 19 frefghters and ruined
100 houses in Arizona. Similarly, another forest fre in August 2013 in California
ruined an area of land the size of 1,042 km2, causing a loss of US$127.35 million.
According to an annual disaster report, fre disasters alone affected 494,000 people
and resulted in a loss of US$3.1 billion in 2015. According to [9], at least 37 people
were killed and about 130 injured in a fre at Sejong Hospital, Miryang, South Korea,
in December 2017, and a fre in Greece killed around 80 people, injured hundreds,
and destroyed about 1,000 homes in July 2018. It is important to detect fres at early
stages utilizing smart surveillance cameras, to avoid such disasters.
Two broad categories of approach can be identifed for fre detection: traditional
fre alarms and vision sensor-assisted fre detection. Traditional fre alarm systems
are based on sensors that require close proximity for activation, such as infrared
and optical sensors. To overcome the limitations of existing approaches, numerous
vision sensor-based methods have been explored by researchers in this feld; these
have many other advantages, e.g., less human interference, faster response, afford-
able cost, and larger surveillance coverage. Despite these advantages, there are still
some issues with these systems, e.g., the complexity of the scenes under observation,
irregular lighting, and low-quality frames; researchers have made several efforts to
address these aspects, taking into consideration both color and motion features. For
instance, [8, 10–17] used color features for fre detection by exploring different color
models including HIS [8], YUV [13], YCbCr [14], RGB [15], and YUC [10]. The
major issue with these methods is their high rate of false alarms. Several attempts
have been made to solve this issue by combing the color information with motion
and analyses of fre’s shape and other characteristics. However, maintaining a well-
balanced trade-off between accuracy, false alarms, and computational effciency still
have remained a challenge. In addition, the above-mentioned methods fail to detect
small fres or fres at larger distances.
In this chapter, two major issues related to fre detection have been investigated:
image representation with feature vectors and feature map selection for fre local-
ization and contextual information extraction. The proposed solutions for image
representation used highly discriminative deep convolutional features to construct
a robust representation. The extensive fne-tuning process enabled the feature
extraction pipeline to focus on fre regions in images, thereby effectively captur-
ing the essential features needed for fre and ignoring irrelevant and trivial or often
Effcient CNNs for Fire Detection 65
misleading features. Further details about feature extraction, feature map selection,
and contextual information extraction are provided in the respective sections of the
chapter.
Section 1.2 presents an overview of image representation techniques for fre detec-
tion based on traditional hand-crafted features and recent deep learning approaches.
Section 1.3 introduces the main contributions of this chapter to fre detection lit-
erature for both indoor and outdoor surveillance. Section 1.4 contains experimental
results on benchmark datasets and evaluations from different perspectives of the pro-
posed methods included in this chapter. Finally, Section 1.5 concludes this chapter
by focusing on the key fndings, strengths, and weaknesses of the proposed methods
along with some future research directions.
This necessitates the existence of effective fre alarm systems for surveillance. The
majority of the research is conducted for fre detection using cameras and visual sen-
sors, whose details are given in the next section.
indexing and retrieval [32, 33]. Recently, some preliminary results for fre detec-
tion using CNNs have also been reported. It is therefore important to highlight and
analyze their performance in comparison with our proposed works. Frizzi et al. [34]
presented a CNN-based method for fre and smoke detection. This work is based
on a limited number of images, having no comparison with existing methods that
could prove its performance. Sharma et al. [35] explored VGG16 and Resnet50 for
fre detection. The dataset is very small (651 images only) and the reported test-
ing accuracy is less than 93%. This work is compared with that of Frizzi et al. [34]
with testing accuracy of 50%. Another recent work has been presented in Zhong
et al. [36], where a fame detection algorithm based on CNN processed the video
data generated by an ordinary camera monitoring a scene. Firstly, a candidate target
area extraction algorithm was used for dealing with the suspected fame area and
for improving the effciency of recognition. Secondly, the extracted feature maps of
candidate areas were classifed by the designed deep neural network model based
on CNN. Finally, the corresponding alarm signal was obtained by the classifcation
results. Experiments conducted on a homemade database show that this method can
identify fre and achieve reasonable accuracy but with a higher false alarm rate. The
method can process 30 fps on a system with the following specifcations: standard
desktop PC equipped with OctaCore, CPU 3.6 GHz, and 8 GB RAM.
It can be observed that some of the methods are too naïve: their execution time
is fast, but they compromise on accuracy, producing a large number of false alarms.
Conversely, some methods have achieved good fre detection accuracies but their
execution time is too high; hence, they cannot be applied in real-world environments,
especially in critical areas where minor delay can lead to huge disasters. Therefore,
for more accurate and early detection of fre, we need a robust mechanism that can
detect fre during varying conditions and can send the important keyframes and
alerts immediately to disaster management systems.
FIGURE 3.2 Early fre detection using CNN with reliable communication for effective
disaster management [7].
objects. Thus, there is a need for an algorithm that can achieve better accuracy in the
aforementioned scenarios while minimizing the number of false alarms. To achieve
this goal, we explored deep CNNs and devised a fne-tuned architecture for early
fre detection during surveillance for effective disaster management. Our system is
overviewed in Figure 3.2.
FIGURE 3.4 Sample query images along with their probabilities for CNN-based fre
detection [7].
process, a target model that can be used for prediction of fre at early stages was
achieved. For testing, the query image is passed through the proposed model, which
results in probabilities for both classes: fre and non-fre (normal). Based on the
higher probability, the image is assigned to its appropriate class. Examples of query
images along with their probabilities are shown in Figure 3.4.
FIGURE 3.5 Early fame detection in surveillance videos using deep CNN [40].
recent improvements in embedded processing capabilities and the potential of deep fea-
tures, numerous CNNs have been investigated to improve fame detection accuracy and
minimize the false warning rate. In this section, we propose a cost-effective fre detec-
tion CNN architecture for surveillance videos. The model is inspired from GoogleNet
architecture, considering its reasonable computational complexity and suitability for
the intended problem compared to other computationally expensive networks such as
“AlexNet.” To balance the effciency and accuracy, the model is fne-tuned considering
the nature of the target problem and fre data. An overview of our framework for fame
detection in CCTV surveillance networks is given in Figure 3.5.
The size of the input image is 224×224×3 pixels passed through the proposed
CNN architecture visualized in Figure 3.6. The motivation for using the proposed
architecture is to avoid an uncontrollable increase in the computational complexity
and network fexibility and signifcant increases in the number of units at each stage.
To achieve this, a dimensionality reduction mechanism is applied before computa-
tion-hungry convolutions of patches with larger sizes. Further, the architecture is
modifed according to the problem of fre classifcation by keeping the number of
output classes to 2, i.e., fre and non-fre. The better results for using this architec-
ture over benchmark datasets with prediction scores for fre and non-fre classes are
shown in Figure 3.7.
FIGURE 3.7 Probability scores and predicted labels produced by the proposed deep CNN
framework for different images from benchmark datasets [40].
72 Deep Learning in Computer Vision
FIGURE 3.8 Prediction scores for a set of query images using the proposed deep CNN [45].
Effcient CNNs for Fire Detection 73
FIGURE 3.9 Sample images and the corresponding localized fre regions using our
approach. The frst row shows the original images, while the second row shows the localized
fre regions [45].
First, a prediction is obtained from our deep CNN. In non-fre cases, no further
action is performed; in the case of fre, further processing of its localization is per-
formed, as given in Algorithms 1 and 2.
After analyzing all the feature maps of the different layers of our proposed CNN
using Algorithm 1, feature maps 8, 26, and 32 of the “Fire2/Concat” layer were found
to be sensitive to fre regions and to be appropriate for fre localization. Therefore,
these three feature maps are fused and binarization is applied to segment the fre
[46]. A set of sample fre images with their segmented regions is given in Figure 3.9.
6. Calculate the hamming distance HDi between GT and each feature map Fbin(i) as follows:
HDi = Fbin(i ) − GT
1. What is on fre?
2. Are there any people in the building?
3. What type of building is on fre?
4. Are there any hazards in or around the building?
5. Is evacuation in progress?
6. What is the address of the premises?
7. What is your phone number?
8. Other information as relevant.
The proposed method can extract similar contextual information as well as more
useful information (such as size and expansion of fre) automatically by processing
the video stream of surveillance cameras. To achieve such information, the seg-
mented fre regions of Algorithm 2 are further processed to determine the severity
level/burning degree of the scene under observation and fnd the zone of infuence
(ZOI) from the input fre image. The burning degree can be determined from the
number of pixels in the segmented fre.
The zone of infuence can be calculated by subtracting the segmented fre regions
from the original input image. The resultant zone of infuence image is then passed
from the original SqueezeNet model [44], which predicts its label from 1,000 objects.
The object information can be used to determine the situation in the scene, such as a
fre in a house, a forest, or a vehicle. This information along with the severity of the
fre can be reported to the fre brigade to take appropriate action.
Effcient CNNs for Fire Detection 75
FIGURE 3.10 Fire localization and contextual information extraction using the proposed
deep CNN [45].
3.4.1 DATASETS
This section presents the benchmark datasets used for experimental results and eval-
uation. The main focus during experiments is on two famous benchmark dataset:
Foggia’s video dataset [10] and Chino’s dataset [47]. The frst dataset contains a total
of 31 videos consisting of indoor as well outdoor scenarios, where 14 videos belong
76 Deep Learning in Computer Vision
to the fre class and the remaining 17 videos belong to the non-fre class. The main
motivation for selecting this dataset is that it contains videos recorded in both indoor
and outdoor environments. Furthermore, this dataset is very challenging because
of the fre-like objects in the non-fre videos, which can be predicated as fre, mak-
ing the classifcation more diffcult. To this end, color-based methods may fail to
differentiate between real fre and scenes with red objects. Similarly, motion-based
techniques may wrongly classify a scene with mountains containing smoke, cloud,
or fog. These compositions made the dataset more challenging, enabling us to stress
our framework and investigate its performance in various situations of the real envi-
ronment. Representative frames of these videos are shown in Figure 3.11.
The second dataset (DS2) [47] is comparatively small but very challenging.
There are 226 images in this dataset, out of which 119 images contain fre while
FIGURE 3.11 Sample images from DS1 videos, containing fre and without fres [7].
Effcient CNNs for Fire Detection 77
the remaining 107 are fre-like images containing sunsets, fre-like lights, sunlight
coming through windows, etc. A set of selected images from this dataset is shown
in Figure 3.12.
deep learning framework [48]. The rest of the experiments were conducted using
MATLAB R2015a with a Core i5 system containing 8 GB RAM.
Furthermore, two different sets of evaluation metrics were employed to evalu-
ate the performance of each method from all perspectives. The frst set of metrics
contains true positives (accuracy), false positives, and false negatives as used in
Muhammad et al. [49]. Dividing the number of images identifed as fre by the sys-
tem by the number of actual fre images provides the “true positive” (TP) rate. “False
negatives” (FN) are calculated by dividing the number of fre images identifed as
non-fre by the system by the number of actual fre images. The “false positives”
(FP) or false alarm rate is achieved by dividing the number of non-fre images pre-
dicted as fre by the system by the total number of non-fre images. For better evalu-
ation of the performance, another set of metrics including precision [4], recall, and
F-measure [50] is also employed. These metrics are computed as follows:
TP
Precision = (3.1)
(TP + FP)
TP
Recall = (3.2)
(TP + FN)
(Precision * Recall)
F − Measure = 2 * (3.3)
(Precision + Recall)
TABLE 3.1
Comparison of Various Fire Detection Methods for Dataset1
Technique False Positives (%) False Negatives (%) Accuracy (%)
Habibuglu et al. [16] performs best and dominates the other methods. However, its
false-negative rate is 14.29%, the worst result of all the methods examined. The
accuracy of the other four methods is also better than this method, with the most
recent method [10] being the best. However, the false positive score of 11.67% is
still high, and the accuracy could be further improved. To achieve a high accuracy
and a low false positive rate, the use of deep features for fre detection is explored.
For this reason, three different methods, i.e., EFD, BEA, and FLCIE, are proposed.
In EFD, frst we used the AlexNet architecture without fne-tuning, which resulted
in an accuracy of 90.06% and reduced false positives from 11.67% to 9.22%. In
the baseline AlexNet architecture, the weights of kernels are initialized randomly
and are modifed during the training process considering the error rate and accu-
racy. The strategy of transfer learning [52] is also applied, whereby the weights from
a pre-trained AlexNet model are initialized with a low learning rate of 0.001 and
modifed the last fully connected layer according to the problem of fre detection.
Interestingly, an improvement in accuracy of 4.33% and reductions in false negatives
and false positives of up to 8.52% and 0.15%, respectively, are obtained.
The false alarm score is still high and needs further improvement. Therefore,
deep learning architecture GoogleNet is explored for this purpose in BEA. Initially,
a GoogleNet model is trained with its default kernel weights, which resulted in an
accuracy of 88.41% and a false positive score of 0.11%. The baseline GoogleNet
architecture randomly initializes the kernel weights, which are tuned according to
the accuracy and error rate during the training process. In an attempt to improve
the accuracy, transfer learning [52] by initializing the weights from pre-trained
GoogleNet model and keeping the learning rate threshold to 0.001 is explored.
Further, the last fully connected layer as per the nature of the intended problem is
changed. With this fne-tuning process, the false alarm rate is reduced from 0.11% to
0.054% and the false-negative score from 5.5% to 1.5%.
80 Deep Learning in Computer Vision
TABLE 3.2
Comparison of Different Fire Detection Methods for Dataset2
Technique Precision Recall F-Measure
Although the results of EFD and BEA are good compared to other existing meth-
ods, there are still certain limitations. Firstly, the size of this model is comparatively
large, thereby restricting its implementation in CCTV networks. Secondly, the accu-
racy is still low and would be problematic for fre brigades and disaster management
teams.
With these strong motivations, SqueezeNet, a lightweight architecture, was
explored for this problem in FLCIE. The experiments for this new architecture were
repeated, and an improvement of 0.11% in accuracy was achieved. Furthermore, the
rate of false alarms was reduced from 9.07% to 8.87%. The rate of false negatives
remained almost the same. Finally, the major achievement of the proposed frame-
work was the reduction of the model size from 238 MB to 3 MB, thus saving an extra
235 MB, which can greatly minimize the cost of CCTV surveillance systems.
For further experimentation, Dataset2 was considered and compared with the
results from four other fre detection algorithms in terms of their relevancy, datasets,
and years of publication as shown in Table 3.2.
To ensure a fair evaluation and a full overview of the performance of our
approach, we considered another set of metrics (precision, recall, and F-measure)
as used by Chino et al. [47]. In a similar way to the experiments on Dataset1, we
tested Dataset2 using EFD, BEA, and FLCIE. For the fne-tuned models in EFD
and BEA, F-measure scores of 0.89 and 0.86, respectively, were achieved. Further
improvement was achieved using FLCIE, increasing the F-measure score to 0.91 and
the precision to 0.86. It is evident from Table 3.2 that FLCIE achieved better results
than the state-of-the-art methods, confrming the effectiveness of the proposed deep
CNN framework.
FIGURE 3.13 Comparison of the CNNFire approach with other methods [45].
maps from different convolutional and pooling layers of our proposed CNN archi-
tecture are used to localize fre in an image precisely. Next, the number of over-
lapping fre pixels in the detection maps and ground truth images are used as true
positives. Similarly, the number of non-overlapping fre pixels in the detection maps
are also determined and then interpreted as false positives. The localization results
are compared with those of several state-of-the-art methods such as Chen et al. [8],
Celik et al. [14], Rossi et al. [54], Rudz et al. [53], and Chino et al. (BoWFire) [47],
as shown in Figure 3.13. Figures 3.14 and 3.15 show the results of all methods for
a sample image from Dataset2. The results of BoWFire, color classifcation, Celik
et al., and Rudz et al. are almost the same. Rossi et al. gives the worst results in this
case, and Chen et al. is better than Rossi et al. The results from CNNFire are similar
to the ground truth.
3.4.3.3 Experiments for Contextual Information Extraction
Along with fre detection and localization, the proposed approach is able to determine
the intensity of the fre detected in an input image accurately. To achieve this target,
the zone of infuence (ZOI) inside the input image along with the segmented fre
regions is extracted. The ZOI image is then advanced to the pre-trained SqueezeNet
model on the ImageNet dataset comprising 1,000 classes. The label assigned by the
SqueezeNet model to the ZOI image is then combined with the severity of the fre
for reporting to the fre brigade. A set of sample cases from this experiment is given
in Figure 3.16.
In Figure 3.16, the frst column shows input images with labels predicted by the
CNN model and their probabilities, with the highest probability taken as the fnal
class label; the second column shows three feature maps (F8, F26, and F32) selected
by EFD; the third column highlights the results for each image using BEA; the
fourth column shows the severity of the fre and ZOI images with a label assigned
82 Deep Learning in Computer Vision
FIGURE 3.14 Visual fre localization results of our CNNFire approach and other fre local-
ization methods [45].
FIGURE 3.15 Fire localization results from our CNNFire and other schemes with false
positives [45].
Effcient CNNs for Fire Detection 83
by the SqueezeNet model; and the fnal column shows the alert that should be sent to
emergency services, such as the fre brigade.
3.5 CONCLUSIONS
Recently, convolutional neural networks (CNNs) have shown substantial progress in
surveillance applications; however, their key roles in early fre detection, localiza-
tion, and fre scene analysis have not been investigated. With this motivation, CNNs
were investigated for addressing the aforementioned problems in this chapter for fre
detection and localization, thereby achieving better detection accuracy with a mini-
mum false alarm rate compared to existing methods. Several highly effcient and
intelligent methods that selected discriminative deep features for fre scene analysis
in both indoor and outdoor surveillance have been presented.
In this chapter, a total of three CNN-based frameworks have been proposed. The
frst framework was proposed for early fre detection using CNNs in surveillance
84 Deep Learning in Computer Vision
BIBLIOGRAPHY
1. B. C. Ko, K.-H. Cheong, and J.-Y. Nam, “Fire detection based on vision sensor and sup-
port vector machines,” Fire Safety Journal, vol. 44, pp. 322–329, 2009.
2. V. D. Nguyen, H. Van Nguyen, D. T. Tran, S. J. Lee, and J. W. Jeon, “Learning frame-
work for robust obstacle detection, recognition, and tracking,” IEEE Transactions on
Intelligent Transportation Systems, vol. 18, pp. 1633–1646, 2017.
3. I. Mehmood, M. Sajjad, and S. W. Baik, “Mobile-cloud assisted video summarization
framework for effcient management of remote sensing data generated by wireless cap-
sule sensors,” Sensors, vol. 14, pp. 17112–17145, 2014.
4. K. Muhammad, M. Sajjad, M. Y. Lee, and S. W. Baik, “Effcient visual attention driven
framework for key frames extraction from hysteroscopy videos,” Biomedical Signal
Processing and Control, vol. 33, pp. 161–168, 2017.
5. R. Hamza, K. Muhammad, Z. Lv, and F. Titouna, “Secure video summarization frame-
work for personalized wireless capsule endoscopy,” Pervasive and Mobile Computing,
vol. 41, pp. 436–450, 2017/10/01/ 2017.
6. K. Muhammad, R. Hamza, J. Ahmad, J. Lloret, H. H. G. Wang, and S. W. Baik, “Secure
surveillance framework for IoT systems using probabilistic image encryption,” IEEE
Transactions on Industrial Informatics, vol. 14(8), pp. 3679–3689, 2018.
7. K. Muhammad, J. Ahmad, and S. W. Baik, “Early fre detection using convolutional neu-
ral networks during surveillance for effective disaster management,” Neurocomputing,
vol. 288, pp. 30–42, 2018.
8. C. Thou-Ho, W. Ping-Hsueh, and C. Yung-Chuen, “An early fre-detection method
based on image processing,” in 2004 International Conference on Image Processing,
2004. ICIP ‘04, vol. 3, 2004, pp. 1707–1710.
9. http://www.bbc.com/news/world-asia-42828023 (Visited 31 January, 2018, 9 AM).
10. P. Foggia, A. Saggese, and M. Vento, “Real-time fre detection for video-surveillance applica-
tions using a combination of experts based on color, shape, and motion,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 25(9), pp. 1545–1556, 2015.
11. B. U. Töreyin, Y. Dedeoğlu, U. Güdükbay, and A. E. Cetin, “Computer vision based
method for real-time fre and fame detection,” Pattern Recognition Letters, vol. 27(1),
pp. 49–58, 2006.
12. D. Han, and B. Lee, “Development of early tunnel fre detection algorithm using
the image processing,” in International Symposium on Visual Computing, 2006, pp.
39–48.
13. G. Marbach, M. Loepfe, and T. Brupbacher, “An image processing technique for fre
detection in video images,” Fire Safety Journal, vol. 41(4), pp. 285–289, 2006.
14. T. Celik, and H. Demirel, “Fire detection in video sequences using a generic color
model,” Fire Safety Journal, vol. 44(2), pp. 147–158, 2009.
15. A. Rafee, R. Dianat, M. Jamshidi, R. Tavakoli, and S. Abbaspour, “Fire and smoke
detection using wavelet analysis and disorder characteristics,” in 2011 3rd International
Conference on Computer Research and Development, 2011, pp. 262–265.
Effcient CNNs for Fire Detection 85
16. Y. H. Habiboğlu, O. Günay, and A. E. Çetin, “Covariance matrix-based fre and fame
detection method in video,” Machine Vision and Applications, vol. 23(6), pp. 1103–
1113, 2012.
17. A. Sorbara, E. Zereik, M. Bibuli, G. Bruzzone, and M. Caccia, “Low cost optronic
obstacle detection sensor for unmanned surface vehicles,” in 2015 IEEE Sensors
Applications Symposium (SAS), 2015, pp. 1–6.
18. I. Kolesov, P. Karasev, A. Tannenbaum, and E. Haber, “Fire and smoke detection in
video with optimal mass transport based optical fow and neural networks,” in 2010
IEEE International Conference on Image Processing, 2010, pp. 761–764.
19. H. J. G. Haynes, “Fire loss in the United States during 2015,” http://www.nfpa.org/, 2016.
20. K. Muhammad, T. Hussain, and S. W. Baik, “Effcient CNN based summarization of
surveillance videos for resource-constrained devices,” Pattern Recognition Letters, 2018.
21. M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K. Sangaiah, A. Castiglione, et
al., “CNN-based anti-spoofng two-tier multi-factor authentication system,” Pattern
Recognition Letters, 126(2019): 123–131.
22. M. Hassaballah, A. A. Abdelmgeid, and H. A. Alshazly, “Image features detection,
description and matching,” in Image Feature Detectors and Descriptors, Awad, A. I.,
Hassaballah, M., (Eds.): Springer, 2016, pp. 11–45.
23. A. I. Awad, and M. Hassaballah, Image Feature Detectors and Descriptors. Studies in
Computational Intelligence. Springer International Publishing, Cham, 2016.
24. F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik, “Violence detec-
tion using spatiotemporal features with 3D Convolutional Neural Network,” Sensors,
vol. 19(11), p. 2472, 2019.
25. M. Sajjad, S. Khan, Z. Jan, K. Muhammad, H. Moon, J. T. Kwak, et al., “Leukocytes
classifcation and segmentation in microscopic blood smear: A resource-aware health-
care service in smart cities,” IEEE Access, vol. 5, pp. 3475–3489, 2017.
26. I. U. Haq, K. Muhammad, A. Ullah, and S. W. Baik, “DeepStar: Detecting starring
characters in movies,” IEEE Access, 7(2019): 9265–9272.
27. M. Hassaballah, H. A. Alshazly, and A. A. Ali, “Ear recognition using local binary pat-
terns: A comparative experimental study,” Expert Systems with Applications, vol. 118,
pp. 182–200, 2019.
28. A. I. Awad, and K. Baba, “Singular point detection for effcient fngerprint classifca-
tion,” International Journal on New Computer Architectures and Their Applications
(IJNCAA), vol. 2, pp. 1–7, 2012.
29. A. Ullah, K. Muhammad, J. D. Ser, S. W. Baik, and V. Albuquerque, “Activity recogni-
tion using temporal optical fow convolutional features and multi-layer LSTM,” IEEE
Transactions on Industrial Electronics, vol. 66(12), pp. 9692–9702, 2019.
30. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in
video sequences using deep Bi-directional LSTM with CNN features,” IEEE Access,
vol. 6, pp. 1155–1166, 2018.
31. M. Sajjad, S. Khan, K. Muhammad, W. Wu, A. Ullah, and S. W. Baik, “Multi-grade
brain tumor classifcation using deep CNN with extensive data augmentation,” Journal
of Computational Science, vol. 30, pp. 174–182, 2019.
32. J. Ahmad, K. Muhammad, J. Lloret, and S. W. Baik, “Effcient conversion of deep fea-
tures to compact binary codes using Fourier decomposition for multimedia big data,”
IEEE Transactions on Industrial Informatics, vol. 14(7), pp. 3205–3215, 2018.
33. J. Ahmad, K. Muhammad, S. Bakshi, and S. W. Baik, “Object-oriented convolu-
tional features for fne-grained image retrieval in large surveillance datasets,” Future
Generation Computer Systems, vol. 81, pp. 314–330, 2018.
86 Deep Learning in Computer Vision
50. K. Muhammad, J. Ahmad, M. Sajjad, and S. W. Baik, “Visual saliency models for sum-
marization of diagnostic hysteroscopy videos in healthcare systems,” SpringerPlus,
vol. 5(1), p. 1495, 2016.
51. R. Di Lascio, A. Greco, A. Saggese, and M. Vento, “Improving fre detection reliability
by a combination of videoanalytics,” in International Conference Image Analysis and
Recognition, 2014, pp. 477–484.
52. S. J. Pan, and Q. Yang, “A survey on transfer learning,” IEEE Transactions on
Knowledge and Data Engineering, vol. 22(10), pp. 1345–1359, 2010.
53. S. Rudz, K. Chetehouna, A. Hafane, H. Laurent, and O. Séro-Guillaume, “Investigation
of a novel image segmentation method dedicated to forest fre applicationsThis paper is
dedicated to the memory of Dr Olivier Séro-Guillaume (1950–2013), CNRS Research
Director,” Measurement Science and Technology, vol. 24(7), p. 075403, 2013.
54. L. Rossi, M. Akhlouf, and Y. Tison, “On the use of stereovision to develop a novel
instrumentation system to extract geometric fre fronts characteristics,” Fire Safety
Journal, vol. 46(1–2), pp. 9–20, 2011.
4 A Multi-biometric Face
Recognition System
Based on Multimodal
Deep Learning
Representations
Alaa S. Al-Waisy, Shumoos Al-Fahdawi,
and Rami Qahwaji
CONTENTS
4.1 Introduction ....................................................................................................90
4.2 Related Work ..................................................................................................92
4.2.1 Handcrafted-Descriptor Approaches.................................................. 93
4.2.2 Learning-Based Approaches ..............................................................94
4.3 The Proposed Approaches.............................................................................. 95
4.3.1 Fractal Dimension............................................................................... 95
4.3.2 Restricted Boltzmann Machine ..........................................................96
4.3.3 Deep Belief Networks......................................................................... 98
4.3.4 Convolutional Neural Network...........................................................99
4.4 The Proposed Methodology.......................................................................... 100
4.4.1 The Proposed FDT-DRBM Approach.............................................. 101
4.4.1.1 Fractal Dimension Transformation.................................... 102
4.4.1.2 Discriminative Restricted Boltzmann Machines............... 104
4.4.2 The Proposed FaceDeepNet System................................................. 105
4.5 Experimental Results.................................................................................... 106
4.5.1 Description of Face Datasets ............................................................ 107
4.5.2 Face Identifcation Experiments ....................................................... 108
4.5.2.1 Parameter Settings of the FDT-DRBM Approach............. 109
4.5.2.2 Testing the Generalization Ability of the
Proposed Approaches ........................................................ 114
4.5.3 Face Verifcation Experiments.......................................................... 117
4.6 Conclusions................................................................................................... 119
Bibliography .......................................................................................................... 121
89
90 Deep Learning in Computer Vision
4.1 INTRODUCTION
A reliable and effcient identity administration system is a vital component in many
governmental and civilian high-security applications that provide services to only
legitimately registered users [1]. Many pattern recognition systems have been used
for establishing the identity of an individual based on different biometric traits (e.g.,
physiological, behavioral, and/or soft biometric traits) [2–3]. In a biometric system,
these traits establish a substantial and strong connection between the users and
their identities [4]. Face recognition has received considerably more attention in the
research community than other biometric traits, due to its wide range of commercial
and governmental applications, its low cost, and the ease of capturing the face image
in a non-intrusive manner. This is unlike other biometric systems (e.g., fngerprint
recognition), which can cause some health risks by transferring viruses and/or some
infectious diseases from one user to the other by using the same sensor device to
capture the biometric trait from multiple users [5]. Despite these advantages, design-
ing and implementing a face recognition system is considered a challenging task in
the image processing, computer vision, and pattern recognition felds. Therefore,
different factors and problems need to be addressed when developing any biometric
system based on the face image, especially when face images are taken in uncon-
strained environments. Such environments are challenging due to the large intraper-
sonal variations, such as changes in facial expression, illumination, multiple views,
aging, occlusion from wearing glasses and hats, and the small interpersonal differ-
ences [6]. Basically, the performance of the face recognition system depends on two
fundamental stages: feature extraction and classifcation. The correct identifcation
of a person’s identity in the latter stage is dependent on the discriminative power of
the extracted facial features in the former stage. Thus, extracting and learning pow-
erful and highly discriminating facial features to minimize intrapersonal variations
and interpersonal similarities remains a challenging task [7].
Recently, considerable attention has been paid to employing multi-biometric
systems in many governmental and private sectors due to their ability to sig-
nifcantly improve the recognition performance of biometric systems. Moreover,
multimodal systems have the following advantages [8–11]: (i) improving popula-
tion coverage; (ii) improving the biometric system’s throughput; (iii) deterring
spoofng attacks; (iv) maximizing the interpersonal similarities and minimizing
the intrapersonal variations; and (v) providing a high degree of fexibility allow-
ing people to choose to provide either a subset or all of their biometric traits
depending on the nature of the implemented application and the user’s conve-
nience. As illustrated in Figure 4.1, to satisfy the multi-biometric system concept,
the face biometric trait can be engaged in four out of fve different types of multi-
biometric systems.
In this chapter, a novel multi-biometric system for identifying a person’s iden-
tity using two discriminative deep learning approaches is proposed based on the
combination of a convolutional neural network (CNN) and deep belief network
(DBN) to address the problem of unconstrained face recognition. CNN is one of
the most powerful supervised deep neural networks (DNNs), which is widely used
to resolve many tasks in image processing, computer vision, and pattern recognition
with high ability to automatically extract discriminative features from input images.
A Multi-biometric Face Recognition System 91
FIGURE 4.1 The fve different types of a multimodal biometric system, adapted from Ross
and Jain [10].
near-infrared images with visual lighting images [40, 41]. Gao et al. [42] have also
addressed the problem of the insuffcient labeled samples by proposing the Semi-
Supervised Sparse Representation-based Classifcation method. This method has
been tested on four different datasets, including AR, Multi-PIE, CAS-PEAL, and
LFW. For more information on the most signifcant challenges, achievements, and
future directions related to the face recognition task, the reader is referred to [43].
m n m
( vi − bi )2 − n
˙˙ ˙ ˙c h
v
E ( v, h) = − wi , j h j i − i j (4.1)
i=1 j =1
s i
i=1
s 2i
2s j =1
Here, σi is the standard deviation of the Gaussian noise for the visible unit vi, wij
represents the weights for the visible unit vi and the hidden unit hj, and bi and cj are
biases for the visible and hidden units, respectively. The conditional probabilities for
FIGURE 4.2 (a) A typical RBM structure. (b) A discriminative RBM modeling the joint
˜
distribution of input variables v and target classes y (represented as a one-hot vector by y ).
(c) A greedy layer-wise training algorithm for the DBN composed of three stacked RBMs.
(d) Three layers of the DBN as a generative model, where the top-down generative path is
represented by the P distributions (solid arcs), and bottom-up inference and the training path
are represented by the Q distributions (dashed arcs).
A Multi-biometric Face Recognition System 97
the visible units, given hidden units, and vice versa for the hidden units are defned
as follows:
˙ ˘
p ( vi = 1 | h) = N ˇ v | bi +
ˇˆ wj
i, j h j , s 2i
(4.2)
˙ vi ˘
p ( h j = 1 | v) = f ˇ c j +
ˆ
wi
i, j
s 2i
(4.3)
Here, N(·| µ,σ2) refers to the Gaussian probability density function with mean µ and
standard deviation σ. f(x) is a sigmoid function. During the training process, the log-
likelihood of the training data is maximized using stochastic gradient descent, and
the update rules for the parameters are defned as follows:
˛ 1 1 ˆ
Dwi , j = ˜ ˙ 2 vi h j − vi h j ˘ (4.4)
˝ si data
s 2i mod el ˇ
˛ 1 1 ˆ
D
Dbi = ˜ ˙ 2 vi − vi ˘ (4.5)
˝ si data
s 2i model ˇ
D (
Dci = ˜ h j data
− hj model ) (4.6)
Here, ˜ is the learning rate and ˜ data and ˜ model represent the expectations under the
distribution specifed by the input data (Positive phase) and the internal representa-
tions of the RBM model (Negative phase), respectively. Finally, bi and ci are bias
terms for visible and hidden units, respectively. More details on the GRBM model
can be found in [49]. Typically, RBMs can be used in two different ways: either as
generative models or as discriminative models, as shown in Figure 4.2 (a, b). The
generative models use a layer of hidden units to model a distribution over the visible
units, as described above. Such models are usually trained in an unsupervised way
and used as feature extractors to model only the inputs for another learning algo-
rithm. On the other hand, discriminative models can also model the joint distribution
of the input data and associated target classes. The discriminative models aim to
train a joint density model using an RBM that has two layers of visible units. The frst
represents the input data and the second is the Softmax label layer that represents
the target classes. A discriminative RBM with n hidden units is a parametric model
of the joint distribution between a layer of hidden units (referred to as features) h =
(h1, h2 ,…, hn) and visible units made of the input data v = (v1, v2 ,…, vm) and targets
y ∈ {1,2,…C}, which can be defned as follows:
p ( y, v, h) ˛ e ( )
− E y, v , h
(4.7)
98 Deep Learning in Computer Vision
where
˜ ˜
E ( y, v, h) = −hTWv − bT v − cT h − d T y − hTUy (4.8)
˜
Here, (W, b, c, d, U) refer to the model parameters and y = (1 y = i )i =1 to the C classes.
C
More details about the discriminative RBM model can be found in [50].
˙ l −2
˘
( 1
P v, h , …, h = ˇ
l
) P ( h k
) (
| h k+1 P hl −1 , hl ) (4.9)
ˆ k=0
( )
Here, ν = h0, P h k | h k+1 is the conditional distribution for the visible units given
hidden units of the RBM associated with level k of the DBN, and P hl −1 , hl is ( )
the visible-hidden joint distribution in the top-level RBM. An example of a three-
layer DBN as a generative model is shown in Figure 4.2 (d), where the symbol Q
is introduced for exact or approximate posteriors of that model, which are used
A Multi-biometric Face Recognition System 99
for bottom-up inference. During the bottom-up inference, the Q posteriors are all
approximate except for the top-level P ( hl | hl −1 ), which is formed as an RBM and
then the exact inference is possible. Although the DBN is considered as one of
the most popular unsupervised deep learning approaches, and has been success-
fully applied in a wide range of applications (e.g., face recognition [13], speech
recognition [14], audio classifcation [15], etc.), one of the main issues of the DBN
is that the feature representations it learns are very sensitive to the local transla-
tions of the input image, in particular when the pixel intensity values (raw data)
are assigned directly to the visible units. This can cause the local facial features
of the input image, which are known to be important for face recognition, to be
disregarded.
FIGURE 4.3 An illustration of the CNN architecture, where the gray and green squares
refer to the activation maps and the learnable convolution kernels, respectively. The crossed
lines between the last two layers show the fully connected neurons [55].
100 Deep Learning in Computer Vision
1
g( x, y) = (4.10)
1+ e
(c*(Th − f ( x, y)))
Here, g(x, y) is the enhanced image. The contrast factor (c) and the threshold value
(Th) are empirically set to be 5 and 0.3, respectively. As shown in Figure 4.5, the
advantage of the sigmoid function is to reduce the effect of lighting changes by
expanding and compressing the range of values of the dark and bright pixels in the
face image, respectively. This is followed by estimating the fractal dimension value
of each pixel in the enhanced face image using the proposed FDT approach. As men-
tioned above, the fractal dimension has many important characteristics, for instance
its ability to refect the roughness and fuctuations of a face image’s surface, and to
represent the facial features under different environmental conditions (e.g., illumina-
tion changes).
However, fractal estimation approaches are very time consuming and cannot
meet real-time requirements [7]. Thus, the face image is rescaled to (64×64) pixel
after detecting the face region in order to speed up the experiments and meet the
real-time system demands. The result is then reshaped into a row feature vector
FDTVector, which is used as an input to a nonlinear classifer using DRBM to eff-
ciently model the joint distribution of inputs and target classes. In this chapter, the
FIGURE 4.5 Output of the image contrast enhancement procedure. (a) The original face
image. (b) The enhanced face image. pmax and pmin refer to maximum and minimum pixel
values in the enhanced face image.
102 Deep Learning in Computer Vision
number of the produced matching scores and predicted classes using the DRBM
classifer is varied according to the applied recognition task. For instance, the goal
of the identifcation system is to determine the true identity (I) of the query trait
based on these N matching scores, where I ∈ {I1, I2 , I3 , ··· , IN , IN+1}. Here, (I1, I2 ,
I3 , ··· , IN) correspond to the identities of the N persons enrolled in the dataset, while
(IN+1) point to the “reject” option, which is generated when no associated identity
can be determined for the given query trait; this is known as open set identifcation.
However, when the biometric system assumes that the given query trait is among the
templates enrolled in the dataset, this is referred to as closed set identifcation [57].
In this proposed system, the open set identifcation protocol is adopted, because of
its effciency with large-scale, real-world applications (e.g., surveillance and watch
list scenarios) [58]. On the other hand, in the verifcation task the DRBM classifer is
used as a binary classifer to determine whether a pair of face images belong to the
same person or not. The next two subsections describe in more detail the FDT and
the DRBM methods mentioned above.
N d ( x, y, d ) =
p= − aq= − b
˙r ˘
f ( p, q ) I ( x + p, y + q ) ˇ max
ˆ r
(4.11)
Here, d = (1, 2, 3, …, rmax − 1) is the third dimension of matrix Nd (x, y, d), which rep-
resents the number of the produced feature maps obtained from applying the kernel
function f(p, q) along with different values of the scaling factor (r). Here, the values
of the scaling factor (r) are empirically chosen to be in the range between 2 and 9.
The kernel function f(p, q) operates by block processing on (7×7) neighboring pixels
of the face image, and calculating the fractal dimension value of each pixel from its
surrounding neighbors, as follows:
a b
pmax − pmin ˆ
f ( p, q) = floor ˛˙˝
p= − aq= − b
r ˘+1
ˇ
(4.12)
Here, a and b are non-negative integer variables, which are used to center the kernel
function on each pixel in the face image, and are defned as a and b = ceil ((n − 1)/ 2).
pmax and pmin are the highest- and lowest-intensity values of neighboring pixels in
the processing block. The size of the kernel function was determined empirically,
noting that increasing its size can affect the accuracy of the calculated fractal dimen-
sion, causing the obtained image to become less distinct, while decreasing its size
can result in an insuffcient number of surrounding pixels to calculate the fractal
dimension value accurately. Once the Nd (x, y, d) matrix is obtained, each element
A Multi-biometric Face Recognition System 103
from each array in Nd (x, y, d) will be stored in a new row vector (V). In other words,
the frst element in all arrays of Nd (x, y, d) will compose vector (V1), and all second
elements will compose (V2), etc., as follows:
Finally, the fractal dimension value at each pixel (x, y) is computed as the fractal
slope of the least square linear regression line as follows:
Sv ( x, y)
FD ( x, y) = (4.14)
Sr
Here, (Sv) and (Sr) are the sums of squares and can be computed as follows:
)−
˙ j
ˇ˙ j
ˇ
j log(r)˘ ˆ v( x, y), r ˘
log(r) (v(
ˆ
Sv ( x, y) = x , y ), r
r =1 r =1
(4.15)
r =1
j
2
˙
ˇ
j
j
ˆ log(r)˘
Sr ( x, y) = (log(r)) −
2 r =1
(4.16)
r =1
j
Here, j = rmax − 1 and a fractal transformed face image is illustrated in Figure 4.6.
FIGURE 4.6 (a) Original face images. (b) FDT fractal-transformed images.
104 Deep Learning in Computer Vision
1. Visible units (vi) are initialized using training data, and the probabilities of
hidden units are computed with Eq. 4.2. Then, a hidden activation vector
(hj) is sampled from this probability distribution.
2. Compute the outer product of (vi) and (hj), which refers to the positive phase.
3. Sample a reconstruction (vi˜) of the visible units from (hj) with Eq. 4.3, and
then from (vi˜) resample the hidden units’ activations (hi˜). (One Gibbs sam-
pling step.)
4. Compute the outer product of (vi˜) and (hi˜), which refers to the negative
phase.
5. Update weights matrix and biases with Eq. 4.4, Eq. 4.5, and Eq. 4.6.
The computation steps of the CD-1 algorithm are graphically shown in Figure 4.7. In
the CD learning algorithm, k is usually set to 1 for many applications. In this work,
the weights were initialized with small random values sampled from a zero-mean
normal distribution and standard deviation of 0.02. In the positive phase, the binary
states of the hidden units are determined by computing the probabilities of weights
and visible units. Since the probability of training data is increased in this phase,
it is called positive phase. Meanwhile, the probability of samples generated by the
model is decreased during the negative phase. A complete positive-negative phase is
considered as one training epoch, and the error between produced samples from the
model and actual data vector is computed at the end of each epoch. Finally, all the
weights are updated by taking the derivative of the probability of visible units with
respect to weights, which is the expectation of the difference between positive phase
contribution and negative phase contribution. Algorithm 1 shows pseudo-code of the
procedure proposed to train the DRBM classifer using the CD algorithm.
FIGURE 4.8 Examples of face images in four face datasets: (a) SDUMLA-HMT, (b) FRGC
V 2.0, (c) UFI, and (d) LFW.
(e.g., using product, sum, weighted sum, max, and min rule) and rank level (e.g.,
using highest rank, Borda count, and logistic regression).
approaches used in the FaceDeepNet system (e.g., number of layers, number of flters
per each layer, learning rate, etc.), as it has more images per person in its image gal-
lery than the other datasets. This allowed more fexibility in dividing the face images
into training, validation, and testing sets.
4.5.2.1 Parameter Settings of the FDT-DRBM Approach
The most important hyper-parameter in the proposed FDT-DRBM approach is the
number of the hidden units in the DRBM classifer. The other hyper-parameters (e.g.,
learning rate, the number of epochs, etc.) were determined based on domain knowl-
edge from the literature to be a learning rate of 10 −2, a weight decay of 0.0002,
momentum of 0.9, and 300 epochs, as in [7]. The weights were initialized with small
random values sampled from a zero-mean normal distribution and standard devia-
tion of 0.02. In this work, the number of the hidden units was determined empirically
by varying its values from 1,000 units to 5,000 units in steps of 100 units, using the
CD learning algorithm with one step of Gibbs sampling to train the DRBM as a
nonlinear classifer, as described in Section 4.1.2. Hence, 41 experiments were con-
ducted using 60% randomly selected samples per person for the training set, and the
remaining 40% for the testing set. In these experiments, the parameter optimization
process is performed on the training set using the 10-fold cross-validation procedure
that divides the training set into k subsets of equal size. Consecutively, one subset
is used to assess the performance of the DRBM classifer trained on the remaining
k-1 subsets. Then, the average error rate (AER) over 10 trials is computed as follows:
k
°Error
1
AER = i (4.17)
K i=1
Here, Errori refers to the error rate per trial. After fnding the best number of hidden
units with higher validation accuracy rate (VAR), the DRBM classifer was trained
using the whole training set, and its performance in predicting unseen data prop-
erly was then evaluated, using the testing set. Figure 4.9 shows the VAR produced
throughout 41 experiments. One can see that the higher VAR was obtained when the
3,000 units were employed in the hidden layer of the DRBM classifer. To verify the
proposed FDT-DRBM approach, its performance was compared with two current
state-of-the-art face recognition approaches, namely the Curvelet-Fractal approach
[7] and the Curvelet Transform-Fractional Brownian Motion (CT-FBM) approach
[64], using the same dataset. Figure 4.10 shows the cumulative match characteris-
tic (CMC) curve to visualize their performance. It can be seen in Figure 4.10 that
the proposed DFT-DRBM approach has the potential to outperform the CT-FBM
approach, as the Rank-1 identifcation rate has dramatically increased from 0.90 to
0.95 using the CT-FBM to more than 0.9513 to 1.0 using the DFT-DRBM approach.
On the other hand, despite the Curvelet-Fractal approach achieving slightly a higher
Rank-1 identifcation rate, it is observed that after Rank-10 identifcation rate the
accuracy of the Curvelet-Fractal approach drops compared to the FDT-DRBM
approach.
As mentioned previously, the main structure of the FaceDeepNet system is based
on building deep learning representations and fusing the matching scores obtained
110 Deep Learning in Computer Vision
FIGURE 4.9 The VAR produced throughout 41 experiments for fnding the best number of
hidden units.
from two discriminative deep learning approaches (e.g., CNN and DBN). It is well
known in the literature that the major challenge of using deep learning approaches is
the number of model architectures and hyper-parameters that need to be tested, such
as the number of layers, the number of units per layer, the number of epochs, etc. In
this section, a number of extensive experiments are performed to fnd the best model
architecture for both CNN and DBN, along with studying and analyzing the infu-
ence of the different training parameters on the performance of the proposed deep
learning approaches. All these experiments were conducted on the SDUMLA-HMT
face dataset using the same sizes of the training and testing sets described in Section
5.2.1, and the parameters with the best performance (e.g., highest VAR and best gen-
eralization ability) were kept to be used later in fnding the best network architecture.
For an initial CNN architecture, the architecture of our IrisConvNet system as
described in Al-Waisy et al. [55] was used to test its generalization ability and adap-
tation to another new application (e.g., face recognition). This initial CNN archi-
tecture will be denoted as FaceConvNet-A through the subsequent experiments.
The main structure of the IrisConvNet system is shown in Figure 4.3. Initially, the
FaceConvNet-A system was trained using the same training methodology along with
the same hyper-parameters described in [55]. However, it was observed that when
500 epochs were evaluated, the obtained model started overftting the training data,
and poor results were obtained on the validation set. Thus, a number of experiments
were conducted to fne-tune this parameter. Firstly, an initial number of epochs was
set to 100 epochs, and then larger numbers of epochs were also investigated, includ-
ing 200, 300, and 400. Figure 4.11 (a) shows the CMC curves used to visualize the
performance of the fnal obtained model on the validation set. It can be seen that as
long as the number of epochs is increased, the performance of the last model gets
better and the highest accuracy rate was obtained when 300 epochs were evaluated.
Furthermore, it was also observed that a better performance can be obtained by adding
a new convolution layer (C0= 10@64×64) before the frst one (C1= 6@64×64), using
the same flter size and increasing the number of flters per these two layers from 6 to
FIGURE 4.11 CMC curves for (a) epoch number parameter evaluation of CNN and (b)
performance comparison between FaceConvNet-A and FaceConvNet-B on the testing set of
the SDUMLA-HMT dataset.
112 Deep Learning in Computer Vision
10 learnable flters. This new CNN architecture was denoted as the FaceConvNet-B
system, and it was used in our assessment procedure for all remaining experiments
instead of the FaceConvNet-A system. The CMC curves of the FaceConvNet-A and
FaceConvNet-B systems are shown in Figure 4.11 (b) to visualize their performances
on the testing set of the SDUMLA-HMT dataset. Table 4.1 summarizes the details of
the FaceConvNet-B system and its hyper-parameters.
Following, the training methodology described in [7], the DBN approach was
greedily trained using input data acquired from the FDT-DRBM approach. Once the
training of a given RBM (hidden layer) is completed, its weights matrix is frozen,
and its activations are served as input to train the next RBM (next hidden layer) in
the stack. Unlike [7], in which the DBN models are composed of only 3 layers, here
larger and deeper DBM models have been investigated by stacking fve RBM mod-
els. As shown in Table 4.2, fve different 5-layer DBN models were greedily trained
in a bottom-up manner using different numbers of hidden units only in the last three
layers. For the frst four hidden layers, each one was trained separately as an RBM
model in an unsupervised way, using the CD learning algorithm with one step of
Gibbs sampling. Each individual RBM was trained for 100 epochs with the learn-
ing rate set to be 0.01, a momentum value of 0.95, a weight-decay value of 0.0005,
and a mini-batch size of 100. The weights of each model were initialized with small
random values sampled from a zero-mean normal distribution and standard devia-
tion of 0.02. The last RBM model was trained in a supervised way as a nonlinear
classifer associated with Softmax units for the multi-class classifcation purpose. In
this supervised phase, the last RBM model was trained using the same values of the
hyper-parameters used to train the frst four models. Finally, in the fne-tuning phase,
the whole DBN model was trained in a top-down manner using the Back-propagation
algorithm equipped with the Dropout technique, to fnd the optimized parameters
and to avoid overftting. The Dropout ratio was set to 0.5 and the number of epochs
through the training set was determined using an early stopping procedure, in which
TABLE 4.1
Details of Hyper-Parameters for the Proposed Deep Learning Approaches
(e.g., CNN and DBN)
FaceConvNet-B 5-Layer DBN Model
Hyper-parameter Value Hyper-parameter Value
TABLE 4.2
Rank-1 Identifcation Rates Obtained
for Different DBN Architectures
Using Validation Set
DBN Model Accuracy Rate
4096-4096-2048-1024-512 0.92
4096-4096-2048-1024-1000 0.94
4096-4096-2048-2000-1000 0.95
4096-4096-3000-2000-1000 0.94
4096-4096-3000-2000-2000 0.97
the training process is stopped as soon as the classifcation error on the validation
set starts to rise again. As can be seen in Table 4.2, the last DBN model provided
signifcantly better recognition rate compared to the other hidden layers investigated.
More information on the FaceConvNet-B and its hyper-parameters are presented in
Table 4.1.
Moreover, a few experiments were also carried out using the SDUMLA-HMT
dataset to fnd out how the performance of the proposed deep learning approaches
can be greatly improved by assigning the local feature representations to the visible
units of the DNNs instead of the raw data, as a way of guiding the neural network to
learn only the useful facial features. As shown in Figure 4.4, the local facial features
are frstly extracted using the proposed FDT-DRBM approach. Then, these extracted
local facial features are assigned to the feature extraction units of the CNN and DBN
to learn additional and complementary feature representations. As expected and as
shown in Table 4.3, our experiments demonstrate that training the proposed deep
learning approaches on the top of the FDT-DRBM approach’s output can signif-
cantly increase the ability of the DNNs to learn more discriminating facial features
with a shorter training time. Generally, the CNN produced the highest recognition
rate compared to the DBN, due to its ability to capture the local dependencies and
extract elementary features in the input image (e.g., edges, curves, etc.), which are
proven to be important for face recognition. Thus, in the fusion phase when the
TABLE 4.3
Rank-1 Identifcation Rates Obtained for the Proposed Deep Learning
Approaches (CNN and DBN) Using Different Types of Input Data
Raw Data Local Facial Features
weighted sum rule is applied, a higher weight was assigned to the proposed CNN
compared to the DBN approach, such that the maximum recognition rate is achieved.
TABLE 4.4
Rank-1 Identifcation Rate of the Proposed Multi-Biometric Face Recognition
System on Three Different Face Datasets Using Score-Level Fusion SR and
PR are referred to the Sum Rule and Product Rule, respectively
Score Fusion Method
FRGC V 2.0 Exp. 1 94.11 97.62 95.12 97.34 94.92 95.34 95.21
Exp. 2 95.10 99.54 97.33 100 95.54 97.82 96.32
Exp. 4 96.21 98.33 97.13 99.12 97.13 97.12 97.12
UFI CI 92.23 93.60 94.32 96.66 92.12 92.31 93.21
LI 93.12 95.29 96.12 97.78 93.64 95.12 94.12
SDUMLA-HMT 95.13 95.13 98.89 100 97.53 96.23 96.21
TABLE 4.5
Rank-1 Identifcation Rate of the Proposed Multi-Biometric Face
Recognition System on Three Different Face Datasets Using
Rank-Level Fusion HR, BC, and LR are referred to Highest
Ranking, Borda Count, and Logistic regression, respectively
Rank Fusion Method
highest-ranking approach was adopted for comparing the performance of the pro-
posed multi-biometric face recognition system with other existing approaches, due
to its effciency compared to other fusion methods in exploiting the strength of each
classifer effectively and breaking the ties between the subjects in the fnal ranking
list. Table 4.6 compares the Rank-1 identifcation rates of the proposed approaches
and the current state-of-the-art face recognition approaches using the three experi-
ments with the FRGC V 2.0 dataset.
Although some approaches, such as the Partial Least Squares (PLS) approach
[65] achieved a slightly higher identifcation rate in experiments 1 and 2, they
obtained inferior results in the remaining experiments with the FRGC V 2.0 data-
set. In addition, the FaceDeepNet system and the fusion results obtained from the
proposed approaches achieved higher identifcation rates on all the experiments with
the FRGC V 2.0 dataset. Table 4.7 compares the Rank-1 identifcation rates of the
proposed approaches and the current state-of-the-art face recognition approaches
116 Deep Learning in Computer Vision
TABLE 4.6
Comparison of the Proposed
Multi-Biometric Face Recognition
System with the State-of-the-Art
Approaches on FRGC V 2.0 Dataset
FRGC V 2.0
TABLE 4.7
Comparison of the Proposed
Multi-Biometric Face
Recognition System with the
State-of-the-Art Approaches
on UFI Dataset
UFI
Approach CI LI
using these two partitions of the UFI dataset. It can be seen that we were able to
achieve a higher Rank-1 identifcation rate using only the FDT-DRBM approach
compared with other existing approaches, and better results were also achieved using
the FaceDeepNet system and the fusion framework on both partitions of the UFI
dataset. Finally, the authors in [66] have proposed a multimodal biometric system
A Multi-biometric Face Recognition System 117
TABLE 4.8
Comparison of the Proposed
Multi-Biometric Face
Recognition System with the
State-of-the-Art Approaches
on SDUMLA-HMT Dataset
Approach Accuracy Rate
using only the face trait in the SDUMLA-HMT dataset. Therefore, for the purpose of
comparison, the best results obtained on the SDUMLA-HMT dataset using the high-
est-ranking approach were employed, and the results are listed in Table 4.8. It can
be seen that the accuracy rate of 100% achieved using the proposed system is higher
than the best results reported in [66], which is 96.54 % using the CLVQ approach.
* http://www.openu.ac.il/home/hassner/data/lfwa/
† Incorrect face detection results were assessed manually to ensure that all the subjects were included in
creating additional positive/negative pairs from any other sources) was used in the
training phase of the proposed FDT-DRBM approach. The general fow of infor-
mation, when the proposed multi-biometric face recognition system operates in the
verifcation mode, is shown in Figure 4.12. The feature representations, fx and fy, of
a pair of two images, Ix and Iy, are obtained frstly by applying FDT approach, and
then a feature vector F for this pair is formed using element-wise multiplication (F =
fx ⊙ fx). Finally, this vector F (e.g., extracted from pairs of images) is used as input
data to the DBN and DRBM to learn additional feature representations and perform
face verifcation in the last layer, respectively. The fnal performance is reported in
Table 4.9 by calculating the mean accuracy rate (m̂ m ) and the standard error of the
mean accuracy (SE) over 10-fold cross-validation using different score-level fusion
methods, and the corresponding receiver operating characteristic (ROC) curves are
shown in Figure 4.13. From Table 4.9 and Figure 4.13, the highest accuracy rate was
obtained using the weighted sum rule as a fusion mothed in the score level, where a
higher weight was assigned to the proposed FaceDeepNet system compared to the
FDT-DRBM approach. The accuracy rate has been improved by 4.5% and 0.98%
compared to the FDT-DRBM approach and the FaceDeepNet system, respectively.
The proposed FDT-DRBM approach was compared to the current state-of-the-art
approaches on the LFW dataset, such as Curvelet-Fractal [7], DDML [34], LBP, Gabor
[74], and MSBSIF-SIEDA [75] using the same evaluation protocol (e.g., Restricted), as
shown in Table 4.10. It can be seen that the Curvelet-Fractal [7] has achieved slightly a
higher accuracy rate, but the proposed FDT-DRBM approach was able to achieve com-
petitive results with the other state-of-the-art face verifcation results reported on the
LFW dataset. The performance of the proposed face recognition approaches were also
compared with the current state-of-the-art deep learning approaches, such as DeepFace
[31], DeepID [76], ConvNet-RBM [77], ConvolutionalDBN [78], MDFR Framework
[7], and DDML [34]. The frst three approaches were mainly trained using a differ-
ent evaluation protocol, named the “Unrestricted, Labeled Outside Data” protocol,
FIGURE 4.12 Illustration of the proposed multi-biometric face recognition system operat-
ing in the verifcation mode.
A Multi-biometric Face Recognition System 119
TABLE 4.9
Performance Comparison between the Proposed Face Recognition
Approaches on LFW Dataset Using Different Score Fusion Methods
Acc. ( m
m̂ ± S E )
FDT-DRBM FaceDeepNet SR WSR PR Max Min
4.6 CONCLUSIONS
In this chapter, a multi-biometric face recognition approach based on multimodal
deep learning termed FDT-DRBM is proposed. The approach is based on integrat-
ing the advantages of the fractal dimension and the DRBM classifer depending on
120 Deep Learning in Computer Vision
TABLE 4.10
Performance Comparison between the Proposed
Approaches and the State-of-the-Art Approaches
on LFW Dataset under Different Evaluation
Protocols
Approach Acc. ( m
m̂ ± S E ) Protocol
extracting the face texture roughness and fuctuations in the surface under uncon-
strained environmental conditions and effciently modeling the joint distribution of
inputs and target classes, respectively. Furthermore, a novel FaceDeepNet system is
proposed to learn additional and complementary facial feature representations by
training two discriminative deep learning approaches (i.e., CNN and DBN) on top
of the local feature representations obtained from the FDT-DRBM approach. The
proposed approach was tested on four large-scale unconstrained face datasets (i.e.,
SDUMLA-HMT, FRGC V 2.0, UFI, and LFW) with high diversity in facial expres-
sions, illumination conditions, pose, noise, etc. A number of extensive experiments
were conducted, and a new state-of-the-art accuracy rate was achieved for both the
face identifcation and verifcation tasks by applying the proposed FaceDeepNet sys-
tem and the whole multi-biometric face recognition system (e.g., using the WSR)
on all the employed datasets. The obtained results demonstrate the reliability and
effciency of the FDT-DRBM approach by achieving competitive results with the
current state-of-the-art face recognition approaches (e.g., CLVQ, Curvelet-Fractal,
DeepID, etc.). For future work, it would be necessary to validate further the eff-
ciency and reliability of the proposed multi-biometric system using a larger multi-
modal dataset, containing more individuals with face images captured under more
challenging conditions.
A Multi-biometric Face Recognition System 121
BIBLIOGRAPHY
1. K. Nandakumar, A. Ross, and A. K. Jain, “Introduction to multibiometrics,” in
Appeared in Proc of the 15th European Signal Processing Conference (EUSIPCO)
(Poznan Poland), 2007, pp. 271–292.
2. A. S. Al-Waisy, “Ear identifcation system based on multi-model approach,”
International Journal of Electronics Communication and Computer Engineering, vol.
3, no. 5, pp. 2278–4209, 2012.
3. M. S. Al-ani and A. S. Al-Waisy, “Milti-view face datection based on kernel principal
component analysis and kernel,” International Journal on Soft Computing (IJSC), vol.
2, no. 2, pp. 1–13, 2011.
4. A. S. Al-Waisy, R. Qahwaji, S. Ipson, and S. Al-Fahdawi, “A fast and accurate Iris local-
ization technique for healthcare security system,” in IEEE International Conference on
Computer and Information Technology; Ubiquitous Computing and Communications;
Dependable, Autonomic and Secure Computing; Pervasive Intelligence and
Computing, 2015, pp. 1028–1034.
5. R. Jafri and H. R. Arabnia, “A survey of face recognition techniques,” Journal of
Information Processing Systems, vol. 5, no. 2, pp. 41–68, 2009.
6. C. Ding, J. Choi, D. Tao, and L. S. Davis, “Multi-directional multi-level dual-cross pat-
terns for robust face recognition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 38, no. 3, pp. 518–531, 2016.
7. A. S. Al-Waisy, R. Qahwaji, S. Ipson, and S. Al-Fahdawi, “A multimodal deep learning
framework using local feature representations for face recognition,” Machine Vision
and Applications, vol. 29, no. 1, pp. 35–54, 2017.
8. A. S. Al-Waisy, R. Qahwaji, S. Ipson, and S. Al-fahdawi, “A multimodal bio-
metric system for personal identifcation based on deep learning approaches,” in
Seventh International Conference on Emerging Security Technologies (EST), 2017,
pp. 163–168.
9. A. Ross, K. Nandakumar, and J. K. Anil, “Handbook of multibiometrics,” Journal of
Chemical Information and Modeling, vol. 53, no. 9, pp. 1689–1699, 2006.
10. A. Ross and A. K. Jain, “Multimodal biometrics : An overview,” in 12th European
Signal Processing Conference IEEE, 2004, pp. 1221–1224.
11. A. Lumini and L. Nanni, “Overview of the combination of biometric matchers,”
Information Fusion, vol. 33, pp. 71–85, 2017.
12. L. Deng and D. Yu, “Deep learning methods and applications,” Signal Processing, vol.
28, no. 3, pp. 198–387, 2013.
13. J. Liu, C. Fang, and C. Wu, “A fusion face recognition approach based on 7-layer deep
learning neural network,” Journal of Electrical and Computer Engineering, vol. 2016,
pp. 1–7, 2016.
14. P. Fousek, S. Rennie, P. Dognin, and V. Goel, “Direct product based deep belief net-
works for automatic speech recognition,” in Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP, 2013, pp.
3148–3152.
15. H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio
classifcation using convolutional deep belief networks,” in Advances in Neural
Information Processing Systems Conference, 2009, pp. 1096–1104.
16. R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for
natural language understanding,” IEEE Transactions on Audio, Speech and Language
Process., vol. 22, no. 4, pp. 778–784, 2014.
17. C. Ding and D. Tao, “A comprehensive survey on pose-invariant face recognition,”
ACM Transactions on Intelligent Systems and Technology (TIST), vol. 7, no. 3, pp.
1–40, 2016.
122 Deep Learning in Computer Vision
37. M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric
learning from equivalence constraints,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2012, pp. 2288–2295.
38. H. V Nguyen and L. Bai, “Cosine Similarity Metric Learning for Face Verifcation,” in
Asian Conference on Computer Vision. Springer Berlin Heidelberg, 2011, pp. 1–12.
39. C. Peng, X. Gao, N. Wang, and J. Li, “Graphical representation for heterogeneous
face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 39, no. 2, pp. 301–312, 2017.
40. C. Peng, N. Wang, J. Li, and X. Gao, “DLFace: Deep local descriptor for cross-modal-
ity face recognition,” Pattern Recognition, vol. 90, pp. 161–171, 2019.
41. C. Peng, X. Gao, S. Member, N. Wang, and D. Tao, “Multiple representations-based
face sketch – photo synthesis,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 27, no. 11, pp. 2201–2215, 2016.
42. Y. Gao, J. Ma, and A. L. Yuille, “Semi-supervised sparse representation based clas-
sifcation for face recognition with insuffcient labeled samples,” IEEE Transactions on
Image Processing, vol. 26, no. 5, pp. 2545–2560, 2017.
43. M. Hassaballah and S. Aly, “Face recognition: Challenges, achievements and future
directions,” IET Computer Vision, vol. 9, no. 4, pp. 614–626, 2015.
44. R. Lopes and N. Betrouni, “Fractal and multifractal analysis: A review, Medical Image
Analysis, vol. 13, no. 4, pp. 634–649, 2009.
45. B. Mandelbrot, The Fractal Geometry of Nature. Library of Congress Cataloging in
Publication Data, United States of America, 1983.
46. B. Mandelbrot, “Self-affnity and fractal dimension,” Physica Scripta, vol. 32, 1985,
pp. 257–260.
47. K. Lin, K. Lam, and W. Siu, “Locating the human eye using fractal dimensions,”
in Proceedings of International Conference on Image Processing, 2001, pp.
1079–1082.
48. M. H. Farhan, L. E. George, and A. T. Hussein, “Fingerprint identifcation using frac-
tal geometry,” International Journal of Advanced Research in Computer Science and
Software Engineering, vol. 4, no. 1, pp. 52–61, 2014.
49. G. Hinton, “A practical guide to training restricted boltzmann machines a practical
guide to training restricted boltzmann machines,” Neural Networks: Tricks of the
Trade. Springer, Berlin, Heidelberg, 2010, pp. 599–619.
50. H. Larochelle and Y. Bengio, “Classifcation using discriminative restricted Boltzmann
machines,” in Proceedings of the 25th International Conference on Machine Learning,
2008, pp. 536–543.
51. B. Abibullaev, J. An, S. H. Jin, S. H. Lee, and J. Il Moon, “Deep machine learning
a new frontier in artifcial intelligence research,” Medical Engineering and Physics,
vol. 35, no. 12, pp. 1811–1818, 2013.
52. G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
53. H. Khalajzadeh, M. Mansouri, and M. Teshnehlab, “Face recognition using convo-
lutional neural network and simple logistic classifer,” Soft Computing in Industrial
Applications. Springer International Publishing, pp. 197–207, 2014.
54. Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends® in Machine
Learning, vol. 2, no. 1, pp. 1–127, 2009.
55. A. S. Al-Waisy, R. Qahwaji, S. Ipson, S. Al-Fahdawi, and T. A. M. Nagem, “A multi-
biometric iris recognition system based on a deep learning approach,” Pattern Analysis
and Applications, vol. 21, no. 3, pp. 783–802, 2018.
56. P. Viola, O. M. Way, and M. J. Jones, “Robust real-time face detection,” International
Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
124 Deep Learning in Computer Vision
57. A. R. Chowdhury, T.-Y. Lin, S. Maji, and E. Learned-Miller, “One-to-many face recog-
nition with bilinear CNNs,” in IEEE Winter Conference on Applications of Computer
Vision (WACV), 2016, pp. 1–9.
58. D. Wang, C. Otto, and A. K. Jain, “Face search at scale: 80 million gallery,” arXiv
Prepr. arXiv1507.07242, pp. 1–14, 2015.
59. G. E. Hinton, “Training products of experts by minimizing contrastive divergence,”
Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
60. Y. Yin, L. Liu, and X. Sun, “SDUMLA-HMT: A multimodal biometric database,”
in Chinese Conference Biometric Recognition, Springer-Verlag Berlin Heidelberg,
pp. 260–268, 2011.
61. P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J.
Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,”
in Proceedings IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, CVPR, 2005, vol. I, pp. 947–954.
62. L. Lenc and P. Král, “Unconstrained facial images: Database for face recognition
under real-world conditions,” Lecture Notes in Computer Science (including subseries
Lecture Notes in Artifcial Intelligence and Lecture Notes in Bioinformatics), vol. 9414,
pp. 349–361, 2015.
63. G. B. Huang, M. Mattar, T. Berg, and E. Learned-miller, “Labeled faces in the wild:
A database for studying face recognition in unconstrained environments,” in Technical
Report 07-49, University of Massachusetts, Amherst, 2007, pp. 1–14.
64. A. S. Al-Waisy, R. Qahwaji, S. Ipson, and S. Al-Fahdawi, “A robust face recognition
system based on curvelet and fractal dimension transforms,” in IEEE International
Conference on Computer and Information Technology; Ubiquitous Computing
and Communications; Dependable, Autonomic and Secure Computing; Pervasive
Intelligence and Computing, 2015, pp. 548–555.
65. W. R. Schwartz, H. Guo, and L. S. Davis, “A robust and scalable approach to face iden-
tifcation,” in European Conference on Computer Vision, 2010, pp. 476–489.
66. M. Y. Shams, A. S. Tolba, and S. H. Sarhan, “A vision system for multi-view face rec-
ognition,” International Journal of Circuits, Systems and Signal Processing, vol. 10,
pp. 455–461, 2016.
67. J. Li, T. Qiu, C. Wen, K. Xie, and F.-Q. Wen, “Robust face recognition using the deep
C2D-CNN model based on decision-level fusion,” Sensors, vol. 18, no. 7, pp. 1–27,
2018.
68. J. Holappa, T. Ahonen, and M. Pietikäinen, “An optimized illumination normalization
method for face recognition,” in IEEE Second International Conference on Biometrics:
Theory, Applications and Systems, 2008, pp. 1–6.
69. P. Král, L. Lenc, and A. Vrba, “Enhanced local binary patterns for automatic face rec-
ognition,” arXiv Prepr. arXiv1702.03349, 2017.
70. J. Lin and C.-T. Chiu, “Lbp edge-mapped descriptor using mgm interest points for
face recognition,” in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2017, pp. 1183–1187.
71. L. Lenc, “Genetic algorithm for weight optimization in descriptor based face recog-
nition methods,” in Proceedings of the 8th International Conference on Agents and
Artifcial Intelligence. SCITEPRESS-Science and Technology Publications, 2016, pp.
330–336.
72. J. Gaston, J. Ming, and D. Crookes, “Unconstrained face identifcation with multi-scale
block-based correlation,” in IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2017, pp. 1477–1481.
73. G. B. Huang and E. Learned-miller, “Labeled faces in the wild : Updates and new
reporting procedures,” Department of Computer Science, University of Massachusetts
Amherst, Amherst, MA, USA, Technical Report, pp. 1–14, 2014.
A Multi-biometric Face Recognition System 125
74. X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li, “High-fdelity pose and expression normal-
ization for face recognition in the wild,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 787–796.
75. A. Ouamane, M. Bengherabi, A. Hadid, and M. Cheriet, “Side-information based
exponential discriminant analysis for face verifcation in the wild,” in 11th IEEE
International Conference and Workshops on Automatic Face and Gesture Recognition
(FG), 2015, pp. 1–6.
76. Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predict-
ing 10,000 classes,” in Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2014, pp. 1891–1898.
77. Y. Sun, X. Wang, and X. Tang, “Hybrid deep learning for face verifcation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1997–
2009, 2016.
78. G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical representations
for face verifcation with convolutional deep belief networks,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2012, pp. 2518–2525.
79. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British
Machine Vision Conference, 2015, pp. 1–12.
80. O. Barkan, J. Weill, L. Wolf, and H. Aronowitz, “Fast high dimensional vector multi-
plication face recognition,” in Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 1960–1967.
81. T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in uncon-
strained images,” in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 4295–4304.
5 Deep LSTM-Based
Sequence Learning
Approaches for Action
and Activity Recognition
Amin Ullah, Khan Muhammad, Tanveer Hussain,
Miyoung Lee, and Sung Wook Baik
CONTENTS
5.1 Introduction .................................................................................................. 127
5.1.1 Recurrent Neural Network (RNN) ................................................... 129
5.1.2 Long Short-Term Memory (LSTM) ................................................. 130
5.1.3 Multi-layer LSTM............................................................................. 131
5.1.4 Bidirectional LSTM.......................................................................... 131
5.1.5 Gated Recurrent Unit........................................................................ 132
5.2 Action and Activity Recognition Methods ................................................... 132
5.3 Results and Discussion ................................................................................. 139
5.4 Future Research Directions .......................................................................... 143
5.4.1 Benefting from Image Models......................................................... 144
5.4.2 Multiple Actions and Activity Recognition ...................................... 145
5.4.3 Utilization of Large-Scale Datasets.................................................. 145
5.5 Conclusion .................................................................................................... 145
Acknowledgment ................................................................................................... 145
Bibliography .......................................................................................................... 146
5.1 INTRODUCTION
Human action and activity recognition is one of the challenging areas in computer
vision due to the increasing number of cameras installed everywhere, which enables
its numerous applications in various domains. The enormous amount of data gener-
ated through these cameras requires human monitoring for identifcation of differ-
ent activities and events. Smarter surveillance, through which normal and abnormal
activities can be automatically identifed using artifcial intelligence and computer
vision technologies, is needed in this era. It has vast demand due to its importance
in real-life problems like detection of abnormal and suspicious activities, intelli-
gent video surveillance, elderly care, patient monitoring in healthcare centers, and
127
128 Deep Learning in Computer Vision
video retrieval based on different actions [1, 2]. In the literature, many researchers
have analyzed action and activity recognition through various sensors data including
RGB, grayscale, night vision, Kinect depth, and skeleton data [3].
Smartphone and smartwatch sensors are also utilized by some researchers for
activity analysis and ftness applications [4]. Categorizations of action and activity
recognition methods using available sensors are given in Figure 5.1. In this chapter,
we will study only vision RGB-based action and activity recognition methods in
which long short-term memory (LSTM) has been deployed for sequence learning.
Processing visual contents of video data for activity recognition is very challeng-
ing due to the inter-similarity of visual and motion patterns, changes in viewpoint,
camera motion in sports videos, scale, action poses, and different illumination con-
ditions [5].
Recently, convolutional neural network (CNN) outperformed state-of-the-art
techniques in various domains such as image classifcation, object detection, and
localization. However, an activity is recorded in sequences of images, so process-
ing a single image surely cannot give us good results to capture the whole idea of
the activity. Therefore, in videos, the whole activity can be recognized by analyzing
the motions and other visual changes of human body parts like hands and legs in a
sequence of frames. For instance, jumping for a headshot in football and skipping
rope have the same action pose in the initial frame. However, the difference between
these actions can be easily recognized in a sequence of frames. Another challenge for
analyzing long-term activities in video data is that each activity has parts that overlap
with other activities; for example, diving for swimming is one activity, but whenever
the swimmer dives into the pool he also jumps in the air, which is another activity.
So, when two activities like jumping and diving appear in the same sequence, it
can lead to false predictions because the sequence is interrupted. To address these
issues, many researchers [6, 7] have introduced different kinds of techniques to rep-
resent sequences and learn those sequences for activity recognition. For instance, the
current literature on human activity recognition highlights some popular methods
FIGURE 5.1 Categorization of action and activity recognition using available sensor data.
Deep LSTM-Based Sequence Learning Approaches 129
including volume motion templates [6], spatio-temporal features [7], motion history
volume [8], interest point descriptors [9], bag-of-words [10], optical fow trajectories
[11], and CNN-based learned features [12, 13] for sequence representation followed
by some machine learning algorithms such as support vector machine (SVM), deci-
sion tree, recurrent neural network (RNN), and LSTM. However, these methods
are not enough for sequence representation and learning. Recent studies prove that
the LSTM is one of the most effective sources of sequence learning. Therefore, in
this chapter, we will explore sequence learning concepts integrated with LSTM for
action and activity recognition.
FIGURE 5.2 (a) A standard unrolled RNN network; (b) vanishing gradient problem
in RNN.
130 Deep Learning in Computer Vision
(
it = s ( xt + st −1 ) W i + bi ) (5.1)
(
ft = s ( xt + st −1 ) W f + b f ) (5.2)
(
ot = s ( xt + st −1 ) W o + bo ) (5.3)
(
g = tanh ( xt + st −1 ) W g + bg ) (5.4)
ct = ct −1 ˛ ft + g ˛it (5.5)
st = tanh(ct )° ot (5.6)
Recent studies have proven [17–21] that huge amounts of data increase the accuracy
of a machine learning task. Therefore, learning complex sequential patterns in video
data of large-scale datasets is very challenging, and a single LSTM is not effective
enough to do it. Therefore, researchers have recently utilized multi-layer LSTM,
which is constructed by stacking multiple LSTM cells to learn long-term dependen-
cies in video data.
Deep LSTM-Based Sequence Learning Approaches 131
function is adjusted after the output layer, weights, and biases are adjusted through
back-propagation.
FIGURE 5.4 The framework for action recognition in videos using CNN and deep bidirec-
tional LSTM [22].
FIGURE 5.5 The framework of activity recognition in industrial systems. In the frst stage, the surveillance stream is preprocessed using CNN-based
salient feature extraction mechanism for shot segmentation. The second stage extracts CNN-based temporal optical fow features from the sequence of
frames. Finally, multi-layer LSTM is presented for sequence learning and the recognition of activities [1].
Deep Learning in Computer Vision
Deep LSTM-Based Sequence Learning Approaches 135
FIGURE 5.6 The framework of selected spatio-temporal features via bidirectional LSTM
for action recognition [25].
TABLE 5.1
Description of LSTM-Based Approaches with Details about the Nature of the Activity
Sequence representation Sequence learning Nature of activity
FIGURE 5.7 Group activity recognition via a hierarchical model [26]. Each person in a
scene is modeled using a temporal model that captures his/her dynamics; these models are
integrated into a higher-level model that captures scene-level activity.
FIGURE 5.8 Framework of early activity detection [33]. At each video frame, their model
frst computes deep CNN features (named as fc7), and then the features are fed into the LSTM
to compute detection scores of activities.
and sequence recognition for fnal action prediction. The LSTM used by the authors
is bidirectional with hierarchical structure, and they employ spatial-temporal atten-
tion in their method. Finally, they fnd the integrated spatio-temporal features that
are extracted using three-dimensional CNNs (3DCNNs) to advance the relationship
mined in each frame.
Similarly, there are many other methods that utilize sequence learning based
on RNN or LSTM with different confgurations to recognize different actions and
activities from video data. Furthermore, researchers (see [25] and [29–36]) have
138 Deep Learning in Computer Vision
FIGURE 5.9 Network architecture for relational LSTM for video action recognition [29].
FIGURE 5.10 The architecture of the RACNN model for action recognition.
introduced different types of LSTM and RNN confgurations for precise action and
activity recognition. Mainstream LSTM-based action recognition literature follows
convolutional features for the representation of sequences and extracts deep features
as a prerequisite step for different versions of LSTM or RNN. The sequence learn-
ing step for action recognition based on LSTM uses different favors of it, such as
bidirectional, multi-layer, and simple single layer. The majority of the action recog-
nition literature uses a simple LSTM structure. The discussed LSTM-based meth-
ods are limited to individual action or activity recognition and localization in video
frames. They process all the frames instead of taking different clues from individual
in frame based on simple LSTM to predict the activity for a group.
Deep LSTM-Based Sequence Learning Approaches 139
TABLE 5.2
Comparison of Different LSTM-Based Action and Activity
Recognition Methods on Different Challenging Benchmark Datasets
Method UCF-101 YouTube HMDB-51
TABLE 5.3
List of Popular Benchmark Action and Activity
Recognition Datasets
Number Number of action/
Dataset of videos activity classes
FIGURE 5.11 Sample activities from UCF-101, HMDB51, and YouTube action datasets [22].
this purpose, a color-coded confusion matrix is used to show the error matrix of the
algorithm: values near 100% are brighter and lower values are darker. The confusion
matrices for the DB-LSTM approach on the HMDB51 and UCF-101 datasets are
given in Figure 5.12, where the intensity of true positives (diagonal) is high for each
category, proving the effciency of this method. A recent approach by Ullah et al. [1]
evaluated their method on fve benchmark datasets. Confusion matrices from their
method for UCF-101 and HMDB51 datasets are shown in Figure 5.13. They achieved
high accuracy for UCF-101 and UCF-50 datasets; therefore, from Figure 5.13 (a)
and Figure 5.13 (d), it can be seen that for each class the true positive is very high.
Similarly, for two challenging datasets, namely HMDB51 and Hollywood2, they
achieved average performance; therefore, in Figure 5.13 (b) and Figure 5.13 (c) the
diagonals are not as bright in the confusion matrix. Figure 5.13 (e) is the confusion
matrix for a YouTube action dataset that contains only 11 action categories; therefore
the results are good for each class and the true positives are very bright in the confu-
sion matrix.
Category-wise accuracy is another metric for large-scale dataset evolution. It
shows the overall performance of the algorithm for all categories in the dataset.
We have compared the class-wise accuracies obtained by DB-LSTM [22] and tem-
poral optical fow with multi-layer LSTM [1] methods for the UCF-101 dataset in
Figure 5.14 (a) and Figure 5.14 (b). The horizontal axis shows classes, and the verti-
cal axis represents percentage accuracies for the corresponding classes for test sets
of the UCF-101 dataset. It is evident from Figure 5.14 (a) and Figure 5.14 (b) that the
best accuracies are above 80%, approaching 100% for some categories.
However, in the results for DB-LSTM shown in Figure 5.14 (a), many categories
are less than 50%, which shows that the performance of the method is not effective
for this dataset. On the other hand, the temporal optical fow method with multi-layer
LSTM [1] given in Figure 5.14 (b) achieves higher than 70% accuracy for each class,
which shows the robustness of this approach for different kinds of activity recogni-
tion in real-world scenarios. The class-wise accuracies achieved by DB-LSTM [22]
FIGURE 5.12 Confusion matrices of HMDB51 and UCF-101 datasets obtained by deep
features of AlexNet model for frame-level representation followed by DB-LSTM for action
recognition [22].
142
FIGURE 5.13 Confusion matrices for testing fve datasets used in the evaluation of CNN-based temporal optical fow features followed by multi-layer
LSTM. (a) UCF101 dataset, (b) HMDB51 dataset, (c) Hollywood2 dataset, (d) UCF50 dataset, and (e) YouTube dataset [1, 22].
Deep Learning in Computer Vision
Deep LSTM-Based Sequence Learning Approaches 143
FIGURE 5.14 Comparison of category-wise accuracy on the UCF101 dataset for DB-LSTM
and temporal optical fow with multi-layer LSTM approaches [1, 22].
and temporal optical fow with multi-layer LSTM [1] approach for HMDB51 dataset
are shown in Figure 5.15 (a) and Figure 5.15 (b). respectively.
HMDB51 is one of the most challenging datasets; for this dataset, the method of
[1] achieved 71% and the method of [22] obtained 87% overall accuracy. However,
the class-wise accuracy for the method shown in Figure 5.15 (a) the method of [22]
is better for maximum class, but a few of them are still less than 10%, while from
the method of [1], shown in Figure 5.15 (b), we can see that a few category scores
are above 90%, but most of them are within the range of 70% to 90%. The accuracy
for all classes is greater than 45% except kicking a football, picking, and swinging
a baseball bat. This is due to the overlapping of visual contents and similarity of
motion information with other classes. Therefore, predictions of these classes are
confused with others.
summarization [54], etc. The dominance of these models is that they are trained
on large-scale image datasets, which helps to learn discriminative hidden local and
global representations from visual data.
5.5 CONCLUSION
Human action and activity recognition is the problem of analyzing and predicting the
movements of human body parts in a sequence of video frames. In this chapter, we
have described the concept of sequence learning for action and activity recognition
using LSTM and its variants such as multi-layer or deep LSTM and bidirectional
LSTM networks. Moreover, recent LSTM-based methods for action and activity rec-
ognition are surveyed, and its achievements and drawbacks are discussed. We also
explained the working of traditional RNNs and their limitations, along with why
LSTM is better than RNN. Finally, the chapter concluded with recommendations for
future directions that may help overcome the present challenges of action and activ-
ity recognition using LSTM.
ACKNOWLEDGMENT
This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea Government (MSIP) (No. 2019R1A2B5B01070067).
146 Deep Learning in Computer Vision
BIBLIOGRAPHY
1. A. Ullah, K. Muhammad, J. Del Ser, S. W. Baik, and V. Albuquerque, “Activity recog-
nition using temporal optical fow convolutional features and multi-layer LSTM,” IEEE
Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9692–9702, 2019.
2. I. U. Haq, K. Muhammad, A. Ullah, and S. W. Baik, “DeepStar: Detecting starring
characters in movies,” IEEE Access, vol. 7, pp. 9265–9272, 2019.
3. Y. Liu, L. Nie, L. Han, L. Zhang, and D. S. Rosenblum, “Action2Activity: Recognizing
complex activities from sensor data,” in IJCAI, 2015, pp. 1617–1623.
4. Y. Liu, L. Nie, L. Liu, and D. S. Rosenblum, “From action to activity: Sensor-based
activity recognition,” Neurocomputing, vol. 181, pp. 108–115, 2016.
5. A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik, “Action recognition using opti-
mized deep autoencoder and CNN for surveillance data streams of non-stationary envi-
ronments,” Future Generation Computer Systems, vol. 96, pp. 386–397, 2019.
6. M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent human action recognition
with volume motion template on single stereo camera,” Pattern Recognition Letters,
vol. 31, pp. 639–647, 2010.
7. M. Xin, H. Zhang, H. Wang, M. Sun, and D. Yuan, “Arch: Adaptive recurrent-convo-
lutional hybrid networks for long-term action recognition,” Neurocomputing, vol. 178,
pp. 87–102, 2016.
8. D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using
motion history volumes,” Computer Vision and Image Understanding, vol. 104, pp.
249–257, 2006.
9. M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Action classifca-
tion in soccer videos with long short-term memory recurrent neural networks,” in
International Conference on Artifcial Neural Networks, 2010, pp. 154–159.
10. A. Kovashka, and K. Grauman, “Learning a hierarchy of discriminative space-time
neighborhood features for human action recognition,” in 2010 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2046–2053.
11. M. Sekma, M. Mejdoub, and C. B. Amar, “Human action recognition based on multi-
layer fsher vector encoding method,” Pattern Recognition Letters, vol. 65, pp. 37–43,
2015.
12. J. Hou, X. Wu, Y. Sun, and Y. Jia, “Content-attention representation by factorized
action-scene network for action recognition,” IEEE Transactions on Multimedia, vol.
20, pp. 1537–1547, 2018.
13. F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik, “Violence detec-
tion using spatiotemporal features with 3D convolutional neural network,” Sensors, vol.
19, p. 2472, 2019.
14. K.-i. Funahashi, and Y. Nakamura, “Approximation of dynamical systems by continu-
ous time recurrent neural networks,” Neural Networks, vol. 6, pp. 801–806, 1993.
15. S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural Computation,
vol. 9, pp. 1735–1780, 1997.
16. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network
architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the
International Speech Communication Association, 2014.
17. M. Sajjad, S. Khan, K. Muhammad, W. Wu, A. Ullah, and S. W. Baik, “Multi-grade
brain tumor classifcation using deep CNN with extensive data augmentation,” Journal
of Computational Science, vol. 30, pp. 174–182, 2019.
18. J. Ahmad, K. Muhammad, S. Bakshi, and S. W. Baik, “Object-oriented convolu-
tional features for fne-grained image retrieval in large surveillance datasets,” Future
Generation Computer Systems, vol. 81, pp. 314–330, 2018/04/01/ 2018.
Deep LSTM-Based Sequence Learning Approaches 147
19. K. Muhammad, J. Ahmad, Z. Lv, P. Bellavista, P. Yang, and S. W. Baik, “Effcient deep
CNN- based fre detection and localization in video surveillance applications,” IEEE
Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 7, pp. 1419–1434,
2019.
20. K. Muhammad, R. Hamza, J. Ahmad, J. Lloret, H. H. G. Wang, and S. W. Baik, “Secure
surveillance framework for IoT systems using probabilistic image encryption,” IEEE
Transactions on Industrial Informatics, vol. 14, no. 8, pp. 3679–3689, 2018.
21. M. Sajjad, A. Ullah, J. Ahmad, N. Abbas, S. Rho, and S. W. Baik, “Integrating salient
colors with rotational invariant texture features for image representation in retrieval
systems,” Multimedia Tools and Applications, vol. 77, pp. 4769–4789, 2018.
22. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in
video sequences using deep Bi-directional LSTM with CNN features,” IEEE Access,
vol. 6, pp. 1155–1166, 2018.
23. A. Ogawa, and T. Hori, “Error detection and accuracy estimation in automatic
speech recognition using deep bidirectional recurrent neural networks,” Speech
Communication, vol. 89, pp. 70–83, 2017.
24. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
25. W. Li, W. Nie, and Y. Su, “Human action recognition based on selected spatio-temporal
features via bidirectional LSTM,” IEEE Access, vol. 6, pp. 44211–44220, 2018.
26. M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep
temporal model for group activity recognition,” In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1971–1980.
27. Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves,
attends and fows for action recognition,” Computer Vision and Image Understanding,
vol. 166, pp. 41–50, 2018.
28. C.-Y. Ma, M.-H. Chen, Z. Kira, and G. AlRegib, “TS-LSTM and temporal-inception:
Exploiting spatiotemporal dynamics for activity recognition,” Signal Processing:
Image Communication, vol. 71, pp. 76–87, 2019.
29. Z. Chen, B. Ramachandra, T. Wu, and R. R. Vatsavai, “Relational long short-term
memory for video action recognition,” arXiv preprint arXiv:1811.07059, 2018.
30. W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention network for action
recognition in videos,” IEEE Transactions on Image Processing, vol. 27, pp. 1347–
1360, 2018.
31. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K.
Saenko, et al., “Long-term recurrent convolutional networks for visual recognition and
description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 2625–2634.
32. Y. Huang, X. Cao, Q. Wang, B. Zhang, X. Zhen, and X. Li, “Long-short term fea-
tures for dynamic scene classifcation,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 29, no. 4, pp. 1038–1047, 2019.
33. S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity
detection and early detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 1942–1950.
34. L. Sun, K. Jia, K. Chen, D.-Y. Yeung, B. E. Shi, and S. Savarese, “Lattice long short-
term memory for human action recognition,” in ICCV, 2017, pp. 2166–2175.
35. V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action
recognition,” in Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 4041–4049.
36. H. Yang, J. Zhang, S. Li, and T. Luo, “Bi-direction hierarchical LSTM with spatial-
temporal attention for action recognition,” Journal of Intelligent & Fuzzy Systems, vol.
36, no. 1, pp. 775–786, 2019.
148 Deep Learning in Computer Vision
37. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions
classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
38. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video
database for human motion recognition,” in 2011 IEEE International Conference on
Computer Vision (ICCV), 2011, pp. 2556–2563.
39. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM
approach,” in Proceedings of the 17th International Conference on Pattern Recognition,
ICPR 2004, 2004, pp. 32–36.
40. M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2009, 2009, pp. 2929–2936.
41. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-
scale video benchmark for human activity understanding,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
42. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, et al.,
“The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
43. C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, et al., “AVA:
A video dataset of spatio-temporally localized atomic visual actions,” arXiv preprint
arXiv:1705.08421, vol. 3, p. 6, 2017.
44. S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, et
al., “Youtube-8m: A large-scale video classifcation benchmark,” arXiv preprint
arXiv:1609.08675, 2016.
45. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-
scale video classifcation with convolutional neural networks,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
46. M. Monfort, B. Zhou, S. A. Bargal, A. Andonian, T. Yan, K. Ramakrishnan, et al.,
“Moments in time dataset: One million videos for event understanding,” arXiv preprint
arXiv:1801.03150, 2018.
47. S. Khan, K. Muhammad, S. Mumtaz, S. W. Baik, and V. H. C. de Albuquerque,
“Energy-effcient deep CNN for smoke detection in foggy IoT environment,” IEEE
Internet of Things Journal, vol. 6, no. 6, pp. 9237–9245, 2019.
48. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al.,
“Mobilenets: Effcient convolutional neural networks for mobile vision applications,”
arXiv preprint arXiv:1704.04861, 2017.
49. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model
size,” arXiv preprint arXiv:1602.07360, 2016.
50. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shuffenet v2: Practical guidelines for eff-
cient cnn architecture design,” arXiv preprint arXiv:1807.11164, vol. 5, 2018.
51. M. Sajjad, M. Nasir, F. U. M. Ullah, K. Muhammad, A. K. Sangaiah, and S. W. Baik,
“Raspberry Pi assisted facial expression recognition framework for smart security in
law-enforcement services,” Information Sciences, vol. 479, pp. 416–431, 2019.
52. M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K. Sangaiah, A. Castiglione, et
al., “CNN-based anti-spoofng two-tier multi-factor authentication system,” Pattern
Recognition Letters, vol 126, pp. 123–131, 2019.
53. J. Ahmad, K. Muhammad, J. Lloret, and S. W. Baik, “Effcient conversion of deep fea-
tures to compact binary codes using Fourier decomposition for multimedia big data,”
IEEE Transactions on Industrial Informatics, vol. 14, pp. 3205–3215, July 2018.
54. K. Muhammad, T. Hussain, and S. W. Baik, “Effcient CNN based summarization
of surveillance videos for resource-constrained devices,” Pattern Recognition Letters,
2018.
Deep LSTM-Based Sequence Learning Approaches 149
55. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unifed, real-
time object detection,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 779–788.
56. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detec-
tion with region proposal networks,” in Advances in Neural Information Processing
Systems, 2015, pp. 91–99.
6 Deep Semantic
Segmentation in
Autonomous Driving
Hazem Rashed, Senthil Yogamani,
Ahmad El-Sallab, Mahmoud Hassaballah, and
Mohamed ElHelw
CONTENTS
6.1 Introduction .................................................................................................. 152
6.2 Convolutional Neural Networks ................................................................... 154
6.2.1 Overview of Deep CNN ................................................................... 154
6.2.2 Types of CNN Layers ....................................................................... 155
6.3 Semantic Segmentation ................................................................................ 157
6.3.1 Automotive Scene Segmentation ...................................................... 159
6.3.2 Common Automotive Datasets......................................................... 160
6.3.2.1 Real Datasets ..................................................................... 160
6.3.2.2 Synthetic Datasets.............................................................. 161
6.4 Motion and Depth Cues................................................................................ 161
6.4.1 Depth Estimation .............................................................................. 162
6.4.2 Optical Flow Estimation................................................................... 163
6.5 Proposed Fusion Architectures..................................................................... 164
6.5.1 Single-Input Network........................................................................ 164
6.5.2 Early-Fusion Network....................................................................... 165
6.5.3 Mid-Fusion Network......................................................................... 165
6.6 Experiments.................................................................................................. 166
6.6.1 Experimental Settings ...................................................................... 166
6.6.2 Results Analysis and Discussion ...................................................... 167
6.6.2.1 Depth Experiments ............................................................ 167
6.6.2.2 Optical Flow Experiments ................................................. 167
6.7 Conclusion .................................................................................................... 174
Bibliography .......................................................................................................... 178
151
152 Deep Learning in Computer Vision
6.1 INTRODUCTION
The frst autonomous cars (ACs) developed by Carnegie Mellon University and
ALV appeared in 1984 [1]. Later, in 1987, Mercedes-Benz collaborated with
Bundeswehr University Munich in a project called the Eureka Prometheus Project
[2], which was the largest R&D project ever in autonomous driving (AD). Since
then, numerous car manufacturers have developed prototypes for their self-driv-
ing cars [3, 4]. The frst truly autonomous car, VaMP, developed in the 1990s
by Mercedes-Benz, was a 500 SEL Mercedes-Benz that had the capabilities to
control the steering wheel, throttle, and brakes by computers in real time. The
car relied only on four cameras for localization, as there was no GPS at that time.
The vehicle was able to drive more than 1,000 km in normal traffc at speeds up to
130, kmh moving from lane to another and passing other cars autonomously after
the approval of the safety driver [5]. Recently, AD has gained great attention with
signifcant progress in computer vision algorithms [6, 7], and it is considered one
of the highly trending technologies all over the globe. Within the next 5–10 years,
AD is expected to be deployed commercially. Currently, most automotive original
equipment manufacturers (OEMs) over the world, such as Volvo, Daimler, BMW,
Audi, Ford, Nissan, and Volkswagen, are working on development projects focus-
ing on the AD technology [8, 9].
In addition, every year large numbers of people die or are injured due to car acci-
dents [10]. Autonomous cars (ACs) have great potential to signifcantly reduce these
numbers. One of the potential benefts of ACs is the higher safety levels for both
drivers and surrounding pedestrians [11]. It has been reported by consulting frm
McKinsey & Company that deploying AD technology could eliminate 90% of acci-
dents in the USA and save up to US 190 billion due to damage prevention annually.
Besides, autonomous vehicles are predicted to have positive impact on traffc fow,
with higher speed limits permitted for increased roadway capacity [12]. AD can also
provide a higher level of satisfaction for customers, especially for the elderly and the
disabled, introduce new business models such as autonomous taxis as a service, and
reduce crime.
Additionally, when vehicles are self-aware, they will have the information needed
to avoid numerous accidents [13]. For example, vehicles may recognize when they
are being stolen or the driver is being threatened. Self-aware vehicles may recognize
children being kidnapped or weapons being transported, or they may spot drunk
drivers and consequently take proper action [14, 15]. Generally, there are fve levels
of autonomy [16]:
system to keep the car in the center of the lane. Another example is adaptive
cruise control (ACC), which controls the acceleration/deceleration accord-
ing to the distance between the ego-vehicle and the car ahead with a pre-
defned speed limit [17].
• Level 2 is “partial automation”, where the car can do maneuvers control-
ling both speed and steering together; however, the driver must keep his/
her hands on the steering wheel for any required intervention. Most of the
current commercial autonomous driving systems are of Level 2.
• Level 3 is “conditional automation”, where the car manages almost all of
the tasks but the system prompts the driver for intervention when there is a
scenario it cannot navigate through. Unlike Level 2, the driver can remove
his hands from the steering wheel; however, he/she should be available to
take over at any time.
• Level 4 is “high automation”, where the car can operate without the
driver’s supervision but under conditional circumstances for the envi-
ronment. Otherwise, the driver may take control. Google’s Firefy pro-
totype is an example for Level 4 autonomy; however, speed is limited
to 25 mph.
• Level 5 is “full automation”, where the car can operate in any condition and
the driver’s role is only providing the destination. This level of autonomy
has not been reached yet.
Nevertheless, like any new technology, AD faces various technical and non-techni-
cal challenges. The technical challenges include sensor accuracy compared to cost;
that is, the sensors have to be reasonable in terms of price [18]. Therefore, effcient
algorithms have to be implemented to deal with map generation errors. There are
localization errors that take place, as data from GPS and intertial measurement units
(IMUs) is not always stable due to connectivity issues and even odometry informa-
tion has errors, especially when the car travels long distances. Path planning and
navigation should not depend on imperfect inputs and should deal with various prac-
tical cases [19]. The complexity of a system must be acceptable for the purpose of
producing commercial cars, which adds another limitation in terms of the hardware
used for production. The non-technical challenges include the urgent need to take
tough decisions into implementation [20]. For instance, the AC has to be equipped
with a decision-making system for unavoidable accidents; it has to decide to choose
hitting a woman or a child if it has to, and which one to protect the most, the child
outside the car or the car occupant [21]. In this regard, a mobile robot must be able to
perform four key tasks to be able to navigate autonomously [22, 23].
• Path planning and decision making according to both map and position
information. In addition to path planning, there are some constraints and
regulations to be followed especially for a self-driving car, e.g., observ-
ing traffc lights and following rules in special paths such as roundabouts,
intersections, and U-turns.
• Vehicle control through sending signals to actuators over the vehicle’s net-
work for taking a desired action.
added to bias. The convolution input can be raw pixels or an output feature map from
the previous layer. The convolution is known in deep learning as a feature detector
for the CNN where the values in kernels defne an operation for feature detection.
A kernel with ones in the middle columns is used to detect vertical edges within
the input patch. The output of each convolution operation is called a feature map.
Stacking the output feature maps for different kernels generates a 3D output volume.
Convolution flters defne a small bounded region to output activation maps from
the input volume through connection with only a subset of the input volume. This
allows us to have high-quality feature extraction utilizing the spatial correlation
between pixels besides reduction the number of parameters per layer compared to
fully connected neurons [31]. Since the depth of the input volume is 3, the convolution
layer as well should have a depth of 3. Convolution kernels connect to the bounded
region on the image in terms of width and height; however, the depth always has to
be the same. CNNs use parameter sharing to control the total number of parameters,
where the same weights being learned using a certain flter are used while the flter
scans the whole image or feature map, instead of defning new parameters. There are
several parameters to control using a flter in CNNs such as receptive feld, output
depth, stride, and zero-padding.
Receptive feld: This term refers to the spatial flter size, in terms of width and
height. A flter with receptive feld of 5 × 5 × 3 means that it has a 5-pixel width, a
5-pixel height, and 3 channels of depth. Usually, flters of the frst convolutional layer
have depths of 3 to accept an RGB image input volume of 3 channels.
Output depth: The neuron count in the convolutional layer connected to the
same region of the input volume, meaning that it refers to the number of flters used
in the convolutional layer.
Stride: This refers to the step used by the flter to scan through the input volume.
A stride of 1 means that the flter moves horizontally and vertically with one-pixel
steps, resulting in a dense output volume of a large number of depth columns, as
there is a large overlap between receptive felds.
Zero-padding: This parameter is used to control the spatial size of the output
volume. This is very helpful if the output volume is required to be same as the input
volume. Additionally, if convolutions are done without zero-padding and the size is
reduced, the information at boundaries will be lost quickly.
Batch normalization layer: This layer is used in deep learning to speed up the
training by making normalization part of the network. For example, if input values
are in a range from 1 to 1,000 and from 1 to 10, the change from 1 to 2 in the sec-
ond input represents a 10% change in the input value, but actually represents only
a 0.1% change in the frst input. Normalizing the inputs will help the network learn
faster and more accurately without biasing certain dimensions over the others. Batch
normalization uses a very similar approach, however, for the hidden layers changing
the outputs all the time. Since the activations of a previous layer are the inputs of
the next layer, each layer in the neural network faces a problem due to input distribu-
tion change with each step. This problem is termed “covariate shift”. The basic idea
behind batch normalization is to limit covariate shift by normalizing the activations
of each layer through transforming the inputs to be mean 0 and unit variance. This
Deep Semantic Segmentation in Autonomous Driving 157
allows each layer to learn on a more stable distribution of inputs, and would acceler-
ate the training of the network. Limiting the covariate shift helps as well in avoid-
ing the vanishing gradient problem, where gradients become very small when the
input is large in a sigmoid activation function, for example, as the distribution keeps
changing during training and might be large enough to create the vanishing gradient
problem [32].
Pooling layers: These layers are periodically inserted after convolutional layer
in a typical ConvNet architecture. Their function is to progressively reduce the spa-
tial size of the input representation providing multiple advantages: (1) Obtaining an
abstract representation of the input data where only the most important features are
being considered; hence, this helps to avoid overftting the model to the input data.
(2) Reducing the computational cost as the number of learnable parameters becomes
less than those for corresponding networks not using pooling. (3) Providing basic
translation invariance to the input representation. The pooling layer operates inde-
pendently on each depth slice of the input representation. The most common form of
pooling is called max-pooling, where a 2 × 2 window scans the image with a stride
of 2, keeping the max value of that window only, resulting in keeping only 25% of
the input volume [33].
Fully connected layers: Neurons in a fully connected layer have connections
to all output activations from the previous layer, which is always good to be able to
model complex representations. The fully connected layer provides all the possible
combinations of the features in the previous layers, as the convolutional layer relies
only on the local spatial correlation between pixels. However, fully connected layers
contribute strongly to the network’s complexity [34].
Output layer: In a normal classifcation task where the objects belong to one
class only, the sigmoid layer is used to fnally generate the output. Let’s say we
are trying to classify an object among fve classes. The Softmax function pro-
vides an output probability for the likelihood that this object belongs to each class.
These output probabilities will sum up to 1, which is very helpful because the next
step would be classifcation of the object according to the maximum probability
generated.
networks are implemented to use only appearance cues without considering sev-
eral other cues that might be helpful to a specifc application. In this context,
deep learning has played a great role in semantic segmentation tasks, ideally
because of the automatic feature selection advantage that deep learning has over
conventional methods. There are three main areas explored in deep learning for
semantic segmentation. The frst one uses patch-wise training, where, the input
image is fed into the Laplacian pyramid, each scale is forwarded through a three-
stage network for hierarchical feature extraction, and patch-wise classifcation is
used [41].
The second area is end-to-end training, in which the CNN network is trained to
take raw images as input and directly output dense pixel-wise classifcation instead
of patches classifcation. In [42], famous classifcation networks, namely AlexNet
[43], VGG-net [44], and GoogleLeNet [45], are adapted into fully convolutional net-
works, and the learned knowledge is transferred to the semantic segmentation task.
Skip connections between shallow and deep features are implemented to avoid los-
ing resolution. The features learned within the network are upsampled using decon-
volution to get dense predictions. In [46], deeper deconvolution is implemented, in
which stacked deconvolution and unpooling layers are utilized. In [47], an encoder-
decoder network is implemented where the decoder upsampling made use of the
pooling indices computed in the max-pooling step of the corresponding encoder for
nonlinear upsampling.
Recent works have focused on multiscale semantic segmentation [48]. In [41], the
scale issue has been dealt with using multiple rescaled versions of the image; how-
ever, after the emergence of end-to-end learning, the skip connections implemented
in [42] helped to merge feature maps from different resolutions, as loss of resolution
takes place due to image downsampling. Dilated convolution has been introduced
in [49], where the receptive feld has been expanded with zero values in between
Deep Semantic Segmentation in Autonomous Driving 159
depending on the dilation factor in order to avoid losing resolution. This helped for
the multiple scales issue. In the attention model [50, 51], the multiscale image seg-
mentation model is adapted so that it learns to softly weight the multiscale features
for objects with different scales.
FIGURE 6.3 Sample images from the surround-view camera network demonstrating near-
feld sensing and wide feld of view.
160 Deep Learning in Computer Vision
systems. The difference between viewpoints of both camera and LiDAR caused
occluded points to be projected incorrectly to the image plane. Then, an occlusion
correction method is presented to remove the incorrectly mapped points, and dense
depth map for the fsheye camera is generated using the LiDAR depth information
as supervised signals.
In a generic structure of automotive scenes, different types of prior informa-
tion that are common in any automotive road scene can be utilized, e.g., spatial
priors such as the road pixels usually taking almost all of the bottom half of the
image. The lanes are seen as thick white lines expanding until the road pixels
end. Other priors include the color priors for traffc lights, roads, and lanes.
Additionally, there is strong geometric structure in the scene as the road is fat,
and all the objects stand vertically above it. This is explicitly exploited in the
formulation of a commonly used objects-depth representation, namely Stixels
[54]. There are static and moving objects in addition to the vehicle ego-motion,
which may provide good temporal information. Although there is a lot in the
automotive scene structure that can be used as prior information, the system’s
complexity has to be taken into consideration in the automotive applications,
given the computational bounds in memory and processing power of the embed-
ded systems on board.
them with the corresponding regions in the RGB image. The approach is used for
indoor scenes using the NYU dataset. Cao et al. [73] provided a way to learn depth
map and empirically proved enhanced performance for 2% on the VOC2012 dataset.
f x u + f y v + ft = 0 (6.1)
Where u and v denote the change in the pixels’ x and y positions with respect to
time, which implies the motion of a pixel in both directions; fx and f y denote image
derivatives along x and y directions; and ft is the image difference over time. The
Lucas-Kanade method is computed over previously selected features where the fnal
output will be sparse and hence provides computationally effcient solution. On the
contrary, Farneback’s method [76] computes optical fow, but for every pixel in the
image, and provides dense output where every pixel in the source frame is associ-
ated with a 2D vector showing values for pixel motion in x and y directions called u
and v. Both Lucas-Kanade and Farneback methods provide good performance only
when there is little motion between corresponding pixels in both images, and they
fail when the displacement is large.
On the other hand, deep learning has been used to formulate optical fow estima-
tion as a learning problem where temporally sequential images are used as input to
the network and the output is a dense optical fow. FlowNet [77] used a large-scale
synthetic dataset, namely “Flying Chairs”, as a ground truth for dense optical fow.
Two architectures were proposed. The frst one accepts an input of six channels for
two temporally sequential images stacked together in a single frame. The second is
a Mid-Fusion implementation in which each image is processed separately on one
stream and 2D correlation is done to output a DOF map. This work has been later
extended to FlowNet v2 [78] focusing on small displacement motion, where an addi-
tional specialized subnetwork is added to learn small displacements. Results are then
fused with the original architecture using another network to learn the best fusion. In
[79], the estimated fow is used to warp one of the two temporally sequential images
to the other, and the loss function is implemented to minimize the error between the
164 Deep Learning in Computer Vision
warped image and the original image. Junhwa and Stefan [80] implemented joint
estimation of optical fow and semantic estimation such that each task benefts from
the other.
FIGURE 6.4 Single-Input network utilized for semantic segmentation given only one piece
of information (e.g., RGB-only, depth-only, or DOF-only).
Deep Semantic Segmentation in Autonomous Driving 165
FIGURE 6.5 Early-Fusion Network: the pixel-level fusion is done on raw pixels before any
processing or feature extraction.
6.6 EXPERIMENTS
6.6.1 EXPERIMENTAL SETTINGS
This section discusses experimental settings. For all experiments, the Adam opti-
mizer is used with a learning rate of value e−5. The L2 regularization is used in
the loss function, with a factor of value 5e −4 to overcome overftting the training
data. The VGG pre-trained weights are utilized for initialization. Dropout with
probability 0.5 is utilized for 1 × 1 convolution layers. Intersection over Union
(IoU) is used to evaluate class-wise accuracy, and the mean IoU is computed
over all the classes. In this work, the TensorFlow—an open-source library devel-
oped by Google—is utilized. TensorFlow can run on CPUs as well as GPUs, and
it is available for Windows, Linux, and macOS. In our experiments, we used
TensorFlow v1.0.0 on a Python 2.7 platform under a Linux Ubuntu v14.04 operat-
ing system, and the implemented networks are executed on TitanX GPU, Maxwell
architecture.
The datasets used in this work are chosen based on three factors: First, they
have to be automotive datasets containing urban road scenes, as our application
is mainly focused on autonomous driving. Second, the datasets should consist of
video sequences to allow us to compute DOF and use the motion information for
semantic segmentation. Third, there should be a way to calculate depth information
from the chosen dataset. Two well-known automotive datasets fulflling the require-
ments, namely, Virtual KITTI [61] and Cityscapes [57], are used in the experi-
ments. For the Virtual KITTI, we shuffed the dataset randomly and split the frames
to 80% training data, and 20% for test. In Cityscapes, we use the split provided
by the dataset, which is 3,474 images for training and 1,525 images for testing.
The optical fow annotation is not provided with the Cityscapes dataset; however,
there are video sequences containing temporal information about the scenes. Thus,
we exploit these sequences to generate different representations of dense optical
fow using two methods. We train the network to output all of the classes provided
in dataset; however, evaluation is done on 19 classes only, as mentioned by the
Cityscapes team where the offcial evaluation script published for the dataset is
exploited to report results.
Deep Semantic Segmentation in Autonomous Driving 167
Performance of Depth-Augmented Semantic Segmentation on the Virtual KITTI Dataset Using IoU Metric
Type Mean Truck Car Van Road Sky Vegetation Building Guardrail TraffcSign TraffcLight Pole
RGB 66.47 33.66 85.69 29.04 95.91 93.91 80.92 68.15 81.82 66.01 65.07 40.91
D (GT) 55 67.68 58.03 56.3 73.81 94.38 53.64 43.95 14.61 53.97 56.51 42.67
RGBD (GT) 66.76 65.34 91.74 56.93 95.46 94.41 79.17 54.91 73.42 60.21 46.09 30.46
RGB+D (GT-add) 68.6 43.38 91.59 29.19 69.01 94.32 85.17 77.6 80.13 69.54 72.73 32.09
RGB+D (GT-concat) 72.13 62.84 93.32 38.42 96.33 94.2 90.46 79.04 90.85 72.22 67.83 34.4
D (monoDepth) 46.1 36.05 75.46 33.2 77.3 87.3 39.3 32.3 6.8 42.14 45.9 15.9
RGB+D (monoDepth-add) 67.05 42.9 86.9 43.5 96.2 94.1 88.07 65.94 85.4 65.7 51.25 30.13
RGB+D (monoDepth-concat) 68.92 40.57 86.1 50.3 95.95 93.82 81.63 70.43 86.3 68.66 67.58 35.94
TABLE 6.2
Performance of Depth-Augmented Semantic Segmentation on the Cityscapes Dataset Using IoU Metric
Type Mean Bicycle Person Rider Motorcycle Bus Car Fence Building Road Sidewalk Sky TraffcSign
RGB 62.47 63.52 67.93 40.49 29.96 62.13 89.16 44.53 87.86 96.22 74.98 89.79 59.88
D (SGM) 47.8 39.84 54.99 29.04 11.29 48.1 82.36 34.32 78.42 95.15 67.78 81.18 27.96
RGBD (SGM) 55.5 56.68 60.27 34.64 21.18 58 86.94 36.47 84.7 94.84 70.39 84.64 45.48
RGB+D (SGM-add) 63.48 66.46 67.85 42.31 41.37 63.1 89.77 46.28 88.1 96.38 75.66 90.23 60.78
RGB+D (SGM-concat) 63.13 65.32 67.79 39.14 37.27 69.71 90.06 42.75 87.44 96.6 76.35 91.06 59.44
D (monoDepth) 40.89 36.63 44.6 18.5 7.3 37.5 77.78 16.16 77.01 92.83 54.87 89.33 24.67
RGB+D (monoDepth-add) 61.39 66.23 67.33 39.9 44.01 55.7 89.1 40.2 87.34 69.47 75.7 88.7 57.7
RGB+D (monoDepth-concat) 63.03 65.85 67.44 41.33 46.24 66.5 89.7 33.6 87.25 96.01 73.5 90.3 59.8
Deep Learning in Computer Vision
Deep Semantic Segmentation in Autonomous Driving 169
learned only the generate the scene structure without any further details. This result
motivated us toward the concept of fusion, which should help the network under-
stand the scene semantics from the RGB image and enhance performance using
motion information from optical fow. We used color wheel representation for this
task, encoding both fow magnitude and direction.
Table 6.3 shows an evaluation of four architectures on the Virtual KITTI data-
set. There is an improvement of 4% in mean IoU with the Mid-Fusion approach,
with larger improvements in moving classes like Truck, Car, and Van (38%, 6%,
and 27%, respectively), which demonstrates that the Mid-Fusion network provides
the best capability of capturing the motion cues and thus the best semantic seg-
mentation accuracy. The RGBF network provides very close performance to the
RGB-alone network, but with signifcant improvement in moving classes such as
Truck and Van. For all the results mentioned in Tables 6.3 and 6.4, we make use of
a color wheel representation format for fow. Table 6.4 illustrates the performance
on the Cityscapes dataset, indicating a marginal improvement of 0.1% in overall
IoU. There is a signifcant improvement in moving object classes like motorcycle
and train by 17% and 7%, even using noisy fow maps. The improvement obtained
from moving objects is signifcant compared to overall IoU, which is dominated by
sky and road pixels.
Quantitative comparisons between different DOF representations in the Early-
Fusion network RGBF and the Mid-Fusion RGB+F architectures for both Virual
KITTI and Cityscapes dataset are reported in Tables 6.5 and 6.6, respectively.
Results show that the color wheel Mid-Fusion network provides the best perfor-
mance, as color wheel representation encodes both magnitude and direction, while
the Mid-Fusion network provides the capacity for each stream to learn separately.
Hence, we obtained the best results using this approach. RGBF results are also
interesting: they show that the network is incapable of providing good output seg-
mentation when the input channels are increased to 6 layers including 3 for RGB
and 3 for color wheel DOF. However, augmenting only one channel for magnitude
DOF provides better results for moving classes such as Truck and Van in Virtual
KITTI. This might be benefcial for an embedded system with limited resources. In
the Cityscapes, the RGBF is incapable of providing better segmentation for several
reasons, one of which is the simple network architecture used and the large num-
ber of classes in the Cityscapes dataset trained for. Additionally, the Cityscapes has
only 3,475 images for training, while the Virtual KITTI has 17,008 frames. The
Virtual KITTI results show the need for more research in that direction, especially
for limited hardware resources. Figures 6.7 and 6.8 illustrate the qualitative results
of the four architectures on Virtual KITTI and Cityscapes, respectively. Figure 6.7
(f) shows better detection of the van, which has a uniform texture and which fow cue
has helped to detect more accurately. Figure 6.7 (h) shows that FlowNet-only results
provides good segmentation for the moving van with the ability to capture generic
scene structures like pavement, sky, and vegetation without color information.
However, fusion in Figure 6.7 (i) still needs to be improved. Figure 6.8 (f) and (h)
illustrate better detection of the bicycle and the cyclist after fusion with DOF. These
examples visually verify the accuracy improvement shown in Tables 6.3 and 6.4.
170
TABLE 6.3
Performance of Flow-Augmented Semantic Segmentation Results on the Virtual KITTI Dataset Using IoU Metric
Type Mean Truck Car Van Road Sky Vegetation Building Guardrail TraffcSign TraffcLight Pole
RGB 66.47 33.66 85.69 29.04 95.91 93.91 80.92 68.15 81.82 66.01 65.07 40.91
F (GT) 42 36.2 55.2 20.7 62.6 93.9 19.54 34 15.23 51.5 33.2 29.3
RGBF (GT) 65.328 70.74 80.2 48.34 93.6 93.3 70.79 62.05 67.86 55.14 55.48 31.9
RGB+F (GT) 70.52 71.79 91.4 56.8 96.19 93.5 83.4 66.53 82.6 64.69 64.65 26.6
F (FlowNet) 28.6 24.6 47.8 14.3 57.9 68 13.4 4.9 0.8 31.8 18.5 6.6
RGB+F (FlowNet) 68.84 60.05 90.87 40.54 96.05 91.73 84.54 68.52 82.43 65.2 63.54 26.54
TABLE 6.4
Performance of Flow-Augmented Semantic Segmentation Results on the Cityscapes Dataset Using IoU Metric
Type Mean Bicycle Person Rider Motorcycle Bus Car Train Building Road Truck Sky TraffcSign
RGB 62.47 63.52 67.93 40.49 29.96 62.13 89.16 44.19 87.86 96.22 48.54 89.79 59.88
F (Farneback) 34.7 34.48 37.9 12.7 7.39 31.4 74.3 11.35 72.77 91.2 19.42 79.6 11.4
RGBF (Farneback) 47.8 52.6 55.8 31.1 22.4 39.34 82.75 22.8 80.43 92.24 20.7 81.87 44.08
RGB+F (Farneback) 62.56 63.65 66.3 39.65 47.22 66.24 89.63 51.02 87.13 96.4 36.1 90.64 60.68
F (FlowNet) 36.8 32.9 50.9 26.8 5.12 25.99 75.29 15.1 65.16 90.75 25.46 50.16 29.14
RGBF (FlowNet) 52.3 54.9 58.9 34.8 26.1 53.7 83.6 40.7 79.4 94 28.1 79.4 45.5
RGB+F (FlowNet) 62.43 64.2 66.32 40.9 40.76 66.05 90.03 41.3 87.3 95.8 54.7 91.07 58.21
Deep Learning in Computer Vision
TABLE 6.5
Semantic Segmentation Results on the KITTI for Different DOF (GT) Representations
Type Mean Truck Car Van Road Sky Vegetation Building Guardrail TraffcSign TraffcLight Pole
RGBF (GT-color wheel) 59.88 41.7 84.44 40.74 93.76 93.6 66.3 49.43 52.18 62.21 49.61 21.52
RGBF (GT-Mag & Dir) 58.85 45.12 82.3 30.04 90.25 94.1 60.97 56.48 51.48 58.74 49.7 26.01
RGBF (GT-Mag only) 65.32 70.73 80.16 48.33 93.59 93.3 70.79 62.04 67.86 55.13 55.48 31.92
RGB+F (GT-3 layers Mag only) 67.88 35.7 91.02 24.78 96.47 94.06 88.72 74.4 84.5 69.48 68.95 34.28
RGB+F (GT-color wheel) 70.52 71.79 91.4 56.8 96.19 93.5 83.4 66.53 82.6 64.69 64.65 26.6
TABLE 6.6
Semantic Segmentation Results on the Cityscapes for Different DOF (Farneback) Representations
Deep Semantic Segmentation in Autonomous Driving
Type Mean Bicycle Person Rider Motorcycle Bus Car Train Building Road Truck Sky TraffcSign
RGBF (Mag only) 47.8 52.63 55.82 31.08 22.38 39.34 82.75 22.8 80.43 92.24 20.7 81.87 44.08
RGBF (Mag & Dir) 54.6 57.28 58.63 33.56 18.49 56.44 84.6 41.15 84.41 95.5 31.8 87.86 44.26
RGBF (color wheel) 57.2 61.47 62.18 35.13 22.68 54.87 87.45 36.69 86.28 95.94 40.2 90.07 51.64
RGB+F (3 layers Mag only) 62.1 65.15 65.44 32.59 33.19 63.07 89.48 43.6 87.88 96.17 57.2 91.48 55.76
RGB+F (color wheel) 62.56 63.65 66.3 39.65 47.22 66.24 89.63 51.02 87.13 69.4 36.11 90.64 60.68
171
172 Deep Learning in Computer Vision
Unexpectedly, Table 6.4 shows that Farneback and FlowNet v2 provide simi-
lar results; however, FlowNet shows better results for some classes like Truck. In
Figure 6.8, fusion with DOF in a two-stream approach enhances results of semantic
segmentation for both Farneback and FlowNet. Farneback DOF is noisy; however,
the result Figure 6.8 (f) shows that the network has good immunity towards noise.
Farneback has an advantage over FlowNet in fne details, where it is shown visually
that the rider can be separated from the bicycle; however, FlowNet doesn’t provide
this level of detail. As a result, fusion with Farneback provides slightly higher perfor-
mance than fusion with FlowNet. In both datasets, we observed some degradation in
accuracy of static objects after fusion despite the improvement obtained for moving
objects. This highlights the motivation towards multimodal fusion to combine RGB,
motion, and depth together, where depth signal is expected to improve the accuracy
of static objects. Moreover, there is a need for deeper investigation of fusion strategy
to maximize the beneft from each modality without loss in accuracy of other classes
(Tables 6.7 and 6.8).
Deep Semantic Segmentation in Autonomous Driving 173
6.7 CONCLUSION
This chapter focuses on computer vision tasks for autonomous driving applications.
It explores the autonomous driving feld and the role of recent computer vision
algorithms in self-driving car development, such as deep learning CNN, which
showed signifcant improvement in terms of accuracy for various tasks in different
Deep Semantic Segmentation in Autonomous Driving 175
scope for improvement, as depth, fow, and color are different modalities, so more
research should be conducted to construct better fusion networks. Future research
aims to
RGB-only 66.47 33.66 85.69 29.04 95.91 93.91 80.92 68.15 81.82 66.01 65.07 40.91
RGB+D (GT) 72.13 62.84 93.32 38.42 96.33 94.2 90.46 79.04 90.85 72.22 67.83 34.4
RGB+F (GT) 70.52 71.79 91.4 56.8 96.19 93.5 83.4 66.53 82.6 64.69 64.65 26.6
RGB+D+F (GT) 71.88 71.688 92.08 61.44 95.85 94.83 84.32 71.86 83.42 64.69 60.67 31.08
TABLE 6.8
Comparison between Baseline, Depth-Augmented, Flow-Augmented, and Depth-Flow-Augmented Semantic Segmentation
Deep Semantic Segmentation in Autonomous Driving
RGB-only 62.47 63.52 67.93 40.49 29.96 62.13 89.16 44.19 87.86 96.22 48.54 89.79 59.88
RGB+D (SGM) 63.48 66.46 67.85 42.31 41.37 63.1 89.77 46.28 88.1 96.38 75.66 90.23 60.78
RGB+F (Farneback) 62.56 63.65 66.3 39.65 47.22 66.24 89.63 51.02 87.13 96.4 36.1 90.64 60.68
RGB+D+F 62.58 65.46 68.18 43.09 37.5 64.4 88.34 57.13 86.86 96.13 41.45 87.96 59.23
177
178 Deep Learning in Computer Vision
BIBLIOGRAPHY
1. Richard S. Wallace, Anthony Stentz, Charles E. Thorpe, Hans P. Moravec, William
Whittaker, and Takeo Kanade. First results in robot road-following. In International
Joint Conferences on Artifcial Intelligence (IJCAI), pages 1089–1095. Citeseer, 1985.
2. Sebastian Thrun. Toward robotic cars. Communications of the ACM, 53(4): 99–106,
2010.
3. Matthew A. Turk, David G. Morgenthaler, Keith D. Gremban, and Martin Marra.
VITS-A vision system for autonomous land vehicle navigation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 10(3): 342–361, 1988.
4. Kichun Jo, Junsoo Kim, Dongchul Kim, Chulhoon Jang, and Myoungho Sunwoo.
Development of autonomous car-Part II: A case study on the implementation of an
autonomous driving system based on distributed architecture. IEEE Transactions on
Industrial Electronics, 62(8): 5119–5132, 2015.
5. Ernst Dieter Dickmanns. Dynamic Vision for Perception and Control of Motion.
Springer-Verlag, London, 2007.
6. Mahdi Rezaei and Reinhard Klette. Computer Vision for Driver Assistance. Springer,
Cham Switzerland, 2017.
7. Mahmoud Hassaballah and Khalid M. Hosny. Recent Advances in Computer Vision:
Theories and Applications, volume 804. Springer International Publishing, 2019.
8. David Michael Stavens. Learning to Drive: Perception for Autonomous Cars. Stanford
University, 2011.
9. Yuna Ro and Youngwook Ha. A factor analysis of consumer expectations for autono-
mous cars. Journal of Computer Information Systems, 59(1): 52–60, 2019.
10. Adil Hashim, Tanya Saini, Hemant Bhardwaj, Adityan Jothi, and Ammannagari Vinay
Kumar. Application of swarm intelligence in autonomous cars for obstacle avoidance.
In Integrated Intelligent Computing, Communication and Security, pages 393–404.
Springer, 2019.
11. Angelos Amanatiadis, Evangelos Karakasis, Loukas Bampis, Stylianos Ploumpis, and
Antonios Gasteratos. ViPED: On-road vehicle passenger detection for autonomous
vehicles. Robotics and Autonomous Systems, 112: 282–290, 2019.
12. Yusuf Artan, Orhan Bulan, Robert P. Loce, and Peter Paul. Passenger compartment vio-
lation detection in HOV/HOT lanes. IEEE Transactions on Intelligent Transportation
Systems, 17(2): 395–405, 2016.
13. Dorsa Sadigh, S Shankar Sastry, and Sanjit A. Seshia. Verifying robustness of human-
aware autonomous cars. IFAC-PapersOnLine, 51(34): 131–138, 2019.
14. Bhargava Reddy, Ye-Hoon Kim, Sojung Yun, Chanwon Seo, and Junik Jang. Real-time
driver drowsiness detection for embedded system using model compression of deep
neural networks. In IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pages 121–128, 2017.
15. Theocharis Kyriacou, Guido Bugmann, and Stanislao Lauria. Vision-based urban navi-
gation procedures for verbally instructed robots. Robotics and Autonomous Systems,
51(1): 69–80, 2005.
16. SAE International Committee. Taxonomy and defnitions for terms related to on-
road motor vehicle automated driving systems. Technical Report J3016–201401, SAE
International, 2014. http://doi.org/ 10.4271/J3016_201806.
17. Jonathan Horgan, Ciaran Hughes, John McDonald, and Senthil Yogamani. Vision-
based driver assistance systems: Survey, taxonomy and advances. In International
Conference on Intelligent Transportation Systems (ITSC), pages 2032–2039.
IEEE, 2015.
18. Zhentao Hu, Tianxiang Chen, Quanbo Ge, and Hebin Wang. Observable degree analy-
sis for multi-sensor fusion system. Sensors, 18(12): 4197, 2018.
Deep Semantic Segmentation in Autonomous Driving 179
19. Shan Luo, Joao Bimbo, Ravinder Dahiya, and Hongbin Liu. Robotic tactile perception
of object properties: A review. Mechatronics, 48: 5467, 2017.
20. Guilherme N. DeSouza and Avinash C. Kak. Vision for mobile robot navigation: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2): 237–267, 2002.
21. Oliver Pink, Jan Becker, and Soren Kammel. Automated driving on public roads:
Experiences in real traffc. IT-Information Technology, 57(4): 223–230, 2015.
22. Thomas Braunl. Embedded Robotics: Mobile Robot Design and Applications with
Embedded Systems. Springer Science & Business Media, 2008.
23. Gregory Dudek and Michael Jenkin. Computational Principles of Mobile Robotics.
Cambridge University Press, 2010.
24. Uwe Handmann, Thomas Kalinke, Christos Tzomakas, Martin Werner, and Werner von
Seelen. Computer vision for driver assistance systems. In Enhanced and Synthetic Vision
1998, volume 3364, pages 136–148. International Society for Optics and Photonics, 1998.
25. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436, 2015.
26. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew
Y. Ng. Multimodal deep learning. In International Conference on Machine Learning
(ICML-11), pages 689–696, 2011.
27. Li Deng and Dong Yu. Deep learning: Methods and applications. Foundations and
Trends® in Signal Processing, 7(3–4): 197–387, 2014.
28. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifcation with
deep convolutional neural networks. In Advances in Neural Information Processing
Systems, pages 1097–1105, 2012.
29. Josh Patterson and Adam Gibson. Deep Learning: A Practitioner’s Approach. O’Reilly
Media, Inc., 2017.
30. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
31. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely
connected convolutional networks. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 4700–4708, 2017.
32. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
33. Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier
Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving
deep neural networks. In Artifcial Intelligence in the Age of Neural Networks and
Brain Computing, pages 293–312. Elsevier, 2019.
34. Michael Hauser, Sean Gunn, Samer Saab Jr, and Asok Ray. State-space representations
of deep neural networks. Neural Computation, 31(3): 538–554, 2019.
35. Abhinav Valada, Gabriel L. Oliveira, Thomas Brox, and Wolfram Burgard. Deep
multispectral semantic scene understanding of forested environments using multi-
modal fusion. In International Symposium on Experimental Robotics, pages 465–477.
Springer, 2016.
36. Taigo M. Bonanni, Andrea Pennisi, D.D Bloisi, Luca Iocchi, and Daniele Nardi.
Human-robot collaboration for semantic labeling of the environment. In 3rd Workshop
on Semantic Perception, Mapping and Exploration, pp. 1–6, 2013.
37. Abhijit Kundu, Yin Li, Frank Dellaert, Fuxin Li, and James M. Rehg. Joint semantic
segmentation and 3D reconstruction from monocular video. In European Conference
on Computer Vision, pages 703–718. Springer, 2014.
38. Ondrej Miksik, Vibhav Vineet, Morten Lidegaard, Ram Prasaath, Matthias Nießner,
Stuart Golodetz, Stephen L. Hicks, Patrick Perez, Shahram Izadi, and Philip H.S. Torr.
The semantic paintbrush: Interactive 3d mapping and recognition in large outdoor
spaces. In 33rd Annual ACM Conference on Human Factors in Computing Systems,
pages 3317–3326. ACM, 2015.
180 Deep Learning in Computer Vision
39. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes data-
set for semantic urban scene understanding. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
40. Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in
video: A high-defnition ground truth database. Pattern Recognition Letters, 30(2):
88–97, 2009.
41. Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning
hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 35(8): 1915–1929, 2013.
42. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks
for semantic segmentation. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 3431–3440, 2015.
43. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E . Imagenet classifcation with deep
convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, editors, Advances in Neural Information Processing Systems, 25, pages
1097–1105. Curran Associates, Inc., 2012.
44. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
45. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015.
46. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network
for semantic segmentation. In IEEE International Conference on Computer Vision,
pages 1520–1528, 2015.
47. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(12): 2481–2495, 2017.
48. Jun Mao, Xiaoping Hu, Xiaofeng He, Lilian Zhang, Liao Wu, and Michael J. Milford.
Learning to fuse multiscale features for visual place recognition. IEEE Access, 7:
5723–5735, 2019.
49. Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.
arXiv preprint arXiv:1511.07122, 2015.
50. Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. Attention to
scale: Scale-aware semantic image segmentation. In EEE Conference on Computer
Vision and Pattern Recognition, pages 3640–3649, 2016.
51. Boyu Chen, Peixia Li, Chong Sun, Dong Wang, Gang Yang, and Huchuan Lu. Multi
attention module for visual tracking. Pattern Recognition, 87: 80–93, 2019.
52. Liuyuan Deng, Ming Yang, Hao Li, Tianyi Li, Bing Hu, and Chunxiang Wang.
Restricted deformable convolution based road scene semantic segmentation using sur-
round view cameras. arXiv preprint arXiv:1801.00708, 2018.
53. Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Simon, Karl Amende,
Johannes Petzold, Senthil Yogamani, and Timo Pech. Monocular fsheye camera depth
estimation using sparse lidar supervision. In International Conference on Intelligent
Transportation Systems (ITSC), pages 2853–2858. IEEE, 2018.
54. Marius Cordts, Timo Rehfeld, Lukas Schneider, David Pfeiffer, Markus Enzweiler,
Stefan Roth, Marc Pollefeys, and Uwe Franke. The stixel world: A medium-level repre-
sentation of traffc scenes. Image and Vision Computing, 68: 40–52, 2017.
55. Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation
and recognition using structure from motion point clouds. In European Conference on
Computer Vision, pages 44–57. Springer, 2008.
Deep Semantic Segmentation in Autonomous Driving 181
56. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driv-
ing? The kitti vision benchmark suite. In Conference on Computer Vision and Pattern
Recognition, pp. 3354–3361, 2012.
57. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes data-
set for semantic urban scene understanding. arXiv preprint arXiv:1604.01685, 2016.
58. Heiko Hirschmuller. Accurate and effcient stereo processing by semi-global match-
ing and mutual information. In IEEE Conference on Computer Vision and Pattern
Recognition, volume 2, pages 807–814. IEEE, 2005.
59. Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder.
The mapillary vistas dataset for semantic understanding of street scenes. In IEEE
International Conference on Computer Vision, pages 4990–4999, 2017.
60. Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang,
Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving.
In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages
954–960, 2018.
61. Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as
proxy for multi-object tracking analysis. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 4340–4349, 2016.
62. German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M.
Lopez. The synthia dataset: A large collection of synthetic images for semantic seg-
mentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3234–3243, 2016.
63. NuTonomy. Nuscenes dataset. 2012. https://www.nuscenes.org/.
64. Magnus Wrenninge and Jonas Unger. Synscapes: A photorealistic synthetic dataset for
street scene parsing. arXiv preprint arXiv:1810.08705, 2018.
65. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan
L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 40(4): 834–848, 2018.
66. Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single
monocular images using deep convolutional neural felds. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 38(10): 2024–2039, 2016.
67. Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Estimating depth from mon-
ocular images as classification using deep fully convolutional residual networks.
IEEE Transactions on Circuits and Systems for Video Technology, 28(11): 3174–
3182, 2018.
68. Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised
learning of depth and ego-motion from video. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 1851–1858, 2017.
69. Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular
depth estimation with left-right consistency. In IEEE Computer Vision and Pattern
Recognition, volume 2, page 7, 2017.
70. Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. Fusenet:
Incorporating depth into semantic segmentation via fusion-based cnn architecture. In
Asian Conference on Computer Vision, pages 213–228. Springer, 2016.
71. Lingni Ma, Jorg Stückler, Christian Kerl, and Daniel Cremers. Multi-view deep
learning for consistent semantic mapping with RGB-d cameras. arXiv preprint
arXiv:1703.08866, 2017.
72. Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-Ann Heng, and Hui Huang.
Cascaded feature network for semantic segmentation of rgb-d images. In IEEE
International Conference on Computer Vision (ICCV), pages 1320–1328. IEEE, 2017.
182 Deep Learning in Computer Vision
73. Yuanzhouhan Cao, Chunhua Shen, and Heng Tao Shen. Exploiting depth from single
monocular images for object detection and semantic segmentation. IEEE Transactions
on Image Processing, 26(2): 836–846, 2017.
74. Min Bai, Wenjie Luo, Kaustav Kundu, and Raquel Urtasun. Exploiting semantic infor-
mation and deep matching for optical fow. In European Conference on Computer
Vision, pages 154–170. Springer, 2016.
75. Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with
an application to stereo vision. In Imaging Understanding Workshop, pages 121–130.
Vancouver, British Columbia, 1981.
76. Gunnar Farneback. Two-frame motion estimation based on polynomial expansion. In
Scandinavian Conference on Image Analysis, pages 363–370. Springer, 2003.
77. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas,
Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet:
Learning optical fow with convolutional networks. In IEEE International Conference
on Computer Vision, pages 2758–2766, 2015.
78. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and
Thomas Brox. Flownet 2.0: Evolution of optical fow estimation with deep networks.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 2462–2470,
2017.
79. Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha.
Unsupervised deep learning for optical fow estimation. In Thirty-First AAAI
Conference on Artifcial Intelligence, pp. 1495–1501, 2017.
80. Junhwa Hur and Stefan Roth. Joint optical fow and temporally consistent semantic
segmentation. In European Conference on Computer Vision, pages 163–177. Springer,
2016.
81. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
82. Mennatullah Siam, Heba Mahgoub, Mohamed Zahran, Senthil Yogamani, Martin
Jagersand, and Ahmad El-Sallab. Modnet: Moving object detection network with
motion and appearance for autonomous driving. arXiv preprint arXiv:1709.04821,
2017.
83. Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine
motion and appearance for fully automatic segmention of generic objects in videos.
arXiv preprint arXiv:1701.05384, 2017.
84. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for
action recognition in videos. In Advances in Neural Information Processing Systems,
pages 568–576, 2014.
85. Mennatullah Siam and Mohammed Elhelw. Enhanced target tracking in uav imagery
with pn learning and structural constraints. In Proceedings of the IEEE International
Conference on Computer Vision Workshops, pages 586–593, 2013.
86. Ahmed Salaheldin, Sara Maher, and Mohamed Helw. Robust real-time tracking with
diverse ensembles and random projections. In Proceedings of the IEEE International
Conference on Computer Vision Workshops, pages 112–120, 2013.
7 Aerial Imagery
Registration Using
Deep Learning for
UAV Geolocalization
Ahmed Nassar, and Mohamed ElHelw
CONTENTS
7.1 Introduction .................................................................................................. 183
7.2 An Integrated Framework for UAV Localization......................................... 185
7.2.1 Framework Architecture................................................................... 185
7.2.2 Calibration ........................................................................................ 187
7.2.3 Registration and Detection Using Semantic Segmentation .............. 188
7.2.4 Registration by Shape Matching....................................................... 190
7.2.5 Detection by Semantic Segmentation............................................... 193
7.2.6 Registration of Sequence Frames ..................................................... 193
7.3 Results and Analysis..................................................................................... 194
7.3.1 Platform ............................................................................................ 194
7.3.2 Datasets............................................................................................. 194
7.3.3 Earth Observation Image Registration ............................................. 198
7.3.4 Semantic Segmentation ....................................................................200
7.3.5 Semantic Segmentation Registration Using Shape
Matching Results ..............................................................................205
7.4 Conclusion ....................................................................................................206
Bibliography ..........................................................................................................207
7.1 INTRODUCTION
The proliferation of drones, also known as unmanned aerial vehicles (UAVs), has
shifted from being used for military operations to being available for the public mar-
ket. This conception came through the advancement and availability of mature hard-
ware components such as Bluetooth, Wi-Fi, radio communication, accelerometers,
gyroscopes, high-performance processors with low consumption, and effcient bat-
tery technologies. This rise can be attributed to the increasing smartphone demand,
which led to the race of manufacturing better hardware components, which brought
benefts to other sectors such as UAVs [1]. The smartphone market has led to the
183
184 Deep Learning in Computer Vision
affordability of these elements since it reduced prices and the size of the hardware.
Now UAVs can beneft from all the different types of components and modules in
many various applications, such as photography, mapping, agriculture, surveillance,
rescue operations, and the recently famous delivery systems. The UAV applications
mentioned previously and the ones available in the current market are currently
being controlled by radio communications through a controller, a software that plans
the route, or by using the on-board camera.
Conventionally, UAV navigation has relied on inertial sensors, on altitude meters,
and sometimes on a global positioning system within their solution to feed com-
mands to their motor controller. The latter modality uses the Global Navigation
Satellite System (GNSS) composed of different regional systems such as GPS (which
is owned by the United States), GLONASS, Galileo, Beidou, and others located at
the Medium Earth Orbit with the purpose of providing autonomous geo-spatial posi-
tioning [2]. This positioning is provided in latitude, longitude, and altitude data. Both
terms, “GPS” and “GNSS,” are commonly used interchangeably in UAV naviga-
tion literature. The signal transmitted from the GPS satellites could be affected by
the distance and the ambient/background noise emanating in numerous places with
higher vulnerability to interference and/or blockade in urban areas with buildings
causing loss of line of sight (LOS) to navigational satellites. However, some areas are
GPS denied, meaning that the GPS signal is not available or weak, and this neces-
sitates fnding alternative approaches to UAV guidance.
A plethora of work has been proposed for UAV localization by using inertial
measurement unit (IMU) readings and on-board camera to acquire features that are
stored in a feature database. Scaramuzza et al. [3], Chowdhary et al. [4], and Rady
et al. [5] offer examples of these methods. During UAV fight, information in the
pre-acquired feature database is used to approximate UAV location based on match-
ing currently extracted features with those in the database. Similar work [6] uses
online mapping services and feature extraction to be able to localize a UAV as well
as terrain classifcation. Mantelli et al. [7] used abBRIEF for global localization and
estimated the UAV pose in 4 degrees of freedom.
Currently, the abundance of satellite and aerial high-resolution imagery [8] cov-
ering most of the globe has facilitated the emergence of new applications such as
autonomous vision-only UAV navigation. In this case, UAV images are compared
with aerial/satellite imagery to infer the drone location as shown in Figure 7.1.
However, processing these images is computationally demanding and even more
involved if data labeling is needed to process and extract actionable information. The
ability to correlate two images of the same location under different viewing angles,
illumination conditions, occlusions, object arrangements, and orientations is crucial
to enable robust autonomous UAV navigation. In traditional computer vision, image
entropy and edges are used as features and compared for similarity as presented by
Fan et al. [9] and Sim et al. [10]. Other works use local feature detectors [11], such
as Scale-Invariant Feature Transform (SIFT) [12] or Speeded-Up Robust Features
(SURF) [13], to produce features that are then matched, providing affne transfor-
mations as investigated by Koch et al. [14] and Shukla [15]. Using Simultaneous
Localization And Mapping (SLAM), UAV navigation is achieved by acquiring and
correlating distinct features of indoor and/or outdoor regions in order to map the
Deep Learning for UAV Geolocalization 185
FIGURE 7.2 Left panel is a Sn, with its corresponding Sn ROI from Bing Maps in Famagusta,
Cyprus on the right panel.
their labels will be represented by L = (L(1), L(2),…Ln). In this literature, the Earth
Observation (EO) image will be referred to as the reference map, its notation will be
R and its label is M. The extents of r on the map and its pixel GPS coordinates are
known. Therefore, registering Sn onto r will provide back the GPS coordinates of the
Sn. Initially, Sn and its GPS location coordinates on r are assumed to be known, since
the person or system deploying the UAV knows its initial location.
Two main inputs are fed to the localization framework. The frst corresponds to
prerecorded top-down view footage S acquired from 1) a UAV fying over a specifc
region and 2) a simulated overfight generated by using Google Earth. The second,
is a high-resolution EO imagery map obtained from free public online services such
Deep Learning for UAV Geolocalization 187
as Google Maps, Bing Maps, or existing datasets such as ISPRS’s Potsdam, which is
used for r as demonstrated in Figure 7.2. Consequently, the different components of
the framework deal with those two inputs, and the techniques used will be evaluated
on how they deal with these inputs to localize the aerial vehicle.
7.2.2 CALIBRATION
Calibration is an integral part of the framework and is called upon at S(1) and every
fve iterations (S(n+5)) frames. It is responsible for fnding the initial location, the dif-
ference in scale, and orientation between the Sn and its corresponding feld of view
in r, and to autocorrect any drift that might occur. To accomplish this task, SIFT is
used to make sure that the localization estimation is checked and updated every fxed
period. That is, under the condition of being the frst frame, or the third sequence
frame, S and r are passed to the SIFT registration, which gives out a homography
that is applied to r and also updates the drone location as shown in Figure 7.4. SIFT
is robust; however, it is time consuming and thus it is not used frequently to reduce
computational overhead.
As stated before, knowing the initial GPS coordinates, a region of interest r
is cropped from r around the coordinate, given that r = R(i, ws+b) where i is the
coordinate at the center and the width is determined by ws+b, which is equal to the
width of S plus a margin b. This is done for the purpose of limiting the search for
the features extracted from Sn in r. As seen in Figure 7.5, r covers a wider region
than the one in S. Furthermore, to get the pixel location i from the coordinate, this
conversion is calculated by Equations 7.1 and 7.2, which maps the GPS bounds of R
to its pixel size, which returns for each pixel its actual GPS location. This process
results in reducing the search space and eliminates the possibility for the matching
to go astray.
Where pixx: Pixel’s location in x axis, pix y: Pixel’s location in y axis, heightmin: r
height minimum (0), heightmax: r height maximum, widthmin: r width minimum (0),
widthmax: r width maximum, lon: current longitude/coordinate, lat: current latitude
coordinate, latn: North latitude bound of r, lats: South latitude bound o r, lone: East
latitude bound of r, and lonw: West latitude bound of r.
188 Deep Learning in Computer Vision
FIGURE 7.4 Calibration using SIFT registration, where location estimation is checked and
updated every fxed period.
Using the search space created, features are extracted from both r and Sn.
Afterward, the features are matched and used to estimate a homography matrix that
in turn is used to extract transformation components. Forthwith, the rotation and
scale components are maintained and are incremented upon or decremented depend-
ing on the next frame or transformation. This helps calculate the orientation of the
UAV using the rotation and altitude by using scale in relation to r.
FIGURE 7.5 Registration using SIFT for calibration. The circles represent feature key
points extracted, and the lines represent the matching of them. The polygon represents the
boundaries of the Sn (left) in r (right) after the transformations.
To be able to accomplish this as seen in Figure 7.6, Sn and r are fed into a U-Net
network and the segmented outputs are subsequently matched.
The Segmentation Network: The task to segment input images (r, S) provides
classifcation of input image pixels and is accomplished using U-Net network whose
architecture is yet another encoding and decoding model that follows the network archi-
tecture of Ronneberger et al. [26]. The segmentation network is able to classify image
regions into buildings, roads, and vegetation. Due to the size of images and their num-
ber, each class was trained separately, resulting in a model for each class in each dataset.
Data Pre-processing: The images that are used to train the network, presented
in Figure 7.3.2, each have a large size exceeding 2000 × 2000 pixels per image.
Therefore, the images are split into smaller patches of size 224 × 224 pixels, similar
to the sizes used by VGG [27] or ResNet [28]. Simple normalization is applied on
each patch by dividing the pixel value by its channel’s highest value. After that, the
patches are stitched to form the full image. Consequently, the same is applied to the
ground-truth images (M,L).
FIGURE 7.6 Two U-Net networks take in Sn and r and provide segmented shapes.
190 Deep Learning in Computer Vision
2TP
Dicescore = (7.5)
2TP + FP + FN
Where
TP: True Positive
FP: False Positive, and
FN: False Negative.
As shown in Equation 7.2.3, the Dice score is calculated by fnding the positives
and penalizing the false positives found as well as positives that the method could
not fnd. Dice is similar to the Jaccard index [33], which is another commonly used
loss function for image segmentation. However, Dice is currently being favored due
to being more intuitive than Jaccard, which only counts true positives once in both
the numerator and denominator.
Regularization and Normalization: To avoid overftting, regularization using
dropout layers was added after every convolutional layer with a value of 0.05. After
every dropout layer, a batch normalization layer was also added to normalize the
activation of each batch.
Network Initialization Using Transfer Learning: Training from scratch is
computationally expensive in terms of processing and time. Transfer learning is
especially relied upon to reduce the training time and in case of small datasets.
In fact, it is diffcult to have a dataset that will be capable of catering to truly deep
networks. Models trained on large datasets, such as ImageNet [27], contain generic
weights that can be used in different tasks. The pre-trained model is trained using a
random image generator that also creates the ground-truth for those images.
FIGURE 7.7 Buildings segmented using U-Net (the left panel is an EO image, in the mid-
dle panel is the result of segmentation, and the right panel is an EO image overlaid with
segmentation).
FIGURE 7.8 The segmented images are processed to fnd their properties such as area, dis-
tance, location, image moments, etc., which for each shape or object acts as a feature. These
features are matched to create a homography.
192 Deep Learning in Computer Vision
FIGURE 7.9 The left panel shows segmented Sn images, and the right panel shows seg-
mented r images. The contours are computed around a segmented image along with the area,
and radius is calculated.
Afterwards, the matches with the highest scores are chosen as presented in
Figure 7.10. However, matches are eliminated if the score is close and/or within a
short distance, since this might be an unreliable match between 2 objects that are
of the same size and within the same distance. For example, a grid of buildings that
look the same would confuse the matches. To conclude, as done with the calibra-
tion step, using a threshold the top matches that are found to be reliable are used to
calculate another homography. This homography, which is estimated from semantic
segmentation, is applied to calculate the current location of the UAV accurately.
FIGURE 7.11 Two consecutive images from the UAV footage are taken and a homography
is estimated from them to update the current UAV location.
194 Deep Learning in Computer Vision
7.3.2 DATASETS
Due to lack of established datasets to benchmark UAV path navigation using cam-
eras, datasets had to be collected and ground-truth manually created to be able to
produce performance metrics. The datasets used in this project can be split into
two categories: aerial datasets and satellite imagery datasets. Multiple sources are
used to cater to different views (rooftops, roads, color). A dataset was also created
* https://www.ubuntu.com
† https://www.nvidia.com/en-us/autonomous-machines/embedded-systems-dev-kits-modules/
‡ https://www.geforce.com/hardware/technology/cuda
§ https://developer.nvidia.com/cudnn
¶ http://cluster-irisa.univ-ubs.fr/wiki/doku.php/accueil
Deep Learning for UAV Geolocalization 195
using images obtained from online mapping services for different locations and their
ground-truth created by hand using OpenStreetMap (OSM)*.
ISPRS WG II/4 Potsdam: This dataset [20] is provided by the ISPRS Working
Group II/4 for the sole purpose of detecting objects and aiding in 3D reconstruc-
tion in urban scenes. The dataset covers two cities: Vaihingen (Germany), a small
village with small buildings, and Potsdam (Germany), a dense city with big blocks
of large buildings and narrow streets. In this work, we focus on the Potsdam data-
set because it contains RGB information and the images refect a state-of-the-art
airborne image dataset in a very high resolution. Figure 7.13 shows samples of the
dataset.
The datasets come with ground-truth or label images to classify six classes:
Impervious Surfaces, Buildings, Low Vegetation, Trees, Cars, and Clutter/
Background. The Potsdam dataset is made of 38 patches of the same size trimmed
from a larger mosaic and a Digital Surface Model (DSM). The resolution of both
data types is 5 cm. The images come in different channel compositions: IRRG (3
channels IR-R-G), RGB (3 channels R-G-B), and RGBIR (4 channels R-G-B-IR).
This dataset has proven very benefcial throughout this work due to semantically
segmented labels and high resolution.
Potsdam Extended: In this work, Google Satellite Images (GSI) are used and
thus images from GSI with the same extents and zoom level for 12 of the original
ISPRS dataset locations were obtained. Subsequently, the ground-truth from ISPRS
is used and updated if there is a difference between recent GSI and ISPRS. Along
FIGURE 7.13 A sample of the tiles available from the ISPRS Potsdam 2D Semantic
Labeling Challenge.
* https://www.openstreetmap.org/
196 Deep Learning in Computer Vision
with the corresponding ground-truth from ISPRS, several tiles were generated from
areas outside the ISPRS dataset and ground-truth produced using OSM as shown in
Figure 7.13. The difference between both ISPRS datasets and Google Satellite Image
can be seen in Figures 7.14 and 7.15 (Table 7.1).
UAV Videos: One of the major diffculties that faced this work was the unavail-
ability of multiple data sources for one location, especially the top-down view video
from a UAV. To address this problem, previously acquired UAV footage from the
Internet was used.
The frst video used is shot in Famagusta (Cyprus), acquired with top-down view
and was available on YouTube by the name of “Above Varosha: one of the most
famous ghost cities” [37]. The UAV motion in the video was smooth as it was gliding
seamlessly with no vibrations or sudden changes in direction or camera orientation
(Figure 7.16). Figure 7.17 shows frame samples from the video. For every 32 frames,
ground-truth position was manually computed as GPS coordinates by comparison
with the other GSI map. The excerpt of the video that was good enough to be used
was 43 secs in length.
Having high-resolution aerial imagery from ISPRS of Potsdam inspired the idea
of simulating a UAV footage of this area. This was implemented by using Google
Earth 3D and navigating a camera with top-down view. The corresponding video
footage was captured while gliding over a region in Potsdam. The captured video
length is 2 minutes.
FIGURE 7.14 A sample of the Potsdam Extended dataset. On the top row, the left panel is
the corresponding Google Satellite Image copying ISPRS’s tile and using its ground-truth.
The middle and right panels are tiles produced from other areas in Potsdam. The bottom row
is OSM as the ground-truth.
Deep Learning for UAV Geolocalization 197
FIGURE 7.15 A comparison between and ISPRS Potsdam (left panel) and a Google Satellite
image (right panel).
TABLE 7.1
Comparison of the Different Datasets
Dataset name Resolution Area (Sq. km) Classes
Potsdam 5 2.16 6
Potsdam Extended 9.1 1.24 3
Famagusta 19 1.13 3
FIGURE 7.16 A sample of frames acquired from the Famagusta UAV video.
FIGURE 7.17 A sample of frames acquired from the Potsdam UAV video.
198 Deep Learning in Computer Vision
1. Manually, another path is created from the center points of the UAV
sequence frames, which provides our ground-truth.
2. Features are extracted from the Sn and r.
3. If 70% of the matches are found have a distance of 0.2 or less, then a homog-
raphy is created.
4. Apply homography to the reference map ROI, which results in a warped
image.
5. A GPS coordinate is estimated for the center pixel of the warped frame
using the method explained in Section 7.2.2 and Equation 7.2.2.
6. The distance between the estimated GPS coordinate and the ground-truth is
calculated by using Haversine Distance Formula, as illustrated by Figure 7.18.
1. A path is created from the center points of the UAV sequence frames to set
up the ground-truth.
2. Features are then extracted from both consecutive frames passed from the
sequence.
Deep Learning for UAV Geolocalization 199
FIGURE 7.18 The left panel is Sn, and the right panel is the estimated r. The actual UAV
GPS coordinate is {52.404758, 13.053362}, while the estimated GPS coordinate is {52.404737,
13.053338}.
Local Feature Registration Results: The output of the pipeline excluding the seman-
tic segmentation pipeline performs relatively well overall. As shown in Table 7.2,
the estimated GPS coordinates are deviating 8.3 m from the actual fight path. As
expected, the SIFT calibration step helps the system to “catch up” when it is lagging
due to differences in scale (altitude) or photogrammetric differences. Without the
SIFT calibration step, longer videos or fights continue to drift until system failure or
until the system is no longer capable of comparing or estimating the location.
Qualitatively as seen in Figures 7.19 and 7.20 using this method alone, the devia-
tion to the eye does not seem signifcant at that height (300 m). Also, it is apparent in
TABLE 7.2
Deviation in Meters Comparing the Two Approaches
Dataset Without local features With local features
Famagusta 26 m 10.4 m
Potsdam 52.1 m† 8.3 m
† Only 60% of the Potsdam Sequence was used to produce this result
before localization drifted way and was not capable of fnding accurate
position.
200 Deep Learning in Computer Vision
FIGURE 7.19 The actual path of the UAV and the estimated one visualized on a map using
their GPS coordinates for the Famagusta Sequence.
FIGURE 7.20 The actual path of the UAV and the estimated one visualized on a map using
their GPS coordinates for the Potsdam Sequence. The intersecting rounded line represents
where the experiment ended due to accumulated drift.
several cases that the pipeline performs well if the buildings are exactly perpendicu-
lar from a top-down view and their facades are not visible.
use to navigate by comparing two aerial or satellite images together trying to fnd
matching structures to correct position. The second reason is to avail more infor-
mation to the UAV, hence making it aware of the types of objects it is fying over,
which could be benefcial in route planning or decision making. This component
builds upon the previous steps to increase the accuracy of localization and reduce
the deviation or drift of the UAV from its planned trajectory. In this section, the setup
of the experiment will be presented with all the different components and choices
made for the network including the training process. Afterward, the metrics chosen
for evaluation are presented along with a discussion of the results and the outcome
of the experiments.
The Semantic Segmentation Experiment: As a result of hardware constraints,
many models were pre-trained for this experiment. Each region (Famagusta,
Potsdam) had its separate model, and each class (buildings, roads, etc.) was trained
separately using U-Net (check Section 7.2.3). The pre-trained model was trained
with a random image generator [38] that was used to initially start training the dif-
ferent models. The data was split into 224 × 224 pixel patches, and then the patches
were augmented using vertical, horizontal, and random rotation and fed to the U-Net
network.
Training: Many confgurations and different runs have been performed for each
model with different hyper-parameters. The training is stopped instantly when the
average F1-score on the validation dataset stopped providing better scores. The aver-
age number of epochs trained for each model was 6 and each epoch took almost 3.5
hours to train.
Metrics: To evaluate the semantic segmentation method, the average F1-score is
used as an error metric on each trained class. Precision is the outcome of how many
positive predictions were predicted in fact positive. Recall is the measure of how
many positive predictions are predicted correctly. The F1-score is the average or the
harmonic mean of both precision and recall.
precision × recall
F1 = 2 (7.6)
precision + recall
TP
Precision = (7.7)
TP + FP
TP
Recall = (7.8)
TP + FN
Semantic Segmentation Results: The experiment was run over two regions,
Potsdam and Famagusta. The performance of this step was evaluated experimen-
tally using two ground-truth values since each city has its UAV image sequence
and its satellite or aerial reference map as presented in Section 7.3.2. The perfor-
mance of the semantic segmentation in our pipeline was evaluated in terms of
localization.
To evaluate the model fairly, the ISPRS Potsdam 2D Semantic Labeling Challenge
was used as a benchmark to compare how our model performs against other work.
202 Deep Learning in Computer Vision
However, this work was compared to work that solely uses RGB images in their
training and doesn’t utilize DSM (Digital Surface Models) (Table 7.3).
Experiment I was trained purely on ISPRS Potsdam dataset using 19 images
for training and 5 for validation. The purpose of this experiment was to fnd
out how well U-Net works in comparison to the other work based on bench-
mark scores. The experiment was run only for the building and roads classes,
which are the most important landmarks for this region since it is an urban area.
In general, the Experiment I model performed better across the 2 classes and
resulted in an average score of 84.7%. Unfortunately, the building class score
lagged behind, but this experiment was the basis on which the other experiments
were built (Table 7.4).
After testing the model generated from Experiment I on the UAV sequence with
source images from GSI, the results were not satisfactory. So Experiment II was
trained on the Potsdam Extended dataset (Section 1.3.2) using Experiment I model
for pre-training. The frst convolutional layer was frozen while training this experi-
ment since the frst layer contains the basic shapes and edges. The scores were satis-
factory when tested and the predictions provided sharp edges with hollow shapes as
shown in Figure 7.21.
TABLE 7.3
Comparison to Other Work Using ISPRS
Potsdam 2D Semantic Labeling
Challenge (RGB Images Only)*
Method Average Building Roads
TABLE 7.4
Semantic Segmentation Results Using Our Models*
Method Average Building Roads Vegetation
FIGURE 7.21 A sample of the predictions made using Experiment II. The left column
shows the target images, the middle column shows the predictions, and the right column is
the combination of the previous columns. These target images are Potsdam’s (frst 3 rows) and
Famagusta’s (last row) S. Buildings are labeled cyan, and roads are labeled red.
Experiment III was carried out to validate if freezing Experiment II’s frst con-
volutional layers was a good idea. Therefore, in this experiment the frst convolu-
tional layer was unfrozen. The Experiment III model provided a fuller shape but with
inaccurate edges in comparison to Experiment II as shown in Figure 7.22. However,
this model provided the highest score. To demonstrate the procedure of only training
from the online map services using OSM as a ground-truth, Experiment IV was
204 Deep Learning in Computer Vision
FIGURE 7.22 A sample of the predictions made using Experiment III. The left column
is the target image, the middle column is the prediction, and the right column is the pre-
diction overlaying the target image. These target images are Potsdam’s S. Buildings are
labeled in cyan.
arranged. As expected, the results were behind Experiments II and III by nearly
10%. This proves that high-resolution data with pixel-wise accurate ground-truths
defnitely had an effect (Table 7.5).
In general, our task was not to come up with the state-of-the-art segmentation
method, but to choose the best segmentation method for the proposed framework.
Although Experiment III provided the highest score, practically Experiment II was
the model used due to its sharper shapes, whereas hollow shapes were remedied
using morphological operations. Qualitatively, the buildings that are close to each
other are challenging to segment. There is also a discrepancy in segmenting build-
ings from sidewalks, which are sometimes considered part of the building, resulting
in added diffculty for the matching process during the registration step. Another
TABLE 7.5
The Specifcations That Trained Our Network, after Trying
Many Variations in Terms of Filter Size or Optimizers
Method Filter size Batch size LR Optimizer Epochs
point to consider when using OSM as a ground-truth is that OSM labels treat trees
and vegetation as one class. This inconsistency or misclassifcation between vegeta-
tion and tree pixels was accounted for by merging. However, for our application the
difference between trees and vegetation is not that vital. The result of the segmenta-
tion step usually contains inaccuracies that have to be remedied using morphological
techniques such as dilation and erosion, which are applied to fll empty hollows in
the shapes and remove small regions.
TABLE 7.6
Deviation in Meters with the Addition of
Semantic Segmentation to the Pipeline
Without With local With semantic
Dataset local features features segmentation
FIGURE 7.23 This graph shows the deviation from the actual GPS coordinate in meters for
every time the calibration function is called for the Famagusta Sequence.
206 Deep Learning in Computer Vision
FIGURE 7.24 This graph shows the deviation from the actual GPS coordinate in meters for
every time the calibration function is called for the Potsdam Sequence.
FIGURE 7.25 The ground-truth path of the UAV and the estimated path visualized on a
map using their GPS coordinates using registration by shape matching in Famagusta.
it has been observed that in Potsdam, building blocks are tightly spaced, so when
segmented they create huge blobs that were very diffcult to match at a low altitude.
Better localization would be achieved in less dense areas or with fner segmentation.
In this section, the feasibility of our framework was investigated. The require-
ments of the system were frst laid out and the framework was introduced. The pro-
cess of acquiring datasets, their specifcations, and the creation of their ground-truths
were presented along with justifcation for picking specifc datasets. Afterward, the
main registration components were broken down and evaluated focusing on their
effectiveness within the framework for localizing the UAV separately.
7.4 CONCLUSION
The essence of this work is to demonstrate the capability of using computer vision to
geolocalize a UAV without using any inertial sensors or GPS but using only an on-board
camera. The proposed method is then demonstrated in its entirety and the process from
start to end is explained in an orderly fashion. Each component is broken down into
Deep Learning for UAV Geolocalization 207
FIGURE 7.26 The ground-truth path of the UAV and the estimated path visualized on a
map using their GPS coordinates using registration by shape matching in Potsdam.
its subcomponents and its purpose is clarifed and evaluated. The proposed method
is advantageous over other methods that use only local feature detectors to match the
UAV image to an offine reference map, even when searching within a window as the
experiments have shown. We believe that segmentation and matching shapes helps
avoid matching the wrong features in areas that look similar or have an urban pattern.
However, a major disadvantage of the method is areas with no recognizable landmarks
or features, but this is a common disadvantage of all methods that depend on vision
solely. Future work is to experiment with larger datasets that cover multiple cities with
different visual features, which would increase the robustness of the proposed method.
Another element is to use a trained neural network to compare the shapes of the seg-
mented objects and replace the demanding dictionary search and heuristics employed
by the current method.
BIBLIOGRAPHY
1. K. D. Atherton, “Senate hearing: Drones are “basically fying smart-phones”.” https://
www.popsci.com/technology/article/2013–03/ how-drone-smartphone, March 2013.
2. B. Hofmann-Wellenhof, H. Lichtenegger, and E. Wasle, GNSS-Global Navigation
Satellite Systems: GPS, GLONASS, Galileo, and More. Springer Science & Business
Media, 2007.
208 Deep Learning in Computer Vision
CONTENTS
8.1 Introduction .................................................................................................. 211
8.2 Requirements in Robot Vision...................................................................... 212
8.3 Robot Vision Methods .................................................................................. 213
8.3.1 Object Detection and Categorization................................................ 213
8.3.2 Object Grasping and Manipulation .................................................. 216
8.3.3 Scene Representation and Classifcation .......................................... 218
8.3.4 Spatiotemporal Vision ...................................................................... 221
8.4 Future Challenges ......................................................................................... 222
8.5 Conclusions...................................................................................................224
Acknowledgments.................................................................................................. 225
Bibliography .......................................................................................................... 225
8.1 INTRODUCTION
Deep learning is a hot topic in the pattern recognition, machine learning, and com-
puter vision research communities. This fact can be seen clearly in the large number
of reviews, surveys, special issues, and workshops that are being organized, and
the special sessions in conferences that address this topic (e.g., [1–9]). Indeed, the
explosive growth of computational power and training datasets, as well as technical
improvements in the training of neural networks, has allowed a paradigm shift in
pattern recognition, from using hand-crafted features (e.g., Histograms of Oriented
Gradients, Scale Invariant Feature Transform, and Local Binary Patterns) [10]
together with statistical classifers to the use of data-driven representations, in which
features and classifers are learned together. Thus, the success of this new paradigm
is that “it requires very little engineering by hand” [4], because most of the parame-
ters are learned from the data, using general-purpose learning procedures. Moreover,
the existence of public repositories with source code and parameters of trained deep
neural networks (DNNs), as well as that of specifc deep learning frameworks/tools
such as Tensorfow [11], has promoted the use of deep learning methods.
The use of the deep learning paradigm has facilitated addressing several com-
puter vision problems in a more successful way than with traditional approaches.
In fact, in several computer vision benchmarks, such as the ones addressing image
classifcation, object detection and recognition, semantic segmentation, and action
211
212 Deep Learning in Computer Vision
recognition, just to name a few, most of the competitive methods are now based on
the use of deep neural networks. In addition, most recent presentations at the fagship
conferences in this area use methods that incorporate deep learning.
Deep learning has already attracted the attention of the robot vision community.
However, given that new methods and algorithms are usually developed within the
computer vision community and then transferred to the robot vision one, the ques-
tion is whether or not new deep learning solutions to computer vision and recog-
nition problems can be directly transferred to robotics applications. This transfer
is not straightforward considering the multiple requirements of current solutions
based on deep neural networks in terms of memory and computational resources,
which in many cases include the use of GPUs. Furthermore, following [12], it must
be considered that robot vision applications have different requirements from
standard computer vision applications, such as real-time operation with limited
on-board computational resources, and the constraining observational conditions
derived from the robot geometry, limited camera resolution, and sensor/object
relative pose.
Currently, there are several reviews and surveys related to deep learning [1–7].
However, they are not focused on robotic vision applications, and they do not con-
sider their specifc requirements. In this context, the main motivation of this chapter
is to address the use of this new family of algorithms in robotics applications. This
chapter is intended to be a guide for developers of robot vision systems. Therefore,
the focus is on the practical aspects of the use of deep neural networks rather than on
theoretical issues. It is also important to note that as convolutional neural networks
(CNNs) [7–13] are the most commonly used deep neural networks in vision applica-
tions, the analysis provided in this chapter will be focused on them.
This chapter is organized as follows: In Section 2, requirements for robot vision
applications are discussed. In Section 3, relevant papers related to robot vision are
described and analyzed. In Section 4, current and future challenges of deep learning
in robotic applications are discussed. Finally, in Section 5, conclusions related to
current robot vision systems and its usability are drawn.
real time and be responsive; and (iii) several processes alongside robot vision must
be executed in parallel. These conditions are needed for a robot to behave and to
complete tasks autonomously in the real world, as these tasks commonly involve
time constraints for being solved successfully. Then, the computational workload
is a crucial component that must be considered when designing solutions for robot
vision tasks.
In the following sections, recent advances based on deep learning related to robot
vision and future challenges are described. As stated before, deep neural networks
require a considerable amount of computational power, which currently limits their
usability in robotics applications. However, the development of new processing plat-
forms for massive parallel computing with low power consumption will enable the
adoption of a wide variety of deep learning algorithms in robotic applications.
proposals. Also, smaller DNN architectures can be used when dealing with a limited
number of object categories.
It is worth mentioning that most of the reported studies do not indicate the frame
rate needed for full object detection/categorization, or they show frame rates that
are far from being real time. In generic object detection methods, computation of
proposals using methods like Selective Search or EdgeBoxes takes most of the time
[14]. Systems like Faster R-CNN [15] that compute proposals using CNNs are able
to obtain higher frame rates, but require the use of high-end GPUs to work (see
Figure 8.1). Also, methods derived from YOLO (You Only Look Once) [16] offer
a better runtime, but they both impose limits on the number of objects that can be
detected, and have a slightly lower performance. The use of task-specifc knowl-
edge-based detectors on depth [17–20], motion [21], color segmentation [22], or weak
object detectors [23] can be useful for generating fewer, faster proposals, which is the
key for achieving high frame rates on CPU-based systems. Methods based on fully
convolutional networks (FCNs) cannot achieve high frame rates on robotic platforms
with low processing capabilities (no GPU available) because they process images
with larger resolutions than normal CNNs. Thus, FCNs cannot be used trivially for
real-time robotics on these kinds of platforms.
First, the use of complementary information sources for segmenting the objects
will be analyzed. Robotic platforms usually have range sensors, such as Kinects/
Asus or LIDAR (Light Detection and Ranging) sensors, that are able to extract depth
information that can be used for boosting the object segmentation. Methods that use
RGBD data for detecting objects include those presented in [17–19, 24, 25], while
methods that use LIDAR data include those of [20, 26]. These methods are able
to generate proposals by using tabletop segmentation or edge/gradient information.
FIGURE 8.1 Faster-RCNN is an object detector that uses a region proposal network and a
classifcation network that share the frst convolutional layers [15].
Applications of Deep Learning in Robot Vision 215
Also, they generate colorized images from depth for use in a standard CNN pipeline
for object recognition. These studies use CNNs and custom depth-based proposals,
and then their speed is limited by the CNN model. For instance, in [27], a system is
described that runs at 405 fps on a Titan X GPU, 67 fps on a Tegra X1 GPU, and 62
fps on a Core i7 6700 K CPU [27]. It must be noted that while Titan X GPU and Core
i7 CPU processors are designed for desktop computers, Tegra X1 is a mobile GPU
processor for embedded systems aimed at low power consumption, and so it can be
used in robotic platforms.
Second, two applications that use specifc detectors for generating the object pro-
posals will be presented: pedestrian detection, and detection of objects in robotics
soccer. Detecting pedestrians in real time with high reliability is an important ability
in many robotics applications, especially for autonomous driving. Large differences
in illumination and variable, cluttered backgrounds are hard issues to be addressed.
Methods that use CNN-based methods for pedestrian detection include [23, 28–34].
Person detectors such as LDCF [23] or ACF [35] are used for generating pedestrian
proposals in some methods. As indicated in [23], the system based on AlexNet [36]
requires only 3 msec for processing a region, and proposal computation runs at 2
fps when using LDCF, and at 21.7 fps when using ACF in an NVIDIA GTX980
GPU, with images of size 640 × 480. The lightweight pedestrian detector is able
to run at 2.4 fps in an NVIDIA Jetson TK1, and beats classical methods by a large
margin [23]. A second approach is to use an FCN for detecting parts of persons [37].
This method is tested on an NVIDIA Titan X GPU, and it is able to run at 4 fps when
processing images of size 300 × 300. KITTY pedestrians is a benchmark used by
state-of-the-art methods. Current leading methods in the KITTY benchmark whose
algorithms are described (many of the best-performing methods do not describe their
algorithms), are shown in Table 8.1. Note that, as the objective of the benchmark is to
evaluate accuracy, most of the methods cannot run in real time (~30 fps). This is an
aspect that needs further research.
TABLE 8.1
Selected Methods for Pedestrian Detection from KITTY Benchmark*
Method Moderate Easy Hard Runtime Computing environment
RRC [30] (code 75.33% 84.14% 70.39% 3.6 s GPU @ 2.5 Ghz (Python +
available) C/C++)
MS-CNN [31] 73.62% 83.70% 68.28% 0.4 s GPU @ 2.5 Ghz (C/C++)
(code available)
SubCNN [32] 71.34% 83.17% 66.36% 2s GPU @ 3.5 Ghz (Python +
C/C++)
IVA [33] (code 70.63% 83.03% 64.68% 0.4 s GPU @ 2.5 Ghz (C/C++)
available)
SDP+RPN [34] 70.20% 79.98% 64.84% 0.4 s GPU @ 2.5 Ghz (Python +
C/C++)
* Note that methods not describing their algorithms are not included.
216 Deep Learning in Computer Vision
command. The system computes the probability that a given motor command will
produce a successful grasp, conditioned on the image, which enables the system to
control the robot. The network is composed by seven convolutional layers, whose
output is concatenated with the motor command. The concatenation layer is fur-
ther processed by an additional eight convolutional layers and two fully connected
layers. The system achieves effective real-time control; the robot can grasp novel
objects successfully and correct mistakes by continuous servoing. It also lowers
the failure rate from 35% (hand designed) to 20%. However, learned eye-hand
coordination depends on the particular robot used for training the system. The
computing hardware used in this work is not fully described. It must be noted that
these kinds of solutions are able to run successfully in real time using GPUs.
Systems able to infer manipulation trajectories for object-task pairs using DNNs
have been developed recently. This task is diffcult because generalization over dif-
ferent objects and manipulators is needed. Proposed solutions include [46, 47]. These
systems are able to learn from manipulation demonstrations and generalize over new
objects. In Robobarista [46], the input of the system is composed of a point cloud,
a trajectory, and a natural language space, which is mapped to a joint space. Once
trained, the trajectory generating the best similarity to a new point cloud and natural
language instruction is selected. Computational hardware (GPU) and runtime are not
reported in this work. In [47], the proposed system is able to learn new tasks by using
image-based reinforcement learning. That system is composed by an encoder and
a decoder. The encoder is formed by three convolutional layers and two fully con-
nected layers. Computational hardware and runtime are not reported in this work.
Visual servoing systems have also benefted from the use of deep learning tech-
niques. In Saxena et al. [48], a system that is able to perform image-based visual
servoing by feeding raw data into convolutional networks is presented. The system
does not require the 3D geometry of the scene or the intrinsic parameters from the
camera. The CNN takes a pair of monocular images as input, representing the cur-
rent and desired pose, and processes them by using four convolutional layers and a
fully connected layer. The system is trained for computing the transformation in 3D
space (position and orientation) needed for moving the camera from its current pose
to the desired one. The estimated relative motion is used for generating linear and
angular speed commands for moving the camera to the desired pose. The system is
trained on synthetic data and tested both in simulated environments and by using a
real quadrotor. The system takes 20 msec for processing each frame when using an
NVIDIA Titan X GPU, and 65 msec when using a Quadro M2000 GPU. A wif con-
nection was used for transferring data between the drone and a host PC.
Multimodal data delivers valuable information for object grasping. In [49], a hap-
tic object classifer that is fed by visual and haptic data is proposed. The outputs of a
visual CNN and a haptic CNN are fused by a fusion layer, followed by a loss function
for haptic adjectives. The visual CNN is based on the (5a) layer of GoogleNet, while
the haptic CNN is based on three one-dimensional convolutional layers followed by
a fully connected layer. Examples of haptic adjectives are “absorbent,” “bumpy,” and
“compressible.” The trained system is able to estimate haptic adjectives of objects
by using only visual data and it is shown to improve over systems based on classical
features. Computing hardware and runtime of the method are not reported.
218 Deep Learning in Computer Vision
TABLE 8.2
Datasets Used for Scene Classifcation
Dataset Results
segmentation of scenes are [26, 54, 62–70]. Also, multimodal information is used in
[52, 71, 72]. These studies are useful for tasks that require pixel-wise precision like
road segmentation and other tasks in autonomous driving. However, runtimes of
methods based on semantic segmentation are usually not reported. PASCAL VOC
and Cityscapes are two benchmarks used for semantic segmentation. While images
from PASCAL VOC are general-purpose, Cityscapes is aimed at autonomous driv-
ing. Best-performing methods in both databases are reported in Tables 8.3 and 8.4.
It can be noted that DeepLabv3 [64] and PSPNet [65] (see Figure 8.2) achieve good
average precision at both benchmarks, and also code from [65] is available. Also,
the increasing performance in average precision has a deep impact on autonomous
driving, as this application requires confdent segmentation for working because of
the risks involved on it.
Visual navigation using deep learning techniques is an active research topic since
new methodologies are able to solve navigation directly, without the need for detect-
ing objects or roads. Examples of studies related to visual navigation are [74–78].
These methods are able to perform navigation directly over raw images captured by
RGB cameras. For instance, the forest trail follower in [74] (see Figure 8.3) has an
architecture composed of seven convolutional layers and one fully connected layer. It
is able to run at 15 fps on an Odroid-U3 platform, enabling it to be used in a real UAV.
TABLE 8.3
PASCAL VOC Segmentation Benchmark*
Name AP Runtime (msec/frame)
* Only entries with reported methods are considered. Repeated entries are not
considered.
TABLE 8.4
Cityscapes Benchmark*
Name IoU class Runtime (ms/frame)
Deeplabv3 [64] 81.3 n/a
PSPNet [65] 81.2 n/a
ResNet-38 [66] 80.6 6 msec (minibatch size 10) @ GTX 980 GPU
TuSimple_Coarse [67] (code available) 80.1 n/a
SAC-multiple [68] 78.2 n/a
* Only entries with reported methods are considered. Repeated entries are not considered.
220 Deep Learning in Computer Vision
FIGURE 8.2 Pyramid Scene Parsing Network (PSPNet) is a scene-parsing algorithm based
on the use of convolutional layers, followed by upsampling and concatenation layers [65].
FIGURE 8.3 A deep neural network is used to decide one of three possible actions (turn
left, go straight, turn right) from a single image. This system is used as guide in an autono-
mous quadrotor [74].
In [79], a visual navigation system for Nao robots based on deep reinforcement learn-
ing is proposed. The system consists of two neural networks—an actor and a critic—that
are trained by using Deep Deterministic Policy Gradients. Both networks are conformed
by convolutional, fully connected, and Long Short-Term Memory (LSTM) layers. Once
trained, the system is able to run in real time in the Nao internal CPU.
The use of CNN methodologies for estimating 3D geometry is another active
research topic in robotics. Studies related to 3D geometry estimation from RGB
images include [80–82]. Also, depth information is added in [71, 72]. These methods
provide a new approach to dealing with 3D geometry, and they are shown to over-
come the classical structure-from-motion approaches, as they are able to infer depth
even from only a single image.
A functional area is a zone in the real world that can be manipulated by a
robot. Functional areas are classifed by the kind of manipulation action the robot
can realize on them. In [83], a system for localizing and recognizing functional
areas by using deep learning is presented. The system enables an autonomous
robot to have a functional understanding of an arbitrary indoor scene. An ontol-
ogy considering 11 possible end categories is set. Functional areas are detected
by using selective-search region proposals, and then by applying a VGG-based
Applications of Deep Learning in Robot Vision 221
TABLE 8.5
Current Results on the UCF-101 Video Dataset for Action
Recognition [91]
Name Accuracy Runtime (msec/frame)
C3D (1 net) + linear SVM [90] 82.3% 3.2 msec/frame @ Tesla K40 GPU
VGG-3D + C3D [90] 86.7% n/a
Ng et al. [92] 88.6% n/a
Wu et al. [93] 92.6% 1,730 msec/frame @ Tesla K80 GPU
Guo et al. [91] 93.3% n/a
Lev et al. [86] 94.0% n/a
222 Deep Learning in Computer Vision
FIGURE 8.4 TSC-DL is a system that is able to segment a suturing video sequence into
different stages (position needle, push needle, pull needle, and hand-off [94]).
and runtime is present. For selecting a method, it must be considered if real time is
needed in the specifc task to be solved.
Transfer of video-based computer vision techniques to medical robotics is an
active area of research, especially in surgery-related applications. The use of deep
learning in this application area has already begun. For instance, in [94], a system
that is able to segment video and kinematic data for performing unsupervised tra-
jectory segmentation of multimodal surgical demonstrations is implemented (see
Figure 8.4). The system is able to segment video-kinematic descriptions of surgi-
cal demonstrations successfully, corresponding to stages such as “position,” “push
needle,” “pull needle,” and “hand-off.” The system is based on a switching linear
dynamic system, which considers both continuous and discrete states composed of
kinematic and visual features. Visual features are extracted from the ffth convo-
lutional layer of a VGG CNN and then reduced to 100 dimensions by using PCA.
Clustering on the state space is applied by the system and the system is able to learn
transitions between clusters, enabling trajectory segmentation. The system uses a
video source with 10 frames per second; however, the frame rate of the full method
is not reported.
platforms [95, 96]. For instance, in [38] a CNN-based robot detector is presented that
is able to run in real time, in Nao robots while playing soccer. In addition, compa-
nies such as Intel, NVIDIA, and Samsung, just to name a few, are developing CNN
chips that will enable real-time vision applications [4]. For instance, mobile GPU
processors like NVIDIA Tegra K1 enable effcient implementation of deep learning
algorithms with low power consumption, which is relevant for mobile robotics [27].
It is expected that these methodologies will consolidate in the next few years and
will then be available to the developers of robot vision applications.
The ability of deep learning models to achieve spatial invariance respect to the
input data in a computationally effcient manner is another important issue. In order
to address this, the spatial transformer module (STM) was recently introduced [97].
It is a self-contained module that can be incorporated into DNNs, and that is able to
perform explicit spatial transformation of the features. The most important charac-
teristic of STMs is that its parameters (i.e., the transformations of the spatial trans-
formation) can be learned together with the other parameters of the network using
the backpropagation of the loss. The further development of the STM concept will
allow addressing the required invariance to the observational conditions derived
from the robot geometry, limited camera resolution, and sensor/object relative pose.
STMs are able to deal with rotations, translations, and scaling [97], and STMs are
already being extended to deal with 3D transformations [98]. Further development
of STM-inspired techniques is expected in the next few years.
Unsupervised learning is another relevant area of future development for deep
learning-based vision systems. Biological learning is largely unsupervised; ani-
mals and humans discover the structure of the world by exploring and observing it.
Therefore, one would expect that similar mechanisms could be used in robotics and
other computer-based systems. Until now, the success of purely supervised learning,
which is based largely on the availability of massive labeled data, has overshad-
owed the application of unsupervised learning in the development of vision sys-
tems. However, when considering the increasing availability of non-labeled digital
data, and the increasing number of vision applications that need to be addressed,
one would expect further development of unsupervised learning strategies for DNN
models [4, 7]. In the case of robotics applications, a natural way of addressing this
issue is by using deep learning and reinforcement learning together. This strategy
has already been applied for the learning of games (Atari, Go, etc.) [99, 100], for
solving simple tasks like pushing a box [47], and for solving more complex tasks
such as navigating in simulated environments [77, 78], but it has not been used for
robots learning in the wild. The visual system of most animals is active, i.e., animals
decide where to look, based on a combination of internal and external stimuli. In a
robotics system having a similar active vision approach, reinforcement learning can
be used for deciding where to look according to the results of the interaction of the
robot with the environment.
A related relevant issue is that of open-world learning [101], i.e., to learn to detect
new classes incrementally, or to learn to distinguish among subclasses incrementally
after the main one has been learned. If this can be done without supervision, new
classifers can be built based on those that already exist, greatly reducing the effort
required to learn new object classes. Note that humans are continuously inventing
224 Deep Learning in Computer Vision
new objects, fashion changes, etc., and therefore robot vision systems will need to be
updated continuously, adding new classes and/or updating existing ones [101]. Some
recent work has addressed these issues, based mainly on the joint use of deep learn-
ing and transfer learning methods [102, 103].
Classifcation using a very large number of classes is another important challenge
to address. AlexNet, like most CNN networks, was developed for solving problems
that contain ~1,000 classes (e.g., ILSVRC 2012 challenge). It uses fully connected
units and a fnal softmax unit for classifcation. However, when working on problems
with a very large number of classes (10,000+) in which dense data sampling is avail-
able, nearest neighbors could work better than other classifers [104]. Also, the use
of a hierarchy-aware cost function (based on WordNet) could enable providing more
detailed information when the classes have redundancy [104]. However, in current
object recognition benchmarks (like ILSVRC 2016), the number of classes has not
increased, and no hierarchical information has been used. The ability of learning
systems to deal with a high number of classes is an important challenge that needs to
be addressed for performing open-world learning.
Another important area of research is the combination of new methods based
on deep learning with classical vision methods based on geometry, which has been
very successful in robotics (e.g., visual SLAM). This topic has been addressed at
recent workshops [105] and conferences [106]. There are several ways in which these
important and complementary paradigms can be combined. On the one hand, geom-
etry-based methods such as Structure from Motion and Visual SLAM can be used
for the training of deep learning-based vision systems; geometry-based methods can
extract and model the structure of the environment, and this structure can be used
for assisting the learning process of a DNN. On the other hand, DNNs can also be
used for learning to compute the visual odometry automatically, and eventually to
learn the parameters of a visual SLAM system. In a very recent study, STMs were
extended to 3D spatial transformations [98], allowing end-to-end learning of DNNs
for computing visual odometry. It is expected that STM-based techniques will allow
end-to-end training for optical fow, depth estimation, place recognition with geo-
metric invariance, small-scale bundle adjustment, etc. [98].
A fnal relevant research topic is the direct analysis of video sequences in robot-
ics applications. The analysis of video sequences by using current techniques based
on deep learning is very demanding of computational resources. It is expected that
the increase of computing capabilities will enable the analysis of video sequences in
real time in the near future. Also, the availability of labeled video datasets for learn-
ing tasks such as action recognition will enable the improvement of robot percep-
tion tasks, which are currently addressed by using independent CNNs for individual
video frames.
8.5 CONCLUSIONS
In this chapter, several approaches based on deep learning that are related to robot
vision applications are described and analyzed. They are classifed into four cat-
egories of methods: (i) object detection and categorization, (ii) object grasping and
manipulation, (iii) scene representation and classifcation, and (iv) spatiotemporal
Applications of Deep Learning in Robot Vision 225
vision. (i) Object detection and categorization has been transferred successfully
into robot vision. While the best-performing object detectors are mainly variants
of Faster-RCNN, methods derived from YOLO offer a better runtime. (ii) Object
grasping, manipulation, and visual servoing are currently being improved by the
adoption of deep learning. While real-time processing is needed for hand-eye coor-
dination and visual servoing to work, it is not needed for open-loop object grasping
detection. Also, these methods are highly dependent on the robot used for collecting
data. (iii) Methods for representing, recognizing, or classifying scenes as a whole
(not pixel-wise) feed images directly into a single CNN, and then near-real-time
processing can be achieved. However, pixel-wise methods based on FCNs require a
GPU for running in near real time. (iv) Spatiotemporal vision applications need real-
time processing capability to be useful in robotics. However, current research using
deep learning is experimental, and the frame rates for most of the methods are not
reported. Platforms with low computing capability (only CPU) are not able to run
most of the methods at a useful frame rate. The adoption of general deep learning
methods in robot vision applications requires addressing several problems: (i) Deep
neural networks require large amounts of data for being trained. Obtaining large
amounts of data for specifc robotic applications is a challenge to address each time
a specifc task needs to be solved. (ii) Deep learning is able to deal with diffcult
problems. However, it cannot generalize to conditions that are not represented by
the available data. (iii) The need of low power consumption is a critical condition
to be considered. However, the availability of specialized hardware for computing
algorithms based on CNNs with low power consumption would make it possible to
broaden the use of deep neural networks in robotics.
ACKNOWLEDGMENTS
This work was partially funded by FONDECYT grant 1161500 and CONICYT PIA
grant AFB18004.
BIBLIOGRAPHY
1. Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and Michael S. Lew.
Deep learning for visual understanding: A review. Neurocomputing, 187, 27–48, 2016.
2. Soren Goyal and PaulBenjamin. Object recognition using deep neural networks: A sur-
vey. http://arxiv.org/abs/1412.3684, 2014.
3. Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai,
Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, and Tsuhan Chen. Recent advances
in convolutional neural networks. Pattern Recognition, 77, 354–377, 2018.
4. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521, 436–
444, 28 May 2015, doi:10.1038/nature14539,
5. Li Deng. A tutorial survey of architectures, algorithms, and applications for deep learn-
ing. APSIPA Transactions on Signal and Information Processing, 3, 1–29, 2014.
6. Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks,
61, 85–117, 2015.
7. Suraj Srinivas, Ravi Kiran Sarvadevabhatla, Konda Reddy Mopuri, Nikita Prabhu,
Srinivas S. S. Kruthiventi, and R. Venkatesh Babu. A taxonomy of deep convolutional
neural nets for computer vision. Frontiers in Robotics and AI, 2, 2016.
226 Deep Learning in Computer Vision
8. Xiaowei Zhou, Emanuele Rodola, Jonathan Masci, Pierre Vandergheynst, Sanja Fidler,
and Kostas Daniilidis. Workshop geometry meets deep learning, ECCV 2016, 2016.
9. Yi Li, Yezhou Yang, Michael James, Danil Prokhorov. Deep Learning for Autonomous
Robots, Workshop at RSS 2016, 2016.
10. Awad, Ali Ismail, and Mahmoud Hassaballah. Image feature detectors and descriptors.
Studies in Computational Intelligence. Springer International Publishing, Cham, 2016.
11. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, et al. Tensorfow: A system for large-scale machine learning. 12th Symposium
on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
12. Patricio Loncomilla, Javier Ruiz-del-Solar, and Luz Martínez, Object recognition
using local invariant features for robotic applications: A survey, Pattern Recognition,
60, 499–514, 2016.
13. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel.
Backpropagation applied to handwritten zip code recognition. Neural Computation,
1(4), 541–551, 1989.
14. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 580–587, 2014.
15. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards
real-time object detection with region proposal networks. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39, 6, 1137–1149, 2017.
16. Joseph Redmon and Ali Farhadi. YOLO9000: Better, faster, stronger. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525, 2017.
17. Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D object recognition and
pose estimation based on Pre-trained convolutional neural network features. IEEE
International Conference on Robotics and Automation (ICRA), 1329–1335, 2005.
18. Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning
rich features from RGB-D images for object detection and segmentation. European
Conference on Computer Vision (ECCV), 345–360, 2014.
19. Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller,
and Wolfram Burgard. Multimodal deep learning for robust RGB-D object recogni-
tion. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
681–687, 2015.
20. Joel Schlosser, Christopher K. Chow, and Zsolt Kira. Fusing LIDAR and images
for pedestrian detection using convolutional neural networks. IEEE International
Conference on Robotics and Automation (ICRA), 2198–2205, 2016.
21. G. Pasquale, C. Ciliberto, F. Odone, L. Rosasco, and L. Natale. Teaching iCub
to recognize objects using deep convolutional neural networks. 4th International
Conference on Machine Learning for Interactive Systems (MLIS’15), 43, 21–25, 2015.
22. Dario Albani, Ali Youssef, Vincenzo Suriani, Daniele Nardi, and Domenico
Daniele Bloisi. A deep learning approach for object recognition with NAO soccer
robots. RoboCup International Symposium, 392–403, 2016.
23. Denis Tomè, Federico Monti, Luca Baroffo, Luca Bondi, Marco Tagliasacchi, and
Stefano Tubaro. Deep convolutional neural networks for pedestrian detection. Signal
Processing: Image Communication, 47, C, 482–489, 2016.
24. Hasan F. M. Zaki, Faisal Shafait, and Ajmal Mian. Convolutional hypercube pyra-
mid for accurate RGB-D object category and instance recognition. IEEE International
Conference on Robotics and Automation (ICRA), 2016, 1685–1692, 2016.
25. Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, and Trevor
Darrell. Cross-modal adaptation for RGB-D detection. IEEE International Conference
on Robotics and Automation (ICRA), 5032–5039, 2016.
Applications of Deep Learning in Robot Vision 227
26. Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3D lidar using fully
convolutional network. Proceedings of Robotics: Science and System Proceedings,
2016.
27. Nvidia: GPU-Based deep learning inference: A performance and power analysis.
Whitepaper, 2015.
28. Jan Hosang, Mohamed Omran, Rodrigo Benenson, and Bernt Schiele. Taking
a deeper look at pedestrians. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 4073–4082, 2015.
29. Rudy Bunel, Franck Davoine, and Philippe Xu. Detection of pedestrians at far
distance. IEEE International Conference on Robotics and Automation (ICRA), 2326–
2331, 2016.
30. Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong Yan,
Yu-Wing Tai, and Li Xu. Accurate single stage detector using recurrent rolling con-
volution. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
752–760, 2017.
31. Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos. A unifed
multi-scale deep convolutional neural network for fast object detection. European
Conference on Computer Vision (ECCV), 354–370, 2016.
32. Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware
convolutional neural networks for object proposals and detection. IEEE Winter
Conference on Applications of Computer Vision (WACV), 924–933, 2017.
33. Yousong Zhu, Jinqiao Wang, Chaoyang Zhao, Haiyun Guo, and Hanqing Lu.
Scale-adaptive deconvolutional regression network for pedestrian detection. Asian
Conference on Computer Vision, 416–430, 2016.
34. Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all the layers: Fast and accu-
rate CNN object detector with scale dependent pooling and cascaded rejection classi-
fers. IEEE International Conference on Computer Vision and Pattern Recognition
(CVPR), 2129–2137, 2016.
35. Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids
for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
36, 8, 1532–1545, 2014.
36. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classifcation
with deep convolutional neural networks. Advances in Neural Information Processing
Systems, 25, 1097–1105, 2012.
37. Gabriel L. Oliveira, Abhinav Valada, Claas Bollen, Wolfram Burgard, and
Thomas Brox. Deep learning for human part discovery in images. IEEE International
Conference on Robotics and Automation (ICRA), 1634–1641, 2016.
38. Nicolás Cruz, Kenzo Lobos-Tsunekawa, and Javier Ruiz-del-Solar. Using convolu-
tional neural networks in robots with limited computational resources: Detecting NAO
robots while playing soccer. RoboCup 2017: Robot World Cup XXI, 19–30, 2017.
39. Daniel Speck, Pablo Barros, Cornelius Weber, and Stefan Wermter. Ball localiza-
tion for robocup soccer using convolutional neural networks. RoboCup International
Symposium, 19–30, 2016.
40. Francisco Leiva, Nicolas Cruz, Ignacio Bugueño, and Javier Ruiz-del-Solar.
Playing soccer without colors in the SPL: A convolutional neural network approach.
RoboCup Symposium, 2018 (in press).
41. Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic
grasps. The International Journal of Robotics Research, 34, 4–5, 705–724, 2015.
42. Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolu-
tional neural networks. IEEE International Conference on Robotics and Automation
(ICRA), 1316–1322, 2015.
228 Deep Learning in Computer Vision
43. Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp
from 50K tries and 700 robot hours. IEEE International Conference on Robotics and
Automation (ICRA), 3406–3413, 2016.
44. Di Guo, Tao Kong, Fuchun Sun, and Huaping Liu. Object discovery and grasp
detection with a shared convolutional neural network. IEEE International Conference
on Robotics and Automation (ICRA), 2038–2043, 2016.
45. Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-
eye coordination for robotic grasping with deep learning and large-scale data collection.
The International Journal of Robotics Research, 37, 4–5, 421–436, 2018.
46. Jaeyong Sung, Seok Hyun Jin, and Ashutosh Saxena. Robobarista: Object part
based transfer of manipulation trajectories from crowd-sourcing in 3D pointclouds.
Robotics Research, 3, 701–720, 2018.
47. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter
Abbeel. Deep spatial autoencoders for visuomotor learning. IEEE International
Conference on Robotics and Automation (ICRA), 512–519, 2016.
48. Aseem Saxena, Harit Pandya, Gourav Kumar, Ayush Gaud, and K. Madhava
Krishna. Exploring convolutional networks for end-to-end visual servoing. IEEE
International Conference on Robotics and Automation (ICRA), 3817–3823, 2017.
49. Yang Gao, Lisa Anne Hendricks, Katherine J. Kuchenbecker, and Trevor Darrell.
Deep learning for tactile understanding from visual and haptic data. IEEE International
Conference on Robotics and Automation (ICRA), 536–543, 2016.
50. Manuel Lopez-Antequera, Ruben Gomez-Ojeda, Nicolai Petkov, and Javier
Gonzalez-Jimenez. Appearance-invariant place recognition by discriminatively
training a convolutional neural network. Pattern Recognition Letters, 92, 89–95,
2017.
51. Yi Hou, Hong Zhang, and Shilin Zhou. Convolutional neural network-based image
representation for visual loop closure detection. IEEE International Conference on
Information and Automation (ICRA), 2238–2245, 2015.
52. Niko Sunderhauf, Feras Dayoub, Sean McMahon, Ben Talbot, Ruth Schulz, Peter
Corke, Gordon Wyeth, Ben Upcroft, and Michael Milford. Place categorization and
semantic mapping on a mobile robot. IEEE International Conference on Robotics and
Automation (ICRA), 5729–5736, 2006.
53. Niko Sunderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell,
Ben Upcroft, and Michael Milford. Place recognition with convnet landmarks:
Viewpoint-robust, condition-robust, training-free. Robotics: Science and System
Proceedings, 2015.
54. Yiyi Liao, Sarath Kodagoda, Yue Wang, Lei Shi, and Yong Liu. Understand Scene
Categories by objects: A semantic regularized scene classifer using convolutional neu-
ral networks. IEEE International Conference on Robotics and Automation (ICRA),
2318–2325, 2016.
55. Peter Uršic, Rok Mandeljc, Aleš Leonardis, and Matej Kristan. Part-based room
categorization for household service robots. IEEE International Conference on
Robotics and Automation (ICRA), 2287–2294, 2016.
56. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba.
Sun database: Large-scale scene recognition from abbey to zoo. IEEE Conference on
Computer Vision and Pattern recognition (CVPR), 3485–3492, 2010.
57. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
Places: A 10 million image database for scene recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 40, 6, 1452–1464, 2018.
58. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.
Learning deep features for scene recognition using places database. Advances in Neural
Information Processing Systems 27 (NIPS), 487–495, 2014.
Applications of Deep Learning in Robot Vision 229
93. Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. Multi-stream
multi-class fusion of deep networks for video classifcation. ACM Multimedia, 791–
800, 2016.
94. Adithyavairavan Murali, Animesh Garg, Sanjay Krishnan, Florian T. Pokorny,
Pieter Abbeel, Trevor Darrell, and Ken Goldberg. TSC-DL: Unsupervised trajec-
tory segmentation of multi-modal surgical demonstrations with deep learning. IEEE
International Conference on Robotics and Automation (ICRA), 4150–4157, 2016.
95. Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding. International
Conference on Learning Representations (ICLR’16), 2016.
96. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz,
and William J. Dally. EIE: Effcient inference engine on compressed deep neural net-
work. International Conference on Computer Architecture (ISCA), 243–254, 2016.
97. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu.
Spatial transformer networks. Advances in Neural Information Processing Systems, 28,
2017–2025, 2015.
98. Ankur Handa, Michael Bloesch, Viorica Patraucean, Simon Stent, John McCormac,
and Andrew Davison. gvnn: Neural network library for geometric computer vision.
Computer Vision – ECCV 2016 Workshops, 67–82, 2016.
99. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg
Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen
King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-
level control through deep reinforcement learning. Nature, 518, 529–533, 2015.
100. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya
Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis. Mastering the game of go with deep neural networks and tree
search. Nature, 529, 7587, 484–489, 2016.
101. Rodrigo Verschae and Javier Ruiz-del-Solar. Object detection: Current and future
directions. Frontiers in Robotics and AI, 29, 2, 2015.
102. Yoshua Bengio. Deep learning of representations for unsupervised and transfer
learning. International Conference on Unsupervised and Transfer Learning Workshop,
27, 17–37, 2012.
103. Marc-Andr Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain
Gagnon. Multiple instance learning. Pattern Recognition, 77, 329–353, 2018.
104. Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. What does classifying more
than 10,000 image categories tell us? European Conference on Computer Vision: Part
V (ECCV’10), 71–84, 2010.
105. Stefan Leutenegger, Thomas Whelan, Richard A. Newcombe, and Andrew J.
Davison. Workshop the future of real-time SLAM: Sensors, processors, representa-
tions, and algorithms, ICCV, 2015.
106. Ken Goldberg. Deep grasping: Can large datasets and reinforcement learning
bridge the dexterity gap? Keynote Talk at ICRA, 2016.
9 Deep Convolutional
Neural Networks
Foundations and
Applications in Medical
Imaging
Mahmoud Khaled Abd-Ellah, Ali Ismail Awad,
Ashraf A. M. Khalaf, and Hesham F. A. Hamed
CONTENTS
9.1 Introduction .................................................................................................. 234
9.2 Convolutional Neural Networks (CNNs) ...................................................... 236
9.2.1 CNN Layers ...................................................................................... 236
9.2.1.1 Convolutional Layers ......................................................... 237
9.2.1.2 Max-Pooling Layer ............................................................ 238
9.2.1.3 Nonlinearity and ReLU ..................................................... 238
9.2.1.4 Fully Connected Layers (FCLs)......................................... 239
9.3 Optimization Approaches............................................................................. 239
9.3.1 Gradient-Based Optimization...........................................................240
9.3.2 Dropout Property..............................................................................240
9.3.3 Batch Normalization.........................................................................240
9.3.4 Network Depth.................................................................................. 241
9.4 Operational Properties.................................................................................. 241
9.4.1 Using Pretrained CNNs .................................................................... 241
9.4.1.1 Fine-Tuning ........................................................................ 241
9.4.1.2 CNN Activations as Features............................................. 242
9.4.2 Deep Learning Diffculties............................................................... 242
9.4.2.1 Overftting Reduction......................................................... 242
9.5 Commonly Used Deep Neural Networks ..................................................... 242
9.5.1 AlexNet............................................................................................. 243
9.5.2 GoogleNet......................................................................................... 243
9.5.3 VGG-19............................................................................................. 243
9.5.4 Deep Residual Networks (ResNet) ................................................... 243
9.5.5 Tools and Software ...........................................................................244
233
234 Deep Learning in Computer Vision
9.1 INTRODUCTION
Machine learning has become a primary focus and one of the most mainstream
subjects among research groups. The main uses of machine learning include feature
extraction, feature reduction, feature classifcation, forecasting, clustering, regres-
sion, and ensembles with different algorithms. Classic algorithms and techniques are
optimized to provide effective self-learning [1]. Since machine learning is applied
in a wide range of studies, many techniques have been provided, such as Bayesian
network, clustering, decision tree learning, and deep learning as shown in Figure 9.1.
Meanwhile, a few analysts in the machine learning feld have attempted to learn
models that combine the learning of features. These models normally consist of
different layers of nonlinearity. This led to the generation of the frst deep learning
models. Early models such as deep belief networks [2], stacked autoencoders (SAEs)
[3], and restricted Boltzmann machines [4] indicated guarantees on small datasets.
This was known as the “unsupervised pretraining” phase. It was believed that these
“pretrained” models would be a good initialization for supervised tasks, such as
classifcation.
Deep learning frst appeared in 2006 as a new subfeld of machine learn-
ing research. In Mosavi and Varkonyi-Koczy [5], it was named the hierarchical
learning algorithm, and it provided research felds related to pattern recogni-
tion. Deep learning of key factors mainly depends on supervised or unsuper-
vised learning and nonlinear processing in multiple layers [6]. With supervised
learning, the class target label is available; however, the absence of the class
target label means an unsupervised system. During the ImageNet large-scale
visual recognition challenge (ILSVRC) competition in 2012 [7], different scaled
algorithms were applied to a large dataset, which required, in addition to other
tasks, the grouping of an image into one of a thousand classes. Surprisingly,
a convolutional neural network (CNN) signifcantly reduced the error rate on
Deep Convolutional Neural Networks 235
FIGURE 9.1 Machine learning is the overall category that includes supervised learning,
unsupervised learning, reinforcement learning, neural networks, deep learning, and other
approaches.
FIGURE 9.2 Representation of CNN network weights. A ReLU nonlinearity is added after
every convolutional layer.
we will call “fully connected layers (FCLs)”. Figure 9.2 presents a portrayal of the
weights in the AlexNet network. Although convolutional layers are used in the frst
fve layers, the last three layers are fully connected.
where x is the input and w is the weight that is provided from the previous layer. The
convolution matrix size is m × n, and the convolutional layer output is y. The input
of the CNN and the convolutional kernel dimensions should be the same. A multidi-
mensional kernel should be used if the CNN is a multidimensional array [17].
Different types of pooling include stochastic pooling [19], average pooling, and win-
ner-takes-all pooling [20]. However, these types of pooling are not as commonly
utilized as max-pooling. Other pooling types occasionally cannot extract good fea-
tures; for example, average pooling provides an average value by taking all features
into account, which is a very generalized computation. Therefore, poor accuracy
will be obtained with average pooling when the systems do not require all features
from the convolutional layer. Another example, winner-takes-all pooling, is similar
to max-pooling; however, it keeps the feature dimension fxed, and no downsampling
occurs. However, some classifcation tasks have used different pooling or mixed
pooling and outperform max-pooling. Thus, the selection of pooling type depends
on the database type.
most commonly applied activation function for deep learning applications in the lit-
erature is rectifed linear unit (ReLU) because of the following reasons:
CNNs with this nonlinearity have been found to train quicker than others.
1
s( x ) = (9.2)
1+ e − x
e x − e− x
tanh( x ) = (9.3)
e x + e− x
Recently, Maas et al. [22] presented another type of nonlinearity, called the leaky-
ReLU. This nonlinearity is characterized as shown in Equation 9.5 [18].
requires careful preparation and generally slows training. To overcome this issue, the
layer output activations are normalized by BN to guarantee that its value is inside a
small interval. In particular, minibatch normalization is performed with BN, which
applies the average of the mean-variance statistics. Extra move and scale parameters
are additionally learned to frustrate the normalization effect if necessary. Recently,
BN has been observed to be a fundamental part of extra deep network training.
9.4.1.1 Fine-Tuning
This process refers to modifying a pretrained network that is used for image clas-
sifcation for an alternate task that is performed by utilizing the trained weights
as an initial condition and starting SGD again for the other task. The learning
rate will be much lower than that for the original network. When the new task is
the same as the original task, we can make the earlier layers fxed, and only the
later semantic layers should be relearned. However, if the new task is completely
different, we should either retrain all layers or train the network from scratch. The
number of layers to retrain also depends on the quantity of information available
for learning the new task. The larger the dataset is, the higher is the number of
layers that can be retrained. The reader is referred to Yosinski et al. [42] for more
information.
242 Deep Learning in Computer Vision
TABLE 9.1
Comparison among Well-Known Convolutional Neural
Networks Using ImageNet Dataset
Year CNN No. of layers No. of parameters Error rate
9.5.1 ALEXNET
AlexNet has been discussed previously. Figure 9.2 shows max-pooling grouped with
layers 1, 2, and 5, while the last two fully connected layers are combined with drop-
out because they contain the most parameters. Local response normalization has
been provided to layers 1 and 2, which does not affect the network performance as
in [50]. The ILSVRC 2012 database, which has 1.2 million images for 1,000 classes,
is used to train the network. Two GPUs are used to train the network for one month.
With more powerful GPUs, the network can be trained in a few days [50]. Learning
algorithms for hyperparameters such as dropout, weight decay, momentum, and
learning rate were manually tuned. Blob-like features and Gabor-like oriented
edges are learned in the prior layers. The highest-order features, such as shapes, are
learned by the next layers. Semantic attributes such as wheels or eyes are learned by
the last layers.
9.5.2 GOOGLENET
For GoogleNet [40], the best method at ILSVRC-2014; the classifcation perfor-
mance is signifcantly improved with very deep networks. The number of parameters
has been increased due to the large number of layers, and a number of design tricks
are used, such as using a 1 × 1 convolutional layer after the traditional convolutional
layer, which reduces the number of parameters and increases the expressive power of
the CNN. It is believed that having at least one 1 × 1 convolutional layer is much the
same as having a multilayer perceptron network preparing the convolutional layer
output that goes before it. Another trick used by the authors is to involve the internal
layers of the network in the calculation of the target function rather than the fnal
softmax layer (as in AlexNet).
9.5.3 VGG-19
The VGG-19 [39] network design is one of the highest-performing deep CNN
(DCCN) networks. The VGG design provides a fascinating feature in that it divides
the large convolutional flters into small ones. The small flters are carefully selected
with a number of parameters approximately the same as the parameters in the larger
convolutional flters, which they should replace. The net impact of this design choice
is productivity and regularization-like impact on parameters because of the smaller
size of the included flters.
TABLE 9.2
Software List for Convolutional Neural Networks
Package Interfaces AutoDiff
task of localizing objects, the network must search for image regions with different
scales simultaneously.
Girshick et al. [55] used region proposals to solve the problem of object local-
ization. The method obtained the best results on the PASCAL VOC 2012 detec-
tion dataset. It is named region-based CNN (R-CNN) because it uses regions of the
images followed by a CNN. Different works have used the R-CNN method to extract
features in small regions to solve many target applications in computer vision.
Generally, deep learning strategies are very successful when the available number of
image samples in the training stage is very large. For instance, 1 million images were
given in the ILSVRC competition. However, the medical feld has a limited number
of samples in the dataset, less than 1,000 samples. This increases the challenge of
developing a deep learning model with a low number of training samples without
experiencing overftting. Various strategies have been developed by researchers [59],
such as the following:
There are several medical imaging applications that depend on the type of medi-
cal images. Medical imaging applications can be categorized based on the human
parts (such as brain tumor, heart, organ/body part, cell, pulmonary nodules, lymph
nodes, interstitial lung disease, cerebral microbleeds, and noticeably sclerosis
lesion), medical image type (MRI, CT, X-ray, ultrasound, and PET), and application
Deep Convolutional Neural Networks 247
FIGURE 9.3 A block diagram of the proposed DCNN architecture for brain tumor detection.
250 Deep Learning in Computer Vision
FIGURE 9.4 Sample images used for the experimental work. The image dataset is extracted
from the RIDER database. The top row shows normal brain MRI images, while abnormal
brain images are shown in the bottom row.
TABLE 9.3
Image Distribution in the Performance Evaluation Dataset
Image type No. of training sets No. of testing sets
RIDER Normal 45 64
Abnormal 77 163
Deep Convolutional Neural Networks 251
The accuracy (ACC) is a widely used metric for the classifcation performance,
which is the ratio between the correctly classifed samples to the total number of
samples, as shown in Equation 9.6.
TP + TN
ACC = . (9.6)
TP + TN + FP + FN
where TP, TN, FP, and FN indicate the true positive, true negative, false positive, and
false negative, respectively [66]
Sensitivity (SV), recall, hit rate, and true positive rate (TPR) represent the true
positive samples to the total number of positive samples, as shown in Equation 9.7.
Specifcity (SP), inverse recall, and true negative rate (TNR) express the correctly
classifed negative samples to the total number of negative samples, as shown in
Equation 9.8 [67].
TP
SV = . (9.7)
TP + FN
TN
SP = . (9.8)
TN + FP
The precision or positive prediction value (PPV) refects the ratio of correctly clas-
sifed positive samples to all classifed positive samples, as shown in Equation 9.9.
Negative predictive value (NPV) or inverse precision represents the ratio of cor-
rectly classifed negative samples to the entire classifed negative samples, as
shown in Equation 9.10. Equation 9.11 refers to the balanced classifcation rate
(BCR) or balanced accuracy (BA), which combines the sensitivity and specifcity
metrics [17].
TP
Precision = PPV = . (9.9)
TP + FP
TN
NPV = . (9.10)
TN + FN
SV + SP 1 ° TP TN ˙
BCR = BA = = ˝ + ˇ. (9.11)
2 2 TP + FN TN + FP ˆ
˛
The false positive rate (FPR) represents the ratio between the false positive classi-
fed samples to all negative samples, as shown in Equation 9.12. However, the false
negative rate (FNR) expresses the false negative classifed samples to all positive
samples, as shown in Equation 9.13.
FP
FPR = . (9.12)
TN + FP
252 Deep Learning in Computer Vision
FN
FNR = . (9.13)
TP + FN
In addition to the above metrics, there are many different metrics that can be con-
sidered for evaluating various applications. The new metrics include Youden’s index
(YI) or Bookmaker informedness (BM), Matthews correlation coeffcient (MCC),
discriminant power (DP), F-measure, markedness (MK), geometric mean (GM),
Jaccard, and optimization precision (OP), as shown in Equations 9.14, 9.15, 9.16,
9.17, 9.18, 9.19, 9.20, and 9.21, respectively [87].
YI = SV + SP −1. (9.14)
TP × TN − FP × FN
MCC = . (9.15)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
3° ° SV ˙ ° SP ˙ ˙
DP = log ˝ + log ˝ . (9.16)
p ˝˛ ˛ 1 − SP ˇˆ ˛ 1 − SV ˇˆ ˇˆ
2PPV × SV
F-measure = . (9.17)
PPV + SV
GM = SV + SP. (9.19)
TP
Jaccard = . (9.20)
TP + FP + FN
SV − SP
OP = ACC − . (9.21)
SV + SP
TABLE 9.4
Comparison of the Proposed Brain Tumor Detection
Model against Some Approaches in the Literature
Performance evaluation criteria (%)
9.9 CONCLUSION
The major advantage of deep learning over the traditional machine learning
techniques is that it can freely reveal signifcant features in a high-dimensional
database. Recently, CNNs have led to great developments in processing text,
speech, images, videos, and other applications. In this chapter, the basic under-
standing of CNNs and the recent enhancements of CNNs were explained and
presented. CNN improvements have been discussed from various prospects,
specifcally the general model of CNN, layer design, loss function, activation
function, optimization, regularization, normalization, fast computation, and net-
work depth, reviewing the benefts of each phase of CNN. Additionally, the use
of pretrained CNNs and deep learning diffculties have been discussed. This
chapter also introduced the commonly used deep convolutional neural networks,
namely, AlexNet, GoogleNet, VGG-19, and ResNet. It discussed the commonly
used programs for deep CNN learning. The chapter focused on different types of
CNNs, which are region-based CNNs, fully convolutional networks, and hybrid
learning networks.
This chapter outlined the main deep learning applications in medical imaging
processing in terms of feature extraction, tumor detection, and tumor segmentation.
A deep convolutional neural network (DCNN) structure was proposed for brain
tumor detection from MRI images. The proposed DCNN was evaluated using the
RIDER database using 122 and 227 images for training and testing, respectively. It
achieved accurate detection within a time of 0.24 seconds per image in the testing
phase. This chapter promises to provide comprehensive knowledge to researchers,
learners, and those who are interested in this feld.
BIBLIOGRAPHY
1. Marra, F., Poggi, G., Sansone, C., Verdoliva, L.: A deep learning approach for iris
sensor model identifcation. Pattern Recognition Letters 113, 4653 (2018), integrating
Biometrics and Forensics.
2. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief Nets.
Neural Computation 18(7), 1527–1554 (2006), https://doi.org/10.1162/neco.2006.18.7.1
527, pMID: 16764513.
254 Deep Learning in Computer Vision
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. Journal of Machine Learning Research 11, 3371–3408 (Dec 2010).
4. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
Neural Computation 14(8), 1771–1800 (2002), https://doi.org/10.1162/08997660276
0128018.
5. Mosavi, A., Varkonyi-Koczy, A.R.: Integration of machine learning and optimization
for robot learning. In: Jabloński, R., Szewczyk, R. (eds.) Recent Global Research and
Education: Technological Challenges, pp. 349–355. Springer International Publishing,
Cham (2017).
6. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2(1), 1–127 (2009), http://dx.doi.org/10.1561/2200000006.
7. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual rec-
ognition challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252
(2015), https://doi.org/10.1007/s11263–015–0816-y.
8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classifcation with deep convo-
lutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q.
(eds.) Advances in Neural Information Processing Systems 25, (NIPS 2012), pp. 1097–
1105. Curran Associates, Inc. (2012).
9. Hassaballah, M., Awad, A.I.: Detection and description of image features: An introduc-
tion. In: Awad, A.I., Hassaballah, M. (eds.) Image Feature Detectors and Descriptors:
Foundations and Applications, Studies in Computational Intelligence, Vol. 630, pp.
1–8. Springer International Publishing, Cham (2016).
10. Lowe, D.G.: Distinctive image features from scale-invariant key-points. International
Journal of Computer Vision 60(2), 91–110 (2004), https://doi.Org/10.1023/B:VISI.0000
029664.99615.94.
11. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05). Vol. 1, pp. 886–893. IEEE (2005), https://doi.org/10.1109/CVPR.2005.177.
12. Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag- of-visual-words
representations in scene classifcation. In: Proceedings of the International Workshop
on Workshop on Multimedia Information Retrieval. pp. 197–206. ACM, New York,
NY, USA (2007), https://doi.org/10.1145/1290082.1290111.
13. Awad, A.I., Hassaballah, M.: Image Feature Detectors and Descriptors: Foundations
and Applications, Studies in Computational Intelligence, Vol. 630. Springer
International Publishing, Cham, 1st edn. (2016).
14. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998), https://doi.
org/10.1109/5–726791.
15. Bishop, C.: Pattern Recognition and Machine Learning. Springer-Verlag New York, 1
edn. (2006).
16. Hubel, D.H., Wiesel, T.N.: Receptive felds, binocular interaction and functional archi-
tecture in the cat's visual cortex. The Journal of physiology 160(1), 106–154 (1962).
17. Abd-Ellah, M.K., Awad, A.I., Khalaf, A.A.M., Hamed, H.F.A.: Two-phase multi-model
automatic brain tumour diagnosis system from magnetic resonance images using con-
volutional neural networks. EURASIP Journal on Image and Video Processing 97(1),
1–10 (2018).
18. Srinivas, S., Sarvadevabhatla, R.K., Mopuri, K.R., Prabhu, N., Kruthiventi, S.S., Babu,
R.V.: Chapter 2—An introduction to deep convolutional neural nets for computer
vision. In: Zhou, S.K., Greenspan, H., Shen, D. (eds.) Deep Learning for Medical Image
Analysis, pp. 25–52. Academic Press (2017).
Deep Convolutional Neural Networks 255
19. Zeiler, M., Fergus, R.: Stochastic pooling for regularization of deep convolutional
neural networks. In: Proceedings of the International Conference on Learning
Representation (ICLR). pp. 1–9 (2013).
20. Srivastava, R.K., Masci, J., Kazerounian, S., Gomez, F., Schmidhuber, J.: Compete to
compute. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger,
K.Q. (eds.) Advances in Neural Information Processing Systems 26, (NIPS 2013), pp.
2310–2318. Curran Associates, Inc. (2013).
21. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural
network. In: 2017 International Conference on Engineering and Technology (ICET).
pp. 1–6 (Aug 2017), https://doi.org/10.1109/ICEngTechnol.2017.8308186.
22. Maas, A.L.: Rectifer nonlinearities improve neural network acoustic models. In:
Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference
on Machine Learning. Proceedings of Machine Learning Research, Vol. 28. PMLR,
Atlanta, Georgia, USA (17–19 June 2013).
23. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifers: Surpassing human-level
performance on ImageNet classifcation. In: 2015 IEEE International Conference on
Computer Vision (ICCV). pp. 1026–1034. IEEE, Santiago, Chile (2015), https://doi.
org/10.1109/ICCV.2015.123.
24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–
778. IEEE, Las Vegas, NV, USA (June 2016), https://doi.org/10.1109/CVPR.2016.90.
25. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-prop-
agating errors. Nature 323, 333–336 (1986), https://doi.org/10.1038/323533a0.
26. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier,
Y., Saporta, G. (eds.) Proceedings of COMP-STAT’2010. pp. 177–186. Physica-Verlag
HD, Heidelberg, Paris, France (2010), https://doi.org/10.1007/978–3-7908–2604–3_16.
27. B.T.Polyak: Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964), https://do
i.org/10.1016/0041–5553(64)90137–5.
28. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980
(2014), http://arxiv.org/abs/1412.6980.
29. Nesterov, Y.: A method of solving a convex programming problem with convergence
rate O(1/sqr(k)). Soviet Mathematics Doklady 27(2), 372–376 (1983).
30. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research 12, 2121–2159 (2011).
31. Zeiler, M.D.: ADADELTA: An adaptive learning rate method. CoRR abs/1212.5701
(2012), http://arxiv.org/abs/1212.5701.
32. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and
momentum in deep learning. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the
30th International Conference on Machine Learning. Proceedings of Machine Learning
Research, Vol. 28, pp. 1139–1147. PMLR, Atlanta, Georgia, USA (17–19 June 2013).
33. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Improving neural networks by preventing co-adaptation of feature detectors. CoRR
abs/1207.0580 (2012), http://arxiv.org/abs/1207.0580.
34. Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks
using DropConnect. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th
International Conference on Machine Learning. Proceedings of Machine Learning
Research, Vol. 28, pp. 1058–1066. PMLR, Atlanta, Georgia, USA (17–19 June 2013).
35. Wang, S., Manning, C.: Fast dropout training. In: Dasgupta, S., McAllester, D. (eds.)
Proceedings of the 30th International Conference on Machine Learning. Proceedings
of Machine Learning Research, Vol. 28, pp. 118–126. PMLR, Atlanta, Georgia, USA
(17–19 June 2013).
256 Deep Learning in Computer Vision
36. Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout net-
works. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International
Conference on Machine Learning. Proceedings of Machine Learning Research, Vol.
28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (17–19 June 2013).
37. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Conference
on International Conference on Machine Learning, Vol. 37. pp. 448–456. ICML’15,
JMLR.org (2015).
38. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural
Networks 4(2), 251–257 (1991), https://doi.org/10.1016/0893–6080(91)90009-T.
39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. CoRR abs/1409.1556 (2014), http://arxiv.org/abs/1409.1556.
40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 1–9. Boston, MA, USA (2015),
https://doi.org/10.1109/CVPR.2015.7298594.
41. Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In:
Proceedings of the 28th International Conference on Neural Information Processing
Systems, Vol. 2. pp. 2377–2385. NIPS’15, MIT Press, Montreal, Canada (2015).
42. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep
neural networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D.,
Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp.
3320–3328. Curran Associates, Inc. (2014).
43. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF:
A deep convolutional activation feature for generic visual recognition. In: Xing,
E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine
Learning. Proceedings of Machine Learning Research, Vol. 32, pp. 647–655. PMLR,
Bejing, China (22–24 June 2014).
44. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image
retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision—
ECCV 2014. pp. 584–599. Springer International Publishing, Cham (2014), https://do
i.org/10.1007/978–3-319–10590–1_38.
45. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: An
astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision
and Pattern Recognition Workshops. pp. 512–519 (June 2014), https://doi.org/10.1109/
CVPRW.2014.131.
46. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifer neural networks. In: Gordon, G.,
Dunson, D., Dudík, M. (eds.) Proceedings of the Fourteenth International Conference
on Artifcial Intelligence and Statistics. Proceedings of Machine Learning Research,
Vol. 15, pp. 315–323. PMLR, Fort Lauderdale, FL, USA (11–13 Apr 2011).
47. Glorot, X., Bengio, Y.: Understanding the diffculty of training deep feedforward
neural networks. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the Thirteenth
International Conference on Artifcial Intelligence and Statistics. Proceedings of
Machine Learning Research, Vol. 9, pp. 249–256. PMLR, Chia Laguna Resort,
Sardinia, Italy (13–15 May 2010).
48. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A
simple way to prevent neural networks from overftting. Journal of Machine Learning
Research 15, 1929–1958 (2014).
49. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. Journal of Machine Learning Research 11, 3371–3408 (2010).
Deep Convolutional Neural Networks 257
50. Chatfeld, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the
details: Delving deep into convolutional Nets. In: Valstar, M., French, A., Pridmore,
T. (eds.) Proceedings of the British Machine Vision Conference. BMVA Press (2014),
http://dx.doi.org/10.5244/C.28.6.
51. Lasagne: https://lasagne.readthedocs.io/en/latest/, Accessed: September 01, 2019.
52. Keras: https://keras.io/, Accessed: September 01, 2019.
53. Neidinger, R.: Introduction to automatic differentiation and MAT-LAB object-ori-
ented programming. SIAM Review 52(3), 545–563 (2010), https://doi.org/10.1137/
080743627.
54. Hosang, J., Benenson, R., Dollar, P., Schiele, B.: What makes for effective detection
proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence 38(4),
814–830 (April 2016), https://doi.org/10.1109/TPAMI.2015.2465908.
55. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate
object detection and semantic segmentation. In: 2014 IEEE Conference on Computer
Vision and Pattern Recognition. pp. 580–587. IEEE, Columbus, OH, USA (June 2014),
https://doi.org/10.1109/CVPR.2014.81.
56. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic seg-
mentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 3431–3440. IEEE, Boston, MA, USA (June 2015), https://doi.org/10.1109/
CVPR.2015.7298965.
57. Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision
(ICCV). pp. 1440–1448. IEEE, Santiago, Chile (Dec 2015), https://doi.org/10.1109/
ICCV.2015.169.
58. Brody, H.: Medical imaging. Nature 502, S81 (10/30 2013).
59. Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Annual review of
biomedical engineering 19, 221–248 (June 21 2017).
60. Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61,
85–117 (2015).
61. Wu, G., Kim, M., Wang, Q., Munsell, B.C., Shen, D.: Scalable high-performance image
registration framework by unsupervised deep feature representations learning. IEEE
Transactions on Biomedical Engineering 63(7), 1505–1516 (July 2016), https://doi.org
/10.1109/TBME.2015.2496253.
62. Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., Shen, D.: Unsupervised deep feature
learning for deformable registration of MR brain images. In: Mori, K., Sakuma, I.,
Sato, Y., Barillot, C., Navab, N. (eds.) Medical Image Computing and Computer-
Assisted Intervention—MICCAI 2013. Lecture Notes in Computer Science, Vol. 8150,
pp. 649–656. Springer Berlin Heidelberg, Berlin, Heidelberg (2013).
63. Liao, S., Gao, Y., Oto, A., Shen, D.: Representation learning: A unifed deep learning
framework for automatic prostate MR segmentation. In: Mori, K., Sakuma, I., Sato,
Y., Barillot, C., Navab, N. (eds.) Medical Image Computing and Computer-Assisted
Intervention—MICCAI 2013. Lecture Notes in Computer Science, Vol. 8150, pp. 254–
261. Springer Berlin Heidelberg, Berlin, Heidelberg (2013).
64. Guo, Y., Gao, Y., Shen, D.: Deformable MR prostate segmentation via deep feature
learning and sparse patch matching. IEEE Transactions on Medical Imaging 35(4),
1077–1089 (April 2016), https://doi.org/10.1109/TMI.2015.2508280.
65. Kim, M., Wu, G., Shen, D.: Unsupervised deep learning for hippocampus segmentation
in 7.0 Tesla MR images. In: Wu, G., Zhang, D., Shen, D., Yan, P., Suzuki, K., Wang,
F. (eds.) Machine Learning in Medical Imaging. Lecture Notes in Computer Science,
Vol. 8184, pp. 1–8. Springer International Publishing, Cham (2013), https://doi.org/10.1
007/978–3-319–02267–3_1.
258 Deep Learning in Computer Vision
66. Abd-Ellah, M.K., Awad, A.I., Khalaf, A.A.M., Hamed, H.F.A.: Classifcation of brain
tumor MRIs using a kernel support vector machine. In: Li, H., Nykanen, P., Suomi,
R., Wickramasinghe, N., Widen, G., Zhan, M. (eds.) Building Sustainable Health
Ecosystems, WIS 2016. Communications in Computer and Information Science, Vol.
636. Springer, Cham, Tampere, Finland (2016).
67. Abd-Ellah, M.K., Awad, A.I., Khalaf, A.A.M., Hamed, H.F.A.: Design and implemen-
tation of a computer-aided diagnosis system for brain tumor classifcation. In: 2016
28th International Conference on Microelectronics (ICM). pp. 73–76. IEEE, Giza,
Egypt (2016).
68. Roth, H.R., Lu, L., Liu, J., Yao, J., Seff, A., Cherry, K., Kim, L., Summers, R.M.:
Improving computer-aided detection using convolutional neural networks and random
view aggregation. IEEE Transactions on Medical Imaging 35(5), 1170–1181 (May
2016), https://doi.org/10.1109/TMI.2015.2482920.
69. Ciompi, F., de Hoop, B., van Riel, S.J., Chung, K., Scholten, E.T., Oudkerk, M., de Jong,
P.A., Prokop, M., van Ginneken, B.: Automatic classifcation of pulmonary peri-fssural
nodules in computed tomography using an ensemble of 2D views and a convolutional
neural network out-of-the-box. Medical Image Analysis 26(1), 195–202 (September
2015), https://doi.org/10.1016/j.media.2015.08.001.
70. Gao, M., Bagci, U., Lu, L., Wu, A., Buty, M., Shin, H.C., Roth, H., Papadakis, G.Z.,
Depeursinge, A., Summers, R.M., Xu, Z., Mollura, D.J.: Holistic classifcation of CT
attenuation patterns for interstitial lung diseases via deep convolutional neural net-
works. Computer methods in biomechanics and biomedical engineering: Imaging &
Visualization 6(1), 1–6 (2016), https://doi.org/10.1080/21681163.2015.1124249.
71. Brosch, T., Tang, L.Y.W., Yoo, Y., Li, D.K.B., Traboulsee, A., Tam, R.: Deep 3D convo-
lutional encoder networks with shortcuts for multi-scale feature integration applied to
multiple sclerosis lesion segmentation. IEEE Transactions on Medical Imaging 35(5),
1229–1239 (May 2016), https://doi.org/10.1109/TMI.2016.2528821.
72. Dou, Q., Chen, H., Yu, L., Zhao, L., Qin, J., Wang, D., Mok, V.C., Shi, L., Heng, P.:
Automatic detection of cerebral microbleeds from MR images via 3D convolutional
neural networks. IEEE Transactions on Medical Imaging 35(5), 1182–1195 (May 2016),
10.1109/TMI.2016.2528129.
73. Abd-Ellah, M.K., Awad, A.I., Khalaf, A.A.M., Hamed, H.F.A.: A review on brain
tumor diagnosis from MRI images: Practical implications, key achievements, and les-
sons learned. Magnetic Resonance Imaging 61, 300–318 (2019), https://doi.org/10.1
016/j.mri.2019.05.028.
74. Ciresan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in
breast cancer histology images with deep neural networks. In: Mori, K., Sakuma,
I., Sato, Y., Barillot, C., Navab, N. (eds.) Medical Image Computing and Computer-
Assisted Intervention—MICCAI 2013. pp. 411–418. Springer International Publishing,
Berlin, Heidelberg (2013), https://doi.org/10.1007/978–3-642–40763–5_51.
75. Kleesiek, J., Urban, G., Hubert, A., Schwarz, D., Maier-Hein, K., Bendszus, M., Biller,
A.: Deep MRI brain extraction: A 3D convolutional neural network for skull stripping.
NeuroImage 129, 460–469 (2016).
76. Moeskops, P., Viergever, M.A., Mendrik, A.M., de Vries, L.S., Benders, M.J.N.L.,
Isgum, I.: Automatic segmentation of MR brain images with a convolutional neural
network. IEEE Transactions on Medical Imaging 35(5), 1252–1261 May 2016).
77. Weisenfeld, N.I., Warfeld, S.K.: Automatic segmentation of newborn brain MRI.
NeuroImage 47(2), 564–572 (2009).
78. Zhang, W., Li, R., Deng, H., Wang, L., Lin, W., Ji, S., Shen, D.: Deep convolutional
neural networks for multi-modality isointense infant brain image segmentation.
NeuroImage 108, 214–224 (2015).
Deep Convolutional Neural Networks 259
79. Nie, D., Wang, L., Gao, Y., Shen, D.: Fully convolutional networks for multi-modal-
ity isointense infant brain image segmentation. In: 2016 IEEE 13th International
Symposium on Biomedical Imaging (ISBI). pp. 1342–1345 (April 2016).
80. Zhao, X., Wu, Y., Song, G., Li, Z., Fan, Y., Zhang, Y.: Brain tumor segmentation using
a fully convolutional neural network with conditional random felds. In: Crimi, A.,
Menze, B., Maier, O., Reyes, M., Winzeck, S., Handels, H. (eds.) Brainlesion: Glioma,
Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Lecture Notes in Computer
Science, Vol. 10154, pp. 75–87. Springer International Publishing, Cham (2016).
81. Casamitjana, A., Puch, S., Aduriz, A., Vilaplana, V.: 3D convolutional neural networks
for brain tumor segmentation: A comparison of multi-resolution architectures. In: Crimi,
A., Menze, B., Maier, O., Reyes, M., Winzeck, S., Handels, H. (eds.) Brainlesion: Glioma,
Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Lecture Notes in Computer
Science, Vol. 10154, pp. 150–161. Springer International Publishing, Cham (2016).
82. Pereira, S., Oliveira, A., Alves, V., Silva, C.A.: On hierarchical brain tumor segmenta-
tion in MRI using fully convolutional neural networks: A preliminary study. In: 2017
IEEE 5th Portuguese Meeting on Bioengineering (ENBENG). pp. 1–4 (Feb 2017).
83. Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y., Pal, C.,
Jodoin, P.M., Larochelle, H.: Brain tumor segmentation with deep neural networks.
Medical Image Analysis 35, 18–31 (2017), https://doi.Org/10.1016/j.media.2016.05.004.
84. Wang, G., Li, W., Ourselin, S., Vercauteren, T.: Automatic brain tumor segmenta-
tion using convolutional neural networks with test-time augmentation. In: Crimi,
A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) Brainlesion:
Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. pp. 61–72. Springer
International Publishing, Cham (2019), https://doi.org/10.1007/978–3-030–11726–9_6.
85. Abd-Ellah, M.K., Khalaf, A.A.M., Awad, A.I., Hamed, H.F.A.: TPUAR-Net: Two
parallel U-Net with asymmetric residual-based deep convolutional neural network for
brain tumor segmentation. In: Karray, F., Campilho, A., Yu, A. (eds.) Image Analysis
and Recognition. ICIAR 2019. Lecture Notes in Computer Science, Vol. 11663, pp.
106–116. Springer International Publishing, Cham (2019).
86. The Cancer Imaging Archive: RIDER NEURO MRI database. https://wiki.canceri
magingarchive.net/display/Public/RIDER+NEURO+MRI (2016), Accessed: February
5th, 2019.
87. Tharwat, A.: Classifcation assessment methods. Applied Computing and Informatics
(2018), https://doi.org/10.1016/j.aci.2018.08.003.
88. Mohsen, H., El-Dahshan, E.S.A., El-Horbaty, E.S.M., Salem, A.B.M.: Classifcation
using deep learning neural networks for brain tumors. Future Computing and
Informatics Journal 3, 68–71 (2018).
10 Lossless Full-Resolution
Deep Learning
Convolutional Networks
for Skin Lesion Boundary
Segmentation
Mohammed A. Al-masni, Mugahed A. Al-antari,
and Tae-Seong Kim
CONTENTS
10.1 Introduction .................................................................................................. 262
10.2 Related Works............................................................................................... 263
10.2.1 Conventional Segmentation Methods ............................................... 263
10.2.2 Initial Deep Learning Segmentation Methods .................................264
10.2.3 Recent Deep Learning Segmentation Methods................................ 265
10.2.4 Previous Deep Learning Works on Skin Lesion Segmentation .......266
10.3 Materials and Methods ................................................................................. 267
10.3.1 Dataset .............................................................................................. 267
10.3.2 Data Preprocessing and Augmentation ............................................ 267
10.3.3 Full-Resolution Convolutional Networks ......................................... 269
10.3.4 Training and Testing......................................................................... 272
10.3.5 Evaluation Metrics............................................................................ 274
10.3.6 Key Component Optimization.......................................................... 275
10.3.6.1 The Effect of Training Data............................................... 275
10.3.6.2 The Effect of Network Optimizers .................................... 276
10.4 Results of Skin Lesion Segmentation ........................................................... 277
10.4.1 Segmentation Performance on the ISIC 2017 Test Dataset .............. 277
10.4.2 Network Computation....................................................................... 277
10.4.3 Comparisons of Segmentators ..........................................................280
10.5 Discussion..................................................................................................... 282
10.6 Conclusions................................................................................................... 283
Acknowledgments.................................................................................................. 283
Bibliography .......................................................................................................... 283
261
262 Deep Learning in Computer Vision
10.1 INTRODUCTION
Skin cancer is one of the most widespread diagnosed cancers in the world. Indeed,
melanoma (i.e., malignant skin tumor) usually starts when melanocyte cells begin to
grow out of control [1]. Among different types of skin cancers, melanoma is the most
aggressive type due to its greater capability of spreading into other organs and its
higher death rate [2]. According to the annual report of the American Cancer Society
(ACS), in the United States in 2018, it is estimated that about 99,550 cases were
diagnosed as new cases of skin cancer (excluding basal and squamous cells skin
cancers), and the estimated deaths from this disease reached up to 13,460 cases [3].
Melanoma makes up the majority among various skin cancer types, with estimated
new cases and expected death cases of 91.7% and 69.2%, respectively. Moreover, it
is reported that melanoma is the most fatal skin cancer, with deaths from melanoma
making up 1.53% of total cancer deaths, and it represents 5.3% of all new cancer
cases [3, 4]. Regarding the survival statistics from the Surveillance, Epidemiology,
and End Results (SEER) Cancer Statistics Review (CSR) report [5], the survival rate
of patients diagnosed early with melanoma over fve years is about 91.8%. However,
this rate decreases to 63% when the disease spreads into lymph nodes. Therefore, a
skin lesion detected and correctly diagnosed in its earliest stage is highly curable,
and such detection reduces the mortality rate. This highlights the signifcance of
timely diagnosis and appropriate treatment of melanoma for patients’ survival.
Visual inspection via the naked eye during the medical examination of skin can-
cers is hindered by the similarity among normal tissues and skin lesions, which
may produce an incorrect diagnosis [6, 7]. In order to allow better visualization of
skin lesions and improve the diagnosis of melanoma, different non-invasive imag-
ing modalities have been developed. Dermoscopy (also known as dermatoscopy or
epiluminescence microscopy) has become a gold standard imaging technology that
assists dermatologists to improve the screening of skin lesions through visualizing
prominent features present under the skin surface. The key idea of dermoscopy is
that it acquires a magnifed high-resolution image while reducing or fltering out
skin surface refections, utilizing polarized light. In clinical practice, skin lesion
diagnosis is based on visual assessment of dermoscopy images. Although dermos-
copy images improve diagnostic precision, examination of dermoscopy images by
dermatologists via visual inspection is still tedious, time consuming, complex, sub-
jective, and fault prone [6, 7]. Hence, automated computerized diagnostic systems
for skin lesions are highly demanded to support and assist dermatologists in clinical
decision-making.
Indeed, automatic segmentation of skin lesion boundaries from surrounding
tissue is a key prerequisite procedure in any computerized diagnostic system of
skin cancer. Accurate segmentation of the lesion boundaries plays a critical role in
obtaining more prominent, specific, and representative features, which are utilized
to distinguish between different skin lesion diseases [8–11]. However, automatic seg-
mentation of skin lesions is still a challenging task due to their large variations in
size, shape, texture, color, and location in the dermoscopy images. Some challeng-
ing examples of skin lesion dermoscopy images are shown in Figure 10.1. These
examples show the presence of artifacts for screening: artifcial segmentation such as
Lossless Deep Learning FrCN 263
FIGURE 10.1 Examples of some challenging cases of skin lesions such as (a) low contrast,
(b) irregular fuzzy boundaries, (c) color illumination, (d) blood vessels, (e) ruler mark artifact,
(f) bubbles, (g) frame artifact, and (h) hair artifact. White line contours indicate the segmen-
tation of the lesions by expert dermatologists.
ruler marks, ebony frames, air bubbles, and color illumination, and natural artifacts
such as hair and blood vessels.
This chapter presents a novel deep learning method for skin lesion boundary seg-
mentation called full-resolution convolutional networks (FrCN) [12]; this method
produces very accurate segmentation. The FrCN method is a resolution-preserving
model that leads to learning high-level features and improved segmentation perfor-
mance. It directly learns the full-resolution attributes of each individual pixel of the
input image. This is achieved by removing all the subsampling layers from the net-
work. The evaluation of the effciency and effectiveness of the FrCN segmentation
method is presented using the well-established public International Skin Imaging
Collaboration (ISIC) 2017 challenge dataset. In addition, the performance of the
FrCN method is compared against well-known deep learning techniques such as
U-Net, SegNet, and fully convolutional network (FCN) methods under the same
experimental conditions.
For further details of these traditional techniques, the latest comprehensive reviews
provide more details of the skin lesion boundary segmentation algorithms [8, 27–30].
All of these methods utilize low-level features that rely only on pixel-level attributes.
Hence, these classical segmentation techniques still do not deliver satisfactory per-
formance and cannot get over the challenges of hair artifacts and low contrast.
center pixel. In addition, it requires complex computation and a lot of execution time,
since the network should be processed separately for each patch.
segmentation challenge with an overall accuracy of 94.9% when using the ISIC 2016
challenge dataset.
Yuan et al. developed a skin lesion segmentation technique via deep FCN [56].
In order to handle the imbalance among lesion-tissue pixels, the authors extended
the well-known FCN approach by employing the Jaccard distance as a loss func-
tion. Their technique achieved a dice index of 93.8% and an overall segmentation
accuracy of 95.5% with the PH2 and ISIC 2016 datasets, respectively. Goyal and Yap
presented a multi-class semantic segmentation via FCN, which was capable of seg-
menting three classes of skin lesions (i.e., melanoma, seborrheic keratoses [SK], and
benign nevi) from the ISIC 2017 challenge dataset [57]. Their method achieved dice
metrics of 65.3%, 55.7%, and 78.5% for melanoma, seborrheic keratosis, and benign
lesions, respectively. Lin et al. proposed a comparison between two skin lesion seg-
mentation methods, C-means clustering and U-Net-based histogram equalization
[58]. They evaluated their methodologies utilizing the ISIC 2017 challenge dataset.
The U-Net method obtained a dice coeffcient measure of 77.0%, which signifcantly
outperformed the clustering method’s result of only 61.0%. In 2017, Yading Yuan
developed a skin lesion segmentation utilizing deep convolutional-deconvolutional
neural networks (CDNN) [59]. The CDNN model was trained with various color
spaces of dermoscopy images using the ISIC 2017 dataset. CDNN approach was
ranked frst in the ISIC 2017 challenge, in which it obtained a Jaccard index (JAC)
of 76.5%.
TABLE 10.1
Distribution of the ISIC 2017 Challenge Dataset
Data Type Benign SK Melanoma Total
FIGURE 10.2 (a) and (b) are an exemplary pair of the original dermoscopy image and its
segmentation mask, which was annotated by expert dermatologists. (c) illustrates the seg-
mented lesion boundaries.
sizes of the ISIC 2017 dataset (i.e., the majority ratio of height to width of 3:4), we
resized all of the dermoscopy images into 192 × 256 pixels utilizing bilinear interpo-
lation according to the report in [56], where the optimal segmentation achievement
was made with the image size of 192 × 256 among other different image sizes. In
fact, deep learning methods require a large amount of data for appropriate train-
ing. However, a limit on the sizes of medical image datasets, especially a limit on
reliable annotated ground-truths, is one of the challenges to adopting deep learning
approaches. Due to this, data augmentation was applied to enlarge the training data-
set. Augmentation is a process that generates augmented images utilizing the given
original images using various transformation techniques such as rotation, scaling,
and translation [62–64].
By taking the merit of the dermoscopy images (i.e., RGB color), the information
from three channels—hue-saturation-value (HSV)—could be derived in addition to
the original RGB images, producing more color space features. The augmented data-
set includes the RGB and HSV dermoscopy images and their rotated images with
angles of 0˚, 90˚, 180˚, and 270˚. Additionally, horizontal and vertical fipping was
applied. Thus, a total of 32,000 [i.e., (2,000 RGB + 2,000 HSV) × 4 rotations × 2 fip-
ping] images with their corresponding ground-truth binary masks were utilized to
train the FrCN segmentation technique. Figure 10.3 shows a sample of the original
RGB images in addition to the 15 augmented images. In fact, the data augmentation
process reduces the overftting problem of deep learning and improves the robust-
ness of the deep network.
Lossless Deep Learning FrCN 269
FIGURE 10.3 The augmentation processes for the training images. Top two rows represent
RGB images, while the bottom two rows are for HSV images. From each original RGB image
(top left corner), we generate a total of 16 different images including four rotations and four
horizontal and vertical fippings on the original RGB and HSV images.
pooling or subsampling layer does promote the robustness of the classifer, since it
reduces overftting, eliminates the redundancy of features, and minimizes computa-
tion time [45, 54]. This is because all the reduced attributes represent a single class
label of the input image. However, in the pixel-wise segmentation tasks, the subsam-
pling layers cause loss of spatial resolution in the features of input images.
The second part of the segmentation networks is the upsampling layers followed
by the softmax classifer, which exploits the extracted features of each individual
pixel to distinguish it as a tissue or a lesion pixel. Recent deep learning segmentation
approaches have used complicated procedures to compensate for the missing fea-
tures of particular pixels, due to the reduced image size, via upsampling, deconvolu-
tion, bilinear interpolation, decoding, or Atrous convolution, increasing the number
of hyper-parameters [45, 47, 48, 50, 51]. Furthermore, these mechanisms require
further PI or CRF operations to refne boundaries of the segmented objects [51, 54].
These deep learning segmentation networks suffer from reduced spatial resolution
and loss of details. Because pixel-wise dependency is not adequately addressed
throughout the previous deep learning segmentation methods such as FCN, U-Net,
and SegNet, important information about some pixels is left out and is not easy to
retrieve.
The full-resolution convolutional networks (FrCN) [12] is an end-to-end super-
vised deep network that is trained via mapping the entire input image to its corre-
sponding ground-truth masks with no resolution loss, leading to better segmentation
performance of skin lesion boundaries. In fact, the FrCN method is a resolution-
preserving model that leads to high-level feature learning and improves the segmen-
tation performance. The full-resolution feature maps of the input pixels are reserved
by removing all the subsampling layers in the architecture. Figure 10.4 illustrates the
architecture of the FrCN segmentation method. The FrCN allows each pixel in the
input dermoscopy image to extract its own features utilizing the convolutional layers.
FIGURE 10.4 An illustration of the FrCN segmentation method. There are no subsampling
or fully connected layers.
Lossless Deep Learning FrCN 271
The FrCN architecture contains 16 convolutional layers, which are inspired by the
VGG-16 network [67].
It has multiple stacks of convolutional layers with different sizes of feature maps,
where each stack learns distinctive representations and causes an increase over the
network complexity. The last three FC-NN layers in Block-6 were substituted with
the convolutional layers to allow the network to process input images with arbitrary
sizes as well as to generate pixel-wise maps of dense prediction with the same sizes
as the input images. Details of the number of maps and the convolutional flter sizes
in each layer are given in Table 10.2. The early layers of FrCN learn localization
information, while the late layers learn subtle and fne feature representations of skin
lesion boundaries.
A convolutional network is originally translation invariant, in which its main
role is to extract different features from the entire dermoscopy images and then
produce feature maps using convolution operations [68]. In the deep learning net-
work, the flter kernels of the convolutional layers are represented by weights,
which are automatically updated during the training process. The basic compo-
nents of this convolutional network are the flter kernels W, convolution operator *,
and activation function ϕ(·). Therefore, the kth feature map of layer L is computed
as follows:
( )
FLk = f WLk * FLk−1 + bLk , (10.1)
where bLk is the bias applied to each feature map for each layer.
A nonlinear activation process gets used via rectifed linear units (ReLUs)
instantly after each convolutional layer to provide nonlinearity to the network. A
ReLU activation function is widely used rather than the traditional sigmoid and tanh
functions due to its capability to alleviate the vanishing gradient problem and to
TABLE 10.2
Details of Architecture Layers of the FrCN Method
Filter Size,
Block Layer Filter Size, Maps Block Layer Maps
train the network with higher computational effciency [69, 70]. The ReLU activation
function is defned as follows:
˙ x, if x ˛ 0
˜( x ) = max(0, x ) = ˆ (10.2)
ˇ0, if x < 0.
In addition, a dropout is used with p = 0.5 after the convolutional layers Conv14 and
Conv15, as shown in the FrCN architecture in Figure 10.4 and Table 10.2. Dropout
is a process used in deep learning to tackle the overftting problem of deep layers
[71]. During training, it randomly eliminates some neural units with their connec-
tions to prevent overftting to the training data. Hence, every unit is retrained with a
particular probability, leading the outgoing weights of that unit to be multiplied by
p at the test time.
Finally, a softmax classifer known as multinomial logistic regression is fed in the
last layer of the FrCN architecture to classify each pixel in the dermoscopy image
into two binary classes (i.e., lesion and non-lesion). This logistic regression function
generates a map of predictions for each pixel, which involves the segmented map. As
a result, the resolutions of the spatial output maps are same as that of the input data.
Cross-entropy is used as a loss function, in which the overall loss H of each pixel is
minimized throughout the training process as defned by
where y and ŷ are the ground-truth delineation and predicted segmented map,
respectively. The cross-entropy loss function is utilized when a deep convolutional
network is applied for a pixel-wise recognition [45, 56].
separate datasets as shown in Figure 10.5. A training dataset (i.e., 72.7% of the whole
dataset) is used to train the deep learning network with hyper-parameters, which are
evaluated utilizing the validation dataset (5.5% of the whole dataset). The optimal
deep learning model is selected according to its high effciency on the validation set.
Then, further evaluation is performed on the test dataset (21.8% of the whole dataset)
to obtain the overall performance of the network [65, 79].
The objective of training deep learning approaches is to optimize the weight
parameters in each layer. The optimization procedure of a single cycle performs
as follows [38, 80]. First, the forward propagation pass sequentially computes the
output in each layer utilizing the training data. In the last output layer, the error
between the ground-truth and predicted labels gets computed using the loss function.
To reduce the training error, back-propagation is proceeded through the network lay-
ers. Consequently, the training weights of FrCN get updated using the training data.
Furthermore, the performance of FrCN is compared against the performance of the
recent deep learning approaches such as FCN, U-Net, and SegNet utilizing the test
subset of 600 dermoscopy images from the ISIC 2017 dataset. All the segmentation
models were trained using the superior Adadelta optimization method (details in
Section 10.3.6.2) with a batch size of 20. The learning rate was initially set as 0.2 and
then reduced with automated updating throughout the training process. The training
network tends to have convergence at about the 200th epoch.
The system implementations of all experiments were performed on a personal
computer (PC) with the following specifcations: a CPU of Intel® Core(TM) i7-6850
K @ 3.360 GHz with 16 GB RAM and a GPU of NVIDIA GeForce GTX 1080. This
work was conducted with Python 2.7.14 on Ubuntu 16.04 OS using the Keras and
Theano DL libraries [81, 82].
TP
Sensitivity (SEN) = , (10.4)
TP + FN
TN
Specificity (SPE) = , (10.5)
TN + FP
2 ° TP
Dice Index (DIC) = , (10.6)
(2 ° TP ) + FP + FN
TP
Jaccard Index (JAC) = , (10.7)
TP + FN + FP
TP + TN
Accuracy (ACC) = , (10.8)
TP + FN + TN + FP
TP ° TN − FP ° FN
MCC = , (10.9)
( TP + FP )( TP + FN )( TN + FP )( TN + FN )
where TP and FP denote the true and false positives, while TN and FN indicate the
true and false negatives, respectively. If the lesion pixels were segmented correctly,
they were considered as TPs; otherwise, they were FNs. On the contrary, the non-
lesion pixels were considered as TNs if their segmentation was classifed correctly
as non-lesion; otherwise, they were FPs. An illustration of the sensitivity and speci-
fcity concept with a dermoscopy image for classifcation is shown in Figure 10.6.
Moreover, the curve of the receiver operator characteristic (ROC) with its area under
the curve (AUC) was used for further segmentation evaluation.
Lossless Deep Learning FrCN 275
FIGURE 10.6 Concepts of sensitivity and specifcity in terms of TP, FN, FP, and TN pixels.
The white contour indicates the ground-truth of the lesion boundary annotation by a derma-
tologist, while the red one is an exemplary segmented boundary.
TABLE 10.3
Segmentation Performance with Different Optimization Components
Performance Evaluation (%)
Data Augmentation Without Aug. 73.40 92.07 79.59 63.36 72.88 88.81
With Aug. 81.20 94.15 87.74 73.58 82.89 91.89
Optimizer SGD 74.05 96.44 77.57 63.36 73.22 92.53
Adam 78.60 97.35 82.23 69.83 78.81 94.08
Adadelta 81.20 94.15 87.74 73.58 82.89 91.89
the sensitivity, which refects the ability of the network to segment the skin lesion
correctly, with an improvement of 7.80% as reported in Table 10.3. Due to this, the
overall segmentation accuracy is increased from 88.81% to 91.89%. These results
ensure that training deep learning models with a larger training data achieves better
performance and provides more feasible and reliable models. The overftting prob-
lem is clearly shown in Figure 10.7 (a), which presents how the loss function diverges
over epochs on the validation dataset when the network is trained with the original
training dataset without data augmentation. In contrast, the loss function converged
rapidly in training the network with the larger augmented dataset.
with overall Jaccard indices of 73.58%, 69.83%, and 63.36% on the validation
dataset, respectively. Furthermore, Figure 10.7 (b) illustrates how the loss function
declined during the training process on the training datasets. This fgure provides
an indication of the convergence speed of the network under training over epochs.
Clearly, the training and validation curves of Adadelta are signifcantly better than
those of Adam and SGD. In addition, these curves clarify the network performance
on the validation data, in which a small variation between them could give the FrCN
the capability to perform well with unseen test data. Although the loss function of
the training data using the Adam optimizer decreases rapidly over epochs, the loss
for the validation data diverges, leading reduced performance for skin lesion bound-
ary segmentation compared to Adadelta. However, the SGD optimizer provides the
worst performance with this dataset, as illustrated in Figure 10.7 (b).
TABLE 10.4
Segmentation Performance (%) of the FrCN Compared to FCN, U-Net, and SegNet in Terms
of Sensitivity, Specifcity, and Accuracy for Benign, SK, Melanoma, and Overall Cases
Benign Cases SK Cases Melanoma Cases Overall
Method SEN SPE ACC SEN SPE ACC SEN SPE ACC SEN SPE ACC
FCN 85.25 96.95 94.45 75.10 95.91 90.96 70.67 96.18 88.25 79.98 96.66 92.72
U-Net 76.76 97.26 92.89 43.81 97.64 84.83 58.71 96.81 84.98 67.15 97.24 90.14
SegNet 85.19 96.30 93.93 70.58 92.50 87.29 73.78 94.26 87.90 80.05 95.37 91.76
FrCN 88.95 97.44 95.62 82.37 94.08 91.29 78.91 96.04 90.78 85.40 96.69 94.03
TABLE 10.5
Segmentation Performance (%) of the FrCN Compared to FCN, U-Net, and SegNet in Terms
of Dice, Jaccard, and MCC Indices for Benign, SK, Melanoma, and Overall Cases
Benign Cases SK Cases Melanoma Cases Overall
Method DIC JAC MCC DIC JAC MCC DIC JAC MCC DIC JAC MCC
FCN 86.77 76.63 83.28 79.81 66.40 74.26 78.89 65.14 71.84 83.83 72.17 79.30
U-Net 82.16 69.72 78.05 57.88 40.73 53.89 70.82 54.83 63.71 76.27 61.64 71.23
SegNet 85.69 74.97 81.84 72.54 56.91 64.32 79.11 65.45 71.03 82.09 69.63 76.79
FrCN 89.68 81.28 86.90 81.83 69.25 76.11 84.02 72.44 77.90 87.08 77.11 83.22
Deep Learning in Computer Vision
Lossless Deep Learning FrCN 279
FIGURE 10.8 ROC curves of different deep learning methods for skin lesion boundary seg-
mentation on the ISIC 2017 test dataset for (a) benign, (b) SK, (c) melanoma, and (d) overall
clinical cases.
image took about 9.7 seconds. This should make the FrCN segmentation approach
applicable for clinical practice. In addition, Table 10.6 shows the trainable param-
eters along with the training computation time per epoch and the test time per
single dermoscopy image for each of the segmentation techniques. Clearly, the
FrCN approach seems feasible for medical practices, since less than 10 seconds of
inference time is required to segment the suspicious lesion from the dermoscopy
image. The computational speed during the training process of the FrCN approach
was faster than other segmentation techniques. Table 10.6 shows the speed of the
FrCN approach in the training process time compared to the FCN model (i.e.,
from 651 to 315 seconds per epoch). Indeed, this is because FrCN was executed
without the need for the downsampling processes, but keeping the same convolu-
tion operations. In summary, the deep FrCN segmentation method outperforms
the most popular deep learning methods of FCN, U-Net, and SegNet on the ISIC
2017 datasets.
280 Deep Learning in Computer Vision
FIGURE 10.9 The lesion contours segmentation with statistical evaluation per each image
of the FrCN against FCN, U-Net, and SegNet for (a) benign, (b) seborrheic keratosis, and (c)
melanoma cases from the ISIC 2017 dataset. Segmentation results are depicted as follows:
ground-truth, white; FrCN, blue; FCN, green; U-Net, red; and SegNet, black.
TABLE 10.6
Measurements in Seconds of the Training Time
per Epoch and Test Time per Dermoscopy Image
Trainable Training Time/ Test Time/
Method Parameters Epoch Single Image
TABLE 10.7
Performance (%) of the FrCN Method Compared to the Latest Studies in the
Literature on Skin Lesion Segmentation
Ranked in the
References ISIC Challenge SEN SPE DIC JAC ACC
Yuan et al. (CDNN) [59] Top 1 82.50 97.50 84.90 76.50 93.40
Berseth [91] Top 2 82.00 97.80 84.70 76.20 93.20
Bi et al. (ResNet) [92] Top 3 80.20 98.50 84.40 76.00 93.40
Bi et al. (Multi-scale ResNet) [92] Top 4 80.10 98.40 84.20 75.8 93.40
Menegola et al. [93] Top 5 81.70 97.00 83.90 75.40 93.10
Lin et al. (U-Net) [58] – – – 77.00 62.00 –
FrCN – 85.40 96.69 87.08 77.11 94.03
the skin lesion boundary segmentation. However, the FrCN method provides prom-
ising results with fewer layers (i.e., only 16 layers) compared to the ResNet model,
which employed a deeper network of 50 layers. Although the skin lesion U-Net-
based histogram equalization technique proposed in [58] outperformed the clustering
method, its segmentation performance was lower compared to the recent deep learn-
ing approach, with Jaccard and dice indices of 62.00% and 77.00%, respectively. The
distribution performance of the segmented skin lesion boundaries of the ISIC 2017
test dataset (i.e., 600 images) in terms of Jaccard index via different segmentation
networks is shown in Figure 10.10 and Table 10.7. This fgure illustrates the boxplot
performance of different models and also shows the median, lower, and upper Jaccard
measures on the test dataset. Obviously, the FrCN method achieves a slightly higher
median Jaccard value than the top method, with fewer counts below a Jaccard index
of 0.4. With regard to the overall pixel-wise segmentation results of the skin lesion
task, the FrCN method proves its feasibility and effectiveness compared to others
FIGURE 10.10 Boxplots of the Jaccard index for different segmentation networks on the
ISIC 2017 challenge test dataset.
282 Deep Learning in Computer Vision
10.5 DISCUSSION
Automatic delineation of skin lesion boundaries is highly demanded as a prerequisite
step for melanoma recognition. Accurate segmentation of skin lesions enables the clas-
sifcation CNNs to extract more specifc and representative features from only seg-
mented skin lesion areas instead of entire dermoscopy images. Generally, the overall
segmentation performance of deep learning approaches improves as the size of the
training dataset increases. The impact of various training dataset sizes is presented in
[31, 62, 84], in which the results showed the signifcance of using larger training sets. In
this work, the training datasets are augmented by utilizing two procedures. First, HSV
images are generated from the original RGB images. Second, the training dermoscopy
images are augmented by rotating the RGB and HSV images with four rotation angles.
Furthermore, the images are fipped by applying horizontal and vertical fipping. The
augmented training data should support the network to learn better features of the
skin lesion characteristics. The results present the capability of the FrCN technique to
segment skin lesions with accuracies higher than those obtained with the conventional
methods. This is because the FrCN method is trained with the full spatial resolution
of the input images. In our experiments, the segmentation performance of the FrCN
technique was compared against three well-known segmentation approaches, namely
FCN, U-Net, and SegNet, under the same test conditions. Quantitative evaluations with
different indices of the recent deep learning methods are reported in Tables 10.4 and
10.5 for the ISIC 2017 test dataset. Signifcantly, the FrCN method outperformed other
skin lesion segmentation approaches. The MCC index showed how the segmented
skin lesions were correlated with the annotated ground-truths. The FrCN architecture
obtained overall increments of 3.92%, 11.99%, and 6.43% in MCC as compared to
FCN, U-Net, and SegNet, respectively. These promising segmentation results indicate
the consistency and capability of the FrCN method.
In this work, further analysis was done to show the segmentation performance
for clinical diagnosis cases. In the ISIC 2017 test dataset, the benign cases achieved
a higher segmentation performance than the melanoma and SK cases, with Jaccard
indices of 81.28%, 72.44%, and 69.25%, respectively, as shown in Table 10.4. This
signifcant improvement for the benign cases is due to the larger existing percentage
of the cases in the training dataset (i.e., 1,372/2,000 cases) and in the test dataset (i.e.,
393/600), as numbered in the database distribution in Table 10.1. Note that due to the
inequality of the numbers of each diagnostic type, the segmentation results of the
overall indices shown in Tables 10.4 and 10.5 were not computed as direct averages
of all cases, but were computed as percentages of their presence.
Lossless Deep Learning FrCN 283
Figure 10.9 illustrates some examples of the segmentation results of the FrCN
versus FCN, U-Net, and SegNet compared to the ground-truth contours. This fgure
shows the segmentation results of each clinical diagnostic case in the ISIC 2017
dataset. All approaches showed reasonable segmentation performance on the skin
lesion cases such as the presented in Figure 10.9 (a). In contrast, FrCN achieved bet-
ter segmentation when the dermoscopy image contained hair obstacles, as shown in
Figure 10.9 (c). Also, Figure 10.9 (b) illustrates how the FrCN method segments one
of the most challenging skin lesion cases with very low contrast.
In fact, the FrCN method is able to take an input image of any arbitrary size since
there are no downsampling layers. However, there is a demand on the computational
resources (i.e., enough memory), since the original dermoscopy images come with
very large sizes (i.e., from 540 × 722 to 4,499 × 6,748). In this work, all images are
resized in a preprocessing step for two reasons. One is to compare the conventional
segmentation methods that require the fxed image size. The other is to take our
computation resources into account.
10.6 CONCLUSIONS
In this chapter, we presented a full-resolution convolutional networks (FrCN)
method for skin lesion segmentation. Unlike the previous well-known segmentation
approaches, the FrCN method is able to utilize full spatial resolution features for each
pixel of dermoscopy images, improving in the performance of pixel-wise segmenta-
tion. We have evaluated the FrCN method utilizing the ISIC 2017 challenge dataset.
The results show that the FrCN method outperformed the recent FCN, U-Net, and
SegNet techniques. In future work, a larger number of training dermoscopy images
should be used to improve the segmentation performance of each class. Developing
a computer-aided diagnostic system using segmented lesions is also necessary to
distinguish between benign, melanoma, and normal skin lesions.
ACKNOWLEDGMENTS
This work was supported by the International Collaborative Research and
Development Programme (funded by the Ministry of Trade, Industry and Energy
[MOTIE, Korea]) (N0002252). This work was also supported by the National
Research Foundation of Korea (NRF) grant funded by the Korean government
(MEST) (NRF-2019R1A2C1003713).
BIBLIOGRAPHY
1. American Joint Committee on Cancer, “Melanoma of the skin,” Cancer Staging
Manual, pp. 209–220, Springer, New York, NY, 2002.
2. M. E. Celebi, H. A. Kingravi, B. Uddin, H. Lyatornid, Y. A. Aslandogan, W. V.
Stoecker, and R. H. Moss, “A methodological approach to the classifcation of der-
moscopy images,” Computerized Medical Imaging and Graphics, vol. 31, no. 6, pp.
362–373, September, 2007.
3. R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics,” CA Cancer Journal for
Clinicians, vol. 68, no. 1, pp. 7–30, 2018.
284 Deep Learning in Computer Vision
4. American Cancer Society, “Cancer facts & fgures 2018, American Cancer Society,
Atlanta, 2018,” Accessed [September 3, 2018]; https://www.cancer.org/cancer/melan
oma-skin-cancer.html.
5. A. M. Noone, N. Howlader, M. Krapcho, D. Miller, A. Brest, M. Yu, J. Ruhl, Z.
Tatalovich, A. Mariotto, D. R. Lewis, H. S. Chen, E. J. Feuer, and K. A. Cronin, “SEER
cancer statistics review, 1975–2015,” National Cancer Institute, 2018.
6. M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W. Menzies, “Dermoscopy com-
pared with naked eye examination for the diagnosis of primary melanoma: A meta-
analysis of studies performed in a clinical setting,” British Journal of Dermatology,
vol. 159, no. 3, pp. 669–676, 2008.
7. M. Silveira, J. C. Nascimento, J. S. Marques, A. R. Marçal, T. Mendonça, S. Yamauchi,
J. Maeda, and J. Rozeira, “Comparison of segmentation methods for melanoma diagno-
sis in dermoscopy images,” IEEE Journal of Selected Topics in Signal Processing, vol.
3, no. 1, pp. 35–45, 2009.
8. M. E. Celebi, H. Iyatomi, G. Schaefer, and W. V. Stoecker, “Lesion border detection in
dermoscopy images,” Computerized Medical Imaging and Graphics, vol. 33, no. 2, pp.
148–53, March, 2009.
9. H. Ganster, A. Pinz, R. Rohrer, E. Wildling, M. Binder, and H. Kittler, “Automated
melanoma recognition,” IEEE Transactions on Medical Imaging, vol. 20, no. 3, pp.
233–239, March, 2001.
10. E. Meskini, M. S. Helfroush, K. Kazemi, and M. Sepaskhah, “A new algorithm for skin
lesion border detection in dermoscopy images,” Journal of Biomedical Physics and
Engineering, vol. 8, no. 1, pp. 117–126, 2018.
11. G. Schaefer, B. Krawczyk, M. E. Celebi, and H. Iyatomi, “An ensemble classifcation
approach for melanoma diagnosis,” Memetic Computing, vol. 6, no. 4, pp. 233–240,
December, 2014.
12. M. A. Al-Masni, M. A. Al-antari, M. T. Choi, S. M. Han, and T. S. Kim, “Skin lesion
segmentation in dermoscopy images via deep full resolution convolutional networks,”
Computer Methods and Programs in Biomedicine, vol. 162, pp. 221–231, August, 2018.
13. M. E. Yuksel, and M. Borlu, “Accurate segmentation of dermoscopic images by image
thresholding based on type-2 fuzzy logic,” IEEE Transactions on Fuzzy Systems, vol.
17, no. 4, pp. 976–982, August, 2009.
14. K. Mollersen, H. M. Kirchesch, T. G. Schopf, and F. Godtliebsen, “Unsupervised seg-
mentation for digital dermoscopic images,” Skin Research and Technology, vol. 16, no.
4, pp. 401–407, November, 2010.
15. M. E. Celebi, Q. Wen, S. Hwang, H. Iyatomi, and G. Schaefer, “Lesion border detection
in dermoscopy images using ensembles of thresholding methods,” Skin Research and
Technology, vol. 19, no. 1, pp. E252–E258, 2013.
16. F. Peruch, F. Bogo, M. Bonazza, V. M. Cappelleri, and E. Peserico, “Simpler, faster,
more accurate melanocytic lesion segmentation through MEDS,” IEEE Transactions
on Biomedical Engineering, vol. 61, no. 2, pp. 557–565, February, 2014.
17. H. Y. Zhou, G. Schaefer, A. H. Sadka, and M. E. Celebi, “Anisotropic mean shift based
fuzzy c-means segmentation of dermoscopy images,” IEEE Journal of Selected Topics
in Signal Processing, vol. 3, no. 1, pp. 26–34, February, 2009.
18. S. Kockara, M. Mete, V. Yip, B. Lee, and K. Aydin, “A soft kinetic data structure for
lesion border detection,” Bioinformatics, vol. 26, no. 12, pp. i21–i28, June 15, 2010.
19. S. Suer, S. Kockara, and M. Mete, “An improved border detection in dermoscopy
images for density based clustering,” BMC Bioinformatics, vol. 12, no. 10, p. S12.
20. F. Y. Xie, and A. C. Bovik, “Automatic segmentation of dermoscopy images using self-
generating neural networks seeded by genetic algorithm,” Pattern Recognition, vol. 46,
no. 3, pp. 1012–1019, March, 2013.
Lossless Deep Learning FrCN 285
54. L. Bi, J. Kim, E. Ahn, A. Kumar, M. Fulham, and D. Feng, “Dermoscopic image
segmentation via multistage fully convolutional networks,” IEEE Transactions on
Biomedical Engineering, vol. 64, no. 9, pp. 2065–2074, 2017.
55. L. Q. Yu, H. Chen, Q. Dou, J. Qin, and P. A. Heng, “Automated melanoma recogni-
tion in dermoscopy images via very deep residual networks,” IEEE Transactions on
Medical Imaging, vol. 36, no. 4, pp. 994–1004, 2017.
56. Y. D. Yuan, M. Chao, and Y. C. Lo, “Automatic skin lesion segmentation using deep
fully convolutional networks with jaccard distance,” IEEE Transactions on Medical
Imaging, vol. 36, no. 9, pp. 1876–1886, 2017.
57. M. Goyal, and M. H. Yap, “Multi-class semantic segmentation of skin lesions via fully
convolutional networks,” arXiv preprint arXiv:1711.10449, 2017.
58. B. S. Lin, K. Michael, S. Kalra, and H. R. Tizhoosh, “Skin lesion segmentation: U-Nets
versus clustering,” in 2017 IEEE Symposium Series on Computational Intelligence
(SSCI), 2017, pp. 1–7.
59. Y. Yuan, “Automatic skin lesion segmentation with fully convolutional-deconvolutional
networks,” arXiv preprint arXiv:1703.05165, 2017.
60. N. C. F. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza,
A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, and A. Halpern, “Skin lesion analy-
sis toward melanoma detection: A challenge at the 2017 International Symposium on
Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration
(ISIC),” in 15th IEEE International Symposium on Biomedical Imaging (ISBI 2018),
2018, pp. 68–172.
61. International Skin Imaging Collaboration, “ISIC 2017: Skin lesion analysis towards mela-
noma detection,” Accessed [October 19, 2018]; https://challenge.kitware.com/#challenges.
62. Z. Jiao, X. Gao, Y. Wang, and J. Li, “A deep feature based framework for breast masses
classifcation,” Neurocomputing vol. 197, pp. 221–231, 2016.
63. T. Kooi, G. Litjens, B. van Ginneken, A. Gubern-Merida, C. I. Sancheza, R. Mann, A. den
Heeten, and N. Karssemeijer, “Large scale deep learning for computer aided detection of
mammographic lesions,” Medical Image Analysis, vol. 35, pp. 303–312, January, 2017.
64. H. R. Roth, L. Lu, J. M. Liu, J. H. Yao, A. Seff, K. Cherry, L. Kim, and R. M. Summers,
“Improving computer-aided detection using convolutional neural networks and ran-
dom view aggregation,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp.
1170–1181, May, 2016.
65. M. Lin, Q. Chen, and S. Yan, “Network in network,” in rXiv preprint arXiv:1312.4400,
pp. 1–10, 2013.
66. D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convo-
lutional architectures for object recognition,” in Artifcial Neural Networks–ICANN
2010, 2010, pp. 92–101.
67. K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
68. Q. Guo, F. L. Wang, J. Lei, D. Tu, and G. H. Li, “Convolutional feature learning and
Hybrid CNN-HMM for scene number recognition,” Neurocomputing, vol. 184, pp.
78–90, April 5, 2016.
69. A. L. Maas, A. Y. Hannun, and A. Y. Ng., “Rectifer nonlinearities improve neural
network acoustic models,” in Proceeding of 30th International Conference on Machine
Learning (ICML), 2013, p. 3.
70. X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifer neural networks,” in
Proceedings of the 14th International Conference on Artifcial Intelligence and
Statistics, 2011, pp. 315–323.
71. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overftting,” Journal of Machine
Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
288 Deep Learning in Computer Vision
72. S. Hoo-Chang, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and
R. M. Summers, “Deep convolutional neural networks for computer-aided detection:
CNN architectures, dataset characteristics and transfer learning,” IEEE Transactions
on Medical Imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
73. Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan, “Chest pathol-
ogy identifcation using deep feature selection with non-medical training,” Computer
Methods in Biomechanics and Biomedical Engineering-Imaging and Visualization,
vol. 6, no. 3, pp. 259–263, 2018.
74. R. K. Samala, H. P. Chan, L. Hadjiiski, M. A. Helvie, J. Wei, and K. Cha, “Mass detec-
tion in digital breast tomosynthesis: Deep convolutional neural network with trans-
fer learning from mammography,” Medical Physics, vol. 43, no. 12, pp. 6654–6666,
December, 2016.
75. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep
neural networks?” in Advances in Neural Information Processing Systems, 2014, pp.
3320–3328.
76. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. H. Huang, A.
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale
visual recognition challenge,” International Journal of Computer Vision, vol. 115, no.
3, pp. 211–252, December, 2015.
77. E. Szymanska, E. Saccenti, A. K. Smilde, and J. A. Westerhuis, “Double-check:
validation of diagnostic statistics for PLS-DA models in metabolomics studies,”
Metabolomics, vol. 8, no. 1, pp. S3–S16, June, 2012.
78. S. Smit, M. J. van Breemen, H. C. J. Hoefsloot, A. K. Smilde, J. M. F. G. Aerts, and
C. G. de Koster, “Assessing the statistical validity of proteomics based biomarkers,”
Analytica Chimica Acta, vol. 592, no. 2, pp. 210–217, June 5, 2007.
79. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning; Data
Mining, Inference and Prediction, Second ed., Springer, New York, 2008.
80. S. Min, B. Lee, and S. Yoon, “Deep learning in bioinformatics,” Briefngs in
Bioinformatics, vol. 18, no. 5, pp. 851–869, September, 2017.
81. Image Segmentation Keras, “Implementation of Segnet, FCN, UNet and other models
in Keras,” Accessed [October 23, 2018]; https://github.com/divamgupta/image-segm
entation-keras.
82. Keras, “Keras: The python deep learning library,” Accessed [January 8, 2019]; https://
keras.io/.
83. M. Dong, X. Lu, Y. Ma, Y. Guo, Y. Ma, and K. Wang, “An effcient approach for
automated mass segmentation and classifcation in mammograms,” Journal of Digital
Imaging, vol. 28, no. 5, pp. 613–25, 2015.
84. M. A. Al-antari, M. A. Al-masni, S. U. Park, J. Park, M. K. Metwally, Y. M. Kadah,
S. M. Han, and T. S. Kim, “An automatic computer-aided diagnosis system for breast
cancer in digital mammograms via deep belief network,” Journal of Medical and
Biological Engineering, vol. 38, no. 3, pp. 443–456, June, 2018.
85. D. Zikic, Y. Ioannou, M. Brown, and A. Criminisi, “Segmentation of brain tumor tis-
sues with convolutional neural networks,” in MICCAI-BRATS, 2014, pp. 36–39.
86. S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor segmentation using convo-
lutional neural networks in MRI images,” IEEE Transactions on Medical Imaging, vol.
35, no. 5, pp. 1240–1251, May, 2016.
87. D. M. Powers, “Evaluation: From precision, recall and F-measure to ROC, informed-
ness, markedness & correlation,” Journal of Machine Learning Technologies, vol. 2,
no. 1, pp. 37–63, 2011.
88. M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” arXiv preprint
arXiv:1212.5701, 2012.
Lossless Deep Learning FrCN 289
CONTENTS
11.1 Introduction .................................................................................................. 291
11.2 Literature Review ......................................................................................... 293
11.3 Background................................................................................................... 294
11.3.1 Preprocessing.................................................................................... 294
11.3.2 Convolutional Neural Networks ....................................................... 296
11.3.2.1 Alex-Net CNN Model ........................................................ 296
11.3.2.2 Resnet50............................................................................. 297
11.3.3 Transfer Knowledge for Deep Neural Network................................ 298
11.4 Methodology................................................................................................. 299
11.4.1 The Augmentation Process............................................................... 299
11.4.2 Transfer Learning ............................................................................. 301
11.4.3 Dropout Layer...................................................................................302
11.5 Experimental Results....................................................................................304
11.5.1 Performance of the Alex-net.............................................................304
11.5.2 Performance of the Resnet50............................................................307
11.6 Discussion.....................................................................................................307
11.7 Conclusion .................................................................................................... 311
Bibliography .......................................................................................................... 312
11.1 INTRODUCTION
Computer-supported skin lesion inspection and dermal investigation have been
extended over the past decade to classify and model dermatological diseases and
human skin [1]. Automated systems for analyzing and synthesizing human skin face
different diffculties and complexities such as humidity and seasonal variation in
temperature, geographically based differences in diseases, environmental factors,
and hair presences [2]. Computerized assessment of skin lesions is an active research
291
292 Deep Learning in Computer Vision
area. The main target of this feld is to create systems able to analyze images of pig-
mented skin lesions for automated diagnosis of cancerous lesions or to assist derma-
tologists in this task [3]. The most important challenge in this context is classifying
the color images of the skin lesions with highly accurate classifers [4]. In terms of
dermoscopic images, skin lesions fall into different categories based on the disease
type, severity, and stage [5]. There are many algorithms used to extract skin lesion
features [6]. The “ABDC” rule is the most known algorithm, where A, B, D, and C
refer to lesion Asymmetry, irregularity of the Border, Color variation, and lesion
Diameter, respectively. COLOR is the most prevalent content that can be easily visu-
ally observed when retrieving images. This is due to its value in isolating diseased
skin from wholesome [7]. However, color-based features are highly infuenced by
the condition of the patient and his/her external environment such as variations in
natural skin complexion, lighting, and camera specifcations [8].
During the last decade, there have been many studies on classifying non-mela-
nocytic and melanocytic skin lesions. These studies vary in using statistical meth-
ods, traditional feature extraction, and classifcation methods [11–15]. Recently, deep
learning has been raised as a promising tool for feature extraction of images through
different models. In these models, many layers of neurons have been used to fulfll
different layers to extract the fne details of the image, where the new layer combines
the characteristics of the previous one. This comprises the newest generic of such
models, which are called convolutional neural networks (CNNs) [16]. Applying deep
learning techniques in recent healthcare research area has had positive results, as
suggested by Dubal et al. and Kumar et al. [17, 18]. There are many studies using
different learning techniques, as can be found in Nasr-Esfahan et al., Premaladha
and Ravichandran, and Kostopoulos et al. [19–22]. Various kinds of skin tumor have
been found, like squamous cell carcinoma, basal cell carcinoma, and melanoma;
the last of these is the most unpredictable. Jain et al. [23] suggested that detecting
melanoma at an early stage is helpful for curing it. Precisely analyzing each patient
sample is a tedious task for dermatologists; therefore, an effcient automated system
with a high classifcation rate is necessary to assess the dangers associated with the
given samples [24].
Martin et al. [25] showed that melanoma is a highly invasive and malignant
tumor that can reach the bloodstream easily and cause metastasis in affected
patients. Azulay [26] reported that one of the hardest challenges in dermatology
is to distinguish malignant melanotic lesions (melanoma) from non-malignant
melanotic lesions (nevus), since both diseases share morphological character-
istics. Melanomas may also take on features that may not be common to their
morphology, which can cause it to be confused with a completely different mela-
noma disease. Goodfellow et al. [16] concluded that CNNs are the best tool for
recognizing patterns in digital images. In this chapter, we assessed the existing
methods for skin lesion classifcation, especially deep learning-based methods.
It was observed that existing deep learning-based methods encounter two major
problems that negatively affect their performance and signifcantly reduce their
classifcation rates. First, color images of skin lesions are associated with compli-
cated backgrounds. Second, the available skin lesion datasets contain a limited
Skin Melanoma Classifcation Using Deep CNNs 293
number of images, while deep learning classifcation methods need a large num-
ber of images for good training.
These challenges motivated us to present a new method that signifcantly improves
the skin cancer classifcation rate and outperforms all previous methods. High skin
lesion detection and classifcation rates enable dermatologists and physicians to make
the right decisions at an earlier stage, which generally improves patients’ health and
saves their lives. The proposed method is a pretrained deep convolutional neural
network-based classifcation method. The well-known datasets of color images of
skin, MED-NODE, Dermatology Information System (DermIS), & DermQuest,
are used to prove the validity of the proposed method and evaluate its performance
against the other existing methods. The obtained results clearly show that the pro-
posed method achieves the best classifcation rate.
has been proposed by Nasr-Esfahan et al. [19]. The accuracy of the system is
increased by preprocessing to correlate the illumination and the segmentation of
the region of interest (ROI). The proposed system sends the image to the CNN
after segmentation for feature extraction. The system achieves a classifcation
rate of 81%. Premaladha and Ravichandran [20] proposed a CAD system that
combines supervised and deep learning algorithms in skin image classifcation.
The contrast-limited adaptive histogram equalization technique (CLAHE) is used
to enhance the input image; then, the median flter with the normalized Otsu’s
segmentation (NOS) method is used to separate the affected lesion from normal
skin. This system achieved a classifcation rate of 90.12% with artifcial neu-
ral networks (ANN) and a rate of 92.89% with deep learning neural networks
(DLNN). Kostopoulos et al. [21] proposed a computer-based analysis of plain
photography using a probabilistic neural network (PNN) to extract the features
and decide if the lesion is melanocytic or melanoma. The system achieved a clas-
sifcation rate of 76.2%. Esteva et al. [22] classifed the skin lesion using a pre-
trained CNN called Inception v3. They achieved accuracy of about 71.2% when
they augmented the images in the dataset using rotations with random angles
from 0° and 359° and vertical fipping. Table 11.1 summarizes and compares these
state-of-the-art methods.
In Hosny et al. [42], transfer learning has been applied to a CNN model. To extract
ROI from images, images have been segmented to discard the background. To over-
come the limitation of low numbers of images, different methods of augmentation
have been carried on the segmented images. They achieved 97.70% accuracy. In
[43], the last layers are replaced in a pretrained CNN model, with different ways of
augmentation, and fne-tuned; an accuracy of 98.61% is achieved.
11.3 BACKGROUND
During the last few years, deep neural networks have gained researchers’ attention
around the world as an alternative approach to extracting features. The core idea of
deep neural networks is that these networks automatically select the most important
features without performing feature extraction [31]. This mechanism is a powerful
step in machine intelligence [33]. Through the following subsections, a brief descrip-
tion of the convolutional neural networks is given.
11.3.1 PREPROCESSING
In machine vision and image-processing applications, segmenting the ROI is a very
crucial task. Image segmentation usually improves the accuracy of classifcation,
since this allows the background of the original image to be ignored. We utilized the
segmentation method of Basavaprasad and Hegad [39] to segment the ROIs from the
color images of skin. In this method, a number of bins are used to coarsely represent
the input color image. The spatial information is used by coarse representation from
a histogram-based windowing process, and then the coarse data of the image is clus-
tered by hierarchical clustering.
TABLE 11.1
Comparison of Literature Review
Classifcation Accuracy
Method Dataset Enhancement Segmentation Image type method (%)
Bunte et al. [7] Real-world dataset University of Groningen Yes No RGB Image retrieval 84
Chang et al. [9] Collected by authors No No Gray level SVM 82.30
Kundu et al. [10] Collected by authors Gray level MLP 94.28
Amelard et al. [11] DermIS & DermQuest Yes Yes RGB SVM 87.38
Almaraz et al. [13] DermIS & DermQuest Yes Yes RGB SVM 75.1
Karabulut et al. [12] DermIS & DermQuest Yes Yes Black & white DCNN 71.4
Karabulut et al. [12] DermIS & DermQuest Yes Yes RGB SVM 71.4
Premaladha et al. [20] MED-NODE Yes Yes Black & white DLNN 92.89
Premaladha et al. [20] MED-NODE Yes Yes Black & white ANN 90.12
Skin Melanoma Classifcation Using Deep CNNs
CNNs have been used successfully with signifcant improved performance in dif-
ferent applications such as natural language processing [28] and visual tasks [29].
Many deep convolution neural networks (DCNNs) are available, including LeNet,
Alex-net, ZFNet, GoogLeNet, VGGNet, and Resnet [30, 31].
with a stride of 2 pixels. The output of this layer is 27 × 27 × 256, which splits to two
GPUs, where each GPU works with 27 × 27 × 128.
The third, fourth, and ffth convolutional layers are created without any normal-
ization and pooling layers. The third layer of the convolutional network is connected
to the (pooled, normalized) outputs of the second convolutional layer with 384 ker-
nels, each of size 3 × 3 × 192. The fourth and ffth convolutional layers have 384 and
256 kernels of sizes 3 × 3 × 192 and 3 × 3 × 128, respectively. There are two fully con-
nected layers containing 4096 neurons for each one. The output features of the ffth
layer are the input of these two fully connected layers. Krizhevsky et al. [29] selected
1.2 million images from ImageNet [30] to construct 1000 classes to train the Alex-
net. An illustration of the architecture of the Alex-net is displayed in Figure 11.1.
Table 11.2 shows different layers with their associated parameters.
11.3.2.2 Resnet50
Resnet50 [44] is one of the deep residual networks (DRN). It is a very deep feed-
forward neural network (FFNN) with extra connections. A model of deep learning
based on the residual learning passes the input from one layer to a far layer by escap-
ing 2 to 5 layers; this method is called “skip connection”. Some models like Alex-net
try to fnd a solution by mapping some of the input to meet the same characteristics
in the output, but there is an enforcement in the residual network to learn how to map
some of the input to some of the output and input.
Resnet50 is a pretrained model consisting of 177 layers and containing a number
of connections equaling 192 × 2 to connect layers with each other. The input layer
restricts the size of the input images. The input must be resized to 224 × 224 × 3,
where the width is 224, the height is 224, and the depth is 3. The depth number
refers to the color space red, green, and blue channels. Resnet50 layers consist of
298 Deep Learning in Computer Vision
TABLE 11.2
Different Layers and Their Associated Parameters
Input Output Filter Depth Stride Padding
Layers size size size size size size
( )
y = F x, {Wi } + x (11.1)
where x and y are the input and the output vectors of the layers considered, Ƒ(x,{Wi})
and represents the residual mapping to be learned. The dimensions of x and Ƒ must
be equal.
Because Resnet50 contains 177 layers and 192 × 2 connections, a detailed part
of Resnet50 has been shown in Figure 11.3. Furthermore, Table 11.3 shows different
layers with their associated output size and the building block parameters.
11.4 METHODOLOGY
This section describes the steps of the proposed method to classify color-skin
images. It is divided into three subsections. The augmentation process of the input
color images of skin is discussed in the frst subsection, which describes different
augmentation methods that have been applied here. The implementation of the trans-
fer learning methodology is described in the second subsection for Alex-net and
Resnet50. The last subsection is devoted to describing the process of dropping out
the last three layers for Alex-net and Resnet50 and how authors replaced these layers
in order to adapt the process to the required task here.
TABLE 11.3
Resnet50 Layers*
Layer
name Conv1 Conv2_x Conv3_x Conv4_x Conv5_x Others FLOPs
Building 7 × 7, 3 × 3, max
blocks stride 2 pool, stide 2
° 1 × 1, 64 ˙ ° 1 × 1, 128 ˙ ° 1 × 1, 256 ˙ ° 1 × 1, 512 ˙ average 3.8 × 109
˝ ˇ ˝ ˇ ˝ ˇ ˝ ˇ pool,
˝ 3 × 3, 64 ˇ × 3 ˝3 × 3, 128ˇ × 3 ˝ 3 × 3, 256 ˇ × 3 ˝ 3 × 3, 512 ˇ × 3
˝˛1 × 1, 256 ˇˆ ˝˛1 × 1, 512 ˇˆ ˝˛1 × 1, 1024 ˇˆ ˝˛1 × 1, 2048ˇˆ 1000-d fc,
SoftMax
Output 112 × 112 56 × 56 28 × 28 14 × 14 7×7 1×1 FLOPs
size
TABLE 11.4
Relationship between Traditional Machine Learning and Various Transfer
Learning Settings
Transfer learning
Traditional
machine learning Unsupervised/inductive Supervised/transudative
Domains of target & Same Different but related Different but related
source
Tasks of target & Same Different but related Same
source
the frst dataset becomes (180/5+1) × 70 = 2590 melanoma images and (180/5+1) ×
100 = 3700 nevus images.
Similarly, the number of images in the second dataset becomes (180/5+1) × 84 =
3108 melanoma images and (180/5+1) × 87 = 3219 nevus images. The original images
are augmented by rotation with random angles ranging from 0° to 355° and trans-
lation with different translation parameters for every image. All dataset images
were randomly divided into images for training (85% of the images) and images for
testing samples of original, segmented, rotated, and translated images (15% of the
images) as displayed in Figure 11.4. The three columns on the left are from DermIS
& DermQuest, while those on the right are from MED-NODE.
FIGURE 11.4 The three columns on the left were selected randomly from the DermIS &
DermQuest datasets, and the three right columns are from MED-NODE. The frst, second,
third, and fourth rows show original, segmented, rotated, and translated images, respectively.
the datasets used in the proposed model. The weights will not change dramatically
because of using a very small learning rate. The learning rate is used to update the
weights of the convolutional layers. Stochastic gradient descent (SGD) algorithm
has been used to update convolutional layer weights because it is computationally
fast, in addition to processing a single training sample, which makes it easier to ft
into memory. Unlike the convolutional layers, the weights of fully connected layers
have been initialized randomly because we have changed the classifcation layer to
multiclass SVM in Alex-net and to SoftMax in Resnet50.
signifcantly increased the training speed and reduced the computational complexity
of the DCNN. Also, it led to learning extra new robust features and fnally reduced
the negative effect of overftting.
The same process of dropout layers has been applied with Resnet50 but with some
differences. The last three layers, called FC100, FC100_softmax, and Classifcation
layer_fc100, have been replaced with new layers to be ft with the required task here.
The new layers are FC2, SoftMax, and Class Output. These layers are able to clas-
sify images into 2 classes (melanoma and nevus) instead of the 1000 classes in the
original Resnet50. Figure 11.6 explains the modifed Resnet50.
t p + tn
Accuracy = (11.2)
t p + f p + fn + t n
tp
Sensitivity = (11.3)
t p + fn
tn
Specificity = (11.4)
f p + tn
Where, tp, fp, f n, and tn refer to true positive, false positive, false negative, and true
negative, respectively. The false-positive rate should be relatively small and the true
negative rate should be relatively large, resulting in most points falling in the left part
of the receiver operating characteristic (ROC) curve [41].
any preprocessing. Because of noise like hair, background, and other factors, a seg-
mentation process was carried out to extract ROI, keeping images in the same RGB
color space. The segmented ROI images were used in the second experiments to
measure the impact of noise reduction with Alex-net.
The segmented ROI images were augmented by rotating each image in the dataset
by an angle ranging from 0° to 90° with a fxed step of 5°. The numbers of melanoma
and nevus images increased to (90/5+1) × 70 = 1330 and (90/5+1) × 100 = 1900,
respectively. The third experiment was carried out to measure the performance of
the modifed Alex-net after the augmentation of the segmented images. The results
of all experiments are summarized in Table 11.5. Comparing these results with the
results of the frst and second experiments clearly shows that increasing the number
of training images results in a signifcant improvement in the evaluation metrics,
accuracy, sensitivity, and specifcity. This observation motivates us to conduct an
additional experiment with an approximately doubled number of training images.
The fourth experiment was performed with the same classes and the same condi-
tions except for the data augmentation. In this experiment, the segmented images
were augmented by rotating each image in the dataset by an angle ranging from
0° to 180° with a fxed step of 5°. Therefore, the numbers of melanoma and nevus
images increased to (180/5+1) × 70 = 2590 and (180/5+1) ×100 = 3700, respectively.
As shown in Table 11.5, the performance metrics were increased compared with the
previous experiments. The ffth experiment was performed where each segmented
image in the dataset was augmented by a rotation angle ranging from 0° to 355° with
a fxed step of 5°. The obtained result was improved compared with those from the
third and fourth experiments. We used the same augmentation process by rotating
each image with a fxed step angle of 5° from 0° to 90°, 0° to 180°, and 0° to 355° for
the third, fourth, and ffth experiments, respectively.
That last experiment was done by using a combination of different augmentation
approaches. Each segmented color image was translated with different translation
parameters and rotated with a random rotation angle in the range between 0° and
355°. The results of these six experiments clearly show that having more training
images increases the ability of the proposed system to identify skin lesions of the
type “melanoma” correctly (sensitivity) and its ability to identify skin lesions of type
“nevus” correctly (specifcity). To ensure this observation and the credibility of the
proposed system and its ability to identify melanoma and nevus skin lesions cor-
rectly, another group of experiments was performed with more complicated color-
skin images.
The second group of experiments was performed using noisy low-quality color-
skin images. The dataset used in these experiments was created by a random selec-
tion of images from DermIS [36] and DermQuest [37]. As mentioned before, the
images of the created dataset were segmented to isolate the ROI from the back-
ground to reduce the noise. The previous six experiments were repeated again using
the DermIS & DermQuest dataset. The acquired results have proved that increasing
the number of training images improved the evaluation metrics, accuracy, sensitivity,
and specifcity. Table 11.5 summarizes all obtained results of Alex-net using MED-
NODE and the DermIS & DermQuest datasets.
306
TABLE 11.5
The Accuracy of the Proposed Model Using Alex-net
MED-NODE dataset DermIS & DermQuest dataset
Experiment Accuracy (%) Sensitivity (%) Specifcity (%) Accuracy (%) Sensitivity (%) Specifcity (%)
11.6 DISCUSSION
It is noted that the results obtained from the second dataset are smaller than the
results obtained from the frst dataset. This is normal and predictable according to
the nature of the images in each dataset. The images of the frst dataset are high-
quality microscopic images acquired by specialized devices. On the other side, the
images of the second dataset are low-quality images acquired by using a common
camera in a noisy environment.
By comparing the performance measures of proposed Alex-net against proposed
Resnet50 using different datasets, we found that the best performance measures
were for the modifed Resnet50. The performance of Resnet50 is the best because it
contains many more layers than Alex-net. As discussed before, Resnet50 is enforced
to learn how to map some of the input to some of the output and input while Alex-net
tried to map some of the input to the output. Based on “skip connection”, Resnet50
passes the input from one layer to a far layer by escaping two to fve layers.
We could say that DCNN works well with any dataset of different resolutions.
The performance increased with high-quality images, high resolution, and low levels
of noise. The performance of the proposed methods was compared with the well-
known existing methods [11–21, 42] and the comparison is summarized in Tables 11.7
and 11.8. The accuracy and the ROC curves were used to measure the performance
of the proposed methods against the performance of the existing methods. For the
308
TABLE 11.6
The Accuracy of the Proposed Model Using Resnet50
MED-NODE dataset DermIS & DermQuest dataset
Experiment Accuracy (%) Sensitivity (%) Specifcity (%) Accuracy (%) Sensitivity (%) Specifcity (%)
TABLE 11.7
Comparative Study Using the DermIS & DermQuest Dataset
Preprocessing Image Classifcation Accuracy
Method (enhancement) Segmentation type method (%)
DermIS & DermQuest dataset, the performance of the proposed method was com-
pared with the performance of the existing methods [11–13, 42]. The obtained results
are listed in Table 11.7. All the mentioned methods in Table 11.7 used the same
DermIS & DermQuest dataset. Amelard et al. [11] enhanced the images and detected
the borders of skin lesions to segment important regions, and then they used the
segmented images in the classifcation process and achieved an accuracy percentage
of 87.38%. A similar approach was utilized in Karabulut and Ibrikci [12], where the
input images were enhanced to reduce the noise and then the regions of interest were
segmented in two ways. The frst way is to keep the segmented images in RGB color
space and the second way is to convert the segmented images to black-and-white
images. The SVM and CNN were used and achieved the same accuracy, 71.4%, for
the two methods. In Almaraz-Damian et al. [13], similar preprocessing steps were
used. The obtained accuracy was marginally increased to 75.1%. In Hosny et al.
[42], an automated process for skin cancer classifcation was proposed, where the
authors applied transfer learning with Alex-net and replaced the classifcation layer
with SoftMax. The achieved accuracy rate was 96.86%.
In order to improve the classifcation accuracy, the authors applied transfer learn-
ing with Alex-net and replaced the classifcation layer with the multiclass SVM.
Unfortunately, this method achieved an accuracy rate of only 93.35%. It is clear that
the achieved accuracy is lower than the achieved accuracy of the SoftMax method
[42]. Therefore, utilizing an alternative CNN could be a fruitful solution for this
problem. The authors applied transfer learning with Resnet50 to classify skin lesions
using the same dataset where the proposed Resnet50-based classifcation method
achieved a 98.56% accuracy rate and outperformed the Alex-net-based classifcation
methods. The ROC curves of the proposed methods and the existing methods [11–13,
42] are plotted and displayed in Figure 11.7.
Another comparison was performed using the MED-NODE dataset; all meth-
ods mentioned in Table 11.8 use this dataset. The performance of the proposed
methods was compared with that of the existing methods [14, 15, 19–21, 42]. Table
11.8 shows an overview of the conditions, image types, and classifers used in the
existing methods. It was noted that all of these methods applied two preprocessing
310 Deep Learning in Computer Vision
TABLE 11.8
Comparative Study for MED-NODE Dataset
Preprocessing Accuracy
Method (enhancement) Segmentation Image type Classifer (%)
steps. The frst step was enhancement, while the second was segmentation to extract
the region of interest. These methods used RGB color images, while the method of
[20] converted these color images to black and white. Table 11.8 clearly shows that,
on one hand, the performance of the proposed Alex-net method failed in terms
of its classifcation rates compared with other methods [14, 15, 19–21, 42]. On
the other hand, the proposed Resnet50-based classifcation method outperformed
FIGURE 11.7 The ROC curves for the different methods with the DermIS & DermQuest
dataset.
Skin Melanoma Classifcation Using Deep CNNs 311
FIGURE 11.8 The ROC curves for the different methods with the MED-NODE dataset.
the existing methods [14, 15, 19–21, 42]. The ROC curves of the different meth-
ods are plotted and displayed in Figure 11.8. All results are consistent and clearly
show that the proposed methods are superior to all the existing methods [11–21].
From the results displayed in Tables 11.7 and 11.8 in addition to the ROC curves
in Figures 11.7 and 11.8, it was proved that the proposed method is superior to the
state of the art.
11.7 CONCLUSION
This chapter proposed two models using the theory of transfer learning for the Alex-
net and Resnet50 architecture to classify tumors as melanoma or not-melanoma.
The last three layers (fully connected, SoftMax, and classifcation layers) have been
dropped out and replaced with three new layers. In Alex-net, the new layers were
fully connected for two classes only, namely, a multiclass SVM and an output layer;
in Resnet50, these layers were fully connected for two different classes, namely,
SoftMax and a classifcation layer. This modifcation was done to create two classif-
cations of skin images in addition to overcoming the overftting problem. Two differ-
ent datasets were used, one with high-quality images and the other with low-quality
images. These datasets were used to investigate the performance of the two proposed
models Alex-net and Resnet50 in a situation of using images of high and low quality.
Different ways of augmentation have been used to overcome the limitations of image
number in datasets. The proposed models have outperformed other models using
312 Deep Learning in Computer Vision
the same datasets. The sensitivity and specifcity of the proposed models proved the
credibility of the proposed model, because the increased sensitivity and specifcity
proved the truthfulness of the system.
BIBLIOGRAPHY
1. I. Maglogiannis, and C. N. Doukas, “Overview of advanced computer vision sys-
tems for skin lesions characterization”, Transactions on Information Technology in
Biomedicine, vol. 13, no. 5, pp. 721–733, 2009.
2. M. S. Arifn, M. G. Kibria, A. Firoze, M. A. Amini, and H. Yan, “Dermatological
disease diagnosis using color-skin images”, International Conference on Machine
Learning and Cybernetics, vol. 5, pp. 1675–1680, 2012.
3. J. M. Gálvez, D. Castillo, L. J. Herrera, B. S. Román, O. Valenzuela, F. M. Ortuño, and
I. Rojas, “Multiclass classifcation for skin cancer profling based on the integration of
heterogeneous gene expression series”, PLoS ONE, vol. 13, no. 5, pp. 1–26, 2018.
4. A. Masood, A. A. Al-Jumaily, and T. Adnan, “Development of automated diagnostic sys-
tem for skin cancer: Performance analysis of neural network learning algorithms for clas-
sifcation”, International Conference on Artifcial Neural Networks, pp. 837–844, 2014.
5. P. G. Cavalcanti, and J. Scharcanski, “Macroscopic pigmented skin lesion segmentation
and its infuence on lesion classifcation and diagnosis”, Color Medical Image Analysis,
vol. 6, pp. 15–39, 2013.
6. M. J. M. Vasconcelos, and L. Rosado, “No-reference blur assessment of dermatologi-
cal images acquired via mobile devices”, Image and Signal Processing, vol. 8509, pp.
350–357, 2014.
7. K. Bunte, M. Biehl, M. F. Jonkman, and N. Petkov, “Learning effective color features
for content-based image retrieval in dermatology”, Pattern Recognition, vol. 44, pp.
1892–1902, 2011.
8. S. V. Patwardhan, A. P. Dhawan, and P. A. Relue, “Classifcation of melanoma using
tree structured wavelet transforms”, Computer Methods and Programs in Biomedicine,
vol. 72, no. 3, pp. 223–239, 2003.
9. W. Y. Chang, A. Huang, C. Y. Yang, C. H. Lee, Y. C. Chen, T. Y. Wu, and G. S. Chen,
“Computer-aided diagnosis of skin lesions using conventional digital photography: A
reliability and feasibility study”, PLoS ONE, vol. 8, no. 11, pp. 1–9, 2013.
10. S. Kundu, N. Das, and M. Nasipuri, “Automatic detection of ringworm using Local
Binary Pattern (LBP)”, 2011, https://arxiv.org/abs/1103.0120.
11. R. Amelard, A. Wong, and D. A. Clausi, “Extracting morphological high-level intuitive
features (HLIF) for enhancing skin lesion classifcation”, International Conference of
the IEEE Engineering in Medicine and Biology Society, pp. 4458–4461, 2012.
12. E. M. Karabulut, and T. Ibrikci, “Texture analysis of melanoma images for computer-
aided diagnosis”, International Conference on Intelligent Computing, Computer
Science and Information Systems (ICCSIS), vol. 2, pp. 26–29, 2016.
13. J. A. Almaraz-Damian, V. Ponomaryov, and E. R. Gonzalez, “Melanoma CADe based
on ABCD rule and Haralick texture features”, International Kharkiv Symposium
on Physics and Engineering of Microwaves, Millimeter and Submillimeter Waves
(MSMW), pp. 1–4, 2016.
14. I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, and N. Petkov, “MED-NODE:
A computer-assisted melanoma diagnosis system using non-dermoscopic images”,
Expert Systems with Applications, vol. 42, no. 19, pp. 6578–6585, 2015.
15. M. H. Jafari, S. Samavi, N. Karimi, S. M. R. Soroushmehr, K. Ward, and K. Najarian,
“Automatic detection of melanoma using broad extraction of features from digital
images”, International Conference of the IEEE Engineering in Medicine and Biology
Society (EMBC), pp. 1357–1360, 2016.
Skin Melanoma Classifcation Using Deep CNNs 313
16. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
17. P. Dubal, S. Bhatt, C. Joglekar, and S. Patil, “Skin cancer detection and classifca-
tion”, International Conference on Electrical Engineering and Informatics (ICEEI),
pp. 1–6, 2017.
18. D. Kumar, M. J. Shafee, A. Chung, F. Khalvati, M. Haider, and A. Wong, “Discovery
radiomics for computed tomography cancer detection”, 2015, https://arxiv.org/
abs/1509.00117.
19. E. Nasr-Esfahan, S. Samavi, N. Karimi, S. M. R. Soroushmehr, M. H. Jafari, K. Ward,
and K. Najarian, “Melanoma detection by analysis of clinical images using convolu-
tional neural network”, International Conference of the IEEE Engineering in Medicine
and Biology Society (EMBC), pp. 1373–1376, 2016.
20. J. Premaladha, and K. S. Ravichandran, “Novel approaches for diagnosing melanoma
skin lesions through supervised and deep learning algorithms”, Journal of Medical
Systems, vol. 40, no. 96, pp. 1–12, 2016.
21. S. A. Kostopoulos et al., “Adaptable pattern recognition system for discriminating
Melanocytic Nevi from Malignant Melanomas using plain photography images from
different image databases”, International Journal of Medical Informatics, vol. 105, pp.
1–10, 2017.
22. A. Esteva, B. Kuprel, H. M. Blau, S. M. Swetter, J. Ko, R. A. Novoa, and S. Thrun,
“Dermatologist-level classifcation of skin cancer with deep neural networks”, Nature,
vol. 542, no. 7639, pp. 115–118, 2017.
23. S. Jain, V. Jagtap, and N. Pise, “Computer aided melanoma skin cancer detection using
image processing”, Procedia Computer Science, vol. 48, pp. 735–740, 2015.
24. D. Gautam, and M. Ahmed, “Melanoma detection and classifcation using SVM based
decision support system”, IEEE India Conference (INDICON), pp. 1–6, 2015.
25. Convolutional Neural Networks (CNNs / ConvNets), the Stanford CS class notes,
Assignments, Spring 2017, http://cs231n.github.io/convolutional-networks/, Accessed:
18 August 2017.
26. R. D. Azulay, D. R. Azulay, and L. Azulay-Abulafa, Dermatologia. 6th edition. Rio de
Janeiro: Guanabara Koogan, 2013.
27. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition”, Proceedings of the IEEE, vol. 86, no. 11, pp. 101–118, 1998.
28. S. Srinivas, R. K. Sarvadevabhatl, K. R. Mopur, N. Prabhu, S. S. S. Kruthiventi, and R.
V. Babu, “A taxonomy of deep convolutional neural nets for computer vision”, Frontiers
in Robotics and AI, vol. 2, pp. 1–18, 2016.
29. A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifcation with deep convo-
lutional neural networks”, Advances in Neural Information Processing Systems, vol.
25, no. 2, pp. 1097–1105, 2012.
30. O. Russakovsky et al., “ImageNet large scale visual recognition challenge”,
International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
31. J. D. Prusa, and T. M. Khoshgoftaar, “Improving deep neural network design with new
text data representations”, Journal of Big Data, Springer, vol. 4, pp. 1–16, 2017.
32. T. A. Martin, A. J. Sanders, L. Ye, J. Lane, and W. G. jiang, “Cancer invasion and
metastasis: Molecular and cellular perspective”, Landes Bioscience, pp. 135–168, 2013.
33. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep
neural networks?” Advances in Neural Information Processing Systems, vol. 2, pp.
3320–3328, 2014.
34. S. J. Pan, and Q. Yang, “A survey on transfer learning”, Transactions on Knowledge
and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
35. K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning”, Journal
of Big Data, Springer, vol. 3, no 9, pp. 1–15, 2016.
36. Dermatology Information System, 2012, http://www.dermis.net, Accessed: 16 August 2017.
314 Deep Learning in Computer Vision
315
316 Index
U Weight pruning, 31
Winograd transform, 14–15
UAVs, see Unmanned aerial vehicles Workloads and computations, CNN models
Ubuntu, 194 computational workload, 6–8
Unconstrained Facial Images (UFI), 108 frameworks, 10
U-Net method, 265, 267, 277–280 hardware, 10
Unmanned aerial vehicles (UAVs) libraries, 10
applications, 184 memory accesses, 9–10
calibration, 187–188 parallelism, 8–9
datasets
ISPRS WG II/4 Potsdam, 195 X
videos, 196–197
defnition, 183 Xilinx ZCU102 device, 15
EO image registration
calibration component experiment, 198 Y
local feature registration results, 199–200
sequence experiment, 198–199 YI, see Youden’s index
framework architecture, 184–187 YOLO, see You Only Look Once
GNSS, 184 YOLO9000, 50
IMU, 184 YOLOv2, 49–50
LOS, 184 YOLOv3, 50
platform, 194 Youden’s index (YI), 252
semantic segmentation (see Semantic You Only Look Once (YOLO)
segmentation) architecture, 49
Unsupervised clustering techniques, 263–264 bounding boxes, 48
Unsupervised pretraining, 234 grid cell, 48
network design, 48
V
Z
Validation accuracy rate (VAR), 109
VGG16 classifcation model, 265 Zero skip scheduler, 32