Article Soumis V1 PDF
Article Soumis V1 PDF
Article Soumis V1 PDF
net/publication/340688890
CITATION READS
1 112
4 authors:
Some of the authors of this publication are also working on these related projects:
CROW2 : Critical and Rescue Operations using Wearable Wireless sensors networks View project
All content following this page was uploaded by Wided Mseddi on 10 September 2020.
Abstract
In order to decrease monocular visual odometry drift by detecting loop closure, this paper
presents a comparison between state of the art, 2-channel and Siamese, Convolutional Neural
Networks. The work consists of training these networks in order to make them able to
robustly identify loop closures. As we are in the case of having two input images, we perform
our trainings and tests on both 2-channel and Siamese architecture for each network.
Keywords: visual odometry, loop-closure, deep learning, convolutional neural network
1. Introduction
The wide diffusion of unmanned vehicles in various fields rises the problem of robots
position and orientation estimation. Internal and external sensors are used for this purpose,
such as gyroscopes, accelerometers or GPS. Besides these sensors are subject to drift, noise
and jamming. In order to enhance navigation skills and position estimation accuracy, cam-
era based approach is adopted.
In this paper we present a comparative study between different state-of-the-art CNN models.
Both 2-channel and Siamese variants will be studied for each model. Section II, is dedicated
to present related work about deep learning approaches in loop closure detection. In Section
III, we describe our approach. Section IV shows our results which are discussed in Section
V.
2. Related Work
To detect loop closure using appearance-based methods, there are mainly two approaches:
hand-crafted methods like Bag-of-Words (BoW) and deep learning approaches, using neural
networks.
BoW was initially developed to extract features from texts in order to describe, modelize
and classify documents [10] by counting words frequency. This technique is then adapted to
computer vision context where lexical words become visual words, namely, a set of extracted
visual features. There is a wide variety of features like SIFT [11], SURF [12] and ORB [13].
BoW technique consists, for computer vision, in extracting visual features, clustering them
and describing them using a histogram of visual words. This allows to have a pose invariant
place description based on training image sequences. Thus, the resulting model is limited to
this training sequence, and loop closure detection can only be performed in the known area.
Fast Appearance-Based Mapping (FAB-MAP) [14] is an algorithm for place recognition and
mapping that uses BoW to define a probabilistic model of the environment.
Recently, a large variety of Convolutional Neural Network (CNN) methods emerged. Be-
sides, deep learning approaches are widely used for computer vision tasks, such as object
detection and classification [15], [16] and face recognition [17]. Such an approach allows to
have an abstract representation of the issue, and makes its application more general than
the training context, in opposition to BoW. Some recent works tried to solve loop closure
problem using CNNs. While some papers like [18] proposed a novel architecture, others
used state-of-the-art, pretrained neural networks, like [2] and [6] In fact, [2] evaluated deep
learning networks in loop closure detection for visual SLAM. This paper showed that neural
2
networks outperform BoW. The document used pretrained models like AlexNet [19], Caf-
feNet [20] and GoogLeNet [21]. It was based on multichannel networks, while we propose
to use siamese networks in addition to that. It showed that AlexNet is the most accurate
in loop closure detection. Besides, [6] presented a Convolutional Neural Network approach
for estimating image-to-image similarity to detect loop closure in visual SLAM process. A
2-channel AlexNet network was implemented to estimate similarity between two input im-
ages. It has been tested on cross season dataset and showed satisfying results. [22] designed
a novel pyramid Siamese architecture to perform loop closure detection, in Simultaneous
Localization And Mapping context. This network processes RGB-D data as input.
Since we are in the case of image comparison, it is necessary to use a pair of images as an
input to the neural network. [23] presented various deep learning architectures to compare
gray scale image patches. According to this paper, these architectures showed good perfor-
mances and better results than hand-crafted approaches.
In our paper we are interested in both Siamese and 2-channel architectures. These architec-
tures will be applied to state-of-the-art, ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners CNNs. We will limit our study to three of them.
AlexNet [19] is ILSVRC 2012 winner. It is composed of eight layers, of which five are
convolution layers, two fully connected layers and one softmax layer. Besides, it has 60
million parameters and 650.000 neurons. While overlapping max pooling layers follow the
first two convolution layers, the third, fourth and fifth convolutional layers are directly con-
nected. Then comes another overlapping max pooling layer followed by two fully connected
layers. The top layer of this network is a softmax layer that could generate 1000 classes.
GoogLeNet [21] is ILSVRC 2014 winner. It has in total 22 layers. Besides, it has around
five million parameters. This network stands out by the use of inception modules. It uses
nine of them. These modules consist of simultaneously performing different convolutions
and concatenating the resulting feature maps.
Residual Networks (ResNet) [24] won first place in the ILSVRC 2015. This new architec-
ture introduced identity shortcut connection called residual connection. This allows layers
skipping. In fact, before ResNet, CNNs were going deeper and they became hard to train,
due to accuracy saturation. ResNet makes skipped layers fit a residual mapping instead of
fitting a desired underlying mapping. ResNet converges faster than its plain equivalent.
Siamese networks were introduced in 1994, by Bromley et al. [25]. At the beginning,
it aims to verify signatures by learning convenient descriptors to compare inputs. In fact,
this architecture is made of two, or more, identical branches, called subnetworks that share
the same weights and parameters (Fig. 3). Each branch takes distinct input but they
meet in the top conjoining layer, namely a contrastive loss function. During training phase,
back-propagation step updates simultaneously weights of each subnetworks. This means
that fewer parameters need to be trained. In general, Siamese networks are used for binary
classifications. Koch et al. employed Siamese neural network in [26] to do one-shot learning
on the MNIST dataset using a pre-trained network on a different dataset, Omniglot.
Multi-channel architectures consider input images as a single image with multiple channels.
3
Figure 3: Siamese neural network architecture
In other terms, if we consider N-channel network that processes RGB images (3 channels),
the input patches can be seen as a single image of 3xN-channel (Fig. 4). In order to
detect loop closure for visual SLAM, [6] estimated the similarity between two images by
concatenating their RGB channels. The approach employed AlexNet [19] which originally
takes 224x224x3 image. To measure image similarity, AlexNet input dimension became
224x224x6. According to Zagoruyko et al. [23], such an architecture provides great flexibil-
ity and is fast to train.
3. Proposed Approach
Our work is divided in two main parts. We start with CNNs adaptation and modifi-
cation. Then, we prepare the data set for training and testing. Loop-closure detection is
a task that is based on images similarity. Therefore, we need to modify input and output
layers of the models cited above. In fact, these CNNs were trained to classify 1000 objects
while we need similarity estimation.
5
distance based loss function is more fitted. In this work, contrastive loss function [4] is used.
It is given by the following formula:
where y is the ground truth. That is to say that y = 1 if the image pair is made of similar
images (loop closure in our case), y = 0 otherwise. The term d of the formula represents
Euclidean distance between discriminative features generated by each subnetwork. Let f1
and f2 be the generated descriptions
pP for each image. The term d is the euclidian distance
2
between f1 and f2 given by d = i (fi1 − fi2 ) to d = kf1 − f2 k2 . Finally, m is a margin
term introduced to clamp the constraint. It is the minimum distance between two dissimilar
images. To minimize the loss, the distance between two images should be close to m. There
is different ways to choose the margin m. A first way of doing this is to adaptively update
m during training by using the smallest error as margin. A second way is to use a constant
margin m tuned manually.
1
http://www.ipb.uni-bonn.de/data/visual-place-recognition-datasets/
6
Figure 5: Samples from the dataset Positive pairs (left), negative pairs (right)
7
4.1. Training
In this work, we limit our training to 30 epochs for each network. Besides, we set learning
rate to 10−4 . We chose Stochastic Gradient Descent as optimizer. We made training using
80% of the whole dataset, that is to say 7200 image pairs. Training was performed using
batches of 11 image pairs each.
The following curves show epochs duration evolution in seconds. First, we compare 6-
channels (Fig.6) and Siamese networks (Fig.7) separately. Then we perform an overall
comparison (Fig.8).
Figure 7: Siamese training duration per epoch Figure 8: Overall training duration per epoch
Training duration curves show that epochs’ duration are smoother for Siamese Networks.
Besides, Siamese Networks training phase is shorter than 6-Channels Networks training
8
phase. In the other hand, both GoogLeNet architectures take more time for training than
ResNet and AlexNet architectures.
4.2. Validation
In order to validate the networks, we used 20% of our dataset which is equal to 1800 pairs.
After 30 training epochs, pretrained 6-Channels AlexNet reaches 87.7% accuracy while
Siamese AlexNet reaches 93% accuracy (Fig.9). In the other hand, Siamese GoogLeNet
has 83.61% accuracy and 6-Channels GoogLeNet is able to reach 82.38% after 25 training
iterations (Fig.10). Finally, Siamese ResNet arrive at 94.83% accuracy and 6-Channels
ResNet reaches 93.55% accuracy (Fig.11).
Figure 10: GoogLeNet validation accuracy Figure 11: ResNet validation accuracy
Besides, our performance comparison exploited comparison matrices for each of the six
networks (Table 2). The first column represents Multichannels networks, the second one is
dedicated to Siamese networks.
9
Table 2: Networks Confusion Matrices
Multichannel Networks Siamese Networks
a) Multichannels AlexNet b) Siamese AlexNet
In this paper, images were already saved in the hard disks, and storage management
issue doesn’t arise. In fact, in real time case, incoming frames have to be stored in order to
compare them to the future images. That’s why storage and real-time memory management
is an axis to explore in the future. Another future challenge is to incorporate a deep learning
based loop closure detection module to our monocular visual odometry framework.
References
[1] Beeson, Patrick, Joseph Modayil, and Benjamin Kuipers. ”Factoring the mapping problem: Mobile
robot map-building in the Hybrid Spatial Semantic Hierarachy.” The International Journal of Robotics
Research (2009).
[2] Y. Xia, J. Li, L. Qi, H. Yu and J. Dong, ”An Evaluation of Deep Learning in Loop Closure Detection
for Visual SLAM,” 2017 IEEE International Conference on Internet of Things (iThings) and IEEE
Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing
(CPSCom) and IEEE Smart Data (SmartData), Exeter, 2017, pp. 85-91.
[3] Zagoruyko, Sergey & Komodakis, Nikos. (2015). Learning to Compare Image Patches via Convolutional
Neural Networks.
11
[4] Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invari-
ant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition - Volume 2 (CVPR ’06), Vol. 2. IEEE Computer Society, Washington, DC, USA,
1735-1742.
[5] J. Bromley, I. Guyon, Y. Lecun, et al, Signature verification using a Siamese time delay neural network,
International Conference on Neural Information Processing Systems, Morgan Kaufmann Publishers Inc,
1993:737-744.
[6] Ma, Jiale & Qian, Kun & Ma, Xudong & Zhao, Wei. (2018). Reliable Loop Closure Detection Using
2-channel Convolutional Neural Networks for Visual SLAM. 5347-5352. 10.23919/ChiCC.2018.8483560.
[7] D. Filliat,”A visual bag of words method for interactive qualitative localization and mapping.” IEEE
International Conference on Robotics and Automation 2007:3921-3926.
[8] M. Cummins and P. Newman, Highly Scalable Appearance Only SLAM - FAB-MAP 2.0, in Robotics:
Science and Systems (RSS), June 2009.
[9] Garcia-Fidalgo, Emilio & Ortiz, Alberto. (2018). iBoW-LCD: An Appearance-based Loop Closure
Detection Approach using Incremental Bags of Binary Words. IEEE Robotics and Automation Letters.
PP. 10.1109/LRA.2018.2849609.
[10] G. Salton and M. McGill. Introduction to modern infor- mation retrieval. McGraw-Hill computer science
series. McGraw-Hill
[11] D. G. Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer
vi- sion, 60(2):91110, 2004.
[12] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European conference on
computer vision, pages 404417. Springer, 2006.
[13] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In
Computer Vi- sion (ICCV), 2011 IEEE international conference on, pages 25642571. IEEE, 2011.
[14] Joseph Cummins, Mark and M. Newman, Paul. (2010). FAB-MAP: Appearance-Based Place Recog-
nition and Mapping using a Learned Visual Vocabulary Model.Proceedings of the 27th International
Conference on Machine Learning (ICML-10), June 21-24, 2010, on pages 3-10.
[15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional net-
works. In CVPR, volume 1, page 3, 2017.
[16] Redmon, Joseph and Ali Farhadi. YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2017): 6517-6525.
[17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and
clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
815823, 2015.
[18] Merrill, Nathaniel and Huang, Guoquan. (2018). Lightweight Unsupervised Deep Loop Closure.
10.15607/RSS.2018.XIV.032.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural
Networks, in Advances in Neural Information Processing Systems (NIPS), pp 10971105, 2012.
[20] Jia, Yangqing & Shelhamer, Evan & Donahue, Jeff & Karayev, Sergey & Long, Jonathan & Gir-
shick, Ross & Guadarrama, Sergio & Darrell, Trevor. (2014). Caffe: Convolutional Architecture
for Fast Feature Embedding. MM 2014 Proceedings of the 2014 ACM Conference on Multimedia.
10.1145/2647868.2654889.
[21] Szegedy, Christian, et al. ”Going deeper with convolutions.”Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015.
[22] Zhang, Q., Mai, A., Menke, J., and Yang, A.Y. (2018). Loop Closure Detection with RGB-D Feature
Pyramid Siamese Networks. CoRR, abs/1811.09938.
[23] Zagoruyko, Sergey and Komodakis, Nikos. (2015). Learning to compare image patches via convolutional
neural networks. 4353-4361. 10.1109/CVPR.2015.7299064.
[24] Szegedy, Christian & Ioffe, Sergey & Vanhoucke, Vincent. (2016). Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning. AAAI Conference on Artificial Intelligence.
[25] Bromley, Jane, et al. Signature verification using a siamese time delay neural network. Advances in
12
neural information processing systems. 1994.
[26] Koch, Gregory, Zemel, Richard and Salakhutdinov, Ruslan. ”Siamese Neural Networks for One-shot
Image Recognition.” Paper presented at the meeting of the , 2015.
13