0% found this document useful (0 votes)
37 views6 pages

06 Avr PDF

Uploaded by

rf_munguia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views6 pages

06 Avr PDF

Uploaded by

rf_munguia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Abstract Motivated by the problems of vision-based

mobile robot map building and localization, we present a


comparative study of statistical methods for matching
image features in a wide base line between learning and
recognition phases. A general methodology called feature-
class method for the problem of fast matching image
features in a wide base line is described in the context of
mobile robots. This work does not pretend to present an
exhaustive description of the mentioned methods but to
give a good idea of their performance and the effectiveness
of the feature-class method.

Index Terms Matching images features, wide base line,
statistical methods.
I. INTRODUCTION
Many computer vision applications can involve the
challenging problem of determining and matching
features between images with a wide base line. When a
mobile platform moves through its environment a single
video camera can be used in order to build a map of its
surroundings and to determine its position (absolute or
relative). Because in computer vision a great amount of
information is available, sparse image statistics called
features are used in order to create a model that is rich
enough to represent the environment and sparse yet to be
stored efficiently. Good image feature matching in a
wide base line can be very helpful in the perspective of
mobile robots localization and 3D vision based map
building. A descriptor can be viewed like a distinctive
representation of the feature and its variations among the
time in a compact way respect the original data without
lose of its statistical meaning. To address the whole
problem descriptors can be very helpful because they
can provide distinctive signatures of different locations
in space. Furthermore, descriptors have to be as much
invariant as possible to changes in scale, rotation,
illumination or projection and algorithms must be
efficient and robust to a number of environmental
variations such as lighting, shades, and occlusions
among others.
In applications like structure from motion or robot
localization a video stream is available making possible
to detect and then track the features across the images
with small base line; several trackers [1, 2] can be used
for this. A small base line tracking is useful for capturing
the variations in the appearance of each feature along the
time. Descriptors can be created from this small base
line tracking and be used later with different methods to
perform the matching among descriptors.
Some approaches [3, 4] have been presented to
address the problem of image features representation for
tracking, recognition or reconstruction, the affine
invariant descriptors. Generally these methods search for
extrema in image scale space for obtaining good
candidates locations for detection. Lowes scale
invariant feature transform (SIFT) [5] has showed to
match reliably across a wide range of scales and
orientation changes; it uses a cascade filtering approach
in an isotropic Gaussian scale space to identify points of
interest, then it samples gradients to create an orientation
histogram represented in a 128-element descriptor. In [6]
an approximate version of Kernel Principal Component
Analysis (KPCA) was used to estimate features
descriptors for wide base line matching.
In the mobile robot map building and localization
scene, we describe a general methodology called feature-
class method to address the problem of image features
matching in a wide base line; later a number of statistical
methods are presented and compared. It is important to
note that the use of the methods and strategies studied in
this work would not be enough, by themselves, to obtain
a match set of a quality sufficient to be suitable for most
computer vision applications. Instead, they have to be
used within a more elaborated matching scheme. The
objective of this study is therefore to give a good idea of
the performance of the methods among them and the
effectiveness of the methodology, not to study these
algorithms as a whole.
II. FEATURE-CLASS METHOD AS A GENERAL
METHODOLOGY
We consider We consider the feature matching wide
baseline problem under the context of map building and
localization of mobile robots. The viewing conditions
change drastically between the phases of map building
(learning) and localization (recognition). Such changes
affect both the domain of the image (deformations of the
Rodrigo Mungua, Antoni Grau
Department of Automatic Control, UPC,
c/ Pau Gargallo, 5 E-08028 Barcelona, Spain
{rodrigo.munguia,antoni.grau}@upc.edu
http://webesaii.upc.es
Comparative Study of Statistical Methods for Image Feature Matching
in a Wide Base Line



scene and geometric distortion due to changes of the
viewpoint) and its range (changes in illumination). Such
changes are due to both intrinsic properties of the scene
(shape, reflectance) and to nuisance factors
(illumination, viewpoint).
A feature is a statistic of the image that is intended to
make easier the matching process; ideally one would
want a statistic feature invariant to all kind of changes.
Invariant descriptors to changes like scale or rotation can
be created from one single view and they are suitable for
many applications, nevertheless in the context of mobile
robot, changes like viewpoint or illumination can be
appreciably significant between learning and recognition
stages. Unfortunately there exist no single view statistic
that is invariant respect to point of view or lighting
conditions. On the other hand in the context of mobile
robots a high frame-rate video is available during both
building and localization. So multiple adjacent views of
the same scene are available, as for instance in a video
from a moving camera, and at least in theory, the point
of view could be explicitly accounted for. Additionally,
changes in viewpoint cause irradiance changes that are
due to the interplay of reflectance and illumination.
The key of our work is to use statistical methods for
learning how to establish feature classes that capture
their variations among the time using multiple adjacent
views. A descriptor generated from one single view acts
like a single object of the class (invariant to changes like
scale or rotation), and the whole class when capturing
the changes among the time is therefore less sensitive,
even invariant, to changes like lighting or viewpoint.
Feature-class is a general and modular method to
address the problem, the idea is to attempt different kind
of descriptors and statistical methods of learning and
classification, the modularity makes possible to
interchange different methods and descriptors.
Therefore, the scheme is divided into two stages:
learning (map building) and recognition (localization).

Learning phase:

1. Small base-line tracking: Features are detected and
tracked using a conventional small base-line tracker,
specifically in this work the Lucas Kanade Tracker
(KLT) was used, but any efficient tracker could be
used.
2. Window extraction: For each feature detected a p-
by-p pixels window around the feature center is
extracted, in our work we used a 12-by-12 pixel
window. Other window sizes have been tested but
the results were worst.
3. Descriptor creation: A descriptor x
i
is obtained for
each window area. A good descriptor has to be as
invariant as possible to changes like rotation or scale
and insensitive to changes like illumination. We used
two statistical techniques: principal component
analysis (PCA) and independent component analysis
(ICA), but others descriptors like SIFT can be used.
4. Storage in database: For each frame, descriptors
have to be scaled and stored in a data base.
Descriptors created from the same feature tracked
among the time with small base-line are stored with
the same label y
i
.
5. Feature class creation: A statistical method is used
in order to create and represent the descriptors stored
in the database with the same label y
i
like a unique
class V that represent the feature. This feature class V
is created with the purpose of capturing the
variations in the appearance of the feature along the
time. The capacity of V to represent these variations
depends on the number of descriptors and the
changes of scene conditions in which it was created.
For this study we used Support Vector Machine
(SVM) and variations of K-Nearest Neighbor
(KNN), but obviously other methods can be used.

Recognition phase:

1. Feature Detection: In order to improve the
computational performance in the recognition phase,
the same small base-line tracker in the learning phase
is used to detect features, but it is not used to track
feature candidates to match.
2. Descriptors candidates: For each candidate feature
a descriptor x
j
is created in the same way that it was
created in the learning phase.
3. Recognition: The same statistical method used in the
learning phase is used now to classify the candidate
descriptor x
j
in the more adequate feature class V.
Depending on the kind of selected statistical method
a quality scheme of correspondence between the
candidate descriptor and its associated class can be
implemented.
III. STATISTICAL METHODS FOR MATCHING FEATURES
IN A WIDE BASE-LINE
In this section the statistical methods used for the
comparative study are shortly described. Like descriptor-
feature methods, ICAD (independent component
analysis descriptor) and PCAD (principal component
analysis descriptor) follow most of the feature-class
methods described in the previous section. Adjacent
views are taken into account only by the ICA and PCA
process and a statistical method of learning and
classification is not used for generate a feature class.
On the other hand we implemented four methods that
follow entirely the feature-class method: SVM-ICA,
SVM-PCA, KNN-ICA, and KNN-PCA. In this methods
ICA and PCA are used only to generate single
descriptors and SVM and KNN are used to establish the
feature class.



PCA and ICA
Principal Component Analysis (PCA) [7] is a standard
statistical tool used to find the orthogonal directions
corresponding to the highest variance. It is equivalent to
a decorrelation of the data using second-order
information. The basic idea in PCA is to find the
components p
1
, p
2
, , p
n
that explain the maximum
amount of variance, by n linearly transformed
components. Then, the principal components are given
by
X w p
T
i i
=
, where X = [x
1
, , x
m
]
T
, x
i
is an observed
data vector, and w
i
is a basis vector (an eigenvector of
the sample covariance matrix E{XX
T
}). It can be written,
in matrix form, as:
P = WX (1)
where P = [p
1
, , p
n
]
T
and p
i
is a principal component
vector.
The Independent Component Analysis (ICA) [8]
attempts to go one step further than PCA, by finding the
orthogonal matrix H which transforms the data P into Z
having Z
1
,Z
2
,Z
m
statistically independent. The ICA is
thus more general than PCA in trying not only to
decorrelate the data, but also to find decomposition,
transforming the input into independent components.
The simplest ICA model, the noise-free linear ICA
model, seems to be sufficient for most applications. This
model is as follows: ICA of observed random data X
consists of estimating the generative model:
where X = [x
1
, , x
n
]
T
x
i
is an observed random
vector, S=[s
1
, , s
n
]
T
, s
i
is a latent component, and A is
the constant mixing matrix. The transform we seek is
B=VW, then
Z = BX = BAS = CS (3)
If an orthogonal matrix D that transforms the mixed
signals X into Z with independent components could be
found, and assuming that at least one independent source
s
k
is normally distributed, then Z=CS with C being a
non-mixing matrix. The ICA algorithms attempt to find
the matrix B which ensures that Z is independent.
ICA and PCA applied to window based images features
If we consider a feature as a window of p x p pixels in
a frame, we can organize each feature as a long vector
with as many dimensions as number of pixels in the
feature. ICA or PCA can be applied to this data
organizing each vector into a matrix X where each row is
the same image feature for different frames (Fig.1 left).
In this approach, images features are random variables
and pixels are samples (Fig1 right).


Fig.1. a) To apply ICA or PCA a matrix is formed where each
row is a feature i tracked in frame n. b) ICA or PCA finds a
weight vector w in the directions of statistical dependencies
among the pixel locations. In the case of ICA not only
decorrelate the data, but also find decompositions,
transforming the input into independent components
SVM and KNN
Support Vector Machine (SVM) is a technique for
data classification. The goal of SVM is to produce a
model which predicts target value of data instances in
the testing set which are given only the attributes.
Given a training set of instance-label pairs (x
i
,y
i
),i=1,,l
where x
i
R
n
and y{1,-1}
l
, the SVM [9] require the
solution of the following optimization problem:


=
+
l
i
i
C w
t
w
b w
1
2
1
, ,
min

, with C > 0
subject to
i i
t
i
b x w y + 1 ) ) ( ( , where . 0
i


Here training vectors, x
i
, are mapped into a higher
(maybe infinite) dimensional space by the function .
Then SVM finds a linear separating hyperplane with the
maximal margin in this higher dimensional space. C is
the penalty parameter of the error term. Furthermore,
K(x
i
,x
j
)= (x
i
)
T
(x
j
) is called the kernel function. In this
work a radial basis function (RBF) was used like kernel
function:
0 ), || || exp( ) (
2
>
j i j i
x x x x K
(4)
The K-nearest neighbor (KNN) is a statistical method
of classification well known and very simple,
nevertheless has come to demonstrate to be very
effective in a wide variety of applications. It works
based on minimum distance from the query instance x
j
to
the training samples (x
i
,y
i
), to determine the K-nearest
neighbors. After we gather K nearest neighbors, we take
simple majority of these K-nearest neighbors to be the
prediction of the query instance.
In this work we employ a 5-nearest neighbor; we use
a parameter like a threshold for considering a good
match between an x
j
candidate descriptor and the
selected more voted feature class V:
X = AS (2)



2
1
) , (
l
x x d K
l
j
j
=
=

(5)
where x
j
V, d(x
j
,x) is the Euclidean distance and l is the
number of votes for the more voted class V. In this way
the average Euclidean distance is used like threshold but
is penalized according to the number of votes received
for V respect to K.
ICAD and PCAD methods
Many ICA and PCA algorithms are available. A
computationally efficient ICA algorithm, called the
FastICA [10] algorithm and the PCA Snapshot Method
[11] have been chosen for this work.
When the KLT (small base-line tracker) locates a feature
(feature i at frame f), a p-by-p pixels window around the
feature center is stored as a vector u
fi
of length p-by-p,
with a distinctive label; in the following frames the
feature is tracked and repeating the above process,
storing the window with the same label. Immediately
vectors with the same label (same feature i) are
regrouped in a matrix U
i
= [u
1i
, , u
ni
]
T
where n is the
number of frames where the feature has been tracked.
Then for each matrix U the ICA or PCA is applied as
it has been shown in the section 3.2 along with
dimensional reduction selecting the largest eigenvalue to
be retained. At the output of the ICA or PCA we obtain a
descriptor q
i
with a dimension that equals to the feature
window size. The descriptors are stored in a database
with a unique label for each feature.
In the recognition phase features are detected but not
tracked by the KLT for each incoming frame. Then for
each feature detected a window is obtained in the same
way than the learning stage and sorted in a vector v
i
, the
ICA or PCA is directly applied to this vector without
dimensional reduction producing a descriptor x
j
. A fast
k-nearest neighbor algorithm is applied to the database
in order to look for the 2-nearest neighbor descriptors q
i1

and q
i2
. Be k
1
= d(x
j
, q
i1
) and k
2
= d(x
j
, q
i2
) (d is the
Euclidean distance), k
1
k
2
, and =k
1
/ k
2
, this factor
will be used by our algorithm as a threshold for
considering a good match between the candidate
descriptor r
i
and its corresponding nearest descriptors q
i1

and q
i2
in the database. When tends to 0 means a great
distance between candidates and, empirically, the results
are better.
SVM-ICA and SVM-PCA methods
We used the LIBSVM [12] for the implementation of
the SVM. The method follows exactly the same steps
than the feature-class method (section 2): In step 3
(descriptor creation) unlike the ICAD and PCAD
methods, the ICA or PCA is applied directly to the
vector u
fi
obtained from de p-by-p pixels window (step
2) and stored in the database (step 4). In the output of the
ICA or PCA we obtain a descriptor with the same
dimension than the pixel window.
For step 5 (feature-class creation) we employ the
descriptors-database for training a SVM classifier with a
radial basis function (RBF) as a kernel function,
equation 4. The parameters C = 8 and = 1 used in RBF
were selected by cross-validation and grid search. For
the recognition phase the SVM output model is used to
predict the feature class V of the candidate descriptor x
j
,
as it has been explained in section 2 (recognition phase).
KNN-ICA and KNN-PCA methods
For the implementation of KNN we employ a
computationally efficient algorithm called approximate
nearest neighbor (ANN). The method follows the same
steps than SVM-ICA and SVM-PCA except that a
training model is not generated from the descriptor-
database. Prior to the recognition phase the whole
database is loaded in memory by the ANN algorithm. In
experiments we used 5-NN. For recognition phase ANN
is applied as it has been explained in section 3.3. A
threshold is used for considering a good match
between an x
j
candidate descriptor and the selected more
voted feature class V (equation 5).
IV. EXPERIMENTS
We have implemented a C++ version of the
methods that runs on a PC 2GHz Pentium IV processor,
512MB RAM. A non-expensive USB Webcam with a
maximum resolution of 640-by-480 pixels and 30 fps
has been used.
We performed a variety of experiments in order to
show the performance of the aforementioned methods.
For each method in the learning phase, a video sequence
of a rigid environment desktop scene was recorded
moving the camera slowly and continuously in order to
obtain a change of some degrees in the 3D point of view
and rotation of the camera. Later, twenty descriptors
were created from this video sequence as it was
described in section 3. In Fig. 1 (upper and lower center)
the scene used in the learning phase is showed.
Together with these 20 descriptors, the database
contains another 1000 descriptors corresponding to other
video sequences. The objective of these experiments
consists of observing the response of the methods when
a set of descriptors coming from an online video
sequence (close to the learning sequence, as it is
explained below) will be matched with the descriptors
database. The response was observed in four different
situations which can be seen in Fig. 1:
case a) change in 3D viewpoint with respect to the
position in the learning phase and little change in
illumination,
case b) change in 3D point of view plus change in
rotation and little change in illumination,



case c) change in 3D point of view plus change in
scale (camera zoom) and little change in illumination,
case d) change in 3D point of view plus great change
in illumination.

We define the response of the methods in terms of two
measurements:
1) Error of classification: it is the ratio between the
percentage of false positive and percentage of
classification. We define percentage of classification as
the ratio between the total of features classified in the
scene (correctly or incorrectly) and the total that
potentially could be matched (we consider only a finite
number of possible locations in each frame to be
matched, step 1 of recognition phase). A direct trade-off
exists between the threshold and the percentage of
classification. The lower is the threshold, the lower is the
number of false positive, but consequently the
percentage of classification is also low. Depending on
the type of algorithm is possible to establish a threshold
for considering a good match, such as the KNN
approach, but in case of SVM is not possible and it is
like choose the best candidate, for instance we
consider that the distance between the hyperplane and
the candidate is not a good quality measurement so we
consider the percentage of classification for SVM as 100
percent. Therefore a 38 percent of false positive in SVM
means 38 percent of classification error. In KNN the
percentage of false positive for a threshold =0 .6 could
be 16 percent but the percentage of classification is 50
percent, consequently the error of classification is 0.32.
The best case is 0 percent of false positive and 100
percent of classification.
2) Computational cost: we have calculated the time
for each frame (or the frequency, it is the same) that the
different methods take to classify 30 possible features
with the 1000 descriptors that are in the database.

The results for the experiments and the
measurements are shown in Table 1.

.

Fig. 2 a more robust descriptor like SIFT. Examples of frames used in the learning phase (upper and lower central images) and
frames used in the recognition phase: case a (lower right), case b (upper right), case c (lower left), case d (upper left)
Table 1. Error of classification for each condition case (a, b, c and d) in the recognition phase and their computational cost. CPU*
does not include the time to detect features by KLT tracker.
Case PCAD ICAD SVM-
PCA
SVM-
ICA
KNN-
PCA
KNN-
ICA
a .40 .35 .33 .21 .26 .21
b .47 .45 .44 .38 .31 .32
c .47 .33 .41 .28 .50 .35
d .48 .47 .43 .42 .37 .32
CPU 5.34 Hz 5.05 Hz 2.32 Hz 2.20 Hz 5.34 Hz 3.84 Hz
CPU* 16.30 Hz 10.70 Hz 3.22 Hz 2.94 Hz 21.72 HZ 7.09 Hz
.



V. CONCLUSIONS
In this work we presented a general feature-class
method for image features matching in a wide base-line
using statistical classification methods and descriptors,
as well as a comparative study of different statistical
methods with the objective of observing their relative
performance and the effectiveness using the feature-class
methodology.
In general ICA based descriptors show lower error of
classification than PCA based descriptors but
computationally the cost for ICA is greater than PCA.
Also we can observe a lower error of classification in the
descriptor feature-class methods (SVM-PCA, SVM-
ICA, KNN-PCA and KNN-ICA) comparing with
descriptor feature methods (PCAD and ICAD) and a
similar computational cost in the case PCAD and ICAD
with KNN-PCA and KNN-ICA. Among descriptor
feature-class methods KNN shows better performance
than SVM in error of classification as well as in
computational speed.
The results show the effectiveness of feature-class
method even if the descriptor-feature methods are based
in multiple adjacent views.
Its very important to note the principal objective of
this work: using the feature-class methodology can
improve the performance for matching image features in
a wide base-line since invariance to some changes like
illumination or point of view in an image feature is more
difficult to improve for a single-view descriptor. The key
idea is learning variations of images features among the
time using statistical methods. This is independent of the
kind of the descriptor used for representing the image
feature. For this reason, the experimental results of the
different descriptors-methods presented in this work
have to be interpreted relative to themselves.
In a future work we want to try using a more robust
descriptor like SIFT. Looking the results of this work,
we can expect that using SIFT [4] together with the
feature-class method can improve their performance than
using SIFT alone for applications like a mobile robot
where an incoming stream video is available.
REFERENCES
[1] J. Shi, C. Tomasi, Good features to track, Proc. IEEE CVPR,
1994.
[2] C. Harris, M. Stephens, A combined corner and edge detector,
Alvey Vision Conf., 1988.
[3] K. Mikolajczyk, C. Schmid, An affine invariant interest point
detector, Proc. ECCV, 2002.
[4] D. Lowe, Object recognition from local scale-invariant
features, Proc. ICCV, Corfu, Greece, September 1999.
[5] D. G. Lowe, Distinctive image features from scale-invariant
keypoints, International Journal of Computer Vision, 60 (2):91-
110, 2004.
[6] J Meltzer, M H. Yang, R Gupta, S. Soatto, Multiple view feature
descriptors from image sequences via kernel principal component
analysis, Proc. ECCV, 2004.
[7] I. T. Jolliffe. Principal Component Analysis. Springer Verlag,
1986.
[8] P. Comon, Independent component analysis, a new concept?,
Signal Processing, Elsevier, 36(3):287-314, April 1994.
[9] B. E. Boser, I. Guyon, and V. Vapnik, A training algorithm for
optimal margin classifiers, Proc. of the Fifth Annual Workshop
on Computational Learning Theory 5, pp. 144-152, 1992.
[10] A. Hyvarinen, E. Oja. A fast fixed-point algorithm for
independent component analysis, Neural Computation, 9
(7):1483-1492, 1997.
[11] L. Sirovich, Turbulence and the dynamics of coherent structures,
Part 1: Coherent Structure, Quarterly of Applied Mathematics,
45 (3):561-571, October 1987.
[12] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for
support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.

You might also like