0% found this document useful (0 votes)
6 views8 pages

Dominant Orientation Templates For Real-Time Detection of Texture-Less Object

The document presents a method for real-time 3D object detection that does not require extensive training and effectively handles untextured objects using a novel template representation based on dominant gradient orientations. This approach allows for quick online learning of new objects and efficient image parsing, resulting in fast detection and accurate 3D pose estimation in complex environments. The method combines the advantages of statistical learning with a binary representation and branch-and-bound techniques to enhance performance.

Uploaded by

hello.giant.man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Dominant Orientation Templates For Real-Time Detection of Texture-Less Object

The document presents a method for real-time 3D object detection that does not require extensive training and effectively handles untextured objects using a novel template representation based on dominant gradient orientations. This approach allows for quick online learning of new objects and efficient image parsing, resulting in fast detection and accurate 3D pose estimation in complex environments. The method combines the advantages of statistical learning with a binary representation and branch-and-bound techniques to enhance performance.

Uploaded by

hello.giant.man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Dominant Orientation Templates for Real-Time Detection of Texture-Less Objects

Stefan Hinterstoisser1 , Vincent Lepetit2 , Slobodan Ilic1 , Pascal Fua2 , Nassir Navab1

1
Department of Computer Science, CAMP, Technische Universität München (TUM), Germany
2
École Polytechnique Fédérale de Lausanne (EPFL), Computer Vision Laboratory, Switzerland
{hinterst,slobodan.ilic,navab}@in.tum.de, {vincent.lepetit,pascal.fua}@epfl.ch

Abstract
We present a method for real-time 3D object detection
that does not require a time consuming training stage, and
can handle untextured objects. At its core, is a novel tem-
plate representation that is designed to be robust to small
image transformations. This robustness based on dominant
gradient orientations lets us test only a small subset of all
possible pixel locations when parsing the image, and to rep-
resent a 3D object with a limited set of templates. We show
that together with a binary representation that makes eval-
uation very fast and a branch-and-bound approach to effi-
ciently scan the image, it can detect untextured objects in
complex situations and provide their 3D pose in real-time.
Figure 1. Overview. Our templates can detect non-textured objects
over cluttered background in real-time without relying on feature
1. Introduction point detection. Adding new objects is fast and easy, as it can be
done online without the need for an initial training set. Only a few
Currently, the dominant approach to object recognition templates are required to cover all appearances of the objects.
is to use statistical learning to build a classifier offline, and
then to use it at run-time for the recognition [17]. This As shown in Figure 1, in this paper we propose a tem-
works remarkably well but is not applicable for all scenar- plate representation that is invariant enough to make search
ios, for example, a system that has to continuously learn in the images very fast and generalizes well. As a result, we
new objects online. It is then difficult, or even impossible, can almost instantaneously learn new objects and recognize
to update the classifier without losing efficiency. them in real-time without requiring much time for training
To overcome this problem, we propose an approach or any feature point detection at runtime.
based on real-time template recognition. With such a tool Our representation is related to the Histograms-of-
at hand, it is then trivial and virtually instantaneous to learn Gradients (HoG) based representation [1] that has proved to
new incoming objects by simply adding new templates to generalize well. Instead of local histograms, it relies on lo-
the database while simultaneously maintaining reliable real- cally dominant orientations, and is made explicitly invariant
time recognition. to small translations. Our experiments show it is in practice
However, we also wish to keep the advantages of statis- at least as discriminant as HoG, while being much faster.
tical methods, as they learn how to reject unpromising im- Because it is explicitly made invariant to small translations,
age locations very quickly, which increases their real-time we can skip many locations while parsing the images with-
performance considerably. They can also be very robust, out the risk of missing the targets. Moreover we developed
because they can generalize well from the training set. For a bit-coding method inspired by [16] to evaluate an image
these reasons, we also designed our template representation location for the presence of a template. It mostly uses sim-
based on some fast to compute image statistics that provide ple bit-wise operations, and is therefore very fast on modern
invariance to small translations and deformations, which in CPUs. Our similarity measure also fulfills the requirements
turn allows us to quickly yet reliably search the image. for recent branch-and-bound exploration techniques [10],

1
speeding-up the search even more. statistical methods to learn the model from few images that
In the remainder of the paper we first discuss related are only constraint with a bounding box around the object.
work before we explain our template representation and While giving very good classification results, the approach
how similarity can be evaluated very fast. We then show is neither appropriate for object tracking in real-time due
quantitative experiments and real world applications of our to its expensive computation nor it is exact enough to re-
method. turn the correct pose of the object. Moreover, it holds all
the disadvantages of Distance Transform based methods as
2. Related Work mentioned previously.
Grabner and Bischof [4, 5] developed another learning
Template Matching is attractive for object detection be-
based approach that put more focus on online learning.
cause of its simplicity and its capability to handle different
In [4, 5] it is shown how a classifier can be trained online in
types of objects. It neither needs a large training set nor a
real-time, with a training set generated automatically. How-
time-consuming training stage, and can handle low-textured
ever, [4] was demonstrated on textured objects, and [5] can-
objects, which are, for example, difficult to detect with fea-
not provide the object pose.
ture points-based methods.
The method proposed in this paper has the strength of
An early approach to Template Matching [13] and its
the similarity measure of [15], the robustness of [1] and the
extension [3] include the use of the Chamfer distance be-
online learning capability of [4, 5]. In addition, by binariz-
tween the template and the input image contours as a dis-
ing the template representation and using a recent branch-
similarity measure. This distance can efficiently be com-
and-bound method of [10] our method becomes very fast,
puted using the image Distance Transform (DT). It tends
making possible the detection of untextured 3D objects in
to generate many false positives, but [13] shows that taking
real-time.
the orientations into account drastically reduces the num-
ber of false positives. [9] is also based on the Distance
Transform, however, it is invariant to scale changes and ro- 3. Proposed Approach
bust enough against perspective distortions to do real-time In this section, we describe our Dominant Orientation
matching. Unfortunately, it is restricted to objects with Templates, and how they can be built and used to parse im-
closed contours, which are not always available. ages to quickly find objects. We will start by deriving our
But the main weakness of all Distance Transform-based similarity measure, emphasizing the contributions of each
methods is the need to extract contour points, using Canny aspect of it. We then show how to use a binary representa-
method for example, and this stage is relatively fragile. It tion to compute the similarity using efficient bit-wise opera-
is sensitive to illumination changes, noise and blur. For in- tions. We finally demonstrate how to use it within a branch-
stance, if the image contrast is lowered, contours on the ob- and-bound exploration of the image.
ject may not be detected and the detection will fail.
The method proposed in [15] tries to overcome these lim- 3.1. Initial Similarity Measure
itations by considering the image gradients in contrast to
the image contours. It relies on the dot product as a sim- Our starting idea is to measure the similarity between
ilarity measure between the template gradients and those an input image I, and a reference image O of an object
in the image. Unfortunately, this measure rapidly declines centered on a location c in the image I by comparing the
with the distance to the object location, or when the ob- orientations of their gradients.
ject appearance is even slightly distorted. As a result, the We chose to consider image gradients because they
similarity measure must be evaluated densely, and with proved to be more discriminant than other forms of repre-
many templates to handle appearance variations, making sentations [11, 15] and are robust to illumination change
the method computationally costly. Using image pyramids and noise. For even more robustness to such changes,
provides some speed improvements, however, fine but im- we use their magnitudes only to retain the orientations of
portant structures tend to be lost if one does not carefully the strongest gradients, without using their actual values
sample the scale space. for matching. Also, to correctly handle object occluding
Histogram of Gradients [1] is another very popular boundaries, we consider only the orientations of the gradi-
method. It describes the local distributions of image gra- ents, by contrast with their directions (two vectors with a
dients as computed on a regular grid. It has proven to give 180deg angle between them have the same orientation).. In
reliable results but tends to be slow due to the computational this way, the measure will not be affected if the object is
complexity. over a dark background, or a bright background. Moreover,
Recently, [2] proposed a learning-based method that rec- as in SIFT or HoG [1], we discretize the orientations to a
ognizes objects via a Hough-style voting scheme with a small number no of integer values.
non-rigid shape matcher on the contour image. It relies on Our initial energy function E1 counts how many orienta-
tions are similar between the image and the template cen- L(O,R 1,1)
cx
tered on location c, and can be formalized as: Image
t
X  
E1 (I, O, c) = δ ori(I, c + r) = ori(O, r) , (1) c+R 1,2
Region
r R
t cy c
where dominant image c+R
...
2,1
orientation
• δ(P ) is a binary function that returns 1 if P is true, 0 dominant template
Template
orientation
otherwise;
matched
orientation
• ori(O, r) is the discretized gradient orientation in the mismatched
reference image O at location r which parses the tem- orientation

...
...
plate. Similarly, ori(I, c+r) is the discretized gradient Figure 2. Similarity measure E4 . Our final energy measure E4
orientation at c shifted by r in the input image I. counts how many times a local dominant orientation for a region
R in the image belongs to the corresponding precomputed list of
3.2. Robustness to Small Deformations orientations L(O, R) for the corresponding template region. Each
list is made of the local dominant orientations that are in the region
To make our measure tolerant to small deformations, and
R when the object template is slightly translated.
also to make it faster to compute, we will not consider all
possible locations, and will decompose the two images into
• maxmagk (R) is the set of locations for the k strongest
small squared regions R over a regular grid. For each
gradients in R. In practice we take k = 7 but the
region, we will consider only the dominant orientations.
choice of k does not seem critical.
Such an approach is similar to the HMAX pooling mech-
anism [14]. Our similarity measure can now be modified • τ is a threshold on the gradient magnitudes to decide
as: if the region is uniform or not.
X  
E2 (I, O, c) = δ do(I, c + R) ∈ DO(O, R) , (2)
The function do(I, c+R) is computed similarly in the input
R in O
image I. However, to be faster at runtime, in do(I, c + R),
where DO(O, R) returns the set of orientations of the k is restricted to 1, and therefore do(I, c + R) returns only
strongest gradients in region R of the object reference im- one single element.
age. In contrast, do(I, c + R) returns only one orienta-
tion, the orientation of the strongest gradient in the region 3.3. Invariance to Small Translation
R shifted by c in the input image.
The reason why we chose each region in O to be repre- We will now explicitly make our similarity measure in-
sented by the strongest gradients is that the strongest gra- variant to small motions. In this way, we will be able to
dients are easy and fast to identify and very robust to noise consider only a limited number of locations c when parsing
and illumination change. Moreover, to describe uniform re- an image and save a significant amount of time without in-
gions, we introduce the symbol ⊥ to indicate that no reliable creasing the chance of missing the target object. To do so,
gradient information is available for the region. The DO(.) we consider a measure that returns the maximal value of E2
function therefore returns either a set of discretized gradi- when the object is slightly moved, which can be written as:
ent orientations of the k strongest gradients in the range of
[0, no − 1] or {⊥}, and can be formally written as: E3 (I, O, c) = max E2 (I, w(O, M ), c)
X M ∈M 
 = max δ do(I, c + R) ∈ DO(w(O, M ), R) ,
S(O, R) if S(O, R) 6= ∅, M ∈M
DO(O, R) = (3) R in O
{⊥} otherwise (5)
where w(O, M ) is the image O of the object warped using
with a transformation M . In practice, we consider for M only
2D translations as it appears sufficient to handle other small
S(O, R) = {ori (O, l) : l ∈ maxmagk (R) ∧ mag(O, l) > τ }
deformations, and M is the set of all (small) translations in
(4)
the range [−t; +t]2 .
where
There is of course a limit for the range t. A large t will
• l is a pixel location in R, result in high speed-up but also in a loss of discriminative
• ori(O, l) is the gradient orientation at l in image O, power of the function. In practice, we found that t = 7 for
and mag(O, l) its magnitude, 640 × 480 images is a good trade-off.
ori7

ori7

ori7

ori7
ori1

ori1

ori1

ori1
3.4. Ignoring the Dependence between Regions
... ... ... ...
Our last step is to ignore the dependence between the L(O,R) : 10011001 01000110 11100001 00100100
different regions R. This will simplify and significantly byten byte byte byte
n+1 n+2 n+3
speed-up the computation of the similarity. We therefore do(I,c+R) : 10000000 00010000 01000000 00000100
approximate E3 as given in Eq.(5) by:
AND
E4 (I, O, c)   10000000 00000000 01000000 00000100
X
= max δ do(I, c + R) ∈ DO(w(O, M ), R) . bytei != 0
M ∈M
R in O
(6)
The speed-up comes from the fact that, for each region R, lookup table [ ...1011... ]
we can precompute a list L(O, R) of the dominant orienta-
tions in R when O is translated over M. As illustrated by Figure 3. Computing the similarity E4 using bitwise operations and
Figure 2, the measure E4 can thus be written as: a lookup table that counts how many terms δ() as in Eq.(9) are
X   equal to 1.
E4 (I, O, c) = δ do(I, c + R) ∈ L(O, R) , (7)
R in O int energy_function4( __m128i lhs, __m128i rhs )
{
and L(O, R) can formally be written as: __m128i a = _mm_and_si128(lhs,rhs);
__m128i b = _mm_cmpeq_epi8(a);
L(O, R)
= {o : ∃M ∈ M such that o ∈ DO(w(O, M ), R)} . return lookuptable[_mm_movemask_epi8(b)];
}
(8)
The collection of lists over all regions R in O forms the Listing 1. C++ Energy function for 16 regions with 3 SSE
instructions and one look-up in a 16-bit-table. Since in SSE there
final object template.
is no comparison on non-equality for unsigned 8-bit integers we
have—in contrast to Figure 3—to compare the AND’ed result to
3.5. Using Bitwise Operations zero and count the ”0” instead.
Inspired by [16], and as shown in Figure 3, we efficiently
compute the energy function E4 using a binary representa- bitwise operations, which are already very fast, SSE tech-
tion of the lists L(O, R) and of the dominant orientations nology allows to perform the same operation on 16 bytes in
do(I, c + R). This allows us to compute E4 with only a few parallel. Thus, by using the function given in Listing 1, the
bitwise operations. similarity score for 16 regions can be computed with only
By setting no , the number of discretized orientations, to 3 SSE operations and one lookup-table with 16-bits entries.
7 we can represent a list L(O, R) or a dominant orientation Thus, if n denotes the number of
 nregions R, we only have
n
 
do(I, c + R) with one byte i.e. a 8-bit integer. Each of the to use 3 16 SEE instructions, 16  n  of a lookup table
uses
7 first bits corresponds to an orientation while the last bit with 16-bits entries and additional 16 − 1 ”+” operations
stands for ⊥. if the number of regions n is larger than 16. Assuming that
More exactly, to each list L(O, R) corresponds a byte L each
 n operation has the same computational cost we need
whose ith bit with 0 ≤ i ≤ 6 is set to 1 iff i ∈ L(O, R), 5 16 − 1 operations for n regions which results in only
and whose 7th bit is set to 1 iff ⊥ ∈ L(O, R). A byte D ≈ 0.3 operations per region.
can be constructed similarly to represent a dominant orien- This method is extremely cache friendly because only
tation do(I, c +R). Note that only one bit of
 D is set to 1. successive chunks of 128 bits are processed at a time which
Now the term δ do(I, c + R) ∈ L(O, R) in Eq.(7) can holds the number of cache misses low. This is very im-
portant because SSE technology is very sensitive to optimal
be evaluated very quickly. We have:
cache alignment. This is probably why, although our energy
  function is slightly more computationally expensive in the-
δ do(I, c + R) ∈ L(O, R) = 1 iff L ⊗ D 6= 0 , (9)
ory than [16], we found that our formulation performed 1.5
times faster in practice.
where ⊗ is the bitwise AND operation.
Another advantage of our algorithm, however, is that it is
very flexible with respect to varying template sizes without
3.6. Using SSE Instructions
loosing the capability of using the computational capacities
The computation of E4 as formulated in Section 3.5 can very efficiently. In our method, the optimal processor load
be further speeded up using SSE operations. In addition to is reached by multiples of 16 in contrast to [16] that needs
100 100 100

95

average overlapping [%]


80 80
matching score [%]

matching score [%]


90
DOT DOT DOT
60 HoG Templates 60 HoG Templates 85 HoG Templates
Leopar Leopar Leopar
Panter Panter 80 Panter
40 Gepard 40 Gepard Gepard
75
Harris Affine Harris Affine Harris Affine
Hessian Affine Hessian Affine 70 Hessian Affine
20 MSER 20 MSER MSER
IBR IBR 65 IBR
EBR EBR EBR
0 0 60
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
viewpoint change [deg] viewpoint change [deg] viewpoint change [deg]

(a) (b) (c)


Figure 4. Methods comparisons on the Graffiti and Wall Oxford datasets. (a-b): Matching scores for Graffiti and Wall sets when increasing
the viewpoint angle. Our method is referred as “DOT”, and reaches a 100% score on both sets for every angle. These results are discussed
in Section 4.1. (c) shows the overlaps between the retrieved and expected regions as an accuracy measure for Graffiti. These results are
discussed in Section 4.2.
1 100
10

80
DOT clustering
similarity score [%]

100

matching score [%]


DOT binary tree
runtime [seconds]

0
10 DOT without clustering
60
DOT−Tay without clustering
HoG Templates 50

40
−1
10
0
20 20 7
vie 11 s]
wp 30 xel
oin 14 [pi
t ch 40 e l s ]x
ang 50 21 x
10
−2
0 e [d e [pi
0 20 40 60 80 100 eg]
60 siz
0 500 1000 1500 on
number of templates visibility [%] regi

(a) (b) (c)


Figure 5. (a) Comparison of different methods and cluster schemes with respect to speed. Our method with our cluster scheme performs
superior over all other methods and cluster schemes as discussed in Section 4.3. (b) In Section 4.4 we discuss the linear behavior of our
method with respect to occlusion. (c) t = 7 is a good trade-off between speed and robustness (Section 4.5).

multiples of 128 in a possible dynamic SSE implementa- template and the remaining templates, until the cluster has
tion. The probability of wasting computational power is a given number of templates assigned. We then continue
therefore much lower using our approach. building clusters until every template is assigned to a clus-
ter.
3.7. Clustering for Efficient Branch and Bound For our approach, this clustering scheme allows faster
We can further improve the scalability of our method by runtime than the binary tree clustering suggested in [16], as
exploiting the similarity between different templates repre- will be shown in Section 4.3.
senting different objects under different views. The general
idea is to build clusters of similar templates—each of them 4. Experimental Results
being represented by what we will refer to as a cluster tem-
plate. A cluster template is computed as a bitwise OR op- In the experiments, we compared our approach called
eration applied to all the templates belonging to the same DOT (for Dominant Orientation Templates) to Affine Re-
cluster. It provides tight upper bounds and can be used in gion Detectors [12] (Harris-Affine, Hessian-Affine, MSER,
a branch and bound constrained search as in [10]. By first IBR, EBR), to patch rectification methods [8, 7, 6] (Leopar,
computing the similarity measure E4 between the image and Panter, Gepard) and to the Histograms-of-Gradients (HoG)
the cluster templates at run-time, we can reject all the tem- template matching approach [1].
plates that belong to a cluster template not similar enough For HoG, we used our own SSE optimized implemen-
to the current image. tation. In order to detect the correct template from a large
We use a bottom-up clustering method: To build a clus- template database we replaced the Support Vector Machine
ter, we start from a template picked randomly among the mentioned in the original work of HoG by a nearest neigh-
templates that do not yet belong to a cluster. We then look bor search since we want to avoid a training phase and to
for the template the most similar to it according to the Ham- look for a robust representation instead.
ming distance, and not picked yet. We proceed this way We did the performance evaluation on the Oxford Graf-
using the Hamming distance between the current cluster fiti and on the Oxford Wall image set [12]. Since no video
sequence is available, we synthesized a training set by scal- 4.3. Speed
ing and rotating the first image of the dataset for changes
in viewpoint angle up to 75 degrees and by adding random Although performing similar in terms of robustness and
noise and affine illumination change. accuracy, DOT clearly outperforms HoG in terms of speed
by several magnitudes. In order to compare both ap-
proaches, we trained them on the same locations and ap-
4.1. Robustness pearances on a 640 × 480 image with |R| = 121. The
experiment was done on a standard notebook with an Intel
The matching scores of the different methods is shown Centrino Processor Core2Duo with 2.4GHz and 3GB RAM
in Figure 4(a) for the Graffiti dataset, and in Figure 4(b) where unoptimized training of one template took 1.8ms and
for the Wall dataset. As defined in [12], this score is the the clustering of about 1600 templates 0.76s. As one can
ratio between the number of correct matches and the smaller see in Figure 5(a), when using about 1600 templates our
number of regions detected in one of the two images. approach is about 310 times faster at runtime than our SSE
For the affine regions, we first extract the regions us- optimized HoG implementation. The reason for this is both
ing different region detectors and match them using SIFT. the robustness to small deformations that allows DOT to
Two of them are said to be correctly matched if the over- skip most of the pixel locations and the binary representa-
lap error of the normalized regions is smaller than 40%. tion of our templates that enables a fast similarity evalua-
In our case, the regions are defined as the patches warped tion.
by the retrieved transformation. For a fair comparison, we We also compared our similarity measure to a SSE op-
used the same numbers and appearances of templates for timized version of Taylor’s version [16]. Our approach is
the DOT and HoG approaches. We also turned off the fi- constantly about 1.5 times faster than Taylor’s. We believe
nal check on the correlation for all patch rectification ap- it is due to the cache friendly formulation of E4 where we
proaches (Leopar, Panter, Gepard) since there is no equiva- successively use sequential chunks of 128 bits at a time
lent for the affine regions. while [16] has to jump back and forth within 1024 bits (in
DOT and HoG clearly outperform the other approaches case |R| = 121) for successively OR’ing pairs of 128 bit
by delivering optimal matching results of 100% on the Graf- vectors and accumulating the result (for a closer explana-
fiti image set. For the Wall image set, DOT performs opti- tion of Taylor’s similarity measure please refer to [16]) in a
mal again with a matching rate of 100% while HoG per- SSE register.
forms worse for larger viewpoint changes.
We also did experiments with respect to the different
These very good performances can be explained by the clustering schemes. We compared the approach where no
fact that DOT and HoG scan the whole image while the clustering is used to the binary tree of [16] and our cluster-
affine regions approach is dependent on the quality of the re- ing described in Section 3.7. Surprisingly, our clustering is
gion extraction. As it will be shown in Section 4.3, even if it twice as fast as the binary tree clustering at runtime. Al-
parses the whole image, our approach is fast enough to com- though the matching should behave in O(log(N )) time, our
pete with affine region and patch rectification approaches in implementation of the binary tree clustering behaves lin-
terms of computation times. early up to about 1600 templates as it was also observed
by [16]. As the authors of [16] claim, the reason for this
might be that there are not enough overlapping templates to
4.2. Detection Accuracy
fully exploit the potential of their tree structure.
As it was done in [7], in Figure 4(c), we compare the
average overlap between the ground truth quadrangles and
4.4. Occlusion
their corresponding warped versions obtained with DOT,
HoG, the patch rectification methods and with the affine re- Occlusion is a very important aspect in template match-
gions detectors. We did the experiments for overlap and ing. To test our approach towards occlusion we selected 100
accuracy on both image sets but due to the similarity of the templates on the first image of the Oxford Graffiti image
results and the lack of space we only show the results on set, added small image deformation, noise and illumination
the Graffiti image set. Since the Affine Region Detectors changes and incrementally occluded the template in 2.5%
deliver elliptic regions we fit quadrangles around these el- steps from 0% to 100%. The results are displayed in Fig-
lipses by aligning them to the main gradient orientation as ure 5(b). As expected the similarity of our method behaves
it was done in [7]. linearly to the percentage of occlusion. This is a desirable
The average overlap is very close to 100% for DOT and property since it allows to detect partly occluded templates
HoG, about 10% better than MSER and about 20% better by setting the detection threshold with respect to the toler-
than the other affine region detectors. ated percentage of occlusion.
100 detect untextured 3D objects using relatively few templates
80 from many different viewpoints in real-time. We have

matching score [%]


shown that our approach performs superior to state-of-the-
60
art methods with respect to the combination of recognition
40
rate and speed. Moreover, the template creation is fast and
20 easy, does not require a training set, only a few exemplars,
DOT

0
HoG Templates and can be done interactively.
20 30 40 50 60
viewpoint change [deg]

Figure 6. Failure Case. When the object does not exhibit strong Acknowledgment: This project was funded by the
gradients, like the blurry image on the left, our method performs BMBF project AVILUSplus (01IM08002).
worse than HoG.
References
4.5. Region Size [1] N. Dalal and B. Triggs. Histograms of Oriented Gradients
for Human Detection. In CVPR, 2005.
The size of the region R is another important parame- [2] V. Ferrari, F. Jurie, and C. Schmid. From images to shape
ter. The larger the region R gets the faster the approach models for object detection. IJCV, 2009.
becomes at runtime. However, at the same time as the size [3] D. Gavrila and V. Philomin. Real-time object detection for
of the region increases the discriminative power of the ap- “smart” vehicles. In ICCV, 1999.
proach decreases since the number of gradients to be con- [4] M. Grabner, H. Grabner, and H. Bischof. Tracking via Dis-
criminative Online Learning of Local Features. In CVPR,
sidered rises. Therefore, it is necessary to choose the size of 2007.
the region R carefully to find a compromise between speed [5] M. Grabner, C. Leistner, and H. Bischof. Semi-supervised
and robustness. In the following experiment on the Graffiti on-line boosting for robust tracking. In ECCV, 2008.
image set we tested the behavior of DOT with respect to the [6] S. Hinterstoisser, S. Benhimane, V. Lepetit, P. Fua, and
matching score and the size of the region R. The result is N. Navab. Simultaneous recognition and homography ex-
traction of local patches with a simple linear classifier. In
shown in Figure 5(c). As the matching score is still 100% BMVC, 2008.
for regions of 7 × 7 pixels, one can see that the robustness [7] S. Hinterstoisser, S. Benhimane, N. Navab, P. Fua, and
decreases with increasing region size. Although dependent V. Lepetit. Online learning of patch perspective rectification
on the texture and on the density of strong gradients within for efficient object detection. In CVPR, 2008.
one region R, we empirically found on many different ob- [8] S. Hinterstoisser, O. Kutter, N. Navab, P. Fua, and V. Lepetit.
Real-time learning of accurate patch rectification. In CVPR,
jects that a region size of 7 × 7 gives very good results. 2009.
[9] S. Holzer, S. Hinterstoisser, S. Ilic, and N. Navab. Distance
4.6. Failure Cases transform templates for object detection and pose estimation.
In CVPR, 2009.
Figure 6 shows the limitation of our method: To obtain [10] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond
such optimal results as in Figure 4, the templates have to Sliding Windows: Object Localization by Efficient Subwin-
exhibit strong gradients. In case of too smooth or blurry dow Search. In CVPR, June 2008.
template images, HoG tends to perform better. [11] D. Lowe. Distinctive Image Features from Scale-Invariant
Keypoints. IJCV, 20(2):91–110, 2004.
4.7. Applications [12] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,
J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A com-
Due to the robustness and the real-time capability of our parison of affine region detectors. IJCV, 2005.
approach, DOT is suited for many different applications in- [13] C. F. Olson and D. P. Huttenlocher. Automatic target recog-
nition by matching oriented edge pixels. IP, 6, 1997.
cluding untextured object detection as shown in Figure 8, [14] T. Serre and M. Riesenhuber. Realistic modeling of simple
and planar patches detection as shown in Figure 9. Al- and complex cell tuning in the hmax model, and implications
though neither a final refinement nor any final verification, for invariant object recognition in cortex. TR, MIT, 2004.
by contrast with [7] for example, was applied to the found [15] C. Steger. Occlusion Clutter, and Illumination Invariant Ob-
3D objects, the results are very accurate, robust and sta- ject Recognition. In IAPRS, 2002.
ble. Creating the templates for new objects is easy and il- [16] S. Taylor and T. Drummond. Multiple target localisation at
over 100 fps. In BMVC, 2009.
lustrated by Figure 7. [17] P. Viola and M. Jones. Robust real-time object detection.
IJCV, 2001.
5. Conclusion
We introduce a new binary template representation
based on locally dominant gradient orientations that is
invariant to small image deformations. It can very reliably
Figure 7. Templates creation. To easily define the templates for a new object, we use DOT to detect a known object—the ICCV logo in
this case—next to the object to learn in order to estimate the camera pose and to define an area in which the object to learn is located. A
template for the new object is created from the first image, and we start detecting the object while moving the camera. When the detection
score becomes too low, a new template is created in order to cover the different object appearances when the viewpoint changes.

Figure 8. Detection of different objects at about 12 fps over a cluttered background. The detections are shown by superimpos-
ing the thresholded gradient magnitudes from the object image over the input images. The corresponding video is available on
http://campar.in.tum.de/Main/StefanHinterstoisser.

Figure 9. Patch 3D orientation estimation. Like Gepard [8], DOT can detect planar patches and provide an estimate of their orientations.
DOT is however much more reliable as it does not rely on feature point detection, but parses the image instead. The corresponding video
is available on http://campar.in.tum.de/Main/StefanHinterstoisser.

You might also like