0% found this document useful (0 votes)
24 views21 pages

FJS10

Uploaded by

Abhishek Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views21 pages

FJS10

Uploaded by

Abhishek Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

From images to shape models for object detection

Vittorio Ferrari, Frédéric Jurie, Cordelia Schmid

To cite this version:


Vittorio Ferrari, Frédéric Jurie, Cordelia Schmid. From images to shape models for object detection.
International Journal of Computer Vision, Springer Verlag, 2010, 87 (3), pp.284-303. �10.1007/s11263-
009-0270-9�. �inria-00548643�

HAL Id: inria-00548643


https://hal.inria.fr/inria-00548643
Submitted on 20 Dec 2010

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
IJCV manuscript No.
(will be inserted by the editor)

From images to shape models for object detection


Vittorio Ferrari · Frederic Jurie · Cordelia Schmid

Received: date / Accepted: date

Abstract We present an object class detection approach


which fully integrates the complementary strengths of-
fered by shape matchers. Like an object detector, it can
learn class models directly from images, and can local-
ize novel instances in the presence of intra-class varia-
tions, clutter, and scale changes. Like a shape matcher,
it finds the boundaries of objects, rather than just their
bounding-boxes. This is achieved by a novel technique Fig. 1 Example object detections returned by our approach (see
for learning a shape model of an object class given im- also figure 13).
ages of example instances. Furthermore, we also inte-
grate Hough-style voting with a non-rigid point match- on contour features have been proposed [2,14,16,25,28,
ing algorithm to localize the model in cluttered im- 32,39]. These are better suited to represent objects de-
ages. As demonstrated by an extensive evaluation, our fined by their shape, such as mugs and horses. Most of
method can localize object boundaries accurately and the methods that train without annotated object seg-
does not need segmented examples for training (only mentations can localize objects in test images only up
bounding-boxes). to a bounding-box, rather than delineating their out-
lines. We believe the main reason lies in the nature of
1 Introduction the proposed models, and in the difficulty of learning
them from real images, as opposed to hand-segmented
In the last few years, the problem of learning object shapes [8,12,21,37]. The models are typically composed
class models and localizing previously unseen instances of rather sparse collections of contour fragments with
in novel images has received a lot of attention. While a loose layer of spatial organization on top [16,25,32,
many methods use local image patches as basic fea- 39]. A few authors even go to the extreme end of using
tures [18,27,40,44], recently several approaches based individual edgels as modeling units [2,28]. In contrast,
an explicit shape model formed by continuous connected
This research was supported by the EADS foundation, INRIA, curves completely covering the object outlines is more
CNRS, and SNSF. V. Ferrari was funded by a fellowship of the
EADS foundation and by SNSF. desirable, as it would naturally support boundary-level
localization in test images.
V. Ferrari
ETH Zurich
In order to achieve this goal, we propose an ap-
E-mail: [email protected] proach which bridges the gap between shape matching
F. Jurie
and object detection. Classic non-rigid shape match-
Univerity of Caen ers [3,6,8,37] produce point-to-point correspondences,
E-mail: [email protected] but need clean pre-segmented shapes as models. In con-
C. Schmid trast, we propose a method that can learn complete
INRIA Grenoble shape models directly from images. Moreover, it can au-
E-mail: [email protected] tomatically match the learned model to cluttered test
2

images, thereby localizing novel class instances up to Several earlier works for shape description are based
their boundaries (as opposed to a bounding-box). on silhouettes [31,41]. Yet, silhouettes are limited be-
The main contribution of this paper is a technique cause they ignore internal contours and are difficult to
for learning the prototypical shape of an object class as extract from cluttered images as noted by [3]. There-
well as a statistical model of intra-class deformations, fore, more recent works represent shapes as loose col-
given image windows containing training instances (fig- lections of 2D points [8,22] or other 2D features [12,
ure 3a; no pre-segmented shapes are needed). The chal- 16]. Other works propose more informative structures
lenge is to determine which contour points belong to than individual points as features, in order to simplify
the class boundaries, while discarding background and matching. Belongie et al. [3] propose the Shape Con-
details specific to individual instances (e.g. mug labels). text, which captures for each point the spatial distribu-
Note how these typically form the majority of points, tion of all other points relative to it on the shape. This
yielding a poor signal-to-noise ratio. The task is further semi-local representation allows to establish point-to-
complicated by intra-class variability: the shape of the point correspondences between shapes even under non-
object boundary varies across instances. rigid deformations. Leordeanu et al. [28] propose an-
As additional contributions, we extend the non-rigid other way to go beyond individual edgels, by encoding
shape matcher of Chui and Rangarajan [6] in two ways. relations between all pairs of edgels. Similarly, Elidan
First, we extend it to operate in cluttered test images, et al. [12] use pairwise spatial relations between land-
by deriving an automatic initialization for the loca- mark points. Ferrari et al. [16] present a family of scale-
tion and scale of the object from a Hough-style vot- invariant local shape features formed by short chains of
ing scheme [27,32,39] (instead of the manual initializa- connected contour segments, capable of cleanly encod-
tion that would otherwise be necessary). This enables ing pure fragments of an object boundary. They offer
to match the learned shape model even to severely clut- an attractive compromise between information content
tered images, where the object boundaries cover only a and repeatability, and encompass a wide variety of local
small fraction of the contour points (figures 1, 13). As shape structures.
a second extension, we constrain the shape matcher [6] While generic features can be directly used to model
to only search over transformations compatible with the any object, an alternative is to learn features adapted to
learned, class-specific deformation model. This ensures a particular object class. Shotton et al. [39] and Opelt
output shapes similar to class members, improves ac- et al. [32] learn class-specific boundary fragments (lo-
curacy, and helps avoiding local minima. cal groups of edgels), and their spatial arrangement as a
These contributions result in a powerful system, ca- star configuration. In addition to their own local shape,
pable of detecting novel class instances and localizing such fragments store a pointer to the object center, en-
their boundaries in cluttered images, while training from abling object localization in novel images using voting.
objects annotated only with bounding-boxes. Other methods [11,16] achieve this functionality by en-
After reviewing related work (section 2) and the lo- coding spatial organization by tiling object windows,
cal contour features used in our approach (section 3), we and learning which features/tile combinations discrim-
present our shape learning method in section 4, and the inate objects from background.
scheme for localizing objects in test images in section 5.
The overall shape model of the above approaches is
Section 6 reports extensive experiments. We evaluate
either (a) a global geometric organization of edge frag-
the quality of the learned models and quantify local-
ments [3,16,32,39]; or (b) an ensemble of pairwise con-
ization performance at test time in terms of accuracy
straints between point features [12,28]. Global geomet-
of the detected object boundaries. We also compare to
ric shape models are appealing because of their abil-
previous works for object localization with training on
ity to handle deformations, which can be represented
real images [16] and hand-drawings [14]. A preliminary
in several ways. The authors of [3] use regularized Thin
version of this work was published at CVPR 2007 [15].
Plate Splines which is a generic deformation model that
can quantify dissimilarity between any two shapes, but
2 Related works cannot model shape variations within a specific class. In
contrast, Pentland et al. [33] learn the intra-class defor-
As there exists a large body of work on shape repre- mation modes of an elastic material from clean training
sentations for recognition [1,3,8,17,16,23,24,28,37], we shapes. The most famous work in this spirit is Active
briefly review in the following only the most important Shape Models [8], where the shape model in novel im-
works relevant to this paper, i.e. on shape description ages is constrained to vary only in ways seen during
and matching for modeling, recognition, and localiza- training. A few principal deformation modes, account-
tion of object classes. ing for most of the total variability over the training
3

set, are learnt using PCA. More generally, non-linear


statistics can be used to gain robustness to noise and
outliers [10].
A shortcoming of the above methods is the need for
clean training shapes, which requires a substantial man-
ual segmentation effort. Recently, a few authors have
tried to develop semi-supervised algorithms not requir-
ing segmented training examples. The key idea is to a b
find combinations of features repeatedly recurring over Fig. 2 Local contour features. (a) three example PAS. (b) the
many training images. Berg et al. [2] suggest to build 12 most frequent PAS types from 24 mug images.
the model from pairs of training images, and retaining
parts matching across several image pairs. A related continuous soft-correspondence matrix. The disadvan-
strategy is used by [28], which initializes the model tage is the need for initialization near the object, as it
using all line segments from a single image and then cannot operate automatically in a very cluttered image.
use many other images to iteratively remove spurious A related framework is adopted by Belongie et al. [3],
features and add new good features. Finally, the LO- where matching is supported by shape contexts. De-
CUS [43] model can also be learned in a semi-supervised pending on the model structure, optimization scheme
way, but needs the training objects to be roughly aligned can be based on Integer Quadratic Programming [2],
and to occupy most of the image surface. spectral matching [28] or graph cuts [43].
A limitation common to these approaches is the lack
of modeling of intra-class shape deformations, assuming Our approach in context In this paper, we present
a single shape is explaining all training images. More- an approach for learning and matching shapes which
over, as pointed out by [7,44], LOCUS is not suited for has several attractive properties. First of all, we build
localizing objects in extensively cluttered test images. explicit shape models formed by continuous connected
Finally, the models learned by [28] are sparse collections curves, which represent the prototype shapes of object
of features, rather than explicit shapes formed by con- classes. The training objects need only be annotated by
tinuous connected curves. As a consequence, [28] can- a bounding-box, i.e. no segmentation is necessary. Our
not localize objects up to their (complete) boundaries learning method avoids the pairwise image matching
in test images. used in previous approaches, and is therefore computa-
Object recognition using shape can be casted as tionally cheaper and more robust to clutter edgels due
finding correspondences between model and image fea- to the ‘global view’ gained by considering all training
tures. The resulting combinatorics can be made tractable images at once. Moreover, we model intra-class defor-
by accepting sub-optimal matching solutions. When the mations and enforce them at test time, when matching
shape is not deformable or we are not interested in the model to novel images. Finally, we extend the al-
recovering the deformation but only in localizing the gorithm [6] to a two-stage technique enabling the de-
object up to translation and scale, simple strategies formable matching of the learned shape models to ex-
can be applied, such as Geometric Hashing [26], Hough tensively cluttered test images. This enables to accu-
Transform [32], or exhaustive search (typically com- rately localize the complete boundaries of previously
bined with Chamfer Matching [22] or classifiers [16, unseen object instances.
39]). In case of non-rigid deformations, the parameter
space becomes too large for these strategies. Gold and
Rangarajan [24] propose an iterative method to simul-
3 Local contour features
taneously find correspondences and the model deforma-
tion. The sum of distances between model points and
In this section, we briefly present the local contour fea-
image points is minimized by alternating a step where
tures used in our approach: the scale-invariant PAS fea-
the correspondences are estimated while keeping the
tures of [16].
transformation fixed, and a step where the transforma-
tion is computed while fixing the correspondences. Chui
and Rangarajan [6] put this idea in a deterministic an- PAS features. The first step is to extract edgels with
nealing framework and adopts Thin Plate Splines as the excellent Berkeley edge detector [29] and to chain
deformation model (TPS). The deterministic annealing them. The resulting edgel-chains are linked at their dis-
formulation elegantly supports a coarse-to-fine search countinuities, and approximately straight segments are
in the TPS transformation space, while maintaining a fit to them, using the technique of [14]. Segments are
4

a) training examples b) model parts c) initial shape d) refined shape e) modes of variation
Fig. 3 Learning the shape model. (a) Four training examples (out of a total 24). (b) Model parts. (c) Occurrences selected to
form the initial shape. (d) Refined shape. (e) First two modes of variation (mean shape in the middle).

fit both over individual edgel-chains, and bridged be- ject instances. Finally, since a correspondence between
tween two linked chains. This brings robustness to the two PAS induces a translation and scale change, they
unavoidable broken edgel-chains [14]. can be readily used within a Hough-style voting scheme
The local features we use are pairs of connected seg- for object detection [27,32,39].
ments (figure 2a). Informally, two segments are consid-
ered connected if they are adjacent on the same edgel- PAS dissimilarity measure. The dissimilarity D(P, Q)
between the descriptors dp , dq of two PAS P, Q defined
chain, or if one is at the end of an edgel-chain directed in [16] is
towards the other (i.e. if the first segment were extended
a bit, it would meet the second one). As two segments 2 2
Dθ θip , θiq + ˛log lp /lq ˛
X ´ X
D(dp , dq ) = wr krp − rq k + wθ
` ˛ ` ´˛
in a pair are not limited to come from a single edgel- i i
i=1 i=1
chain, but may come from adjacent edgel-chains, the
(1)
extraction of pairs is robust to the typical mistakes of
the underlying edge detector. where the first term is the difference in the relative
Each pair of connected segments forms one feature, locations of the segments, Dθ ∈ [0, π/2] measures the
called a PAS, for Pair of Adjacent Segments. A PAS difference between segment orientations, and the last
feature P = (x, y, s, e, d) has a location (x, y) (mean term accounts for the difference in lengths. In all our
over the two segment centers), a scale s (distance be- experiments, the weights wr , wθ are fixed to the same
tween the segment centers), a strength e (average edge values used in [16] (wr = 4, wθ = 2).
detector confidence over the edgels with values in [0, 1]),
and a descriptor d = (θ1 , θ2 , l1 , l2 , r) invariant to trans- PAS codebook. We construct a codebook by clus-
lation and scale changes. The descriptor encodes the tering the PAS inside all training bounding-boxes ac-
shape of the PAS, by the segments’ orientations θ1 , θ2 cording to their descriptors (see [16] for more details
and lengths l1 , l2 , and the relative location vector r, go- about the clustering algorithm). For each cluster, we
ing from the center of the first segment to the center retain the centermost PAS, minimizing the sum of dis-
of the second (a stable way to derive the order of the similarities to all the others. The codebook C = {ti }
segments in a PAS is given in [16]). Both lengths and is the collection of the descriptors of these centermost
relative location are normalized by the scale of the PAS. PAS, the PAS types {ti } (figure 2b). A codebook is use-
Notice that PAS can overlap, i.e. two different PAS can ful for efficient matching, since all features similar to a
share a common segment. type are considered in correspondence. The codebook is
PAS features are particularly suited to our needs. class-specific and built from the same images used later
First, they are robustly detected because they connect to learn the shape model.
segments even across gaps between edgel-chains. Sec-
ond, as both PAS and their descriptors cover solely the 4 Learning the shape model
two segments, they can cover pure portion of an object
boundary, without including clutter edges which often In this section we present the new technique for learn-
lie in the vicinity (as opposed to patch descriptors). ing a prototype shape for an object class and its prin-
Hence, PAS descriptors respect the nature of boundary cipal intra-class deformation modes, given image win-
fragments, to be one-dimensional elements embedded dows W with example instances (figure 3a). To achieve
in a 2D image, as opposed to local appearance features, this, we propose a procedure for discovering which con-
whose extent is a 2D patch. Fourth, PAS have inter- tour points belong to the common class boundaries, and
mediate complexity. As demonstrated in [16], they are for putting them in full point-to-point correspondence
complex enough to be informative, yet simple enough to across the training examples. For example, we want the
be detectable repeatably across different images and ob- shape model to include the outline of a mug, which
5

Fig. 4 Finding model parts. Left: four training instances with two recurring PAS of the upper-L type (one on the handle, and
another on the main body). Right: four slices of the accumulator space for this PAS type (each slice corresponds to a different size).
The two recurring PAS form peaks at different locations and sizes. Our method allows for different model parts with the same PAS
type.

is characteristic for the class, and not the mug labels, 1. Align windows. Let a be the geometric mean of
which vary across instances. The technique is composed the aspect-ratios of the training windows W (width
of four stages (figure 3b-e): over height). Each window is transformed to a canonical
zero-centered rectangle of height 1 and width a. This
1. Determine model parts as PAS frequently reoccur- removes translation and scale differences, and cancels
ring with similar locations, scales, and shapes (sub- out shape variations due to different aspect-ratios (e.g
section 4.1). tall Starbucks mugs versus coffee cups). This facilitates
2. Assemble an initial shape by selecting a particular the learning task, because PAS on the class boundaries
PAS for each model part from the training examples are now better aligned.
(subsection 4.2).
3. Refine the initial shape by iteratively matching it 2. Vote for parts. Let Vi be a voting space associated
back onto the training images (subsection 4.3). with PAS type ti . There are |C| such voting spaces, all
4. Learn a statistical model of intra-class deformations initially empty. Each voting space has three dimensions:
from the corresponded shape instances produced by two for location (x, y) and one for size s. Every PAS P =
stage 3 (subsection 4.4). (x, y, s, e, d) from every training window casts votes as
follows:
The shape model output at the end of this procedure
is composed of a prototype shape S, which is a set of 1. P is soft-assigned to all types T within a dissimilar-
points in the image plane, and a small number of n ity threshold γ: T = {tj |D(d, tj ) < γ}, where d is
intra-class deformation modes E1:n , so that new class the shape descriptor of P (see equation (1)).
members can be written as S + E1:n . 2. For each assigned type tj ∈ T , a vote is casted in
Vj at (x, y, s), i.e. at the location and size of P . The
vote is weighted by e · (1 − D(d, tj )/γ), where e is
4.1 Finding model parts the edge strength of P .

Assigning P to multiple types T , and weighting


The first stage towards learning the model shape is to
votes according to the similarity 1 − D(d, tj )/γ reduce
determine which PAS lie on boundaries common across
the sensitivity to the exact shape of P and the exact
the object class, as opposed to those on the background
codebook types. Weighting by edge strength allows to
clutter and those on details specific to individual train-
take into account the relevance of the PAS. It leads to
ing instances. The basic idea is that a PAS belonging to
better results over treating edgels as binary features (as
the class boundaries will recur consistently across sev-
also noticed by [11,14]).
eral training instances with a similar location, size, and
Essentially, each PAS votes for the existence of a
shape. Although they are numerous, PAS not belonging
part of the class boundary with shape, location, and
to the class boundaries are not correlated across differ-
size like its own (figure 4). This is the best it can do
ent examples. In the following we refer to any PAS or
from its limited local perspective.
edgel not lying on the class boundaries as clutter.
3. Find local maxima. All voting spaces are searched
4.1.1 Algorithm for local maxima. Each local maximum yields a model
part M = (x, y, s, v, d), with a specific location (x, y),
The algorithm consists of three steps: size s, and shape d = ti (the PAS type corresponding
6

to the voting space where M was found). The value 4.2 Assembling the initial model shape
v of the local maximum measures the confidence that
the part belongs to the class boundaries. The (x, y, s) The collection of parts learned in the previous section
coordinates are relative to the canonical window. captures class boundaries well, and conveys a sense of
the shape of the object class (figure 3b). The outer
boundary of the mug and the handle hole are included,
whereas the label and background clutter are largely ex-
cluded. Based on this ‘collection of parts’ model (COP)
4.1.2 Discussion
one could already attempt to detect objects in a test im-
age, by matching parts based on their descriptor and en-
The success of this procedure is due in part to adopt- forcing their spatial relationship. This could be achieved
ing PAS as basic shape elements. A simpler alterna- in a way similar to what earlier approaches do based
tive would be to use individual edgels. In that case, on appearance features [18,27], and also done recently
there would be just one voting space, with two loca- with contour features by [32,39], and it would localize
tion dimensions and one orientation dimension. In con- objects up to a bounding-box.
trast, PAS bring two additional degrees of separation: However, the COP model has no notion of shape
the shape of the PAS, expressed as the assignments to at the global scale. It is a loose collection of fragments
codebook types, and its size (relative to the window). learnt rather independently, each focusing on its own
Individual edgels have no size, and the shape of a PAS local scale. In order to support localizing object bound-
is more distinctive than the orientation of an edgel. As aries accurately and completely on novel test images, a
a consequence, it is very unlikely that a significant num- more globally consistent shape is preferable. Ideally, its
ber of clutter PAS will accidentally have similar loca- parts would be connected into a whole shape featuring
tions, sizes and shapes at the same time. Hence, recur- smooth, continuous lines.
ring PAS stemming from the desired class boundaries In this subsection we describe a procedure for con-
tend to form peaks in the voting spaces, whereas clutter structing a first version of such a shape, and in the next
PAS don’t. subsection we refine it. We start with some intuition
Intra-class shape variability is addressed partly by behind the method. A model part occurs several times
the soft-assign of PAS to types, and partly by applying on different images (figure 5a-b). These occurrences of-
a substantial spatial smoothing to the voting spaces be- fer slightly different alternatives for the part’s location,
fore detecting local maxima. This creates wide basins of size, and shape. We can assemble variants of the model
attraction for PAS from different training examples to shape by selecting different occurrences for each part.
accumulate evidence for the same part. We can afford The key idea for obtaining a globally consistent shape
this flexibility while keeping a low risk of accumulating is to select one occurrence for each part so as to form
clutter because of the high separability discussed above, larger aggregates of connected occurrences (figure 3c).
especially due to separate voting spaces for different We cast the shape assembly task as the search for the
codebook types. This yields the discriminativity neces- assignment of parts to occurrences leading to the best
sary to overcome the poor signal-to-noise ratio, while connected shape. In the following, we explain the algo-
allowing the flexibility necessary to accommodate for rithm in more detail.
intra-class shape variations.
The voting procedure is similar in spirit to recent 4.2.1 Algorithm
works on finding frequently recurring spatial config-
urations of local appearance features in unannotated The algorithm consists of three steps:
images [19,34], but it is specialized for the case when
bounding-box annotation is available. 1. Compute occurrences. A PAS P = (xp , y p , sp , ep , dp )
is an occurrence of model part M = (xm , y m , sm , v m , dm )
The proposed algorithm sees all training data at if they have similar location, scale, and shape (figure 5a).
once, and therefore reliably selects parts and robustly The following function measures the confidence that P
estimates their locations/size/shapes. In our experiments is an occurrence of M (denoted M → P ):
this was more stable and more robust to clutter than „
sm sp
«
conf(M → P ) = ep · D(dm , dp ) · min , · (2)
matching pairs of training instances and combining their sp sm
output a posteriori. As another advantage, the algo-
“ ”
− 1 ((xp −xm )2 +(y p −y m )2 )
·exp 2σ2
rithm has complexity linear in the total number of PAS
in the training windows, so it can learn from large train- It takes into account P ’s edge strength (first factor)
ing sets efficiently. and how close it is to M in terms of shape, scale,
7

a b c
Fig. 5 Occurrences and connectedness. (a) A model part (above) and two of its occurrences (below). (b) All occurrences of all
model parts on a few training images, colored by the distance to the peak in the voting space (decreasing from blue to cyan to green
to yellow to red). (c) Two model parts with high connectedness (above) and two of their occurrences, which share a common segment
(below).

and location (second to last factors). The confidence where 1(a, b) = 1 if occurrences a, b come from the
ranges in [0, 1], and P is deemed an occurrence of M same image, and 0 otherwise; K is the number of im-
if conf(M → P ) > δ, with δ a threshold. By analogy ages contributing occurrences to A; α, β are predefined
Mi → Pi denotes the occurrence of model segment Mi weights. The first term prefers high confidence occur-
on image segment Pi (with i ∈ {1, 2}). rences. The second favors assigning connected parts
to connected occurrences, because occurrences of parts
2. Compute connectedness. As a PAS P is formed with high connectedness are likely to be connected when
by two segments P1 , P2 , two occurrences P, Q of dif- they come from the same image (by construction of
ferent model parts M, N might share a segment (fig- function (5)). The last term enourages selecting occur-
ure 5c). This suggests that M, N explain connected por- rences from a few images, as occurrences from the same
tions of the class boundaries and should be connected in image fit together naturally. Overall, function (5) en-
the model. As model parts occurs in several images, we courages the formation of aggregates of good confidence
estimate how likely it is for two parts to be connected and properly connected occurrences.
in the model, by how frequently their occurrences share Optimizing (5) exactly is expensive, as the space of
segments. all assignments is huge. In practice, the following ap-
Let the equivalence of segments Mi , Nj be
proximation algorithm brings satisfactory results. We
eq(Mi , Nj ) =
X
(conf(M → P ) + conf(N → Q)) (3)
start by assigning the part with the single most confi-
{P,Q|s∈P,s∈Q,Mi →s,Nj →s} dent occurrence. Next, we iteratively consider the part
most connected to those assigned so far, and assign it to
The summation runs over all pairs of PAS P, Q sharing the occurrence maximizing (5). The algorithm iterates
a segment s, where s is an occurrence of both Mi and until all parts are assigned to an occurrence.
Nj (figure 5c). Let the connectedness of M, N be the
combined equivalence of their segments 1 : Figure 3c shows the selected occurrences for our
running example. These form a rather well connected
conn(M, N ) = max(eq(M1 , N1 ) + eq(M2 , N2 ), (4)
shape, where most segments fit together and form con-
eq(M1 , N2 ) + eq(M2 , N1 )) tinuous lines. The remaining discontinuities are smoothed
Two parts have high connectedness if their occurrences out by the refinement procedure in the next subsection.
frequently share a segment. Two parts sharing both seg-
ments have even higher connectedness, suggesting they
explain the same portion of the class boundaries.
4.3 Model shape refinement
3. Assign parts to occurrences. Let A(M ) = P be
a function assigning a PAS P to each model part M . In this subsection we refine the initial model shape.
Find the mapping A that maximizes The key idea is match it back onto the training im-
X X age windows W, by applying a deformable matching
conf (M → A(M ))+α conn(M, N )·1 (A(M ), A(N ))−βK
M M,N
algorithm [6] (figure 6b). This results in a backmatched
(5)
shape for each window (figure 6c-left). An improved
model shape is obtained by averaging them (figure 6c-
1 for the best of the two possible segment matchings right). The process is then iterated by alternating back-
8

a) sampled initial shape b) backmatching (init −> match)

average shape average shape

backmatching backmatching

c) First iteration d) Second iteration


Fig. 6 Model shape refinement. (a) sampled points from the initial model shape. (b) after initializing backmatching by aligning
the model with the image bounding-box (left), it deforms it so as to match the image edgels (right). (c) the first iteration of shape
refinement. (d) the second iteration.

matching and averaging (figure 6d). Below we give the 3. Averaging. (1) Align the backmatched shapes B =
details of the algorithm. {Bi }i=1..|W| using Cootes’ variant of Procustes analy-
sis [9], by translating, scaling, and rotating each shape
4.3.1 Algorithm so that the total
P sum of distances to the mean shape B̄
is minimized: B∈Bi |Bi − B̄|2 (see appendix A of [9]).
The algorithm follows three steps: (2) Update S by setting it to the mean shape: S ← B̄
(figure 6c-right).
1. Sampling. Sample 100 equally spaced points from The algorithm now iterates to Step 2, using the up-
the initial model shape, giving the point set S (fig- dated model shape S. In our experiments, Steps 2 and
ure 6a). 3 are repeated two to three times.
2. Backmatching. Match S back to each training win-
dow w ∈ W by doing: 4.3.2 Discussion
2.1 Alignment. Translate, scale, and stretch S so that
Step 3 is possible because the backmatched shapes B
its bounding-box aligns with w (figure 6b-left). This
are in point-to-point correspondence, as they are differ-
provides the initialization for the shape matcher.
ent TPS transformations of the same S (figure 6c-left).
2.2 Shape matching. Let E be the point set consist-
This enables to define B̄ as the coordinates of corre-
ing of the edgels inside w. Put S and E in point-to-
sponding points averaged over all Bi ∈ B. It also en-
point correspondence using the non-rigid robust point
ables to analyze the variations in the point locations.
matcher TPS-RPM [6] (Thin-Plate Spline Robust Point
The differences remaining after alignment are due to
Matcher). This estimates a TPS transformation from S
non-rigid shape variations, which we will learn in the
to E, while at the same time rejecting edgels not corre-
next subsection.
sponding to any point of S. This is important, as only
The alternation of backmatching and averaging re-
some edgels lie on the object boundaries. Subsection 5.2
sults in a succession of better models and better matches
presents TPS-RPM in detail, where it is used again for
to the data, as the point correspondence cover more and
localizing object boundaries in test images.
9

Our idea for shape refinement is related to a gen-


eral design principle visible in different areas of vision.
It involves going back to the image after building some
intermediate representation from initial low-level fea-
tures, to refine and extend it. This differs from the
conventional way of building layers of increasing ab-
straction, involving representations of higher and higher
level, progressively departing from the original image
data. The traditional strategy suffers from two prob-
lems: errors accumulate from a layer to the next, and
relevant information missed by the low-level features is
never recovered. Going back to the image enables to cor-
rect both problems, and it has good chances to succeed
since a rough model has already been built. Different
Fig. 7 Evolution of shape models over the three stages of algorithms are instances of this strategy and have led to
learning. Top row: model parts (section 4.1). Second row: initial
shape (section 4.2). Bottom row: refined shape (section 4.3).
excellent results in various areas: human pose estima-
tion [35], top-down segmentation [27,4], and recognition
more of the class boundaries of the training objects (fig- of specific objects [13].
ure 6d). Segments of the model shape are moved, bent,
and stretched so as to form smooth, connected lines, 4.4 Learning shape deformations
thus recovering the shape of the class well on a global
scale (e.g. topmost and leftmost segments in figure 6c- The previous subsection matches the model shape to
right). This because backmatching deforms the initial each training image, and thus provides examples of the
shape onto the class boundaries of the training images, variations within the object class we want to learn.
delivering natural, well formed shapes. The averaging Since these examples are in full point-to-point corre-
step then integrates them into a generic-looking shape, spondence, we can learn a compact model of the intra-
and smoothes out occasional inaccuracies of the indi- class variations using the statistical shape analysis tech-
vidual backmatches. nique by Cootes [8].
The proposed technique can be seen as searching for The idea is to consider each example shape as a
the model shape that best explains the training data, point in a 2p-D space (with p the number of points on
under the general assumption that TPS deformations each shape), and model their distribution with Princi-
account for the difference between the model shape and pal Component Analysis (PCA). The eigenvectors re-
the class boundaries of the training objects. turned by PCA represent modes of variation, and the
As shown in figure 6d-right, the running example associated eigenvalues λi their importance (how much
improves further during the second (and final) iteration the example shapes deform along them, figure 3e). By
(e.g. the handle arcs become more continuous). The fi- keeping only the n largest eigenvectors E1:n represent-
nal shape is smooth and well connected, includes no ing 95% of the total variance, we can approximate the
background clutter and little interior clutter, and, as region in which the training examples live by S + E1:n b,
desired, represents an average class member (a proto- where S is the mean shape, b is a vector representing
type shape). Both large scale (the external frame) and shapes in the subspace spanned√ by E1:n , and b’s ith
fine scale structures (the double handle arc) are cor- component is bound by ±3 λi . This defines the valid
rectly recovered. The backmatched shapes also improve region of the shape space, containing shapes similar to
in the second iteration, because matching is easier given the example ones. Typically, n < 15 eigenvectors are
a better model. In turn, the better backmatches yield a sufficient (compared to 2p ≃ 200).
better average shape. The mutual help between back- Figure 3e shows the first two deformation modes for
matching and updating the model is key for the success our running example. The first mode spans the spec-
of the procedure. trum between little coffee cups and tall Starbucks-style
Figure 7 shows examples of other models evolving mugs, while the handle can vary from pointed down to
over the three stages (sections 4.1 to 4.3). Notice the pointed up within the second mode. In subsection 5.3,
large positive impact of model shape refinement. Fur- we exploit this deformation model to constrain the match-
thermore, to demonstrate that the proposed techniques ing of the model to novel test images. We should point
consistently produce good quality models, we show many out that by deformation we mean the geometric trans-
of them in the result section (figure 10). formation from the shape of an instance of the object
10

class to another instance. Although a single mug is not


a rigid object, we need a non-rigid transformation to
map the shape of a mug to that of another mug.
Notice that previous works on these deformation
models require at least the example shapes as input [21],
and many also need the point-to-point correspondences [8].
In contrast, we automatically learn shapes, correspon-
dences, and deformations given just images. a b

5 Object detection

In this section we describe how to localize the bound-


aries of previously unseen object instances in a test im-
age. To this end, we match the shape model learnt in c d e
the previous section to the test image edges. This task Fig. 8 Object detection. (a) A challenging test image and
is very challenging, because 1) the image can be exten- its edgemap b). The object covers only about 6% of the image
sively cluttered, with the object covering only a small surface, and only about 1 edgel in 17 belongs to its boundaries.
proportion of its edges (figure 8a-b); and 2) to han- (c) Initialization with a local maximum in Hough space. (d) Out-
put shape with unconstrained TPS-RPM. It recovers the object
dle intra-class variability, the shape model must be de- boundaries well, but on the bottom-right corner, where it is at-
formed into the shape of the particular instance shown tracted by the strong-gradient edgels caused by the shading inside
in the test image. the mug. (e) Output of the shape-constrained TPS-RPM. The
We decompose the problem into two stages. We first bottom-right corner is now properly recovered.

obtain rough estimates for location and scale of the


The above voting procedure delivers 10 to 40 local
object based on a Hough-style voting scheme (subsec-
maxima in a typical cluttered image, as the local fea-
tion 5.1). This greatly simplifies the subsequent shape
tures are not very distinctive on their own. The impor-
matching, as it approximately lifts three degrees of free-
tant point is that a few tens is far less than the number
dom (translation and scale). The estimates are then
of possible location and scales the object could take
used to initialize the non-rigid shape matcher [6] (sub-
in the image, which is in the order of the thousands.
section 5.2). This combination enables [6] to operate
Thus, Hough voting acts as a focus of attention mecha-
in cluttered images, and hence allows to localize ob-
nism, drastically reducing the problem complexity. We
ject boundaries. Furthermore, in subsection 5.3, we con-
can now afford to run a full-featured shape matching
strain the matcher to explore only the region of shape
algorithm [6], starting from each of the initializations.
space spanned by the training examples, thereby ensur-
Note that running [6] directly, without initialization,
ing that output shapes are similar to class members.
is likely to fail on very cluttered images, where only a
small minority of edgels are on the boundaries of the
5.1 Initialization by Hough voting target object.

In subsection 4.1 we have represented the shape of a


class as a set of PAS parts, each with a specific shape, lo- 5.2 Shape Matching by TPS-RPM
cation, size, and confidence. Here we match these parts
to PAS from the test image, based on their shape de- For each initial location l and scale s found by Hough
scriptors. More precisely, a model part is deemed matched voting, we obtain a point set V by centering the model
to an image PAS if their dissimilarity (1) is below a shape on l and rescaling it to s, and a set X which
threshold γ (this is the same as used in section 4.1). contains all image edge points within a larger rectan-
Since a pair of matched PAS induces a translation and gle of scale 1.8s (figure 8c). This larger rectangle is
scale transformation, each match votes for the presence designed to contain the whole object, even when s is
of an object instance at a particular location (object under-estimated. Any point outside this rectangle is ig-
center) and scale (in the same spirit as [27,32,39]). nored by the shape matcher.
Votes are weighed by the shape similarity between the Given the initialization, we want to put V in cor-
model part and test PAS, the edge strength of the PAS, respondence with the subset of X lying on the object
and the confidence of the part. Local maxima in the boundary. We estimate the associated non-rigid trans-
voting space define rough estimates of the location and formation, and reject image points not corresponding
scale of candidate object instances (figure 8c). to any model point with the Thin-Plate Spline Robust
11

Point Matching algorithm (TPS-RPM [6]). In this sub- unconstrained TPS−RPM


section we give a brief summary of TPS-RPM, and we
refer to [6] for details.
TPS-RPM matches the two point sets V = {va }a=1..K
and X = {xi }i=1..N by applying a non-rigid TPS map-
ping {d, w} to V (d is the affine component, and w the
non-rigid warp). It estimates both the correspondence
matrix M = {mai } between V and X, and the mapping TPS−RPM with class−specific shape constraints
{d, w} that minimize an objective function including 1)
the distance between points of X and their correspond-
ing points of V after mapping them by the TPS, and 2)
the regularization terms for the affine and warp compo-
nents of the TPS. In addition to the inner K × N part,
M has an extra row and an extra column which allow
to reject points as unmatched.
after iter 1 after iter 8 after iter 12
Since neither the correspondence M nor the TPS
mapping {d, w} are known beforehand, TPS-RPM iter- Fig. 9 Three iterations of TPS-RPM initialized as in fig-
atively alternates between updating M , while keeping ure 8c. The image points X are shown in red, and the current
{d, w} fixed, and updating the mapping with M fixed. shape estimate Y in blue. The green circles have radius pro-
M is a continuous-valued soft-assign matrix, allowing portional to the temperature T , and give an indication of the
the algorithm to evolve through a continuous correspon- range of potential correspondence considered by M . This is fully
dence space, rather than jumping around in the space shown by the yellow lines joining all pairs of points with non-
of binary matrices (hard correspondence). It is updated zero mai . Top: unconstrained TPS-RPM. Bottom: TPS-RPM
by setting mai as a function of the distance between xi with the proposed class-specific shape constraints. The two pro-
and va , after mapping by the TPS (details below). The cesses are virtually identical until iteration eight, when the un-
update of the mapping fits a TPS between V and the constrained matcher diverges towards interior clutter. The con-
current estimate Y = {ya }a=1..K of the corresponding strained version instead, sticks to the true object boundary.
points. Each point ya in y is a linear combination of
all image points {xi }i=1..N weighted by the soft-assign
values M (a, i):
N
X
ya = mai xi (6)
i=1
gets more and more deformable as the iterations con-
tinue. These two phenomena enable TPS-RPM to find
The TPS fitting maximizes the proximity between a good solution even when given a rather poor initial-
the points Y and the model points V after TPS map- ization. At first, when the correspondence uncertainty
ping, under the influence of the regularization terms, is high, each ya essentially averages over a wide area of
which penalize local warpings w and deviations of d X around the TPS-mapped point and the TPS is con-
from the identity. Fitting the TPS to V ↔ Y rather strained to near-rigid transformations. This can be seen
than to V ↔ X, allows to harvest the benefits of man- as a large T in equation (7) generates similar-valued
taining a full soft-correspondence matrix M . mai , which are then averaged by equation (6). As the
The optimization procedure of TPS-RPM is embed- iterations continue and the temperature decreases, M
ded in a deterministic annealing framework by intro- looks less and less far, and pays increasing attention
ducting a temperature parameter T , which decreases to the differences between matching options from X.
at each iteration. The entries of M are updated by the Since the uncertainty diminishes, it is safe to let the
following equation:
TPS looser, freer to fit the details of X more accu-
(xi − f (va , d, w))T (xi − f (va , d, w)) rately. Figure 9 illustrates TPS-RPM on our running
„ «
1
mai = exp (7)
T 2T example.
where f (va , d, w) is the mapping of point va by the TPS
{d, w}. The entries of M are then iteratively normalized We have extended TPS-RPM by adding two terms
to ensure the rows and columns sum to 1 [6]. Since T is to the objective function: the orientation difference be-
the bandwidth of the Gaussian kernel in equation (7), tween corresponding points (minimize), and the edge
as it decreases M becomes less fuzzy, progressively ap- strength of matched image points (maximize). In our
proaching a hard correspondence matrix. At the same experiments, these extra terms made TPS-RPM more
time, the regularization terms of the TPS is given less accurate and stable, i.e. it succeeds even when initial-
weight. Hence, the TPS is rigid in the beginning, and ized farther away from the best location and scale.
12

apple bottle giraffe mug swan horse The proposed extension to TPS-RPM has a deep
train 20 24 44 24 16 50 impact, in that it alters the search through the trans-
test pos 20 24 43 24 16 120
formation and correspondence spaces. Beside improving
test neg 215 207 167 207 223 170
accuracy, it can help TPS-RPM to avoid local minima
Table 1 Number of training images and of positive/negative test far from the correct solution, thus avoiding gross fail-
images for all datasets. ures.
Figure 8e shows the improvement brought by the
proposed constrained shape matching, compared to TPS-
5.3 Constrained shape matching
RPM with just the generic TPS model (figure 8d). On
the running example, the two versions of TPS-RPM di-
TPS-RPM treats all shapes according to the same generic
verge after the eight iteration, as shown in figure 9.
TPS deformation model, simply preferring smoother
transformations (in particular, low 2D curvature w, and
low affine skew d). Two shapes with the same deforma-
tion energy are considered equivalent. This might result 5.4 Detections
in output shapes unlike any of the training examples.
In this section, we extend TPS-RPM with the class- Every local maximum in Hough space constitutes an
specific deformation model learned in subsection 4.4. initialization for the shape matching, and results in dif-
We constrain the optimization to explore only the valid ferent shapes (detections) localized in the test image. In
region of the shape space, containing shapes plausible this section we score the detections, making it possible
for the class (defined by S, E1:n , λi from subsection 4.4). to reject detections and to evaluate the detection rate
At each iteration of TPS-RPM we project the cur- and false-positive rate of our system.
rent shape estimate Y (equation (6)) inside the valid We score each detection by a weighted sum of four
region, just before fitting the TPS. This amounts to: terms:

1) align Y on S w.r.t. to translation/rotation/scale 1) the number of matched model points, i.e. for which
2) project Y on the subspace spanned by E1:n : a corresponding image point has been found with good
b = E −1 · (Y − S) , b(n+1):2p = 0 confidence. Following [6], these are all points va with

3) bound the first n components of b by ±3 λi maxi=1..N (mai ) > 1/N .
4) transform b back into the original space: Y c = S+E·b 2) the sum of squared distances from the TPS-mapped
5) apply to Y c the inverse of the transformation used model points to their corresponding image points. This
in 1) measure is made scale-invariant by normalizing by the
squared range r2 of the image point coordinates (width
The assignment Y ← Y c imposes hard constraints or height, whichever is larger). Only matched model
on the shape space. While this guarantees output shapes points are considered.
similar to class members, it might sometimes be too re- P ³ p ´2
3) the deviation i,j∈[1,2] I(i, j) − d(i, j)/ |d| of
strictive. To match a novel instance accurately, it could
the affine component d of the TPS from the identity
be necessary to move a little along some dimensions of
I. The normalization by the determinant of d factors
the shape space not recorded in the deformation model.
out deviations due to scale changes.
The training data cannot be assumed to present all pos-
4) the amount of non-rigid warp w of the TPS trace(wT Φw)/r2 ,
sible intra-class variations.
where Φ(a, b) ∝ ||va − vb ||2 log ||va − vb || is the TPS ker-
To tackle this issue, we propose a soft-constrained nel matrix [6].
variant, where Y is attracted by the valid region, with
a force that diminishes with temperature: Y ← Y + This score integrates the information a matched shape
T c provides. It is high when the TPS fits many (term 1)
Tinit (Y − Y ). This causes TPS-RPM to start fully
constrained, and then, as temperature decreases and points well (term 2), without having to distort much
M looks for correspondences closer to the current es- (terms 3 and 4). In our current implementation, the
timates, later iterations are allowed to apply small de- relative weights between these terms have been selected
formations beyond the valid region (typically along di- manually, they are the same for all classes, and remain
mensions not in E1:n ). As a result, output shapes fit fixed in all experiments.
the image data more accurately, while still resembling As a final refinement, if two detections overlap sub-
class members. Notice how this behavior is fully in the stantially, we remove the lower scored one. Notice that
spirit of TPS-RPM, which also lets the TPS more and the method can detect multiple instances of the same
more free as T decreases. class in an image. Since they appear as different peaks
13

in the Hough voting space, they result in separate de- 6.2 Learning shape models
tections.
Evaluation measures. We assess the performance of
the learning procedure of section 4 in terms of how accu-
rately it recovers the true class boundaries of the train-
6 Experiments ing instances. For this evaluation, we have manually
annotated the boundaries of all object instances in the
We present an extensive evaluation involving six di- ETHZ shape classes dataset. We will present results for
verse object classes from two existing datasets [14,25]. all of these five classes.
After introducing the datasets in the next subsection, Let Bgt be the ground-truth boundaries, and Bmodel
we evaluate our approach for learning shape models in the backmatched shapes output by the model shape re-
subsection 6.2. The ability to localize objects in novel finement algorithm of subsection 4.3. The accuracy of
images, both in terms of bounding-boxes and bound- learning is quantified by two measures. Coverage is the
aries, is measured in subsection 6.3. All experiments percentage of points from Bgt closer than a threshold t
are run with the same parameters (no class-specific nor from any point of Bmodel . We set t to 4% of the diago-
dataset-specific tuning is applied). nal of the bounding-box of Bgt . Conversely, precision is
the percentage of Bmodel points closer than t from any
point of Bgt . The measures are complementary. Cov-
6.1 Datasets and protocol erage captures how much of the object boundary has
been recovered by the algorithm, whereas precision re-
ETHZ shape classes [14].This dataset features five ports how much of the algorithm’s output lies on the
diverse classes (bottles, swans, mugs, giraffes, apple- object boundaries.
logos) and contains a total of 255 images collected from
the web. It is highly challenging, as the objects appear Models from the full algorithm. Table 2 shows
in a wide range of scales, there is considerable intra- coverage and precision averaged over training instances
class shape variation, and many images are severely and trials, for the complete learning procedure described
cluttered, with objects comprising only a fraction of in section 4. With the exception of giraffes, the pro-
the total image area (figures 13 and 18). posed method achieves very high coverage (above 90%),
For each class, we learn 5 models, each from a dif- demonstrating its ability to discover which contour points
ferent random sample containing half of the available belong to the class boundaries. The precision of apple-
images (there are 40 for apple-logos, 48 for bottles, 87 logos and bottles is also excellent, thanks to the clean
for giraffes, 48 for mugs and 32 for swans). Learning prototype shapes learned by our approach (figure 10).
models from different training sets allows to evaluate Interestingly, the precision of mugs is somewhat lower,
the stability of the proposed learning technique (sub- because the learned shapes include a detail not present
section 6.2). Notice that our method does not require in the ground-truth annotations, although it is arguably
negative training images i.e. images not containing any part of the class boundaries: the inner half of the open-
instance of the class. ing on top of the mug. A similar phenomenon penalizes
The test set for a model consists of all other im- the precision of swans, where our method sometimes in-
ages in the dataset. Since this includes about 200 nega- cludes a few water waves in the model. Although they
tive images, it allows to properly estimate false-positive are not part of the swan boundaries, waves acciden-
rates. Table 1 gives an overview of the composition of tally occurring at a similar position over many train-
all training and testing sets. We refer to learning and ing images are picked up by the algorithm. A larger
testing on a particular split of the images as a trial. training set might lead to the suppression of such arti-
facts, as waves have less chances of accumulating acci-
dentally (we only used 16 images). The modeling per-
INRIA horses [25].This challenging dataset consists formance for giraffes is lower, due to the extremely clut-
of 170 images with one or more horses viewed from the tered edgemaps arising from their natural environment,
side and 170 images without horses. Horses appear at and to the camouflage texture which tends to break
several scales, and against cluttered backgrounds. edges along the body outlines (figure 11).
We train 5 models, each from a different random
subset of 50 horse images. For each model, the remain- Models without assembling the initial shape. We
ing 120 horse images and all 170 negative images are experiment with a simpler scheme for learning shape
used for testing, see table 1. models by skipping the procedure for assembling the
14

apple bottle giraffe mug swan


Full system 90.2 / 90.6 96.2 / 92.7 70.8 / 74.3 93.9 / 83.6 90.0 / 80.0
No assembly 91.2 / 92.7 96.8 / 88.1 70.0 / 72.6 92.6 / 82.9 89.4 / 79.2

Table 2 Accuracy of learned models. Each entry is the average coverage/precision over trials and training instances.

Fig. 11 A typical edgemap for a Giraffe training window is very


cluttered and edges are broken along the animal’s outline, making
it difficult to learn clean models.

6.3 Object detection

Detection up to a bounding-box. We first evalu-


ate the ability of the object detection procedure of sec-
Fig. 10 Learned shape models for ETHZ Shape Classes (three
out of total five per class). Top three rows: models learnt using tion 5 to localize objects in cluttered test images up
the full method presented in section 4. Last row: models learnt to a bounding-box (i.e. the traditional detection task
using the same training images used in row 3, but skipping the commonly defined in the literature).
procedure for assembling the initial shape (subsection 4.2; done
Figure 12 reports detection-rate against the num-
only for ETHZ shape classes).
ber of false-positives averaged over all 255 test images
(FPPI) and averaged over the 5 trials. As discussed
above, this includes mostly negative images. We adopt
the strict standards of the PASCAL Challenge criterion
(dashed lines in the plots): a detection is counted as cor-
initial shape (section 4.2). An alternative initial shape
rect only if the intersection-over-union ratio (IoU) with
can be obtained directly from the COP model (sec-
the ground-truth bounding-box is greater than 50%. All
tion 4.1) by picking, for each part, the occurrence clos-
other detections are counted as false-positives. In order
est to the peak in the voting space corresponding to
to compare to [14,16], we also report results under their
the part (as in section 4.1). This initial shape can then
somewhat softer criterion: a detection is counted as cor-
be passed on to the shape refinement stage as usual
rect if its bounding-box overlaps more than 20% with
(section 4.3).
the ground-truth one, and vice-versa (we refer to this
criterion as 20%-IoU).
For each object class and trial we have rerun the As the plots show, our method performs well on all
learning algorithm without the assembly stage, but oth- classes but giraffes, with detections-rates around 80%
erwise keeping identical conditions (including using ex- at the moderate false-positive rate of 0.4 FPPI (this
actly the same training images). Many of the result- is the reference point for all comparisons). The lower
ing prototype shapes are moderately worse than those performance on giraffes is mainly due to the difficulty
obtained using the full learning scheme (figure 10 bot- of building shape models from their extremely noisy
tom row). However, the lower model quality only re- edge maps.
sults in slightly lower average coverage/accuracy values It is interesting to compare against the detection
(table 2). These results suggest that while the initial as- performance obtained by the Hough voting stage alone
sembly stage does help getting better models, it is not (subsection 5.1), without the shape matcher on top
a crucial step, and that the shape refinement stage of (subsections 5.2, 5.3). The full system performs sub-
section 4.3 is robust to large amounts of noise, and de- stantially better: the difference under PASCAL crite-
livers good models even when starting from poor initial rion is about +30% averaged over all classes. This shows
shapes. the benefit of treating object detection fully as a shape
15

Apple logos Bottles


Giraffes
1 1
1
0.9 0.9
0.9
0.8 0.8 0.8
0.7 0.7 0.7

Detection rate
Detection rate

Detection rate
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1

0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image False−positives per image

Mugs Swans
INRIA Horses
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
Detection rate

Detection rate

0.6 0.6

Detection rate
0.6
0.5 0.5 0.5
0.4 0.4 0.4
Ferrari et al. PAMI 08 (20% IoU)
0.3 0.3 0.3 Ferrari et al. PAMI 08 (PASCAL)

0.2 0.2 Full system (20% IoU)


0.2
Full system (PASCAL)
0.1 0.1 0.1 Hough only (PASCAL)
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0
False−positives per image 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image

Fig. 12 Object detection performance (models learnt from real images). Each plot shows five curves: the full system evaluated under
the PASCAL criterion for a correct detection (dashed, thick, red), the full system under the 20%-IoU criterion (solid, thick, red), the
Hough voting stage alone under PASCAL (dashed, thin, blue), [16] under 20%-IoU (solid, thin, green) and under PASCAL (dashed,
thin, green). The curve for the full system under PASCAL in the apple-logo plot is identical to the curve for 20%-IoU.

matching task, rather than simply matching local fea- ter than [16] on two classes (apple-logos, swans), mod-
tures, which is one of the principal points of this paper. erately worse on two (bottles, horses), and about the
Moreover, the shape matching stage also makes it possi- same on two (mugs, giraffes), thanks to the higher ac-
ble to localize complete object boundaries, rather than curacy of the detected bounding-boxes. Averaged over
just bounding-boxes (figure 13). all classes, under PASCAL our method reaches 71.1%
The difference between the curves under the PAS- detection-rate at 0.4 FPPI, comparing well against the
CAL criterion and the 20%-IoU criterion of [14,16] is 68.5% of [16]. Note how our results are achieved with-
small for apple-logos, bottles, mugs and swans (0%, out the beneficial discriminative learning of [16], where
−1.6%, −3.6%, −4.9%), indicating that most detec- a SVM learns which PAS types at which relative loca-
tions have accurate bounding-boxes. For horses and gi- tion within the training bounding-box best discriminate
raffes the decrease is more significant (−18.1%, −14.1%), between instances of the class and background image
because the legs of the animals are harder to detect windows. Our method instead trains only from positive
and cause the bounding-box to shift along the body. examples.
On average over all classes, our method achieves 78.1% For clarity and reference for comparison by future
detection-rate at 0.4 FPPI under 20%-IoU and 71.1% works, we summarize here our results on the ETHZ
under PASCAL. The corresponding standard-deviation Shape Classes alone (without INRIA horses). Under
over trials, averaged over classes, is 8.1% under 20%- PASCAL, averaged over all 5 trials and 5 classes, our
IoU and 8.0% under PASCAL (this variation is due to method achieves 72.0%/67.2% detection-rate at 0.4/0.3
different trials having different training and test sets). FPPI respectively. Under 20%-IoU, it achieves 76.8%/71.5%
For reference, the plots also show the performance detection-rate at 0.4/0.3 FPPI.
of [16] on the same datasets, using the same number After our results were first published [15], Fritz and
of training and test images. An exact comparison is Schiele [20] presented an approach based on topic mod-
not possible, as [16] reports result based on only one els and a dense gradient histogram representation of im-
training/testing split, whereas we average over 5 ran- age windows (no explicit shapes). They report results
dom splits. Under the rather permissive 20%-IoU crite- on the ETHZ Shape Classes dataset (i.e. no horses), us-
rion, [16] performs a little better than our method on ing the same protocol (5 random trials). Their method
average over all classes. Under the strict PASCAL cri- achieves 84.8% averaged over classes, improving over
terion instead, our method performs substantially bet- our 76.8% (both at 0.4 FPPI and under 20%-IoU).
16

Fig. 13 Example detections (models learnt from images). Notice the large scale variations (especially in apple-logos, swans), the intra-
category shape variability (especially in swans, giraffes), and the extensive clutter (especially in giraffes, mugs). The method works
for photographs as well as paintings (first swan, last bottle). Two bottle cases show also false-positives. In the first two horse images,
the horizontal line below the horses’ legs is part of the model and represents the ground. Interestingly, the ground line systematically
reoccurs over the training images for that model and gets learned along with the horse.
17

As a baseline, table 3 also reports coverage/precision


results when using the ground-truth bounding-boxes as
shapes. The purpose of this experiment is to compare
the accuracy of our method to the maximal accuracy
that can be achieved when localizing objects up to a
Fig. 15 Learned shape models for INRIA horses (three out of bounding-box. As the table clearly shows, the shapes
total five), using the method presented in section 4. returned by our method are substantially more accurate
than the best bounding-box, thereby proving one of the
Beyond the above quantiative evaluation, the method principal points of this paper. While the average differ-
presented in this paper offers two important advantages ence is about 35%, it is interesting to observe how the
over both [16] and [20]. It localizes object boundaries, difference is greater for less rectangular objects (swans,
rather than just bounding-boxes, and can also detect giraffes, apple-logos) than for bottles and mugs. Notice
objects starting from a single hand-drawing as a model also how our method is much more accurate than the
(see below). ground-truth bounding-box even for giraffes, the class
where it performs the worst.
Localizing object boundaries. The most interest- Finally, we investigate the impact of the constrained
ing feature of our approach is the ability to localize shape matching technique proposed in subsection 5.3,
object boundaries in novel test images. This is shown by re-running the experiment without it, simply rely-
by several examples in figure 13, where the method ing on the deformation model implicit in the thin-plate
succeeds in spite of extensive clutter, a large range of spline formulation (table 3, second row). The cover-
scales, and intra-class variability (typical failure cases age/precision values are very similar to those obtained
are discussed in figure 16). In the following we quantify through constrained shape matching. The reason is that
how accurately the output shapes match the true ob- most cases are either already solved accurately without
ject boundaries. We use the coverage and precision mea- learned deformation models, or they do not improve
sures defined above. In the present context, coverage is when using them because the low accuracy is due to
the percentage of ground-truth boundary points recov- particularly bad edgemaps. In practice, the difference
ered by the method and precision is the percentage of made by constrained shape matching is visible in about
output points that lie on the ground-truth boundaries. one case every six, and it is localized to a relatively
All shape models used in these experiments have been small region of the shape (figure 14). The combina-
learned from real images, as discussed before. Several tion of these two factors explains why constrained shape
models for each object class are shown in figures 10 and matching appears to make little quantitative difference,
15. although in many cases the localized boundaries im-
Table 3 shows coverage and precision averaged over prove visibly.
trials and correct detections at 0.4 FPPI. Coverage ranges
in 78 − 92% for all classes but giraffes, demonstrating Detection from hand-drawn models. A useful char-
that most of the true boundaries have been successfully acteristic of the proposed approach is that, unlike most
detected. Moreover, precision values are similar, indi- existing object detection methods, it can take either a
cating that the method returns only a small proportion hand-drawing as a model, or learn it from real images.
of points outside the true boundaries. Performance is When given a hand-drawing as a model, our approach
lower for giraffes, due to the more noisy models and does not perform the learning stage, and naturally falls
difficult edgemaps derived from the test images. back to the functionality of pure shape matchers which
Although it uses the same evaluation metric, the ex- takes a clean shape as input (e.g. the recent works [14,
periment carried out at training time in subsection 6.2 38], which support matching to cluttered test images).
differs substantially from the present one, because at In this case, the modeling stage simply decomposes
testing time the system is not given ground-truth bounding- the hand-drawing into PAS. Object detection then uses
boxes. In spite of the important additional challenge these PAS for the Hough voting stage, and the hand-
of having to determine the object’s location and scale drawing itself for the shape matching stage. As no defor-
in the image, the coverage/precision scores in table 3 mation model can be learnt from a single example, our
are only moderately lower than those achieved during method naturally switches to the standard deformation
training (table 3; the average difference in coverage and model implicit in the Thin-Plane Spline formulation.
precision is 7.1% and 2.1% respectively). This demon- Figure 17 compares our method to [14] using their
strates that our detection approach is highly robust to exact setup, i.e. with a single hand-drawing per class as
clutter. model and all 255 images of the ETHZ shape classes as
18

default TPS shape constrained default TPS shape constrained


Fig. 14 (left) typical improvement brought by constrained shape matching over simply using the TPS deformation model. As the
improvement is often a refinement of a local portion of the shape (the swan’s tail in this case), the numerical differences in the
evaluation measures is only modest (in this case less than 1%). (right) an infrequent case, where constrained shape matching fixes
the entirely wrong solution delivered by standard matching. The numerical difference in such cases is noticeable (about 6%).
apple bottle giraffe mug swan
Full system 91.6 / 93.9 83.6 / 84.5 68.5 / 77.3 84.4 / 77.6 77.7 / 77.2
No learned deform 91.3 / 93.6 82.7 / 84.2 68.4 / 77.7 83.2 / 75.7 78.4 / 77.0
Ground-truth BB 42.5 / 40.8 71.2 / 67.7 26.7 / 29.8 55.1 / 62.3 36.8 / 39.3

Table 3 Accuracy of localized object boundaries at test time. Each entry is the average coverage/precision over trials and
correct detections at 0.4 FPPI.

a b c
Fig. 16 Example failed detections (models learnt from images). (a) A typical case. A good match to the image edges is
found, but at the wrong scale. Our system has no bias for any particular scale. (b) Another typical case. Failure is due to an extremely
cluttered edge-map. The neck is correctly matched, and gives rise to a peak in the Hough voting space (section 5.1). However, the
subsequent deformable matching stage (section 5.2) is attracted by the high contrast waves in the background. (c) An infrequent case.
Failure is due to a poor shape model (right, this the worst of the 30 models we have learned).

test set. Therefore, the test set for each class contains propose a sophisticated scoring method which allows
mostly images not containing any instance of the class, to reliably reject false-positives, while the method of
which supports the proper estimatation of FPPI. Our Zhu et al. [45] relies on their algorithm [46] to find
method performs better than [14] on all 5 classes, espe- long salient contours, effectively removing many clut-
cially in the low FPPI range, and substantially outper- ter edgels before the object detector runs. An interest-
forms the oriented chamfer matching baseline (details ing avenue for further research is incorporating these
in [14]). Averaged over classes, our method achieves successful elements in our framework.
85.3%/82.4% detection-rate at 0.4/0.3 FPPI respectively,
compared to 81.5%/70.5% of [14] (all results under 20%-
IoU). As one reason for this improvement, our method Beside this quantitative evaluation, the main ad-
is more robust because it does not need the test image vantage of our approach over [14,36,45] is that it can
to contain long chains of contour segments around the also train from real images (which is the main topic
object. of this paper). Moreover, compared to [14], it supports
After our results were first published [15], two works branching and self-intersecting input shapes.
reported even better performance. Ravishankar et al. [36]
achieve 95.2% at 0.4 FPPI. Zhu et al. [45] reports 0.21
FPPI at 85.3% detection-rate (ours). Note this is the Interestingly, in our system hand-drawings lead to
opposite of the usual way, reporting detection-rate at moderately better detection results than when learning
a reference FPPI [14–16,20,36]). All results are under models from images. This is less suprising when consid-
20%-IoU and averaged over classes. As part of the rea- ering that hand-drawings are essentially the prototype
son for the high performance, Ravishankar et al. [36] shapes the system tries to learn.
19

Apple logos Bottles Giraffes


1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7


Detection rate

Detection rate

Detection rate
0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 Chamfer matching 0.2 0.2


Ferrari et al. ECCV 2006
0.1 Our method 0.1 0.1

0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image False−positives per image

Mugs Swans
1 1

0.9 0.9

0.8 0.8

0.7 0.7
Detection rate

Detection rate

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image

Fig. 17 Object detection performance (hand-drawn models). To facilitate comparison, all curves have been computed using
the 20%-IoU criterion of [14].

Fig. 18 Detection from hand-drawn models. Top: four of the five models from [14]. There is just one example per object class.
Bottom: example detections delivered by our shape matching procedure, using these hand-drawings as models.

7 Conclusions and future work amples. This could be extended by learning a classi-
fier to distinguish between positive and negative exam-
We have proposed an approach for learning class-specific ples, which might reduce false positives. One possibility
explicit shape models from images annotated by bounding- could be to train both our shape models and the dis-
boxes, and localizing the boundaries of novel class in- criminative models of [16]. At detection time, we could
stances in the presence of extensive clutter, scale changes, then use the bounding-box delivered by [16] to initial-
and intra-class variability. In addition, the approach ize shape matching based on our models. Moreover, the
operates effectively also when given hand-drawings as discriminative power of the representation could be im-
models. The ability to input both images and hand- proved by using appearance features in addition to im-
drawings as training data is a consequence of the basic age contours. Finally, in this paper we have assumed
design of our approach, which attempts to bridge the that all observed differences in the shape of the training
gap between shape matching and object class detection. examples originate from intra-class variation, and not
The presented approach can be extended in several from viewpoint changes. It would be interesting to add
ways. First, the training stage models only positive ex-
20

a stage to automatically group objects by viewpoint, 27. B. Leibe and B. Schiele, Scale-Invariant Object Categoriza-
and learn separate shape models. tion using a Scale-Adaptive Mean-Shift Search, DAGM, 2004.
28. M. Leordeanu, M. Hebert, and R. Sukthankar, Beyond Local
Appearance: Category Recognition from Pairwise Interactions
of Simple Features, CVPR, 2007.
References 29. D. Martin, C. Fowlkes and J. Malik, Learning to detect nat-
ural image boundaries using local brightness, color, and texture
1. R. Basri, L. Costa, D. Geiger, and D. Jacobs, Determining cues, PAMI, 26(5):530-549, 2004.
the Similarity of Deformable Shapes, Vision Research, vol. 38, 30. D. Marr and H.K. Nishihara, Representation and Recog-
pp. 2365-2385, 1998. nition of the Spatial Organization of Three-Dimensional
2. A. Berg, T. Berg and J. Malik, Shape Matching and Ob- Shapes, Proc. Royal Soc. London, Series B, Biological Sciences,
ject Recognition using Low Distortion Correspondence, CVPR, (200):269-294, 1978.
2005. 31. F. Mokhtarian and A. Mackworth, Scale-based Descrip-
3. S. Belongie and J. Malik, Shape Matching and Object Recog- tion and Recognition of Planar Curves and Two-dimensional
nition using Shape Contexts, PAMI, 24(4):509-522, 2002. Shapes, PAMI, 8(1):34-43, 1986.
4. E. Borenstein and S. Ullman, Class-Specific, Top-Down Seg- 32. A. Opelt, A. Pinz, and A. Zisserman, A Boundary-Fragment-
mentation, ECCV, 2002. Model for Object Detection, ECCV, 2006.
5. I. Biederman, Recognition-by-components: A theory of human 33. A. Pentland, A.; Sclaroff, S., Closed-form solutions for phys-
image understanding, Psychological Review, 94(2):115-147. ically based shape modeling and recognition, PAMI, 13(7):715-
6. H. Chui and A. Rangarajan, A new point matching algorithm 729, 1991.
for non-rigid registration, CVIU, 89(2-3):114-141, 2003. 34. T. Quack, V. Ferrari, B. Leibe, and L. Van Gool, Efficient
7. O. Chum and A. Zisserman, An Exemplar Model for Learning Mining of Frequent and Distinctive Feature Configurations,
Object Classes, CVPR, 2007. ICCV, 2007.
8. T. Cootes, C. Taylor, D. Cooper, and J. Graham, Active Shape 35. D. Ramanan, Learning to parse images of articulated bodies,
Models: Their Training and Application, CVIU, 61(1):38-59, NIPS, 2006.
1995. 36. S. Ravishankar, A. Jain, and A. Mittal, Multi-stage Contour
9. T. Cootes, An Introduction to Active Shape Models, 2000. Based Detection of Deformable Objects, ECCV, 2008.
10. D. Cremers, T. Kohlberger, and C. Schnorr, Nonlinear Shape 37. T. Sebastian, P. Klein, and B. Kimia, Recognition of Shapes
Statistics in Mumford-Shah Based Segmentation, ECCV, 2002. by Editing Their Shock Graphs, PAMI, 26(5):550-571, 2004.
11. N. Dalal and B. Triggs, Histograms of Oriented Gradients 38. J. Schwartz and P. Felzenszwalb, Hierarchical Matching of
for Human Detection, CVPR, 2005. Deformable Shapes, CVPR, 2007.
12. G. Elidan, G. Heitz, D. Koller, Learning Object Shape: From 39. J. Shotton, A. Blake, and R. Cipolla, Contour-Based Learn-
Drawings to Images, CVPR, 2006. ing for Object Detection, ICCV, 2005.
13. V. Ferrari, T. Tuytelaars, and L. van Gool, Simultaneous 40. A. Torralba, K. Murphy, and W. Freeman, Sharing Features:
Object Recognition and Segmentation by Image Exploration, Efficient Boosting Procedures for Multiclass Object Detection,
ECCV, 2004. CVPR, 2004.
14. V. Ferrari, T. Tuytelaars, and L. Van Gool, Object Detection 41. D. Sharvit, J. Chan, H. Tek, B Kimia, Symmetry-Based
with Contour Segment Networks, ECCV, 2006. Indexing of Image Databases, IEEE Workshop on Content-
15. V. Ferrari, F. Jurie, and C. Schmid, Accurate Object De- Based Access of Image and Video Libraries, 1998.
tection with Deformable Shape Models Learnt from Images, 42. S. Ullman, Aligning pictoral descriptions: An approach to
CVPR, 2007. object recognition, Cognition, 32(3):193-254, 1989.
16. V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, Groups 43. J. Winn and N. Jojic, LOCUS: Learning Object Classes with
of Adjacent Contour Segments for Object Detection, PAMI, Unsupervised Segmentation, ICCV, 2005.
(30)1:36-51, 2008. 44. J. Winn and J. Shotton, The Layout Consistent Random
17. P. Felzenswalb, Representation and Detection of Deformable Field for Recognizing and Segmenting Partially Occluded Ob-
Shapes, PAMI, 27(2):208-220, 2005. jects, CVPR, 2006.
45. Q. Zhu, L. Wang, Y. Wu, and J. Shi, Contour Context Se-
18. R. Fergus, P. Perona, and A. Zisserman, Object Class Recog-
lection for Object Detection: A Set-to-Set Contour Matching
nition by Unsupervised Scale-invariant Learning, CVPR, 2003.
Approach, ECCV, 2008.
19. M. Fritz and B. Schiele, Towards Unsupervised Discovery of
46. Q. Zhu, G. Song, and J. Shi, Untangling cycles for contour
Visual Categories, DAGM, 2006.
grouping, ICCV, 2007.
20. M. Fritz and B. Schiele, Decomposition, discovery and de-
tection of visual categories using topic models, CVPR, 2008.
21. A. Hill and C. Taylor, A Method of Non-Rigid Correspon-
dence for Automatic Landmark Identification, BMVC, 1996.
22. D. Gavrila, Multi-Feature Hierarchical Template Matching
Using Distance Transforms, ICPR, 1998.
23. Y. Gdalyahu and D. Weinshall, Flexible Syntactic Match-
ing of Curves and Its Application to Automatic Hierarchical
Classification of Silhouettes, PAMI, 21(12):1312-1328, 1999.
24. S. Gold and A. Rangarajan, Graduated Assignment Algo-
rithm for Graph Matching, PAMI, 18(4):377-388, 1996.
25. F. Jurie and C. Schmid, Scale-invariant Shape Features for
Recognition of Object Categories, CVPR, 2004.
26. Y. Lamdan, J. Schwartz, and H. Wolfson, Affine Invari-
ant Model-based Object Recognition, IEEE Transactions on
Robotics and Automation, 6(5):578-589, 1990.

You might also like