FJS10
FJS10
images, thereby localizing novel class instances up to Several earlier works for shape description are based
their boundaries (as opposed to a bounding-box). on silhouettes [31,41]. Yet, silhouettes are limited be-
The main contribution of this paper is a technique cause they ignore internal contours and are difficult to
for learning the prototypical shape of an object class as extract from cluttered images as noted by [3]. There-
well as a statistical model of intra-class deformations, fore, more recent works represent shapes as loose col-
given image windows containing training instances (fig- lections of 2D points [8,22] or other 2D features [12,
ure 3a; no pre-segmented shapes are needed). The chal- 16]. Other works propose more informative structures
lenge is to determine which contour points belong to than individual points as features, in order to simplify
the class boundaries, while discarding background and matching. Belongie et al. [3] propose the Shape Con-
details specific to individual instances (e.g. mug labels). text, which captures for each point the spatial distribu-
Note how these typically form the majority of points, tion of all other points relative to it on the shape. This
yielding a poor signal-to-noise ratio. The task is further semi-local representation allows to establish point-to-
complicated by intra-class variability: the shape of the point correspondences between shapes even under non-
object boundary varies across instances. rigid deformations. Leordeanu et al. [28] propose an-
As additional contributions, we extend the non-rigid other way to go beyond individual edgels, by encoding
shape matcher of Chui and Rangarajan [6] in two ways. relations between all pairs of edgels. Similarly, Elidan
First, we extend it to operate in cluttered test images, et al. [12] use pairwise spatial relations between land-
by deriving an automatic initialization for the loca- mark points. Ferrari et al. [16] present a family of scale-
tion and scale of the object from a Hough-style vot- invariant local shape features formed by short chains of
ing scheme [27,32,39] (instead of the manual initializa- connected contour segments, capable of cleanly encod-
tion that would otherwise be necessary). This enables ing pure fragments of an object boundary. They offer
to match the learned shape model even to severely clut- an attractive compromise between information content
tered images, where the object boundaries cover only a and repeatability, and encompass a wide variety of local
small fraction of the contour points (figures 1, 13). As shape structures.
a second extension, we constrain the shape matcher [6] While generic features can be directly used to model
to only search over transformations compatible with the any object, an alternative is to learn features adapted to
learned, class-specific deformation model. This ensures a particular object class. Shotton et al. [39] and Opelt
output shapes similar to class members, improves ac- et al. [32] learn class-specific boundary fragments (lo-
curacy, and helps avoiding local minima. cal groups of edgels), and their spatial arrangement as a
These contributions result in a powerful system, ca- star configuration. In addition to their own local shape,
pable of detecting novel class instances and localizing such fragments store a pointer to the object center, en-
their boundaries in cluttered images, while training from abling object localization in novel images using voting.
objects annotated only with bounding-boxes. Other methods [11,16] achieve this functionality by en-
After reviewing related work (section 2) and the lo- coding spatial organization by tiling object windows,
cal contour features used in our approach (section 3), we and learning which features/tile combinations discrim-
present our shape learning method in section 4, and the inate objects from background.
scheme for localizing objects in test images in section 5.
The overall shape model of the above approaches is
Section 6 reports extensive experiments. We evaluate
either (a) a global geometric organization of edge frag-
the quality of the learned models and quantify local-
ments [3,16,32,39]; or (b) an ensemble of pairwise con-
ization performance at test time in terms of accuracy
straints between point features [12,28]. Global geomet-
of the detected object boundaries. We also compare to
ric shape models are appealing because of their abil-
previous works for object localization with training on
ity to handle deformations, which can be represented
real images [16] and hand-drawings [14]. A preliminary
in several ways. The authors of [3] use regularized Thin
version of this work was published at CVPR 2007 [15].
Plate Splines which is a generic deformation model that
can quantify dissimilarity between any two shapes, but
2 Related works cannot model shape variations within a specific class. In
contrast, Pentland et al. [33] learn the intra-class defor-
As there exists a large body of work on shape repre- mation modes of an elastic material from clean training
sentations for recognition [1,3,8,17,16,23,24,28,37], we shapes. The most famous work in this spirit is Active
briefly review in the following only the most important Shape Models [8], where the shape model in novel im-
works relevant to this paper, i.e. on shape description ages is constrained to vary only in ways seen during
and matching for modeling, recognition, and localiza- training. A few principal deformation modes, account-
tion of object classes. ing for most of the total variability over the training
3
a) training examples b) model parts c) initial shape d) refined shape e) modes of variation
Fig. 3 Learning the shape model. (a) Four training examples (out of a total 24). (b) Model parts. (c) Occurrences selected to
form the initial shape. (d) Refined shape. (e) First two modes of variation (mean shape in the middle).
fit both over individual edgel-chains, and bridged be- ject instances. Finally, since a correspondence between
tween two linked chains. This brings robustness to the two PAS induces a translation and scale change, they
unavoidable broken edgel-chains [14]. can be readily used within a Hough-style voting scheme
The local features we use are pairs of connected seg- for object detection [27,32,39].
ments (figure 2a). Informally, two segments are consid-
ered connected if they are adjacent on the same edgel- PAS dissimilarity measure. The dissimilarity D(P, Q)
between the descriptors dp , dq of two PAS P, Q defined
chain, or if one is at the end of an edgel-chain directed in [16] is
towards the other (i.e. if the first segment were extended
a bit, it would meet the second one). As two segments 2 2
Dθ θip , θiq + ˛log lp /lq ˛
X ´ X
D(dp , dq ) = wr krp − rq k + wθ
` ˛ ` ´˛
in a pair are not limited to come from a single edgel- i i
i=1 i=1
chain, but may come from adjacent edgel-chains, the
(1)
extraction of pairs is robust to the typical mistakes of
the underlying edge detector. where the first term is the difference in the relative
Each pair of connected segments forms one feature, locations of the segments, Dθ ∈ [0, π/2] measures the
called a PAS, for Pair of Adjacent Segments. A PAS difference between segment orientations, and the last
feature P = (x, y, s, e, d) has a location (x, y) (mean term accounts for the difference in lengths. In all our
over the two segment centers), a scale s (distance be- experiments, the weights wr , wθ are fixed to the same
tween the segment centers), a strength e (average edge values used in [16] (wr = 4, wθ = 2).
detector confidence over the edgels with values in [0, 1]),
and a descriptor d = (θ1 , θ2 , l1 , l2 , r) invariant to trans- PAS codebook. We construct a codebook by clus-
lation and scale changes. The descriptor encodes the tering the PAS inside all training bounding-boxes ac-
shape of the PAS, by the segments’ orientations θ1 , θ2 cording to their descriptors (see [16] for more details
and lengths l1 , l2 , and the relative location vector r, go- about the clustering algorithm). For each cluster, we
ing from the center of the first segment to the center retain the centermost PAS, minimizing the sum of dis-
of the second (a stable way to derive the order of the similarities to all the others. The codebook C = {ti }
segments in a PAS is given in [16]). Both lengths and is the collection of the descriptors of these centermost
relative location are normalized by the scale of the PAS. PAS, the PAS types {ti } (figure 2b). A codebook is use-
Notice that PAS can overlap, i.e. two different PAS can ful for efficient matching, since all features similar to a
share a common segment. type are considered in correspondence. The codebook is
PAS features are particularly suited to our needs. class-specific and built from the same images used later
First, they are robustly detected because they connect to learn the shape model.
segments even across gaps between edgel-chains. Sec-
ond, as both PAS and their descriptors cover solely the 4 Learning the shape model
two segments, they can cover pure portion of an object
boundary, without including clutter edges which often In this section we present the new technique for learn-
lie in the vicinity (as opposed to patch descriptors). ing a prototype shape for an object class and its prin-
Hence, PAS descriptors respect the nature of boundary cipal intra-class deformation modes, given image win-
fragments, to be one-dimensional elements embedded dows W with example instances (figure 3a). To achieve
in a 2D image, as opposed to local appearance features, this, we propose a procedure for discovering which con-
whose extent is a 2D patch. Fourth, PAS have inter- tour points belong to the common class boundaries, and
mediate complexity. As demonstrated in [16], they are for putting them in full point-to-point correspondence
complex enough to be informative, yet simple enough to across the training examples. For example, we want the
be detectable repeatably across different images and ob- shape model to include the outline of a mug, which
5
Fig. 4 Finding model parts. Left: four training instances with two recurring PAS of the upper-L type (one on the handle, and
another on the main body). Right: four slices of the accumulator space for this PAS type (each slice corresponds to a different size).
The two recurring PAS form peaks at different locations and sizes. Our method allows for different model parts with the same PAS
type.
is characteristic for the class, and not the mug labels, 1. Align windows. Let a be the geometric mean of
which vary across instances. The technique is composed the aspect-ratios of the training windows W (width
of four stages (figure 3b-e): over height). Each window is transformed to a canonical
zero-centered rectangle of height 1 and width a. This
1. Determine model parts as PAS frequently reoccur- removes translation and scale differences, and cancels
ring with similar locations, scales, and shapes (sub- out shape variations due to different aspect-ratios (e.g
section 4.1). tall Starbucks mugs versus coffee cups). This facilitates
2. Assemble an initial shape by selecting a particular the learning task, because PAS on the class boundaries
PAS for each model part from the training examples are now better aligned.
(subsection 4.2).
3. Refine the initial shape by iteratively matching it 2. Vote for parts. Let Vi be a voting space associated
back onto the training images (subsection 4.3). with PAS type ti . There are |C| such voting spaces, all
4. Learn a statistical model of intra-class deformations initially empty. Each voting space has three dimensions:
from the corresponded shape instances produced by two for location (x, y) and one for size s. Every PAS P =
stage 3 (subsection 4.4). (x, y, s, e, d) from every training window casts votes as
follows:
The shape model output at the end of this procedure
is composed of a prototype shape S, which is a set of 1. P is soft-assigned to all types T within a dissimilar-
points in the image plane, and a small number of n ity threshold γ: T = {tj |D(d, tj ) < γ}, where d is
intra-class deformation modes E1:n , so that new class the shape descriptor of P (see equation (1)).
members can be written as S + E1:n . 2. For each assigned type tj ∈ T , a vote is casted in
Vj at (x, y, s), i.e. at the location and size of P . The
vote is weighted by e · (1 − D(d, tj )/γ), where e is
4.1 Finding model parts the edge strength of P .
to the voting space where M was found). The value 4.2 Assembling the initial model shape
v of the local maximum measures the confidence that
the part belongs to the class boundaries. The (x, y, s) The collection of parts learned in the previous section
coordinates are relative to the canonical window. captures class boundaries well, and conveys a sense of
the shape of the object class (figure 3b). The outer
boundary of the mug and the handle hole are included,
whereas the label and background clutter are largely ex-
cluded. Based on this ‘collection of parts’ model (COP)
4.1.2 Discussion
one could already attempt to detect objects in a test im-
age, by matching parts based on their descriptor and en-
The success of this procedure is due in part to adopt- forcing their spatial relationship. This could be achieved
ing PAS as basic shape elements. A simpler alterna- in a way similar to what earlier approaches do based
tive would be to use individual edgels. In that case, on appearance features [18,27], and also done recently
there would be just one voting space, with two loca- with contour features by [32,39], and it would localize
tion dimensions and one orientation dimension. In con- objects up to a bounding-box.
trast, PAS bring two additional degrees of separation: However, the COP model has no notion of shape
the shape of the PAS, expressed as the assignments to at the global scale. It is a loose collection of fragments
codebook types, and its size (relative to the window). learnt rather independently, each focusing on its own
Individual edgels have no size, and the shape of a PAS local scale. In order to support localizing object bound-
is more distinctive than the orientation of an edgel. As aries accurately and completely on novel test images, a
a consequence, it is very unlikely that a significant num- more globally consistent shape is preferable. Ideally, its
ber of clutter PAS will accidentally have similar loca- parts would be connected into a whole shape featuring
tions, sizes and shapes at the same time. Hence, recur- smooth, continuous lines.
ring PAS stemming from the desired class boundaries In this subsection we describe a procedure for con-
tend to form peaks in the voting spaces, whereas clutter structing a first version of such a shape, and in the next
PAS don’t. subsection we refine it. We start with some intuition
Intra-class shape variability is addressed partly by behind the method. A model part occurs several times
the soft-assign of PAS to types, and partly by applying on different images (figure 5a-b). These occurrences of-
a substantial spatial smoothing to the voting spaces be- fer slightly different alternatives for the part’s location,
fore detecting local maxima. This creates wide basins of size, and shape. We can assemble variants of the model
attraction for PAS from different training examples to shape by selecting different occurrences for each part.
accumulate evidence for the same part. We can afford The key idea for obtaining a globally consistent shape
this flexibility while keeping a low risk of accumulating is to select one occurrence for each part so as to form
clutter because of the high separability discussed above, larger aggregates of connected occurrences (figure 3c).
especially due to separate voting spaces for different We cast the shape assembly task as the search for the
codebook types. This yields the discriminativity neces- assignment of parts to occurrences leading to the best
sary to overcome the poor signal-to-noise ratio, while connected shape. In the following, we explain the algo-
allowing the flexibility necessary to accommodate for rithm in more detail.
intra-class shape variations.
The voting procedure is similar in spirit to recent 4.2.1 Algorithm
works on finding frequently recurring spatial config-
urations of local appearance features in unannotated The algorithm consists of three steps:
images [19,34], but it is specialized for the case when
bounding-box annotation is available. 1. Compute occurrences. A PAS P = (xp , y p , sp , ep , dp )
is an occurrence of model part M = (xm , y m , sm , v m , dm )
The proposed algorithm sees all training data at if they have similar location, scale, and shape (figure 5a).
once, and therefore reliably selects parts and robustly The following function measures the confidence that P
estimates their locations/size/shapes. In our experiments is an occurrence of M (denoted M → P ):
this was more stable and more robust to clutter than „
sm sp
«
conf(M → P ) = ep · D(dm , dp ) · min , · (2)
matching pairs of training instances and combining their sp sm
output a posteriori. As another advantage, the algo-
“ ”
− 1 ((xp −xm )2 +(y p −y m )2 )
·exp 2σ2
rithm has complexity linear in the total number of PAS
in the training windows, so it can learn from large train- It takes into account P ’s edge strength (first factor)
ing sets efficiently. and how close it is to M in terms of shape, scale,
7
a b c
Fig. 5 Occurrences and connectedness. (a) A model part (above) and two of its occurrences (below). (b) All occurrences of all
model parts on a few training images, colored by the distance to the peak in the voting space (decreasing from blue to cyan to green
to yellow to red). (c) Two model parts with high connectedness (above) and two of their occurrences, which share a common segment
(below).
and location (second to last factors). The confidence where 1(a, b) = 1 if occurrences a, b come from the
ranges in [0, 1], and P is deemed an occurrence of M same image, and 0 otherwise; K is the number of im-
if conf(M → P ) > δ, with δ a threshold. By analogy ages contributing occurrences to A; α, β are predefined
Mi → Pi denotes the occurrence of model segment Mi weights. The first term prefers high confidence occur-
on image segment Pi (with i ∈ {1, 2}). rences. The second favors assigning connected parts
to connected occurrences, because occurrences of parts
2. Compute connectedness. As a PAS P is formed with high connectedness are likely to be connected when
by two segments P1 , P2 , two occurrences P, Q of dif- they come from the same image (by construction of
ferent model parts M, N might share a segment (fig- function (5)). The last term enourages selecting occur-
ure 5c). This suggests that M, N explain connected por- rences from a few images, as occurrences from the same
tions of the class boundaries and should be connected in image fit together naturally. Overall, function (5) en-
the model. As model parts occurs in several images, we courages the formation of aggregates of good confidence
estimate how likely it is for two parts to be connected and properly connected occurrences.
in the model, by how frequently their occurrences share Optimizing (5) exactly is expensive, as the space of
segments. all assignments is huge. In practice, the following ap-
Let the equivalence of segments Mi , Nj be
proximation algorithm brings satisfactory results. We
eq(Mi , Nj ) =
X
(conf(M → P ) + conf(N → Q)) (3)
start by assigning the part with the single most confi-
{P,Q|s∈P,s∈Q,Mi →s,Nj →s} dent occurrence. Next, we iteratively consider the part
most connected to those assigned so far, and assign it to
The summation runs over all pairs of PAS P, Q sharing the occurrence maximizing (5). The algorithm iterates
a segment s, where s is an occurrence of both Mi and until all parts are assigned to an occurrence.
Nj (figure 5c). Let the connectedness of M, N be the
combined equivalence of their segments 1 : Figure 3c shows the selected occurrences for our
running example. These form a rather well connected
conn(M, N ) = max(eq(M1 , N1 ) + eq(M2 , N2 ), (4)
shape, where most segments fit together and form con-
eq(M1 , N2 ) + eq(M2 , N1 )) tinuous lines. The remaining discontinuities are smoothed
Two parts have high connectedness if their occurrences out by the refinement procedure in the next subsection.
frequently share a segment. Two parts sharing both seg-
ments have even higher connectedness, suggesting they
explain the same portion of the class boundaries.
4.3 Model shape refinement
3. Assign parts to occurrences. Let A(M ) = P be
a function assigning a PAS P to each model part M . In this subsection we refine the initial model shape.
Find the mapping A that maximizes The key idea is match it back onto the training im-
X X age windows W, by applying a deformable matching
conf (M → A(M ))+α conn(M, N )·1 (A(M ), A(N ))−βK
M M,N
algorithm [6] (figure 6b). This results in a backmatched
(5)
shape for each window (figure 6c-left). An improved
model shape is obtained by averaging them (figure 6c-
1 for the best of the two possible segment matchings right). The process is then iterated by alternating back-
8
backmatching backmatching
matching and averaging (figure 6d). Below we give the 3. Averaging. (1) Align the backmatched shapes B =
details of the algorithm. {Bi }i=1..|W| using Cootes’ variant of Procustes analy-
sis [9], by translating, scaling, and rotating each shape
4.3.1 Algorithm so that the total
P sum of distances to the mean shape B̄
is minimized: B∈Bi |Bi − B̄|2 (see appendix A of [9]).
The algorithm follows three steps: (2) Update S by setting it to the mean shape: S ← B̄
(figure 6c-right).
1. Sampling. Sample 100 equally spaced points from The algorithm now iterates to Step 2, using the up-
the initial model shape, giving the point set S (fig- dated model shape S. In our experiments, Steps 2 and
ure 6a). 3 are repeated two to three times.
2. Backmatching. Match S back to each training win-
dow w ∈ W by doing: 4.3.2 Discussion
2.1 Alignment. Translate, scale, and stretch S so that
Step 3 is possible because the backmatched shapes B
its bounding-box aligns with w (figure 6b-left). This
are in point-to-point correspondence, as they are differ-
provides the initialization for the shape matcher.
ent TPS transformations of the same S (figure 6c-left).
2.2 Shape matching. Let E be the point set consist-
This enables to define B̄ as the coordinates of corre-
ing of the edgels inside w. Put S and E in point-to-
sponding points averaged over all Bi ∈ B. It also en-
point correspondence using the non-rigid robust point
ables to analyze the variations in the point locations.
matcher TPS-RPM [6] (Thin-Plate Spline Robust Point
The differences remaining after alignment are due to
Matcher). This estimates a TPS transformation from S
non-rigid shape variations, which we will learn in the
to E, while at the same time rejecting edgels not corre-
next subsection.
sponding to any point of S. This is important, as only
The alternation of backmatching and averaging re-
some edgels lie on the object boundaries. Subsection 5.2
sults in a succession of better models and better matches
presents TPS-RPM in detail, where it is used again for
to the data, as the point correspondence cover more and
localizing object boundaries in test images.
9
5 Object detection
apple bottle giraffe mug swan horse The proposed extension to TPS-RPM has a deep
train 20 24 44 24 16 50 impact, in that it alters the search through the trans-
test pos 20 24 43 24 16 120
formation and correspondence spaces. Beside improving
test neg 215 207 167 207 223 170
accuracy, it can help TPS-RPM to avoid local minima
Table 1 Number of training images and of positive/negative test far from the correct solution, thus avoiding gross fail-
images for all datasets. ures.
Figure 8e shows the improvement brought by the
proposed constrained shape matching, compared to TPS-
5.3 Constrained shape matching
RPM with just the generic TPS model (figure 8d). On
the running example, the two versions of TPS-RPM di-
TPS-RPM treats all shapes according to the same generic
verge after the eight iteration, as shown in figure 9.
TPS deformation model, simply preferring smoother
transformations (in particular, low 2D curvature w, and
low affine skew d). Two shapes with the same deforma-
tion energy are considered equivalent. This might result 5.4 Detections
in output shapes unlike any of the training examples.
In this section, we extend TPS-RPM with the class- Every local maximum in Hough space constitutes an
specific deformation model learned in subsection 4.4. initialization for the shape matching, and results in dif-
We constrain the optimization to explore only the valid ferent shapes (detections) localized in the test image. In
region of the shape space, containing shapes plausible this section we score the detections, making it possible
for the class (defined by S, E1:n , λi from subsection 4.4). to reject detections and to evaluate the detection rate
At each iteration of TPS-RPM we project the cur- and false-positive rate of our system.
rent shape estimate Y (equation (6)) inside the valid We score each detection by a weighted sum of four
region, just before fitting the TPS. This amounts to: terms:
1) align Y on S w.r.t. to translation/rotation/scale 1) the number of matched model points, i.e. for which
2) project Y on the subspace spanned by E1:n : a corresponding image point has been found with good
b = E −1 · (Y − S) , b(n+1):2p = 0 confidence. Following [6], these are all points va with
√
3) bound the first n components of b by ±3 λi maxi=1..N (mai ) > 1/N .
4) transform b back into the original space: Y c = S+E·b 2) the sum of squared distances from the TPS-mapped
5) apply to Y c the inverse of the transformation used model points to their corresponding image points. This
in 1) measure is made scale-invariant by normalizing by the
squared range r2 of the image point coordinates (width
The assignment Y ← Y c imposes hard constraints or height, whichever is larger). Only matched model
on the shape space. While this guarantees output shapes points are considered.
similar to class members, it might sometimes be too re- P ³ p ´2
3) the deviation i,j∈[1,2] I(i, j) − d(i, j)/ |d| of
strictive. To match a novel instance accurately, it could
the affine component d of the TPS from the identity
be necessary to move a little along some dimensions of
I. The normalization by the determinant of d factors
the shape space not recorded in the deformation model.
out deviations due to scale changes.
The training data cannot be assumed to present all pos-
4) the amount of non-rigid warp w of the TPS trace(wT Φw)/r2 ,
sible intra-class variations.
where Φ(a, b) ∝ ||va − vb ||2 log ||va − vb || is the TPS ker-
To tackle this issue, we propose a soft-constrained nel matrix [6].
variant, where Y is attracted by the valid region, with
a force that diminishes with temperature: Y ← Y + This score integrates the information a matched shape
T c provides. It is high when the TPS fits many (term 1)
Tinit (Y − Y ). This causes TPS-RPM to start fully
constrained, and then, as temperature decreases and points well (term 2), without having to distort much
M looks for correspondences closer to the current es- (terms 3 and 4). In our current implementation, the
timates, later iterations are allowed to apply small de- relative weights between these terms have been selected
formations beyond the valid region (typically along di- manually, they are the same for all classes, and remain
mensions not in E1:n ). As a result, output shapes fit fixed in all experiments.
the image data more accurately, while still resembling As a final refinement, if two detections overlap sub-
class members. Notice how this behavior is fully in the stantially, we remove the lower scored one. Notice that
spirit of TPS-RPM, which also lets the TPS more and the method can detect multiple instances of the same
more free as T decreases. class in an image. Since they appear as different peaks
13
in the Hough voting space, they result in separate de- 6.2 Learning shape models
tections.
Evaluation measures. We assess the performance of
the learning procedure of section 4 in terms of how accu-
rately it recovers the true class boundaries of the train-
6 Experiments ing instances. For this evaluation, we have manually
annotated the boundaries of all object instances in the
We present an extensive evaluation involving six di- ETHZ shape classes dataset. We will present results for
verse object classes from two existing datasets [14,25]. all of these five classes.
After introducing the datasets in the next subsection, Let Bgt be the ground-truth boundaries, and Bmodel
we evaluate our approach for learning shape models in the backmatched shapes output by the model shape re-
subsection 6.2. The ability to localize objects in novel finement algorithm of subsection 4.3. The accuracy of
images, both in terms of bounding-boxes and bound- learning is quantified by two measures. Coverage is the
aries, is measured in subsection 6.3. All experiments percentage of points from Bgt closer than a threshold t
are run with the same parameters (no class-specific nor from any point of Bmodel . We set t to 4% of the diago-
dataset-specific tuning is applied). nal of the bounding-box of Bgt . Conversely, precision is
the percentage of Bmodel points closer than t from any
point of Bgt . The measures are complementary. Cov-
6.1 Datasets and protocol erage captures how much of the object boundary has
been recovered by the algorithm, whereas precision re-
ETHZ shape classes [14].This dataset features five ports how much of the algorithm’s output lies on the
diverse classes (bottles, swans, mugs, giraffes, apple- object boundaries.
logos) and contains a total of 255 images collected from
the web. It is highly challenging, as the objects appear Models from the full algorithm. Table 2 shows
in a wide range of scales, there is considerable intra- coverage and precision averaged over training instances
class shape variation, and many images are severely and trials, for the complete learning procedure described
cluttered, with objects comprising only a fraction of in section 4. With the exception of giraffes, the pro-
the total image area (figures 13 and 18). posed method achieves very high coverage (above 90%),
For each class, we learn 5 models, each from a dif- demonstrating its ability to discover which contour points
ferent random sample containing half of the available belong to the class boundaries. The precision of apple-
images (there are 40 for apple-logos, 48 for bottles, 87 logos and bottles is also excellent, thanks to the clean
for giraffes, 48 for mugs and 32 for swans). Learning prototype shapes learned by our approach (figure 10).
models from different training sets allows to evaluate Interestingly, the precision of mugs is somewhat lower,
the stability of the proposed learning technique (sub- because the learned shapes include a detail not present
section 6.2). Notice that our method does not require in the ground-truth annotations, although it is arguably
negative training images i.e. images not containing any part of the class boundaries: the inner half of the open-
instance of the class. ing on top of the mug. A similar phenomenon penalizes
The test set for a model consists of all other im- the precision of swans, where our method sometimes in-
ages in the dataset. Since this includes about 200 nega- cludes a few water waves in the model. Although they
tive images, it allows to properly estimate false-positive are not part of the swan boundaries, waves acciden-
rates. Table 1 gives an overview of the composition of tally occurring at a similar position over many train-
all training and testing sets. We refer to learning and ing images are picked up by the algorithm. A larger
testing on a particular split of the images as a trial. training set might lead to the suppression of such arti-
facts, as waves have less chances of accumulating acci-
dentally (we only used 16 images). The modeling per-
INRIA horses [25].This challenging dataset consists formance for giraffes is lower, due to the extremely clut-
of 170 images with one or more horses viewed from the tered edgemaps arising from their natural environment,
side and 170 images without horses. Horses appear at and to the camouflage texture which tends to break
several scales, and against cluttered backgrounds. edges along the body outlines (figure 11).
We train 5 models, each from a different random
subset of 50 horse images. For each model, the remain- Models without assembling the initial shape. We
ing 120 horse images and all 170 negative images are experiment with a simpler scheme for learning shape
used for testing, see table 1. models by skipping the procedure for assembling the
14
Table 2 Accuracy of learned models. Each entry is the average coverage/precision over trials and training instances.
Detection rate
Detection rate
Detection rate
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image False−positives per image
Mugs Swans
INRIA Horses
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
Detection rate
Detection rate
0.6 0.6
Detection rate
0.6
0.5 0.5 0.5
0.4 0.4 0.4
Ferrari et al. PAMI 08 (20% IoU)
0.3 0.3 0.3 Ferrari et al. PAMI 08 (PASCAL)
Fig. 12 Object detection performance (models learnt from real images). Each plot shows five curves: the full system evaluated under
the PASCAL criterion for a correct detection (dashed, thick, red), the full system under the 20%-IoU criterion (solid, thick, red), the
Hough voting stage alone under PASCAL (dashed, thin, blue), [16] under 20%-IoU (solid, thin, green) and under PASCAL (dashed,
thin, green). The curve for the full system under PASCAL in the apple-logo plot is identical to the curve for 20%-IoU.
matching task, rather than simply matching local fea- ter than [16] on two classes (apple-logos, swans), mod-
tures, which is one of the principal points of this paper. erately worse on two (bottles, horses), and about the
Moreover, the shape matching stage also makes it possi- same on two (mugs, giraffes), thanks to the higher ac-
ble to localize complete object boundaries, rather than curacy of the detected bounding-boxes. Averaged over
just bounding-boxes (figure 13). all classes, under PASCAL our method reaches 71.1%
The difference between the curves under the PAS- detection-rate at 0.4 FPPI, comparing well against the
CAL criterion and the 20%-IoU criterion of [14,16] is 68.5% of [16]. Note how our results are achieved with-
small for apple-logos, bottles, mugs and swans (0%, out the beneficial discriminative learning of [16], where
−1.6%, −3.6%, −4.9%), indicating that most detec- a SVM learns which PAS types at which relative loca-
tions have accurate bounding-boxes. For horses and gi- tion within the training bounding-box best discriminate
raffes the decrease is more significant (−18.1%, −14.1%), between instances of the class and background image
because the legs of the animals are harder to detect windows. Our method instead trains only from positive
and cause the bounding-box to shift along the body. examples.
On average over all classes, our method achieves 78.1% For clarity and reference for comparison by future
detection-rate at 0.4 FPPI under 20%-IoU and 71.1% works, we summarize here our results on the ETHZ
under PASCAL. The corresponding standard-deviation Shape Classes alone (without INRIA horses). Under
over trials, averaged over classes, is 8.1% under 20%- PASCAL, averaged over all 5 trials and 5 classes, our
IoU and 8.0% under PASCAL (this variation is due to method achieves 72.0%/67.2% detection-rate at 0.4/0.3
different trials having different training and test sets). FPPI respectively. Under 20%-IoU, it achieves 76.8%/71.5%
For reference, the plots also show the performance detection-rate at 0.4/0.3 FPPI.
of [16] on the same datasets, using the same number After our results were first published [15], Fritz and
of training and test images. An exact comparison is Schiele [20] presented an approach based on topic mod-
not possible, as [16] reports result based on only one els and a dense gradient histogram representation of im-
training/testing split, whereas we average over 5 ran- age windows (no explicit shapes). They report results
dom splits. Under the rather permissive 20%-IoU crite- on the ETHZ Shape Classes dataset (i.e. no horses), us-
rion, [16] performs a little better than our method on ing the same protocol (5 random trials). Their method
average over all classes. Under the strict PASCAL cri- achieves 84.8% averaged over classes, improving over
terion instead, our method performs substantially bet- our 76.8% (both at 0.4 FPPI and under 20%-IoU).
16
Fig. 13 Example detections (models learnt from images). Notice the large scale variations (especially in apple-logos, swans), the intra-
category shape variability (especially in swans, giraffes), and the extensive clutter (especially in giraffes, mugs). The method works
for photographs as well as paintings (first swan, last bottle). Two bottle cases show also false-positives. In the first two horse images,
the horizontal line below the horses’ legs is part of the model and represents the ground. Interestingly, the ground line systematically
reoccurs over the training images for that model and gets learned along with the horse.
17
Table 3 Accuracy of localized object boundaries at test time. Each entry is the average coverage/precision over trials and
correct detections at 0.4 FPPI.
a b c
Fig. 16 Example failed detections (models learnt from images). (a) A typical case. A good match to the image edges is
found, but at the wrong scale. Our system has no bias for any particular scale. (b) Another typical case. Failure is due to an extremely
cluttered edge-map. The neck is correctly matched, and gives rise to a peak in the Hough voting space (section 5.1). However, the
subsequent deformable matching stage (section 5.2) is attracted by the high contrast waves in the background. (c) An infrequent case.
Failure is due to a poor shape model (right, this the worst of the 30 models we have learned).
test set. Therefore, the test set for each class contains propose a sophisticated scoring method which allows
mostly images not containing any instance of the class, to reliably reject false-positives, while the method of
which supports the proper estimatation of FPPI. Our Zhu et al. [45] relies on their algorithm [46] to find
method performs better than [14] on all 5 classes, espe- long salient contours, effectively removing many clut-
cially in the low FPPI range, and substantially outper- ter edgels before the object detector runs. An interest-
forms the oriented chamfer matching baseline (details ing avenue for further research is incorporating these
in [14]). Averaged over classes, our method achieves successful elements in our framework.
85.3%/82.4% detection-rate at 0.4/0.3 FPPI respectively,
compared to 81.5%/70.5% of [14] (all results under 20%-
IoU). As one reason for this improvement, our method Beside this quantitative evaluation, the main ad-
is more robust because it does not need the test image vantage of our approach over [14,36,45] is that it can
to contain long chains of contour segments around the also train from real images (which is the main topic
object. of this paper). Moreover, compared to [14], it supports
After our results were first published [15], two works branching and self-intersecting input shapes.
reported even better performance. Ravishankar et al. [36]
achieve 95.2% at 0.4 FPPI. Zhu et al. [45] reports 0.21
FPPI at 85.3% detection-rate (ours). Note this is the Interestingly, in our system hand-drawings lead to
opposite of the usual way, reporting detection-rate at moderately better detection results than when learning
a reference FPPI [14–16,20,36]). All results are under models from images. This is less suprising when consid-
20%-IoU and averaged over classes. As part of the rea- ering that hand-drawings are essentially the prototype
son for the high performance, Ravishankar et al. [36] shapes the system tries to learn.
19
Detection rate
Detection rate
0.6 0.6 0.6
0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image False−positives per image
Mugs Swans
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Detection rate
Detection rate
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
False−positives per image False−positives per image
Fig. 17 Object detection performance (hand-drawn models). To facilitate comparison, all curves have been computed using
the 20%-IoU criterion of [14].
Fig. 18 Detection from hand-drawn models. Top: four of the five models from [14]. There is just one example per object class.
Bottom: example detections delivered by our shape matching procedure, using these hand-drawings as models.
7 Conclusions and future work amples. This could be extended by learning a classi-
fier to distinguish between positive and negative exam-
We have proposed an approach for learning class-specific ples, which might reduce false positives. One possibility
explicit shape models from images annotated by bounding- could be to train both our shape models and the dis-
boxes, and localizing the boundaries of novel class in- criminative models of [16]. At detection time, we could
stances in the presence of extensive clutter, scale changes, then use the bounding-box delivered by [16] to initial-
and intra-class variability. In addition, the approach ize shape matching based on our models. Moreover, the
operates effectively also when given hand-drawings as discriminative power of the representation could be im-
models. The ability to input both images and hand- proved by using appearance features in addition to im-
drawings as training data is a consequence of the basic age contours. Finally, in this paper we have assumed
design of our approach, which attempts to bridge the that all observed differences in the shape of the training
gap between shape matching and object class detection. examples originate from intra-class variation, and not
The presented approach can be extended in several from viewpoint changes. It would be interesting to add
ways. First, the training stage models only positive ex-
20
a stage to automatically group objects by viewpoint, 27. B. Leibe and B. Schiele, Scale-Invariant Object Categoriza-
and learn separate shape models. tion using a Scale-Adaptive Mean-Shift Search, DAGM, 2004.
28. M. Leordeanu, M. Hebert, and R. Sukthankar, Beyond Local
Appearance: Category Recognition from Pairwise Interactions
of Simple Features, CVPR, 2007.
References 29. D. Martin, C. Fowlkes and J. Malik, Learning to detect nat-
ural image boundaries using local brightness, color, and texture
1. R. Basri, L. Costa, D. Geiger, and D. Jacobs, Determining cues, PAMI, 26(5):530-549, 2004.
the Similarity of Deformable Shapes, Vision Research, vol. 38, 30. D. Marr and H.K. Nishihara, Representation and Recog-
pp. 2365-2385, 1998. nition of the Spatial Organization of Three-Dimensional
2. A. Berg, T. Berg and J. Malik, Shape Matching and Ob- Shapes, Proc. Royal Soc. London, Series B, Biological Sciences,
ject Recognition using Low Distortion Correspondence, CVPR, (200):269-294, 1978.
2005. 31. F. Mokhtarian and A. Mackworth, Scale-based Descrip-
3. S. Belongie and J. Malik, Shape Matching and Object Recog- tion and Recognition of Planar Curves and Two-dimensional
nition using Shape Contexts, PAMI, 24(4):509-522, 2002. Shapes, PAMI, 8(1):34-43, 1986.
4. E. Borenstein and S. Ullman, Class-Specific, Top-Down Seg- 32. A. Opelt, A. Pinz, and A. Zisserman, A Boundary-Fragment-
mentation, ECCV, 2002. Model for Object Detection, ECCV, 2006.
5. I. Biederman, Recognition-by-components: A theory of human 33. A. Pentland, A.; Sclaroff, S., Closed-form solutions for phys-
image understanding, Psychological Review, 94(2):115-147. ically based shape modeling and recognition, PAMI, 13(7):715-
6. H. Chui and A. Rangarajan, A new point matching algorithm 729, 1991.
for non-rigid registration, CVIU, 89(2-3):114-141, 2003. 34. T. Quack, V. Ferrari, B. Leibe, and L. Van Gool, Efficient
7. O. Chum and A. Zisserman, An Exemplar Model for Learning Mining of Frequent and Distinctive Feature Configurations,
Object Classes, CVPR, 2007. ICCV, 2007.
8. T. Cootes, C. Taylor, D. Cooper, and J. Graham, Active Shape 35. D. Ramanan, Learning to parse images of articulated bodies,
Models: Their Training and Application, CVIU, 61(1):38-59, NIPS, 2006.
1995. 36. S. Ravishankar, A. Jain, and A. Mittal, Multi-stage Contour
9. T. Cootes, An Introduction to Active Shape Models, 2000. Based Detection of Deformable Objects, ECCV, 2008.
10. D. Cremers, T. Kohlberger, and C. Schnorr, Nonlinear Shape 37. T. Sebastian, P. Klein, and B. Kimia, Recognition of Shapes
Statistics in Mumford-Shah Based Segmentation, ECCV, 2002. by Editing Their Shock Graphs, PAMI, 26(5):550-571, 2004.
11. N. Dalal and B. Triggs, Histograms of Oriented Gradients 38. J. Schwartz and P. Felzenszwalb, Hierarchical Matching of
for Human Detection, CVPR, 2005. Deformable Shapes, CVPR, 2007.
12. G. Elidan, G. Heitz, D. Koller, Learning Object Shape: From 39. J. Shotton, A. Blake, and R. Cipolla, Contour-Based Learn-
Drawings to Images, CVPR, 2006. ing for Object Detection, ICCV, 2005.
13. V. Ferrari, T. Tuytelaars, and L. van Gool, Simultaneous 40. A. Torralba, K. Murphy, and W. Freeman, Sharing Features:
Object Recognition and Segmentation by Image Exploration, Efficient Boosting Procedures for Multiclass Object Detection,
ECCV, 2004. CVPR, 2004.
14. V. Ferrari, T. Tuytelaars, and L. Van Gool, Object Detection 41. D. Sharvit, J. Chan, H. Tek, B Kimia, Symmetry-Based
with Contour Segment Networks, ECCV, 2006. Indexing of Image Databases, IEEE Workshop on Content-
15. V. Ferrari, F. Jurie, and C. Schmid, Accurate Object De- Based Access of Image and Video Libraries, 1998.
tection with Deformable Shape Models Learnt from Images, 42. S. Ullman, Aligning pictoral descriptions: An approach to
CVPR, 2007. object recognition, Cognition, 32(3):193-254, 1989.
16. V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, Groups 43. J. Winn and N. Jojic, LOCUS: Learning Object Classes with
of Adjacent Contour Segments for Object Detection, PAMI, Unsupervised Segmentation, ICCV, 2005.
(30)1:36-51, 2008. 44. J. Winn and J. Shotton, The Layout Consistent Random
17. P. Felzenswalb, Representation and Detection of Deformable Field for Recognizing and Segmenting Partially Occluded Ob-
Shapes, PAMI, 27(2):208-220, 2005. jects, CVPR, 2006.
45. Q. Zhu, L. Wang, Y. Wu, and J. Shi, Contour Context Se-
18. R. Fergus, P. Perona, and A. Zisserman, Object Class Recog-
lection for Object Detection: A Set-to-Set Contour Matching
nition by Unsupervised Scale-invariant Learning, CVPR, 2003.
Approach, ECCV, 2008.
19. M. Fritz and B. Schiele, Towards Unsupervised Discovery of
46. Q. Zhu, G. Song, and J. Shi, Untangling cycles for contour
Visual Categories, DAGM, 2006.
grouping, ICCV, 2007.
20. M. Fritz and B. Schiele, Decomposition, discovery and de-
tection of visual categories using topic models, CVPR, 2008.
21. A. Hill and C. Taylor, A Method of Non-Rigid Correspon-
dence for Automatic Landmark Identification, BMVC, 1996.
22. D. Gavrila, Multi-Feature Hierarchical Template Matching
Using Distance Transforms, ICPR, 1998.
23. Y. Gdalyahu and D. Weinshall, Flexible Syntactic Match-
ing of Curves and Its Application to Automatic Hierarchical
Classification of Silhouettes, PAMI, 21(12):1312-1328, 1999.
24. S. Gold and A. Rangarajan, Graduated Assignment Algo-
rithm for Graph Matching, PAMI, 18(4):377-388, 1996.
25. F. Jurie and C. Schmid, Scale-invariant Shape Features for
Recognition of Object Categories, CVPR, 2004.
26. Y. Lamdan, J. Schwartz, and H. Wolfson, Affine Invari-
ant Model-based Object Recognition, IEEE Transactions on
Robotics and Automation, 6(5):578-589, 1990.