0% found this document useful (0 votes)
14 views9 pages

Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated

This document presents a multi-instance object segmentation algorithm designed to effectively handle occlusions by integrating top-down reasoning and shape prediction into an energy minimization framework. The proposed method demonstrates improved performance in joint detection and segmentation tasks on the PASCAL VOC 2012 dataset, outperforming state-of-the-art techniques both quantitatively and qualitatively. The algorithm utilizes categorized segmentation hypotheses and class-specific likelihood maps to accurately infer occluded regions and enhance segmentation quality.

Uploaded by

ravinderytuse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated

This document presents a multi-instance object segmentation algorithm designed to effectively handle occlusions by integrating top-down reasoning and shape prediction into an energy minimization framework. The proposed method demonstrates improved performance in joint detection and segmentation tasks on the PASCAL VOC 2012 dataset, outperforming state-of-the-art techniques both quantitatively and qualitatively. The algorithm utilizes categorized segmentation hypotheses and class-specific likelihood maps to accurately infer occluded regions and enhance segmentation quality.

Uploaded by

ravinderytuse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multi-instance Object Segmentation with Occlusion Handling

Yi-Ting Chen1 Xiaokai Liu1,2 Ming-Hsuan Yang1


University of California at Merced1 Dalian University of Technology2

Abstract

We present a multi-instance object segmentation algo-


rithm to tackle occlusions. As an object is split into two
parts by an occluder, it is nearly impossible to group the
two separate regions into an instance by purely bottom-
up schemes. To address this problem, we propose to in-
corporate top-down category specific reasoning and shape
prediction through exemplars into an intuitive energy min-
imization framework. We perform extensive evaluations of
our method on the challenging PASCAL VOC 2012 segmen-
tation set. The proposed algorithm achieves favorable re-
sults on the joint detection and segmentation task against
the state-of-the-art method both quantitatively and qualita-
tively.
Figure 1: Segmentation quality comparison. Given an
image (a), our method (d) can handle occlusions caused by
1. Introduction the leg of the person while MCG [5] (c) includes the leg
of the person as part of the motorbike. Note that our re-
Object detection and semantic segmentation are core
sult has an IoU of 0.85 and the MCG result has an IoU of
tasks in computer vision. Object detection aims to local-
0.61. Moreover, the segment in (c) is classified as a bicycle
ize and recognize every instance marked by a bounding box.
using class-specific classifiers whereas our segment can be
However, bounding boxes can only provide coarse positions
classified correctly as a motorbike.
of detected objects. On the other hand, semantic segmen-
tation assigns a category label to each pixel in an image,
which provides more accurate locations of objects. How- two parts. Here, the best hypothesis (with respect to highest
ever, semantic segmentation does not provide the instance intersection-over-union (IoU) score) generated by the top
information (e.g., number of instances). Intuitively, it is performing segmentation algorithm (Multiscale Combina-
beneficial to jointly tackle the object detection and seman- torial Grouping, MCG [5]) fails to parse the motorbike cor-
tic segmentation. However, this is challenging due to oc- rectly as shown in Figure 1(c).
clusions, shape deformation, texture and color within one In this work, we address this issue by developing an al-
object and obscured boundaries with respect to other image gorithm suited to handle occlusions. To tackle occlusions,
parts in real-world scenes. we incorporate both top-down and bottom-up information
Occlusion is the main challenge in providing accu- to achieve accurate segmentations under occlusions. We
rate segmentation results. A typical semantic segmenta- start by finding the occluding regions (i.e., the overlap be-
tion [3, 6, 10] starts with generating segmentation hypothe- tween two instances). In case of Figure 1, finding the over-
ses by a category-independent bottom-up segmentation al- lap between the person and motorbike gives the occluding
gorithm [5, 7, 4] followed by class-specific classifiers. In region, i.e., leg of the person. To find these regions, we need
many cases, bottom-up object segmentation algorithms can- to parse and categorize the two overlapping instances. Re-
not correctly handle occlusions where an object is spilt cently, a large scale convolutional neural network (CNN) is
into two separate regions since they lack top-down infor- applied to obtain highly discriminative features for training
mation. Figure 1(a) shows such an example where a mo- class-specific classifiers [16] (i.e., R-CNN). The classifiers
torbike is occluded by the leg of a person and is split into can categorize the object in a bounding box with a high

1
accuracy on the challenging PASCAL VOC dataset [11]. rithms [25, 7, 4, 35]. Arbelaez et al. [4] develop a uni-
Based on R-CNN, Hariharan et al. [19] propose a simulta- fied approach to contour detection and image segmentation
neous detection and segmentation (SDS) algorithm. Unlike based on the gPb contour detector [4], the oriented water-
R-CNN, SDS inputs both bounding boxes and segmenta- shed transform and the ultrametric contour map [2]. Car-
tion foreground masks to a modified CNN architecture to reira and Sminchisescu [7] generate segmentation hypothe-
extract CNN features. Afterward, features are used to train ses by solving a sequence of constrained parametric min-
class-specific classifiers. This framework shows a signif- cut problems (CPMC) with various seeds and unary terms.
icant improvement in the segmentation classification task. Kim and Grauman [25] introduce a shape sharing concept,
This classification capability provides us a powerful top- a category-independent top-down cue, for object segmenta-
down category specific reasoning to tackle occlusions. tion. Specifically, they transfer a shape prior to an object
We use the categorized segmentation hypotheses ob- in the test image from an exemplar database based on the
tained by SDS to infer occluding regions by checking if local region matching algorithm [24]. Most recently, the
two of the top-scoring categorized segmentation proposals SCALPEL [35] framework that integrates bottom-up cues
are overlapped. If they overlap, we record this occluding and top-down priors such as object layout, class and scale
region into the occluding region set. On the other hand, the into a cascade bottom-up segmentation scheme to generate
classification capability are used to generate class-specific object segments is proposed.
likelihood maps and to find the corresponding category spe- Semantic segmentation [3, 6, 28, 16, 31] assigns a cate-
cific exemplar sets to get better shape predictions. Then, gory label to each segment generated by a bottom-up seg-
the inferred occluded regions, shape predictions and class- mentation algorithm. Arbelaez et al. [3] first generate seg-
specific likelihood maps are formulated into an energy min- mentations using the gPb framework [4]. Then, rich feature
imization framework [30, 27, 1] to obtain the desired seg- representations are extracted for training class-specific clas-
mentation candidates (e.g., Figure 1(d)). Finally, we score sifiers. Carreira et al. [6] starts with the CPMC algorithm
all the segmentation candidates by using the class-specific to generate hypotheses. Then, they propose a second order
classifiers. pooling (O2 P ) scheme to encode local features into a global
We demonstrate the effectiveness of the proposed algo- descriptor. Then, they train linear support linear regressors
rithm by comparing with SDS on the challenging PASCAL on top of the pooled features. On the other hand, Girshick
VOC segmentation dataset [11]. The experimental results et al. [16] extract CNN features from the CPMC segmen-
show that the proposed algorithm achieves favorable per- tation proposals and then apply the same procedure as in
formance both quantitatively and qualitatively; moreover, O2 P framework to tackle semantic segmentation. Most re-
suggest that high quality segmentations improve the detec- cently, Tao et al. [31] integrate a new categorization cost,
tion accuracy significantly. For example, the segment in based on the discriminative sparse dictionary learning, into
Figure 1(c) is classified as a bicycle whereas our segment in the conditional random field model for semantic segmenta-
Figure 1(d) can be classified correctly as a motorbike. tion. A similar work that also utilizes the estimated statistics
of mutually overlapping mid-level object segmentation pro-
2. Related Work and Problem Context posals to predict optimal full-image semantic segmentation
is proposed [28]. On the other hand, the proposed algo-
In this section, we discuss the most relevant work on ob-
rithm incorporates both category specific classification and
ject detection, object segmentation, occlusion modeling and
shape predictions from mid-level segmentation proposals in
shape prediction.
an energy minimization framework to tackle occlusions.
Object Detection. Object detection algorithms aim to lo-
Occlusion Modeling. Approaches for handling occlusion
calize and recognize every instance marked by a bound-
have been studied extensively [32, 36, 15, 37, 21, 23, 14].
ing box. [12, 34, 16]. Felzenszwalb et al. [12] propose a
Tighe et al. [32] handle occlusions in the scene parsing task
deformable part model that infers the deformations among
by inferring the occlusion ordering based on a histogram
parts of the object by latent variables learned through a dis-
given the probability for the class c1 to be occluded by the
criminative training. The “Regionlet” algorithm [34] de-
class c2 with overlap score. Winn and Shotton [36] handle
tects an object by learning a cascaded boosting classifier
occlusions by using a layout consistent random field, which
with the most discriminative features extracted from sub-
models the object parts using a hidden random field where
parts of regions, i.e., the regionlets. Most recently, a R-CNN
pairwise potentials are asymmetric. Ghiasi et al. [15] model
detector [16] facilitate a large scale CNN network [26] to
occlusion patterns by learning a pictorial structure with lo-
tackle detection and outperforms the state-of-the-art with a
cal mixtures using large scale synthetic data. Gao et al. [14]
large margin on the challenging PASCAL VOC dataset.
propose a segmentation-aware model that handles occlu-
Object Segmentation. Recent years have witnessed sig- sions by introducing binary variables to denote the visibility
nificant progress in bottom-up object segmentation algo- of the object in each cell of a bounding box. The assignment
input object hypotheses categorized object occluding output
hypotheses regions


Graph-cut


SDS


with occlusion
CNN

handling



class-specific
likelihood map

… …
Exemplar-based

shape prediction

… …

exemplars shape predictions

Figure 2: Overall framework. The framework starts by generating object hypotheses using MCG [5]. Then, a SDS CNN
architecture [19] extracts CNN features for each object hypothesis, and subsequently the extracted features are fed into class-
specific classifiers to obtain the categories of object hypotheses. Categorized segmentation hypotheses are used to obtain
class-specific likelihood maps, and top-scoring segmentation proposals are used to infer occluding regions. Meanwhile,
these exemplars serve as inputs to the proposed exemplar-based shape predictor to obtain a better shape estimation of an
object. Finally, the inferred occluding regions, shape predictions, class-specific likelihood maps are formulated into an
energy minimization problem to obtain the desired segmentation.

of a binary variable represents a particular occlusion pattern 3. Proposed Algorithm


and this assignment is different from [33], which only mod-
els occlusions due to image boundaries (e.g., finite field of 3.1. Overview
view). Hsiao and Hebert [23] take a data driven approach to In this section, we present the detail of the proposed
reason occlusions by modeling the interaction of objects in multiple-instance object segmentation algorithm with oc-
3D space. On the other hand, Yang et al. [37] tackle occlu- clusion handling in details. We first introduce the joint de-
sions by learning a layered model. Specifically, this layered tection and segmentation framework and then our approach
model infers the relative depth ordering of objects using the to tackle occlusions. Figure 2 illustrates the proposed algo-
outputs of the object detector. In this work, we tackle occlu- rithm.
sions by incorporating top-down category specific reason-
ing and shape prediction through exemplars, and bottom-up 3.2. Joint Detection and Segmentation
segments into an energy minimization framework.
In this section, we briefly review SDS algorithm pro-
Shape Prediction. Shape is an effective object descriptor posed by Hariharan et al. [19]. SDS consists of the follow-
due to the invariance property to lighting conditions and ing four steps. First, they generate category-independent
color. Several recent works have attempted to use the shape segmentation proposals based on MCG [5]. Then, these
prior to guide the segmentation inference. Yang et al. [37] segments are fed into a CNN network to extract features,
use the detection results to generate shape predictions based and this CNN network is based on the R-CNN frame-
on the content of the bounding boxes. Gu et al. [17] aggre- work [16]. the CNN architecture of SDS is shown in Fig-
gate posetlet activations and obtain the spatial distribution ure 3. This architecture consists of two paths, box and re-
of contours within an image cell. He and Gould [21] apply gion. The box pathway is the same network as the R-CNN
the exemplar SVM to get a rough location and scale of can- framework. The R-CNN has been shown to be effective in
didate objects, and subsequently project the shape masks as classifying the object proposals in the detection task. How-
an initial object hypothesis. In this paper, we obtain fine- ever, the R-CNN does not perceive the foreground shape
grained shape priors by evaluating the similarities between directly. Hariharan et al. adopt the idea proposed by Gir-
the segmentation proposals and the exemplar templates. shick et al. by computing another CNN features on another
bounding box where it only has the foreground contents. B
˟A ˟ 0.28 M1
corner
Third, The two resulting CNN feature vectors are concate- detector

nated and the result is given as the input to train class-


0.64
specific classifiers. Note that the two pathways are trained Chamfer
Sn


matching

jointly in this framework. These classifiers assign scores for


each category to each segmentation proposal. Finally, a re-
0.30
finement step is conducted to boost the performance. More …
Mn
details about the SDS algorithm can be found in [19]. exemplar templates matched points inferred masks shape prior

object proposals and


cropped images feature extraction feature vectors
foreground masks Figure 5: Overview of the exemplar-based shape predictor.
Box CNN This figure shows an example that the shape predictor uses
Class-specific the top-down class-specific shape information to remove the
classifiers
overshooting on the back of the horse.
Region
CNN

foreground images taining background clutter). Thus, we propose an exemplar-


based shape predictor to better estimate the shape of an
Figure 3: SDS CNN architecture [19]. SDS first applies object. The framework of the proposed shape predictor is
MCG [5] to obtain foreground masks and the corresponding shown in Figure 5.
bounding boxes. Foreground images and cropped images We assume that segmentation proposals can provide
are fed into Region and Box CNN respectively to jointly instance-level information to a certain extent whereas these
train the CNN network. Finally, the grouped CNN features proposals may be undershooting or overshooting in real-
are used to train class-specific classifiers. ity. We aim to remove these issues according to the global
shape cues and simultaneously recover the object shape. We
3.3. Class-specific Likelihood Map thus propose a non-parametric, data-driven shape predictor
based on the chamfer matching (CM). Given a proposal,
From SDS, we obtain a set of categorized segmentation we identify and modify strong matches locally based on the
cj K
hypotheses {hk }K k=1 and scores {shk }k=1 , where cj ∈ C chamfer distance to every possible exemplar template. Af-
and C is a set of target classes. We use superpixel to repre- ter aggregating all the matches for the best-scoring segmen-
sent an image I. For each superpixel sp covered by hk , we tation proposals, the matches are automatically clustered
record the corresponding category and score. After exam- into a sequence of shape priors. Note that exemplar tem-
ining all the segmentation proposals, each superpixel has a plates are selected from the VOC 2012 segmentation train-
c cj ∈C
list {sspj }sp∈I indicating the score of a superpixel belong- ing set.
ing to the class cj . Then, the class-specific likelihood map We first choose the top 20 scoring segmentation propos-
is defined as the mean over the scores of all superpixels be- als from each class. Given a proposal, we first slightly en-
ing the class cj . Figure 4(b) and Figure 4(c) are the person larges 1.2x width and height of the segmentation proposal
and horse likelihood maps respectively. as the search area. Then, we start with placing an exemplar
Due to the high computational load of the gPb edge de- template at the top left corner of the enlarged search area
tector [4], we generate the superpixel map by using [8]. The with a step size of 5 pixels. A fast CM algorithm [29] is
resulting superpixel maps are shown in Figure 4(a). applied to evaluate the distance between the contour of the
proposal and the contour of the exemplar template. CM pro-
vides a fairly robust distance measure between two contours
and can tolerate small rotations, misalignments, occlusions
and deformations to a certain extent.
The chamfer distance between the contour of the pro-
(a) Superpixel (b) Person (c) Horse posal U and the contour of the exemplar template T is given
by the average of the distance between each point ti ∈ T
Figure 4: The superpixel map and class-specific likelihood and its nearest point uj in U as
maps.
1 X
dCM (T, U ) = min |ti − uj | , (1)
|T | uj ∈U
ti ∈T
3.4. Exemplar-Based Shape Predictor
Bottom-up segmentation proposals tend to undershoot where |T | indicates the number of the points on the con-
(e.g., missing parts of an object) and overshoot (e.g., con- tour of the exemplar template T and we use boldface to
represent a vector. The matching cost can be computed ef- F = {Se1 , Se2 , · · · , SeN , h1 , h2 , · · · , hK } by concatenating
ficiently via a distance transformation image DTU (x) = thresholded shape priors and segmentation proposals.
minuj ∈U |x − uj |, which specifies the distance from each
point to the nearest edge pixel in U . Then, 3.5. Graph Cut with Occlusion Handling
P (1) can be ef-
ficiently calculated via dCM (T, U ) = n1 ti ∈T DTU (ti ). In this section, we introduce the proposed graph cut
Based on the instance-level assumption, for a segmenta- formulation to address occlusions. Specifically, we infer
tion proposal of size w, we limit the searching scale in the occluding regions (i.e., the overlap between two in-
[w ×1.1−3 , w]. By searching over the scale space, we select stances) based on segmentation proposals with top classi-
the one with the minimum distance as the shape candidate fication scores. We formulate the occluding regions into the
U ∗ . Among all the exemplar templates, we choose the top energy minimization framework.
5 matches for a proposal. Let yp denote the label of a pixel p in an image and y
However, CM only provides discrete matched points. denote a vector of all yp . The energy function given the
We need to infer a closed contour given all matched points. foreground-specific appearance model Ai is defined as
Moreover, CM cannot handle undershooting and overshoot-
ing effectively. Therefore, we propose a two-stage approach
X X
E(y; Ai ) = Up (yp ; Ai ) + Vp,q (yp , yq ) , (3)
to solve these issues. First, we eliminate the effects of the p∈P p,q∈N
large distances in the DTU (x) due to undershooting and
overshooting by truncating those DTU (x) that are above τ where P denotes all pixels in an image, N denotes pairs of
to τ . Second, undershooting and overshooting always lead adjacent pixels, Up (·) is the unary term and Vp ,q (·) is the
to contour inconsistency. We conduct the following pro- pairwise term. Our unary term Up (·) is the linear combina-
cesses to remove the contour inconsistency. We first apply tion of several terms and is written as
the Harris corner detector [20] to detect inconsistent points
(blue dots in Figure 5). We choose three inconsistent points Up (yp ; Ai ) = −αAi log p(yp ; cp , Ai ) − αO log p(yp ; O)
and check the number of matched points on the adjacent −αPcj log p(yp ; Pcj ) . (4)
contour segments that is formed by inconsistent points. If
less than 20% of points on segments are matched with a For the pairwise term Vp,q (yp , yq ), we follow the definition
template, we index the common inconsistent point, the mid- as Grabcut [30].
dle one of the three inconsistent points. We then choose The first potential p(yp ; cp , Ai ) evaluates how likely a
another three inconsistent points and conduct the aforemen- pixel of color cp is to take label yp based on a foreground-
tioned process. Finally, we remove the indexed inconsistent specific appearance model Ai . As in [30], an appear-
points from the inconsistent point set. In this way, we are ance model Ai consists of two Gaussian mixture models
able to effectively remove those odd contours and obtain a (GMM), one for the foreground (yp = 1) and another for the
better object shape estimate. background (yp = 0). Each GMM has 5 components and
After collecting all the strong matches for those segmen- each component is a full-covariance Gaussian over the RGB
tation candidates with high classification scores, we apply color space. Each foreground-specific appearance model Ai
Affinity Propagation (AP) [13] to find representative shape corresponds to the foreground and background models ini-
priors {Sn }Nn=1 , where N is the number of clusters and is tialized using one of the elements in the F. Note that the
determined automatically. A shape prior corresponds to element in the set F is denoted as fi .
a cluster cls(n). The n-th shape prior is defined as the The second potential p(yp ; O) accounts for the occlu-
weighted mean of every matched inferred mask Mm in the sion handling in the proposed graph cut framework. To find
cluster cls(n): occluding regions in a given image I, we first choose seg-
mentation proposals with the top 10 scores from each cate-
1 X c
Sn = shjk · sCM
hk · Mm , (2) gory. Then, we check whether a pair of proposals overlaps
|cls(n)|
Mm ∈cls(n) or not. If they overlap, we record this occluding region into
c
where shjk is the classification score of the proposal hk the occluding region set O. We use classification scores to
for the class cj . The parameter sCM = exp (− σCM
d2 determine the energy of the pixel in the occluding regions.
hk 2 ) is

the chamfer matching score between the contour of the Thus, we define the second energy term − log p(yp ; O) as
proposal and the contour of the exemplar template. Note
that shape priors {Sn }Nn=1 are probabilistic. We thresh-
 cj
if p ∈ O∗ and
old the shape prior by an empirically chosen number (i.e., - log sfi ,yp +(2yp -1)γ

c c
0.6 in the experiments) to form the corresponding fore- - log p(yp ; O)= sfji \O∗ > sfji ,
 c
- log sfji ,yp otherwise

ground mask. We denote the thresholded shape priors as
{Sen }N
n=1 . Finally, we form a set of foreground masks (5)
Table 1: Per-class results of the joint detection and segmentation task using AP r metric over 20 classes at 0.5 IoU on the
VOC PASCAL 2012 segmentation validation set. All number are %.

person
mbike
dtable
bottle

sheep
horse
chair

plant

train
aero

boat
bike

sofa
cow
bird

dog

avg
bus

TV
car

cat
SDS [19] 58.8 0.5 60.1 34.4 29.5 60.6 40.0 73.6 6.5 52.4 31.7 62.0 49.1 45.6 47.9 22.6 43.5 26.9 66.2 66.1 43.8
Ours 63.6 0.3 61.5 43.9 33.8 67.3 46.9 74.4 8.6 52.3 31.3 63.5 48.8 47.9 48.3 26.3 40.1 33.5 66.7 67.8 46.3

c c c 
where sfji ,yp = (sfji )yp 1 − (sfji )1−yp and the penalization uate our occlusion handling as images in segmentation set
c mostly contain only one instance.
γ = − log |O1∗ | p∈O∗ (sfji ,yp =0 + e). The parameter e is
P

a small number to prevent a logarithm function returning


infinity. The variable O∗ ⊂ fi ∩ O is one of the possible Table 2: Results of the joint detection and segmentation task
occluding regions for fi . Given a foreground mask fi and using AP r metric at different IoU thresholds on the VOC
c
its class score sfji , we check the corresponding score of the PASCAL 2012 segmentation validation set. The top two
c rows show the AP r results using all validation images. The
region fi \O∗ . The score sfji \O∗ is obtained by applying the
bottom two rows show AP r using the images with occlu-
classifier of the class cj and the region fi \O∗ is obtained sions between instances. We discuss the selection scheme
by removing the occluding region O∗ from the foreground in the text.
c c
mask fi . When sfji \O∗ > sfji , that means the pixel p in the
IoU Score
occluding region O∗ is discouraged to be associated with # of images
0.5 0.6 0.7 0.8 0.9
the foreground mask fi . In this case, we penalize the energy SDS [19] 1449 43.8 34.5 21.3 8.7 0.9
of the occluding regions by adding the penalization γ when Ours 1449 46.3 38.2 27.0 13.5 2.6
yp = 1. When yp = 0, the energy of the occluding regions SDS [19] 309 27.2 19.6 12.5 5.7 1.0
is subtracted with the penalization γ. Ours 309 38.4 28.0 19.0 10.1 2.1
The third potential p(yp ; Pcj ) corresponds to one of the
class-specific likelihood map Pcj . Because of the proba- 4.1. Results of Joint Detection and Segmentation
bilistic nature of class-specific likelihood map Pcj , we set
y 1−y
the third potential as p(yp ; Pcj ) = Pcjp (1 − Pcj p ). Fi- Experimental Setting. We use AP r to evaluate the pro-
nally, we iteratively minimize the energy function (3) as posed algorithm against SDS [19] on the joint detection and
in [30]. Parameters of the foreground-specific appearance segmentation task. However, the recent works on object
model will keep updating in each iteration until the energy proposal algorithms [9, 22] show that an IoU of 0.5 is not
function converges. sufficient for different purposes. Thus, Hariharan et al. pro-
In the experiment, the parameter αAi is set to be 1. We pose to vary IoU scores from 0.1 to 0.9 to show their al-
vary the parameter αO from 0.5 to 1 with a step size of 0.1. gorithm can be adopted for different applications. In our
In addition, the parameter αPcj ranges from 0.1 to 0.3 with application, we aim to provide accurate segmentations, thus
a step size of 0.1. In the pairwise term, we vary the constant we choose thresholds from 0.5 to 0.9 in the experiments.
controlling the smoothness degree from 60 to 240 with a In addition to the above, we also collect a subset of im-
step size of 60. We use these combinations of parameters ages from the VOC 2012 validation dataset to form the
to generate different segmentation candidates for a given VOCoccluded . Each image in the VOCoccluded dataset sat-
foreground mask fi . Finally, the segmentation candidates isfies the following: (a) It contains at least two instances
of all the foreground masks are applied with class-specific (with respect to the VOC object categories) and (b) There
classifiers trained on top of the CNN features extracted from is an overlap between two instances in the image. In the
the SDS CNN architecture. Note that We apply the same end, the VOCoccluded contains 309 images in total and it
classifiers as in SDS. helps to evaluate the detection performance of our proposed
algorithm under occlusions.
4. Experiments
Experimental Results. We use the benchmarking source
We present experimental results for the joint detection code provided by Hariharan et al. [19] and follow the same
and segmentation task on the PASCAL VOC 2012 vali- protocols to evaluate the proposed algorithm on the joint
dation segmentation set with the comparison to SDS [19]. detection and segmentation task. in Table 1 shows per-class
There are 1449 images on the PASCAL VOC 2012 segmen- results of the joint detection and segmentation task using
tation validation set. Moreover, we show our performance AP r metric (at IoU 0.5) on all the images of the validation
on the subset of segmentation validation set to better eval- set. Our method is highly beneficial for object classes such
sheep

bird

train

bus

horse

person

sofa

aeroplane

(a) Input (b) Ground truth (c) Ours (d) SDS [19]

Figure 6: Top detection results (with respect to the ground truth) of SDS [19] and the proposed algorithm on the PASCAL
VOC 2012 segmentation validation dataset. Compared with SDS, the proposed algorithm obtains favorable segmentation
results for different categories. Best viewed in color.

as boat, bus, car and sofa by boosting the performance by category compared to the original scores reported in [19].
more than 5%. Overall, the proposed algorithm performs This is because we evaluate the performance using the VOC
the best in 15 out of the 20 categories. Note that in our 2012 segmentation annotations from the VOC website in-
experiments, SDS obtains a much lower AP r in the bike stead of annotations from the semantic boundary database
(a) Input (b) Ground truth (c) MCG [5] (d) Ours

Figure 7: Some representative segmentation results with comparisons to MCG [5] on the PASCAL VOC segmentation
validation dataset. These results aim to present the occlusion handling capability of the proposed algorithm.

(SBD) [18]. mance experiments on the segmentation datasets. The per-


The first two rows of Table 2 show the AP r at differ- formance drops from 46.3% to 46%. On the other hand,
ent IoUs on all the VOC images. The results suggest that when the functionality of the exemplar-shape based predic-
high quality segmentation candidates boost the detection re- tor is disabled, the performance drops to 39.3%.
sults at high IoUs. In particular, we achieve more than 5% Next, we conduct experiments on the occlusion sub-
jump in performance at high IoUs. Moreover, the bottom set. Without the occlusion regularization, the performance
two rows of Table 2 show the proposed algorithm outper- drops from 38.4% to 37.9%. If we turn off the exemplar-
form SDS by a large margin on the occlusion images from shape based predictor, the performance drops to 33.2%. The
VOCoccluded dataset. This suggests that an algorithm with above studies suggest that exemplar-shape based predictor
occlusion handling can even boost the detection results sig- is more important than the occlusion regularization for the
nificantly. joint detection and segmentation task. We conclude that a
We present qualitative results in Figure 6 and 7. Fig- better estimate of object shape helps detection significantly.
ure 6 shows the segmentation quality comparisons of the
top detection results (with respect to the ground truth). The 6. Conclusion
proposed algorithm obtains favorable segmentation results We present a novel multi-instance object segmentation to
for different categories. Although we show promising seg- tackle occlusions. We observe that the bottom-up segmen-
mentation quality in Figure 6, the segmentation quality of tation approaches cannot correctly handle occlusions. We
the best detected proposal may not be the best. We further thus incorporate top-down category specific reasoning and
present the Figure 7 to demonstrate that the proposed algo- shape predictions through exemplars into an energy min-
rithm generates high quality segmentations. Moreover, it imization framework. Experimental results show that the
shows the capability to handle occlusions. proposed algorithm generates favorable segmentation can-
didates on the PASCAL VOC 2012 segmentation validation
5. Ablation studies dataset. Moreover, the results suggest that high quality seg-
mentations improve the detection accuracy significantly es-
In this section, we conduct ablation studies to understand pecially for those image images with occlusions between
how critical are the exemplar-shape based predictor and oc- objects.
clusion regularization in (5) for the performance of the joint
detection and segmentation task. First, we disable the func-
tionality of the occlusion regularization in (5) and perfor-
Acknowledgment. This work is supported in part by the [20] C. Harris and M. Stephens. A combined corner and edge
NSF CAREER Grant #1149783, NSF IIS Grant #1152576, detector. In Fourth Alvey Vision Conference, 1988.
and a gift from Toyota. X. Liu is sponsored by CSC fel- [21] X. He and S. Gould. An exemplar-based CRF for multi-
lowship. We thank Rachit Dubey and Simon Sáfár for their instance object segmentation. In CVPR, 2014.
suggestions and the CVPR reviewers for their feedback on [22] J. Hosang, R. Benenson, and B. Schiele. How good are de-
this work. tection proposals, really? In BMVC, 2014.
[23] E. Hsiao and M. Hebert. Occlusion reasoning for object de-
References tection under arbitrary viewpoint. In CVPR, 2012.
[24] J. Kim and K. Grauman. Boundary preserving dense local
[1] E. Ahmed, S. Cohen, and B. Price. Semantic object selec- regions. In CVPR, 2011.
tion. In CVPR, 2014. [25] J. Kim and K. Grauman. Shape sharing for object segmenta-
[2] P. Arbelaez. Boundary extraction in natural images using tion. In ECCV, 2012.
ultrametric contour maps. In POCV, 2006. [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[3] P. Arbeláez, B. Hariharan, C. G. S. Gupta, L. Bourdev, and classification with deep convolutional neural networks. In
J. Malik. Semantic segmentation using regions and parts. In NIPS, 2012.
CVPR, 2012. [27] D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation
[4] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- propagation in imagenet. In ECCV, 2012.
tour detection and hierarchical image segmentation. PAMI, [28] F. Li, J. Carreira, G. Lebanon, and C. Sminchisescu. Com-
33(5):898–916, 2011. posite statistical inference for semantic segmentation. In
[5] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and CVPR, 2013.
J. Malik. Multiscale combinatorial grouping. In CVPR, [29] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, and R. Chellappa.
2014. Fast directional chamfer matching. In CVPR, 2010.
[6] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se- [30] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Inter-
mantic segmentation with second-order pooling. In ECCV, active foreground extraction using iterated graph cuts. In
2012. SIGGRAPH, 2004.
[7] J. Carreira and C. Sminchisescu. CPMC: Automatic object [31] L. Tao, F. Porikli, and R. Vidal. Sparse dictionaries for se-
segmentation using constrained parametric min-cuts. PAMI, mantic segmentation. In ECCV, 2014.
34(7):1312–1328, 2012.
[32] J. Tighe, M. Niethammer, and S. Lazebnik. Scene pars-
[8] Y.-T. Chen, J. Yang, and M.-H. Yang. Extracting image re-
ing with object instances and occlusion ordering. In CVPR,
gions by structured edge prediction. In WACV, 2015.
2014.
[9] P. Dollár and C. L. Zitnick. Structured forests for fast edge
[33] A. Vedaldi and A. Zisserman. Structured output regression
detection. In ICCV, 2013.
for detection with partial truncation. In NIPS, 2009.
[10] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards unified
[34] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic
object detection and segmentation. In ECCV, 2014.
object detection. In ICCV, 2013.
[11] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and
[35] D. Weiss and B. Taskar. Scalpel: Segmentation cascades
A. Zisserman. The PASCAL Visual Object Classes Chal-
with localized priors and efficient learning. In CVPR, 2013.
lenge (VOC). IJCV, 88(2):303–338, 2010.
[36] J. Winn and J. Shotton. The layout consistent random field
[12] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-
for recognizing and segmenting partially occluded objects.
manan. Object detection with discriminatively trained part
In CVPR, 2006.
based models. PAMI, 32(9):1627–1645, 2010.
[37] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkess. Layered
[13] B. J. Frey and D. Dueck. Object detection with discrimina-
object models for image segmentation. PAMI, 34(9):1731–
tively trained part based models. Science, 315(5814):972–
1743, 2012.
976, 2007.
[14] T. Gao, B. Packer, and D. Koller. A segmentation-aware
object detection model with occlusion handling. In CVPR,
2011.
[15] G. Ghiasi, Y. Yang, D. Ramana, and C. Fowlkes. Parsing
occluded people. In CVPR, 2014.
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In CVPR, 2014.
[17] C. Gu, P. Arbel, Y. Lin, K. Yu, and J. Malik. Multi-
Component Models for Object Detection. In ECCV, 2012.
[18] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.
Semantic contours from inverse detectors. In ICCV, 2011.
[19] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul-
taneous detection and segmentation. In ECCV, 2014.

You might also like