Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated
Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated
Abstract
1
accuracy on the challenging PASCAL VOC dataset [11]. rithms [25, 7, 4, 35]. Arbelaez et al. [4] develop a uni-
Based on R-CNN, Hariharan et al. [19] propose a simulta- fied approach to contour detection and image segmentation
neous detection and segmentation (SDS) algorithm. Unlike based on the gPb contour detector [4], the oriented water-
R-CNN, SDS inputs both bounding boxes and segmenta- shed transform and the ultrametric contour map [2]. Car-
tion foreground masks to a modified CNN architecture to reira and Sminchisescu [7] generate segmentation hypothe-
extract CNN features. Afterward, features are used to train ses by solving a sequence of constrained parametric min-
class-specific classifiers. This framework shows a signif- cut problems (CPMC) with various seeds and unary terms.
icant improvement in the segmentation classification task. Kim and Grauman [25] introduce a shape sharing concept,
This classification capability provides us a powerful top- a category-independent top-down cue, for object segmenta-
down category specific reasoning to tackle occlusions. tion. Specifically, they transfer a shape prior to an object
We use the categorized segmentation hypotheses ob- in the test image from an exemplar database based on the
tained by SDS to infer occluding regions by checking if local region matching algorithm [24]. Most recently, the
two of the top-scoring categorized segmentation proposals SCALPEL [35] framework that integrates bottom-up cues
are overlapped. If they overlap, we record this occluding and top-down priors such as object layout, class and scale
region into the occluding region set. On the other hand, the into a cascade bottom-up segmentation scheme to generate
classification capability are used to generate class-specific object segments is proposed.
likelihood maps and to find the corresponding category spe- Semantic segmentation [3, 6, 28, 16, 31] assigns a cate-
cific exemplar sets to get better shape predictions. Then, gory label to each segment generated by a bottom-up seg-
the inferred occluded regions, shape predictions and class- mentation algorithm. Arbelaez et al. [3] first generate seg-
specific likelihood maps are formulated into an energy min- mentations using the gPb framework [4]. Then, rich feature
imization framework [30, 27, 1] to obtain the desired seg- representations are extracted for training class-specific clas-
mentation candidates (e.g., Figure 1(d)). Finally, we score sifiers. Carreira et al. [6] starts with the CPMC algorithm
all the segmentation candidates by using the class-specific to generate hypotheses. Then, they propose a second order
classifiers. pooling (O2 P ) scheme to encode local features into a global
We demonstrate the effectiveness of the proposed algo- descriptor. Then, they train linear support linear regressors
rithm by comparing with SDS on the challenging PASCAL on top of the pooled features. On the other hand, Girshick
VOC segmentation dataset [11]. The experimental results et al. [16] extract CNN features from the CPMC segmen-
show that the proposed algorithm achieves favorable per- tation proposals and then apply the same procedure as in
formance both quantitatively and qualitatively; moreover, O2 P framework to tackle semantic segmentation. Most re-
suggest that high quality segmentations improve the detec- cently, Tao et al. [31] integrate a new categorization cost,
tion accuracy significantly. For example, the segment in based on the discriminative sparse dictionary learning, into
Figure 1(c) is classified as a bicycle whereas our segment in the conditional random field model for semantic segmenta-
Figure 1(d) can be classified correctly as a motorbike. tion. A similar work that also utilizes the estimated statistics
of mutually overlapping mid-level object segmentation pro-
2. Related Work and Problem Context posals to predict optimal full-image semantic segmentation
is proposed [28]. On the other hand, the proposed algo-
In this section, we discuss the most relevant work on ob-
rithm incorporates both category specific classification and
ject detection, object segmentation, occlusion modeling and
shape predictions from mid-level segmentation proposals in
shape prediction.
an energy minimization framework to tackle occlusions.
Object Detection. Object detection algorithms aim to lo-
Occlusion Modeling. Approaches for handling occlusion
calize and recognize every instance marked by a bound-
have been studied extensively [32, 36, 15, 37, 21, 23, 14].
ing box. [12, 34, 16]. Felzenszwalb et al. [12] propose a
Tighe et al. [32] handle occlusions in the scene parsing task
deformable part model that infers the deformations among
by inferring the occlusion ordering based on a histogram
parts of the object by latent variables learned through a dis-
given the probability for the class c1 to be occluded by the
criminative training. The “Regionlet” algorithm [34] de-
class c2 with overlap score. Winn and Shotton [36] handle
tects an object by learning a cascaded boosting classifier
occlusions by using a layout consistent random field, which
with the most discriminative features extracted from sub-
models the object parts using a hidden random field where
parts of regions, i.e., the regionlets. Most recently, a R-CNN
pairwise potentials are asymmetric. Ghiasi et al. [15] model
detector [16] facilitate a large scale CNN network [26] to
occlusion patterns by learning a pictorial structure with lo-
tackle detection and outperforms the state-of-the-art with a
cal mixtures using large scale synthetic data. Gao et al. [14]
large margin on the challenging PASCAL VOC dataset.
propose a segmentation-aware model that handles occlu-
Object Segmentation. Recent years have witnessed sig- sions by introducing binary variables to denote the visibility
nificant progress in bottom-up object segmentation algo- of the object in each cell of a bounding box. The assignment
input object hypotheses categorized object occluding output
hypotheses regions
…
Graph-cut
…
SDS
…
…
with occlusion
CNN
handling
…
…
…
class-specific
likelihood map
… …
Exemplar-based
shape prediction
… …
Figure 2: Overall framework. The framework starts by generating object hypotheses using MCG [5]. Then, a SDS CNN
architecture [19] extracts CNN features for each object hypothesis, and subsequently the extracted features are fed into class-
specific classifiers to obtain the categories of object hypotheses. Categorized segmentation hypotheses are used to obtain
class-specific likelihood maps, and top-scoring segmentation proposals are used to infer occluding regions. Meanwhile,
these exemplars serve as inputs to the proposed exemplar-based shape predictor to obtain a better shape estimation of an
object. Finally, the inferred occluding regions, shape predictions, class-specific likelihood maps are formulated into an
energy minimization problem to obtain the desired segmentation.
…
matching
…
each category to each segmentation proposal. Finally, a re-
0.30
finement step is conducted to boost the performance. More …
Mn
details about the SDS algorithm can be found in [19]. exemplar templates matched points inferred masks shape prior
the chamfer matching score between the contour of the Thus, we define the second energy term − log p(yp ; O) as
proposal and the contour of the exemplar template. Note
that shape priors {Sn }Nn=1 are probabilistic. We thresh-
cj
if p ∈ O∗ and
old the shape prior by an empirically chosen number (i.e., - log sfi ,yp +(2yp -1)γ
c c
0.6 in the experiments) to form the corresponding fore- - log p(yp ; O)= sfji \O∗ > sfji ,
c
- log sfji ,yp otherwise
ground mask. We denote the thresholded shape priors as
{Sen }N
n=1 . Finally, we form a set of foreground masks (5)
Table 1: Per-class results of the joint detection and segmentation task using AP r metric over 20 classes at 0.5 IoU on the
VOC PASCAL 2012 segmentation validation set. All number are %.
person
mbike
dtable
bottle
sheep
horse
chair
plant
train
aero
boat
bike
sofa
cow
bird
dog
avg
bus
TV
car
cat
SDS [19] 58.8 0.5 60.1 34.4 29.5 60.6 40.0 73.6 6.5 52.4 31.7 62.0 49.1 45.6 47.9 22.6 43.5 26.9 66.2 66.1 43.8
Ours 63.6 0.3 61.5 43.9 33.8 67.3 46.9 74.4 8.6 52.3 31.3 63.5 48.8 47.9 48.3 26.3 40.1 33.5 66.7 67.8 46.3
c c c
where sfji ,yp = (sfji )yp 1 − (sfji )1−yp and the penalization uate our occlusion handling as images in segmentation set
c mostly contain only one instance.
γ = − log |O1∗ | p∈O∗ (sfji ,yp =0 + e). The parameter e is
P
bird
train
bus
horse
person
sofa
aeroplane
(a) Input (b) Ground truth (c) Ours (d) SDS [19]
Figure 6: Top detection results (with respect to the ground truth) of SDS [19] and the proposed algorithm on the PASCAL
VOC 2012 segmentation validation dataset. Compared with SDS, the proposed algorithm obtains favorable segmentation
results for different categories. Best viewed in color.
as boat, bus, car and sofa by boosting the performance by category compared to the original scores reported in [19].
more than 5%. Overall, the proposed algorithm performs This is because we evaluate the performance using the VOC
the best in 15 out of the 20 categories. Note that in our 2012 segmentation annotations from the VOC website in-
experiments, SDS obtains a much lower AP r in the bike stead of annotations from the semantic boundary database
(a) Input (b) Ground truth (c) MCG [5] (d) Ours
Figure 7: Some representative segmentation results with comparisons to MCG [5] on the PASCAL VOC segmentation
validation dataset. These results aim to present the occlusion handling capability of the proposed algorithm.