0% found this document useful (0 votes)
55 views10 pages

SSAP: Single-Shot Instance Segmentation With Affinity Pyramid

1) The document proposes a single-shot instance segmentation method called SSAP that uses an affinity pyramid to jointly learn semantic segmentation and instance segmentation in one pass of a neural network. 2) The affinity pyramid computes pixel-pair affinities hierarchically to group pixels into objects. This is jointly learned with semantic segmentation for mutual benefit. 3) A cascaded graph partition module then generates instances from coarse to fine using the learned affinity pyramid, achieving state-of-the-art results on Cityscapes with 37.3% AP and 61.1% PQ using only one network pass.

Uploaded by

hvnth_88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views10 pages

SSAP: Single-Shot Instance Segmentation With Affinity Pyramid

1) The document proposes a single-shot instance segmentation method called SSAP that uses an affinity pyramid to jointly learn semantic segmentation and instance segmentation in one pass of a neural network. 2) The affinity pyramid computes pixel-pair affinities hierarchically to group pixels into objects. This is jointly learned with semantic segmentation for mutual benefit. 3) A cascaded graph partition module then generates instances from coarse to fine using the learned affinity pyramid, achieving state-of-the-art results on Cityscapes with 37.3% AP and 61.1% PQ using only one network pass.

Uploaded by

hvnth_88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SSAP: Single-Shot Instance Segmentation With Affinity Pyramid

Naiyu Gao1,2 , Yanhu Shan3 , Yupei Wang1,2 , Xin Zhao1,2∗, Yinan Yu3 , Ming Yang3 , Kaiqi Huang1,2,4
1
CRISE, Institute of Automation, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences 3 Horizon Robotics, Inc
4
CAS Center for Excellence in Brain Science and Intelligence Technology
{gaonaiyu2017,wangyupei2014}@ia.ac.cn,{xzhao,kaiqi.huang}@nlpr.ia.ac.cn
arXiv:1909.01616v1 [cs.CV] 4 Sep 2019

{yanhu.shan,yinan.yu}@horizon.ai, [email protected]

Abstract

Recently, proposal-free instance segmentation has re- Semantic Seg.


Cascaded
ceived increasing attention due to its concise and effi- ConvNet Graph
Partiton
cient pipeline. Generally, proposal-free methods gener- Instance Seg.


Image
ate instance-agnostic semantic segmentation labels and


instance-aware features to group pixels into different object



instances. However, previous methods mostly employ sep- Affinity Pyramid

arate modules for these two sub-tasks and require multiple


passes for inference. We argue that treating these two sub- Figure 1. Overview of the proposed method. The per-pixel seman-
tasks separately is suboptimal. In fact, employing multiple tic class and pixel-pair affinities are generated with a single pass
of a fully-convolutional network. The final instance segmentation
separate modules significantly reduces the potential for ap-
result is then derived from these predictions by the proposed cas-
plication. The mutual benefits between the two complemen- caded graph partition module.
tary sub-tasks are also unexplored. To this end, this work
proposes a single-shot proposal-free instance segmentation
method that requires only one single pass for prediction. tation [41, 6, 49], instance segmentation provides in-depth
Our method is based on a pixel-pair affinity pyramid, which understanding by segmenting all objects and distinguishing
computes the probability that two pixels belong to the same different object instances. Researchers are thus showing in-
instance in a hierarchical manner. The affinity pyramid can creasing interests in instance segmentation recently.
also be jointly learned with the semantic class labeling and Current state-of-the-art solutions to this challenging
achieve mutual benefits. Moreover, incorporating with the problem can be classified into the proposal-based and
learned affinity pyramid, a novel cascaded graph partition proposal-free approaches [34, 28, 40]. The proposal-based
module is presented to sequentially generate instances from approaches regard it as an extension to the classic object
coarse to fine. Unlike previous time-consuming graph par- detection task [46, 39, 44, 15]. After localizing each ob-
tition methods, this module achieves 5× speedup and 9% ject with a bounding box, a foreground mask is predicted
relative improvement on Average-Precision (AP). Our ap- within each bounding box proposal. However, the perfor-
proach achieves state-of-the-art results on the challenging mances of these proposal-based methods are highly limited
Cityscapes dataset. by the quality of the bounding box predictions and the two-
stage pipeline also limits the speed of the systems. By con-
trast, the proposal-free approach has the advantage of its
1. Introduction simple and efficient design. This work also focuses on the
proposal-free paradigm.
The rapid development of Convolutional networks [30,
The proposal-free methods mostly start by producing
29] has revolutionized various vision tasks, enabling us
instance-agnostic pixel-level semantic class labels [41, 6,
to move towards more fine-grained understanding of im-
8, 49], followed by clustering them into different object
ages. Instead of classic bounding-box level object detec-
instances with particularly designed instance-aware fea-
tion [19, 18, 46, 39, 44, 15] or class-level semantic segmen-
tures. However, previous methods mainly treat the two sub-
∗ Corresponding author processes as two separate stages and employ multiple mod-
ules, which is suboptimal. In fact, the mutual benefits be- actions, which further boosts instance segmentation.
tween the two sub-tasks can be exploited, which will further
• A single-shot, proposal-free instance segmentation
improve the performance of instance segmentation. More-
method is proposed, based on the proposed affinity
over, employing multiple modules may result in additional
pyramid. Unlike most previous methods, our approach
computational costs for real-world applications.
requires only one single pass to generate instances.
To cope with the above issues, this work proposes a
On the challenging Cityscapes dataset, our method
single-shot proposal-free instance segmentation method,
achieves state of art with 37.3% AP (val) / 32.7% (test)
which jointly learns the pixel-level semantic class segmen-
and 61.1% PQ (val).
tation and object instance differentiating in a unified model
with a single backbone network, as shown in Fig. 1. Specif- • Incorporating with the hierarchical manner of the affin-
ically, for distinguishing different object instances, an affin- ity pyramid, a novel cascaded graph partition module
ity pyramid is proposed, which can be jointly learned with is proposed to gradually segment an image into in-
the labeling of semantic classes. The pixel-pair affinity stances from coarse to fine. Compared with the non-
computes the probability that two pixels belong to the same cascaded way, this module achieves 5× speedup and
instance. In this work, the short-range affinities for pixels 9% relative improvement on AP.
close to each other are derived with dense small learning
windows. Simultaneously, the long-range affinities for pix-
2. Related Work
els distant from each other are also required to group objects 2.1. Instance Segmentation
with large scales or nonadjacent parts. Instead of enlarg-
ing the windows, the multi-range affinities are decoupled Existing approaches on instance segmentation could be
and long-range affinities are sparsely derived from instance divided into two paradigms: proposal-based methods and
maps with lower resolutions. After that, we propose learn- proposal-free methods.
ing the affinity pyramid at multiple scales along the hier- Proposal-based methods recognize object instances with
archy of an U-shape network, where the short-range and bounding boxes that generated with detectors [46, 39, 15].
long-range affinities are effectively learned from the fea- MNC [14] decomposes instance segmentation into a cas-
ture levels with higher and lower resolutions respectively. cade of sub-tasks, including box localization, mask refine-
Experiments in Table 3 show that the pixel-level semantic ment and instance classification. Another work [2, 32] com-
segmentation and pixel-pair affinity pyramid based group- bines the predictions of detection and semantic segmenta-
ing are indeed mutually benefited from the proposed joint tion with a CRFasRNN [50] to generate instances. FICS
learning scheme. The overall instance segmentation is thus [33] develops the position sensitive score map [13]. Mask
further improved. R-CNN [20] extends Faster R-CNN [46] by adding a seg-
Then, in order to utilize the cues about global con- mentation mask predicting branch on each Region of Inter-
text reasoning, this work employs a graph partition method est (RoI). Following works extend Mask R-CNN by modi-
[26] to derive instances from the learned affinities. Un- fying feature layers [38] or the mask prediction head [7].
like previous time-consuming methods, a cascaded graph Proposal-free methods mainly solve instance segmenta-
partition module is presented to incorporate the graph par- tion based on the success of semantic segmentation [6, 49,
tition process with the hierarchical manner of the affinity 8]. The segmentation based methods learn instance-aware
pyramid and finally provides both acceleration and perfor- features and use corresponding grouping methods to cluster
mance improvements. Concretely, with the learned pixel- pixels into instances. DWT [3] learns boundary-aware en-
pair affinity pyramid, a graph is constructed by regarding ergy for each pixel followed by watershed transform. Sev-
each pixel as a node and transforming affinities into the edge eral methods [5, 17, 43] adopt instance level embeddings
scores. Graph partition is then employed from higher-level to differentiate instances. SGN [37] sequentially groups
lower-resolution layers to lower-level higher-resolution lay- instances with three sub-networks. Recurrent Neural Net-
ers progressively. Instance segmentation predictions from works (RNNs) is adopted in several approaches [47, 45] to
lower resolutions produce confident proposals, which sig- generate one instance mask at each time. Graph based al-
nificantly reduce node numbers at higher resolutions. Thus gorithm [26] is also utilized for post-processing [31, 28],
the whole process is accelerated. which segments an image into instances with global reason-
The main contributions of this paper are as follows: ing. However, the graph based algorithm is usually time-
consuming. To speed up, Levinkov et al. [31] down-sample
• A novel instance-aware pixel-pair affinity pyramid the outputs before the graph optimization while Kirillov et
is proposed to distinguish instances, which can be al. [28] only derive edges for adjacent neighbors. They all
jointly learned with the pixel-level labeling of seman- accelerate at the expense of performance. Recently, Yang et
tic class. The mutual benefits between the two sub- al. [48] propose a single-shot image parser that achieves a
tasks are explored by encouraging bidirectional inter- balance between accuracy and efficiency.
2.2. Pixel-Pair Affinity

The concept of learning pixel-pair affinity has been de-
veloped in many previous works [36, 23, 1, 4, 42] to
facilitate semantic segmentations during training or post- …
processing. Recently, Liu et al. [40] propose learning
instance-aware affinity and grouping pixels into instances
: current pixels : label 1 : label 0
with agglomerative hierarchical clustering. Our approach
also utilizes instance-aware affinity to distinguish object in- Figure 2. Illustration of affinity pyramid. Pixel-pair affinity spec-
stances, but both the ways to derive affinities and group pix- ifies whether two pixels belong to the same instance or not. For
els are significantly different. Importantly, Liu et al. [40] each current pixel, the affinities to neighboring pixels within a
employ two models and require multiple passes for the RoIs small r × r window (here, r = 5) are predicted. The short-range
generated from semantic segmentation results. Instead, our and long-range affinities are decoupled and derived from instance
approach is single-shot, which requires only one single pass maps with higher and lower resolutions respectively. In practice,
to generate the final instance segmentation result. ground truth affinity is set to 1 if two pixels are from the same
instance, otherwise 0. Best viewed in color and zoom.
3. Proposed Approach
This work proposes a single-shot proposal-free instance from the same instance, 0 if two pixels are from different in-
segmentation model based on the jointly learned seman- stances. Importantly, the training data generated in this way
tic segmentation and pixel-pair affinity pyramid, which are is unbalanced. Specifically, the ground truth affinities are
equipped with a cascaded graph partition module to differ- mostly with all 1 as most pixels are at the inner-regions of
entiate object instances. As shown in Fig. 3, our model con- instances. To this end, 80% pixels with all 1 ground truth
sists of two parts: (a) a unified network to learn the seman- affinities are randomly dropped during training. Addition-
tic segmentation and affinity pyramid with a single back- ally, we set 3 times loss for pixels belonging to object in-
bone network, and (b) a cascaded graph partition module to stances.
sequentially generate multi-scale instance predictions using Moreover, apart from the short-range affinities above, the
the jointly learned affinity pyramid and semantic segmenta- long-range affinities are also required to handle objects of
tion. In this section, the affinity pyramid is firstly explained larger scales or nonadjacent object parts. A simple solution
at Subsection 3.1, then the cascaded graph partition module is to utilize a large affinity window size. However, besides
is described at Subsection 3.2. the cost of GPU memories, a large affinity window would
3.1. Affinity Pyramid inevitably conflict with the semantic segmentation during
training, which severely hinders the joint learning of the two
With the instance-agnostic semantic segmentation, sub-tasks. As shown in experiments (see Table 2), jointly
grouping pixels into individual object instance is critical for learning the short-range affinities with semantic segmenta-
instance segmentation. This work proposes distinguishing tion obtains mutual benefits for the two tasks. However,
different object instances based on the instance-aware pixel- the long-range affinities are obviously more difficult to be
pair affinity, which specifies whether two pixels belong to jointly learned with the pixel-level semantic class labeling.
the same instance or not. As shown in the second column Similar observation is also captured by Ke et al. [24].
of Fig. 2, for each pixel, the short-range affinities to neigh-
Instead of enlarging the affinity window, we propose to
boring pixels within a small r × r window are learned. In
learn multi-scale affinities as an affinity pyramid, where the
this way, a r2 × h × w affinity response map is presented.
short-range and long-range affinities are decoupled and the
For training, the average L2 loss is calculated with the r2
latter is sparsely derived from instance maps with lower res-
predicted affinities for each pixel:
olutions. More concretely, as shown in Fig. 2, the long-
r 2 range affinities are achieved with the same small affin-
1 X j 2
loss(a, y) = 2 y − aj , (1) ity window at the lower resolutions. Note that the win-
r j=1 dow sizes can be different, however they are fixed in this
2
work for simplicity. In this way, the 5 × 5 windows from
where a = [a1 , a2 , . . . , ar ]. aj is the predicted affinity the 641
resolution can produce affinities between pixels at
between the current pixel and the j-th pixel in its affinity most 128 pixel-distance. With the constructed affinity pyra-
window, representing the probability that two pixels belong mid, the finer short-range and coarser long-range affinities
to the same instance. The sigmoid activation is used to let are learned from the higher and lower resolutions, respec-
2
aj ∈ (0, 1). Here, y = [y 1 , y 2 , . . . , y r ] and y j represents tively. Consequently, multi-scale instance predictions are
j j
the ground truth affinity for a . y is set to 1 if two pixels are generated by affinities under corresponding resolutions. As
𝑤 Our result
𝑤 S1
4



4
S1 A1 A2 … A5
A1 𝑟2


1
4
S4
1 𝑤
8 32

1 32
16 S4
A4 𝑟
A4 A5
𝑟2 𝑟
1
32 𝑤
64 S5
1
64 ℎ
S5 64
A5
Encoder Decoder A5 𝑟2

(a) U-Shape structure to predict Sem. Seg. and Affinity Pyramid (b) Cascaded Graph Partition
Semantic segmentation Up-sample inner-regions
Semantic branch Si at ith resolution
Elem. Sum
Affinity branch Ai Affinity prediction Graph partition
at ith resolution

Figure 3. Our instance segmentation model consists of two parts: (a) a unified U-shape framework that jointly learns the semantic seg-
mentation and affinity pyramid. The affinity pyramid is constructed by learning multi-range affinities from feature levels with different
resolutions separately. (b) a cascaded graph partition module that utilizes the jointly learned affinity pyramid and semantic segmentation to
progressively refine instance predictions starting from the deepest layer. Instance predictions in the lower-level layers with higher resolution
are guided by the instance proposals generated from the deeper layers with lower resolution. Best viewed in color and zoom.

shown in Fig. 2, the predictions of larger instances are pro- Graph Partition With the learned pixel-pair affinity pyra-
posed by the lower resolution affinities, and are further de- mid, an undirected graph G = (V, E) is constructed, where
tailed by higher resolution affinities. Meanwhile, although V is the set of pixels and E ⊆ V 2 is the set of pixel-pairs
the smaller instances have too weak responses to be pro- within affinity windows. eu,v ∈ E represents the edge be-
posed at lower resolutions, they can be generated by the tween the pixels {u, v}. Furthermore, au,v , av,u ∈ (0, 1)
affinities with higher resolutions. are the affinities for pixels {u, v}, which are predicted at
After that, the affinity pyramid can be easily learned pixels u and v, respectively. The average affinity αu,v is
by adding affinity branches in parallel with the existing then calculated and transformed into the score wu,v of edge
branches for semantic segmentation along the hierarchy of eu,v by:
the decoder network. As shown in Fig. 3 (a), affinities are
predicted under { 14 , 18 , 16
1 1
, 32 1
, 64 } resolutions of the orig- αu,v = (au,v + av,u )/2, (2)
inal image. In this way, the short-range and long-range αu,v
wu,v = log( ). (3)
affinities can be effectively learned at different feature lev- 1 − αu,v
els in the feature pyramid of the U-shape architecture. The
formed affinity pyramid can thus be jointly learned with the As the affinities predict how likely two pixels belong to
semantic segmentation in a unified model, resulting in mu- the same instance, the average affinities higher than 0.5 are
tual benefits. transformed into positive and negative otherwise. In this
way, instance segmentation is transformed into a graph par-
3.2. Cascaded Graph Partition tition problem [11] and can be addressed by solving the fol-
With the jointly learned semantic segmentation and lowing optimization problem [26]:
affinity pyramid, a graph-based partition mechanism is em- X
ployed in this work to differentiate object instances. In par- min we ye , (4)
y∈{0,1}
ticular, incorporating with the hierarchical manner of the e∈E
0 X
affinity pyramid, a cascaded graph partition module is pre- s.t. ∀C ∀e ∈ C : ye ≥ ye0 . (5)
sented. This module sequentially generates instances with e∈C\{e0 }
multiple scales, guided by the cues encoded in the deeper-
level layers of the affinity pyramid. Here, ye = yu,v ∈ {0, 1} is a binary variable and yu,v = 1
0
affinity αu,v for pixels {u, v} is refined to αu,v by:
0
αu,v = αu,v ∗ exp[−DJS (su ksv )], (6)

1h P + Q P + Q i
DJS (P kQ) = DKL P k + DKL Qk ,
(a) (b) (c) (d) (e) 2 2 2
(7)
X Pi
Figure 4. Influence of segmentation refinement (SR). (a) Input im- DKL (P kQ) = Pi log . (8)
Qi
age. (b) Semantic segmentation. (c) Instance segmentation with- i
out SR (d) Instance segmentation with SR. (e) Ground truth. SR Here, su = [s1u , s2u , . . . , scu ] and sv = [s1v , s2v , . . . , scv ] are
significantly improves the errors in instance segmentation which
the semantic segmentation scores for c object classes at the
are caused by the semantic segmentation failures. Best viewed in
pixel u and v, which represent the classification possibil-
color and zoom.
ity distributions on the c object classes. The distance be-
tween the two distributions can be measured with the popu-
lar Jensen-Shannon divergence, as described in Eq. 7-8. Af-
represents nodes u and v belong to different partitions. C is ter the refinement for the initial affinities, graph partition is
the set of all cycles of the graph G. The objective in formu- conducted for all the foreground pixels at the 41 resolution.
lation 4 is about to maximize the total score of the selected By combining the information from semantic segmentation
edges, and the inequality 5 constrains each feasible solu- and affinity branches, the errors in instance segmentation
tion representing a partition. A search-based algorithm [26] which are caused by the semantic segmentation failures are
is developed to solve the optimization problem. However, significantly improved, as shown in Fig. 4.
when this algorithm is employed to segment instances, the Finally, the class label for each instance is obtained by
inference time is not only long but also rises significantly voting among all pixels based on semantic segmentation
w.r.t. the number of nodes, which brings potential problem labels. Following DWT [3], small instances are removed
for real-world applications. and semantic scores from semantic segmentation are used
to rank predictions.
Cascade Scheme The sizes of instances in the Cityscapes
dataset are various significantly. For large instances, the 4. Experiments
pixels are mostly at the inner-regions which cost long in-
ference time although are easy for segmentation. Motivated Dataset Our model is evaluated on the challenging urban
by this observation, a cascaded strategy is developed to in- street scenes dataset Cityscapes [12]. In this dataset, each
corporate the graph partition mechanism with the hierarchi- image has a high resolution of 1,024×2,048 pixels. There
cal manner of the affinity pyramid. As shown in Fig. 3 are 5,000 images with high quality dense pixel annota-
(b), the graph partition is firstly utilized on a low resolu- tions and 20,000 images with coarse annotations. Note that
tion where it has fewer pixels and requires a short running only the fine annotated dataset is used to train our model.
time for graph partition. Although only coarse segments Cityscapes benchmark evaluates 8 classes for instance seg-
for large instances are generated, the inner-regions for these mentation. Together with another 11 background classes,
segments are still reliable. In this case, these inner-regions 19 classes are evaluated for semantic segmentation.
can be up-sampled and regarded as proposals for the higher Metrics The main metric for evaluation is Average-
resolution. At the higher resolution, the pixels in each pro- Precision (AP), which is calculated by averaging the preci-
posal are combined to generate a node and the remaining sions under IoU (Intersection over Union) thresholds from
pixels are each treated as a node. To construct a graph with 0.50 to 0.95 at the step of 0.05. Our result is also reported
these nodes, the edge score wti ,tj between nodes ti and tj with three sub-metrics from Cityscapes: AP50%, AP100m
is calculated by adding all and AP50m. They are calculated at 0.5 IoU threshold or
Ppixel-pair edge scores between only for objects within specific distances.
the two nodes: wti ,tj = u∈ti ,v∈tj wu,v . In this way, the
proposals for instance predictions are progressively refined. This paper also evaluates the results with a new metric
Because the number of nodes decreases significantly at each Panoptic Quality (PQ) [27], which is further divided into
step, the entire graph partition is accelerated. Segmentation Quality (SQ) and Recognition Quality (RQ)
to measure recognition and segmentation performances re-
Segmentation Refinement In previous steps, the partition spectively. The formulation PQ is defined as:
is made within each class to speed up. At this step, the P
cues from both semantic segmentation and affinity branches p,g∈T P IoU(p, g) |T P |
PQ = × , (9)
are integrated to segmentation instances from all the pixels |T P | |T P | + 12 |F P | + 21 |F N |
| {z } | {z }
which are classified as foreground. In practice, the average Segmentation Quality (SQ) Recognition Quality (RQ)
α λ AP (%) PQTh (%) PQ (%)
0.0003 diff. 27.5 45.0 54.6
0.001 diff. 29.5 48.0 55.8
0.003 diff. 31.5 49.2 56.6
0.003 same 31.0 48.7 56.2
0.01 diff. 31.0 49.2 56.3
0.03 diff. 28.1 46.4 53.4

Table 1. Influence of the balancing parameters.

r AP (%) PQTh (%) PQ (%) mIoU (%)


0 - - - 74.5
3 30.5 48.5 56.4 75.0
5 31.3 49.0 56.5 75.0
7 31.2 48.1 56.0 75.1
9 30.0 46.2 55.0 74.3
Figure 5. Running time for the cascaded graph partition module
under different object sizes. The cascade scheme significantly re-
Table 2. Influence of affinity window size r. mIoU for semantic duces the time for large objects. Best viewed in color.
segmentation evaluation is also provided. r = 0 means to train
semantic segmentation only.
work [9]. Our model is trained with Nadam [16] for 70,000
Feature JL AP (%) PQTh (%) PQ (%) mIoU (%) iterations using synchronized batch normalization [22] over
Single √ 29.4 46.9 54.9 74.5 8 TitanX 1080ti GPUs and the batch size is set to 24. The
(w/o dilation) 30.2 47.6 55.0 74.2
Single 30.6 48.2 55.5 74.5
learning rate is initialized to 10−4 and divided by 10 at the

(w/ dilation) 30.8 48.8 55.8 74.5 30,000 and 50,000 iterations, respectively.
√ 30.0 47.7 55.2 74.5 Influence of Joint Learning Our separately trained seman-
Hierarchical
31.3 49.0 56.5 75.0
tic segmentation model achieves 74.5% mIoU. This result
Table 3. JL: joint learning. Comparing with learning all layers of is significantly improved after being jointly trained with the
the affinity pyramid from the single 1/4 resolution feature map, our affinity pyramid, as shown in Table 2. However, the per-
hierarchical manner with joint learning performs better. formance for both instance and semantic segmentation is
affected by the affinity window size. Similar phenomenon
is also observed by Ke et al. [24] and they explain that small
where p and g are the predicted and ground truth seg- windows and large windows benefit small objects and large
ments, while T P , F P and F N represent matched pairs objects, respectively. Due to the limitation of GPU mem-
of segments, unmatched predicted segments and unmatched ory, the window size is tested from 3 to 9. Among them,
ground truth segments respectively. Moreover, both count- 5 × 5 affinity window balances the conflict and achieves the
able objects (thing) and uncountable regions (stuff) are eval- best performance, which is used in the other experiments.
uated in PQ and are separately reported with PQTh and PQSt . Furthermore, in our proposed model, the semantic segmen-
As the stuff is not concerned in this work, only PQ and PQTh tation and affinity pyramid are jointly learned along the hi-
is reported. erarchy of the U-shape network. We compare this approach
Implementation Details Our model predicts semantic seg- with generating all layers of the affinity pyramid from the
mentation and pixel-pair affinity with a unified U-shape single 41 resolution feature map with corresponding strides.
framework based on ResNet-50 [21]. The training loss L The employing of dilated convolution [6] is also tested. Ta-
is defined as: ble 3 shows our approach performs best, where the mutually
X benefits of the two tasks are explored and finally improve
L= (Lis + αλi Lia ), (10)
the performance on instance segmentation.
i
Influence of Cascaded Graph Partition At this part, the
where Lis and Lia
are multi-class focal loss [35] and av- proposed cascaded graph partition module is analyzed by
erage L2 loss (see Eq. 1) for semantic segmentation and being initialized from each resolution. As shown in Fig. 5,
affinity branches at the ith resolution in [ 14 , 18 , ..., 64
1
] res- the running time for graph partition increases rapidly w.r.t.
olutions respectively. To combine losses from each scale, the size of object regions when conducting the partition at
we firstly tune the balancing parameter λi to make losses the 14 resolution directly, without the guidance of instance
of each scale are in the same order, which are finally set to proposals. However, the time significantly reduces when
[0.01, 0.03, 0.1, 0.3, 1] respectively. After that, α is set to initializing the cascaded graph partition from lower reso-
1
0.003 to balance the losses of affinity pyramid and seman- lutions, like the 16 resolution, where the graph partition
1 1 1
tic segmentation. The influence of α and λi are shown in is constructed at the [ 16 , 8 , 4 ] resolutions sequentially, and
Table 1. We run all experiments using the MXNet frame- the latter two are guided by the proposals from the previous
Init. Res. GP time (s) AP (%) PQTh (%) PQ (%) Backbone SR HF MS AP (%) PQTh (%) PQ (%)
1/4 1.26 28.9 45.1 54.9 ResNet-50 √ 28.7 45.4 55.1
1/8 0.33 31.3 49.2 56.6 ResNet-50 √ √ 31.5 49.2 56.6
1/16 0.26 31.5 49.2 56.6 ResNet-50 √ √ √ 32.8 50.4 57.6
1/32 0.26 30.9 48.8 56.5 ResNet-50 √ √ √ 34.4 50.6 58.4
1/64 0.26 30.9 48.7 56.5 ResNet-101 37.3 55.0 61.1

Table 4. Influence of the initial resolution for the cascaded graph Table 7. SR: segmentation refinement. HF: horizontal flipping
partition. With the decreasing of initial resolution, the GP time test. MS: multiscale test.
(running time for cascaded graph partition per image) keeps de-
creasing. Comparing with 14 resolution initialization, initializ- Method AP (%) PQTh (%) PQ (%) Backbone
1 Li et al. [32] 28.6 42.5 53.8 ResNet-101
ing cascaded graph partition from the 16 resolution achieves 5×
speedup with 9% AP improvement. SGN [37] 29.2 - - -
Mask R-CNN [20] 31.5 49.61 - ResNet-50
GMIS [40] 34.1 - - ResNet-101
Affinities Used AP (%) PQTh (%) PQ (%) Deeperlab [48] - - 56.5 Xception-71 [10]
A1 only 25.7 41.2 53.2 PANet [38] 36.5 - - ResNet-50
+A2 29.8 46.5 55.4 SSAP (ours) 34.4 50.6 58.4 ResNet-50
+A3 30.8 48.6 56.3 SSAP (ours) 37.3 55.0 61.1 ResNet-101
+A4 31.4 49.2 56.5
+A5 31.5 49.2 56.6
Table 8. Results on Cityscapes val set. All results are trained with
Cityscapes data only.
Table 5. Effectiveness of the long-range affinities. [A1 , A2 , ..., A5 ]
are affinities of the [ 41 , 18 , ..., 64
1
] resolutions respectively. Affini-
Method PQ [val] PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt
ties with longer-range are gradually added. DeeperLab [48] 33.8 34.3 77.1 43.1 37.5 77.5 46.8 29.6 76.4 37.4
SSAP (ours) 36.5 36.9 80.7 44.8 40.1 81.6 48.5 32.0 79.4 39.3
BD OL Kernel AP (%) PQTh (%) PQ (%)
√ 3 29.1 46.4 55.8 Table 9. Results on COCO val (‘PQ [val]’ column ) and test-dev
√ √ 3 30.0 48.8 56.0 (remaining columns) sets. Results are reported as percentages.
√ √ 3 31.3 49.0 56.5
5 31.5 49.2 56.6

Table 6. BD: balance the training data by randomly dropping 80% the validation set. Our model is also trained with ResNet-
pixels with all 1 ground truth affinities. OL: set 3 times affinity 101, which achieves 37.3% AP and 61.1% PQ, as shown in
loss for pixels belonging to object instances. Kernel: kernel size. Table 8. For the test set, our model attains a performance of
32.7% AP, which exceeds all previous methods. Details are
in Table 10.
stage. The quantitative results are shown in Table 4. Com- Visual Results The proposals generated from the 16 1
and 18
paring with the 41 resolution initialization (non-cascaded), resolutions are visualized in Fig 6. A few sample results on
1
the 64 resolution initializing scheme achieves 5× accelera- the validation set are visualized in Fig 7, where fine details
tion. Importantly, the cascaded approach achieves speeding are precisely captured. As shown in the second column, the
up without scarifying precisions. As shown in Table 4, ini- cars occluded by persons or poles and separated into parts
1
tializing from the 64 resolution has 2.0% absolute improve- are successfully grouped.
ment on AP, which is achieved due to that the proposals Results on COCO To show the effectiveness of our
from lower resolutions can reduce the disturbing informa- method in scenarios other than streets, we evaluate it on the
1
tion for prediction. Meanwhile, the 16 resolution initializ- COCO dataset. The annotations for COCO instance seg-
1
ing approach achieves better performance than the 64 and mentation are with overlaps, making it unsuitable to train
1
32 manner, which indicates proposals from too low resolu- and test a proposal-free method like ours. So our method
tions still bring errors for prediction. In the other experi- is evaluated in the panoptic segmentation task. To train on
1
ments, cascaded graph partitions are initialized from the 16 COCO, we resize the longer edge to 640 and train the model
resolution. with 512 × 512 crops. The number of iterations is 80,000
Quantitative Results Firstly, to show the effectiveness of and the learning rate is divided by 10 in 60,000 and 70,000
the long-range affinities, we start with just using the affini- iterations. Other experimental settings are remained the
ties from the 1/4 resolution, and gradually add longer-range same. The performance of our model (ResNet-101 based)
affinities. Results are shown in Table 5. Then, the influ- is summarized in Table 9. To the best of our knowledge,
ences of balancing training data, setting larger affinity loss DeeperLab [48] is currently the only proposal-free method
and employing a large kernel are evaluated and shown in to report COCO result. Our method outperformes Deeper-
Table 6. After that, as shown in Table 7, the segmentation Lab (Xception-71 based) in all sub metrics.
refinement improves the performance with 2.8% AP. With
test tricks, our model achieves 34.4% AP and 58.4% PQ on 1 This result is reported by Kirillov et al. [27].
Method Training data AP AP50% AP50m AP100m person rider car trunk bus train motor bicycle
InstanceCut [28] fine + coarse 13.0 27.9 26.1 22.1 10.0 8.0 23.7 14.0 19.5 15.2 9.3 4.7
Multi-task [25] fine 21.6 39.0 37.0 35.0 19.2 21.4 36.6 18.8 26.8 15.9 19.4 14.5
SGN [37] fine + coarse 25.0 44.9 44.5 38.9 21.8 20.1 39.4 24.8 33.2 30.8 17.7 12.4
Mask RCNN [20] fine 26.2 49.9 40.1 37.6 30.5 23.7 46.9 22.8 32.2 18.6 19.1 16.0
GMIS [40] fine + coarse 27.3 45.6 - - 31.5 25.2 42.3 21.8 37.2 28.9 18.8 12.8
Neven et al. [43] fine 27.6 50.9 - - 34.5 26.1 52.4 21.7 31.2 16.4 20.1 18.9
PANet [38] fine 31.8 57.1 46.0 44.2 36.8 30.4 54.8 27.0 36.3 25.5 22.6 20.8
SSAP (ours) fine 32.7 51.8 51.4 47.3 35.4 25.5 55.9 33.2 43.9 31.9 19.5 16.2

Table 10. Results on Cityscapes test set. All results are trained with Cityscapes data only. Results are reported as percentages.

1 1
Image Proposals from 16 Res. Proposals from 8 Res. Instance Seg. Ground Truth

Figure 6. Visualizations of proposals generated from lower resolutions within the cascaded graph partition module and the final instance
segmentation results. Best viewed in color and zoom.

Semantic Seg. Instance Seg. Semantic Seg. Instance Seg.

Figure 7. Visualizations of sampled results on the validation set. Best viewed in color and zoom.

5. Conclusion ing with the non-cascaded way, this module has achieved
5× speedup and 9% relative improvement on AP. Our ap-
This work has proposed a single-shot proposal-free in- proach has achieved a new state of the art on the challenging
stance segmentation method, which requires only one sin- Cityscapes dataset.
gle pass to generate instances. Our method is based on a
novel affinity pyramid to distinguish instances, which can Acknowledgment
be jointly learned with the pixel-level semantic class labels
using a single backbone network. Experiment results have This work is supported in part by the National Key
shown the two sub-tasks are mutually benefited from our Research and Development Program of China (Grant
joint learning scheme, which further boosts instance seg- No.2016YFB1001005), the National Natural Science Foun-
mentation. Moreover, a cascaded graph partition module dation of China (Grant No. 61673375 and No.61602485),
has been developed to segment instances with the affin- and the Projects of Chinese Academy of Science (Grant No.
ity pyramid and semantic segmentation results. Compar- QYZDB-SSW-JSC006).
References [18] Ross Girshick. Fast r-cnn. In ICCV, 2015. 1
[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
[1] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic Malik. Rich feature hierarchies for accurate object detection
affinity with image-level supervision for weakly supervised and semantic segmentation. In CVPR, 2014. 1
semantic segmentation. In CVPR, 2018. 3 [20] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
[2] Anurag Arnab and Philip H. S. Torr. Pixelwise instance shick. Mask r-cnn. In ICCV, 2017. 2, 7, 8
segmentation with a dynamically instantiated network. In [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
CVPR, 2017. 2 Deep residual learning for image recognition. In CVPR,
[3] Min Bai and Raquel Urtasun. Deep watershed transform for 2016. 6
instance segmentation. In CVPR, 2017. 2, 5 [22] Sergey Ioffe and Christian Szegedy. Batch normalization:
[4] Gedas Bertasius, Lorenzo Torresani, Stella X Yu, and Jianbo Accelerating deep network training by reducing internal co-
Shi. Convolutional random walk networks for semantic im- variate shift. In ICML, 2015. 6
age segmentation. In CVPR, 2017. 3 [23] Tsungwei Ke, Jyhjing Hwang, Ziwei Liu, and Stella X Yu.
[5] Bert De Brabandere, Davy Neven, and Luc Van Gool. Adaptive affinity fields for semantic segmentation. In ECCV,
Semantic instance segmentation with a discriminative loss 2018. 3
function. arXiv:1708.02551, 2017. 2 [24] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X.
[6] Liangchieh Chen, George Papandreou, Iasonas Kokkinos, Yu. Adaptive affinity fields for semantic segmentation. In
Kevin P Murphy, and Alan L Yuille. Deeplab: Semantic im- ECCV, 2018. 3, 6
age segmentation with deep convolutional nets, atrous con- [25] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task
volution, and fully connected crfs. TPAMI, 40(4), 2018. 1, learning using uncertainty to weigh losses for scene geome-
2, 6 try and semantics. In CVPR, 2018. 8
[7] Liang-Chieh Chen, Alexander Hermans, George Papan- [26] Margret Keuper, Evgeny Levinkov, Nicolas Bonneel, Guil-
dreou, Florian Schroff, Peng Wang, and Hartwig Adam. laume Lavoue, Thomas Brox, and Bjoern Andres. Efficient
Masklab: Instance segmentation by refining object detection decomposition of image and mesh graphs by lifted multicuts.
with semantic and direction features. In CVPR, 2018. 2 In ICCV, 2015. 2, 4, 5
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and [27] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
Hartwig Adam. Rethinking atrous convolution for semantic Rother, and Piotr Dollar. Panoptic segmentation. In CVPR,
image segmentation. arXiv:1706.05587, 2017. 1, 2 2019. 5, 7
[9] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, [28] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bog-
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, dan Savchynskyy, and Carsten Rother. Instancecut: From
and Zheng Zhang. Mxnet: A flexible and efficient ma- edges to instances with multicut. In CVPR, 2017. 1, 2, 8
chine learning library for heterogeneous distributed systems. [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
arXiv:1512.01274, 2015. 6 Imagenet classification with deep convolutional neural net-
[10] Francois Chollet. Xception: Deep learning with depthwise works. In NIPS, 2012. 1
separable convolutions. In CVPR, 2017. 7 [30] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document recog-
[11] Sunil Chopra and M R Rao. The partition problem. Mathe-
nition. Proceedings of the IEEE, 86(11), 1998. 1
matical Programming, 59(1), 1993. 4
[31] Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Om-
[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
ran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Thomas Brox, Bernt Schiele, and Bjoern Andres. Joint graph
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
decomposition & node labeling: Problem, algorithms, appli-
dataset for semantic urban scene understanding. In CVPR,
cations. In CVPR, 2017. 2
2016. 5
[32] Qizhu Li, Anurag Arnab, and Philip H.S. Torr. Weakly- and
[13] Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. semi-supervised panoptic segmentation. In ECCV, 2018. 2,
Instance-sensitive fully convolutional networks. In ECCV, 7
2016. 2 [33] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei.
[14] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se- Fully convolutional instance-aware semantic segmentation.
mantic segmentation via multi-task network cascadeds. In In CVPR, 2017. 2
CVPR, 2016. 2 [34] Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Jianchao Yang,
[15] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object Liang Lin, and Shuicheng Yan. Proposal-free network
detection via region-based fully convolutional networks. In for instance-level object segmentation. arXiv:1509.02636,
NIPS, 2016. 1, 2 2015. 1
[16] Timothy Dozat. Incorporating nesterov momentum into [35] Tsungyi Lin, Priya Goyal, Ross B Girshick, Kaiming He,
adam. 2016. 6 and Piotr Dollar. Focal loss for dense object detection. In
[17] Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, ICCV, 2017. 6
Hyun Oh Song, Sergio Guadarrama, and Kevin P. Murphy. [36] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong,
Semantic instance segmentation via deep metric learning. Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spa-
arXiv:1703.10277, 2017. 2 tial propagation networks. In NIPS, 2017. 3
[37] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn:
Sequential grouping networks for instance segmentation. In
ICCV, 2017. 2, 7, 8
[38] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.
Path aggregation network for instance segmentation. In
CVPR, 2018. 2, 7, 8
[39] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott E Reed, Chengyang Fu, and Alexander C
Berg. SSD: Single shot multibox detector. In ECCV, 2016.
1, 2
[40] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu,
Houqiang Li, and Yan Lu. Affinity derivation and graph
merge for instance segmentation. In ECCV, 2018. 1, 3, 7, 8
[41] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In
CVPR, 2015. 1
[42] Michael Maire, Takuya Narihira, and Stella X Yu. Affin-
ity cnn: Learning pixel-centric pairwise relations for fig-
ure/ground embedding. In CVPR, 2016. 3
[43] Davy Neven, Bert De Brabandere, Marc Proesmans, and
Luc Van Gool. Instance segmentation by jointly optimiz-
ing spatial embeddings and clustering bandwidth. In CVPR,
2019. 2, 8
[44] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster,
stronger. In CVPR, 2017. 1
[45] Mengye Ren and Richard S Zemel. End-to-end instance seg-
mentation with recurrent attention. In CVPR, 2017. 2
[46] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks. In NIPS, 2015. 1, 2
[47] Bernardino Romeraparedes and Philip H S Torr. Recurrent
instance segmentation. In ECCV, 2016. 2
[48] Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing
Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Pa-
pandreou, and Liang-Chieh Chen. Deeperlab: Single-shot
image parser. arXiv:1902.05093, 2019. 2, 7
[49] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, 2017. 1, 2
[50] Shuai Zheng, Sadeep Jayasumana, Bernardino Romerapare-
des, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang,
and Philip H S Torr. Conditional random fields as recurrent
neural networks. In ICCV, 2015. 2

You might also like