0% found this document useful (0 votes)
8 views

Unsupervised Learning of Image Segmentation Based On Differentiable Feature Clustering

The document discusses unsupervised learning of image segmentation using convolutional neural networks. It proposes an approach that alternately iterates label prediction and network parameter learning to minimize similarity and spatial continuity losses, balancing criteria like pixel feature similarity, spatial continuity of labels, and number of unique labels. The contributions include a differentiable clustering method, a spatial continuity loss, extensions for scribble-based segmentation and unseen image segmentation.

Uploaded by

Isaac Wasserman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unsupervised Learning of Image Segmentation Based On Differentiable Feature Clustering

The document discusses unsupervised learning of image segmentation using convolutional neural networks. It proposes an approach that alternately iterates label prediction and network parameter learning to minimize similarity and spatial continuity losses, balancing criteria like pixel feature similarity, spatial continuity of labels, and number of unique labels. The contributions include a differentiable clustering method, a spatial continuity loss, extensions for scribble-based segmentation and unseen image segmentation.

Uploaded by

Isaac Wasserman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO.

, 1

Unsupervised Learning of Image Segmentation


Based on Differentiable Feature Clustering
Wonjik Kim∗ , Member, IEEE, Asako Kanezaki∗ , Member, IEEE, and Masayuki Tanaka, Member, IEEE

Abstract—The usage of convolutional neural networks (CNNs) is to train a system that classifies the labels of the known
for unsupervised image segmentation was investigated in this categories for the image pixels. In contrast, unsupervised image
study. Similar to supervised image segmentation, the proposed segmentation is used to predict more general labels, such as
CNN assigns labels to pixels that denote the cluster to which
the pixel belongs. In unsupervised image segmentation, however, “foreground” and “background”. The latter is more challenging
arXiv:2007.09990v1 [cs.CV] 20 Jul 2020

no training images or ground truth labels of pixels are specified than the former. Furthermore, it is extremely difficult to segment
beforehand. Therefore, once a target image is input, the pixel an image into an arbitrary number (≥ 2) of plausible regions.
labels and feature representations are jointly optimized, and This study considers a problem in which an image is partitioned
their parameters are updated by the gradient descent. In the into an arbitrary number of salient or meaningful regions
proposed approach, label prediction and network parameter
learning are alternately iterated to meet the following criteria: without any previous knowledge.
(a) pixels of similar features should be assigned the same label, Once the pixel-level feature representation is obtained, image
(b) spatially continuous pixels should be assigned the same label, segments can be obtained by clustering the feature vectors.
and (c) the number of unique labels should be large. Although However, the design of feature representation remains a chal-
these criteria are incompatible, the proposed approach minimizes lenge. The desired feature representation depends considerably
the combination of similarity loss and spatial continuity loss
to find a plausible solution of label assignment that balances on the content of the target image. For instance, if the objective
the aforementioned criteria well. The contributions of this study is to detect zebras as a foreground, the feature representation
are four-fold. First, we propose a novel end-to-end network of should be reactive to black-white vertical stripes. Therefore,
unsupervised image segmentation that consists of normalization the pixel-level features should be descriptive of the colors and
and an argmax function for differentiable clustering. Second, we textures of a local region surrounding each pixel. Recently,
introduce a spatial continuity loss function that mitigates the
limitations of fixed segment boundaries possessed by previous convolutional neural networks (CNNs) have been successfully
work. Third, we present an extension of the proposed method for applied to semantic image segmentation in supervised learning
segmentation with scribbles as user input, which showed better scenarios such as autonomous driving and augmented reality
accuracy than existing methods while maintaining efficiency. games. CNNs are not often used in completely unsupervised
Finally, we introduce another extension of the proposed method: scenarios; however, they have great potential for extracting
unseen image segmentation by using networks pre-trained with
a few reference images without re-training the networks. The detailed features from image pixels, which is necessary for
effectiveness of the proposed approach was examined on several unsupervised image segmentation. Driven by the high feature
benchmark datasets of image segmentation. descriptiveness of CNNs, a joint learning approach is presented
Index Terms—convolutional neural networks, unsupervised that predicts, for an arbitrary image input, unknown cluster
learning, feature clustering. labels and learns the optimal CNN parameters for the image
pixel clustering. Subsequently, a group of image pixels in each
cluster as a segment is extracted.
I. I NTRODUCTION The characteristics of the cluster labels that are necessary
for good image segmentation are discussed further. Similar
I MAGE segmentation has garnered attention in computer
vision research for decades. The applications of image
segmentation include object detection, texture recognition, and
to previous studies on unsupervised image segmentation [1],
[2], it is assumed that a good image segmentation solution
image compression. In supervised image segmentation, a set matches well with a solution that a human would provide.
consisting of pairs of images and pixel-level semantic labels, When a human is asked to segment an image, they would
such as “sky” or “bicycle”, is used for training. The objective most likely create segments, each of which corresponds to
the whole or a salient part of a single object instance. An
*W. Kim and A. Kanezaki contributed equally to this work. object instance tends to contain large regions of similar colors
The authors are with the Tokyo Institute of Technology, Tokyo 152- or texture patterns. Therefore, grouping spatially continuous
8550, Japan, Tokyo 135-0061, Japan e-mail: [email protected];
[email protected]; [email protected]. pixels that have similar colors or texture patterns into the same
The authors are also with the National Institute of Advanced Industrial cluster is a reasonable strategy for image segmentation. To
Science and Technology. separate segments from different object instances, it is better
c 20XX IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media, to assign different cluster labels to the neighboring pixels of
including reprinting/republishing this material for advertising or promotional dissimilar patterns. To facilitate the cluster separation, a strategy
purposes, creating new collective works, for resale or redistribution to servers in which a large number of unique cluster labels is desired is
or lists, or reuse of any copyrighted component of this work in other works.
This paper has been accepted by the IEEE Transactions on Image Processing considered as well. In conclusion, the following three criteria
in July, 2020. for the prediction of cluster labels are introduced:
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 2

(a) Pixels of similar features should be assigned the same characteristics are assigned the same labels. Classical image
label. segmentation can be performed by e.g. k-means clustering [5],
(b) Spatially continuous pixels should be assigned the same which is a de facto standard method for vector quantization.
label. The k-means clustering aims to assign the target data to k
(c) The number of unique cluster labels should be large. clusters in which each datum belongs to the cluster with the
In this paper, we propose a CNN-based algorithm that jointly nearest mean. The graph-based segmentation method (GS)
optimizes feature extraction functions and clustering functions [6] is another example that makes simple greedy decisions
to satisfy these criteria. Here, in order to enable end-to-end of image segmentation. It produces segmentation results that
learning of a CNN, an iterative approach to predict cluster follow the global features of not being too coarse or too fine
labels using differentiable functions is proposed. The code is based on a particular region comparison function. Similar
available online 1 . to the classical methods, the proposed method in this study
This study is an extension of the previous research published aims to perform unsupervised image segmentation. In recent,
in the international conference on acoustics, speech and signal there have been proposed a few methods on learning based
processing (ICASSP) 2018 [3]. In the previous work, superpixel unsupervised image segmentation [7], [8], [9]. MsLRR [7]
extraction using simple linear iterative clustering [4] was is an efficient and versatile approach that can be switched
employed for criterion (b). However, the previous algorithm to both unsupervised and supervised methods. MsLRR [7]
had a limitation that the boundaries of the segments were fixed employed superpixels (as our previous work [3]), which caused
in the superpixel extraction process. In this study, a spatial a limitation that the boundaries are fixed to those of the
continuity loss is proposed as an alternative to mitigate the superpixels. W-Net [8] performs unsupervised segmentation by
aforementioned limitation. In addition, two new applications estimating segmentation from an input image and restoring the
based on our improved unsupervised segmentation method input image from the estimated segmentation. Therefore, similar
are introduced: segmentation with user input and utilization pixels are assigned to the same label, though it does not estimate
of network weights obtained using unsupervised learning the boundary of each segment. Croitoru et al. [9] proposed
of different images. As the proposed method is completely an unsupervised segmentation method based on deep neural
unsupervised, it segments images based on their nature, which network techniques. Whereas this method performs binary
is not always related with the user’s intention. As an exemplar foreground/background segmentation, our method generates
application of the proposed method, scribbles were used as user arbitrary number of segments. A comprehensive survey about
input and the effect was compared with other existing methods. deep learning techniques for image segmentation is presented
Subsequently, the proposed method incurred a high calculation in [10].
cost to iteratively obtain the segmentation results of a single The remainder of this section introduces image segmentation
input image. Therefore, as another potential application of the with user input, weakly-supervised image segmentation based
proposed method, the network weights pre-trained with several on CNN, and methods for unsupervised deep learning.
reference images were used. Once the network weights are Image segmentation with user input: Graph cut is a common
obtained from several images using the proposed algorithm, method for image segmentation that works by minimizing
a new unseen image can be segmented by the fixed network, the cost of a graph where image pixels correspond to the
provided it is somewhat similar to the reference images. The nodes. This algorithm can be applied to image segmenta-
utilization of this technique for a video segmentation task was tion with certain user inputs such as scribbles [11] and
demonstrated as well. bounding boxes [12]. Image matting is commonly used for
The contributions of this paper are summarized as follows. image segmentation with user input [13], [14] as well. The
distinguishing characteristic of image matting is the soft
• We proposed a novel end-to-end differentiable network
assignment of pixel labels, whereas, graph cuts produce hard
of unsupervised image segmentation.
segmentation where every pixel belongs to either the foreground
• We introduced a spatial continuity loss function that
or background. Constrained random walks [15] is proposed to
mitigated the limitations of our previous method [3].
achieve interactive image segmentation with a more flexible
• We presented an extension of the proposed method for
user input, which allows scribbles to specify the boundary
segmentation with scribbles as user input, which showed
regions as well as the foreground/background seeds. Recently, a
better accuracy than existing methods while maintaining
quadratic optimization problem related to dominant-set clusters
efficiency.
has been solved with several types of user input: scribbles,
• We introduced another extension of the proposed method:
sloppy contours, and bounding boxes [16].
unseen image segmentation by using networks pre-trained
The abovementioned methods chiefly produce a binary map
with a few reference images without re-training the
that separates image pixels into foreground and background.
networks.
In order to apply the graph cut to multi-label segmentation
problems, the α-β swap and α-expansion algorithms were
II. R ELATED W ORK proposed in [17]. Both algorithms process repeatedly to find the
Image segmentation is the process of assigning labels to all global minimum of a binary labeling problem. In α-expansion
the pixels within an image such that the pixels sharing certain algorithm, an expansion move is defined for a label α to
increase the set of pixels that are given this label. This algorithm
1 https://github.com/kanezaki/pytorch-unsupervised-segmentation-tip/ finds a local minimum such that no expansion move for any
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 3

label α yields a labeling with lower energy. A swap move takes loss based on the estimated clusters. Similar approaches
some subset of the pixels presently labeled α and assigns them such as maximum margin clustering [43] and discriminative
the label β and vice versa for a pair of labels α, β. The α-β clustering [44], [45] have been proposed for semi-supervised
swap algorithm finds a minimum state such that there is no learning frameworks; however, the proposed method is focused
swap move for any pair of labels α, β that produces a lower on completely unsupervised image segmentation.
energy labeling.
Weakly-supervised image segmentation based on CNN:
III. M ETHOD
Semantic image segmentation based on CNNs have been
gaining importance in the literature [18], [19], [20], [21]. As The problem that is solved for image segmentation is
pixel-level annotations for image segmentation are difficult to described as follows. For simplicity, let {} denote {}N
n=1 unless
obtain, weakly supervised learning approaches using object otherwise noted, where N denotes the number of pixels in
detectors [22], [23], [24], object bounding boxes [25], [26], an input color image I = {vn ∈ R3 }. Let f : R3 → Rp
image-level class labels [27], [28], [29], [30], or scribbles [31], be a feature extraction function and {xn ∈ Rp } be a set of
[32], [33] for training are widely used. p-dimensional feature vectors of image pixels. Cluster labels
Most of the weakly supervised segmentation algorithms {cn ∈ Z} are assigned to all of the pixels by cn = g(xn ),
[31], [25], [26], [30] generate a training target from the weak where g : Rp → Z denotes a mapping function. Here, g can
labels and update their models using the generated training be an assignment function that returns the label of the cluster
set. Therefore, these methods follow an iterative process that centroid closest to xn . For the case in which f and g are
alternates between two steps: (1) gradient descent for training fixed, {cn } are obtained using the abovementioned equation.
a CNN-based model from the generated target and (2) training Conversely, if f and g are trainable whereas {cn } are specified
target generation by the weak labels. For example, ScribbleSup (fixed), then the aforementioned equation can be regarded as
[31] propagates the semantic labels of scribbles to other a standard supervised classification problem. The parameters
pixels using super-pixels so as to completely annotate the for f and g in this case can be optimized by gradient descent
images (step 1) and learns a convolutional neural network if f and g are differentiable. However, in the present study,
for semantic segmentation with the annotated images (step unknown {cn } are predicted while training the parameters of
2). In the case of e-SVM [25], the segment proposals from f and g in a completely unsupervised manner. To put this
bounding box annotations or pixel level annotations using into practice, the following two sub-problems were solved:
CPMC segments [34] (step 1) are generated and the model is prediction of the optimal {cn } with fixed f and g and training
trained with the generated segment proposals (step 2). Shimoda of the parameters of f and g with fixed {cn }.
et al. [30] estimated class saliency maps using image level Notably, the three criteria introduced in Sec. I are in-
annotation (step 1) and applied fully-connected CRF [35] compatible and are never satisfied perfectly. One possible
with the estimated saliency maps as unary potential (step solution for addressing this problem using a classical method is:
2). These iterative processes are exposed to danger that the applying k-means clustering to {xn } for (a), performing graph
convergence is not guaranteed. The error in training target cut algorithm [17] using distances to centroids for (b), and
generation with weak labels might reinforce the entire algorithm determining k in k-means clustering using a non-parametric
to update the model in an undesired direction. Therefore, recent method for (c). However, these classical methods are only
approaches [33], [32], [36] for avoiding the error in training applicable to fixed {xn } and therefore the solution can be
target generation with weak labels are proposed. In this study, to suboptimal. Therefore, a CNN-based algorithm is proposed
deal with the convergence problem, an end-to-end differentiable to solve the problem. The feature extraction functions for
segmentation algorithm based on a CNN is proposed. {xn } and {cn } are jointly optimized in a manner that satisfies
Unsupervised deep learning: Unsupervised deep learning all the aforementioned criteria. In order to enable end-to-end
approaches are mainly focused on learning high-level feature learning of a CNN, an iterative approach to predict {cn } using
representations using generative models [37], [38], [39]. The differentiable functions is proposed.
idea behind these studies is closely related to the conjecture A CNN structure is proposed, as shown in Fig. 1, along
in neuroscience that there exist neurons that represent specific with a loss function to satisfy the three criteria described in
semantic concepts. In contrast, the application of deep learning Sec. I. The concept of the proposed CNN architecture for
to image segmentation and importance of high-level features considering criteria (a) and (c) is detailed in Section III-A.
extracted with convolutional layers are investigated in this The concept of loss function for solving criteria (a) and (b)
study. Deep CNN filters are known to be effective for texture is presented in Section III-B. The details of training a CNN
recognition and segmentation [40], [41]. using backpropagation are described in Sec. III-C.
Notably, the convolution filters used in the proposed method
are trainable in the standard backpropagation algorithm,
although there are no ground truth labels. The present study A. Network architecture
is therefore related to the recent research on deep embedded 1) Constraint on feature similarity: We consider the first
clustering (DEC) [42]. The DEC algorithm iteratively refines criterion of assigning the same label to pixels with similar
clusters by minimizing the KL divergence loss among the characteristics. The proposed solution is to apply a linear
soft-assigned data points with an auxiliary target distribution, classifier that classifies the features of each pixel into q classes.
whereas, the proposed method simply minimizes the softmax In this study, we assume the input to be an RGB image I =
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 4

Fig. 1: Illustration of the proposed algorithm for training a CNN. Input image I is fed into the CNN to extract deep features
{xn } using a feature extraction module. Subsequently, one-dimensional (1D) convolutional layer calculates the response vectors
{rn } of the features in q-dimensional cluster space, where q = 3 in this illustration. Here, z1 , z2 , and z3 represent the three
axes of the cluster space. Subsequently, the response vectors are normalized across the axes of the cluster space using a batch
normalization function. Further, cluster labels {cn } are determined by assigning the cluster IDs to the response vectors using
an argmax function. The cluster labels are then used as pseudo targets to compute the feature similarity loss. Finally, the spatial
continuity loss as well as the feature similarity loss are computed and backpropagated.

{vn ∈ R3 }, where each pixel value is normalized to [0, 1]. A p- number of unique cluster labels should be adaptive to the image
dimensional feature map {xn } is computed from {vn } through content. As described in Sec. III-A1, the proposed strategy is
M convolutional components, each of which consists of a two- to classify pixels into an arbitrary number q 0 (1 ≤ q 0 ≤ q) of
dimensional (2D) convolution, ReLU activation function, and clusters, where q is the maximum possible value of q 0 . A large
a batch normalization function, where a batch corresponds q 0 indicates oversegmentation, whereas a small q 0 indicates
to N pixels of a single input image. Here, we set p filters undersegmentation. To train a neural network, we set a large
of region size 3 × 3 for all of the M components. Notably, number to the initial (maximum) number of cluster labels
these components for feature extraction can be replaced by q. Then, in the iterative update process, similar or spatially
alternatives such as fully convolutional networks (FCN) [20]. close pixels are integrated by considering feature similarity
Subsequently, a response map {rn = Wc xn } is obtained by and spatial continuity constraints. This phenomenon leads to
applying a linear classifier, where Wc ∈ Rq×p . The response reduce the number of unique cluster labels q 0 , even though
map is then normalized to {rn0 } such that {rn0 } has zero mean there is no explicit constraint on q 0 .
and unit variance. The motivation behind the normalization As shown in Fig. 1, the proposed clustering function based on
process is described in Sec. III-A2. Finally, the cluster label cn argmax classification corresponds to q 0 -class clustering, where
for each pixel is obtained by selecting the dimension that has q 0 anchors correspond to a subset of q points at infinity on the
the maximum value in rn0 . This classification rule is referred q axes. The aforementioned criteria (a) and (b) only facilitate
to as the argmax classification. This processing corresponds the grouping of pixels, which could lead to a simple solution
intuitively to the clustering of feature vectors into q clusters. that q 0 = 1. To prevent this kind of undersegmentation failure,
The ith cluster of the final responses {rn0 } can be written as: a third criterion (c) is introduced, which is the preference
for a large q 0 . The proposed solution is to insert the intra-
Ci = {rn0 ∈ Rq | rn,i
0 0
≥ rn,j , ∀j},
axis normalization process for the response map {rn } before
0
where rn,i denotes the ith element of rn0 . This is equivalent assigning cluster labels using the argmax classification. Here,
to assigning each pixel to the closest point among the q batch normalization [46] is used. This operation, also known
representative points, which are placed at infinite distance as whitening, converts the original responses {rn } to {rn0 },
on the respective axis in the q-dimensional space. Notably, Ci where each axis has zero mean and unit variance. This gives
0
can be ∅, and therefore the number of unique cluster labels each rn,i (i = 1, . . . , q) an even chance to be the maximum
can arbitrarily range from 1 to q. value of rn0 across the axes. Although this operation does not
2) Constraint on the number of unique cluster labels: In guarantee that every cluster index i (i = 1, . . . , q) achieves
unsupervised image segmentation, there is no clue as to how the maximum value for any n (n = 1, . . . , N ), nevertheless,
many segments should be generated in an image. Therefore, the because of this operation, many cluster indices will achieve
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 5

the maximum value for any n (n = 1, . . . , N ). Consequently, Algorithm 1: Unsupervised image segmentation
this intra-axis normalization process gives the proposed system Input: I = {vn ∈ R3 } // RGB image
a preference for a large q 0 . µ // weight for Lcon
Output: L = {cn ∈ Z} // Label image
{Wm , bm }M m=1 ← Init() // Initialize
B. Loss function Wc ← Init() // Initialize
for t = 1 to T do
The proposed loss function L consists of a constraint on {xn } ← GetFeats( {vn }, {Wm , bm }M
m=1 )
feature similarity and a constraint on spatial continuity, denoted {rn } ← { Wc xn }
as follows: {rn0 } ← Norm( {rn } ) // Batch norm.
0
{cn } ← { arg max rn,i } // Assign labels
L = Lsim ({rn0 , cn }) + µLcon ({rn0 }), i
L ← Lsim ({rn0 , cn }) + µLcon ({rn0 })
(1)
{Wm , bm }M
| {z } | {z }
feature similarity spatial continuity m=1 , Wc ← Update( L )

where µ represents the weight for balancing the two constraints.


Although the proposed method is a completely unsupervised
learning method, the use of the proposed method with scribbles differential operator. More specifically, the spatial continuity
as user input was also investigated. In the case with segmenta- loss Lcon is defined as follows:
tion using scribble information, the loss function (1) is simply W
X −1 H−1
X
modified using another weight ν as follows: Lcon ({rn0 }) = 0
|| rξ+1,η 0
−rξ,η 0
||1 + || rξ,η+1 0
−rξ,η ||1 ,
ξ=1 η=1
L = Lsim ({rn0 , cn }) + µLcon ({rn0 }) + νLscr ({rn0 , sn , un }).
| {z } | {z } | {z } where W and H represent the width and height of an input
feature similarity spatial continuity scribble information 0
image, whereas rξ,η represents the pixel value at (ξ, η) in the
(2) 0
response map {rn }.
Each component of the abovementioned function is described
By applying the spatial continuity loss Lcon , an excessive
in their respective sections below.
number of labels due to complicated patterns or textures can
1) Constraint on feature similarity: As described in
be suppressed.
Sec. III-A1, the cluster labels {cn } are obtained by applying
0 3) Constraint on scribbles as user input: Image segmenta-
the argmax function to the normalized response map {rn }.
tion technique with scribble information has been researched
The cluster labels are further utilized as the pseudo targets.
extensively [15], [31], [32], [33]. In the proposed approach,
In the proposed approach, the following cross entropy loss
0 scribble loss Lscr as partial cross entropy was introduced as
between {rn } and {cn } is calculated as the constraint on
follows:
feature similarity:
XN X q
0 0
N
XX q L scr ({r n , sn , un }) = −un δ(i − sn ) ln rn,i ,
0 0
Lsim ({rn , cn }) = −δ(i − cn ) ln rn,i , n=1 i=1
n=1 i=1 where un = 1 if the nth pixel is a scribbled pixel, otherwise
where it is 0, and sn denotes the scribble label for each pixel.
(
1 if t = 0
δ(t) = C. Learning network by backpropagation
0 otherwise.
In this section, the method of training the network for
The objective behind this loss function is to enhance the unsupervised image segmentation is described. Once a target
similarity of the similar features. Once the image pixels are image is input, the following two sub-problems are solved: the
clustered based on their features, the feature vectors within the prediction of cluster labels with fixed network parameters and
same cluster should be similar to each other, and the feature the training of network parameters with the (fixed) predicted
vectors from different clusters should be different from each cluster labels. The former corresponds to the forward process
other. Through the minimization of this loss function, the of a network followed by the proposed architecture described
network weights are updated to facilitate extraction of more in Sec. III-A. The latter corresponds to the backward process
efficient features for clustering. of a network based on the gradient descent. Subsequently, we
2) Constraint on spatial continuity: The elementary concept calculate and backpropagate the loss L described in Sec. III-B
of image pixel clustering is to group similar pixels into clusters, to update the parameters of the convolutional filters {Wm }M m=1
as shown in Sec. III-A1. In image segmentation, however, it as well as the parameters of the classifier Wc . In this study,
is preferable for the clusters of image pixels to be spatially a stochastic gradient descent with momentum is used for
continuous. An additional constraint is introduced that favors updating the parameters. The parameters are initialized with
the cluster labels to be the same as those of the neighboring the Xavier initialization [48], which samples values from the
pixels. uniform distribution normalized according to the input and
In a similar manner to [47], we consider the L1-norm of output layer size. This forward-backward process is iterated
horizontal and vertical differences of the response map {rn0 } T times to obtain the final prediction of the cluster labels
as a spatial constraint. We can implement the process by a {cn }. Algorithm 1 shows the pseudocode for the proposed
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 6

unsupervised image segmentation algorithm. Since this iterative dataset [49]. The results show that µ = 5 is the best when
process requires a little computation time, we further introduce applying to unsupervised segmentation and ν = 0.5 is the best
a use of the proposed method with one or several reference for segmentation with user input. It is also shown that the
images. Provided that a target image is somewhat similar to the proposed method is more sensitive to ν than µ.
reference images, the fixed network weights trained with those Table I shows comparative results of the unsupervised
images as pre-processing can be reused. The effectiveness of image segmentation on two benchmark datasets. The k-means
the use of reference images is investigated in Sec. IV-C. clustering and the graph-based segmentation method (GS)
As shown in Fig. 1, the proposed CNN network is composed [6] were chosen as the comparative methods. In case of
of basic functions. The most distinctive part of the proposed GS, a gaussian filter with σ = 1 was applied to smooth an
CNN is the existence of the batch normalization layer between input image slightly before computing the edge weights, in
the final convolution layer and the argmax classification layer. order to compensate for digitization artifacts. GS needs a
Unlike the supervised learning scenario, in which the target threshold parameter to determine the granularity of segments.
labels are fixed, the batch normalization of responses over axes The threshold parameter effectively sets a scale of observa-
is necessary to obtain reasonable labels {cn } (see Sec. III-A2). tion, in that a larger value causes a preference for larger
Furthermore, in contrast to supervised learning, there are components. For the k-means clustering, the concatenation
multiple solutions of {cn } with different network parameters of RGB values in a 5 × 5 window were used for each pixel
that achieve near zero loss. The value of the learning rate takes representation. The connected components were extracted as
control over the balance between the parameter updates and segments from each cluster generated by k-means clustering
clustering, which leads to different solutions of {cn }. We set and the proposed method. The best k for k-means clustering and
the learning rate to 0.1 with a momentum of 0.9. threshold parameter τ for GS were experimentally determined
from {2, 5, 8, 11, 14, 17, 20} and {100, 500, 1000, 1500, 2000},
IV. E XPERIMENTAL RESULTS respectively. For comparison with a cutting-edge method,
As mentioned in Sec. I, a spatial continuity loss is proposed we employed Invariant Information Clustering (IIC) [50].
as described in Sec. III-B2 as an alternative to superpixel We altered the number of output clusters and iterations as
extraction used in our previous study [3]. The effectiveness {2, 5, 8, 11, 14, 17, 20} and {10, 20, 30, 40, 50}, respectively,
of the continuity loss was evaluated by comparing it with and shows the best result among them. As to other parameters,
[3] as well as other classical methods discussed in Sec. IV-A. we used default values used for Microsoft COCO dataset in
2
Additionally, the use of the proposed method with scribble the official IIC code .
input in Sec. IV-B and with reference images in Sec. IV-C Examples of unsupervised image segmentation results on
was demonstrated. The number of convolutional layers M was PASCAL VOC 2012 and BSD500 are shown in Fig. 3 and
set to 3 and p = q = 100 for all of the experiments. For Fig. 4, respectively. As shown in figure, the boundary lines of
the loss function, different µ were set for each experiment: segments are smoother and more salient using the proposed
µ = 5 for PASCAL VOC 2012 and BSD500 in Section IV-A method compared with those in our previous work [3]. This
and Section IV-C, µ = 50 for iCoseg and BBC Earth in improvement also leads to enhanced performance, which can
Section IV-C, µ = 100 for pixabay in Section IV-C, and µ = 1 be confirmed from Table I. There are several sets of ground
for Section IV-B. The results of all the experiments were truth segmentation for each image in the BSD500 superpixel
evaluated by the mean intersection over union (mIOU). Here, benchmark. Figures 4b and 4c show two different ground truth
mIOU was calculated as the mean IOU of each segment in segments for an exemplar test image. As shown in figure,
the ground truth (GT) and the estimated segment that had the the ground truth segments are labeled without certain object
largest IOU with the GT segment. Notably, the object category classes. For evaluation, three groups for mIOU calculation
labels in PASCAL VOC 2012 dataset [49] were ignored and were defined as: “all” using all ground truth files, “fine” using
each segment along with the background region was treated a single ground truth file per image that contains the largest
as an individual segment. number of segments, and “coarse” using the ground truth file
that contains the smallest number of segments. In this case,
“fine” used Fig. 4b, “coarse” used Fig. 4c, and “all” used all the
A. Effect of continuity loss ground truth files including both of those for mIOU calculation.
The effect of continuity loss on the validation dataset According to Table I, the proposed method achieved the best
of PASCAL VOC 2012 segmentation benchmark [49] and or the second best scores on PASCAL VOC 2012 and BSD500
Berkeley Segmentation Dataset and Benchmark (BSD500) [51] datasets. The proposed method was outperformed by GS on
were evaluated. Figure 2 shows examples of the segmentation “BSD500 all” and “BSD500 fine” because the IOU values
results when µ was changed. In case of Fig. 2f, the image was for small segments are dominant owing to the several small
successfully segmented into sky, sea, rock, cattle, and beach segments in the ground truth sets. This in effect does not convey
regions. However, the image was segmented in more detail that the proposed method produced fewer accurate segments
with µ = 1; for example, the beach was further segmented than GS. To confirm this fact, the precision-recall curves in
into sand and grass regions. It is inferred that the optimal µ “BSD500 all” with an IOU threshold 0.2, 0.3, 0.4, 0.5, 0.6,
changes depending on the degree of detailing in the desired and 0.7 in Fig. 5 were also presented. For this evaluation, we
segmentation results. Table II shows the change in mIOU scores
with respect to µ and ν variations on PASCAL VOC 2012 2 https://github.com/xu-ji/IIC
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 7

TABLE I: Comparison of mIOU for unsupervised segmentation on PASCAL VOC 2012 and BSD500. The best scores are
shown in bold and the second-best scores are underlined.

Method VOC 2012 BSD500 all BSD500 fine BSD500 coarse mean
k-means clustering, k = 2 0.3166 0.1223 0.0865 0.1972 0.1807
k-means clustering, k = 17 0.2383 0.2404 0.2208 0.2648 0.2411
Graph-based Segmentation (GS) [6], τ = 100 0.2682 0.3135 0.2951 0.3255 0.3006
Graph-based Segmentation (GS) [6], τ = 500 0.3647 0.2768 0.2238 0.3659 0.3078
IIC [50], k = 2 0.2729 0.0896 0.0537 0.1733 0.1474
IIC [50], k = 20 0.2005 0.1724 0.1513 0.2071 0.1828
Ours w/ superpixels [3] 0.3082 0.2261 0.1690 0.3239 0.2568
Ours w/ continuity loss, µ = 5 0.3520 0.3050 0.2592 0.3739 0.3225

(a) Input image (b) Ground truth (c) µ = 1 (d) µ = 5 (e) µ = 10 (f) µ = 50 (g) µ = 100

Fig. 2: Effect of continuity loss with different µ values. Different segments are shown in different colors.

TABLE II: Parameter search on PASCAL VOC 2012. The proposed method w/ continuity loss achieved the
best average precision scores and our previous method w/
unsupervised segmentation superpixels [3] achieved the second best average precision
µ 0.1 0.5 1 5 10 20 scores for all the cases in Fig. 5.
mIOU 0.3340 0.3433 0.3449 0.3520 0.3483 0.3438
To confirm the effectiveness of each element of the proposed
method, the ablation study was performed with PASCAL VOC
segmentation with user input
2012 and BSD500. Table III shows the results regarding the
ν 0.1 0.5 1 5 10 20 presence and absence of Lcon and the batch normalization of
mIOU 0.4774 0.6174 0.5994 0.5298 0.4982 0.4650 the response map. The experimental results show that the batch
normalization process consistently and considerably improves
the performance on all the datasets. Even though the effect
TABLE III: Ablation studies on Lcon and batch normalization. of Lcon alone is marginal, it gives a solid improvement when
used together with the batch normalization. This indicates the
BSD500 importance of the three criteria introduced in Sec. I.
Lsim Lcon BN VOC2012 all fine coarse
X 0.3312 0.2279 0.1928 0.2932 B. Segmentation with scribbles as user input
X X 0.3340 0.2199 0.1832 0.2931
The effect of the proposed method was tested for image
X X 0.3358 0.3007 0.2619 0.3506
segmentation with the user input on the validation dataset of
X X X 0.3520 0.3050 0.2592 0.3739
PASCAL VOC 2012 segmentation benchmark [49]. We let
ν = 0.5 in (2) in this experiment. The scribble information
was used for the test images given in [31] as the user input. For
first sort all the estimated segments according to maximum comparison, graph cut [17], graph cut α-expansion [17], graph
IOU values between respective estimated segments and the cut α-β swap [17], and regularized loss [33] were employed.
ground truth segments. The precision-recall curves in Fig. 5 In graph cut, Gaussian Mixture Model (GMM) was used for
are drawn by counting an estimated segment as a true positive modeling foreground and background of an image. A graph
when the maximum IOU between the estimated segment and is constructed from pixel distributions modeled by GMM. At
the ground truth segments exceeds a threshold. The number of this time, scribbled pixels are fixed to their scribbled labels
true positive segments reduces when the threshold increases, which are foreground or background. In the generated graph,
which causes different average precision scores in the plots in a node is defined as a pixel, whereas the weight of an edge
Fig. 5. connecting nodes is defined by a probability to be foreground
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 8

Image
Groud truth
IIC
w/ superpixels
w/ continuity loss

Fig. 3: Comparison of unsupervised segmentation results on PASCAL VOC 2012. The method with superpixels corresponds to
the previous method proposed in [3]. Different segments are shown in different colors.

or background. Thereafter, the graph is divided by energy Exemplar segmentation results are shown in Fig. 6. It was
minimization into the two groups: foreground and background. observed that the proposed method is more stable than the
The vanilla graph cut is an algorithm for segmenting the graph-based methods. Relatively rougher segments of objects
foreground and the background and it does not support the are detected by “Regularized loss 1-image training”, whereas
multi-labels case. Therefore, in this study, a graph cut was the boundaries of segmented areas of the proposed method
performed multiple times where each scribble is regarded as are more accurate. The quantitative evaluation in Table IV
the foreground each time, and subsequently all the extracted shows that the proposed method achieved the best mIOU
segments were used for calculating the mIOU. To compare the score. In addition to outperforming than “Regularized loss 1-
performance, α-expansion and α-β swap (introduced in Sec. II), image training” with the DeepLab-ResNet-101 architecture, the
as well as regularized loss [33] were tested. Regularized loss proposed method is effective in three folds. First, the proposed
[33] is a weakly-supervised method of segmentation using a method uses a small network where the number of parameters
training data set and additional scribble information. In order to is 1, 000 times less than DeepLab-ResNet-101. Owing to the
unify the experiment conditions, one image from the validation smallness of the architecture, the proposed method converges
dataset of PASCAL VOC 2012 is used for the network training 20 times faster than “Regularized loss 1-image training” with
with the scribble information. The output in the final iteration the DeepLab-ResNet-101 architecture. Finally, the proposed
for the image after completion of training was regarded as the method initializes the network with random weights and thus
segmentation result of the image. After that, the network weight requires no pre-trained weights. In contrast, “Regularized loss
is initialized, and then the process is repeated for the next image. 1-image training” requires the weights pre-trained on e.g., the
This process was repeated individually for all the test images ImageNet dataset3 for initialization. Notably, we found that
in the validation dataset of PASCAL VOC 2012. This process
3 We
was defined as “Regularized loss 1-image training”. We tested used the pre-trained weights downloaded from http:
two base architectures for “Regularized loss 1-image training”: //liangchiehchen.com/projects/Init%20Models.html and https://github.
com/KaimingHe/deep-residual-networks for DeepLab-largeFOV and
DeepLab-largeFOV and DeepLab-ResNet-101. DeepLab-ResNet-101, respectively.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 9

(a) Input image (b) Ground truth #1 (c) Ground truth #2 (d) GS [6] (e) k-means (f) IIC [50] (g) w/ superpixels(h) w/ continuity loss
[3]

Fig. 4: Comparison of unsupervised segmentation results on BSD500. Different segments are shown in different colors. In (b)
and (c), two different ground truth segments of image (a) in BSD500 superpixel benchmark are shown.

(a) IOU = 0.2 (b) IOU = 0.3 (c) IOU = 0.4

(d) IOU = 0.5 (e) IOU = 0.6 (f) IOU = 0.7

Fig. 5: Precision-recall curves with different IOU thresholds for BSD500. The numbers in the legends represent the average
precision scores of each method.

“Regularized loss 1-image training” both with the DeepLab- reference images. The effectiveness of the networks of fixed
ResNet-101 and DeepLab-largeFOV architectures failed to train weights trained on several images as reference was evaluated
the weights from random states in our experiment. for unseen test images. The BSD500 and the iCoseg [54]
datasets were employed for the experiment.
C. Unsupervised segmentation with reference images The proposed method was trained with four images in
Supervised learning generally learns from training data BSD500 shown in Fig. 7a. In the training phase, the network
and evaluates the performance using test data. Therefore, the was updated once for each reference image. After training, the
network can obtain segmentation results by processing the test network weights were fixed and the other three images shown
images with the (fixed) learned weights. In contrast, as the in the top row in Fig. 7b were segmented. The reference
proposed method is completely unsupervised learning, it is images and the test images were arbitrarily selected from
necessary to learn the network weights every time the test image different scenes in the nature category. The segmentation
is input in order to obtain the segmentation results. Further, results are shown in the two bottom rows in Fig. 7b. The
an unsupervised segmentation experiment was conducted with phrase “from beginning” in Fig. 7b means that an image is
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 10

Image
Ground truth
α-expansion
Graph cut
Graph cut
α-β swap
1-image training
Regularized loss
w/ scribble loss

Fig. 6: Comparison of segmentation results with user input. We used DeepLab-ResNet-101 for the base architecture of
“Regularized loss 1-image training”. “Image” row shows input images including scribbles (user input), which are made bold for
the purpose of visualization. Different segments are shown in different colors.

segmented with the proposed method where the weights of (e.g., when they belong to the same category).
a network are trained for each test image from scratch. As We also conducted an experiment to segment an image using
shown in Fig. 7b, the segmentation results “w/ reference images” a single reference image. Figure 9 shows the segmentation
were more detailed than “from beginning”. This is because results of the test and reference images. Even though the
“from beginning” integrates clusters under the influence of the segmentation result for a test image is not as appropriate as
continuity loss when training the target image. According to that of a reference image, a sufficient segmentation result is
Fig. 7b, “w/ reference images” showed acceptable segmentation obtained. We can see different levels of quality in these two
performance compared with “from beginning”. The method “w/ cases: fishes in the left case are successfully assigned the same
reference images” only takes under 0.02s for the segmentation label, whereas oranges in the right case are differentiated. It
of each image, whereas the “from beginning” method takes implies that similar objects with somewhat similar colors are
approximately 20s under GPU calculation in GeForce GTX assigned the same label.
TITAN X to get the segmentation results. The proposed method
with four groups in iCoseg (ID: 12, 17, 36, 49) were also In the experiments thus far, it was found that the pro-
trained. As iCoseg does not distinguish between the training and posed method can be trained from several reference images
test data, two images from the group were randomly selected and effective to similar-different images. Therefore, another
for testing. Further, the proposed method was trained on the application for videos was introduced. Video data generally
images in the group excluding the sampled test images. The contains information connected in a time series. Hence, video
segmentation results are illustrated in Fig. 8. Therefore, it was segmentation can be accomplished by training only a part of
concluded that it is possible to segment unknown images with all the frames using the proposed method. Figure 10 shows
unsupervised trained weights on reference images, provided examples of segmentation results when video data was input in
that the images are somewhat similar to the reference images the proposed method. The proposed method trained a network
only with the leftmost image in a respective row in Fig. 10. It
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 11

TABLE IV: Comparison of the number of parameters, computation time, and mIOU for segmentation with user input.

Method # parameters Time (sec.) mIOU


Graph cut [17] - 1.47 0.2965
Graph cut α-expansion [17] - 0.81 0.5509
Graph cut α-β swap [17] - 0.77 0.5524
Regularized loss [33] 1-image training (DeepLab-largeFOV) 20,499,136 42 0.5790
Regularized loss [33] 1-image training (DeepLab-ResNet-101) 132,145,344 414 0.6064
Proposed method w/ scribble loss 103,600 20 0.6174

was observed that most of the segments obtained from other [2] A. Y. Yang, J. Wright, Y. Ma, and S. Sastry, “Unsupervised segmentation
images were successfully matched to the same segments in the of natural images via lossy data compression,” Computer Vision and
Image Understanding, vol. 110, no. 2, pp. 212–225, 2008.
leftmost image. Consequently, it was demonstrated that even [3] A. Kanezaki, “Unsupervised image segmentation by backpropagation,”
the video data without ground truth can be segmented with in Proceedings of IEEE International Conference on Acoustics, Speech,
the proposed method efficiently using only a single frame as and Signal Processing (ICASSP), IEEE, 2018.
[4] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,
a reference. This result indicates that the proposed method, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE
which aims unsupervised learning of image segmentation, can Transactions on Pattern Analysis and Machine Intelligence, vol. 34,
be extended to unsupervised learning of video segmentation. By no. 11, pp. 2274–2282, 2012.
[5] J. MacQueen et al., “Some methods for classification and analysis of
using the first frame of the video as a reference and segmenting multivariate observations,” in Proceedings of the fifth Berkeley Symposium
other frames, the segmentation task can be accelerated. In on Mathematical Statistics and Probability, vol. 1, Oakland, CA, USA,
addition, the segmentation of the full target video can also be 1967.
[6] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image
improved by stacking processed images as additional reference segmentation,” International Journal of Computer Vision, vol. 59, no. 2,
images. pp. 167–181, 2004.
[7] X. Liu, Q. Xu, J. Ma, H. Jin, and Y. Zhang, “Mslrr: A unified multiscale
V. C ONCLUSION low-rank representation for image segmentation,” IEEE transactions on
image processing, vol. 23, no. 5, pp. 2159–2167, 2014.
A novel CNN architecture was presented in this study, [8] X. Xia and B. Kulis, “W-net: A deep model for fully unsupervised image
along with its unsupervised process that enables image seg- segmentation,” arXiv preprint arXiv:1711.08506, 2017.
mentation in an unsupervised manner. The proposed CNN [9] I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervised learning
of foreground object segmentation,” International Journal of Computer
architecture consists of convolutional filters for feature ex- Vision, vol. 127, no. 9, pp. 1279–1302, 2019.
traction and differentiable processes for feature clustering, [10] S. Ghosh, N. Das, I. Das, and U. Maulik, “Understanding deep learning
which enables end-to-end network training. The proposed techniques for image segmentation,” ACM Computing Surveys (CSUR),
CNN jointly assigned cluster labels to image pixels and vol. 52, no. 4, pp. 1–35, 2019.
[11] Y. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundary
updated the convolutional filters to achieve better separation & region segmentation of objects in nd images,” in Proceedings of
of clusters using the backpropagation of the proposed loss to International Conference on Computer Vision (ICCV), IEEE, 2001.
the normalized responses of convolutional layers. Furthermore, [12] V. S. Lempitsky, P. Kohli, C. Rother, and T. Sharp, “Image segmentation
with a bounding box prior,” in Proceedings of International Conference
two applications based on the proposed segmentation method on Computer Vision (ICCV), vol. 76, Citeseer, 2009.
were introduced: segmentation with scribbles as user input [13] A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natural
and utilization of reference images. The experimental results image matting,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 30, no. 2, pp. 228–242, 2007.
on the PASCAL VOC 2012 segmentation benchmark dataset [14] A. Levin, A. Rav-Acha, and D. Lischinski, “Spectral matting,” IEEE
[49] and BSD500 [51] demonstrated the effectiveness of the Transactions on Pattern Analysis and Machine Intelligence, vol. 30,
proposed method for completely unsupervised segmentation. no. 10, pp. 1699–1712, 2008.
[15] W. Yang, J. Cai, J. Zheng, and J. Luo, “User-friendly interactive
The proposed method outperformed classical methods for image segmentation through unified combinatorial user inputs,” IEEE
unsupervised image segmentation such as k-means clustering Transactions on Image Processing, vol. 19, no. 9, pp. 2470–2479, 2010.
and a graph-based segmentation method, which verfied the [16] E. Zemene and M. Pelillo, “Interactive image segmentation using
importance of feature learning. Furthermore, the effectiveness constrained dominant sets,” in Proceedings of European Conference
on Computer Vision (ECCV), Springer, 2016.
of the proposed method for image segmentation with user [17] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy
input and utilization of reference images was validated by minimization via graph cuts,” in Proceedings of International Conference
additional experimental results on the PASCAL VOC 2012, on Computer Vision (ICCV), vol. 1, IEEE, 1999.
[18] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
BSD500, and iCoseg [54] datasets. A potential application of convolutional encoder-decoder architecture for image segmentation,”
the proposed method to an efficient video segmentation system IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
was demonstrated. no. 12, pp. 2481–2495, 2017.
[19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Semantic image segmentation with deep convolutional nets and fully
R EFERENCES connected crfs,” in Proceedings of International Conference on Learning
[1] R. Unnikrishnan, C. Pantofaru, and M. Hebert, “Toward objective Representations (ICLR), 2015.
evaluation of image segmentation algorithms,” IEEE Transactions on [20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 929–944, semantic segmentation,” in Proceedings of IEEE Conference on Computer
2007. Vision and Pattern Recognition (CVPR), 2015.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 12

Image #28083 Image #79073 Image #117025 Image #145079

(a) Reference images

#35049 #101027 #176051


Image
Ground truth
w/ reference images

mIOU = 0.1108 mIOU = 0.1451 mIOU = 0.3691


From beginning

mIOU = 0.1669 mIOU = 0.1377 mIOU = 0.2844


(b) Test results

Fig. 7: Results of segmentation with reference images on BSD500. Different segments are shown in different colors.

[21] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, [25] J. Zhu, J. Mao, and A. L. Yuille, “Learning from weakly supervised
C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural data by the expectation loss svm (e-svm) algorithm,” in Proceedings of
networks,” in Proceedings of International Conference on Computer Advances in Neural Information Processing Systems (NIPS), 2014.
Vision (ICCV), 2015. [26] F.-J. Chang, Y.-Y. Lin, and K.-J. Hsu, “Multiple structured-instance
[22] J. Tighe and S. Lazebnik, “Finding things: Image parsing with regions learning for semantic segmentation with uncertain training data,” in
and per-exemplar detectors,” in Proceedings of IEEE Conference on Proceedings of IEEE Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition (CVPR), 2013. Recognition (CVPR), 2014.
[23] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous [27] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutional
detection and segmentation,” in Proceedings of European Conference on neural networks for weakly supervised segmentation,” in Proceedings of
Computer Vision (ECCV), 2014. International Conference on Computer Vision (ICCV), 2015.
[24] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via [28] N. Pourian, S. Karthikeyan, and B. S. Manjunath, “Weakly supervised
multi-task network cascades,” in Proceedings of IEEE Conference on graph based semantic segmentation by learning communities of image-
Computer Vision and Pattern Recognition (CVPR), 2016. parts,” in Proceedings of International Conference on Computer Vision
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 13

#12-1 #12-2 #17-1 #17-2 #36-1 #36-2 #49-1 #49-2


Image
Ground truth
Segmentation

#12 mIOU = 0.9780 #17 mIOU = 0.6875 #36 mIOU = 0.6570 #49 mIOU = 0.4831

Fig. 8: Results of segmentation with reference images on iCoseg. “#n-m” denotes the mth test image in the group whose ID
is n. Different segments are shown in different colors.

RGB Segmentation RGB Segmentation


Reference image

Reference image
Test image

Test image

Fig. 9: Results of segmentation with a single reference image. Different segments are shown in different colors. The results
imply that similar objects with somewhat similar colors are assigned the same label (see fishes in the left case), although
otherwise not (see oranges in the right case). Images are from pixabay [52].

(ICCV), 2015. [36] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang, “Weakly-supervised
[29] Z. Shi, Y. Yang, T. Hospedales, and T. Xiang, “Weakly-supervised semantic segmentation network with deep seeded region growing,” in
image annotation and segmentation with objects and attributes,” IEEE Proceedings of IEEE Conference on Computer Vision and Pattern
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, Recognition (CVPR), 2018.
no. 12, pp. 2525–2538, 2016. [37] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature
[30] W. Shimoda and K. Yanai, “Distinct class-specific saliency maps for learning for audio classification using convolutional deep belief networks,”
weakly supervised semantic segmentation,” in Proceedings of European in Proceedings of Advances in Neural Information Processing Systems
Conference on Computer Vision (ECCV), 2016. (NIPS), 2009.
[31] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised [38] Q. V. Le, “Building high-level features using large scale unsupervised
convolutional networks for semantic segmentation,” in Proceedings of learning,” in Proceedings of IEEE International Conference on Acoustics,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Speech, and Signal Processing (ICASSP), 2013.
2016. [39] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
[32] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normal- deep belief networks for scalable unsupervised learning of hierarchical
ized cut loss for weakly-supervised cnn segmentation,” in Proceedings of representations,” in Proceedings of International Conference on Machine
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Learning (ICML), 2009.
2018. [40] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture
[33] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and recognition and segmentation,” in Proceedings of IEEE Conference on
Y. Boykov, “On regularized losses for weakly-supervised cnn segmen- Computer Vision and Pattern Recognition (CVPR), 2015.
tation,” in Proceedings of European Conference on Computer Vision [41] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns
(ECCV), 2018. for object segmentation and fine-grained localization,” in Proceedings of
[34] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
using constrained parametric min-cuts,” IEEE Transactions on Pattern 2015.
Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1312–1328, 2011. [42] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding
[35] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs for clustering analysis,” in Proceedings of International Conference on
with gaussian edge potentials,” in Proceedings of Advances in Neural Machine Learning (ICML), 2016.
Information Processing Systems (NIPS), 2011. [43] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. , NO. , 14

Fig. 10: Results of segmentation for sequential images. For each case, a network is trained with only the leftmost image in the
respective row. Different segments are shown in different colors and the time flow is from left to right. The images are from
“BBC Earth, Nature Makes You Happy.” [53]

clustering,” in Proceedings of Advances in Neural Information Processing [54] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactive
Systems (NIPS), 2005. co-segmentation with intelligent scribble guidance,” in Proceedings of
[44] F. R. Bach and Z. Harchaoui, “Diffrac: a discriminative and flexible IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
framework for clustering,” in Proceedings of Advances in Neural IEEE, 2010.
Information Processing Systems (NIPS), 2008.
[45] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image
co-segmentation,” in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2010.
[46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proceedings of
International Conference on Machine Learning (ICML), 2015.
[47] T. Shibata, M. Tanaka, and M. Okutomi, “Misalignment-robust joint filter
for cross-modal image pairs,” in Proceedings of the IEEE International
Conference on Computer Vision, pp. 3295–3304, 2017.
[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proceedings of International Conference
on Artificial Intelligence and Statistics (AISTATS), 2010.
[49] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,
and A. Zisserman, “The pascal visual object classes challenge: A
retrospective,” International Journal of Computer Vision, vol. 111, no. 1,
pp. 98–136, 2015.
[50] X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering
for unsupervised image classification and segmentation,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 9865–
9874, 2019.
[51] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection
and hierarchical image segmentation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011.
[52] pixabay. internet. https://pixabay.com/.
[53] B. Earth, “Nature makes you happy.” Youtube, 2017.
https://youtu.be/1wkPMUZ9vX4.

You might also like