Understanding Deep Learning Techniques For Image Segmentation
Understanding Deep Learning Techniques For Image Segmentation
Image Segmentation
arXiv:1907.06119v1 [cs.CV] 13 Jul 2019
Abstract
The machine learning community has been overwhelmed by a plethora
of deep learning based approaches. Many challenging computer vision
tasks such as detection, localization, recognition and segmentation of ob-
jects in unconstrained environment are being efficiently addressed by var-
ious types of deep neural networks like convolutional neural networks,
recurrent networks, adversarial networks, autoencoders and so on. While
there have been plenty of analytical studies regarding the object detection
or recognition domain, many new deep learning techniques have surfaced
with respect to image segmentation techniques. This paper approaches
these various deep learning techniques of image segmentation from an
analytical perspective. The main goal of this work is to provide an in-
tuitive understanding of the major techniques that has made significant
contribution to the image segmentation domain. Starting from some of
the traditional image segmentation approaches, the paper progresses de-
scribing the effect deep learning had on the image segmentation domain.
Thereafter, most of the major segmentation algorithms have been logi-
cally categorized with paragraphs dedicated to their unique contribution.
With an ample amount of intuitive explanations, the reader is expected
to have an improved ability to visualize the internal dynamics of these
processes.
1 Introduction
Image segmentation can be defined as a specific image processing technique
which is used to divide an image into two or more meaningful regions. Image
segmentation can also be seen as a process of defining boundaries between sep-
arate semantic entities in an image. From a more technical perspective, image
segmentation is a process of assigning a label to each pixel in the image such
that pixels with the same label are connected with respect to some visual or
semantic property (Fig. 1).
Image segmentation subsumes a large class of finely related problems in
computer vision. The most classic version is semantic segmentation [66]. In
semantic segmentation, each pixel is classified into one of the predefined set
1
Figure 1: Semantic Image Segmentation(Samples from the Mapillary Vistas
Dataset [155])
of classes such that pixels belonging to the same class belongs to an unique
semantic entity in the image. It is also worthy to note that the semantics in
question depends not only on the data but also the problem that needs to be
addressed. For example, for a pedestrian detection system, the whole body of
person should belong to the same segment, however for a action recognition sys-
tem, it might be necessary to segment different body parts into different classes.
Other forms of image segmentation can focus on the most important object in a
scene. A particular class of problem called saliency detection [19] is born from
this. Other variants of this domain can be foreground background separation
problems. In many systems like, image retrieval or visual question answering it
is often necessary to count the number of objects. Instance specific segmenta-
tion addresses that issue. Instance specific segmentation is often coupled with
object detection systems to detect and segment multiple instances of the same
object[43] in a scene. Segmentation in the temporal space is also a challenging
domain and has various application. In object tracking scenarios, pixel level
classification is not only performed in the spatial domain but also across time.
Other applications in traffic analysis or surveillance needs to perform motion
segmentation to analyze paths of moving objects. In the field of segmentation
with lower semantic level, over-segmentation is also a common approach where
images are divided into extremely small regions to ensure boundary adherence,
at the cost of creating a lot of spurious edges. Over-segmentation algorithms
are often combined with region merging techniques to perform image segmen-
tation. Even simple color or texture segmentation also finds its use in various
scenarios. Another important distinction between segmentation algorithms is
the need of interactions from the user. While it is desirable to have fully auto-
mated systems, a little bit of interaction from the user can improve the quality
of segmentation to a large extent. This is especially applicable where we are
dealing with complex scenes or we do not posses an ample amount of data to
train the system.
Segmentation algorithms has several applications in the real world. In med-
ical image processing [123] as well we need to localize various abnormalities
2
like aneurysms [48], tumors [145], cancerous elements like melanoma detec-
tion [189], or specific organs during surgeries [206]. Another domain where
segmentation is important is surveillance. Many problems such as pedestrian
detection [113], traffic surveillance [60] require the segmentation of specific ob-
jects e.g. persons or cars. Other domains include satellite imagery [11, 17],
guidance systems in defense [119], forensics such as face [5], iris [51] and fin-
gerprint [144] recognition. Generally traditional methods such as histogram
thresholding [195], hybridization [193, 87] feature space clustering [40], region-
based approaches [59], edge detection approaches [184], fuzzy approaches [39],
entropy-based approaches [47], neural networks (Hopfield neural network [35],
self-organizing maps [27]), physics-based approaches [158] etc. are used popu-
larly in this purpose. However, such feature-based approaches have a common
bottleneck that they are dependent on the quality of feature extracted by the
domain experts. Generally, humans are bound to miss latent or abstract fea-
tures for image segmentation. On the other hand, deep learning in general
addresses this issue of automated feature learning. In this regard one of the
most common technique in computer vision was introduced soon by the name
of convolutional neural networks [110] that learned a cascaded set of convolu-
tional kernels through backpropagation [182]. Since then, it has been improved
significantly with features like layer-wise training [13], rectified linear activa-
tions [153], batch normalization [84], auxiliary classifiers [52], atrous convolu-
tions [211], skip connections [78], better optimization techniques [97] and so on.
With all these there was a large number of new types of image segmentation
techniques as well. Various such techniques drew inspiration from popular net-
works such as AlexNet [104], convolutional autoencoders [141], recurrent neural
networks [143], residual networks [78] and so on.
2 Motivation
There have been many reviews and surveys regarding the traditional technolo-
gies associated with image segmentation [61, 160]. While some of them special-
ized in application areas [107, 123, 185], while other focused on specific types
of algorithms [20, 19, 59]. With arrival of deep learning techniques many new
classes of image segmentation algorithms have surfaced. Earlier studies [219]
have shown the potential of deep learning based approaches. There have been
more recent studies [68] which cover a number of methods and compare them on
the basis of their reported performance. The work of Garcia et al. [66] lists a va-
riety of deep learning based segmentation techniques. They have tabulated the
performance of various state of the art networks on several modern challenges.
The resources are incredibly useful for understanding the current state-of-the-
art in this domain. While knowing the available methods is quite useful to
develop products, however, to contribute to this domain as a researcher, one
needs to understand the underlying mechanics of the methods that make them
confident. In the present work, our main motivation is to answer the question
why the methods are designed in a way they are. Understanding the mechanics
3
of modern techniques would make it easier to tackle new challenges and develop
better algorithms. Our approach carefully analyses each method to understand
why they succeed at what they do and also why they fail for certain problems.
Being aware of pros and cons of such method new designs can be initiated that
reaps the benefits of the pros and overcomes the cons. We recommend the
works of Alberto Garcia-Garcia [66] to get an overview of some of the best im-
age segmentation techniques using deep learning while our focus would be to
understand why, when and how these techniques perform on various challenges.
2.1 Contribution
The paper has been designed in a way such that new researchers reap the most
benefits. Initially some of the traditional techniques have been discussed to up-
hold the frameworks before the deep learning era. Gradually the various factors
governing the onset of deep learning has been discussed so that readers have a
good idea of the current direction in which machine learning is progressing. In
the subsequent sections the major deep learning algorithms have been briefly
described in a generic way to establish a clearer concept of the procedures in the
mind of the readers. The image segmentation algorithms discussed thereafter
have been categorized into the major families of algorithms that governed the
last few years in this domain. The concepts behind all the major approaches
have been explained through a very simple language with minimum amount
of complicated mathematics. Almost all the diagrams corresponding to major
networks have been drawn using a common representational format as shown
in fig. 2. The various approaches that have been discussed comes with different
representations for architectures. The unified representation scheme allows the
4
user to understand the fundamental similarities and differences between net-
works. Finally, the major application areas have been discussed to help new
researchers pursue a field of their choice.
5
Table 1: A list of various datasets in the image segmentation domain
Category Dataset
Natural Berkeley Segmentation Dataset [140]
Scenes PASCAL VOC [54]
Stanford Background Dataset [72]
Microsoft COCO [122]
MIT Scene parsing data(ADE20K) [222, 223]
Semantic Boundaries Dataset [75]
Microsoft Research Cambridge Object Recognition Image Database (MSRC) [188]
Video Densely Annotated Video Segmentation(DAVIS) [168]
Segmentation Video Segmentation Benchmark(VSB100) [64]
Dataset YouTube-Video object Segmentation [209]
Autonomous Cambridge-driving Labeled Video Database (CamVid) [23]
Driving Cityscapes: Semantic Urban Scene Understanding [41]
Mapillary Vistas Dataset [155]
SYNTHIA: Synthetic collection of Imagery and Annotations [178]
KITTI Vision Benchmark Suite [67]
Berkeley Deep Drive [212]
India Driving Dataset(IDD) [202]
Aerial Inria Aerial Image Labeling Dataset [134]
Imaging Aerial Image Segmentation Dataset [213]
ISPRS Dataset collection [57]
Google Open Street Map [8]
DeepGlobe [49]
Medical DRIVE:Digital Retinal Images for Vessel Extraction [191]
Imaging Sunnybrook Cardiac Data [171]
Multiple Sclerosis Database [129, 25]
IMT: Intima Media Thickness Segmentation Dataset [148]
SCR: Segmentation in Chest Radiographs [201]
BRATS: Brain Tumor Segmentation [146]
LITS: Liver Tumour Segmentation [74]
BACH: Breast Cancer Histology [6]
IDRiD: Indian Diabetic Retinopathy Image Dataset [169]
ISLES: Ischemic Stroke Lesion Segmentation [135]
Saliency MSRA Salient Object Database [37]
Detection ECSSD: Extended Complex Scene Saliency Dataset [187]
PASCAL-S DATASET [117]
THUR15K: Group Saliency in Image [36]
JuddDB: MIT saliency benchmark [18]
DUT-OMRON Image Dataset [210]
Scene Text KAIST Scene Text Database [112]
Segmentation COCO-Text [203]
SVT: Street View Text Dataset [205]
6
Figure 3: Input image and sample activation maps from a typical CNN. (Top
row) Input image and two activation maps from earlier layers showing part
objects like t-shirts and features like contours. (Bottom row) shows activation
maps from later layers with more meaningful activations like fields, people and
sky respectively
segments. Various methods have been implemented to make use of these inter-
nal activations to segment the images. A summary of major deep learning based
segmentation algorithms are provided in table 2 along with brief description of
their major contribution.
7
Table 2: A summary of major deep learning based segmentation algorithms.
Abbreviations: S: Supervised, W: Weakly supervised, U: Unsupervised, I: In-
teractive, P: Partially Supervised, SO: Single objective optimization, MO: Multi
objective optimization, AD: Adversarial Learning, SM: Semantic Segmentation,
CL: Class specific Segmentation, IN: Instance Segmentation, RNN: Recurrent
Modules, E-D: Encoder Decoder Architecture
Supervision Learning Type Modules
Method Year Description
S W U I P SO MO AD SM CL IN RNN E-D
Global Average Pooling 2013 X X X Object specific soft segmentation
DenseCRF 2014 X X X Using CRF to boost segmentation
FCN 2015 X X X Fully convolutional layers
DeepMask 2015 X X X Simultaneous learning for segmentation and classification
U-Net 2015 X X X X Encoder-Decoder with multiscale feature concatenation
SegNet 2015 X X X X Encoder-Decoder with forwarding pooling indices
CRF as RNN 2015 X X X X Simulating CRFs as trainable RNN modules
Deep Parsing Network 2015 X X Using unshared kernels to incorporate higher order dependency
BoxSup 2015 X X Using bounding box for weak supervision
SharpMask 2016 X X X X Refined Deepmask with multi layer feature fusion
Attention to Scale 2016 X X X Fusing features from multi scale inputs
Semantic Segmentation 2016 X X X Adversarial training for image segmentation
Conv LSTM and Spatial Inhibition 2016 X X X X Using spatial inhibition for instance segmentation
JULE 2016 X X X X Joint unsupervised learning for segmentation
ENet 2016 X X X Compact network for realtime segmentation
Instance aware segmentation 2016 X X X Multi task approach for instance segmentation
Mask RCNN 2017 X X X Using region propsal network for segmentation
Large Kernel Matters 2017 X X X X Using larger kernels for learning complex features
RefineNet 2017 X X X X Multi path refinement module for fine segmentation
PSPNet 2017 X X X Multi scale pooling for scale agnostic segmentation
Tiramisu 2017 X X X X DenseNet 121 feature extractor
Image to Image Translation 2017 X X X X Conditional GAN for translation image to segment maps
Instance Segmentation with attention 2017 X X X X Attention modules for image segmentatoin
W-Net 2017 X X X X Unsupervised segmentation using normalized cut loss
PolygonRNN 2017 X X X X Generating contours by RNN
Deep Layer Cascade 2017 X X X Multi level approach to handle pixels of different complexity
Spatial Propagation Network 2017 X X X Refinement using linear label propagation
DeepLab 2018 X X X Atrous convolution, Spatial pooling pyramid, DenseCRF
SegCaps 2018 X X Capsule Networks for Segmentation
Adversarial Collaboration 2018 X X Adversarial collaboration between multiple networks
Superpixel Supervision 2018 X X Using superpixel refinement as supervisory signals
Deep Extreme Cut 2018 X X X Using extreme points for interactive segmentation
Two Stream Fusion 2019 X X X Using image stream and interaction stream simultaneously
SegFast 2019 X X X X Using depth-wise separable convolution in SqueezeNet encoder
8
values. As these pooled scalars are connected to the output layers, the weights
corresponding to each class may be used to perform weighted summation of
the corresponding activation maps in the previous layers. This process called
Global Average Pooling(GAP) [121] can be directly used on various trained
networks like residual network to find object specific activation zones which can
be used for pixel level segmentation. The major issues with algorithm such
as this is the loss of sharpness due to the intermediate sub-sampling opera-
tions. Sub-sampling is a common operation in convolutional neural networks
to increase the sensory area of kernels. What it means is that as the activa-
tions maps reduces in size in the subsequent layers, the kernels convoluting over
them actually corresponds to a larger area in the original image. However, it
reduces the image size in the process, which when up-sampled to original size
loses sharpness. Many approaches have been implemented to handle this issue.
For fully convolutional models, skip connections from preceding layers can be
used to obtain sharper versions of the activations from which finer segments
can be chalked out (Refer fig. 4). Another work showed how the usage of high
dimensional kernels to capture global information with FCN models created
better segmentation masks [165]. Segmentation algorithms can also be treated
as boundary detection technique. convolutional features are also very useful
from that perspective [139]. While earlier layers can provide fine details, later
layers focus more on the coarser boundaries.
9
Figure 6: The Sharpmask Network
10
intensities to cluster the pixels into objects. The bounding boxes corresponding
to these segments are passed through classifying networks to short-list some of
the most sensible boxes. Finally, with a simple linear regression network tighter
co-ordinate can be obtained. The main downside of the technique is its compu-
tational cost. The network needs to compute a forward pass for every bounding
box proposition. The problem with sharing computation across all boxes was
that the boxes were of different sizes and hence uniform sized features were not
achievable. In the upgraded Fast R-CNN [69], ROI (Region of Interest) Pooling
was proposed in which region of interests were dynamically pooled to obtain a
fixed size feature output. Henceforth, the network was mainly bottlenecked by
the selective search technique for candidate region proposal. In Faster-RCNN
[175], instead of depending on external features, the intermediate activation
maps were used to propose bounding boxes, thus speeding up the feature ex-
traction process. Bounding boxes are representative of the location of the object,
however they do not provide pixel-level segments. The Faster R-CNN network
was extended as Mask R-CNN [76] with a parallel branch that performed pixel
level object specific binary classification to provide accurate segments. With
Mask-RCNN an average precision of 35.7 was attained in the COCO[122] test
images. The family of RCNN algorithms have been depicted in fig.7. Region
proposal networks have often been combined with other networks [118, 44] to
give instance level segmentations. RCNN was further improved under the name
of HyperNet [99] by using features from multiple layers of the feature extrac-
tor. Region proposal networks have also been implemented for instance specific
segmentation as well. As mentioned before object detection capabilities of ap-
proaches like RCNN are often coupled with segmentation models to generate
different masks for different instances of the same object[43].
4.1.3 DeepLab
While pixel level segmentation was effective, two complementing issues were
still affecting the performance. Firstly, smaller kernel sizes failed to capture
contextual information. In classification problems, this is handled using pooling
layers that increases the sensory area of the kernels with respect to the original
image. But in segmentation that reduces the sharpness of the segmented output.
Alternative usage of larger kernels tend to be slower due to significanty larger
number of trainable parameters. To handle this issue the DeepLab [30, 32]
family of algorithms demonstrated the usage of various methodologies like atrous
convolutions [211], spatial pooling pyramids [77] and fully connected conditional
random fields [100] to perform image segmentation with great efficiency. The
DeepLab algorithm was able to attain a meanIOU of 79.7 on the PASCAL VOC
2012 dataset[54].
11
Figure 7: The RCNN Family of localization and segmentation networks
12
Figure 8: Normal convolution(red) vs. Atrous or Dilated convolution(green)
13
Spatial Pyramid Pooling Spatial pyramid pooling [77] was introduced in
R-CNN where ROI pooling showed the benefit of using multi-scale regions for
object localization. However, in DeepLab, atrous convolutions were preferred
over pooling layers for changing field of view or sensory area. To imitate the
effect of ROI pooling, multiple branches with atrous convolutions of different
dilations were combined together to utilize multi-scale properties for image seg-
mentation.
14
fully connected CRF. For each pair of pixels i and j in the image the pairwise
potential was defined as
" !
||pi − pj ||2 ||Ii − Ij ||2
θij (xi , xj ) = µ(xi , xj ) w1 exp − −
2σα2 2σβ2
||pi − pj ||2
+w2 exp − (2)
2σγ2
Here, µ(xi , xj ) = 1 if xi 6= xj , 0 otherwise and w1 , w2 are the weights given
to the kernels. The expression uses two gaussian kernels. The first one is a bilat-
eral kernel that depends on both pixel positions(pi , pj ) and their corresponding
intensities in the RGB channels. The second kernel is only dependent on the
the pixel positions. σα , σβ and σγ controls the scale of the Gaussian kernels.
The intuition behind the design of such a pairwise potential energy function
is to ensure that nearby pixels of similar intensities in the RGB channels are
classified under the same class. This model has also been later included in the
popular network called DeepLab (refer section 4.1.3). In the various versions of
the DeepLab algorithm, the use of CRF was able to boost the mean IOU on the
Pascal 2012 Dataset by significant amount(upto 4% in some cases).
15
4. Compatibility Transform : Considering a compatibility function to keep
a track of uncertainty between various labels, a simple 1 × 1 convolution
with the same number of input and output channel is enough to simulate
that. Unlike the potts model that assigns the same penalty, here the
compatibility function can be learnt and hence a much better alternative.
PSPNet The pyramid scene parsing network [220] was built upon the FCN
based pixel level classification network. The feature maps from a ResNet-101
16
network are converted to activations of different resolutions thorough multi-scale
pooling layers which are later upsampled and concatenated with the orginal
feature map to perform segmentation(Refer fig.10). The learning process in
deep networks like ResNet was further optimized by using auxiliary classifiers.
The different types of pooling modules focus on different areas of the activation
map. Pooling kernels of various sizes like 1 × 1, 2 × 2, 3 × 3, 6 × 6 look into
different areas of the activation map to create the spatial pooling pyramid. One
the ImageNet scene parsing challenge the PSPNet was able to score an mean
IoU of 57.21 with respect to 44.80 of FCN and 40.79 of SegNet.
RefineNet Working with features from last layer of a CNN produces soft
boundaries for the object segments. This issue was avoided in DeepLab algo-
rithms with atrous convolutions. RefineNet [120] takes an alternative approach
by refining intermediate activation maps and hierarchically concatenating it to
combine multi-scale activations and prevent loss of sharpness simultaneously.
The network consisted of separate RefineNet modules for each block of the
ResNet. Each RefineNet module were made up of three main blocks, namely,
Residual convolution unit(RCU), multi-resolution fusion(MRF) and chained
residual pooling(CRP)(Refer fig.11). The RCU block consists of an adaptive
convolution set that fine-tunes the pre-trained weights of the ResNet weights
for the segmentation problem. The MRF layer fuses activations of different
resolutions using convolutions and upsampling layers to create a higher reso-
lution map. Finally in CRP layer pooling kernels of multiple sizes are used
on the activations to capture background context from large image areas. The
RefineNet was tested on the Person-Part Dataset where it obtained an IOU of
68.6 as compared to 64.9 by DeepLab-v2 both of which used the ResNet-101 as
a feature extractor.
17
Figure 11: A schematic representation of the RefineNet
an encoder that encodes the input representations from a raw input to a possibly
lower dimensional intermediate representation and a decoder that attempts to
reconstruct the original input from the intermediate representation. The loss
is computed in terms of the difference between the raw input images and the
reconstructed output image. The generative nature of the decoder part has
often been modified and used for image segmentation purposes. Unlike the
traditional autoencoders, during segmentation the loss is computed in terms of
the difference between the reconstructed pixel level class distribution and the
desired pixel level class distribution. This kind of segmentation approach is
more of a generative procedure as compared to the classification approach of
RCNN or DeepLab algorithms. The problem with approaches such as this is to
prevent over-abstraction of images during the encoding process. The primary
benefit of such approaches is the ability to generate sharper boundaries with
much lesser complication. Unlike the classification approaches, the generative
nature of the decoder can learn to create delicate boundaries based on extracted
features. The major issue that affects these algorithm is the level of abstraction.
It has been seen that without proper modification the reduction in the size
of the feature map created inconsistencies during the reconstruction. in the
paradigm of convolutional neural networks the encoding is basically a series
of convolution and pooling layers or strided convolutions. The reconstruction
however can be tricky. The commonly used techniques for decoding from a lower
dimensional feature are transposed convolution or a unpooling layers. One of the
main advantages of using autoencoder based approach over normal convolutional
feature extractor is the freedom of choosing input size. With a clever use of
18
down-sampling and up-sampling operation it is possible to output a pixel-level
probability that is of the same resolution as the input image. This benefit
has made encoder-decoder architectures with multi-scale feature forwarding has
become ubiquitous for networks where input size is not predetermined and an
output of same size as the input is needed.
Figure 12: (Left) Normal Convolution with unit stride. (Right) Transposed
convolution with fractional strides.
19
concepts. Skip connections has proved to be very effective to combine different
levels of abstractions from different layers to generate crisp segmentation maps.
20
decompress the pixel to a similar dimension of 2 × 2. By forwarding pooling
indices the network basically remembers the location of the maximum value
among the 4 pixels while performing max-pooling. The index corresponding
to the maximum value is forwarded to the decoder(Refer fig.14) so that while
the un-pooling operation the value from the single pixel can be copied to the
corresponding location in 2 × 2 region in the next layer [215]. The values
in rest of the three positions are computed in the subsequent convolutional
layers. If the value was copied to random location without the knowledge of the
pooling indices, there would be inconsistencies in classification especially in the
boundary regions.
SegNet The SegNet algorithm [9] was launched in 2015 to compete with the
FCN network on complex indoor and outdoor images. The architecture was
composed of 5 encoding blocks and and 5 decoding blocks. The encoding blocks
followed the architecture of the feature extractor in VGG-16 network. Each
block is a sequence of multiple convolution, batch normalization and ReLU
layers. Each encoding block ends with a max-pooling layer where the indices
are stored. Each decoding block begins with a unpooling layer where the saved
pooling indices are used (Refer fig.15). The indices from the max-pooling layer
of the ith block in the encoder is forwarded to the max-unpooling layer in the
(L − i + 1)th block in the decoder where L is the total number of blocks in each
of the encoder and decoder. The SegNet architecture scored an mIoU of 60.10
as compared to 53.88 by DeepLab-LargeFOV[31] or 49.83 by FCN[130] or 59.77
by Deconvnet[156] on the CamVid Dataset.
21
Figure 15: Architecture of SegNet
The segmentation problem has also been approached from a adversarial learning
perspective. The segmentation network is treated as a generator that generates
the segmenation masks for each class, whereas a discriminator network tries to
predict whether a set of masks is from the ground truth or from the output
of the generator [133]. A schematic diagram of the process is shown in fig.20.
Furthermore, conditional GANs have been used to perform image to image
translation[86]. This framework can be used for image segmentation problems
where the semantic boundaries of the image and output segmentation map do
not necessarily coincide, for example, in case of creating a schematic diagram
of a façade of a building.
22
Figure 16: Adversarial learning model for image segmentation
23
tions. Attention models have been further developed with the introduction of
dedicated attention module and an external memory to keep track of segments.
In the works of [174], the instance segmentation network was divided into 4
modules. First, and external memory provides object boundary details from
all previous steps. Second, a box network attempts to predict the location of
the next instance of the object and outputs a sub-region of the image for the
third module that is the segmentation module. The segmentation module is
similar to a convolutional auto-encoder model discussed previously. The fourth
module scores the predicted segments based on whether they qualify as a proper
instance of the object. The network terminates when the score goes below a
user-defined threshold.
24
a weak supervision to generate pixel level segmentation maps. In the works of
[42], titled BoxSup, segmentation proposals were generated using region pro-
posal methods like selective search. After that multi-scale combinatorial group-
ing is used to combine candidate masks and the objective is to select the optimal
combination that has the highest IOU with the box. This segmentation map is
used to tune a traditional image segmentation network like FCN. BoxSup was
able to attain an mIOU of 75.1 in the pascal VOC 2012 test set as compared to
62.2 of FCN or 66.4 of DeepLab-CRF.
25
extracting superpixel from the image using standard algorithms like the SLIC[1]
and all pixels within a superpixel are forced to have the same label. The dif-
ference between the two segmentation map is used as a supervisory signal to
update the weights.
4.5.3 W-Net
W-Net [207] derived its inspiration from the previously discussed U-Net. The
W-Net architecture consists of a two cascaded U-Nets. The first U-Net acts
as a encoder that converts an image to its segmented version while the second
U-Net tries to reconstruct the original image from the output of the first U-Net
that is the segmented image. Two loss functions are minimized simultaneously.
One of them being the Mean square error between the input image and the
reconstructed image that is given by the second U-Net. The second loss function
is derived from the Normalized-Cut [186]. The hard normalized cut is formulated
as,
K P
u∈Ak ,v∈V −Ak w(u, v)
X
N cutK (V ) = P (5)
k=1 u∈Ak ,t∈V w(u, t)
where, Ak is set of pixels in the k − th segment, V is the set of all the pixels,
and w measures the weight between two pixels.
However, this function is not differentiable and hence backpropagation is not
26
possible. Hence a soft version of function was proposed.
K P P
u∈V p(u = Ak ) v∈V w(u, v)p(v = Ak )
X
Jsof t−N cut (V, K) = K − P P (6)
u∈V p(u = A k) t∈V w(u, t)
k=1
27
4.6.3 Polygon-RNN
Polygon-RNN[26] takes a different approach to the other methods. Multi scale
features are extracted from different layers of a typical VGG Network and con-
catenated to create a feature block for a recurrent network. The RNN in turn
is supposed to provide a sequence of points as an output that represents the
contour of the object. The system is primarily designed as an interactive image
annotation tool. The users can interact in two different ways. Firstly the users
must provide a tight bounding box for the object of interest. Secondly after the
polygon is built the users were allowed to edit any point in the polygon. How-
ever this editing is not used for any further training of the system and hence
presents a small avenue for improvement of the system.
4.7.1 ENet
The ENet[163] brought forward a couple of interesting design choices to create
a quite shallow segmentation network with a small number of parameters (0.37
Million). Instead of a symmetric encoder decoder architecture like SegNet or
U-Net, it has a deeper encoder and a shallower decoder. Instead of increasing
channel sizes after pooling, parallel pooling operations were performed along
with convolutions of stride 2 to reduce overall features. To increase the learning
capability PReLU were used as compared to ReLU so that the transfer functions
remains dynamic so that it can simulate the jobs of a ReLU as well as an identity
function as required. This is normally an important factor in ResNet however
because the network is shallow, using PReLU is a smarter choice. Above that
using factorized filters also allowed for a smaller number of parameters.
28
In the consequent stages the convolutions will only take place on those pixels
which could not be classified in the previous stage while forwarding yet harder
pixels to the next stage. Typically the proposed model comes with three stages
each adding more convolutional modules to the network. With layer cascad-
ing an mIOU of 82.7 was reached on the VOC12 test set with DeepLabV2 and
Deep Parsing Network being the nearest competitors with mIOU of 79.7 and
77.5 respectively. In terms of speed, 23.6 frames were processed per second as
compared to 14.6 fps by SegNet or 7.1 fps by DeepLab-V2.
4.7.3 SegFast
Another recent implementation titled SegFast [159] was able to build a net-
work with only 0.6 Million parameters that resulted in a network that can do
a forward pass in around 0.38 seconds without a GPU. The approach com-
bined the concept of depth-wise separable convolutions with the fire modules
of SqueezeNet. SqueezeNet introduced the concepts of fire modules to reduce
the number of convolutional weights. With depth-wise separable convolutions,
the number of parameters went further down. They also proposed the use of
depthwise separable transposed convolutions for decoding. Even with so many
approaches of feature reductions the performance was quite comparable to other
popular networks like SegNet.
29
5 Applications
Image segmentation is one of the most commonly addressed problems in the do-
main of computer vision. It is often augmented with other related tasks like ob-
ject detection, object recognition, scene parsing, image description generation.
Hence this branch of study finds extensive use in various real-life scenarios.
30
post-production it is often essential to perform segmentation for various tasks
like image matting[115, 115], compositing[24] and rotoscoping[2].
5.4 Forensics
Biometric verification systems like iris [125, 65], fingerprint [94], finger vein [170],
dental records [93], involve segmentation of various informative regions for effi-
cient analysis.
5.5 Surveillance
Surveillance systems [147, 95, 89] are associated with various issues like oc-
clusion, lighting or weather conditions. Moreover surveillance system can also
involve analysis of images from hyper-spectral sources [4]. Surveillance system
can also be extended to various applications such as object tracking [82], search-
ing [3], anomaly detection [173], threat detection [137], traffic control [208] and
so on. Image segmentation plays a vital role to segregate objects of interest
from the clutter present in natural scenes.
31
domain. Weakly supervised algorithms is also a highly demanding area. It is
much easier to collect annotations corresponding to problems like classification
or localization. Using those annotations to guide image segmentation problem
is also a promising domain.
The next important aspect of building deep learning models for image seg-
mentation is the selection of the appropriate approaches. Pre-trained classifiers
can be used for various fully convolutional approaches. Most of the time some
kind of multi-scale feature fusion can be carried out by combining informa-
tion from different depths of the network. Pre-trained classifiers like VGGNet
or ResNet or DenseNet are also often used for the encoder part of a encoder-
decoder architecture. Here also information can be passed from various layers
of encoders to corresponding similar sized layers of the decoder to obtain multi-
scale information. Another major benefit of encoder-decoder architectures are
that if the down-sampling and up-sampling operations are designed carefully,
outputs can be generated which are of the same size as that of the input. It is
a major benefit over simple convolutional approaches like FCN or DeepMask.
This removes the dependency on the input size and hence makes the system
more scalable. These two approaches are the most common in case of seman-
tic segmentation problem. However, if finer level of instance specific segments
are required it is often necessary to couple with other methods corresponding
to object detection. Utilizing bounding box information is one way to address
these problems, while other approaches use attention based models or recur-
rent models to provide output as sequence of segments for each instance of the
object.
There can be two aspects to consider while measuring the performance of
the system. One is speed and the other is accuracy. Conditional random field
is one of the most commonly used post-processing module for refining outputs
from other networks. CRFs can be simulated as an RNN to create end-to-end
trainable modules to provide very precise segmentation maps. Other refinement
strategies include the use of over-segmentation algorithms like superpixels, or
using human interactions to guide segmentation algorithms. In terms of gain
in speed, networks can be highly compressed using strategies like depth-wise
separable convolutions, kernel factorizations, reducing number of spatial convo-
lutions and so on. These tactics can reduce number of parameters to a great
extent without reducing the performance too much. Lately, generative adver-
sarial networks have seen a tremendous rise in popularity. However, their use in
the field of segmentation is still pretty thin with only a handful of approaches
addressing the avenue. Given the success they have gained it certainly has
potential to improve existing systems by a great margin.
The future of image segmentation largely depends on the quality and quan-
tity of available data. While there is an abundance of unstructured data in the
internet, the lack of accurate annotations is a legitimate concern. Especially
pixel level annotations can be incredibly difficult to obtain without manual in-
tervention. The most ideal scenario would be to exploit the data distribution
itself to analyze and extract meaningful segments that represent concepts rather
than content. This is an incredibly challenging task especially if we are working
32
with a huge amount of unstructured data. The key is to map a representation
of the data distribution to the intent of the problem statement such that the
derived segments are meaningful in some way and contributes to the overall
purpose of the system.
7 Conclusion
Image segmentation has seen a new rush of deep learning based algorithms.
Starting with the evolution of deep learning based algorithms we have thor-
oughly explained the pros and cons of the various state of the art algorithms
associated with image segmentation based on deep learning. The simple expla-
nations allow the reader to grasp the most basic concepts that contribute to
the success of deep learning based image segmentation algorithms. The unified
representation scheme followed in the diagrams can highlight the similarities
and differences of various algorithms. In the future this theoretical survey work
can be accompanied by empirical analysis of the discussed methods.
Acknowledgement
This work is partially supported by the project order no. SB/S3/EECE/054/2016,
dated 25/11/2016, sponsored by SERB (Government of India) and carried out
at the Centre for Microprocessor Application for Training Education and Re-
search, CSE Department, Jadavpur University. The authors would also like to
thank the reviewers for their valuable suggestions which gelp to improve the
quality of the manuscript.
33
Supplementary Information
Image Segmenation before deep learning : A refresher
Thresholding : The most straight forward image segmentation can be assign-
ing class labels to each image based on a threshold-point with respect to the
intensity values. One of the earliest algorithm popularly known as Otsu’s [199]
method chooses a threshold point with respect to maximum variance. Many
modern approaches have been applied which involving fuzzy logic [39] or non-
linear thresholds [193]. While early approaches were mainly focused on binary
thresholding, multi-class segmentation have also come up in subsequent years.
34
[186, 21, 22, 180] may be efficiently used to obtain the segments. Probabilistic
graphical models such as Markov random fields(MRF) [80] can be used to create
a pixel labeling scheme based on prior probability distribution. MRF tries to
maximize the probability of correctly classifying pixels or regions based on a
set of features.Probabilistic graphical models like MRF or other similar graph
based approaches [56] can also be seen as energy minimization problems [47].
Simulated annealing [124] can be an apt example in this regard as well. These
approaches can choose to partition graphs based on their energy.
35
Advancement in Hardwares : Another important factor that triggered on-
set of deep learning was the availability of parallel computation capabilities.
Earlier dependence on CPU based architecture was creating a bottleneck for
the large amount of floating point operations. Clusters were something that
was costly and hence research was quite localized at organizations with sub-
stantial funding. But with the onset of graphics processing units(GPUs) [7]
developed by Nvidia and their underlying CUDA api for accessing the parallel
compute cores neural learning systems got a significant boost. These graph-
ics cards provided hundreds and thousands of computational cores at a much
cheaper rate which were efficiently built to handle matrix based operation which
is ideal for neural networks.
Larger Datasets : With the availibilty of GPUs neural network based re-
search became much more widespread. But another important factor to boost
it was the influx of new and challenging datasets and challenges. Starting with
the MNIST dataset [111], digit recognition became one of the vanilla challenges
for new systems. In the late 90’s Yann LeCun, Yoshua Bengio and Geoffrey
Hinton kept deep learning alive at the Canadian institute of advanced research
by launching the CIFAR datasets [103] of natural objects for object classifica-
tion. The challenge was further extended to new heights with the 1000 class
Imagenet database [50] along with the competition called Imagenet large scale
visual recognition challenge [15]. Till this date this has been one of the main
challenges which acts as a benchmark for any new object classification algo-
rithms. As people moved on to more complex datasets, image segmentation be-
came challenging as well. Many datasets and competition such as PASCAL VOC
Image segmentation [54], ILSVRC Scene parsing [15], DAVIS Challenge [168],
Microsoft COCO [122] and so on came into the picture that boosted research
in image segmentation as well.
Sequential Models : The earliest problem with deep networks were seen
with recurrent neural networks [143]. Recurrent networks can be characterized
by the feedback loop that allows them to accept a sequence of inputs while car-
rying the learned information over every time-step. However, over long chains of
inputs it was seen that information got lost over time due to vanishing gradients.
Vanishing or exploding gradients primarily occurs due to the long multiplica-
tion chains of partial derivatives which can push resultant value to almost zero
or a huge value, which in turn results in either an insignificant or too large
weight update [14]. The first attempt to solve this was proposed by Long short
term memory [81] architectures where relevant information can be propagated
through long distances through a highway channel which was affected only my
36
addition or subtraction, hence, preserving the gradient values. Sequential mod-
els can have various applications in computer vision such as video processing,
instance segmentation and so on. Fig. 17 shows a sample of a generic and
unrolled version of a typical recurrent network. .
Figure 17: Sequential Models: (topleft) Generic Representation for t-th input,
(bottomleft) Unfolded network along a sequence of n inputs, (Right) A generic
LSTM Module. The Ψ function represents a linear layer in tradition LSTM and
a convolutional layer in convolutional LSTM
37
Convolutional Neural Networks : Convolutional neural networks [110,
104] are probably one of the most significant inventions under the wing of deep
learning for computer vision. Convolutional kernels have been often used for
feature extraction from complex images. However designing kernels was not an
easy task especially for complex data like natural images. With convolutional
neural networks kernels can be randomly initialized and updated iteratively
through back propagation based on a error function like cross entropy or mean
square error. Many other operations are commonly found in CNNs such as pool-
ing, batch normalization, activations, residual connections and so on. Pooling
layer increase the receptive fields of convolutional kernels. Batch normaliza-
tion [84] refers to a generalization process that involves normalization of activa-
tions across the batch. Activation functions have been an integral part of per-
ceptron based learning. Since the introduction of AlexNet [104], rectified linear
units (ReLU) [153] have been the activation function of choice. ReLU(Rectified
Linear Unit) provides a gradient of either 0 or 1, thus, preventing vanishing
or exploding gradients and also inducing sparsity in the activations. Lately
another interesting method for gradient boosting was seen in the application
residual connection. Residual connections [78] provided an alternate path for
gradients to flow which is devoid of operations that inhibit gradients. Residual
connections have also been applied in many cases to improve the quality of seg-
mented images. Fig. 19 shows a convolutional feature extractor along with fully
connected classifier.
Generative Models : Generative models are probably one of the latest at-
tractions of deep learning in computer vision. While sequential models like long
short term memory or gated recurrent units are able to generate sequence of vec-
torized elements, in computer vision it is much more difficult due to the spatial
complexities. Lately various methodologies like variational autoencoders [98], or
adversarial learning [136, 71] has become extremely efficient in generating com-
plex images. The generative properties can be used quite efficiently in tasks like
generation of segmentation masks. An typical example of a generative network
is shown in fig. 20 which learns by adversarial learning.
38
Figure 20: A block diagram of generative adversarial network
References
[1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P.,
Süsstrunk, S., et al. Slic superpixels compared to state-of-the-art
superpixel methods. IEEE transactions on pattern analysis and machine
intelligence 34, 11 (2012), 2274–2282.
[2] Agarwala, A., Hertzmann, A., Salesin, D. H., and Seitz, S. M.
Keyframe-based tracking for rotoscoping and animation. In ACM Trans-
actions on Graphics (ToG) (2004), vol. 23, ACM, pp. 584–591.
[3] Ahmad, J., Mehmood, I., and Baik, S. W. Efficient object-based
surveillance image search using spatial pooling of convolutional features.
Journal of Visual Communication and Image Representation 45 (2017),
62–76.
[4] Alam, F. I., Zhou, J., Liew, A. W.-C., and Jia, X. Crf learning with
cnn features for hyperspectral image segmentation. In Geoscience and
Remote Sensing Symposium (IGARSS), 2016 IEEE International (2016),
IEEE, pp. 6890–6893.
[5] Albiol, A., Torres, L., and Delp, E. J. An unsupervised color
image segmentation algorithm for face detection applications. In Image
Processing, 2001. Proceedings. 2001 International Conference on (2001),
vol. 2, IEEE, pp. 681–684.
[6] Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy,
C., Polónia, A., and Campilho, A. Classification of breast cancer
histology images using convolutional neural networks. PloS one 12, 6
(2017), e0177544.
[7] Asano, S., Maruyama, T., and Yamaguchi, Y. Performance com-
parison of fpga, gpu and cpu in image processing. In Field programmable
logic and applications, 2009. fpl 2009. international conference on (2009),
IEEE, pp. 126–131.
[8] Ather, A. A quality analysis of openstreetmap data. ME Thesis, Uni-
versity College London, London, UK 22 (2009).
39
[9] Badrinarayanan, V., Kendall, A., and Cipolla, R. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation. IEEE
transactions on pattern analysis and machine intelligence 39, 12 (2017),
2481–2495.
[10] Bandyopadhyay, S., Maulik, U., and Mukhopadhyay, A. Multiob-
jective genetic clustering for pixel classification in remote sensing imagery.
IEEE transactions on Geoscience and Remote Sensing 45, 5 (2007), 1506–
1511.
[11] Barlow, J., Franklin, S., and Martin, Y. High spatial resolution
satellite imagery, dem derivatives, and image segmentation for the detec-
tion of mass wasting processes. Photogrammetric Engineering and Remote
Sensing 72, 6 (2006), 687–692.
[12] Belongie, S., Carson, C., Greenspan, H., and Malik, J. Color-and
texture-based image segmentation using em and its application to content-
based image retrieval. In Computer Vision, 1998. Sixth International
Conference on (1998), IEEE, pp. 675–682.
[13] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
Greedy layer-wise training of deep networks. In Advances in neural infor-
mation processing systems (2007), pp. 153–160.
[14] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term de-
pendencies with gradient descent is difficult. IEEE transactions on neural
networks 5, 2 (1994), 157–166.
[15] Berg, A., Deng, J., and Fei-Fei, L. Large scale visual
recognition challenge (ilsvrc), 2010. URL http://www. image-net.
org/challenges/LSVRC 3 (2010).
[16] Bezdek, J. C., Ehrlich, R., and Full, W. Fcm: The fuzzy c-means
clustering algorithm. Computers and Geosciences 10, 2-3 (1984), 191–203.
[17] Bins, L. S., Fonseca, L. G., Erthal, G. J., and Ii, F. M. Satellite
imagery segmentation: a region growing approach. Simpósio Brasileiro de
Sensoriamento Remoto 8, 1996 (1996), 677–680.
[18] Borji, A. What is a salient object? a dataset and a baseline model for
salient object detection. IEEE Transactions on Image Processing 24, 2
(2015), 742–756.
[19] Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., and Li, J. Salient
object detection: A survey. arXiv preprint arXiv:1411.5878 (2014).
[20] Borji, A., Cheng, M.-M., Jiang, H., and Li, J. Salient object de-
tection: A benchmark. IEEE Transactions on Image Processing 24, 12
(2015), 5706–5722.
40
[21] Boykov, Y., Veksler, O., and Zabih, R. Fast approximate energy
minimization via graph cuts. IEEE Transactions on pattern analysis and
machine intelligence 23, 11 (2001), 1222–1239.
[22] Boykov, Y. Y., and Jolly, M.-P. Interactive graph cuts for optimal
boundary & region segmentation of objects in nd images. In Computer
Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Con-
ference on (2001), vol. 1, IEEE, pp. 105–112.
[23] Brostow, G. J., Fauqueur, J., and Cipolla, R. Semantic object
classes in video: A high-definition ground truth database. Pattern Recog-
nition Letters 30, 2 (2009), 88–97.
[24] Cahill, N. D., and Ray, L. A. Method and system for compositing
images to produce a cropped image, Jan. 9 2007. US Patent 7,162,102.
[25] Carass, A., Roy, S., Jog, A., Cuzzocreo, J. L., Magrath, E.,
Gherman, A., Button, J., Nguyen, J., Bazin, P.-L., Calabresi,
P. A., et al. Longitudinal multiple sclerosis lesion segmentation data
resource. Data in brief 12 (2017), 346–350.
[26] Castrejon, L., Kundu, K., Urtasun, R., and Fidler, S. Annotating
object instances with a polygon-rnn. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (2017), pp. 5230–5238.
[27] Chang, P.-L., and Teng, W.-G. Exploiting the self-organizing map for
medical image segmentation. In Computer-Based Medical Systems, 2007.
CBMS’07. Twentieth IEEE International Symposium on (2007), IEEE,
pp. 281–288.
[28] Chen, J., Yang, L., Zhang, Y., Alber, M., and Chen, D. Z. Com-
bining fully convolutional and recurrent neural networks for 3d biomedical
image segmentation. In Advances in Neural Information Processing Sys-
tems (2016), pp. 3036–3044.
[29] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and
Yuille, A. L. Semantic image segmentation with deep convolutional
nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).
[30] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and
Yuille, A. L. Deeplab: Semantic image segmentation with deep convo-
lutional nets, atrous convolution, and fully connected crfs. IEEE transac-
tions on pattern analysis and machine intelligence 40, 4 (2018), 834–848.
[31] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and
Yuille, A. L. Deeplab: Semantic image segmentation with deep convo-
lutional nets, atrous convolution, and fully connected crfs. IEEE transac-
tions on pattern analysis and machine intelligence 40, 4 (2018), 834–848.
41
[32] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. Rethink-
ing atrous convolution for semantic image segmentation. arXiv preprint
arXiv:1706.05587 (2017).
[33] Chen, L.-C., Yang, Y., Wang, J., Xu, W., and Yuille, A. L. At-
tention to scale: Scale-aware semantic image segmentation. In Proceedings
of the IEEE conference on computer vision and pattern recognition (2016),
pp. 3640–3649.
[34] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam,
H. Encoder-decoder with atrous separable convolution for semantic image
segmentation. arXiv preprint arXiv:1802.02611 (2018).
[35] Cheng, K.-S., Lin, J.-S., and Mao, C.-W. The application of com-
petitive hopfield neural network to medical image segmentation. IEEE
transactions on medical imaging 15, 4 (1996), 560–567.
[36] Cheng, M.-M., Mitra, N. J., Huang, X., and Hu, S.-M.
Salientshape: Group saliency in image collections. The Visual Computer
30, 4 (2014), 443–453.
[37] Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., and Hu,
S.-M. Global contrast based salient region detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence 37, 3 (2015), 569–582.
[38] Choudhuri, S., Das, N., Ghosh, S., and Nasipuri, M. A multi-cue
information based approach to contour detection by utilizing superpixel
segmentation. In Advances in Computing, Communications and Informat-
ics (ICACCI), 2016 International Conference on (2016), IEEE, pp. 1057–
1063.
[39] Chuang, K.-S., Tzeng, H.-L., Chen, S., Wu, J., and Chen, T.-J.
Fuzzy c-means clustering with spatial information for image segmentation.
Computerized MedicalImaging and Graphics 30, 1 (2006), 9–15.
[40] Comaniciu, D., and Meer, P. Robust analysis of feature spaces: color
image segmentation. In Computer Vision and Pattern Recognition, 1997.
Proceedings., 1997 IEEE Computer Society Conference on (1997), IEEE,
pp. 750–755.
[41] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The
cityscapes dataset for semantic urban scene understanding. In Proceedings
of the IEEE conference on computer vision and pattern recognition (2016),
pp. 3213–3223.
[42] Dai, J., He, K., and Sun, J. Boxsup: Exploiting bounding boxes to
supervise convolutional networks for semantic segmentation. In Proceed-
ings of the IEEE International Conference on Computer Vision (2015),
pp. 1635–1643.
42
[43] Dai, J., He, K., and Sun, J. Instance-aware semantic segmentation via
multi-task network cascades. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2016), pp. 3150–3158.
[44] Dai, J., Li, Y., He, K., and Sun, J. R-fcn: Object detection via region-
based fully convolutional networks. In Advances in neural information
processing systems (2016), pp. 379–387.
[45] Das, A., Ghosh, S., Sarkhel, R., Choudhuri, S., Das, N., and
Nasipuri, M. Combining multi-level contexts of superpixel using convo-
lutional neural networks to perform natural scene labeling. arXiv preprint
arXiv:1803.05200 (2018).
[46] Das, A., Ghosh, S., Sarkhel, R., Choudhuri, S., Das, N., and
Nasipuri, M. Combining multilevel contexts of superpixel using con-
volutional neural networks to perform natural scene labeling. In Recent
Developments in Machine Learning and Data Analytics. Springer, 2019,
pp. 297–306.
[50] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L.
Imagenet: A large-scale hierarchical image database. In Computer Vision
and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (2009),
IEEE, pp. 248–255.
[51] Du, Y., Arslanturk, E., Zhou, Z., and Belcher, C. Video-based
noncooperative iris image segmentation. IEEE Transactions on Systems,
Man, and Cybernetics, Part B (Cybernetics) 41, 1 (2011), 64–74.
[52] Duan, L., Tsang, I. W., Xu, D., and Chua, T.-S. Domain adap-
tation from multiple sources via auxiliary classifiers. In Proceedings of
the 26th Annual International Conference on Machine Learning (2009),
ACM, pp. 289–296.
[53] Dumoulin, V., and Visin, F. A guide to convolution arithmetic for
deep learning. arXiv preprint arXiv:1603.07285 (2016).
43
[54] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and
Zisserman, A. The pascal visual object classes (voc) challenge. Interna-
tional journal of computer vision 88, 2 (2010), 303–338.
[55] Farabet, C., Couprie, C., Najman, L., and LeCun, Y. Learning
hierarchical features for scene labeling. IEEE transactions on pattern
analysis and machine intelligence 35, 8 (2013), 1915–1929.
[56] Felzenszwalb, P. F., and Huttenlocher, D. P. Efficient graph-
based image segmentation. International journal of computer vision 59, 2
(2004), 167–181.
44
[66] Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-
Martinez, V., and Garcia-Rodriguez, J. A review on deep
learning techniques applied to semantic segmentation. arXiv preprint
arXiv:1704.06857 (2017).
[67] Geiger, A., Lenz, P., and Urtasun, R. Are we ready for autonomous
driving? the kitti vision benchmark suite. In Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on (2012), IEEE,
pp. 3354–3361.
[68] Geng, Q., Zhou, Z., and Cao, X. Survey of recent progress in semantic
image segmentation with cnns. Science China Information Sciences 61, 5
(2018), 051101.
[69] Girshick, R. Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015).
[70] Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE conference on computer vision and pattern
recognition (2014), pp. 580–587.
[71] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-
Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative
adversarial nets. In Advances in neural information processing systems
(2014), pp. 2672–2680.
[72] Gould, S., Fulton, R., and Koller, D. Decomposing a scene into
geometric and semantically consistent regions. In Computer Vision, 2009
IEEE 12th International Conference on (2009), IEEE, pp. 1–8.
[73] Grau, V., Mewes, A., Alcaniz, M., Kikinis, R., and Warfield,
S. K. Improved watershed transform for medical image segmentation
using prior information. IEEE transactions on medical imaging 23, 4
(2004), 447–458.
[74] Han, X. Automatic liver lesion segmentation using a deep convolutional
neural network method. arXiv preprint arXiv:1704.07239 (2017).
[75] Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., and Malik,
J. Semantic contours from inverse detectors. In International Conference
on Computer Vision (ICCV) (2011).
[76] He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask r-cnn.
In Computer Vision (ICCV), 2017 IEEE International Conference on
(2017), IEEE, pp. 2980–2988.
[77] He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in
deep convolutional networks for visual recognition. IEEE transactions on
pattern analysis and machine intelligence 37, 9 (2015), 1904–1916.
45
[78] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition (2016), pp. 770–778.
[79] Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algo-
rithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
[85] Irshad, H., Veillard, A., Roux, L., and Racoceanu, D. Methods
for nuclei detection, segmentation, and classification in digital histopathol-
ogy: a reviewcurrent status and future potential. IEEE reviews in biomed-
ical engineering 7 (2014), 97–114.
[86] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to-image
translation with conditional adversarial networks. arXiv preprint (2017).
[87] Jassim, F. A., and Altaani, F. H. Hybridization of otsu method
and median filter for color image segmentation. arXiv preprint
arXiv:1305.1052 (2013).
[88] Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., and Ben-
gio, Y. The one hundred layers tiramisu: Fully convolutional densenets
for semantic segmentation. In Computer Vision and Pattern Recognition
Workshops (CVPRW), 2017 IEEE Conference on (2017), IEEE, pp. 1175–
1183.
[89] Jin, C.-B., Li, S., Do, T. D., and Kim, H. Real-time human ac-
tion recognition using cnn over temporal images for static video surveil-
lance cameras. In Pacific Rim Conference on Multimedia (2015), Springer,
pp. 330–339.
46
[90] Kam, A., Ng, T., Kingsbury, N., and Fitzgerald, W. Content
based image retrieval through object extraction and querying. In cbaivl
(2000), IEEE, p. 91.
[91] Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P.,
Kane, A. D., Menon, D. K., Rueckert, D., and Glocker, B.
Efficient multi-scale 3d cnn with fully connected crf for accurate brain
lesion segmentation. Medical image analysis 36 (2017), 61–78.
[92] Kanezaki, A. Unsupervised image segmentation by backpropagation.
In 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2018), IEEE, pp. 1543–1547.
[93] Kang, J., Li, X., Luan, Q., Liu, J., and Min, L. Dental plaque quan-
tification using cellular neural network-based image segmentation. In In-
telligent computing in signal processing and pattern recognition. Springer,
2006, pp. 797–802.
[94] Kang, J., and Zhang, W. Fingerprint segmentation using cellular neu-
ral network. In Computational Intelligence and Natural Computing, 2009.
CINC’09. International Conference on (2009), vol. 2, IEEE, pp. 11–14.
[95] Kang, K., and Wang, X. Fully convolutional neural networks for crowd
segmentation. arXiv preprint arXiv:1411.4464 (2014).
[96] Kass, M., Witkin, A., and Terzopoulos, D. Snakes: Active contour
models. International journal of computer vision 1, 4 (1988), 321–331.
[97] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980 (2014).
[98] Kingma, D. P., and Welling, M. Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114 (2013).
[99] Kong, T., Yao, A., Chen, Y., and Sun, F. Hypernet: Towards accu-
rate region proposal generation and joint object detection. In Proceedings
of the IEEE conference on computer vision and pattern recognition (2016),
pp. 845–853.
47
[103] Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset.
online: http://www. cs. toronto. edu/kriz/cifar. html (2014).
[104] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classi-
fication with deep convolutional neural networks. In Advances in neural
information processing systems (2012), pp. 1097–1105.
[105] Kuntimad, G., and Ranganath, H. S. Perfect image segmentation us-
ing pulse coupled neural networks. IEEE Transactions on Neural networks
10, 3 (1999), 591–598.
[106] Labao, A. B., and Naval, P. C. Weakly-labelled semantic segmenta-
tion of fish objects in underwater videos using a deep residual network.
In Asian Conference on Intelligent Information and Database Systems
(2017), Springer, pp. 255–265.
[107] ladys law Skarbek, W., and Koschan, A. Colour image segmen-
tation a survey. IEEE Transactions on circuits and systems for Video
Technology 14, 7 (1994).
[108] LaLonde, R., and Bagci, U. Capsules for object segmentation. arXiv
preprint arXiv:1804.04241 (2018).
[109] Längkvist, M., Kiselev, A., Alirezaie, M., and Loutfi, A. Clas-
sification and segmentation of satellite orthoimagery using convolutional
neural networks. Remote Sensing 8, 4 (2016), 329.
[110] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceedings of the IEEE
86, 11 (1998), 2278–2324.
[111] LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten
digit database. AT&T Labs [Online]. Available: http://yann. lecun.
com/exdb/mnist 2 (2010).
[112] Lee, S., Cho, M. S., Jung, K., and Kim, J. H. Scene text extraction
with edge constraint and text collinearity. In 2010 International Confer-
ence on Pattern Recognition (2010), IEEE, pp. 3983–3986.
[113] Leibe, B., Seemann, E., and Schiele, B. Pedestrian detection in
crowded scenes. In Computer Vision and Pattern Recognition, 2005.
CVPR 2005. IEEE Computer Society Conference on (2005), vol. 1, IEEE,
pp. 878–885.
[114] Levi, D., Garnett, N., Fetaya, E., and Herzlyia, I. Stixelnet: A
deep convolutional network for obstacle detection and road segmentation.
In BMVC (2015), pp. 109–1.
[115] Levin, A., Lischinski, D., and Weiss, Y. A closed-form solution
to natural image matting. IEEE transactions on pattern analysis and
machine intelligence 30, 2 (2008), 228–242.
48
[116] Li, X., Liu, Z., Luo, P., Change Loy, C., and Tang, X. Not all
pixels are equal: Difficulty-aware semantic segmentation via deep layer
cascade. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (2017), pp. 3193–3202.
[117] Li, Y., Hou, X., Koch, C., Rehg, J. M., and Yuille, A. L. The
secrets of salient object segmentation. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (2014), pp. 280–287.
[118] Li, Y., Qi, H., Dai, J., Ji, X., and Wei, Y. Fully convolutional
instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709
(2016).
[119] Lie, W.-N. Automatic target segmentation by locally adaptive image
thresholding. IEEE Transactions on Image Processing 4, 7 (1995), 1036–
1041.
[120] Lin, G., Milan, A., Shen, C., and Reid, I. Refinenet: Multi-path
refinement networks for high-resolution semantic segmentation. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
[121] Lin, M., Chen, Q., and Yan, S. Network in network. arXiv preprint
arXiv:1312.4400 (2013).
[122] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common
objects in context. In European conference on computer vision (2014),
Springer, pp. 740–755.
[123] Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi,
F., Ghafoorian, M., van der Laak, J. A., van Ginneken, B., and
Sánchez, C. I. A survey on deep learning in medical image analysis.
Medical image analysis 42 (2017), 60–88.
[124] Liu, J., and Yang, Y.-H. Multiresolution color image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 7
(1994), 689–700.
[125] Liu, N., Li, H., Zhang, M., Liu, J., Sun, Z., and Tan, T. Accurate
iris segmentation in non-cooperative environments using fully convolu-
tional networks. In Biometrics (ICB), 2016 International Conference on
(2016), IEEE, pp. 1–8.
[126] Liu, S., De Mello, S., Gu, J., Zhong, G., Yang, M.-H., and
Kautz, J. Learning affinity via spatial propagation networks. In Ad-
vances in Neural Information Processing Systems (2017), pp. 1520–1530.
[127] Liu, Y., Zhang, D., Lu, G., and Ma, W.-Y. A survey of content-
based image retrieval with high-level semantics. Pattern recognition 40, 1
(2007), 262–282.
49
[128] Liu, Z., Li, X., Luo, P., Loy, C.-C., and Tang, X. Semantic im-
age segmentation via deep parsing network. In Proceedings of the IEEE
International Conference on Computer Vision (2015), pp. 1377–1385.
[129] Loizou, C. P., Murray, V., Pattichis, M. S., Seimenis, I.,
Pantziaris, M., and Pattichis, C. S. Multiscale amplitude-
modulation frequency-modulation (am–fm) texture analysis of multiple
sclerosis in brain mri images. IEEE Transactions on Information Tech-
nology in Biomedicine 15, 1 (2011), 119–129.
[130] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional net-
works for semantic segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition (2015), pp. 3431–3440.
[131] López-Linares, K., Lete, N., Kabongo, L., Ceresa, M., Maclair,
G., Garcı́a-Familiar, A., Macı́a, I., and Ballester, M. Á. G.
Comparison of regularization techniques for dcnn-based abdominal aortic
aneurysm segmentation. In Biomedical Imaging (ISBI 2018), 2018 IEEE
15th International Symposium on (2018), IEEE, pp. 864–867.
[132] Lu, P., Barazzetti, L., Chandran, V., Gavaghan, K., Weber, S.,
Gerber, N., and Reyes, M. Highly accurate facial nerve segmentation
refinement from cbct/ct imaging using a super-resolution classification
approach. IEEE transactions on biomedical engineering 65, 1 (2018), 178–
188.
[133] Luc, P., Couprie, C., Chintala, S., and Verbeek, J. Semantic
segmentation using adversarial networks. arXiv preprint arXiv:1611.08408
(2016).
[134] Maggiori, E., Tarabalka, Y., Charpiat, G., and Alliez, P. Can
semantic labeling methods generalize to any city? the inria aerial image
labeling benchmark. In IEEE International Symposium on Geoscience
and Remote Sensing (IGARSS) (2017).
[135] Maier, O., Menze, B. H., von der Gablentz, J., Häni, L., Hein-
rich, M. P., Liebrand, M., Winzeck, S., Basit, A., Bentley, P.,
Chen, L., et al. Isles 2015-a public evaluation benchmark for ischemic
stroke lesion segmentation from multispectral mri. Medical image analysis
35 (2017), 250–269.
[136] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey,
B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).
50
[138] Maninis, K.-K., Caelles, S., Pont-Tuset, J., and Van Gool, L.
Deep extreme cut: From extreme points to object segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition (2018), pp. 616–625.
[139] Maninis, K.-K., Pont-Tuset, J., Arbeláez, P., and Van Gool, L.
Convolutional oriented boundaries. In European Conference on Computer
Vision (2016), Springer, pp. 580–596.
[140] Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of
human segmented natural images and its application to evaluating seg-
mentation algorithms and measuring ecological statistics. In Proc. 8th
Int’l Conf. Computer Vision (July 2001), vol. 2, pp. 416–423.
[141] Masci, J., Meier, U., Cireşan, D., and Schmidhuber, J. Stacked
convolutional auto-encoders for hierarchical feature extraction. In In-
ternational Conference on Artificial Neural Networks (2011), Springer,
pp. 52–59.
[148] Molinari, F., Zeng, G., and Suri, J. S. A state of the art re-
view on intima-media thickness (imt) measurement and wall segmenta-
tion techniques for carotid ultrasound. Computer methods and programs
in biomedicine 100, 3 (2010), 201–221.
51
[149] Moriya, T., Roth, H. R., Nakamura, S., Oda, H., Nagara, K.,
Oda, M., and Mori, K. Unsupervised segmentation of 3d medical
images based on clustering and deep representation learning. In Medi-
cal Imaging 2018: Biomedical Applications in Molecular, Structural, and
Functional Imaging (2018), vol. 10578, International Society for Optics
and Photonics, p. 1057820.
[150] Mukhopadhyay, A., Maulik, U., and Bandyopadhyay, S. Multiob-
jective genetic algorithm-based fuzzy clustering of categorical attributes.
IEEE transactions on evolutionary computation 13, 5 (2009), 991–1005.
[151] Mumford, D., and Shah, J. Optimal approximations by piecewise
smooth functions and associated variational problems. Communications
on pure and applied mathematics 42, 5 (1989), 577–685.
[152] Mundhenk, T. N., Ho, D., and Chen, B. Y. Improvements to context
based self-supervised learning. In Computer Vision and Pattern Recogni-
tion (CVPR) (2018).
[153] Nair, V., and Hinton, G. E. Rectified linear units improve restricted
boltzmann machines. In Proceedings of the 27th international conference
on machine learning (ICML-10) (2010), pp. 807–814.
[154] Nassar, A., Amer, K., ElHakim, R., and ElHelw, M. A deep
cnn-based framework for enhanced aerial imagery registration with appli-
cations to uav geolocalization. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops (2018), pp. 1513–
1523.
[155] Neuhold, G., Ollmann, T., Bulo, S. R., and Kontschieder,
P. The mapillary vistas dataset for semantic understanding of street
scenes. In Proceedings of the International Conference on Computer Vi-
sion (ICCV), Venice, Italy (2017), pp. 22–29.
[156] Noh, H., Hong, S., and Han, B. Learning deconvolution network for
semantic segmentation. In Proceedings of the IEEE International Confer-
ence on Computer Vision (2015), pp. 1520–1528.
52
Computer Vision, Graphics and Image Processing (ICVGIP 2018), ACM,
p. 7.
[160] Pal, N. R., and Pal, S. K. A review on image segmentation techniques.
Pattern recognition 26, 9 (1993), 1277–1294.
[167] Pinheiro, P. O., Lin, T.-Y., Collobert, R., and Dollár, P. Learn-
ing to refine object segments. In European Conference on Computer Vision
(2016), Springer, pp. 75–91.
[168] Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-
Hornung, A., and Van Gool, L. The 2017 davis challenge on video
object segmentation. arXiv preprint arXiv:1704.00675 (2017).
[169] Porwal, P., Pachade, S., Kamble, R., Kokare, M., Desh-
mukh, G., Sahasrabuddhe, V., Meriaudeau, F., Quellec, G.,
MacGillivray, T., Giancardo, L., and Sidib, D. Diabetic retinopa-
thy: Segmentation and grading challenge workshop. IEEE International
Symposium on Biomedical Imaging(ISBI-2018) (2018).
53
[171] Radau, P., Lu, Y., Connelly, K., Paul, G., Dick, A., and
Wright, G. Evaluation framework for algorithms segmenting short axis
cardiac mri. The MIDAS Journal-Cardiac MR Left Ventricle Segmenta-
tion Challenge 49 (2009).
[172] Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., and Black,
M. J. Adversarial collaboration: Joint unsupervised learning of depth,
camera motion, optical flow and motion segmentation. arXiv preprint
arXiv:1805.09806 (2018).
[173] Ravanbakhsh, M., Nabi, M., Mousavi, H., Sangineto, E., and
Sebe, N. Plug-and-play cnn for crowd motion analysis: An application
in abnormal event detection. arXiv preprint arXiv:1610.00307 (2016).
[174] Ren, M., and Zemel, R. S. End-to-end instance segmentation with
recurrent attention. arXiv preprint arXiv:1605.09410 (2017).
[175] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in
neural information processing systems (2015), pp. 91–99.
[176] Romera-Paredes, B., and Torr, P. H. S. Recurrent instance seg-
mentation. In European Conference on Computer Vision (2016), Springer,
pp. 312–329.
54
[183] Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between
capsules. In Advances in Neural Information Processing Systems (2017),
pp. 3856–3866.
[184] Senthilkumaran, N., and Rajesh, R. Edge detection techniques for
image segmentation–a survey of soft computing approaches. International
Journal of Recent Trends in Engineering 1, 2 (2009), 250–254.
[185] Sharma, N., and Aggarwal, L. M. Automated medical image seg-
mentation techniques. Journal of medical physics/Association of Medical
Physicists of India 35, 1 (2010), 3.
[186] Shi, J., and Malik, J. Normalized cuts and image segmentation. IEEE
Transactions on pattern analysis and machine intelligence 22, 8 (2000),
888–905.
[187] Shi, J., Yan, Q., Xu, L., and Jia, J. Hierarchical image saliency
detection on extended cssd. IEEE transactions on pattern analysis and
machine intelligence 38, 4 (2016), 717–729.
[188] Shotton, J., Winn, J., Rother, C., and Criminisi, A. Texton-
boost: Joint appearance, shape and context modeling for multi-class ob-
ject recognition and segmentation. In European conference on computer
vision (2006), Springer, pp. 1–15.
[189] Silveira, M., Nascimento, J. C., Marques, J. S., Marçal, A. R.,
Mendonça, T., Yamauchi, S., Maeda, J., and Rozeira, J. Com-
parison of segmentation methods for melanoma diagnosis in dermoscopy
images. IEEE Journal of Selected Topics in Signal Processing 3, 1 (2009),
35–45.
[190] Song, Y., Zhu, Y., Li, G., Feng, C., He, B., and Yan, T. Side
scan sonar segmentation using deep convolutional neural network. In
OCEANS–Anchorage, 2017 (2017), IEEE, pp. 1–4.
[191] Staal, J., Abràmoff, M. D., Niemeijer, M., Viergever, M. A.,
and Van Ginneken, B. Ridge-based vessel segmentation in color images
of the retina. IEEE transactions on medical imaging 23, 4 (2004), 501–509.
[192] Szirányi, T., László, K., Czúni, L., and Ziliani, F. Object oriented
motion-segmentation for video-compression in the cnn-um. Journal of
VLSI signal processing systems for signal, image and video technology 23,
2-3 (1999), 479–496.
[193] Tan, K. S., and Isa, N. A. M. Color image segmentation using his-
togram thresholding–fuzzy c-means hybrid approach. Pattern Recognition
44, 1 (2011), 1–15.
[194] Tatiraju, S., and Mehta, A. Image segmentation using k-means clus-
tering, em and normalized cuts. University Of California Irvine (2008).
55
[195] Tobias, O. J., and Seara, R. Image segmentation by histogram thresh-
olding using fuzzy sets. IEEE transactions on Image Processing 11, 12
(2002), 1457–1465.
[196] Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R.,
Friedmann, F., Schuberth, P., Mayr, A., Heusel, M., Hof-
marcher, M., Widrich, M., et al. Speeding up semantic segmentation
for autonomous driving. In MLITS, NIPS Workshop (2016).
[197] Tu, W.-C., Liu, M.-Y., Jampani, V., Sun, D., Chien, S.-Y., Yang,
M.-H., and Kautz, J. Learning superpixels with segmentation-aware
affinity loss. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (2018), pp. 568–576.
[198] Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeulders,
A. W. Selective search for object recognition. International journal of
computer vision 104, 2 (2013), 154–171.
[199] Vala, M. H. J., and Baxi, A. A review on otsu image segmentation
algorithm. International Journal of Advanced Research in Computer En-
gineering and Technology (IJARCET) 2, 2 (2013), pp–387.
[200] Van de Sande, K. E., Uijlings, J. R., Gevers, T., and Smeulders,
A. W. Segmentation as selective search for object recognition. In Com-
puter Vision (ICCV), 2011 IEEE International Conference on (2011),
IEEE, pp. 1879–1886.
[201] Van Ginneken, B., Stegmann, M. B., and Loog, M. Segmentation
of anatomical structures in chest radiographs using supervised methods:
a comparative study on a public database. Medical image analysis 10, 1
(2006), 19–40.
[202] Varma, G., Subramanian, A., Namboodiri, A., Chandraker,
M., and Jawahar, C. Idd: A dataset for exploring problems of
autonomous navigation in unconstrained environments. arXiv preprint
arXiv:1811.10200 (2018).
[203] Veit, A., Matera, T., Neumann, L., Matas, J., and Belongie, S.
Coco-text: Dataset and benchmark for text detection and recognition in
natural images. arXiv preprint arXiv:1601.07140 (2016).
[204] Vilarino, D. L., Cabello, D., and Brea, V. M. An analogic cnn-
algorithm of pixel level snakes for tracking and surveillance tasks. In Cel-
lular Neural Networks and Their Applications, 2002.(CNNA 2002). Pro-
ceedings of the 2002 7th IEEE International Workshop on (2002), IEEE,
pp. 84–91.
[205] Wang, K., Babenko, B., and Belongie, S. End-to-end scene text
recognition. In Computer Vision (ICCV), 2011 IEEE International Con-
ference on (2011), IEEE, pp. 1457–1464.
56
[206] Wei, G.-Q., Arbter, K., and Hirzinger, G. Real-time visual ser-
voing for laparoscopic surgery. controlling robot motion with color image
segmentation. IEEE Engineering in Medicine and Biology Magazine 16,
1 (1997), 40–45.
[207] Xia, X., and Kulis, B. W-net: A deep model for fully unsupervised
image segmentation. arXiv preprint arXiv:1711.08506 (2017).
[208] Xu, J., Wang, G., and Sun, F. A novel method for detecting and track-
ing vehicles in traffic-image sequence. In Fifth International Conference
on Digital Image Processing (ICDIP 2013) (2013), vol. 8878, International
Society for Optics and Photonics, p. 88782P.
[209] Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., and
Huang, T. Youtube-vos: A large-scale video object segmentation bench-
mark. arXiv preprint arXiv:1809.03327 (2018).
[210] Yang, C., Zhang, L., Lu, H., Ruan, X., and Yang, M.-H. Saliency
detection via graph-based manifold ranking. In Computer Vision and
Pattern Recognition (CVPR), 2013 IEEE Conference on (2013), IEEE,
pp. 3166–3173.
[211] Yu, F., and Koltun, V. Multi-scale context aggregation by dilated
convolutions. arXiv preprint arXiv:1511.07122 (2015).
[212] Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., and
Darrell, T. Bdd100k: A diverse driving video database with scalable
annotation tooling. arXiv preprint arXiv:1805.04687 (2018).
[213] Yuan, J., Gleason, S. S., and Cheriyadat, A. M. Systematic bench-
marking of aerial image segmentation. IEEE Geoscience and Remote Sens-
ing Letters 10, 6 (2013), 1527–1531.
[214] Zeiler, M. D., and Fergus, R. Visualizing and understanding con-
volutional networks. In European conference on computer vision (2014),
Springer, pp. 818–833.
[215] Zeiler, M. D., and Fergus, R. Visualizing and understanding con-
volutional networks. In European conference on computer vision (2014),
Springer, pp. 818–833.
[216] Zhan, X., Pan, X., Liu, Z., Lin, D., and Loy, C. C. Self-
supervised learning via conditional motion propagation. arXiv preprint
arXiv:1903.11412 (2019).
[217] Zhang, Q., Goldman, S. A., Yu, W., and Fritts, J. E. Content-
based image retrieval using multiple-instance learning. In ICML (2002),
vol. 2, pp. 682–689.
57
[218] Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization.
In European conference on computer vision (2016), Springer, pp. 649–666.
[219] Zhao, B., Feng, J., Wu, X., and Yan, S. A survey on deep learning-
based fine-grained object classification and semantic segmentation. Inter-
national Journal of Automation and Computing 14, 2 (2017), 119–135.
[220] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene pars-
ing network. In IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR) (2017), pp. 2881–2890.
[221] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su,
Z., Du, D., Huang, C., and Torr, P. H. Conditional random fields
as recurrent neural networks. In Proceedings of the IEEE international
conference on computer vision (2015), pp. 1529–1537.
[222] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Tor-
ralba, A. Semantic understanding of scenes through the ade20k dataset.
arXiv preprint arXiv:1608.05442 (2016).
[223] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Tor-
ralba, A. Scene parsing through ade20k dataset. In Proc. CVPR (2017).
[224] Zikic, D., Ioannou, Y., Brown, M., and Criminisi, A. Segmentation
of brain tumor tissues with convolutional neural networks. Proceedings
MICCAI-BRATS (2014), 36–39.
58