0% found this document useful (0 votes)
247 views23 pages

A Review On Deep Learning Techniques Applied To Semantic Segmentation

This document provides a review of deep learning techniques applied to semantic segmentation. It begins with background on semantic segmentation and how deep learning has improved approaches to this problem. The paper then reviews major datasets used for semantic segmentation as well as significant methods that use deep learning. It provides quantitative results for different methods on various datasets and discusses future directions. The goal is to give researchers an overview of the state-of-the-art in semantic segmentation using deep learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
247 views23 pages

A Review On Deep Learning Techniques Applied To Semantic Segmentation

This document provides a review of deep learning techniques applied to semantic segmentation. It begins with background on semantic segmentation and how deep learning has improved approaches to this problem. The paper then reviews major datasets used for semantic segmentation as well as significant methods that use deep learning. It provides quantitative results for different methods on various datasets and discusses future directions. The goal is to give researchers an overview of the state-of-the-art in semantic segmentation using deep learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1

A Review on Deep Learning Techniques


Applied to Semantic Segmentation
A. Garcia-Garcia, S. Orts-Escolano, S.O. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez

Abstract—Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers.
Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even
virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every
field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a
review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology
of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers
arXiv:1704.06857v1 [cs.CV] 22 Apr 2017

decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their
contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in
which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw
our own conclusions about the state of the art of semantic segmentation using deep learning techniques.

Index Terms—Semantic Segmentation, Deep Learning, Scene Labeling, Object Segmentation

1 I NTRODUCTION

N OWADAYS , semantic segmentation – applied to still


2D images, video, and even 3D or volumetric data
– is one of the key problems in the field of computer
mantic segmentation and properly interpret their proposals,
prune subpar approaches, and validate results.
To the best of our knowledge, this is the first review to
vision. Looking at the big picture, semantic segmentation focus explicitly on deep learning for semantic segmentation.
is one of the high-level task that paves the way towards Various semantic segmentation surveys already exist such
complete scene understanding. The importance of scene as the works by Zhu et al. [12] and Thoma [13], which do
understanding as a core computer vision problem is high- a great work summarizing and classifying existing meth-
lighted by the fact that an increasing number of applications ods, discussing datasets and metrics, and providing design
nourish from inferring knowledge from imagery. Some of choices for future research directions. However, they lack
those applications include autonomous driving [1] [2] [3], some of the most recent datasets, they do not analyze
human-machine interaction [4], computational photography frameworks, and none of them provide details about deep
[5], image search engines [6], and augmented reality to name learning techniques. Because of that, we consider our work
a few. Such problem has been addressed in the past using to be novel and helpful thus making it a significant contri-
various traditional computer vision and machine learning bution for the research community.
techniques. Despite the popularity of those kind of methods,
the deep learning revolution has turned the tables so that
many computer vision problems – semantic segmentation
among them – are being tackled using deep architectures,
usually Convolutional Neural Networks (CNNs) [7] [8] [9]
[10] [11], which are surpassing other approaches by a large
margin in terms of accuracy and sometimes even efficiency.
However, deep learning is far from the maturity achieved
by other old-established branches of computer vision and
machine learning. Because of that, there is a lack of unifying
works and state of the art reviews. The ever-changing state
of the field makes initiation difficult and keeping up with
its evolution pace is an incredibly time-consuming task due
to the sheer amount of new literature being produced. This
makes it hard to keep track of the works dealing with se-

• A. Garcia-Garcia, S.O. Oprea, V. Villena-Martinez, and J. Garcia-


Rodriguez are with the Department of Computer Technology, University
of Alicante, Spain. Fig. 1: Evolution of object recognition or scene understand-
E-mail: {agarcia, soprea, vvillena, jgarcia}@dtic.ua.es
• S. Orts-Escolano is with the Department of Computer Science and
ing from coarse-grained to fine-grained inference: classifica-
Artificial Intelligence, Universit of Alicante, Spain. tion, detection or localization, semantic segmentation, and
E-mail: [email protected]. instance segmentation.
2

The key contributions of our work are as follows: elements of a set of random variables X = {x1 , x2 , ..., xN }.
Each label l represents a different class or object, e.g., aero-
• We provide a broad survey of existing datasets that
plane, car, traffic sign, or background. This label space has
might be useful for segmentation projects with deep
k possible states which are usually extended to k + 1 and
learning techniques.
treating l0 as background or a void class. Usually, X is a 2D
• An in-depth and organized review of the most sig-
image of W × H = N pixels x. However, that set of random
nificant methods that use deep learning for semantic
variables can be extended to any dimensionality such as
segmentation, their origins, and their contributions.
volumetric data or hyperspectral images.
• A thorough performance evaluation which gathers
Apart from the problem formulation, it is important
quantitative metrics such as accuracy, execution time,
to remark some background concepts that might help the
and memory footprint.
reader to understand this review. Firstly, common networks,
• A discussion about the aforementioned results, as
approaches, and design decisions that are often used as the
well as a list of possible future works that might set
basis for deep semantic segmentation systems. In addition,
the course of upcoming advances, and a conclusion
common techniques for training such as transfer learning.
summarizing the state of the art of the field.
At last, data pre-processing and augmentation approaches.
The remainder of this paper is organized as follows.
Firstly, Section 2 introduces the semantic segmentation prob-
2.1 Common Deep Network Architectures
lem as well as notation and conventions commonly used in
the literature. Other background concepts such as common As we previously stated, certain deep networks have made
deep neural networks are also reviewed. Next, Section 3 such significant contributions to the field that they have
describes existing datasets, challenges, and benchmarks. become widely known standards. It is the case of AlexNet,
Section 4 reviews existing methods following a bottom- VGG-16, GoogLeNet, and ResNet. Such was their impor-
up complexity order based on their contributions. This tance that they are currently being used as building blocks
section focuses on describing the theory and highlights for many segmentation architectures. For that reason, we
of those methods rather than performing a quantitative will devote this section to review them.
evaluation. Finally, Section 5 presents a brief discussion on
the presented methods based on their quantitative results 2.1.1 AlexNet
on the aforementioned datasets. In addition, future research AlexNet was the pioneering deep CNN that won the
directions are also laid out. At last, Section 6 summarizes ILSVRC-2012 with a TOP-5 test accuracy of 84.6% while
the paper and draws conclusions about this work and the the closest competitor, which made use of traditional tech-
state of the art of the field. niques instead of deep architectures, achieved a 73.8% ac-
curacy in the same challenge. The architecture presented by
Krizhevsky et al. [14] was relatively simple. It consists of
2 T ERMINOLOGY AND BACKGROUND C ONCEPTS five convolutional layers, max-pooling ones, Rectified Lin-
In order to properly understand how semantic segmenta- ear Units (ReLUs) as non-linearities, three fully-connected
tion is tackled by modern deep learning architectures, it is layers, and dropout. Figure 2 shows that CNN architecture.
important to know that it is not an isolated field but rather a
natural step in the progression from coarse to fine inference.
The origin could be located at classification, which consists
of making a prediction for a whole input, i.e., predicting
which is the object in an image or even providing a ranked
list if there are many of them. Localization or detection is the
next step towards fine-grained inference, providing not only
the classes but also additional information regarding the
spatial location of those classes, e.g., centroids or bounding Fig. 2: AlexNet Convolutional Neural Network architecture.
boxes. Providing that, it is obvious that semantic segmen- Figure reproduced from [14].
tation is the natural step to achieve fine-grained inference,
its goal: make dense predictions inferring labels for every
pixel; this way, each pixel is labeled with the class of its en- 2.1.2 VGG
closing object or region. Further improvements can be made, Visual Geometry Group (VGG) is a CNN model introduced
such as instance segmentation (separate labels for different by the Visual Geometry Group (VGG) from the University
instances of the same class) and even part-based segmenta- of Oxford. They proposed various models and configura-
tion (low-level decomposition of already segmented classes tions of deep CNNs [15], one of them was submitted to
into their components). Figure 1 shows the aforementioned the ImageNet Large Scale Visual Recognition Challenge
evolution. In this review, we will mainly focus on generic (ILSVRC)-2013. That model, also known as VGG-16 due to
scene labeling, i.e., per-pixel class segmentation, but we will the fact that it is composed by 16 weight layers, became
also review the most important methods on instance and popular thanks to its achievement of 92.7% TOP-5 test
part-based segmentation. accuracy. Figure 3 shows the configuration of VGG-16. The
In the end, the per-pixel labeling problem can be reduced main difference between VGG-16 and its predecessors is the
to the following formulation: find a way to assign a state use of a stack of convolution layers with small receptive
from the label space L = {l1 , l2 , ..., lk } to each one of the fields in the first layers instead of few layers with big
3

receptive fields. This leads to less parameters and more non-


linearities in between, thus making the decision function χ
more discriminative and the model easier to train.

weight layer
χ
F (χ) identity

weight layer

F (χ) + χ +
relu
Fig. 5: Residual block from the ResNet architecture. Figure
reproduced from [17].

Fig. 3: VGG-16 CNN architecture. Figure extracted from


Matthieu Cord’s talk with his permission. The intuitive idea behind this approach is that it ensures
that the next layer learns something new and different from
what the input has already encoded (since it is provided
2.1.3 GoogLeNet with both the output of the previous layer and its un-
GoogLeNet is a network introduced by Szegedy et al. [16] changed input). In addition, this kind of connections help
which won the ILSVRC-2014 challenge with a TOP-5 test overcoming the vanishing gradients problem.
accuracy of 93.3%. This CNN architecture is characterized
by its complexity, emphasized by the fact that it is composed
by 22 layers and a newly introduced building block called 2.1.5 ReNet
inception module (see Figure 4). This new approach proved
that CNN layers could be stacked in more ways than a In order to extend Recurrent Neural Networks (RNNs)
typical sequential manner. In fact, those modules consist of architectures to multi-dimensional tasks, Graves et al. [18]
a Network in Network (NiN) layer, a pooling operation, a proposed a Multi-dimensional Recurrent Neural Network
large-sized convolution layer, and small-sized convolution (MDRNN) architecture which replaces each single recurrent
layer. All of them are computed in parallel and followed connection from standard RNNs with d connections, where
by 1 × 1 convolution operations to reduce dimensionality. d is the number of spatio-temporal data dimensions. Based
Thanks to those modules, this network puts special consid- on this initial approach, Visin el al. [19] proposed ReNet
eration on memory and computational cost by significantly architecture in which instead of multidimensional RNNs,
reducing the number of parameters and operations. they have been using usual sequence RNNs. In this way, the
number of RNNs is scaling linearly at each layer regarding
to the number of dimensions d of the input image (2d).
Filter
concatenation In this approach, each convolutional layer (convolution +
pooling) is replaced with four RNNs sweeping the image
1x1 convolutions 3x3 convolutions 5x5 convolutions 1x1 convolutions vertically and horizontally in both directions as we can see
in Figure 6.
1x1 convolutions 1x1 convolutions 3x3 max pooling

Previous layer

Fig. 4: Inception module with dimensionality reduction


from the GoogLeNet architecture. Figure reproduced from
[16].

2.1.4 ResNet
Microsoft’s ResNet [17] is specially remarkable thanks to
winning ILSVRC-2016 with 96.4% accuracy. Apart from that
fact, the network is well-known due to its depth (152 layers)
and the introduction of residual blocks (see Figure 5). The
residual blocks address the problem of training a really deep Fig. 6: One layer of ReNet architecture modeling vertical and
architecture by introducing identity skip connections so that horizontal spatial dependencies. Extracted from [19].
layers can copy their inputs to the next layer.
4

2.2 Transfer Learning Augmentations are specially helpful for small datasets,
and have proven their efficacy with a long track of suc-
Training a deep neural network from scratch is often not
cess stories. For instance, in [26], a dataset of 1500 por-
feasible because of various reasons: a dataset of sufficient
trait images is augmented synthesizing four new scales
size is required (and not usually available) and reaching
(0.6, 0.8, 1.2, 1.5), four new rotations (−45, −22, 22, 45), and
convergence can take too long for the experiments to be
four gamma variations (0.5, 0.8, 1.2, 1.5) to generate a new
worth. Even if a dataset large enough is available and con-
dataset of 19000 training images. That process allowed them
vergence does not take that long, it is often helpful to start
to raise the accuracy of their system for portrait segmenta-
with pre-trained weights instead of random initialized ones
tion from 73.09 to 94.20 Intersection over Union (IoU) when
[20] [21]. Fine-tuning the weights of a pre-trained network
including that augmented dataset for fine-tuning.
by continuing with the training process is one of the major
transfer learning scenarios.
Yosinski et al. [22] proved that transferring features
even from distant tasks can be better than using random 3 DATASETS AND C HALLENGES
initialization, taking into account that the transferability of Two kinds of readers are expected for this type of review:
features decreases as the difference between the pre-trained either they are initiating themselves in the problem, or either
task and the target one increases. they are experienced enough and they are just looking for
However, applying this transfer learning technique is the most recent advances made by other researchers in the
not completely straightforward. On the one hand, there last few years. Although the second kind is usually aware of
are architectural constraints that must be met to use a pre- two of the most important aspects to know before starting
trained network. Nevertheless, since it is not usual to come to research in this problem, it is critical for newcomers to get
up with a whole new architecture, it is common to reuse a grasp of what are the top-quality datasets and challenges.
already existing network architectures (or components) thus Therefore, the purpose of this section is to kickstart novel
enabling transfer learning. On the other hand, the training scientists, providing them with a brief summary of datasets
process differs slightly when fine-tuning instead of training that might suit their needs as well as data augmentation and
from scratch. It is important to choose properly which layers preprocessing tips. Nevertheless, it can also be useful for
to fine-tune – usually the higher-level part of the network, hardened researchers who want to review the fundamentals
since the lower one tends to contain more generic features or maybe discover new information.
– and also pick an appropriate policy for the learning rate, Arguably, data is one of the most – if not the most
which is usually smaller due to the fact that the pre-trained – important part of any machine learning system. When
weights are expected to be relatively good so there is no dealing with deep networks, this importance is increased
need to drastically change them. even more. For that reason, gathering adequate data into
Due to the inherent difficulty of gathering and creating a dataset is critical for any segmentation system based on
per-pixel labelled segmentation datasets, their scale is not as deep learning techniques. Gathering and constructing an
large as the size of classification datasets such as ImageNet appropriate dataset, which must have a scale large enough
[23] [24]. This problem gets even worse when dealing with and represent the use case of the system accurately, needs
RGB-D or 3D datasets, which are even smaller. For that time, domain expertise to select relevant information, and
reason, transfer learning, and in particular fine-tuning from infrastructure to capture that data and transform it to a
pre-trained classification networks is a common trend for representation that the system can properly understand and
segmentation networks and has been successfully applied learn. This task, despite the simplicity of its formulation in
in the methods that we will review in the following sections. comparison with sophisticated neural network architecture
definitions, is one of the hardest problems to solve in this
context. Because of that, the most sensible approach usually
2.3 Data Preprocessing and Augmentation
means using an existing standard dataset which is repre-
Data augmentation is a common technique that has been sentative enough for the domain of the problem. Following
proven to benefit the training of machine learning models in this approach has another advantage for the community:
general and deep architectures in particular; either speeding standardized datasets enable fair comparisons between sys-
up convergence or acting as a regularizer, thus avoiding tems; in fact, many datasets are part of a challenge which
overfitting and increasing generalization capabilities [25]. reserves some data – not provided to developers to test their
It typically consist of applying a set of transformations algorithms – for a competition in which many methods are
in either data or feature spaces, or even both. The most tested, generating a fair ranking of methods according to
common augmentations are performed in the data space. their actual performance without any kind of data cherry-
That kind of augmentation generates new samples by ap- picking.
plying transformations to the already existing data. There In the following lines we describe the most popular
are many transformations that can be applied: translation, large-scale datasets currently in use for semantic segmen-
rotation, warping, scaling, color space shifts, crops, etc. The tation. All datasets listed here provide appropriate pixel-
goal of those transformations is to generate more samples to wise or point-wise labels. The list is structured into three
create a larger dataset, preventing overfitting and presum- parts according to the nature of the data: 2D or plain
ably regularizing the model, balance the classes within that RGB datasets, 2.5D or RGB-Depth (RGB-D) ones, and pure
database, and even synthetically produce new samples that volumetric or 3D databases. Table 1 shows a summarized
are more representative for the use case or task at hand. view, gathering all the described datasets and providing
5

useful information such as their purpose, number of classes, consistent set of parts). The original classes of PAS-
data format, and training/validation/testing splits. CAL VOC are kept, but their parts are introduced,
e.g., bicycle is now decomposed into back wheel,
3.1 2D Datasets chain wheel, front wheel, handlebar, headlight, and
saddle. It contains labels for all training and valida-
Throughout the years, semantic segmentation has been tion images from PASCAL VOC as well as for the
mostly focused on two-dimensional images. For that reason, 9637 testing images.
2D datasets are the most abundant ones. In this section • Semantic Boundaries Dataset (SBD) [30]5 : this
we describe the most popular 2D large-scale datasets for dataset is an extended version of the aforementioned
semantic segmentation, considering 2D any dataset that PASCAL VOC which provides semantic segmenta-
contains any kind of two-dimensional representations such tion ground truth for those images that were not
as gray-scale or Red Green Blue (RGB) images. labelled in VOC. It contains annotations for 11355
• PASCAL Visual Object Classes (VOC) [27]1 : this images from PASCAL VOC 2011. Those annotations
challenge consists of a ground-truth annotated provide both category-level and instance-level infor-
dataset of images and five different competitions: mation, apart from boundaries for each object. Since
classification, detection, segmentation, action classi- the images are obtained from the whole PASCAL
fication, and person layout. The segmentation one is VOC challenge (not only from the segmentation one),
specially interesting since its goal is to predict the the training and validation splits diverge. In fact,
object class of each pixel for each test image. There SBD provides its own training (8498 images) and
are 21 classes categorized into vehicles, household, validation (2857 images) splits. Due to its increased
animals, and other: aeroplane, bicycle, boat, bus, car, amount of training data, this dataset is often used as
motorbike, train, bottle, chair, dining table, potted a substitute for PASCAL VOC for deep learning.
plant, sofa, TV/monitor, bird, cat, cow, dog, horse, • Microsoft Common Objects in Context (COCO)
sheep, and person. Background is also considered if [31]6 : is another image recognition, segmentation,
the pixel does not belong to any of those classes. and captioning large-scale dataset. It features various
The dataset is divided into two subsets: training challenges, being the detection one the most relevant
and validation with 1464 and 1449 images respec- for this field since one of its parts is focused on
tively. The test set is private for the challenge. This segmentation. That challenge, which features more
dataset is arguably the most popular for semantic than 80 classes, provides more than 82783 images
segmentation so almost every remarkable method in for training, 40504 for validation, and its test set
the literature is being submitted to its performance consist of more than 80000 images. In particular,
evaluation server to validate against their private the test set is divided into four different subsets or
test set. Methods can be trained either using only splits: test-dev (20000 images) for additional vali-
the dataset or either using additional information. dation, debugging, test-standard (20000 images) is
Furthermore, its leaderboard is public and can be the default test data for the competition and the
consulted online2 . one used to compare state-of-the-art methods, test-
• PASCAL Context [28]3 : this dataset is an extension challenge (20000 images) is the split used for the
of the PASCAL VOC 2010 detection challenge which challenge when submitting to the evaluation server,
contains pixel-wise labels for all training images and test-reserve (20000 images) is a split used to
(10103). It contains a total of 540 classes – includ- protect against possible overfitting in the challenge
ing the original 20 classes plus background from (if a method is suspected to have made too many
PASCAL VOC segmentation – divided into three submissions or trained on the test data, its results will
categories (objects, stuff, and hybrids). Despite the be compared with the reserve split). Its popularity
large number of categories, only the 59 most frequent and importance has ramped up since its appearance
are remarkable. Since its classes follow a power law thanks to its large scale. In fact, the results of the
distribution, there are many of them which are too challenge are presented yearly on a joint workshop
sparse throughout the dataset. In this regard, this at the European Conference on Computer Vision
subset of 59 classes is usually selected to conduct (ECCV)7 together with ImageNet’s ones.
studies on this dataset, relabeling the rest of them • SYNTHetic Collection of Imagery and Annotations
as background. (SYNTHIA) [32]8 : is a large-scale collection of photo-
• PASCAL Part [29]4 : this database is an extension of realistic renderings of a virtual city, semantically
the PASCAL VOC 2010 detection challenge which segmented, whose purpose is scene understanding in
goes beyond that task to provide per-pixel segmen- the context of driving or urban scenarios.The dataset
tation masks for each part of the objects (or at least provides fine-grained pixel-level annotations for 11
silhouette annotation if the object does not have a classes (void, sky, building, road, sidewalk, fence,
vegetation, pole, car, sign, pedestrian, and cyclist). It
1. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
2. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
challengeid=11&compid=6 5. http://home.bharathh.info/home/sbd
3. http://www.cs.stanford.edu/∼roozbeh/pascal-context/ 6. http://mscoco.org/
4. http://www.stat.ucla.edu/∼xianjie.chen/pascal part dataset/ 7. http://image-net.org/challenges/ilsvrc+coco2016
pascal part.html 8. http://synthia-dataset.net/
6

TABLE 1: Popular large-scale segmentation datasets.


Name and Reference Purpose Year Classes Data Resolution Sequence Synthetic/Real Samples (training) Samples (validation) Samples (test)
PASCAL VOC 2012 Segmentation [27] Generic 2012 21 2D Variable 7 R 1464 1449 Private
PASCAL-Context [28] Generic 2014 540 (59) 2D Variable 7 R 10103 N/A 9637
PASCAL-Part [29] Generic-Part 2014 20 2D Variable 7 R 10103 N/A 9637
SBD [30] Generic 2011 21 2D Variable 7 R 8498 2857 N/A
Microsoft COCO [31] Generic 2014 +80 2D Variable 7 R 82783 40504 81434
SYNTHIA [32] Urban (Driving) 2016 11 2D 960 × 720 7 S 13407 N/A N/A
Cityscapes (fine) [33] Urban 2015 30 (8) 2D 2048 × 1024 3 R 2975 500 1525
Cityscapes (coarse) [33] Urban 2015 30 (8) 2D 2048 × 1024 3 R 22973 500 N/A
CamVid [34] Urban (Driving) 2009 32 2D 960 × 720 3 R 701 N/A N/A
CamVid-Sturgess [35] Urban (Driving) 2009 11 2D 960 × 720 3 R 367 100 233
KITTI-Layout [36] [37] Urban/Driving 2012 3 2D Variable 7 R 323 N/A N/A
KITTI-Ros [38] Urban/Driving 2015 11 2D Variable 7 R 170 N/A 46
KITTI-Zhang [39] Urban/Driving 2015 10 2D/3D 1226 × 370 7 R 140 N/A 112
Stanford background [40] Outdoor 2009 8 2D 320 × 240 7 R 725 N/A N/A
SiftFlow [41] Outdoor 2011 33 2D 256 × 256 7 R 2688 N/A N/A
Youtube-Objects-Jain [42] Objects 2014 10 2D 480 × 360 3 R 10167 N/A N/A
Adobe’s Portrait Segmentation [26] Portrait 2016 2 2D 600 × 800 7 R 1500 300 N/A
MINC [43] Materials 2015 23 2D Variable 7 R 7061 2500 5000
DAVIS [44] [45] Generic 2016 4 2D 480p 3 R 4219 2023 2180
NYUDv2 [46] Indoor 2012 40 2.5D 480 × 640 7 R 795 654 N/A
SUN3D [47] Indoor 2013 – 2.5D 640 × 480 3 R 19640 N/A N/A
SUNRGBD [48] Indoor 2015 37 2.5D Variable 7 R 2666 2619 5050
RGB-D Object Dataset [49] Household objects 2011 51 2.5D 640 × 480 3 R 207920 N/A N/A
ShapeNet Part [50] Object/Part 2016 16/50 3D N/A 7 S 31, 963 N/A N/A
Stanford 2D-3D-S [51] Indoor 2017 13 2D/2.5D/3D 1080 × 1080 3 R 70469 N/A N/A
3D Mesh [52] Object/Part 2009 19 3D N/A 7 S 380 N/A N/A
Sydney Urban Objects Dataset [53] Urban (Objects) 2013 26 3D N/A 7 R 41 N/A N/A
Large-Scale Point Cloud Classification Benchmark [54] Urban/Nature 2016 8 3D N/A 7 R 15 N/A 15

features 13407 training images from rendered video ing. It consists of hours of traffic scenarios recorded
streams. It is also characterized by its diversity in with a variety of sensor modalities, including high-
terms of scenes (towns, cities, highways), dynamic resolution RGB, grayscale stereo cameras, and a 3D
objects, seasons, and weather. laser scanner. Despite its popularity, the dataset itself
• Cityscapes [33]9 : is a large-scale database which does not contain ground truth for semantic segmen-
focuses on semantic understanding of urban street tation. However, various researchers have manually
scenes. It provides semantic, instance-wise, and annotated parts of the dataset to fit their necessities.
dense pixel annotations for 30 classes grouped into 8 Álvarez et al. [36] [37] generated ground truth for
categories (flat surfaces, humans, vehicles, construc- 323 images from the road detection challenge with
tions, objects, nature, sky, and void). The dataset con- three classes: road, vertical, and sky. Zhang et al.
sist of around 5000 fine annotated images and 20000 [39] annotated 252 (140 for training and 112 for
coarse annotated ones. Data was captured in 50 cities testing) acquisitions – RGB and Velodyne scans –
during several months, daytimes, and good weather from the tracking challenge for ten object categories:
conditions. It was originally recorded as video so the building, sky, road, vegetation, sidewalk, car, pedes-
frames were manually selected to have the following trian, cyclist, sign/pole, and fence. Ros et al. [38]
features: large number of dynamic objects, varying labeled 170 training images and 46 testing images
scene layout, and varying background. (from the visual odometry challenge) with 11 classes:
• CamVid [55] [34]10 : is a road/driving scene under- building, tree, sky, car, sign, road, pedestrian, fence,
standing database which was originally captured pole, sidewalk, and bicyclist.
as five video sequences with a 960 × 720 resolu- • Youtube-Objects [57] is a database of videos col-
tion camera mounted on the dashboard of a car. lected from YouTube which contain objects from ten
Those sequences were sampled (four of them at 1 PASCAL VOC classes: aeroplane, bird, boat, car, cat,
fps and one at 15 fps) adding up to 701 frames. cow, dog, horse, motorbike, and train. That database
Those stills were manually annotated with 32 classes: does not contain pixel-wise annotations but Jain et al.
void, building, wall, tree, vegetation, fence, sidewalk, [42] manually annotated a subset of 126 sequences.
parking block, column/pole, traffic cone, bridge, They took every 10th frame from those sequences
sign, miscellaneous text, traffic light, sky, tunnel, and generated semantic labels. That totals 10167
archway, road, road shoulder, lane markings (driv- annotated frames at 480 × 360 pixels resolution.
ing), lane markings (non-driving), animal, pedes- • Adobe’s Portrait Segmentation [26]11 : this is a
trian, child, cart luggage, bicyclist, motorcycle, car, dataset of 800 × 600 pixels portrait images collected
SUV/pickup/truck, truck/bus, train, and other mov- from Flickr, mainly captured with mobile front-
ing object. It is important to remark the partition facing cameras. The database consist of 1500 training
introduced by Sturgess et al. [35] which divided the images and 300 reserved for testing, both sets are
dataset into 367/100/233 training, validation, and fully binary annotated: person or background. The
testing images respectively. That partition makes use images were labeled in a semi-automatic way: first
of a subset of class labels: building, tree, sky, car, sign, a face detector was run on each image to crop them
road, pedestrian, fence, pole, sidewalk, and bicyclist. to 600 × 800 pixels and then persons were manu-
• KITTI [56]: is one of the most popular datasets ally annotated using Photoshop quick selection. This
for use in mobile robotics and autonomous driv- dataset is remarkable due to its specific purpose
which makes it suitable for person in foreground
segmentation applications.
9. https://www.cityscapes-dataset.com/
10. http://mi.eng.cam.ac.uk/research/projects/VideoRec/
CamVid/ 11. http://xiaoyongshen.me/webpage portrait/index.html
7

• Materials in Context (MINC) [43]: this work is a • NYUDv2 [46]14 : this database consists of 1449 indoor
dataset for patch material classification and full scene RGB-D images captured with a Microsoft Kinect
material segmentation. The dataset provides seg- device. It provides per-pixel dense labeling (category
ment annotations for 23 categories: wood, painted, and instance levels) which were coalesced into 40
fabric, glass, metal, tile, sky, foliage, polished stone, indoor object classes by Gupta et al. [60] for both
carpet, leather, mirror, brick, water, other, plastic, training (795 images) and testing (654) splits. This
skin, stone, ceramic, hair, food, paper, and wallpa- dataset is specially remarkable due to its indoor
per. It contains 7061 labeled material segmentations nature, this makes it really useful for certain robotic
for training, 5000 for test, and 2500 for validation. tasks at home. However, its relatively small scale
The main source for these images is the OpenSur- with regard to other existing datasets hinders its
faces dataset [58], which was augmented using other application for deep learning architectures.
sources of imagery such as Flickr or Houzz. For that • SUN3D [47]15 : similar to the NYUDv2, this dataset
reason, image resolution for this dataset varies. On contains a large-scale RGB-D video database, with 8
average, image resolution is approximately 800×500 annotated sequences. Each frame has a semantic seg-
or 500 × 800. mentation of the objects in the scene and information
• Densely-Annotated VIdeo Segmentation (DAVIS) about the camera pose. It is still in progress and it
[44] [45]12 : this challenge is purposed for video object will be composed by 415 sequences captured in 254
segmentation. Its dataset is composed by 50 high- different spaces, in 41 different buildings. Moreover,
definition sequences which add up to 4219 and some places have been captured multiple times at
2023 frames for training and validation respectively. different moments of the day.
Frame resolution varies across sequences but all of • SUNRGBD [48]16 : captured with four RGB-D sen-
them were downsampled to 480p for the challenge. sors, this dataset contains 10000 RGB-D images, at
Pixel-wise annotations are provided for each frame a similar scale as PASCAL VOC. It contains images
for four different categories: human, animal, vehicle, from NYU depth v2 [46], Berkeley B3DO [61], and
and object. Another feature from this dataset is the SUN3D [47]. The whole dataset is densely annotated,
presence of at least one target foreground object in including polygons, bounding boxes with orientation
each sequence. In addition, it is designed not to have as well as a 3D room layout and category, being
many different objects with significant motion. For suitable for scene understanding tasks.
those scenes which do have more than one target • The Object Segmentation Database (OSD) [62]17
foreground object from the same class, they provide this database has been designed for segmenting
separated ground truth for each one of them to allow unknown objects from generic scenes even under
instance segmentation. partial occlusions. This dataset contains 111 entries,
• Stanford background [40]13 : dataset with outdoor and provides depth image and color images together
scene images imported from existing public datasets: withper-pixel annotations for each one to evalu-
LabelMe, MSRC, PASCAL VOC and Geometric Con- ate object segmentation approaches. However, the
text. The dataset contains 715 images (size of 320 × dataset does not differentiate the category of differ-
240 pixels) with at least one foreground object and ent objects so its classes are reduced to a binary set
having the horizon position within the image. The of objects and not objects.
dataset is pixel-wise annotated (horizon location, • RGB-D Object Dataset [49]18 : this dataset is com-
pixel semantic class, pixel geometric class and image posed by video sequences of 300 common house-
region) for evaluating methods for semantic scene hold objects organized in 51 categories arranged
understanding. using WordNet hypernym-hyponym relationships.
• SiftFlow [41]: contains 2688 fully annotated images The dataset has been recorded using a Kinect style
which are a subset of the LabelMe database [59]. 3D camera that records synchronized and aligned
Most of the images are based on 8 different outdoor 640 × 480 RGB and depth images at 30Hz . For
scenes including streets, mountains, fields, beaches each frame, the dataset provides, the RGB-D and
and buildings. Images are 256 × 256 belonging to depth images, a cropped ones containing the object,
one of the 33 semantic classes. Unlabeled pixels, or the location and a mask with per-pixel annotation.
pixels labeled as a different semantic class are treated Moreover, each object has been placed on a turntable,
as unlabeled. providing isolated video sequences around 360 de-
grees. For the validation process, 22 annotated video
sequences of natural indoor scenes containing the
3.2 2.5D Datasets objects are provided.
With the advent of low-cost range scanners, datasets in-
cluding not only RGB information but also depth maps are
gaining popularity and usage. In this section, we review the
14. http://cs.nyu.edu/∼silberman/projects/indoor scene seg sup.
most well-known 2.5D databases which include that kind of
html
depth data. 15. http://sun3d.cs.princeton.edu/
16. http://rgbd.cs.princeton.edu/
12. http://davischallenge.org/index.html 17. http://www.acin.tuwien.ac.at/?id=289
13. http://dags.stanford.edu/data/iccv09Data.tar.gz 18. http://rgbd-dataset.cs.washington.edu/
8

3.3 3D Datasets object, apart from the individual scan, a full 360-
degrees annotated scan is provided.
Pure three-dimensional databases are scarce, this kind of
• Large-Scale Point Cloud Classification Benchmark
datasets usually provide Computer Aided Design (CAD)
[54]23 : this benchmark provides manually annotated
meshes or other volumetric representations, such as point
3D point clouds of diverse natural and urban scenes:
clouds. Generating large-scale 3D datasets for segmentation
churches, streets, railroad tracks, squares, villages,
is costly and difficult, and not many deep learning methods
soccer fields, castles among others. This dataset fea-
are able to process that kind of data as it is. For those
tures statically captured point clouds with very fine
reasons, 3D datasets are not quite popular at the moment. In
details and density. It contains 15 large-scale point
spite of that fact, we describe the most promising ones for
clouds for training and another 15 for testing. Its
the task at hand.
scale can be grasped by the fact that it totals more
than one billion labelled points.
• ShapeNet Part [50]19 : is a subset of the ShapeNet [63]
repository which focuses on fine-grained 3D object
segmentation. It contains 31, 693 meshes sampled
4 M ETHODS
from 16 categories of the original dataset (airplane,
earphone, cap, motorbike, bag, mug, laptop, table, The relentless success of deep learning techniques in various
guitar, knife, rocket, lamp, chair, pistol, car, and high-level computer vision tasks – in particular, super-
skateboard). Each shape class is labeled with two to vised approaches such as Convolutional Neural Networks
five parts (totalling 50 object parts across the whole (CNNs) for image classification or object detection [14]
dataset), e.g., each shape from the airplane class is [15] [16] – motivated researchers to explore the capabilities
labeled with wings, body, tail, and engine. Ground- of such networks for pixel-level labelling problems like
truth labels are provided on points sampled from the semantic segmentation. The key advantage of these deep
meshes. learning techniques, which gives them an edge over tradi-
• Stanford 2D-3D-S [51]20 : is a multi-modal and large- tional methods, is the ability to learn appropriate feature
scale indoor spaces dataset extending the Stanford representations for the problem at hand, e.g., pixel labelling
3D Semantic Parsing work [64]. It provides a va- on a particular dataset, in an end-to-end fashion instead of
riety of registered modalities – 2D (RGB), 2.5D using hand-crafted features that require domain expertise,
(depth maps and surface normals), and 3D (meshes effort, and often too much fine-tuning to make them work
and point clouds) – with semantic annotations. The on a particular scenario.
database is composed of 70, 496 full high-definition
RGB images (1080×1080 resolution) along with their
corresponding depth maps, surface normals, meshes,
and point clouds with semantic annotations (per-
pixel and per-point). That data were captured in six
indoor areas from three different educational and
office buildings. That makes a total of 271 rooms and
approximately 700 million points annotated with
labels from 13 categories: ceiling, floor, wall, column,
beam, window, door, table, chair, bookcase, sofa,
board, and clutter.
• A Benchmark for 3D Mesh Segmentation [52]21 :
this benchmark is composed by 380 meshes classified
in 19 categories (human, cup, glasses, airplane, ant,
chair, octopus, table, teddy, hand, plier, fish, bird,
armadillo, bust, mech, bearing, vase, fourleg). Each
mesh has been manually segmented into functional
parts, the main goal is to provide a sample distribu-
tion over ”how humans decompose each mesh into
functional parts”.
• Sydney Urban Objects Dataset [53]22 : this dataset
contains a variety of common urban road objects
scanned with a Velodyne HDK-64E LIDAR. There are
631 individual scans (point clouds) of objects across Fig. 7: Fully Convolutional Network figure by Long et al.
classes of vehicles, pedestrians, signs and trees. The [65]. Transforming a classification-purposed CNN to pro-
interesting point of this dataset is that, for each duce spatial heatmaps by replacing fully connected layers
with convolutional ones. Including a deconvolution layer
19. http://cs.stanford.edu/∼ericyi/project page/part annotation/ for upsampling allows dense inference and learning for per-
20. http://buildingparser.stanford.edu pixel labeling.
21. http://segeval.cs.princeton.edu/
22. http://www.acfr.usyd.edu.au/papers/
SydneyUrbanObjectsDataset.shtml 23. http://www.semantic3d.net/
9

TABLE 2: Summary of semantic segmentation methods based on deep learning.


Targets
Name and Reference Architecture Accuracy Efficiency Training Instance Sequences Multi-modal 3D Source Code Contribution(s)
Fully Convolutional Network [65] VGG-16(FCN) ? ? ? 7 7 7 7 3 Forerunner
SegNet [66] VGG-16 + Decoder ??? ?? ? 7 7 7 7 3 Encoder-decoder
Bayesian SegNet [67] SegNet ??? ? ? 7 7 7 7 3 Uncertainty modeling
DeepLab [68] [69] VGG-16/ResNet-101 ??? ? ? 7 7 7 7 3 Standalone CRF, atrous convolutions
MINC-CNN [43] GoogLeNet(FCN) ? ? ? 7 7 7 7 3 Patchwise CNN, Standalone CRF
CRFasRNN [70] FCN-8s ? ?? ??? 7 7 7 7 3 CRF reformulated as RNN
Dilation [71] VGG-16 ??? ? ? 7 7 7 7 3 Dilated convolutions
ENet [72] ENet bottleneck ?? ??? ? 7 7 7 7 3 Bottleneck module for efficiency
Multi-scale-CNN-Raj [73] VGG-16(FCN) ??? ? ? 7 7 7 7 7 Multi-scale architecture
Multi-scale-CNN-Eigen [74] Custom ??? ? ? 7 7 7 7 3 Multi-scale sequential refinement
Multi-scale-CNN-Roy [75] Multi-scale-CNN-Eigen ??? ? ? 7 7 ?? 7 7 Multi-scale coarse-to-fine refinement
Multi-scale-CNN-Bian [76] FCN ?? ? ?? 7 7 7 7 7 Independently trained multi-scale FCNs
ParseNet [77] VGG-16 ??? ? ? 7 7 7 7 3 Global context feature fusion
ReSeg [78] VGG-16 + ReNet ?? ? ? 7 7 7 7 3 Extension of ReNet to semantic segmentation
LSTM-CF [79] Fast R-CNN + DeepMask ??? ? ? 7 7 7 7 3 Fusion of contextual information from multiple sources
2D-LSTM [80] MDRNN ?? ?? ? 7 7 7 7 7 Image context modelling
rCNN [81] MDRNN ??? ?? ? 7 7 7 7 3 Different input sizes, image context
DAG-RNN [82] Elman network ??? ? ? 7 7 7 7 3 Graph image structure for context modelling
SDS [10] R-CNN + Box CNN ??? ? ? ?? 7 7 7 3 Simultaneous detection and segmentation
DeepMask [83] VGG-A ??? ? ? ?? 7 7 7 3 Proposals generation for segmentation
SharpMask [84] DeepMask ??? ? ? ??? 7 7 7 3 Top-down refinement module
MultiPathNet [85] Fast R-CNN + DeepMask ??? ? ? ??? 7 7 7 3 Multi path information flow through network
Huang-3DCNN [86] Own 3DCNN ? ? ? 7 7 7 ??? 7 3DCNN for voxelized point clouds
PointNet [87] Own MLP-based ?? ? ? 7 7 7 ??? 3 Segmentation of unordered point sets
Clockwork Convnet [88] FCN ?? ?? ? 7 ??? 7 7 3 Clockwork scheduling for sequences
3DCNN-Zhang Own 3DCNN ?? ? ? 7 ??? 7 7 3 3D convolutions and graph cut for sequences
End2End Vox2Vox [89] C3D ?? ? ? 7 ??? 7 7 7 3D convolutions/deconvolutions for sequences

Fig. 8: Visualization of the reviewed methods.

Currently, the most successful state-of-the-art deep For all those reasons, and other significant contributions, the
learning techniques for semantic segmentation stem from a FCN is the cornerstone of deep learning applied to semantic
common forerunner: the Fully Convolutional Network (FCN) segmentation. The convolutionalization process is shown in
by Long et al. [65]. The insight of that approach was to Figure 7.
take advantage of existing CNNs as powerful visual models Despite the power and flexibility of the FCN model,
that are able to learn hierarchies of features. They trans- it still lacks various features which hinder its application
formed those existing and well-known classification models to certain problems and situations: its inherent spatial in-
– AlexNet [14], VGG (16-layer net) [15], GoogLeNet [16], variance does not take into account useful global context
and ResNet [17] – into fully convolutional ones by replacing information, no instance-awareness is present by default,
the fully connected layers with convolutional ones to output efficiency is still far from real-time execution at high resolu-
spatial maps instead of classification scores. Those maps tions, and it is not completely suited for unstructured data
are upsampled using fractionally strided convolutions (also such as 3D point clouds or models. Those problems will be
named deconvolutions [90] [91]) to produce dense per-pixel reviewed in this section, as well as the state-of-the-art solu-
labeled outputs. This work is considered a milestone since it tions that have been proposed in the literature to overcome
showed how CNNs can be trained end-to-end for this prob- those hurdles. Table 2 provides a summary of that review. It
lem, efficiently learning how to make dense predictions for shows all reviewed methods (sorted by appearance order in
semantic segmentation with inputs of arbitrary sizes. This the section), their base architecture, their main contribution,
approach achieved a significant improvement in segmenta- and a classification depending on the target of the work:
tion accuracy over traditional methods on standard datasets accuracy, efficiency, training simplicity, sequence processing,
like PASCAL VOC, while preserving efficiency at inference. multi-modal inputs, and 3D data. Each target is graded from
10

one to three stars (?) depending on how much focus puts the
work on it, and a mark (7) if that issue is not addressed. In
addition, Figure 8 shows a graph of the reviewed methods
for the sake of visualization.

4.1 Decoder Variants


Apart from the FCN architecture, other variants were de-
veloped to transform a network whose purpose was clas-
sification to make it suitable for segmentation. Arguably,
FCN-based architectures are more popular and successful, Fig. 10: Comparison of SegNet (left) and FCN (right) de-
but other alternatives are also remarkable. In general terms, coders. While SegNet uses max-pooling indices from the
all of them take a network for classification, such as VGG- corresponding encoder stage to upsample, FCN learns de-
16, and remove its fully connected layers. This part of convolution filters to upsample (adding the corresponding
the new segmentation network often receives the name of feature map from the encoder stage). Figure reproduced
encoder and produce low-resolution image representations from [66].
or feature maps. The problem lies on learning to decode or
map those low-resolution images to pixel-wise predictions
for segmentation. This part is named decoder and it is Vanilla CNNs struggle with this balance. Pooling layers,
usually the divergence point in this kind of architectures. which allow the networks to achieve some degree of spatial
SegNet [66] is a clear example of this divergence (see invariance and keep computational cost at bay, dispose of
Figure 9). The decoder stage of SegNet is composed by the global context information. Even purely CNNs – without
a set of upsampling and convolution layers which are at pooling layers – are limited since the receptive field of their
last followed by a softmax classifier to predict pixel-wise units can only grow linearly with the number of layers.
labels for an output which has the same resolution as the Many approaches can be taken to make CNNs aware
input image. Each upsampling layer in the decoder stage of that global information: refinement as a post-processing
corresponds to a max-pooling one in the encoder part. Those step with Conditional Random Fields (CRFs), dilated con-
layers upsample feature maps using the max-pooling in- volutions, multi-scale aggregation, or even defer the context
dices from their corresponding feature maps in the encoder modeling to another kind of deep networks such as RNNs.
phase. The upsampled maps are then convolved with a set
of trainable filter banks to produce dense feature maps. 4.2.1 Conditional Random Fields
When the feature maps have been restored to the original
As we mentioned before, the inherent invariance to spatial
resolution, they are fed to the softmax classifier to produce
transformations of CNN architectures limits the very same
the final segmentation.
spatial accuracy for segmentation tasks. One possible and
common approach to refine the output of a segmentation
system and boost its ability to capture fine-grained details
is to apply a post-processing stage using a Conditional
Random Field (CRF). CRFs enable the combination of low-
level image information – such as the interactions between
pixels [92] [93] – with the output of multi-class inference sys-
tems that produce per-pixel class scores. That combination
is especially important to capture long-range dependencies,
Fig. 9: SegNet architecture with an encoder and a decoder which CNNs fail to consider, and fine local details.
followed by a softmax classifier for pixel-wise classification. The DeepLab models [68] [69] make use of the fully
Figure extracted from [66]. connected pairwise CRF by Krähenbühl and Koltun [94]
[95] as a separated post-processing step in their pipeline
On the other hand, FCN-based architectures make use of
to refine the segmentation result. It models each pixel as a
learnable deconvolution filters to upsample feature maps.
node in the field and employs one pairwise term for each
After that, the upsampled feature maps are added element-
pair of pixels no matter how far they lie (this model is
wise to the corresponding feature map generated by the
known as dense or fully connected factor graph). By using
convolution layer in the encoder part. Figure 10 shows a
this model, both short and long-range interactions are taken
comparison of both approaches.
into account, rendering the system able to recover detailed
structures in the segmentation that were lost due to the
4.2 Integrating Context Knowledge spatial invariance of the CNN. Despite the fact that usually
Semantic segmentation is a problem that requires the inte- fully connected models are inefficient, this model can be
gration of information from various spatial scales. It also efficiently approximated via probabilistic inference. Figure
implies balancing local and global information. On the one 11 shows the effect of this CRF-based post-processing on
hand, fine-grained or local information is crucial to achieve the score and belief maps produced by the DeepLab model.
good pixel-level accuracy. On the other hand, it is also The material recognition in the wild network by Bell et al.
important to integrate information from the global context [43] makes use of various CNNs trained to identify patches
of the image to be able to resolve local ambiguities. in the MINC database. Those CNNs are used on a sliding
11

In practice, it is equivalent to dilating the filter before


doing the usual convolution. That means expanding its
size, according to the dilation rate, while filling the empty
elements with zeros. In other words, the filter weights are
matched to distant elements which are not adjacent if the
dilation rate is greater than one. Figure 13 shows examples
(a) GT (b) CNNout (c) CRFit1 (d) CRFit2 (e) CRFit10 of dilated filters.

Fig. 11: CRF refinement per iteration as shown by the


authors of DeepLab [68]. The first row shows the score
maps (inputs before the softmax function) and the second
one shows the belief maps (output of the softmax function).

window fashion to classify those patches. Their weights are


transferred to the same networks converted into FCNs by
adding the corresponding upsampling layers. The outputs
are averaged to generate a probability map. At last, the same Fig. 13: Filter elements (green) matched to input elements
CRF from DeepLab, but discretely optimized, is applied to when using 3 × 3 dilated convolutions with various dilation
predict and refine the material at every pixel. rates. From left to right: 1, 2, and 3.
Another significant work applying a CRF to refine the
segmentation of a FCN is the CRFasRNN by Zheng et al. The most important works that make use of dilated
[70]. The main contribution of that work is the reformulation convolutions are the multi-scale context aggregation mod-
of the dense CRF with pairwise potentials as an integral part ule by Yu et al. [71], the already mentioned DeepLab (its
of the network. By unrolling the mean-field inference steps improved version) [69], and the real-time network ENet [72].
as RNNs, they make it possible to fully integrate the CRF All of them use combinations of dilated convolutions with
with a FCN and train the whole network end-to-end. This increasing dilation rates to have wider receptive fields with
work demonstrates the reformulation of CRFs as RNNs to no additional cost and without overly downsampling the
form a part of a deep network, in contrast with Pinheiro feature maps. Those works also show a common trend: di-
et al. [81] which employed RNNs to model large spatial lated convolutions are tightly coupled to multi-scale context
dependencies. aggregation as we will explain in the following section.

4.2.2 Dilated Convolutions 4.2.3 Multi-scale Prediction


Dilated convolutions, also named à-trous convolutions, are Another possible way to deal with context knowledge inte-
a generalization of Kronecker-factored convolutional filters gration is the use of multi-scale predictions. Almost every
[96] which support exponentially expanding receptive fields single parameter of a CNN affects the scale of the generated
without losing resolution. In other words, dilated convolu- feature maps. In other words, the very same architecture
tions are regular ones that make use of upsampled filters. will have an impact on the number of pixels of the input
The dilation rate l controls that upsampling factor. As image which correspond to a pixel of the feature map. This
shown in Figure 12, stacking l-dilated convolution makes means that the filters will implicitly learn to detect fea-
the receptive fields grow exponentially while the number of tures at specific scales (presumably with certain invariance
parameters for the filters keeps a linear growth. This means degree). Furthermore, those parameters are usually tightly
that dilated convolutions allow efficient dense feature ex- coupled to the problem at hand, making it difficult for the
traction on any arbitrary resolution. As a side note, it is models to generalize to different scales. One possible way
important to remark that typical convolutions are just 1- to overcome that obstacle is to use multi-scale networks
dilated convolutions. which generally make use of multiple networks that target
different scales and then merge the predictions to produce a
single output.
Raj et al. [73] propose a multi-scale version of a fully con-
volutional VGG-16. That network has two paths, one that
processes the input at the original resolution and another
one which doubles it. The first path goes through a shal-
low convolutional network. The second one goes through
the fully convolutional VGG-16 and an extra convolutional
(a) 1-dilated (b) 2-dilated (c) 3-dilated layer. The result of that second path is upsampled and
combined with the result of the first path. That concatenated
Fig. 12: As shown in [71], dilated convolution filters with output then goes through another set of convolutional lay-
various dilation rates: (a) 1-dilated convolutions in which ers to generate the final output. As a result, the network
each unit has a 3 × 3 receptive fields, (b) 2-dilated ones with becomes more robust to scale variations.
7 × 7 receptive fields, and (c) 3-dilated convolutions with Roy et al. [75] take a different approach using a network
15 × 15 receptive fields. composed by four multi-scale CNNs. Those four networks
12

have the same architecture introduced by Eigen et al. [74].


One of those networks is devoted to finding semantic labels
for the scene. That network extracts features from a progres-
sively coarse-to-fine sequence of scales (see Figure 14).

Fig. 15: Skip-connection-like architecture, which performs


late fusion of feature maps as if making independent predic-
tions for each layer and merging the results. Figure extracted
from [84].

Fig. 14: Multi-scale CNN architecture proposed by Eigen et


al. [74]. The network progressively refines the output using
a sequence of scales to estimate depth, normals, and also
perform semantic segmentation over an RGB input. Figure
extracted from [74]. Fig. 16: ParseNet context module overview in which a
global feature (from a previous layer) is combined with the
Another remarkable work is the network proposed by feature of the next layer to add context information. Figure
Bian et al. [76]. That network is a composition of n FCNs extracted from [77].
which operate at different scales. The features extracted
from the networks are fused together (after the necessary
upsampling with an appropriate padding) and then they go This feature fusion idea was continued by Pinheiro et
through an additional convolutional layer to produce the al. in their SharpMask network [84], which introduced a
final segmentation. The main contribution of this architec- progressive refinement module to incorporate features from
ture is the two-stage learning process which involves, first, the previous layer to the next in a top-down architecture.
training each network independently, then the networks are This work will be reviewed later since it is mainly focused
combined and the last layer is fine-tuned. This multi-scale on instance segmentation.
model allows to add an arbitrary number of newly trained
networks in an efficient manner. 4.2.5 Recurrent Neural Networks
As we noticed, CNNs have been successfully applied to
4.2.4 Feature Fusion multi-dimensional data, such as images. Nevertheless, these
Another way of adding context information to a fully convo- networks rely on hand specified kernels limiting the ar-
lutional architecture for segmentation is feature fusion. This chitecture to local contexts. Taking advantage of its topo-
technique consists of merging a global feature (extracted logical structure, Recurrent Neural Networks have been
from a previous layer in a network) with a more local successfully applied for modeling short- and long-temporal
feature map extracted from a subsequent layer. Common sequences. In this way and by linking together pixel-level
architectures such as the original FCN make use of skip and local information, RNNs are able to successfully model
connections to perform a late fusion by combining the global contexts and improve semantic segmentation. How-
feature maps extracted from different layers (see Figure 15). ever, one important issue is the lack of a natural sequential
Another approach is performing early fusion. This ap- structure in images and the focus of standard vanilla RNNs
proach is taken by ParseNet [77] in their context module. architectures on one-dimensional inputs.
The global feature is unpooled to the same spatial size as Based on ReNet model for image classification Visin et
the local feature and then they are concatenated to generate al. [19] proposed an architecture for semantic segmentation
a combined feature that is used in the next layer or to learn called ReSeg [78] represented in Figure 17. In this approach,
a classifier. Figure 16 shows a representation of that process. the input image is processed with the first layers of the
13

Fig. 17: Representation of ReSeg network. VGG-16 convolutional layers are represented by the blue and yellow first layers.
The rest of the architecture is based on the ReNet approach with fine-tuning purposes. Figure extracted from [78].

VGG-16 network [15], feeding the resulting feature maps three different layers: image feature map produced by CNN,
into one or more ReNet layers for fine-tuning. Finally, fea- model image contextual dependencies with DAG-RNNs,
ture maps are resized using upsampling layers based on and deconvolution layer for upsampling feature maps. This
transposed convolutions. In this approach Gated Recurrent work demonstrates how RNNs can be used together with
Units (GRUs) have been used as they strike a good per- graphs to successfully model long-range contextual depen-
formance balance regarding memory usage and computa- dencies, overcoming state-of-the-art approaches in terms of
tional power. Vanilla RNNs have problems modeling long- performance.
term dependencies mainly due to the vanishing gradients
problem. Several derived models such as Long Short-Term
4.3 Instance Segmentation
Memory (LSTM) networks [97] and GRUs [98] are the state-
of-art in this field to avoid such problem. Instance segmentation is considered the next step after
Inspired on the same ReNet architecture, a novel Long semantic segmentation and at the same time the most
Short-Term Memorized Context Fusion (LSTM-CF) model challenging problem in comparison with the rest of low-
for scene labeling was proposed by [99]. In this approach, level pixel segmentation techniques. Its main purpose is to
they use two different data sources: RGB and depth. The represent objects of the same class splitted into different
RGB pipeline relies on a variant of the DeepLab architecture instances. The automation of this process is not straight-
[29] concatenating features at three different scales to enrich forward, thus the number of instances is initially unknown
feature representation (inspired by [100]). The global context and the evaluation of performed predictions is not pixel-
is modeled vertically over both, depth and photometric wise such as in semantic segmentation. Consequently, this
data sources, concluding with a horizontal fusion in both problem remains partially unsolved but the interest in this
direction over these vertical contexts. field is motivated by its potential applicability. Instance
labeling provides us extra information for reasoning about
As we noticed, modeling image global contexts is related occlusion situations, also counting the number of elements
to 2D recurrent approaches by unfolding vertically and belonging to the same class and for detecting a particular
horizontally the network over the input images. Based on object for grasping in robotics tasks, among many other
the same idea, Byeon et al. [80] purposed a simple 2D LSTM- applications.
based architecture in which the input image is divided into For this purpose, Hariharan et al. [10] proposed a
non-overlapping windows which are fed into four separate Simultaneous Detection and Segmentation (SDS) method in
LSTMs memory blocks. This work emphasizes its low com- order to improve performance over already existing works.
putational complexity on a single-core CPU and the model Their pipeline uses, firstly, a bottom-up hierarchical image
simplicity. segmentation and object candidate generation process called
Another approach for capturing global information relies Multi-scale COmbinatorial Grouping (MCG) [101] to obtain
on using bigger input windows in order to model larger con- region proposals. For each region, features are extracted
texts. Nevertheless, this reduces images resolution and also by using an adapted version of the Region-CNN (R-CNN)
implies several problems regarding to window overlapping. [102], which is fine-tuned using bounding boxes provided
However, Pinheiro et al. [81] introduced Recurrent Convo- by the MCG method instead of selective search and also
lutional Neural Networks (rCNNs) which recurrently train alongside region foreground features. Then, each region
with different input window sizes taking into account pre- proposal is classified by using a linear Support Vector Ma-
vious predictions by using a different input window sizes. chine (SVM) on top of the CNN features. Finally, and for
In this way, predicted labels are automatically smoothed refinement purposes, Non-Maximum Suppression (NMS) is
increasing the performance. applied to the previous proposals.
Undirected cyclic graphs (UCGs) were also adopted Later, Pinheiro et al. [83] presented DeepMask model, an
to model image contexts for semantic segmentation [82]. object proposal approach based on a single ConvNet. This
Nevertheless, RNNs are not directly applicable to UCG model predicts a segmentation mask for an input patch and
and the solution is decomposing it into several directed the likelihood of this patch for containing an object. The two
graphs (DAGs). In this approach, images are processed by tasks are learned jointly and computed by a single network,
14

sharing most of the layers except last ones which are task- Several works focused on RGB-D scene segmentation have
specific. reported an improvement in the fine-grained labeling preci-
Based on the DeepMask architecture as a starting point sion by using depth information and not only photometric
due to its effectiveness, the same authors presented a novel data. Using depth information for segmentation is consid-
architecture for object instance segmentation implementing ered more challenging because of the unpredictable varia-
a top-down refinement process [84] and achieving a better tion of scene illumination alongside incomplete representa-
performance in terms of accuracy and speed. The goal of this tion of objects due to complex occlusions. However, various
process is to efficiently merge low-level features with high- works have successfully made use of depth information to
level semantic information from upper network layers. The increase accuracy.
process consisted in different refinement modules stacked The use of depth images with approaches focused on
together (one module per pooling layer), with the purpose photometric data is not straightforward. Depth data needs
of inverting pooling effect by generating a new upsampled to be encoded with three channels at each pixel as if it was
object encoding. Figure 18 shows the refinement module in an RGB images. Different techniques such as Horizontal
SharpMask. Height Angle (HHA) [11] are used for encoding the depth
into three channels as follows: horizontal disparity, height
above ground, and the angle between local surface normal
and the inferred gravity direction. In this way, we can input
depth images to models designed for RGB data and improve
in this way the performance by learning new features from
structural information. Several works such as [99] are based
on this encoding technique.
In the literature, related to methods that use RGB-D data,
we can also find some works that leverage a multi-view
approach to improve existing single-view works.
Zeng et al. [103] present an object segmentation approach
that leverages multi-view RGB-D data and deep learning
techniques. RGB-D images captured from each viewpoint
are fed to a FCN network which returns a 40-class proba-
bility for each pixel in each image. Segmentation labels are
threshold by using three times the standard deviation above
the mean probability across all views. Moreover, in this
work, multiple networks for feature extraction were trained
Fig. 18: SharpMask’s top-down architecture with progres-
(AlexNet [14] and VGG-16 [15]), evaluating the benefits of
sive refinement using their signature modules. That refine-
using depth information. They found that adding depth
ment merges spatially rich information from lower-level
did not yield any major improvements in segmentation
features with high-level semantic cues encoded in upper
performance, which could be caused by noise in the depth
layers. Figure extracted from [83].
information. The described approach was presented during
Another approach, based on Fast R-CNN as a starting the 2016 Amazon Picking Challenge. This work is a mi-
point and using DeepMask object proposals instead of nor contribution towards multi-view deep learning systems
Selective Search was presented by Zagoruyko et al [85]. since RGB images are independently fed to a FCN network.
This combined system called MultiPath classifier, improved Ma et al. [104] propose a novel approach for object-class
performance over COCO dataset and supposed three mod- segmentation using a multi-view deep learning technique.
ifications to Fast R-CNN: improving localization with an Multiple views are obtained from a moving RGB-D camera.
integral loss, provide context by using foveal regions and During the training stage, camera trajectory is obtained
finally skip connections to give multi-scale features to the using an RGB-D SLAM technique, then RGB-D images are
network. The system achieved a 66% improvement over the warped into ground-truth annotated frames in order to
baseline Fast R-CNN. enforce multi-view consistency for training. The proposed
As we have seen, most of the methods mentioned above approach is based on FuseNet [105], which combines RGB
rely on existing object detectors limiting in this way model and depth images for semantic segmentation, and improves
performance. Even so, instance segmentation process re- the original work by adding multi-scale loss minimization.
mains an unresolved research problem and the mentioned
works are only a small part of this challenging research 4.5 3D Data
topic.
3D geometric data such as point clouds or polygonal meshes
are useful representations thanks to their additional dimen-
4.4 RGB-D Data sion which provides methods with rich spatial information
As we noticed, a significant amount of work has been done that is intuitively useful for segmentation. However, the vast
in semantic segmentation by using photometric data. Nev- majority of successful deep learning segmentation architec-
ertheless, the use of structural information was spurred on tures – CNNs in particular – are not originally engineered
with the advent of low-cost RGB-D sensors which provide to deal with unstructured or irregular inputs such as the
useful geometric cues extracted from depth information. aforementioned ones. In order to enable weight sharing
15

and other optimizations in convolutional architectures, most manner. This approach works, often producing remarkable
researchers have resorted to 3D voxel grids or projections results. Nevertheless, applying those methods frame by
to transform unstructured and unordered point clouds or frame is usually non-viable due to computational cost. In
meshes into regular representations before feeding them to addition, those methods completely ignore temporal con-
the networks. For instance, Huang et al. [86] (see Figure tinuity and coherence cues which might help increase the
19 take a point cloud and parse it through a dense voxel accuracy of the system while reducing its execution time.
grid, generating a set of occupancy voxels which are used Arguably, the most remarkable work in this regard is
as input to a 3D CNN to produce one label per voxel. the clockwork FCN by Shelhamer et al. [88]. This network
They then map back the labels to the point cloud. Although is an adaptation of a FCN to make use of temporal cues in
this approach has been applied successfully, it has some video to decrease inference time while preserving accuracy.
disadvantages like quantization, loss of spatial information, The clockwork approach relies on the following insight:
and unnecessarily large representations. For that reason, feature velocity – the temporal rate of change of features
various researchers have focused their efforts on creating in the network – across frames varies from layer to layer so
deep architectures that are able to directly consume unstruc- that features from shallow layers change faster than deep
tured 3D point sets or meshes. ones. Under that assumption, layers can be grouped into
stages, processing them at different update rates depending
on their depth. By doing this, deep features can be persisted
over frames thanks to their semantic stability, thus saving
inference time. Figure 21 shows the network architecture of
the clockwork FCN.
It is important to remark that the authors propose two
kinds of update rates: fixed and adaptive. The fixed sched-
ule just sets a constant time frame for recomputing the fea-
tures for each stage of the network. The adaptive schedule
fires each clock on a data-driven manner, e.g., depending on
the amount of motion or semantic change. Figure 22 shows
an example of this adaptive scheduling.
Zhang et al. [106] took a different approach and made
use of a 3DCNN, which was originally created for learning
Fig. 19: 3DCNN based system presented by Huang et al. features from volumes, to learn hierarchical spatio-temporal
[86] for semantic labeling of point clouds. Clouds undergo features from multi-channel inputs such as video clips. In
a dense voxelization process and the CNN produces per- parallel, they over-segment the input clip into supervoxels.
voxel labels that are then mapped back to the point cloud. Then they use that supervoxel graph and embed the learned
Figure extracted from [86]. features in it. The final segmentation is obtained by applying
graph-cut [107] on the supervoxel graph.
PointNet [87] is a pioneering work which presents a Another remarkable method, which builds on the idea
deep neural network that takes raw point clouds as input, of using 3D convolutions, is the deep end-to-end voxel-to-
providing a unified architecture for both classification and voxel prediction system by Tran et al. [89]. In that work,
segmentation. Figure 20 shows that two-part network which they make use of the Convolutional 3D (C3D) network intro-
is able to consume unordered point sets in 3D. duced by themselves on a previous work [108], and extend
As we can observe, PointNet is a deep network archi-
it for semantic segmentation by adding deconvolutional
tecture that stands out of the crowd due to the fact that it
layers at the end. Their system works by splitting the input
is based on fully connected layers instead of convolutional
into clips of 16 frames, performing predictions for each clip
ones. The architecture features two subnetworks: one for
separately. Its main contribution is the use of 3D convo-
classification and another for segmentation. The classifica-
lutions. Those convolutions make use of three-dimensional
tion subnetwork takes a point cloud and applies a set of
filters which are suitable for spatio-temporal feature learn-
transforms and Multi Layer Perceptrons (MLPs) to generate
ing across multiple channels, in this case frames. Figure
features which are then aggregated using max-pooling to
23 shows the difference between 2D and 3D convolutions
generate a global feature which describes the original input
applied to multi-channel inputs, proving the usefulness of
cloud. That global feature is classified by another MLP to
the 3D ones for video segmentation.
produce output scores for each class. The segmentation
subnetwork concatenates the global feature with the per-
point features extracted by the classification network and 5 D ISCUSSION
applies another two MLPs to generate features and produce In the previous section we reviewed the existing methods
output scores for each point. from a literary and qualitative point of view, i.e., we did
not take any quantitative result into account. In this Section
4.6 Video Sequences we are going to discuss the very same methods from a
As we have observed, there has been a significant progress numeric standpoint. First of all, we will describe the most
in single-image segmentation. However, when dealing with popular evaluation metrics that can be used to measure the
image sequences, many systems rely on the naı̈ve appli- performance of semantic segmentation systems from three
cation of the very same algorithms in a frame-by-frame aspects: execution time, memory footprint, and accuracy.
16

Fig. 20: The PointNet unified architecture for point cloud classification and segmentation. Figure reproduced from [87].

Next, we will gather the results of the methods on the most


representative datasets using the previously described met-
rics. After that, we will summarize and draw conclusions
about those results. At last, we enumerate possible future
research lines that we consider significant for the field.

5.1 Evaluation Metrics


For a segmentation system to be useful and actually produce
a significant contribution to the field, its performance must
Fig. 21: The clockwork FCN with three stages and their be evaluated with rigor. In addition, that evaluation must be
corresponding clock rates. Figure extracted from [88]. performed using standard and well-known metrics that en-
able fair comparisons with existing methods. Furthermore,
many aspects must be evaluated to assert the validity and
usefulness of a system: execution time, memory footprint,
and accuracy. Depending on the purpose or the context of
the system, some metrics might be of more importance than
others, i.e., accuracy may be expendable up to a certain
point in favor of execution speed for a real-time applica-
tion. Nevertheless, for the sake of scientific rigor it is of
Fig. 22: Adaptive clockwork method proposed by Shel- utmost importance to provide all the possible metrics for
hamer et al. [88]. Extracted features persists during static a proposed method.
frames while they are recomputed for dynamic ones. Figure
extracted from [88]. 5.1.1 Execution Time
Speed or runtime is an extremely valuable metric since the
vast majority of systems must meet hard requirements on
how much time can they spend on the inference pass. In
some cases it might be useful to know the time needed for
training the system, but it is usually not that significant,
unless it is exaggeratedly slow, since it is an offline process.
In any case, providing exact timings for the methods can be
(a) 2D Convolution (b) 3D Convolution
seen as meaningless since they are extremely dependant on
Fig. 23: Difference between 2D and 3D convolutions ap- the hardware and the backend implementation, rendering
plied on a set of frames. (a) 2D convolutions use the same some comparisons pointless.
weights for the whole depth of the stack of frames (multiple However, for the sake of reproducibility and in order
channels) and results in a single image. (b) 3D convolutions to help fellow researchers, it is useful to provide timings
use 3D filters and produce a 3D volume as a result of the with a thorough description of the hardware in which the
convolution, thus preserving temporal information of the system was executed on, as well as the conditions for the
frame stack. benchmark. If done properly, that can help others estimate
if the method is useful or not for the application as well
as perform fair comparisons under the same conditions to
check which are the fastest methods.
17

5.1.2 Memory Footprint false positives (union). That IoU is computed on a


per-class basis and then averaged.
Memory usage is another important factor for segmentation
methods. Although it is arguably less constraining than k
1 X pii
execution time – scaling memory capacity is usually feasible M IoU = k k
k + 1 i=0 X
– it can also be a limiting element. In some situations, such
X
pij + pji − pii
as onboard chips for robotic platforms, memory is not as j=0 j=0
abundant as in a high-performance server. Even high-end
Graphics Processing Units (GPUs), which are commonly • Frequency Weighted Intersection over Union
used to accelerate deep networks, do not pack a copious (FWIoU): it is an improved over the raw MIoU which
amount of memory. In this regard, and considering the weights each class importance depending on their
same implementation-dependent aspects as with runtime, appearance frequency.
documenting the peak and average memory footprint of k
X
a method with a complete description of the execution pij pii
k
conditions can be extraordinarily helpful. 1 X j=0
F W IoU = k k k k
i=0
XX X X
5.1.3 Accuracy pij pij + pji − pii
i=0 j=0 j=0 j=0
Many evaluation criteria have been proposed and are fre-
Of all metrics described above, the MIoU stands out of
quently used to assess the accuracy of any kind of technique
the crowd as the most used metric due to its representative-
for semantic segmentation. Those metrics are usually varia-
ness and simplicity. Most challenges and researchers make
tions on pixel accuracy and IoU. We report the most popular
use of that metric to report their results.
metrics for semantic segmentation that are currently used
to measure how per-pixel labeling methods perform on
this task. For the sake of the explanation, we remark the 5.2 Results
following notation details: we assume a total of k + 1 classes As we stated before, Section 4 provided a functional de-
(from L0 to Lk including a void class or background) and scription of the reviewed methods according to their targets.
pij is the amount of pixels of class i inferred to belong to Now we gathered all the quantitative results for those meth-
class j . In other words, pii represents the number of true ods as stated by their authors in their corresponding papers.
positives, while pij and pji are usually interpreted as false These results are organized into three parts depending on
positives and false negatives respectively (although either the input data used by the methods: 2D RGB or 2.5D RGB-D
of them can be the sum of both false positives and false images, volumetric 3D, or video sequences.
negatives).. The most used datasets have been selected for that
purpose. It is important to remark the heterogeneity of the
• Pixel Accuracy (PA): it is the simplest metric, simply papers in the field when reporting results. Although most
computing a ratio between the amount of properly of them try to evaluate their methods in standard datasets
classified pixels and the total number of them. and provide enough information to reproduce their results,
also expressed in widely known metrics, many others fail to
k
X do so. That leads to a situation in which it is hard or even
pii
impossible to fairly compare methods.
i=0
PA = k X k
Furthermore, we also came across the fact few authors
X provide information about other metrics rather than ac-
pij
curacy. Despite the importance of other metrics, most of
i=0 j=0
the papers do not include any data about execution time
• Mean Pixel Accuracy (MPA): a slightly improved nor memory footprint. In some cases that information is
PA in which the ratio of correct pixels is computed provided, but no reproducibility information is given so it
in a per-class basis and then averaged over the total is impossible to know the setup that produced those results
number of classes. which are of no use.

k 5.2.1 RGB
1 X pii
MPA = k For the single 2D image category we have selected seven
k + 1 i=0 X
pij datasets: PASCAL VOC2012, PASCAL Context, PASCAL
j=0 Person-Part, CamVid, CityScapes, Stanford Background,
and SiftFlow. That selection accounts for a wide range of
• Mean Intersection over Union (MIoU): this is the situations and targets.
standard metric for segmentation purposes. It com- The first, and arguably the most important dataset,
putes a ratio between the intersection and the union in which the vast majority of methods are evaluated is
of two sets, in our case the ground truth and our PASCAL VOC-2012. Table 3 shows the results of those
predicted segmentation. That ratio can be reformu- reviewed methods which provide accuracy results on the
lated as the number of true positives (intersection) PASCAL VOC-2012 test set. This set of results shows a clear
over the sum of true positives, false negatives, and improvement trend from the firs proposed methods (SegNet
18

TABLE 3: Performance results on PASCAL VOC-2012. TABLE 8: Performance results on Stanford Background.
# Method Accuracy (IoU) # Method Accuracy (IoU)
1 DeepLab [69] 79.70 1 rCNN [81] 80.20
2 Dilation [71] 75.30 2 2D-LSTM [80] 78.56
3 CRFasRNN [70] 74.70
4 ParseNet [77] 69.80
5 FCN-8s [65] 67.20
6 Multi-scale-CNN-Eigen [74] 62.60
recurrent methods. In particular DAG-RNN is the top scorer
7 Bayesian SegNet [67] 60.50 with 85.30 IoU.
TABLE 9: Performance results on SiftFlow.
and the original FCN) to the most complex models such as # Method Accuracy (IoU)
CRFasRNN and the winner (DeepLab) with 79.70 IoU. 1 DAG-RNN [82] 85.30
Apart from the widely known VOC we also collected 2 rCNN [81] 77.70
metrics of its Context counterpart. Table 4 shows those 3 2D-LSTM [80] 70.11
results in which DeepLab is again the top scorer (45.70 IoU).

TABLE 4: Performance results on PASCAL-Context. 5.2.2 2.5D


# Method Accuracy (IoU)
Regarding the 2.5D category, i.e., datasets which also in-
1 DeepLab [69] 45.70 clude depth information apart from the typical RGB chan-
2 CRFasRNN [70] 39.28 nels, we have selected three of them for the analysis: SUN-
3 FCN-8s [65] 39.10 RGB-D and NYUDv2. Table 10 shows the results for SUN-
RGB-D that are only provided by LSTM-CF, which achieves
In addition, we also took into account the PASCAL Part
48.10 IoU.
dataset, whose results are shown in Table 5. In this case, the
only analyzed method that provided metrics for this dataset TABLE 10: Performance results on SUN-RGB-D.
is DeepLab which achieved a 64.94 IoU.
# Method Accuracy (IoU)
TABLE 5: Performance results on PASCAL-Person-Part. 1 LSTM-CF [79] 48.10

# Method Accuracy (IoU) Table 11 shows the results for NYUDv2 which are exclu-
1 DeepLab [69] 64.94
sive too for LSTM-CF. That method reaches 49.40 IoU.
Moving from a general-purpose dataset such as PASCAL TABLE 11: Performance results on NYUDv2.
VOC, we also gathered results for two of the most important
urban driving databases. Table 6 shows the results of those # Method Accuracy (IoU)
1 LSTM-CF [79] 49.40
methods which provide accuracy metrics for the CamVid
dataset. In this case, an RNN-based approach (DAG-RNN)
is the top one with a 91.60 IoU. At last, Table 12 gathers results for the last 2.5D dataset:
SUN-3D. Again, LSTM-CF is the only one which provides
TABLE 6: Performance results on CamVid. information for that database, in this case a 58.50 accuracy.

# Method Accuracy (IoU) TABLE 12: Performance results on SUN3D.


1 DAG-RNN [82] 91.60
2 Bayesian SegNet [67] 63.10 # Method Accuracy (IoU)
3 SegNet [66] 60.10 1 LSTM-CF [79] 58.50
4 ReSeg [78] 58.80
5 ENet [72] 55.60
5.2.3 3D
Table 7 shows the results on a more challenging and
currently more in use database: CityScapes. The trend on Two 3D datasets have been chosen for this discussion:
this dataset is similar to the one with PASCAL VOC with ShapeNet Part and Stanford-2D-3D-S. In both cases, only
DeepLab leading with a 70.40 IoU. one of the analyzed methods actually scored on them. It is
the case of PointNet which achieved 83.80 and 47.71 IoU on
TABLE 7: Performance results on CityScapes. ShapeNet Part (Table 13) and Stanford-2D-3D-S (Table 14)
respectively.
# Method Accuracy (IoU)
1 DeepLab [69] 70.40 TABLE 13: Performance results on ShapeNet Part.
2 Dilation10 [71] 67.10
3 FCN-8s [65] 65.30
# Method Accuracy (IoU)
4 CRFasRNN [70] 62.50
1 PointNet [87] 83.70
5 ENet [72] 58.30

Table 8 shows the results of various recurrent networks


TABLE 14: Performance results on Stanford 2D-3D-S.
on the Stanford Background dataset. The winner, rCNN,
achieves a maximum accuracy of 80.20 IoU. # Method Accuracy (IoU)
At last, results for another popular dataset such as Sift- 1 PointNet [87] 47.71
Flow are shown in Table 9. This dataset is also dominated by
19

5.2.4 Sequences 5.4 Future Research Directions


The last category included in this discussion is video or
Based on the reviewed research, which marks the state of the
sequences. For that part we gathered results for two datasets
art of the field, we present a list of future research directions
which are suitable for sequence segmentation: CityScapes
that would be interesting to pursue.
and YouTube-Objects. Only one of the reviewed methods for
video segmentation provides quantitative results on those
datasets: Clockwork Convnet. That method reaches 64.40 • 3D datasets: methods that make full use of 3D infor-
IoU on CityScapes (Table 15), and 68.50 on YouTube-Objects mation are starting to rise but, even if new proposals
(Table 16). and techniques are engineered, they still lack one
of the most important components: data. There is a
TABLE 15: Performance results on Cityscapes. strong need for large-scale datasets for 3D semantic
segmentation, which are harder to create than their
# Method Accuracy (IoU) lower dimensional counterparts. Although there are
1 Clockwork Convnet [88] 64.40
already some promising works, there is still room
for more, better, and varied data. It is important to
remark the importance of real-world 3D data since
TABLE 16: Performance results on Youtube-Objects.
most of the already existing works are synthetic
# Method Accuracy (IoU) databases. A proof of the importance of 3D is the
1 Clockwork Convnet [88] 68.50 fact that the ILSVRC will feature 3D data in 2018.
• Sequence datasets: the same lack of large-scale data
that hinders progress on 3D segmentation also im-
pacts video segmentation. There are only a few
5.3 Summary
datasets that are sequence-based and thus helpful for
In light of the results, we can draw various conclusions. developing methods which take advantage of tempo-
The most important of them is related to reproducibility. As ral information. Bringing up more high-quality data
we have observed, many methods report results on non- from this nature, either 2D or 3D, will unlock new
standard datasets or they are not even tested at all. That research lines without any doubt.
makes comparisons impossible. Furthermore, some of them • Point cloud segmentation using Graph Convolutional
do not describe the setup for the experimentation or do Networks (GCNs): as we already mentioned, dealing
not provide the source code for the implementation, thus with 3D data such as point clouds poses an unsolved
significantly hurting reproducibility. Methods should report challenge. Due to its unordered and unstructured
their results on standard datasets, exhaustively describe the nature, traditional architectures such as CNNs can-
training procedure, and also make their models and weights not be applied unless some sort of discretization
publicly available to enable progress. process is applied to structure it. One promising line
Another important fact discovered thanks to this study of research aims to treat point clouds as graphs and
is the lack of information about other metrics such as exe- apply convolutions over them [109] [110] [111]. This
cution time and memory footprint. Almost no paper reports has the advantage of preserving spatial cues in every
this kind of information, and those who do suffer from the dimension without quantizing data.
reproducibility issues mentioned before. This void is due • Context knowledge: while FCNs are a consolidated
to the fact that most methods focus on accuracy without approach for semantic segmentation, they lack sev-
any concern about time or space. However, it is important eral features such as context modelling that help
to think about where are those methods being applied. In increasing accuracy. The reformulation of CRFs as
practice, most of them will end up running on embedded RNNs to create end-to-end solutions seems to be a
devices, e.g., self-driving cars, drones, or robots, which are promising direction to improve results on real-life
fairly limited from both sides: computational power and data. Multi-scale and feature fusion approaches have
memory. also shown remarkable progress. In general, all those
Regarding the results themselves, we can conclude that works represent important steps towards achieving
DeepLab is the most solid method which outperforms the ultimate goal, but there are some problems that
the rest on almost every single RGB images dataset by a still require more research.
significant margin. The 2.5D or multimodal datasets are • Real-time segmentation: In many applications, preci-
dominated by recurrent networks such as LSTM-CF. 3D sion is important; however, it is also crucial that these
data segmentation still has a long way to go with Point- implementations are able to cope with common cam-
Net paving the way for future research on dealing with era frame rates (at least 25 frames per second). Most
unordered point clouds without any kind of preprocessing of the current methods are far from that framerate,
or discretization. Finally, dealing with video sequences is e.g., FCN-8s takes roughly 100 ms to process a low-
another green area with no clear direction, but Clockwork resolution PASCAL VOC image whilst CRFasRNN
Convnets are the most promising approach thanks to their needs more than 500 ms. Therefore, during the next
efficiency and accuracy duality. 3D convolutions are worth years, we expect a stream of works coming out,
remarking due to their power and flexibility to process focusing more on real-time constraints. These future
multichannel inputs, making them successful at capturing works will have to find a trade-off between accuracy
both spatial and temporal information. and runtime.
20

• Memory: some platforms are bounded by hard mem- a Spanish national grant for PhD studies FPU15/04516.
ory constraints. Segmentation networks usually do In addition, it was also funded by the grant Ayudas para
need significant amounts of memory to be executed Estudios de Máster e Iniciación a la Investigación from the
for both inference and training. In order to fit them University of Alicante.
in some devices, networks must be simplified. While
this can be easily accomplished by reducing their
complexity (often trading it for accuracy), another R EFERENCES
approaches can be taken. Pruning is a promising re- [1] A. Ess, T. Müller, H. Grabner, and L. J. Van Gool, “Segmentation-
search line that aims to simplify a network, making it based urban traffic scene understanding.” in BMVC, vol. 1, 2009,
lightweight while keeping the knowledge, and thus p. 2.
the accuracy, of the original network architecture [2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au-
tonomous driving? the kitti vision benchmark suite,” in 2012
[112] [113] [114]. IEEE Conference on Computer Vision and Pattern Recognition, June
• Temporal coherency on sequences: some methods have 2012, pp. 3354–3361.
addressed video or sequence segmentation but either [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
taking advantage of that temporal cues to increase R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
dataset for semantic urban scene understanding,” in Proceedings
accuracy or efficiency. However, none of them have of the IEEE Conference on Computer Vision and Pattern Recognition,
explicitly tackled the coherency problem. For a seg- 2016, pp. 3213–3223.
mentation system to work on video streams it is [4] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands deep
in deep learning for hand pose estimation,” arXiv preprint
important, not only to produce good results frame arXiv:1502.06807, 2015.
by frame, but also make them coherent through the [5] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learn-
whole clip without producing artifacts by smoothing ing a deep convolutional network for light-field image super-
predicted per-pixel labels along the sequence. resolution,” in Proceedings of the IEEE International Conference on
Computer Vision Workshops, 2015, pp. 24–32.
• Multi-view integration: Use of multiple views in re- [6] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li,
cently proposed segmentation works is mostly lim- “Deep learning for content-based image retrieval: A comprehen-
ited to RGB-D cameras and in particular focused on sive study,” in Proceedings of the 22nd ACM international conference
single-object segmentation. on Multimedia. ACM, 2014, pp. 157–166.
[7] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E.
Barbano, “Toward automatic phenotyping of developing em-
bryos from videos,” IEEE Transactions on Image Processing, vol. 14,
6 C ONCLUSION no. 9, pp. 1360–1371, 2005.
[8] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber,
To the best of our knowledge, this is the first review paper “Deep neural networks segment neuronal membranes in electron
in the literature which focuses on semantic segmentation microscopy images,” in Advances in neural information processing
using deep learning. In comparison with other surveys, this systems, 2012, pp. 2843–2851.
paper is devoted to such a rising topic as deep learning, [9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning
hierarchical features for scene labeling,” IEEE transactions on
covering the most advanced and recent work on that front. pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–
We formulated the semantic segmentation problem and 1929, 2013.
provided the reader with the necessary background knowl- [10] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simulta-
edge about deep learning for such task. We covered the neous detection and segmentation,” in European Conference on
Computer Vision. Springer, 2014, pp. 297–312.
contemporary literature of datasets and methods, providing [11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich
a comprehensive survey of 28 datasets and 27 methods. features from rgb-d images for object detection and segmenta-
Datasets were carefully described, stating their purposes tion,” in European Conference on Computer Vision. Springer, 2014,
pp. 345–360.
and characteristics so that researchers can easily pick the one [12] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels:
that best suits their needs. Methods were surveyed from two A comprehensive survey from bottom-up to semantic
perspectives: contributions and raw results, i.e., accuracy. image segmentation and cosegmentation,” Journal of Visual
We also presented a comparative summary of the datasets Communication and Image Representation, vol. 34, pp. 12 –
27, 2016. [Online]. Available: http://www.sciencedirect.com/
and methods in tabular forms, classifying them according science/article/pii/S1047320315002035
to various criteria. In the end, we discussed the results [13] M. Thoma, “A survey of semantic segmentation,” CoRR, vol.
and provided useful insight in shape of future research abs/1602.06541, 2016. [Online]. Available: http://arxiv.org/abs/
1602.06541
directions and open problems in the field. In conclusion,
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
semantic segmentation has been approached with many cation with deep convolutional neural networks,” in Advances in
success stories but still remains an open problem whose neural information processing systems, 2012, pp. 1097–1105.
solution would prove really useful for a wide set of real- [15] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” arXiv preprint
world applications. Furthermore, deep learning has proved arXiv:1409.1556, 2014.
to be extremely powerful to tackle this problem so we can [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
expect a flurry of innovation and spawns of research lines D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
in the upcoming years. convolutions,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 1–9.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE Conference on
ACKNOWLEDGMENTS Computer Vision and Pattern Recognition, 2016, pp. 770–778.
This work has been funded by the Spanish Government [18] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-
dimensional recurrent neural networks,” CoRR, vol.
TIN2016-76515-R grant for the COMBAHO project, sup- abs/0705.2011, 2007. [Online]. Available: http://arxiv.org/
ported with Feder funds. It has also been supported by abs/0705.2011
21

[19] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. C. Courville, and [39] R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, “Sensor fusion
Y. Bengio, “Renet: A recurrent neural network based alternative for semantic segmentation of urban scenes,” in Robotics and
to convolutional networks,” CoRR, vol. abs/1505.00393, 2015. Automation (ICRA), 2015 IEEE International Conference on. IEEE,
[Online]. Available: http://arxiv.org/abs/1505.00393 2015, pp. 1850–1857.
[20] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, “Training hi- [40] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
erarchical feed-forward visual recognition models using transfer geometric and semantically consistent regions,” in Computer Vi-
learning from pseudo-tasks,” in European Conference on Computer sion, 2009 IEEE 12th International Conference on. IEEE, 2009, pp.
Vision. Springer, 2008, pp. 69–82. 1–8.
[21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and [41] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:
transferring mid-level image representations using convolutional Label transfer via dense scene alignment,” in Computer Vision and
neural networks,” in Proceedings of the IEEE conference on computer Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
vision and pattern recognition, 2014, pp. 1717–1724. 2009, pp. 1972–1979.
[22] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable [42] S. D. Jain and K. Grauman, “Supervoxel-consistent foreground
are features in deep neural networks?” in Advances in neural propagation in video,” in European Conference on Computer Vision.
information processing systems, 2014, pp. 3320–3328. Springer, 2014, pp. 656–671.
[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- [43] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “Material recog-
agenet: A large-scale hierarchical image database,” in Computer nition in the wild with the materials in context database,” in
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference Proceedings of the IEEE conference on computer vision and pattern
on. IEEE, 2009, pp. 248–255. recognition, 2015, pp. 3479–3487.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [44] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet and A. Sorkine-Hornung, “A benchmark dataset and evaluation
large scale visual recognition challenge,” International Journal of methodology for video object segmentation,” in Computer Vision
Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. and Pattern Recognition, 2016.
[25] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, [45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-
“Understanding data augmentation for classification: when to Hornung, and L. Van Gool, “The 2017 davis challenge on video
warp?” CoRR, vol. abs/1609.08764, 2016. [Online]. Available: object segmentation,” arXiv:1704.00675, 2017.
http://arxiv.org/abs/1609.08764 [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
[26] X. Shen, A. Hertzmann, J. Jia, S. Paris, B. Price, E. Shechtman, and mentation and support inference from rgbd images,” in European
I. Sachs, “Automatic portrait segmentation for image stylization,” Conference on Computer Vision. Springer, 2012, pp. 746–760.
in Computer Graphics Forum, vol. 35, no. 2. Wiley Online Library, [47] J. Xiao, A. Owens, and A. Torralba, “Sun3d: A database of big
2016, pp. 93–102. spaces reconstructed using sfm and object labels,” in 2013 IEEE
[27] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, International Conference on Computer Vision, Dec 2013, pp. 1625–
J. Winn, and A. Zisserman, “The pascal visual object classes chal- 1632.
lenge: A retrospective,” International Journal of Computer Vision,
[48] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d
vol. 111, no. 1, pp. 98–136, Jan. 2015.
scene understanding benchmark suite,” in Proceedings of the IEEE
[28] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, conference on computer vision and pattern recognition, 2015, pp. 567–
R. Urtasun, and A. Yuille, “The role of context for object detection 576.
and semantic segmentation in the wild,” in IEEE Conference on
[49] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical
Computer Vision and Pattern Recognition (CVPR), 2014.
multi-view rgb-d object dataset,” in Robotics and Automation
[29] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
(ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp.
“Detect what you can: Detecting and representing objects using
1817–1824.
holistic models and body parts,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. [50] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu,
Q. Huang, A. Sheffer, and L. Guibas, “A scalable active frame-
[30] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik,
work for region annotation in 3d shape collections,” SIGGRAPH
“Semantic contours from inverse detectors,” in 2011 International
Asia, 2016.
Conference on Computer Vision. IEEE, 2011, pp. 991–998.
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [51] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in Semantic Data for Indoor Scene Understanding,” ArXiv e-prints,
context,” in European Conference on Computer Vision. Springer, Feb. 2017.
2014, pp. 740–755. [52] X. Chen, A. Golovinskiy, and T. Funkhouser, “A benchmark
[32] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, for 3D mesh segmentation,” ACM Transactions on Graphics (Proc.
“The synthia dataset: A large collection of synthetic images for SIGGRAPH), vol. 28, no. 3, Aug. 2009.
semantic segmentation of urban scenes,” in Proceedings of the IEEE [53] A. Quadros, J. Underwood, and B. Douillard, “An occlusion-
Conference on Computer Vision and Pattern Recognition, 2016, pp. aware feature for range images,” in Robotics and Automation, 2012.
3234–3243. ICRA’12. IEEE International Conference on. IEEE, May 14-18 2012.
[33] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, [54] T. Hackel, J. D. Wegner, and K. Schindler, “Contour detection
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes in unstructured 3d point clouds,” in Proceedings of the IEEE
dataset,” in CVPR Workshop on The Future of Datasets in Vision, Conference on Computer Vision and Pattern Recognition, 2016, pp.
2015. 1610–1618.
[34] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object [55] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmenta-
classes in video: A high-definition ground truth database,” Pat- tion and recognition using structure from motion point clouds,”
tern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009. in European Conference on Computer Vision. Springer, 2008, pp.
[35] P. Sturgess, K. Alahari, L. Ladicky, and P. H. Torr, “Combining 44–57.
appearance and structure from motion features for road scene [56] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
understanding,” in BMVC 2012-23rd British Machine Vision Con- robotics: The kitti dataset,” The International Journal of Robotics
ference. BMVA, 2009. Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[36] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road [57] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learn-
scene segmentation from a single image,” in European Conference ing object class detectors from weakly annotated video,” in Com-
on Computer Vision. Springer, 2012, pp. 376–389. puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
[37] G. Ros and J. M. Alvarez, “Unsupervised image transformation on. IEEE, 2012, pp. 3282–3289.
for outdoor semantic labelling,” in Intelligent Vehicles Symposium [58] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “OpenSurfaces: A
(IV), 2015 IEEE. IEEE, 2015, pp. 537–542. richly annotated catalog of surface appearance,” ACM Trans. on
[38] G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and Graphics (SIGGRAPH), vol. 32, no. 4, 2013.
A. M. Lopez, “Vision-based offline-online perception paradigm [59] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
for autonomous driving,” in Applications of Computer Vision “Labelme: a database and web-based tool for image annotation,”
(WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 231– International journal of computer vision, vol. 77, no. 1, pp. 157–173,
238. 2008.
22

[60] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and [82] B. Shuai, Z. Zuo, G. Wang, and B. Wang, “Dag-recurrent neural
recognition of indoor scenes from rgb-d images,” in Proceedings networks for scene labeling,” CoRR, vol. abs/1509.00552, 2015.
of the IEEE Conference on Computer Vision and Pattern Recognition, [Online]. Available: http://arxiv.org/abs/1509.00552
2013, pp. 564–571. [83] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment
[61] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and object candidates,” in Advances in Neural Information Processing
T. Darrell, A Category-Level 3D Object Dataset: Putting the Kinect Systems, 2015, pp. 1990–1998.
to Work. London: Springer London, 2013, pp. 141–165. [Online]. [84] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning
Available: http://dx.doi.org/10.1007/978-1-4471-4640-7 8 to refine object segments,” in European Conference on Computer
[62] A. Richtsfeld, “The object segmentation database (osd),” 2012. Vision. Springer, 2016, pp. 75–91.
[63] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, [85] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chin-
Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: tala, and P. Dollár, “A multipath network for object detection,”
An information-rich 3d model repository,” arXiv preprint arXiv preprint arXiv:1604.02135, 2016.
arXiv:1512.03012, 2015. [86] J. Huang and S. You, “Point cloud labeling using 3d convolu-
[64] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, tional neural network,” in Proc. of the International Conf. on Pattern
and S. Savarese, “3d semantic parsing of large-scale indoor Recognition (ICPR), vol. 2, 2016.
spaces,” in Proceedings of the IEEE Conference on Computer Vision [87] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep
and Pattern Recognition, 2016, pp. 1534–1543. learning on point sets for 3d classification and segmentation,”
arXiv preprint arXiv:1612.00593, 2016.
[65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
works for semantic segmentation,” in Proceedings of the IEEE [88] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork
Conference on Computer Vision and Pattern Recognition, 2015, pp. convnets for video semantic segmentation,” in Computer Vision–
3431–3440. ECCV 2016 Workshops. Springer, 2016, pp. 852–868.
[89] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deep
[66] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
end2end voxel2voxel prediction,” in Proceedings of the IEEE Con-
convolutional encoder-decoder architecture for image segmenta-
ference on Computer Vision and Pattern Recognition Workshops, 2016,
tion,” arXiv preprint arXiv:1511.00561, 2015.
pp. 17–24.
[67] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian [90] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive decon-
segnet: Model uncertainty in deep convolutional encoder- volutional networks for mid and high level feature learning,”
decoder architectures for scene understanding,” arXiv preprint in Computer Vision (ICCV), 2011 IEEE International Conference on.
arXiv:1511.02680, 2015. IEEE, 2011, pp. 2018–2025.
[68] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [91] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
Yuille, “Semantic image segmentation with deep convolutional volutional networks,” in European conference on computer vision.
nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, Springer, 2014, pp. 818–833.
2014. [92] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive
[69] ——, “Deeplab: Semantic image segmentation with deep convo- foreground extraction using iterated graph cuts,” in ACM trans-
lutional nets, atrous convolution, and fully connected crfs,” arXiv actions on graphics (TOG), vol. 23, no. 3. ACM, 2004, pp. 309–314.
preprint arXiv:1606.00915, 2016. [93] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost
[70] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, for image understanding: Multi-class object recognition and seg-
D. Du, C. Huang, and P. H. Torr, “Conditional random fields as mentation by jointly modeling texture, layout, and context,”
recurrent neural networks,” in Proceedings of the IEEE International International Journal of Computer Vision, vol. 81, no. 1, pp. 2–23,
Conference on Computer Vision, 2015, pp. 1529–1537. 2009.
[71] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated [94] V. Koltun, “Efficient inference in fully connected crfs with gaus-
convolutions,” arXiv preprint arXiv:1511.07122, 2015. sian edge potentials,” Adv. Neural Inf. Process. Syst, vol. 2, no. 3,
[72] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A p. 4, 2011.
deep neural network architecture for real-time semantic segmen- [95] P. Krähenbühl and V. Koltun, “Parameter learning and conver-
tation,” arXiv preprint arXiv:1606.02147, 2016. gent inference for dense random fields.” in ICML (3), 2013, pp.
[73] A. Raj, D. Maturana, and S. Scherer, “Multi-scale convolutional 513–521.
architecture for semantic segmentation,” 2015. [96] S. Zhou, J.-N. Wu, Y. Wu, and X. Zhou, “Exploiting local struc-
[74] D. Eigen and R. Fergus, “Predicting depth, surface normals tures with the kronecker layer in convolutional networks,” arXiv
and semantic labels with a common multi-scale convolutional preprint arXiv:1512.09194, 2015.
architecture,” in Proceedings of the IEEE International Conference on [97] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Computer Vision, 2015, pp. 2650–2658. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[75] A. Roy and S. Todorovic, “A multi-scale cnn for affordance [98] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On
segmentation in rgb images,” in European Conference on Computer the properties of neural machine translation: Encoder-decoder
Vision. Springer, 2016, pp. 186–201. approaches,” arXiv preprint arXiv:1409.1259, 2014.
[99] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin,
[76] X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional
“RGB-D scene labeling with long short-term memorized fusion
network with application to industrial inspection,” in Applications
model,” CoRR, vol. abs/1604.05000, 2016. [Online]. Available:
of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE,
http://arxiv.org/abs/1604.05000
2016, pp. 1–8.
[100] G. Li and Y. Yu, “Deep contrast learning for salient object detec-
[77] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider tion,” in Proceedings of the IEEE Conference on Computer Vision and
to see better,” arXiv preprint arXiv:1506.04579, 2015. Pattern Recognition, 2016, pp. 478–487.
[78] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, [101] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
M. Matteucci, and A. Courville, “Reseg: A recurrent neural “Multiscale combinatorial grouping,” in Proceedings of the IEEE
network-based model for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
Conference on Computer Vision and Pattern Recognition (CVPR) 328–335.
Workshops, June 2016. [102] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[79] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, hierarchies for accurate object detection and semantic segmenta-
LSTM-CF: Unifying Context Modeling and Fusion with LSTMs tion,” in Proceedings of the IEEE conference on computer vision and
for RGB-D Scene Labeling. Cham: Springer International pattern recognition, 2014, pp. 580–587.
Publishing, 2016, pp. 541–557. [Online]. Available: http: [103] A. Zeng, K. Yu, S. Song, D. Suo, E. W. Jr., A. Rodriguez,
//dx.doi.org/10.1007/978-3-319-46475-6 34 and J. Xiao, “Multi-view self-supervised deep learning for
[80] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling 6d pose estimation in the amazon picking challenge,”
with lstm recurrent neural networks,” in Proceedings of the IEEE CoRR, vol. abs/1609.09475, 2016. [Online]. Available: http:
Conference on Computer Vision and Pattern Recognition, 2015, pp. //arxiv.org/abs/1609.09475
3547–3555. [104] L. Ma, J. Stuckler, C. Kerl, and D. Cremers, “Multi-view deep
[81] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural learning for consistent semantic mapping with rgb-d cameras,”
networks for scene labeling.” in ICML, 2014, pp. 82–90. in arXiv:1703.08866, Mar 2017.
23

[105] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Sergiu-Ovidiu Oprea is a MSc Student (Au-
Incorporating depth into semantic segmentation via fusion-based tomation and Robotics) at University of Ali-
cnn architecture,” in Proc. ACCV, vol. 2, 2016. cante. He received his Bachelor’s Degree (Com-
[106] H. Zhang, K. Jiang, Y. Zhang, Q. Li, C. Xia, and X. Chen, “Dis- puter Engineering) from the same institution in
criminative feature learning for video semantic segmentation,” June 2015. His main research interests include
in Virtual Reality and Visualization (ICVRV), 2014 International deep learning (specially recurrent neural net-
Conference on. IEEE, 2014, pp. 321–326. works), 3D computer vision, parallel computing
[107] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy on GPUs, and computer graphics. He is also
minimization via graph cuts,” IEEE Transactions on pattern analysis member of European Networks like HiPEAC.
and machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[108] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri,
“Learning spatiotemporal features with 3d convolutional net-
works,” in Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 4489–4497.
[109] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks
on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[110] T. N. Kipf and M. Welling, “Semi-supervised classification with
graph convolutional networks,” arXiv preprint arXiv:1609.02907,
2016.
[111] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
neural networks for graphs,” in Proceedings of the 33rd annual
international conference on machine learning. ACM, 2016.
[112] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
convolutional neural networks,” arXiv preprint arXiv:1512.08571, Victor Villena-Martinez is a PhD Student at
2015. the University of Alicante. He received his Mas-
[113] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compress- ter’s Degree in Automation and Robotics in June
ing deep neural networks with pruning, trained quantization and 2016 and his Bachelor’s Degree in Computer
huffman coding,” arXiv preprint arXiv:1510.00149, 2015. engineering in June 2015. He has collaborated
[114] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Prun- in the project ”Acquisition and modeling of grow-
ing convolutional neural networks for resource efficient transfer ing plants” (GV/2013/005). His main research is
learning,” arXiv preprint arXiv:1611.06440, 2016. focused on the calibration of RGB-D devices and
the reconstruction of the human body using the
same devices.

Alberto Garcia-Garcia is a PhD Student (Ma-


chine Learning and Computer Vision) at the
University of Alicante. He received his Mas-
ter’s Degree (Automation and Robotics) and his
Bachelor’s Degree (Computer Engineering) from
the same institution in June 2015 and June
2016 respectively. His main research interests
include deep learning (specially convolutional
neural networks), 3D computer vision, and par-
allel computing on GPUs. He was an intern at Jose Garcia-Rodriguez received his Ph.D. de-
Jülich Supercomputing Center, and at NVIDIA gree, with specialization in Computer Vision and
working jointly with the Camera/Solutions engineering team and the Neural Networks, from the University of Alicante
Mobile Visual Computing group from NVIDIA Research. He is also a (Spain). He is currently Associate Professor at
member of European Networks such as HiPEAC and IV&L. the Department of Computer Technology of the
University of Alicante. His research areas of in-
terest include: computer vision, computational
intelligence, machine learning, pattern recogni-
tion, robotics, man-machine interfaces, ambient
intelligence, computational chemistry, and paral-
lel and multicore architectures. He has authored
+100 publications in journals and top conferences and revised papers for
Sergio Orts-Escolano received a BSc, MSc several journals like Journal of Machine Learning Research, Computa-
and PhD in Computer Science from the Univer- tional intelligence, Neurocomputing, Neural Networks, Applied Softcom-
sity of Alicante (Spain) in 2008, 2010 and 2014 puting, Image Vision and Computing, Journal of Computer Mathematics,
respectively. He is currently an assistant profes- IET on Image Processing, SPIE Optical Engineering and many others,
sor in the Department of Computer Science and chairing sessions in the last decade for WCCI/IJCNN and participating
Artificial Intelligence at the University of Alicante. in program committees of several conferences including IJCNN, ICRA,
Previously he was a researcher at Microsoft ICANN, IWANN, IWINAC KES, ICDP and many others. He is also
Research where he was one of the leading member of European Networks of Excellence and COST actions like
members of the Holoportation project (virtual 3D Eucog, HIPEAC, AAPELE or I&VL and director or the GPU Research
teleportation in real-time). His research interests Center at University of Alicante and Phd program in Computer Science.
include computer vision, 3D sensing, real-time
computing, GPU computing, and deep learning. He has authored +50
publications in journals and top conferences like CVPR, SIGGRAPH,
3DV, BMVC, Neurocomputing, Neural Networks, Applied Soft Comput-
ing, etcetera. He is also member of European Networks like HiPEAC
and Eucog.

You might also like