A Review On Deep Learning Techniques Applied To Semantic Segmentation
A Review On Deep Learning Techniques Applied To Semantic Segmentation
Abstract—Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers.
Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even
virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every
field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a
review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology
of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers
arXiv:1704.06857v1 [cs.CV] 22 Apr 2017
decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their
contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in
which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw
our own conclusions about the state of the art of semantic segmentation using deep learning techniques.
1 I NTRODUCTION
The key contributions of our work are as follows: elements of a set of random variables X = {x1 , x2 , ..., xN }.
Each label l represents a different class or object, e.g., aero-
• We provide a broad survey of existing datasets that
plane, car, traffic sign, or background. This label space has
might be useful for segmentation projects with deep
k possible states which are usually extended to k + 1 and
learning techniques.
treating l0 as background or a void class. Usually, X is a 2D
• An in-depth and organized review of the most sig-
image of W × H = N pixels x. However, that set of random
nificant methods that use deep learning for semantic
variables can be extended to any dimensionality such as
segmentation, their origins, and their contributions.
volumetric data or hyperspectral images.
• A thorough performance evaluation which gathers
Apart from the problem formulation, it is important
quantitative metrics such as accuracy, execution time,
to remark some background concepts that might help the
and memory footprint.
reader to understand this review. Firstly, common networks,
• A discussion about the aforementioned results, as
approaches, and design decisions that are often used as the
well as a list of possible future works that might set
basis for deep semantic segmentation systems. In addition,
the course of upcoming advances, and a conclusion
common techniques for training such as transfer learning.
summarizing the state of the art of the field.
At last, data pre-processing and augmentation approaches.
The remainder of this paper is organized as follows.
Firstly, Section 2 introduces the semantic segmentation prob-
2.1 Common Deep Network Architectures
lem as well as notation and conventions commonly used in
the literature. Other background concepts such as common As we previously stated, certain deep networks have made
deep neural networks are also reviewed. Next, Section 3 such significant contributions to the field that they have
describes existing datasets, challenges, and benchmarks. become widely known standards. It is the case of AlexNet,
Section 4 reviews existing methods following a bottom- VGG-16, GoogLeNet, and ResNet. Such was their impor-
up complexity order based on their contributions. This tance that they are currently being used as building blocks
section focuses on describing the theory and highlights for many segmentation architectures. For that reason, we
of those methods rather than performing a quantitative will devote this section to review them.
evaluation. Finally, Section 5 presents a brief discussion on
the presented methods based on their quantitative results 2.1.1 AlexNet
on the aforementioned datasets. In addition, future research AlexNet was the pioneering deep CNN that won the
directions are also laid out. At last, Section 6 summarizes ILSVRC-2012 with a TOP-5 test accuracy of 84.6% while
the paper and draws conclusions about this work and the the closest competitor, which made use of traditional tech-
state of the art of the field. niques instead of deep architectures, achieved a 73.8% ac-
curacy in the same challenge. The architecture presented by
Krizhevsky et al. [14] was relatively simple. It consists of
2 T ERMINOLOGY AND BACKGROUND C ONCEPTS five convolutional layers, max-pooling ones, Rectified Lin-
In order to properly understand how semantic segmenta- ear Units (ReLUs) as non-linearities, three fully-connected
tion is tackled by modern deep learning architectures, it is layers, and dropout. Figure 2 shows that CNN architecture.
important to know that it is not an isolated field but rather a
natural step in the progression from coarse to fine inference.
The origin could be located at classification, which consists
of making a prediction for a whole input, i.e., predicting
which is the object in an image or even providing a ranked
list if there are many of them. Localization or detection is the
next step towards fine-grained inference, providing not only
the classes but also additional information regarding the
spatial location of those classes, e.g., centroids or bounding Fig. 2: AlexNet Convolutional Neural Network architecture.
boxes. Providing that, it is obvious that semantic segmen- Figure reproduced from [14].
tation is the natural step to achieve fine-grained inference,
its goal: make dense predictions inferring labels for every
pixel; this way, each pixel is labeled with the class of its en- 2.1.2 VGG
closing object or region. Further improvements can be made, Visual Geometry Group (VGG) is a CNN model introduced
such as instance segmentation (separate labels for different by the Visual Geometry Group (VGG) from the University
instances of the same class) and even part-based segmenta- of Oxford. They proposed various models and configura-
tion (low-level decomposition of already segmented classes tions of deep CNNs [15], one of them was submitted to
into their components). Figure 1 shows the aforementioned the ImageNet Large Scale Visual Recognition Challenge
evolution. In this review, we will mainly focus on generic (ILSVRC)-2013. That model, also known as VGG-16 due to
scene labeling, i.e., per-pixel class segmentation, but we will the fact that it is composed by 16 weight layers, became
also review the most important methods on instance and popular thanks to its achievement of 92.7% TOP-5 test
part-based segmentation. accuracy. Figure 3 shows the configuration of VGG-16. The
In the end, the per-pixel labeling problem can be reduced main difference between VGG-16 and its predecessors is the
to the following formulation: find a way to assign a state use of a stack of convolution layers with small receptive
from the label space L = {l1 , l2 , ..., lk } to each one of the fields in the first layers instead of few layers with big
3
weight layer
χ
F (χ) identity
weight layer
F (χ) + χ +
relu
Fig. 5: Residual block from the ResNet architecture. Figure
reproduced from [17].
Previous layer
2.1.4 ResNet
Microsoft’s ResNet [17] is specially remarkable thanks to
winning ILSVRC-2016 with 96.4% accuracy. Apart from that
fact, the network is well-known due to its depth (152 layers)
and the introduction of residual blocks (see Figure 5). The
residual blocks address the problem of training a really deep Fig. 6: One layer of ReNet architecture modeling vertical and
architecture by introducing identity skip connections so that horizontal spatial dependencies. Extracted from [19].
layers can copy their inputs to the next layer.
4
2.2 Transfer Learning Augmentations are specially helpful for small datasets,
and have proven their efficacy with a long track of suc-
Training a deep neural network from scratch is often not
cess stories. For instance, in [26], a dataset of 1500 por-
feasible because of various reasons: a dataset of sufficient
trait images is augmented synthesizing four new scales
size is required (and not usually available) and reaching
(0.6, 0.8, 1.2, 1.5), four new rotations (−45, −22, 22, 45), and
convergence can take too long for the experiments to be
four gamma variations (0.5, 0.8, 1.2, 1.5) to generate a new
worth. Even if a dataset large enough is available and con-
dataset of 19000 training images. That process allowed them
vergence does not take that long, it is often helpful to start
to raise the accuracy of their system for portrait segmenta-
with pre-trained weights instead of random initialized ones
tion from 73.09 to 94.20 Intersection over Union (IoU) when
[20] [21]. Fine-tuning the weights of a pre-trained network
including that augmented dataset for fine-tuning.
by continuing with the training process is one of the major
transfer learning scenarios.
Yosinski et al. [22] proved that transferring features
even from distant tasks can be better than using random 3 DATASETS AND C HALLENGES
initialization, taking into account that the transferability of Two kinds of readers are expected for this type of review:
features decreases as the difference between the pre-trained either they are initiating themselves in the problem, or either
task and the target one increases. they are experienced enough and they are just looking for
However, applying this transfer learning technique is the most recent advances made by other researchers in the
not completely straightforward. On the one hand, there last few years. Although the second kind is usually aware of
are architectural constraints that must be met to use a pre- two of the most important aspects to know before starting
trained network. Nevertheless, since it is not usual to come to research in this problem, it is critical for newcomers to get
up with a whole new architecture, it is common to reuse a grasp of what are the top-quality datasets and challenges.
already existing network architectures (or components) thus Therefore, the purpose of this section is to kickstart novel
enabling transfer learning. On the other hand, the training scientists, providing them with a brief summary of datasets
process differs slightly when fine-tuning instead of training that might suit their needs as well as data augmentation and
from scratch. It is important to choose properly which layers preprocessing tips. Nevertheless, it can also be useful for
to fine-tune – usually the higher-level part of the network, hardened researchers who want to review the fundamentals
since the lower one tends to contain more generic features or maybe discover new information.
– and also pick an appropriate policy for the learning rate, Arguably, data is one of the most – if not the most
which is usually smaller due to the fact that the pre-trained – important part of any machine learning system. When
weights are expected to be relatively good so there is no dealing with deep networks, this importance is increased
need to drastically change them. even more. For that reason, gathering adequate data into
Due to the inherent difficulty of gathering and creating a dataset is critical for any segmentation system based on
per-pixel labelled segmentation datasets, their scale is not as deep learning techniques. Gathering and constructing an
large as the size of classification datasets such as ImageNet appropriate dataset, which must have a scale large enough
[23] [24]. This problem gets even worse when dealing with and represent the use case of the system accurately, needs
RGB-D or 3D datasets, which are even smaller. For that time, domain expertise to select relevant information, and
reason, transfer learning, and in particular fine-tuning from infrastructure to capture that data and transform it to a
pre-trained classification networks is a common trend for representation that the system can properly understand and
segmentation networks and has been successfully applied learn. This task, despite the simplicity of its formulation in
in the methods that we will review in the following sections. comparison with sophisticated neural network architecture
definitions, is one of the hardest problems to solve in this
context. Because of that, the most sensible approach usually
2.3 Data Preprocessing and Augmentation
means using an existing standard dataset which is repre-
Data augmentation is a common technique that has been sentative enough for the domain of the problem. Following
proven to benefit the training of machine learning models in this approach has another advantage for the community:
general and deep architectures in particular; either speeding standardized datasets enable fair comparisons between sys-
up convergence or acting as a regularizer, thus avoiding tems; in fact, many datasets are part of a challenge which
overfitting and increasing generalization capabilities [25]. reserves some data – not provided to developers to test their
It typically consist of applying a set of transformations algorithms – for a competition in which many methods are
in either data or feature spaces, or even both. The most tested, generating a fair ranking of methods according to
common augmentations are performed in the data space. their actual performance without any kind of data cherry-
That kind of augmentation generates new samples by ap- picking.
plying transformations to the already existing data. There In the following lines we describe the most popular
are many transformations that can be applied: translation, large-scale datasets currently in use for semantic segmen-
rotation, warping, scaling, color space shifts, crops, etc. The tation. All datasets listed here provide appropriate pixel-
goal of those transformations is to generate more samples to wise or point-wise labels. The list is structured into three
create a larger dataset, preventing overfitting and presum- parts according to the nature of the data: 2D or plain
ably regularizing the model, balance the classes within that RGB datasets, 2.5D or RGB-Depth (RGB-D) ones, and pure
database, and even synthetically produce new samples that volumetric or 3D databases. Table 1 shows a summarized
are more representative for the use case or task at hand. view, gathering all the described datasets and providing
5
useful information such as their purpose, number of classes, consistent set of parts). The original classes of PAS-
data format, and training/validation/testing splits. CAL VOC are kept, but their parts are introduced,
e.g., bicycle is now decomposed into back wheel,
3.1 2D Datasets chain wheel, front wheel, handlebar, headlight, and
saddle. It contains labels for all training and valida-
Throughout the years, semantic segmentation has been tion images from PASCAL VOC as well as for the
mostly focused on two-dimensional images. For that reason, 9637 testing images.
2D datasets are the most abundant ones. In this section • Semantic Boundaries Dataset (SBD) [30]5 : this
we describe the most popular 2D large-scale datasets for dataset is an extended version of the aforementioned
semantic segmentation, considering 2D any dataset that PASCAL VOC which provides semantic segmenta-
contains any kind of two-dimensional representations such tion ground truth for those images that were not
as gray-scale or Red Green Blue (RGB) images. labelled in VOC. It contains annotations for 11355
• PASCAL Visual Object Classes (VOC) [27]1 : this images from PASCAL VOC 2011. Those annotations
challenge consists of a ground-truth annotated provide both category-level and instance-level infor-
dataset of images and five different competitions: mation, apart from boundaries for each object. Since
classification, detection, segmentation, action classi- the images are obtained from the whole PASCAL
fication, and person layout. The segmentation one is VOC challenge (not only from the segmentation one),
specially interesting since its goal is to predict the the training and validation splits diverge. In fact,
object class of each pixel for each test image. There SBD provides its own training (8498 images) and
are 21 classes categorized into vehicles, household, validation (2857 images) splits. Due to its increased
animals, and other: aeroplane, bicycle, boat, bus, car, amount of training data, this dataset is often used as
motorbike, train, bottle, chair, dining table, potted a substitute for PASCAL VOC for deep learning.
plant, sofa, TV/monitor, bird, cat, cow, dog, horse, • Microsoft Common Objects in Context (COCO)
sheep, and person. Background is also considered if [31]6 : is another image recognition, segmentation,
the pixel does not belong to any of those classes. and captioning large-scale dataset. It features various
The dataset is divided into two subsets: training challenges, being the detection one the most relevant
and validation with 1464 and 1449 images respec- for this field since one of its parts is focused on
tively. The test set is private for the challenge. This segmentation. That challenge, which features more
dataset is arguably the most popular for semantic than 80 classes, provides more than 82783 images
segmentation so almost every remarkable method in for training, 40504 for validation, and its test set
the literature is being submitted to its performance consist of more than 80000 images. In particular,
evaluation server to validate against their private the test set is divided into four different subsets or
test set. Methods can be trained either using only splits: test-dev (20000 images) for additional vali-
the dataset or either using additional information. dation, debugging, test-standard (20000 images) is
Furthermore, its leaderboard is public and can be the default test data for the competition and the
consulted online2 . one used to compare state-of-the-art methods, test-
• PASCAL Context [28]3 : this dataset is an extension challenge (20000 images) is the split used for the
of the PASCAL VOC 2010 detection challenge which challenge when submitting to the evaluation server,
contains pixel-wise labels for all training images and test-reserve (20000 images) is a split used to
(10103). It contains a total of 540 classes – includ- protect against possible overfitting in the challenge
ing the original 20 classes plus background from (if a method is suspected to have made too many
PASCAL VOC segmentation – divided into three submissions or trained on the test data, its results will
categories (objects, stuff, and hybrids). Despite the be compared with the reserve split). Its popularity
large number of categories, only the 59 most frequent and importance has ramped up since its appearance
are remarkable. Since its classes follow a power law thanks to its large scale. In fact, the results of the
distribution, there are many of them which are too challenge are presented yearly on a joint workshop
sparse throughout the dataset. In this regard, this at the European Conference on Computer Vision
subset of 59 classes is usually selected to conduct (ECCV)7 together with ImageNet’s ones.
studies on this dataset, relabeling the rest of them • SYNTHetic Collection of Imagery and Annotations
as background. (SYNTHIA) [32]8 : is a large-scale collection of photo-
• PASCAL Part [29]4 : this database is an extension of realistic renderings of a virtual city, semantically
the PASCAL VOC 2010 detection challenge which segmented, whose purpose is scene understanding in
goes beyond that task to provide per-pixel segmen- the context of driving or urban scenarios.The dataset
tation masks for each part of the objects (or at least provides fine-grained pixel-level annotations for 11
silhouette annotation if the object does not have a classes (void, sky, building, road, sidewalk, fence,
vegetation, pole, car, sign, pedestrian, and cyclist). It
1. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
2. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
challengeid=11&compid=6 5. http://home.bharathh.info/home/sbd
3. http://www.cs.stanford.edu/∼roozbeh/pascal-context/ 6. http://mscoco.org/
4. http://www.stat.ucla.edu/∼xianjie.chen/pascal part dataset/ 7. http://image-net.org/challenges/ilsvrc+coco2016
pascal part.html 8. http://synthia-dataset.net/
6
features 13407 training images from rendered video ing. It consists of hours of traffic scenarios recorded
streams. It is also characterized by its diversity in with a variety of sensor modalities, including high-
terms of scenes (towns, cities, highways), dynamic resolution RGB, grayscale stereo cameras, and a 3D
objects, seasons, and weather. laser scanner. Despite its popularity, the dataset itself
• Cityscapes [33]9 : is a large-scale database which does not contain ground truth for semantic segmen-
focuses on semantic understanding of urban street tation. However, various researchers have manually
scenes. It provides semantic, instance-wise, and annotated parts of the dataset to fit their necessities.
dense pixel annotations for 30 classes grouped into 8 Álvarez et al. [36] [37] generated ground truth for
categories (flat surfaces, humans, vehicles, construc- 323 images from the road detection challenge with
tions, objects, nature, sky, and void). The dataset con- three classes: road, vertical, and sky. Zhang et al.
sist of around 5000 fine annotated images and 20000 [39] annotated 252 (140 for training and 112 for
coarse annotated ones. Data was captured in 50 cities testing) acquisitions – RGB and Velodyne scans –
during several months, daytimes, and good weather from the tracking challenge for ten object categories:
conditions. It was originally recorded as video so the building, sky, road, vegetation, sidewalk, car, pedes-
frames were manually selected to have the following trian, cyclist, sign/pole, and fence. Ros et al. [38]
features: large number of dynamic objects, varying labeled 170 training images and 46 testing images
scene layout, and varying background. (from the visual odometry challenge) with 11 classes:
• CamVid [55] [34]10 : is a road/driving scene under- building, tree, sky, car, sign, road, pedestrian, fence,
standing database which was originally captured pole, sidewalk, and bicyclist.
as five video sequences with a 960 × 720 resolu- • Youtube-Objects [57] is a database of videos col-
tion camera mounted on the dashboard of a car. lected from YouTube which contain objects from ten
Those sequences were sampled (four of them at 1 PASCAL VOC classes: aeroplane, bird, boat, car, cat,
fps and one at 15 fps) adding up to 701 frames. cow, dog, horse, motorbike, and train. That database
Those stills were manually annotated with 32 classes: does not contain pixel-wise annotations but Jain et al.
void, building, wall, tree, vegetation, fence, sidewalk, [42] manually annotated a subset of 126 sequences.
parking block, column/pole, traffic cone, bridge, They took every 10th frame from those sequences
sign, miscellaneous text, traffic light, sky, tunnel, and generated semantic labels. That totals 10167
archway, road, road shoulder, lane markings (driv- annotated frames at 480 × 360 pixels resolution.
ing), lane markings (non-driving), animal, pedes- • Adobe’s Portrait Segmentation [26]11 : this is a
trian, child, cart luggage, bicyclist, motorcycle, car, dataset of 800 × 600 pixels portrait images collected
SUV/pickup/truck, truck/bus, train, and other mov- from Flickr, mainly captured with mobile front-
ing object. It is important to remark the partition facing cameras. The database consist of 1500 training
introduced by Sturgess et al. [35] which divided the images and 300 reserved for testing, both sets are
dataset into 367/100/233 training, validation, and fully binary annotated: person or background. The
testing images respectively. That partition makes use images were labeled in a semi-automatic way: first
of a subset of class labels: building, tree, sky, car, sign, a face detector was run on each image to crop them
road, pedestrian, fence, pole, sidewalk, and bicyclist. to 600 × 800 pixels and then persons were manu-
• KITTI [56]: is one of the most popular datasets ally annotated using Photoshop quick selection. This
for use in mobile robotics and autonomous driv- dataset is remarkable due to its specific purpose
which makes it suitable for person in foreground
segmentation applications.
9. https://www.cityscapes-dataset.com/
10. http://mi.eng.cam.ac.uk/research/projects/VideoRec/
CamVid/ 11. http://xiaoyongshen.me/webpage portrait/index.html
7
• Materials in Context (MINC) [43]: this work is a • NYUDv2 [46]14 : this database consists of 1449 indoor
dataset for patch material classification and full scene RGB-D images captured with a Microsoft Kinect
material segmentation. The dataset provides seg- device. It provides per-pixel dense labeling (category
ment annotations for 23 categories: wood, painted, and instance levels) which were coalesced into 40
fabric, glass, metal, tile, sky, foliage, polished stone, indoor object classes by Gupta et al. [60] for both
carpet, leather, mirror, brick, water, other, plastic, training (795 images) and testing (654) splits. This
skin, stone, ceramic, hair, food, paper, and wallpa- dataset is specially remarkable due to its indoor
per. It contains 7061 labeled material segmentations nature, this makes it really useful for certain robotic
for training, 5000 for test, and 2500 for validation. tasks at home. However, its relatively small scale
The main source for these images is the OpenSur- with regard to other existing datasets hinders its
faces dataset [58], which was augmented using other application for deep learning architectures.
sources of imagery such as Flickr or Houzz. For that • SUN3D [47]15 : similar to the NYUDv2, this dataset
reason, image resolution for this dataset varies. On contains a large-scale RGB-D video database, with 8
average, image resolution is approximately 800×500 annotated sequences. Each frame has a semantic seg-
or 500 × 800. mentation of the objects in the scene and information
• Densely-Annotated VIdeo Segmentation (DAVIS) about the camera pose. It is still in progress and it
[44] [45]12 : this challenge is purposed for video object will be composed by 415 sequences captured in 254
segmentation. Its dataset is composed by 50 high- different spaces, in 41 different buildings. Moreover,
definition sequences which add up to 4219 and some places have been captured multiple times at
2023 frames for training and validation respectively. different moments of the day.
Frame resolution varies across sequences but all of • SUNRGBD [48]16 : captured with four RGB-D sen-
them were downsampled to 480p for the challenge. sors, this dataset contains 10000 RGB-D images, at
Pixel-wise annotations are provided for each frame a similar scale as PASCAL VOC. It contains images
for four different categories: human, animal, vehicle, from NYU depth v2 [46], Berkeley B3DO [61], and
and object. Another feature from this dataset is the SUN3D [47]. The whole dataset is densely annotated,
presence of at least one target foreground object in including polygons, bounding boxes with orientation
each sequence. In addition, it is designed not to have as well as a 3D room layout and category, being
many different objects with significant motion. For suitable for scene understanding tasks.
those scenes which do have more than one target • The Object Segmentation Database (OSD) [62]17
foreground object from the same class, they provide this database has been designed for segmenting
separated ground truth for each one of them to allow unknown objects from generic scenes even under
instance segmentation. partial occlusions. This dataset contains 111 entries,
• Stanford background [40]13 : dataset with outdoor and provides depth image and color images together
scene images imported from existing public datasets: withper-pixel annotations for each one to evalu-
LabelMe, MSRC, PASCAL VOC and Geometric Con- ate object segmentation approaches. However, the
text. The dataset contains 715 images (size of 320 × dataset does not differentiate the category of differ-
240 pixels) with at least one foreground object and ent objects so its classes are reduced to a binary set
having the horizon position within the image. The of objects and not objects.
dataset is pixel-wise annotated (horizon location, • RGB-D Object Dataset [49]18 : this dataset is com-
pixel semantic class, pixel geometric class and image posed by video sequences of 300 common house-
region) for evaluating methods for semantic scene hold objects organized in 51 categories arranged
understanding. using WordNet hypernym-hyponym relationships.
• SiftFlow [41]: contains 2688 fully annotated images The dataset has been recorded using a Kinect style
which are a subset of the LabelMe database [59]. 3D camera that records synchronized and aligned
Most of the images are based on 8 different outdoor 640 × 480 RGB and depth images at 30Hz . For
scenes including streets, mountains, fields, beaches each frame, the dataset provides, the RGB-D and
and buildings. Images are 256 × 256 belonging to depth images, a cropped ones containing the object,
one of the 33 semantic classes. Unlabeled pixels, or the location and a mask with per-pixel annotation.
pixels labeled as a different semantic class are treated Moreover, each object has been placed on a turntable,
as unlabeled. providing isolated video sequences around 360 de-
grees. For the validation process, 22 annotated video
sequences of natural indoor scenes containing the
3.2 2.5D Datasets objects are provided.
With the advent of low-cost range scanners, datasets in-
cluding not only RGB information but also depth maps are
gaining popularity and usage. In this section, we review the
14. http://cs.nyu.edu/∼silberman/projects/indoor scene seg sup.
most well-known 2.5D databases which include that kind of
html
depth data. 15. http://sun3d.cs.princeton.edu/
16. http://rgbd.cs.princeton.edu/
12. http://davischallenge.org/index.html 17. http://www.acin.tuwien.ac.at/?id=289
13. http://dags.stanford.edu/data/iccv09Data.tar.gz 18. http://rgbd-dataset.cs.washington.edu/
8
3.3 3D Datasets object, apart from the individual scan, a full 360-
degrees annotated scan is provided.
Pure three-dimensional databases are scarce, this kind of
• Large-Scale Point Cloud Classification Benchmark
datasets usually provide Computer Aided Design (CAD)
[54]23 : this benchmark provides manually annotated
meshes or other volumetric representations, such as point
3D point clouds of diverse natural and urban scenes:
clouds. Generating large-scale 3D datasets for segmentation
churches, streets, railroad tracks, squares, villages,
is costly and difficult, and not many deep learning methods
soccer fields, castles among others. This dataset fea-
are able to process that kind of data as it is. For those
tures statically captured point clouds with very fine
reasons, 3D datasets are not quite popular at the moment. In
details and density. It contains 15 large-scale point
spite of that fact, we describe the most promising ones for
clouds for training and another 15 for testing. Its
the task at hand.
scale can be grasped by the fact that it totals more
than one billion labelled points.
• ShapeNet Part [50]19 : is a subset of the ShapeNet [63]
repository which focuses on fine-grained 3D object
segmentation. It contains 31, 693 meshes sampled
4 M ETHODS
from 16 categories of the original dataset (airplane,
earphone, cap, motorbike, bag, mug, laptop, table, The relentless success of deep learning techniques in various
guitar, knife, rocket, lamp, chair, pistol, car, and high-level computer vision tasks – in particular, super-
skateboard). Each shape class is labeled with two to vised approaches such as Convolutional Neural Networks
five parts (totalling 50 object parts across the whole (CNNs) for image classification or object detection [14]
dataset), e.g., each shape from the airplane class is [15] [16] – motivated researchers to explore the capabilities
labeled with wings, body, tail, and engine. Ground- of such networks for pixel-level labelling problems like
truth labels are provided on points sampled from the semantic segmentation. The key advantage of these deep
meshes. learning techniques, which gives them an edge over tradi-
• Stanford 2D-3D-S [51]20 : is a multi-modal and large- tional methods, is the ability to learn appropriate feature
scale indoor spaces dataset extending the Stanford representations for the problem at hand, e.g., pixel labelling
3D Semantic Parsing work [64]. It provides a va- on a particular dataset, in an end-to-end fashion instead of
riety of registered modalities – 2D (RGB), 2.5D using hand-crafted features that require domain expertise,
(depth maps and surface normals), and 3D (meshes effort, and often too much fine-tuning to make them work
and point clouds) – with semantic annotations. The on a particular scenario.
database is composed of 70, 496 full high-definition
RGB images (1080×1080 resolution) along with their
corresponding depth maps, surface normals, meshes,
and point clouds with semantic annotations (per-
pixel and per-point). That data were captured in six
indoor areas from three different educational and
office buildings. That makes a total of 271 rooms and
approximately 700 million points annotated with
labels from 13 categories: ceiling, floor, wall, column,
beam, window, door, table, chair, bookcase, sofa,
board, and clutter.
• A Benchmark for 3D Mesh Segmentation [52]21 :
this benchmark is composed by 380 meshes classified
in 19 categories (human, cup, glasses, airplane, ant,
chair, octopus, table, teddy, hand, plier, fish, bird,
armadillo, bust, mech, bearing, vase, fourleg). Each
mesh has been manually segmented into functional
parts, the main goal is to provide a sample distribu-
tion over ”how humans decompose each mesh into
functional parts”.
• Sydney Urban Objects Dataset [53]22 : this dataset
contains a variety of common urban road objects
scanned with a Velodyne HDK-64E LIDAR. There are
631 individual scans (point clouds) of objects across Fig. 7: Fully Convolutional Network figure by Long et al.
classes of vehicles, pedestrians, signs and trees. The [65]. Transforming a classification-purposed CNN to pro-
interesting point of this dataset is that, for each duce spatial heatmaps by replacing fully connected layers
with convolutional ones. Including a deconvolution layer
19. http://cs.stanford.edu/∼ericyi/project page/part annotation/ for upsampling allows dense inference and learning for per-
20. http://buildingparser.stanford.edu pixel labeling.
21. http://segeval.cs.princeton.edu/
22. http://www.acfr.usyd.edu.au/papers/
SydneyUrbanObjectsDataset.shtml 23. http://www.semantic3d.net/
9
Currently, the most successful state-of-the-art deep For all those reasons, and other significant contributions, the
learning techniques for semantic segmentation stem from a FCN is the cornerstone of deep learning applied to semantic
common forerunner: the Fully Convolutional Network (FCN) segmentation. The convolutionalization process is shown in
by Long et al. [65]. The insight of that approach was to Figure 7.
take advantage of existing CNNs as powerful visual models Despite the power and flexibility of the FCN model,
that are able to learn hierarchies of features. They trans- it still lacks various features which hinder its application
formed those existing and well-known classification models to certain problems and situations: its inherent spatial in-
– AlexNet [14], VGG (16-layer net) [15], GoogLeNet [16], variance does not take into account useful global context
and ResNet [17] – into fully convolutional ones by replacing information, no instance-awareness is present by default,
the fully connected layers with convolutional ones to output efficiency is still far from real-time execution at high resolu-
spatial maps instead of classification scores. Those maps tions, and it is not completely suited for unstructured data
are upsampled using fractionally strided convolutions (also such as 3D point clouds or models. Those problems will be
named deconvolutions [90] [91]) to produce dense per-pixel reviewed in this section, as well as the state-of-the-art solu-
labeled outputs. This work is considered a milestone since it tions that have been proposed in the literature to overcome
showed how CNNs can be trained end-to-end for this prob- those hurdles. Table 2 provides a summary of that review. It
lem, efficiently learning how to make dense predictions for shows all reviewed methods (sorted by appearance order in
semantic segmentation with inputs of arbitrary sizes. This the section), their base architecture, their main contribution,
approach achieved a significant improvement in segmenta- and a classification depending on the target of the work:
tion accuracy over traditional methods on standard datasets accuracy, efficiency, training simplicity, sequence processing,
like PASCAL VOC, while preserving efficiency at inference. multi-modal inputs, and 3D data. Each target is graded from
10
one to three stars (?) depending on how much focus puts the
work on it, and a mark (7) if that issue is not addressed. In
addition, Figure 8 shows a graph of the reviewed methods
for the sake of visualization.
Fig. 17: Representation of ReSeg network. VGG-16 convolutional layers are represented by the blue and yellow first layers.
The rest of the architecture is based on the ReNet approach with fine-tuning purposes. Figure extracted from [78].
VGG-16 network [15], feeding the resulting feature maps three different layers: image feature map produced by CNN,
into one or more ReNet layers for fine-tuning. Finally, fea- model image contextual dependencies with DAG-RNNs,
ture maps are resized using upsampling layers based on and deconvolution layer for upsampling feature maps. This
transposed convolutions. In this approach Gated Recurrent work demonstrates how RNNs can be used together with
Units (GRUs) have been used as they strike a good per- graphs to successfully model long-range contextual depen-
formance balance regarding memory usage and computa- dencies, overcoming state-of-the-art approaches in terms of
tional power. Vanilla RNNs have problems modeling long- performance.
term dependencies mainly due to the vanishing gradients
problem. Several derived models such as Long Short-Term
4.3 Instance Segmentation
Memory (LSTM) networks [97] and GRUs [98] are the state-
of-art in this field to avoid such problem. Instance segmentation is considered the next step after
Inspired on the same ReNet architecture, a novel Long semantic segmentation and at the same time the most
Short-Term Memorized Context Fusion (LSTM-CF) model challenging problem in comparison with the rest of low-
for scene labeling was proposed by [99]. In this approach, level pixel segmentation techniques. Its main purpose is to
they use two different data sources: RGB and depth. The represent objects of the same class splitted into different
RGB pipeline relies on a variant of the DeepLab architecture instances. The automation of this process is not straight-
[29] concatenating features at three different scales to enrich forward, thus the number of instances is initially unknown
feature representation (inspired by [100]). The global context and the evaluation of performed predictions is not pixel-
is modeled vertically over both, depth and photometric wise such as in semantic segmentation. Consequently, this
data sources, concluding with a horizontal fusion in both problem remains partially unsolved but the interest in this
direction over these vertical contexts. field is motivated by its potential applicability. Instance
labeling provides us extra information for reasoning about
As we noticed, modeling image global contexts is related occlusion situations, also counting the number of elements
to 2D recurrent approaches by unfolding vertically and belonging to the same class and for detecting a particular
horizontally the network over the input images. Based on object for grasping in robotics tasks, among many other
the same idea, Byeon et al. [80] purposed a simple 2D LSTM- applications.
based architecture in which the input image is divided into For this purpose, Hariharan et al. [10] proposed a
non-overlapping windows which are fed into four separate Simultaneous Detection and Segmentation (SDS) method in
LSTMs memory blocks. This work emphasizes its low com- order to improve performance over already existing works.
putational complexity on a single-core CPU and the model Their pipeline uses, firstly, a bottom-up hierarchical image
simplicity. segmentation and object candidate generation process called
Another approach for capturing global information relies Multi-scale COmbinatorial Grouping (MCG) [101] to obtain
on using bigger input windows in order to model larger con- region proposals. For each region, features are extracted
texts. Nevertheless, this reduces images resolution and also by using an adapted version of the Region-CNN (R-CNN)
implies several problems regarding to window overlapping. [102], which is fine-tuned using bounding boxes provided
However, Pinheiro et al. [81] introduced Recurrent Convo- by the MCG method instead of selective search and also
lutional Neural Networks (rCNNs) which recurrently train alongside region foreground features. Then, each region
with different input window sizes taking into account pre- proposal is classified by using a linear Support Vector Ma-
vious predictions by using a different input window sizes. chine (SVM) on top of the CNN features. Finally, and for
In this way, predicted labels are automatically smoothed refinement purposes, Non-Maximum Suppression (NMS) is
increasing the performance. applied to the previous proposals.
Undirected cyclic graphs (UCGs) were also adopted Later, Pinheiro et al. [83] presented DeepMask model, an
to model image contexts for semantic segmentation [82]. object proposal approach based on a single ConvNet. This
Nevertheless, RNNs are not directly applicable to UCG model predicts a segmentation mask for an input patch and
and the solution is decomposing it into several directed the likelihood of this patch for containing an object. The two
graphs (DAGs). In this approach, images are processed by tasks are learned jointly and computed by a single network,
14
sharing most of the layers except last ones which are task- Several works focused on RGB-D scene segmentation have
specific. reported an improvement in the fine-grained labeling preci-
Based on the DeepMask architecture as a starting point sion by using depth information and not only photometric
due to its effectiveness, the same authors presented a novel data. Using depth information for segmentation is consid-
architecture for object instance segmentation implementing ered more challenging because of the unpredictable varia-
a top-down refinement process [84] and achieving a better tion of scene illumination alongside incomplete representa-
performance in terms of accuracy and speed. The goal of this tion of objects due to complex occlusions. However, various
process is to efficiently merge low-level features with high- works have successfully made use of depth information to
level semantic information from upper network layers. The increase accuracy.
process consisted in different refinement modules stacked The use of depth images with approaches focused on
together (one module per pooling layer), with the purpose photometric data is not straightforward. Depth data needs
of inverting pooling effect by generating a new upsampled to be encoded with three channels at each pixel as if it was
object encoding. Figure 18 shows the refinement module in an RGB images. Different techniques such as Horizontal
SharpMask. Height Angle (HHA) [11] are used for encoding the depth
into three channels as follows: horizontal disparity, height
above ground, and the angle between local surface normal
and the inferred gravity direction. In this way, we can input
depth images to models designed for RGB data and improve
in this way the performance by learning new features from
structural information. Several works such as [99] are based
on this encoding technique.
In the literature, related to methods that use RGB-D data,
we can also find some works that leverage a multi-view
approach to improve existing single-view works.
Zeng et al. [103] present an object segmentation approach
that leverages multi-view RGB-D data and deep learning
techniques. RGB-D images captured from each viewpoint
are fed to a FCN network which returns a 40-class proba-
bility for each pixel in each image. Segmentation labels are
threshold by using three times the standard deviation above
the mean probability across all views. Moreover, in this
work, multiple networks for feature extraction were trained
Fig. 18: SharpMask’s top-down architecture with progres-
(AlexNet [14] and VGG-16 [15]), evaluating the benefits of
sive refinement using their signature modules. That refine-
using depth information. They found that adding depth
ment merges spatially rich information from lower-level
did not yield any major improvements in segmentation
features with high-level semantic cues encoded in upper
performance, which could be caused by noise in the depth
layers. Figure extracted from [83].
information. The described approach was presented during
Another approach, based on Fast R-CNN as a starting the 2016 Amazon Picking Challenge. This work is a mi-
point and using DeepMask object proposals instead of nor contribution towards multi-view deep learning systems
Selective Search was presented by Zagoruyko et al [85]. since RGB images are independently fed to a FCN network.
This combined system called MultiPath classifier, improved Ma et al. [104] propose a novel approach for object-class
performance over COCO dataset and supposed three mod- segmentation using a multi-view deep learning technique.
ifications to Fast R-CNN: improving localization with an Multiple views are obtained from a moving RGB-D camera.
integral loss, provide context by using foveal regions and During the training stage, camera trajectory is obtained
finally skip connections to give multi-scale features to the using an RGB-D SLAM technique, then RGB-D images are
network. The system achieved a 66% improvement over the warped into ground-truth annotated frames in order to
baseline Fast R-CNN. enforce multi-view consistency for training. The proposed
As we have seen, most of the methods mentioned above approach is based on FuseNet [105], which combines RGB
rely on existing object detectors limiting in this way model and depth images for semantic segmentation, and improves
performance. Even so, instance segmentation process re- the original work by adding multi-scale loss minimization.
mains an unresolved research problem and the mentioned
works are only a small part of this challenging research 4.5 3D Data
topic.
3D geometric data such as point clouds or polygonal meshes
are useful representations thanks to their additional dimen-
4.4 RGB-D Data sion which provides methods with rich spatial information
As we noticed, a significant amount of work has been done that is intuitively useful for segmentation. However, the vast
in semantic segmentation by using photometric data. Nev- majority of successful deep learning segmentation architec-
ertheless, the use of structural information was spurred on tures – CNNs in particular – are not originally engineered
with the advent of low-cost RGB-D sensors which provide to deal with unstructured or irregular inputs such as the
useful geometric cues extracted from depth information. aforementioned ones. In order to enable weight sharing
15
and other optimizations in convolutional architectures, most manner. This approach works, often producing remarkable
researchers have resorted to 3D voxel grids or projections results. Nevertheless, applying those methods frame by
to transform unstructured and unordered point clouds or frame is usually non-viable due to computational cost. In
meshes into regular representations before feeding them to addition, those methods completely ignore temporal con-
the networks. For instance, Huang et al. [86] (see Figure tinuity and coherence cues which might help increase the
19 take a point cloud and parse it through a dense voxel accuracy of the system while reducing its execution time.
grid, generating a set of occupancy voxels which are used Arguably, the most remarkable work in this regard is
as input to a 3D CNN to produce one label per voxel. the clockwork FCN by Shelhamer et al. [88]. This network
They then map back the labels to the point cloud. Although is an adaptation of a FCN to make use of temporal cues in
this approach has been applied successfully, it has some video to decrease inference time while preserving accuracy.
disadvantages like quantization, loss of spatial information, The clockwork approach relies on the following insight:
and unnecessarily large representations. For that reason, feature velocity – the temporal rate of change of features
various researchers have focused their efforts on creating in the network – across frames varies from layer to layer so
deep architectures that are able to directly consume unstruc- that features from shallow layers change faster than deep
tured 3D point sets or meshes. ones. Under that assumption, layers can be grouped into
stages, processing them at different update rates depending
on their depth. By doing this, deep features can be persisted
over frames thanks to their semantic stability, thus saving
inference time. Figure 21 shows the network architecture of
the clockwork FCN.
It is important to remark that the authors propose two
kinds of update rates: fixed and adaptive. The fixed sched-
ule just sets a constant time frame for recomputing the fea-
tures for each stage of the network. The adaptive schedule
fires each clock on a data-driven manner, e.g., depending on
the amount of motion or semantic change. Figure 22 shows
an example of this adaptive scheduling.
Zhang et al. [106] took a different approach and made
use of a 3DCNN, which was originally created for learning
Fig. 19: 3DCNN based system presented by Huang et al. features from volumes, to learn hierarchical spatio-temporal
[86] for semantic labeling of point clouds. Clouds undergo features from multi-channel inputs such as video clips. In
a dense voxelization process and the CNN produces per- parallel, they over-segment the input clip into supervoxels.
voxel labels that are then mapped back to the point cloud. Then they use that supervoxel graph and embed the learned
Figure extracted from [86]. features in it. The final segmentation is obtained by applying
graph-cut [107] on the supervoxel graph.
PointNet [87] is a pioneering work which presents a Another remarkable method, which builds on the idea
deep neural network that takes raw point clouds as input, of using 3D convolutions, is the deep end-to-end voxel-to-
providing a unified architecture for both classification and voxel prediction system by Tran et al. [89]. In that work,
segmentation. Figure 20 shows that two-part network which they make use of the Convolutional 3D (C3D) network intro-
is able to consume unordered point sets in 3D. duced by themselves on a previous work [108], and extend
As we can observe, PointNet is a deep network archi-
it for semantic segmentation by adding deconvolutional
tecture that stands out of the crowd due to the fact that it
layers at the end. Their system works by splitting the input
is based on fully connected layers instead of convolutional
into clips of 16 frames, performing predictions for each clip
ones. The architecture features two subnetworks: one for
separately. Its main contribution is the use of 3D convo-
classification and another for segmentation. The classifica-
lutions. Those convolutions make use of three-dimensional
tion subnetwork takes a point cloud and applies a set of
filters which are suitable for spatio-temporal feature learn-
transforms and Multi Layer Perceptrons (MLPs) to generate
ing across multiple channels, in this case frames. Figure
features which are then aggregated using max-pooling to
23 shows the difference between 2D and 3D convolutions
generate a global feature which describes the original input
applied to multi-channel inputs, proving the usefulness of
cloud. That global feature is classified by another MLP to
the 3D ones for video segmentation.
produce output scores for each class. The segmentation
subnetwork concatenates the global feature with the per-
point features extracted by the classification network and 5 D ISCUSSION
applies another two MLPs to generate features and produce In the previous section we reviewed the existing methods
output scores for each point. from a literary and qualitative point of view, i.e., we did
not take any quantitative result into account. In this Section
4.6 Video Sequences we are going to discuss the very same methods from a
As we have observed, there has been a significant progress numeric standpoint. First of all, we will describe the most
in single-image segmentation. However, when dealing with popular evaluation metrics that can be used to measure the
image sequences, many systems rely on the naı̈ve appli- performance of semantic segmentation systems from three
cation of the very same algorithms in a frame-by-frame aspects: execution time, memory footprint, and accuracy.
16
Fig. 20: The PointNet unified architecture for point cloud classification and segmentation. Figure reproduced from [87].
k 5.2.1 RGB
1 X pii
MPA = k For the single 2D image category we have selected seven
k + 1 i=0 X
pij datasets: PASCAL VOC2012, PASCAL Context, PASCAL
j=0 Person-Part, CamVid, CityScapes, Stanford Background,
and SiftFlow. That selection accounts for a wide range of
• Mean Intersection over Union (MIoU): this is the situations and targets.
standard metric for segmentation purposes. It com- The first, and arguably the most important dataset,
putes a ratio between the intersection and the union in which the vast majority of methods are evaluated is
of two sets, in our case the ground truth and our PASCAL VOC-2012. Table 3 shows the results of those
predicted segmentation. That ratio can be reformu- reviewed methods which provide accuracy results on the
lated as the number of true positives (intersection) PASCAL VOC-2012 test set. This set of results shows a clear
over the sum of true positives, false negatives, and improvement trend from the firs proposed methods (SegNet
18
TABLE 3: Performance results on PASCAL VOC-2012. TABLE 8: Performance results on Stanford Background.
# Method Accuracy (IoU) # Method Accuracy (IoU)
1 DeepLab [69] 79.70 1 rCNN [81] 80.20
2 Dilation [71] 75.30 2 2D-LSTM [80] 78.56
3 CRFasRNN [70] 74.70
4 ParseNet [77] 69.80
5 FCN-8s [65] 67.20
6 Multi-scale-CNN-Eigen [74] 62.60
recurrent methods. In particular DAG-RNN is the top scorer
7 Bayesian SegNet [67] 60.50 with 85.30 IoU.
TABLE 9: Performance results on SiftFlow.
and the original FCN) to the most complex models such as # Method Accuracy (IoU)
CRFasRNN and the winner (DeepLab) with 79.70 IoU. 1 DAG-RNN [82] 85.30
Apart from the widely known VOC we also collected 2 rCNN [81] 77.70
metrics of its Context counterpart. Table 4 shows those 3 2D-LSTM [80] 70.11
results in which DeepLab is again the top scorer (45.70 IoU).
# Method Accuracy (IoU) Table 11 shows the results for NYUDv2 which are exclu-
1 DeepLab [69] 64.94
sive too for LSTM-CF. That method reaches 49.40 IoU.
Moving from a general-purpose dataset such as PASCAL TABLE 11: Performance results on NYUDv2.
VOC, we also gathered results for two of the most important
urban driving databases. Table 6 shows the results of those # Method Accuracy (IoU)
1 LSTM-CF [79] 49.40
methods which provide accuracy metrics for the CamVid
dataset. In this case, an RNN-based approach (DAG-RNN)
is the top one with a 91.60 IoU. At last, Table 12 gathers results for the last 2.5D dataset:
SUN-3D. Again, LSTM-CF is the only one which provides
TABLE 6: Performance results on CamVid. information for that database, in this case a 58.50 accuracy.
• Memory: some platforms are bounded by hard mem- a Spanish national grant for PhD studies FPU15/04516.
ory constraints. Segmentation networks usually do In addition, it was also funded by the grant Ayudas para
need significant amounts of memory to be executed Estudios de Máster e Iniciación a la Investigación from the
for both inference and training. In order to fit them University of Alicante.
in some devices, networks must be simplified. While
this can be easily accomplished by reducing their
complexity (often trading it for accuracy), another R EFERENCES
approaches can be taken. Pruning is a promising re- [1] A. Ess, T. Müller, H. Grabner, and L. J. Van Gool, “Segmentation-
search line that aims to simplify a network, making it based urban traffic scene understanding.” in BMVC, vol. 1, 2009,
lightweight while keeping the knowledge, and thus p. 2.
the accuracy, of the original network architecture [2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au-
tonomous driving? the kitti vision benchmark suite,” in 2012
[112] [113] [114]. IEEE Conference on Computer Vision and Pattern Recognition, June
• Temporal coherency on sequences: some methods have 2012, pp. 3354–3361.
addressed video or sequence segmentation but either [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
taking advantage of that temporal cues to increase R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
dataset for semantic urban scene understanding,” in Proceedings
accuracy or efficiency. However, none of them have of the IEEE Conference on Computer Vision and Pattern Recognition,
explicitly tackled the coherency problem. For a seg- 2016, pp. 3213–3223.
mentation system to work on video streams it is [4] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands deep
in deep learning for hand pose estimation,” arXiv preprint
important, not only to produce good results frame arXiv:1502.06807, 2015.
by frame, but also make them coherent through the [5] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learn-
whole clip without producing artifacts by smoothing ing a deep convolutional network for light-field image super-
predicted per-pixel labels along the sequence. resolution,” in Proceedings of the IEEE International Conference on
Computer Vision Workshops, 2015, pp. 24–32.
• Multi-view integration: Use of multiple views in re- [6] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li,
cently proposed segmentation works is mostly lim- “Deep learning for content-based image retrieval: A comprehen-
ited to RGB-D cameras and in particular focused on sive study,” in Proceedings of the 22nd ACM international conference
single-object segmentation. on Multimedia. ACM, 2014, pp. 157–166.
[7] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E.
Barbano, “Toward automatic phenotyping of developing em-
bryos from videos,” IEEE Transactions on Image Processing, vol. 14,
6 C ONCLUSION no. 9, pp. 1360–1371, 2005.
[8] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber,
To the best of our knowledge, this is the first review paper “Deep neural networks segment neuronal membranes in electron
in the literature which focuses on semantic segmentation microscopy images,” in Advances in neural information processing
using deep learning. In comparison with other surveys, this systems, 2012, pp. 2843–2851.
paper is devoted to such a rising topic as deep learning, [9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning
hierarchical features for scene labeling,” IEEE transactions on
covering the most advanced and recent work on that front. pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–
We formulated the semantic segmentation problem and 1929, 2013.
provided the reader with the necessary background knowl- [10] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simulta-
edge about deep learning for such task. We covered the neous detection and segmentation,” in European Conference on
Computer Vision. Springer, 2014, pp. 297–312.
contemporary literature of datasets and methods, providing [11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich
a comprehensive survey of 28 datasets and 27 methods. features from rgb-d images for object detection and segmenta-
Datasets were carefully described, stating their purposes tion,” in European Conference on Computer Vision. Springer, 2014,
pp. 345–360.
and characteristics so that researchers can easily pick the one [12] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels:
that best suits their needs. Methods were surveyed from two A comprehensive survey from bottom-up to semantic
perspectives: contributions and raw results, i.e., accuracy. image segmentation and cosegmentation,” Journal of Visual
We also presented a comparative summary of the datasets Communication and Image Representation, vol. 34, pp. 12 –
27, 2016. [Online]. Available: http://www.sciencedirect.com/
and methods in tabular forms, classifying them according science/article/pii/S1047320315002035
to various criteria. In the end, we discussed the results [13] M. Thoma, “A survey of semantic segmentation,” CoRR, vol.
and provided useful insight in shape of future research abs/1602.06541, 2016. [Online]. Available: http://arxiv.org/abs/
1602.06541
directions and open problems in the field. In conclusion,
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
semantic segmentation has been approached with many cation with deep convolutional neural networks,” in Advances in
success stories but still remains an open problem whose neural information processing systems, 2012, pp. 1097–1105.
solution would prove really useful for a wide set of real- [15] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” arXiv preprint
world applications. Furthermore, deep learning has proved arXiv:1409.1556, 2014.
to be extremely powerful to tackle this problem so we can [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
expect a flurry of innovation and spawns of research lines D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
in the upcoming years. convolutions,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 1–9.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE Conference on
ACKNOWLEDGMENTS Computer Vision and Pattern Recognition, 2016, pp. 770–778.
This work has been funded by the Spanish Government [18] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-
dimensional recurrent neural networks,” CoRR, vol.
TIN2016-76515-R grant for the COMBAHO project, sup- abs/0705.2011, 2007. [Online]. Available: http://arxiv.org/
ported with Feder funds. It has also been supported by abs/0705.2011
21
[19] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. C. Courville, and [39] R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, “Sensor fusion
Y. Bengio, “Renet: A recurrent neural network based alternative for semantic segmentation of urban scenes,” in Robotics and
to convolutional networks,” CoRR, vol. abs/1505.00393, 2015. Automation (ICRA), 2015 IEEE International Conference on. IEEE,
[Online]. Available: http://arxiv.org/abs/1505.00393 2015, pp. 1850–1857.
[20] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, “Training hi- [40] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
erarchical feed-forward visual recognition models using transfer geometric and semantically consistent regions,” in Computer Vi-
learning from pseudo-tasks,” in European Conference on Computer sion, 2009 IEEE 12th International Conference on. IEEE, 2009, pp.
Vision. Springer, 2008, pp. 69–82. 1–8.
[21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and [41] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:
transferring mid-level image representations using convolutional Label transfer via dense scene alignment,” in Computer Vision and
neural networks,” in Proceedings of the IEEE conference on computer Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
vision and pattern recognition, 2014, pp. 1717–1724. 2009, pp. 1972–1979.
[22] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable [42] S. D. Jain and K. Grauman, “Supervoxel-consistent foreground
are features in deep neural networks?” in Advances in neural propagation in video,” in European Conference on Computer Vision.
information processing systems, 2014, pp. 3320–3328. Springer, 2014, pp. 656–671.
[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- [43] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “Material recog-
agenet: A large-scale hierarchical image database,” in Computer nition in the wild with the materials in context database,” in
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference Proceedings of the IEEE conference on computer vision and pattern
on. IEEE, 2009, pp. 248–255. recognition, 2015, pp. 3479–3487.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [44] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet and A. Sorkine-Hornung, “A benchmark dataset and evaluation
large scale visual recognition challenge,” International Journal of methodology for video object segmentation,” in Computer Vision
Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. and Pattern Recognition, 2016.
[25] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, [45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-
“Understanding data augmentation for classification: when to Hornung, and L. Van Gool, “The 2017 davis challenge on video
warp?” CoRR, vol. abs/1609.08764, 2016. [Online]. Available: object segmentation,” arXiv:1704.00675, 2017.
http://arxiv.org/abs/1609.08764 [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
[26] X. Shen, A. Hertzmann, J. Jia, S. Paris, B. Price, E. Shechtman, and mentation and support inference from rgbd images,” in European
I. Sachs, “Automatic portrait segmentation for image stylization,” Conference on Computer Vision. Springer, 2012, pp. 746–760.
in Computer Graphics Forum, vol. 35, no. 2. Wiley Online Library, [47] J. Xiao, A. Owens, and A. Torralba, “Sun3d: A database of big
2016, pp. 93–102. spaces reconstructed using sfm and object labels,” in 2013 IEEE
[27] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, International Conference on Computer Vision, Dec 2013, pp. 1625–
J. Winn, and A. Zisserman, “The pascal visual object classes chal- 1632.
lenge: A retrospective,” International Journal of Computer Vision,
[48] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d
vol. 111, no. 1, pp. 98–136, Jan. 2015.
scene understanding benchmark suite,” in Proceedings of the IEEE
[28] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, conference on computer vision and pattern recognition, 2015, pp. 567–
R. Urtasun, and A. Yuille, “The role of context for object detection 576.
and semantic segmentation in the wild,” in IEEE Conference on
[49] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical
Computer Vision and Pattern Recognition (CVPR), 2014.
multi-view rgb-d object dataset,” in Robotics and Automation
[29] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
(ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp.
“Detect what you can: Detecting and representing objects using
1817–1824.
holistic models and body parts,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. [50] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu,
Q. Huang, A. Sheffer, and L. Guibas, “A scalable active frame-
[30] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik,
work for region annotation in 3d shape collections,” SIGGRAPH
“Semantic contours from inverse detectors,” in 2011 International
Asia, 2016.
Conference on Computer Vision. IEEE, 2011, pp. 991–998.
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [51] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in Semantic Data for Indoor Scene Understanding,” ArXiv e-prints,
context,” in European Conference on Computer Vision. Springer, Feb. 2017.
2014, pp. 740–755. [52] X. Chen, A. Golovinskiy, and T. Funkhouser, “A benchmark
[32] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, for 3D mesh segmentation,” ACM Transactions on Graphics (Proc.
“The synthia dataset: A large collection of synthetic images for SIGGRAPH), vol. 28, no. 3, Aug. 2009.
semantic segmentation of urban scenes,” in Proceedings of the IEEE [53] A. Quadros, J. Underwood, and B. Douillard, “An occlusion-
Conference on Computer Vision and Pattern Recognition, 2016, pp. aware feature for range images,” in Robotics and Automation, 2012.
3234–3243. ICRA’12. IEEE International Conference on. IEEE, May 14-18 2012.
[33] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, [54] T. Hackel, J. D. Wegner, and K. Schindler, “Contour detection
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes in unstructured 3d point clouds,” in Proceedings of the IEEE
dataset,” in CVPR Workshop on The Future of Datasets in Vision, Conference on Computer Vision and Pattern Recognition, 2016, pp.
2015. 1610–1618.
[34] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object [55] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmenta-
classes in video: A high-definition ground truth database,” Pat- tion and recognition using structure from motion point clouds,”
tern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009. in European Conference on Computer Vision. Springer, 2008, pp.
[35] P. Sturgess, K. Alahari, L. Ladicky, and P. H. Torr, “Combining 44–57.
appearance and structure from motion features for road scene [56] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
understanding,” in BMVC 2012-23rd British Machine Vision Con- robotics: The kitti dataset,” The International Journal of Robotics
ference. BMVA, 2009. Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[36] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road [57] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learn-
scene segmentation from a single image,” in European Conference ing object class detectors from weakly annotated video,” in Com-
on Computer Vision. Springer, 2012, pp. 376–389. puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
[37] G. Ros and J. M. Alvarez, “Unsupervised image transformation on. IEEE, 2012, pp. 3282–3289.
for outdoor semantic labelling,” in Intelligent Vehicles Symposium [58] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “OpenSurfaces: A
(IV), 2015 IEEE. IEEE, 2015, pp. 537–542. richly annotated catalog of surface appearance,” ACM Trans. on
[38] G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and Graphics (SIGGRAPH), vol. 32, no. 4, 2013.
A. M. Lopez, “Vision-based offline-online perception paradigm [59] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
for autonomous driving,” in Applications of Computer Vision “Labelme: a database and web-based tool for image annotation,”
(WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 231– International journal of computer vision, vol. 77, no. 1, pp. 157–173,
238. 2008.
22
[60] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and [82] B. Shuai, Z. Zuo, G. Wang, and B. Wang, “Dag-recurrent neural
recognition of indoor scenes from rgb-d images,” in Proceedings networks for scene labeling,” CoRR, vol. abs/1509.00552, 2015.
of the IEEE Conference on Computer Vision and Pattern Recognition, [Online]. Available: http://arxiv.org/abs/1509.00552
2013, pp. 564–571. [83] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment
[61] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and object candidates,” in Advances in Neural Information Processing
T. Darrell, A Category-Level 3D Object Dataset: Putting the Kinect Systems, 2015, pp. 1990–1998.
to Work. London: Springer London, 2013, pp. 141–165. [Online]. [84] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning
Available: http://dx.doi.org/10.1007/978-1-4471-4640-7 8 to refine object segments,” in European Conference on Computer
[62] A. Richtsfeld, “The object segmentation database (osd),” 2012. Vision. Springer, 2016, pp. 75–91.
[63] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, [85] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chin-
Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: tala, and P. Dollár, “A multipath network for object detection,”
An information-rich 3d model repository,” arXiv preprint arXiv preprint arXiv:1604.02135, 2016.
arXiv:1512.03012, 2015. [86] J. Huang and S. You, “Point cloud labeling using 3d convolu-
[64] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, tional neural network,” in Proc. of the International Conf. on Pattern
and S. Savarese, “3d semantic parsing of large-scale indoor Recognition (ICPR), vol. 2, 2016.
spaces,” in Proceedings of the IEEE Conference on Computer Vision [87] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep
and Pattern Recognition, 2016, pp. 1534–1543. learning on point sets for 3d classification and segmentation,”
arXiv preprint arXiv:1612.00593, 2016.
[65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
works for semantic segmentation,” in Proceedings of the IEEE [88] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork
Conference on Computer Vision and Pattern Recognition, 2015, pp. convnets for video semantic segmentation,” in Computer Vision–
3431–3440. ECCV 2016 Workshops. Springer, 2016, pp. 852–868.
[89] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deep
[66] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
end2end voxel2voxel prediction,” in Proceedings of the IEEE Con-
convolutional encoder-decoder architecture for image segmenta-
ference on Computer Vision and Pattern Recognition Workshops, 2016,
tion,” arXiv preprint arXiv:1511.00561, 2015.
pp. 17–24.
[67] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian [90] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive decon-
segnet: Model uncertainty in deep convolutional encoder- volutional networks for mid and high level feature learning,”
decoder architectures for scene understanding,” arXiv preprint in Computer Vision (ICCV), 2011 IEEE International Conference on.
arXiv:1511.02680, 2015. IEEE, 2011, pp. 2018–2025.
[68] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [91] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
Yuille, “Semantic image segmentation with deep convolutional volutional networks,” in European conference on computer vision.
nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, Springer, 2014, pp. 818–833.
2014. [92] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive
[69] ——, “Deeplab: Semantic image segmentation with deep convo- foreground extraction using iterated graph cuts,” in ACM trans-
lutional nets, atrous convolution, and fully connected crfs,” arXiv actions on graphics (TOG), vol. 23, no. 3. ACM, 2004, pp. 309–314.
preprint arXiv:1606.00915, 2016. [93] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost
[70] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, for image understanding: Multi-class object recognition and seg-
D. Du, C. Huang, and P. H. Torr, “Conditional random fields as mentation by jointly modeling texture, layout, and context,”
recurrent neural networks,” in Proceedings of the IEEE International International Journal of Computer Vision, vol. 81, no. 1, pp. 2–23,
Conference on Computer Vision, 2015, pp. 1529–1537. 2009.
[71] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated [94] V. Koltun, “Efficient inference in fully connected crfs with gaus-
convolutions,” arXiv preprint arXiv:1511.07122, 2015. sian edge potentials,” Adv. Neural Inf. Process. Syst, vol. 2, no. 3,
[72] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A p. 4, 2011.
deep neural network architecture for real-time semantic segmen- [95] P. Krähenbühl and V. Koltun, “Parameter learning and conver-
tation,” arXiv preprint arXiv:1606.02147, 2016. gent inference for dense random fields.” in ICML (3), 2013, pp.
[73] A. Raj, D. Maturana, and S. Scherer, “Multi-scale convolutional 513–521.
architecture for semantic segmentation,” 2015. [96] S. Zhou, J.-N. Wu, Y. Wu, and X. Zhou, “Exploiting local struc-
[74] D. Eigen and R. Fergus, “Predicting depth, surface normals tures with the kronecker layer in convolutional networks,” arXiv
and semantic labels with a common multi-scale convolutional preprint arXiv:1512.09194, 2015.
architecture,” in Proceedings of the IEEE International Conference on [97] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Computer Vision, 2015, pp. 2650–2658. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[75] A. Roy and S. Todorovic, “A multi-scale cnn for affordance [98] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On
segmentation in rgb images,” in European Conference on Computer the properties of neural machine translation: Encoder-decoder
Vision. Springer, 2016, pp. 186–201. approaches,” arXiv preprint arXiv:1409.1259, 2014.
[99] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin,
[76] X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional
“RGB-D scene labeling with long short-term memorized fusion
network with application to industrial inspection,” in Applications
model,” CoRR, vol. abs/1604.05000, 2016. [Online]. Available:
of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE,
http://arxiv.org/abs/1604.05000
2016, pp. 1–8.
[100] G. Li and Y. Yu, “Deep contrast learning for salient object detec-
[77] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider tion,” in Proceedings of the IEEE Conference on Computer Vision and
to see better,” arXiv preprint arXiv:1506.04579, 2015. Pattern Recognition, 2016, pp. 478–487.
[78] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, [101] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
M. Matteucci, and A. Courville, “Reseg: A recurrent neural “Multiscale combinatorial grouping,” in Proceedings of the IEEE
network-based model for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
Conference on Computer Vision and Pattern Recognition (CVPR) 328–335.
Workshops, June 2016. [102] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[79] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, hierarchies for accurate object detection and semantic segmenta-
LSTM-CF: Unifying Context Modeling and Fusion with LSTMs tion,” in Proceedings of the IEEE conference on computer vision and
for RGB-D Scene Labeling. Cham: Springer International pattern recognition, 2014, pp. 580–587.
Publishing, 2016, pp. 541–557. [Online]. Available: http: [103] A. Zeng, K. Yu, S. Song, D. Suo, E. W. Jr., A. Rodriguez,
//dx.doi.org/10.1007/978-3-319-46475-6 34 and J. Xiao, “Multi-view self-supervised deep learning for
[80] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling 6d pose estimation in the amazon picking challenge,”
with lstm recurrent neural networks,” in Proceedings of the IEEE CoRR, vol. abs/1609.09475, 2016. [Online]. Available: http:
Conference on Computer Vision and Pattern Recognition, 2015, pp. //arxiv.org/abs/1609.09475
3547–3555. [104] L. Ma, J. Stuckler, C. Kerl, and D. Cremers, “Multi-view deep
[81] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural learning for consistent semantic mapping with rgb-d cameras,”
networks for scene labeling.” in ICML, 2014, pp. 82–90. in arXiv:1703.08866, Mar 2017.
23
[105] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Sergiu-Ovidiu Oprea is a MSc Student (Au-
Incorporating depth into semantic segmentation via fusion-based tomation and Robotics) at University of Ali-
cnn architecture,” in Proc. ACCV, vol. 2, 2016. cante. He received his Bachelor’s Degree (Com-
[106] H. Zhang, K. Jiang, Y. Zhang, Q. Li, C. Xia, and X. Chen, “Dis- puter Engineering) from the same institution in
criminative feature learning for video semantic segmentation,” June 2015. His main research interests include
in Virtual Reality and Visualization (ICVRV), 2014 International deep learning (specially recurrent neural net-
Conference on. IEEE, 2014, pp. 321–326. works), 3D computer vision, parallel computing
[107] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy on GPUs, and computer graphics. He is also
minimization via graph cuts,” IEEE Transactions on pattern analysis member of European Networks like HiPEAC.
and machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[108] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri,
“Learning spatiotemporal features with 3d convolutional net-
works,” in Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 4489–4497.
[109] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks
on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[110] T. N. Kipf and M. Welling, “Semi-supervised classification with
graph convolutional networks,” arXiv preprint arXiv:1609.02907,
2016.
[111] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
neural networks for graphs,” in Proceedings of the 33rd annual
international conference on machine learning. ACM, 2016.
[112] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
convolutional neural networks,” arXiv preprint arXiv:1512.08571, Victor Villena-Martinez is a PhD Student at
2015. the University of Alicante. He received his Mas-
[113] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compress- ter’s Degree in Automation and Robotics in June
ing deep neural networks with pruning, trained quantization and 2016 and his Bachelor’s Degree in Computer
huffman coding,” arXiv preprint arXiv:1510.00149, 2015. engineering in June 2015. He has collaborated
[114] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Prun- in the project ”Acquisition and modeling of grow-
ing convolutional neural networks for resource efficient transfer ing plants” (GV/2013/005). His main research is
learning,” arXiv preprint arXiv:1611.06440, 2016. focused on the calibration of RGB-D devices and
the reconstruction of the human body using the
same devices.