Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
Article
Depth Estimation and Semantic Segmentation from
a Single RGB Image Using a Hybrid Convolutional
Neural Network
Xiao Lin 1,2, *, Dalila Sánchez-Escobedo 2 , Josep R. Casas 2 and Montse Pardàs 2
1 Visual Interactions and Communication Technologies (Vicomtech), 20009 Donostia/San Sebastián, Spain
2 Image Processing Group, TSC Department, Technical University of Catalonia (UPC), 08034 Barcelona, Spain;
[email protected] (D.S.-E.); [email protected] (J.R.C.); [email protected] (M.P.)
* Correspondence: [email protected]
Received: 6 March 2019; Accepted: 12 April 2019; Published: 15 April 2019
Abstract: Semantic segmentation and depth estimation are two important tasks in computer vision,
and many methods have been developed to tackle them. Commonly these two tasks are addressed
independently, but recently the idea of merging these two problems into a sole framework has been
studied under the assumption that integrating two highly correlated tasks may benefit each other
to improve the estimation accuracy. In this paper, depth estimation and semantic segmentation are
jointly addressed using a single RGB input image under a unified convolutional neural network.
We analyze two different architectures to evaluate which features are more relevant when shared
by the two tasks and which features should be kept separated to achieve a mutual improvement.
Likewise, our approaches are evaluated under two different scenarios designed to review our results
versus single-task and multi-task methods. Qualitative and quantitative experiments demonstrate
that the performance of our methodology outperforms the state of the art on single-task approaches,
while obtaining competitive results compared with other multi-task methods.
1. Introduction
Semantic segmentation and depth information are intrinsically related, and both pieces of
information need to be considered in an integrated manner to succeed in challenging applications,
such as robotics [1] or autonomous navigation [2]. In robotics, performing tasks in interactive
environments requires identification of objects as well as their distance from the camera. Likewise,
autonomous navigation applications need a 3D reconstruction of the scene as well as semantic
information to ensure that the agent device has enough information available to carry out the
navigation in a safe and independent manner. Although RGB-D sensors are currently being used
in many applications, most systems only provide RGB information. Therefore, addressing depth
estimation and semantic segmentation under a unified framework is of special interest.
On the other hand, deep learning techniques have shown extraordinary success in both tasks [3] in
recent years. In this context, the feature-extraction process for a specific task is modeled as a parameter
estimation problem in Convolutional Neural Networks (CNNs) which is based on a set of training
data. In other words, the feature extractors are created by learning from the prior knowledge that we
have. This provides a possibility of combining different tasks (different sources of prior knowledge)
when training the feature extractors, in particular for highly correlated tasks such as depth estimation
and semantic segmentation. Specifically, the idea of integrating the depth estimation and semantic
segmentation into a sole structure is motivated by the fact that both segmentation information and
depth maps represent geometrical information of a scene. In this manner, the feature extractors can be
better trained due to the enriched prior knowledge.
In this paper, we introduce a hybrid convolutional network that integrates depth estimation
and semantic segmentation into a unified framework. We propose to build a model where the
features extracted are suitable for both tasks, thus leading to an improved accuracy in the estimated
information. One of the main advantages of the proposed approach is the straightforward manner
semantic segmentation and depth map are estimated from a single image, providing a feasible solution
to these problems.
2. Related Work
Depth estimation and semantic segmentation are two widely studied problems in the image
processing community and recently have been tackled through deep learning techniques due to its
successful results in terms of accuracy and efficiency. This section makes a review of the state of the art
introducing, first, single-task approaches, and, afterwards, methods focused on solving multiple tasks.
provides more memory efficiency than FCN. Ghiasi et al. [14] present a Laplacian pyramid for semantic
segmentation refinement incorporating, into the decoding step, the spatial information contained in
the high-resolution feature maps to keep the spatial information destroyed after pooling. Thus, a better
dense pixel-accurate labeling is obtained.
Architectures in the second class use what are known as dilated/atrous convolutions [15–17].
Pooling layers help in classification networks because they increase the receptive field of a network.
However, as mentioned, this is not suitable for a semantic segmentation task since pooling drops the
spatial information and decreases the resolution. Dilated/Atrous convolutions can compute responses
at all image positions with an n times larger receptive field if the full resolution image is convolved
with a filter ‘with holes’, in which the original filter is upsampled by a factor n, and zeros are introduced
in between filter values. Although the effective filter size increases, it is only necessary to take into
account the non-zero filter values, hence both the number of filter parameters and the number of
operations per position stay constant.
A more recent multi-task approach is introduced in [23]. The methodology proposed in this work
makes initial estimations for depth and semantic label at a pixel level through a joint network. Later,
depth estimation is used to solve possible confusions between similar semantic categories and thus to
obtain the final semantic segmentation.
Another multi-task approach by Teichmann et al. [24] presents a network architecture named
MultiNet that can perform classification, semantic segmentation, and detection simultaneously.
They incorporate these three tasks into a unified encoder-decoder network where the encoder
stage is shared among all tasks and specific decoders for each task produce outputs in real time.
These work efforts were focused on improving the computational efficiency for real-time applications
as autonomous driving.
A similar approach is Pixel-Level Encoding and Depth Layering (PLEDL) [25], this work extended
a FCN [11] with three output channels jointly trained to estimate semantic labeling, direction to the
instance center and depth at pixel level.
Table 1 presents a brief comparison on the pros and cons between different types of
methods. Traditional image segmentation approaches [4–9] usually perform low-level segmentation,
which obtain the segments with more general assumptions, such as local homogeneity. On the other
hand, semantic segmentation methods [10,11,13,16,26] improve image segmentation by introducing
semantic annotations, which provide higher level meaning (semantics at object level) rather than
low-level features exploited in traditional methods. Approaches under multi-task learning schemes,
such as the proposed approach and [21,25] exploit the correlation between semantic segmentation
and depth estimation to benefit each of the tasks, which generate both image segmentation and depth
estimation taking as input a single-color image. Unlike the multi-task methods in the state of the
art [21,25], the proposed approach focuses on separating the commons and distinctions between the
two tasks, which obtains promising results shown in our experiments.
each other in a hybrid system. We also apply them to the more challenging indoor scenes to verify the
validity of the unifying strategy.
Input/Output Semantic
SemanticSegmetnation
SegmentationNetwork
Network Depth Estimation Network
Figure 1. Architecture 1.
Figure 2. Architecture 2.
The rest of the paper is organized as follows: in Section 3 we introduce the proposed methodology;
the detailed explanation of the proposed architectures is presented in Section 4, as well as the training
details. In Section 5, we present the experiment results of our approach in different datasets and
compare our approach with state-of-the-art approaches. Finally, conclusions are drawn in Section 6.
representative for a bunch of methods. (2) DepthNet has a modularized architecture, which allows us
to analyze each of the components in it and better integrate DepthNet into a hybrid architecture.
The semantic segmentation architecture [16] shown in Figure 4, is divided into two main parts:
feature network and atrous upsampling network. For the feature network, it follows the VGGNet
architecture proposed in [28]. It is in charge of extracting robust features from the input image,
which benefits from the deep structure of the network. On the other hand, the atrous upsampling
network is a group of atrous spatial pyramid pooling layers [16] which outputs a class score map with
the number of channels equal to the number of labels. Atrous upsampling layers allows us to explicitly
control the resolution at which feature responses are computed within the architecture, while enlarging
the field of view of filters to incorporate larger context in the semantic segmentation task. The semantic
segmentation architecture is denoted as DeepLab-Atrous Spatial Pyramid Pooling (DeepLab-ASPP) in
this paper. In DeepLab-ASPP, all parts are trained together.
DeepLab-ASPP is employed as the semantic segmentation component in our approach due to its
outstanding performance in this task.
less parameters are involved in the training process, which makes it easy to be trained. In practice,
we exploit the VGG structure [28] as the feature-extraction network for both tasks. Based on the
extracted features, the atrous upsampling network in DeepLab-ASPP is employed for the semantic
segmentation task, while the refining network in DepthNet is leveraged as the decoder for the depth
estimation task. We denote the architecture 1 as HybridNet A1 in the rest of this paper.
Architecture 2: The motivation of building this architecture is to further clarify the common and
specific attributes in the two tasks. Thus, we build the hybrid network by substituting the gradient
network in DepthNet by a common feature-extraction network for the two tasks, while keeping the
global depth estimation network only for the depth task. The advantages of this hybrid architecture
are two-fold. On one hand, the strong power of extracting object information from a color image
learned in the semantic segmentation task can also benefit depth layering when predicting a depth
map, while the strong power of extracting rich depth boundaries from a color image learned in the
depth estimation task is shared in the semantic segmentation task to improve segmentation accuracy
on object boundaries. On the other hand, the global layout of the scene which is more relevant in depth
estimation than in semantic segmentation is estimated independently by a global network in the depth
estimation task. This avoids interfering the common feature extraction for both tasks. In practice,
we keep the global network and refining network in DepthNet without changes, while replacing the
structure of the gradient network with the VGG structure, in order to keep the structure consistent
with the feature network in DeepLab-ASPP. We denote the architecture 2 as HybridNet A2 in the rest
of this paper.
4. Architecture Details
Since the proposed architectures are assembled with basic components in the two single-task
architectures, we explain the detail of the proposed architectures by describing the two single-task
architectures in this section.
the feature maps extracted from the color image. These feature maps are concatenated with outputs of
the global depth network and the gradient network, then are fed to the remaining four convolutional
layers. Each of them is followed by a ReLU layer. The output from the 5th convolutional layer in the
refining network is treated as the output (a refined depth map with size 81 by 81).
weighted in the overall loss function except for those unlabeled pixels which are ignored. The loss
function used for the depth estimation task is made by two Euclidean loss layers L Dabs and L Dmvn .
L Dabs computes the Euclidean distance between absolute values of a depth map in the ground truth
and the estimated depth map, while the L Dmvn computes the Euclidean distance between estimation
and ground truth after performing a mean variance normalization on both of them. L Dabs stands
for a pixel-level metric which evaluates locally how well the estimated depth value matches the
ground truth regardless of the geometry of the scene. On the other hand, L Dmvn introduces a global
regularization in the depth estimation by aligning depth values in both the estimation and the ground
truth to zero mean and unit variance.
The hybrid loss function L H is therefore defined as the linear combination of them:
where α is the term used to balance the loss functions of the depth estimation and semantic
segmentation tasks. In our experiments, α is set to 1000, given an analysis on the values of L Dabs + L Dmvn
and LS respectively, when training them separately in the single-task architectures.
5. Experiments
We quantify the performance of the proposed architectures on both semantic segmentation and
depth estimation in different scenes using our Caffe implementation. We first evaluate the proposed
architectures in road scenes which is of current practical interest for various autonomous driving
related problems. Secondly, the proposed architectures are evaluated in indoor scenes which is of
immediate interest to possible augmented reality (AR) applications.
Our first aim is to determine if the features obtained in the shared part of the proposed
architectures solving the two tasks simultaneously provide better results than the ones that we would
obtain using two identical networks trained separately. This is why, in addition to the results of the
proposed architectures, we present the results obtained by the models that solve these two tasks
separately for comparison. The models used to train semantic segmentation and depth estimation
independently are denoted as DeepLab-ASPP [16] and DepthNet [19], respectively. We trained these
two models using the code provided by the authors with the same training data in Cityscapes dataset
and the same training configuration than the proposed architectures. Apart from that, we also compare
different ways of unifying single-task architectures proposed in Section 3, to justify whether the
unifying strategy is better. Besides, the comparison between the proposed architectures and a hybrid
method in the state of the art [25] is also made in Cityscapes dataset. The hybrid approach proposed
in [25] is similar to HybridNet A1, in which the encoder network in FCN [11] is employed as the
feature network shared by three different tasks and the decoder network in FCN is then employed for
each task to decode the commonly extracted features. The three tasks that [25] tackles are semantic
segmentation, depth layering, boundary detection, which is similar to our target. However, in the
depth layering task, ref. [25] focuses on estimating a depth label for each object, instead of estimating
the real depth value of the whole scene at pixel level. This is also the reason that we only compare the
performance between our approach and [25] in semantic segmentation. We present the results in our
experiments in the following two subsections specifying the evaluation in semantic segmentation and
depth estimation, respectively.
in [25]. For additional evaluation, comparisons between our approach against other well adopted
single-task methods [11,13,16,33] are presented in Table 2.
Table 2. Evaluation of HybridNet against Multi-task and single-task approaches (best results in bold).
G C mIoU
HybridNet A2 93.26 79.47 66.61
HybridNet A1 89.31 77.22 58.1
PLEDL [25] - - 64.3
DeepLab-ASPP [16] 90.99 74.88 58.02
FCN [11] - - 65.3
SegNet [13] - - 57.0
GoogLeNetFCN [26] - - 63.0
Table 3. Definition of the evaluation metrics: for depth estimation: Percentage of Pixel (PP), PP-MVN
Absolute Relative Difference (ARD), Square Relative Difference (SRD), RMSE-linear, RMSE-log and
Scale-Invariant Error (SIE).
Metrics Definition
d∗
PP max dd∗i , di = γ < threshold
i i
MV N (di ) MV N (di∗ )
PP-MVN max ∗ , MV N (d ) = γ < threshold
MV N (di ) i
1 di − d∗ /d∗
ARD N ∑ i i
1 ∗ 2 ∗
SRD q∑ di
− di /d
N i
1 ∗
2
RMSE-linear N ∑ di − di
q
1 ∗
2
N ∑ log ( di ) − log di
RMSE-log
2
1
∑i log (di ) − log di∗ + N1 ∑ j log d j − log d∗j
SIE N
Figure 6. Depth estimation qualitative results. A visual comparison between the estimated depth maps
against the ground truth is presented. In the first column the input image is presented, columns 2 and 3
depict the estimated depth maps obtained by DepthNet in [19] and our hybrid model A2, respectively.
Finally, ground truth is presented in column 4.
In the quantitative experiment, we compare the proposed hybrid architectures and DepthNet.
Table 4 shows the quantitative results of the proposed hybrid architectures and DepthNet under
the different evaluation metrics introduced above. HybridNet A2 outperforms in 6 out of 9 metrics,
which proves that training the feature-extraction network for the simultaneous tasks of semantic
segmentation and depth estimation also improves the depth estimation results. The better performance
of HybridNet A2 in comparison to DepthNet illustrates that the shared features obtained with
the semantic segmentation task in HybridNet A2 have richer information and are more relevant
in the depth estimation task than the information extracted from the depth gradient in DepthNet.
The comparison between Hybrid A2 and Hybrid A1 shows the necessity of clarifying the common
and specific attributes of different tasks. Sharing only the common attributes of tasks in the
feature-extraction process leads to a better performance in-depth estimation. We also verify the
standard deviation of the performance of these methods among all testing samples to ensure the
statistical significance of the results. Since very similar results are observed, we do not present them in
Table 4 for conciseness.
Sensors 2019, 19, 1795 13 of 20
Table 4. Depth estimation. Quantitative evaluation: PP, PP-MVN, ARD, SRD, RMSE-linear, RMSE-log,
and SIE (best results in bold).
Color Image
HybridNet
Seg. Mask
A2
Deeplab-
ASPP
Ground Truth
Additionally, to qualitative results, we follow the three metrics introduced in Section 5.1.1:
the global accuracy (G), the class average accuracy (C) and mean intersection over union (mIoU) to
evaluate the segmentation performance quantitatively. We also benchmark the proposed architectures
against several other well adopted architectures for semantic segmentation, such as FCN [11],
SegNet [13], DeepLab [16] and DeconvNet [38]. For FCN, the parameters for the deconvolutional layers
are learned from the training process instead of using fixed parameters to perform bilinear upsampling.
For DeepLab, three architectures are employed, which are DeepLab-ASPP, DeepLab-LargeFOV,
and DeepLab-LargeFOV-denseCRF. They use the same VGGNet architecture for feature map
extraction, which is similar to the proposed architectures. DeepLab-LargeFOV performs single
scale upsampling on the feature map, while DeepLab-ASPP performs multi-scale upsampling.
DeepLab-LargeFOV-denseCRF introduces a dense conditional random field as a post-processing
step for DeepLab-LargeFOV. Table 5 shows the quantitative results of the proposed architectures
(HybridNet A1 and A2) compared with other methods. HybridNet A2 achieves the best results in
C and mIoU over all the 7 methods while also obtaining a (71.63%) in G close to the best (73.87%)
obtained in DeepLab-ASPP. The higher global accuracy and lower per-class accuracy obtained in
DeepLab-ASPP in comparison to HybridNet A2 illustrates that DeepLab-ASPP prefers to better cover
large objects in the scene such as floor and wall, which provides good results in global evaluation.
However, this affects its performance in smaller objects, which results in its lower per-class accuracy,
as well as mIoU. The improvement against DeepLab-ASPP verifies again the idea of the multi-task
learning, that estimating depth in addition to semantic segmentation helps the segmentation task
(6.1% and 5.1% improvement in C and mIoU respectively). The performance of HybridNet A1 is
even worse than the single-task method DeepLab-ASPP, which indicates that the idea of benefiting
from unifying two single tasks in a hybrid architecture can hardly be achieved by simply sharing
the feature-extraction process in more complex indoor scenes. The best segmentation performance
obtained by HybridNet A2 compared with HybridNet A1 shows the importance of selecting a suitable
unifying strategy in a multi-task learning problem and verifies the efficiency of the strategy employed
in HybridNet A2.
G C mIoU
HybridNet A2 71.63 46.20 34.30
HybridNet A1 69.34 38.64 28.68
DeepLab-ASPP [16] 73.87 40.09 29.22
SegNet [13] 72.63 44.76 31.84
DeepLab-LargeFOV [16] 71.90 42.21 32.08
DeepLab-LargeFOV-denseCRF [16] 66.96 33.06 24.13
FCN(learned deconv) [11] 68.18 38.41 27.39
DeconvNet [38] 66.13 32.28 22.57
Color Image
HybridNet
Depth Map
A2
DepthNet
Ground Truth
Figure 8. Depth estimation qualitative results. A comparison between depth estimations against
ground truth is presented. Input image is depicted in the first row. The 2nd, 3rd and 4th rows present
the estimated depth map of our method, DepthNet and the ground truth, respectively.
Table 6. Depth estimation. Quantitative evaluation: PP, PP-MVN, ARD, SRD, RMSE-linear, RMSE-log,
and SIE (best results in bold).
Figure 9). Each of the VGG structures takes the output of the previous one along with the input color
image as its input. Among the three tasks, depth estimation, and surface normal estimation are two
tasks tackled jointly, which means that these two tasks share the network in scale 1 while the networks
in scale 2-3 are separately assembled for each task. For the semantic segmentation task, the architecture
shown in Figure 9 is used again. However, different from the other two tasks, the architecture of
semantic segmentation allows two additional input channels which are depth and normal channels.
This architecture is only fine-tuned from the model previously trained on depth and normal estimation
to generate semantic segmentation masks.
Although the source code of this method was not available, the performance evaluation is
reported in a public dataset (NYU Depth V2 dataset [35]). To make the comparison with this approach,
we trained and evaluated our approach on NYU Depth V2 dataset. This data set includes RGB images
and their corresponding 2D ground truth object labels for 40 indoor scene classes and depth map.
NYU depth V2 dataset is divided into 795 images for training and 654 for testing. Due to the small
number of images available for training, we augment the training set by random cropping, flipping,
and mirroring.
Tables 7 and 8 show the quantitative results of HybridNet A2 for both tasks and provides
a comparison with the approach proposed in [21], denoted as Eigen. Semantic segmentation results in
Table 7 show that HybridNet A2 outperforms Eigen in class average accuracy (C) and mean intersection
over union (mIoU) while keeping similar results than Eigen in Global accuracy (G). It also illustrates
that addressing RGB-D-based semantic segmentation task under a multi-task learning scheme better
uses the depth information than directly feeding the depth information to the network as an extra input
channel. On the other hand, depth estimation results in Table 8 show that HybridNet A2 has a better
performance in the relative measure SIE, while in the absolute measures Eigen outperforms HybridNet
Sensors 2019, 19, 1795 17 of 20
A2. The better performance of HybridNet A2 in the relative measure shows that HybridNet A2 has
a better depth layering capability than Eigen, which is more relevant in real applications. For absolute
measures, we believe that the worse performance of HybridNet A2 is due to the weaker ability in
describing the global layout of the scene. HybridNet A2 employs a much simpler architecture (AlexNet
structure) for global depth network compared with the network of scale 1 (VGG structure) in Eigen.
Table 7. Quantitative segmentation results on NYU V2: G, C and mIoU (best results in bold).
G C mIoU
HybridNet A2 64.7 48.4 36.5
Eigen [21] 65.6 45.1 34.1
Table 8. Depth estimation results on NYU V2 Quantitative evaluation: PP, PP-MVN, ARD, SRD,
RMSE-linear, RMSE-log, and SIE (best results in bold).
• Designing better loss functions for a multi-task learning scheme. The loss function employed in
the state-of-the-art approaches is normally a balanced linear combination of losses for single
tasks. However, these losses may have totally different physical meaning regarding the tasks
(e.g., cross entropy and Euclidean loss), which makes it hard to combine them. Finding higher
level evaluation metrics helps define the loss function for a multi-task learning system. For
instance, evaluating on the prediction of the 3D oriented bounding box of objects requires using
both semantic segmentation and depth estimation result, which naturally combines the loss
function for both tasks.
Sensors 2019, 19, 1795 18 of 20
• Applying to higher level tasks requiring 3D analysis. Since the proposed approach produces an
object level segmentation and a depth map of an input image, applying the estimated result to
applications requiring 3D analysis (such as traffic violation detection) will be of great interest.
Author Contributions: Conceptualization, X.L., J.R.C. and M.P.; Formal analysis, X.L.; Investigation, X.L. and
D.S.-E.; Methodology, X.L. Supervision, J.R.C. and M.P.; Validation, D.S.-E.; Writing—original draft, X.L.;
Writing—review & editing, X.L., D.S.-E., J.R.C. and M.P.
Funding: This work has been developed in the framework of project MALEGRA TEC2016-75976-R, financed by
the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Ball, D.; Ross, P.; English, A.; Milani, P.; Richards, D.; Bate, A.; Upcroft, B.; Wyeth, G.; Corke, P. Farm workers
of the future: Vision-based robotics for broad-acre agriculture. IEEE Robot. Autom. Mag. 2017, 24, 97–107.
[CrossRef]
2. Shah, U.; Khawad, R.; Krishna, K.M. DeepFly: Towards complete autonomous navigation of MAVs with
monocular camera. In Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image
Processing, Guwahati, India, 18–22 December 2016; ACM: New York, NY, USA, 2016; p. 59.
3. Leo, M.; Furnari, A.; Medioni, G.G.; Trivedi, M.; Farinella, G.M. Deep Learning for Assistive Computer Vision.
In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018;
pp. 3–14.
4. Yang, J.; Gan, Z.; Li, K.; Hou, C. Graph-based segmentation for RGB-D data using 3-D geometry enhanced
superpixels. IEEE Trans. Cybern. 2015, 45, 927–940. [CrossRef] [PubMed]
5. Stutz, D.; Hermans, A.; Leibe, B. Superpixels: An evaluation of the state-of-the-art. Comput. Vis. Image Underst.
2018, 166, 1–27. [CrossRef]
6. Ciecholewski, M. An edge-based active contour model using an inflation/deflation force with a damping
coefficient. Expert Syst. Appl. 2016, 44, 22–36. [CrossRef]
7. Ding, K.; Xiao, L.; Weng, G. Active contours driven by local pre-fitting energy for fast image segmentation.
Pattern Recognit. Lett. 2018, 104, 29–36. [CrossRef]
8. Cousty, J.; Bertrand, G.; Najman, L.; Couprie, M. Watershed cuts: Thinnings, shortest path forests, and
topological watersheds. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 925–939. [CrossRef] [PubMed]
9. Gaetano, R.; Masi, G.; Poggi, G.; Verdoliva, L.; Scarpa, G. Marker-controlled watershed-based segmentation
of multiresolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2987–3004. [CrossRef]
10. Shotton, J.; Johnson, M.; Cipolla, R. Semantic texton forests for image categorization and segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA,
23–28 June 2008; pp. 1–8.
11. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 3431–3440.
12. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation.
In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted
Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241.
13. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for
image segmentation. arXiv 2015, arXiv:1511.00561.
14. Ghiasi, G.; Fowlkes, C.C. Laplacian pyramid reconstruction and refinement for semantic segmentation.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,
8–16 October 2016; pp. 519–534.
15. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122.
16. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv 2016, arXiv:1606.00915.
17. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image
Segmentation. arXiv 2017, arXiv:1706.05587.
Sensors 2019, 19, 1795 19 of 20
18. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep
network. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA,
3–8 December 2014; pp. 2366–2374.
19. Ivaneckỳ, B.J. Depth Estimation by Convolutional Neural Networks. Master’s Thesis, Brno University of
Technology, Brno, Czechia, 2016.
20. Abdi, L.; Meddeb, A. Driver information system: A combination of augmented reality and deep learning.
In Proceedings of the Symposium on Applied Computing, Marrakech, Morocco, 4–6 April 2017; ACM:
New York, NY, USA; pp. 228–230.
21. Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale
convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision,
Santiago, Chile, 13–16 December 2015; pp. 2650–2658.
22. Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A.L. Towards unified depth and semantic prediction
from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Boston, MA, USA, 8–10 June 2015; pp. 2800–2809.
23. Mousavian, A.; Pirsiavash, H.; Košecká, J. Joint Semantic Segmentation and Depth Estimation with Deep
Convolutional Networks. In Proceedings of the Fourth International Conference on 3D Vision (3DV),
Stanford, CA, USA, 25–28 October 2016; pp. 611–619.
24. Teichmann, M.; Weber, M.; Zoellner, M.; Cipolla, R.; Urtasun, R. MultiNet: Real-time Joint Semantic
Reasoning for Autonomous Driving. arXiv 2016, arXiv:1612.07695.
25. Uhrig, J.; Cordts, M.; Franke, U.; Brox, T. Pixel-level encoding and depth layering for instance-level
semantic labeling. In Proceedings of the German Conference on Pattern Recognition, Hannover, Germany,
12–15 September 2016; pp. 14–25.
26. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 8–10 June 2015; pp. 1–9.
27. Sanchez-Escobedo, D.; Lin, X.; Casas, J.R.; Pardas, M. Hybridnet for depth estimation and semantic
segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1563–1567.
28. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014,
arXiv:1409.1556.
29. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural
networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, AK, USA,
3–8 December 2012; pp. 1097–1105.
30. Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth
database. Pattern Recognit. Lett. 2009, 30, 88–97. [CrossRef]
31. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark
suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Rhode, Island,
18–20 June 2012; pp. 3354–3361.
32. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3213–3223.
33. Papandreou, G.; Chen, L.C.; Murphy, K.; Yuille, A.L. Weakly-and semi-supervised learning of a DCNN for
semantic image segmentation. arXiv 2015, arXiv:1502.02734.
34. Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015;
pp. 567–576.
35. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd
images. In Proceedings of the European Conference on Computer Vision, Firenze, Italy, 7–13 October 2012;
pp. 746–760.
36. Janoch, A.; Karayev, S.; Jia, Y.; Barron, J.T.; Fritz, M.; Saenko, K.; Darrell, T. A category-level 3d object dataset:
Putting the kinect to work. In Consumer Depth Cameras for Computer Vision; Springer: London, UK, 2013;
pp. 141–165.
Sensors 2019, 19, 1795 20 of 20
37. Xiao, J.; Owens, A.; Torralba, A. Sun3d: A database of big spaces reconstructed using sfm and object
labels. In Proceedings of the IEEE International Conference on Computer Vision, Portland, OR, USA,
25–27 June 2013; pp. 1625–1632.
38. Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of
the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528.
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).