Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures For Scene Understanding
Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures For Scene Understanding
1. Introduction
Semantic segmentation requires an understanding of an Figure 1: Bayesian SegNet. These examples show the perfor-
image at a pixel level and is an important tool for scene un- mance of Bayesian SegNet on popular segmentation and scene
derstanding. It is a difficult problem as scenes often vary understanding benchmarks: SUN [35] (left), CamVid [4] (cen-
significantly in pose and appearance. However it is an im- ter column) and Pascal VOC [11] (right). The system takes an
portant problem as it can be used to infer scene geometry RGB image as input (top), and outputs a semantic segmentation
and object support relationships. This has wide ranging ap- (middle row) and model uncertainty estimate, averaged across
all classes (bottom row). We observe higher model uncertainty
plications from robotic interaction to autonomous driving.
at object boundaries and with visually difficult objects. An on-
Previous approaches to scene understanding used low line demo and source code can be found on our project webpage
level visual features [32]. We are now seeing the emergence mi.eng.cam.ac.uk/projects/segnet/
of machine learning techniques for this problem [31, 25].
In particular deep learning [25] has set the benchmark on
many popular datasets [11, 8]. However none of these deep can trust the semantic segmentation output is important for
learning methods produce a probabilistic segmentation with decision making. For instance, a system on an autonomous
a measure of model uncertainty. vehicle may segment an object as a pedestrian. But it is de-
Uncertainty should be a natural part of any predictive sirable to know the model uncertainty with respect to other
system’s output. Knowing the confidence with which we classes such as street sign or cyclist as this can have a strong
1
effect on behavioural decisions. Uncertainty is also imme- the output and ensuring label consistency. However none
diately useful for other applications such as active learning of these proposed segmentation methods generate a proba-
[7], semi-supervised learning, or label propagation [1]. bilistic output with a measure of model uncertainty.
The main contribution of this paper is extending deep Neural networks which model uncertainty are known as
convolutional encoder-decoder neural network architec- Bayesian neural networks [9, 26]. They offer a probabilistic
tures [3] to Bayesian convolutional neural networks which interpretation of deep learning models by inferring distribu-
can produce a probabilistic segmentation output [13]. In tions over the networks weights. They are often compu-
Section 4 we propose Bayesian SegNet, a probabilistic deep tationally very expensive, increasing the number of model
convolutional neural network framework for pixel-wise se- parameters without increasing model capacity significantly.
mantic segmentation. We use dropout at test time which Performing inference in Bayesian neural networks is a dif-
allows us to approximate the posterior distribution by sam- ficult task, and approximations to the model posterior are
pling from the Bernoulli distribution across the network’s often used, such as variational inference [14].
weights. This is achieved with no additional parameterisa- On the other hand, the already significant parameteriza-
tion. tion of convolutional network architectures leaves them par-
In Section 5, we demonstrate that Bayesian SegNet sets ticularly susceptible to over-fitting without large amounts of
the best performing benchmark on prominent scene under- training data. A technique known as dropout is commonly
standing datasets, CamVid Road Scenes [4] and SUN RGB- used as a regularizer in convolutional neural networks to
D Indoor Scene Understanding [35]. In particular, we find prevent overfitting and co-adaption of features [36]. During
a larger performance improvement on smaller datasets such training with stochastic gradient descent, dropout randomly
as CamVid where the Bayesian Neural Network is able to removes units within a network. By doing this it samples
cope with the additional uncertainty from a smaller amount from a number of thinned networks with reduced width. At
of data. test time, standard dropout approximates the effect of aver-
Moreover, we show in section 5.4 that this technique is aging the predictions of all these thinnned networks by us-
broadly applicable across a number of state of the art archi- ing the weights of the unthinned network. This is referred
tectures and achieves a 2-3% improvement in segmenation to as weight averaging.
accuracy when applied to SegNet [3], FCN [25] and Dila- Gal and Ghahramani [13] have cast dropout as approx-
tion Network [40]. imate Bayesian inference over the network’s weights. [12]
Finally in Section 5.5 we demonstrate the effectiveness shows that dropout can be used at test time to impose
of model uncertainty. We show this measure can be used to a Bernoulli distribution over the convolutional net filter’s
understand with what confidence we can trust image seg- weights, without requiring any additional model parame-
mentations. We also explore what factors contribute to ters. This is achieved by sampling the network with ran-
Bayesian SegNet making an uncertain prediction. domly dropped out units at test time. We can consider these
as Monte Carlo samples obtained from the posterior dis-
2. Related Work tribution over models. This technique has seen success in
modelling uncertainty for camera relocalisation [19]. Here
Semantic pixel labelling was initially approached with we apply it to pixel-wise semantic segmentation.
TextonBoost [32], TextonForest [30] and Random Forest We note that the probability distribution from Monte
Based Classifiers [31]. We are now seeing the emergence of Carlo sampling is significantly different to the ‘probabili-
deep learning architectures for pixel wise segmentation, fol- ties’ obtained from a softmax classifier. The softmax func-
lowing its success in object recognition for a whole image tion approximates relative probabilities between the class
[21]. Architectures such as SegNet [3] Fully Convolutional labels, but not an overall measure of the model’s uncertainty
Networks (FCN) [25] and Dilation Network [40] have been [13]. Figure 3 illustrates these differences.
proposed, which we refer to as the core segmentation en-
gine. FCN is trained using stochastic gradient descent with 3. SegNet Architecture
a stage-wise training scheme. SegNet was the first archi-
tecture proposed that can be trained end-to-end in one step, We briefly review the SegNet architecture [3] which we
due to its lower parameterisation. modify to produce Bayesian SegNet. SegNet is a deep
We have also seen methods which improve on these core convolutional encoder decoder architecture which consists
segmentation engine architectures by adding post process- of a sequence of non-linear processing layers (encoders)
ing tools. HyperColumn [16] and DeConvNet [27] use and a corresponding set of decoders followed by a pixel-
region proposals to bootstrap their core segmentation en- wise classifier. Typically, each encoder consists of one
gine. DeepLab [6] post-processes with conditional random or more convolutional layers with batch normalisation and
fields (CRFs) and CRF-RNN [42] use recurrent neural net- a ReLU non-linearity, followed by non-overlapping max-
works. These methods improve performance by smoothing pooling and sub-sampling. The sparse encoding due to the
Convolutional Encoder-Decoder Stochastic Dropout Segmentation
Input
Samples mean
Model Uncertainty
variance
Figure 2: A schematic of the Bayesian SegNet architecture. This diagram shows the entire pipeline for the system which is trained
end-to-end in one step with stochastic gradient descent. The encoders are based on the 13 convolutional layers of the VGG-16 network
[34], with the decoder placing them in reverse. The probabilistic output is obtained from Monte Carlo samples of the model with dropout
at test time. We take the variance of these softmax samples as the model uncertainty for each class.
pooling process is upsampled in the decoder using the max- the Kullback-Leibler (KL) divergence between this approx-
pooling indices in the encoding sequence. This has the im- imating distribution and the full posterior;
portant advantage of retaining class boundary details in the
segmented images and also reducing the total number of KL(q(W) || p(W | X, Y)). (2)
model parameters. The model is trained end to end using
stochastic gradient descent. Here, the approximating variational distribution q(Wi ) for
We take both SegNet [3] and a smaller variant termed every K × K dimensional convolutional layer i, with units
SegNet-Basic [2] as our base models. SegNet’s encoder is j, is defined as:
based on the 13 convolutional layers of the VGG-16 net-
work [34] followed by 13 corresponding decoders. SegNet- bi,j ∼ Bernoulli(pi ) for j = 1, ..., Ki ,
(3)
Basic is a much smaller network with only four layers each Wi = Mi diag(bi ),
for the encoder and decoder with a constant feature size of
64. We use SegNet-Basic as a smaller model for our analy- with bi vectors of Bernoulli distributed random variables
sis since it conceptually mimics the larger architecture. and variational parameters Mi we obtain the approximate
model of the Gaussian process in [12]. The dropout proba-
4. Bayesian SegNet bilities, pi , could be optimised. However we fix them to the
standard probability of dropping a connection as 50%, i.e.
The technique we use to form a probabilistic encoder-
pi = 0.5 [36].
decoder architecture is dropout [36], which we use as ap-
In [12] it was shown that minimising the cross entropy
proximate inference in a Bayesian neural network [12]. We
loss objective function has the effect of minimising the
can therefore consider using dropout as a way of getting
Kullback-Leibler divergence term. Therefore training the
samples from the posterior distribution of models. Gal and
network with stochastic gradient descent will encourage the
Ghahramani [12] link this technique to variational inference
model to learn a distribution of weights which explains the
in Bayesian convolutional neural networks with Bernoulli
data well while preventing over-fitting.
distributions over the network’s weights. We leverage this
method to perform probabilistic inference over our segmen- We train the model with dropout and sample the poste-
tation model, giving rise to Bayesian SegNet. rior distribution over the weights at test time using dropout
For Bayesian SegNet we are interested in finding the pos- to obtain the posterior distribution of softmax class prob-
terior distribution over the convolutional weights, W, given abilities. We take the mean of these samples for our seg-
our observed training data X and labels Y. mentation prediction and use the variance to output model
uncertainty for each class. We take the mean of the per class
p(W | X, Y) (1) variance measurements as an overall measure of model un-
certainty. We also explored using the variation ratio as
In general, this posterior distribution is not tractable, there- a measure of uncertainty (i.e. the percentage of samples
fore we need to approximate the distribution of these which agree with the class prediction) however we found
weights [9]. Here we use variational inference to approx- this to qualitatively produce a more binary measure of
imate it [14]. This technique allows us to learn the distri- model uncertainty. Fig. 2 shows a schematic of the segmen-
bution over the network’s weights, q(W), by minimising tation prediction and model uncertainty estimate process.
(a) Input Image (b) Semantic Segmentation (c) Softmax Uncertainty (d) Dropout Uncertainty (e) Dropout Uncertainty
Car Class Car Class All Classes
Figure 3: Comparison of uncertainty with Monte Carlo dropout and uncertainty from softmax regression (c-e: darker colour
represents larger value). This figure shows that softmax regression is only capable of inferring relative probabilities between classes. In
contrast, dropout uncertainty can produce an estimate of absolute model uncertainty.
Global avg.
Pedestrian
Class avg.
Side-walk
Mean I/U
Bicyclist
Building
Fence
Road
Tree
Sky
Car
Method
SfM+Appearance [5] 46.2 61.9 89.7 68.6 42.9
89.5 53.6 46.6 0.7 60.5 22.5 53.0 69.1 n/a
Boosting [37] 61.9 67.3 91.1 71.1 58.5
92.9 49.5 37.6 25.8 77.8 24.7 59.8 76.4 n/a
Structured Random Forests [20] n/a 51.4 72.5 n/a
Neural Decision Forests [29] n/a 56.1 82.1 n/a
Local Label Descriptors [39] 80.7 61.5 88.8 16.4 n/a 98.0 1.09 0.05 4.13 12.4 0.07 36.3 73.6 n/a
Super Parsing [38] 87.0 67.1 96.9 62.7 30.1 95.9 14.7 17.9 1.7 70.0 19.4 51.2 83.3 n/a
Boosting+Detectors+CRF [22] 81.5 76.6 96.2 78.7 40.2 93.9 43.0 47.6 14.3 81.5 33.9 62.5 83.8 n/a
SegNet-Basic (layer-wise training [2]) 75.0 84.6 91.2 82.7 36.9 93.3 55.0 37.5 44.8 74.1 16.0 62.9 84.3 n/a
SegNet-Basic [3] 80.6 72.0 93.0 78.5 21.0 94.0 62.5 31.4 36.6 74.0 42.5 62.3 82.8 46.3
SegNet [3] 88.0 87.3 92.3 80.0 29.5 97.6 57.2 49.4 27.8 84.8 30.7 65.9 88.6 50.2
FCN 8 [25] n/a 64.2 83.1 52.0
DeconvNet [27] n/a 62.1 85.9 48.9
DeepLab-LargeFOV-DenseCRF [6] n/a 60.7 89.7 54.7
Bayesian SegNet Models in this work:
Bayesian SegNet-Basic 75.1 68.8 91.4 77.7 52.0 92.5 71.5 44.9 52.9 79.1 69.6 70.5 81.6 55.8
Bayesian SegNet 80.4 85.5 90.1 86.4 67.9 93.8 73.8 64.5 50.8 91.7 54.6 76.3 86.9 63.1
Table 2: Quantitative results on CamVid [4] consisting of 11 road scene categories. Bayesian SegNet outperforms all other methods,
including shallow methods which utilise depth, video and/or CRF’s, and more contemporary deep methods. Particularly noteworthy are
the significant improvements in accuracy for the smaller/thinner classes.
81.5
87
81
86.5
80.5 4.3. Training and Inference
86
80
85.5
Following [3] we train SegNet with median frequency
79.5
Monte Carlo Dropout Sampling
Weight Averaging
Monte Carlo Dropout Sampling
Weight Averaging
class balancing using the formula proposed by Eigen and
79 85
0 10 20 30 40
Number of Samples
50 60 0 10 20 30
Number of Samples
40 50 60
Fergus [10]. We use batch normalisation layers after ev-
ery convolutional layer [17]. We compute batch normali-
(a) SegNet Basic (b) SegNet
sation statistics across the training dataset and use these at
test time. We experimented with computing these statistics
Figure 4: Global segmentation accuracy against number of while using dropout sampling. However we experimentally
Monte Carlo samples for both SegNet and SegNet-Basic. Re- found that computing them with weight averaging produced
sults averaged over 5 trials, with two standard deviation error bars, better results.
are shown for the CamVid dataset. This shows that Monte Carlo We implement Bayesian SegNet using the Caffe library
sampling outperforms the weight averaging technique after ap- [18] and release the source code and trained models for pub-
proximately 6 samples. Monte Carlo sampling converges after lic evaluation 1 . We train the whole system end-to-end us-
around 40 samples with no further significant improvement be-
ing stochastic gradient descent with a base learning rate of
yond this point.
0.001 and weight decay parameter equal to 0.0005. We train
the network until convergence when we observe no further
reduction in training loss.
averaging proposes to remove dropout at test time and scale
the weights proportionally to the dropout percentage. Fig. 5. Experiments
4 shows that Monte Carlo sampling with dropout performs
better than weight averaging after approximately 6 samples. We quantify the performance of Bayesian SegNet on
We also observe no additional performance improvement three different benchmarks using our Caffe implementa-
beyond approximately 40 samples. Therefore the weight tion. Through this process we demonstrate the efficacy of
averaging technique produces poorer segmentation results, Bayesian SegNet for a wide variety of scene segmentation
in terms of global accuracy, in addition to being unable to tasks which have practical applications. CamVid [4] is a
provide a measure of model uncertainty. However, sam- 1 An online demo and source code can be found on our project webpage
Table 3: SUN Indoor Scene Understanding. Quantitative com- Table 4: NYU v2. Results for the NYUv2 RGB-D dataset [33]
parison on the SUN RGB-D dataset [35] which consists of 5050 which consists of 654 test images. Bayesian SegNet is the top
test images of indoor scenes with 37 classes. SegNet RGB based performing RGB method.
predictions have a high global accuracy and out-perform all previ-
ous benchmarks, including those which use depth modality. Parameters Pascal VOC Test IoU
Method (Millions) Non-Bayesian Bayesian
Dilation Network [40] 140.8 71.3 73.1
road scene understanding dataset which has applications for FCN-8 [25] 134.5 62.2 65.4
autonomous driving. SUN RGB-D [35] is a very challeng- SegNet [3] 29.45 59.1 60.5
ing and large dataset of indoor scenes which is important
for domestic robotics. Finally, Pascal VOC 2012 [11] is a Table 5: Pascal VOC12 [11] test results evaluated from the online
RGB dataset for object segmentation. evaluation server. We compare to competing deep learning archi-
tectures. Bayesian SegNet is considerably smaller but achieves a
5.1. CamVid competitive accuracy to other methods. We also evaluate FCN
CamVid is a road scene understanding dataset with 367 [25] and Dilation Network (front end) [40] with Monte Carlo
training images and 233 testing images of day and dusk dropout sampling. We observe an improvement in segmentation
performance across all three deep learning models when using
scenes [4]. The challenge is to segment 11 classes such as
the Bayesian approach. This demonstrates this method’s appli-
road, building, cars, pedestrians, signs, poles, side-walk etc.
cability in general. Additional results available on the leaderboard
We resize images to 360x480 pixels for training and testing host.robots.ox.ac.uk:8080/leaderboard
of our system.
Table 2 shows our results and compares them to previous
benchmarks. We compare to methods which utilise depth Using the depth modality would necessitate architectural
and motion cues. Additionally we compare to other promi- modifications and careful post-processing to fill-in missing
nent deep learning architectures. Bayesian SegNet obtains depth measurements. This is beyond the scope of this paper.
the highest overall class average and mean intersection over Table 3 shows our results on this dataset compared to
union score by a significant margin. We set a new bench- other methods. Bayesian SegNet outperforms all previous
mark on 7 out of the 11 classes. Qualitative results can be benchmarks, including those which use depth modality. We
viewed in Fig. 5. also note that an earlier benchmark dataset, NYUv2 [33],
is included as part of this dataset, and Table 4 shows our
5.2. Scene Understanding (SUN) evaluation on this subset. Qualitative results can be viewed
SUN RGB-D [35] is a very challenging and large dataset in Fig. 6.
of indoor scenes with 5285 training and 5050 testing im-
5.3. Pascal VOC
ages. The images are captured by different sensors and
hence come in various resolutions. The task is to segment The Pascal VOC12 segmentation challenge [11] consists
37 indoor scene classes including wall, floor, ceiling, table, of segmenting a 20 salient object classes from a widely
chair, sofa etc. This task is difficult because object classes varying background class. For our model, we resize the
come in various shapes, sizes and in different poses with input images for training and testing to 224x224 pixels. We
frequent partial occlusions. These factors make this one of train on the 12031 training images and 1456 testing images,
the hardest segmentation challenges. For our model, we re- with scores computed remotely on a test server. Table 5
size the input images for training and testing to 224x224 shows our results compared to other methods, with qualita-
pixels. Note that we only use RGB input to our system. tive results in Fig. 9.
Figure 5: Bayesian SegNet results on CamVid road scene understanding dataset [4]. The top row is the input image, with the ground
truth shown in the second row. The third row shows Bayesian SegNet’s segmentation prediction, with overall model uncertainty, averaged
across all classes, in the bottom row (with darker colours indicating more uncertain predictions). In general, we observe high quality
segmentation, especially on more difficult classes such as poles, people and cyclists. Where SegNet produces an incorrect class label we
often observe a high model uncertainty.
Figure 6: Bayesian SegNet results on the SUN RGB-D indoor scene understanding dataset [35]. The top row is the input image, with
the ground truth shown in the second row. The third row shows Bayesian SegNet’s segmentation prediction, with overall model uncertainty,
averaged across all classes, in the bottom row (with darker colours indicating more uncertain predictions). Bayesian SegNet uses only RGB
input and is able to accurately segment 37 classes in this challenging dataset. Note that often parts of an image do not have ground truth
labels and these are shown in black colour.
Percentile Pixel-Wise Classification Accuracy
Confidence CamVid SUN RGBD
90 99.7 97.6
50 98.5 92.3
10 89.5 79.0
0 86.7 75.4
Shower curtain
Whiteboard
Night stand
Bookshelf
Floor Mat
Window
Counter
Bathtub
Cabinet
Shelves
Clothes
Dresser
Curtain
Ceiling
Picture
Person
Mirror
Blinds
Fridge
Pillow
Books
Towel
Lamp
Toilet
Paper
Chair
Table
Floor
Door
Desk
Wall
Sink
Sofa
Box
Bed
Bag
TV
86.6
92.0
52.4
68.4
76.0
54.3
59.3
37.4
53.8
29.2
49.7
32.5
31.2
17.8
53.2
28.8
36.5
29.6
14.4
67.7
32.4
10.2
18.3
19.2
11.5
38.7
22.6
55.6
52.7
27.9
29.9
5.3
0.0
0.0
8.9
4.9
8.1
SegNet [3]
80.2
90.9
58.9
64.8
76.0
58.6
62.6
47.7
66.4
31.2
63.6
33.8
46.7
19.7
16.2
67.0
42.3
57.1
39.1
24.4
84.0
48.7
21.3
49.5
30.6
18.8
24.1
56.8
17.9
42.9
73.0
66.2
48.8
45.1
24.1
0.1
0.1
Bayesian SegNet
Table 7: Class accuracy of Bayesian SegNet predictions for the 37 indoor scene classes in the SUN RGB-D benchmark dataset [35].
Fig. 8. Uncertainty is calculated as the mean uncertainty and Bayesian SegNet with 10 Monte Carlo samples at 90ms
value for each pixel of that class in a test dataset. We ob- per frame on Titan X GPU. However inference time will de-
serve an inverse relationship between uncertainty and class pend on the implementation.
accuracy or class frequency. This shows that the model is
more confident about classes which are easier or occur more
6. Conclusions
often, and less certain about rare and challenging classes.
Additionally, Table 6 shows segmentation accuracies for We have presented Bayesian SegNet, the first probabilis-
varying levels of confidence. We observe very high levels tic framework for semantic segmentation using deep learn-
of accuracy for values of model uncertainty above the 90th ing, which outputs a measure of model uncertainty for each
percentile across each dataset. This demonstrates that the class. We show that the model is uncertain at object bound-
model’s uncertainty is an effective measure of confidence aries and with difficult and visually ambiguous objects. We
in prediction. quantitatively show Bayesian SegNet produces a reliable
measure of model uncertainty and is very effective when
5.6. Real Time Performance
modelling smaller datasets. Bayesian SegNet outperforms
Table 5 shows that SegNet and Bayesian SegNet main- shallow architectures which use motion and depth cues, and
tains a far lower parameterisation than its competitors. other deep architectures. We obtain the highest perform-
Monte Carlo sampling requires additional inference time, ing result on CamVid road scenes and SUN RGB-D indoor
however if model uncertainty is not required, then the scene understanding datasets. We show that the segmenta-
weight averaging technique can be used to remove the need tion model can be run in real time on a GPU. For future
for sampling (Fig. 4 shows the performance drop is mod- work we intend to explore how video data can improve our
est). Our implementation can run SegNet at 35ms per frame model’s scene understanding performance.
[13] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation:
Representing model uncertainty in deep learning. arXiv:1506.02142,
2015. 2
[14] A. Graves. Practical variational inference for neural networks. In
Advances in Neural Information Processing Systems, pages 2348–
2356, 2011. 2, 3
[15] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich fea-
tures from rgb-d images for object detection and segmentation. In
Computer Vision–ECCV 2014, pages 345–360. Springer, 2014. 6
[16] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns
for object segmentation and fine-grained localization. arXiv preprint
arXiv:1411.5752, 2014. 2
[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015. 5
Figure 8: Bayesian SegNet class frequency compared to mean [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
model uncertainty for each class in CamVid road scene un- S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
derstanding dataset. This figure shows that there is a strong in- fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 5
verse relationship between model uncertainty and the frequency at [19] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning
which a class label appears in the dataset. It shows that the classes for camera relocalization. arXiv preprint arXiv:1509.05909, 2015. 2
that Bayesian SegNet is more confident at are more prevalent in [20] P. Kontschieder, S. Rota Buló, H. Bischof, and M. Pelillo. Structured
class-labels in random forests for semantic image labelling. In Com-
the dataset. Conversely, for the more rare classes such as Sign
puter Vision (ICCV), 2011 IEEE International Conference on, pages
Symbol and Bicyclist, Bayesian SegNet has a much higher model 2190–2197. IEEE, 2011. 5
uncertainty. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012. 2
References [22] L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr. What,
where and how many? combining object detectors and crfs. In Com-
[1] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label propagation puter Vision–ECCV 2010, pages 424–437. Springer, 2010. 5
in video sequences. In Computer Vision and Pattern Recognition [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
(CVPR), 2010 IEEE Conference on, pages 3265–3272. IEEE, 2010. P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in
2 context. In Computer Vision–ECCV 2014, pages 740–755. Springer,
[2] V. Badrinarayanan, A. Handa, and R. Cipolla. Segnet: A deep convo- 2014. 8
lutional encoder-decoder architecture for robust semantic pixel-wise [24] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Sift flow:
labelling. arXiv preprint arXiv:1505.07293, 2015. 3, 5 Dense correspondence across different scenes. In Computer Vision–
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep ECCV 2008, pages 28–42. Springer, 2008. 6
convolutional encoder-decoder architecture for image segmentation. [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks
arXiv preprint arXiv:1511.00561, 2015. 2, 3, 4, 5, 6, 9 for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
[4] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in 1, 2, 5, 6, 8
video: A high-definition ground truth database. Pattern Recognition [26] D. J. MacKay. A practical bayesian framework for backpropagation
Letters, 30(2):88–97, 2009. 1, 2, 4, 5, 6, 7, 8 networks. Neural computation, 4(3):448–472, 1992. 2
[5] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmenta- [27] H. Noh, S. Hong, and B. Han. Learning deconvolution network for
tion and recognition using structure from motion point clouds. In semantic segmentation. arXiv preprint arXiv:1505.04366, 2015. 2,
Computer Vision–ECCV 2008, pages 44–57. Springer, 2008. 5 5, 6, 8
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [28] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and
Yuille. Semantic image segmentation with deep convolutional nets algorithms. In Computer Vision and Pattern Recognition (CVPR),
and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 2, 2012 IEEE Conference on, pages 2759–2766. IEEE, 2012. 6
5, 6, 8 [29] S. Rota Bulo and P. Kontschieder. Neural decision forests for se-
[7] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with mantic image labelling. In Computer Vision and Pattern Recognition
statistical models. Journal of artificial intelligence research, 1996. 2 (CVPR), 2014 IEEE Conference on, pages 81–88. IEEE, 2014. 5
[8] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor se- [30] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for
mantic segmentation using depth information. arXiv preprint image categorization and segmentation. In Computer vision and pat-
arXiv:1301.3572, 2013. 1 tern recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8.
[9] J. Denker and Y. Lecun. Transforming neural-net output levels to IEEE, 2008. 2
probability distributions. In Advances in Neural Information Pro- [31] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio,
cessing Systems 3. Citeseer, 1991. 2, 3 A. Blake, M. Cook, and R. Moore. Real-time human pose recog-
[10] D. Eigen and R. Fergus. Predicting depth, surface normals and se- nition in parts from single depth images. Communications of the
mantic labels with a common multi-scale convolutional architecture. ACM, 56(1):116–124, 2013. 1, 2
arXiv preprint arXiv:1411.4734, 2014. 5, 6 [32] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for
[11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- image understanding: Multi-class object recognition and segmenta-
man. The pascal visual object classes (voc) challenge. International tion by jointly modeling texture, layout, and context. International
journal of computer vision, 88(2):303–338, 2010. 1, 6, 8, 9 Journal of Computer Vision, 81(1):2–23, 2009. 1, 2
[12] Y. Gal and Z. Ghahramani. Bayesian convolutional neural networks [33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmenta-
with bernoulli approximate variational inference. arXiv:1506.02158, tion and support inference from rgbd images. In Computer Vision–
2015. 2, 3 ECCV 2012, pages 746–760. Springer, 2012. 6
[34] K. Simonyan and A. Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014. 3
[35] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene
understanding benchmark suite. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 567–576,
2015. 1, 2, 6, 7, 8, 9
[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout: A simple way to prevent neural net-
works from overfitting. The Journal of Machine Learning Research,
15(1):1929–1958, 2014. 2, 3, 4
[37] P. Sturgess, K. Alahari, L. Ladicky, and P. H. Torr. Combining ap-
pearance and structure from motion features for road scene under-
standing. In BMVC, volume 1, page 6, 2009. 5
[38] J. Tighe and S. Lazebnik. Superparsing. International Journal of
Computer Vision, 101(2):329–349, 2013. 5
[39] Y. Yang, Z. Li, L. Zhang, C. Murphy, J. Ver Hoeve, and H. Jiang.
Local label descriptor for example based semantic image labeling.
In Computer Vision–ECCV 2012, pages 361–375. Springer, 2012. 5
[40] F. Yu and V. Koltun. Multi-scale context aggregation by dilated con-
volutions. In ICLR, 2016. 2, 6, 8
[41] M. D. Zeiler and R. Fergus. Visualizing and understanding convolu-
tional networks. In Computer Vision–ECCV 2014, pages 818–833.
Springer, 2014. 4
[42] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su,
D. Du, C. Huang, and P. Torr. Conditional random fields as recur-
rent neural networks. arXiv preprint arXiv:1502.03240, 2015. 2,
8