3D Graph Neural Networks For RGBD Semantic Segmentation
3D Graph Neural Networks For RGBD Semantic Segmentation
Xiaojuan Qi† Renjie Liao‡,§ Jiaya Jia†,♭ Sanja Fidler‡ Raquel Urtasun§,‡
†
The Chinese University of Hong Kong ‡ University of Toronto
§
Uber Advanced Technologies Group ♭ Youtu Lab, Tencent
{xjqi, leojia}@cse.cuhk.edu.hk {rjliao, fidler, urtasun}@cs.toronto.edu
Abstract
RGBD semantic segmentation requires joint reasoning
about 2D appearance and 3D geometric information. In
this paper we propose a 3D graph neural network (3DGNN)
that builds a k-nearest neighbor graph on top of 3D point
cloud. Each node in the graph corresponds to a set of points
and is associated with a hidden representation vector ini-
tialized with an appearance feature extracted by a unary
CNN from 2D images. Relying on recurrent functions, every
node dynamically updates its hidden representation based
on the current status and incoming messages from its neigh-
bors. This propagation model is unrolled for a certain num-
ber of time steps and the final per-node representation is
used for predicting the semantic class of each pixel. We
use back-propagation through time to train the model. Ex- Figure 1. 2D and 3D context. The solid lines indicate neighbors
tensive experiments on NYUD2 and SUN-RGBD datasets in 3D while the dotted lines are for neighbors in 2D but not in 3D.
demonstrate the effectiveness of our approach. (a) Input image (b) 2D image projected into 3D point cloud. (c)
Prediction by the two-stream CNN with HHA encoding [29]. (d)
Our 3DGNN prediction.
1. Introduction
Advent of depth sensors makes it possible to perform
RGBD semantic segmentation along with many applica- the real world. For example, in Fig. 1(c), the two-network
tions in virtual reality, robotics and human-computer inter- model [29] classifies the table as part of the counter.
action. Compared with the more common 2D image setting, An alternative is to use 3D CNNs [37] in voxelized
RGBD semantic segmentation can utilize the real-world ge- 3D space. This type of methods has the potential to ex-
ometric information by exploiting depth infromation. For tract more geometric information. However, since 3D point
example, in Fig. 1(a), given the 2D image alone, the lo- clouds are quite sparse, effective representation learning
cal neighborhood of the red point located on the table in- from such data is challenging. In addition, 3D CNNs are
evitably includes microwave and counter pixels. However, computationally more expensive than their 2D version, thus
in 3D, there is no such confusion because these points are it is difficult to scale up these systems to deal with a large
distant in the 3D point cloud, as shown in Fig. 1(b). number of classes. Anisotropic convolutional neural net-
Several methods [11, 29, 21, 7] treat RGBD segmenta- work [1, 30] provides a promising way to learn filters in
tion as a 2D segmentation problem where depth is taken as non-euclidean space for shape analysis. Yet it faces the
another input image. Deep convolutional neural networks same difficulty of scaling-up to perform large-scale RGBD
(CNNs) are applied separately to the color and depth im- dense semantic segmentation due to complex association of
ages to extract features. These methods need two CNNs, points.
which double computation and memory consumption. Pos- To tackle the challenges above, we propose an end-to-
sible errors stem from missing part of the geometric context end 3D graph neural network, which directly learns its rep-
information since 2D pixels are associated with 3D ones in resentation from 3D points. We first cast the 2D pixels into
15199
Figure 2. Overview of our 3D graph neural network. The top part of the figure shows the 3D point cloud and a close-up of the constructed
graph based on the point cloud. Blue points and the associated black dotted lines represent nodes and edges which exist in the graph
constructed from 2D image. It is clear that a graph built from the 3D point cloud encodes geometric information which is hard to be
inferred from the 2D image. In the bottom part, we show the sub-graph connected to the red point as an example to illustrate the propagation
process. We highlight the source of messages the red point receives in different time steps using yellow edges.
3D based on depth information and associate with each 3D tic segmentation as sequence prediction and used LSTM to
point a unary feature vector, i.e., an output of a 2D segmen- capture local and global dependencies. Graph LSTM [22]
tation CNN. We then build a graph whose nodes are these was used to model structured data. However, the update
3D points, and edges are constructed by finding the nearest procedure and sequential processing make it hard to scale
neighbors in 3D. For each node, we take the image feature up the system to large graphs.
vector as the initial representation and iteratively update it
using a recurrent function. The key idea of this dynamic
RGBD Semantic Segmentation. Compared to the 2D
computation scheme is that the node state is determined
setting, RGBD semantic segmentation has the benefit of
by its history state and the messages sent by its neighbors,
exploiting more geometric information. Several methods
while taking both appearance and 3D information into con-
have encoded the depth map as an image. In [11, 29, 21],
sideration.
depth information forms three channels via HHA encoding:
We use the final state of each node to perform per-node
horizontal disparity, height above ground and norm angle.
classification. The back-propagation through time (BPTT)
In [7], the depth image is simply treated as a one-channel
algorithm is adopted to compute gradients of the graph neu-
image. FCNs were then applied to extract semantic features
ral network. Further, we pass the gradients to the unary
directly on the encoded images.
CNN to facilitate end-to-end training. Our experimental re-
In [11, 24], a set of 2.5D region proposals are first gen-
sults show state-of-the-art performance on the challenging
erated. Each proposal is then represented by its RGB image
NYUD2 and SUN-RGBD datasets.
and the encoded HHA image. Two CNNs were used to ex-
2. Related Work tract features separately, which are finally concatenated and
passed as input to SVM classification. Besides high compu-
2D Semantic Segmentation. Fully convolutional net- tation, separate region proposal generation and label assign-
works (FCN) [29] have demonstrated effectiveness in per- ment make these systems fragile. The final classification
forming semantic segmentation. In fact, most of the fol- stage could be influenced by errors produced in the proposal
lowing work [3, 45, 44, 26, 25, 46, 28, 43, 4] is built on stage. Long et al. [29] applied FCN to RGB and HHA im-
top of FCN. Chen et al. [3] used dilated convolutions to ages separately and fused scores for final prediction. Eigen
enlarge the receptive field of the network while retaining et al. [7] proposed a global-to-local strategy to combine dif-
dense prediction. Conditional random fields (CRFs) have ferent levels of prediction, which simply extracts features
been applied as post-processing [3] or have been integrated via CNNs from the depth image. The extracted feature is
into the network [45] to refine boundary of prediction. Re- again concatenated with the image one for prediction. Li et
cently, global and local context is modeled in scene pars- al. [21] used LSTM to fuse the HHA image and color infor-
ing [42, 27]. In [44, 27], the context is incorporated with mation. These methods all belong to the category that uses
pyramid pooling [12]. Liang et al. [22, 23] tackled seman- 2D CNNs to extract depth features.
5200
Alternatively, several approaches deal with 2.5D data us- is also possible to incorporate more information from the
ing 3D voxel networks [37, 41, 36]. Song et al. [37] used graph with different types of edges using multiple M.
a 3D dilated voxel convolutional neural network to learn Inference is performed by executing the above propaga-
the semantics and occupancy of each voxel. These methods tion model for a certain number of steps. The final predic-
take better advantage of 3D context. However, scaling up to tion can be at the node or at the graph level depending on the
deal with high-resolution and complex scenes is challeng- task. For example, one can feed the hidden representation
ing since 3D voxel networks are computationally expensive. (or aggregation of it) to another neural network to perform
Further, quantization of 3D space can lead to additional er- node (or graph) classification.
rors. Other methods [1, 30] learned non-euclidean filters for Graph Neural Networks are closely related to many ex-
shape analysis. They typically rely on well-defined point as- isting models, such as conditional random fields (CRFs) and
sociation, e.g., meshes, which are not readily available for recurrent neural networks (RNNs). We discuss them next.
complex RGBD segmentation data. We focussed on pairwise CRFs but note that the connection
extends to higher-order models.
Graph neural networks. In terms of the structure of neu-
ral networks, there has been effort to generalize neural net- Loopy Belief Propagation Inference. We are given
works to graph data. One direction is to apply Convolu- a pairwise (often cyclic in practice) CRF whose
tional Neural Networks (CNNs) to graphs. In [2, 5, 18], conditional
P distribution factorizes
P as log P (Y |I) ∝
CNNs are employed in the spectral domain relying on the − i∈V φu (yi |I) − (i,j)∈E φp (yi , yj |I) with
graph Laplacian. While [6] used hash functions so that Y = {yi |i ∈ V } the set of all labels, I the set of all
CNN can be applied to graphs. Another direction is to observed image pixels, and φu and φp the unary and pair-
recurrently apply neural networks to every node of the wise potentials, respectively. One fundamental algorithm
graph [9, 33, 20, 39], producing “Graph Neural Networks”. to approximate inference in general MRFs/CRFs is loopy
belief propagation (BP) [31, 8]. The propagation process is
This model includes a propagation process, which resem-
denoted as
bles message passing of graphical models [8]. The final
X Y
learning process of such a model can be achieved by the βi→j = exp {−φu (yi ) − φp (yi , yj )} βk→i (2)
back-propagation through time (BPTT) algorithm. yi k∈Ωi /j
mtv = M {htu |u ∈ Ωv }
Mean Field Inference. Mean field Q inference defines an
ht+1 = F ht , mtv , approximate distribution Q(Y ) = i Q(yi ) and minimizes
v (1)
the KL-divergence KL(Q||P ). The fixed-point propaga-
where mtv is a vector, which indicates the aggregation of tion equations characterize the stationary points of the KL-
messages that node v receives from its neighbors Ωv . M is divergence as
a function to compute the message and F is the function to
update the hidden state. Similar to a recurrent neural net- 1
X
Q(yi ) = exp −φu (yi ) − EQ(yj ) [φp (yj , yi )] , (3)
work, M and F are feedforward neural networks that are Zi
j∈Ω i
shared among different time steps. Simple M and F can be
an element-wise summation function and a fully connected where Zi is a normalizing constant and Ωi is the neigh-
layer, respectively. Note that these update functions spec- bor of node i. This fixed-point iteration converges to a lo-
ify a propagation model of information inside the graph. It cal minimum [8]. From Eqs. (1) and (3), it is clear that
5201
the mean field propagation is a special case of graph neu- 37.9 0
RNN
39.2 1
ral networks. The hidden representation of node i is just LSTM
38.6 2
the approximate distribution Q(yi ). M and F is the nega-
38.8 3
mean IoU\%
tion of element-wise summation and softmax respectively. 39 4
EQ(yj ) [φp (yj , yi )] is the message sent from node j to i.
39.3 5
While the messages of CRFs lie in the space of output la- 37.7 6
bels y, GNNs have messages mtv in the space of hidden 38.28
37.5
39 7
representations. GNNs are therefore more flexible in terms 39.4
37
of information propagation as the dimension of the hidden
space can be much larger than that of the label space. 36.5
0 1 2 3 4 5 6 7
hv = F h , mtv ,
t+1 t
(5)
4. 3DGNN for RGBD Semantic Segmentation
where g is a multi-layer perceptron (MLP). Unless other-
In this section, we propose a special GNN to tackle the
wise specified, all instances of MLP that we employ have
problem of RGBD semantic segmentation.
one layer with ReLU [19] as the nonlinearity. At each time
4.1. Graph Construction step, every node collects messages from its neighbors. The
message is computed by first feeding hidden states to the
Given an image, we construct a directed graph based on MLP g and then taking the average over the neighborhood.
the 2D position and the depth information of pixels. Let Then every node updates its hidden state based on previous
[x, y, z] be the 3D coordinates of a point in the camera co- state and the aggregated message. This process is shown in
ordinate system and let [u, v] be its projection onto the im- Fig. 2. We consider two choices of the update function F.
age according to the pinhole camera model. Geometry of
perspective projection yields
Vanilla RNN Update. We can use a MLP as the update
x = (u − cx ) ∗ z/fx function as
ht+1 = q( htv , mtv ),
y = (v − cy ) ∗ z/fy , (4) v (6)
where fx and fy are the focal length along the x and y direc- where we concatenate the hidden state and message before
tions, and cx and cy are coordinates of the principal point. feeding it to the MLP q. This type of update function is
To form our graph, we regard each pixel as a node and con- common in vanilla RNN.
nect it via directed edges to K nearest neighbors (KNN) in
the 3D space, where K is set to 64 in our experiments. Note LSTM Update. Another choice is to use a long short-
that this process creates asymmetric structure, i.e., an edge term memory (LSTM) [15] cell. This is more powerful
from A to B does not necessarily imply existence of the since it maintains its own memory to help extract useful
edge from B to A. We visualize the graph in Fig. 2. information from incoming messages.
4.2. Propagation Model 4.3. Prediction Model
After constructing the graph, we use a CNN as the unary Assuming the propagation model in Eq. (5) is unrolled
model to compute the features for each pixel. We provide for T steps, we now predict the semantic label for each pixel
5202
model mean IoU% mean acc% Unary CNN. For most of the ablation experiments, we
Gupta et al. [11] (2014) 28.6 35.1 use a modified VGG-16 network, i.e., deeplab-LargeFov [3]
Long et al. [29] (2015) 34.0 46.1 with dilated convolutions as our unary CNN to extract the
Eigen et al. [7] (2015) 34.1 45.1 appearance features from the 2D images. We use the fc7
Lin et al. [26] + ms (2016) 40.6 53.6 feature map. The output feature map is of size H × W × C
HHA + ss 40.8 54.6 where H, W and C are the height, width and the channel
3DGNN + ss 39.9 54.0 size respectively. Note that due to the stride and pooling
3DGNN + ms 41.7 55.4 of this network, H and W are 1/8 of the original input in
HHA-3DGNN + ss 42.0 55.2 terms of size. Therefore, our 3D graph is built on top of the
HHA-3DGNN + ms 43.1 55.7 downsampled feature maps.
To further incorporate contextual information, we use
Table 1. Comparison with state-of-the-arts on NYUD2 test set in
global pooling [27] to compute another C-dim vector from
40-class setting. HHA means combining HHA feature [29]. “ss”
and “ms” indicate single- and multi-scale test. the feature map. We then append the vector to all spatial
positions, which result in a H × W × 2C feature map. In
model mean IoU% mean acc% our experiment, C = 1024 and a 1 × 1 convolution layer is
Silberman et al. [34] (2012) - 17.5 used to further reduce the dimension to 512. We also exper-
Ren et al. [32] (2012) - 20.2 imented by replacing the VGG-net with ResNet-101 [14] or
Gupta et al. [10] (2015) - 30.2 by combining it with the HHA encoding.
Wang et al. [40] (2015) - 29.2
Khan et al. [17] (2016) - 43.9 Implementation Details. We initialize the unary CNN
Li et al. [21] (2016) - 49.4 from the pre-trained VGG network in [3]. We use SGD with
momentum to optimize the network and clip the norm of the
3DGNN + ss 43.6 57.1
gradients such that it is not larger than 10. The initial learn-
3DGNN + ms 45.4 59.5
ing rates of the pre-trained unary CNN and GNN are 0.001
Table 2. Comparison with state-of-the-art methods on NYUD2 test and 0.01 respectively. Momentum is set to 0.9. We initialize
set in the 37-class setting. “ss” and “ms” indicate single- and RNN and LSTM update functions of the Graph Neural Net-
multi-scale test. work using the MSRA method [13]. We randomly scale the
image in scaling range [0.5, 2] and randomly crop 425×425
in the score map. For node v corresponding to a pixel in the patches. For the multi-scale testing, we use three scales 0.8,
score map, we predict the probability over semantic classes 1.0 and 1.2. In the ResNet-101 experiment, we modified the
yv as follows: network by reducing the overall stride to 8 and by adding
dilated convolutions to enlarge the receptive field.
pyv = s( hTv , h0v ),
(7) We adopt two common metrics to evaluate our method:
mean accuracy and mean intersection-over-union (IoU).
where s is a MLP with a softmax layer shared by all nodes.
Note that we concatenate the initial hidden state, which is 5.1. Comparison with State-of-the-art
the output of the unary CNN to capture the 2D appearance
information. We finally associate a softmax cross-entropy In our comparison with other methods, we use the vanilla
loss function for each node and train the model with the RNN update function in all our experiments due to its effi-
back-propagation through time (BPTT) algorithm. ciency and good performance. We defer the thorough abla-
tion study to Section 5.2.
5. Experiments
NYUD2 dataset. We first compare with other methods in
Datasets. We evaluate our method on two popular RGBD the NYUD2 40-class and 37-class settings. As shown in
datasets: NYUD2 [34] and SUN-RGBD [35]. NYUD2 con- Tables 1 and 2 our model achieves very good performance
tains a total of 1,449 RGBD image pairs from 464 different in both settings. Note that Long et al. [29] and Eigen et al.
scenes. The dataset is divided into 795 images from 249 [7] both used two-VGG networks with HHA image/depth
scenes for training and 654 images from 215 scenes for test- encoding whereas we only use one VGG network to extract
ing. We randomly split 49 scenes from the training set as the appearance features. The configuration of Lin et al. [26] is
validation set, which contains 167 images. The remaining a bit different since it only takes the color image as input
654 images from 200 scenes are used as the training set. and builds a complicated model that involves several VGG
SUN-RGBD consists of 10,335 images, which are divided networks to extract image features.
into 5,285 RGBD image pairs for training and 5,050 for Results in these tables also reveal that by combining
testing. All our hyperparameter search and ablation stud- VGG with the HHA encoding features [29] as the unary
ies are performed on the NYUD2 validation set. model, our method further improves in performance.
5203
model mean IoU% mean acc% ent parts of our model.
Song et al. [35] (2015) - 36.3
Kendall et al. [16] (2015) - 45.9
Propagation Steps. We first calculate statistics of the
Li et al. [21] (2016) - 48.1
constructed 3D graphs. The average diameter of all graphs
HHA + ss 41.7 52.3
is 21. It corresponds to the average number of propagation
ResNet-101 + ss 42.7 53.5
steps to traverse the graph. The average distance between
3DGNN + ss 40.2 52.5
any pair of nodes is 7.9. We investigate how the number
3DGNN + ms 42.3 54.6
of propagation steps affects the performance of our model
HHA-3DGNN + ss 42.0 55.2
in Fig. 3. The performance, i.e., mean IoU, gradually im-
HHA-3DGNN + ms 43.1 55.7
proves when the number of propagation step increases.
ResNet-101-3DGNN + ss 44.1 55.7
The oscillation when the propagation step is large might
ResNet-101-3DGNN + ms 45.9 57.0
relate to the optimization process. We found that 3 to 6
Table 3. Comparison with other methods on SUN-RGBD test set. propagation steps produce reasonably good results. We also
“ResNet-101” exploits ResNet-101 as the unary model. “HHA” show the segmentation maps using different propagation
denotes a combination of the RGB image feature with HHA image
steps in Fig. 4. Limited by the receptive field size, the unary
feature [29]. “ss” and “ms” indicate single-scale and multi-scale.
CNN often yields wrong predictions when the objects are
too large. For example, in the first row of Fig. 4, the table
Propagation Step Unary CNN 2DGNN 3DGNN
is confused as the counter. With 4 propagation steps, our
0 37.9 - -
prediction of the table becomes much more accurate.
1 - 37.8 38.1
3 - 38.4 39.3
Update Equation. Here we compare the two update
4 - 38.0 39.4
equations described in Section 4.2. As shown in Fig. 3,
6 - 38.1 39.0
the vanilla RNN performs similarly to LSTM. The com-
Table 4. 2D VS. 3D graph. Performance with different propagation putation complexity of LSTM update is much larger than
steps on NYUD2 validation set is listed.
the vanilla RNN. According to this finding, we stick to the
Vanilla RNN update in all our experiments.
Dataset network mean IoU% mean acc%
NYUD2-40 Unary CNN 37.1 51.0 2D VS. 3D Graph. To investigate how much improve-
2DGNN 38.7 52.9 ment the 3D graph additionally brings, we compare with 2D
3DGNN 39.9 54.0 graphs that are built on 2D pixel positions with the same
NYUD2-37 Unary CNN 41.7 55.0 KNN method. We conduct experiments using the same
3DGNN 43.6 57.0 Graph Neural Network and show the performance of differ-
SUNRGBD Unary CNN 38.5 49.4 ent propagation steps in Table 4. Results on the whole test
2DGNN 38.9 50.3 set is shown in Table 5. They indicate that with 3DGNN,
3DGNN 40.2 52.5 more 3D geometric context is captured, which in turn makes
Table 5. Comparison with the unary CNN on NYUD2 and SUN- prediction more accurate. Another interesting observation
RGBD test set. is that even the simple 2DGNN still outperforms the method
incorporating the unary CNN.
SUN-RGBD dataset. We also compare these methods on Performance Analysis. We now compare our 3DGNN to
SUN-RGBD in Table 3. The performance difference is sig- the unary CNN in order to investigate how GNN can be
nificant. Note that Li et al. [21] also adopted the Deeplab- enhanced by leveraging 3D geometric information. The
LargeFov network for extracting image feature and a sep- results based on the single-scale data input are listed in
arate network for HHA encoded depth feature extraction. Table 5. Our 3DGNN model outperforms the unary and
Our single 3DGNN model already outperforms previous 2DGNN models, which again supports the fact that 3D con-
ones by a large margin. Combining HHA features or re- text is important in semantic segmentation.
placing VGG-net with ResNet-101 further boost the perfor- We further break down the improvement in performance
mance. These gains showcase that our method is effective for each semantic class in Fig. 5. The statistics show that
in encoding 3D geometric context. our 3DGNN outperforms the unary CNN by a large margin
for classes like cabinet, bed, dresser, and refrigerator. This
5.2. Ablation Study
is likely because these objects are easily misclassified as
In this section, we conduct an ablation study on our their surroundings in the 2D image. However, in 3D space,
NYUD2 validation set to verify the functionality of differ- they typically have rigid shapes and the depth distribution
5204
(a) Original Image (b) Ground Truth (c) Unary CNN (d) Propagation Step 1 (e) Propagation Step 4
Figure 4. Influence of different propagation steps on NYUD2 validation set.
,PSURYHPHQWRI,R8
is more consistent, which makes the classification task rela- we first divide the ground truth segmentation maps into a
tively easier to tackle. set of connected components where each component is re-
garded as one instance of the object within that class. We
To better understand what contributes to the improve-
then count the sizes of object instances for all classes. The
ment, we analyze how the performance gain varies for dif-
range of object sizes is up to 10, 200 different values in
ferent sizes of objects. In particular, for each semantic class,
5205
Accuracy improvement/%
5206
References [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated
graph sequence neural networks. ICLR, 2016. 3
[1] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learn- [21] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Lstm-
ing shape correspondence with anisotropic convolutional cf: Unifying context modeling and fusion with lstms for rgb-
neural networks. In NIPS, pages 3189–3197, 2016. 1, 3 d scene labeling. In ECCV, 2016. 1, 2, 5, 6
[2] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral [22] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic
networks and locally connected networks on graphs. ICLR, object parsing with graph lstm. In ECCV, 2016. 2
2014. 3 [23] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan.
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and Semantic object parsing with local-global long short-term
A. L. Yuille. Semantic image segmentation with deep con- memory. In CVPR, 2016. 2
volutional nets and fully connected crfs. ICLR, 2015. 2, 5 [24] D. Lin, S. Fidler, and R. Urtasun. Holistic scene understand-
[4] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding ing for 3d object detection with rgbd cameras. In ICCV,
boxes to supervise convolutional networks for semantic seg- 2013. 2
mentation. In ICCV, 2015. 2 [25] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-
[5] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolu- path refinement networks with identity mappings for high-
tional neural networks on graphs with fast localized spectral resolution semantic segmentation. arXiv, 2016. 2
filtering. In NIPS, 2016. 3 [26] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient
[6] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bom- piecewise training of deep structured models for semantic
barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Con- segmentation. In CVPR, 2016. 2, 5
volutional networks on graphs for learning molecular finger- [27] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking
prints. In NIPS, 2015. 3 wider to see better. ICLR Workshop, 2016. 2, 5
[7] D. Eigen and R. Fergus. Predicting depth, surface normals [28] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic im-
and semantic labels with a common multi-scale convolu- age segmentation via deep parsing network. In ICCV, 2015.
tional architecture. In ICCV, 2015. 1, 2, 5 2
[8] N. Friedman. Inferring cellular networks using probabilistic [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
graphical models. Science, 2004. 3 networks for semantic segmentation. In CVPR, 2015. 1, 2,
[9] M. Gori, G. Monfardini, and F. Scarselli. A new model for 5, 6
learning in graph domains. In IJCNN, 2005. 3 [30] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst.
[10] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik. Indoor Geodesic convolutional neural networks on riemannian man-
scene understanding with rgb-d images: Bottom-up segmen- ifolds. In ICCV Workshops, pages 37–45, 2015. 1, 3
tation, object detection and semantic segmentation. IJCV, [31] J. Pearl. Probabilistic reasoning in intelligent systems. Mor-
2015. 5 gan Kaufmann, 1988. 3
[11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning [32] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features
rich features from rgb-d images for object detection and seg- and algorithms. In CVPR, 2012. 5
mentation. In ECCV, 2014. 1, 2, 5 [33] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and
G. Monfardini. The graph neural network model. IEEE TNN,
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
2009. 3
in deep convolutional networks for visual recognition. In
[34] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
ECCV, 2014. 2
segmentation and support inference from rgbd images. In
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
ECCV, 2012. 5
rectifiers: Surpassing human-level performance on imagenet
[35] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
classification. In ICCV, 2015. 5
scene understanding benchmark suite. In CVPR, 2015. 5, 6
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [36] S. Song and J. Xiao. Deep sliding shapes for amodal 3d
for image recognition. In CVPR, 2016. 5 object detection in rgb-d images. In CVPR, 2016. 3
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
Neural computation, 1997. 4 T. Funkhouser. Semantic scene completion from a single
[16] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian depth image. arXiv, 2016. 1, 3
segnet: Model uncertainty in deep convolutional encoder- [38] I. Sutskever, J. Martens, and G. E. Hinton. Generating text
decoder architectures for scene understanding. arXiv, 2015. with recurrent neural networks. In ICML, 2011. 4
6 [39] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic
[17] S. H. Khan, M. Bennamoun, F. Sohel, R. Togneri, and representations from tree-structured long short-term memory
I. Naseem. Integrating geometrical context for semantic la- networks. ACL, 2015. 3, 4
beling of indoor scenes using rgbd images. IJCV, 2016. 5 [40] A. Wang, J. Lu, J. Cai, G. Wang, and T.-J. Cham. Unsu-
[18] T. N. Kipf and M. Welling. Semi-supervised classification pervised joint feature learning and encoding for rgb-d scene
with graph convolutional networks. ICLR, 2017. 3 labeling. TIP, 2015. 5
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [41] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
classification with deep convolutional neural networks. In J. Xiao. 3d shapenets: A deep representation for volumetric
NIPS, 2012. 4 shapes. In CVPR, 2015. 3
5207
[42] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as
a whole: Joint object detection, scene classification and se-
mantic segmentation. In CVPR, 2012. 2
[43] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. arXiv, 2015. 2
[44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. arXiv, 2016. 2
[45] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional ran-
dom fields as recurrent neural networks. In ICCV, 2015. 2
[46] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-
ralba. Scene parsing through ade20k dataset. In CVPR, 2017.
2
5208