Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
Chao Liu1,2∗ Jinwei Gu1,3∗ Kihwan Kim1 Srinivasa Narasimhan2 Jan Kautz1
1 2 3
NVIDIA Carnegie Mellon University SenseTime
Abstract
arXiv:1901.02571v1 [cs.CV] 9 Jan 2019
1
as lack of texture, specular/transparent material, occlusion, two views [6, 51], and multi-view stereo [20, 54, 56]. A
and scale drift. While some recent work started focusing on few work also incorporated these DL-based depth sensing
uncertainty estimation [15, 21, 23, 24] for certain computer methods into visual SLAM systems [4, 47]. Despite of
vision tasks, to our knowledge, we are the first to predict a the promising performance, however, these DL-based meth-
depth probability volume from images and integrate it over ods are still far from real-world applications, since their ro-
time in a statistical framework. bustness and generalization ability is yet to be thoroughly
We evaluate our method extensively on multiple datasets tested [1]. In fact, as shown in Sec. 4, we found many
and compare with recent state-of-the-art, DL-based, depth state-of-the-art methods degrade significantly even for sim-
estimation methods [13, 17, 51]. We also perform the so- ple cross-dataset tasks. This gives rise to an increasing de-
called “cross-dataset” evaluation task, which tests models mand for a systematic study of uncertainty and Bayesian
trained on a different dataset without fine-tuning. We be- deep learning for depth sensing, as performed in our paper.
lieve such cross-dataset tasks are essential to evaluate the
robustness and generalization ability [1]. Experimental re-
sults show that, with reasonably good camera pose estima- Uncertainty and Bayesian deep learning Uncertainty
tion, our method outperforms these prior methods on depth and Bayesian modeling have been long studied in last few
estimation with better accuracy, robustness, and temporal decades, with various definitions ranging from the vari-
stability. Moreover, as shown in Fig. 1, the output of the ance of posterior distributions for low-level vision [46] and
proposed method can be directly fed into RGB-D based 3D motion analysis [25] to variability of sensor input mod-
scanning methods [32, 33] for 3D scene reconstruction. els [22]. Recently, uncertainty [15, 23] for Bayesian deep
learning were introduced for a variety of computer vision
2. Related Work tasks [8, 21, 24]. In our work, the uncertainty is defined as
the posterior probability of depth, i.e., the DPV estimated
Depth sensing from active sensors Active depth sen- from a local window of several consecutive frames. Thus,
sors, such as depth cameras [19, 36] or LiDAR sen- our network estimates the “measurement uncertainty” [23]
sors [7] provide dense metric depth measurements as well rather than the “model uncertainty”. We also learn an ad-
as sensor-specific confidence measure [37]. Despite of their ditional network module to integrate this depth probability
wide usage [18, 32, 35, 53], they have several inherent distribution over time in a Bayesian filtering manner, in or-
drawbacks[5, 29, 34, 50], such as limited operating range, der to improve the accuracy and robustness for depth esti-
low spatial resolution, sensor interference, and high power mation from a video stream.
consumption. Our goal in this paper is to mimic a RGB-D
sensor with a monocular RGB camera, which continuously
predicts depth (and its uncertainty) from a video stream. 3. Our Approach
Depth estimation from images Depth estimation directly Figure 2 shows an overview of our proposed method for
from images has been a core problem in computer vi- depth sensing from an input video stream, which consists of
sion [39, 42]. Classical single view methods [9, 38] often three parts. The first part (Sec. 3.1) is the D-Net, which es-
make strong assumptions on scene structures. Stereo and timates the Depth Probability Volume (DPV) for each input
multi-view methods [42] rely on triangulation and suffer frame. The second part (Sec. 3.2) is the K-Net, which helps
from finding correspondences for textureless regions, trans- to integrate the DPVs over time. The third part (Sec. 3.3) is
parent/specular materials, and occlusion. Moreover, due to the refinement R-Net, which improves the spatial resolution
global bundle adjustment, these methods are often compu- of DPVs with the guidance from input images.
tationally expensive for real-time applications. For depth
Specifically, we denote the depth probability volume
estimation from a monocular video, there is also scale ambi-
(DPV) as p(d; u, v), which represents the probability
guity and drifting [30]. Because of these challenges, many
of pixel (u, v) having a depth value d, where d ∈
computer vision systems [30, 40] use RGB images mainly
[dmin , dmax ]. Due to perspective projection, the DPV is
for camera pose estimation but rarely for dense 3D recon-
defined on the 3D view frustum attached to the camera, as
struction [41]. Nevertheless, depth sensing from images has
shown in Fig. 3(a). dmin and dmax are the near and far
great potentials, since it addresses all the above drawbacks
planes of the 3D frustum, which is discretized into N = 64
of active depth sensors. In this paper, we take a step in this
planes uniformly over the inverse of depth (i.e., disparity).
direction using a learning-based method.
The DPV contains the complete statistical distribution of
Learning-based depth estimation Recently researchers depth for a given scene. In this paper, we directly use
have shown encouraging results for depth sensing directly the non-parametric volume to represent DPV. Parametric
from images(s), including single-view methods [13, 17, 57], models, such as Gaussian Mixture Model [3], can be also
video-based methods [28, 52, 55], depth and motion from be used. Given the DPV, we can compute the Maximum-
… Estimate Depth Probability (Sec. 3.1) Integrate Depth Probability Over Time (Sec. 3.2) Refine Depth Probability(Sec. 3.3)
!"
D-Net
Softmax - +
Measured DPV Residual Residual Gain Updated DPV
#(%" |'" ) #(%" |'):" )
Warp Refined DPV
Figure 2. Overview of the proposed network for depth estimation with uncertainty from a video. Our method takes the frames in a local
time window in the video as input and outputs a Depth Probability Volume (DPV) that is updated over time. The update procedure is in
a Bayesian filter fashion: we first take the difference between the local DPV estimated using the D-Net (Sec. 3.1) and the predicted DPV
from previous frames to get the residual; then the residual is modified by the K-Net (Sec. 3.2) and added back to the predicted DPV; at last
the DPV is refined and upsampled by the R-Net (Sec. 3.3), which can be used to compute the depth map and its confidence measure.
Camera trajectory the current frame It and computing their differences. Thus,
for all depth candidates, we can compute a cost volume,
t
which produces the DPV after a softmax layer:
t+1 X
L(dt |It ) = ||f (It ) − warp(f (Ik ); dt , δTkt )||,
k∈Nt ,k6=t
(a) Depth Probability Volume (DPV) (b) Update DPV
p(dt |It ) = softmax(L(dt |It )), (3)
Figure 3. Representation and update for DPV. (a) The DPV is de-
fined over a 3D frustrum defined by the pinhole camera model . where f (·) is a feature extractor, δTkt is the relative cam-
(b) The DPV gets updated over time as the camera moves. era pose from frame Ik to frame It , warp(·) is an operator
that warps the image features from frame Ik to the refer-
Likelihood Estimates (MLE) for depth and its confidence: ence frame It , which is implemented as 2D grid sampling.
In this paper, without loss of generality, we use the feature
d=dmax
ˆ v) =
X extractor f (·) from PSM-Net [6], which outputs a feature
Depth : d(u, p(d; (u, v)) · d, (1) map of 1/4 size of the input image. Later in Sec. 3.3, we
d=dmin
learn a refinement R-Net to upsample the DPV back to the
ˆ (u, v)).
Confidence : Ĉ(u, v) = p(d, (2) original size of the input image.
Figure 4 shows an example of a depth map d(u, ˆ v) and
To make notations more concise, we will omit (u, v) and
its confidence map Ĉ(u, v) (blue means low confidence) de-
use p(d) for DPVs in the rest of the paper.
rived from a Depth Probability Volume (DPV) from the in-
When processing a video stream, the DPV can be treated
put image. The bottom plot shows the depth probability
as a hidden state of the system. As the camera moves,
distributions p(d; u, v) for the three selected points, respec-
as shown in Fig. 3(b), the DPV p(d) is being updated as
tively. The red and green points have sharp peaks, which
new observations arrive, especially for the overlapping vol-
indicates high confidence in their depth values. The blue
umes. Meanwhile, if camera motion is known, we can eas-
point is in the highlight region, and thus it has a flat depth
ily predict the next state p(d) from the current state. This
probability distribution and a low confidence for its depth.
predict-update iteration naturally implies a Bayesian filter-
ing scheme to update the DPV over time for better accuracy.
3.2. K-Net: Integrating DPV over Time
3.1. D-Net: Estimating DPV
When processing a video stream, our goal is to integrate
For each frame It , we use a CNN, named D-Net, to esti- the local estimation of DPVs over time to reduce uncer-
mate the conditional DPV, p(dt |It ), using It and its tempo- tainty. As mentioned earlier, this integration can be natu-
rally neighboring frames. In this paper, we consider a local rally implemented as Bayesian filtering. Let us define dt
time window of five frames Nt = [t − 2∆t, t − ∆t, t, t + as the hidden state, which is the depth (in camera coordi-
∆t, t + 2∆t], and we set ∆t = 5 for all our testing videos nates) at frame It . The “belief” volume p(dt |I1:t ) is the
(25fps/30fps). For a given depth candidate d, we can com- conditional distribution of the state giving all the previous
pute a cost map by warping all the neighboring frames into frames. A simple Bayesian filtering can be implemented in
Input frame Depth Confidence
p(d)
.5
Frame t Frame t+1
0
Depth probability Depth (meter)
Table 1. Comparison of depth estimation over the 7-Scenes and DeMoN were trained on different datasets, we com-
dataset [43] with the metrics defined in [11].
pare with these two methods on a separate indoor dataset
σ < 1.25 abs. rel rmse scale inv. 7Scenes [43]. For our method, we assume that the relative
DeMoN [51] 31.88 0.3888 0.8549 0.4473 camera rotation δR within a local time window is provided
DORN [13] 60.05 0.2000 0.4591 0.2207 (e.g. measured by IMU). As shown in Table 5, our method
Ours 69.26 0.1758 0.4408 0.1899 significantly outperforms both DeMoN and DORN on this
dataset based on the commonly used statistical metrics [11].
We include the complete metrics in the appendix. Without
is difficult to choose methods to compare with. We care-
using an IMU, our method can also achieve better perfor-
fully select a few recent DL-based depth estimation meth-
mance, as shown in Table 4.
ods and try our best for a fair comparison. For single-view
methods, we select DORN [13] which is the current state- For qualitative comparison, as shown in Fig. 8, the depth
of-the-art [1]. For two-view methods, we compare with De- maps from our method are less noisy, more sharper, and
MoN [51], which shows high quality depth prediction from temporally more consistent. More importantly, using an
a pair of images. We also compare with MonoDepth [17], RGB-D 3D scanning method [33], we can reconstruct a
which is a semi-supervised learning approach from stereo much higher quality 3D mesh with our estimated depths
images. To improve the temporal consistency for these compared to other methods. Even when compared with 3D
per-frame estimations, we trained a post-processing net- reconstruction using a real RGB-D sensor, our result has
work [27], but we observed it does not improve the perfor- better coverage and accuracy in some regions (e.g., mon-
mance. Since there is always scale ambiguity for depth from itors / glossy surfaces) where active depth sensors cannot
a monocular camera, for fair comparison, we normalize the capture.
scale for the outputs from all the above methods before we Results for Outdoor Scenarios We also evaluated our
compute statistical metrics [11]. method on some outdoor datasets — KITTI [16] and vir-
The inference time for processing one frame in our tual KITTI [14]. The virtual KITTI dataset is used because
method is ∼ 0.7 second per frame without pose optimiza- it has dense, accurate metric depth as ground truth, while
tion and ∼ 1.5 second with pose estimation on a worksta- KITTI only has sparse depth values from LiDAR as ground
tion with GTX 1080 GPU and 64 GB RAM memory, with truth. For our method, we use the camera poses measured
the framework implemented in Python. The pose estimation by the IMU and GPS. Table 6 lists the comparison results
part can be implemented with C++ to improve efficiency. with DORN [13], Eigen [11], and MonoDepth [17] which
Results for Indoor Scenarios We first evaluated our are also trained on KITTI [16]. Our method has similar per-
method for indoor scenarios, for which RGB-D sensors formance with DORN [13], and is better than the other two
were used to capture dense metric depth for ground truth. methods, based on the statistical metrics defined in [11]. We
We trained our network on ScanNet [10]. Figure 7 shows also tested our method with camera poses from DSO [12]
two exemplar results. As shown, in addition to depth maps, and obtain slightly worse performance (see appendix).
our method also outputs reasonable confidence maps (e.g., Figure 9 shows qualitative comparison for depth maps in
low confidence in the occluded or specular regions) which KITTI dataset. As shown, our method generate sharper and
correlates with the depth errors. Moreover, with more in- less noisier depth maps. In addition, our method outputs
put frames, the confidence maps accumulate correctly over depth confidence maps (e.g., lower confidence on the car
time: the confidence of the books (top row) increases and window). Our depth estimation is temporally consistent,
the depth error decreases; the confidence of the glass region which leads to the possibility of fusing multiple depth maps
(bottom row) decreases and the depth error increases. with voxel hashing [33] in the outdoors for a large-scale
For comparison, since the models provided by DORN dense 3D reconstruction, as shown in Fig. 9.
Input frame GT depth DORN depth DeMoN depth Our depth Our confidence
GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2
Input frame GT depth DORN depth DeMoN depth Our depth Our confidence
GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2
Input frame GT depth DORN depth DeMoN depth Our depth Our confidence
GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2
Figure 8. Depth and 3D reconstruction results on indoor datasets (best viewed when zoomed in). We compare our method with DORN [13]
and DeMoN [51], in terms of both depth maps and 3D reconstruction using Voxel Hashing [33] that accumulates the estimated depth maps
for multiple frames. To show the temporal consistency of the depths, we use different numbers of depth maps for Voxel Hashing: 2 depth
maps for the first sample and 30 depth maps for the other samples. The depth maps from DORN contain block artifacts as marked in red
boxes. This is manifested as the rippled shapes in the 3D reconstruction. DeMoN generates sharp depth boundaries but fails to recover the
depth faithfully in the regions marked in the green box. Also, the depths from DeMoN is not temporally consistent. This leads to the severe
misalignment artifacts in the 3D reconstructions. In comparison, our method generates correct and temporally consistent depths maps,
especially at regions with high confidence, such as the monitor where even the Kinect sensor fails to get the depth due to low reflectance.
In Table 3, we performed the cross-dataset task. The left Table 2. Comparison of depth estimation on KITTI [16].
shows the results with training from KITTI [16] and test- σ < 1.25 abs. rel rmse scale inv.
ing on virtual KITTI [14]. The right shows the results with Eigen [11] 67.80 0.1904 5.114 0.2628
training from indoor datasets (NYUv2 [31] for DORN [13] Mono [17] 86.43 0.1238 2.8684 0.1635
and ScanNet [10] for ours) and testing on KITTI [16]. As DORN [13] 92.62 0.0874 3.1375 0.1233
shown, our method performs better, which shows its better Ours 93.15 0.0998 2.8294 0.1070
robustness and generalization ability.
Ablation Study The performance of our method relies on (denoted as “VO pose”) (3) δR of the first five frames are
accurate estimate of camera poses, so we test our method initialized with DSO [12] (denoted as “1st win”). We ob-
with different camera pose estimation schemes: (1) Relative serve that when only the camera poses in the first time win-
camera rotation δR is read from an IMU sensor (denoted as dow are initialized using DSO, the performance in terms of
“GT R”). (2) δR of all frames are initialized with DSO [12] depth estimation is better than that using the DSO pose ini-
Input frame MonoDepth DORN Ours depth Ours confidence
The proposed update process defined in Eq. 8 in the main C.3. R-Net
paper using residuals is closely related to Kalman Filter. In We show the structure of the R-Net in Table. 9. In the
Kalman Filter, given the observation xt at time t and the paper, we set D = 64.
estimated hidden state ht−1 at time t−1, the updated hidden
state ht is:
B. More Results
B.1. Complete metrics for Comparisons
We show the complete metrics for depth estimation com-
parisons in Table 5 and Table 6.
C. Network structures
In this section, we illustrate the network structures used
in the pipeline.
C.1. D-Net
We show the structure of the D-Net in Table. 10. In the
paper, we set D = 64.
Table 5. Comparison of depth estimation over the 7-Scenes dataset [43] with the metrics defined in [11]
σ < 1.25 σ < 1.252 σ < 1.253 abs. rel sq. rel rmse rmse log scale. inv
DeMoN [51]31.88 61.02 82.52 0.3888 0.4198 0.8549 0.4771 0.4473
DORN [13] 60.05 87.76 96.33 0.2000 0.1153 0.4591 0.2813 0.2207
Ours 69.26 91.77 96.82 0.1758 0.1123 0.4408 0.2500 0.1899
Table 8. K-Net structure. The operator expand(·) repeat the image intensity in the depth dimension
Name Components Input Output dimension
1 1
Input concat(cost volume, expand(Iref )) 4H × 4 W×D×4
conv 3d(3×3, ch in=4, ch out=32), ReLU 1 1
conv 0 Input 4H × 4 W × D × 32
conv 3d(3×3, ch in=32, ch out=32), ReLU
conv 3d(3 ×3, ch in=32, ch out=32), ReLU 1 1
conv 1 ×4 conv 0 4H × 4 W × D × 32
conv 3d(3×3, ch in=32, ch out=32)
conv 3d(3×3, ch in=32, ch out=32), ReLU 1 1
conv 2 conv 1 4H × 4 W×D×1
conv 3d(3×3, ch in=32, ch out=1)
1 1
Output Modified cost volume from the conv 2 layer 4H × 4 W×D×1
References [2] J. T. Barron and B. Poole. The fast bilateral solver. In Euro-
pean Conference on Computer Vision (ECCV), 2016. 8
[1] Robust Vision Challenge Workshop. http://www.
robustvision.net, 2018. 2, 6 [3] C. M. Bishop. Mixture density networks. 1994. 2
Table 10. D-Net structure. The structure is taken from [6]
Name Components Input Output dimension
Input Input frame H×W×3
CNN Layers
1 1
conv0 1 conv 2d(3×3, ch in=3, ch out=32, stride=2), ReLU Input 2H × 2 W × 32
1 1
conv0 2 conv 2d(3×3, ch in=32, ch out=32 ), ReLU conv0 1 2H × 2 W × 32
1 1
conv0 3 conv 2d(3×3, ch in=32, ch out=32), ReLU conv0 2 2H × 2 W × 32
conv 2d(3×3, ch in=32, ch out=32), ReLU 1 1
conv1 ×3 conv0 2 2H × 2 W ×32
conv 2d(3×3, ch in=32, ch out=32)
1 1
conv1 1 conv
2d(3×3, ch in=32, ch out=64, stride=2), ReLU conv1 4H × 4 W ×64
conv 2d(3×3, ch in=64, ch out=64), ReLU 1 1
conv2 × 15 conv1 1 4H × 4 W ×64
conv 2d(3×3, ch in=64, ch out=64)
1 1
conv2 1 conv 2d(3×3, ch in=64, ch out=128), ReLU conv2 4H × 4 W ×128
conv 2d(3×3, ch in=128, ch out=128), ReLU 1 1
conv3 ×2 conv2 1 4H × 4 W × 128
conv 2d(3×3, ch in=128, ch out=128)
conv 2d(3×3, ch in=128, ch out=128, dila=2), ReLU 1 1
conv4 ×3 conv3 4H × 4 W × 128
conv 2d(3×3, ch in=128, ch out=128, dila=2)
Spatial Pyramid Layers
avg pool(64×64,stride=64)
1 1
branch1 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(32 × 32,stride= 32)
1 1
branch2 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(16 × 16,stride= 16)
1 1
branch3 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(8 × 8,stride= 8)
1 1
branch4 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
1 1
concat concat(branch1, branch2, branch3, branch4, conv2, conv4) 4H × 4 W × 320
conv 2d(3×3, ch in=320, ch out=128), ReLU 1 1
fusion concat 4H × 4 W × 64
conv 2d(1×1, ch in=128, ch out=64), ReLU
1 1
Output The extracted image feature from the fusion layer 4H × 4 W × 64
[4] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and [8] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen.
A. Davison. CodeSLAM - Learning a compact, optimisable VidLoc: a deep spatial-temporal model for 6-DoF video-clip
representation for dense visual SLAM. In IEEE Conference relocalization. In IEEE Conference on Computer Vision and
on Computer Vision and Pattern Recognition (CVPR), 2018. Pattern Recognition (CVPR), 2017. 2
1, 2 [9] A. Criminisi, I. Reid, and A. Zisserman. Single view metrol-
[5] D. Chan, H. Buisman, C. Theobalt, and S. Thrun. A noise- ogy. International Journal of Computer Vision (IJCV), 2000.
aware filter for real-time depth upsampling. In Workshop 2
on Multi-camera and Multi-modal Sensor Fusion Algorithms [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,
and Applications - M2SFA2 2008, Marseille, France, 2008. and M. Nießner. ScanNet: Richly-annotated 3D reconstruc-
Andrea Cavallaro and Hamid Aghajan. 1, 2 tions of indoor scenes. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017. 6, 7
[6] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching net- [11] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
work. In IEEE Conference on Computer Vision and Pattern from a single image using a multi-scale deep network. In
Recognition (CVPR), pages 5410–5418, 2018. 1, 2, 3, 11 Advances in Neural Information Processing Systems (NIPS),
[7] J. A. Christian and S. Cryan. A survey of LiDAR technology 2014. 5, 6, 7, 10
and its use in spacecraft relative navigation. In AIAA Guid- [12] J. Engel, V. Koltun, and D. Cremers. Direct sparse odom-
ance, Navigation, and Control (GNC) Conference, 2013. 1, etry. IEEE Transactions on Pattern Analysis and Machine
2 Intelligence (TPAMI), 40:611–625, 2018. 5, 6, 7
[13] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. [28] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised
Deep ordinal regression network for monocular depth esti- learning of depth and ego-motion from monocular video us-
mation. In IEEE Conference on Computer Vision and Pattern ing 3D geometric constraints. In IEEE Conference on Com-
Recognition (CVPR), 2018. 1, 2, 6, 7, 8, 10 puter Vision and Pattern Recognition (CVPR), 2018. 2
[14] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as [29] A. Maimone and H. Fuchs. Reducing interference between
proxy for multi-object tracking analysis. In IEEE Conference multiple structured light depth sensors using motion. In
on Computer Vision and Pattern Recognition (CVPR), 2016. IEEE Virtual Reality Workshops (VRW), pages 51–54, 2012.
5, 6, 7 1, 2
[15] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approx- [30] R. Mur-Artal and J. D. Tardós. ORB-SLAM2: an open-
imation: Representing model uncertainty in deep learning. source SLAM system for monocular, stereo and RGB-D
In International Conference on Machine Learning (ICML), cameras. IEEE Transactions on Robotics, 33(5):1255–1262,
2016. 2 2017. 2
[16] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- [31] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
tonomous driving? The KITTI vision benchmark suite. In segmentation and support inference from RGBD images. In
IEEE Conference on Computer Vision and Pattern Recogni- European Conference on Computer Vision (ECCV), 2012. 7
tion (CVPR), pages 3354–3361, 2012. 5, 6, 7, 10 [32] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,
[17] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and
monocular depth estimation with left-right consistency. In A. Fitzgibbon. KinectFusion: Real-time dense surface map-
IEEE Conference on Computer Vision and Pattern Recogni- ping and tracking. In IEEE and ACM International Sympo-
tion (CVPR), 2017. 1, 2, 6, 7, 10 sium on Mixed and Augmented Reality (ISMAR), pages 127–
[18] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning 136, 2011. 1, 2, 5
rich features from RGB-D images for object detection and [33] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
segmentation. In European Conference on Computer Vision Real-time 3D reconstruction at scale using voxel hashing.
(ECCV), 2014. 1, 2 ACM Transactions on Graphics (TOG), 2013. 1, 2, 5, 6,
[19] R. Horaud, M. Hansard, G. Evangelidis, and C. Ménier. An 7, 8
overview of depth cameras and range scanners based on [34] F. Pomerleau, A. Breitenmoser, M. Liu, F. Colas, and
time-of-flight technologies. Machine Vision and Applica- R. Siegwart. Noise characterization of depth sensors for
tions Journal, 27(7):1005–1020, 2016. 1, 2 surface inspections. In International Conference on Ap-
[20] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. plied Robotics for the Power Industry (CARPI), pages 16–21,
DeepMVS: Learning multi-view stereopsis. In IEEE Confer- 2012. 2
ence on Computer Vision and Pattern Recognition (CVPR), [35] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frus-
2018. 1, 2 tum PointNets for 3D object detection from RGB-D data. In
[21] E. Ilg, Ö. Çiçek, S. Galesso, A. Klein, O. Makansi, F. Hutter, IEEE Conference on Computer Vision and Pattern Recogni-
and T. Brox. Uncertainty Estimates and Multi-Hypotheses tion (CVPR), 2017. 1, 2
Networks for Optical Flow. In European Conference on [36] F. Remondino and D. Stoppa. TOF Range-Imaging Cameras.
Computer Vision (ECCV), 2018. 2 Springer Publishing Company, Incorporated, 2013. 1, 2
[22] G. Kamberova and R. Bajcsy. Sensor errors and the uncer- [37] M. Reynolds, J. Dobo, L. Peel, T. Weyrich, and G. J. Bros-
tainties in stereo reconstruction. In Empirical Evaluation tow. Capturing time-of-flight data with confidence. In IEEE
Techniques in Computer Vision, pages 96–116. IEEE Com- Conference on Computer Vision and Pattern Recognition
puter Society Press, 1998. 2 (CVPR), 2011. 2
[23] A. Kendall and Y. Gal. What uncertainties do we need in [38] A. Saxena, S. H. Chung, and A. Y. Ng. 3D depth recon-
bayesian deep learning for computer vision? In Advances in struction from a single still image. International Journal of
Neural Information Processing Systems (NIPS), 2017. 2 Computer Vision (IJCV), 76(1):53–69, Jan. 2008. 1, 2
[24] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using [39] A. Saxena, J. Schulte, and A. Y. Ng. Depth estimation us-
uncertainty to weigh losses for scene geometry and seman- ing monocular and stereo cues. In Proceedings of the 20th
tics. In IEEE Conference on Computer Vision and Pattern International Joint Conference on Artificial Intelligence, IJ-
Recognition (CVPR), 2018. 2 CAI’07, pages 2197–2203, 2007. 2
[25] K. Kim, D. Lee, and I. Essa. Gaussian process regression [40] J. L. Schönberger and J.-M. Frahm. Structure-from-motion
flow for analysis of motion trajectories. In International revisited. In IEEE Conference on Computer Vision and Pat-
Conference on Computer Vision (ICCV), 2011. 2 tern Recognition (CVPR), 2016. 2
[26] D. P. Kingma and J. Ba. Adam: A method for stochastic [41] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm.
optimization. In International Conference on Learning Rep- Pixelwise view selection for unstructured multi-view stereo.
resentations (ICLR), 2015. 5 In European Conference on Computer Vision (ECCV), 2016.
[27] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, 2
and M.-H. Yang. Learning blind video temporal consistency. [42] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and
In European Conference on Computer Vision (ECCV), 2018. R. Szeliski. A comparison and evaluation of multi-view
6 stereo reconstruction algorithms. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2006. 1,
2
[43] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
A. Fitzgibbon. Scene coordinate regression forests for cam-
era relocalization in RGB-D images. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2013. 5,
6, 10
[44] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D:
A RGB-D scene understanding benchmark suite. In IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015. 1
[45] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-
mers. A benchmark for the evaluation of RGB-D SLAM sys-
tems. In IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 2012. 5
[46] R. Szeliski. Bayesian modeling of uncertainty in low-level
vision. International Journal of Computer Vision, 5(3):271–
301, Dec 1990. 2
[47] K. Tateno, F. Tombari, I. Laina, and N. Navab. CNN-SLAM:
Real-time dense monocular SLAM with learned depth pre-
diction. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 1, 2
[48] B. Tippetts, D. J. Lee, K. Lillywhite, and J. Archibald. Re-
view of stereo vision algorithms and their suitability for
resource-limited systems. Journal of Real-Time Image Pro-
cessing, 11(1):5–25, 2016. 1
[49] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
Bundle adjustment a modern synthesis. In International
Conference on Computer Vision (ICCV), 1999. 5
[50] J. Tuley, N. Vandapel, and M. Hebert. Analysis and removal
of artifacts in 3-d LIDAR data. In International Conference
on Robotics and Automation (ICRA), 2005. 2
[51] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,
A. Dosovitskiy, and T. Brox. DeMoN: Depth and motion
network for learning monocular stereo. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2017.
1, 2, 5, 6, 7, 8, 10
[52] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-
ing depth from monocular videos using direct methods. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2018. 1, 2
[53] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and
A. Davison. ElasticFusion: dense SLAM without a pose
graph. In Robotics: Science and Systems (RSS), 2015. 1,
2
[54] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. MVSNet: Depth
inference for unstructured multi-view stereo. In European
Conference on Computer Vision (ECCV), 2018. 1, 2
[55] Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense
depth, optical flow and camera pose. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 2
[56] H. Zhou, B. Ummenhofer, and T. Brox. DeepTAM: Deep
tracking and mapping. In European Conference on Com-
puter Vision (ECCV), 2018. 1, 2
[57] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-
pervised learning of depth and ego-motion from video. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 1, 2