0% found this document useful (0 votes)
7 views13 pages

Neural RGB D Sensing: Depth and Uncertainty From A Video Camera

Uploaded by

liujbhaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

Neural RGB D Sensing: Depth and Uncertainty From A Video Camera

Uploaded by

liujbhaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Neural RGB→D Sensing: Depth and Uncertainty from a Video Camera

Chao Liu1,2∗ Jinwei Gu1,3∗ Kihwan Kim1 Srinivasa Narasimhan2 Jan Kautz1
1 2 3
NVIDIA Carnegie Mellon University SenseTime

Abstract
arXiv:1901.02571v1 [cs.CV] 9 Jan 2019

Depth sensing is crucial for 3D reconstruction and scene


understanding. Active depth sensors provide dense metric
measurements, but often suffer from limitations such as re-
stricted operating ranges, low spatial resolution, sensor in-
terference, and high power consumption. In this paper, we Input frame Estimated depth
1
propose a deep learning (DL) method to estimate per-pixel
depth and its uncertainty continuously from a monocular
video stream, with the goal of effectively turning an RGB
camera into an RGB-D camera. Unlike prior DL-based
methods, we estimate a depth probability distribution for
each pixel rather than a single depth value, leading to an 0
estimate of a 3D depth probability volume for each input Confidence 3D Recon. using 30 views
frame. These depth probability volumes are accumulated Figure 1. We proposed a DL-based method to estimate depth and
over time under a Bayesian filtering framework as more in- its uncertainty (or, confidence) continuously for a monocular video
coming frames are processed sequentially, which effectively stream, with the goal of turning an RGB camera into an RGB-D
reduces depth uncertainty and improves accuracy, robust- camera. Its output can be directly fed into classical RGB-D based
3D scanning methods [32, 33] for 3D reconstruction.
ness, and temporal stability. Compared to prior work, the
proposed approach achieves more accurate and stable re-
sults, and generalizes better to new datasets. Experimental uncertainty continuously from a monocular video stream,
results also show the output of our approach can be directly with the goal of effectively turning an RGB camera into an
fed into classical RGB-D based 3D scanning methods for RGB-D camera. We have two key ideas:
3D scene reconstruction. 1. Unlike prior work, for each pixel, we estimate a
depth probability distribution rather than a single
depth value, leading to an estimate of a Depth Prob-
1. Introduction ability Volume (DPV) for each input frame. As
Depth sensing is crucial for 3D reconstruction [32, 33, shown in Fig. 1, the DPV provides both a Maximum-
53] and scene understanding [18, 35, 44]. Active depth Likelihood-Estimate (MLE) of the depth map, as well
sensors (e.g., time of flight cameras [19, 36], LiDAR [7]) as the corresponding per-pixel uncertainty measure.
measure dense metric depth, but often have limited operat-
2. These DPVs across different frames are accumulated
ing range (e.g., indoor) and spatial resolution [5], consume
over time, as more incoming frames are processed se-
more power, and suffer from multi-path reflection and inter-
quentially. The accumulation step, originated from the
ference between sensors [29]. In contrast, estimating depth
Bayesian filtering theory and implemented as a learn-
directly from image(s) solves these issues, but faces other
able deep network, effectively reduces depth uncer-
long-standing challenges such as scale ambiguity and drift
tainty and improves accuracy, robustness, and tempo-
for monocular methods [38], as well as the correspondence
ral stability over time, as shown later in Sec. 4.
problem and high computational cost for stereo [48] and
We argue that all DL-based depth estimation methods
multi-view methods [42].
should predict not depth values but depth distributions,
Inspired by recent success of deep learning in 3D vi-
and should integrate such statistical distributions over time
sion [4, 6, 13, 17, 20, 47, 51, 52, 54, 56, 57], in this paper,
(e.g., via Bayesian filtering). This is because dense depth
we propose a DL-based method to estimate depth and its
estimation from image(s) – especially for single-view meth-
∗ The authors contributed on this work when they were at NVIDIA. ods – inherently has a lot of uncertainty, due to factors such

1
as lack of texture, specular/transparent material, occlusion, two views [6, 51], and multi-view stereo [20, 54, 56]. A
and scale drift. While some recent work started focusing on few work also incorporated these DL-based depth sensing
uncertainty estimation [15, 21, 23, 24] for certain computer methods into visual SLAM systems [4, 47]. Despite of
vision tasks, to our knowledge, we are the first to predict a the promising performance, however, these DL-based meth-
depth probability volume from images and integrate it over ods are still far from real-world applications, since their ro-
time in a statistical framework. bustness and generalization ability is yet to be thoroughly
We evaluate our method extensively on multiple datasets tested [1]. In fact, as shown in Sec. 4, we found many
and compare with recent state-of-the-art, DL-based, depth state-of-the-art methods degrade significantly even for sim-
estimation methods [13, 17, 51]. We also perform the so- ple cross-dataset tasks. This gives rise to an increasing de-
called “cross-dataset” evaluation task, which tests models mand for a systematic study of uncertainty and Bayesian
trained on a different dataset without fine-tuning. We be- deep learning for depth sensing, as performed in our paper.
lieve such cross-dataset tasks are essential to evaluate the
robustness and generalization ability [1]. Experimental re-
sults show that, with reasonably good camera pose estima- Uncertainty and Bayesian deep learning Uncertainty
tion, our method outperforms these prior methods on depth and Bayesian modeling have been long studied in last few
estimation with better accuracy, robustness, and temporal decades, with various definitions ranging from the vari-
stability. Moreover, as shown in Fig. 1, the output of the ance of posterior distributions for low-level vision [46] and
proposed method can be directly fed into RGB-D based 3D motion analysis [25] to variability of sensor input mod-
scanning methods [32, 33] for 3D scene reconstruction. els [22]. Recently, uncertainty [15, 23] for Bayesian deep
learning were introduced for a variety of computer vision
2. Related Work tasks [8, 21, 24]. In our work, the uncertainty is defined as
the posterior probability of depth, i.e., the DPV estimated
Depth sensing from active sensors Active depth sen- from a local window of several consecutive frames. Thus,
sors, such as depth cameras [19, 36] or LiDAR sen- our network estimates the “measurement uncertainty” [23]
sors [7] provide dense metric depth measurements as well rather than the “model uncertainty”. We also learn an ad-
as sensor-specific confidence measure [37]. Despite of their ditional network module to integrate this depth probability
wide usage [18, 32, 35, 53], they have several inherent distribution over time in a Bayesian filtering manner, in or-
drawbacks[5, 29, 34, 50], such as limited operating range, der to improve the accuracy and robustness for depth esti-
low spatial resolution, sensor interference, and high power mation from a video stream.
consumption. Our goal in this paper is to mimic a RGB-D
sensor with a monocular RGB camera, which continuously
predicts depth (and its uncertainty) from a video stream. 3. Our Approach
Depth estimation from images Depth estimation directly Figure 2 shows an overview of our proposed method for
from images has been a core problem in computer vi- depth sensing from an input video stream, which consists of
sion [39, 42]. Classical single view methods [9, 38] often three parts. The first part (Sec. 3.1) is the D-Net, which es-
make strong assumptions on scene structures. Stereo and timates the Depth Probability Volume (DPV) for each input
multi-view methods [42] rely on triangulation and suffer frame. The second part (Sec. 3.2) is the K-Net, which helps
from finding correspondences for textureless regions, trans- to integrate the DPVs over time. The third part (Sec. 3.3) is
parent/specular materials, and occlusion. Moreover, due to the refinement R-Net, which improves the spatial resolution
global bundle adjustment, these methods are often compu- of DPVs with the guidance from input images.
tationally expensive for real-time applications. For depth
Specifically, we denote the depth probability volume
estimation from a monocular video, there is also scale ambi-
(DPV) as p(d; u, v), which represents the probability
guity and drifting [30]. Because of these challenges, many
of pixel (u, v) having a depth value d, where d ∈
computer vision systems [30, 40] use RGB images mainly
[dmin , dmax ]. Due to perspective projection, the DPV is
for camera pose estimation but rarely for dense 3D recon-
defined on the 3D view frustum attached to the camera, as
struction [41]. Nevertheless, depth sensing from images has
shown in Fig. 3(a). dmin and dmax are the near and far
great potentials, since it addresses all the above drawbacks
planes of the 3D frustum, which is discretized into N = 64
of active depth sensors. In this paper, we take a step in this
planes uniformly over the inverse of depth (i.e., disparity).
direction using a learning-based method.
The DPV contains the complete statistical distribution of
Learning-based depth estimation Recently researchers depth for a given scene. In this paper, we directly use
have shown encouraging results for depth sensing directly the non-parametric volume to represent DPV. Parametric
from images(s), including single-view methods [13, 17, 57], models, such as Gaussian Mixture Model [3], can be also
video-based methods [28, 52, 55], depth and motion from be used. Given the DPV, we can compute the Maximum-
… Estimate Depth Probability (Sec. 3.1) Integrate Depth Probability Over Time (Sec. 3.2) Refine Depth Probability(Sec. 3.3)
!"
D-Net

Depth Confidence Predicted DPV Depth Confidence Depth Confidence


#(%" |'):"+) )
Shared K-Net R-Net

Softmax - +
Measured DPV Residual Residual Gain Updated DPV
#(%" |'" ) #(%" |'):" )
Warp Refined DPV

Skip connection from image features to R-Net


Figure 2. Overview of the proposed network for depth estimation with uncertainty from a video. Our method takes the frames in a local
time window in the video as input and outputs a Depth Probability Volume (DPV) that is updated over time. The update procedure is in
a Bayesian filter fashion: we first take the difference between the local DPV estimated using the D-Net (Sec. 3.1) and the predicted DPV
from previous frames to get the residual; then the residual is modified by the K-Net (Sec. 3.2) and added back to the predicted DPV; at last
the DPV is refined and upsampled by the R-Net (Sec. 3.3), which can be used to compute the depth map and its confidence measure.

Camera trajectory the current frame It and computing their differences. Thus,
for all depth candidates, we can compute a cost volume,
t
which produces the DPV after a softmax layer:
t+1 X
L(dt |It ) = ||f (It ) − warp(f (Ik ); dt , δTkt )||,
k∈Nt ,k6=t
(a) Depth Probability Volume (DPV) (b) Update DPV
p(dt |It ) = softmax(L(dt |It )), (3)
Figure 3. Representation and update for DPV. (a) The DPV is de-
fined over a 3D frustrum defined by the pinhole camera model . where f (·) is a feature extractor, δTkt is the relative cam-
(b) The DPV gets updated over time as the camera moves. era pose from frame Ik to frame It , warp(·) is an operator
that warps the image features from frame Ik to the refer-
Likelihood Estimates (MLE) for depth and its confidence: ence frame It , which is implemented as 2D grid sampling.
In this paper, without loss of generality, we use the feature
d=dmax
ˆ v) =
X extractor f (·) from PSM-Net [6], which outputs a feature
Depth : d(u, p(d; (u, v)) · d, (1) map of 1/4 size of the input image. Later in Sec. 3.3, we
d=dmin
learn a refinement R-Net to upsample the DPV back to the
ˆ (u, v)).
Confidence : Ĉ(u, v) = p(d, (2) original size of the input image.
Figure 4 shows an example of a depth map d(u, ˆ v) and
To make notations more concise, we will omit (u, v) and
its confidence map Ĉ(u, v) (blue means low confidence) de-
use p(d) for DPVs in the rest of the paper.
rived from a Depth Probability Volume (DPV) from the in-
When processing a video stream, the DPV can be treated
put image. The bottom plot shows the depth probability
as a hidden state of the system. As the camera moves,
distributions p(d; u, v) for the three selected points, respec-
as shown in Fig. 3(b), the DPV p(d) is being updated as
tively. The red and green points have sharp peaks, which
new observations arrive, especially for the overlapping vol-
indicates high confidence in their depth values. The blue
umes. Meanwhile, if camera motion is known, we can eas-
point is in the highlight region, and thus it has a flat depth
ily predict the next state p(d) from the current state. This
probability distribution and a low confidence for its depth.
predict-update iteration naturally implies a Bayesian filter-
ing scheme to update the DPV over time for better accuracy.
3.2. K-Net: Integrating DPV over Time
3.1. D-Net: Estimating DPV
When processing a video stream, our goal is to integrate
For each frame It , we use a CNN, named D-Net, to esti- the local estimation of DPVs over time to reduce uncer-
mate the conditional DPV, p(dt |It ), using It and its tempo- tainty. As mentioned earlier, this integration can be natu-
rally neighboring frames. In this paper, we consider a local rally implemented as Bayesian filtering. Let us define dt
time window of five frames Nt = [t − 2∆t, t − ∆t, t, t + as the hidden state, which is the depth (in camera coordi-
∆t, t + 2∆t], and we set ∆t = 5 for all our testing videos nates) at frame It . The “belief” volume p(dt |I1:t ) is the
(25fps/30fps). For a given depth candidate d, we can com- conditional distribution of the state giving all the previous
pute a cost map by warping all the neighboring frames into frames. A simple Bayesian filtering can be implemented in
Input frame Depth Confidence
p(d)
.5
Frame t Frame t+1

0
Depth probability Depth (meter)

Figure 4. An example of a depth map d(u, ˆ v) and its confidence


map Ĉ(u, v) (blue means low confidence) derived from a Depth
Probability Volume (DPV). The bottom plot shows the depth prob- No filtering No damping
ability distributions p(d; u, v) for the three selected points, respec-
tively. The red and green points have sharp peaks, which indicates
high confidence in their depth values. The blue point is in the
highlight region, which results in a flat depth probability distribu-
tion and a low confidence for its depth value.

two iterative steps: Global damping Adaptive damping

Predict : p(dt |I1:t ) → p(dt+1 |I1:t ),


Update : p(dt+1 |I1:t ) → p(dt+1 |I1:t+1 ), (4)

where the prediction step is to warp the current DPV from


the camera coordinate at t to the camera coordinate at t + 1:

p(dt+1 |I1:t ) = warp(p(dt |I1:t ), δTt,t+1 ), (5) GT depth Confidence


Figure 5. Comparison between different methods for integrating
where δTt,t+1 is the relative camera pose from time t to time DPV over time. Part of the wall is occluded by the chair at frame
t + 1, and warp(·) here is a warping operator implemented t and disoccluded in frame t + 1. No filtering: not integrating
as 3D grid sampling. At time t + 1, we can compute the the DPV over time. No damping: integrating DPV directly with
local DPV p(dt+1 |It+1 ) from the new measurement It+1 Bayesian filtering. Global damping: down-weighting the pre-
dicted DPV for all voxels using Eq. 7 with λ = 0.8. Adaptive
using the D-Net. This local estimate is thus used to update
damping: down-weighting the predicted DPV adaptively with the
the hidden state, i.e., the “belief” volume, K-Net (Sec. 3.2). Using the K-net, we get the best depth estima-
tion for regions with/without disocclusion.
p(dt+1 |I1:t+1 ) = p(dt+1 |I1:t ) · p(dt+1 |It+1 ). (6)
E(d) = − log p(d), Eq. 6 can be re-written as
Note that we always normalize the DPV in the above equa-
R dmax
tions and ensure dmin p(d) = 1. Figure 5 shows an exam- E(dt+1 |I1:t+1 ) = E(dt+1 |I1:t ) + E(dt+1 |It+1 ),
ple. As shown in the second row, with the above Bayesian where the first term is the prediction and the second term is
filtering (labeled as ”no damping”), the estimated depth the measurement. To reduce the weight of the prediction,
map is less noisy, especially in the regions of the back wall we multiply a weight λ ∈ [0, 1] with the first term,
and the floor.
However, one problem with directly applying Bayesian E(dt+1 |I1:t+1 ) = λ · E(dt+1 |I1:t ) + E(dt+1 |It+1 ). (7)
filtering is it integrates both correct and incorrect informa-
We call this scheme “global damping”. As shown in Fig. 5,
tion in the prediction step. For example, when there are
global damping helps to reduce the error in the disoc-
occlusions or disocclusions, the depth values near the oc-
cluded regions. However, global damping may also prevent
clusion boundaries change abruptly. Applying Bayesian
some correct depth information to be integrated to the next
filtering directly will propagate wrong information to the
frames, since it reduces the weights equally for all voxels
next frames for those regions, as highlighted in the red box
in the DPV. Therefore, we propose an “adaptive damping”
in Fig. 5. One straightforward solution is to reduce the
scheme to update the DPV:
weight of the prediction in order to prevent incorrect infor-
mation being integrated over time. Specifically, by defining E(dt+1 |I1:t+1 ) = E(dt+1 |I1:t ) + g(∆Et+1 , It+1 ), (8)
where ∆Et+1 is the difference between the measurement Local Time Window !"
Ref. Frame
and the prediction,

∆Et+1 = E(dt+1 |It+1 ) − E(dt+1 |I1:t ), (9)


Warp depth map
Local Time Window !"#$
and g(·) is a CNN, named K-Net, which learns to trans-
form ∆Et+1 into a correction term to the prediction. Intu-
itively, for regions with correct depth probability estimates, Ref. Frame
the values in the overlapping volume of DPVs are consis- Time
tent. Thus the residual in Eq. 9 is small and the DPV will
Figure 6. Camera pose optimization in a sliding local time window
not be updated in Eq. 8. On the other hand, for regions with during inference. Given the relative camera pose from the refer-
incorrect depth probability, the residual would be large and ence frame in Nt to the reference frame in Nt+1 , we can predict
the DPV will be corrected by g(∆E, It+1 ). This way, the the depth map for the reference frame in Nt+1 . Then, we opti-
weight for prediction will be changed adaptively for differ- mize the relative camera poses between every source frame and
ent DPV voxels. As shown in Fig. 5, the adaptive damping, the reference frame in Nt+1 using Eq.10.
i.e., K-Net, significantly improve the accuracy for depth es-
timation. In fact, K-Net is closely related to the derivation state-of-the-art monocular visual odometry methods, such
of Kalman filter, where “K” stands for Kalman gain. Please as DSO [12], to obtain the initial camera poses. Since our
refer to the appendix for details. method outputs continuous dense depth maps and their un-
certainty maps, we can in fact further optimize the initial
3.3. R-Net and Training Details camera poses within a local time window, similar to local
Finally, since the DPV p(dt |I1:t ) is estimated with 1/4 bundle adjustment [49].
spatial resolution (on both width and height) of the input Specifically, as shown in Fig. 6, given p(dt |I1:t ), the
image, we employ a CNN, named R-Net, to upsample and DPV of the reference frame It in the local time window
refine the DPV back to the original image resolution. The Nt , we can warp p(dt |I1:t ) to the reference camera view in
R-Net, h(·), is essentially an U-Net with skip connections, Nt+1 to predict the DPV p(dt+1 |I1:t ) using Eq. 5. Then
which takes input the low-res DPV from the K-Net g(·) and we get the depth map dˆ and confidence map Ĉ for the new
the image features extracted from the feature extractor f (·), reference frame using Eq. 2. The camera poses within the
and outputs a high-resolution DPV. local time window Nt+1 are optimized as:
In summary, as shown in Fig. 2, the entire network has X
three modules, i.e., the D-Net, f (·; Θ1 ), the K-Net, g(·; Θ2 ), min. Ĉ|It+1 − warp(Ik ; d;ˆ δTk,t+1 )|1 ,
δTk,t+1 (10)
and the R-Net, h(·; Θ3 ). Detailed network architectures are k∈Nt+1 ,k6=t+1 k
provided in the appendix. The full network is trained end-
to-end, with simply the Negative Log-Likelihood (NLL) where δTk,t+1 is the relative camera pose of frame k to
loss over the depth, Loss = NLL(p(d), dGT ). We also tried frame t + 1; Ik is the source image at frame k; warp(·)
to add image warping as an additional loss term (i.e., min- is an operator that warps the image from the source view to
imizing the difference between It and the warped neigh- the reference view.
boring frames), but we found that it does not improve the
quality of depth prediction.
4. Experimental Results
During training, we use ground truth camera poses. For We evaluate our method on multiple indoor and outdoor
all our experiments, we use the ADAM optimizer [26] with datasets [43, 45, 14, 16], with an emphasis on accuracy and
a learning rate of 10−5 , β1 = .9 and β2 = .999. The whole robustness. For accuracy evaluation, we argue the widely-
framework, including D-Net, K-Net and R-Net, is trained used statistical metrics [11, 51] are insufficient because they
together in an end-to-end fashion for 20 epochs. can only provide an overall estimate over the entire depth
map. Rather, we feed the estimated depth maps directly
3.4. Camera Poses during Inference
into classical RGB-D based 3D scanning systems [32, 33]
During inference, given an input video stream, our for 3D reconstruction — this will show the metric accuracy,
method requires relative camera poses δT between consec- the consistency, and the usefulness of the estimation. For
utive frames — at least for all the first five frames — to robustness evaluation, we performed the aforementioned
bootstrap the computation of the DPV. In this paper, we cross-dataset evaluation tasks, i.e., testing on new datasets
evaluated several options to solve this problem. In many without fine-tuning. The performance degradation over new
applications, such as autonomous driving and AR, initial datasets will show the generalization ability and robustness
camera poses may be provided by additional sensors such for a given algorithm.
as GPS, odometer, or IMU. Alternatively, we can also run As no prior work operates in the exact setting as ours, it
Input frames Confidence Est. depth Error
Figure 7. Exemplar results of our approach on ScanNet [10]. In addition to high quality depth output, we also obtain reasonable confidence
maps (as shown in the marked regions for occlusion and specularity) which correlates with the depth error. Moreover, the confidence maps
accumulate correctly over time with more input frames.

Table 1. Comparison of depth estimation over the 7-Scenes and DeMoN were trained on different datasets, we com-
dataset [43] with the metrics defined in [11].
pare with these two methods on a separate indoor dataset
σ < 1.25 abs. rel rmse scale inv. 7Scenes [43]. For our method, we assume that the relative
DeMoN [51] 31.88 0.3888 0.8549 0.4473 camera rotation δR within a local time window is provided
DORN [13] 60.05 0.2000 0.4591 0.2207 (e.g. measured by IMU). As shown in Table 5, our method
Ours 69.26 0.1758 0.4408 0.1899 significantly outperforms both DeMoN and DORN on this
dataset based on the commonly used statistical metrics [11].
We include the complete metrics in the appendix. Without
is difficult to choose methods to compare with. We care-
using an IMU, our method can also achieve better perfor-
fully select a few recent DL-based depth estimation meth-
mance, as shown in Table 4.
ods and try our best for a fair comparison. For single-view
methods, we select DORN [13] which is the current state- For qualitative comparison, as shown in Fig. 8, the depth
of-the-art [1]. For two-view methods, we compare with De- maps from our method are less noisy, more sharper, and
MoN [51], which shows high quality depth prediction from temporally more consistent. More importantly, using an
a pair of images. We also compare with MonoDepth [17], RGB-D 3D scanning method [33], we can reconstruct a
which is a semi-supervised learning approach from stereo much higher quality 3D mesh with our estimated depths
images. To improve the temporal consistency for these compared to other methods. Even when compared with 3D
per-frame estimations, we trained a post-processing net- reconstruction using a real RGB-D sensor, our result has
work [27], but we observed it does not improve the perfor- better coverage and accuracy in some regions (e.g., mon-
mance. Since there is always scale ambiguity for depth from itors / glossy surfaces) where active depth sensors cannot
a monocular camera, for fair comparison, we normalize the capture.
scale for the outputs from all the above methods before we Results for Outdoor Scenarios We also evaluated our
compute statistical metrics [11]. method on some outdoor datasets — KITTI [16] and vir-
The inference time for processing one frame in our tual KITTI [14]. The virtual KITTI dataset is used because
method is ∼ 0.7 second per frame without pose optimiza- it has dense, accurate metric depth as ground truth, while
tion and ∼ 1.5 second with pose estimation on a worksta- KITTI only has sparse depth values from LiDAR as ground
tion with GTX 1080 GPU and 64 GB RAM memory, with truth. For our method, we use the camera poses measured
the framework implemented in Python. The pose estimation by the IMU and GPS. Table 6 lists the comparison results
part can be implemented with C++ to improve efficiency. with DORN [13], Eigen [11], and MonoDepth [17] which
Results for Indoor Scenarios We first evaluated our are also trained on KITTI [16]. Our method has similar per-
method for indoor scenarios, for which RGB-D sensors formance with DORN [13], and is better than the other two
were used to capture dense metric depth for ground truth. methods, based on the statistical metrics defined in [11]. We
We trained our network on ScanNet [10]. Figure 7 shows also tested our method with camera poses from DSO [12]
two exemplar results. As shown, in addition to depth maps, and obtain slightly worse performance (see appendix).
our method also outputs reasonable confidence maps (e.g., Figure 9 shows qualitative comparison for depth maps in
low confidence in the occluded or specular regions) which KITTI dataset. As shown, our method generate sharper and
correlates with the depth errors. Moreover, with more in- less noisier depth maps. In addition, our method outputs
put frames, the confidence maps accumulate correctly over depth confidence maps (e.g., lower confidence on the car
time: the confidence of the books (top row) increases and window). Our depth estimation is temporally consistent,
the depth error decreases; the confidence of the glass region which leads to the possibility of fusing multiple depth maps
(bottom row) decreases and the depth error increases. with voxel hashing [33] in the outdoors for a large-scale
For comparison, since the models provided by DORN dense 3D reconstruction, as shown in Fig. 9.
Input frame GT depth DORN depth DeMoN depth Our depth Our confidence

GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2

Input frame GT depth DORN depth DeMoN depth Our depth Our confidence

GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2

Input frame GT depth DORN depth DeMoN depth Our depth Our confidence

GT view 1 GT view 2 DORN view 1 DeMoN view 1 Our view 1 Our view 2
Figure 8. Depth and 3D reconstruction results on indoor datasets (best viewed when zoomed in). We compare our method with DORN [13]
and DeMoN [51], in terms of both depth maps and 3D reconstruction using Voxel Hashing [33] that accumulates the estimated depth maps
for multiple frames. To show the temporal consistency of the depths, we use different numbers of depth maps for Voxel Hashing: 2 depth
maps for the first sample and 30 depth maps for the other samples. The depth maps from DORN contain block artifacts as marked in red
boxes. This is manifested as the rippled shapes in the 3D reconstruction. DeMoN generates sharp depth boundaries but fails to recover the
depth faithfully in the regions marked in the green box. Also, the depths from DeMoN is not temporally consistent. This leads to the severe
misalignment artifacts in the 3D reconstructions. In comparison, our method generates correct and temporally consistent depths maps,
especially at regions with high confidence, such as the monitor where even the Kinect sensor fails to get the depth due to low reflectance.

In Table 3, we performed the cross-dataset task. The left Table 2. Comparison of depth estimation on KITTI [16].
shows the results with training from KITTI [16] and test- σ < 1.25 abs. rel rmse scale inv.
ing on virtual KITTI [14]. The right shows the results with Eigen [11] 67.80 0.1904 5.114 0.2628
training from indoor datasets (NYUv2 [31] for DORN [13] Mono [17] 86.43 0.1238 2.8684 0.1635
and ScanNet [10] for ours) and testing on KITTI [16]. As DORN [13] 92.62 0.0874 3.1375 0.1233
shown, our method performs better, which shows its better Ours 93.15 0.0998 2.8294 0.1070
robustness and generalization ability.

Ablation Study The performance of our method relies on (denoted as “VO pose”) (3) δR of the first five frames are
accurate estimate of camera poses, so we test our method initialized with DSO [12] (denoted as “1st win”). We ob-
with different camera pose estimation schemes: (1) Relative serve that when only the camera poses in the first time win-
camera rotation δR is read from an IMU sensor (denoted as dow are initialized using DSO, the performance in terms of
“GT R”). (2) δR of all frames are initialized with DSO [12] depth estimation is better than that using the DSO pose ini-
Input frame MonoDepth DORN Ours depth Ours confidence

MonoDepth topview DORN topview Ours topview


Figure 9. Depth map and 3D reconstruction for KITTI, compared with DORN [13], MonoDepth [51] (best viewed when zoomed in). First
row: Our depth map is sharper and contains less noise. For specular region (marked in the pink box), the confidence is lower. Second
row, from left to right: reconstructions using depth maps of the same 100 frames estimated from MonoDepth, DORN and our method. All
meshes are viewed from above. Within the 100 frames, the vehicle was travelling in a straight line without turning.

Table 3. Cross-dataset tests for depth estimation in the outdoors.


KITTI (train) → virtual KITTI (test)
σ < 1.25 abs. rel rmse scale inv.
DORN [13] 69.61 0.2256 9.618 0.3986
(a) Depth Correction
Ours 73.38 0.2537 6.452 0.2548
Indoor (train) → KITTI (test) Input frame Confidence
σ < 1.25 abs. rel rmse scale inv.
DORN [13] 25.44 0.6352 8.603 0.4448
Ours 72.96 0.2798 5.437 0.2139
Table 4. Performance on 7Scenes with different initial poses
σ < 1.25 abs. rel rmse scale inv. Depth before Depth after
VO pose 60.63 0.1999 0.4816 0.2158
(b) Mesh Masking

1st win. 62.08 0.1923 0.4591 0.2001


GT R 69.26 0.1758 0.4408 0.1899
GT pose 70.54 0.1619 0.3932 0.1586

tialization for all frames. This may seem counter-intuitive,


but it is because monocular VO methods sometimes have Before masking After masking
large errors for textureless regions while optimization with Figure 10. Usefulness of depth confidence map. (a) Correct depth
dense depths may overcome this problem. map using Fast Bilateral Solver [2]. (b) Mask out pixels with low
confidence before applying Voxel Hashing [33].
Usefulness of the Confidence Map The estimated confi-
a Bayesian filtering framework. Experimental results show
dence maps can also be used to further improve the depth
our approach achieves high accuracy, temporal consistency,
maps. As shown in Fig. 10(a), given the depth map and the
and robustness for depth sensing, especially for the cross-
corresponding confidence, we can correct the regions with
dataset tasks. The estimated depth maps from our method
lower confidence due to specular reflection. Also, for 3D
can be fed directly into RGB-D scanning systems for 3D
reconstruction algorithm, given the depth confidence, we
reconstruction and achieve on-par or sometimes more com-
can mask out the regions with lower confidence for better
plete 3D meshes than using a real RGB-D sensor.
reconstruction, as shown in Fig. 10(b).
There are several limitations that we plan to address in
5. Conclusions and Limitations the future. First, camera poses from a monocular video of-
ten suffer from scale drifting, which may affect the accuracy
In this paper, we present a DL-based method for contin- of our depth estimation. Second, in this work we focus on
uous depth sensing from a monocular video camera. Our depth sensing from a local time window, rather than solving
method estimates a depth probability distribution volume it in a global context using all the frames.
from a local time window, and integrates it over time under
Appendices C.2. K-Net
We show the structure of the K-Net in Table. 8. In the
A. Relation of K-Net to the Kalman filter paper, we set D = 64.

The proposed update process defined in Eq. 8 in the main C.3. R-Net
paper using residuals is closely related to Kalman Filter. In We show the structure of the R-Net in Table. 9. In the
Kalman Filter, given the observation xt at time t and the paper, we set D = 64.
estimated hidden state ht−1 at time t−1, the updated hidden
state ht is:

ht = Wt ht−1 + Kt (xt − Vt Wt ht−1 ) (11)

where Wt is the transition matrix mapping the previous hid-


den state to current state; Kt is the gain matrix mapping the
residual in the observation space to the hidden state space.
Vt is the measurement matrix mapping the estimation in the
hidden state space back to the observation space.
If we assume the measurement matrix is accurate: xt =
V ht , and the gain and measurement matrices are temporally
invariant, we have:

ht = Wt ht−1 + K(V ht − V Wt ht−1 )


= Wt ht−1 + KV (ht − Wt ht−1 ) (12)

Comparing our proposed update process in Eq. 5, Eq. 8 and


Eq. 9 in the main paper and Kalman Filter in Eq.12, in our
case the input images correspond to the observations xt ;
the negative-log depth probabilities correspond to the hid-
den states ht ; the warping operator warp(·) corresponds to
the transition matrix Wt ; the K-Net g(·) corresponds to the
multiplication of the gain and measurement matrices KV
in Eq. 12.

B. More Results
B.1. Complete metrics for Comparisons
We show the complete metrics for depth estimation com-
parisons in Table 5 and Table 6.

B.2. Results on KITTI without GPS or IMU


In Table 7, we show the performance of our method on
the KITTI dataset, in case where only the IMU measure-
ment are available (denoted as ’GT R’), and neither IMU
nor GPU are available (denoted as ’opt. pose’).

C. Network structures
In this section, we illustrate the network structures used
in the pipeline.

C.1. D-Net
We show the structure of the D-Net in Table. 10. In the
paper, we set D = 64.
Table 5. Comparison of depth estimation over the 7-Scenes dataset [43] with the metrics defined in [11]
σ < 1.25 σ < 1.252 σ < 1.253 abs. rel sq. rel rmse rmse log scale. inv
DeMoN [51]31.88 61.02 82.52 0.3888 0.4198 0.8549 0.4771 0.4473
DORN [13] 60.05 87.76 96.33 0.2000 0.1153 0.4591 0.2813 0.2207
Ours 69.26 91.77 96.82 0.1758 0.1123 0.4408 0.2500 0.1899

Table 6. Comparison of depth estimation over the KITTI dataset [16].


σ < 1.25 σ < 1.252 σ < 1.253 abs. rel sq. rel rmse rmse log scale. inv
Eigen [11] 67.80 88.79 96.51 0.1904 1.263 5.114 0.2758 0.2628
Mono [17] 86.43 97.70 99.47 0.1238 0.5023 2.8684 0.1644 0.1635
DORN [13] 92.62 98.18 99.35 0.0874 0.4134 3.1375 0.1337 0.1233
Ours 93.15 98.018 99.25 0.0998 0.4732 2.8294 0.1280 0.1070

Table 7. Performance on KITTI dataset without GPS/IMU measurements


σ < 1.25 σ < 1.252 σ < 1.253 abs. rel sq. rel rmse rmse log scale. inv
GT R 89.34 98.30 99.64 0.1178 0.4490 3.2042 0.1514 0.1509
opt. pose 87.78 97.22 99.10 0.1201 0.5763 3.5157 0.1672 0.1665

Table 8. K-Net structure. The operator expand(·) repeat the image intensity in the depth dimension
Name Components Input Output dimension
1 1
Input concat(cost volume, expand(Iref )) 4H × 4 W×D×4
conv 3d(3×3, ch in=4, ch out=32), ReLU 1 1
conv 0 Input 4H × 4 W × D × 32
 conv 3d(3×3, ch in=32, ch out=32), ReLU 
conv 3d(3 ×3, ch in=32, ch out=32), ReLU 1 1
conv 1 ×4 conv 0 4H × 4 W × D × 32
conv 3d(3×3, ch in=32, ch out=32)
conv 3d(3×3, ch in=32, ch out=32), ReLU 1 1
conv 2 conv 1 4H × 4 W×D×1
conv 3d(3×3, ch in=32, ch out=1)
1 1
Output Modified cost volume from the conv 2 layer 4H × 4 W×D×1

Table 9. R-Net structure


Name Components Input Output dimension
1 1
Input cost volume from K-Net 4H × 4 W×D
conv 2d(3×3, ch in=64+D, ch out= 64+D), LeakyReLU 1 1
conv 0 concat(Input, fusion 4H × 4 W × (64+D)
conv 2d(3×3, ch in=64+D, ch out= 64+D), LeakyReLU
in D-Net )
1 1
trans conv 0 transpose conv(4×4, ch in=64+D, ch out=D, stride=2), conv 0 2H × 2 W×D
LeakyReLU
conv 2d(3×3, ch in=32+D, ch out=32 + D ), LeakyReLU 1 1
conv 1 concat(trans conv 0, 2H × 2 W × (D+32)
conv 2d(3×3, ch in=32+D, ch out=32 + D),LeakyReLU
conv 1 in D-Net
trans conv 1 transpose conv(4×4, ch in=32+D, ch out=D, stride=2 ), conv 1 H×W×D
LeakyReLU
conv 2d(3×3, ch in=3+D, ch out=3+D ), LeakyReLU
conv 2 conv 2d(3×3, ch in=3+D, ch out=D ), LeakyReLU concat(trans conv 1, H×W×D
conv 2d(3×3, ch in= D, ch out=D ) Iref )
Output Upsampled and refined cost volume H×W×D

References [2] J. T. Barron and B. Poole. The fast bilateral solver. In Euro-
pean Conference on Computer Vision (ECCV), 2016. 8
[1] Robust Vision Challenge Workshop. http://www.
robustvision.net, 2018. 2, 6 [3] C. M. Bishop. Mixture density networks. 1994. 2
Table 10. D-Net structure. The structure is taken from [6]
Name Components Input Output dimension
Input Input frame H×W×3
CNN Layers
1 1
conv0 1 conv 2d(3×3, ch in=3, ch out=32, stride=2), ReLU Input 2H × 2 W × 32
1 1
conv0 2 conv 2d(3×3, ch in=32, ch out=32 ), ReLU conv0 1 2H × 2 W × 32
1 1
conv0 3  conv 2d(3×3, ch in=32, ch out=32), ReLU  conv0 2 2H × 2 W × 32
conv 2d(3×3, ch in=32, ch out=32), ReLU 1 1
conv1 ×3 conv0 2 2H × 2 W ×32
conv 2d(3×3, ch in=32, ch out=32)
1 1
conv1 1 conv
 2d(3×3, ch in=32, ch out=64, stride=2), ReLU conv1 4H × 4 W ×64
conv 2d(3×3, ch in=64, ch out=64), ReLU 1 1
conv2 × 15 conv1 1 4H × 4 W ×64
conv 2d(3×3, ch in=64, ch out=64)
1 1
conv2 1  conv 2d(3×3, ch in=64, ch out=128), ReLU  conv2 4H × 4 W ×128
conv 2d(3×3, ch in=128, ch out=128), ReLU 1 1
conv3 ×2 conv2 1 4H × 4 W × 128
 conv 2d(3×3, ch in=128, ch out=128) 
conv 2d(3×3, ch in=128, ch out=128, dila=2), ReLU 1 1
conv4 ×3 conv3 4H × 4 W × 128
conv 2d(3×3, ch in=128, ch out=128, dila=2)
Spatial Pyramid Layers
avg pool(64×64,stride=64)
1 1
branch1 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(32 × 32,stride= 32)
1 1
branch2 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(16 × 16,stride= 16)
1 1
branch3 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
avg pool(8 × 8,stride= 8)
1 1
branch4 conv 2d(1×1, ch in=128, ch out=32), ReLU conv4 4H × 4 W × 32
bilinear interpolation
1 1
concat concat(branch1, branch2, branch3, branch4, conv2, conv4) 4H × 4 W × 320
conv 2d(3×3, ch in=320, ch out=128), ReLU 1 1
fusion concat 4H × 4 W × 64
conv 2d(1×1, ch in=128, ch out=64), ReLU
1 1
Output The extracted image feature from the fusion layer 4H × 4 W × 64

[4] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and [8] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen.
A. Davison. CodeSLAM - Learning a compact, optimisable VidLoc: a deep spatial-temporal model for 6-DoF video-clip
representation for dense visual SLAM. In IEEE Conference relocalization. In IEEE Conference on Computer Vision and
on Computer Vision and Pattern Recognition (CVPR), 2018. Pattern Recognition (CVPR), 2017. 2
1, 2 [9] A. Criminisi, I. Reid, and A. Zisserman. Single view metrol-
[5] D. Chan, H. Buisman, C. Theobalt, and S. Thrun. A noise- ogy. International Journal of Computer Vision (IJCV), 2000.
aware filter for real-time depth upsampling. In Workshop 2
on Multi-camera and Multi-modal Sensor Fusion Algorithms [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,
and Applications - M2SFA2 2008, Marseille, France, 2008. and M. Nießner. ScanNet: Richly-annotated 3D reconstruc-
Andrea Cavallaro and Hamid Aghajan. 1, 2 tions of indoor scenes. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017. 6, 7
[6] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching net- [11] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
work. In IEEE Conference on Computer Vision and Pattern from a single image using a multi-scale deep network. In
Recognition (CVPR), pages 5410–5418, 2018. 1, 2, 3, 11 Advances in Neural Information Processing Systems (NIPS),
[7] J. A. Christian and S. Cryan. A survey of LiDAR technology 2014. 5, 6, 7, 10
and its use in spacecraft relative navigation. In AIAA Guid- [12] J. Engel, V. Koltun, and D. Cremers. Direct sparse odom-
ance, Navigation, and Control (GNC) Conference, 2013. 1, etry. IEEE Transactions on Pattern Analysis and Machine
2 Intelligence (TPAMI), 40:611–625, 2018. 5, 6, 7
[13] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. [28] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised
Deep ordinal regression network for monocular depth esti- learning of depth and ego-motion from monocular video us-
mation. In IEEE Conference on Computer Vision and Pattern ing 3D geometric constraints. In IEEE Conference on Com-
Recognition (CVPR), 2018. 1, 2, 6, 7, 8, 10 puter Vision and Pattern Recognition (CVPR), 2018. 2
[14] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as [29] A. Maimone and H. Fuchs. Reducing interference between
proxy for multi-object tracking analysis. In IEEE Conference multiple structured light depth sensors using motion. In
on Computer Vision and Pattern Recognition (CVPR), 2016. IEEE Virtual Reality Workshops (VRW), pages 51–54, 2012.
5, 6, 7 1, 2
[15] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approx- [30] R. Mur-Artal and J. D. Tardós. ORB-SLAM2: an open-
imation: Representing model uncertainty in deep learning. source SLAM system for monocular, stereo and RGB-D
In International Conference on Machine Learning (ICML), cameras. IEEE Transactions on Robotics, 33(5):1255–1262,
2016. 2 2017. 2
[16] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- [31] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
tonomous driving? The KITTI vision benchmark suite. In segmentation and support inference from RGBD images. In
IEEE Conference on Computer Vision and Pattern Recogni- European Conference on Computer Vision (ECCV), 2012. 7
tion (CVPR), pages 3354–3361, 2012. 5, 6, 7, 10 [32] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,
[17] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and
monocular depth estimation with left-right consistency. In A. Fitzgibbon. KinectFusion: Real-time dense surface map-
IEEE Conference on Computer Vision and Pattern Recogni- ping and tracking. In IEEE and ACM International Sympo-
tion (CVPR), 2017. 1, 2, 6, 7, 10 sium on Mixed and Augmented Reality (ISMAR), pages 127–
[18] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning 136, 2011. 1, 2, 5
rich features from RGB-D images for object detection and [33] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
segmentation. In European Conference on Computer Vision Real-time 3D reconstruction at scale using voxel hashing.
(ECCV), 2014. 1, 2 ACM Transactions on Graphics (TOG), 2013. 1, 2, 5, 6,
[19] R. Horaud, M. Hansard, G. Evangelidis, and C. Ménier. An 7, 8
overview of depth cameras and range scanners based on [34] F. Pomerleau, A. Breitenmoser, M. Liu, F. Colas, and
time-of-flight technologies. Machine Vision and Applica- R. Siegwart. Noise characterization of depth sensors for
tions Journal, 27(7):1005–1020, 2016. 1, 2 surface inspections. In International Conference on Ap-
[20] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. plied Robotics for the Power Industry (CARPI), pages 16–21,
DeepMVS: Learning multi-view stereopsis. In IEEE Confer- 2012. 2
ence on Computer Vision and Pattern Recognition (CVPR), [35] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frus-
2018. 1, 2 tum PointNets for 3D object detection from RGB-D data. In
[21] E. Ilg, Ö. Çiçek, S. Galesso, A. Klein, O. Makansi, F. Hutter, IEEE Conference on Computer Vision and Pattern Recogni-
and T. Brox. Uncertainty Estimates and Multi-Hypotheses tion (CVPR), 2017. 1, 2
Networks for Optical Flow. In European Conference on [36] F. Remondino and D. Stoppa. TOF Range-Imaging Cameras.
Computer Vision (ECCV), 2018. 2 Springer Publishing Company, Incorporated, 2013. 1, 2
[22] G. Kamberova and R. Bajcsy. Sensor errors and the uncer- [37] M. Reynolds, J. Dobo, L. Peel, T. Weyrich, and G. J. Bros-
tainties in stereo reconstruction. In Empirical Evaluation tow. Capturing time-of-flight data with confidence. In IEEE
Techniques in Computer Vision, pages 96–116. IEEE Com- Conference on Computer Vision and Pattern Recognition
puter Society Press, 1998. 2 (CVPR), 2011. 2
[23] A. Kendall and Y. Gal. What uncertainties do we need in [38] A. Saxena, S. H. Chung, and A. Y. Ng. 3D depth recon-
bayesian deep learning for computer vision? In Advances in struction from a single still image. International Journal of
Neural Information Processing Systems (NIPS), 2017. 2 Computer Vision (IJCV), 76(1):53–69, Jan. 2008. 1, 2
[24] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using [39] A. Saxena, J. Schulte, and A. Y. Ng. Depth estimation us-
uncertainty to weigh losses for scene geometry and seman- ing monocular and stereo cues. In Proceedings of the 20th
tics. In IEEE Conference on Computer Vision and Pattern International Joint Conference on Artificial Intelligence, IJ-
Recognition (CVPR), 2018. 2 CAI’07, pages 2197–2203, 2007. 2
[25] K. Kim, D. Lee, and I. Essa. Gaussian process regression [40] J. L. Schönberger and J.-M. Frahm. Structure-from-motion
flow for analysis of motion trajectories. In International revisited. In IEEE Conference on Computer Vision and Pat-
Conference on Computer Vision (ICCV), 2011. 2 tern Recognition (CVPR), 2016. 2
[26] D. P. Kingma and J. Ba. Adam: A method for stochastic [41] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm.
optimization. In International Conference on Learning Rep- Pixelwise view selection for unstructured multi-view stereo.
resentations (ICLR), 2015. 5 In European Conference on Computer Vision (ECCV), 2016.
[27] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, 2
and M.-H. Yang. Learning blind video temporal consistency. [42] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and
In European Conference on Computer Vision (ECCV), 2018. R. Szeliski. A comparison and evaluation of multi-view
6 stereo reconstruction algorithms. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2006. 1,
2
[43] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
A. Fitzgibbon. Scene coordinate regression forests for cam-
era relocalization in RGB-D images. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2013. 5,
6, 10
[44] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D:
A RGB-D scene understanding benchmark suite. In IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015. 1
[45] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-
mers. A benchmark for the evaluation of RGB-D SLAM sys-
tems. In IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 2012. 5
[46] R. Szeliski. Bayesian modeling of uncertainty in low-level
vision. International Journal of Computer Vision, 5(3):271–
301, Dec 1990. 2
[47] K. Tateno, F. Tombari, I. Laina, and N. Navab. CNN-SLAM:
Real-time dense monocular SLAM with learned depth pre-
diction. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 1, 2
[48] B. Tippetts, D. J. Lee, K. Lillywhite, and J. Archibald. Re-
view of stereo vision algorithms and their suitability for
resource-limited systems. Journal of Real-Time Image Pro-
cessing, 11(1):5–25, 2016. 1
[49] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
Bundle adjustment a modern synthesis. In International
Conference on Computer Vision (ICCV), 1999. 5
[50] J. Tuley, N. Vandapel, and M. Hebert. Analysis and removal
of artifacts in 3-d LIDAR data. In International Conference
on Robotics and Automation (ICRA), 2005. 2
[51] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,
A. Dosovitskiy, and T. Brox. DeMoN: Depth and motion
network for learning monocular stereo. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2017.
1, 2, 5, 6, 7, 8, 10
[52] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-
ing depth from monocular videos using direct methods. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2018. 1, 2
[53] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and
A. Davison. ElasticFusion: dense SLAM without a pose
graph. In Robotics: Science and Systems (RSS), 2015. 1,
2
[54] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. MVSNet: Depth
inference for unstructured multi-view stereo. In European
Conference on Computer Vision (ECCV), 2018. 1, 2
[55] Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense
depth, optical flow and camera pose. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 2
[56] H. Zhou, B. Ummenhofer, and T. Brox. DeepTAM: Deep
tracking and mapping. In European Conference on Com-
puter Vision (ECCV), 2018. 1, 2
[57] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-
pervised learning of depth and ego-motion from video. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 1, 2

You might also like