0% found this document useful (0 votes)

14 views8 pages

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Chunyang Yang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Chunyang Yang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Unsupervised Monocular Depth Learning with Integrated Intrinsics

and Spatio-Temporal Constraints

Kenny Chen1 , Alexandra Pogue2 , Brett T. Lopez3 , Ali-akbar Agha-mohammadi3 , and Ankur Mehta1

Abstract— Monocular depth inference has gained tremendous

attention from researchers in recent years and remains as Stereo Training
Training
a promising replacement for expensive time-of-flight sensors,
but issues with scale acquisition and implementation overhead L θ
still plague these systems. To this end, this work presents an
θ*
Disparities
arXiv:2011.01354v1 [cs.CV] 2 Nov 2020

unsupervised learning framework that is able to predict at-scale

depth maps and egomotion, in addition to camera intrinsics,
from a sequence of monocular images via a single network. R
Egomotion Intrinsics
Our method incorporates both spatial and temporal geometric
constraints to resolve depth and pose scale factors, which are Stereo Images Training
Single Network Network Weights
enforced within the supervisory reconstruction loss functions
at training time. Only unlabeled stereo sequences are required θ*
for training the weights of our single-network architecture, Inference
which reduces overall implementation overhead as compared to θ*
previous methods. Our results demonstrate strong performance
when compared to the current state-of-the-art on multiple Predicted Depth, Trajectory, & Intrinsics
sequences of the KITTI driving dataset. Monocular Inference
Stereo Images Single Network Network Weights

I. I NTRODUCTION
Fig. 1. System Overview. Our system regresses depth, pose and camera
Modern robotic agents take advantage of accurate, real- intrinsics from a sequence of monocular images. During training, we use
two pairs of unlabeled stereo images and consider losses in both spatial
time range measurements to build a spatial understanding of and temporal directions for our network weights. During inference, only
their surrounding environments for collision avoidance, state monocular images are required as input, and our system outputs accurately
estimation, and other navigational tasks. Such measurements scaled depth maps and egomotion in addition to the camera’s intrinsics.
are commonly retrieved via active sensors (e.g., LiDAR)
which resolve distance by measuring the time-of-flight of a information onto a 2D image plane, and abstracting higher
reflected light signal; however, these sensors are often costly dimensional depth information from a lower dimension is
[1], difficult to calibrate and maintain [2], [3], and can be fundamentally an ill-posed problem. To resolve the scale
unwieldy for platforms with a weight budget [4]. Passive factors of these depth maps, a variety of learning-based
sensors, on the other hand, have seen a tremendous surge approaches have been proposed with differing techniques to
of interest in recent literature to predict scene depth from constrain the problem geometrically [13], [18]–[26]. Tempo-
input imagery using multi-view stereo [5]–[7], structure- ral constraints, for example, are commonly employed [12],
from-motion [8]–[11], or more recently, purely monocular [25], [27]–[29] and is defined as the geometric constraint
systems [12]–[17], due to their smaller form factor and between two consecutive monocular frames, aiming to min-
increasing potential to rival the performance of explicit active imize the photometric consistency loss after warping one
sensors with the advent of machine learning. frame to the next. Spatial constraints [13], [19], on the
In particular, monocular depth inference is attractive since other hand, extract scene geometry not through a forward-
RGB cameras are ubiquitous in modern times and requires backward reconstruction loss (i.e., temporally) but rather in
the least number of sensors, but this setup suffers from a left-right pairs of stereo images with a predefined baseline.
fundamental issue of scale acquisition. More specifically, in Most works choose to design their systems around either
a purely monocular system, depth can only be estimated one or the other, and while a few systems have integrated
up to an ambiguous scale and requires additional geometric both constraints before in a multi-network framework [30]–
information to resolve the units of the depth map. Such [32], none have taken advantage of both spatial and temporal
cameras typically capture frames by projecting 3D scene constraints in a single network to resolve these scale factors.
To this end, we propose an unsupervised, single-network
1 Kenny Chen and Ankur Mehta are with the Department of Electrical and
monocular depth inference approach that considers both
Computer Engineering at the University of California Los Angeles, Los An- spatial and temporal geometric constraints to resolve the
geles, CA 90095, USA. {kennyjchen, mehtank}@ucla.edu
2 Alexandra Pogue is with the Department of Mechanical and Aerospace scale of a predicted depth map. These “spatio-temporal”
Engineering at the University of California Los Angeles, Los Angeles, CA constraints are enforced within the reconstruction loss func-
90095, USA. [email protected] tions of our network during training (Fig. 1), which aim
3 Ali-akbar Agha-mohammadi and Brett T. Lopez are with the NASA Jet
Propulsion Laboratory, California Institute of Technology, Pasadena, CA to minimize the photometric difference between a warped
91109, USA. {aliagha, brett.t.lopez}@jpl.nasa.gov frame and the actual next frame (forward-backward) while
Decoder
Dt+1l

512, 3x3

256, 3x3

128, 3x3
D tl

16, 3x3
64, 3x3

32, 3x3
t+1 Dt+1r

128, 3x3,

512, 3x3
256, 3x3

512, 3x3

512, 3x3
32, 7x7

64, 5x5
R
t D tr
t

512

n
Common Encoder
Fully Connected Layers (x3) K

Fig. 2. Architecture Overview. Our system uses a common convolutional-based encoder between the different outputs, which compresses the input
images into a latent space representation. This representation is then sent through either a trained decoder to retrieve left-right stereo image disparities, or
through different groups of fully connected layers to estimate the egomotion (n = 3 for the last layer) or camera intrinsics (n = 4).

simultaneously maximize the disparity consistency between a as an alternative [13], [19]. CNNs trained using monocular
pair of stereo frames (left-right). Unlike previous approaches, video regressed depth using the camera egomotion to warp
we consider camera intrinsics as an additional unknown a source image to its temporally adjacent target. To address
parameter to be inferred and demonstrate accurate inference the additional problem of camera pose, [12]–[17] trained a
of both depth and camera parameters through a sequence of separate pose network.
purely monocular frames; this is all performed in a single
The learning of visual odometry (VO) and depth maps
end-to-end network to minimize implemetation overhead.
has useful application in visual simultaneous localization
Our main contributions are as follows: (1) we propose
and mapping (SLAM). Visual SLAM leverages 3D vision
an unsupervised, single-network architecture for monocular
to navigate an unknown area by determining camera pose
depth inference which takes advantage of the geometric
relative to a constructed global map of an environment. To
constraints found in both spatial and temporal directions; (2)
build a map and localize within it, the VO within the SLAM
a novel loss function that integrates unknown camera intrin-
pipeline must solve at metric scale. Geometric approaches
sics directly into the depth prediction network with analysis
to monocular SLAM using first principled solutions, such as
that there is sufficient supervisory signal to regress these
structure from motion (SfM) [36], resolved scaling issues
parameters; and (3) an analysis of our proposed architecture’s
using external information [26], [37]. Building on such
performance on the KITTI driving dataset [33] as compared
methods, work in data-driven monocular VO sought to obtain
to the current state-of-the-art.
accurate scaling using sources such as GPS sensor fusion
[38] or supervision [37]. Unsupervised approaches using a
Related Work
camera alone remain attractive, however, due to the reduction
Depth estimation using monocular images and deep learn- of manual effort associated with fewer sensors. Promising
ing began with supervised methods over large datasets research in this area used combined visual constraints (e.g.,
and ground truth labeling [22], [34], [35]. Although these monocular depth [12]–[17], stereo depth [26], [30]–[32],
methods produced accurate results, acquiring ground truth or optical flow [17]) to achieve scale consistent outcomes.
data for supervised training requires expensive 3D sensors, Network architecture for visual odometry and dense depth
multiple scene views, and inertial feedback to obtain even map estimation separate depth and pose networks into two
sparse depth maps [20]. Later work sought to address a lack CNNs, one with convolutional and fully connected layers and
of available high-quality labeled data by posing monocular the other an encoder-decoder structure [39], respectively. In
depth estimation as a stereo image correspondence problem, the case where only monocular images are used in training,
where the second image in a binocular pair served as a the self-supervision inherent in estimation is less constrained,
supervisory signal [18]–[20]. This approach trained a con- having only pose generated from temporal constraints to
volutional neural network (CNN) to learn epipolar geometry determine depth, and vice versa. The work of [15], [40]
constraints by generating disparity images subject to a stereo for example, suffered from scaling ambiguity issues [30].
image reconstruction loss. Once trained, networks were able Training using binocular video, on the other hand, made use
to infer depth using only a single monocular color image of independent constraints from spatial and temporal image
as input. While this work achieved results comparable to pairs that offered an enriched set of sampled images for
supervised methods in some cases, occlusion and texture- network training. This “spatio-temporal” approach allowed
copy artifacts that arose with stereo supervision motivated for the regression of depth from spatial cues generated
learning approaches using a temporal sequence of images by epipolar constraints, which were then passed to the
and a disparity map smoothing function [41]. To estimate
egomotion and camera intrinsics, we leverage a unique loss
𝑙
function that accounts for the photometric difference between
𝐼𝑡+1 temporally adjacent images. Using this loss, we show that we
can obtain scaled visual odometry information in addition to
𝐼𝑡𝑟 𝐼𝑡𝑙 accurate camera intrinsics.

II. M ETHODS
A. Notation
𝐷𝑡𝑟 𝐷𝑡𝑙
A color image, I, is composed of pixels with coordinates
𝑡, 𝑅, 𝐾 pij ∈ R2 , where Iij = I(pij ). In temporal training, we
denote images at time t as I t , and images temporally adjacent
0
as source frame I t . A pixel at time t0 is transformed to its
corresponding pixel at time t using homogeneous transfor-
mation matrix Tt0 →t ∈ SE(3) and camera intrinsics matrix
K ∈ R3×3 , where pixels in homogeneous coordinates, p̃ =
𝑙 , 𝐼𝑙 )
𝑙(𝐼𝑡+1 𝑡+1 𝑙 𝐼𝑟 , 𝐼𝑟 + 𝑙 𝑙𝑙 , 𝐼𝑙
− (p, 1)T , are denoted p for simplicity.
−
Rectified stereo image pairs are given by I r , I l , where the
superscripts for time have been dropped for convenience,
𝑙
𝐼𝑡+1 and superscripts l, r correspond to the left and right images
respectively. Dl represents the disparity map that warps I r
𝐼𝑡𝑙 𝐼𝑡𝑟 to the corresponding I l , and we define per pixel disparity
as dlij = Dl (pij ). Thus Iij l r
= Ii+d r
l ,j , and di+dl ,j =
r
Temporal Learning Spatial Learning D (pi+dl ,j ) is the disparity that does the reverse operation.
Depth per pixel z is then determined by the relation, z =
Fig. 3. Training Diagram. Our single-network system takes a timed Bfx /d, where fx is the x-component focal length and B is
sequence of left images and runs them through the encoder to generate the horizontal baseline between stereo cameras.
outputs that are fed to the fully connected (FC) layers (blue rectangles) and
the decoder (green trapezoid). Outputs from the FC layers are the camera
pose and intrinsics, and outputs from the decoder are disparity maps. The B. Preliminaries
disparities are the spatial component of the network used to find left-right
reprojected images (green dashed lines), while the disparities, camera pose We can obtain the projected pixel coordinates and depth
and intrinsics determine the temporal reprojections (pink dashed lines). All map using equation,
input and output images are framed in black for clarity.
0 0
z t pt = KRt0 →t K −1 z t pt + Ktt0 →t . (1)
pose network to independently estimate VO using temporal
constraints [30]–[32]. where the camera intrinsics matrix, K, is written explicitly
as:
In this work, we draw from the findings of [30]–[32] and " #
determine dense depth maps and egomotion using an unsu- F X0
pervised, end-to-end approach. By observing the similarities K= , F = diag(fx , fy ), X0 = [x0 , y0 ]T , (2)
0 1
between the architecture of the depth network encoder and
the pose network’s convolutional layers, we can effectively and R and t are the rotation matrix and translation vector
eliminate architecture redundancy by merging them via a arguments of transformation matrix T [12]. Note that in
common encoder (Fig. 2). Through the creation of a single, this work we assume no lens distortion and a zero skew
spatio-temporal network, we reduce the overhead associated coefficient in the camera, and that stereo cameras have equal
with optimizing networks separately under different criteria intrinsic parameters. Equation (1) constitutes the temporal
while still achieving the performance benefits of combined reconstruction loss at training used to determine the camera
vision techniques. To further reduce human effort and elimi- egomotion, R and t, and the camera intrinsics K in a single
nate error-prone manual intervention, this work also follows network.
the work of [12], [25] in demonstrating support for learning
the camera intrinsics. In addition to freeing the system from C. Overall Optimization Objective
manual calibration, learning camera intrinsics can be useful Our loss function is made up of a novel temporal recon-
when a video source is unknown. struction term and four spatial reconstruction terms [19],
Our proposed single, spatio-temporal network uses an [26]. Error regression for the following losses allow the
effective combination of losses to regress depth, egomo- network to correctly predict a target image temporally and
tion, and camera intrinsics. To predict depth, we use a spatially during training in order to infer depth, pose, and
photoconsistency loss between stereo image pairs, a left- camera intrinsics from a monocular image sequence at test
right consistency loss between image disparity maps [12], time. The temporal reconstruction term of the loss function is
designated lte , and the spatial reconstruction terms are com- smooth disparities, an exponential weighting function is used
posed of a photoconsistency loss, lp , a left-right consistency on disparity gradients ∂d:
loss llr , and a disparity smoothness loss lr :
1 X
l f (I l ; θ), I l , I r = λp lp (I l , Iˆl ) + lp (I r , Iˆr ) +
lr (D, I) = ∂x dij e−|∂x Iij | + ∂y dij |e−|∂y Iij | . (7)
N i,j
λte lp (I l,t , Iˆl,t ) + λlr llr Dl , Dr +

(3)
λr lr (Dl , I l ) + lr (Dr , I r ) ,

D. Learning the Camera Intrinsics

where I is the original image and Iˆ is the reprojected image, For predicted parameters K̂, R̂, t̂ in (1), penalizing dif-
and individual losses are weighted by λl with subscript l ferences via training loss ensures K̂ t̂ and K̂ −1 R̂K̂ converge
corresponding to the loss function being weighted. to the correct values. To determine parameters individually,
1) Spatio-Temporal Reconstruction Loss: The photocon- the translational relation fails because it is under-determined
sistency loss compares image appearance: since there exists incorrect values of K̂ and t̂ such that
K̂ t̂ = Kt. The rotational relationship, K̂ R̂K̂ −1 = KRK −1 ,
1 − SSIM Iij , îj
I however, does uniquely determine K̂, R̂ such that they are
ˆ = 1
X
lp (I, I) α + equal to K, R, and therefore provides sufficient supervisory
N 2 (4)
i,j signal to estimate these values accurately.
(1 − α) |Iij − Iîj |. Proof: From the above relation we obtain R̂ =
K̂ −1 KRK −1 K̂, and we constrain R̂ to be SO(3), i.e.
The loss is composed of three terms in total (two spatial
R̂T = R̂−1 and det(R̂) = 1. Substituting R̂ into the
losses and a temporal loss). N in this equation is the number
relationship R̂R̂T = I, we find that AR = RA where
of image pixels and the weight α is set to 0.85. The structural
A = K −1 K̂ K̂ T K −T . The value det(K̂ −1 KRK −1 K̂) is
similarity index measure (SSIM) is used here in addition
equal to 1, therefore the determinant of A is also equal
to an absolute error between generated views and sampled
to 1. Moreover, the characteristic equation of A shows A
images [42].
always has an eigenvalue of 1 [12]. Thus the eigenvalue
For reprojected images, we assume equal camera intrinsics
of A is equal to 1 with an algebraic multiplicity of 3,
produce right and left stereo images. The focal length fˆx
implying A is the identity matrix, or the eigenvalues are
from instrinsics matrix K̂ in (2) is co-predicted via the
unique. If we assume A has 3 distinct eigenvalues, because
learned disparity and penalized using spatial reconstruction
A ∈ R3×3 and A = AT , we may choose the eigenvectors
losses. For stereo image inputs, predicted disparity maps are
of A such that they are real. But because AR = RA, for
used to generate the left view from a right image, and vice
every eigenvector, v of A, Rv is also an eigenvector. For an
versa. Depth values calculated from the disparity maps are
eigenvalue with algebraic multiplicity 1, the corresponding
then input to the temporal reconstruction loss to generate the
eigenspace is dim(1), thus Rv = µv for some scalar µ,
left target image from temporally adjacent source images, i.e.
implying each eigenvector of A is also an eigenvector of
to generate the temporal image arguments for (4), we put (1)
R. If R is SO(3), however, it has complex eigenvectors in
in the form where for pixels P = {pi , i = 1 . . . N },
general, which contradicts this assertion. Therefore A must
X be the identity matrix, and K̂ K̂ T = KK T . Referring to K
l,t l,t
|Iij − Iîj |→ from (2), we observe,
i,j
(5)
" #
ˆ
F F + X0 X0T X0
−1 bfx t0 T
X
t t KK = , (8)
z p − K̂ R̂t0 →t K̂ p + K̂ t̂t0 →t ,
dl X0T 1
p∈P

is the absolute error between the left image and the re- which implies X̂0 = X0 and F̂ = F , or K̂ = K.
projected image, and the structure similarity measure is It is clear from above that for R = I, the relation AR =
generated by the same mappings between Iij and pixel pi . RA holds trivially, and K̂ cannot be uniquely determined.
2) Spatial Reconstruction Loss: The left-right dispar- Thus the tolerances with which F in (2) can be determined
ity consistency loss is used to obtain consistency between (in units of pixels) with respect to the amount of camera
disparity maps [19]. During training, the network predicts rotation that occurs is quantified as,
disparity maps Dl and Dr using only left image sequences
as input and then penalizes the difference between the left- 2fx2 2fy2
view disparity map and the warped right view, as well as the δfx < ; δfy < , (9)
w2 ry h2 rx
right-view and the warped left view,
1 X l where rx and ry are the x and y-axis rotation angles (in
llr (Dl , Dr ) = d − dri+dl ,j + drij − dli+dr ,j . (6) radians) between adjacent frames, and w and h are the width
N i,j ij
and height of the image, respectively. For a complete proof
The disparity smoothness loss penalizes depth discontinuities on the relation between the strength of supervision on K and
that occur at image gradients ∂I [41]. To obtain locally the closeness of R to I, see [12].
E. Network Architecture
RGB Depth
Our framework is inspired by [30] and [32], but rather
than requiring two separate networks for depth and pose
estimation, we use a common encoder for both tasks in a
novel single-network architecture (Fig. 3). That is, given
two temporally adjacent input images at times t and t0 ,
our network first convolves these inputs through a series
of convolutional blocks in a common encoder, and then
predicts either disparities through a decoder, or camera
pose and intrinsics through fully connected layers. In the
decoder network, the encoder’s latent representation of the
input images is first re-upsampled using a standard bilinear
interpolation kernel with pooling indices from the encoder
Fig. 4. Example Depth. Two representative examples of depth maps
to fuse low-level features, as inspired by [19], [30]. We then produced by our framework. Our single-network system can accurately
use rectified linear units (ReLU) [43] as activation functions estimate object distances from purely monocular images.
in all layers of this decoder except for the prediction layer,
which uses a sigmoid function instead. The decoder predicts KITTI Vision Benchmark Suite is an extensive dataset which
left-to-right and right-to-left disparities D at both timesteps, consists of 61 video sequences recorded across 5 days,
which are then either used to reconstruct the right stereo with 42,382 total rectified stereo image pairs, each with a
images for a spatially-constrained geometric loss during resolution of 1242x375. For a fair comparison, Eigen et al.
training, or used to construct the depth during inference. In [34] proposed a selection of 697 images across 28 of these
the fully connected layers, translation t̂t0 →t , rotation R̂t0 →t , video sequences as a test set for single view depth evaluation,
and camera intrinsics K̂ are predicted independently in three held out from the training and validation sets. The remaining
separate and decoupled groups of fully connected layers for 23,178 images from the 32 other scenes make up the training
better performance [31]. These outputs are then either taken and validation sets, in which the training set contains 21,055
at face value during inference as the predicted egomotion images while the validation set contains 2,123. We train,
and camera parameters, or used as inputs (along with the validate, and test our network using these splits and compare
estimated depth map) to warp the current frame to the next our depth estimation accuracy against other works across
for our temporal reconstruction loss as described previously. several metrics, shown in Table I. Ground truth data for the
testing set calculated via projecting the Velodyne LiDAR
III. R ESULTS data onto the image plane. Example depth maps outputted
In this section, we evaluate our proposed framework using by our system can be seen in Fig. 4.
the KITTI driving dataset [33]. Network architecture was im-
plemented using the TensorFlow framework [44] and models B. Learned Camera Intrinsics
were trained on a single NVIDIA GeForce RTX 2070 Super To evaluate our system’s ability to recover the camera
GPU with 8GB of memory using a batch size of 4. Adam intrinsics (i.e., fx , fy , x0 , y0 ) through the supervisory signal
optimizer [45] was used to train the network parameters, provided by the rotational component of (1), we follow a
with exponential decay rates β1 = 0.9 and β2 = 0.99 and similar procedure as [12] and trained separate models on
learning rate α initially set to 0.001 but gradually decreased several different video sequences until convergence of these
throughout training to allow faster weight training in the parameters for multiple independent results. We specifically
beginning and smaller fine-tuning towards the end of the used ten video sequences of the “2011 09 28” subdataset
learning process. We used standard data augmentation tech- chosen to have the same ground truth calibration done that
niques during training, such as random left-right mirroring day, and Table III shows the resulting mean and standard
of training images and perturbing the image color space (i.e., deviation of those ten tests. All experiments were done on the
gamma, brightness, and color shifts) to artificially increase left stereo color camera (“image 02”) of the vehicle setup.
the size of the training dataset. At test time, inference was
performed on an AMD Ryzen 7 3700X 8-Core 3.6GHz CPU, C. Egomotion
and we compare our method against the current state-of-the- We carried out our pose estimation performance evaluation
art using conventional metrics (i.e., Abs Rel, Sq Rel, RMSE, using four sequences from the KITTI Odometry dataset
RMSE log, and Accuracy for depth, and Absolute Trajectory [48] and compared against several state-of-the-art methods,
Error (ATE) for egomotion) as per [46]. including UnDEMoN [32], SfMLearner [15], and VISO-M
A. Performance of Depth Estimation [47]. For a quantitative comparison, we adopt the absolute
trajectory root-mean-square error (ATE) for both transla-
To evaluate the performance of our network’s depth in- tional (tate ) and rotational (rate ) components as per standard
ference, we use a standard Eigen split [34] on the KITTI practice [49], defined as
dataset [33] as per convention and compare against several
state-of-the-art methods, including [13], [30]–[32]. The full Fi := Q−1
i SPi (10)
TABLE I: Comparison of monocular depth estimation with other spatio-temporal approaches. Cropped regions from [19] were used for performance
evaluation all methods. In the column labeled “Train”, “Depth” indicates supervised training with ground truth and “S-T” indicates an unsupervised spatio-
temporal training approach. We evaluate using the Eigen split [34] on the KITTI dataset [33] and cap depth to 80m and 50m as per standard practice
[19]. Results from other methods were taken from their corresponding papers. For error metrics, lower is better; for accuracy, higher is better.

Method Train Error Metric Accuracy Metric

Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253
Depth: 80m cap
Train Set Mean Depth 0.361 4.826 8.102 0.377 0.638 0.804 0.894
Li et al. [31] S-T 0.183 1.730 6.570 0.268 - - -
Babu et al. [32] S-T 0.139 1.174 5.590 0.239 0.812 0.930 0.968
Zhan et al. [30] S-T 0.144 1.391 5.869 0.241 0.803 0.928 0.969
Ours S-T 0.141 1.227 5.629 0.239 0.809 0.927 0.962
Depth: 50m cap
Babu et al. [32] S-T 0.132 0.885 4.290 0.226 0.827 0.937 0.972
Zhan et al. [30] S-T 0.135 0.905 4.366 0.225 0.818 0.937 0.973
Ours S-T 0.131 0.897 4.297 0.228 0.821 0.938 0.972

TABLE II: Comparison of our system’s odometry estimation against various other state-of-the-art methods [15], [32], [47] using absolute trajectory error
for translation (tate ) and rotational (rate ) movement. Values of other methods were retrieved from [32].

Seq. Ours UnDEMoN [32] SfMLearner [15] VISO-M [47]

tate rate tate rate tate rate tate rate
00 0.0712 0.0014 0.0644 0.0013 0.7366 0.0040 0.1747 0.0009
04 0.0962 0.0016 0.0974 0.0008 1.5521 0.0027 0.2184 0.0009
05 0.0689 0.0009 0.0696 0.0009 0.7260 0.0036 0.3787 0.0013
07 0.0753 0.0013 0.0742 0.0011 0.5255 0.0036 0.4803 0.0018

TABLE III: Regressed camera intrinsics during training as compared to the

ground truth. Note that ground truth values have been adjusted to match the intrinsics. Through training our neural network to learn
scaling and cropping done for training. All values are in units of pixels. spatial constraints between stereo image pairs in addition
Camera Parameter Learned Ground Truth to temporal constraints, we are able to successfully resolve
Horizontal Focal Length (fx ) 298.4 ± 2.3 295.8 solutions at metric scale using only monocular video at
Vertical Focal Length (fy ) 483.1 ± 3.6 489.2 test time. We distinguish our work from other monocular
Horizontal Principal Point (x0 ) 254.8 ± 2.4 252.7 inference approaches by creating a single, fully differentiable
Vertical Principal Point (y0 ) 127.8 ± 1.7 124.9 architecture for depth prediction and visual odometry. To
further reduce human effort and manual intervention, we
with estimated trajectory P1:n and ground truth trajectory also take advantage of intrinsics observability in the system
Q1:n . We note that the same model that was trained for depth by learning the camera parameters embedded within the
estimation to output our egomotion estimation, and that these temporal recontruction loss. We verify the success of our
four test sequences were not part of our training set. system using the KITTI dataset, where results in Tables I-
From Table II, we observe that for both translational and III show we are able to achieve performance comparable to
rotational errors in all four sequences, our method outper- the state-of-the-art in spatio-temporal monocular vision while
formed SfMLearner [15] and VISO-M [47] and is compa- reducing overhead and solving for intrinsics.
rable with UnDEMoN’s [32] performance — all methods In future work we plan to quantify the training efficiency
in which a separate, dedicated pose estimation network was of our neural network, as we suspect the parameter re-
trained specifically for the task of predicting egomotion. In duction resulting from network elimination decreases the
our system, egomotion and camera intrinsics are co-predicted necessary time to optimize weights. We are also interested
(alongside disparity) in a single network such that the loss in expanding training and evaluation of our work to the
functions for these free parameters are tied together. This Cityscapes [50] and EuRoC [51] datasets. Increasing the size
may explain the slight loss in accuracy, but the upside is of the training dataset facilitates improved performance from
that our method is a reduction in computational and network unsupervised learning methods and a more diverse collection
complexity as there are less weights in our architecture to of scenes for evaluation. We also plan to address occlusion
optimize over. and moving objects to obtain more accurate reprojection
losses by discounting the correspondence between associated
IV. D ISCUSSION pixels. Extention of our work for SLAM applications is
In this work we have presented an unsupervised, single- also of interest, where augmentation of our system for a
network monocular depth inference approach for joint pre- hybrid, learned front-end and geometric back-end will aid in
diction of environmental depth, egomotion, and camera generating an accurate global pose graph.
R EFERENCES [22] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single
monocular images using deep convolutional neural fields,” IEEE
[1] S. Royo and M. Ballesta-Garcia, “An overview of lidar imaging transactions on pattern analysis and machine intelligence, vol. 38,
systems for autonomous vehicles,” Applied Sciences, vol. 9, no. 19, p. no. 10, pp. 2024–2039, 2015.
4093, 2019. [23] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene
[2] R. Katzenbeisser, “About the calibration of lidar sensors,” in ISPRS structure from a single still image,” IEEE transactions on pattern
Workshop, 2003, pp. 1–6. analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
[3] N. Muhammad and S. Lacroix, “Calibration of a rotating multi- [24] K. Tateno, F. Tombari, I. Laina, and N. Navab, “CNN-SLAM:
beam lidar,” in 2010 IEEE/RSJ International Conference on Intelligent Real-time dense monocular SLAM with learned depth prediction,”
Robots and Systems. IEEE, 2010, pp. 5648–5653. arXiv:1704.03489 [cs], Apr. 2017, arXiv: 1704.03489. [Online].
[4] B. T. Lopez and J. P. How, “Aggressive collision avoidance with Available: http://arxiv.org/abs/1704.03489
limited field-of-view sensing,” in 2017 IEEE/RSJ International Con- [25] C. Schmid, C. Sminchisescu, and Y. Chen, “Self-supervised learning
ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. with geometric constraints in monocular video - connecting flow,
1358–1365. depth, and camera,” in ICCV, 2019.
[5] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski,
[26] W. N. Greene and N. Roy, “Metrically-scaled monocular slam using
“A comparison and evaluation of multi-view stereo reconstruction
learned scale factors,” in 2020 IEEE International Conference on
algorithms,” in 2006 IEEE computer society conference on computer
Robotics and Automation (ICRA), 2020, pp. 43–50.
vision and pattern recognition (CVPR’06), vol. 1. IEEE, 2006, pp.
[27] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep
519–528.
ordinal regression network for monocular depth estimation,” in Pro-
[6] J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixel-
ceedings of the IEEE Conference on Computer Vision and Pattern
wise view selection for unstructured multi-view stereo,” in European
Recognition, 2018, pp. 2002–2011.
Conference on Computer Vision. Springer, 2016, pp. 501–518.
[28] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured
[7] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Towards
attention guided convolutional neural fields for monocular depth
internet-scale multi-view stereo,” in 2010 IEEE computer society
estimation,” in Proceedings of the IEEE Conference on Computer
conference on computer vision and pattern recognition. IEEE, 2010,
Vision and Pattern Recognition, 2018, pp. 3917–3925.
pp. 1434–1441.
[8] R. Hartley and A. Zisserman, Multiple view geometry in computer [29] S. Pillai, R. Ambruş, and A. Gaidon, “Superdepth: Self-supervised,
vision. Cambridge university press, 2003. super-resolved monocular depth estimation,” in 2019 International
[9] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense Conference on Robotics and Automation (ICRA). IEEE, 2019, pp.
tracking and mapping in real-time,” in 2011 international conference 9250–9256.
on computer vision. IEEE, 2011, pp. 2320–2327. [30] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid,
[10] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” “Unsupervised Learning of Monocular Depth Estimation and Visual
in Proceedings of the IEEE Conference on Computer Vision and Odometry with Deep Feature Reconstruction,” arXiv:1803.03893
Pattern Recognition, 2016, pp. 4104–4113. [cs], Apr. 2018, arXiv: 1803.03893. [Online]. Available: http:
[11] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun, “Structure from //arxiv.org/abs/1803.03893
motion without correspondence,” in Proceedings IEEE Conference [31] R. Li, S. Wang, Z. Long, and D. Gu, “UnDeepVO: Monocular Visual
on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. Odometry through Unsupervised Deep Learning,” arXiv:1709.06841
PR00662), vol. 2. IEEE, 2000, pp. 557–564. [cs], Feb. 2018, arXiv: 1709.06841. [Online]. Available: http:
[12] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from //arxiv.org/abs/1709.06841
videos in the wild: Unsupervised monocular depth learning from [32] V. Madhu Babu, K. Das, A. Majumdar, and S. Kumar, “UnDEMoN:
unknown cameras,” in 2019 IEEE/CVF International Conference on Unsupervised Deep Network for Depth and Ego-Motion Estimation,”
Computer Vision (ICCV), 2019, pp. 8976–8985. in 2018 IEEE/RSJ International Conference on Intelligent Robots and
[13] C. Godard, O. Mac Aodha, M. Firman, and G. Brostow, “Digging Systems (IROS), Oct. 2018, pp. 1082–1088, iSSN: 2153-0866.
Into Self-Supervised Monocular Depth Estimation,” arXiv:1806.01260 [33] M. Menze and A. Geiger, “Object scene flow for autonomous ve-
[cs, stat], Aug. 2019, arXiv: 1806.01260. [Online]. Available: hicles,” in Conference on Computer Vision and Pattern Recognition
http://arxiv.org/abs/1806.01260 (CVPR), 2015.
[14] J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, [34] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
and I. Reid, “Unsupervised Scale-consistent Depth and Ego-motion single image using a multi-scale deep network,” in Advances in Neural
Learning from Monocular Video,” arXiv:1908.10553 [cs], Oct. 2019, Information Processing Systems 27, Z. Ghahramani, M. Welling,
arXiv: 1908.10553. [Online]. Available: http://arxiv.org/abs/1908. C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran
10553 Associates, Inc., 2014, pp. 2366–2374.
[15] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised [35] D. Eigen and R. Fergus, “Predicting depth, surface normals and se-
Learning of Depth and Ego-Motion from Video,” in 2017 IEEE mantic labels with a common multi-scale convolutional architecture,”
Conference on Computer Vision and Pattern Recognition (CVPR). 2015.
Honolulu, HI: IEEE, Jul. 2017, pp. 6612–6619. [Online]. Available: [36] J. J. Koenderink and A. J. Van Doorn, “Affine structure from motion,”
http://ieeexplore.ieee.org/document/8100183/ JOSA A, vol. 8, no. 2, pp. 377–385, 1991.
[16] Z. Yin and J. Shi, “GeoNet: Unsupervised Learning of Dense [37] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “Vinet:
Depth, Optical Flow and Camera Pose,” arXiv:1803.02276 [cs], Mar. Visual-inertial odometry as a sequence-to-sequence learning problem,”
2018, arXiv: 1803.02276. [Online]. Available: http://arxiv.org/abs/ 2017.
1803.02276 [38] S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in
[17] Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning robots,” 2017.
of depth and flow using cross-task consistency,” 2018. [39] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
[18] M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “Towards real-time convolutional encoder-decoder architecture for image segmentation,”
unsupervised monocular depth estimation on CPU,” arXiv:1806.11430 IEEE transactions on pattern analysis and machine intelligence,
[cs], Jul. 2018, arXiv: 1806.11430. vol. 39, no. 12, pp. 2481–2495, 2017.
[19] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular [40] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and
depth estimation with left-right consistency,” in 2017 IEEE Conference K. Fragkiadaki, “Sfm-net: Learning of structure and motion from
on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6602– video,” 2017.
6611. [41] P. Heise, S. Klose, B. Jensen, and A. Knoll, “Pm-huber: Patchmatch
[20] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised with huber regularization for stereo matching,” in 2013 IEEE Interna-
CNN for Single View Depth Estimation: Geometry to the Rescue,” tional Conference on Computer Vision, 2013, pp. 2360–2367.
arXiv:1603.04992 [cs], Jul. 2016, arXiv: 1603.04992. [Online]. [42] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image Quality
Available: http://arxiv.org/abs/1603.04992 Assessment: From Error Visibility to Structural Similarity,” IEEE
[21] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single Transactions on Image Processing, vol. 13, no. 4, pp. 600–612,
monocular images,” in Advances in neural information processing Apr. 2004. [Online]. Available: http://ieeexplore.ieee.org/document/
systems, 2006, pp. 1161–1168. 1284395/
[43] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in ICML, 2010.
[44] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015, software available from tensorflow.org.
[Online]. Available: http://tensorflow.org/
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[46] C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular
Depth Estimation Based On Deep Learning: An Overview,”
Science China Technological Sciences, vol. 63, no. 9, pp. 1612–
1627, Sep. 2020, arXiv: 2003.06620. [Online]. Available: http:
//arxiv.org/abs/2003.06620
[47] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruc-
tion in real-time,” in 2011 IEEE intelligent vehicles symposium (IV).
Ieee, 2011, pp. 963–968.
[48] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in 2012 IEEE Conference
on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–
3361.
[49] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
“A benchmark for the evaluation of rgb-d slam systems,” in 2012
IEEE/RSJ International Conference on Intelligent Robots and Systems.
IEEE, 2012, pp. 573–580.
[50] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”
The International Journal of Robotics Research, vol. 35, no. 10, pp.
1157–1163, 2016.
[51] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp.
3213–3223.

Depth Reconstruction With Deep Neural Networks (Part 1)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 1)
66 pages
Depth Reconstruction With Deep Neural Networks (Part 2)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 2)
54 pages
Metric3D v2: A Versatile Monocular Geometric Foundation Model For Zero-Shot Metric Depth and Surface Normal Estimation
No ratings yet
Metric3D v2: A Versatile Monocular Geometric Foundation Model For Zero-Shot Metric Depth and Surface Normal Estimation
30 pages
D P: S M M D L T S: Epth RO Harp Onocular Etric Epthin ESS Hana Econd
No ratings yet
D P: S M M D L T S: Epth RO Harp Onocular Etric Epthin ESS Hana Econd
33 pages
Depthanything
No ratings yet
Depthanything
18 pages
Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data
No ratings yet
Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data
18 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
No ratings yet
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
26 pages
Zero-Shot Monocular Scene Flow Estimation in The Wild
No ratings yet
Zero-Shot Monocular Scene Flow Estimation in The Wild
20 pages
Monocular Depth Estimation Based On Deep Learning An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning An Overview
16 pages
Fdsafdsfsafasdfbrwa
No ratings yet
Fdsafdsfsafasdfbrwa
14 pages
CV Sce
No ratings yet
CV Sce
12 pages
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
No ratings yet
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
16 pages
Pano 3 D
No ratings yet
Pano 3 D
21 pages
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
No ratings yet
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
13 pages
D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth
No ratings yet
D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth
15 pages
CroMo Cross-Modal Learning For Monocular Depth Estimation
No ratings yet
CroMo Cross-Modal Learning For Monocular Depth Estimation
11 pages
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
No ratings yet
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
14 pages
08-Monocular Depth Estimation
No ratings yet
08-Monocular Depth Estimation
15 pages
Tosi NeRF-Supervised Deep Stereo CVPR 2023 Paper
No ratings yet
Tosi NeRF-Supervised Deep Stereo CVPR 2023 Paper
12 pages
Video Depth Anything
No ratings yet
Video Depth Anything
15 pages
Research Proposal
83% (6)
Research Proposal
49 pages
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
13 pages
2311.07198 Kepentingan
No ratings yet
2311.07198 Kepentingan
10 pages
Video Depth Without Video Models
No ratings yet
Video Depth Without Video Models
13 pages
Shi 3D Distillation Improving Self-Supervised Monocular Depth Estimation On Reflective Surfaces ICCV 2023 Paper
No ratings yet
Shi 3D Distillation Improving Self-Supervised Monocular Depth Estimation On Reflective Surfaces ICCV 2023 Paper
11 pages
The Fourth Monocular Depth Estimation Challenge
No ratings yet
The Fourth Monocular Depth Estimation Challenge
14 pages
Nerf Slam
No ratings yet
Nerf Slam
10 pages
Event-Based Monocular Depth Estimation With Recurrent Transformers
No ratings yet
Event-Based Monocular Depth Estimation With Recurrent Transformers
13 pages
JournalPaper ASC Updated
No ratings yet
JournalPaper ASC Updated
16 pages
Ummenhofer DeMoN Depth and CVPR 2017 Paper
No ratings yet
Ummenhofer DeMoN Depth and CVPR 2017 Paper
10 pages
Monocular Depth Estimation Based On Deep Learning: An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning: An Overview
14 pages
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
10 pages
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
No ratings yet
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
12 pages
Zusc S 24 00845
No ratings yet
Zusc S 24 00845
15 pages
Demon: Depth and Motion Network For Learning Monocular Stereo
No ratings yet
Demon: Depth and Motion Network For Learning Monocular Stereo
22 pages
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
No ratings yet
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
11 pages
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
No ratings yet
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
12 pages
Unsupervised Learning of Depth and Ego-Motion From Video
No ratings yet
Unsupervised Learning of Depth and Ego-Motion From Video
10 pages
cvpr06 3dreconstructionindoor
No ratings yet
cvpr06 3dreconstructionindoor
8 pages
Neural Network Adaption For Depth Sensor Replication
No ratings yet
Neural Network Adaption For Depth Sensor Replication
11 pages
2023046644 (3)
No ratings yet
2023046644 (3)
5 pages
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
No ratings yet
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
10 pages
Multidepth: Single-Image Depth Estimation Via Multi-Task Regression and Classification
No ratings yet
Multidepth: Single-Image Depth Estimation Via Multi-Task Regression and Classification
9 pages
Real Time 3D Depth Estimation and
No ratings yet
Real Time 3D Depth Estimation and
6 pages
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
No ratings yet
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
11 pages
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
No ratings yet
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
11 pages
Project Synopsis Template
No ratings yet
Project Synopsis Template
5 pages
Metric3D绝对深度估计
No ratings yet
Metric3D绝对深度估计
17 pages
Neural Recon
No ratings yet
Neural Recon
10 pages
开题报告 Akash 6
No ratings yet
开题报告 Akash 6
1 page
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
No ratings yet
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
5 pages
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
No ratings yet
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
10 pages
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
No ratings yet
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
10 pages
CBSE Class 6 Maths Practice Worksheets
73% (33)
CBSE Class 6 Maths Practice Worksheets
52 pages
Linear Regression and Correlation Analysis PPT at BEC DOMS
50% (2)
Linear Regression and Correlation Analysis PPT at BEC DOMS
67 pages
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
No ratings yet
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
11 pages
Shift RCNN
No ratings yet
Shift RCNN
2 pages
Abstract
No ratings yet
Abstract
2 pages
Monoloco: Monocular 3D Pedestrian Localization and Uncertainty Estimation
No ratings yet
Monoloco: Monocular 3D Pedestrian Localization and Uncertainty Estimation
11 pages
Intermediate AMC Questions
0% (1)
Intermediate AMC Questions
6 pages
7 5-04-01-01.2 Analysis of Speed Power Trial Data PDF
No ratings yet
7 5-04-01-01.2 Analysis of Speed Power Trial Data PDF
25 pages
Crash Course JEE Advanced Sample Ebook
100% (1)
Crash Course JEE Advanced Sample Ebook
31 pages
Cycle Counting in Fatigue Analysis: Standard Practices For
No ratings yet
Cycle Counting in Fatigue Analysis: Standard Practices For
10 pages
Shelly Cashman Series Microsoft Office 365 and Access 2016 Introductory 1st Edition Pratt Solutions Manual
75% (4)
Shelly Cashman Series Microsoft Office 365 and Access 2016 Introductory 1st Edition Pratt Solutions Manual
20 pages
Depth Estimation Using CNN With Transfer Learning
No ratings yet
Depth Estimation Using CNN With Transfer Learning
15 pages
PHY10 Lesson 2 Kinematics (Full)
No ratings yet
PHY10 Lesson 2 Kinematics (Full)
35 pages
SigmaXL Version 8 Workbook
No ratings yet
SigmaXL Version 8 Workbook
541 pages
Analyzing Operational Flexibility of Electric Power Systems
No ratings yet
Analyzing Operational Flexibility of Electric Power Systems
10 pages
MF821 Syllabus
No ratings yet
MF821 Syllabus
5 pages
Mel709 22
No ratings yet
Mel709 22
18 pages
Quadratic Formula PROOF
100% (1)
Quadratic Formula PROOF
1 page
Enotes
No ratings yet
Enotes
30 pages
Electronics - Number System & Logic Gates
No ratings yet
Electronics - Number System & Logic Gates
26 pages
4th Grade Math Framework
No ratings yet
4th Grade Math Framework
5 pages
UNIT5 Comparison Tree
No ratings yet
UNIT5 Comparison Tree
52 pages
Midterm Assignment: Saint Louis University
No ratings yet
Midterm Assignment: Saint Louis University
7 pages
DSP - Mod2 QB
No ratings yet
DSP - Mod2 QB
15 pages
A Comparative Review of 3D Container Loading Algorithms
No ratings yet
A Comparative Review of 3D Container Loading Algorithms
34 pages
Jan 2006 Paper 2
No ratings yet
Jan 2006 Paper 2
16 pages
Face-Bow Record Without A Third Point of Reference Theoretical Considerations and An Alternative Technique
No ratings yet
Face-Bow Record Without A Third Point of Reference Theoretical Considerations and An Alternative Technique
5 pages
Kinematic Analysis For Sliding Failure of Multi-Faced Rock Slopes
No ratings yet
Kinematic Analysis For Sliding Failure of Multi-Faced Rock Slopes
11 pages
Crossmark: Ocean Engineering
No ratings yet
Crossmark: Ocean Engineering
13 pages
Paraview Tutorial
No ratings yet
Paraview Tutorial
28 pages
Latest DLL Math 4 WK 8
No ratings yet
Latest DLL Math 4 WK 8
2 pages
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
No ratings yet
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
8 pages
Studies Analysing FWD Test Results: S.No. Paper Name Journal Objectives Methodology Conclusion
No ratings yet
Studies Analysing FWD Test Results: S.No. Paper Name Journal Objectives Methodology Conclusion
4 pages
Polynomial Series
No ratings yet
Polynomial Series
3 pages

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Unsupervised Monocular Depth Learning with Integrated Intrinsics

and Spatio-Temporal Constraints

Abstract— Monocular depth inference has gained tremendous

unsupervised learning framework that is able to predict at-scale

Method Train Error Metric Accuracy Metric

Seq. Ours UnDEMoN [32] SfMLearner [15] VISO-M [47]

TABLE III: Regressed camera intrinsics during training as compared to the

You might also like