0% found this document useful (0 votes)
14 views8 pages

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Chunyang Yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints

Uploaded by

Chunyang Yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Unsupervised Monocular Depth Learning with Integrated Intrinsics

and Spatio-Temporal Constraints


Kenny Chen1 , Alexandra Pogue2 , Brett T. Lopez3 , Ali-akbar Agha-mohammadi3 , and Ankur Mehta1

Abstract— Monocular depth inference has gained tremendous


attention from researchers in recent years and remains as Stereo Training
Training
a promising replacement for expensive time-of-flight sensors,
but issues with scale acquisition and implementation overhead L θ
still plague these systems. To this end, this work presents an
θ*
Disparities
arXiv:2011.01354v1 [cs.CV] 2 Nov 2020

unsupervised learning framework that is able to predict at-scale


depth maps and egomotion, in addition to camera intrinsics,
from a sequence of monocular images via a single network. R
Egomotion Intrinsics
Our method incorporates both spatial and temporal geometric
constraints to resolve depth and pose scale factors, which are Stereo Images Training
Single Network Network Weights
enforced within the supervisory reconstruction loss functions
at training time. Only unlabeled stereo sequences are required θ*
for training the weights of our single-network architecture, Inference
which reduces overall implementation overhead as compared to θ*
previous methods. Our results demonstrate strong performance
when compared to the current state-of-the-art on multiple Predicted Depth, Trajectory, & Intrinsics
sequences of the KITTI driving dataset. Monocular Inference
Stereo Images Single Network Network Weights

I. I NTRODUCTION
Fig. 1. System Overview. Our system regresses depth, pose and camera
Modern robotic agents take advantage of accurate, real- intrinsics from a sequence of monocular images. During training, we use
two pairs of unlabeled stereo images and consider losses in both spatial
time range measurements to build a spatial understanding of and temporal directions for our network weights. During inference, only
their surrounding environments for collision avoidance, state monocular images are required as input, and our system outputs accurately
estimation, and other navigational tasks. Such measurements scaled depth maps and egomotion in addition to the camera’s intrinsics.
are commonly retrieved via active sensors (e.g., LiDAR)
which resolve distance by measuring the time-of-flight of a information onto a 2D image plane, and abstracting higher
reflected light signal; however, these sensors are often costly dimensional depth information from a lower dimension is
[1], difficult to calibrate and maintain [2], [3], and can be fundamentally an ill-posed problem. To resolve the scale
unwieldy for platforms with a weight budget [4]. Passive factors of these depth maps, a variety of learning-based
sensors, on the other hand, have seen a tremendous surge approaches have been proposed with differing techniques to
of interest in recent literature to predict scene depth from constrain the problem geometrically [13], [18]–[26]. Tempo-
input imagery using multi-view stereo [5]–[7], structure- ral constraints, for example, are commonly employed [12],
from-motion [8]–[11], or more recently, purely monocular [25], [27]–[29] and is defined as the geometric constraint
systems [12]–[17], due to their smaller form factor and between two consecutive monocular frames, aiming to min-
increasing potential to rival the performance of explicit active imize the photometric consistency loss after warping one
sensors with the advent of machine learning. frame to the next. Spatial constraints [13], [19], on the
In particular, monocular depth inference is attractive since other hand, extract scene geometry not through a forward-
RGB cameras are ubiquitous in modern times and requires backward reconstruction loss (i.e., temporally) but rather in
the least number of sensors, but this setup suffers from a left-right pairs of stereo images with a predefined baseline.
fundamental issue of scale acquisition. More specifically, in Most works choose to design their systems around either
a purely monocular system, depth can only be estimated one or the other, and while a few systems have integrated
up to an ambiguous scale and requires additional geometric both constraints before in a multi-network framework [30]–
information to resolve the units of the depth map. Such [32], none have taken advantage of both spatial and temporal
cameras typically capture frames by projecting 3D scene constraints in a single network to resolve these scale factors.
To this end, we propose an unsupervised, single-network
1 Kenny Chen and Ankur Mehta are with the Department of Electrical and
monocular depth inference approach that considers both
Computer Engineering at the University of California Los Angeles, Los An- spatial and temporal geometric constraints to resolve the
geles, CA 90095, USA. {kennyjchen, mehtank}@ucla.edu
2 Alexandra Pogue is with the Department of Mechanical and Aerospace scale of a predicted depth map. These “spatio-temporal”
Engineering at the University of California Los Angeles, Los Angeles, CA constraints are enforced within the reconstruction loss func-
90095, USA. [email protected] tions of our network during training (Fig. 1), which aim
3 Ali-akbar Agha-mohammadi and Brett T. Lopez are with the NASA Jet
Propulsion Laboratory, California Institute of Technology, Pasadena, CA to minimize the photometric difference between a warped
91109, USA. {aliagha, brett.t.lopez}@jpl.nasa.gov frame and the actual next frame (forward-backward) while
Decoder
Dt+1l

512, 3x3

512, 3x3

256, 3x3

128, 3x3
D tl

16, 3x3
64, 3x3

32, 3x3
t+1 Dt+1r

128, 3x3,

512, 3x3
256, 3x3

512, 3x3

512, 3x3
32, 7x7

64, 5x5
R
t D tr
t

512

512

n
Common Encoder
Fully Connected Layers (x3) K

Fig. 2. Architecture Overview. Our system uses a common convolutional-based encoder between the different outputs, which compresses the input
images into a latent space representation. This representation is then sent through either a trained decoder to retrieve left-right stereo image disparities, or
through different groups of fully connected layers to estimate the egomotion (n = 3 for the last layer) or camera intrinsics (n = 4).

simultaneously maximize the disparity consistency between a as an alternative [13], [19]. CNNs trained using monocular
pair of stereo frames (left-right). Unlike previous approaches, video regressed depth using the camera egomotion to warp
we consider camera intrinsics as an additional unknown a source image to its temporally adjacent target. To address
parameter to be inferred and demonstrate accurate inference the additional problem of camera pose, [12]–[17] trained a
of both depth and camera parameters through a sequence of separate pose network.
purely monocular frames; this is all performed in a single
The learning of visual odometry (VO) and depth maps
end-to-end network to minimize implemetation overhead.
has useful application in visual simultaneous localization
Our main contributions are as follows: (1) we propose
and mapping (SLAM). Visual SLAM leverages 3D vision
an unsupervised, single-network architecture for monocular
to navigate an unknown area by determining camera pose
depth inference which takes advantage of the geometric
relative to a constructed global map of an environment. To
constraints found in both spatial and temporal directions; (2)
build a map and localize within it, the VO within the SLAM
a novel loss function that integrates unknown camera intrin-
pipeline must solve at metric scale. Geometric approaches
sics directly into the depth prediction network with analysis
to monocular SLAM using first principled solutions, such as
that there is sufficient supervisory signal to regress these
structure from motion (SfM) [36], resolved scaling issues
parameters; and (3) an analysis of our proposed architecture’s
using external information [26], [37]. Building on such
performance on the KITTI driving dataset [33] as compared
methods, work in data-driven monocular VO sought to obtain
to the current state-of-the-art.
accurate scaling using sources such as GPS sensor fusion
[38] or supervision [37]. Unsupervised approaches using a
Related Work
camera alone remain attractive, however, due to the reduction
Depth estimation using monocular images and deep learn- of manual effort associated with fewer sensors. Promising
ing began with supervised methods over large datasets research in this area used combined visual constraints (e.g.,
and ground truth labeling [22], [34], [35]. Although these monocular depth [12]–[17], stereo depth [26], [30]–[32],
methods produced accurate results, acquiring ground truth or optical flow [17]) to achieve scale consistent outcomes.
data for supervised training requires expensive 3D sensors, Network architecture for visual odometry and dense depth
multiple scene views, and inertial feedback to obtain even map estimation separate depth and pose networks into two
sparse depth maps [20]. Later work sought to address a lack CNNs, one with convolutional and fully connected layers and
of available high-quality labeled data by posing monocular the other an encoder-decoder structure [39], respectively. In
depth estimation as a stereo image correspondence problem, the case where only monocular images are used in training,
where the second image in a binocular pair served as a the self-supervision inherent in estimation is less constrained,
supervisory signal [18]–[20]. This approach trained a con- having only pose generated from temporal constraints to
volutional neural network (CNN) to learn epipolar geometry determine depth, and vice versa. The work of [15], [40]
constraints by generating disparity images subject to a stereo for example, suffered from scaling ambiguity issues [30].
image reconstruction loss. Once trained, networks were able Training using binocular video, on the other hand, made use
to infer depth using only a single monocular color image of independent constraints from spatial and temporal image
as input. While this work achieved results comparable to pairs that offered an enriched set of sampled images for
supervised methods in some cases, occlusion and texture- network training. This “spatio-temporal” approach allowed
copy artifacts that arose with stereo supervision motivated for the regression of depth from spatial cues generated
learning approaches using a temporal sequence of images by epipolar constraints, which were then passed to the
and a disparity map smoothing function [41]. To estimate
egomotion and camera intrinsics, we leverage a unique loss
𝑙
function that accounts for the photometric difference between
𝐼𝑡+1 temporally adjacent images. Using this loss, we show that we
can obtain scaled visual odometry information in addition to
𝐼𝑡𝑟 𝐼𝑡𝑙 accurate camera intrinsics.

II. M ETHODS
A. Notation
𝐷𝑡𝑟 𝐷𝑡𝑙
A color image, I, is composed of pixels with coordinates
𝑡, 𝑅, 𝐾 pij ∈ R2 , where Iij = I(pij ). In temporal training, we
denote images at time t as I t , and images temporally adjacent
0
as source frame I t . A pixel at time t0 is transformed to its
corresponding pixel at time t using homogeneous transfor-
mation matrix Tt0 →t ∈ SE(3) and camera intrinsics matrix
K ∈ R3×3 , where pixels in homogeneous coordinates, p̃ =
𝑙 , 𝐼𝑙 )
𝑙(𝐼𝑡+1 𝑡+1 𝑙 𝐼𝑟 , 𝐼𝑟 + 𝑙 𝑙𝑙 , 𝐼𝑙
− (p, 1)T , are denoted p for simplicity.

Rectified stereo image pairs are given by I r , I l , where the
superscripts for time have been dropped for convenience,
𝑙
𝐼𝑡+1 and superscripts l, r correspond to the left and right images
respectively. Dl represents the disparity map that warps I r
𝐼𝑡𝑙 𝐼𝑡𝑟 to the corresponding I l , and we define per pixel disparity
as dlij = Dl (pij ). Thus Iij l r
= Ii+d r
l ,j , and di+dl ,j =
r
Temporal Learning Spatial Learning D (pi+dl ,j ) is the disparity that does the reverse operation.
Depth per pixel z is then determined by the relation, z =
Fig. 3. Training Diagram. Our single-network system takes a timed Bfx /d, where fx is the x-component focal length and B is
sequence of left images and runs them through the encoder to generate the horizontal baseline between stereo cameras.
outputs that are fed to the fully connected (FC) layers (blue rectangles) and
the decoder (green trapezoid). Outputs from the FC layers are the camera
pose and intrinsics, and outputs from the decoder are disparity maps. The B. Preliminaries
disparities are the spatial component of the network used to find left-right
reprojected images (green dashed lines), while the disparities, camera pose We can obtain the projected pixel coordinates and depth
and intrinsics determine the temporal reprojections (pink dashed lines). All map using equation,
input and output images are framed in black for clarity.
0 0
z t pt = KRt0 →t K −1 z t pt + Ktt0 →t . (1)
pose network to independently estimate VO using temporal
constraints [30]–[32]. where the camera intrinsics matrix, K, is written explicitly
as:
In this work, we draw from the findings of [30]–[32] and " #
determine dense depth maps and egomotion using an unsu- F X0
pervised, end-to-end approach. By observing the similarities K= , F = diag(fx , fy ), X0 = [x0 , y0 ]T , (2)
0 1
between the architecture of the depth network encoder and
the pose network’s convolutional layers, we can effectively and R and t are the rotation matrix and translation vector
eliminate architecture redundancy by merging them via a arguments of transformation matrix T [12]. Note that in
common encoder (Fig. 2). Through the creation of a single, this work we assume no lens distortion and a zero skew
spatio-temporal network, we reduce the overhead associated coefficient in the camera, and that stereo cameras have equal
with optimizing networks separately under different criteria intrinsic parameters. Equation (1) constitutes the temporal
while still achieving the performance benefits of combined reconstruction loss at training used to determine the camera
vision techniques. To further reduce human effort and elimi- egomotion, R and t, and the camera intrinsics K in a single
nate error-prone manual intervention, this work also follows network.
the work of [12], [25] in demonstrating support for learning
the camera intrinsics. In addition to freeing the system from C. Overall Optimization Objective
manual calibration, learning camera intrinsics can be useful Our loss function is made up of a novel temporal recon-
when a video source is unknown. struction term and four spatial reconstruction terms [19],
Our proposed single, spatio-temporal network uses an [26]. Error regression for the following losses allow the
effective combination of losses to regress depth, egomo- network to correctly predict a target image temporally and
tion, and camera intrinsics. To predict depth, we use a spatially during training in order to infer depth, pose, and
photoconsistency loss between stereo image pairs, a left- camera intrinsics from a monocular image sequence at test
right consistency loss between image disparity maps [12], time. The temporal reconstruction term of the loss function is
designated lte , and the spatial reconstruction terms are com- smooth disparities, an exponential weighting function is used
posed of a photoconsistency loss, lp , a left-right consistency on disparity gradients ∂d:
loss llr , and a disparity smoothness loss lr :
1 X
l f (I l ; θ), I l , I r = λp lp (I l , Iˆl ) + lp (I r , Iˆr ) +
  lr (D, I) = ∂x dij e−|∂x Iij | + ∂y dij |e−|∂y Iij | . (7)
N i,j
λte lp (I l,t , Iˆl,t ) + λlr llr Dl , Dr +

(3)
λr lr (Dl , I l ) + lr (Dr , I r ) ,

D. Learning the Camera Intrinsics

where I is the original image and Iˆ is the reprojected image, For predicted parameters K̂, R̂, t̂ in (1), penalizing dif-
and individual losses are weighted by λl with subscript l ferences via training loss ensures K̂ t̂ and K̂ −1 R̂K̂ converge
corresponding to the loss function being weighted. to the correct values. To determine parameters individually,
1) Spatio-Temporal Reconstruction Loss: The photocon- the translational relation fails because it is under-determined
sistency loss compares image appearance: since there exists incorrect values of K̂ and t̂ such that
  K̂ t̂ = Kt. The rotational relationship, K̂ R̂K̂ −1 = KRK −1 ,
1 − SSIM Iij , ˆij
I however, does uniquely determine K̂, R̂ such that they are
ˆ = 1
X
lp (I, I) α + equal to K, R, and therefore provides sufficient supervisory
N 2 (4)
i,j signal to estimate these values accurately.
(1 − α) |Iij − Iˆij |. Proof: From the above relation we obtain R̂ =
K̂ −1 KRK −1 K̂, and we constrain R̂ to be SO(3), i.e.
The loss is composed of three terms in total (two spatial
R̂T = R̂−1 and det(R̂) = 1. Substituting R̂ into the
losses and a temporal loss). N in this equation is the number
relationship R̂R̂T = I, we find that AR = RA where
of image pixels and the weight α is set to 0.85. The structural
A = K −1 K̂ K̂ T K −T . The value det(K̂ −1 KRK −1 K̂) is
similarity index measure (SSIM) is used here in addition
equal to 1, therefore the determinant of A is also equal
to an absolute error between generated views and sampled
to 1. Moreover, the characteristic equation of A shows A
images [42].
always has an eigenvalue of 1 [12]. Thus the eigenvalue
For reprojected images, we assume equal camera intrinsics
of A is equal to 1 with an algebraic multiplicity of 3,
produce right and left stereo images. The focal length fˆx
implying A is the identity matrix, or the eigenvalues are
from instrinsics matrix K̂ in (2) is co-predicted via the
unique. If we assume A has 3 distinct eigenvalues, because
learned disparity and penalized using spatial reconstruction
A ∈ R3×3 and A = AT , we may choose the eigenvectors
losses. For stereo image inputs, predicted disparity maps are
of A such that they are real. But because AR = RA, for
used to generate the left view from a right image, and vice
every eigenvector, v of A, Rv is also an eigenvector. For an
versa. Depth values calculated from the disparity maps are
eigenvalue with algebraic multiplicity 1, the corresponding
then input to the temporal reconstruction loss to generate the
eigenspace is dim(1), thus Rv = µv for some scalar µ,
left target image from temporally adjacent source images, i.e.
implying each eigenvector of A is also an eigenvector of
to generate the temporal image arguments for (4), we put (1)
R. If R is SO(3), however, it has complex eigenvectors in
in the form where for pixels P = {pi , i = 1 . . . N },
general, which contradicts this assertion. Therefore A must
X be the identity matrix, and K̂ K̂ T = KK T . Referring to K
l,t l,t
|Iij − Iˆij |→ from (2), we observe,
i,j
(5)
" #
 ˆ 
F F + X0 X0T X0
−1 bfx t0 T
X
t t KK = , (8)
z p − K̂ R̂t0 →t K̂ p + K̂ t̂t0 →t ,
dl X0T 1
p∈P

is the absolute error between the left image and the re- which implies X̂0 = X0 and F̂ = F , or K̂ = K.
projected image, and the structure similarity measure is It is clear from above that for R = I, the relation AR =
generated by the same mappings between Iij and pixel pi . RA holds trivially, and K̂ cannot be uniquely determined.
2) Spatial Reconstruction Loss: The left-right dispar- Thus the tolerances with which F in (2) can be determined
ity consistency loss is used to obtain consistency between (in units of pixels) with respect to the amount of camera
disparity maps [19]. During training, the network predicts rotation that occurs is quantified as,
disparity maps Dl and Dr using only left image sequences
as input and then penalizes the difference between the left- 2fx2 2fy2
view disparity map and the warped right view, as well as the δfx < ; δfy < , (9)
w2 ry h2 rx
right-view and the warped left view,
1 X l where rx and ry are the x and y-axis rotation angles (in
llr (Dl , Dr ) = d − dri+dl ,j + drij − dli+dr ,j . (6) radians) between adjacent frames, and w and h are the width
N i,j ij
and height of the image, respectively. For a complete proof
The disparity smoothness loss penalizes depth discontinuities on the relation between the strength of supervision on K and
that occur at image gradients ∂I [41]. To obtain locally the closeness of R to I, see [12].
E. Network Architecture
RGB Depth
Our framework is inspired by [30] and [32], but rather
than requiring two separate networks for depth and pose
estimation, we use a common encoder for both tasks in a
novel single-network architecture (Fig. 3). That is, given
two temporally adjacent input images at times t and t0 ,
our network first convolves these inputs through a series
of convolutional blocks in a common encoder, and then
predicts either disparities through a decoder, or camera
pose and intrinsics through fully connected layers. In the
decoder network, the encoder’s latent representation of the
input images is first re-upsampled using a standard bilinear
interpolation kernel with pooling indices from the encoder
Fig. 4. Example Depth. Two representative examples of depth maps
to fuse low-level features, as inspired by [19], [30]. We then produced by our framework. Our single-network system can accurately
use rectified linear units (ReLU) [43] as activation functions estimate object distances from purely monocular images.
in all layers of this decoder except for the prediction layer,
which uses a sigmoid function instead. The decoder predicts KITTI Vision Benchmark Suite is an extensive dataset which
left-to-right and right-to-left disparities D at both timesteps, consists of 61 video sequences recorded across 5 days,
which are then either used to reconstruct the right stereo with 42,382 total rectified stereo image pairs, each with a
images for a spatially-constrained geometric loss during resolution of 1242x375. For a fair comparison, Eigen et al.
training, or used to construct the depth during inference. In [34] proposed a selection of 697 images across 28 of these
the fully connected layers, translation t̂t0 →t , rotation R̂t0 →t , video sequences as a test set for single view depth evaluation,
and camera intrinsics K̂ are predicted independently in three held out from the training and validation sets. The remaining
separate and decoupled groups of fully connected layers for 23,178 images from the 32 other scenes make up the training
better performance [31]. These outputs are then either taken and validation sets, in which the training set contains 21,055
at face value during inference as the predicted egomotion images while the validation set contains 2,123. We train,
and camera parameters, or used as inputs (along with the validate, and test our network using these splits and compare
estimated depth map) to warp the current frame to the next our depth estimation accuracy against other works across
for our temporal reconstruction loss as described previously. several metrics, shown in Table I. Ground truth data for the
testing set calculated via projecting the Velodyne LiDAR
III. R ESULTS data onto the image plane. Example depth maps outputted
In this section, we evaluate our proposed framework using by our system can be seen in Fig. 4.
the KITTI driving dataset [33]. Network architecture was im-
plemented using the TensorFlow framework [44] and models B. Learned Camera Intrinsics
were trained on a single NVIDIA GeForce RTX 2070 Super To evaluate our system’s ability to recover the camera
GPU with 8GB of memory using a batch size of 4. Adam intrinsics (i.e., fx , fy , x0 , y0 ) through the supervisory signal
optimizer [45] was used to train the network parameters, provided by the rotational component of (1), we follow a
with exponential decay rates β1 = 0.9 and β2 = 0.99 and similar procedure as [12] and trained separate models on
learning rate α initially set to 0.001 but gradually decreased several different video sequences until convergence of these
throughout training to allow faster weight training in the parameters for multiple independent results. We specifically
beginning and smaller fine-tuning towards the end of the used ten video sequences of the “2011 09 28” subdataset
learning process. We used standard data augmentation tech- chosen to have the same ground truth calibration done that
niques during training, such as random left-right mirroring day, and Table III shows the resulting mean and standard
of training images and perturbing the image color space (i.e., deviation of those ten tests. All experiments were done on the
gamma, brightness, and color shifts) to artificially increase left stereo color camera (“image 02”) of the vehicle setup.
the size of the training dataset. At test time, inference was
performed on an AMD Ryzen 7 3700X 8-Core 3.6GHz CPU, C. Egomotion
and we compare our method against the current state-of-the- We carried out our pose estimation performance evaluation
art using conventional metrics (i.e., Abs Rel, Sq Rel, RMSE, using four sequences from the KITTI Odometry dataset
RMSE log, and Accuracy for depth, and Absolute Trajectory [48] and compared against several state-of-the-art methods,
Error (ATE) for egomotion) as per [46]. including UnDEMoN [32], SfMLearner [15], and VISO-M
A. Performance of Depth Estimation [47]. For a quantitative comparison, we adopt the absolute
trajectory root-mean-square error (ATE) for both transla-
To evaluate the performance of our network’s depth in- tional (tate ) and rotational (rate ) components as per standard
ference, we use a standard Eigen split [34] on the KITTI practice [49], defined as
dataset [33] as per convention and compare against several
state-of-the-art methods, including [13], [30]–[32]. The full Fi := Q−1
i SPi (10)
TABLE I: Comparison of monocular depth estimation with other spatio-temporal approaches. Cropped regions from [19] were used for performance
evaluation all methods. In the column labeled “Train”, “Depth” indicates supervised training with ground truth and “S-T” indicates an unsupervised spatio-
temporal training approach. We evaluate using the Eigen split [34] on the KITTI dataset [33] and cap depth to 80m and 50m as per standard practice
[19]. Results from other methods were taken from their corresponding papers. For error metrics, lower is better; for accuracy, higher is better.

Method Train Error Metric Accuracy Metric


Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253
Depth: 80m cap
Train Set Mean Depth 0.361 4.826 8.102 0.377 0.638 0.804 0.894
Li et al. [31] S-T 0.183 1.730 6.570 0.268 - - -
Babu et al. [32] S-T 0.139 1.174 5.590 0.239 0.812 0.930 0.968
Zhan et al. [30] S-T 0.144 1.391 5.869 0.241 0.803 0.928 0.969
Ours S-T 0.141 1.227 5.629 0.239 0.809 0.927 0.962
Depth: 50m cap
Babu et al. [32] S-T 0.132 0.885 4.290 0.226 0.827 0.937 0.972
Zhan et al. [30] S-T 0.135 0.905 4.366 0.225 0.818 0.937 0.973
Ours S-T 0.131 0.897 4.297 0.228 0.821 0.938 0.972

TABLE II: Comparison of our system’s odometry estimation against various other state-of-the-art methods [15], [32], [47] using absolute trajectory error
for translation (tate ) and rotational (rate ) movement. Values of other methods were retrieved from [32].

Seq. Ours UnDEMoN [32] SfMLearner [15] VISO-M [47]


tate rate tate rate tate rate tate rate
00 0.0712 0.0014 0.0644 0.0013 0.7366 0.0040 0.1747 0.0009
04 0.0962 0.0016 0.0974 0.0008 1.5521 0.0027 0.2184 0.0009
05 0.0689 0.0009 0.0696 0.0009 0.7260 0.0036 0.3787 0.0013
07 0.0753 0.0013 0.0742 0.0011 0.5255 0.0036 0.4803 0.0018

TABLE III: Regressed camera intrinsics during training as compared to the


ground truth. Note that ground truth values have been adjusted to match the intrinsics. Through training our neural network to learn
scaling and cropping done for training. All values are in units of pixels. spatial constraints between stereo image pairs in addition
Camera Parameter Learned Ground Truth to temporal constraints, we are able to successfully resolve
Horizontal Focal Length (fx ) 298.4 ± 2.3 295.8 solutions at metric scale using only monocular video at
Vertical Focal Length (fy ) 483.1 ± 3.6 489.2 test time. We distinguish our work from other monocular
Horizontal Principal Point (x0 ) 254.8 ± 2.4 252.7 inference approaches by creating a single, fully differentiable
Vertical Principal Point (y0 ) 127.8 ± 1.7 124.9 architecture for depth prediction and visual odometry. To
further reduce human effort and manual intervention, we
with estimated trajectory P1:n and ground truth trajectory also take advantage of intrinsics observability in the system
Q1:n . We note that the same model that was trained for depth by learning the camera parameters embedded within the
estimation to output our egomotion estimation, and that these temporal recontruction loss. We verify the success of our
four test sequences were not part of our training set. system using the KITTI dataset, where results in Tables I-
From Table II, we observe that for both translational and III show we are able to achieve performance comparable to
rotational errors in all four sequences, our method outper- the state-of-the-art in spatio-temporal monocular vision while
formed SfMLearner [15] and VISO-M [47] and is compa- reducing overhead and solving for intrinsics.
rable with UnDEMoN’s [32] performance — all methods In future work we plan to quantify the training efficiency
in which a separate, dedicated pose estimation network was of our neural network, as we suspect the parameter re-
trained specifically for the task of predicting egomotion. In duction resulting from network elimination decreases the
our system, egomotion and camera intrinsics are co-predicted necessary time to optimize weights. We are also interested
(alongside disparity) in a single network such that the loss in expanding training and evaluation of our work to the
functions for these free parameters are tied together. This Cityscapes [50] and EuRoC [51] datasets. Increasing the size
may explain the slight loss in accuracy, but the upside is of the training dataset facilitates improved performance from
that our method is a reduction in computational and network unsupervised learning methods and a more diverse collection
complexity as there are less weights in our architecture to of scenes for evaluation. We also plan to address occlusion
optimize over. and moving objects to obtain more accurate reprojection
losses by discounting the correspondence between associated
IV. D ISCUSSION pixels. Extention of our work for SLAM applications is
In this work we have presented an unsupervised, single- also of interest, where augmentation of our system for a
network monocular depth inference approach for joint pre- hybrid, learned front-end and geometric back-end will aid in
diction of environmental depth, egomotion, and camera generating an accurate global pose graph.
R EFERENCES [22] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single
monocular images using deep convolutional neural fields,” IEEE
[1] S. Royo and M. Ballesta-Garcia, “An overview of lidar imaging transactions on pattern analysis and machine intelligence, vol. 38,
systems for autonomous vehicles,” Applied Sciences, vol. 9, no. 19, p. no. 10, pp. 2024–2039, 2015.
4093, 2019. [23] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene
[2] R. Katzenbeisser, “About the calibration of lidar sensors,” in ISPRS structure from a single still image,” IEEE transactions on pattern
Workshop, 2003, pp. 1–6. analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
[3] N. Muhammad and S. Lacroix, “Calibration of a rotating multi- [24] K. Tateno, F. Tombari, I. Laina, and N. Navab, “CNN-SLAM:
beam lidar,” in 2010 IEEE/RSJ International Conference on Intelligent Real-time dense monocular SLAM with learned depth prediction,”
Robots and Systems. IEEE, 2010, pp. 5648–5653. arXiv:1704.03489 [cs], Apr. 2017, arXiv: 1704.03489. [Online].
[4] B. T. Lopez and J. P. How, “Aggressive collision avoidance with Available: http://arxiv.org/abs/1704.03489
limited field-of-view sensing,” in 2017 IEEE/RSJ International Con- [25] C. Schmid, C. Sminchisescu, and Y. Chen, “Self-supervised learning
ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. with geometric constraints in monocular video - connecting flow,
1358–1365. depth, and camera,” in ICCV, 2019.
[5] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski,
[26] W. N. Greene and N. Roy, “Metrically-scaled monocular slam using
“A comparison and evaluation of multi-view stereo reconstruction
learned scale factors,” in 2020 IEEE International Conference on
algorithms,” in 2006 IEEE computer society conference on computer
Robotics and Automation (ICRA), 2020, pp. 43–50.
vision and pattern recognition (CVPR’06), vol. 1. IEEE, 2006, pp.
[27] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep
519–528.
ordinal regression network for monocular depth estimation,” in Pro-
[6] J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixel-
ceedings of the IEEE Conference on Computer Vision and Pattern
wise view selection for unstructured multi-view stereo,” in European
Recognition, 2018, pp. 2002–2011.
Conference on Computer Vision. Springer, 2016, pp. 501–518.
[28] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured
[7] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Towards
attention guided convolutional neural fields for monocular depth
internet-scale multi-view stereo,” in 2010 IEEE computer society
estimation,” in Proceedings of the IEEE Conference on Computer
conference on computer vision and pattern recognition. IEEE, 2010,
Vision and Pattern Recognition, 2018, pp. 3917–3925.
pp. 1434–1441.
[8] R. Hartley and A. Zisserman, Multiple view geometry in computer [29] S. Pillai, R. Ambruş, and A. Gaidon, “Superdepth: Self-supervised,
vision. Cambridge university press, 2003. super-resolved monocular depth estimation,” in 2019 International
[9] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense Conference on Robotics and Automation (ICRA). IEEE, 2019, pp.
tracking and mapping in real-time,” in 2011 international conference 9250–9256.
on computer vision. IEEE, 2011, pp. 2320–2327. [30] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid,
[10] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” “Unsupervised Learning of Monocular Depth Estimation and Visual
in Proceedings of the IEEE Conference on Computer Vision and Odometry with Deep Feature Reconstruction,” arXiv:1803.03893
Pattern Recognition, 2016, pp. 4104–4113. [cs], Apr. 2018, arXiv: 1803.03893. [Online]. Available: http:
[11] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun, “Structure from //arxiv.org/abs/1803.03893
motion without correspondence,” in Proceedings IEEE Conference [31] R. Li, S. Wang, Z. Long, and D. Gu, “UnDeepVO: Monocular Visual
on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. Odometry through Unsupervised Deep Learning,” arXiv:1709.06841
PR00662), vol. 2. IEEE, 2000, pp. 557–564. [cs], Feb. 2018, arXiv: 1709.06841. [Online]. Available: http:
[12] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from //arxiv.org/abs/1709.06841
videos in the wild: Unsupervised monocular depth learning from [32] V. Madhu Babu, K. Das, A. Majumdar, and S. Kumar, “UnDEMoN:
unknown cameras,” in 2019 IEEE/CVF International Conference on Unsupervised Deep Network for Depth and Ego-Motion Estimation,”
Computer Vision (ICCV), 2019, pp. 8976–8985. in 2018 IEEE/RSJ International Conference on Intelligent Robots and
[13] C. Godard, O. Mac Aodha, M. Firman, and G. Brostow, “Digging Systems (IROS), Oct. 2018, pp. 1082–1088, iSSN: 2153-0866.
Into Self-Supervised Monocular Depth Estimation,” arXiv:1806.01260 [33] M. Menze and A. Geiger, “Object scene flow for autonomous ve-
[cs, stat], Aug. 2019, arXiv: 1806.01260. [Online]. Available: hicles,” in Conference on Computer Vision and Pattern Recognition
http://arxiv.org/abs/1806.01260 (CVPR), 2015.
[14] J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, [34] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
and I. Reid, “Unsupervised Scale-consistent Depth and Ego-motion single image using a multi-scale deep network,” in Advances in Neural
Learning from Monocular Video,” arXiv:1908.10553 [cs], Oct. 2019, Information Processing Systems 27, Z. Ghahramani, M. Welling,
arXiv: 1908.10553. [Online]. Available: http://arxiv.org/abs/1908. C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran
10553 Associates, Inc., 2014, pp. 2366–2374.
[15] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised [35] D. Eigen and R. Fergus, “Predicting depth, surface normals and se-
Learning of Depth and Ego-Motion from Video,” in 2017 IEEE mantic labels with a common multi-scale convolutional architecture,”
Conference on Computer Vision and Pattern Recognition (CVPR). 2015.
Honolulu, HI: IEEE, Jul. 2017, pp. 6612–6619. [Online]. Available: [36] J. J. Koenderink and A. J. Van Doorn, “Affine structure from motion,”
http://ieeexplore.ieee.org/document/8100183/ JOSA A, vol. 8, no. 2, pp. 377–385, 1991.
[16] Z. Yin and J. Shi, “GeoNet: Unsupervised Learning of Dense [37] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “Vinet:
Depth, Optical Flow and Camera Pose,” arXiv:1803.02276 [cs], Mar. Visual-inertial odometry as a sequence-to-sequence learning problem,”
2018, arXiv: 1803.02276. [Online]. Available: http://arxiv.org/abs/ 2017.
1803.02276 [38] S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in
[17] Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning robots,” 2017.
of depth and flow using cross-task consistency,” 2018. [39] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
[18] M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “Towards real-time convolutional encoder-decoder architecture for image segmentation,”
unsupervised monocular depth estimation on CPU,” arXiv:1806.11430 IEEE transactions on pattern analysis and machine intelligence,
[cs], Jul. 2018, arXiv: 1806.11430. vol. 39, no. 12, pp. 2481–2495, 2017.
[19] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular [40] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and
depth estimation with left-right consistency,” in 2017 IEEE Conference K. Fragkiadaki, “Sfm-net: Learning of structure and motion from
on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6602– video,” 2017.
6611. [41] P. Heise, S. Klose, B. Jensen, and A. Knoll, “Pm-huber: Patchmatch
[20] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised with huber regularization for stereo matching,” in 2013 IEEE Interna-
CNN for Single View Depth Estimation: Geometry to the Rescue,” tional Conference on Computer Vision, 2013, pp. 2360–2367.
arXiv:1603.04992 [cs], Jul. 2016, arXiv: 1603.04992. [Online]. [42] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image Quality
Available: http://arxiv.org/abs/1603.04992 Assessment: From Error Visibility to Structural Similarity,” IEEE
[21] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single Transactions on Image Processing, vol. 13, no. 4, pp. 600–612,
monocular images,” in Advances in neural information processing Apr. 2004. [Online]. Available: http://ieeexplore.ieee.org/document/
systems, 2006, pp. 1161–1168. 1284395/
[43] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in ICML, 2010.
[44] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015, software available from tensorflow.org.
[Online]. Available: http://tensorflow.org/
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[46] C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular
Depth Estimation Based On Deep Learning: An Overview,”
Science China Technological Sciences, vol. 63, no. 9, pp. 1612–
1627, Sep. 2020, arXiv: 2003.06620. [Online]. Available: http:
//arxiv.org/abs/2003.06620
[47] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruc-
tion in real-time,” in 2011 IEEE intelligent vehicles symposium (IV).
Ieee, 2011, pp. 963–968.
[48] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in 2012 IEEE Conference
on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–
3361.
[49] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
“A benchmark for the evaluation of rgb-d slam systems,” in 2012
IEEE/RSJ International Conference on Intelligent Robots and Systems.
IEEE, 2012, pp. 573–580.
[50] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”
The International Journal of Robotics Research, vol. 35, no. 10, pp.
1157–1163, 2016.
[51] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp.
3213–3223.

You might also like