Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
I. I NTRODUCTION
Fig. 1. System Overview. Our system regresses depth, pose and camera
Modern robotic agents take advantage of accurate, real- intrinsics from a sequence of monocular images. During training, we use
two pairs of unlabeled stereo images and consider losses in both spatial
time range measurements to build a spatial understanding of and temporal directions for our network weights. During inference, only
their surrounding environments for collision avoidance, state monocular images are required as input, and our system outputs accurately
estimation, and other navigational tasks. Such measurements scaled depth maps and egomotion in addition to the camera’s intrinsics.
are commonly retrieved via active sensors (e.g., LiDAR)
which resolve distance by measuring the time-of-flight of a information onto a 2D image plane, and abstracting higher
reflected light signal; however, these sensors are often costly dimensional depth information from a lower dimension is
[1], difficult to calibrate and maintain [2], [3], and can be fundamentally an ill-posed problem. To resolve the scale
unwieldy for platforms with a weight budget [4]. Passive factors of these depth maps, a variety of learning-based
sensors, on the other hand, have seen a tremendous surge approaches have been proposed with differing techniques to
of interest in recent literature to predict scene depth from constrain the problem geometrically [13], [18]–[26]. Tempo-
input imagery using multi-view stereo [5]–[7], structure- ral constraints, for example, are commonly employed [12],
from-motion [8]–[11], or more recently, purely monocular [25], [27]–[29] and is defined as the geometric constraint
systems [12]–[17], due to their smaller form factor and between two consecutive monocular frames, aiming to min-
increasing potential to rival the performance of explicit active imize the photometric consistency loss after warping one
sensors with the advent of machine learning. frame to the next. Spatial constraints [13], [19], on the
In particular, monocular depth inference is attractive since other hand, extract scene geometry not through a forward-
RGB cameras are ubiquitous in modern times and requires backward reconstruction loss (i.e., temporally) but rather in
the least number of sensors, but this setup suffers from a left-right pairs of stereo images with a predefined baseline.
fundamental issue of scale acquisition. More specifically, in Most works choose to design their systems around either
a purely monocular system, depth can only be estimated one or the other, and while a few systems have integrated
up to an ambiguous scale and requires additional geometric both constraints before in a multi-network framework [30]–
information to resolve the units of the depth map. Such [32], none have taken advantage of both spatial and temporal
cameras typically capture frames by projecting 3D scene constraints in a single network to resolve these scale factors.
To this end, we propose an unsupervised, single-network
1 Kenny Chen and Ankur Mehta are with the Department of Electrical and
monocular depth inference approach that considers both
Computer Engineering at the University of California Los Angeles, Los An- spatial and temporal geometric constraints to resolve the
geles, CA 90095, USA. {kennyjchen, mehtank}@ucla.edu
2 Alexandra Pogue is with the Department of Mechanical and Aerospace scale of a predicted depth map. These “spatio-temporal”
Engineering at the University of California Los Angeles, Los Angeles, CA constraints are enforced within the reconstruction loss func-
90095, USA. [email protected] tions of our network during training (Fig. 1), which aim
3 Ali-akbar Agha-mohammadi and Brett T. Lopez are with the NASA Jet
Propulsion Laboratory, California Institute of Technology, Pasadena, CA to minimize the photometric difference between a warped
91109, USA. {aliagha, brett.t.lopez}@jpl.nasa.gov frame and the actual next frame (forward-backward) while
Decoder
Dt+1l
512, 3x3
512, 3x3
256, 3x3
128, 3x3
D tl
16, 3x3
64, 3x3
32, 3x3
t+1 Dt+1r
128, 3x3,
512, 3x3
256, 3x3
512, 3x3
512, 3x3
32, 7x7
64, 5x5
R
t D tr
t
512
512
n
Common Encoder
Fully Connected Layers (x3) K
Fig. 2. Architecture Overview. Our system uses a common convolutional-based encoder between the different outputs, which compresses the input
images into a latent space representation. This representation is then sent through either a trained decoder to retrieve left-right stereo image disparities, or
through different groups of fully connected layers to estimate the egomotion (n = 3 for the last layer) or camera intrinsics (n = 4).
simultaneously maximize the disparity consistency between a as an alternative [13], [19]. CNNs trained using monocular
pair of stereo frames (left-right). Unlike previous approaches, video regressed depth using the camera egomotion to warp
we consider camera intrinsics as an additional unknown a source image to its temporally adjacent target. To address
parameter to be inferred and demonstrate accurate inference the additional problem of camera pose, [12]–[17] trained a
of both depth and camera parameters through a sequence of separate pose network.
purely monocular frames; this is all performed in a single
The learning of visual odometry (VO) and depth maps
end-to-end network to minimize implemetation overhead.
has useful application in visual simultaneous localization
Our main contributions are as follows: (1) we propose
and mapping (SLAM). Visual SLAM leverages 3D vision
an unsupervised, single-network architecture for monocular
to navigate an unknown area by determining camera pose
depth inference which takes advantage of the geometric
relative to a constructed global map of an environment. To
constraints found in both spatial and temporal directions; (2)
build a map and localize within it, the VO within the SLAM
a novel loss function that integrates unknown camera intrin-
pipeline must solve at metric scale. Geometric approaches
sics directly into the depth prediction network with analysis
to monocular SLAM using first principled solutions, such as
that there is sufficient supervisory signal to regress these
structure from motion (SfM) [36], resolved scaling issues
parameters; and (3) an analysis of our proposed architecture’s
using external information [26], [37]. Building on such
performance on the KITTI driving dataset [33] as compared
methods, work in data-driven monocular VO sought to obtain
to the current state-of-the-art.
accurate scaling using sources such as GPS sensor fusion
[38] or supervision [37]. Unsupervised approaches using a
Related Work
camera alone remain attractive, however, due to the reduction
Depth estimation using monocular images and deep learn- of manual effort associated with fewer sensors. Promising
ing began with supervised methods over large datasets research in this area used combined visual constraints (e.g.,
and ground truth labeling [22], [34], [35]. Although these monocular depth [12]–[17], stereo depth [26], [30]–[32],
methods produced accurate results, acquiring ground truth or optical flow [17]) to achieve scale consistent outcomes.
data for supervised training requires expensive 3D sensors, Network architecture for visual odometry and dense depth
multiple scene views, and inertial feedback to obtain even map estimation separate depth and pose networks into two
sparse depth maps [20]. Later work sought to address a lack CNNs, one with convolutional and fully connected layers and
of available high-quality labeled data by posing monocular the other an encoder-decoder structure [39], respectively. In
depth estimation as a stereo image correspondence problem, the case where only monocular images are used in training,
where the second image in a binocular pair served as a the self-supervision inherent in estimation is less constrained,
supervisory signal [18]–[20]. This approach trained a con- having only pose generated from temporal constraints to
volutional neural network (CNN) to learn epipolar geometry determine depth, and vice versa. The work of [15], [40]
constraints by generating disparity images subject to a stereo for example, suffered from scaling ambiguity issues [30].
image reconstruction loss. Once trained, networks were able Training using binocular video, on the other hand, made use
to infer depth using only a single monocular color image of independent constraints from spatial and temporal image
as input. While this work achieved results comparable to pairs that offered an enriched set of sampled images for
supervised methods in some cases, occlusion and texture- network training. This “spatio-temporal” approach allowed
copy artifacts that arose with stereo supervision motivated for the regression of depth from spatial cues generated
learning approaches using a temporal sequence of images by epipolar constraints, which were then passed to the
and a disparity map smoothing function [41]. To estimate
egomotion and camera intrinsics, we leverage a unique loss
𝑙
function that accounts for the photometric difference between
𝐼𝑡+1 temporally adjacent images. Using this loss, we show that we
can obtain scaled visual odometry information in addition to
𝐼𝑡𝑟 𝐼𝑡𝑙 accurate camera intrinsics.
II. M ETHODS
A. Notation
𝐷𝑡𝑟 𝐷𝑡𝑙
A color image, I, is composed of pixels with coordinates
𝑡, 𝑅, 𝐾 pij ∈ R2 , where Iij = I(pij ). In temporal training, we
denote images at time t as I t , and images temporally adjacent
0
as source frame I t . A pixel at time t0 is transformed to its
corresponding pixel at time t using homogeneous transfor-
mation matrix Tt0 →t ∈ SE(3) and camera intrinsics matrix
K ∈ R3×3 , where pixels in homogeneous coordinates, p̃ =
𝑙 , 𝐼𝑙 )
𝑙(𝐼𝑡+1 𝑡+1 𝑙 𝐼𝑟 , 𝐼𝑟 + 𝑙 𝑙𝑙 , 𝐼𝑙
− (p, 1)T , are denoted p for simplicity.
−
Rectified stereo image pairs are given by I r , I l , where the
superscripts for time have been dropped for convenience,
𝑙
𝐼𝑡+1 and superscripts l, r correspond to the left and right images
respectively. Dl represents the disparity map that warps I r
𝐼𝑡𝑙 𝐼𝑡𝑟 to the corresponding I l , and we define per pixel disparity
as dlij = Dl (pij ). Thus Iij l r
= Ii+d r
l ,j , and di+dl ,j =
r
Temporal Learning Spatial Learning D (pi+dl ,j ) is the disparity that does the reverse operation.
Depth per pixel z is then determined by the relation, z =
Fig. 3. Training Diagram. Our single-network system takes a timed Bfx /d, where fx is the x-component focal length and B is
sequence of left images and runs them through the encoder to generate the horizontal baseline between stereo cameras.
outputs that are fed to the fully connected (FC) layers (blue rectangles) and
the decoder (green trapezoid). Outputs from the FC layers are the camera
pose and intrinsics, and outputs from the decoder are disparity maps. The B. Preliminaries
disparities are the spatial component of the network used to find left-right
reprojected images (green dashed lines), while the disparities, camera pose We can obtain the projected pixel coordinates and depth
and intrinsics determine the temporal reprojections (pink dashed lines). All map using equation,
input and output images are framed in black for clarity.
0 0
z t pt = KRt0 →t K −1 z t pt + Ktt0 →t . (1)
pose network to independently estimate VO using temporal
constraints [30]–[32]. where the camera intrinsics matrix, K, is written explicitly
as:
In this work, we draw from the findings of [30]–[32] and " #
determine dense depth maps and egomotion using an unsu- F X0
pervised, end-to-end approach. By observing the similarities K= , F = diag(fx , fy ), X0 = [x0 , y0 ]T , (2)
0 1
between the architecture of the depth network encoder and
the pose network’s convolutional layers, we can effectively and R and t are the rotation matrix and translation vector
eliminate architecture redundancy by merging them via a arguments of transformation matrix T [12]. Note that in
common encoder (Fig. 2). Through the creation of a single, this work we assume no lens distortion and a zero skew
spatio-temporal network, we reduce the overhead associated coefficient in the camera, and that stereo cameras have equal
with optimizing networks separately under different criteria intrinsic parameters. Equation (1) constitutes the temporal
while still achieving the performance benefits of combined reconstruction loss at training used to determine the camera
vision techniques. To further reduce human effort and elimi- egomotion, R and t, and the camera intrinsics K in a single
nate error-prone manual intervention, this work also follows network.
the work of [12], [25] in demonstrating support for learning
the camera intrinsics. In addition to freeing the system from C. Overall Optimization Objective
manual calibration, learning camera intrinsics can be useful Our loss function is made up of a novel temporal recon-
when a video source is unknown. struction term and four spatial reconstruction terms [19],
Our proposed single, spatio-temporal network uses an [26]. Error regression for the following losses allow the
effective combination of losses to regress depth, egomo- network to correctly predict a target image temporally and
tion, and camera intrinsics. To predict depth, we use a spatially during training in order to infer depth, pose, and
photoconsistency loss between stereo image pairs, a left- camera intrinsics from a monocular image sequence at test
right consistency loss between image disparity maps [12], time. The temporal reconstruction term of the loss function is
designated lte , and the spatial reconstruction terms are com- smooth disparities, an exponential weighting function is used
posed of a photoconsistency loss, lp , a left-right consistency on disparity gradients ∂d:
loss llr , and a disparity smoothness loss lr :
1 X
l f (I l ; θ), I l , I r = λp lp (I l , Iˆl ) + lp (I r , Iˆr ) +
lr (D, I) = ∂x dij e−|∂x Iij | + ∂y dij |e−|∂y Iij | . (7)
N i,j
λte lp (I l,t , Iˆl,t ) + λlr llr Dl , Dr +
(3)
λr lr (Dl , I l ) + lr (Dr , I r ) ,
D. Learning the Camera Intrinsics
where I is the original image and Iˆ is the reprojected image, For predicted parameters K̂, R̂, t̂ in (1), penalizing dif-
and individual losses are weighted by λl with subscript l ferences via training loss ensures K̂ t̂ and K̂ −1 R̂K̂ converge
corresponding to the loss function being weighted. to the correct values. To determine parameters individually,
1) Spatio-Temporal Reconstruction Loss: The photocon- the translational relation fails because it is under-determined
sistency loss compares image appearance: since there exists incorrect values of K̂ and t̂ such that
K̂ t̂ = Kt. The rotational relationship, K̂ R̂K̂ −1 = KRK −1 ,
1 − SSIM Iij , ˆij
I however, does uniquely determine K̂, R̂ such that they are
ˆ = 1
X
lp (I, I) α + equal to K, R, and therefore provides sufficient supervisory
N 2 (4)
i,j signal to estimate these values accurately.
(1 − α) |Iij − Iˆij |. Proof: From the above relation we obtain R̂ =
K̂ −1 KRK −1 K̂, and we constrain R̂ to be SO(3), i.e.
The loss is composed of three terms in total (two spatial
R̂T = R̂−1 and det(R̂) = 1. Substituting R̂ into the
losses and a temporal loss). N in this equation is the number
relationship R̂R̂T = I, we find that AR = RA where
of image pixels and the weight α is set to 0.85. The structural
A = K −1 K̂ K̂ T K −T . The value det(K̂ −1 KRK −1 K̂) is
similarity index measure (SSIM) is used here in addition
equal to 1, therefore the determinant of A is also equal
to an absolute error between generated views and sampled
to 1. Moreover, the characteristic equation of A shows A
images [42].
always has an eigenvalue of 1 [12]. Thus the eigenvalue
For reprojected images, we assume equal camera intrinsics
of A is equal to 1 with an algebraic multiplicity of 3,
produce right and left stereo images. The focal length fˆx
implying A is the identity matrix, or the eigenvalues are
from instrinsics matrix K̂ in (2) is co-predicted via the
unique. If we assume A has 3 distinct eigenvalues, because
learned disparity and penalized using spatial reconstruction
A ∈ R3×3 and A = AT , we may choose the eigenvectors
losses. For stereo image inputs, predicted disparity maps are
of A such that they are real. But because AR = RA, for
used to generate the left view from a right image, and vice
every eigenvector, v of A, Rv is also an eigenvector. For an
versa. Depth values calculated from the disparity maps are
eigenvalue with algebraic multiplicity 1, the corresponding
then input to the temporal reconstruction loss to generate the
eigenspace is dim(1), thus Rv = µv for some scalar µ,
left target image from temporally adjacent source images, i.e.
implying each eigenvector of A is also an eigenvector of
to generate the temporal image arguments for (4), we put (1)
R. If R is SO(3), however, it has complex eigenvectors in
in the form where for pixels P = {pi , i = 1 . . . N },
general, which contradicts this assertion. Therefore A must
X be the identity matrix, and K̂ K̂ T = KK T . Referring to K
l,t l,t
|Iij − Iˆij |→ from (2), we observe,
i,j
(5)
" #
ˆ
F F + X0 X0T X0
−1 bfx t0 T
X
t t KK = , (8)
z p − K̂ R̂t0 →t K̂ p + K̂ t̂t0 →t ,
dl X0T 1
p∈P
is the absolute error between the left image and the re- which implies X̂0 = X0 and F̂ = F , or K̂ = K.
projected image, and the structure similarity measure is It is clear from above that for R = I, the relation AR =
generated by the same mappings between Iij and pixel pi . RA holds trivially, and K̂ cannot be uniquely determined.
2) Spatial Reconstruction Loss: The left-right dispar- Thus the tolerances with which F in (2) can be determined
ity consistency loss is used to obtain consistency between (in units of pixels) with respect to the amount of camera
disparity maps [19]. During training, the network predicts rotation that occurs is quantified as,
disparity maps Dl and Dr using only left image sequences
as input and then penalizes the difference between the left- 2fx2 2fy2
view disparity map and the warped right view, as well as the δfx < ; δfy < , (9)
w2 ry h2 rx
right-view and the warped left view,
1 X l where rx and ry are the x and y-axis rotation angles (in
llr (Dl , Dr ) = d − dri+dl ,j + drij − dli+dr ,j . (6) radians) between adjacent frames, and w and h are the width
N i,j ij
and height of the image, respectively. For a complete proof
The disparity smoothness loss penalizes depth discontinuities on the relation between the strength of supervision on K and
that occur at image gradients ∂I [41]. To obtain locally the closeness of R to I, see [12].
E. Network Architecture
RGB Depth
Our framework is inspired by [30] and [32], but rather
than requiring two separate networks for depth and pose
estimation, we use a common encoder for both tasks in a
novel single-network architecture (Fig. 3). That is, given
two temporally adjacent input images at times t and t0 ,
our network first convolves these inputs through a series
of convolutional blocks in a common encoder, and then
predicts either disparities through a decoder, or camera
pose and intrinsics through fully connected layers. In the
decoder network, the encoder’s latent representation of the
input images is first re-upsampled using a standard bilinear
interpolation kernel with pooling indices from the encoder
Fig. 4. Example Depth. Two representative examples of depth maps
to fuse low-level features, as inspired by [19], [30]. We then produced by our framework. Our single-network system can accurately
use rectified linear units (ReLU) [43] as activation functions estimate object distances from purely monocular images.
in all layers of this decoder except for the prediction layer,
which uses a sigmoid function instead. The decoder predicts KITTI Vision Benchmark Suite is an extensive dataset which
left-to-right and right-to-left disparities D at both timesteps, consists of 61 video sequences recorded across 5 days,
which are then either used to reconstruct the right stereo with 42,382 total rectified stereo image pairs, each with a
images for a spatially-constrained geometric loss during resolution of 1242x375. For a fair comparison, Eigen et al.
training, or used to construct the depth during inference. In [34] proposed a selection of 697 images across 28 of these
the fully connected layers, translation t̂t0 →t , rotation R̂t0 →t , video sequences as a test set for single view depth evaluation,
and camera intrinsics K̂ are predicted independently in three held out from the training and validation sets. The remaining
separate and decoupled groups of fully connected layers for 23,178 images from the 32 other scenes make up the training
better performance [31]. These outputs are then either taken and validation sets, in which the training set contains 21,055
at face value during inference as the predicted egomotion images while the validation set contains 2,123. We train,
and camera parameters, or used as inputs (along with the validate, and test our network using these splits and compare
estimated depth map) to warp the current frame to the next our depth estimation accuracy against other works across
for our temporal reconstruction loss as described previously. several metrics, shown in Table I. Ground truth data for the
testing set calculated via projecting the Velodyne LiDAR
III. R ESULTS data onto the image plane. Example depth maps outputted
In this section, we evaluate our proposed framework using by our system can be seen in Fig. 4.
the KITTI driving dataset [33]. Network architecture was im-
plemented using the TensorFlow framework [44] and models B. Learned Camera Intrinsics
were trained on a single NVIDIA GeForce RTX 2070 Super To evaluate our system’s ability to recover the camera
GPU with 8GB of memory using a batch size of 4. Adam intrinsics (i.e., fx , fy , x0 , y0 ) through the supervisory signal
optimizer [45] was used to train the network parameters, provided by the rotational component of (1), we follow a
with exponential decay rates β1 = 0.9 and β2 = 0.99 and similar procedure as [12] and trained separate models on
learning rate α initially set to 0.001 but gradually decreased several different video sequences until convergence of these
throughout training to allow faster weight training in the parameters for multiple independent results. We specifically
beginning and smaller fine-tuning towards the end of the used ten video sequences of the “2011 09 28” subdataset
learning process. We used standard data augmentation tech- chosen to have the same ground truth calibration done that
niques during training, such as random left-right mirroring day, and Table III shows the resulting mean and standard
of training images and perturbing the image color space (i.e., deviation of those ten tests. All experiments were done on the
gamma, brightness, and color shifts) to artificially increase left stereo color camera (“image 02”) of the vehicle setup.
the size of the training dataset. At test time, inference was
performed on an AMD Ryzen 7 3700X 8-Core 3.6GHz CPU, C. Egomotion
and we compare our method against the current state-of-the- We carried out our pose estimation performance evaluation
art using conventional metrics (i.e., Abs Rel, Sq Rel, RMSE, using four sequences from the KITTI Odometry dataset
RMSE log, and Accuracy for depth, and Absolute Trajectory [48] and compared against several state-of-the-art methods,
Error (ATE) for egomotion) as per [46]. including UnDEMoN [32], SfMLearner [15], and VISO-M
A. Performance of Depth Estimation [47]. For a quantitative comparison, we adopt the absolute
trajectory root-mean-square error (ATE) for both transla-
To evaluate the performance of our network’s depth in- tional (tate ) and rotational (rate ) components as per standard
ference, we use a standard Eigen split [34] on the KITTI practice [49], defined as
dataset [33] as per convention and compare against several
state-of-the-art methods, including [13], [30]–[32]. The full Fi := Q−1
i SPi (10)
TABLE I: Comparison of monocular depth estimation with other spatio-temporal approaches. Cropped regions from [19] were used for performance
evaluation all methods. In the column labeled “Train”, “Depth” indicates supervised training with ground truth and “S-T” indicates an unsupervised spatio-
temporal training approach. We evaluate using the Eigen split [34] on the KITTI dataset [33] and cap depth to 80m and 50m as per standard practice
[19]. Results from other methods were taken from their corresponding papers. For error metrics, lower is better; for accuracy, higher is better.
TABLE II: Comparison of our system’s odometry estimation against various other state-of-the-art methods [15], [32], [47] using absolute trajectory error
for translation (tate ) and rotational (rate ) movement. Values of other methods were retrieved from [32].