0% found this document useful (0 votes)

20 views11 pages

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

krishnapandu2904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

krishnapandu2904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

2022 IEEE/CVF

2022 IEEE/CVF Conference

Conference on
on Computer
Computer Vision
Vision and
and Pattern
Pattern Recognition
Recognition (CVPR)
(CVPR)

CroMo: Cross-Modal Learning for Monocular Depth Estimation

Yannick Verdie1, Jifei Song1, Barnabe Mas1•2, Benjamin Busam1•3, Ales Leonardis1, Steven McDonagh1
1 Huawei Noah's Ark Lab 2 Ecole Polytechnique 3 Technical University of Munich
{yannick . verdie , j i fe i . song , a l e s . leonardi s , s t even . mcdonagh}@huawe i . com
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00391

barnabe . ma s@polytechnique . edu b . bus am@tum . de

Abstract
-�/ i - .
,. I
-�
. ililil
� �� '�
Learning-based depth estimation has witnessed recent
progress in multiple directions; from self-supervision using
1 t • I: • 1 t • I I ' II I t•
J � • it-ItO
g:ht
monocular video to supervised methods offering highest ac
curacy. Complementary to supervision, further boosts to
performance and robustness are gained by combining in I _ ___ I __

formation from multiple signals. In this paper we systemat

ically investigate key trade-offs associated with sensor and
P2d
u[l !]
0
Monodepth2 [22]
[_-]
Ours
u
GT Label
modality design choices as well as related model training
strategies. Our study leads us to a new method, capable of Figure 1 . Top row: polarisation input signal (Pol.) visualised as
connecting modality-specific advantages from polarisation, Angle and Intensity. Additionally; Time-of-Flight Amplitude (i
Time-of-Flight and structured-light inputs. We propose a ToF) and structured-light sensor co-modalities, exploitable during
novel pipeline capable of estimating depth from monocular training. Bottom row: monocular depth estimation, using the Pol.
polarisationfor which we evaluate various training signals. input. Uni-modal model training of p2d [1 1] and the monodepth2
The inversion of differentiable analytic models thereby con architecture [22], compared with cross-modal training (Ours).

nects scene geometry with polarisation and ToF signals and dilations, surface reflection from non-metallic objects can
enables self-supervised and cross-modal learning. linearly polarise the light. Such polarised light then con
In the absence of existing multimodal datasets, we exam tains surface structure information, retrievable using ana
ine our approach with a custom-made multi-modal camera lytic physical models [8]. This information can be used to
rig and collect CroMo; the first dataset to consist of syn harness the depth cues offered by this light property. Polari
chronized stereo polarisation, indirect ToF and structured metric imagery is a passive example for depth estimation.
light depth, captured at video rates. Extensive experiments Passive sensors have acceptable resolution and dense depth
on challenging video scenes confirm both qualitative and however there exist well understood capture situations that
quantitative pipeline advantages where we are able to out prove challenging (e.g. textureless surface regions).
perform competitive monocular depth estimation methods. However further known properties of light (i.e. speed)
provide yet more information. Indirect Time-of-Flight (i
1. Introduction
ToF) cameras are active light sensors and use a pulsed,
Modem vision sensors are able to leverage a variety of near infrared light source to measure object and surface dis
light properties for optical sensing. Common RGB sensors, tance. Further active sensors use structured-light and these
for instance, use colour filter arrays (CFA) over a pixel sen emit known infrared patterns and use stereoscopic imag
sor grid to separate incoming radiation into specified wave ing to measure the distance to the surface. While i-ToF
bands. This allows a photosensor to detect wavelength and structured-light cameras have clear advantages, such as
separated light intensity and enables the acquisition of fa the ability to function in low-light scenarios and good short
miliar visible spectrum images. Wavelength is however range precision, they are susceptible to specular reflections,
only one property of light capable of providing information. ambient light and range remains limited.
Light polarisation defines another property and describes We argue that novel combination of active and passive
the oscillation direction of an electromagnetic wave. While light sensors offers new possibilities. We can exploit such a
the majority of natural light sources (e.g. the sun) emit combination to take advantage of the discussed, modality
unpolarised light, consisting of a mixture of oriented os- specific strengths and weaknesses. We observe that (1) dif-

978-1-6654-6946-3/22/$31.00 ©2022 IEEE 3927

DOI 10.1109/CVPR52688.2022.00391
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
fering visual modalities offer information cues about com cross-modal information from Time-of-Flight (i-ToF), ac
plimentary aspects of the world and (2) there exist clear tive stereo and polarisation modalities during training. We
trade-offs between the complexity of capture sensor setups briefly review the literature most closely related to the main
and the resulting data diversity and quality, accessible for components of our investigation and proposed framework.
supervision signals. This motivates us to systematically in 2.1. Monocular depth estimation
vestigate these considerations and provide insight into train
Estimating depth from a single image constitutes a hard,
ing data capture design decisions and the related pay-offs.
ill-posed problem. Pioneering work on supervised monoc
Our study results in the proposal of a framework capable
ular depth estimation [42] used synthetic samples during
of exploiting available supervision signals and is tailored to
training. Synthetic data was also previously used in con
benefit from the particular strengths of unique modalities.
junction with stereo network distillation [25] for this task.
We instantiate our ideas by bringing together the physi
To improve accuracy and convergence speed, [17] introduce
cal understanding of Polarisation and i-ToF in a data driven
a spatially-increasing discretisation. However, acquiring
fashion. In practice this affords an inference pipeline that
ground truth depth data remains a difficult task [20].
estimates depth from a single polarisation image. We train
To overcome the difficulty of collecting accurate ground
a convolutional neural network (CNN), with cross-modal
truth signal, multiple works [1 9, 6 1] investigate a con
fusion using differentiable physical models. We establish a
sistency loss by leveraging stereo imagery during train
dataset comprising Ground Truth depth obtained via Multi
ing, towards self-supervision. While being undoubtedly
View Stereo (MVS) reconstruction [52] that enjoys access
path-breaking, the initial methods suffered from a non
to information rich, full video sequences. We carry out ex
differentiable sampling step. Godard et al. [21 ] formu
tensive experimental work to establish the efficacy of our
lated a fully-differentiable pipeline with left-right consis
proposed monocular depth estimation strategies.
tency checks during training and have also explored the
Our contributions can be summarised as:
temporal components [22], even in challenging setups such
1. Novel multi-modal method. We propose a multi-modal
as night scenes [55]. These methods predict depth with
training approach that allows for monocular depth esti
RGB input, while we utilise polarisation images.
mation from polarisation images. We propose (i) dif
Monocular Polarisation Previous work use monocular po
ferentiable analytic formulae that define modal-specific
larisation imagery to recover depth. One route to overcome
loss terms, (ii) cross-modal consistency joint-training
Shape from Polarisation (SfP) ambiguities is to use ortho
towards improved real-world depth estimation from a
graphic camera models to express polarisation intensi� in
single polarisation image, (iii) architectural components
terms of depth [64] . Atkinson et al. [8] compute depth with
that increase predicted depth sharpness (see Fig. 1).
out knowing the light direction through a non-linear opti
2. CroMo dataset and training modalities study. We
mization framework and yet assume fully diffuse surfaces.
provide a systematic analysis of the benefits affor�ed
Linear systems have also been constructed for the task [53]
when multiple image modalities are available at tram
by adding shape from shading equations. While theoreti
ing time, for monocular depth estimation. Investigation
cally interesting, the orthographic assumption has restricted
and exposure of improvements are enabled by the unique
their application to synthetic lab environments.
Cross-Modality video dataset1 . Our multiple hardware
Learning based Polarisation Due to lack of reasonably
synchronized cameras capture, for the first time, stereo
sized datasets, only a limited number of works focus on
polarisation (Pol), indirect Time-of-Flight (i-ToF) and
learning with polarisation. Ba et al. [9] provide polarisation
structured-light images from active sensing.
images together with a set of plausible inputs from a phys
The remaining sections of the paper are thus organised:
ical model to estimate surface normals. The work of [34]
Sec. 2 provides brief review of depth estimation with re
apply polarisation for instance segmentation of transp�
spect to relevant modalities and previous work considering
ent objects and [37] learn de-glaring of images with seilll
multiple information signals. Sec. 3 presents our model ca
transparent objects. Recently, Blanchon et al. [1 1] extended
pable of monocular depth estimation from polarisation im
the work of [22] with complementary polarimetric cues.
agery and our cross-modal training procedure. In Sec. 4
In contrast to them, we invert a physical model to enable
we introduce CroMo, our novel multi-modal dataset, Sec. 5
self-supervision through consistency cycles and addition
reports experimental work validating our contributions and
ally study the benefit of co-modal i-ToF information.
Sec. 6 provides discussion and future research issues.
Learning based i-ToF i-ToF sensors acquire distance infor
2. Related Work mation by estimating the time required for an emitted light
pulse to be reflected [65] . Sensors measure either the time
To the best of our knowledge this is the first work
(direct) or the phase (indirect) difference between emitted
to study end-to-end monocular depth inference, utilising
and received light. The modality enjoys high precision for
IDataset is available at: https : I I cromo- dat a . github . iol short range distances [27], yet suffers from limited spatial

3928

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
resolution and noise [15], which constitute challenging fac ages of a scene, from an identical view point yet with dif
tors for any learning-based approach. Obtaining reliable ferent light exposures, are leveraged. An extension deal
signals from specular surfaces is difficult and inherent Multi ing with mixed reflectivity is established via a combined
Path Interference (MPI), often manifests as noisy measure photometric-polarisation linear system in [38] and Garcia
ments and artifacts. Synthetic training is also explored for et al. [18] solve for polarisation normals using circularly
raw i-ToF input data in end-to-end fashion [5, 24, 57]. How polarised light. Traditional multi-view methods also benefit
ever, the ability to account for real world domain shifts is from polarisation. Miyazaki et al. [43] recover surfaces of
limited. In [4] a GAN is employed towards addressing such black objects using polarisation physics and space carving.
domain adaption issues on a limited dataset. Depth refinement with Polarisation Consumer depth esti
i-ToF depth improvement MPI can be considered a critical mation tools progress significantly in recent years however
issue and error mitigation has been the focus of a body of their predictions are noisy and lack details. Using polari
work [60]. 1\vo-path approximations [23] have been used sation cues, [32] enhance sharper depth maps from RGBD
within optimization schemes [14, 35] and multiple frequen cameras by differentiating their depth maps to resolve po
cies are used to constrain the problem [16]. Kadambi et larisation ambiguities and perform mutual optimization.
al. [33] propose a hardware solution to address scenes with Despite clear improvements in monocular depth estima
translucent objects and a number of scholars incorporate tion methods, their performance remains bounded by the
light transport information to correct for MPI [2, 26, 44, 46]. chosen modality hence calling for multi-modal depth esti
2.2. Depth with multiple sensors mation. Our method alleviates this problem with a learn
ing based approach. During training we leverage comple
Depth completion has been carried out via combining mul
mentary modalities such that our model can compensate the
tiple input modalities, for example, a sparse but accurate
drawbacks of the single modality used at inference time.
LiDAR signal in combination with RGB [58]. It is diffi
cult to address sparse signals with CNNs [41] and LiDAR 3. Method
sensors can produce problematic artifacts resulting in unre
Our multi-modal monocular depth investigation leads to
liable Ground Truth depth estimates [39]. One strategy to
a new model architecture that accounts explicitly for pre
wards removing dependence on this form of supervision are
diction blur and introduces two novel analytic losses. We
self-supervision cues however these fall behind supervised
discuss these components in the following sections.
pipelines in terms of accuracy [40].
i-ToF and x Confidence-based combination of i-ToF depth 3.1. Architecture
and classical RGB stereo is explored with the network ar Our architecture employs multiple encoder-decoder net
chitecture of [3] and a semi-supervised approach for this works illustrated in Fig. 3a. We observe that monocular
combination is explored by [47] in a generic framework. depth estimation methods often incur blurry image predic
While these approaches improve upon the individual depth tions and we address this problem by introducing architec
estimates, they rely on a late fusion paradigm. Son et tural components that account for prediction blur. Firstly
al. [54] use a robotic arm to collect 540 real training images convolutions in our encoders are coupled with gated convo
of short range scenes with structured light ground truth. lution. Our network then composes a traditional U-Net [5 1]
By inserting micro linear polarizers [45] in front of a with skip connections and the gated convolutions [63]. The
photo-receiver chip, Yoshida et al. [62] build an i-ToF sen encoder utilises a ResNet [28] style block, while the de
sor capable of acquiring both i-ToF depth and polarisation coder is a cascade of convolutions with layer resizing.
scene cues. Combination of both the absolute depth (i-ToF) Secondly, drawing on the fact that Displacement Fields
and relative shape (polarisation cues) allowed reconstruc (DF) can be utilised to aid sharpness [49], we estimate a DF
tion of depth for specular surfaces. While this pipeline re using a self-supervised sharpening decoder. Depth pixels
quires i-ToF and polarisation input to solve an optimiza with strong local disparity have values redefined to mirror
tion problem, we alternatively explore cross-modal self a nearest neighbour that does not exhibit strong local dis
supervised training and single image inference. parity. Groundtruth (GT) displacement fields can thus be
Depth from multi-view Polarisation Another route to pre defined for each predicted depth during training ("on-the
dict depth is the use of more than one polarisation im fly"), guiding our dedicated displacement field prediction.
age [7] which enables methods based on physical models. We inspect predicted depth with and without our DF strat
An RGB+Polarisation pair can provide sharp depth maps egy and observe significant improved sharpness, most evi
with stereo vision [66]. Other methods [12] use more than dent when employing 3D visualisations (Fig. 4).
two polarisation images. Despite the sharpness of the re
sults, the difficulty to acquire multi-view polarisation im 3.2. Loss Formulation
ages is still a major hurdle. Atkinson et al. [6] combine Our study considers multiple modalities and various sen
polarisation methods with photometric stereo. 1\vo im- sor configurations at training time. We explore several loss

3929

1-
P: Projection
Correlation Auxiliary depth
Auxiliary
depth
"-=-----/ �
Nl

At inference time

N2 At training time
a. Our network architecture. b. Our training procedure and introduced losses.

Figure 3. Our full model with modality specific losses £corr , £corr-+pol and £stereo (see Sec. 3.1 for further details).

Lstereois used to train the polarisation network (N2) by

comparing the right polarisation image, projected using the
predicted depth Dpot. with the left polarisation image. We
next provide details of our analytical formulae for image re
covery and the loss terms that enable our training procedure.

Depth to polarisation (A2) Polarisation cameras capture

polarised intensity along directions C{)pol · The measured in
tensity is given by [66]

i'Ppol = iun · (1 + p cos(2cppol - 2¢))

Figure 4. Effect of the Displacement Fields (DF). Left top (bot
tom): predicted depth (point cloud) without DF. Center top: polar
with C{)pol E { 0, �' i• 3: } (1)

isation intensity. Right top (bottom): predicted depth (point cloud)

with DF. Flying pixels, visible in 3D, are clearly reduced.
where C{)pol is the polariser angle, iun is the intensity of on
polarised light, p is the degree of linear polarisation and
terms to exploit our unique setting (see Tab. 3). Loss terms ¢ is the Angle of Polarisation (AoP). The polarisation pa
in our training procedure are enabled through both coordi rameters p E {p8, Pd} and ¢ E { ¢8, </Jd} depend on local
nate frame projections (Pl , P2) and analytic transforms (Al , reflection type, either diffuse (d) or specular (s) as follows:
A2) of individual network (Nl , N2) outputs (see Figs. 3a,
3b). We firstly process input modalities individually us
ing distinct networks. These ingest i-ToF correlation and
left polarisation images respectively and output initial depth
maps. We propose two analytic losses, derived from prop
erties of i-ToF and polarisation, to train and link the net
works. We train the i-ToF module without ground truth and with () E [0, 1r/2] the viewing angle and rJ the object refrac
also leverage the available multi-modal information through tive index, typically 1.5, and
image recovery via related analytical formulae (Al, A2). = a [7r] if the pixel is diffuse
Strategically similar to previous work [13], at inference (3)
= a + � [7r] if the pixel is specular
time we require only a single modality (in our case polari
sation), and can discard network N1 completely. The 1r-ambiguity is denoted as [1r] in (3), and a denotes the
Terms Lcorr-+pol and Lcorr evaluate discrepancies be azimuth angle of the surface normal n. Azimuth angle a
tween each input image and respective recovered images, and viewing angle () are obtained as
obtained using auxiliary and final depth maps (Fig. 3b). D·V n
Our individual branches share information through the loss cos(O) = and tan(a) = ___]!._ ' (4)
n
ll ll ll v ll nx
term Lcorr-+pol · Explicitly, we recover a polarisation image
from an auxiliary depth map and then project this, using with v the viewing vector defined as the vector pointing
projection P1, to the polarisation sensor frame of reference toward the camera center from the 3D point P(x, y )
via the final depth map Dpol · Finally our third loss term corresponding to pixel (x, y ) with depth d(x, y ) and n the

3930

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
outward pointing normal vector, defined as the cross prod Stereo loss .Cstereo This loss requires that left and right im
uct of the partial derivatives with respect to x and y [66]: age pairs are accessible during training. While only the left

[ l
- Jy8xd(x, y) image Iz is fed to the network, the right image Ir can guide
n = - fxayd(x, y) (5 ) the model towards generating valid depth, and vice versa.
(x - cx )8xd(x, y) + (y - ey ) ayd(x , y) + d(x, y) More formally, let Kz and Kr be camera matrices with in
with fx , fy , ex , Cy the camera intrinsics. trinsic parameters for left and right images respectively, and
Hence, from a given depth map d, one can compute the D a depth map on the left reference frame. Let Tleft--+right de
azimuth angle a and the viewing angle () using (4) and (5), note the transformation that moves 3D points from the left
followed by the polarisation parameters p and ¢ with (2) coordinate system to the right. An image coordinate trans
and (3). The polarisation images for diffuse and specular formed from left coordinate pz to the right image is
surfaces f:;Fse and .f!�;"'ular
are finally recovered using the
Pleft--+right = Kr · 1ieft--+right D (pz ) Kz- l Pl
· · ·
(13)
calculated polarisation parameters with (1).
A backward differentiable warping [31] is used to reproject
Depth to correlation (Al) Indirect ToF measures the cor an image onto the left view as Jright--+Ieft·
relation between a known emitted signal and the received We form a stereo loss .Cstereo• and related mask loss £mask
signal. The emitted signal at frequency fM is a sinusoid: similarly to [22], which aid network training and deal with
occluded pixels respectively as
g (t) = 2 cos (27rfMt) + 1 (6)
.Cstereo = Epe (Iz , Iright --+ left) (14)
and the signal, reflected by the scene, is of the form [30] Dpol

(
.Cmask = Epe lz, Iright�left) (15)

where the T is the time delay between the emitted signal where the photometric error is similar to [22]:
g (t) and the reflected signal f(t). The i-ToF measurement

-l j-ff f(t)g (t - x) dt
c(x) is the correlation between the two signals:

c(x) = lim Analytical losses .Ccorr and .Ccorr--+pol Depth Dcorr is

T--+oo T (8) firstly inferred directly from i-ToF correlation input, and
then two recovered images Icorr and lpo1 are formed.
= a COS (27r fMX + 27rfMT) + (3
Recovered images represent the 'ideal' input for each
where we only consider the direct reflection signal and ig modality, i-ToF and polarisation respectively, conditioned
nore the multipath interference (MPI) and sensor imperfec on the inferred depth. Since Ipo1 is generated from
tions. We are interested in the phase cp, proportional to the Dcorr. we reproject it using Dpol to form a recovered fi-
depth d between the objects in the scene and the sensor: nal polarisation image lJorr --+ i, i E {diffuse, specular}.
Dpol
po
In each case, discrepancies between the recovered im
(9) age and the true input image provide a strong indi
cation of the quality of the generated depth. We
where c is the speed of light and [21r] represents the use this signal to guide the network. Formally
21r-ambiguity. Using the four bucket strategy [36] to sample
c(x) at four positions, where 27rfM x E {0, � , 1r, 3; }, four (17)

.
measurements {c(x0), c(x1 ) , c(x2 ) , c(x3 )} can be obtained
to recover the phase cp, the amplitude a and the intensity (3. Lcorr--+pol =
•E{diffu in
. mse,specular} {
Epe (Iz , f:.Orr --+ pol)
Dpo!
} ( 18)

where Iz is the left polarisation image, korr --+ pol the recov-
npoz

ered polarisation image aligned to fz, Icorr the i-ToF corre

lation input, and korr the recovered correlation image. We
use a min operator for .Ccorr--+pol to lift the problem of classi
fying a pixel as diffuse or specular by computing both pos
sibilities and letting the network select the best solution.
Hence, from a given depth d, one can compute the phase cp Finally, following [59], we use an additional loss .Cstruct
using (9) and then reformulate the recovered i-ToF correla in the objective function, derived from structured-light in
tion using (10), (1 1) and (12) in turn to form Icorr . formation (see appendix for further detail). In summary,

3931

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
depending on the input modalities available at training time, are not unlimited and our data collection also does not cover
we can add or remove the introduced losses Lcorr• Lcorr--+poh all possible scenarios; we further discuss limitations in our
Lstereo and Lstruct as appropriate. We explicitly note that hy appendix. We report statistics per captured scene in Tab. 1
per parameter tuning for balancing of these loss terms is not (lower). These statistics characterise our scene captures
required, due to our formulation. Our total loss £ can then and provide useful information, e.g., that the median scene
be formulated as: depth differs greatly between indoor (Kitchen) and outdoor

£ = min
i E {mask, stereo , corr--+pol ,struct}
{c;} + Lcorr + LDF (19)
(Station, Park and Facades) scenes. This is a strong indica
tor for whether the i-ToF sensor will perform well. Tab. 1
(upper) provides a comparison with other depth datasets
where £nF is the £2 norm between predicted and GT DF. showing that CroMo is the first publicly available, modality
rich dataset containing a large quantity of image data.
4. Data
We next provide details on our custom camera rig 5. Experiments
(Sec. 4. 1) and CroMo dataset (Sec. 4.2), comprising syn Our experimental design evaluates (1) the effect of mul
chronised image sequences capturing multiple modalities, tiple modalities, accessible at training time, for monocular
at video-rate across real-world indoor and outdoor scenes. depth estimation and (2) the effect that changing network
4.1. Camera capture rig architecture has on depth quality, under consistent input
Our prototype custom-camera hardware rig is shown in signal. Our capture setup allows us to employ a standard
Fig. 5. Our rig is constructed in order to capture syn MVS approach [52] on full temporal sequences of polarisa
chronised data across multiple modalities including stereo tion intensity frames (left-camera), to serve as ground-truth
polarisation, i-ToF correlation, structured-light depth and depth for our experimental work. This expensive offline op
IMU. We rigidly mount two polarisation cameras (Lucid timisation leverages accordances amongst all frames per se
PHXOSOS-QC) providing a left-right stereo pair, an i-ToF quence, affording high quality depth to evaluate our ideas.
camera (Lucid HLS003S-001) operating at 25Mhz and a Multi-modal training We firstly evaluate combinations
camera (RealSense D435i) for active IR stereo capture. All of training input signal by changing the number of sensors
devices are connected with a hardware synchronisation wire available to the model. We fix network encoder-decoder
resulting in time-aligned video capture at a frame rate of backbone components (i.e. ResNet50, analogous to [22])
lOfps. The left polarisation camera is the lead camera which and train models that leverage cues from a maximum of
generates the genlock signal and defines the world refer four sensors; left and right polarisation, i-ToF correlation
ence frame. Accurate synchronisation was validated us and structured-light. We show predicted depth improve
ing a flash-light torch and was further confirmed by the ments, attainable by systematic addition of sensors, and
respectable quality observed from stereo Block Matching quantify where best gains can be made. The training sig
results [29]. The focus of all sensors is set to infinity, the nal components used for our monocular depth estimation
aperture to maximum, and the exposure is manually fixed at experiments are as follow: Temporal (M) extracts infor
the beginning of each capture sequence. The calibration on mation from video sequences (3 frames), Stereo (S) uses
all four cameras' extrinsics, intrinsics, and distortion coeffi stereo images, i-ToF (T) leverages i-ToF correlations via
cients is done with a graph bundle-adjustment for improved our two interconnected depth branches (see Sec. 3.2). Fi
multi-view calibration (see appendix for further details). nally, Structured-light (L) incorporates an additional mask
4.2. CroMo dataset into the objective function, derived from information pro
We collect a unique dataset comprising multi-modal cap vided by our structured-light sensor. The structured-light
tures such that each time point pertains to measurement signal is utilised only when the mask improves the projec
of (1) Polarisation: raw stereo polarisation cameras pro tion loss. We explore alternative strategies to exploit the
duce 2448 x 2048 px stereo images. (2) i-ToF: 4 channel structured-light signal and discuss details on practical ben
640x 480 px correlation images. (3) Depth: a structured efits (e.g. convergence speed) in the appendix.
light capture of the scene resulting in a 848 x 480 px depth Introduced signal components define our set of training
image. In addition to the three main sensors, IMU infor experiments. For example, Stereo and Structure Light
mation is recorded to further enable future research direc (SL) train the model using self-supervised stereo (S) and
tions. Our dataset consists of more than 20k frames, to structured-light (L) information. Experiments therefore use
talling >80k images of indoor and outdoor sequences in differing subsets of the introduced loss terms (see Tab. 3).
challenging conditions, with no constraints on minimum or Qualitative results are shown in Fig. 6. Unsurprisingly,
maximum scene range. We group these sequences into four self-supervised stereo (S) is relatively blurry and struggles
different scenes which we name: Kitchen, Station, Park and to capture fine details, such as the thin, metallic arch on the
Facades. Despite the multitude of senors, operating ranges Facades sample, or the furniture in the Kitchen. Addition of

3932

<
�
.t

Sturm et al. [56 ] ./ ./ ./ ./ ./ ./ >20k

Agresti et al. [4 ] - (./) ./ ./ ./ 113
Guo et al. [24 ] - ./ ./ ./ ./ 2000
Zhu and Smith [66] (./) ./ ./ ./ ./ 1
Qiu et al. [48] ./ ./ ./ ./ ./ 40
Ba et al. [1 0 ] ./ (./) ./ ./ ./ 300
Kadambi et al. [32 ] ./ ./ ./ ./ ./ ./ 1
CroMo (./) ./ ./ ./ ./ ./ ./ ./ ./ >20k

sciesl GT depth statistics (meters)

mean var. min max median
valid # of # of
ratio seqs. frames
Kitchen 3.3 3.6 0.3 15.7 2.9 0.95 3 2859
Station 4.9 14.8 0.3 18.9 3.6 0.86 11 7400
Facades 4.0 8.4 0.3 17.8 3.3 0.86 7 7228
Park 6.1 23.7 0.3 19.7 4.4 0.82 10 5551
Total 4.7 13.6 0.3 18.3 3.6 0.86 31 23038

Figure 5 . Our multi-modal camera rig (see Sec. 4. 1 ) . Table 1. CroMo comparison and dataset statistics.
Models trained with Stereo (S) input I MP GMACs I Sq Rel RMSE RMSE Log I 8 < 1 . 25 8 < 1.252 8 < 1 .253
ResNet18 architecture [22] 14.36 20.17 1.7928 2.1982 0.3596 0.5061 0.7026 0.8009
ResNet50 architecture [22] 32.55 39.62 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
p2d [1 1] (ResNet50 - Stokes) 32.55 39.62 1.5938 2.1291 0.3884 0.4565 0.6632 0.7775
MiDaS architecture [50] 1 04.2 1 207.86 1 .4021 1 .9985 0.3252 0.5409 0.7901 0.8281
Our architecture (Stereo (S) input) 74.40 97.39 1 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213

Table 2. Architectural comparisons under consistent modality sensor input; Stereo (S). Our proposed architecture improves quantitative
results across the majority of metrics whilst remaining competitive in terms of compute and space requirements.

I I £""""' £oF Loorr C,tmct I Sq Rei I {j < 1.25 {j < 1.252 {j < 1.253
Image
sensors Training strategy Loor-+
r pol RMSE RMSE Log
Stereo (S) wlo DF sampling ./
I I
1 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
./ ./
2
Stereo (S) 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213

./ ./ ./ ./
I I
1 1.2829 1.8573 0.3202 0.5541 0.7308 0.9062
Stereo and Structured-Light (SL)
Stereo and i-ToF (ST)
./ ./ ./
3
1. 1233 1.7510 0.3168 0.5529 0.7370 0.925 1

.( .( .( .( .( 0.9266

I I I
1 1 .0699 1.6070 0.2891 0.6512 0.7882
1.()031 1.4889 0.2527 0.7061 0.8066
Stereo, i-ToF, Structured-Light (STL)
./ ./ ./ ./ ./
4
STL+Temporal (STLM) 0.9246

Table 3. Model training strategies that differ in terms of available image sensor signals (utilised loss components). Sec. 3 and 4 provide
details on loss function components and image sensors, respectively. In spite of having access to only a single, consistent modality during
inference, the model benefits from visibility of additional training signals.
i-ToF and structured-light modalities, exclusively at train tured light information helps more due to the nature of our
(ST), (SL), (STL) and can be observed to
ing time, result in dataset and current distribution of image content therein i.e.
improve respective depth quality. Finally, (STLM) adds our ""'85% outdoor imagery, where i-ToF sensors are impaired
temporal modality and improves detail recovery (e.g. metal by ambient light. Combining the i-ToF and structured-light
lic arch and fence). Qualitative results can be observed to sensors (STL), further improves. The best depth prediction
corroborate our hypothesis; inclusion of additional modali utilises the temporal component (STLM). RMSE on Tab. 3
ties at training time afford the model multiple complemen displays a clear trend; the availability of additional sensor
tary depth cues that can qualitatively improve depth infer cues at training time improves monocular inference.
ence. Our experimental work highlights the nature of valu
Network architecture We next investigate the effect of
able investigation possible with our unique CroMo dataset. network architecture on monocular estimation performance.
Quantitative results are reported in Tab. 3. We follow [22], Of note, we highlight that employing a larger capacity net
reporting standard evaluation metrics, with focus on the work is not the only way to improve prediction perfor
RMSE in our following experiments. Best performance is mance. We use our self-supervised stereo (S), i.e. baseline
obtained when all sensors are used together (STLM) while modality, training strategy for all experiments that follow in
self-supervision stereo (S) with only polarisation images this section. This strategy provided weakest performance
performs worst. When additional modalities are added to in our previous investigation of training modality choice
self-supervision (S); i.e. i-ToF (ST) or structure light (SL), (Tab. 3). For this reason, we consider it an appropriate
performance improves in both cases, with larger gains come candidate with which to evaluate improvements afforded
from the addition of the latter. We conjecture that struc- by changes to network architecture. We report millions-

3933

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
I�
11
00

i
�
11
0

J
-
1il
..

�
'fJ
�

�
!5

Park Facades Kitchen Station Station sample Park sample

Figure 6. Same ResNet50 architecture as in [22] with different Figure 7. Different architectures, same training strategy Stereo
modalities: each new modality closes the gap with GT. (S): our new architecture produces the sharpest depth predictions.

of-parameters at inference (MP) and the giga-multiply ing angle directly, is more sensitive to noise and not appro
accumulates per second (GMACs) in order to evaluate size priate for an SSIM loss with the self-supervised stereo (S)
and compute-cost per architecture. Architectures consist of training strategy. MiDaS [50] provides second best perfor
the ResNet18 U-Net used in [22] and their supplementary mance and yet necessitates roughly x 2 GMACs. Our ar
material ResNet50 variant, the p2d architecture [1 1] using chitecture provides best performance while remaining rela
ResNet50 with a different data representation (Stokes), the tively compact which we largely attribute to gated convolu
MiDaS [50] architecture and Ours (see Sec. 3.1 , Fig. 3a). tions and displacement field estimation (see Sec. 3.1).
Qualitative results are shown in Fig. 7. It may be ob
served that the ResNetlS architecture with smallest (MP) 6. Conclusion
fails to obtain good background detail of the swing frame We systematically investigate the effect of using addi
structure (Station sample) or of the tree (Park sample). The tional information from co-modal sensors at training time,
ResNet50 variant slightly improves detail, especially with for the task of monocular depth estimation from polarisation
raw measurements instead of Stokes (p2d [1 1]). Even when imagery. Our exploration is enabled through a unique multi
increasing network capacity c. three-fold with MiDaS, re modal video dataset which constitutes synchronized images
sults are unsatisfying. Our proposed architecture (Ours) from binocular polarisation, raw i-ToF and structured-light
requires smaller capacity and computation for a sharper re depth. We quantify the beneficial influence of both passive
construction of the swing and the tree. We disentangle and active sensors, leveraging self-supervised and cross
the benefits of additional sensor modalities from our model modal learning strategies that lead to the proposal of a new
contributions, highlighting the advantage of gated convolu method providing sharper and more accurate depth estima
tions and our DF-based approach towards reducing blur. tion. This is made possible through two physical mod
Quantitative results are reported in Tab. 2. The small els that describe the relationships between polarisation and
est architecture ResNetlS [22] performs worst. The larger surface normals on one side and correlation measures and
U-Net ResNet50 performs better, and has been generally scene depth on the other. We believe that our fundamen
adopted [1 1, 21]. Note p2d [1 1] uses a different data rep tal investigation of modality combination and the CroMo
resentation (Stokes) for polarisation cf. ResNet50; perfor dataset can accelerate research of both spatial and temporal
mance decreases. We believe the Stokes representation, us- fusion, towards advancing cross-modal computer vision.

3934

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
References Interaction, and Measurement, volume 7864, page 786404.
International Society for Optics and Photonics, 201 1. 3
[ 1] Intel realsense depth camera d435i. : I I www
http s •
[15] Sergi Foix, Guillem Alenya, and Carme Torras. Lock-in
i ntel rea l s e n s e . com I depth - camera - d4 3 5 i l .
time-of-flight (tof) cameras: A survey. IEEE Sensors Jour
Accessed: 2021-11-22. 3 nal, 1 1 (9):1917-1926, 201 1. 3
[2] Supreeth Achar, Joseph R Bartels, William L'Red' Whittaker,
[16] Daniel Freedman, Yoni Smolin, Eyal Krupka, Ido Leichter,
Kiriakos N Kutulakos, and Srinivasa G Narasimhan. Epipo
and Mirko Schmidt. Sra: Fast removal of general multipath
lar time-of-flight imaging. ACM Transactions on Graphics
for tof sensors. In European Conference on Computer Vision,
(ToG), 36(4): 1-8, 2017. 3
pages 234-249. Springer, 2014. 3
[3] Gianluca Agresti, Ludovico Minto, Giulio Marin, and Pietro
[17] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat
Zanuttigh. Deep learning for confidence information in stereo
and tof data fusion. In The IEEE International Conference on manghelich, and Dacheng Tao. Deep ordinal regression net
work for monocular depth estimation. In Proceedings of the
Computer Vision (!CCV) Workshops, Oct 2017. 3
IEEE Conference on Computer Vision and Pattern Recogni
[4] Gianluca Agresti, Henrik: Schaefer, Piergiorgio Sartor, and
tion, pages 2002-2011, 2018. 2
Pietro Zanuttigh. Unsupervised domain adaptation for tof
data denoising with adversarial learning. In Proceedings of [18] N Missael Garcia, Ignacio De Erausquin, Christopher Ed
the IEEE/CVF Conference on Computer Vision and Pattern
miston, and VIkt:or Gruev. Surface normal reconstruction us
Recognition (CVPR), June 2019. 3, 7 ing circularly polarized light. Optics express, 23(1 1): 14391-
[5] Gianluca Agresti and Pietro Zanuttigh. Deep learning for 14406, 2015. 3
multi-path error removal in tof sensors. In Proceedings of [19] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian
the European Conference on Computer Vision (ECCV), pages Reid. Unsupervised cnn for single view depth estimation: Ge
0--0, 2018. 3 ometry to the rescue, 2016. 2
[6] Gary A Atkinson. Polarisation photometric stereo. Computer [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Vision and Image Understanding, 160: 158-167, 2017. 3 ready for autonomous driving? the kitti vision benchmark
[7] Gary A Atkinson and Edwin R Hancock. Multi-view sur suite. In 2012 IEEE Conference on Computer Vision and Pat
face reconstruction using polarization. In Tenth IEEE Inter tern Recognition, pages 3354-3361. IEEE, 2012. 2
national Conference on Computer Vision (ICCV '05) Volume [21] Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow.
1 , volume 1, pages 309-316. IEEE, 2005. 3 Unsupervised monocular depth estimation with left-right con
[8] Gary A Atkinson and Edwin R Hancock. Recovery of surface sistency. In CVPR, 2017. 2, 8
orientation from diffuse polarization. IEEE transactions on [22] Clement Godard, Oisin Mac Aodha, Michael Firman, and
image processing, 15(6):1653-1664, 2006. 1, 2 Gabriel J Brostow. Digging into self-supervised monocular
[9] Yunhao Ba, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, and depth estimation. In Proceedings of the IEEE international
Achuta Kadambi. Physics-based neural networks for shape conference on computer vision, pages 3828-3838, 2019. 1, 2,
from polarization. arXiv preprint arXiv:1903.10210, 2019. 2 5, 6, 7, 8, 3
[10] Yunhao Ba, Alex Gilbert, Franklin Wang, Jinfa Yang, Rui [23] John P Godbaz, Michael J Cree, and Adrian A Darrington.
Chen, Yiqin Wang, Lei Yan, Boxin Shi, and Achuta Kadambi. Closed-form inverses for the mixed pixellmultipath interfer
Deep shape from polarization. In Andrea Vedaldi, Horst ence problem in amcw lidar. In Computational Imaging X,
Bischof, Thomas Brox, and Jan-Michael Frahm, editors, volume 8296, page 829618. International Society for Optics
Computer Vision - ECCV 2020, pages 554-571, Cham, 2020. and Photonics, 2012. 3
Springer International Publishing. 7 [24] Qi Guo, Iuri Frosio, Orazio Gallo, Todd Zickler, and Jan
[11] Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Kautz. Tackling 3d tof artifacts through learning and the flat
Daniel Braun, and Fabrice Meriaudeau. P2d: a self-supervised dataset. In Proceedings of the European Conference on Com
method for depth estimation from polarimetry. In 25th In puter Vision (ECCV), pages 368-383, 2018. 3, 7
ternational Conference on Pattern Recognition (!CPR 2020),
[25] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and
2020. 1, 2, 7, 8
Xiaogang Wang. Learning monocular depth by distilling
[12] Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, and Jan
cross-domain stereo networks, 2018. 2
Kautz. Polarimetric multi-view stereo. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni
[26] Mohit Gupta, Shree K Nayar, Matthias B Hullin, and Jaime
tion, pages 1558-1567, 2017. 3
Martin. Phasor imaging: A generalization of correlation
based time-of-flight imaging. ACM Transactions on Graphics
[13] Rui Dai, Srijan Das, and Fran�ois Bremond. Learning
(ToG), 34(5):1-18, 2015. 3
an augmented rgb representation with cross-modal knowl
edge distillation for action detection. In Proceedings of [27] Miles Hansard, Seungkyu Lee, Ouk Choi, and Radu Patrice
the IEEE/CVF International Conference on Computer Vision Horaud. Time-of-flight cameras: principles, methods and ap
plications. Springer Science & Business Media, 2012. 2
(!CCV), pages 13053-13064, October 2021. 4
[14] Adrian A Darrington, John P Godbaz, Michael J Cree, An [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn
drew D Payne, and Lee V Streeter. Separating true range ing for image recognition. In 2016 IEEE Conference on Com
measurements from multi-path and scattering interference in puter Vision and Pattern Recognition (CVPR), pages 770-778,

commercial range cameras. In Three-Dimensional Imaging, 2016. 3, 2

3935

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[29] Heiko Hirschmuller. Accurate and efficient stereo processing age. In 2018 IEEE International Conference on Robotics and
by semi-global matching and mutual information. In 2005 Automation (ICRA), pages 1-8. IEEE, 2018. 3
IEEE Computer Society Conference on Computer Vision and [42] Nikolaus Mayer, Eddy llg, Philipp Fischer, Caner Hazirbas,
Pattern Recognition (CVPR '05), volume 2, pages 807-814. Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What
IEEE, 2005. 6 makes good synthetic training data for learning disparity and
[30] Radu Horaud, Miles Hansard, Georgios Evangelidis, and optical flow estimation? International Journal of Computer
Clement Menier. An overview of depth cameras and range Vision, 126(9):942-960, 2018. 2
scanners based on time-of-flight technologies. Machine Vi [43] Daisuke Miyazaki, Takuya Shigetomi, Masashi Baba, Ryo
sion and Applications, 27(7):1005-1020, Jun 2016. 5 Furukawa, Shinsaku Hiura, and Naoki Asada. Surface normal
[31] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and estimation of black specular objects from multiview polariza
koray kavukcuoglu. Spatial transformer networks. In C. tion images. Optical Engineering, 56(4):041303, 2016. 3
Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, [44] Nikhil Naik, Achuta Kadambi, Christoph Rhemann,
editors, Advances in Neural Information Processing Systems, Shahram Izadi, Ramesh Raskar, and Sing Bing Kang. A light
volume 28, pages 2017-2025. Curran Associates, Inc., 2015. transport model for mitigating multipath interference in time
5 of-flight sensors. In Proceedings of the IEEE Conference on
[32] Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Computer Vision and Pattern Recognition, pages 73-81 , 2015.

Raskar. Depth sensing using geometrically constrained polar 3

ization normals. International Journal of Computer Vision, [45] Gregory P Nordin, Jeffrey T Meier, Panfilo C Deguzman,
125(1-3):34-5 1 , 2017. 3, 7 and Michael W Jones. Micropolarizer array for infrared imag
[33] Achuta Kadambi, Refael Whyte, Ayush Bhandari, Lee ing polarimetry. JOSA A, 16(5) : 1 168-1174, 1999. 3
Streeter, Christopher Barsi, Adrian Dorrington, and Ramesh [46] Matthew O'Toole, Felix Heide, Lei Xiao, Matthias B Hullin,
Raskar. Coded time of flight cameras: sparse deconvolution to Wolfgang Heidrich, and Kiriakos N Kutulakos. Temporal
address multipath interference and recover time profiles. ACM frequency probing for 5d transient analysis of global light
Transactions on Graphics (ToG), 32(6): 1-10, 2013. 3
transport. ACM Transactions on Graphics (ToG), 33(4): 1-1 1,

[34] Agastya Kalra, Vage Taamazyan, Supreeth Krishna Rao,

2014. 3
[47] Can Pu, Runzi Song, Radim Tylecek, Nanbo Li, and
Kartik Venkataraman, Ramesh Raskar, and Achuta Kadambi.
Robert B Fisher. Sdf-gan: Semi-supervised depth fu
Deep polarization cues for transparent object segmentation.
sion with multi-scale adversarial networks. arXiv preprint
In Proceedings ofthe IEEEICVF Conference on Computer Vi
arXiv:1803. 06657, 2018. 3
sion and Pattern Recognition, pages 8602-861 1, 2020. 2
[48] Simeng Qiu, Qiang Fu, Congli Wang, and Wolfgang Hei
[35] Ahmed Kirmani, Arrigo Benedetti, and Philip A Chou.
drich. Polarization demosaicking for monochrome and color
Spurnic: Simultaneous phase unwrapping and multipath inter
polarization focal plane arrays. In Hans-JOrg Schulz, Matthias
ference cancellation in time-of-flight cameras using spectral
Teschner, and Michael Wimmer, editors, Vision, Modeling
methods. In 201 3 IEEE International Conference on Multi
and Visualization. The Eurographics Association, 2019. 7
media and Expo (ICME), pages 1--6. IEEE, 2013. 3
[49] M. Ramamonjisoa, Y. Du, and V. Lepetit. Predicting sharp
[36] R. Lange and P. Seitz. Solid-state time-of-flight range cam
and accurate occlusion boundaries in monocular depth estima
era. IEEE Journal of Quantum Electronics, 37(3):390-397,
tion using displacement fields. Proceedings of the IEEE Con
2001. 5
ference on Computer Vision and Pattern Recognition (CVPR),
[37] Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, 2020. 3
Wenxiu Sun, and Qifeng Chen. Polarized reflection re
[50] Rene Ranftl, Katrin Lasinger, David Hafner, Komad
moval with perfect alignment in the wild. In Proceedings of
Schindler, and Vladlen Koltun. Towards robust monocular
the IEEEICVF Conference on Computer Vision and Pattern
depth estimation: Mixing datasets for zero-shot cross-dataset
Recognition, pages 1750-1758, 2020. 2
transfer. IEEE Transactions on Pattern Analysis and Machine
[38] Fotios Logothetis, Roberto Mecca, Fiorella Sgallari, and Intelligence (TPAMI), 2020. 7, 8
Roberto Cipolla. A differential approach to shape from po [51] OlafRonneberger, Philipp Fischer, and Thomas Brox. U-net:
larisation: A level-set characterisation. International Journal Convolutional networks for biomedical image segmentation.
of Computer Vision, 127( 1 1-12) : 1 680-1693, 2019. 3 In Nassir Navab, Joachim Homegger, William M. Wells, and
[39] Adrian Lopez-Rodriguez, Benjamin Busam, and Krystian Alejandro F. Frangi, editors, Medical Image Computing and
Mikolajczyk. Project to adapt: Domain adaptation for depth Computer-Assisted Intervention - M1CCA1 2015, pages 234-
completion from noisy and sparse sensor data. arXiv preprint 241, Cham, 2015. Springer International Publishing. 3
arXiv:2008.01034, 2020. 3 [52] Johannes Lutz Schonberger, Enliang Zheng, Marc Pollefeys,
[40] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac and Jan-Michael Frahm. Pixelwise view selection for unstruc
Karaman. Self-supervised sparse-to-dense: Self-supervised tured multi-view stereo. In European Conference on Com
depth completion from lidar and monocular camera. In puter Vision (ECCV), 2016. 2, 6
2019 International Conference on Robotics and Automation [53] William AP Smith, Ravi Ramamoorthi, and Silvia Tozza.
(ICRA), pages 3288-3295. IEEE, 2019. 3 Height-from-polarisation with unknown lighting or albedo.
[41] Fangchang Mal and Sertac Karaman. Sparse-to-dense: IEEE transactions on pattern analysis and machine intelli
Depth prediction from sparse depth samples and a single im- gence, 41(12):2875-2888, 2018. 2

3936

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[54] Kilho Son, Ming-Yu Liu, and Yuichi Taguchi. Learning to
remove multipath distortions in time-of-flight range images
for a robotic arm setup. In 2016 1EEE International Confer
ence on Robotics and Automation (ICRA), pages 3390-3397.
IEEE, 2016. 3
[55] Jaime Spencer, Richard Bowden, and Simon Hadfield.
Defeat-net: General monocular depth via simultaneous un
supervised representation learning. In Proceedings of the
IEEEICVF Conference on Computer Vision and Pattern
Recognition, pages 14402-14413, 2020. 2
[56] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre
mers. A benchmark for the evaluation of rgb-d slam systems.
In Proc. of the International Conference on Intelligent Robot
Systems (IROS), Oct. 2012. 7
[57] Shuochen Su, Felix Heide, Gordon Wetzstein, and Wolfgang
Heidrich. Deep end-to-end time-of-flight imaging. In Pro
ceedings of the IEEE Conference on Computer Vision and Pat
tern Recognition, pages 6383--6392, 201 8. 3
[58] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
In Proceedings of the International Conference on 3D Vision
(3DV)), pages 11-20. IEEE, 2017. 3
[59] Jamie Watson, Michael Firman, Gabriel J. Brostow, and
Daniyar Turmukhambetov. Self-supervised monocular depth
hints. In The International Conference on Computer Vision
(/CCV), October 2019. 5, 3
[60] Refael Whyte, Lee Streeter, Michael J Cree, and Adrian A
Dorrington. Review of methods for resolving multi-path in
terference in time-of-flight range cameras. In SENSORS, 2014
IEEE, pages 629--632. IEEE, 2014. 3
[61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully
automatic 2d-to-3d video conversion with deep convolutional
neural networks, 2016. 2
[62] Tomonari Yoshida, Vladislav Golyanik, Oliver Wasenmiiller,
and Didier Stricker. Improving time-of-flight sensor for spec
ular surfaces with shape from polarization. In 2018 25th IEEE
International Conference on Image Processing (ICIP), pages
1558-1562. IEEE, 2018. 3
[63] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas Huang. Free-form image inpainting with gated con
volution, 2019. 3, 2
[64] Ye Yu, Dizhong Zhu, and William AP Smith. Shape-from
polarisation: a nonlinear least squares approach. In Proceed
ings of the IEEE International Conference on Computer Vi
sion Workshops, pages 2969-2976, 2017. 2
[65] Pietro Zanuttigh, Giulio Marin, Carlo Dal Mutto, Fabio Do
minio, Ludovico Minto, and Guido Maria Cortelazzo. Time
of-flight and structured light depth cameras. Technology and
Applications, pages 978-3, 2016. 2
[66] Dizhong Zhu and William AP Smith. Depth from a polar
isation+ rgb stereo pair. In Proceedings of the IEEE Con
ference on Computer Vision and Pattern Recognition, pages
7586-7595, 2019. 3, 4, 5, 7

3937

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.

Critical Heart in Children and Infants 2019 PDF
100% (3)
Critical Heart in Children and Infants 2019 PDF
1,184 pages
Monocular Depth Estimation Based On Deep Learning An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning An Overview
16 pages
Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
No ratings yet
Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
8 pages
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
No ratings yet
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
11 pages
CV Sce
No ratings yet
CV Sce
12 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Shi 3D Distillation Improving Self-Supervised Monocular Depth Estimation On Reflective Surfaces ICCV 2023 Paper
No ratings yet
Shi 3D Distillation Improving Self-Supervised Monocular Depth Estimation On Reflective Surfaces ICCV 2023 Paper
11 pages
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
No ratings yet
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
10 pages
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
No ratings yet
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
11 pages
Monocular Depth Estimation Based On Deep Learning: An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning: An Overview
14 pages
Fdsafdsfsafasdfbrwa
No ratings yet
Fdsafdsfsafasdfbrwa
14 pages
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
No ratings yet
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
5 pages
Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data
No ratings yet
Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data
18 pages
Hinterstoisser Iccv11
No ratings yet
Hinterstoisser Iccv11
8 pages
D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth
No ratings yet
D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth
15 pages
Depthanything
No ratings yet
Depthanything
18 pages
Depth Reconstruction With Deep Neural Networks (Part 1)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 1)
66 pages
Improving Structured Light Based Depth and Pose Estimation Using Cnns
No ratings yet
Improving Structured Light Based Depth and Pose Estimation Using Cnns
77 pages
Ummenhofer DeMoN Depth and CVPR 2017 Paper
No ratings yet
Ummenhofer DeMoN Depth and CVPR 2017 Paper
10 pages
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
10 pages
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
13 pages
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
No ratings yet
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
11 pages
Computer Vision
No ratings yet
Computer Vision
8 pages
Depth Estimation by Combining Binocular Stereo and Monocular
No ratings yet
Depth Estimation by Combining Binocular Stereo and Monocular
10 pages
Chugunov The Implicit Values of A Good Hand Shake Handheld Multi-Frame CVPR 2022 Paper
No ratings yet
Chugunov The Implicit Values of A Good Hand Shake Handheld Multi-Frame CVPR 2022 Paper
11 pages
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
No ratings yet
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
11 pages
Development of Optical Depth-Sensing Technology With A Mechanical Control Lens and Diffuser
No ratings yet
Development of Optical Depth-Sensing Technology With A Mechanical Control Lens and Diffuser
11 pages
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
No ratings yet
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
10 pages
Neural Network Adaption For Depth Sensor Replication
No ratings yet
Neural Network Adaption For Depth Sensor Replication
11 pages
Demon: Depth and Motion Network For Learning Monocular Stereo
No ratings yet
Demon: Depth and Motion Network For Learning Monocular Stereo
22 pages
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
No ratings yet
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
16 pages
EC 9580 Computer Vision Tutorial
No ratings yet
EC 9580 Computer Vision Tutorial
4 pages
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
No ratings yet
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
14 pages
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
No ratings yet
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
12 pages
Part I Methods of 3D Computer Vision 1 Geometric Approaches To Three-Dimensional Scene Reconstruction .
No ratings yet
Part I Methods of 3D Computer Vision 1 Geometric Approaches To Three-Dimensional Scene Reconstruction .
5 pages
Relative Pose Estimation Through Affine Corrections of Monocular Depth Priors
No ratings yet
Relative Pose Estimation Through Affine Corrections of Monocular Depth Priors
15 pages
Zusc S 24 00845
No ratings yet
Zusc S 24 00845
15 pages
Abstract
No ratings yet
Abstract
2 pages
2311.07198 Kepentingan
No ratings yet
2311.07198 Kepentingan
10 pages
Nerf Supervision
No ratings yet
Nerf Supervision
8 pages
2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation
No ratings yet
2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation
12 pages
DepthSplat Connecting Gaussian Splatting
No ratings yet
DepthSplat Connecting Gaussian Splatting
11 pages
Robot Vision: Y. Shirai
No ratings yet
Robot Vision: Y. Shirai
28 pages
ChiTransformer Towards Reliable Stereo From Cues
No ratings yet
ChiTransformer Towards Reliable Stereo From Cues
11 pages
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
No ratings yet
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
13 pages
开题报告 Akash 4
No ratings yet
开题报告 Akash 4
1 page
Stereo Vision Using The Opencv Library: Sebastian DR Oppelmann Moos Hueting Sander Latour Martijn Van Der Veen June 2010
No ratings yet
Stereo Vision Using The Opencv Library: Sebastian DR Oppelmann Moos Hueting Sander Latour Martijn Van Der Veen June 2010
15 pages
Joint Image and Depth Estimation With Mask-Based Lensless Cameras
No ratings yet
Joint Image and Depth Estimation With Mask-Based Lensless Cameras
12 pages
Learning The Depths of Moving People by Watching Frozen People
No ratings yet
Learning The Depths of Moving People by Watching Frozen People
10 pages
08-Monocular Depth Estimation
No ratings yet
08-Monocular Depth Estimation
15 pages
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
No ratings yet
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
12 pages
The Application of Deep Learning in Stereo Matching and Disparity
No ratings yet
The Application of Deep Learning in Stereo Matching and Disparity
24 pages
2023046644 (3)
No ratings yet
2023046644 (3)
5 pages
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
No ratings yet
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
20 pages
Project Synopsis Template
No ratings yet
Project Synopsis Template
5 pages
Event-Based Monocular Depth Estimation With Recurrent Transformers
No ratings yet
Event-Based Monocular Depth Estimation With Recurrent Transformers
13 pages
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
No ratings yet
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
26 pages
Video Depth Without Video Models
No ratings yet
Video Depth Without Video Models
13 pages
Project Synopsis Guidelines
No ratings yet
Project Synopsis Guidelines
10 pages
AutoCAD 2016: A Problem-Solving Approach, 3D and Advanced
From Everand
AutoCAD 2016: A Problem-Solving Approach, 3D and Advanced
Prof. Sham Tickoo
No ratings yet
Exploring Autodesk Revit MEP 2016
From Everand
Exploring Autodesk Revit MEP 2016
Prof. Sham Tickoo
3.5/5 (3)
Practical Experiments 1st Paper
No ratings yet
Practical Experiments 1st Paper
38 pages
Operator'S Manual: 110 Series Leveling System HWH Lever-Controlled
100% (1)
Operator'S Manual: 110 Series Leveling System HWH Lever-Controlled
15 pages
Iqbal New
No ratings yet
Iqbal New
1 page
Epp
100% (1)
Epp
2 pages
Class X Icse Syllabus
100% (1)
Class X Icse Syllabus
8 pages
Duraco Septic Tank
100% (1)
Duraco Septic Tank
6 pages
Beyond The Oedipus Complex
No ratings yet
Beyond The Oedipus Complex
16 pages
Leonel-Tumaniog-Course-Syllabus-In-Pe 3-Individual and Dual Sports
No ratings yet
Leonel-Tumaniog-Course-Syllabus-In-Pe 3-Individual and Dual Sports
10 pages
Understanding Basic Concepts in DRRM Material
No ratings yet
Understanding Basic Concepts in DRRM Material
7 pages
The Habitat Agenda UN
No ratings yet
The Habitat Agenda UN
109 pages
Withania Coagulans (Solanaceae)
No ratings yet
Withania Coagulans (Solanaceae)
11 pages
STERN White Paper 2017-03 Withcover (1) - 0
No ratings yet
STERN White Paper 2017-03 Withcover (1) - 0
270 pages
Special Web 1 PDF
No ratings yet
Special Web 1 PDF
12 pages
UA SmartLife Brochure-English
No ratings yet
UA SmartLife Brochure-English
10 pages
60N3LH5 STMicroelectronics
No ratings yet
60N3LH5 STMicroelectronics
16 pages
PEZA Ecozone Developer
No ratings yet
PEZA Ecozone Developer
4 pages
Adobe PR
No ratings yet
Adobe PR
54 pages
Asian Terminals vs. Reyes, JR
No ratings yet
Asian Terminals vs. Reyes, JR
2 pages
Mineral Resource Conflict Jharkhand
No ratings yet
Mineral Resource Conflict Jharkhand
20 pages
Ajp12. Minu
No ratings yet
Ajp12. Minu
9 pages
PRATAP DOME Holding
No ratings yet
PRATAP DOME Holding
1 page
Application For Duplicate Motor Vehicle Certificate of Title (Fillable)
No ratings yet
Application For Duplicate Motor Vehicle Certificate of Title (Fillable)
1 page
Internal Audit Report: 1. Summary of Findings
No ratings yet
Internal Audit Report: 1. Summary of Findings
7 pages
Metric Screw Thread Chart: Metric Tap Size Tap Drill (Inches) Clearance Drill (Inches)
No ratings yet
Metric Screw Thread Chart: Metric Tap Size Tap Drill (Inches) Clearance Drill (Inches)
2 pages
Family Dynamics
No ratings yet
Family Dynamics
3 pages
Chapter 7 Suffrage
No ratings yet
Chapter 7 Suffrage
7 pages
For Billing Enquiry Call Tikona Care at 1800-20-94276: Current Bill Details Amount (RS.)
No ratings yet
For Billing Enquiry Call Tikona Care at 1800-20-94276: Current Bill Details Amount (RS.)
1 page
01 Guide To Drafting Your Critical Role Letters
No ratings yet
01 Guide To Drafting Your Critical Role Letters
3 pages
Full-Stack Development 5 Day Workshop Syllabus
No ratings yet
Full-Stack Development 5 Day Workshop Syllabus
5 pages

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

2022 IEEE/CVF

2022 IEEE/CVF Conference

CroMo: Cross-Modal Learning for Monocular Depth Estimation

barnabe . ma s@polytechnique . edu b . bus am@tum . de

formation from multiple signals. In this paper we systemat­

978-1-6654-6946-3/22/$31.00 ©2022 IEEE 3927

Lstereois used to train the polarisation network (N2) by

Depth to polarisation (A2) Polarisation cameras capture

i'Ppol = iun · (1 + p cos(2cppol - 2¢))

isation intensity. Right top (bottom): predicted depth (point cloud)

c(x) = lim Analytical losses .Ccorr and .Ccorr--+pol Depth Dcorr is

ered polarisation image aligned to fz, Icorr the i-ToF corre­

Sturm et al. [56 ] ./ ./ ./ ./ ./ ./ >20k

sciesl GT depth statistics (meters)

Park Facades Kitchen Station Station sample Park sample

commercial range cameras. In Three-Dimensional Imaging, 2016. 3, 2

Raskar. Depth sensing using geometrically constrained polar­ 3

[34] Agastya Kalra, Vage Taamazyan, Supreeth Krishna Rao,

You might also like

formation from multiple signals. In this paper we systemat

ered polarisation image aligned to fz, Icorr the i-ToF corre

Raskar. Depth sensing using geometrically constrained polar 3