CroMo Cross-Modal Learning For Monocular Depth Estimation
CroMo Cross-Modal Learning For Monocular Depth Estimation
Yannick Verdie1, Jifei Song1, Barnabe Mas1•2, Benjamin Busam1•3, Ales Leonardis1, Steven McDonagh1
1 Huawei Noah's Ark Lab 2 Ecole Polytechnique 3 Technical University of Munich
{yannick . verdie , j i fe i . song , a l e s . leonardi s , s t even . mcdonagh}@huawe i . com
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00391
Abstract
-�/ i - .
,. I
-�
. ililil
� �� �'�
Learning-based depth estimation has witnessed recent
progress in multiple directions; from self-supervision using
1 t • I: • 1 t • I I ' II I t•
J � • it-ItO
g:ht
monocular video to supervised methods offering highest ac
curacy. Complementary to supervision, further boosts to
performance and robustness are gained by combining in I _ ___ I __
nects scene geometry with polarisation and ToF signals and dilations, surface reflection from non-metallic objects can
enables self-supervised and cross-modal learning. linearly polarise the light. Such polarised light then con
In the absence of existing multimodal datasets, we exam tains surface structure information, retrievable using ana
ine our approach with a custom-made multi-modal camera lytic physical models [8]. This information can be used to
rig and collect CroMo; the first dataset to consist of syn harness the depth cues offered by this light property. Polari
chronized stereo polarisation, indirect ToF and structured metric imagery is a passive example for depth estimation.
light depth, captured at video rates. Extensive experiments Passive sensors have acceptable resolution and dense depth
on challenging video scenes confirm both qualitative and however there exist well understood capture situations that
quantitative pipeline advantages where we are able to out prove challenging (e.g. textureless surface regions).
perform competitive monocular depth estimation methods. However further known properties of light (i.e. speed)
provide yet more information. Indirect Time-of-Flight (i
1. Introduction
ToF) cameras are active light sensors and use a pulsed,
Modem vision sensors are able to leverage a variety of near infrared light source to measure object and surface dis
light properties for optical sensing. Common RGB sensors, tance. Further active sensors use structured-light and these
for instance, use colour filter arrays (CFA) over a pixel sen emit known infrared patterns and use stereoscopic imag
sor grid to separate incoming radiation into specified wave ing to measure the distance to the surface. While i-ToF
bands. This allows a photosensor to detect wavelength and structured-light cameras have clear advantages, such as
separated light intensity and enables the acquisition of fa the ability to function in low-light scenarios and good short
miliar visible spectrum images. Wavelength is however range precision, they are susceptible to specular reflections,
only one property of light capable of providing information. ambient light and range remains limited.
Light polarisation defines another property and describes We argue that novel combination of active and passive
the oscillation direction of an electromagnetic wave. While light sensors offers new possibilities. We can exploit such a
the majority of natural light sources (e.g. the sun) emit combination to take advantage of the discussed, modality
unpolarised light, consisting of a mixture of oriented os- specific strengths and weaknesses. We observe that (1) dif-
3928
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
resolution and noise [15], which constitute challenging fac ages of a scene, from an identical view point yet with dif
tors for any learning-based approach. Obtaining reliable ferent light exposures, are leveraged. An extension deal
signals from specular surfaces is difficult and inherent Multi ing with mixed reflectivity is established via a combined
Path Interference (MPI), often manifests as noisy measure photometric-polarisation linear system in [38] and Garcia
ments and artifacts. Synthetic training is also explored for et al. [18] solve for polarisation normals using circularly
raw i-ToF input data in end-to-end fashion [5, 24, 57]. How polarised light. Traditional multi-view methods also benefit
ever, the ability to account for real world domain shifts is from polarisation. Miyazaki et al. [43] recover surfaces of
limited. In [4] a GAN is employed towards addressing such black objects using polarisation physics and space carving.
domain adaption issues on a limited dataset. Depth refinement with Polarisation Consumer depth esti
i-ToF depth improvement MPI can be considered a critical mation tools progress significantly in recent years however
issue and error mitigation has been the focus of a body of their predictions are noisy and lack details. Using polari
work [60]. 1\vo-path approximations [23] have been used sation cues, [32] enhance sharper depth maps from RGBD
within optimization schemes [14, 35] and multiple frequen cameras by differentiating their depth maps to resolve po
cies are used to constrain the problem [16]. Kadambi et larisation ambiguities and perform mutual optimization.
al. [33] propose a hardware solution to address scenes with Despite clear improvements in monocular depth estima
translucent objects and a number of scholars incorporate tion methods, their performance remains bounded by the
light transport information to correct for MPI [2, 26, 44, 46]. chosen modality hence calling for multi-modal depth esti
2.2. Depth with multiple sensors mation. Our method alleviates this problem with a learn
ing based approach. During training we leverage comple
Depth completion has been carried out via combining mul
mentary modalities such that our model can compensate the
tiple input modalities, for example, a sparse but accurate
drawbacks of the single modality used at inference time.
LiDAR signal in combination with RGB [58]. It is diffi
cult to address sparse signals with CNNs [41] and LiDAR 3. Method
sensors can produce problematic artifacts resulting in unre
Our multi-modal monocular depth investigation leads to
liable Ground Truth depth estimates [39]. One strategy to
a new model architecture that accounts explicitly for pre
wards removing dependence on this form of supervision are
diction blur and introduces two novel analytic losses. We
self-supervision cues however these fall behind supervised
discuss these components in the following sections.
pipelines in terms of accuracy [40].
i-ToF and x Confidence-based combination of i-ToF depth 3.1. Architecture
and classical RGB stereo is explored with the network ar Our architecture employs multiple encoder-decoder net
chitecture of [3] and a semi-supervised approach for this works illustrated in Fig. 3a. We observe that monocular
combination is explored by [47] in a generic framework. depth estimation methods often incur blurry image predic
While these approaches improve upon the individual depth tions and we address this problem by introducing architec
estimates, they rely on a late fusion paradigm. Son et tural components that account for prediction blur. Firstly
al. [54] use a robotic arm to collect 540 real training images convolutions in our encoders are coupled with gated convo
of short range scenes with structured light ground truth. lution. Our network then composes a traditional U-Net [5 1]
By inserting micro linear polarizers [45] in front of a with skip connections and the gated convolutions [63]. The
photo-receiver chip, Yoshida et al. [62] build an i-ToF sen encoder utilises a ResNet [28] style block, while the de
sor capable of acquiring both i-ToF depth and polarisation coder is a cascade of convolutions with layer resizing.
scene cues. Combination of both the absolute depth (i-ToF) Secondly, drawing on the fact that Displacement Fields
and relative shape (polarisation cues) allowed reconstruc (DF) can be utilised to aid sharpness [49], we estimate a DF
tion of depth for specular surfaces. While this pipeline re using a self-supervised sharpening decoder. Depth pixels
quires i-ToF and polarisation input to solve an optimiza with strong local disparity have values redefined to mirror
tion problem, we alternatively explore cross-modal self a nearest neighbour that does not exhibit strong local dis
supervised training and single image inference. parity. Groundtruth (GT) displacement fields can thus be
Depth from multi-view Polarisation Another route to pre defined for each predicted depth during training ("on-the
dict depth is the use of more than one polarisation im fly"), guiding our dedicated displacement field prediction.
age [7] which enables methods based on physical models. We inspect predicted depth with and without our DF strat
An RGB+Polarisation pair can provide sharp depth maps egy and observe significant improved sharpness, most evi
with stereo vision [66]. Other methods [12] use more than dent when employing 3D visualisations (Fig. 4).
two polarisation images. Despite the sharpness of the re
sults, the difficulty to acquire multi-view polarisation im 3.2. Loss Formulation
ages is still a major hurdle. Atkinson et al. [6] combine Our study considers multiple modalities and various sen
polarisation methods with photometric stereo. 1\vo im- sor configurations at training time. We explore several loss
3929
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
Correlation
(i-ToF)
,..,--Loss Lcorr --...... N: Network
A: Analytic
1-
P: Projection
Correlation Auxiliary depth
Auxiliary
depth
"-=-----/ �
Nl
At inference time
N2 At training time
a. Our network architecture. b. Our training procedure and introduced losses.
Figure 3. Our full model with modality specific losses £corr , £corr-+pol and £stereo (see Sec. 3.1 for further details).
3930
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
outward pointing normal vector, defined as the cross prod Stereo loss .Cstereo This loss requires that left and right im
uct of the partial derivatives with respect to x and y [66]: age pairs are accessible during training. While only the left
[ l
- Jy8xd(x, y) image Iz is fed to the network, the right image Ir can guide
n = - fxayd(x, y) (5 ) the model towards generating valid depth, and vice versa.
(x - cx )8xd(x, y) + (y - ey ) ayd(x , y) + d(x, y) More formally, let Kz and Kr be camera matrices with in
with fx , fy , ex , Cy the camera intrinsics. trinsic parameters for left and right images respectively, and
Hence, from a given depth map d, one can compute the D a depth map on the left reference frame. Let Tleft--+right de
azimuth angle a and the viewing angle () using (4) and (5), note the transformation that moves 3D points from the left
followed by the polarisation parameters p and ¢ with (2) coordinate system to the right. An image coordinate trans
and (3). The polarisation images for diffuse and specular formed from left coordinate pz to the right image is
surfaces f:;Fse and .f!�;"'ular
are finally recovered using the
Pleft--+right = Kr · 1ieft--+right D (pz ) Kz- l Pl
· · ·
(13)
calculated polarisation parameters with (1).
A backward differentiable warping [31] is used to reproject
Depth to correlation (Al) Indirect ToF measures the cor an image onto the left view as Jright--+Ieft·
relation between a known emitted signal and the received We form a stereo loss .Cstereo• and related mask loss £mask
signal. The emitted signal at frequency fM is a sinusoid: similarly to [22], which aid network training and deal with
occluded pixels respectively as
g (t) = 2 cos (27rfMt) + 1 (6)
.Cstereo = Epe (Iz , Iright --+ left) (14)
and the signal, reflected by the scene, is of the form [30] Dpol
(
.Cmask = Epe lz, Iright�left) (15)
where the T is the time delay between the emitted signal where the photometric error is similar to [22]:
g (t) and the reflected signal f(t). The i-ToF measurement
-l j-ff f(t)g (t - x) dt
c(x) is the correlation between the two signals:
.
measurements {c(x0), c(x1 ) , c(x2 ) , c(x3 )} can be obtained
to recover the phase cp, the amplitude a and the intensity (3. Lcorr--+pol =
•E{diffu in
. mse,specular} {
Epe (Iz , f:.Orr --+ pol)
Dpo!
} ( 18)
where Iz is the left polarisation image, korr --+ pol the recov-
npoz
3931
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
depending on the input modalities available at training time, are not unlimited and our data collection also does not cover
we can add or remove the introduced losses Lcorr• Lcorr--+poh all possible scenarios; we further discuss limitations in our
Lstereo and Lstruct as appropriate. We explicitly note that hy appendix. We report statistics per captured scene in Tab. 1
per parameter tuning for balancing of these loss terms is not (lower). These statistics characterise our scene captures
required, due to our formulation. Our total loss £ can then and provide useful information, e.g., that the median scene
be formulated as: depth differs greatly between indoor (Kitchen) and outdoor
£ = min
i E {mask, stereo , corr--+pol ,struct}
{c;} + Lcorr + LDF (19)
(Station, Park and Facades) scenes. This is a strong indica
tor for whether the i-ToF sensor will perform well. Tab. 1
(upper) provides a comparison with other depth datasets
where £nF is the £2 norm between predicted and GT DF. showing that CroMo is the first publicly available, modality
rich dataset containing a large quantity of image data.
4. Data
We next provide details on our custom camera rig 5. Experiments
(Sec. 4. 1) and CroMo dataset (Sec. 4.2), comprising syn Our experimental design evaluates (1) the effect of mul
chronised image sequences capturing multiple modalities, tiple modalities, accessible at training time, for monocular
at video-rate across real-world indoor and outdoor scenes. depth estimation and (2) the effect that changing network
4.1. Camera capture rig architecture has on depth quality, under consistent input
Our prototype custom-camera hardware rig is shown in signal. Our capture setup allows us to employ a standard
Fig. 5. Our rig is constructed in order to capture syn MVS approach [52] on full temporal sequences of polarisa
chronised data across multiple modalities including stereo tion intensity frames (left-camera), to serve as ground-truth
polarisation, i-ToF correlation, structured-light depth and depth for our experimental work. This expensive offline op
IMU. We rigidly mount two polarisation cameras (Lucid timisation leverages accordances amongst all frames per se
PHXOSOS-QC) providing a left-right stereo pair, an i-ToF quence, affording high quality depth to evaluate our ideas.
camera (Lucid HLS003S-001) operating at 25Mhz and a Multi-modal training We firstly evaluate combinations
camera (RealSense D435i) for active IR stereo capture. All of training input signal by changing the number of sensors
devices are connected with a hardware synchronisation wire available to the model. We fix network encoder-decoder
resulting in time-aligned video capture at a frame rate of backbone components (i.e. ResNet50, analogous to [22])
lOfps. The left polarisation camera is the lead camera which and train models that leverage cues from a maximum of
generates the genlock signal and defines the world refer four sensors; left and right polarisation, i-ToF correlation
ence frame. Accurate synchronisation was validated us and structured-light. We show predicted depth improve
ing a flash-light torch and was further confirmed by the ments, attainable by systematic addition of sensors, and
respectable quality observed from stereo Block Matching quantify where best gains can be made. The training sig
results [29]. The focus of all sensors is set to infinity, the nal components used for our monocular depth estimation
aperture to maximum, and the exposure is manually fixed at experiments are as follow: Temporal (M) extracts infor
the beginning of each capture sequence. The calibration on mation from video sequences (3 frames), Stereo (S) uses
all four cameras' extrinsics, intrinsics, and distortion coeffi stereo images, i-ToF (T) leverages i-ToF correlations via
cients is done with a graph bundle-adjustment for improved our two interconnected depth branches (see Sec. 3.2). Fi
multi-view calibration (see appendix for further details). nally, Structured-light (L) incorporates an additional mask
4.2. CroMo dataset into the objective function, derived from information pro
We collect a unique dataset comprising multi-modal cap vided by our structured-light sensor. The structured-light
tures such that each time point pertains to measurement signal is utilised only when the mask improves the projec
of (1) Polarisation: raw stereo polarisation cameras pro tion loss. We explore alternative strategies to exploit the
duce 2448 x 2048 px stereo images. (2) i-ToF: 4 channel structured-light signal and discuss details on practical ben
640x 480 px correlation images. (3) Depth: a structured efits (e.g. convergence speed) in the appendix.
light capture of the scene resulting in a 848 x 480 px depth Introduced signal components define our set of training
image. In addition to the three main sensors, IMU infor experiments. For example, Stereo and Structure Light
mation is recorded to further enable future research direc (SL) train the model using self-supervised stereo (S) and
tions. Our dataset consists of more than 20k frames, to structured-light (L) information. Experiments therefore use
talling >80k images of indoor and outdoor sequences in differing subsets of the introduced loss terms (see Tab. 3).
challenging conditions, with no constraints on minimum or Qualitative results are shown in Fig. 6. Unsurprisingly,
maximum scene range. We group these sequences into four self-supervised stereo (S) is relatively blurry and struggles
different scenes which we name: Kitchen, Station, Park and to capture fine details, such as the thin, metallic arch on the
Facades. Despite the multitude of senors, operating ranges Facades sample, or the furniture in the Kitchen. Addition of
3932
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
Dam�t l� �
� �
"'
Fl
-�
-�
j£ � � � �
..!l
<
�
.t
Figure 5 . Our multi-modal camera rig (see Sec. 4. 1 ) . Table 1. CroMo comparison and dataset statistics.
Models trained with Stereo (S) input I MP GMACs I Sq Rel RMSE RMSE Log I 8 < 1 . 25 8 < 1.252 8 < 1 .253
ResNet18 architecture [22] 14.36 20.17 1.7928 2.1982 0.3596 0.5061 0.7026 0.8009
ResNet50 architecture [22] 32.55 39.62 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
p2d [1 1] (ResNet50 - Stokes) 32.55 39.62 1.5938 2.1291 0.3884 0.4565 0.6632 0.7775
MiDaS architecture [50] 1 04.2 1 207.86 1 .4021 1 .9985 0.3252 0.5409 0.7901 0.8281
Our architecture (Stereo (S) input) 74.40 97.39 1 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213
Table 2. Architectural comparisons under consistent modality sensor input; Stereo (S). Our proposed architecture improves quantitative
results across the majority of metrics whilst remaining competitive in terms of compute and space requirements.
I I £""""' £oF Loorr C,tmct I Sq Rei I {j < 1.25 {j < 1.252 {j < 1.253
Image
sensors Training strategy Loor-+
r pol RMSE RMSE Log
Stereo (S) wlo DF sampling ./
I I
1 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
./ ./
2
Stereo (S) 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213
./ ./ ./ ./
I I
1 1.2829 1.8573 0.3202 0.5541 0.7308 0.9062
Stereo and Structured-Light (SL)
Stereo and i-ToF (ST)
./ ./ ./
3
1. 1233 1.7510 0.3168 0.5529 0.7370 0.925 1
.( .( .( .( .( 0.9266
I I I
1 1 .0699 1.6070 0.2891 0.6512 0.7882
1.()031 1.4889 0.2527 0.7061 0.8066
Stereo, i-ToF, Structured-Light (STL)
./ ./ ./ ./ ./
4
STL+Temporal (STLM) 0.9246
Table 3. Model training strategies that differ in terms of available image sensor signals (utilised loss components). Sec. 3 and 4 provide
details on loss function components and image sensors, respectively. In spite of having access to only a single, consistent modality during
inference, the model benefits from visibility of additional training signals.
i-ToF and structured-light modalities, exclusively at train tured light information helps more due to the nature of our
(ST), (SL), (STL) and can be observed to
ing time, result in dataset and current distribution of image content therein i.e.
improve respective depth quality. Finally, (STLM) adds our ""'85% outdoor imagery, where i-ToF sensors are impaired
temporal modality and improves detail recovery (e.g. metal by ambient light. Combining the i-ToF and structured-light
lic arch and fence). Qualitative results can be observed to sensors (STL), further improves. The best depth prediction
corroborate our hypothesis; inclusion of additional modali utilises the temporal component (STLM). RMSE on Tab. 3
ties at training time afford the model multiple complemen displays a clear trend; the availability of additional sensor
tary depth cues that can qualitatively improve depth infer cues at training time improves monocular inference.
ence. Our experimental work highlights the nature of valu
Network architecture We next investigate the effect of
able investigation possible with our unique CroMo dataset. network architecture on monocular estimation performance.
Quantitative results are reported in Tab. 3. We follow [22], Of note, we highlight that employing a larger capacity net
reporting standard evaluation metrics, with focus on the work is not the only way to improve prediction perfor
RMSE in our following experiments. Best performance is mance. We use our self-supervised stereo (S), i.e. baseline
obtained when all sensors are used together (STLM) while modality, training strategy for all experiments that follow in
self-supervision stereo (S) with only polarisation images this section. This strategy provided weakest performance
performs worst. When additional modalities are added to in our previous investigation of training modality choice
self-supervision (S); i.e. i-ToF (ST) or structure light (SL), (Tab. 3). For this reason, we consider it an appropriate
performance improves in both cases, with larger gains come candidate with which to evaluate improvements afforded
from the addition of the latter. We conjecture that struc- by changes to network architecture. We report millions-
3933
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
I�
11
00
i
�
11
0
J
-
1il
..
�
'fJ
�
�
!5
Figure 6. Same ResNet50 architecture as in [22] with different Figure 7. Different architectures, same training strategy Stereo
modalities: each new modality closes the gap with GT. (S): our new architecture produces the sharpest depth predictions.
of-parameters at inference (MP) and the giga-multiply ing angle directly, is more sensitive to noise and not appro
accumulates per second (GMACs) in order to evaluate size priate for an SSIM loss with the self-supervised stereo (S)
and compute-cost per architecture. Architectures consist of training strategy. MiDaS [50] provides second best perfor
the ResNet18 U-Net used in [22] and their supplementary mance and yet necessitates roughly x 2 GMACs. Our ar
material ResNet50 variant, the p2d architecture [1 1] using chitecture provides best performance while remaining rela
ResNet50 with a different data representation (Stokes), the tively compact which we largely attribute to gated convolu
MiDaS [50] architecture and Ours (see Sec. 3.1 , Fig. 3a). tions and displacement field estimation (see Sec. 3.1).
Qualitative results are shown in Fig. 7. It may be ob
served that the ResNetlS architecture with smallest (MP) 6. Conclusion
fails to obtain good background detail of the swing frame We systematically investigate the effect of using addi
structure (Station sample) or of the tree (Park sample). The tional information from co-modal sensors at training time,
ResNet50 variant slightly improves detail, especially with for the task of monocular depth estimation from polarisation
raw measurements instead of Stokes (p2d [1 1]). Even when imagery. Our exploration is enabled through a unique multi
increasing network capacity c. three-fold with MiDaS, re modal video dataset which constitutes synchronized images
sults are unsatisfying. Our proposed architecture (Ours) from binocular polarisation, raw i-ToF and structured-light
requires smaller capacity and computation for a sharper re depth. We quantify the beneficial influence of both passive
construction of the swing and the tree. We disentangle and active sensors, leveraging self-supervised and cross
the benefits of additional sensor modalities from our model modal learning strategies that lead to the proposal of a new
contributions, highlighting the advantage of gated convolu method providing sharper and more accurate depth estima
tions and our DF-based approach towards reducing blur. tion. This is made possible through two physical mod
Quantitative results are reported in Tab. 2. The small els that describe the relationships between polarisation and
est architecture ResNetlS [22] performs worst. The larger surface normals on one side and correlation measures and
U-Net ResNet50 performs better, and has been generally scene depth on the other. We believe that our fundamen
adopted [1 1, 21]. Note p2d [1 1] uses a different data rep tal investigation of modality combination and the CroMo
resentation (Stokes) for polarisation cf. ResNet50; perfor dataset can accelerate research of both spatial and temporal
mance decreases. We believe the Stokes representation, us- fusion, towards advancing cross-modal computer vision.
3934
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
References Interaction, and Measurement, volume 7864, page 786404.
International Society for Optics and Photonics, 201 1. 3
[ 1] Intel realsense depth camera d435i. : I I www
http s •
[15] Sergi Foix, Guillem Alenya, and Carme Torras. Lock-in
i ntel rea l s e n s e . com I depth - camera - d4 3 5 i l .
time-of-flight (tof) cameras: A survey. IEEE Sensors Jour
Accessed: 2021-11-22. 3 nal, 1 1 (9):1917-1926, 201 1. 3
[2] Supreeth Achar, Joseph R Bartels, William L'Red' Whittaker,
[16] Daniel Freedman, Yoni Smolin, Eyal Krupka, Ido Leichter,
Kiriakos N Kutulakos, and Srinivasa G Narasimhan. Epipo
and Mirko Schmidt. Sra: Fast removal of general multipath
lar time-of-flight imaging. ACM Transactions on Graphics
for tof sensors. In European Conference on Computer Vision,
(ToG), 36(4): 1-8, 2017. 3
pages 234-249. Springer, 2014. 3
[3] Gianluca Agresti, Ludovico Minto, Giulio Marin, and Pietro
[17] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat
Zanuttigh. Deep learning for confidence information in stereo
and tof data fusion. In The IEEE International Conference on manghelich, and Dacheng Tao. Deep ordinal regression net
work for monocular depth estimation. In Proceedings of the
Computer Vision (!CCV) Workshops, Oct 2017. 3
IEEE Conference on Computer Vision and Pattern Recogni
[4] Gianluca Agresti, Henrik: Schaefer, Piergiorgio Sartor, and
tion, pages 2002-2011, 2018. 2
Pietro Zanuttigh. Unsupervised domain adaptation for tof
data denoising with adversarial learning. In Proceedings of [18] N Missael Garcia, Ignacio De Erausquin, Christopher Ed
the IEEE/CVF Conference on Computer Vision and Pattern
miston, and VIkt:or Gruev. Surface normal reconstruction us
Recognition (CVPR), June 2019. 3, 7 ing circularly polarized light. Optics express, 23(1 1): 14391-
[5] Gianluca Agresti and Pietro Zanuttigh. Deep learning for 14406, 2015. 3
multi-path error removal in tof sensors. In Proceedings of [19] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian
the European Conference on Computer Vision (ECCV), pages Reid. Unsupervised cnn for single view depth estimation: Ge
0--0, 2018. 3 ometry to the rescue, 2016. 2
[6] Gary A Atkinson. Polarisation photometric stereo. Computer [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Vision and Image Understanding, 160: 158-167, 2017. 3 ready for autonomous driving? the kitti vision benchmark
[7] Gary A Atkinson and Edwin R Hancock. Multi-view sur suite. In 2012 IEEE Conference on Computer Vision and Pat
face reconstruction using polarization. In Tenth IEEE Inter tern Recognition, pages 3354-3361. IEEE, 2012. 2
national Conference on Computer Vision (ICCV '05) Volume [21] Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow.
1 , volume 1, pages 309-316. IEEE, 2005. 3 Unsupervised monocular depth estimation with left-right con
[8] Gary A Atkinson and Edwin R Hancock. Recovery of surface sistency. In CVPR, 2017. 2, 8
orientation from diffuse polarization. IEEE transactions on [22] Clement Godard, Oisin Mac Aodha, Michael Firman, and
image processing, 15(6):1653-1664, 2006. 1, 2 Gabriel J Brostow. Digging into self-supervised monocular
[9] Yunhao Ba, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, and depth estimation. In Proceedings of the IEEE international
Achuta Kadambi. Physics-based neural networks for shape conference on computer vision, pages 3828-3838, 2019. 1, 2,
from polarization. arXiv preprint arXiv:1903.10210, 2019. 2 5, 6, 7, 8, 3
[10] Yunhao Ba, Alex Gilbert, Franklin Wang, Jinfa Yang, Rui [23] John P Godbaz, Michael J Cree, and Adrian A Darrington.
Chen, Yiqin Wang, Lei Yan, Boxin Shi, and Achuta Kadambi. Closed-form inverses for the mixed pixellmultipath interfer
Deep shape from polarization. In Andrea Vedaldi, Horst ence problem in amcw lidar. In Computational Imaging X,
Bischof, Thomas Brox, and Jan-Michael Frahm, editors, volume 8296, page 829618. International Society for Optics
Computer Vision - ECCV 2020, pages 554-571, Cham, 2020. and Photonics, 2012. 3
Springer International Publishing. 7 [24] Qi Guo, Iuri Frosio, Orazio Gallo, Todd Zickler, and Jan
[11] Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Kautz. Tackling 3d tof artifacts through learning and the flat
Daniel Braun, and Fabrice Meriaudeau. P2d: a self-supervised dataset. In Proceedings of the European Conference on Com
method for depth estimation from polarimetry. In 25th In puter Vision (ECCV), pages 368-383, 2018. 3, 7
ternational Conference on Pattern Recognition (!CPR 2020),
[25] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and
2020. 1, 2, 7, 8
Xiaogang Wang. Learning monocular depth by distilling
[12] Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, and Jan
cross-domain stereo networks, 2018. 2
Kautz. Polarimetric multi-view stereo. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni
[26] Mohit Gupta, Shree K Nayar, Matthias B Hullin, and Jaime
tion, pages 1558-1567, 2017. 3
Martin. Phasor imaging: A generalization of correlation
based time-of-flight imaging. ACM Transactions on Graphics
[13] Rui Dai, Srijan Das, and Fran�ois Bremond. Learning
(ToG), 34(5):1-18, 2015. 3
an augmented rgb representation with cross-modal knowl
edge distillation for action detection. In Proceedings of [27] Miles Hansard, Seungkyu Lee, Ouk Choi, and Radu Patrice
the IEEE/CVF International Conference on Computer Vision Horaud. Time-of-flight cameras: principles, methods and ap
plications. Springer Science & Business Media, 2012. 2
(!CCV), pages 13053-13064, October 2021. 4
[14] Adrian A Darrington, John P Godbaz, Michael J Cree, An [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn
drew D Payne, and Lee V Streeter. Separating true range ing for image recognition. In 2016 IEEE Conference on Com
measurements from multi-path and scattering interference in puter Vision and Pattern Recognition (CVPR), pages 770-778,
3935
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[29] Heiko Hirschmuller. Accurate and efficient stereo processing age. In 2018 IEEE International Conference on Robotics and
by semi-global matching and mutual information. In 2005 Automation (ICRA), pages 1-8. IEEE, 2018. 3
IEEE Computer Society Conference on Computer Vision and [42] Nikolaus Mayer, Eddy llg, Philipp Fischer, Caner Hazirbas,
Pattern Recognition (CVPR '05), volume 2, pages 807-814. Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What
IEEE, 2005. 6 makes good synthetic training data for learning disparity and
[30] Radu Horaud, Miles Hansard, Georgios Evangelidis, and optical flow estimation? International Journal of Computer
Clement Menier. An overview of depth cameras and range Vision, 126(9):942-960, 2018. 2
scanners based on time-of-flight technologies. Machine Vi [43] Daisuke Miyazaki, Takuya Shigetomi, Masashi Baba, Ryo
sion and Applications, 27(7):1005-1020, Jun 2016. 5 Furukawa, Shinsaku Hiura, and Naoki Asada. Surface normal
[31] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and estimation of black specular objects from multiview polariza
koray kavukcuoglu. Spatial transformer networks. In C. tion images. Optical Engineering, 56(4):041303, 2016. 3
Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, [44] Nikhil Naik, Achuta Kadambi, Christoph Rhemann,
editors, Advances in Neural Information Processing Systems, Shahram Izadi, Ramesh Raskar, and Sing Bing Kang. A light
volume 28, pages 2017-2025. Curran Associates, Inc., 2015. transport model for mitigating multipath interference in time
5 of-flight sensors. In Proceedings of the IEEE Conference on
[32] Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Computer Vision and Pattern Recognition, pages 73-81 , 2015.
3936
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[54] Kilho Son, Ming-Yu Liu, and Yuichi Taguchi. Learning to
remove multipath distortions in time-of-flight range images
for a robotic arm setup. In 2016 1EEE International Confer
ence on Robotics and Automation (ICRA), pages 3390-3397.
IEEE, 2016. 3
[55] Jaime Spencer, Richard Bowden, and Simon Hadfield.
Defeat-net: General monocular depth via simultaneous un
supervised representation learning. In Proceedings of the
IEEEICVF Conference on Computer Vision and Pattern
Recognition, pages 14402-14413, 2020. 2
[56] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre
mers. A benchmark for the evaluation of rgb-d slam systems.
In Proc. of the International Conference on Intelligent Robot
Systems (IROS), Oct. 2012. 7
[57] Shuochen Su, Felix Heide, Gordon Wetzstein, and Wolfgang
Heidrich. Deep end-to-end time-of-flight imaging. In Pro
ceedings of the IEEE Conference on Computer Vision and Pat
tern Recognition, pages 6383--6392, 201 8. 3
[58] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
In Proceedings of the International Conference on 3D Vision
(3DV)), pages 11-20. IEEE, 2017. 3
[59] Jamie Watson, Michael Firman, Gabriel J. Brostow, and
Daniyar Turmukhambetov. Self-supervised monocular depth
hints. In The International Conference on Computer Vision
(/CCV), October 2019. 5, 3
[60] Refael Whyte, Lee Streeter, Michael J Cree, and Adrian A
Dorrington. Review of methods for resolving multi-path in
terference in time-of-flight range cameras. In SENSORS, 2014
IEEE, pages 629--632. IEEE, 2014. 3
[61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully
automatic 2d-to-3d video conversion with deep convolutional
neural networks, 2016. 2
[62] Tomonari Yoshida, Vladislav Golyanik, Oliver Wasenmiiller,
and Didier Stricker. Improving time-of-flight sensor for spec
ular surfaces with shape from polarization. In 2018 25th IEEE
International Conference on Image Processing (ICIP), pages
1558-1562. IEEE, 2018. 3
[63] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas Huang. Free-form image inpainting with gated con
volution, 2019. 3, 2
[64] Ye Yu, Dizhong Zhu, and William AP Smith. Shape-from
polarisation: a nonlinear least squares approach. In Proceed
ings of the IEEE International Conference on Computer Vi
sion Workshops, pages 2969-2976, 2017. 2
[65] Pietro Zanuttigh, Giulio Marin, Carlo Dal Mutto, Fabio Do
minio, Ludovico Minto, and Guido Maria Cortelazzo. Time
of-flight and structured light depth cameras. Technology and
Applications, pages 978-3, 2016. 2
[66] Dizhong Zhu and William AP Smith. Depth from a polar
isation+ rgb stereo pair. In Proceedings of the IEEE Con
ference on Computer Vision and Pattern Recognition, pages
7586-7595, 2019. 3, 4, 5, 7
3937
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.