0% found this document useful (0 votes)
20 views11 pages

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

krishnapandu2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

CroMo Cross-Modal Learning For Monocular Depth Estimation

Uploaded by

krishnapandu2904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2022 IEEE/CVF

2022 IEEE/CVF Conference


Conference on
on Computer
Computer Vision
Vision and
and Pattern
Pattern Recognition
Recognition (CVPR)
(CVPR)

CroMo: Cross-Modal Learning for Monocular Depth Estimation

Yannick Verdie1, Jifei Song1, Barnabe Mas1•2, Benjamin Busam1•3, Ales Leonardis1, Steven McDonagh1
1 Huawei Noah's Ark Lab 2 Ecole Polytechnique 3 Technical University of Munich
{yannick . verdie , j i fe i . song , a l e s . leonardi s , s t even . mcdonagh}@huawe i . com
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00391

barnabe . ma s@polytechnique . edu b . bus am@tum . de

Abstract
-�/ i - .
,. I
-�
. ililil
� �� �'�
Learning-based depth estimation has witnessed recent
progress in multiple directions; from self-supervision using
1 t • I: • 1 t • I I ' II I t•
J � • it-ItO
g:ht
monocular video to supervised methods offering highest ac­
curacy. Complementary to supervision, further boosts to
performance and robustness are gained by combining in­ I _ ___ I __

formation from multiple signals. In this paper we systemat­


ically investigate key trade-offs associated with sensor and
P2d
u[l !]
0
Monodepth2 [22]
[_-]
Ours
u
GT Label
modality design choices as well as related model training
strategies. Our study leads us to a new method, capable of Figure 1 . Top row: polarisation input signal (Pol.) visualised as
connecting modality-specific advantages from polarisation, Angle and Intensity. Additionally; Time-of-Flight Amplitude (i­
Time-of-Flight and structured-light inputs. We propose a ToF) and structured-light sensor co-modalities, exploitable during
novel pipeline capable of estimating depth from monocular training. Bottom row: monocular depth estimation, using the Pol.
polarisationfor which we evaluate various training signals. input. Uni-modal model training of p2d [1 1] and the monodepth2
The inversion of differentiable analytic models thereby con­ architecture [22], compared with cross-modal training (Ours).

nects scene geometry with polarisation and ToF signals and dilations, surface reflection from non-metallic objects can
enables self-supervised and cross-modal learning. linearly polarise the light. Such polarised light then con­
In the absence of existing multimodal datasets, we exam­ tains surface structure information, retrievable using ana­
ine our approach with a custom-made multi-modal camera lytic physical models [8]. This information can be used to
rig and collect CroMo; the first dataset to consist of syn­ harness the depth cues offered by this light property. Polari­
chronized stereo polarisation, indirect ToF and structured­ metric imagery is a passive example for depth estimation.
light depth, captured at video rates. Extensive experiments Passive sensors have acceptable resolution and dense depth
on challenging video scenes confirm both qualitative and however there exist well understood capture situations that
quantitative pipeline advantages where we are able to out­ prove challenging (e.g. textureless surface regions).
perform competitive monocular depth estimation methods. However further known properties of light (i.e. speed)
provide yet more information. Indirect Time-of-Flight (i­
1. Introduction
ToF) cameras are active light sensors and use a pulsed,
Modem vision sensors are able to leverage a variety of near infrared light source to measure object and surface dis­
light properties for optical sensing. Common RGB sensors, tance. Further active sensors use structured-light and these
for instance, use colour filter arrays (CFA) over a pixel sen­ emit known infrared patterns and use stereoscopic imag­
sor grid to separate incoming radiation into specified wave­ ing to measure the distance to the surface. While i-ToF
bands. This allows a photosensor to detect wavelength­ and structured-light cameras have clear advantages, such as
separated light intensity and enables the acquisition of fa­ the ability to function in low-light scenarios and good short
miliar visible spectrum images. Wavelength is however range precision, they are susceptible to specular reflections,
only one property of light capable of providing information. ambient light and range remains limited.
Light polarisation defines another property and describes We argue that novel combination of active and passive
the oscillation direction of an electromagnetic wave. While light sensors offers new possibilities. We can exploit such a
the majority of natural light sources (e.g. the sun) emit combination to take advantage of the discussed, modality­
unpolarised light, consisting of a mixture of oriented os- specific strengths and weaknesses. We observe that (1) dif-

978-1-6654-6946-3/22/$31.00 ©2022 IEEE 3927


DOI 10.1109/CVPR52688.2022.00391
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
fering visual modalities offer information cues about com­ cross-modal information from Time-of-Flight (i-ToF), ac­
plimentary aspects of the world and (2) there exist clear tive stereo and polarisation modalities during training. We
trade-offs between the complexity of capture sensor setups briefly review the literature most closely related to the main
and the resulting data diversity and quality, accessible for components of our investigation and proposed framework.
supervision signals. This motivates us to systematically in­ 2.1. Monocular depth estimation
vestigate these considerations and provide insight into train­
Estimating depth from a single image constitutes a hard,
ing data capture design decisions and the related pay-offs.
ill-posed problem. Pioneering work on supervised monoc­
Our study results in the proposal of a framework capable
ular depth estimation [42] used synthetic samples during
of exploiting available supervision signals and is tailored to
training. Synthetic data was also previously used in con­
benefit from the particular strengths of unique modalities.
junction with stereo network distillation [25] for this task.
We instantiate our ideas by bringing together the physi­
To improve accuracy and convergence speed, [17] introduce
cal understanding of Polarisation and i-ToF in a data driven
a spatially-increasing discretisation. However, acquiring
fashion. In practice this affords an inference pipeline that
ground truth depth data remains a difficult task [20].
estimates depth from a single polarisation image. We train
To overcome the difficulty of collecting accurate ground
a convolutional neural network (CNN), with cross-modal
truth signal, multiple works [1 9, 6 1] investigate a con­
fusion using differentiable physical models. We establish a
sistency loss by leveraging stereo imagery during train­
dataset comprising Ground Truth depth obtained via Multi­
ing, towards self-supervision. While being undoubtedly
View Stereo (MVS) reconstruction [52] that enjoys access
path-breaking, the initial methods suffered from a non­
to information rich, full video sequences. We carry out ex­
differentiable sampling step. Godard et al. [21 ] formu­
tensive experimental work to establish the efficacy of our
lated a fully-differentiable pipeline with left-right consis­
proposed monocular depth estimation strategies.
tency checks during training and have also explored the
Our contributions can be summarised as:
temporal components [22], even in challenging setups such
1. Novel multi-modal method. We propose a multi-modal
as night scenes [55]. These methods predict depth with
training approach that allows for monocular depth esti­
RGB input, while we utilise polarisation images.
mation from polarisation images. We propose (i) dif­
Monocular Polarisation Previous work use monocular po­
ferentiable analytic formulae that define modal-specific
larisation imagery to recover depth. One route to overcome
loss terms, (ii) cross-modal consistency joint-training
Shape from Polarisation (SfP) ambiguities is to use ortho­
towards improved real-world depth estimation from a
graphic camera models to express polarisation intensi� in
single polarisation image, (iii) architectural components
terms of depth [64] . Atkinson et al. [8] compute depth with­
that increase predicted depth sharpness (see Fig. 1).
out knowing the light direction through a non-linear opti­
2. CroMo dataset and training modalities study. We
mization framework and yet assume fully diffuse surfaces.
provide a systematic analysis of the benefits affor�ed
Linear systems have also been constructed for the task [53]
when multiple image modalities are available at tram­
by adding shape from shading equations. While theoreti­
ing time, for monocular depth estimation. Investigation
cally interesting, the orthographic assumption has restricted
and exposure of improvements are enabled by the unique
their application to synthetic lab environments.
Cross-Modality video dataset1 . Our multiple hardware­
Learning based Polarisation Due to lack of reasonably­
synchronized cameras capture, for the first time, stereo
sized datasets, only a limited number of works focus on
polarisation (Pol), indirect Time-of-Flight (i-ToF) and
learning with polarisation. Ba et al. [9] provide polarisation
structured-light images from active sensing.
images together with a set of plausible inputs from a phys­
The remaining sections of the paper are thus organised:
ical model to estimate surface normals. The work of [34]
Sec. 2 provides brief review of depth estimation with re­
apply polarisation for instance segmentation of transp�­
spect to relevant modalities and previous work considering
ent objects and [37] learn de-glaring of images with seilll­
multiple information signals. Sec. 3 presents our model ca­
transparent objects. Recently, Blanchon et al. [1 1] extended
pable of monocular depth estimation from polarisation im­
the work of [22] with complementary polarimetric cues.
agery and our cross-modal training procedure. In Sec. 4
In contrast to them, we invert a physical model to enable
we introduce CroMo, our novel multi-modal dataset, Sec. 5
self-supervision through consistency cycles and addition­
reports experimental work validating our contributions and
ally study the benefit of co-modal i-ToF information.
Sec. 6 provides discussion and future research issues.
Learning based i-ToF i-ToF sensors acquire distance infor­
2. Related Work mation by estimating the time required for an emitted light
pulse to be reflected [65] . Sensors measure either the time
To the best of our knowledge this is the first work
(direct) or the phase (indirect) difference between emitted
to study end-to-end monocular depth inference, utilising
and received light. The modality enjoys high precision for
IDataset is available at: https : I I cromo- dat a . github . iol short range distances [27], yet suffers from limited spatial

3928

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
resolution and noise [15], which constitute challenging fac­ ages of a scene, from an identical view point yet with dif­
tors for any learning-based approach. Obtaining reliable ferent light exposures, are leveraged. An extension deal­
signals from specular surfaces is difficult and inherent Multi ing with mixed reflectivity is established via a combined
Path Interference (MPI), often manifests as noisy measure­ photometric-polarisation linear system in [38] and Garcia
ments and artifacts. Synthetic training is also explored for et al. [18] solve for polarisation normals using circularly
raw i-ToF input data in end-to-end fashion [5, 24, 57]. How­ polarised light. Traditional multi-view methods also benefit
ever, the ability to account for real world domain shifts is from polarisation. Miyazaki et al. [43] recover surfaces of
limited. In [4] a GAN is employed towards addressing such black objects using polarisation physics and space carving.
domain adaption issues on a limited dataset. Depth refinement with Polarisation Consumer depth esti­
i-ToF depth improvement MPI can be considered a critical mation tools progress significantly in recent years however
issue and error mitigation has been the focus of a body of their predictions are noisy and lack details. Using polari­
work [60]. 1\vo-path approximations [23] have been used sation cues, [32] enhance sharper depth maps from RGBD
within optimization schemes [14, 35] and multiple frequen­ cameras by differentiating their depth maps to resolve po­
cies are used to constrain the problem [16]. Kadambi et larisation ambiguities and perform mutual optimization.
al. [33] propose a hardware solution to address scenes with Despite clear improvements in monocular depth estima­
translucent objects and a number of scholars incorporate tion methods, their performance remains bounded by the
light transport information to correct for MPI [2, 26, 44, 46]. chosen modality hence calling for multi-modal depth esti­
2.2. Depth with multiple sensors mation. Our method alleviates this problem with a learn­
ing based approach. During training we leverage comple­
Depth completion has been carried out via combining mul­
mentary modalities such that our model can compensate the
tiple input modalities, for example, a sparse but accurate
drawbacks of the single modality used at inference time.
LiDAR signal in combination with RGB [58]. It is diffi­
cult to address sparse signals with CNNs [41] and LiDAR 3. Method
sensors can produce problematic artifacts resulting in unre­
Our multi-modal monocular depth investigation leads to
liable Ground Truth depth estimates [39]. One strategy to­
a new model architecture that accounts explicitly for pre­
wards removing dependence on this form of supervision are
diction blur and introduces two novel analytic losses. We
self-supervision cues however these fall behind supervised
discuss these components in the following sections.
pipelines in terms of accuracy [40].
i-ToF and x Confidence-based combination of i-ToF depth 3.1. Architecture
and classical RGB stereo is explored with the network ar­ Our architecture employs multiple encoder-decoder net­
chitecture of [3] and a semi-supervised approach for this works illustrated in Fig. 3a. We observe that monocular
combination is explored by [47] in a generic framework. depth estimation methods often incur blurry image predic­
While these approaches improve upon the individual depth tions and we address this problem by introducing architec­
estimates, they rely on a late fusion paradigm. Son et tural components that account for prediction blur. Firstly
al. [54] use a robotic arm to collect 540 real training images convolutions in our encoders are coupled with gated convo­
of short range scenes with structured light ground truth. lution. Our network then composes a traditional U-Net [5 1]
By inserting micro linear polarizers [45] in front of a with skip connections and the gated convolutions [63]. The
photo-receiver chip, Yoshida et al. [62] build an i-ToF sen­ encoder utilises a ResNet [28] style block, while the de­
sor capable of acquiring both i-ToF depth and polarisation coder is a cascade of convolutions with layer resizing.
scene cues. Combination of both the absolute depth (i-ToF) Secondly, drawing on the fact that Displacement Fields
and relative shape (polarisation cues) allowed reconstruc­ (DF) can be utilised to aid sharpness [49], we estimate a DF
tion of depth for specular surfaces. While this pipeline re­ using a self-supervised sharpening decoder. Depth pixels
quires i-ToF and polarisation input to solve an optimiza­ with strong local disparity have values redefined to mirror
tion problem, we alternatively explore cross-modal self­ a nearest neighbour that does not exhibit strong local dis­
supervised training and single image inference. parity. Groundtruth (GT) displacement fields can thus be
Depth from multi-view Polarisation Another route to pre­ defined for each predicted depth during training ("on-the­
dict depth is the use of more than one polarisation im­ fly"), guiding our dedicated displacement field prediction.
age [7] which enables methods based on physical models. We inspect predicted depth with and without our DF strat­
An RGB+Polarisation pair can provide sharp depth maps egy and observe significant improved sharpness, most evi­
with stereo vision [66]. Other methods [12] use more than dent when employing 3D visualisations (Fig. 4).
two polarisation images. Despite the sharpness of the re­
sults, the difficulty to acquire multi-view polarisation im­ 3.2. Loss Formulation
ages is still a major hurdle. Atkinson et al. [6] combine Our study considers multiple modalities and various sen­
polarisation methods with photometric stereo. 1\vo im- sor configurations at training time. We explore several loss

3929

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
Correlation
(i-ToF)
,..,--Loss Lcorr --...... N: Network
A: Analytic

1-
P: Projection
Correlation Auxiliary depth
Auxiliary
depth
"-=-----/ �
Nl

At inference time

N2 At training time
a. Our network architecture. b. Our training procedure and introduced losses.

Figure 3. Our full model with modality specific losses £corr , £corr-+pol and £stereo (see Sec. 3.1 for further details).

Lstereois used to train the polarisation network (N2) by


comparing the right polarisation image, projected using the
predicted depth Dpot. with the left polarisation image. We
next provide details of our analytical formulae for image re­
covery and the loss terms that enable our training procedure.

Depth to polarisation (A2) Polarisation cameras capture


polarised intensity along directions C{)pol · The measured in­
tensity is given by [66]

i'Ppol = iun · (1 + p cos(2cppol - 2¢))


Figure 4. Effect of the Displacement Fields (DF). Left top (bot­
tom): predicted depth (point cloud) without DF. Center top: polar­
with C{)pol E { 0, �' i• 3: } (1)

isation intensity. Right top (bottom): predicted depth (point cloud)


with DF. Flying pixels, visible in 3D, are clearly reduced.
where C{)pol is the polariser angle, iun is the intensity of on­
polarised light, p is the degree of linear polarisation and
terms to exploit our unique setting (see Tab. 3). Loss terms ¢ is the Angle of Polarisation (AoP). The polarisation pa­
in our training procedure are enabled through both coordi­ rameters p E {p8, Pd} and ¢ E { ¢8, </Jd} depend on local
nate frame projections (Pl , P2) and analytic transforms (Al , reflection type, either diffuse (d) or specular (s) as follows:
A2) of individual network (Nl , N2) outputs (see Figs. 3a,
3b). We firstly process input modalities individually us­
ing distinct networks. These ingest i-ToF correlation and
left polarisation images respectively and output initial depth
maps. We propose two analytic losses, derived from prop­
erties of i-ToF and polarisation, to train and link the net­
works. We train the i-ToF module without ground truth and with () E [0, 1r/2] the viewing angle and rJ the object refrac­
also leverage the available multi-modal information through tive index, typically 1.5, and
image recovery via related analytical formulae (Al, A2). = a [7r] if the pixel is diffuse
Strategically similar to previous work [13], at inference (3)
= a + � [7r] if the pixel is specular
time we require only a single modality (in our case polari­
sation), and can discard network N1 completely. The 1r-ambiguity is denoted as [1r] in (3), and a denotes the
Terms Lcorr-+pol and Lcorr evaluate discrepancies be­ azimuth angle of the surface normal n. Azimuth angle a
tween each input image and respective recovered images, and viewing angle () are obtained as
obtained using auxiliary and final depth maps (Fig. 3b). D·V n
Our individual branches share information through the loss cos(O) = and tan(a) = ___]!._ ' (4)
n
ll ll ll v ll nx
term Lcorr-+pol · Explicitly, we recover a polarisation image
from an auxiliary depth map and then project this, using with v the viewing vector defined as the vector pointing
projection P1, to the polarisation sensor frame of reference toward the camera center from the 3D point P(x, y )
via the final depth map Dpol · Finally our third loss term corresponding to pixel (x, y ) with depth d(x, y ) and n the

3930

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
outward pointing normal vector, defined as the cross prod­ Stereo loss .Cstereo This loss requires that left and right im­
uct of the partial derivatives with respect to x and y [66]: age pairs are accessible during training. While only the left

[ l
- Jy8xd(x, y) image Iz is fed to the network, the right image Ir can guide
n = - fxayd(x, y) (5 ) the model towards generating valid depth, and vice versa.
(x - cx )8xd(x, y) + (y - ey ) ayd(x , y) + d(x, y) More formally, let Kz and Kr be camera matrices with in­
with fx , fy , ex , Cy the camera intrinsics. trinsic parameters for left and right images respectively, and
Hence, from a given depth map d, one can compute the D a depth map on the left reference frame. Let Tleft--+right de­
azimuth angle a and the viewing angle () using (4) and (5), note the transformation that moves 3D points from the left
followed by the polarisation parameters p and ¢ with (2) coordinate system to the right. An image coordinate trans­
and (3). The polarisation images for diffuse and specular formed from left coordinate pz to the right image is
surfaces f:;Fse and .f!�;"'ular
are finally recovered using the
Pleft--+right = Kr · 1ieft--+right D (pz ) Kz- l Pl
· · ·
(13)
calculated polarisation parameters with (1).
A backward differentiable warping [31] is used to reproject
Depth to correlation (Al) Indirect ToF measures the cor­ an image onto the left view as Jright--+Ieft·
relation between a known emitted signal and the received We form a stereo loss .Cstereo• and related mask loss £mask
signal. The emitted signal at frequency fM is a sinusoid: similarly to [22], which aid network training and deal with
occluded pixels respectively as
g (t) = 2 cos (27rfMt) + 1 (6)
.Cstereo = Epe (Iz , Iright --+ left) (14)
and the signal, reflected by the scene, is of the form [30] Dpol

(
.Cmask = Epe lz, Iright�left) (15)

where the T is the time delay between the emitted signal where the photometric error is similar to [22]:
g (t) and the reflected signal f(t). The i-ToF measurement

-l j-ff f(t)g (t - x) dt
c(x) is the correlation between the two signals:

c(x) = lim Analytical losses .Ccorr and .Ccorr--+pol Depth Dcorr is


T--+oo T (8) firstly inferred directly from i-ToF correlation input, and
then two recovered images Icorr and lpo1 are formed.
= a COS (27r fMX + 27rfMT) + (3
Recovered images represent the 'ideal' input for each
where we only consider the direct reflection signal and ig­ modality, i-ToF and polarisation respectively, conditioned
nore the multipath interference (MPI) and sensor imperfec­ on the inferred depth. Since Ipo1 is generated from
tions. We are interested in the phase cp, proportional to the Dcorr. we reproject it using Dpol to form a recovered fi-
depth d between the objects in the scene and the sensor: nal polarisation image lJorr --+ i, i E {diffuse, specular}.
Dpol
po
In each case, discrepancies between the recovered im­
(9) age and the true input image provide a strong indi­
cation of the quality of the generated depth. We
where c is the speed of light and [21r] represents the use this signal to guide the network. Formally
21r-ambiguity. Using the four bucket strategy [36] to sample
c(x) at four positions, where 27rfM x E {0, � , 1r, 3; }, four (17)

.
measurements {c(x0), c(x1 ) , c(x2 ) , c(x3 )} can be obtained
to recover the phase cp, the amplitude a and the intensity (3. Lcorr--+pol =
•E{diffu in
. mse,specular} {
Epe (Iz , f:.Orr --+ pol)
Dpo!
} ( 18)

where Iz is the left polarisation image, korr --+ pol the recov-
npoz

ered polarisation image aligned to fz, Icorr the i-ToF corre­


lation input, and korr the recovered correlation image. We
use a min operator for .Ccorr--+pol to lift the problem of classi­
fying a pixel as diffuse or specular by computing both pos­
sibilities and letting the network select the best solution.
Hence, from a given depth d, one can compute the phase cp Finally, following [59], we use an additional loss .Cstruct
using (9) and then reformulate the recovered i-ToF correla­ in the objective function, derived from structured-light in­
tion using (10), (1 1) and (12) in turn to form Icorr . formation (see appendix for further detail). In summary,

3931

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
depending on the input modalities available at training time, are not unlimited and our data collection also does not cover
we can add or remove the introduced losses Lcorr• Lcorr--+poh all possible scenarios; we further discuss limitations in our
Lstereo and Lstruct as appropriate. We explicitly note that hy­ appendix. We report statistics per captured scene in Tab. 1
per parameter tuning for balancing of these loss terms is not (lower). These statistics characterise our scene captures
required, due to our formulation. Our total loss £ can then and provide useful information, e.g., that the median scene
be formulated as: depth differs greatly between indoor (Kitchen) and outdoor

£ = min
i E {mask, stereo , corr--+pol ,struct}
{c;} + Lcorr + LDF (19)
(Station, Park and Facades) scenes. This is a strong indica­
tor for whether the i-ToF sensor will perform well. Tab. 1
(upper) provides a comparison with other depth datasets
where £nF is the £2 norm between predicted and GT DF. showing that CroMo is the first publicly available, modality
rich dataset containing a large quantity of image data.
4. Data
We next provide details on our custom camera rig 5. Experiments
(Sec. 4. 1) and CroMo dataset (Sec. 4.2), comprising syn­ Our experimental design evaluates (1) the effect of mul­
chronised image sequences capturing multiple modalities, tiple modalities, accessible at training time, for monocular
at video-rate across real-world indoor and outdoor scenes. depth estimation and (2) the effect that changing network
4.1. Camera capture rig architecture has on depth quality, under consistent input
Our prototype custom-camera hardware rig is shown in signal. Our capture setup allows us to employ a standard
Fig. 5. Our rig is constructed in order to capture syn­ MVS approach [52] on full temporal sequences of polarisa­
chronised data across multiple modalities including stereo tion intensity frames (left-camera), to serve as ground-truth
polarisation, i-ToF correlation, structured-light depth and depth for our experimental work. This expensive offline op­
IMU. We rigidly mount two polarisation cameras (Lucid timisation leverages accordances amongst all frames per se­
PHXOSOS-QC) providing a left-right stereo pair, an i-ToF quence, affording high quality depth to evaluate our ideas.
camera (Lucid HLS003S-001) operating at 25Mhz and a Multi-modal training We firstly evaluate combinations
camera (RealSense D435i) for active IR stereo capture. All of training input signal by changing the number of sensors
devices are connected with a hardware synchronisation wire available to the model. We fix network encoder-decoder
resulting in time-aligned video capture at a frame rate of backbone components (i.e. ResNet50, analogous to [22])
lOfps. The left polarisation camera is the lead camera which and train models that leverage cues from a maximum of
generates the genlock signal and defines the world refer­ four sensors; left and right polarisation, i-ToF correlation
ence frame. Accurate synchronisation was validated us­ and structured-light. We show predicted depth improve­
ing a flash-light torch and was further confirmed by the ments, attainable by systematic addition of sensors, and
respectable quality observed from stereo Block Matching quantify where best gains can be made. The training sig­
results [29]. The focus of all sensors is set to infinity, the nal components used for our monocular depth estimation
aperture to maximum, and the exposure is manually fixed at experiments are as follow: Temporal (M) extracts infor­
the beginning of each capture sequence. The calibration on mation from video sequences (3 frames), Stereo (S) uses
all four cameras' extrinsics, intrinsics, and distortion coeffi­ stereo images, i-ToF (T) leverages i-ToF correlations via
cients is done with a graph bundle-adjustment for improved our two interconnected depth branches (see Sec. 3.2). Fi­
multi-view calibration (see appendix for further details). nally, Structured-light (L) incorporates an additional mask
4.2. CroMo dataset into the objective function, derived from information pro­
We collect a unique dataset comprising multi-modal cap­ vided by our structured-light sensor. The structured-light
tures such that each time point pertains to measurement signal is utilised only when the mask improves the projec­
of (1) Polarisation: raw stereo polarisation cameras pro­ tion loss. We explore alternative strategies to exploit the
duce 2448 x 2048 px stereo images. (2) i-ToF: 4 channel structured-light signal and discuss details on practical ben­
640x 480 px correlation images. (3) Depth: a structured­ efits (e.g. convergence speed) in the appendix.
light capture of the scene resulting in a 848 x 480 px depth Introduced signal components define our set of training
image. In addition to the three main sensors, IMU infor­ experiments. For example, Stereo and Structure Light
mation is recorded to further enable future research direc­ (SL) train the model using self-supervised stereo (S) and
tions. Our dataset consists of more than 20k frames, to­ structured-light (L) information. Experiments therefore use
talling >80k images of indoor and outdoor sequences in differing subsets of the introduced loss terms (see Tab. 3).
challenging conditions, with no constraints on minimum or Qualitative results are shown in Fig. 6. Unsurprisingly,
maximum scene range. We group these sequences into four self-supervised stereo (S) is relatively blurry and struggles
different scenes which we name: Kitchen, Station, Park and to capture fine details, such as the thin, metallic arch on the
Facades. Despite the multitude of senors, operating ranges Facades sample, or the furniture in the Kitchen. Addition of

3932

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
Dam�t l� �
� �
"'
Fl
-�
-�
j£ � � � �
..!l

<

.t

Sturm et al. [56 ] ./ ./ ./ ./ ./ ./ >20k


Agresti et al. [4 ] - (./) ./ ./ ./ 113
Guo et al. [24 ] - ./ ./ ./ ./ 2000
Zhu and Smith [66] (./) ./ ./ ./ ./ 1
Qiu et al. [48] ./ ./ ./ ./ ./ 40
Ba et al. [1 0 ] ./ (./) ./ ./ ./ 300
Kadambi et al. [32 ] ./ ./ ./ ./ ./ ./ 1
CroMo (./) ./ ./ ./ ./ ./ ./ ./ ./ >20k

sciesl GT depth statistics (meters)


mean var. min max median
valid # of # of
ratio seqs. frames
Kitchen 3.3 3.6 0.3 15.7 2.9 0.95 3 2859
Station 4.9 14.8 0.3 18.9 3.6 0.86 11 7400
Facades 4.0 8.4 0.3 17.8 3.3 0.86 7 7228
Park 6.1 23.7 0.3 19.7 4.4 0.82 10 5551
Total 4.7 13.6 0.3 18.3 3.6 0.86 31 23038

Figure 5 . Our multi-modal camera rig (see Sec. 4. 1 ) . Table 1. CroMo comparison and dataset statistics.
Models trained with Stereo (S) input I MP GMACs I Sq Rel RMSE RMSE Log I 8 < 1 . 25 8 < 1.252 8 < 1 .253
ResNet18 architecture [22] 14.36 20.17 1.7928 2.1982 0.3596 0.5061 0.7026 0.8009
ResNet50 architecture [22] 32.55 39.62 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
p2d [1 1] (ResNet50 - Stokes) 32.55 39.62 1.5938 2.1291 0.3884 0.4565 0.6632 0.7775
MiDaS architecture [50] 1 04.2 1 207.86 1 .4021 1 .9985 0.3252 0.5409 0.7901 0.8281
Our architecture (Stereo (S) input) 74.40 97.39 1 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213

Table 2. Architectural comparisons under consistent modality sensor input; Stereo (S). Our proposed architecture improves quantitative
results across the majority of metrics whilst remaining competitive in terms of compute and space requirements.

I I £""""' £oF Loorr C,tmct I Sq Rei I {j < 1.25 {j < 1.252 {j < 1.253
Image
sensors Training strategy Loor-+
r pol RMSE RMSE Log
Stereo (S) wlo DF sampling ./
I I
1 1.5037 2.0642 0.3383 0.5324 0.7262 0.8160
./ ./
2
Stereo (S) 1.3031 1.8889 0.3233 0.5533 0.7301 0.8213

./ ./ ./ ./
I I
1 1.2829 1.8573 0.3202 0.5541 0.7308 0.9062
Stereo and Structured-Light (SL)
Stereo and i-ToF (ST)
./ ./ ./
3
1. 1233 1.7510 0.3168 0.5529 0.7370 0.925 1

.( .( .( .( .( 0.9266

I I I
1 1 .0699 1.6070 0.2891 0.6512 0.7882
1.()031 1.4889 0.2527 0.7061 0.8066
Stereo, i-ToF, Structured-Light (STL)
./ ./ ./ ./ ./
4
STL+Temporal (STLM) 0.9246

Table 3. Model training strategies that differ in terms of available image sensor signals (utilised loss components). Sec. 3 and 4 provide
details on loss function components and image sensors, respectively. In spite of having access to only a single, consistent modality during
inference, the model benefits from visibility of additional training signals.
i-ToF and structured-light modalities, exclusively at train­ tured light information helps more due to the nature of our
(ST), (SL), (STL) and can be observed to
ing time, result in dataset and current distribution of image content therein i.e.
improve respective depth quality. Finally, (STLM) adds our ""'85% outdoor imagery, where i-ToF sensors are impaired
temporal modality and improves detail recovery (e.g. metal­ by ambient light. Combining the i-ToF and structured-light
lic arch and fence). Qualitative results can be observed to sensors (STL), further improves. The best depth prediction
corroborate our hypothesis; inclusion of additional modali­ utilises the temporal component (STLM). RMSE on Tab. 3
ties at training time afford the model multiple complemen­ displays a clear trend; the availability of additional sensor
tary depth cues that can qualitatively improve depth infer­ cues at training time improves monocular inference.
ence. Our experimental work highlights the nature of valu­
Network architecture We next investigate the effect of
able investigation possible with our unique CroMo dataset. network architecture on monocular estimation performance.
Quantitative results are reported in Tab. 3. We follow [22], Of note, we highlight that employing a larger capacity net­
reporting standard evaluation metrics, with focus on the work is not the only way to improve prediction perfor­
RMSE in our following experiments. Best performance is mance. We use our self-supervised stereo (S), i.e. baseline­
obtained when all sensors are used together (STLM) while modality, training strategy for all experiments that follow in
self-supervision stereo (S) with only polarisation images this section. This strategy provided weakest performance
performs worst. When additional modalities are added to in our previous investigation of training modality choice
self-supervision (S); i.e. i-ToF (ST) or structure light (SL), (Tab. 3). For this reason, we consider it an appropriate
performance improves in both cases, with larger gains come candidate with which to evaluate improvements afforded
from the addition of the latter. We conjecture that struc- by changes to network architecture. We report millions-

3933

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
I�
11
00

i

11
0

J
-
1il
..


'fJ


!5

Park Facades Kitchen Station Station sample Park sample

Figure 6. Same ResNet50 architecture as in [22] with different Figure 7. Different architectures, same training strategy Stereo
modalities: each new modality closes the gap with GT. (S): our new architecture produces the sharpest depth predictions.

of-parameters at inference (MP) and the giga-multiply­ ing angle directly, is more sensitive to noise and not appro­
accumulates per second (GMACs) in order to evaluate size priate for an SSIM loss with the self-supervised stereo (S)
and compute-cost per architecture. Architectures consist of training strategy. MiDaS [50] provides second best perfor­
the ResNet18 U-Net used in [22] and their supplementary mance and yet necessitates roughly x 2 GMACs. Our ar­
material ResNet50 variant, the p2d architecture [1 1] using chitecture provides best performance while remaining rela­
ResNet50 with a different data representation (Stokes), the tively compact which we largely attribute to gated convolu­
MiDaS [50] architecture and Ours (see Sec. 3.1 , Fig. 3a). tions and displacement field estimation (see Sec. 3.1).
Qualitative results are shown in Fig. 7. It may be ob­
served that the ResNetlS architecture with smallest (MP) 6. Conclusion
fails to obtain good background detail of the swing frame We systematically investigate the effect of using addi­
structure (Station sample) or of the tree (Park sample). The tional information from co-modal sensors at training time,
ResNet50 variant slightly improves detail, especially with for the task of monocular depth estimation from polarisation
raw measurements instead of Stokes (p2d [1 1]). Even when imagery. Our exploration is enabled through a unique multi­
increasing network capacity c. three-fold with MiDaS, re­ modal video dataset which constitutes synchronized images
sults are unsatisfying. Our proposed architecture (Ours) from binocular polarisation, raw i-ToF and structured-light
requires smaller capacity and computation for a sharper re­ depth. We quantify the beneficial influence of both passive
construction of the swing and the tree. We disentangle and active sensors, leveraging self-supervised and cross­
the benefits of additional sensor modalities from our model modal learning strategies that lead to the proposal of a new
contributions, highlighting the advantage of gated convolu­ method providing sharper and more accurate depth estima­
tions and our DF-based approach towards reducing blur. tion. This is made possible through two physical mod­
Quantitative results are reported in Tab. 2. The small­ els that describe the relationships between polarisation and
est architecture ResNetlS [22] performs worst. The larger surface normals on one side and correlation measures and
U-Net ResNet50 performs better, and has been generally scene depth on the other. We believe that our fundamen­
adopted [1 1, 21]. Note p2d [1 1] uses a different data rep­ tal investigation of modality combination and the CroMo
resentation (Stokes) for polarisation cf. ResNet50; perfor­ dataset can accelerate research of both spatial and temporal
mance decreases. We believe the Stokes representation, us- fusion, towards advancing cross-modal computer vision.

3934

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
References Interaction, and Measurement, volume 7864, page 786404.
International Society for Optics and Photonics, 201 1. 3
[ 1] Intel realsense depth camera d435i. : I I www
http s •
[15] Sergi Foix, Guillem Alenya, and Carme Torras. Lock-in
i ntel rea l s e n s e . com I depth - camera - d4 3 5 i l .
time-of-flight (tof) cameras: A survey. IEEE Sensors Jour­
Accessed: 2021-11-22. 3 nal, 1 1 (9):1917-1926, 201 1. 3
[2] Supreeth Achar, Joseph R Bartels, William L'Red' Whittaker,
[16] Daniel Freedman, Yoni Smolin, Eyal Krupka, Ido Leichter,
Kiriakos N Kutulakos, and Srinivasa G Narasimhan. Epipo­
and Mirko Schmidt. Sra: Fast removal of general multipath
lar time-of-flight imaging. ACM Transactions on Graphics
for tof sensors. In European Conference on Computer Vision,
(ToG), 36(4): 1-8, 2017. 3
pages 234-249. Springer, 2014. 3
[3] Gianluca Agresti, Ludovico Minto, Giulio Marin, and Pietro
[17] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat­
Zanuttigh. Deep learning for confidence information in stereo
and tof data fusion. In The IEEE International Conference on manghelich, and Dacheng Tao. Deep ordinal regression net­
work for monocular depth estimation. In Proceedings of the
Computer Vision (!CCV) Workshops, Oct 2017. 3
IEEE Conference on Computer Vision and Pattern Recogni­
[4] Gianluca Agresti, Henrik: Schaefer, Piergiorgio Sartor, and
tion, pages 2002-2011, 2018. 2
Pietro Zanuttigh. Unsupervised domain adaptation for tof
data denoising with adversarial learning. In Proceedings of [18] N Missael Garcia, Ignacio De Erausquin, Christopher Ed­
the IEEE/CVF Conference on Computer Vision and Pattern
miston, and VIkt:or Gruev. Surface normal reconstruction us­
Recognition (CVPR), June 2019. 3, 7 ing circularly polarized light. Optics express, 23(1 1): 14391-
[5] Gianluca Agresti and Pietro Zanuttigh. Deep learning for 14406, 2015. 3
multi-path error removal in tof sensors. In Proceedings of [19] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian
the European Conference on Computer Vision (ECCV), pages Reid. Unsupervised cnn for single view depth estimation: Ge­
0--0, 2018. 3 ometry to the rescue, 2016. 2
[6] Gary A Atkinson. Polarisation photometric stereo. Computer [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Vision and Image Understanding, 160: 158-167, 2017. 3 ready for autonomous driving? the kitti vision benchmark
[7] Gary A Atkinson and Edwin R Hancock. Multi-view sur­ suite. In 2012 IEEE Conference on Computer Vision and Pat­
face reconstruction using polarization. In Tenth IEEE Inter­ tern Recognition, pages 3354-3361. IEEE, 2012. 2
national Conference on Computer Vision (ICCV '05) Volume [21] Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow.
1 , volume 1, pages 309-316. IEEE, 2005. 3 Unsupervised monocular depth estimation with left-right con­
[8] Gary A Atkinson and Edwin R Hancock. Recovery of surface sistency. In CVPR, 2017. 2, 8
orientation from diffuse polarization. IEEE transactions on [22] Clement Godard, Oisin Mac Aodha, Michael Firman, and
image processing, 15(6):1653-1664, 2006. 1, 2 Gabriel J Brostow. Digging into self-supervised monocular
[9] Yunhao Ba, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, and depth estimation. In Proceedings of the IEEE international
Achuta Kadambi. Physics-based neural networks for shape conference on computer vision, pages 3828-3838, 2019. 1, 2,
from polarization. arXiv preprint arXiv:1903.10210, 2019. 2 5, 6, 7, 8, 3
[10] Yunhao Ba, Alex Gilbert, Franklin Wang, Jinfa Yang, Rui [23] John P Godbaz, Michael J Cree, and Adrian A Darrington.
Chen, Yiqin Wang, Lei Yan, Boxin Shi, and Achuta Kadambi. Closed-form inverses for the mixed pixellmultipath interfer­
Deep shape from polarization. In Andrea Vedaldi, Horst ence problem in amcw lidar. In Computational Imaging X,
Bischof, Thomas Brox, and Jan-Michael Frahm, editors, volume 8296, page 829618. International Society for Optics
Computer Vision - ECCV 2020, pages 554-571, Cham, 2020. and Photonics, 2012. 3
Springer International Publishing. 7 [24] Qi Guo, Iuri Frosio, Orazio Gallo, Todd Zickler, and Jan
[11] Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Kautz. Tackling 3d tof artifacts through learning and the flat
Daniel Braun, and Fabrice Meriaudeau. P2d: a self-supervised dataset. In Proceedings of the European Conference on Com­
method for depth estimation from polarimetry. In 25th In­ puter Vision (ECCV), pages 368-383, 2018. 3, 7
ternational Conference on Pattern Recognition (!CPR 2020),
[25] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and
2020. 1, 2, 7, 8
Xiaogang Wang. Learning monocular depth by distilling
[12] Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, and Jan
cross-domain stereo networks, 2018. 2
Kautz. Polarimetric multi-view stereo. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni­
[26] Mohit Gupta, Shree K Nayar, Matthias B Hullin, and Jaime
tion, pages 1558-1567, 2017. 3
Martin. Phasor imaging: A generalization of correlation­
based time-of-flight imaging. ACM Transactions on Graphics
[13] Rui Dai, Srijan Das, and Fran�ois Bremond. Learning
(ToG), 34(5):1-18, 2015. 3
an augmented rgb representation with cross-modal knowl­
edge distillation for action detection. In Proceedings of [27] Miles Hansard, Seungkyu Lee, Ouk Choi, and Radu Patrice
the IEEE/CVF International Conference on Computer Vision Horaud. Time-of-flight cameras: principles, methods and ap­
plications. Springer Science & Business Media, 2012. 2
(!CCV), pages 13053-13064, October 2021. 4
[14] Adrian A Darrington, John P Godbaz, Michael J Cree, An­ [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn­
drew D Payne, and Lee V Streeter. Separating true range ing for image recognition. In 2016 IEEE Conference on Com­
measurements from multi-path and scattering interference in puter Vision and Pattern Recognition (CVPR), pages 770-778,

commercial range cameras. In Three-Dimensional Imaging, 2016. 3, 2

3935

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[29] Heiko Hirschmuller. Accurate and efficient stereo processing age. In 2018 IEEE International Conference on Robotics and
by semi-global matching and mutual information. In 2005 Automation (ICRA), pages 1-8. IEEE, 2018. 3
IEEE Computer Society Conference on Computer Vision and [42] Nikolaus Mayer, Eddy llg, Philipp Fischer, Caner Hazirbas,
Pattern Recognition (CVPR '05), volume 2, pages 807-814. Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What
IEEE, 2005. 6 makes good synthetic training data for learning disparity and
[30] Radu Horaud, Miles Hansard, Georgios Evangelidis, and optical flow estimation? International Journal of Computer
Clement Menier. An overview of depth cameras and range Vision, 126(9):942-960, 2018. 2
scanners based on time-of-flight technologies. Machine Vi­ [43] Daisuke Miyazaki, Takuya Shigetomi, Masashi Baba, Ryo
sion and Applications, 27(7):1005-1020, Jun 2016. 5 Furukawa, Shinsaku Hiura, and Naoki Asada. Surface normal
[31] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and estimation of black specular objects from multiview polariza­
koray kavukcuoglu. Spatial transformer networks. In C. tion images. Optical Engineering, 56(4):041303, 2016. 3
Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, [44] Nikhil Naik, Achuta Kadambi, Christoph Rhemann,
editors, Advances in Neural Information Processing Systems, Shahram Izadi, Ramesh Raskar, and Sing Bing Kang. A light
volume 28, pages 2017-2025. Curran Associates, Inc., 2015. transport model for mitigating multipath interference in time­
5 of-flight sensors. In Proceedings of the IEEE Conference on
[32] Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Computer Vision and Pattern Recognition, pages 73-81 , 2015.

Raskar. Depth sensing using geometrically constrained polar­ 3


ization normals. International Journal of Computer Vision, [45] Gregory P Nordin, Jeffrey T Meier, Panfilo C Deguzman,
125(1-3):34-5 1 , 2017. 3, 7 and Michael W Jones. Micropolarizer array for infrared imag­
[33] Achuta Kadambi, Refael Whyte, Ayush Bhandari, Lee ing polarimetry. JOSA A, 16(5) : 1 168-1174, 1999. 3
Streeter, Christopher Barsi, Adrian Dorrington, and Ramesh [46] Matthew O'Toole, Felix Heide, Lei Xiao, Matthias B Hullin,
Raskar. Coded time of flight cameras: sparse deconvolution to Wolfgang Heidrich, and Kiriakos N Kutulakos. Temporal
address multipath interference and recover time profiles. ACM frequency probing for 5d transient analysis of global light
Transactions on Graphics (ToG), 32(6): 1-10, 2013. 3
transport. ACM Transactions on Graphics (ToG), 33(4): 1-1 1,

[34] Agastya Kalra, Vage Taamazyan, Supreeth Krishna Rao,


2014. 3
[47] Can Pu, Runzi Song, Radim Tylecek, Nanbo Li, and
Kartik Venkataraman, Ramesh Raskar, and Achuta Kadambi.
Robert B Fisher. Sdf-gan: Semi-supervised depth fu­
Deep polarization cues for transparent object segmentation.
sion with multi-scale adversarial networks. arXiv preprint
In Proceedings ofthe IEEEICVF Conference on Computer Vi­
arXiv:1803. 06657, 2018. 3
sion and Pattern Recognition, pages 8602-861 1, 2020. 2
[48] Simeng Qiu, Qiang Fu, Congli Wang, and Wolfgang Hei­
[35] Ahmed Kirmani, Arrigo Benedetti, and Philip A Chou.
drich. Polarization demosaicking for monochrome and color
Spurnic: Simultaneous phase unwrapping and multipath inter­
polarization focal plane arrays. In Hans-JOrg Schulz, Matthias
ference cancellation in time-of-flight cameras using spectral
Teschner, and Michael Wimmer, editors, Vision, Modeling
methods. In 201 3 IEEE International Conference on Multi­
and Visualization. The Eurographics Association, 2019. 7
media and Expo (ICME), pages 1--6. IEEE, 2013. 3
[49] M. Ramamonjisoa, Y. Du, and V. Lepetit. Predicting sharp
[36] R. Lange and P. Seitz. Solid-state time-of-flight range cam­
and accurate occlusion boundaries in monocular depth estima­
era. IEEE Journal of Quantum Electronics, 37(3):390-397,
tion using displacement fields. Proceedings of the IEEE Con­
2001. 5
ference on Computer Vision and Pattern Recognition (CVPR),
[37] Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, 2020. 3
Wenxiu Sun, and Qifeng Chen. Polarized reflection re­
[50] Rene Ranftl, Katrin Lasinger, David Hafner, Komad
moval with perfect alignment in the wild. In Proceedings of
Schindler, and Vladlen Koltun. Towards robust monocular
the IEEEICVF Conference on Computer Vision and Pattern
depth estimation: Mixing datasets for zero-shot cross-dataset
Recognition, pages 1750-1758, 2020. 2
transfer. IEEE Transactions on Pattern Analysis and Machine
[38] Fotios Logothetis, Roberto Mecca, Fiorella Sgallari, and Intelligence (TPAMI), 2020. 7, 8
Roberto Cipolla. A differential approach to shape from po­ [51] OlafRonneberger, Philipp Fischer, and Thomas Brox. U-net:
larisation: A level-set characterisation. International Journal Convolutional networks for biomedical image segmentation.
of Computer Vision, 127( 1 1-12) : 1 680-1693, 2019. 3 In Nassir Navab, Joachim Homegger, William M. Wells, and
[39] Adrian Lopez-Rodriguez, Benjamin Busam, and Krystian Alejandro F. Frangi, editors, Medical Image Computing and
Mikolajczyk. Project to adapt: Domain adaptation for depth Computer-Assisted Intervention - M1CCA1 2015, pages 234-
completion from noisy and sparse sensor data. arXiv preprint 241, Cham, 2015. Springer International Publishing. 3
arXiv:2008.01034, 2020. 3 [52] Johannes Lutz Schonberger, Enliang Zheng, Marc Pollefeys,
[40] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac and Jan-Michael Frahm. Pixelwise view selection for unstruc­
Karaman. Self-supervised sparse-to-dense: Self-supervised tured multi-view stereo. In European Conference on Com­
depth completion from lidar and monocular camera. In puter Vision (ECCV), 2016. 2, 6
2019 International Conference on Robotics and Automation [53] William AP Smith, Ravi Ramamoorthi, and Silvia Tozza.
(ICRA), pages 3288-3295. IEEE, 2019. 3 Height-from-polarisation with unknown lighting or albedo.
[41] Fangchang Mal and Sertac Karaman. Sparse-to-dense: IEEE transactions on pattern analysis and machine intelli­
Depth prediction from sparse depth samples and a single im- gence, 41(12):2875-2888, 2018. 2

3936

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.
[54] Kilho Son, Ming-Yu Liu, and Yuichi Taguchi. Learning to
remove multipath distortions in time-of-flight range images
for a robotic arm setup. In 2016 1EEE International Confer­
ence on Robotics and Automation (ICRA), pages 3390-3397.
IEEE, 2016. 3
[55] Jaime Spencer, Richard Bowden, and Simon Hadfield.
Defeat-net: General monocular depth via simultaneous un­
supervised representation learning. In Proceedings of the
IEEEICVF Conference on Computer Vision and Pattern
Recognition, pages 14402-14413, 2020. 2
[56] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre­
mers. A benchmark for the evaluation of rgb-d slam systems.
In Proc. of the International Conference on Intelligent Robot
Systems (IROS), Oct. 2012. 7
[57] Shuochen Su, Felix Heide, Gordon Wetzstein, and Wolfgang
Heidrich. Deep end-to-end time-of-flight imaging. In Pro­
ceedings of the IEEE Conference on Computer Vision and Pat­
tern Recognition, pages 6383--6392, 201 8. 3
[58] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
In Proceedings of the International Conference on 3D Vision
(3DV)), pages 11-20. IEEE, 2017. 3
[59] Jamie Watson, Michael Firman, Gabriel J. Brostow, and
Daniyar Turmukhambetov. Self-supervised monocular depth
hints. In The International Conference on Computer Vision
(/CCV), October 2019. 5, 3
[60] Refael Whyte, Lee Streeter, Michael J Cree, and Adrian A
Dorrington. Review of methods for resolving multi-path in­
terference in time-of-flight range cameras. In SENSORS, 2014
IEEE, pages 629--632. IEEE, 2014. 3
[61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully
automatic 2d-to-3d video conversion with deep convolutional
neural networks, 2016. 2
[62] Tomonari Yoshida, Vladislav Golyanik, Oliver Wasenmiiller,
and Didier Stricker. Improving time-of-flight sensor for spec­
ular surfaces with shape from polarization. In 2018 25th IEEE
International Conference on Image Processing (ICIP), pages
1558-1562. IEEE, 2018. 3
[63] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas Huang. Free-form image inpainting with gated con­
volution, 2019. 3, 2
[64] Ye Yu, Dizhong Zhu, and William AP Smith. Shape-from­
polarisation: a nonlinear least squares approach. In Proceed­
ings of the IEEE International Conference on Computer Vi­
sion Workshops, pages 2969-2976, 2017. 2
[65] Pietro Zanuttigh, Giulio Marin, Carlo Dal Mutto, Fabio Do­
minio, Ludovico Minto, and Guido Maria Cortelazzo. Time­
of-flight and structured light depth cameras. Technology and
Applications, pages 978-3, 2016. 2
[66] Dizhong Zhu and William AP Smith. Depth from a polar­
isation+ rgb stereo pair. In Proceedings of the IEEE Con­
ference on Computer Vision and Pattern Recognition, pages
7586-7595, 2019. 3, 4, 5, 7

3937

Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 25,2023 at 16:43:40 UTC from IEEE Xplore. Restrictions apply.

You might also like