Shin Deep Depth Estimation From Thermal Image CVPR 2023 Paper
Shin Deep Depth Estimation From Thermal Image CVPR 2023 Paper
Abstract
1043
Table 1. Comprehensive comparison of multi-modal datasets. Compared to previous datasets [6, 8, 24, 25, 28, 53], the proposed Multi-
Spectral Stereo (MS2 ) dataset provides about 195K synchronized and rectified multi-spectral stereo sensor data (i.e., RGB, NIR, thermal,
LiDAR, and GNSS/IMU data) covering diverse locations (e.g., city, campus, residential, road, and suburban), times (e.g., morning, daytime,
and nighttime), and weathers (e.g., clear-sky, cloudy, and rainy).
However, the missing necessities are the well- • We perform exhaustive validation and investigate that
established large-scale dataset and public benchmark monocular and stereo depth estimation algorithms
results. The publicly available datasets for autonomous originally designed for visible spectral bands work rea-
driving are overwhelmingly composed of the visible spec- sonably in thermal spectral bands.
trum band (i.e., RGB images), but it very rarely includes
other spectrum bands, such as the NIR band and LWIR • We propose a unified depth network that bridges
band. Especially, despite the advantage of the LWIR band, monocular depth and stereo depth estimation tasks
just a few LWIR datasets have been recently released. from the perspective of a conditional random field ap-
However, these datasets are indoor oriented [8, 25, 28], proach.
small scale [25, 53], publicly unavailable [6], or limited
sensor diversity [6, 24]. Therefore, the necessity is getting 2. Related Work
increase to design a large-scale multi-sensor driving dataset
to investigate the feasibility and challenges associated 2.1. Thermal Image Dataset for 3D Vision
with an autonomous driving perception system from A well-established large-scale dataset is the most fun-
multi-spectral sensors. damental and top priority for modern deep neural network
The other necessity is thoroughly validating vision ap- training. For the visible spectrum band, numerous large-
plications on the LWIR band. Estimating a depth map from scale datasets have been proposed such as KITTI [15],
monocular and stereo images is one fundamental task for DDAD [17], Cityscape [7], Oxford [36], and nuScenes [4]
geometric understanding. Despite numerous recent stud- datasets. On the other hand, the InfraRed (IR) spectrum
ies in depth estimation, these works have mainly focused band (e.g., near-IR, short-wave IR, long-wave IR) is very
on depth estimation using RGB images. However, thermal rarely included in just a few datasets in a limited form de-
images, which typically have lower resolution, less texture, spite its superior environmental robustness.
and more noise than RGB images, could pose a challenge The comprehensive comparison is shown in Tab. 1.
for stereo-matching algorithms. This means that the perfor- Most datasets are insufficient to investigate the feasibil-
mances of these previous works in thermal image domains ity of geometric and semantic understanding from multi-
are uncertain and may not be guaranteed. spectrum image sensors under diverse outdoor driving sce-
To this end, in this paper, we provide a large-scale multi- narios. More specifically, these datasets are indoor ori-
spectral dataset along with exhaustive experimental results ented [8, 25, 28], small scale [25, 53], publicly unavail-
and a new perspective of depth unification to encourage ac- able [6], limited sensor diversity [6, 24], limited weather
tive research of various geometry algorithms from multi- condition [6, 24, 25], or missing RAW thermal data [53].
spectral data to achieve high-level performance, reliability,
and robustness against hostile conditions. Our contributions 2.2. Depth From Visible Spectrum Band
can be summarized as follows:
Monocular Depth Estimation (MDE) has high-level
• We provide a large-scale Multi-Spectral Stereo (MS2 ) universality because it estimates depth map from a single
dataset, including stereo RGB, stereo NIR, stereo ther- image. There have been numerous mainstream methods for-
mal, and stereo LiDAR data along with GNSS/IMU mulating depth estimation as per-pixel regression [26, 41,
data. Our dataset provides about 195K synchronized 42, 56] by directly estimating per-pixel depth value through
data pairs taken from city, residential, road, campus, a neural network, per-pixel classification [12, 13] by dis-
and suburban areas in the morning, daytime, and night- cretizing continuous depth range into discrete intervals, and
time under clear-sky, cloudy, and rainy conditions. classification-and-regression problems [2, 29].
1044
However, MDE is an ill-posed problem; a single 2D im- Table 2. Sensor specification for the multi-spectral stereo sys-
age can be generated from an infinite number of distinct 3D tem. Our sensor system consists of RGB, NIR, thermal, and Li-
scenes. Therefore, the estimated monocular depth map is DAR stereo system along with a GNSS/IMU module. The data
from RGB, NIR, and thermal stereo system was taken at 15 fps
inherently scale-ambiguous, has low generalization perfor-
with synchronized signals. Lidar stereo data were taken at 10 fps.
mance, and provides lower performance than depth estima-
tion from multi-view images.
Sensor Model Frame Rate Characteristics
Stereo Depth Estimation (SDE) can estimate metric- PointGrey BlackFly-S
Max 75 fps
2448×2048 pixel
RGB camera BFS-U3-51S5C Global Shutter
scale depth map by utilizing a known camera baseline and Kowa LM5JC10M 82.2◦ (H) × 66.5◦ (V) FoV
disparity map from a rectified stereo image pair. Existing 1280 × 720 pixel
NIR camera Intel RealSense D435i Max 90 fps Global Shutter
stereo matching networks can be categorized into 3D cost 69◦ (H) × 42◦ (V) FoV
volume [30, 37, 52, 55] and 4D cost volume based meth- 640×512 pixel
ods [5, 18, 20, 43, 54]. The former one estimates a single- 45◦ (H) × 37◦ (V) FoV
Thermal camera FLIR A65C Max 30 fps
Uncooled VOX microbolometer
channel cost volume (e.g., D×H×W) by measuring the sim- 16-bit Raw data
ilarity between left and right features. Then, they aggregate Accuracy: ± 3 cm
LiDAR Velodyne VLP-16 Max 20 fps Measurement range : 100m
the contextual information via 2D convolution. These meth- 360◦ (H), ±15◦ (V) FoV
ods have high memory and computational efficiency, yet the GNSS/IMU
LORD Microstrain
10/100 Hz
Position, Velocity,
encoded volume loses large content information leading to 3DM-GX5-45 Attitude, Acceleration, etc.
unsatisfactory accuracy.
The latter one builts multiple-channel cost volume (e.g.,
D×C×H×W) by concatenating two left-right feature vol- 3. Multi-Spectral Stereo (MS2 ) Dataset
umes [5,20], correlation-volume and left-right features [18],
or attention-added features [54]. Then, they aggregate the 3.1. Multi-Spectral Stereo Sensor System
4D cost volume with 3D convolution layers. Current state-
Despite the well-known advantages of the long-wave in-
of-the-art models are mostly based on this method. How-
frared camera (i.e., thermal camera) [9, 19, 57], the absence
ever, this demands high memory consumption and cubic
of a large-scale dataset still interrupts the development and
computational complexity that is expensive to deploy in a
investigation of condition-agnostic autonomous driving per-
real-world application. The SDE task yields significant per-
ception systems from the thermal spectrum domain. To this
formance gains compared to the MDE task, yet the SDE
end, we designed a data collection platform that consists of
task is still struggling to find accurate corresponding points
RGB, NIR, thermal, and LiDAR stereo system along with
in inherently ill-posed regions such as occlusion areas, re-
a GNSS/IMU module, as shown in Fig. 2-(a),(b), and (c).
peated patterns, textureless regions, and reflective surfaces.
Each sensor specification is described in Tab. 2.
Accurate time-synchronization is one important prereq-
2.3. Depth From Thermal Spectrum Band uisite for various geometric tasks with multiple sensors,
such as depth estimation, odometry, 3D detection, and 3D
Thermal spectrum band has high-level robustness reconstruction. Therefore, we synchronize RGB and NIR
against various adverse weather and lighting conditions, stereo cameras via an external synchronizer. Thermal stereo
such as rain, fog, dust, haze, and low-light conditions. How- cameras are synchronized with the sync signal of the left
ever, due to the absence of a large-scale dataset, most pre- thermal camera. Also, a software trigger is used to synchro-
vious studies for geometric understanding [3, 10, 21, 38, 47] nize the two systems at the start time of each data acquisi-
are conducted on their own testbed. Also, most works focus tion. Please refer to the supplementary material for more
to utilizes a thermal camera along with other heterogeneous details on calibration and sensor system configuration.
sensors for the target geometric task rather than focusing on
the thermal camera itself. 3.2. Data Collection
For the geometric understanding task that utilizes a We collect multi-spectral stereo data (i.e., stereo RGB,
deep neural network, a few researches [22, 35, 44–46] have NIR, thermal, and LiDAR data) along with GNSS/IMU data
been proposed recently. Most studies focus on the self- under various locations, lighting conditions, and weather
supervised depth estimation from thermal images with aux- conditions. Specifically, we obtain the synchronized multi-
iliary modality guidance, such as aligned-and-paired RGB spectral data from campus, city, residential area, suburban
images [22], style transfer network [35], and paired RGB area, and multiple road environments. Also, we provide var-
images [45]. Unlike the previous studies, in this paper, we ious time diversities (e.g., morning, daytime, and nighttime)
target a supervised depth estimation from a single and stereo and weather diversities (e.g., clear-sky, cloudy, and rainy)
thermal image that has not yet been actively explored. for each representative location (Fig. 2-(d) and (e)).
1045
(a) Frontal view of sensor system (b) Sensor system details (c) Coordinate system of our platform
(f) Driving Scenario - Campus (RGB/NIR/THR) (g) Driving Scenario - Road (RGB/NIR/THR)
Figure 2. Overview of our proposed Multi-Spectral Stereo (MS2 ) outdoor driving dataset. We designed a data collection platform
that consists of RGB, NIR, thermal, and LiDAR stereo system along with a GPS/IMU module (i.e. (a),(b),(c)). The collected dataset are
taken under locations of campus, city, residential area, road, and suburban with various time slots (morning, day, and night) and weather
conditions (clear-sky, cloudy, and rainy) (i.e. (d) and (e)). According to the surrounding conditions, each spectrum sensor shows different
aspects, advantages, and disadvantages induced by their sensor characteristics (i.e., (f) and (g)). Further examples and details are described
in the supplementary material.
This aims to investigate and evaluate the generalization GNSS/IMU sensor data. Afterward, we aggregate 10 suc-
and domain gap handling abilities of a deep neural net- cessive stereo LiDAR data for each target thermal image
work. It also targets to explore the possibility of multi- via transformation matrices between consecutive data and
sensor complementation and the characteristics of each sen- refine the aggregated point cloud via the Iterative Closest
sor under various conditions (Fig. 2-(f) and (g)). Com- Point (ICP) algorithm [1]. Then, the refined and aggregated
pared to previous datasets [6, 8, 24, 25, 28, 53], the pro- 3D point cloud is projected to the thermal image plane to
posed dataset provides about 195K synchronized and rec- get the final semi-dense depth map.
tified multi-spectral data pairs (i.e., RGB, NIR, thermal,
LiDAR, and GNSS/IMU data) covering diverse locations,
Training Set Configuration. From the MS2 dataset, we
times, weathers, and sensors.
periodically sampled the thermal images and filter out the
3.3. Multi-Spectral Stereo (MS2 ) Depth Dataset static vehicle movement to make training, validation, and
evaluation splits for the learning of monocular and stereo
Ground-Truth Generation Process. To create a dense depth networks. We utilize 26K data pairs for training,
Ground-Truth (GT) depth map, we accumulated 10 succes- 4K pairs for validation, and 5.8K, 6.8K, and 5.2K pairs
sive stereo LiDAR data by utilizing interpolated odometry for evaluation of daytime, nighttime, and rainy conditions.
information from GNSS/IMU sensor in a similar way to We make the training set splits have almost zero overlap in
KITTI dataset [15]. Specifically, we calculate every pose time, weather, and location diversity. The split details can
information of each sensor’s time stamp by interpolating be found in the supplementary material.
1046
Figure 3. Overall pipeline of our proposed depth estimation network. We design a single network that can estimate both monocular
and stereo depth maps from given a single or stereo thermal image. We bridge monocular depth and stereo depth estimation by regarding
the cost-volume as additional information for Neural Window Conditional Random Field (NeWCRF) block [56]. Initially, the network
extracts multi-scale feature maps via Swin-Transformer backbone model [31] and aggregates the global contextual information via Pyramid
Pooling Module (PPM) head [58]. If the right thermal image is available, the network generates each scale of single-channel cost-volume
(i.e., Dscale × H scale × W scale ) based on feature similarity of the left-right features. If only the left image is available, the network
utilizes zero-filled cost-volume. The depth maps are estimated from the multi-scale concatenated features via NeWCRF blocks [56].
4. Depth Estimation from Thermal Image 4.2. Feature Extraction and Aggregation
We adopt Swin transformer [31] as our backbone net-
4.1. Bridging Monocular and Stereo Depth Estima- work. The backbone network extract feature in four scale-
tion level (i.e., 1/4, 1/8, 1/16, and 1/32) from the given images.
After that, the pyramid pooling module(PPM) [58] aggre-
In this section, we connect the Monocular Depth Esti- gates global context information with global average pool-
mation (MDE) and Stereo Depth Estimation (SDE) tasks ing of receptive fields 1, 2, 3, and 6 from the last scale-level.
via the Conditional Random Field(CRF) perspective. MDE The features of remained scales are provided to each level
network has the advantage of high-level universality that of decoders via a skip-connected manner.
doesn’t need extra constrain such as pre-rectification, ex- 4.3. Cost Volume Construction
trinsic matrix information, and additional images. How-
ever, MDE networks suffer from inherent scale ambiguity Most state-of-the-art stereo matching networks [5,18,54]
and generalization issues. On the other hand, SDE net- utilize a 4D cost volume with 3D convolution layer to
works provide an accurate metric-scale depth map by find- achieve higher performance. However, the 4D cost vol-
ing horizontal correspondences between rectified left-and- ume based method requires costly memory and computa-
right images. But, the SDE network is hard to provide a re- tion consumptions. Also, the method makes it hard to as-
liable depth map in the ill-posed regions such as occlusion sociate monocular depth estimation in the network archi-
areas, repeated patterns, textureless regions, and reflective tecture by enforcing the utilization of both left-right feature
surfaces. maps always.
Therefore, we utilize correlation cost volume (i.e., 3D
They can complement each other by bridging two tasks cost volume) [30,37,52,55] that has a single-channel corre-
and, at the same time, flexibly estimate depth maps from lation map for each disparity level. The method loses some
given monocular or stereo images, as shown in Fig. 3. To correlation information between left-right features, yet it
this end, we utilize the recently proposed MDE network, can be easily associated with a monocular depth estimation
Neural Window FC-CRF (NeWCRF), to connect two tasks. network as additional information. The cost volume of each
Specifically, we regard the estimated cost volume as addi- scale is estimated as follows:
tional information for NeWCRF blocks. Therefore, when 1
the right image is available, we add each cost volume of C scale (d, x, y) = < flscale (x, y), frscale (x − d, y) >,
Nc
multi-scale left-and-right features to the left image feature (1)
FLscale . If only the left image is available, the network uti- where < ·, · > is the inner product, Nc denotes the number
lizes zero-filled cost volume. of channels, and flscale and frscale are the feature map of
1047
each scale. The cost volume of each scale is concatenated 5. Experimental Results
with the corresponding feature map of the left image flscale
to form skip-connection input F for the NeWCRF blocks. 5.1. Implementation Details
MDE and SDE Networks For the validation of various
4.4. Neural Window FC-CRF
MDE and SDE networks designed for the visible spectrum
NeWCRF [56] implements traditional CRF as the form band, we train and evaluate representative MDE and SDE
of neural network in a computation efficient way by uti- networks on the proposed MS2 dataset. Specifically, we
lizing shifted window multi-head attention module [31]. adopt regression [26], classification [13], classification-and-
Given the previous prediction result X and concatenated regression [2], and modern transformer [56] based MDE
feature F, the NeWCRF block estimate unary potential ψu networks (i.e., BTS, DORN, AdaBins, and NeWCRF).
and pairwise potential ψp via multi-head attention mecha- Also, we employ 3D cost volume [55] and 4D cost vol-
nism (i.e., NeWCRF block of Fig. 3), as follows: ume [18, 54] based SDE networks (i.e., AANet, Gwc-
X Net, and ACVNet). We utilize their official source code
ψpu = θu (X), ψpi = SoftMax(Q · K T + P ) · X, (2) to implement each network architecture. All networks
i are initialized with ImageNet pretrained [11] or provided
backbone model by following their original implementa-
where θu is the parameter of a unary network and Q, K, P
tions [2, 13, 18, 26, 54–56]. We utilize the PyTorch li-
are query, key, and position embedding matrix of attention
brary [40] to implement our proposed method and other
block. After that, the optimized net, which consists of two
′ comparison methods.
MLP layers, estimates the current stage result X . And the
′ Optimizer and Data Augmentation All models are
X is regarded as X for the next NeWCRF block.
trained for 60 epochs on a single A6000 GPU with 48GB
4.5. Disparity and Inverse Depth Prediction of memory. We utilize a batch size of 8 for all MDE model
training and 4 for all SDE model training. For our method,
The proposed network estimates four scale prediction we use a batch size of 6. We adopt AdamW optimizer [34]
results (i.e., 1/4, 1/8, 1/16, and 1/32) from the last four with an initial learning rate 1e−4 for all model training. Co-
NeWCRF blocks. When a single image is fed to the net- sine Annealing Warm Restarts [33] is used as a learning
work, we regard the prediction results as an inverse depth rate scheduler. For the data augmentation, we apply ran-
map. For the stereo image pair, we regard the prediction re- dom center crop-and-resize, brightness jitter, and contrast
sults as a common disparity map. For the prediction features jitter for all model training. Horizontal flip is additionally
X of each scale, the network employs two convolution lay- applied to the MDE networks. We set the coefficients of
ers to get a single-channel (disparity/inverse depth) volume. multi-scale L1 loss λscale to 0.5, 0.5, 0.7, and 1.0. The
After that, the volume is upsampled and converted into a maximum value of disparity range Dmax is set to 192.
probability volume by the softmax function along the dis-
parity dimension. Finally, the predicted value is computed 5.2. Depth Estimation from Thermal Images
as follows:
X−1
Dmax We provide the comprehensive comparison of represen-
Dpred = k · pk , (3) tative MDE and SDE networks on our MS2 depth dataset,
k=0 as shown in Tab. 3. Also, the advantage of depth estimation
where k denotes disparity level, pk indicates the corre- from thermal images can be observed in Fig. 4.
sponding probability, and Dmax is the maximum value of Monocular Depth Estimation The performance ten-
disparity range. dency of MDE networks is generally preserved in the ther-
mal spectrum domain, similar to KITTI depth benchmark
4.6. Loss Function results [15]. MDE networks with regression heads for depth
map prediction (i.e., BTS and NeWCRF) have clear advan-
We utilize a multi-scale smooth L1 loss, that is com-
tages in error metrics over methods with classification heads
monly adopted in the SDE task, to train our network.
by directly regressing precise depth values. On the other
3
X hand, the classification head (i.e., DORN and Ours) achieve
scale
Lsup = λscale · (SmoothL1 (Dpred,mono , DGT ) higher accuracy scores by explicitly binning depth range.
scale=0 The proposed unified network (i.e., Ours (Mono)) gen-
scale erally shows comparable results with the state-of-the-art
+ SmoothL1 (Dpred,stereo , DGT )), (4)
MDE method by showing higher scores in accuracy met-
where λ indicates the coefficient for the prediction result rics yet, lower metrics in some error metrics. We think the
of each scale, DGT denotes the GT disparity map, and performance gap comes from the depth prediction head and
SmoothL1 is the smooth L1 loss. loss function. All MDE networks utilize GT depth maps
1048
Table 3. Quantitative comparison of depth estimation results on the proposed dataset. We compare our network with state-of-the-art
monocular and stereo depth estimation networks [2,13,18,26,54–56]. Ours shows comparable results in both monocular and stereo depth
estimation results. Differing from the other networks, Ours has high-level practicality and flexibility in that it can flexibly estimate a depth
map regardless of a single or stereo thermal image input. Reg and Cls indicate regression and classification heads for MDE task. The two
types of SDE (i.e., 3D and 4D CV) denote 3D and 4D cost volume, respectively. The best performance in each block is highlighted in bold.
(a) Monocular Depth Estimation Results on the Evaluation Set of Our MS2 Depth Dataset.
Error ↓ Accuracy ↑ Type
Methods TestSet
AbsRel SqRel RMSE RMSElog δ < 1.25 δ < 1.252 δ < 1.253 Reg Cls
Day 0.144 1.288 5.483 0.230 0.856 0.941 0.970
Night 0.136 1.136 5.290 0.212 0.863 0.950 0.976
DORN [13] ✓
Rain 0.180 1.934 6.735 0.276 0.781 0.910 0.955
Avg 0.151 1.419 5.776 0.237 0.837 0.935 0.968
Day 0.122 0.905 4.923 0.198 0.857 0.951 0.980
Night 0.114 0.798 4.701 0.184 0.870 0.959 0.984
BTS [26] ✓
Rain 0.157 1.395 6.053 0.243 0.791 0.926 0.969
Avg 0.129 1.008 5.169 0.206 0.843 0.947 0.978
Day 0.129 0.976 5.108 0.205 0.847 0.947 0.979
Night 0.119 0.822 4.749 0.187 0.864 0.958 0.984
AdaBins [2] ✓ ✓
Rain 0.168 1.545 6.336 0.254 0.771 0.918 0.965
Avg 0.137 1.084 5.330 0.212 0.831 0.943 0.977
Day 0.120 0.864 4.852 0.195 0.858 0.952 0.982
Night 0.112 0.755 4.594 0.179 0.875 0.961 0.985
NeWCRF [56] ✓
Rain 0.155 1.352 5.956 0.240 0.795 0.929 0.970
Avg 0.127 0.965 5.077 0.202 0.846 0.949 0.980
Day 0.115 0.983 4.895 0.201 0.882 0.952 0.977
Night 0.107 0.850 4.658 0.185 0.894 0.961 0.981
Ours (Mono) ✓
Rain 0.152 1.567 6.020 0.247 0.822 0.928 0.964
Avg 0.123 1.103 5.134 0.208 0.869 0.948 0.975
Day 0.113 0.948 4.852 0.200 0.884 0.953 0.977
Night 0.105 0.811 4.584 0.183 0.896 0.961 0.981
Ours (Stereo) ✓
Rain 0.149 1.499 5.940 0.245 0.826 0.929 0.965
Avg 0.120 1.057 5.068 0.207 0.872 0.949 0.975
(b) Disparity Estimation Results on the Evaluation Set of Our MS2 Depth Dataset.
Lower is better Type
Methods TestSet
EPE-all(px) D1-all(%) > 1px(%) > 2px(%) > 3px(%) 3D CV 4D CV
Day 0.905 5.5 19.2 8.4 5.5
Night 0.946 5.6 26.0 10.2 5.6
GwcNet [18] ✓
Rain 1.070 7.2 24.3 11.1 7.2
Avg 0.969 6.0 23.3 9.9 6.0
Day 0.939 5.8 20.2 8.8 5.8
Night 0.995 6.1 27.9 11.1 6.1
AANet [55] ✓
Rain 1.091 7.5 25.3 11.6 7.5
Avg 1.005 6.4 24.7 10.5 6.4
Day 0.898 5.5 18.9 8.3 5.5
Night 0.943 5.5 25.9 10.1 5.5
ACVNet [54] ✓
Rain 1.056 7.2 23.6 10.9 7.2
Avg 0.962 6.0 23.0 9.8 6.0
Day 1.033 6.4 23.1 10.5 6.4
Night 0.946 5.6 29.6 9.8 5.6
Ours (Mono) ✓
Rain 1.261 8.7 24.4 14.6 8.7
Avg 1.066 6.8 24.4 11.4 6.8
Day 0.957 5.7 22.7 9.1 5.7
Night 0.853 4.8 21.3 8.2 4.8
Ours (Stereo) ✓
Rain 1.159 7.7 29.1 12.4 7.7
Avg 0.976 5.9 24.0 9.7 5.9
1049
(a) RGB (Reference Only) (b) NIR (Reference Only) (c) THR (d) GT disparity (e) Ours (stereo)
Figure 4. Qualitative results of stereo disparity estimation on the MS2 depth dataset. Predicted disparity map from stereo thermal
images shows high-level robust estimation results regardless of lighting and weather condition. However, inherent hardware noise and the
absence of high-frequency information lead to blurry prediction results for specific regions such as the regions that have similar thermal
radiation values (i.e., temperature) and noisy areas generated by the sensor itself. We think multi-spectral modality fusion can achieve both
robustness and reliability. Further results and comparisons with other MDE and SDE networks can be found in the supplementary material.
1050
References IEEE conference on computer vision and pattern recogni-
tion, pages 2002–2011, 2018. 2, 6, 7
[1] Paul J Besl and Neil D McKay. Method for registration of [14] Stefano Gasperini, Patrick Koch, Vinzenz Dallabetta, Nas-
3-d shapes. In Sensor fusion IV: control paradigms and data sir Navab, Benjamin Busam, and Federico Tombari. R4dyn:
structures, volume 1611, pages 586–606. Spie, 1992. 4 Exploring radar for self-supervised monocular depth estima-
[2] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. tion of dynamic scenes. In 2021 International Conference
Adabins: Depth estimation using adaptive bins. In Proceed- on 3D Vision (3DV), pages 751–760. IEEE, 2021. 1
ings of the IEEE/CVF Conference on Computer Vision and [15] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
Pattern Recognition, pages 4009–4018, 2021. 2, 6, 7 Urtasun. Vision meets robotics: The kitti dataset. The Inter-
[3] Paulo Vinicius Koerich Borges and Stephen Vidas. Practical national Journal of Robotics Research, 32(11):1231–1237,
infrared visual odometry. IEEE Transactions on Intelligent 2013. 1, 2, 4, 6
Transportation Systems, 17(8):2205–2213, 2016. 1, 3 [16] Vitor Guizilini, Rares Ambrus, Wolfram Burgard, and
[4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Adrien Gaidon. Sparse auxiliary networks for unified
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- monocular depth prediction and completion. In Proceedings
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- of the IEEE/CVF Conference on Computer Vision and Pat-
modal dataset for autonomous driving. In Proceedings of tern Recognition, pages 11078–11088, 2021. 1
the IEEE/CVF conference on computer vision and pattern [17] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven-
recognition, pages 11621–11631, 2020. 1, 2 tos, and Adrien Gaidon. 3d packing for self-supervised
[5] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo monocular depth estimation. In IEEE Conference on Com-
matching network. In Proceedings of the IEEE conference on puter Vision and Pattern Recognition (CVPR), 2020. 1, 2
computer vision and pattern recognition, pages 5410–5418, [18] Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and
2018. 3, 5 Hongsheng Li. Group-wise correlation stereo network. In
[6] Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, Proceedings of the IEEE/CVF Conference on Computer Vi-
Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist sion and Pattern Recognition, pages 3273–3282, 2019. 3, 5,
multi-spectral day/night data set for autonomous and assisted 6, 7
driving. IEEE Transactions on Intelligent Transportation [19] Keli Huang, Botian Shi, Xiang Li, Xin Li, Siyuan Huang,
Systems, 19(3):934–948, 2018. 2, 4 and Yikang Li. Multi-modal sensor fusion for auto driv-
[7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo ing perception: A survey. arXiv preprint arXiv:2202.02703,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe 2022. 3
Franke, Stefan Roth, and Bernt Schiele. The cityscapes [20] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter
dataset for semantic urban scene understanding. In Proc. Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry.
of the IEEE Conference on Computer Vision and Pattern End-to-end learning of geometry and context for deep stereo
Recognition (CVPR), 2016. 2 regression. In Proceedings of the IEEE international confer-
ence on computer vision, pages 66–75, 2017. 3
[8] Weichen Dai, Yu Zhang, Shenzhou Chen, Donglei Sun, and
[21] Shehryar Khattak, Christos Papachristos, and Kostas Alexis.
Da Kong. A multi-spectral dataset for evaluating motion es-
Keyframe-based thermal–inertial odometry. Journal of Field
timation systems. In 2021 IEEE International Conference on
Robotics, 37(4):552–579, 2020. 1, 3
Robotics and Automation (ICRA), pages 5560–5566. IEEE,
[22] Namil Kim, Yukyung Choi, Soonmin Hwang, and In So
2021. 2, 4
Kweon. Multispectral transfer network: Unsupervised depth
[9] Kevser Irem Danaci and Erdem Akagunduz. A sur-
estimation for all-day vision. In Thirty-Second AAAI Con-
vey on infrared image and video sets. arXiv preprint
ference on Artificial Intelligence, 2018. 1, 3
arXiv:2203.08581, 2022. 3
[23] Yeong-Hyeon Kim, Ukcheol Shin, Jinsun Park, and In So
[10] Jeff Delaune, Robert Hewitt, Laura Lytle, Cristina Sorice, Kweon. Ms-uda: Multi-spectral unsupervised domain adap-
Rohan Thakker, and Larry Matthies. Thermal-inertial odom- tation for thermal image semantic segmentation. IEEE
etry for autonomous flight throughout the night. In 2019 Robotics and Automation Letters, 6(4):6497–6504, 2021. 1
IEEE/RSJ International Conference on Intelligent Robots [24] Alex Junho Lee, Younggun Cho, Young-sik Shin, Ayoung
and Systems (IROS), pages 1122–1128. IEEE, 2019. 3 Kim, and Hyun Myung. Vivid++: Vision for visibility
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, dataset. IEEE Robotics and Automation Letters, 7(3):6282–
and Li Fei-Fei. Imagenet: A large-scale hierarchical image 6289, 2022. 2, 4
database. In 2009 IEEE conference on computer vision and [25] Alex Junho Lee, Younggun Cho, Sungho Yoon, Youngsik
pattern recognition, pages 248–255. Ieee, 2009. 6 Shin, and Ayoung Kim. ViViD : Vision for Visibility Dataset.
[12] Raul Diaz and Amit Marathe. Soft labels for ordinal re- In ICRA Workshop on Dataset Generation and Benchmark-
gression. In Proceedings of the IEEE/CVF conference on ing of SLAM Algorithms for Robotics and VR/AR, Montreal,
computer vision and pattern recognition, pages 4738–4747, May. 2019. Best paper award. 2, 4
2019. 2 [26] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and
[13] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- Il Hong Suh. From big to small: Multi-scale local planar
manghelich, and Dacheng Tao. Deep ordinal regression net- guidance for monocular depth estimation. arXiv preprint
work for monocular depth estimation. In Proceedings of the arXiv:1907.10326, 2019. 2, 6, 7
1051
[27] Chenglong Li, Wei Xia, Yan Yan, Bin Luo, and Jin Tang. [40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Segmenting objects in day and night: Edge-conditioned cnn Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
for thermal image semantic segmentation. arXiv preprint ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
arXiv:1907.10303, 2019. 1 differentiation in pytorch. 2017. 6
[28] Peize Li, Kaiwen Cai, Muhamad Risqi U Saputra, [41] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
Zhuangzhuang Dai, and Chris Xiaoxuan Lu. Odombe- sion transformers for dense prediction. In Proceedings of
yondvision: An indoor multi-modal multi-platform odom- the IEEE/CVF International Conference on Computer Vi-
etry dataset beyond the visible spectrum. arXiv preprint sion, pages 12179–12188, 2021. 2
arXiv:2206.01589, 2022. 2, 4 [42] René Ranftl, Katrin Lasinger, David Hafner, Konrad
[29] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Schindler, and Vladlen Koltun. Towards robust monocular
Binsformer: Revisiting adaptive bins for monocular depth depth estimation: Mixing datasets for zero-shot cross-dataset
estimation. arXiv preprint arXiv:2204.00987, 2022. 2 transfer. IEEE transactions on pattern analysis and machine
[30] Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei intelligence, 2020. 2
Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learn- [43] Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade
ing for disparity estimation through feature constancy. In and fused cost volume for robust stereo matching. In Pro-
Proceedings of the IEEE conference on computer vision and ceedings of the IEEE/CVF Conference on Computer Vision
pattern recognition, pages 2811–2820, 2018. 3, 5 and Pattern Recognition, pages 13906–13915, 2021. 3
[31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng [44] Ukcheol Shin, Kyunghyun Lee, Byeong-Uk Lee, and In So
Zhang, Stephen Lin, and Baining Guo. Swin transformer: Kweon. Maximizing self-supervision from thermal im-
Hierarchical vision transformer using shifted windows. In age for effective self-supervised learning of depth and ego-
Proceedings of the IEEE/CVF International Conference on motion. IEEE Robotics and Automation Letters, 7(3):7771–
Computer Vision, pages 10012–10022, 2021. 5, 6 7778, 2022. 3
[32] Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Cas-
[45] Ukcheol Shin, Kyunghyun Lee, Seokju Lee, and In So
tro, Punarjay Chakravarty, and Praveen Narayanan. Radar-
Kweon. Self-supervised depth and ego-motion estimation
camera pixel depth association for depth completion. In Pro-
for monocular thermal video using multi-spectral consis-
ceedings of the IEEE/CVF Conference on Computer Vision
tency loss. IEEE Robotics and Automation Letters, 2021.
and Pattern Recognition, pages 12507–12516, 2021. 1
1, 3
[33] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas-
[46] Ukcheol Shin, Kwanyong Park, Byeong-Uk Lee,
tic gradient descent with warm restarts. arXiv preprint
Kyunghyun Lee, and In So Kweon. Self-supervised
arXiv:1608.03983, 2016. 6
monocular depth estimation from thermal images via
[34] Ilya Loshchilov and Frank Hutter. Decoupled weight de-
adversarial multi-spectral adaptation. In Proceedings of the
cay regularization. In International Conference on Learning
IEEE/CVF Winter Conference on Applications of Computer
Representations, 2018. 6
Vision, pages 5798–5807, 2023. 3
[35] Yawen Lu and Guoyu Lu. An alternative of lidar in night-
[47] Young-Sik Shin and Ayoung Kim. Sparse depth enhanced
time: Unsupervised depth estimation based on single ther-
direct thermal-infrared slam beyond the visible spectrum.
mal image. In Proceedings of the IEEE/CVF Winter Confer-
IEEE Robotics and Automation Letters, 4(3):2918–2925,
ence on Applications of Computer Vision, pages 3833–3843,
2019. 1, 3
2021. 1, 3
[36] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul [48] Shreyas S Shivakumar, Neil Rodrigues, Alex Zhou, Ian D
Newman. 1 year, 1000 km: The oxford robotcar dataset. Miller, Vijay Kumar, and Camillo J Taylor. Pst900: Rgb-
The International Journal of Robotics Research, 36(1):3–15, thermal calibration, dataset and segmentation network. arXiv
2017. 2 preprint arXiv:1909.10980, 2019. 1
[37] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, [49] Yuxiang Sun, Weixun Zuo, and Ming Liu. Rtfnet: Rgb-
Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A thermal fusion network for semantic segmentation of urban
large dataset to train convolutional networks for disparity, scenes. 4(3):2576–2583, 2019. 1
optical flow, and scene flow estimation. In Proceedings of [50] Yuxiang Sun, Weixun Zuo, Peng Yun, Hengli Wang, and
the IEEE conference on computer vision and pattern recog- Ming Liu. Fuseseg: Semantic segmentation of urban scenes
nition, pages 4040–4048, 2016. 3, 5 based on rgb and thermal data fusion. IEEE Trans. on Au-
[38] Yasuto Nagase, Takahiro Kushida, Kenichiro Tanaka, tomation Science and Engineering (TASE), 2020. 1
Takuya Funatomi, and Yasuhiro Mukaigawa. Shape from [51] Jie Tang, Fei-Peng Tian, Wei Feng, Jian Li, and Ping Tan.
thermal radiation: Passive ranging using multi-spectral lwir Learning guided convolutional network for depth comple-
measurements. In Proceedings of the IEEE/CVF Conference tion. IEEE Transactions on Image Processing, 30:1116–
on Computer Vision and Pattern Recognition, pages 12661– 1129, 2020. 1
12671, 2022. 3 [52] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mat-
[39] Jinsun Park, Yongseop Jeong, Kyungdon Joo, Donghyeon toccia, and Luigi Di Stefano. Real-time self-adaptive deep
Cho, and In So Kweon. Adaptive cost volume fusion net- stereo. In Proceedings of the IEEE/CVF Conference on Com-
work for multi-modal depth estimation in changing environ- puter Vision and Pattern Recognition, pages 195–204, 2019.
ments. IEEE Robotics and Automation Letters, 2022. 1 3, 5
1052
[53] Wayne Treible, Philip Saponaro, Scott Sorensen, Abhishek
Kolagunda, Michael O’Neal, Brian Phelan, Kelly Sher-
bondy, and Chandra Kambhamettu. Cats: A color and ther-
mal stereo benchmark. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
2961–2969, 2017. 2, 4
[54] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten-
tion concatenation volume for accurate and efficient stereo
matching. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 12981–
12990, 2022. 3, 5, 6, 7
[55] Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega-
tion network for efficient stereo matching. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1959–1968, 2020. 3, 5, 6, 7
[56] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and
Ping Tan. Neural window fully-connected crfs for monocu-
lar depth estimation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
3916–3925, 2022. 2, 5, 6, 7
[57] Yuxiao Zhang, Alexander Carballo, Hanting Yang, and
Kazuya Takeda. Autonomous driving in adverse weather
conditions: A survey. arXiv preprint arXiv:2112.08936,
2021. 3
[58] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 5
1053