Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
Encoder Block 2
Encoder Block 3
Encoder Block 4
Patch Embedding
1/16 x 1/16 x
1/32 x 1/32 x
MLP-Conv-MLP
Patch Merging
1/8 x 1/8 x
1/4 x 1/4 x
Self Attention
d
Input RGB
𝐼 XN Channel Channel
Reduction Reduction
Conv-ReLU-Conv
1/16 x 1/16 x
1/2 x 1/2 x
1/4 x 1/4 x
1/8 x 1/8 x
Upsampling
Upsampling
Upsampling
Upsampling
Upsampling
Reduction
1x1x
Channel
Sigmoid
SFF
SFF
SFF
Depth Map 𝐹𝐷4 𝐹𝐷3 𝐹𝐷2 𝐹𝐷1
𝑌
Figure 1: Overall architecture of the proposed network. The main components of the architecture are the encoder, decoder, and skip connec-
tions with feature fusion modules.
variant of CutDepth, in which the crop is only applied to the mulates the depth estimation problem as a classification task
horizontal axis, so that the model adaptively learns to capture by dividing depth values into bins and shows state-of-the-art
vertical long-range information from the training sample. performance.
The proposed network architecture and training strategy Transformer [Vaswani et al., 2017] adopts a self-attention
are experimented over the popular depth estimation dataset mechanism with multi-layer perceptron (MLP) to overcome
NYU Depth V2 [Silberman et al., 2012] and exhibit the state- the limitation of previous RNN for natural language process-
of-the-art performance. We validate the model through ex- ing. Since the emergence of the transformer, it has gained
tensive quantitative and qualitative experiments, and the sug- considerable attention in various fields. In the field of com-
gested architecture and data augmentation method demon- puter vision, a vision transformer (ViT) [Dosovitskiy et al.,
strate their effectiveness. In addition, we observe that our net- 2020] first uses a transformer to solve image classification
work can generalize well under cross-dataset validation and tasks. The success of ViT in the image classification task ac-
shows robust performance against image corruption. celerates the introduction of the transformer into other tasks.
To summarize, our contributions are as follows: SETR [Zheng et al., 2021] first employs ViT as a back-
• We propose a novel global-local path architecture for bone and demonstrates the potential of the transformer in
monocular depth estimation. dense prediction tasks by achieving new state-of-the-art per-
formance. [Xie et al., 2021] proposed SegFormer, which is
• We suggest an improved depth-specific data augmenta- a transformer-based segmentation framework, with a simple
tion method to boost the performance of the model. lightweight MLP decoder.
• Our network achieves state-of-the-art performance on However, very few attempts have been made to employ a
the most popular dataset NYU Depth V2 and shows transformer for monocular depth estimation. Adabins [Bhat
higher generalization ability and robustness than previ- et al., 2021] uses a minimized version of a vision trans-
ously developed networks. former (mini-ViT) to calculate bin width in an adaptive man-
ner. DPT [Ranftl et al., 2021] employs ViT as an encoder
2 Related Work to obtain a global receptive field at different stages and at-
Monocular depth estimation is a computer vision task that taches a convolutional decoder to make a dense prediction.
predicts corresponding depth maps with given input images. However, both Adabins and DPT use CNN-based encoders
Learning-based monocular depth estimation has been stud- and transformers simultaneously which increases the com-
ied following the seminal work of [Saxena et al., 2008] putational complexity. In addition, DPT is trained with an
which used monocular cues to predict depth based on a extra large-scale dataset. In contrast to these studies, our
Markov random field. Later, with the development of CNNs, method use only one encoder and does not require an addi-
depth estimation networks have utilized the encoded fea- tional dataset to accomplish state-of-the-art performance.
tures of deep CNNs that generalize well to various tasks and Data augmentation plays an important role in preventing
achieve drastic performance improvement [Eigen et al., 2014; overfitting by increasing the effective amount of training
Huynh et al., 2020; Yin et al., 2019]. Recently, BTS [Lee et data. Therefore, common methods, such as flipping, color
al., 2019] has suggested a local planar guidance layer that space transformation, cropping, and rotation, are used in
outputs plane coefficients, and then used them in the full res- several tasks to improve the network performance. How-
olution depth estimation. AdaBins [Bhat et al., 2021] refor- ever, although various methods, such as CutMix [Yun et al.,
2019], Copy-Paste [Ghiasi et al., 2021] and CutBlur [Yoo
Global Feat
et al., 2020], have been actively proposed in diverse tasks,
the depth-specific data augmentation method has rarely been
BatchNorm
BatchNorm
studied. To the best of our knowledge, CutDepth [Ishii and
Sigmoid
concat
ReLU
ReLU
Conv
Conv
Conv
Yamashita, 2021] is the first approach that attempts to aug- 𝐻𝐷𝑖
Local Feat
ment the data in depth estimation. We accelerate the perfor-
mance of this depth-specific data augmentation method by
emphasizing the vertical location in the image.
element-wise multiplication element-wise summation
3 Methods
3.1 Global-Local Path Networks Figure 2: Detailed description of the SFF module.
Our depth estimation framework aims to predict the depth
map Ŷ ∈ RH×W ×1 with a given RGB image I ∈ RH×W ×3 .
And depth map is multiplied with the maximum depth value
Thus, we suggest a new architecture with global and local
to scale in meter. This simple decoder can generate as precise
feature paths through the entire network to generate Ŷ . The a depth map as other baseline structures. However, to fur-
overall structure of our framework is depicted in Figure 1. ther exploit the local structures with fine details, we add skip
Our transformer encoder [Xie et al., 2021] enables the model connection with the proposed fusion module.
to learn global dependencies, and the proposed decoder suc-
cessfully recovers the extracted feature into the target depth
map by constructing the local path through skip connection 3.4 Selective Feature Fusion
and the feature fusion module. We detail the proposed archi-
We propose a Selective Feature Fusion (SFF) module to adap-
tecture in the following subsections.
tively select and integrate local and global features by attain-
3.2 Encoder ing an attention map for each feature. The detailed structure
of SFF is illustrated in Figure 2. To match the dimensions of
In the encoding phase, we aim to leverage rich global infor- the decoded features FD and FE , we first reduce the dimen-
mation from an RGB image. To achieve this, we adopt a hi- sions of multi-scale local context features to NC with the con-
erarchical transformer as the encoder. First, the input image volution layer. Then, these features are concatenated along
I is embedded as a sequence of patches with a 3 × 3 con- the channel dimension and passed through two 3 × 3 Conv-
volution operation. Then, the embedded patches are used as batch normalization-ReLU layers. The final convolution and
an input of the transformer block, which comprises of multi- sigmoid layers produce a two-channel attention map, where
ple sets of self-attention and the MLP-Conv-MLP layer with each local and global feature is multiplied with each chan-
a residual skip. To reduce the computational cost in the self- nel to focus on the significant location. Then these multiplied
attention layer, the dimension of each attention head is re- features are added element-wise to construct a hybrid feature
duced with ratio Ri for each ith block. With a given output, HD . To strengthen the local continuity we do not reduce the
we perform patch merging with overlapped convolution. This dimension on the 14 scale feature. We will verify the effec-
process allows us to generate multi-scale features during the tiveness of the proposed decoder in section 4.4.
encoding phase and can be utilized in the decoding phase.
We use four transformer blocks and each block generates 14 ,
1 1 1 3.5 Vertical CutDepth
8 , 16 , 32 scale feature with [C1 , C2 , C3 , C4 ] dimensions.
Recently, a depth-specific data augmentation method named
3.3 Lightweight Decoder CutDepth has been proposed [Ishii and Yamashita, 2021],
The encoder transforms the input image I into the bottleneck which replaces a part of the input image with the ground-
1 1
feature FE4 with the size of 32 H × 32 W × C4 . To obtain the truth depth map to provide diversity to the input image and
estimated depth map, we construct a lightweight and effec- enable the network to focus on the high-frequency area. In
tive decoder to restore the bottleneck feature into the size of CutDepth, the coordinates (l, u) and size (w, h) of the cut re-
H × W × 1. Most of the previous studies have convention- gion are randomly chosen. However, we believe that the ver-
ally stacked multiple bilinear upsampling with convolution or tical and horizontal directions should not be regarded equally
deconvolution layers to recover the original size. However, for depth estimation based on the following discovery. A pre-
we empirically observe that the model can achieve better per- vious study [Dijk and Croon, 2019] suggested that the depth
formance with much fewer convolution and bilinear upsam- estimation networks mainly use vertical position in the im-
pling layers of the decoder if we design our restoring path age rather than apparent size or texture to predict the depth of
effectively. First, we reduce the channel dimension of the arbitrary obstacles. This motivates us to propose vertical Cut-
bottleneck feature into NC with 1 × 1 convolution to avoid Depth, which enhances the original CutDepth by preserving
computational complexity. Then we use consecutive bilinear the vertical geometric information. In vertical CutDepth, the
upsampling to enlarge the feature into size of H × W × NC . ground-truth depth map replaces an area on I with the same
Finally, the output is passed through two convolution layers location of Y , but the crop is not applied along the vertical di-
and a sigmoid function to predict depth map H × W × 1. rection. Therefore, the coordinate of the replacement region
Method Params (M) δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓
Eigen [Eigen et al., 2014] 141 0.769 0.950 0.988 0.158 0.641 -
Fu [Fu et al., 2018] 110 0.828 0.965 0.992 0.115 0.509 0.051
Yin [Yin et al., 2019] 114 0.875 0.976 0.994 0.108 0.416 0.048
DAV [Huynh et al., 2020] 25 0.882 0.980 0.996 0.108 0.412 -
BTS [Lee et al., 2019] 47 0.885 0.978 0.994 0.110 0.392 0.047
Adabins[Bhat et al., 2021] 78 0.903 0.984 0.997 0.103 0.364 0.044
DPT* [Ranftl et al., 2021] 123 0.904 0.988 0.998 0.110 0.357 0.045
Ours 62 0.915 0.988 0.997 0.098 0.344 0.042
Table 1: Performance on the NYU Depth V2 dataset. DPT* is trained with an extra dataset.
(l, u) and size (w, h) are calculated as follows: Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓
Yin [Yin et al., 2019] 0.696 0.912 0.973 0.183 0.541 0.082
(l, u) = (α × W, 0) BTS [Lee et al., 2019] 0.740 0.933 0.980 0.172 0.515 0.075
(1) Adabins [Bhat et al., 2021] 0.771 0.944 0.983 0.159 0.476 0.068
(w, h) = (max((W − α × W ) × β × p, 1), H) Ours 0.814 0.964 0.991 0.144 0.418 0.061
where α and β are U(0, 1). p is a hyperparameter that is set Table 2: Performance on the SUN RGB-D dataset with the NYU
at a value of (0, 1]. By maintaining the vertical range of the Depth V2 trained model. We test the model without any fine-tuning.
input RGB image, the network can capture the long-range
vertical direction for better prediction, as shown in the results.
We set the value of p to 0.75 by performing various settings and segmentation maps. We use this dataset for evaluating
of p (Section 4.4). pre-trained models; thus, only the official test set of 5050 im-
ages is used. Image sizes are not constant throughout this
3.6 Training Loss dataset, and thus we resize the image to the largest multiple
of 32 below the image size, and then pass the resized image
In order to calculate the distance between predicted output
to predict the depth map, which is then resized to the original
Ŷ and ground truth depth map Y , we use scale-invariant log image.
scale loss [Eigen et al., 2014] to train the model. yi ∗ and yi
indicates ith pixel in Ŷ and Y . The equation of training loss 4.2 Implementation Details
is as follows: We implement the proposed network using the PyTorch
! framework. For training, we use the one-cycle learning rate
1X 2 1 X
2 strategy with an Adam optimizer. The learning rate increases
L= di − 2 di (2)
n i 2n i
from 3e-5 to 1e-4 following a poly LR schedule with a factor
of 0.9 in the first half of the total iteration, and then decreases
where di = log yi − log yi ∗ . from 1e-4 to 3e-5 in the last half. The total number of epochs
is set to 25 with a batch size of 12. We use pre-trained weights
4 Experiments from the MiT-b4 [Xie et al., 2021] network and initialize our
To validate our approach, we perform several experiments on encoder. The values of NC , Ri and Ci are 64, [8, 4, 2, 1] and
the NYU Depth V2 and SUN RGB-D datasets. We com- [64, 128, 320, 512], respectively.
pare our model with existing methods through quantitative In the case of data augmentation, the following strate-
and qualitative evaluation, and an ablation study is conducted gies are used with the proposed vertical CutDepth with
to show the effectiveness of each contribution. Additionally, 50% probability: horizontal flips, random brightness(±0.2),
we provide other results on additional dataset in supplemen- contrast(±0.2), gamma(±20), hue(±20), saturation(±30),
tary material. and value(±20). We apply p = 0.75 for vertical CutDepth
with 25% possibility.
4.1 Dataset
NYU Depth V2 [Silberman et al., 2012] contains 640 × 480 4.3 Comparison with State-of-the-Arts
images and corresponding depth maps of various indoor NYU Depth V2. Table 1 presents the performance com-
scenes acquired using a Microsoft Kinect camera. We train parison of the NYU Depth V2 dataset. DPT [Ranftl et al.,
our network using approximately 24K images on a random 2021] uses a much larger dataset of 1.4M images for training
crop of 576 × 448 and test on 654 images. To facilitate a fair the model. As listed in the table, the proposed model shows
comparison, we perform the evaluation on a pre-defined cen- state-of-the-art performance in most of the evaluation metrics
ter cropping by Eigen [Eigen et al., 2014] with a maximum which we attribute to the proposed architecture and enhanced
range of 10 m. depth-specific data augmentation method. Furthermore, our
SUN RGB-D [Song et al., 2015] contains approximately model achieves higher performance than the recently devel-
10K RGB-D images of various indoor scenes captured by oped state-of-the-art models (Adabins, DPT) with lesser pa-
four different sensors, along with the corresponding depth rameters. This suggests that the combination of the trans-
(a) RGB (b) BTS (c) AdaBins (d) DPT (e) Ours (f) GT
Figure 3: Qualitative comparison with previous works on the NYU Depth V2 dataset.
Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓ the network. The results are shown in Table 5. The first
Baseline 0.908 0.987 0.997 0.101 0.351 0.043 row of the table represents the baseline, which is only trained
+ CutDepth 0.909 0.986 0.997 0.102 0.348 0.042 with traditional data augmentation except for CutDepth, and
+ Ours (p=0.25) 0.911 0.988 0.997 0.102 0.354 0.043 the second row shows the result obtained from adopting the
+ Ours (p=0.50) 0.911 0.988 0.997 0.100 0.348 0.042
+ Ours (p=0.75) 0.915 0.988 0.998 0.098 0.343 0.042
basic CutDepth method. Then, we apply the proposed verti-
+ Ours (p=1.00) 0.910 0.988 0.997 0.101 0.351 0.043 cal CutDepth with different choices of hyperparameter p. As
Table 5: Experimental results with data augmentation.
detailed in the table, CutDepth helps the model to achieve
slightly better performance than the baseline. However, by
applying vertical CutDepth, the network shows further im-
study aims to avoid computationally demanding decoders, provement. This proves that utilizing vertical features en-
we construct simple baselines and compare them with ours. hances accurate depth estimation as compared to the case
Baseline-Dconv consists of consecutive deconvolution-batch of simply cropping the random area. In addition, the model
normalization-ReLU blocks to obtain the desired depth map. achieves the best performance with a setting of p = 0.75.
In addition, Baseline-UNet is an improved structure from
Baseline-Dconv that has skip connections between the en- 4.5 Robustness of the model
coder and decoder. As detailed in the table, our decoder In this subsection, we demonstrate the robustness of the pro-
achieves better performance than the baselines. Even with- posed method against natural image corruptions. Model ro-
out an SFF, it already shows better performance than other bustness for depth estimation is essential because real world
decoders while having fewer parameters. The powerful en- images always have a high possibility of being corrupted to a
coding ability of our encoder and the effectively designed de- certain degree. Under these circumstances, it is beneficial to
coder enables the network to produce a finely detailed depth design a robust model so that it can perform the given task
map. In addition, our proposed SFF leverages additional per- without being critically corrupted. Following the previous
formance of our model. study on the robustness of CNNs [Hendrycks and Dietterich,
We additionally provide comparison with existing decoder 2018], we test our model on images that are corrupted by 16
architectures which integrates multi-scale features, in the different methods. Each corruption is applied with five differ-
bottom part of Table 3. Despite the compactness of the ent intensities, and the performance is averaged over all test
proposed decoder, our network outperforms other networks. images and all five intensities.
Our decoder has only 0.66M parameters while the MLP- Table 4 presents the depth estimation results for the cor-
decoder [Xie et al., 2021], BTS [Lee et al., 2019] and rupted images of the NYU Depth V2 test set. Due to space
DPT [Ranftl et al., 2021] have 3.19M, 5.79M and 14.15M pa- constraints, we provide the complete table in the supplemen-
rameters, respectively, and thus, are highly heavier than ours. tary material and present results on a few corruption types in
This indicates that we have effectively designed the restoring Table 4. The results show that our model is clearly more ro-
path for our encoder, which enables the proposed model to bust to all types of corruption than the compared models. The
record fine performance with very few parameters. experimental results indicate that our model shows stronger
Effectiveness of the vertical CutDepth. We perform an ab- robustness and thus is more appropriate for safety-critical ap-
lation study on the data augmentation method used to train plications.
5 Conclusion ings of the IEEE/CVF Conference on Computer Vision and
This paper proposes a new architecture for monocular depth Pattern Recognition, pages 2918–2928, 2021.
estimation to deliver meaningful global and local features and [Hendrycks and Dietterich, 2018] Dan Hendrycks and
generate a precisely estimated depth map. We further exploit Thomas Dietterich. Benchmarking neural network
the depth-specific data augmentation technique to improve robustness to common corruptions and perturbations. In
the performance of the model by considering the knowledge International Conference on Learning Representations,
that the use of vertical position is a crucial property of depth 2018.
estimation. The proposed method shows improvement over [Huynh et al., 2020] Lam Huynh, Phong Nguyen-Ha, Jiri
state-of-the-art performance for the NYU Depth V2 dataset. Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocu-
Moreover, extensive experimental results demonstrate the ef- lar depth estimation using depth-attention volume. In Eu-
fectiveness and generalization ability of our network. ropean Conference on Computer Vision, pages 581–597.
Springer, 2020.
References [Ishii and Yamashita, 2021] Yasunori Ishii and Takayoshi
[Bhat et al., 2021] Shariq Farooq Bhat, Ibraheem Alhashim, Yamashita. Cutdepth: Edge-aware data augmentation in
and Peter Wonka. Adabins: Depth estimation using adap- depth estimation. arXiv preprint arXiv:2107.07684, 2021.
tive bins. In Proceedings of the IEEE/CVF Conference [Kim et al., 2020] Doyeon Kim, Sihaeng Lee, Janghyeon
on Computer Vision and Pattern Recognition, pages 4009– Lee, and Junmo Kim. Leveraging contextual information
4018, 2021. for monocular depth estimation. IEEE Access, 8:147808–
[Chen et al., 2019] Xiaotian Chen, Xuejin Chen, and Zheng- 147817, 2020.
Jun Zha. Structure-aware residual pyramid net- [Koch et al., 2018] Tobias Koch, Lukas Liebel, Friedrich
work for monocular depth estimation. arXiv preprint Fraundorfer, and Marco Korner. Evaluation of cnn-based
arXiv:1907.06023, 2019. single-image depth estimation methods. In Proceedings
[Dijk and Croon, 2019] Tom van Dijk and Guido de Croon. of the European Conference on Computer Vision (ECCV)
How do neural networks see depth in single images? Workshops, pages 0–0, 2018.
In Proceedings of the IEEE International Conference on [Lee et al., 2019] Jin Han Lee, Myung-Kyu Han,
Computer Vision, pages 2183–2191, 2019. Dong Wook Ko, and Il Hong Suh. From big to
[Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, small: Multi-scale local planar guidance for monocular
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, depth estimation. arXiv preprint arXiv:1907.10326, 2019.
Thomas Unterthiner, Mostafa Dehghani, Matthias Min- [Luo et al., 2016] Wenjie Luo, Yujia Li, Raquel Urtasun, and
derer, Georg Heigold, Sylvain Gelly, et al. An image is Richard Zemel. Understanding the effective receptive field
worth 16x16 words: Transformers for image recognition in deep convolutional neural networks. In Proceedings of
at scale. arXiv preprint arXiv:2010.11929, 2020. the 30th International Conference on Neural Information
[Eigen et al., 2014] David Eigen, Christian Puhrsch, and Processing Systems, pages 4905–4913, 2016.
Rob Fergus. Depth map prediction from a single image [Ramamonjisoa and Lepetit, 2019] Michael Ramamonjisoa
using a multi-scale deep network. In Advances in neural and Vincent Lepetit. Sharpnet: Fast and accurate recov-
information processing systems (NIPS), pages 2366–2374, ery of occluding contours in monocular depth estimation.
2014. In Proceedings of the IEEE/CVF International Conference
[Fu et al., 2018] Huan Fu, Mingming Gong, Chaohui Wang, on Computer Vision Workshops, pages 0–0, 2019.
Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal [Ranftl et al., 2021] René Ranftl, Alexey Bochkovskiy, and
regression network for monocular depth estimation. In Vladlen Koltun. Vision transformers for dense prediction.
Proceedings of the IEEE Conference on Computer Vision arXiv preprint arXiv:2103.13413, 2021.
and Pattern Recognition, pages 2002–2011, 2018.
[Saxena et al., 2008] Ashutosh Saxena, Min Sun, and An-
[Garg et al., 2016] Ravi Garg, Vijay Kumar BG, Gustavo drew Y Ng. Make3d: Learning 3d scene structure from
Carneiro, and Ian Reid. Unsupervised cnn for single view a single still image. IEEE transactions on pattern analysis
depth estimation: Geometry to the rescue. In Proceedings and machine intelligence, 31(5):824–840, 2008.
of the European Conference on Computer Vision (ECCV), [Silberman et al., 2012] Nathan Silberman, Derek Hoiem,
pages 740–756. Springer, 2016. Pushmeet Kohli, and Rob Fergus. Indoor segmentation
[Geiger et al., 2013] Andreas Geiger, Philip Lenz, Christoph and support inference from rgbd images. In European
Stiller, and Raquel Urtasun. Vision meets robotics: The conference on computer vision, pages 746–760. Springer,
kitti dataset. The International Journal of Robotics Re- 2012.
search, 32(11):1231–1237, 2013. [Song et al., 2015] Shuran Song, Samuel P. Lichtenberg, and
[Ghiasi et al., 2021] Golnaz Ghiasi, Yin Cui, Aravind Srini- Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding
vas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, benchmark suite. In Proceedings of the IEEE Conference
and Barret Zoph. Simple copy-paste is a strong data aug- on Computer Vision and Pattern Recognition (CVPR),
mentation method for instance segmentation. In Proceed- June 2015.
[Swami et al., 2020] Kunal Swami, Prasanna Vishnu Bon-
dada, and Pankaj Kumar Bajpai. Aced: Accurate
and edge-consistent monocular depth estimation. In
2020 IEEE International Conference on Image Processing
(ICIP), pages 1376–1380. IEEE, 2020.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In Advances in neural information processing sys-
tems, pages 5998–6008, 2017.
[Xie et al., 2021] Enze Xie, Wenhai Wang, Zhiding Yu,
Anima Anandkumar, Jose M Alvarez, and Ping Luo.
Segformer: Simple and efficient design for seman-
tic segmentation with transformers. arXiv preprint
arXiv:2105.15203, 2021.
[Yin et al., 2019] Wei Yin, Yifan Liu, Chunhua Shen, and
Youliang Yan. Enforcing geometric constraints of vir-
tual normal for depth prediction. In Proceedings of the
IEEE International Conference on Computer Vision, pages
5684–5693, 2019.
[Yoo et al., 2020] Jaejun Yoo, Namhyuk Ahn, and Kyung-
Ah Sohn. Rethinking data augmentation for image super-
resolution: A comprehensive analysis and a new strategy.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8375–8384, 2020.
[Yun et al., 2019] Sangdoo Yun, Dongyoon Han, Seong Joon
Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.
Cutmix: Regularization strategy to train strong classifiers
with localizable features. CoRR, abs/1905.04899, 2019.
[Zheng et al., 2021] Sixiao Zheng, Jiachen Lu, Hengshuang
Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu,
Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethink-
ing semantic segmentation from a sequence-to-sequence
perspective with transformers. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 6881–6890, 2021.
6 Appendix: Additional dataset results
In this section, we provide additional results on KITTI [Geiger et al., 2013] and iBims-1 [Koch et al., 2018] datasets. KITTI is
an outdoor depth estimation dataset and iBims-1 is an indoor dataset.
6.1 KITTI
KITTI [Geiger et al., 2013] contains stereo camera images and corresponding 3D LiDAR scans of various driving scenes
acquired by car mounted sensors. The size of RGB images is around 1224 × 368. We train our network using approximately
23K images on a random crop of 704 × 352 and test on 697 images. To compare our performance with previous works, we use
the crop as defined by Garg [Garg et al., 2016] and a maximum value of 80m for evaluation. The results on the KITTI dataset
are shown in Table 6. As shown in the table, our model outperforms other previous studies.
Table 6: Performance on the KITTI dataset. DPT* is trained with extra dataset.
6.2 iBims-1
iBims-1 [Koch et al., 2018] (independent Benchmark images and matched scans version 1) is a high quality RGB-D dataset
acquired using a digital single-lens reflex (DSLR) camera and high-precision laser scanner. iBims-1 can be characterized by
accurate edges and planar regions, consistent depth values, and accurate absolute distances. We evaluate with our NYU Depth
V2 trained model without any fine-tuning. Results on iBims-1 dataset are listed in Table 7.
In Table 8, we report a full table of the results on the corrupted NYU Depth V2 dataset. (Section 4.5 of the main paper)
𝐹𝐸4
We illustrate the detailed structure of Baseline-DConv and Baseline-UNet in Figure 5. We use transposed convolution with
K = 3, S = 2, P = 1 parameters to upscale the given feature into 2x size. For Baseline-UNet, features from encoder
ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
concat
𝐹𝐸3
ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
FE3 , FE2 , FE1 are concatenated in channel dimension.
𝑌
𝑌
(b)
(a)
8