Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Global-Local Path Networks for Monocular Depth Estimation

with Vertical CutDepth


Doyeon Kim1 , Woonghyun Ka2 , Pyunghwan Ahn1 , Donggyu Joo1 ,
Sewhan Chun2 and Junmo Kim1,2
School of Electrical Engineering, KAIST, South Korea1
Division of Future Vehicle, KAIST, South Korea2
{doyeon kim, kwh950724, p.ahn, jdg105, alskdjfhgk, junmo.kim}@kaist.ac.kr
arXiv:2201.07436v3 [cs.CV] 29 Oct 2022

Abstract As many previous papers have claimed [Chen et al., 2019;


Kim et al., 2020], understanding both global and local con-
Depth estimation from a single image is an impor- texts is crucial for successful depth estimation. There are
tant task that can be applied to various fields in many cues in monocular depth estimation that require under-
computer vision, and has grown rapidly with the standing the scene on a global scale, such as the location of
development of convolutional neural networks. In objects or the vanishing point. In addition, local connectivity
this paper, we propose a novel structure and train- of features is important because adjacent pixels tend to have
ing strategy for monocular depth estimation to fur- similar values owing to their coplanar surfaces. Therefore,
ther improve the prediction accuracy of the net- we propose a new global-local path network to fully extract
work. We deploy a hierarchical transformer en- meaningful features on diverse scales and effectively deliver
coder to capture and convey the global context, and them throughout the network. First, we adopt a hierarchical
design a lightweight yet powerful decoder to gen- transformer as the encoder to model long-range dependen-
erate an estimated depth map while considering lo- cies and capture multi-scale context features. In prior stud-
cal connectivity. By constructing connected paths ies, it is observed that the transformer enables the network
between multi-scale local features and the global to enlarge the size of the receptive field [Xie et al., 2021].
decoding stream with our proposed selective fea- Motivated by this knowledge, we leverage the global rela-
ture fusion module, the network can integrate both tionships explicitly by building the global path with multi-
representations and recover fine details. In ad- ple transformer blocks. Second, we design a highly utilized
dition, the proposed decoder shows better perfor- decoder with an effective fusion module to enable local fea-
mance than the previously proposed decoders, with tures to produce a fine depth map while preserving structural
considerably less computational complexity. Fur- details. Contrary to the transformer, skip connections tend
thermore, we improve the depth-specific augmen- to create smaller receptive fields and help to focus on short-
tation method by utilizing an important observa- distance information [Luo et al., 2016]. Thus, the proposed
tion in depth estimation to enhance the model. Our architecture is intended to take the complementary advan-
network achieves state-of-the-art performance over tages of both transformer and skip connections. This is en-
the challenging depth dataset NYU Depth V2. Ex- abled by aggregating the encoded and the decoded features
tensive experiments have been conducted to vali- using an input-dependent fusion module, called selective fea-
date and show the effectiveness of the proposed ap- ture fusion (SFF). The SFF module aids the model to selec-
proach. Finally, our model shows better generaliza- tively focus on the salient regions by estimating the attention
tion ability and robustness than other comparative map for both features with a very low computational burden.
models. The code will be available soon. Compared to other decoders, our decoder achieves superior
performance with much lower complexity.

1 Introduction Furthermore, we train the network with an additional task


specific data augmentation technique to boost the model ca-
Depth estimation is a challenging area that has been actively pability. Data augmentation plays an important role in opti-
researched for many years. In particular, monocular depth es- mizing the network and can accelerate the model performance
timation, which uses a single image to predict depth, is an ill- without additional computational costs. Nevertheless, data
posed problem due to its inherent ambiguity. With the advent augmentation for depth estimation has been rarely adopted
of convolutional neural networks (CNNs), many CNN-based unlike for other tasks. To the best of our knowledge, Cut-
approaches have been proposed for depth estimation and have Depth [Ishii and Yamashita, 2021] is the first attempted data
yielded promising results [Bhat et al., 2021; Lee et al., 2019; augmentation method specifically for depth estimation. We
Fu et al., 2018]. This paper proposes a new architecture and revisit CutDepth with the discovery that the vertical position
training strategy to further improve the performance by focus- of an object plays an essential role in monocular depth esti-
ing on the essential properties of monocular depth estimation. mation [Dijk and Croon, 2019]. To this end, we propose a
Encoder Block 1
𝐹𝐸1 𝐹𝐸2 𝐹𝐸3 𝐹𝐸4

Encoder Block 2

Encoder Block 3

Encoder Block 4
Patch Embedding

1/16 x 1/16 x

1/32 x 1/32 x
MLP-Conv-MLP

Patch Merging

1/8 x 1/8 x
1/4 x 1/4 x
Self Attention
d

Input RGB
𝐼 XN Channel Channel
Reduction Reduction

Conv-ReLU-Conv

1/16 x 1/16 x
1/2 x 1/2 x

1/4 x 1/4 x

1/8 x 1/8 x

Upsampling
Upsampling

Upsampling

Upsampling

Upsampling

Reduction
1x1x

Channel
Sigmoid

SFF

SFF

SFF
Depth Map 𝐹𝐷4 𝐹𝐷3 𝐹𝐷2 𝐹𝐷1
𝑌෠

Figure 1: Overall architecture of the proposed network. The main components of the architecture are the encoder, decoder, and skip connec-
tions with feature fusion modules.

variant of CutDepth, in which the crop is only applied to the mulates the depth estimation problem as a classification task
horizontal axis, so that the model adaptively learns to capture by dividing depth values into bins and shows state-of-the-art
vertical long-range information from the training sample. performance.
The proposed network architecture and training strategy Transformer [Vaswani et al., 2017] adopts a self-attention
are experimented over the popular depth estimation dataset mechanism with multi-layer perceptron (MLP) to overcome
NYU Depth V2 [Silberman et al., 2012] and exhibit the state- the limitation of previous RNN for natural language process-
of-the-art performance. We validate the model through ex- ing. Since the emergence of the transformer, it has gained
tensive quantitative and qualitative experiments, and the sug- considerable attention in various fields. In the field of com-
gested architecture and data augmentation method demon- puter vision, a vision transformer (ViT) [Dosovitskiy et al.,
strate their effectiveness. In addition, we observe that our net- 2020] first uses a transformer to solve image classification
work can generalize well under cross-dataset validation and tasks. The success of ViT in the image classification task ac-
shows robust performance against image corruption. celerates the introduction of the transformer into other tasks.
To summarize, our contributions are as follows: SETR [Zheng et al., 2021] first employs ViT as a back-
• We propose a novel global-local path architecture for bone and demonstrates the potential of the transformer in
monocular depth estimation. dense prediction tasks by achieving new state-of-the-art per-
formance. [Xie et al., 2021] proposed SegFormer, which is
• We suggest an improved depth-specific data augmenta- a transformer-based segmentation framework, with a simple
tion method to boost the performance of the model. lightweight MLP decoder.
• Our network achieves state-of-the-art performance on However, very few attempts have been made to employ a
the most popular dataset NYU Depth V2 and shows transformer for monocular depth estimation. Adabins [Bhat
higher generalization ability and robustness than previ- et al., 2021] uses a minimized version of a vision trans-
ously developed networks. former (mini-ViT) to calculate bin width in an adaptive man-
ner. DPT [Ranftl et al., 2021] employs ViT as an encoder
2 Related Work to obtain a global receptive field at different stages and at-
Monocular depth estimation is a computer vision task that taches a convolutional decoder to make a dense prediction.
predicts corresponding depth maps with given input images. However, both Adabins and DPT use CNN-based encoders
Learning-based monocular depth estimation has been stud- and transformers simultaneously which increases the com-
ied following the seminal work of [Saxena et al., 2008] putational complexity. In addition, DPT is trained with an
which used monocular cues to predict depth based on a extra large-scale dataset. In contrast to these studies, our
Markov random field. Later, with the development of CNNs, method use only one encoder and does not require an addi-
depth estimation networks have utilized the encoded fea- tional dataset to accomplish state-of-the-art performance.
tures of deep CNNs that generalize well to various tasks and Data augmentation plays an important role in preventing
achieve drastic performance improvement [Eigen et al., 2014; overfitting by increasing the effective amount of training
Huynh et al., 2020; Yin et al., 2019]. Recently, BTS [Lee et data. Therefore, common methods, such as flipping, color
al., 2019] has suggested a local planar guidance layer that space transformation, cropping, and rotation, are used in
outputs plane coefficients, and then used them in the full res- several tasks to improve the network performance. How-
olution depth estimation. AdaBins [Bhat et al., 2021] refor- ever, although various methods, such as CutMix [Yun et al.,
2019], Copy-Paste [Ghiasi et al., 2021] and CutBlur [Yoo

Global Feat
et al., 2020], have been actively proposed in diverse tasks,
the depth-specific data augmentation method has rarely been

BatchNorm

BatchNorm
studied. To the best of our knowledge, CutDepth [Ishii and

Sigmoid
concat

ReLU

ReLU
Conv

Conv

Conv
Yamashita, 2021] is the first approach that attempts to aug- 𝐻𝐷𝑖

Local Feat
ment the data in depth estimation. We accelerate the perfor-
mance of this depth-specific data augmentation method by
emphasizing the vertical location in the image.
element-wise multiplication element-wise summation
3 Methods
3.1 Global-Local Path Networks Figure 2: Detailed description of the SFF module.
Our depth estimation framework aims to predict the depth
map Ŷ ∈ RH×W ×1 with a given RGB image I ∈ RH×W ×3 .
And depth map is multiplied with the maximum depth value
Thus, we suggest a new architecture with global and local
to scale in meter. This simple decoder can generate as precise
feature paths through the entire network to generate Ŷ . The a depth map as other baseline structures. However, to fur-
overall structure of our framework is depicted in Figure 1. ther exploit the local structures with fine details, we add skip
Our transformer encoder [Xie et al., 2021] enables the model connection with the proposed fusion module.
to learn global dependencies, and the proposed decoder suc-
cessfully recovers the extracted feature into the target depth
map by constructing the local path through skip connection 3.4 Selective Feature Fusion
and the feature fusion module. We detail the proposed archi-
We propose a Selective Feature Fusion (SFF) module to adap-
tecture in the following subsections.
tively select and integrate local and global features by attain-
3.2 Encoder ing an attention map for each feature. The detailed structure
of SFF is illustrated in Figure 2. To match the dimensions of
In the encoding phase, we aim to leverage rich global infor- the decoded features FD and FE , we first reduce the dimen-
mation from an RGB image. To achieve this, we adopt a hi- sions of multi-scale local context features to NC with the con-
erarchical transformer as the encoder. First, the input image volution layer. Then, these features are concatenated along
I is embedded as a sequence of patches with a 3 × 3 con- the channel dimension and passed through two 3 × 3 Conv-
volution operation. Then, the embedded patches are used as batch normalization-ReLU layers. The final convolution and
an input of the transformer block, which comprises of multi- sigmoid layers produce a two-channel attention map, where
ple sets of self-attention and the MLP-Conv-MLP layer with each local and global feature is multiplied with each chan-
a residual skip. To reduce the computational cost in the self- nel to focus on the significant location. Then these multiplied
attention layer, the dimension of each attention head is re- features are added element-wise to construct a hybrid feature
duced with ratio Ri for each ith block. With a given output, HD . To strengthen the local continuity we do not reduce the
we perform patch merging with overlapped convolution. This dimension on the 14 scale feature. We will verify the effec-
process allows us to generate multi-scale features during the tiveness of the proposed decoder in section 4.4.
encoding phase and can be utilized in the decoding phase.
We use four transformer blocks and each block generates 14 ,
1 1 1 3.5 Vertical CutDepth
8 , 16 , 32 scale feature with [C1 , C2 , C3 , C4 ] dimensions.
Recently, a depth-specific data augmentation method named
3.3 Lightweight Decoder CutDepth has been proposed [Ishii and Yamashita, 2021],
The encoder transforms the input image I into the bottleneck which replaces a part of the input image with the ground-
1 1
feature FE4 with the size of 32 H × 32 W × C4 . To obtain the truth depth map to provide diversity to the input image and
estimated depth map, we construct a lightweight and effec- enable the network to focus on the high-frequency area. In
tive decoder to restore the bottleneck feature into the size of CutDepth, the coordinates (l, u) and size (w, h) of the cut re-
H × W × 1. Most of the previous studies have convention- gion are randomly chosen. However, we believe that the ver-
ally stacked multiple bilinear upsampling with convolution or tical and horizontal directions should not be regarded equally
deconvolution layers to recover the original size. However, for depth estimation based on the following discovery. A pre-
we empirically observe that the model can achieve better per- vious study [Dijk and Croon, 2019] suggested that the depth
formance with much fewer convolution and bilinear upsam- estimation networks mainly use vertical position in the im-
pling layers of the decoder if we design our restoring path age rather than apparent size or texture to predict the depth of
effectively. First, we reduce the channel dimension of the arbitrary obstacles. This motivates us to propose vertical Cut-
bottleneck feature into NC with 1 × 1 convolution to avoid Depth, which enhances the original CutDepth by preserving
computational complexity. Then we use consecutive bilinear the vertical geometric information. In vertical CutDepth, the
upsampling to enlarge the feature into size of H × W × NC . ground-truth depth map replaces an area on I with the same
Finally, the output is passed through two convolution layers location of Y , but the crop is not applied along the vertical di-
and a sigmoid function to predict depth map H × W × 1. rection. Therefore, the coordinate of the replacement region
Method Params (M) δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓
Eigen [Eigen et al., 2014] 141 0.769 0.950 0.988 0.158 0.641 -
Fu [Fu et al., 2018] 110 0.828 0.965 0.992 0.115 0.509 0.051
Yin [Yin et al., 2019] 114 0.875 0.976 0.994 0.108 0.416 0.048
DAV [Huynh et al., 2020] 25 0.882 0.980 0.996 0.108 0.412 -
BTS [Lee et al., 2019] 47 0.885 0.978 0.994 0.110 0.392 0.047
Adabins[Bhat et al., 2021] 78 0.903 0.984 0.997 0.103 0.364 0.044
DPT* [Ranftl et al., 2021] 123 0.904 0.988 0.998 0.110 0.357 0.045
Ours 62 0.915 0.988 0.997 0.098 0.344 0.042
Table 1: Performance on the NYU Depth V2 dataset. DPT* is trained with an extra dataset.

(l, u) and size (w, h) are calculated as follows: Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓
Yin [Yin et al., 2019] 0.696 0.912 0.973 0.183 0.541 0.082
(l, u) = (α × W, 0) BTS [Lee et al., 2019] 0.740 0.933 0.980 0.172 0.515 0.075
(1) Adabins [Bhat et al., 2021] 0.771 0.944 0.983 0.159 0.476 0.068
(w, h) = (max((W − α × W ) × β × p, 1), H) Ours 0.814 0.964 0.991 0.144 0.418 0.061

where α and β are U(0, 1). p is a hyperparameter that is set Table 2: Performance on the SUN RGB-D dataset with the NYU
at a value of (0, 1]. By maintaining the vertical range of the Depth V2 trained model. We test the model without any fine-tuning.
input RGB image, the network can capture the long-range
vertical direction for better prediction, as shown in the results.
We set the value of p to 0.75 by performing various settings and segmentation maps. We use this dataset for evaluating
of p (Section 4.4). pre-trained models; thus, only the official test set of 5050 im-
ages is used. Image sizes are not constant throughout this
3.6 Training Loss dataset, and thus we resize the image to the largest multiple
of 32 below the image size, and then pass the resized image
In order to calculate the distance between predicted output
to predict the depth map, which is then resized to the original
Ŷ and ground truth depth map Y , we use scale-invariant log image.
scale loss [Eigen et al., 2014] to train the model. yi ∗ and yi
indicates ith pixel in Ŷ and Y . The equation of training loss 4.2 Implementation Details
is as follows: We implement the proposed network using the PyTorch
! framework. For training, we use the one-cycle learning rate
1X 2 1 X
2 strategy with an Adam optimizer. The learning rate increases
L= di − 2 di (2)
n i 2n i
from 3e-5 to 1e-4 following a poly LR schedule with a factor
of 0.9 in the first half of the total iteration, and then decreases
where di = log yi − log yi ∗ . from 1e-4 to 3e-5 in the last half. The total number of epochs
is set to 25 with a batch size of 12. We use pre-trained weights
4 Experiments from the MiT-b4 [Xie et al., 2021] network and initialize our
To validate our approach, we perform several experiments on encoder. The values of NC , Ri and Ci are 64, [8, 4, 2, 1] and
the NYU Depth V2 and SUN RGB-D datasets. We com- [64, 128, 320, 512], respectively.
pare our model with existing methods through quantitative In the case of data augmentation, the following strate-
and qualitative evaluation, and an ablation study is conducted gies are used with the proposed vertical CutDepth with
to show the effectiveness of each contribution. Additionally, 50% probability: horizontal flips, random brightness(±0.2),
we provide other results on additional dataset in supplemen- contrast(±0.2), gamma(±20), hue(±20), saturation(±30),
tary material. and value(±20). We apply p = 0.75 for vertical CutDepth
with 25% possibility.
4.1 Dataset
NYU Depth V2 [Silberman et al., 2012] contains 640 × 480 4.3 Comparison with State-of-the-Arts
images and corresponding depth maps of various indoor NYU Depth V2. Table 1 presents the performance com-
scenes acquired using a Microsoft Kinect camera. We train parison of the NYU Depth V2 dataset. DPT [Ranftl et al.,
our network using approximately 24K images on a random 2021] uses a much larger dataset of 1.4M images for training
crop of 576 × 448 and test on 654 images. To facilitate a fair the model. As listed in the table, the proposed model shows
comparison, we perform the evaluation on a pre-defined cen- state-of-the-art performance in most of the evaluation metrics
ter cropping by Eigen [Eigen et al., 2014] with a maximum which we attribute to the proposed architecture and enhanced
range of 10 m. depth-specific data augmentation method. Furthermore, our
SUN RGB-D [Song et al., 2015] contains approximately model achieves higher performance than the recently devel-
10K RGB-D images of various indoor scenes captured by oped state-of-the-art models (Adabins, DPT) with lesser pa-
four different sensors, along with the corresponding depth rameters. This suggests that the combination of the trans-
(a) RGB (b) BTS (c) AdaBins (d) DPT (e) Ours (f) GT

Figure 3: Qualitative comparison with previous works on the NYU Depth V2 dataset.

Method Params (M) δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓


Baseline - Dconv 4.03 0.898 0.986 0.997 0.110 0.359
Baseline - UNet 4.95 0.901 0.987 0.997 0.109 0.363
Ours (w/o SFF) 0.38 0.905 0.986 0.997 0.104 0.357
Ours 0.66 0.908 0.987 0.997 0.101 0.351
[Xie et al., 2021] 3.19 0.893 0.983 0.995 0.112 0.379
[Lee et al., 2019] 5.79 0.906 0.985 0.997 0.102 0.356
[Ranftl et al., 2021] 14.15 0.907 0.987 0.997 0.103 0.354

Table 3: Comparison with multiple decoders. All results in this table


are obtained from the same encoder.

The network is trained on the NYU Depth V2 dataset and


evaluated with a test set of SUN RGB-D without any fine-
(a) RGB (b) BTS (c) AdaBins (d) Ours (e) GT tuning process. Table 2 compares the results with those ob-
Figure 4: Examples of estimated depth maps on the SUN RGB-D tained by comparative studies. The proposed approach out-
dataset. performs competing methods in all metrics. As shown in Fig-
ure 4, reasonable result depth maps are generated through our
model without additional training.
former encoder and the proposed compact decoder clearly
makes an important contribution to estimate accurate depth 4.4 Ablation Study
maps in an efficient manner. The visualized results are shown In this subsection, we validate the effectiveness of our ap-
in Figure 3. In the figure, our model shows an accurate esti- proach through several experiments conducted on the NYU
mation of depth values for the provided example images and Depth V2 dataset.
is more robust to various illumination conditions as compared Comparison with different decoder designs. Table 3
to other methods. demonstrates the comparison results with different decoder
SUN RGB-D. We test our network on an additional indoor design. In this experiment, vertical CutDepth is omitted to
dataset SUN RGB-D to show the generalization performance. solely compare the effectiveness of the decoder. As our
Corruption Type Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ SqRel ↓ RMSE ↓ RMSElog ↓
BTS 0.885 0.978 0.994 0.110 0.066 0.392 0.142
Clean Adabins 0.903 0.984 0.997 0.103 0.057 0.364 0.130
Ours 0.915 0.988 0.997 0.098 0.049 0.344 0.124
BTS 0.223 0.384 0.543 0.435 0.824 1.589 0.743
Gaussian Noise Adabins 0.347 0.553 0.708 0.343 0.578 1.299 0.544
Ours 0.775 0.940 0.983 0.161 0.126 0.541 0.198
BTS 0.677 0.850 0.922 0.189 0.207 0.701 0.279
Motion Blur Adabins 0.697 0.859 0.927 0.180 0.182 0.643 0.262
Ours 0.807 0.946 0.981 0.139 0.103 0.494 0.183
BTS 0.697 0.864 0.932 0.181 0.198 0.689 0.263
Contrast Adabins 0.654 0.836 0.917 0.198 0.234 0.752 0.283
Ours 0.860 0.971 0.992 0.117 0.074 0.427 0.152
BTS 0.410 0.649 0.803 0.298 0.423 1.114 0.458
Snow Adabins 0.410 0.656 0.817 0.292 0.410 1.094 0.440
Ours 0.723 0.926 0.981 0.170 0.138 0.598 0.217
Table 4: Robustness experiment results on corrupted images of NYU Depth V2 datasets. The results of BTS and Adabins are obtained from
distributed pre-trained weights.

Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓ the network. The results are shown in Table 5. The first
Baseline 0.908 0.987 0.997 0.101 0.351 0.043 row of the table represents the baseline, which is only trained
+ CutDepth 0.909 0.986 0.997 0.102 0.348 0.042 with traditional data augmentation except for CutDepth, and
+ Ours (p=0.25) 0.911 0.988 0.997 0.102 0.354 0.043 the second row shows the result obtained from adopting the
+ Ours (p=0.50) 0.911 0.988 0.997 0.100 0.348 0.042
+ Ours (p=0.75) 0.915 0.988 0.998 0.098 0.343 0.042
basic CutDepth method. Then, we apply the proposed verti-
+ Ours (p=1.00) 0.910 0.988 0.997 0.101 0.351 0.043 cal CutDepth with different choices of hyperparameter p. As
Table 5: Experimental results with data augmentation.
detailed in the table, CutDepth helps the model to achieve
slightly better performance than the baseline. However, by
applying vertical CutDepth, the network shows further im-
study aims to avoid computationally demanding decoders, provement. This proves that utilizing vertical features en-
we construct simple baselines and compare them with ours. hances accurate depth estimation as compared to the case
Baseline-Dconv consists of consecutive deconvolution-batch of simply cropping the random area. In addition, the model
normalization-ReLU blocks to obtain the desired depth map. achieves the best performance with a setting of p = 0.75.
In addition, Baseline-UNet is an improved structure from
Baseline-Dconv that has skip connections between the en- 4.5 Robustness of the model
coder and decoder. As detailed in the table, our decoder In this subsection, we demonstrate the robustness of the pro-
achieves better performance than the baselines. Even with- posed method against natural image corruptions. Model ro-
out an SFF, it already shows better performance than other bustness for depth estimation is essential because real world
decoders while having fewer parameters. The powerful en- images always have a high possibility of being corrupted to a
coding ability of our encoder and the effectively designed de- certain degree. Under these circumstances, it is beneficial to
coder enables the network to produce a finely detailed depth design a robust model so that it can perform the given task
map. In addition, our proposed SFF leverages additional per- without being critically corrupted. Following the previous
formance of our model. study on the robustness of CNNs [Hendrycks and Dietterich,
We additionally provide comparison with existing decoder 2018], we test our model on images that are corrupted by 16
architectures which integrates multi-scale features, in the different methods. Each corruption is applied with five differ-
bottom part of Table 3. Despite the compactness of the ent intensities, and the performance is averaged over all test
proposed decoder, our network outperforms other networks. images and all five intensities.
Our decoder has only 0.66M parameters while the MLP- Table 4 presents the depth estimation results for the cor-
decoder [Xie et al., 2021], BTS [Lee et al., 2019] and rupted images of the NYU Depth V2 test set. Due to space
DPT [Ranftl et al., 2021] have 3.19M, 5.79M and 14.15M pa- constraints, we provide the complete table in the supplemen-
rameters, respectively, and thus, are highly heavier than ours. tary material and present results on a few corruption types in
This indicates that we have effectively designed the restoring Table 4. The results show that our model is clearly more ro-
path for our encoder, which enables the proposed model to bust to all types of corruption than the compared models. The
record fine performance with very few parameters. experimental results indicate that our model shows stronger
Effectiveness of the vertical CutDepth. We perform an ab- robustness and thus is more appropriate for safety-critical ap-
lation study on the data augmentation method used to train plications.
5 Conclusion ings of the IEEE/CVF Conference on Computer Vision and
This paper proposes a new architecture for monocular depth Pattern Recognition, pages 2918–2928, 2021.
estimation to deliver meaningful global and local features and [Hendrycks and Dietterich, 2018] Dan Hendrycks and
generate a precisely estimated depth map. We further exploit Thomas Dietterich. Benchmarking neural network
the depth-specific data augmentation technique to improve robustness to common corruptions and perturbations. In
the performance of the model by considering the knowledge International Conference on Learning Representations,
that the use of vertical position is a crucial property of depth 2018.
estimation. The proposed method shows improvement over [Huynh et al., 2020] Lam Huynh, Phong Nguyen-Ha, Jiri
state-of-the-art performance for the NYU Depth V2 dataset. Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocu-
Moreover, extensive experimental results demonstrate the ef- lar depth estimation using depth-attention volume. In Eu-
fectiveness and generalization ability of our network. ropean Conference on Computer Vision, pages 581–597.
Springer, 2020.
References [Ishii and Yamashita, 2021] Yasunori Ishii and Takayoshi
[Bhat et al., 2021] Shariq Farooq Bhat, Ibraheem Alhashim, Yamashita. Cutdepth: Edge-aware data augmentation in
and Peter Wonka. Adabins: Depth estimation using adap- depth estimation. arXiv preprint arXiv:2107.07684, 2021.
tive bins. In Proceedings of the IEEE/CVF Conference [Kim et al., 2020] Doyeon Kim, Sihaeng Lee, Janghyeon
on Computer Vision and Pattern Recognition, pages 4009– Lee, and Junmo Kim. Leveraging contextual information
4018, 2021. for monocular depth estimation. IEEE Access, 8:147808–
[Chen et al., 2019] Xiaotian Chen, Xuejin Chen, and Zheng- 147817, 2020.
Jun Zha. Structure-aware residual pyramid net- [Koch et al., 2018] Tobias Koch, Lukas Liebel, Friedrich
work for monocular depth estimation. arXiv preprint Fraundorfer, and Marco Korner. Evaluation of cnn-based
arXiv:1907.06023, 2019. single-image depth estimation methods. In Proceedings
[Dijk and Croon, 2019] Tom van Dijk and Guido de Croon. of the European Conference on Computer Vision (ECCV)
How do neural networks see depth in single images? Workshops, pages 0–0, 2018.
In Proceedings of the IEEE International Conference on [Lee et al., 2019] Jin Han Lee, Myung-Kyu Han,
Computer Vision, pages 2183–2191, 2019. Dong Wook Ko, and Il Hong Suh. From big to
[Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, small: Multi-scale local planar guidance for monocular
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, depth estimation. arXiv preprint arXiv:1907.10326, 2019.
Thomas Unterthiner, Mostafa Dehghani, Matthias Min- [Luo et al., 2016] Wenjie Luo, Yujia Li, Raquel Urtasun, and
derer, Georg Heigold, Sylvain Gelly, et al. An image is Richard Zemel. Understanding the effective receptive field
worth 16x16 words: Transformers for image recognition in deep convolutional neural networks. In Proceedings of
at scale. arXiv preprint arXiv:2010.11929, 2020. the 30th International Conference on Neural Information
[Eigen et al., 2014] David Eigen, Christian Puhrsch, and Processing Systems, pages 4905–4913, 2016.
Rob Fergus. Depth map prediction from a single image [Ramamonjisoa and Lepetit, 2019] Michael Ramamonjisoa
using a multi-scale deep network. In Advances in neural and Vincent Lepetit. Sharpnet: Fast and accurate recov-
information processing systems (NIPS), pages 2366–2374, ery of occluding contours in monocular depth estimation.
2014. In Proceedings of the IEEE/CVF International Conference
[Fu et al., 2018] Huan Fu, Mingming Gong, Chaohui Wang, on Computer Vision Workshops, pages 0–0, 2019.
Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal [Ranftl et al., 2021] René Ranftl, Alexey Bochkovskiy, and
regression network for monocular depth estimation. In Vladlen Koltun. Vision transformers for dense prediction.
Proceedings of the IEEE Conference on Computer Vision arXiv preprint arXiv:2103.13413, 2021.
and Pattern Recognition, pages 2002–2011, 2018.
[Saxena et al., 2008] Ashutosh Saxena, Min Sun, and An-
[Garg et al., 2016] Ravi Garg, Vijay Kumar BG, Gustavo drew Y Ng. Make3d: Learning 3d scene structure from
Carneiro, and Ian Reid. Unsupervised cnn for single view a single still image. IEEE transactions on pattern analysis
depth estimation: Geometry to the rescue. In Proceedings and machine intelligence, 31(5):824–840, 2008.
of the European Conference on Computer Vision (ECCV), [Silberman et al., 2012] Nathan Silberman, Derek Hoiem,
pages 740–756. Springer, 2016. Pushmeet Kohli, and Rob Fergus. Indoor segmentation
[Geiger et al., 2013] Andreas Geiger, Philip Lenz, Christoph and support inference from rgbd images. In European
Stiller, and Raquel Urtasun. Vision meets robotics: The conference on computer vision, pages 746–760. Springer,
kitti dataset. The International Journal of Robotics Re- 2012.
search, 32(11):1231–1237, 2013. [Song et al., 2015] Shuran Song, Samuel P. Lichtenberg, and
[Ghiasi et al., 2021] Golnaz Ghiasi, Yin Cui, Aravind Srini- Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding
vas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, benchmark suite. In Proceedings of the IEEE Conference
and Barret Zoph. Simple copy-paste is a strong data aug- on Computer Vision and Pattern Recognition (CVPR),
mentation method for instance segmentation. In Proceed- June 2015.
[Swami et al., 2020] Kunal Swami, Prasanna Vishnu Bon-
dada, and Pankaj Kumar Bajpai. Aced: Accurate
and edge-consistent monocular depth estimation. In
2020 IEEE International Conference on Image Processing
(ICIP), pages 1376–1380. IEEE, 2020.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In Advances in neural information processing sys-
tems, pages 5998–6008, 2017.
[Xie et al., 2021] Enze Xie, Wenhai Wang, Zhiding Yu,
Anima Anandkumar, Jose M Alvarez, and Ping Luo.
Segformer: Simple and efficient design for seman-
tic segmentation with transformers. arXiv preprint
arXiv:2105.15203, 2021.
[Yin et al., 2019] Wei Yin, Yifan Liu, Chunhua Shen, and
Youliang Yan. Enforcing geometric constraints of vir-
tual normal for depth prediction. In Proceedings of the
IEEE International Conference on Computer Vision, pages
5684–5693, 2019.
[Yoo et al., 2020] Jaejun Yoo, Namhyuk Ahn, and Kyung-
Ah Sohn. Rethinking data augmentation for image super-
resolution: A comprehensive analysis and a new strategy.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8375–8384, 2020.
[Yun et al., 2019] Sangdoo Yun, Dongyoon Han, Seong Joon
Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.
Cutmix: Regularization strategy to train strong classifiers
with localizable features. CoRR, abs/1905.04899, 2019.
[Zheng et al., 2021] Sixiao Zheng, Jiachen Lu, Hengshuang
Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu,
Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethink-
ing semantic segmentation from a sequence-to-sequence
perspective with transformers. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 6881–6890, 2021.
6 Appendix: Additional dataset results
In this section, we provide additional results on KITTI [Geiger et al., 2013] and iBims-1 [Koch et al., 2018] datasets. KITTI is
an outdoor depth estimation dataset and iBims-1 is an indoor dataset.
6.1 KITTI
KITTI [Geiger et al., 2013] contains stereo camera images and corresponding 3D LiDAR scans of various driving scenes
acquired by car mounted sensors. The size of RGB images is around 1224 × 368. We train our network using approximately
23K images on a random crop of 704 × 352 and test on 697 images. To compare our performance with previous works, we use
the crop as defined by Garg [Garg et al., 2016] and a maximum value of 80m for evaluation. The results on the KITTI dataset
are shown in Table 6. As shown in the table, our model outperforms other previous studies.

Method Params (M) δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ RMSE log ↓


Fu [Fu et al., 2018] 110 0.932 0.984 0.994 0.072 2.727 0.120
Yin [Yin et al., 2019] 114 0.938 0.984 0.998 0.072 3.258 0.117
BTS [Lee et al., 2019] 113 0.956 0.993 0.998 0.059 2.756 0.088
DPT* [Ranftl et al., 2021] 123 0.959 0.995 0.999 0.062 2.573 0.092
Adabins [Bhat et al., 2021] 78 0.964 0.995 0.999 0.058 2.360 0.088
Ours 62 0.967 0.996 0.999 0.057 2.297 0.086

Table 6: Performance on the KITTI dataset. DPT* is trained with extra dataset.

6.2 iBims-1
iBims-1 [Koch et al., 2018] (independent Benchmark images and matched scans version 1) is a high quality RGB-D dataset
acquired using a digital single-lens reflex (DSLR) camera and high-precision laser scanner. iBims-1 can be characterized by
accurate edges and planar regions, consistent depth values, and accurate absolute distances. We evaluate with our NYU Depth
V2 trained model without any fine-tuning. Results on iBims-1 dataset are listed in Table 7.

Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMSE ↓ log10 ↓


VNL [Yin et al., 2019] 0.54 0.84 0.93 0.24 1.06 0.11
BTS [Lee et al., 2019] 0.53 0.84 0.94 0.24 1.10 0.12
DORN [Fu et al., 2018] 0.55 0.81 0.92 0.24 1.13 0.12
AdaBins [Bhat et al., 2021] 0.55 0.86 0.95 0.22 1.07 0.11
SharpNet [Ramamonjisoa and Lepetit, 2019] 0.59 0.84 0.94 0.26 1.07 0.11
ACED [Swami et al., 2020] 0.60 0.87 0.95 0.20 1.03 0.10
Ours 0.61 0.89 0.96 0.20 1.01 0.10

Table 7: Performance on the iBims-1 dataset.


7 Appendix: Robustness of the Model

In Table 8, we report a full table of the results on the corrupted NYU Depth V2 dataset. (Section 4.5 of the main paper)

Corruption Type Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ SqRel ↓ RMSE ↓ RMSElog ↓


BTS 0.885 0.978 0.994 0.110 0.066 0.392 0.142
Clean Adabins 0.903 0.984 0.997 0.103 0.057 0.364 0.130
Ours 0.915 0.988 0.997 0.098 0.049 0.344 0.124
BTS 0.223 0.384 0.543 0.435 0.824 1.589 0.743
Gaussian Noise Adabins 0.347 0.553 0.708 0.343 0.578 1.299 0.544
Ours 0.775 0.940 0.983 0.161 0.126 0.541 0.198
BTS 0.280 0.448 0.600 0.399 0.736 1.482 0.669
Shot Adabins 0.436 0.653 0.791 0.293 0.460 1.141 0.454
Noise Ours 0.791 0.949 0.986 0.152 0.114 0.523 0.189
BTS 0.116 0.249 0.420 0.504 1.006 1.818 0.875
Impulse Adabins 0.377 0.589 0.736 0.327 0.541 1.246 0.518
Ours 0.760 0.938 0.984 0.167 0.131 0.556 0.204
BTS 0.456 0.633 0.756 0.302 0.500 1.159 0.492
Speckle Adabins 0.639 0.834 0.918 0.200 0.244 0.805 0.294
Ours 0.830 0.965 0.991 0.136 0.091 0.467 0.168
BTS 0.677 0.850 0.922 0.189 0.207 0.701 0.279
Motion Adabins 0.697 0.859 0.927 0.180 0.182 0.643 0.262
Ours 0.807 0.946 0.981 0.139 0.103 0.494 0.183
BTS 0.511 0.684 0.786 0.276 0.415 1.002 0.436
Defocus Adabins 0.599 0.769 0.859 0.227 0.277 0.793 0.341
Blur Ours 0.728 0.897 0.954 0.166 0.155 0.605 0.228
BTS 0.671 0.855 0.927 0.193 0.224 0.747 0.285
Glass Adabins 0.743 0.914 0.967 0.165 0.149 0.619 0.223
Ours 0.770 0.914 0.978 0.155 0.132 0.573 0.202
BTS 0.530 0.688 0.779 0.274 0.422 0.989 0.437
Gaussian Adabins 0.595 0.738 0.814 0.244 0.341 0.847 0.379
Ours 0.716 0.865 0.926 0.177 0.190 0.641 0.248
BTS 0.842 0.965 0.990 0.124 0.084 0.457 0.166
Brightness Adabins 0.862 0.972 0.994 0.117 0.073 0.427 0.152
Ours 0.899 0.984 0.997 0.104 0.055 0.369 0.133
BTS 0.697 0.864 0.932 0.181 0.198 0.689 0.263
Contrast Adabins 0.654 0.836 0.917 0.198 0.234 0.752 0.283
Digital Ours 0.860 0.971 0.992 0.117 0.074 0.427 0.152
BTS 0.814 0.950 0.983 0.135 0.103 0.505 0.182
Saturation Adabins 0.839 0.965 0.991 0.125 0.086 0.465 0.162
Ours 0.896 0.984 0.996 0.107 0.058 0.374 0.134
BTS 0.786 0.942 0.983 0.154 0.124 0.532 0.195
JPEG Compression Adabins 0.804 0.954 0.988 0.153 0.115 0.493 0.182
Ours 0.860 0.973 0.994 0.123 0.073 0.413 0.153
BTS 0.410 0.649 0.803 0.298 0.423 1.114 0.458
Snow Adabins 0.410 0.656 0.817 0.292 0.410 1.094 0.440
Ours 0.723 0.926 0.981 0.170 0.138 0.598 0.217
BTS 0.705 0.878 0.945 0.176 0.168 0.642 0.250
Spatter Adabins 0.699 0.890 0.964 0.173 0.155 0.625 0.234
Weather Ours 0.835 0.971 0.994 0.134 0.083 0.445 0.162
BTS 0.588 0.798 0.893 0.227 0.273 0.835 0.332
Fog Adabins 0.523 0.748 0.873 0.252 0.308 0.898 0.357
Ours 0.759 0.928 0.978 0.153 0.125 0.559 0.204
BTS 0.515 0.734 0.850 0.261 0.359 0.996 0.400
Frost Adabins 0.439 0.691 0.842 0.280 0.398 1.074 0.413
Ours 0.736 0.929 0.983 0.163 0.130 0.576 0.209

Table 8: Robustness experiment results on corrupted images of NYU Depth V2 datasets.


𝐹𝐸4

𝐹𝐸4
We illustrate the detailed structure of Baseline-DConv and Baseline-UNet in Figure 5. We use transposed convolution with
K = 3, S = 2, P = 1 parameters to upscale the given feature into 2x size. For Baseline-UNet, features from encoder

ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
concat
𝐹𝐸3

1/16 x 1/16 x 512 1/16 x 1/16 x 1024


ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
Figure 5: The detailed structure of (a) Baseline-DConv (b) Baseline-UNet.
concat
𝐹𝐸2

1/8 x 1/8 x 256 1/8 x 1/8 x 576


ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
concat
𝐹𝐸1

1/4 x 1/4 x 128 1/4 x 1/4 x 256


Appendix: Detailed structure of baseline decoder

ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
FE3 , FE2 , FE1 are concatenated in channel dimension.

1/2 x 1/2 x 64 1/2 x 1/2 x 64


ConvTransposed ConvTransposed
Batch Norm Batch Norm
ReLU ReLU
1 x 1 x 32 1 x 1 x 32
Conv-ReLU-Conv Conv-ReLU-Conv
Sigmoid Sigmoid

𝑌෠

𝑌෠
(b)
(a)
8

You might also like