Adabins: Depth Estimation Using Adaptive Bins: Shariq Farooq Bhat Kaust Ibraheem Alhashim Kaust Peter Wonka Kaust
Adabins: Depth Estimation Using Adaptive Bins: Shariq Farooq Bhat Kaust Ibraheem Alhashim Kaust Peter Wonka Kaust
Abstract
1
Standard Encoder-Decoder AdaBins Module
Hybrid Regression
Figure 2: Overview of our proposed network architecture. Our architecture consists of two major components: an encoder-
decoder block and our proposed adaptive bin-width estimator block called AdaBins. The input to our network is an RGB
image of spatial dimensions H and W , and the output is a single channel h × w depth image (e.g., half the spatial resolution).
2
the corresponding loss functions used. Conv
Pixel-wise dot product R
3×3
3.1. Motivation
Features Map Bin widths: b
Cd
Our idea could be seen as a generalization of depth es-
timation via an ordinal regression network as proposed by MLP Head 1×1 kernels Misc.
3
Patch num MLP E. Thus, the result of this convolution is a tensor of size
E Layers C Params h/p × w/p × E (assuming both h and w are divisible by
size (p) heads Size
p). The result is reshaped into a spatially flattened tensor
16 128 4 4 128 1024 5.8 M xp ∈ RS×E , where S = hw p2 serves as the effective se-
quence length for the transformer. We refer to this sequence
Table 1: Mini-ViT architecture details. of E-dimensional vectors as patch embeddings.
Following common practice [2, 6], we add learned po-
sitional encodings to the patch embeddings before feeding
DenseNet [20] to EfficientNet B5 and using a different ap-
them to the transformer. Our transformer is a small trans-
propriate loss function for the new architecture. In addi-
former encoder (see Table. 1 for details) and outputs a se-
tion, the output of the decoder is a tensor xd ∈ Rh×w×Cd ,
quence of output embeddings xo ∈ RS×E . We use an
not a single channel image representing the final depth val-
MLP head over the first output embedding (we also exper-
ues. We refer to this tensor as the “decoded features”. The
imented with a version that has an additional special token
second component is a key contribution in this paper, the
as first input, but did not see an improvement). The MLP
AdaBins module. The input to the AdaBins module are de-
head uses a ReLU activation and outputs an N-dimensional
coded features of size h×w ×Cd and the output tensor is of
vector b0 . Finally, we normalize the vector b0 such that it
size h × w × 1. Due to memory limitations of current GPU
sums up to 1, to obtain the bin-widths vector b as follows:
hardware, we use h = H/2 and w = W/2 to facilitate bet-
ter learning with larger batch sizes. The final depth map is b0 +
computed by simply bilinearly upsampling to H × W × 1. bi = PN i , (1)
0
The first block in the AdaBins module is called mini- j=1 (bj + )
4
Finally, we define the total loss as:
5
Method δ1 ↑ δ2 ↑ δ3 ↑ REL ↓ RMS ↓ log10 ↓
Eigen et al. [8] 0.769 0.950 0.988 0.158 0.641 –
Laina et al. [25] 0.811 0.953 0.988 0.127 0.573 0.055
Hao et al. [16] 0.841 0.966 0.991 0.127 0.555 0.053
Lee et al. [27] 0.837 0.971 0.994 0.131 0.538 –
Fu et al. [11] 0.828 0.965 0.992 0.115 0.509 0.051
SharpNet [34] 0.836 0.966 0.993 0.139 0.502 0.047
Hu et al. [19] 0.866 0.975 0.993 0.115 0.530 0.050
Chen et al. [4] 0.878 0.977 0.994 0.111 0.514 0.048
Yin et al. [47] 0.875 0.976 0.994 0.108 0.416 0.048
BTS [26] 0.885 0.978 0.994 0.110 0.392 0.047
DAV [22] 0.882 0.980 0.996 0.108 0.412 –
AdaBins (Ours) 0.903 0.984 0.997 0.103 0.364 0.044
Table 2: Comparison of performances on the NYU-Depth-v2 dataset. The reported numbers are from the corresponding
original papers. Best results are in bold, second best are underlined.
Table 3: Comparison of performances on the KITTI dataset. We compare our network against the state-of-the-art on this
dataset. The reported numbers are from the corresponding original papers. Measurements are made for the depth range from
0m to 80m. Best results are in bold, second best are underlined.
q P
1 n 2
Loss δ1 ↑ δ2 ↑ δ3 ↑ REL↓ RMS↓ log10 ↓ error (RMS): n p (yp − ŷp ) ); average (log10 ) error:
1
P n
p |log10 (yp ) − log10 (ŷp )|; threshold accuracy (δi ):
L1 /SSIM 0.888 0.980 0.995 0.107 0.384 0.046 n
SI 0.897 0.984 0.997 0.106 0.368 0.044 y ŷ
% of yp s.t. max( ŷpp , ypp ) = δ < thr for thr =
SI+Bins 0.903 0.984 0.997 0.103 0.364 0.044
1.25, 1.252 , 1.253 ; where yp is a pixel in depth image y,
Table 4: Comparison of performance with respect to the ŷp is a pixel in the predicted depth image ŷ, and n is the
choice of loss function. total number of pixels for each depth image. Additionally
for KITTI, we use the two standard metrics: Squared Rela-
Pn ky −ŷ k2
tive Difference (Sq. Rel): n1 p p y p ; and RMSE log:
q P
cross-evaluating pre-trained models on the official test set 1 n 2
n p k log yp − log ŷp k .
of 5050 images. We do not use it for training.
4.2. Implementation details
Evaluation metrics. We use the standard six metrics used We implement the proposed network in PyTorch [33].
in prior work [8] to compare our method against state- For training, we use the AdamW optimizer [30] with
of-the-art. These error metrics are defined as: average
Pn |y −ŷ | weight-decay 10−2 . We use the 1-cycle policy [38] for the
relative error (REL): n1 p p y p ; root mean squared learning rate with max lr = 3.5 × 10−4 , linear warm-up
6
Method δ1 ↑ δ2 ↑ δ3 ↑ REL↓ RMS↓ log10 ↓
Chen [4] 0.757 0.943 0.984 0.166 0.494 0.071
Yin [47] 0.696 0.912 0.973 0.183 0.541 0.082
BTS [26] 0.740 0.933 0.980 0.172 0.515 0.075
Ours 0.771 0.944 0.983 0.159 0.476 0.068
7
(a) RGB (b) BTS [26] (c) DAV [4] (d) Ours (e) GT
and measure the performance in terms of Absolute Rela- solute Relative Error from 10.6% to 10.3%.
tive Error metric. Results are plotted in Fig. 6. Interest-
ingly, starting from N = 20, the error first increases with
increasing N and then decreases significantly. As we keep
increasing N above 256, and with higher values the gain in
5. Conclusion
performance starts to diminish. We use N = 256 for our
final model.
We introduced a new architecture block, called AdaBins
Loss function: Table. 4 lists performance corresponding for depth estimation from a single RGB image. AdaBins
to the three choices of loss function. Firstly, the L1 /SSIM leads to a decisive improvement in the state of the art for the
combination does not lead to the state-of-the-art perfor- two most popular datasets, NYU and KITTI. In future work,
mance in our case. Secondly, we trained our network with we would like to investigate if global processing of informa-
and without the proposed Chamfer loss (Eq. 5). Introducing tion at a high resolution can also improve performance on
the Chamfer loss clearly gives a boost to the performance. other tasks, such as segmentation, normal estimation, and
For example, introducing the Chamfer loss reduces the Ab- 3D reconstruction from multiple images.
8
References [12] Yukang Gan, Xiangyu Xu, Wenxiu Sun, and Liang Lin.
Monocular depth estimation with affinity, vertical pooling,
[1] Ibraheem Alhashim and Peter Wonka. High quality and label enhancement. In Vittorio Ferrari, Martial Hebert,
monocular depth estimation via transfer learning. CoRR, Cristian Sminchisescu, and Yair Weiss, editors, Computer
abs/1812.11941, 2018. 2, 3, 5 Vision – ECCV 2018, pages 232–247, Cham, 2018. Springer
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas International Publishing. 6
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [13] Ravi Garg, Vijay Kumar B.G., Gustavo Carneiro, and Ian
end object detection with transformers. In Andrea Vedaldi, Reid. Unsupervised cnn for single view depth estimation:
Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi- Geometry to the rescue. In Bastian Leibe, Jiri Matas, Nicu
tors, Computer Vision – ECCV 2020, pages 213–229, Cham, Sebe, and Max Welling, editors, Computer Vision – ECCV
2020. Springer International Publishing. 2, 3, 4 2016, pages 740–756, Cham, 2016. Springer International
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Publishing. 5
Schroff, and Hartwig Adam. Encoder-decoder with atrous [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
separable convolution for semantic image segmentation. In Urtasun. Vision meets robotics: The kitti dataset. I. J.
ECCV, 2018. 11 Robotics Res., 32:1231–1237, 2013. 1, 2, 5
[4] Xiaotian Chen, Xuejin Chen, and Zheng-Jun Zha. Structure- [15] Clément Godard, Oisin Mac Aodha, and Gabriel J. Bros-
aware residual pyramid network for monocular depth esti- tow. Unsupervised monocular depth estimation with left-
mation. In Proceedings of the Twenty-Eighth International right consistency. 2017 IEEE Conference on Computer Vi-
Joint Conference on Artificial Intelligence, IJCAI-19, pages sion and Pattern Recognition (CVPR), pages 6602–6611,
694–700. International Joint Conferences on Artificial Intel- 2017. 2, 6
ligence Organization, 7 2019. 6, 7, 8 [16] Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail pre-
[5] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and serving depth estimation from a single image using attention
Rita Cucchiara. Meshed-memory transformer for image cap- guided networks. 2018 International Conference on 3D Vi-
tioning. In IEEE/CVF Conference on Computer Vision and sion (3DV), pages 304–313, 2018. 2, 6
Pattern Recognition (CVPR), June 2020. 3 [17] Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Cremers. Fusenet: Incorporating depth into semantic seg-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, mentation via fusion-based cnn architecture. In ACCV, 2016.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 1
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [18] Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao
worth 16x16 words: Transformers for image recognition at Soares. Image captioning: Transforming objects into words.
scale. arXiv preprint arXiv:2010.11929, 2020. 2, 4 In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural
[7] Ruofei Du, Eric Lee Turner, Maksym Dzitsiuk, Luca Prasso,
Information Processing Systems, volume 32, pages 11137–
Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal,
11147. Curran Associates, Inc., 2019. 3
Josh Gladstone, Nuno Moura e Silva Cruces, Shahram Izadi,
[19] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani.
Adarsh Kowdle, Konstantine Nicholas John Tsotsos, and
Revisiting single image depth estimation: Toward higher
David Kim. Depthlab: Real-time 3d interaction with depth
resolution maps with accurate object boundaries. 2019
maps for mobile augmented reality. In Proceedings of the
IEEE Winter Conference on Applications of Computer Vision
33rd Annual ACM Symposium on User Interface Software
(WACV), pages 1043–1051, 2018. 2, 6
and Technology, page 15, 2020. 1
[20] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-
[8] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map ian Q. Weinberger. Densely connected convolutional net-
prediction from a single image using a multi-scale deep net- works. 2017 IEEE Conference on Computer Vision and Pat-
work. In NIPS, 2014. 2, 5, 6 tern Recognition (CVPR), pages 2261–2269, 2017. 4
[9] H. Fan, H. Su, and L. Guibas. A point set generation network [21] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra
for 3d object reconstruction from a single image. In 2017 Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view
IEEE Conference on Computer Vision and Pattern Recogni- stereopsis. 2018 IEEE/CVF Conference on Computer Vision
tion (CVPR), pages 2463–2471, 2017. 5 and Pattern Recognition, pages 2821–2830, 2018. 2
[10] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip [22] Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and
Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Janne Heikkila. Guiding monocular depth estimation using
Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learn- depth-attention volume. arXiv preprint arXiv:2004.02760,
ing optical flow with convolutional networks. 2015 IEEE 2020. 1, 2, 3, 6, 7, 11, 12
International Conference on Computer Vision (ICCV), pages [23] Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T
2758–2766, 2015. 2 Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A
[11] Huan Fu, Mingming Gong, Chaohui Wang, Nematollah Bat- category-level 3d object dataset: Putting the kinect to work.
manghelich, and Dacheng Tao. Deep ordinal regression net- In Consumer depth cameras for computer vision, pages 141–
work for monocular depth estimation. 2018 IEEE/CVF Con- 165. Springer, 2013. 5
ference on Computer Vision and Pattern Recognition, pages [24] Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. Semi-
2002–2011, 2018. 2, 3, 5, 6 supervised deep learning for monocular depth map predic-
9
tion. 2017 IEEE Conference on Computer Vision and Pattern Intervention – MICCAI 2015, pages 234–241, Cham, 2015.
Recognition (CVPR), pages 2215–2223, 2017. 6 Springer International Publishing. 2
[25] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- [36] Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng.
erico Tombari, and Nassir Navab. Deeper depth prediction Learning depth from single monocular images. In Pro-
with fully convolutional residual networks. 2016 Fourth In- ceedings of the 18th International Conference on Neural In-
ternational Conference on 3D Vision (3DV), pages 239–248, formation Processing Systems, NIPS’05, page 1161–1168,
2016. 2, 6 Cambridge, MA, USA, 2005. MIT Press. 6
[26] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and [37] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Il Hong Suh. From big to small: Multi-scale local planar Fergus. Indoor segmentation and support inference from
guidance for monocular depth estimation. arXiv preprint rgbd images. In Computer Vision – ECCV 2012, pages 746–
arXiv:1907.10326, 2019. 1, 2, 5, 6, 7, 8, 11, 12, 13 760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[27] Wonwoo Lee, Nohyoung Park, and Woontack Woo. Depth- 1, 2, 5, 11
assisted real-time 3d object detection for augmented reality. [38] Leslie N. Smith and Nicholay Topin. Super-convergence:
ICAT’11, 2:126–132, 2011. 1, 6 Very fast training of residual networks using large learning
[28] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli rates. CoRR, abs/1708.07120, 2017. 6
Laine, Tero Karras, Miika Aittala, and Timo Aila. [39] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
Noise2Noise: Learning image restoration without clean data. scene understanding benchmark suite. In 2015 IEEE Confer-
In Jennifer Dy and Andreas Krause, editors, Proceedings ence on Computer Vision and Pattern Recognition (CVPR),
of the 35th International Conference on Machine Learning, pages 567–576, 2015. 5, 7, 11
volume 80 of Proceedings of Machine Learning Research, [40] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking
pages 2965–2974, Stockholmsmässan, Stockholm Sweden, model scaling for convolutional neural networks. In Ka-
10–15 Jul 2018. PMLR. 2 malika Chaudhuri and Ruslan Salakhutdinov, editors, Pro-
ceedings of the 36th International Conference on Machine
[29] Fayao Liu, Chunhua Shen, Guosheng Lin, and I. Reid.
Learning, ICML 2019, 9-15 June 2019, Long Beach, Cali-
Learning depth from single monocular images using deep
fornia, USA, volume 97 of Proceedings of Machine Learning
convolutional neural fields. IEEE Transactions on Pattern
Research, pages 6105–6114. PMLR, 2019. 3
Analysis and Machine Intelligence, 38:2024–2039, 2016. 6
[41] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko-
[30] I. Loshchilov and F. Hutter. Decoupled weight decay regu-
laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas
larization. In ICLR, 2019. 6
Brox. Demon: Depth and motion network for learning
[31] Francesc Moreno-Noguer, Peter N. Belhumeur, and Shree K. monocular stereo. 2017 IEEE Conference on Computer Vi-
Nayar. Active refocusing of images and videos. ACM Trans. sion and Pattern Recognition (CVPR), pages 5622–5631,
Graph., 26(3), July 2007. 1 2017. 2
[32] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz [42] Jiheng Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- Simoncelli. Image quality assessment: from error visibility
age transformer. In Jennifer Dy and Andreas Krause, edi- to structural similarity. IEEE Transactions on Image Pro-
tors, Proceedings of Machine Learning Research, volume 80 cessing, 13:600–612, 2004. 5
of Proceedings of Machine Learning Research, pages 4055– [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
4064, Stockholmsmässan, Stockholm Sweden, 10–15 Jul ing He. Non-local neural networks. In Proceedings of the
2018. PMLR. 2 IEEE Conference on Computer Vision and Pattern Recogni-
[33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, tion (CVPR), June 2018. 2
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming [44] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, big spaces reconstructed using sfm and object labels. In 2013
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- IEEE International Conference on Computer Vision, pages
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, 1625–1632, 2013. 5
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An [45] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and
imperative style, high-performance deep learning library. In Nicu Sebe. Multi-scale continuous crfs as sequential deep
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, networks for monocular depth estimation. In Proceedings
E. Fox, and R. Garnett, editors, Advances in Neural Infor- of the IEEE Conference on Computer Vision and Pattern
mation Processing Systems, volume 32, pages 8026–8037. Recognition, pages 5354–5362, 2017. 2
Curran Associates, Inc., 2019. 6 [46] Dong Xu, Wei Wang, Hao Tang, Hong W. Liu, Nicu
[34] Michael Ramamonjisoa and Vincent Lepetit. Sharpnet: Fast Sebe, and Elisa Ricci. Structured attention guided convo-
and accurate recovery of occluding contours in monocular lutional neural fields for monocular depth estimation. 2018
depth estimation. In Proceedings of the IEEE/CVF Interna- IEEE/CVF Conference on Computer Vision and Pattern
tional Conference on Computer Vision (ICCV) Workshops, Recognition, pages 3917–3925, 2018. 2
Oct 2019. 6 [47] Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. En-
[35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- forcing geometric constraints of virtual normal for depth pre-
net: Convolutional networks for biomedical image segmen- diction. In Proceedings of the IEEE/CVF International Con-
tation. In Medical Image Computing and Computer-Assisted ference on Computer Vision (ICCV), October 2019. 6, 7
10
[48] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Layer Input Dimension Output Dimension Activation
Deeptam: Deep tracking and mapping. In Proceedings of the LeakyReLU
European Conference on Computer Vision (ECCV), pages FC E 256
(negative slope=0.01)
822–838, 2018. 2 LeakyReLU
FC 256 256
(negative slope=0.01)
FC 256 N ReLU
A. Appendix
Table 7: Architecture details of MLP head. FC: Fully Con-
A.1. Geometric Consistency nected layer, E: Embedding dimension, N: Number of bins
11
RGB DAV [22] BTS [26] Ours
12
RGB BTS [26] Ours GT
Figure 10: Qualitative comparison of generalization from NYU-Depth-v2 to SUN RGB-D dataset. Darker pixels are farther.
Missing ground truth values are shown in white.
13