Neural Recon
Neural Recon
Jiaming Sun1,2∗ Yiming Xie1∗ Linghao Chen1 Xiaowei Zhou1 Hujun Bao1†
1
Zhejiang University 2 SenseTime Research
Abstract 1 2 3 … 15
arXiv:2104.00681v1 [cs.CV] 1 Apr 2021
1
frames, causing redundant computation. [46, 51] optimize this line of research towards low power
In this paper, we propose a novel framework for real- consumption on mobile platforms. Learning-based meth-
time monocular reconstruction named NeuralRecon that ods on real-time multi-view depth estimation try to alle-
jointly reconstructs and fuses the 3D geometry directly in viate the photo-consistency assumption with a data-driven
the volumetric TSDF representation. Given a sequence of approach. Notably, MVDepthNet [48] and Neural RGB-
monocular images and their corresponding camera poses >D [24] use 2D CNNs to process the 2D depth cost vol-
estimated by a SLAM system, NeuralRecon incrementally ume constructed from multi-view image features. CNMNet
reconstructs local geometry in a view-independent 3D vol- [26] further leverages the planar structure in indoor scenes
ume instead of view-dependent depth maps. Specifically, to constrain the surface normals calculated from the pre-
it unprojects the image features to form a 3D feature vol- dicted depth maps to obtain smooth depth estimation. These
ume and then uses sparse convolutions to process the feature learning-based methods use 2D CNNs to process the depth
volume to output a sparse TSDF volume. With a coarse- cost volume to maintain a low computational cost for near
to-fine design, the predicted TSDF is gradually refined at real-time performance.
each level. By directly reconstructing the implicit surface When the input images are high-resolution and offline
(TSDF), the network is able to learn the local smoothness computation is allowed, multi-view depth estimation is
and global shape prior of natural 3D surfaces. Different also known as the Multiple View Stereo (MVS) problem.
from depth-based methods that predict depth maps for each PatchMatch-based methods [56, 37] have achieved impres-
key frame separately, the surface geometry within a local sive accuracy and are still the most popular methods ap-
fragment window is jointly predicted in NeuralRecon, and plicable to high-resolution images. Learning-based ap-
thus locally coherent geometry estimation can be produced. proaches in MVS have recently dominated several bench-
To make the current-fragment reconstruction to be globally marks [2, 20] in terms of accuracy, but are only limited to
consistent with the previously reconstructed fragments, a processing mid-resolution images due to the GPU memory
learning-based TSDF fusion module using the Gated Re- constraint. Different from the real-time methods, 3D cost
current Unit (GRU) is proposed. The GRU fusion makes volumes are constructed and 3D CNNs are used to process
the current-fragment reconstruction conditioned on the pre- the cost volume as proposed in MVSNet [53]. Some recent
viously reconstructed global volume, yielding a joint recon- works [12, 4] improve this pipeline with a coarse-to-fine ap-
struction and fusion approach. As a result, the reconstructed proach. Similar design can also be found in many learning-
mesh is dense, accurate and globally coherent in scale. Fur- based SLAM systems [45, 57, 42, 44].
thermore, predicting the volumetric representation also re- All the above-mentioned works adopt single-view depth
moves the redundant computation in depth-based methods, maps as intermediate representations. SurfaceNet [15, 16]
which allows us to use a larger 3D CNN while maintaining takes a different approach and uses a unified volumetric rep-
the real-time performance. resentation to predict the volume occupancy. Recently, At-
We validate our system on the ScanNet and 7-Scenes las [30] also proposes a volumetric design and direct pre-
datasets. The experimental results show that NeuralRe- dicts TSDF and semantic labels with 3D CNN. As an offline
con outperforms multiple state-of-the-art multi-view depth method, Atlas aggregates the image features of the entire
estimation methods and the volume-based reconstruction sequence and then predicts the global TSDF volume only
method Atlas [30] by a large margin, while achieving a real- once with a decoder module. We further elaborate the rela-
time performance at 33 key frames per second, ∼10× faster tionship between the proposed method and Atlas in the sup-
compared to Atlas. As shown in the supplementary video, plementary material. The proposed method is also related to
our method is able to reconstruct large-scale 3D scenes from [5, 18] in terms of using recurrent networks for multi-view
a video stream on a laptop GPU in real-time. To the best of feature fusion. However, their recurrent fusion is applied
our knowledge, this is the first learning-based system that is to only the global features and their focus is to reconstruct
able to reconstruct dense and coherent 3D scene geometry single objects.
in real-time. 3D Surface Reconstruction. After depth maps are esti-
mated and converted to point clouds, the remaining task for
2. Related Work 3D reconstruction is to estimate the 3D surface position and
produce the reconstructed mesh. In an offline MVS pipeline
Multi-view Depth Estimation. The most related line of [37], Poisson reconstruction [19] and Delaunay triagula-
research is real-time methods for multi-view depth estima- tion [22] are often used to fulfill this purpose. Proposed
tion. Before the age of deep learning, many renowned by the seminal work KinectFusion [31], incremental volu-
works in monocular 3D reconstruction [47, 21, 38, 34] have metric TSDF fusion [7] gets widely adopted in real-time re-
achieved good performance with plane-sweeping stereo and construction scenarios due to its simplicity and paralleliza-
depth filters under the assumption of photo-consistency. tion capability. [32, 10] improve KinectFusion by making it
2
Global C Concatenate
… Hidden State
S1t S Sparsify
F1t GRU
MLP
<latexit sha1_base64="hY3m3Wc765S0dCsvqotn77+OMfM=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuCG5cV7QPaacmkmTY0kxmSO0oZ+h9uXCji1n9x59+YtrPQ1gOBwzn3ck9OkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3grGNzO/9ci1EbF6wEnC/YgOlQgFo2ilXjeiOArC7H7a8/rYL1fcqjsHWSVeTiqQo94vf3UHMUsjrpBJakzHcxP0M6pRMMmnpW5qeELZmA55x1JFI278bJ56Ss6sMiBhrO1TSObq742MRsZMosBOzlKaZW8m/ud1Ugyv/UyoJEWu2OJQmEqCMZlVQAZCc4ZyYgllWtishI2opgxtUSVbgrf85VXSvKh6btW7cyu1y7yOIpzAKZyDB1dQg1uoQwMYaHiGV3hznpwX5935WIwWnHznGP7A+fwBfmKSbw==</latexit>
sha1_base64="+ZC0JRnFw6IV9/r6K0IQjwvA4yc=">AAAB9XicbVDLSgMxFL3js9ZXfezcBIvgxjIjgi4LblxWtA9opyWTZtrQTGZI7ihlKPgZblwo4tZ/ceffmD4W2nogcDjnXu7JCRIpDLrut7O0vLK6tp7byG9ube/sFvb2ayZONeNVFstYNwJquBSKV1Gg5I1EcxoFkteDwfXYrz9wbUSs7nGYcD+iPSVCwShaqd2KKPaDMLsbtb0OdgpFt+ROQBaJNyPF8uFZ7wkAKp3CV6sbszTiCpmkxjQ9N0E/oxoFk3yUb6WGJ5QNaI83LVU04sbPJqlH5MQqXRLG2j6FZKL+3shoZMwwCuzkOKWZ98bif14zxfDKz4RKUuSKTQ+FqSQYk3EFpCs0ZyiHllCmhc1KWJ9qytAWlbclePNfXiS185Lnlrxb28YFTJGDIziGU/DgEspwAxWoAgMNz/AKb86j8+K8Ox/T0SVntnMAf+B8/gBkPZPX</latexit>
sha1_base64="xXCTkKBIxWQkDJn3vc4d/SSDpNI=">AAAB9XicbVDLSsNAFL2pj9b6qo+dm2AR3FgSEXRZcOOyon1gX0ymk3boZBJmbiwl9D/cKCji1n9x5984abvQ1gMDh3Pu5Z45XiS4Rsf5tjIrq2vr2dxGfnNre2e3sLdf02GsKKvSUISq4RHNBJesihwFa0SKkcATrO4Nr1O//siU5qG8x3HE2gHpS+5zStBInVZAcOD5yd2k43axWyg6JWcKe5m4c1IsH571R9mXh0q38NXqhTQOmEQqiNZN14mwnRCFnAo2ybdizSJCh6TPmoZKEjDdTqapJ/aJUXq2HyrzJNpT9fdGQgKtx4FnJtOUetFLxf+8Zoz+VTvhMoqRSTo75MfCxtBOK7B7XDGKYmwIoYqbrDYdEEUomqLypgR38cvLpHZecp2Se2vauIAZcnAEx3AKLlxCGW6gAlWgoOAJXuHNGlnP1rv1MRvNWPOdA/gD6/MHrIuUzQ==</latexit>
S U Upsample
<latexit sha1_base64="SSbpW/qad9dr/a4oLs/6o9UAr88=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuCIC4r2Af0RSbNtKGZzJDcUcrQ/3DjQhG3/os7/8ZMOwttPRA4nHMv9+T4sRQGXffbKaytb2xuFbdLO7t7+wflw6OmiRLNeINFMtJtnxouheINFCh5O9achr7kLX9yk/mtR66NiNQDTmPeC+lIiUAwilbqd0OKYz9Ib2d9b4CDcsWtunOQVeLlpAI56oPyV3cYsSTkCpmkxnQ8N8ZeSjUKJvms1E0Mjymb0BHvWKpoyE0vnaeekTOrDEkQafsUkrn6eyOloTHT0LeTWUqz7GXif14nweC6lwoVJ8gVWxwKEkkwIlkFZCg0ZyinllCmhc1K2JhqytAWVbIleMtfXiXNi6rnVr17t1K7zOsowgmcwjl4cAU1uIM6NICBhmd4hTfnyXlx3p2PxWjByXeO4Q+czx9qbZJi</latexit>
sha1_base64="lnPCBgwihh62rmSSBxbJDu3RPzk=">AAAB9XicbVDLSgMxFL3js9ZXfezcBIvgxjIjgi4LgrisYB/QF5k004ZmMkNyRylDwc9w40IRt/6LO//GTNuFth4IHM65l3ty/FgKg6777Swtr6yurec28ptb2zu7hb39mokSzXiVRTLSDZ8aLoXiVRQoeSPWnIa+5HV/eJ359QeujYjUPY5i3g5pX4lAMIpW6rRCigM/SG/GHa+L3ULRLbkTkEXizUixfHjWfwKASrfw1epFLAm5QiapMU3PjbGdUo2CST7OtxLDY8qGtM+blioactNOJ6nH5MQqPRJE2j6FZKL+3khpaMwo9O1kltLMe5n4n9dMMLhqp0LFCXLFpoeCRBKMSFYB6QnNGcqRJZRpYbMSNqCaMrRF5W0J3vyXF0ntvOS5Je/OtnEBU+TgCI7hFDy4hDLcQgWqwEDDM7zCm/PovDjvzsd0dMmZ7RzAHzifP1BIk8o=</latexit>
sha1_base64="EXp3ZyZeslyxNXlgZmIiSOiP5Eo=">AAAB9XicbVDLSsNAFL2pj9b6qo+dm2AR3FgSEXRZEMRlBfvAvphMJ+3QySTM3FhK6H+4UVDErf/izr9x0nahrQcGDufcyz1zvEhwjY7zbWVWVtfWs7mN/ObW9s5uYW+/psNYUValoQhVwyOaCS5ZFTkK1ogUI4EnWN0bXqd+/ZEpzUN5j+OItQPSl9znlKCROq2A4MDzk5tJx+1it1B0Ss4U9jJx56RYPjzrj7IvD5Vu4avVC2kcMIlUEK2brhNhOyEKORVskm/FmkWEDkmfNQ2VJGC6nUxTT+wTo/RsP1TmSbSn6u+NhARajwPPTKYp9aKXiv95zRj9q3bCZRQjk3R2yI+FjaGdVmD3uGIUxdgQQhU3WW06IIpQNEXlTQnu4peXSe285Dol9860cQEz5OAIjuEUXLiEMtxCBapAQcETvMKbNbKerXfrYzaaseY7B/AH1ucPmJaUwA==</latexit>
Fusion Extract
Image Encoder Replace
U
Coarse-to- Fine
Global Sgt
<latexit sha1_base64="PJpKLRmk8ERcNxmOUXHBlXOLN4U=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4sSQi6LLgxmVF+4A2hsl00g6dTMLMRCwhv+LGhSJu/RF3/o2TNgttPTBwOOde7pkTJJwp7Tjf1srq2vrGZmWrur2zu7dvH9Q6Kk4loW0S81j2AqwoZ4K2NdOc9hJJcRRw2g0m14XffaRSsVjc62lCvQiPBAsZwdpIvl0bRFiPgzC7yx9GfqbP3Ny3607DmQEtE7ckdSjR8u2vwTAmaUSFJhwr1XedRHsZlpoRTvPqIFU0wWSCR7RvqMARVV42y56jE6MMURhL84RGM/X3RoYjpaZRYCaLpGrRK8T/vH6qwysvYyJJNRVkfihMOdIxKopAQyYp0XxqCCaSmayIjLHERJu6qqYEd/HLy6Rz3nCdhnvr1JsXZR0VOIJjOAUXLqEJN9CCNhB4gmd4hTcrt16sd+tjPrpilTuH8AfW5w/8CJRU</latexit>
sha1_base64="LPo8xU6B0BcTMeVdAsUNOVnQjGw=">AAAB+3icbVDLSsNAFL2pr1pfsS51MVgEN5ZEBF0WBHFZ0T6gjWEynbRDJ5MwMxFLyK+4caGIW3/EnTs/xeljoa0HBg7n3Ms9c4KEM6Ud58sqLC2vrK4V10sbm1vbO/ZuuaniVBLaIDGPZTvAinImaEMzzWk7kRRHAaetYHg59lsPVCoWizs9SqgX4b5gISNYG8m3y90I60EQZrf5fd/P9Imb+3bFqToToEXizkildhBefQNA3bc/u72YpBEVmnCsVMd1Eu1lWGpGOM1L3VTRBJMh7tOOoQJHVHnZJHuOjozSQ2EszRMaTdTfGxmOlBpFgZkcJ1Xz3lj8z+ukOrzwMiaSVFNBpofClCMdo3ERqMckJZqPDMFEMpMVkQGWmGhTV8mU4M5/eZE0T6uuU3VvTBtnMEUR9uEQjsGFc6jBNdShAQQe4Qle4NXKrWfrzXqfjhas2c4e/IH18QOOW5Y8</latexit>
sha1_base64="6uSwjjSS5Fuj7D0l2A0+llhOdlE=">AAAB+3icbVDLSsNAFJ3UV62vtK5EF8EiuLEkIuiyIIjLivYBbQyT6aQdOpmEmRuxhPyKGxeKuPVH3PkZ/oGTtgttPTBwOOde7pnjx5wpsO0vo7C0vLK6VlwvbWxube+Y5UpLRYkktEkiHsmOjxXlTNAmMOC0E0uKQ5/Ttj+6zP32A5WKReIOxjF1QzwQLGAEg5Y8s9ILMQz9IL3N7gdeCidO5plVu2ZPYC0SZ0aq9YPg6ru1V2545mevH5EkpAIIx0p1HTsGN8USGOE0K/USRWNMRnhAu5oKHFLlppPsmXWklb4VRFI/AdZE/b2R4lCpcejryTypmvdy8T+vm0Bw4aZMxAlQQaaHgoRbEFl5EVafSUqAjzXBRDKd1SJDLDEBXVdJl+DMf3mRtE5rjl1zbnQbZ2iKItpHh+gYOegc1dE1aqAmIugRPaEX9GpkxrPxZrxPRwvGbGcX/YHx8QNQh5bN</latexit>
1 Sgt
<latexit sha1_base64="uXixzNhEISBS5pPAP22ypKVcKb4=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuCG5cV7QPaacmkmTY0kxmSO0oZ+h9uXCji1n9x59+YtrPQ1gOBwzn3ck9OkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3grGNzO/9ci1EbF6wEnC/YgOlQgFo2ilXjeiOArC7H7aG/axX664VXcOskq8nFQgR71f/uoOYpZGXCGT1JiO5yboZ1SjYJJPS93U8ISyMR3yjqWKRtz42Tz1lJxZZUDCWNunkMzV3xsZjYyZRIGdnKU0y95M/M/rpBhe+5lQSYpcscWhMJUEYzKrgAyE5gzlxBLKtLBZCRtRTRnaokq2BG/5y6ukeVH13Kp351Zql3kdRTiBUzgHD66gBrdQhwYw0PAMr/DmPDkvzrvzsRgtOPnOMfyB8/kD0KaSpQ==</latexit>
sha1_base64="iVFOYBrgGv9/jQwbR0k6Ymt89vg=">AAAB9XicbVDLSsNAFL2pr1pf9YEbN8EiuCqJCLosuHFZ0T6gTctkOmmHTiZh5kYpof/hRkERt/6C3+DOv3HSdqGtBwYO59zLPXP8WHCNjvNt5ZaWV1bX8uuFjc2t7Z3i7l5dR4mirEYjEammTzQTXLIachSsGStGQl+whj+8yvzGPVOaR/IORzHzQtKXPOCUoJE67ZDgwA/S23Gn38VuseSUnQnsReLOSKlycPj8CQDVbvGr3YtoEjKJVBCtW64To5cShZwKNi60E81iQoekz1qGShIy7aWT1GP7xCg9O4iUeRLtifp7IyWh1qPQN5NZSj3vZeJ/XivB4NJLuYwTZJJODwWJsDGyswrsHleMohgZQqjiJqtNB0QRiqaoginBnf/yIqmflV2n7N6YNs5hijwcwTGcggsXUIFrqEINKCh4hBd4tR6sJ+vNep+O5qzZzj78gfXxAzeZlG0=</latexit>
sha1_base64="rOv6QVju2HoUZYbtcljC2hIQcqE=">AAAB9XicbVDLSsNAFL3xWeurPnDjJlgEVyURQZcFQcRVRfuAvphMJ+3QySTM3Cgl9D/cKCji1n/wA1y482+ctF1o64GBwzn3cs8cLxJco+N8W3PzC4tLy5mV7Ora+sZmbmu7osNYUVamoQhVzSOaCS5ZGTkKVosUI4EnWNXrn6d+9Y4pzUN5i4OINQPSldznlKCRWo2AYM/zk5thq9vGdi7vFJwR7FniTki+uLv39PF5cVVq574anZDGAZNIBdG67joRNhOikFPBhtlGrFlEaJ90Wd1QSQKmm8ko9dA+NErH9kNlnkR7pP7eSEig9SDwzGSaUk97qfifV4/RP2smXEYxMknHh/xY2BjaaQV2hytGUQwMIVRxk9WmPaIIRVNU1pTgTn95llSOC65TcK9NGycwRgb24QCOwIVTKMIllKAMFBQ8wDO8WPfWo/VqvY1H56zJzg78gfX+Aw5/lc0=</latexit>
Hidden State
F2t S2t
GRU
C MLP S
<latexit sha1_base64="I/9Wc+BqWUwsXvJNo79qs6sAYjA=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6LLgxmVF+4C+yKSZNjSTGZI7Shn6H25cKOLWf3Hn35hpZ6GtBwKHc+7lnhw/lsKg6347a+sbm1vbhZ3i7t7+wWHp6LhpokQz3mCRjHTbp4ZLoXgDBUrejjWnoS95y5/cZH7rkWsjIvWA05j3QjpSIhCMopX63ZDi2A/S+1m/OsBBqexW3DnIKvFyUoYc9UHpqzuMWBJyhUxSYzqeG2MvpRoFk3xW7CaGx5RN6Ih3LFU05KaXzlPPyLlVhiSItH0KyVz9vZHS0Jhp6NvJLKVZ9jLxP6+TYHDdS4WKE+SKLQ4FiSQYkawCMhSaM5RTSyjTwmYlbEw1ZWiLKtoSvOUvr5JmteK5Fe/OLdcu8zoKcApncAEeXEENbqEODWCg4Rle4c15cl6cd+djMbrm5Dsn8AfO5w9/6JJw</latexit>
sha1_base64="w40Qy61/95/0JrHKw9XuOREKe1A=">AAAB9XicbVDLSsNAFL2pr1pf9bFzEyyCG0tSBF0W3LisaB/QpmUynbRDJ5Mwc6OUUPAz3LhQxK3/4s6/cdJ2oa0HBg7n3Ms9c/xYcI2O823lVlbX1jfym4Wt7Z3dveL+QUNHiaKsTiMRqZZPNBNcsjpyFKwVK0ZCX7CmP7rO/OYDU5pH8h7HMfNCMpA84JSgkbqdkODQD9K7SbfSw16x5JSdKexl4s5JqXp0PngCgFqv+NXpRzQJmUQqiNZt14nRS4lCTgWbFDqJZjGhIzJgbUMlCZn20mnqiX1qlL4dRMo8ifZU/b2RklDrceibySylXvQy8T+vnWBw5aVcxgkySWeHgkTYGNlZBXafK0ZRjA0hVHGT1aZDoghFU1TBlOAufnmZNCpl1ym7t6aNC5ghD8dwAmfgwiVU4QZqUAcKCp7hFd6sR+vFerc+ZqM5a75zCH9gff4AZcOT2A==</latexit>
sha1_base64="52FYRkQbIyRDsdWb32XfYs2e5Vc=">AAAB9XicbVDLSsNAFL3x1Vpf9bFzEyyCG0tSBF0W3LisaB/YF5PppB06mYSZG0sJ/Q83Coq49V/c+TdO2i609cDA4Zx7uWeOFwmu0XG+rZXVtfWNTHYzt7W9s7uX3z+o6TBWlFVpKELV8IhmgktWRY6CNSLFSOAJVveG16lff2RK81De4zhi7YD0Jfc5JWikTisgOPD85G7SKXWxmy84RWcKe5m4c1IoH533R5mXh0o3/9XqhTQOmEQqiNZN14mwnRCFnAo2ybVizSJCh6TPmoZKEjDdTqapJ/apUXq2HyrzJNpT9fdGQgKtx4FnJtOUetFLxf+8Zoz+VTvhMoqRSTo75MfCxtBOK7B7XDGKYmwIoYqbrDYdEEUomqJypgR38cvLpFYquk7RvTVtXMAMWTiGEzgDFy6hDDdQgSpQUPAEr/Bmjaxn6936mI2uWPOdQ/gD6/MHrhGUzg==</latexit>
<latexit sha1_base64="6a88EkJH8iiUaw5vujPyf/1nWPY=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6LIgiMsK9gF9kUkzbWgmMyR3lDL0P9y4UMSt/+LOvzHTzkJbDwQO59zLPTl+LIVB1/121tY3Nre2CzvF3b39g8PS0XHTRIlmvMEiGem2Tw2XQvEGCpS8HWtOQ1/ylj+5yfzWI9dGROoBpzHvhXSkRCAYRSv1uyHFsR+kt7N+dYCDUtmtuHOQVeLlpAw56oPSV3cYsSTkCpmkxnQ8N8ZeSjUKJvms2E0Mjymb0BHvWKpoyE0vnaeekXOrDEkQafsUkrn6eyOloTHT0LeTWUqz7GXif14nweC6lwoVJ8gVWxwKEkkwIlkFZCg0ZyinllCmhc1K2JhqytAWVbQleMtfXiXNasVzK969W65d5nUU4BTO4AI8uIIa3EEdGsBAwzO8wpvz5Lw4787HYnTNyXdO4A+czx9r85Jj</latexit>
sha1_base64="YSfaX/Lq8t0oAOcE9BqS9v6Zhfg=">AAAB9XicbVDLSsNAFL2pr1pf9bFzEyyCG0tSBF0WBHFZwT6gTctkOmmHTiZh5kYpoeBnuHGhiFv/xZ1/46TtQlsPDBzOuZd75vix4Bod59vKrayurW/kNwtb2zu7e8X9g4aOEkVZnUYiUi2faCa4ZHXkKFgrVoyEvmBNf3Sd+c0HpjSP5D2OY+aFZCB5wClBI3U7IcGhH6Q3k26lh71iySk7U9jLxJ2TUvXofPAEALVe8avTj2gSMolUEK3brhOjlxKFnAo2KXQSzWJCR2TA2oZKEjLtpdPUE/vUKH07iJR5Eu2p+nsjJaHW49A3k1lKvehl4n9eO8Hgyku5jBNkks4OBYmwMbKzCuw+V4yiGBtCqOImq02HRBGKpqiCKcFd/PIyaVTKrlN270wbFzBDHo7hBM7AhUuowi3UoA4UFDzDK7xZj9aL9W59zEZz1nznEP7A+vwBUc6Tyw==</latexit>
sha1_base64="b+xGtgNeJFr1PaN7DIWiQV9NiLQ=">AAAB9XicbVDLSsNAFL3x1Vpf9bFzEyyCG0tSBF0WBHFZwT6wLybTSTt0MgkzN5YS+h9uFBRx67+482+ctF1o64GBwzn3cs8cLxJco+N8Wyura+sbmexmbmt7Z3cvv39Q02GsKKvSUISq4RHNBJesihwFa0SKkcATrO4Nr1O//siU5qG8x3HE2gHpS+5zStBInVZAcOD5yc2kU+piN19wis4U9jJx56RQPjrvjzIvD5Vu/qvVC2kcMIlUEK2brhNhOyEKORVskmvFmkWEDkmfNQ2VJGC6nUxTT+xTo/RsP1TmSbSn6u+NhARajwPPTKYp9aKXiv95zRj9q3bCZRQjk3R2yI+FjaGdVmD3uGIUxdgQQhU3WW06IIpQNEXlTAnu4peXSa1UdJ2ie2fauIAZsnAMJ3AGLlxCGW6hAlWgoOAJXuHNGlnP1rv1MRtdseY7h/AH1ucPmhyUwQ==</latexit>
Fusion
Global
Hidden State
F3t GRU S3t Slt
C MLP S
<latexit sha1_base64="WNe7brbRKZm3ALKaoj8QI/lwfmo=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclRkVdFkQxGUF+4C+yKSZNjSTGZI7Shn6H25cKOLWf3Hn35hpZ6GtBwKHc+7lnhw/lsKg6347K6tr6xubha3i9s7u3n7p4LBhokQzXmeRjHTLp4ZLoXgdBUreijWnoS950x/fZH7zkWsjIvWAk5h3QzpUIhCMopV6nZDiyA/S22nvoo/9UtmtuDOQZeLlpAw5av3SV2cQsSTkCpmkxrQ9N8ZuSjUKJvm02EkMjykb0yFvW6poyE03naWeklOrDEgQafsUkpn6eyOloTGT0LeTWUqz6GXif147weC6mwoVJ8gVmx8KEkkwIlkFZCA0ZygnllCmhc1K2IhqytAWVbQleItfXiaN84rnVrx7t1y9zOsowDGcwBl4cAVVuIMa1IGBhmd4hTfnyXlx3p2P+eiKk+8cwR84nz9teZJk</latexit>
sha1_base64="TeOvM5ArAoT53jv0W1hPnDMJXs8=">AAAB9XicbVDJSgNBFHwTtxi3uNy8NAbBi2FGBT0GBPEYwSyQTEJPpydp0rPQ/UYJQ8DP8OJBEa/+izf/xp4kB00saCiq3uNVlxdLodG2v63c0vLK6lp+vbCxubW9U9zdq+soUYzXWCQj1fSo5lKEvIYCJW/GitPAk7zhDa8zv/HAlRZReI+jmLsB7YfCF4yikTrtgOLA89Obcee8i91iyS7bE5BF4sxIqXJw2n8CgGq3+NXuRSwJeIhMUq1bjh2jm1KFgkk+LrQTzWPKhrTPW4aGNODaTSepx+TYKD3iR8q8EMlE/b2R0kDrUeCZySylnvcy8T+vlaB/5aYijBPkIZse8hNJMCJZBaQnFGcoR4ZQpoTJStiAKsrQFFUwJTjzX14k9bOyY5edO9PGBUyRh0M4ghNw4BIqcAtVqAEDBc/wCm/Wo/VivVsf09GcNdvZhz+wPn8AU1STzA==</latexit>
sha1_base64="rZVbI+AYB9wK5hfgFJ5lgcJjWmQ=">AAAB9XicbVDLSsNAFL2pj9b6qo+dm2AR3FgSFXRZEMRlBfvAvphMJ+3QySTM3FhK6H+4UVDErf/izr9x0rrQ1gMDh3Pu5Z45XiS4Rsf5sjJLyyur2dxafn1jc2u7sLNb02GsKKvSUISq4RHNBJesihwFa0SKkcATrO4Nr1K//sCU5qG8w3HE2gHpS+5zStBInVZAcOD5yfWkc9bFbqHolJwp7EXi/pBief+kP8o+31e6hc9WL6RxwCRSQbRuuk6E7YQo5FSwSb4VaxYROiR91jRUkoDpdjJNPbGPjNKz/VCZJ9Geqr83EhJoPQ48M5mm1PNeKv7nNWP0L9sJl1GMTNLZIT8WNoZ2WoHd44pRFGNDCFXcZLXpgChC0RSVNyW4819eJLXTkuuU3FvTxjnMkIMDOIRjcOECynADFagCBQWP8AKv1sh6st6s99loxvrZ2YM/sD6+AZuilMI=</latexit> <latexit sha1_base64="t65hbJUORQz5y5qYhp1cDXMJz0o=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclRkVdFlw47KifUA7LZk004ZmMkNyRylD/8ONC0Xc+i/u/BvTdhbaeiBwOOde7skJEikMuu63s7K6tr6xWdgqbu/s7u2XDg4bJk4143UWy1i3Amq4FIrXUaDkrURzGgWSN4PRzdRvPnJtRKwecJxwP6IDJULBKFqp24koDoMwu590L3rYK5XdijsDWSZeTsqQo9YrfXX6MUsjrpBJakzbcxP0M6pRMMknxU5qeELZiA5421JFI278bJZ6Qk6t0idhrO1TSGbq742MRsaMo8BOTlOaRW8q/ue1Uwyv/UyoJEWu2PxQmEqCMZlWQPpCc4ZybAllWtishA2ppgxtUUVbgrf45WXSOK94bsW7c8vVy7yOAhzDCZyBB1dQhVuoQR0YaHiGV3hznpwX5935mI+uOPnOEfyB8/kDgW6ScQ==</latexit>
sha1_base64="AgBtdGTbY7KXiF824FFtZ0imFWI=">AAAB9XicbVDJSgNBFHwTtxi3uNy8NAbBi2FGBT0GvHiMaBZIJqGn05M06VnofqOEIeBnePGgiFf/xZt/Y0+SgyYWNBRV7/Gqy4ul0Gjb31ZuaXlldS2/XtjY3NreKe7u1XWUKMZrLJKRanpUcylCXkOBkjdjxWngSd7whteZ33jgSosovMdRzN2A9kPhC0bRSJ12QHHg+enduHPexW6xZJftCcgicWakVDk47T8BQLVb/Gr3IpYEPEQmqdYtx47RTalCwSQfF9qJ5jFlQ9rnLUNDGnDtppPUY3JslB7xI2VeiGSi/t5IaaD1KPDMZJZSz3uZ+J/XStC/clMRxgnykE0P+YkkGJGsAtITijOUI0MoU8JkJWxAFWVoiiqYEpz5Ly+S+lnZscvOrWnjAqbIwyEcwQk4cAkVuIEq1ICBgmd4hTfr0Xqx3q2P6WjOmu3swx9Ynz9nSZPZ</latexit>
sha1_base64="XRYu97Axa/Jc7A6HTXU3wkLTcUU=">AAAB9XicbVDLSsNAFL2pj9b6qo+dm2AR3FgSFXRZcOOyon1gX0ymk3boZBJmbiwl9D/cKCji1n9x5984aV1o64GBwzn3cs8cLxJco+N8WZml5ZXVbG4tv76xubVd2Nmt6TBWlFVpKELV8IhmgktWRY6CNSLFSOAJVveGV6lff2BK81De4Thi7YD0Jfc5JWikTisgOPD85HbSOetit1B0Ss4U9iJxf0ixvH/SH2Wf7yvdwmerF9I4YBKpIFo3XSfCdkIUcirYJN+KNYsIHZI+axoqScB0O5mmnthHRunZfqjMk2hP1d8bCQm0HgeemUxT6nkvFf/zmjH6l+2EyyhGJunskB8LG0M7rcDuccUoirEhhCpustp0QBShaIrKmxLc+S8vktppyXVK7o1p4xxmyMEBHMIxuHABZbiGClSBgoJHeIFXa2Q9WW/W+2w0Y/3s7MEfWB/fr5eUzw==</latexit>
<latexit sha1_base64="GdrM42YpDFeSRYHw4cC9/6Gqtp0=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuCG5cV7QPaacmkmTY0kxmSO0oZ+h9uXCji1n9x59+YtrPQ1gOBwzn3ck9OkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3grGNzO/9ci1EbF6wEnC/YgOlQgFo2ilXjeiOArC7H7ax57slytu1Z2DrBIvJxXIUe+Xv7qDmKURV8gkNabjuQn6GdUomOTTUjc1PKFsTIe8Y6miETd+Nk89JWdWGZAw1vYpJHP190ZGI2MmUWAnZynNsjcT//M6KYbXfiZUkiJXbHEoTCXBmMwqIAOhOUM5sYQyLWxWwkZUU4a2qJItwVv+8ippXlQ9t+rduZXaZV5HEU7gFM7BgyuowS3UoQEMNDzDK7w5T86L8+58LEYLTr5zDH/gfP4A2FaSqg==</latexit>
sha1_base64="Bu52UauP6zMk1TN3tEr87l9+BiU=">AAAB9XicbVDLSgMxFL3js9ZXfezcBIvgxjIjgi4LblxWtA9opyWTZtrQTGZI7ihlKPgZblwo4tZ/ceffmD4W2nogcDjnXu7JCRIpDLrut7O0vLK6tp7byG9ube/sFvb2ayZONeNVFstYNwJquBSKV1Gg5I1EcxoFkteDwfXYrz9wbUSs7nGYcD+iPSVCwShaqd2KKPaDMLsbdbAtO4WiW3InIIvEm5Fi+fCs9wQAlU7hq9WNWRpxhUxSY5qem6CfUY2CST7Kt1LDE8oGtMeblioaceNnk9QjcmKVLgljbZ9CMlF/b2Q0MmYYBXZynNLMe2PxP6+ZYnjlZ0IlKXLFpofCVBKMybgC0hWaM5RDSyjTwmYlrE81ZWiLytsSvPkvL5LaeclzS96tbeMCpsjBERzDKXhwCWW4gQpUgYGGZ3iFN+fReXHenY/p6JIz2zmAP3A+fwC+MZQS</latexit>
sha1_base64="u0coKfp+RVN1rEPt/pzGT8QykJw=">AAAB9XicbVDLSsNAFL2pj9b6qo+dm2AR3FgSEXRZcOOyon1gX0ymk3boZBJmbiwl9D/cKCji1n9x5984abvQ1gMDh3Pu5Z45XiS4Rsf5tjIrq2vr2dxGfnNre2e3sLdf02GsKKvSUISq4RHNBJesihwFa0SKkcATrO4Nr1O//siU5qG8x3HE2gHpS+5zStBInVZAcOD5yd2kix3RLRSdkjOFvUzcOSmWD8/6o+zLQ6Vb+Gr1QhoHTCIVROum60TYTohCTgWb5FuxZhGhQ9JnTUMlCZhuJ9PUE/vEKD3bD5V5Eu2p+nsjIYHW48Azk2lKveil4n9eM0b/qp1wGcXIJJ0d8mNhY2inFdg9rhhFMTaEUMVNVpsOiCIUTVF5U4K7+OVlUjsvuU7JvTVtXMAMOTiCYzgFFy6hDDdQgSpQUPAEr/Bmjaxn6936mI1mrPnOAfyB9fkDBo6VCA==</latexit>
Fusion
{It , ⇠t }N
<latexit sha1_base64="fT3y9EB3thfSWzwgGQUL/ASbTa4=">AAACAnicbVDLSsNAFJ34rPUVdSVuBovgQkrixi4LbnQjFewDmhAm00k7dPJg5kYsIbjxV9y4UMStX+HOv3HSZqGtBy4czrmXe+/xE8EVWNa3sbS8srq2Xtmobm5t7+yae/sdFaeSsjaNRSx7PlFM8Ii1gYNgvUQyEvqCdf3xZeF375lUPI7uYJIwNyTDiAecEtCSZx46mRMSGPlBdu1BfoadB+6Bk3s3nlmz6tYUeJHYJamhEi3P/HIGMU1DFgEVRKm+bSXgZkQCp4LlVSdVLCF0TIasr2lEQqbcbPpCjk+0MsBBLHVFgKfq74mMhEpNQl93Fueqea8Q//P6KQQNN+NRkgKL6GxRkAoMMS7ywAMuGQUx0YRQyfWtmI6IJBR0alUdgj3/8iLpnNdtq27fWrVmo4yjgo7QMTpFNrpATXSFWqiNKHpEz+gVvRlPxovxbnzMWpeMcuYA/YHx+QMh3pcy</latexit>
sha1_base64="PtZBmMXjmi1HqwqJ+YubSC4aKIY=">AAACAnicbVA9S8RAEJ347fkVtVFsFg/BQo7ExisFG21EwfMOLiFs9ja6uPlgdyKeIdj4V2wsFLH1V9gJ/hj3Pgq988HA470ZZuaFmRQaHefLmpicmp6ZnZuvLCwuLa/Yq2uXOs0V4w2WylS1Qqq5FAlvoEDJW5niNA4lb4Y3Rz2/ecuVFmlygd2M+zG9SkQkGEUjBfamV3gxxeswKk4CLPeIdycC9MrgNLCrTs3pg4wTd0iqhxv332BwFtifXidlecwTZJJq3XadDP2CKhRM8rLi5ZpnlN3QK942NKEx137Rf6EkO0bpkChVphIkffX3REFjrbtxaDp75+pRryf+57VzjOp+IZIsR56wwaIolwRT0suDdITiDGXXEMqUMLcSdk0VZWhSq5gQ3NGXx8nlfs11au65SaMOA8zBFmzDLrhwAIdwDGfQAAYP8AQv8Go9Ws/Wm/U+aJ2whjPr8AfWxw9m5Jjg</latexit>
sha1_base64="zV9TLPrQjy/3jm1J6Dj2nfOpw4E=">AAACAnicbVC7SgNBFJ2NryS+ojaKzWAQLCTs2pgyaKONRDAPyIZldjKbDJl9MHNXjEvQwl+xsVDU1q+wE/wYZ5MUmnjgwuGce7n3HjcSXIFpfhmZufmFxaVsLr+8srq2XtjYrKswlpTVaChC2XSJYoIHrAYcBGtGkhHfFazh9k9Tv3HNpOJhcAWDiLV90g24xykBLTmFHTuxfQI910vOHRgeYvuGO2APnQunUDRL5gh4llgTUqxs337n7t9Oqk7h0+6ENPZZAFQQpVqWGUE7IRI4FWyYt2PFIkL7pMtamgbEZ6qdjF4Y4n2tdLAXSl0B4JH6eyIhvlID39Wd6blq2kvF/7xWDF65nfAgioEFdLzIiwWGEKd54A6XjIIYaEKo5PpWTHtEEgo6tbwOwZp+eZbUj0qWWbIudRplNEYW7aI9dIAsdIwq6AxVUQ1RdIce0TN6MR6MJ+PVeB+3ZozJzBb6A+PjB2RQml0=</latexit>
…
Figure 2. NeuralRecon architecture. NeuralRecon predicts TSDF with a three-level coarse-to-fine approach that gradually increases the
density of sparse voxels. Key-frame images in the local fragment are first passed through the image backbone to extract the multi-level
features. These image features are later back-projected along each ray and aggregated into a 3D feature volume Flt , where l represents the
level index. At the first level (l = 1), a dense TSDF volume S1t is predicted. At the second and third levels, the upsampled Sl−1
t from the
last level is concatenated with Flt and used as the input for the GRU Fusion and MLP modules. A feature volume defined in the world
frame is maintained at each level as the global hidden state of the GRU. At the last level, the output Slt is used to replace corresponding
voxels in the global TSDF volume Sgt , yielding the final reconstruction at time t.
more scalable and robust. RoutedFusion [49, 50] changes as input for the networks. To provide enough motion par-
the fusion operation from a simple linear addition into a allax while keeping multi-view co-visibility for reconstruc-
data-dependent process. tion, the selected key frames should be neither too close
Neural Implicit Representations. Recently, neural im- nor far from each other. Following [13], a new incoming
plicit representations [29, 33, 36, 17, 54, 25] have gained frame is selected as a key frame if its relative translation is
significant advances. Our work also learns a neural implicit greater than tmax and the relative rotation angle is greater
representation by predicting SDF with the neural network Rmax . A window with N key frames is defined as a lo-
from the encoded image features similar to PIFu [36]. The cal fragment. After key frames are selected, a cubic-shaped
key difference is that we are using sparse 3D convolution fragment bounding volume (FBV) that encloses all the key
to predict a discrete TSDF volume, instead of querying the frame view-frustums is computed with a fixed max depth
MLP network with image features and 3D coordinates. range dmax in each view. Only the region within the FBV
is considered during the reconstruction of each fragment.
3. Methods 3.2. Joint Fragment Reconstruction and Fusion
Given a sequence of monocular images {It } and camera We propose to simultaneously reconstruct the TSDF vol-
pose trajectory {ξt } ∈ SE(3) provided by a SLAM system, ume of a local fragment Slt and fuse it with global TSDF
the goal is to reconstruct dense 3D scene geometry accu- volume Sgt with a learning-based approach. The joint re-
rately in real-time. We denote the global TSDF volume to construction and fusion is carried out in the local coordinate
reconstruct as Sgt , where t represents the current time step. system. The definition of the local and global coordinate
The system architecture is illustrated in Fig. 2. systems as well as the construction of FBV are illustrated in
Fig. 1 of the supplementary material.
3.1. Key Frame Selection
Image Feature Volume Construction. The N images in
To achieve real-time 3D reconstruction that is suit- the local fragment are first passed through the image back-
able for interactive applications, the reconstruction process bone to extract the multi-level features. Similar to previ-
needs to be incremental and the input images should be pro- ous works on volumetric reconstruction [18, 15, 30], the
cessed sequentially in local fragments [40]. We seek to find extracted features are back-projected along each ray into
a set of suitable key frames from the incoming image stream the 3D feature volume. The image feature volume Flt is
3
Hgt 1 Hgt
Fragment
<latexit sha1_base64="FhAJp2R5i8MqwrlQPMjVx9UqCnY=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRu1F3BTZcVbCu0sUymk3bo5MHMTaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOn0ih0XG+rcrG5tb2TnW3trd/cHhkH590dZwqxjsslrF69KnmUkS8gwIlf0wUp6Evec+f3hV+b8aVFnH0gPOEeyEdRyIQjKKRhrY9CClO/CBr5cMM86fx0K47DWcBsk7cktShRHtofw1GMUtDHiGTVOu+6yToZVShYJLntUGqeULZlI5539CIhlx72SJ5Ti6MMiJBrMyLkCzU3xsZDbWeh76ZLHLqVa8Q//P6KQY3XiaiJEUeseWhIJUEY1LUQEZCcYZybghlSpishE2oogxNWTVTgrv65XXSvWq4TsO9d+rN27KOKpzBOVyCC9fQhBa0oQMMZvAMr/BmZdaL9W59LEcrVrlzCn9gff4AB3OT3A==</latexit>
sha1_base64="f5BGDqvn49MBECq6FuAs+4I+H6w=">AAAB+XicbVDLSsNAFL2pr1pfUVeii8EiuCqJG3VX6KbLCvYBbQyT6aQdOpmEmUmhhPyJGxeKuPVP3Pk3TtoutPXAwOGce7lnTpBwprTjfFuljc2t7Z3ybmVv/+DwyD4+6ag4lYS2Scxj2QuwopwJ2tZMc9pLJMVRwGk3mDQKvzulUrFYPOpZQr0IjwQLGcHaSL5tDyKsx0GYNXM/0/nTyLerTs2ZA60Td0mq9bOLBgKAlm9/DYYxSSMqNOFYqb7rJNrLsNSMcJpXBqmiCSYTPKJ9QwWOqPKyefIcXRlliMJYmic0mqu/NzIcKTWLAjNZ5FSrXiH+5/VTHd55GRNJqqkgi0NhypGOUVEDGjJJieYzQzCRzGRFZIwlJtqUVTEluKtfXiedm5rr1NwH08Y9LFCGc7iEa3DhFurQhBa0gcAUnuEV3qzMerHerY/FaMla7pzCH1ifPyuIlLQ=</latexit>
sha1_base64="F4GTW4goxy3pbUd5UpKWmIA8yH0=">AAAB+XicbVDLSsNAFL2pr1pfUVeii8EiuJCSuFF3hW66rGAf0MYwmU7aoZMHM5NiDfkTNy4q4tY/ceffOGm70NYDA4dz7uWeOV7MmVSW9W0U1tY3NreK26Wd3b39A/PwqCWjRBDaJBGPRMfDknIW0qZiitNOLCgOPE7b3qiW++0xFZJF4YOaxNQJ8CBkPiNYack1zV6A1dDz03rmpip7HLhm2apYM6BVYi9IuXpyVruaPj81XPOr149IEtBQEY6l7NpWrJwUC8UIp1mpl0gaYzLCA9rVNMQBlU46S56hC630kR8J/UKFZurvjRQHUk4CT0/mOeWyl4v/ed1E+bdOysI4UTQk80N+wpGKUF4D6jNBieITTTARTGdFZIgFJkqXVdIl2MtfXiWt64ptVex73cYdzFGEUziHS7DhBqpQhwY0gcAYXmAKb0ZqvBrvxsd8tGAsdo7hD4zPH1utllc=</latexit>
<latexit sha1_base64="nWRUpMZh5AGEM4jKgRsE/p2IUxM=">AAAB+3icbVC7TsMwFHV4lvIKZWSxqJBYqBIWYKvE0rFI9CG1IXJcp7XqOJF9g6ii/AoLAwix8iNs/A1OmwFajmTp6Jx7dY9PkAiuwXG+rbX1jc2t7cpOdXdv/+DQPqp1dZwqyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXT28LvPTKleSzvYZYwLyJjyUNOCRjJt2vDiMAkCLNW7mdw4eYPY9+uOw1nDrxK3JLUUYm2b38NRzFNIyaBCqL1wHUS8DKigFPB8uow1SwhdErGbGCoJBHTXjbPnuMzo4xwGCvzJOC5+nsjI5HWsygwk0VSvewV4n/eIIXw2su4TFJgki4OhanAEOOiCDziilEQM0MIVdxkxXRCFKFg6qqaEtzlL6+S7mXDdRrunVNv3pR1VNAJOkXnyEVXqIlaqI06iKIn9Ixe0ZuVWy/Wu/WxGF2zyp1j9AfW5w/sLpRO</latexit>
sha1_base64="ReQlye1GGVYcWdfB5jQdE0X8GGk=">AAAB+3icbVC7TsMwFL3hWcorlJHFUCGxUCUswFaJpWOR6ENqS3Bcp7XqOJHtIKoov8LCAEKszPwDG//AR+C0HaDlSJaOzrlX9/j4MWdKO86XtbS8srq2Xtgobm5t7+zae6WmihJJaINEPJJtHyvKmaANzTSn7VhSHPqctvzRVe637qlULBI3ehzTXogHggWMYG0kzy51Q6yHfpDWMi/Vp252O/DsslNxJkCLxJ2RcvXw4/sOAOqe/dntRyQJqdCEY6U6rhPrXoqlZoTTrNhNFI0xGeEB7RgqcEhVL51kz9CxUfooiKR5QqOJ+nsjxaFS49A3k3lSNe/l4n9eJ9HBRS9lIk40FWR6KEg40hHKi0B9JinRfGwIJpKZrIgMscREm7qKpgR3/suLpHlWcZ2Ke23auIQpCnAAR3ACLpxDFWpQhwYQeIBHeIYXK7OerFfrbTq6ZM129uEPrPcfMVeWug==</latexit>
sha1_base64="muN5+vWfuQgfs0L0ix3JdMOnuYM=">AAAB+3icbVDLSsNAFJ3UV62vWJduRovgxpK4UXcFN11WsA9oY5hMJ+3QyYOZG7GE/IobF4q6de0nCIIL/8GPcNJ2odUDA4dz7uWeOV4suALL+jQKC4tLyyvF1dLa+sbmlrldbqkokZQ1aSQi2fGIYoKHrAkcBOvEkpHAE6ztjc5zv33NpOJReAnjmDkBGYTc55SAllyz3AsIDD0/rWduCkd2djVwzYpVtSbAf4k9I5Xa3utX6+PtqeGa771+RJOAhUAFUaprWzE4KZHAqWBZqZcoFhM6IgPW1TQkAVNOOsme4QOt9LEfSf1CwBP150ZKAqXGgacn86Rq3svF/7xuAv6pk/IwToCFdHrITwSGCOdF4D6XjIIYa0Ko5DorpkMiCQVdV0mXYM9/+S9pHVdtq2pf6DbO0BRFtIv20SGy0QmqoTpqoCai6Abdonv0YGTGnfFoPE9HC8ZsZwf9gvHyDTnImP8=</latexit>
Bounding Volume
Camera 0 Avg
2
<latexit sha1_base64="BNnIJUzjRpa6ynfypyV/gzTj/cY=">AAAB73icbVDLSsNAFL3xWeur6tLNYBFclaQIupKCG5cV7APaUCaTSTt0MokzN0IJ/Qk3LhRx6++482+ctllo64GBwznnMveeIJXCoOt+O2vrG5tb26Wd8u7e/sFh5ei4bZJMM95iiUx0N6CGS6F4CwVK3k01p3EgeScY3878zhPXRiTqAScp92M6VCISjKKVuvW+tNmQDipVt+bOQVaJV5AqFGgOKl/9MGFZzBUySY3peW6Kfk41Cib5tNzPDE8pG9Mh71mqaMyNn8/3nZJzq4QkSrR9Cslc/T2R09iYSRzYZExxZJa9mfif18swuvZzodIMuWKLj6JMEkzI7HgSCs0ZyokllGlhdyVsRDVlaCsq2xK85ZNXSbte89yad39ZbdwUdZTgFM7gAjy4ggbcQRNawEDCM7zCm/PovDjvzsciuuYUMyfwB87nD62nj7U=</latexit>
<latexit
Feature Map
{ xo >2 (✓
<latexit sha1_base64="ahCvpUPgAacmgBHpnqj1PIMSZeU=">AAAB+XicbVDLSsNAFL3xWesr6tJNsAiuSuJG3RVc6LKCfUAbw2Q6aYdOJmHmplBC/8SNC0Xc+ifu/BsnbRbaemDgcM693DMnTAXX6Lrf1tr6xubWdmWnuru3f3BoHx23dZIpylo0EYnqhkQzwSVrIUfBuqliJA4F64Tj28LvTJjSPJGPOE2ZH5Oh5BGnBI0U2HY/JjgKo/xuFuQ4exKBXXPr7hzOKvFKUoMSzcD+6g8SmsVMIhVE657npujnRCGngs2q/UyzlNAxGbKeoZLETPv5PPnMOTfKwIkSZZ5EZ67+3shJrPU0Ds1kkVMve4X4n9fLMLr2cy7TDJmki0NRJhxMnKIGZ8AVoyimhhCquMnq0BFRhKIpq2pK8Ja/vEral3XPrXsPbq1xU9ZRgVM4gwvw4AoacA9NaAGFCTzDK7xZufVivVsfi9E1q9w5gT+wPn8ADXyT4A==</latexit>
sha1_base64="UN8focMp7A4baWEvazho9Ky93Wo=">AAAB+XicbVDLSsNAFL2pr1pfUVfiZrAIrkrixroruNBlBfuANobJdNIOnUzCzKRQQv7EjQtF3PoDfoM7/8ZJ24W2Hhg4nHMv98wJEs6Udpxvq7S2vrG5Vd6u7Ozu7R/Yh0dtFaeS0BaJeSy7AVaUM0FbmmlOu4mkOAo47QTjm8LvTKhULBYPeppQL8JDwUJGsDaSb9v9COtREGa3uZ/p/JH7dtWpOTOgVeIuSLVxUv8Eg6Zvf/UHMUkjKjThWKme6yTay7DUjHCaV/qpogkmYzykPUMFjqjyslnyHJ0bZYDCWJonNJqpvzcyHCk1jQIzWeRUy14h/uf1Uh3WvYyJJNVUkPmhMOVIx6ioAQ2YpETzqSGYSGayIjLCEhNtyqqYEtzlL6+S9mXNdWruvWnjGuYowymcwQW4cAUNuIMmtIDABJ7gBV6tzHq23qz3+WjJWuwcwx9YHz/U3pUx</latexit>
sha1_base64="qpgTwFXR447uFdkZAyRz4jW0SVY=">AAAB+XicbVDLSsNAFJ2o1VpfUVfiZrAIbiyJG+tGCi50WcE+oI1hMp20QyeTMHNTKKF/4saFIm79Ab/BnX/jpO1CWw8MHM65l3vmBIngGhzn21pZXSusbxQ3S1vbO7t79v5BU8epoqxBYxGrdkA0E1yyBnAQrJ0oRqJAsFYwvMn91ogpzWP5AOOEeRHpSx5ySsBIvm13IwKDIMxuJ34Gk0fh22Wn4kyBl4k7J+XaUfWzcH0e1H37q9uLaRoxCVQQrTuuk4CXEQWcCjYpdVPNEkKHpM86hkoSMe1l0+QTfGqUHg5jZZ4EPFV/b2Qk0nocBWYyz6kXvVz8z+ukEFa9jMskBSbp7FCYCgwxzmvAPa4YBTE2hFDFTVZMB0QRCqaskinBXfzyMmleVFyn4t6bNq7QDEV0jE7QGXLRJaqhO1RHDUTRCD2hF/RqZdaz9Wa9z0ZXrPnOIfoD6+MH7V+WAw==</latexit>
<latexit sha1_base64="9uho03nn5NSjIAtT6B/B61Tu82A=">AAAB+XicbVDLSgNBEJyNrxhfqx69DAbBU9j1ot4CXnKMYB6QrMvspDcZMvtgpjcYlvyJFw+KePVPvPk3TpI9aGJBQ1HVTXdXkEqh0XG+rdLG5tb2Tnm3srd/cHhkH5+0dZIpDi2eyER1A6ZBihhaKFBCN1XAokBCJxjfzf3OBJQWSfyA0xS8iA1jEQrO0Ei+bfcRnjAI88bMz3H2KH276tScBeg6cQtSJQWavv3VHyQ8iyBGLpnWPddJ0cuZQsElzCr9TEPK+JgNoWdozCLQXr64fEYvjDKgYaJMxUgX6u+JnEVaT6PAdEYMR3rVm4v/eb0MwxsvF3GaIcR8uSjMJMWEzmOgA6GAo5wawrgS5lbKR0wxjiasignBXX15nbSvaq5Tc++dav22iKNMzsg5uSQuuSZ10iBN0iKcTMgzeSVvVm69WO/Wx7K1ZBUzp+QPrM8fOT2T/A==</latexit>
sha1_base64="yHGFJcHcfb34P0armzQxP2fehB0=">AAAB+XicbVDJSgNBEK1xjXEb9SReGoPgKcx4Md4CXnKMYBZIxqGn05M06VnorgmGIX/ixYMiXv0Bv8Gbf2NnOWjig4LHe1VU1QtSKTQ6zre1tr6xubVd2Cnu7u0fHNpHx02dZIrxBktkotoB1VyKmDdQoOTtVHEaBZK3guHt1G+NuNIiie9xnHIvov1YhIJRNJJv213kjxiEeW3i5zh5kL5dcsrODGSVuAtSqp5WPsGg7ttf3V7CsojHyCTVuuM6KXo5VSiY5JNiN9M8pWxI+7xjaEwjrr18dvmEXBilR8JEmYqRzNTfEzmNtB5HgemMKA70sjcV//M6GYYVLxdxmiGP2XxRmEmCCZnGQHpCcYZybAhlSphbCRtQRRmasIomBHf55VXSvCq7Ttm9M2ncwBwFOINzuAQXrqEKNahDAxiM4Ale4NXKrWfrzXqft65Zi5kT+APr4wcArpVN</latexit>
sha1_base64="QYgkAN2jSTC8i+EmVFK2fMjk/QY=">AAAB+XicbVDLSsNAFJ1Uq7W+oq7EzWAR3FgSN9aNFNx0WcE+oI1hMp20QyeTMHNTLKF/4saFIm79Ab/BnX/j9LHQ1gMXDufcy733BIngGhzn28qtrec3Ngtbxe2d3b19++CwqeNUUdagsYhVOyCaCS5ZAzgI1k4UI1EgWCsY3k791ogpzWN5D+OEeRHpSx5ySsBIvm13gT1CEGa1iZ/B5EH4dskpOzPgVeIuSKl6XPnM31wEdd/+6vZimkZMAhVE647rJOBlRAGngk2K3VSzhNAh6bOOoZJETHvZ7PIJPjNKD4exMiUBz9TfExmJtB5HgemMCAz0sjcV//M6KYQVL+MySYFJOl8UpgJDjKcx4B5XjIIYG0Ko4uZWTAdEEQomrKIJwV1+eZU0L8uuU3bvTBrXaI4COkGn6By56ApVUQ3VUQNRNEJP6AW9Wpn1bL1Z7/PWnLWYOUJ/YH38ABkvlh8=</latexit>
<latexit sha1_base64="gsWHSUk22Jw1YzSwbOMluahh2r0=">AAAB+XicbVDLSsNAFL3xWesr6tJNsAiuSuJG3RUEcVnBPqCNYTKdtEMnkzBzUyihf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTpoJrdN1va219Y3Nru7JT3d3bPzi0j47bOskUZS2aiER1Q6KZ4JK1kKNg3VQxEoeCdcLxbeF3JkxpnshHnKbMj8lQ8ohTgkYKbLsfExyFUX43C3KcPYnArrl1dw5nlXglqUGJZmB/9QcJzWImkQqidc9zU/RzopBTwWbVfqZZSuiYDFnPUElipv18nnzmnBtl4ESJMk+iM1d/b+Qk1noah2ayyKmXvUL8z+tlGF37OZdphkzSxaEoEw4mTlGDM+CKURRTQwhV3GR16IgoQtGUVTUleMtfXiXty7rn1r0Ht9a4KeuowCmcwQV4cAUNuIcmtIDCBJ7hFd6s3Hqx3q2PxeiaVe6cwB9Ynz8L8ZPf</latexit>
sha1_base64="zscCZLYBc9eSGjwF9FNmqCkZ8s0=">AAAB+XicbVDLSsNAFL2pr1pfUVfiZrAIrkrixrorCOKygn1AG8NkOmmHTiZhZlIoIX/ixoUibv0Bv8Gdf+Ok7UJbDwwczrmXe+YECWdKO863VVpb39jcKm9Xdnb39g/sw6O2ilNJaIvEPJbdACvKmaAtzTSn3URSHAWcdoLxTeF3JlQqFosHPU2oF+GhYCEjWBvJt+1+hPUoCLPb3M90/sh9u+rUnBnQKnEXpNo4qX+CQdO3v/qDmKQRFZpwrFTPdRLtZVhqRjjNK/1U0QSTMR7SnqECR1R52Sx5js6NMkBhLM0TGs3U3xsZjpSaRoGZLHKqZa8Q//N6qQ7rXsZEkmoqyPxQmHKkY1TUgAZMUqL51BBMJDNZERlhiYk2ZVVMCe7yl1dJ+7LmOjX33rRxDXOU4RTO4AJcuIIG3EETWkBgAk/wAq9WZj1bb9b7fLRkLXaO4Q+sjx/TU5Uw</latexit>
sha1_base64="9K78WYfZ+Ney4obANaRDivOGP/A=">AAAB+XicbVDLSsNAFJ2o1VpfUVfiZrAIbiyJG+tGCoK4rGAf0MYwmU7aoZNJmLkplNA/ceNCEbf+gN/gzr9x0nahrQcGDufcyz1zgkRwDY7zba2srhXWN4qbpa3tnd09e/+gqeNUUdagsYhVOyCaCS5ZAzgI1k4UI1EgWCsY3uR+a8SU5rF8gHHCvIj0JQ85JWAk37a7EYFBEGa3Ez+DyaPw7bJTcabAy8Sdk3LtqPpZuD4P6r791e3FNI2YBCqI1h3XScDLiAJOBZuUuqlmCaFD0mcdQyWJmPayafIJPjVKD4exMk8Cnqq/NzISaT2OAjOZ59SLXi7+53VSCKtexmWSApN0dihMBYYY5zXgHleMghgbQqjiJiumA6IIBVNWyZTgLn55mTQvKq5Tce9NG1dohiI6RifoDLnoEtXQHaqjBqJohJ7QC3q1MuvZerPeZ6Mr1nznEP2B9fED69SWAg==</latexit>
Conv SDF
<latexit sha1_base64="RdHVnulM9Oz9vKpg4DDSQOKtqUQ=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoCcJePEYwTwwWcLsZDYZMjuzzPQKIeQvvHhQxKt/482/cZLsQRMLGoqqbrq7olQKi77/7RXW1jc2t4rbpZ3dvf2D8uFR0+rMMN5gWmrTjqjlUijeQIGSt1PDaRJJ3opGtzO/9cSNFVo94DjlYUIHSsSCUXTSoyY3pItDjrRXrvhVfw6ySoKcVCBHvVf+6vY1yxKukElqbSfwUwwn1KBgkk9L3czylLIRHfCOo4om3IaT+cVTcuaUPom1caWQzNXfExOaWDtOIteZUBzaZW8m/ud1Moyvw4lQaYZcscWiOJMENZm9T/rCcIZy7AhlRrhbCRtSQxm6kEouhGD55VXSvKgGfjW49yu1yzyOIpzAKZxDAFdQgzuoQwMYKHiGV3jzrPfivXsfi9aCl88cwx94nz+g8JAr</latexit>
sha1_base64="j/E3UN3AXgMV3cYKoKyO5zVn9Vc=">AAAB8XicbZDLSgMxFIbP1Futt6pLN8EiuCozItiVFty4rGAv2A4lk2ba0ExmSM4IZehbuHGhiFvfxp3vIPgKppeFtv4Q+Pj/c8g5J0ikMOi6n05uZXVtfSO/Wdja3tndK+4fNEycasbrLJaxbgXUcCkUr6NAyVuJ5jQKJG8Gw+tJ3nzg2ohY3eEo4X5E+0qEglG01n1MLkkHBxxpt1hyy+5UZBm8OZSuvirfVQCodYsfnV7M0ogrZJIa0/bcBP2MahRM8nGhkxqeUDakfd62qGjEjZ9NJx6TE+v0SBhr+xSSqfu7I6ORMaMosJURxYFZzCbmf1k7xbDiZ0IlKXLFZh+FqSQYk8n6pCc0ZyhHFijTws5K2IBqytAeqWCP4C2uvAyNs7Lnlr1bt1Q9h5nycATHcAoeXEAVbqAGdWCg4BGe4cUxzpPz6rzNSnPOvOcQ/sh5/wEae5K9</latexit>
sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>
sha1_base64="fQQUssI7htwtqLGnRJG0PuZ0aRU=">AAAB5nicbZDNSgMxFIXv1L9aq1a3boJFcFVm3OhKBDcuK9hWbIeSSe+0oZnMkNwRytC3cONCER/JnW9j+rPQ1gOBj3MScu+JMiUt+f63V9rY3NreKe9W9qr7B4e1o2rbprkR2BKpSs1jxC0qqbFFkhQ+ZgZ5EinsROPbWd55RmNlqh9okmGY8KGWsRScnPWUsmvWoxES79fqfsOfi61DsIQ6LNXs1756g1TkCWoSilvbDfyMwoIbkkLhtNLLLWZcjPkQuw41T9CGxXziKTtzzoDFqXFHE5u7v18UPLF2kkTuZsJpZFezmflf1s0pvgoLqbOcUIvFR3GuGKVstj4bSIOC1MQBF0a6WZkYccMFuZIqroRgdeV1aF80Ar8R3PtQhhM4hXMI4BJu4A6a0AIBGl7gDd496716H4u6St6yt2P4I+/zB2+TjuE=</latexit>
sha1_base64="TOMUc7rB9VTaUI6ZiRtRphENRz4=">AAAB8XicbZDLSgMxFIbP1Futt6pLN8EiuCozItiVFty4rGAv2A4lk6ZtaCYzJGeEMvQt3LhQ1K1v40LwHQRfwfSy0NYfAh//fw455wSxFAZd99PJLC2vrK5l13Mbm1vbO/ndvZqJEs14lUUy0o2AGi6F4lUUKHkj1pyGgeT1YHA5zut3XBsRqRscxtwPaU+JrmAUrXUbkXPSwj5H2s4X3KI7EVkEbwaFi6/S9+nHS1Bp599bnYglIVfIJDWm6bkx+inVKJjko1wrMTymbEB7vGlR0ZAbP51MPCJH1umQbqTtU0gm7u+OlIbGDMPAVoYU+2Y+G5v/Zc0EuyU/FSpOkCs2/aibSIIRGa9POkJzhnJogTIt7KyE9ammDO2RcvYI3vzKi1A7KXpu0bt2C+VTmCoLB3AIx+DBGZThCipQBQYK7uERnhzjPDjPzuu0NOPMevbhj5y3H4LTlIo=</latexit>
Avg Average
<latexit sha1_base64="V0LyWXF6dvrzWkP3UxjI34gwPaU=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF49VTFtoQ9lsJ+3SzSbsboQS+g+8eFDEq//Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByO/c7T6g0T+SjmaYYxHQkecQZNVZ66OeDas2tuwuQdeIVpAYFWoPqV3+YsCxGaZigWvc8NzVBTpXhTOCs0s80ppRN6Ah7lkoaow7yxaUzcmGVIYkSZUsaslB/T+Q01noah7YzpmasV725+J/Xy0x0E+RcpplByZaLokwQk5D522TIFTIjppZQpri9lbAxVZQZG07FhuCtvrxO2ld1z617941as1HEUYYzOIdL8OAamnAHLfCBQQTP8ApvzsR5cd6dj2VrySlmTuEPnM8fl06NVw==</latexit>
<latexit sha1_base64="5pJYQ1FxONN6fEPb7SC/lKU3ins=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRahgpaNFPRY8OKxgv2AdinZNNuGZrNrki2Wpb/DiwdFvPpjvPlvTNs9aOuDgcd7M8zM82PBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpKFGUNGolItX2imeCSNQw3grVjxUjoC9byR7czvzVmSvNIPphJzLyQDCQPOCXGSt4T6nKJypf4AuHzXrHkVtw50CrBGSlBhnqv+NXtRzQJmTRUEK072I2NlxJlOBVsWugmmsWEjsiAdSyVJGTaS+dHT9GZVfooiJQtadBc/T2RklDrSejbzpCYoV72ZuJ/XicxwY2Xchknhkm6WBQkApkIzRJAfa4YNWJiCaGK21sRHRJFqLE5FWwIePnlVdK8qmC3gu+rpVo1iyMPJ3AKZcBwDTW4gzo0gMIjPMMrvDlj58V5dz4WrTknmzmGP3A+fwCssJAF</latexit>
1, 1)
Camera 1 Input: Image Output: Updated <latexit sha1_base64="oUEKJz6q1t6aHWXfbPmUmQLxGVE=">AAAB7nicbVDLSsNAFL2pr1pfVZduBovgqiQiVHcFNy4r2Ae0oUwmk3boZBJmboQS+hFuXCji1u9x5984bbPQ1gMDh3POZe49QSqFQdf9dkobm1vbO+Xdyt7+weFR9fikY5JMM95miUx0L6CGS6F4GwVK3ks1p3EgeTeY3M397hPXRiTqEacp92M6UiISjKKVugNpoyEdVmtu3V2ArBOvIDUo0BpWvwZhwrKYK2SSGtP33BT9nGoUTPJZZZAZnlI2oSPet1TRmBs/X6w7IxdWCUmUaPsUkoX6eyKnsTHTOLDJmOLYrHpz8T+vn2F04+dCpRlyxZYfRZkkmJD57SQUmjOUU0so08LuStiYasrQNlSxJXirJ6+TzlXdc+vew3WteVvUUYYzOIdL8KABTbiHFrSBwQSe4RXenNR5cd6dj2W05BQzp/AHzucPOhCPdA==</latexit>
Truncation Distance
Feature Volume Hidden State
i. Unprojection ii. GRU Fusion iii. Sparse TSDF Representation Slt <latexit sha1_base64="ePLbwnVW87J9FXIxverajVcU7oI=">AAAB+XicbVDLSsNAFL3xWesr6tJNsAiuSuJG3RXcuKxoH9DGMJlO2qGTSZi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJU8E1uu63tba+sbm1Xdmp7u7tHxzaR8dtnWSKshZNRKK6IdFMcMlayFGwbqoYiUPBOuH4tvA7E6Y0T+QjTlPmx2QoecQpQSMFtt2PCY7CKH+YBTnOnkRg19y6O4ezSryS1KBEM7C/+oOEZjGTSAXRuue5Kfo5UcipYLNqP9MsJXRMhqxnqCQx034+Tz5zzo0ycKJEmSfRmau/N3ISaz2NQzNZ5NTLXiH+5/UyjK79nMs0Qybp4lCUCQcTp6jBGXDFKIqpIYQqbrI6dEQUoWjKqpoSvOUvr5L2Zd1z6969W2vclHVU4BTO4AI8uIIG3EETWkBhAs/wCm9Wbr1Y79bHYnTNKndO4A+szx8gAJPs</latexit>
sha1_base64="sOkuONVNRUV0ZBOU2geUVVInglk=">AAAB+XicbVDLSsNAFL2pr1pfUVfiZrAIrkrixroruHFZ0T6gjWEynbRDJ5MwMymUkD9x40IRt/6A3+DOv3HSdqGtBwYO59zLPXOChDOlHefbKq2tb2xulbcrO7t7+wf24VFbxakktEViHstugBXlTNCWZprTbiIpjgJOO8H4pvA7EyoVi8WDnibUi/BQsJARrI3k23Y/wnoUhNl97mc6f+S+XXVqzgxolbgLUm2c1D/BoOnbX/1BTNKICk04VqrnOon2Miw1I5zmlX6qaILJGA9pz1CBI6q8bJY8R+dGGaAwluYJjWbq740MR0pNo8BMFjnVsleI/3m9VId1L2MiSTUVZH4oTDnSMSpqQAMmKdF8aggmkpmsiIywxESbsiqmBHf5y6ukfVlznZp7Z9q4hjnKcApncAEuXEEDbqEJLSAwgSd4gVcrs56tN+t9PlqyFjvH8AfWxw/nYpU9</latexit>
sha1_base64="596seo/vfBMFhravzSeO86NIDjQ=">AAAB+XicbVDLSsNAFJ2o1VpfUVfiZrAIbiyJG+tGCm5cVrQPaGOYTCft0MkkzNwUSuifuHGhiFt/wG9w5984abvQ1gMDh3Pu5Z45QSK4Bsf5tlZW1wrrG8XN0tb2zu6evX/Q1HGqKGvQWMSqHRDNBJesARwEayeKkSgQrBUMb3K/NWJK81g+wDhhXkT6koecEjCSb9vdiMAgCLP7iZ/B5FH4dtmpOFPgZeLOSbl2VP0sXJ8Hdd/+6vZimkZMAhVE647rJOBlRAGngk1K3VSzhNAh6bOOoZJETHvZNPkEnxqlh8NYmScBT9XfGxmJtB5HgZnMc+pFLxf/8zophFUv4zJJgUk6OxSmAkOM8xpwjytGQYwNIVRxkxXTAVGEgimrZEpwF7+8TJoXFdepuHemjSs0QxEdoxN0hlx0iWroFtVRA1E0Qk/oBb1amfVsvVnvs9EVa75ziP7A+vgB/+OWDw==</latexit>
Figure 3. 2D toy examples to illustrate the unprojection, GRU fusion and sparse TSDF representation. In figure i and ii, the colored
grids mean different features. In figure iii, the colored grids mean different TSDF values. Best viewed in color.
obtained by averaging the features from different views ac- tion layers to extract 3D geometric features Glt . The hidden
cording to the visibility weight of each voxel. The visibil- state Hlt−1 is extracted from the global hidden state Hgt−1
ity weight is defined as the number of views from which a within the fragment bounding volume. GRU fuses Glt with
voxel can be observed in the local fragment. A visualization hidden state Hlt−1 and produces the updated hidden state
of this unprojection process can be found in Fig.3 i. Hlt , which will be passed through the MLP layers to predict
Coarse-to-fine TSDF Reconstruction. We adopt a coarse- the TSDF volume Slt at this level. The hidden state Hlt will
to-fine approach to gradually refine the predicted TSDF vol- also be updated to global hidden state Hgt by directly replac-
ume at each level. We use 3D sparse convolution to effi- ing the corresponding voxels. Formally, denoting zt as the
ciently process the feature volume Flt . The sparse volumet- update gate, rt as the reset gate, σ as the sigmoid function
ric representation also naturally integrates with the coarse- and W∗ as the weight for sparse convolution, GRU fuses Glt
to-fine design. Specifically, each voxel in the TSDF volume with hidden state Hlt−1 with the following operations:
Slt contains two values, the occupancy score o and the SDF
value x. At each level, both o and x are predicted by the zt = σ(SparseConv([Hlt−1 , Glt ], Wz ))
MLP. The occupancy score represents the confidence of a rt = σ(SparseConv([Hlt−1 , Glt ], Wr ))
voxel being within the TSDF truncation distance λ. The
H̃lt = tanh(SparseConv([rt Hlt−1 , Glt ], Wh ))
voxel whose occupancy score is lower than the sparsifica-
tion threshold θ is defined as void space and will be sparsi- Hlt = (1 − zt ) Hlt−1 + zt H̃lt
fied. This representation of sparse TSDF volume is visually
illustrated in Fig.3 iii. After the sparsification, Slt is upsam- Intuitively, in the context of joint reconstruction and fu-
pled by 2× and concatenated with the Fl+1 t as the input for sion of TSDF, the update gate zt and forget gate rt in
the GRU Fusion module (introduced later) in the next level. the GRU determine how much information from the pre-
Instead of estimating single-view depth maps for each vious reconstructions (i.e. hidden state Hlt−1 ) is fused to
key frame, NeuralRecon jointly reconstructs the implicit the current-fragment geometric feature Glt , as well as how
surface within the bounding volume of the local fragment much information from the current-fragment will be fused
window. This design guides the network to learn the natu- into the hidden state Hlt . As a data-driven approach, the
ral surface prior directly from the training data. As a result, GRU serves as a selective attention mechanism that replaces
the reconstructed surface is locally smooth and coherent in the linear running-average operation in conventional TSDF
scale. Notably, this design also leads to less redundant com- fusion [31]. By predicting Slt after the GRU, the MLP
putation compared to depth-based methods since each area network can leverage the context information accumulated
on the 3D surface is estimated only once during the frag- from history fragments to produce consistent surface geom-
ment reconstruction. etry across local fragments. This is also conceptually anal-
GRU Fusion. To make the reconstruction consistent be- ogous to the depth filter in a non-learning-based 3D recon-
tween fragments, we propose to make the current-fragment struction pipeline [38, 34], where the current observation
reconstruction to be conditioned on the reconstructions in and the temporally-fused depths are fused with the Bayesian
previous fragments. We use a 3D convolutional variant of filter. The effectiveness of joint reconstruction and fusion is
Gated Recurrent Unit (GRU) [6] module for this purpose. validated in the ablation study.
As illustrated in Fig.3 ii, at each level the image feature Integration to the Global TSDF Volume. At the last
volume Flt is first passed through the 3D sparse convolu- coarse-to-fine level, S3t is predicted and further sparsified
4
to Slt . Since the fusion between Slt and Sgt has been done in these 3D and 2D metrics, we consider F-score as the most
GRU Fusion, Slt is integrated into Sgt by directly replacing suitable metrics to measure 3D reconstruction quality since
the corresponding voxels after being transformed into the both the accuracy and completeness of the reconstruction
global coordinate. At each time step t, Marching Cubes is are considered.
performed on Sgt to reconstruct the mesh. Baselines. We compare our method with the following
Supervision. Following [9], two loss functions are used baseline methods in three categories: 1) Real-time meth-
to supervise the network. The occupancy loss is defined ods for multi-view depth estimation [48, 13, 24, 26]. Due
as the binary cross-entropy (BCE) between the predicted to the efficiency constraints, the estimated depth accuracy
occupancy values and the ground-truth occupancy values. by these methods is rather limited. We compare with these
The SDF loss is defined as the `1 distance between the pre- methods to demonstrate the better reconstruction accuracy
dicted SDF values and the ground-truth SDF values. We of NeuralRecon given the same efficiency. 2) Multiple View
log-transform the SDF values of predictions and ground- Stereo methods [37, 14, 53, 30, 28]. These offline methods
truth before applying the `1 loss. The supervision is applied have much higher accuracy compared to real-time methods.
to all the coarse-to-fine levels. These baselines are used to demonstrate that NeuralRecon
achieves a reconstruction quality on-par with offline meth-
3.3. Implementation Details
ods but runs in real-time. 3) Learning-based SLAM meth-
We use torchsparse [43] as the implementation of 3D ods [45, 42, 44]. These monocular SLAM methods estimate
sparse convolution. The image backbone is a variant of camera poses and perform reconstruction simultaneously,
MnasNet [41] and is initialized with the weights pretrained thus the scale factor of pose and depth is usually not ac-
from ImageNet. Feature Pyramid Network [23] is used in curately estimated. For a fair comparison, we use ground-
the backbone to extract more representative multi-level fea- truth camera poses for these methods and apply a scaling
tures. The entire network is trained end-to-end with ran- factor to the predicted depth map using ground-truth depth.
domly initialized weights except for the image backbone. Among all these baseline methods, GPMVS [13] and At-
The occupancy score o is predicted with a Sigmoid layer. las [30] are the most relevant real-time and offline methods,
The voxel size of the last level is 4cm and the TSDF trun- respectively.
cation distance λ is set to 12cm. dmax is set to 3m. Rmax Evaluation Protocols. Since our method does not estimate
and tmax are set to 15°and 0.1m respectively. θ is set to
depth maps explicitly, we render the reconstructed mesh to
0.5. Nearest-neighbor interpolation is used in the upsam-
the image plane and obtain depth map estimations [30]. Key
pling between coarse-to-fine levels.
frames used for evaluation are sampled from the video se-
quence with an interval of 10 frames for both depth-based
4. Experiments
methods and Atlas. Following [30, 26], [53, 48, 14, 13] are
In this section, we conduct a series of experiments to fine-tuned on ScanNet. To evaluate depth-based methods
evaluate the reconstruction quality and different design con- [37, 48, 13, 14] in 3D, we use the point cloud fusion to ob-
siderations of NeuralRecon. tain the 3D reconstruction following Atlas. For other depth-
based methods, we use the standard TSDF fusion proposed
4.1. Datasets, Metrics, Baselines and Protocols. in [31, 7]. For the reasons we detailed in the supplementary
material, in order to make a fair comparison with Atlas, we
Datasets. We perform the experiments on two indoor
also report the evaluation results using the double-layered
datasets, ScanNet (V2) [8] and 7-Scenes [39]. The ScanNet
mesh (same as Atlas). The evaluation of 3D geometry on 7-
dataset contains 1613 indoor scenes with ground-truth cam-
Scenes uses the single-layered mesh. We also evaluate the
era poses, surface reconstructions, and semantic segmenta-
depth filtering operation with multi-view consistency check,
tion labels. There are two training/validation splits com-
which will be elaborated in the supplementary material.
monly used in previous works (defined in [30] and [42]) for
the ScanNet dataset. We use the same training and valida-
4.2. Evaluation Results
tion data with the corresponding baseline methods to make
a fair comparison. The 7-Scenes dataset is another chal- ScanNet. 2D depth metrics and 3D geometry metrics are
lenging RGB-D dataset captured in indoor scenes. Follow- used on the ScanNet dataset. The 3D geometry evalua-
ing the baseline method [26], we use the model trained on tion results are shown in Tab. 1. Our method produces
ScanNet to perform the validation on 7-Scenes. much better performance than recent learning-based meth-
Metrics. The 3D reconstruction quality is evaluated using ods and achieves slightly better results than COLMAP. We
3D geometry metrics presented in [30], as well as standard believe that the improvements come from the joint recon-
2D depth metrics defined in [11]. The definitions of these struction and fusion design achieved by the GRU Fusion
metrics are detailed in the supplementary material. Among module. Compared to depth-based methods, NeuralRecon
5
Method Layer Comp ↓ Acc ↓ Recall ↑ Prec ↑ F-score ↑ Time (ms) ↓
MVDepthNet [48] single 0.040 0.240 0.831 0.208 0.329 48
GPMVS [13] single 0.031 0.879 0.871 0.188 0.304 51
DPSNet [14] single 0.045 0.284 0.793 0.223 0.344 322
COLMAP [37] single 0.069 0.135 0.634 0.505 0.558 2076
Ours single 0.128 0.054 0.479 0.684 0.562 30
Atlas [30] double 0.062 0.128 0.732 0.382 0.499 292
Ours double 0.106 0.073 0.609 0.450 0.516 30
DeepV2D [44] single 0.057 0.239 0.646 0.329 0.431 347
Consistent Depth [28] single 0.091 0.344 0.461 0.266 0.331 2321
Ours single 0.120 0.062 0.428 0.592 0.494 30
Table 1. 3D geometry metrics on ScanNet. We use two different training/validation splits following Atlas [30] (top block) and BA-Net
[42] (bottom block). We elaborate the meaning of the single and double layer in the supplementary material.
Method Abs Rel ↓ Abs Diff ↓ Sq Rel ↓ RMSE ↓ δ < 1.25 ↑ Comp ↑
COLMAP [37] 0.137 0.264 0.138 0.502 83.4 0.871
MVDepthNet [48] 0.098 0.191 0.061 0.293 89.6 0.928
GPMVS [13] 0.130 0.239 0.339 0.472 90.6 0.928
DPSNet [14] 0.087 0.158 0.035 0.232 92.5 0.928
Atlas [30] 0.065 0.123 0.045 0.251 93.6 0.999
Ours 0.065 0.106 0.031 0.195 94.8 0.909
Method Abs Rel ↓ Sq Rel ↓ RMSE ↓ RMSE log ↓ Sc Inv ↓ -
DeMoN [45] 0.231 0.520 0.761 0.289 0.284 -
BA-Net [42] 0.161 0.092 0.346 0.214 0.184 -
DeepV2D [44] 0.057 0.010 0.168 0.080 0.077 -
Consistent Depth [28] 0.073 0.037 0.217 0.105 0.103 -
Ours 0.047 0.024 0.164 0.093 0.092 -
Table 2. 2D depth metrics on ScanNet. We use two different training/validation splits following Atlas [30] (top block) and BA-Net [42]
(bottom block).
can produce coherent reconstructions both locally and glob- structure information as in CNMNet. Since the model used
ally. Our method also surpasses the volumetric baseline here is only trained on ScanNet, the results also demonstrate
method Atlas [30] on the accuracy, precision, and F-score. that NeuralRecon can generalize well beyond the domain of
The improvements potentially come from the design of lo- the training data.
cal fragment separation in our method, which can act as a
Efficiency. We also report the average running time of the
view-selection mechanism that avoids irrelevant image fea-
baselines and our method in Tab. 1. Only the inference time
tures to be fused into the 3D volume. In terms of complete-
on key frames is computed. A detailed timing analysis for
ness and recall, the proposed method has an inferior perfor-
each module of NeuralRecon is presented in Table 4. For
mance compared to both depth-based methods and Atlas.
volumetric methods (Atlas and ours), the running time is
Since depth-based methods predict pixel-wise depth maps
obtained by dividing the time of reconstructing the TSDF
on each view, the coverage of their predictions is high by
volume of a local fragment by the number of key frames in
nature, but with the cost of accuracy. Being an offline ap-
the local fragment. Notice that the time for TSDF fusion
proach, Atlas has the advantage of having a global context
is not included for depth-based methods. The running time
from the entire sequence before predicting the geometry. As
for [44, 28, 24, 26, 45] and NeuralRecon is measured on an
a result, Atlas sometimes achieves even better completeness
NVIDIA RTX 2080Ti GPU. We use running time reported
compared to the ground-truth due to its TSDF completion
in [30] and [55] for [48, 14, 37, 13, 30] and [53], respec-
capability. However, Atlas tends to predict over-smoothed
tively.
geometries, and the completed regions may be inaccurate.
As shown in Tab. 1, our time cost is 30ms per key
As for 2D depth metrics, NeuralRecon also outperforms
frame, achieving real-time speed at 33 key frames per sec-
previous state-of-the-art methods for almost all 2D depth
ond and outperforming all previous methods. Specifically,
metrics, as shown in Tab. 2.
our method runs ∼10× faster than Atlas, and 77× faster
7-Scenes. 2D depth metrics and 3D geometry metrics are than Consistent Depth. Predicting the volumetric represen-
evaluated on the 7-Scenes dataset. As shown in Tab. 3, tation removes the redundant computation in depth-based
our method achieves comparable performance to the state- methods, which contributes to the fast running speed of
of-the-art method CNMNet [26] and outperforms all other our method. Compared to Atlas, incrementally reconstruct-
methods. We believe that the accuracy of the proposed ing geometry in local fragment avoids processing a huge
method can be further improved by leveraging the planar 3D volume, leading to a faster speed than Atlas. The use
6
Method Comp ↓ Acc ↓ Recall ↑ Prec ↑ F-score ↑ Img. Enc. Unproj. Sparse Conv. GRU Total
DeepV2D [44] 0.180 0.518 0.175 0.087 0.115 Level 1 1.27 3.70 2.18
CNMNet [26] 0.150 0.398 0.246 0.111 0.149
Ours 0.228 0.100 0.227 0.389 0.282
4.03 Level 2 1.21 3.84 2.24 29.56
Method δ < 1.25 ↑ Abs Rel ↓ Sq Rel ↓ RMSE ↓ Time ↓ Level 3 2.18 5.11 3.80
DeMoN [45] 31.88 0.3888 0.4198 0.8549 110
MVSNet [53] 64.09 0.2339 0.1904 0.5078 1050 Table 4. Timing analysis of NeuralRecon measured in millisec-
N-RGBD [24] 69.26 0.1758 0.1123 0.4408 202 onds per key frame. The level number indicates the different
MVDNet [48] 71.79 0.1925 0.2350 0.4585 48 coarse-to-fine level. Img. Enc. stands for image encoder, Unproj.
DPSNet [14] 70.96 0.1991 0.1420 0.4382 322
DeepV2D [44] 42.80 0.4370 0.5530 0.8690 347 stands for unprojection.
CNMNet [26] 76.64 0.1612 0.0832 0.3614 80
Ours 82.00 0.1550 0.1040 0.3470 30 Fusion 3D Geometry Metrics
#views
Area Method Recall Prec F-score
Table 3. 3D geometry metrics (top block) and 2D depth metrics i 5 OCC Linear 0.576 0.386 0.462
(bottom block) on 7-Scenes. Time is measured in milliseconds. ii 5 OCC Avg 0.535 0.432 0.478
iii 5 OCC GRU 0.572 0.426 0.488
iv 5 FBV GRU 0.613 0.421 0.494
of sparse convolution also contributes to the superior effi- - 7 FBV GRU 0.607 0.435 0.507
ciency of NeuralRecon. v 9 FBV GRU 0.609 0.450 0.516
- 11 FBV GRU 0.593 0.398 0.474
4.3. Ablation Study
Table 5. Ablation study. We report 3D geometry metrics on Scan-
In this section, we conduct several ablation experiments
Net. OCC: fuse 3D geometric features Glt within the occupied
on the ScanNet dataset to discuss the effectiveness of com- area where occupancy score o > θ. FBV: fuse 3D geometric fea-
ponents in our method. tures Glt within the fragment bounding volume. Linear: remove
GRU Fusion. We validate the GRU Fusion design by com- GRU-Fusion and use the conventional running-average-based lin-
paring rows from (i) to (iv) in Tab. 5. ear TSDF fusion to update the global TSDF volume. Avg: fuse 3D
geometric features Glt with the average operation. GRU: fuse 3D
To validate the benefit of feature fusion, we compare row
geometric features Glt with GRU. We use row (v) in all other ex-
(i) and row (ii) in Tab. 5. Using feature fusion with the av- periments. More details about ablation experiments can be found
erage operation obtains nearly 5% improvement for the pre- in the supplementary material.
cision metric than conventional linear TSDF fusion. Visual-
ization in Fig. 5 shows that feature fusion with the average
operation can reconstruct smoother geometry. These results Qualitative Results. We provide the qualitative results and
demonstrate that feature fusion can be more effective than the corresponding analysis in Fig. 4.
TSDF fusion using the same average operation.
5. Conclusion
Comparing row (ii) and row (iii) in Tab. 5 shows that
replacing average operation with GRU gives 4% improve- In this paper, we introduced a novel system NeuralRecon
ment in terms of recall. The mesh in Fig. 5 (iii) is also more for real-time 3D reconstruction with monocular video. The
complete than that in Fig. 5 (ii). These results demonstrate key idea is to jointly reconstruct and fuse sparse TSDF vol-
that the GRU is more effective to selectively integrate only umes for each video fragment incrementally by 3D sparse
the consistent information from the current-fragment to the convolutions and GRU. This design enables NeuralRecon
hidden state. to output accurate and coherent reconstruction in real-time.
The recalls in row (iii) and row (iv) in Tab. 5 show that Experiments show that NeuralRecon outperforms state-of-
fusion in the fragment bounding volume can produce much the-art methods in both reconstruction quality and running
more complete results. Visualization results in Fig. 5 (iii) speed. The sparse TSDF volume reconstructed by Neural-
and (iv) show that, with fusion in the fragment bounding Recon can be directly used in downstream tasks like 3D
volume, our method produces fewer artifacts on the ground. object detection, 3D semantic segmentation and neural ren-
Fusion in the fragment bounding volume can leverage the dering. We believe that, by jointly training with the down-
context information in boundaries and produce more con- stream tasks end-to-end, NeuralRecon enables new possi-
sistent and complete surface estimation. bilities in learning-based multi-view perception and recog-
nition systems.
Number of views. We set 5, 7, 9 and 11 views as the
length of a fragment respectively. As shown in row (v) in Acknowledgement. The authors would like to acknowl-
Tab. 5, the F-score has over 2% improvement when 9 views edge the support from the National Key Research and De-
are used as a fragment. As shown in visualization results in velopment Program of China (No. 2020AAA0108901),
Fig. 5 (v), with more views in a fragment, the geometry can NSFC (No. 61806176), and ZJU-SenseTime Joint Lab of
be reconstructed more accurately compared to Fig. 5 (iv). 3D Vision.
7
COLMAP CNMNet DeepV2D Ground Truth
Figure 4. Qualitative results on ScanNet. Compared to depth-based methods, NeuralRecon can produce much more coherent recon-
struction results. Notice that our method also recovers sharper geometry compared to Atlas [30], which illustrates the effectiveness of
the local fragment design in our method. Reconstructing only within the local fragment window avoids irrelevant image features from
far-away camera views to be fused into the 3D volume. The color indicates surface normal. More qualitative results can be found in the
supplementary material and the project webpage. Zoom in for details.
Figure 5. Ablation study. The indications of Roman numerals are in Tab. 5. The analysis is presented in Sec. 4.3.
8
References [18] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning
a Multi-View Stereo Machine. In NeurIPS, 2017. 2, 3.2
[1] Augmented Reality with ARKit- Apple Developer. 1
[19] Michael Kazhdan and Hugues Hoppe. Screened Poisson Sur-
[2] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis,
face Reconstruction. ACM TOG, 2013. 2
Engin Tola, and Anders Bjorholm Dahl. Large-Scale Data
for Multiple-View Stereopsis. IJCV, 2016. 2 [20] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Koltun. Tanks and Temples: Benchmarking Large-Scale
[3] Carlos Campos, Richard Elvira, Juan J. Gómez Rodrı́guez,
Scene Reconstruction. ACM TOG, 2017. 2
José M. M. Montiel, and Juan D. Tardós. ORB-SLAM3:
An Accurate Open-Source Library for Visual, Visual-Inertial [21] Kalin Kolev, Petri Tanskanen, Pablo Speciale, and Marc
and Multi-Map SLAM. ArXiv, 2020. 1 Pollefeys. Turning Mobile Phones into 3D Scanners. In
[4] Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran CVPR, 2014. 2
Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adap- [22] P. Labatut, J.-P. Pons, and R. Keriven. Robust and Efficient
tive thin volume representation with uncertainty awareness. Surface Reconstruction From Range Data. Computer Graph-
In CVPR, 2020. 2 ics Forum, 2009. 2
[5] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin [23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Chen, and Silvio Savarese. 3D-R2N2: A unified approach Bharath Hariharan, and Serge Belongie. Feature pyramid
for single and multi-view 3D object reconstruction. In networks for object detection. In CVPR, 2017. 3.3
ECCV, 2016. 2 [24] Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G Narasimhan,
[6] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and and Jan Kautz. Neural RGB-> D Sensing: Depth and uncer-
Yoshua Bengio. Empirical evaluation of gated recurrent neu- tainty from a video camera. In CVPR, 2019. 1, 2, 4.1, 4.2
ral networks on sequence modeling. In NeurIPS 2014 Work- [25] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
shop on Deep Learning, 2014. 3.2 Christian Theobalt. Neural Sparse Voxel Fields. In NeurIPS,
[7] Brian Curless and Marc Levoy. A Volumetric Method for 2020. 2
Building Complex Models from Range Images. In SIG- [26] Xiaoxiao Long, Lingjie Liu, Christian Theobalt, and Wen-
GRAPH, 1996. 2, 4.1 ping Wang. Occlusion-Aware Depth Estimation with Adap-
[8] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- tive Normal Constraints. In ECCV, 2020. 2, 4.1, 4.2
ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: [27] William E. Lorensen and Harvey E. Cline. Marching Cubes:
Richly-annotated 3D Reconstructions of Indoor Scenes. In A High Resolution 3D Surface Construction Algorithm.
CVPR, 2017. 4.1 SIGGRAPH, 1987. 1
[9] Angela Dai, Christian Diller, and Matthias Nießner. SG- [28] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen,
NN: Sparse Generative Neural Networks for Self-Supervised and Johannes Kopf. Consistent Video Depth Estimation.
Scene Completion of RGB-D Scans. In CVPR, 2020. 3.2 ACM TOG, 2020. 4.1, 4.1, 4.2
[10] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram
[29] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Izadi, and Christian Theobalt. BundleFusion: Real-time
bastian Nowozin, and Andreas Geiger. Occupancy Net-
globally consistent 3d reconstruction using on-the-fly surface
works: Learning 3d reconstruction in function space. In
reintegration. ACM TOG, 2017. 2
CVPR, 2019. 2
[11] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
[30] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha,
prediction from a single image using a multi-scale deep net-
Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-
work. In NeurIPS, 2014. 4.1
to-End 3D Scene Reconstruction from Posed Images. In
[12] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong
ECCV, 2020. 1, 2, 3.2, 4.1, 1, 4.1, 2, 4.2, 4
Tan, and Ping Tan. Cascade cost volume for high-resolution
multi-view stereo and stereo matching. In CVPR, 2020. 2 [31] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D.
[13] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A.
stereo by temporal nonparametric fusion. In ICCV, 2019. Fitzgibbon. KinectFusion: Real-time dense surface mapping
1, 3.1, 4.1, 4.1, 4.2 and tracking. In ISMAR, 2011. 1, 2, 3.2, 4.1
[14] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So [32] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and
Kweon. DPSNet: End-to-end Deep Plane Sweep Stereo. In Marc Stamminger. Real-Time 3D Reconstruction at Scale
ICLR, 2019. 4.1, 4.1, 4.2 Using Voxel Hashing. ACM TOG, 2013. 2
[15] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu [33] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Fang. SurfaceNet: An end-to-end 3D neural network for Newcombe, and Steven Lovegrove. DeepSDF: Learning
multiview stereopsis. In ICCV, 2017. 2, 3.2 Continuous Signed Distance Functions for Shape Represen-
[16] Mengqi Ji, Jinzhi Zhang, Qionghai Dai, and Lu Fang. Sur- tation. In CVPR, 2019. 2
faceNet+: An End-to-End 3D Neural Network for Very [34] Matia Pizzoli, Christian Forster, and Davide Scaramuzza.
Sparse Multi-View Stereopsis. IEEE TPAMI, 2020. 2 REMODE: Probabilistic, Monocular Dense Reconstruction
[17] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei in Real Time. In ICRA, 2014. 2, 3.2
Huang, Matthias Nießner, and Thomas Funkhouser. Local [35] Tong Qin, Jie Pan, Shaozu Cao, and Shaojie Shen. A Gen-
implicit grid representations for 3d scenes. In CVPR, 2020. eral Optimization-Based Framework for Local Odometry Es-
2 timation with Multiple Sensors. ArXiv, 2019. 1
9
[36] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- time Monocular 3D Reconstruction on a Mobile Phone.
ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-Aligned IEEE TVCG, 2020. 2
Implicit Function for High-Resolution Clothed Human Dig- [52] Zhenfei Yang, Fei Gao, and Shaojie Shen. Real-Time
itization. In ICCV, 2019. 2 Monocular Dense Mapping on Aerial Robots Using Visual-
[37] Johannes L. Schönberger, Enliang Zheng, Jan-Michael Inertial Fusion. In ICRA, 2017. 1
Frahm, and Marc Pollefeys. Pixelwise View Selection for [53] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Unstructured Multi-View Stereo. In ECCV. 2016. 2, 2, 4.1, MVSNet: Depth Inference for Unstructured Multi-View
4.1, 4.2 Stereo. In ECCV, 2018. 2, 4.1, 4.2
[38] Thomas Schops, Torsten Sattler, Christian Hane, and Marc [54] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Pollefeys. 3D Modeling on the Go: Interactive 3D Recon- Atzmon, Basri Ronen, and Yaron Lipman. Multiview Neu-
struction of Large-Scale Scenes on Mobile Devices. In 3DV, ral Surface Reconstruction by Disentangling Geometry and
2015. 1, 2, 3.2 Appearance. In NeurIPS, 2020. 2
[39] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram [55] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to-
Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene dense multi-view stereo with learned propagation and gauss-
Coordinate Regression Forests for Camera Relocalization in newton refinement. In CVPR, 2020. 4.2
RGB-D Images. In CVPR, 2013. 4.1 [56] Enliang Zheng, Enrique Dunn, Vladimir Jojic, and Jan-
[40] Sungjoon Choi, Q. Zhou, and V. Koltun. Robust reconstruc- Michael Frahm. PatchMatch Based Joint View Selection and
tion of indoor scenes. In CVPR, 2015. 3.1 Depthmap Estimation. In CVPR, 2014. 2
[41] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, [57] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas- DeepTAM: Deep Tracking and Mapping. In ECCV, 2018. 2
Net: Platform-aware neural architecture search for mobile.
In CVPR, 2019. 3.3
[42] Chengzhou Tang and Ping Tan. BA-Net: Dense Bundle Ad-
justment Networks. In ICLR, 2019. 2, 4.1, 1, 4.1, 2
[43] Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin,
Hanrui Wang, and Song Han. Searching Efficient 3D Ar-
chitectures with Sparse Point-Voxel Convolution. In ECCV,
2020. 3.3
[44] Zachary Teed and Jia Deng. DeepV2D: Video to Depth with
Differentiable Structure from Motion. In ICLR, 2020. 2, 4.1,
4.1, 4.2
[45] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko-
laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas
Brox. DeMoN: Depth and Motion Network for Learning
Monocular Stereo. In CVPR, 2017. 2, 4.1, 4.1, 4.2
[46] Julien Valentin, Adarsh Kowdle, Jonathan T. Barron, Neal
Wadhwa, Max Dzitsiuk, Michael Schoenberg, Vivek Verma,
Ambrus Csaszar, Eric Turner, Ivan Dryanovski, Joao
Afonso, Jose Pascoal, Konstantine Tsotsos, Mira Leung,
Mirko Schmidt, Onur Guleryuz, Sameh Khamis, Vladimir
Tankovitch, Sean Fanello, Shahram Izadi, and Christoph
Rhemann. Depth from Motion for Smartphone AR. ACM
TOG, 2019. 1, 2
[47] George Vogiatzis and Carlos Hernández. Video-Based, Real-
Time Multi-View Stereo. Image and Vision Computing,
2011. 2
[48] Kaixuan Wang and Shaojie Shen. MVDepthNet: Real-Time
Multiview Depth Estimation Neural Network. In 3DV, 2018.
1, 2, 4.1, 4.1, 4.2
[49] Silvan Weder, Johannes Schönberger, Marc Pollefeys, and
Martin R. Oswald. RoutedFusion: Learning Real-Time
Depth Map Fusion. In CVPR, 2020. 2
[50] Silvan Weder, Johannes L. Schönberger, Marc Pollefeys, and
Martin R. Oswald. NeuralFusion: Online Depth Fusion in
Latent Space, 2020. 2
[51] Xingbin Yang, L. Zhou, Hanqing Jiang, Z. Tang, Yuanbo
Wang, H. Bao, and Guofeng Zhang. Mobile3DRecon: Real-
10