0% found this document useful (0 votes)
20 views5 pages

Dvgo v2

Uploaded by

Srinjay Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

Dvgo v2

Uploaded by

Srinjay Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Improved Direct Voxel Grid Optimization for Radiance Fields Reconstruction

Cheng Sun Min Sun Hwann-Tzong Chen


National Tsing Hua University
[email protected] [email protected] [email protected]
arXiv:2206.05085v4 [cs.GR] 2 Jul 2022

Abstract Method Data structure Density Color


DVGO [9] dense grid explicit hybrid
In this technical report, we improve the DVGO [9] frame- Plenoxels [10] sparse grid explicit explicit
work (called DVGOv2), which is based on Pytorch and Instant-NGP [7] hash table hybrid hybrid
uses the simplest dense grid representation. First, we re- TensoRF [2] decomposed grid explicit hybrid
implement part of the Pytorch operations with cuda, achiev-
Table 1. Overview of explicit radiance field representations.
ing 2–3× speedup. The cuda extension is automatically
Various data structures have been realized to model the volume
compiled just in time. Second, we extend DVGO to sup- densities and view-dependent colors explicitly. The ‘hybrid’ in-
port Forward-facing and Unbounded Inward-facing captur- dicates that the explicit representation is followed by an implicit
ing. Third, we improve the space time complexity of the dis- representation (the MLP network).
tortion loss proposed by mip-NeRF 360 [1] from O(N 2 ) to
O(N ). The distortion loss improves our quality and train-
ing speed. Our efficient implementation could allow more
future works to benefit from the loss. plicit representations to reduce training time from hours to
Project page: https:// sunset1995.github.io/ dvgo/ . minutes per scene. Querying an explicit representation re-
Code: https:// github.com/ sunset1995/ DirectVoxGO. quires only constant time computation, which is much more
efficient than the few hundred thousand FLOPs per query.
Even hybrid representations may benefit from the reduced
1. Introduction computation for the speedup in training since the MLP
network in a hybrid representation is typically much shal-
Neural radiance fields [6] (NeRF) have provided an ap- lower than that in a fully implicit representation. DVGO [9]
pealing approach to novel view synthesis for the high qual- uses the simplest dense grid data structure in fully Pytorch
ity and flexibility to reconstruct the volume densities and implementation. Plenoxels [10] model the coefficients of
view-dependent colors from multi-view calibrated images. spherical harmonic for view-dependent colors and realize a
However, NeRF runs very slow due to the processing time fully explicit (without MLP) representation. Plenoxels in-
of multilayer perceptron (MLP) networks. Consider an terpolation and rendering pipeline are fused in CUDA code.
MLP consisting of 8 layers with 256 hidden channels: To Instant-NGP [7] uses hash-table and hybrid representations
query a single point would require more than 520k ≈ for both densities and colors. Instant-NGP further improves
2562 ·8 FLOPs. In each training iteration, there are typically the training time using C/C++ and fully-fused CUDA im-
8,192 rays, each with 256 sampled points, which results in plementation. TensoRF [2] improves the memory footprint
more than 1T FLOPs in total. and scalability of the dense grid via tensor decomposition
Using an occupancy mask is one of the easiest ways to and directly modeling the low-rank components.
speed up. As the training progresses for a scene, we can
gradually update an occupancy mask to deactivate some of This technical report presents DVGOv2. Compared to
the space with low density. Thanks to the occupancy mask, DVGO, DVGOv2 achieves another 2–3× speedup and ex-
we can then skip most of the point queries in each itera- tends to forward-facing and unbounded inward-facing cap-
tion after training for a while. VaxNeRF [4] reports that a turing. We also present an efficient realization of the distor-
vanilla NeRF with the occupancy mask can achieve 2–8× tion loss (reduced from O(N 2 ) to O(N )), which improves
speedups for bounded inward-facing scenes [4]. Most of the our quality and training time. DVGOv2 uses the simplest
recent works [2, 4, 7, 9, 10] on training time speedup use the data structure and most of our intermediate steps are in
occupancy mask trick. Python interface. Meanwhile, DVGOv2 still demonstrates
Recently, many works (Tab. 1) have emerged using ex- the good quality and convergence speed.

1
2. Efficient regularization point-based query, there are typically more than 256 sam-
pled points on each ray, which makes the computation non-
Unlike the implicit MLP representation, the explicit rep- trivial and consumes many GPU memory (more than 3G for
resentation is found to be more prone to producing artifacts a batch with 4,096 rays).
of holes or floaters [10]. Thus, regularization losses are es- Thus, we re-implement the first term to achieve
pecially important for achieving reasonable results on the O(N ) computation. Let the mid-point distance mi =
unbounded real-world captured scenes. (si +si+1 )/2 and mi < mi+1 . We can eliminate the di-
Efficient total variation (TV) loss. TV loss [8] is com- agonal term (i = k, j = k) and rewrite it into:
monly adapted to prevent unnecessary sharpness in explicit N−1
X N−1
X
modeling [2, 10]. For each grid point, we compute the Hu- L1st
dist = wi wj |mi − mj |
i=0 j=0
ber loss to its six nearest-neighbor grid points; we find that
N−1 i−1
the Huber loss is better than L1 and L2 loss in our case. =2
X X
wi wj (mi − mj ) (2)
However, computing the TV loss is time-consuming for i=1 j=0
a large dense grid and requires many Pytorch API calls to 
N−1
X i−1
X
 
N−1
X i−1
X

implement, so we fuse them into a single CUDA kernel. Be- = 2 wi mi w j  − 2 wi wj mj  ,
sides, we skip the forward pass and directly add the gradient i=1 j=0 i=1 j=0

into the Pytorch tensor, so the users have to call our API be- where we can compute and store the prefix sum of (w) and
tween the normal backward pass and the optimization step (w ⊙ m) first so we can directly lookup the results of the in-
(see Listing 1). Despite the CUDA extension, it still takes a ner summation when computing the outer summation. The
lot of time, so we only compute the TV loss densely for all overall computation can thus be realized in O(N ).
grid points in the first 10k iterations. After the 10k check- The derivative for wk is
point, we only compute the TV loss for grid points involved ∂
L1st
in the current iteration (i.e., with non-zero gradients). ∂wk dist
k−1 N−1
1 optimizer.zero_grad() ∂ X ∂ X
2 # compute total_loss =2 wk wj (mk − mj ) + 2 wi wk (mi − mk )
∂wk j=0 ∂wk i=k+1
3 total_loss.backward()
4 dvgo_model.total_variation_add_grad( k−1
X N−1
X
5 tv_weight, dense_mode=(curr_step<10000)) =2 wj (mk − mj ) + 2 wi (mi − mk )
6 optimizer.step() j=0 i=k+1
k−1 k−1 N−1 N−1
Listing 1. Call our efficient TV loss after the backward pass. X X X X
= 2mk wj − 2 wj mj + 2 wi mj − 2mk wi ,
j=0 j=0 i=k+1 i=k+1
Efficient O(N ) distortion loss. The distortion loss is pro- (3)
posed by mip-NeRF 360 [1]. For a ray with N sampled where we can also compute and store the prefix and suffix
points, the loss is defined as sum of (w) and (w⊙m) so we can directly lookup the result
when computing the derivative of every wk . The overall
N −1 N −1 computation is also O(N ).
X X si + si+1 sj + sj+1
Ldist (s, w) = wi wj − We implement the efficient distortion loss as Pytorch
i=0 j=0
2 2
CUDA extension and support uneven number of sampled
N −1 points on each ray. We provide a self-contained package at
1 X 2
+ w (si+1 − si ) , (1) https://github.com/sunset1995/torch efficient distloss. We
3 i=0 i
will see that the distortion loss improves our rendering qual-
ity and speed up our training, thanks to the compactness en-
where (si+1 −si ) is the length and (si +si+1 )/2 is the mid-
couraged by the loss. We believe the efficient distortion loss
point of the i-th query interval. The s is non-linearly nor-
and our implementation can let more works benefit from
malized (from the near-far clipping distance to [0, 1]) to pre-
mip-NeRF 360’s regularization technique as most NeRF-
vent overweighting the far query. The weight wi is for the
based methods have hundreds of sampled points on a ray.
i-th sample points. Despite we are using the point-based
instead of the interval-based query (which is an interesting
3. CUDA speedup
problem), we find it still beneficial to adapt the distortion
loss. There are lots of sequential point-wise operations in
However, the straightforward implementation for the DVGO, each of which has an overhead for launching
first term in Eq. (1) results in O(N 2 ) computation for a sin- the CUDA kernel. So we re-implement these sequential
gle ray. This is not a problem for mip-NeRF 360 as there point-wise operations into a single CUDA kernel to re-
are only 32 query intervals in the finest sampling. For the duce launching overhead. We refer interested reader to

2
https://pytorch.org/tutorials/advanced/cpp extension.html. sion are roughly the same. We present the results in Tab. 3,
We use Pytorch’s just-in-time compilation mechanism, where the training time is measured on an RTX 2080Ti GPU
which automatically compiles the newly implemented and is 2–3× faster than the original implementation. We use
C/C++ and CUDA code by the first time it is required. the improved implementation in the rest of this technical re-
port.
Re-implement Adam optimizer. There are about ten
point-wise operations in an Adam optimization step. We Adam Rendering lego mic ship
fuse them into a single kernel and skip updating the grid
11.5m 9.3m 14.6m
points with zero gradients.
✓ 8.7m 1.3x 6.4m 1.5x 12.1m 1.2x
Re-implement rendering utils. Originally, we sample an ✓ ✓ 4.8m 2.4x 3.4m 2.7x 7.1m 2.1x
equal number of points on each ray for vectorized Pytorch
Table 3. Speedup by the improved implementation.
implementation, where a large number of query points are
outside the scene BBox. We now infer the ray BBox inter-
section to sample query points parsimoniously for each ray 5.2. Bounded inward-facing scenes
(which is only applicable to bounded scene). Besides, we
fuse about ten point-wise operations for the forward and the We evaluate DVGOv2 on two bounded inward-facining
backward pass of the density to alpha function. In the vol- datasets—Synthetic-NeRF [6] and Tanks&Temples [3]
ume rendering accumulation procedure, we halt tracking a dataset (bounded ver.). The results are summarized in
ray once the accumulated transmittance is less than 10−3 . Tab. 4. DVGOv2’s training time is two more times faster
than DVGO. DVGOv2 also uses less training time than
most of the recent methods despite using the lowest spec
4. Mix of factors affecting speed and quality
GPU. The result qualities are also comparable to the recent
Please note that the comparisons on the training speed methods. The improvement by scaling to a higher grid res-
and the result quality in this report are affected by many fac- olution is limited on the Tanks&Temples [3] dataset, per-
tors, not just the different scene representations as presented haps because of the photometric variation between training
in Tab. 1. views.
First, the computation devices are different. As shown
in Tab. 2, the computing power across different works is Method Tr. time PSNR↑ SSIM↑ LPIPS(VGG)↓
not aligned, where we use the lowest spec GPU to mea- DVGO [9] 14.2m 31.95 0.957 0.053
sure our training times. Second, Instant-NGP’s training Plenoxels [10] 11.1m 31.71 0.958 0.049
pipeline is implemented in C++, while the other methods Instant-NGP [7] 5m 33.18 - -
use Python/Pytorch. Instant-NGP and Plenoxels implement TensoRF (S) [2] 13.9m 32.39 0.957 0.057
most of their computations (e.g., grid interpolation, ray- TensoRF (L) [2] 8.1m 32.52 0.959 0.053
casting, volume rendering) in CUDA; DVGOv2 customizes TensoRF (L) [2] 17.6m 33.14 0.963 0.047
part of the computation as CUDA extension, and most of DVGOv2 (S) 4.9m 31.91 0.956 0.054
the intermediate steps are still in Python interface; DVGO DVGOv2 (L) 6.8m 32.76 0.962 0.046
and TensoRF only use the built-in Pytorch API. Third, some
(a) Results on Synthetic-NeRF [6] dataset. The results are aver-
implementation details such as regularization terms, policy
aged over 8 scenes.
of occupancy grid, and the other tricks can affect the quality
and convergence speed as well. Method Tr. time PSNR↑ SSIM↑ LPIPS(VGG)↓
DVGO [9] 17.7m 28.41 0.911 0.155
GPU FLOPs Memory Used by
TensoRF (S) [2] - 28.06 0.909 0.155
RTX 2080Ti 13.45T 11G DVGO [9], DVGOv2 TensoRF (L) [2] - 28.56 0.920 0.140
Telsa V100 15.67T 16G TensoRF [2] DVGOv2 (S) 7.3m 28.29 0.910 0.157
Titan RTX 16.31T 24G Plenoxels [10]
DVGOv2 (L) 9.1m 28.69 0.918 0.143
RTX 3090 35.58T 24G Instant-NGP [7]
(b) Results on Tanks&Temples [3] dataset (bounded ver.). The
Table 2. GPU specs. FLOPs are theoretical for float32. results are averaged over 5 scenes.
Table 4. Results on bounded inward scenes. We only compare
5. Experiments with the recent fast convergence approaches. Our small and large
models use 1603 and 2563 voxels respectively, and both are mea-
5.1. Ablation study for the CUDA speedup sured on an RTX 2080Ti GPU. Results breakdown and rendered
We test the re-implementation with 1603 voxels on the videos: https://sunset1995.github.io/dvgo/results.html.
lego, mic, and ship scenes. The PSNRs of different ver-

3
5.3. Forward-facing scenes grid points allocated to the background. The sampling step
size is measured on the contracted space. The original p=2
Points parameterization and sampling. We use NeRF’s
wastes around 50% of the grid points (a sphere in a cube),
parameterization to warp the unbounded forward-facing
so we also try p=∞ to make a cuboid contracted space.
frustum to a bounded volume. In this case, the dense
voxel grid allocation is similar to the multiplane images Implementation details. The grid resolution is 3203 . We
(MPIs) [12], where we place D RGB-density images at set α(init) = 10−4 [9] with 0.5 voxel step size. The TV
fixed depths, each with X × Z resolution. Every ray is loss weights are 10−6 for the density grid and 10−7 for the
traced from the first to the D-th images with a step size of s feature grid; the distortion loss weight is 10−2 .
image, i.e., there are 2D − 1 sampled points if s = 0.5.
Results. We show our results on the unbounded inward-
Implementation details. The number of depth layers is facing scenes in Tab. 6 and Tab. 7. On the
D=256 each with XZ=3842 number of voxels. Sampling Tanks&Temples [3] dataset (Tab. 6), we achieve compara-
step size is s=1.0 layer. The TV loss weights are 10−5 for ble SSIM and LPIPS to NeRF++, while our quality is be-
the density grid and 10−6 for the feature grid; the distortion hind the Plenoxels due to the grid resolution limited by the
loss weight is 10−2 . dense grid. On the newly released mip-NeRF 360 dataset
(Tab. 7), we achieve NeRF comparable PSNR and SSIM,
Results. We compare DVGOv2 with the recent fast con-
while our LPIPS is still far behind. One improvement is
vergence approaches in Tab. 5. DVGOv2 shows comparable
scaling the grid resolution as we only use 3203 voxels, while
quality using less training time on the lowest spec GPU. We
advanced data structures [2, 7, 10] are necessary in this case.
also see that the efficient distortion loss makes our training
However, DVGOv2 could still be a good starting point for
faster and achieves better quality, thanks to the compactness
its simplicity (Pytorch, dense grid) with reasonable quality.
encouraged by the loss. We also note that we achieve sim-
Using a cuboid contracted space significantly improves our
ilar performance using a much lower grid resolution. This
results on mip-nerf 360 dataset, while it degrades the results
is perhaps due to the other challenges in the LLFF dataset
on Tanks&Temples dataset perhaps due to the photometric
(e.g., fewer training views with some multi-view inconsis-
inconsistency problem in the dataset. Again, the distortion
tency due to real-world capturing), which hinders the gain
loss [1] with our efficient realization improves our quality
by using higher grid resolution.
and training speed.
Method Tr. time PSNR↑ SSIM↑ LPIPS(VGG)↓
Method Tr. time PSNR↑ SSIM↑ LPIPS(VGG)↓
Plenoxels [10] 24.2m 26.29 0.839 0.210
NeRF++ [11] hours 20.49 0.648 0.478
TensoRF (S) [2] 19.7m 26.51 0.832 0.217
Plenoxels [10] 27.3m 20.40 0.696 0.420
TensoRF (L) [2] 25.7m 26.73 0.839 0.204
DVGOv2 w/o Ldist 22.1m 20.08 0.649 0.495
DVGOv2 w/o Ldist 13.9m 26.24 0.833 0.204
DVGOv2 16.0m 20.10 0.653 0.477
DVGOv2 10.9m 26.34 0.838 0.197

Table 6. Results on Tanks&Temples [3] dataset.‡ The results
Table 5. Results on LLFF [5] dataset. The results are averaged are averaged over the 4 scenes organoized by NeRF++ [11]. The
over 8 scenes. The effective grid resolution is 1408×1156×128 for grid resolution of Plenoxels is 6403 for foreground and 2048 ×
Plenoxels and 6403 for TensoRF, while ours is about 3842 × 256. 1024 × 64 for background, while ours is a single 3203 grid shared
by foreground and background.
5.4. Unbounded inward-facing scenes
Points parameterization and sampling. We adapt mip- Method Tr. time PSNR↑ SSIM↑ LPIPS(VGG)↓
NeRF 360 [1] parameterization for the unbounded 360 NeRF [6] hours 24.85 0.659 0.426
scenes, which is
NeRF++ [11] hours 26.21 0.729 0.348
hours 28.94 0.837 0.208
(
mip-NeRF 360 [1]
x′ = 
x,  kxkp ≤ 1 ; (4)
DVGOv2 w/o Ldist 16.4m 24.73 0.663 0.465

b
1 + b − kxk kxkp , kxkp > 1 .
x
p DVGOv2 p=2 13.2m 24.80 0.659 0.468
DVGOv2 p=∞ 14.0m 25.24 0.680 0.446
We first rotate the world coordinate to align the first two
PCA directions of camera positions with grid’s XY axis, DVGOv2 p=∞(∗) 15.6m 25.42 0.695 0.429
(∗) Longer grid scaling and decaying schedule.
which slightly improves our results. The world coordinate
is then shifted to align with cameras’ centroid and scaled to Table 7. Results on mip-NeRF-360 [1] dataset.‡ The results are
cover all cameras near planes in a unit sphere. We allocate averaged over only the publicly available 7 scenes.
a cuboid voxel grid centered at the origin with length 2+2b. ‡ Find out results breakdown and rendered video at:
The hyperparameter b>0 controls the proportion of voxel https://sunset1995.github.io/dvgo/results.html

4
References
[1] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P.
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
anti-aliased neural radiance fields. 2022. 1, 2, 4
[2] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and
Hao Su. Tensorf: Tensorial radiance fields. arxiv CS.CV
2203.09517, 2022. 1, 2, 3, 4
[3] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Koltun. Tanks and temples: benchmarking large-scale scene
reconstruction. ACM Trans. Graph., 2017. 3, 4
[4] Naruya Kondo, Yuya Ikeda, Andrea Tagliasacchi, Yutaka
Matsuo, Yoichi Ochiai, and Shixiang Shane Gu. Vaxnerf:
Revisiting the classic for voxel-accelerated neural radiance
field. arxiv CS.CV 2111.13112, 2021. 1
[5] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon,
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
Abhishek Kar. Local light field fusion: Practical view syn-
thesis with prescriptive sampling guidelines. In SIGGRAPH,
2019. 4
[6] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Representing scenes as neural radiance fields for view syn-
thesis. In ECCV, 2020. 1, 3, 4
[7] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
der Keller. Instant neural graphics primitives with a multires-
olution hash encoding. ACM Trans. Graph., 2022. 1, 3, 4
[8] Leonid I. Rudin and Stanley J. Osher. Total variation based
image restoration with free local constraints. In ICIP, 1994.
2
[9] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
grid optimization: Super-fast convergence for radiance fields
reconstruction. In CVPR, 2022. 1, 3, 4
[10] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Radiance fields without neural networks. In CVPR, 2022. 1,
2, 3, 4
[11] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arxiv CS.CV 2010.07492, 2020. 4
[12] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: learning view syn-
thesis using multiplane images. ACM Trans. Graph., 2018.
4

You might also like