Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network
Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Overall architecture of proposed method. In training stage, we randomly choose one view for each object in the training batch as input
so each view of the same object will be trained at different time. Constant size block (CB) is similar with DSDB.
view of the object. It needs multi-view images to show its advan- 3.1.2. Dilated Downsample Block
tage.
The most related work to ours is the voxel tube network from Recall that the encoder is major in extracting image features, so
Richter et al. [1]. They use single image and 3D shape ground truth we need to design a powerful image extractor to extract more fea-
for 3D shape reconstruction without any additional ground truth ture. For this purpose, we propose the dilated downsample block, as
such camera pose or object silhouette. Their network architecture shown in Fig. 2.
is friendly to memory because they mainly apply 2D convolution
module in their decoder called voxel tube decoder. Besides, the
computation time of using 2D convolution is less than using 3D
convolution. However, because the object shape is in 3D, using 2D
convolution is not powerful enough to reconstruct the 3D shape.
Thus, we still use the 3D convolution module in the decoder and
integrate our proposed method to address the single-view recon-
struction problem.
3. PROPOSED METHOD
1654
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
3.2.1. Upsample Block 3.4. Objective Function
Because we want to reconstruct the 3D object to the specific resolu- We first define some notations for convenience. Because the pre-
tion, the most difference between upsample block and residual block dicted shape is 3D, let the predicted shape V ∗ = [v1∗ , v2∗ , ..., vN∗
]
is that the first layer of the upsample block is changed from convo- denote the probabilistic volume after sigmoid of the output volume
lution to deconvolution because the convolution does not upsample through our network, where N is the number of the voxel in volume
the input features. Besides, the general upsample skill, such as inter- V ∗ and 0 ¡ vi≤N
∗
¡ 1. V = [v1 , v2 , . . . , vN ] represents the ground
polation, nearest or bilinear, is added to the first layer of the shortcut truth volume and each voxel in ground truth volume is 0 or 1.
path of the upsample block.
3.4.1. Intersection-over-Union (IoU) Loss
3.2.2. Multi-path Upsample Block
In spired by Richter et al. [1], the IoU divides the number of the
Though the upsample block has the ability to reconstruct the object intersection by the union. For the segmentation of a 3D object, cor-
shape to the specific resolution, it is not powerful. To use the input rect foreground predictions are effectively weighted by the size of
features greatly, we try to expand the path of the upsample block and the ground truth and prediction shape. Therefore, it is benefit to seg-
the first layer of each path uses different kernel size of deconvolution. ment the object shape from background.
Fig. 3 shows the detail structure of multi-path upsample block. The To use IoU for training, the IoU loss is defined as following:
reason why we do that is similar with dilated downsample block P ∗
mentioned above, and thus the input features can be used as much as vi vi
possible. LIoU (V ∗ , V ) = 1 − P ∗ i ∗
(1)
i [vi + vi − vi vi ]
N2
1 X
F N CE = − [vp log vp∗ + (1 − vp ) log(1 − vp∗ )], (4)
N2 p=1
1655
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
Table 1. Comparison with other works. Note * is recalculated by the author of AtlasNet. ** and *** are tested on our own experiment
environment.
2D Image Encoder- ShapeNet dataset Memory
Mean IoU(%) Mean CD Time/per image
Decoder Arch. (13 categories) (MB)
3D-R2N2 [3] 3DRNN 13 categories 56.0% - 205.8 MB -
OGN [14] Octree Decoder 13 categories 59.6%(+3.6%) - 296.9 MB 16 ms on Titan X
130 ms on
PSGN [15] Point Cloud Decoder 13 categories 64.0%(+8.0%) 6.41* 148.4 MB
laptop CPU
AtlasNet [9] Mesh Decoder 13 categories - 5.11(-1.3) 488.4 MB -
8 ms** on
VTN [1] Voxel Tube Decoder 13 categories 64.1%(+8.1%) - 126.6 MB
GTX 1080 Ti
18 ms on Tesla M40
PSVH [2] 3D Shape Decoder 4 categories 68.0% - 226.9 MB 323 ms*** on
GTX 1080 Ti
D-DB-MpUB 11 ms on
3D Shape Decoder 13 categories 66.7%(+10.7%) 5.01(-1.40) 112.3 MB
w/ LF inal GTX 1080 Ti
C-D-DB-MpUB 13 ms on
3D Shape Decoder 13 categories 67.7%(+11.7%) 4.83(-1.58) 118.1 MB
w/ LF inal GTX 1080 Ti
4. EXPERIMENTAL RESULTS entry in Table 2 and the LF inal also has great improvement, 2.1%
higher than its previous entry.
4.1. Environment and Dataset
We implement our method in PyTorch. The CPU is Intel(R) Xeon(R) Table 3. Reconstruction comparison with PSVH.
CPU E5-2620 v4 @ 2.1 GHz, the main memory is 32 GB DDR4
Mean Time/
RAM, and the GPU is NVIDIA GeForce GTX 1080 Ti. Method areo car chair sofa
IoU(%) per image
In this work, we mainly use the dataset provided by [3] which PSVH [2] 63.1 83.9 55.2 69.8 68.0 323 ms
rendered objects from the ShapeNet dataset [5]. The dataset we used Our 68.0 85.5 57.6 74.4 71.4(+3.4) 13 ms
has 13 categories and consists of nearly 50K 3D objects, and each
object has 24 images from different views. We use the same training
and testing split provided by Choy et al. [3]. The object resolution 4.3. Comparison with Other Works
of this dataset is 32 × 32 × 32 in voxel representation and we train
one network for all 13 categories. First, we compare our final result with PSVH [2] whose research pur-
We train our model with batch size 64 and training epoch 210. pose is the same as us. Table 3 shows that our reconstruction result is
The initial learning rate is 10−3 , the learning rate decay is 0.5 per 30 better than PSVH and gets 3.4% improvement. For the reconstruc-
epochs, and the optimizer is Adam. tion speed, our reconstruction time achieves 13 ms per image and
We evaluate our reconstruction results in two metrics: intersec- approximately 25 times faster than 323 ms of them. Note that we
tion over union (IoU) and chamfer distance (CD). The higher IoU use their pretrained model and test on our own environment.
score means the better reconstruction result. The lower CD value At last, we show our result and other works in several aspects.
means the better reconstruction result. As Table 1 shows, our best method achieves 67.7% IoU score, 3.6%
higher than our reference architecture VTN, and has the better results
4.2. Results in both IoU and CD. Even with lower speed GPU, our average recon-
struction time is less than most of works. Note that all CD values are
In terms of the architecture, we propose two modules and leverage multiplied by 103 . In addition, we find that our reconstruction time
the concatenation skill. Then we integrate our proposed modules is less without using concatenation. VTN has the lowest reconstruc-
with our LF inal . We show the influence of each module in Table 2. tion time, as it applies 2D convolution in their voxel tube decoder
that its computation time is less than applying 3D convolution.
Table 2. Summary of all modules used and comparison of influence
of each module. 5. CONCLUSION
Mean
D-DB MpUB Concat. w/LF inal
IoU(%) We design a network for single-view 3D object reconstruction with-
SSDB-UB out other additional input. We also focus on designing a network
63.5
(Baseline) √ which has powerful ability to extract object features and greatly use
D-DB-UB √ √ 64.4% them. We propose two blocks, dilated downsample block and multi-
D-DB-MpUB √ √ √ 64.6%
C-D-DB-MpUB 65.6% path upsample block, to enhance our model. In addition, we use the
C-D-DB-MpUB √ √ √ √ concatenation skill to reserve features of encoder during reconstruc-
67.7%
w/LF inal tion step and leverage MSFCEL to handle unbalance problem.
With all of our proposed modules, our final reconstruction result
In this table, our baseline is the sparse-step downsample block achieves state of the art performance, 67.7% IoU in 13 categories,
(SSDB) and upsample block (UB). D-DB is dilated downsample 3.6% improvement compared with VTN [1]. In addition, in 4 cate-
block, MpUB is multi-path upsample block and w/LF inal is inte- gories, our result achieves 71.4% IoU, 3.4% higher than PSVH [2].
grating LIoU and MSFCEL. We can infer that each module we pro- In terms of reconstruction speed, our average reconstruction time is
posed has benefit to the reconstruction result. Specifically, the con- 13 ms, 25 times faster than PSVH on our own experiment environ-
catenation module has 1% improvement compared with its previous ment.
1656
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [15] Haoqiang Fan, Hao Su, and Leonidas J. Guibas, “A point set
generation network for 3d object reconstruction from a single
[1] Stephan R. Richter and Stefan Roth, “Matryoshka networks: image,” in The IEEE Conference on Computer Vision and Pat-
Predicting 3d geometry via nested shape layers,” in The tern Recognition (CVPR), July 2017.
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
[2] Hanqing Wang, Jiaolong Yang, Wei Liang, and Xin Tong,
“Deep single-view 3d object reconstruction with visual hull
embedding,” in Proceedings of the AAAI Conference on Ar-
tificial Intelligence (AAAI), 2019.
[3] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin
Chen, and Silvio Savarese, “3d-r2n2: A unified approach for
single and multi-view 3d object reconstruction,” in Computer
Vision – ECCV 2016, Cham, 2016, pp. 628–644, Springer In-
ternational Publishing.
[4] Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik,
“Multi-view consistency as supervisory signal for learning
shape and pose prediction,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2018.
[5] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas,
Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese,
Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,
and Fisher Yu, “Shapenet: An information-rich 3d model
repository,” CoRR, vol. abs/1512.03012, 2015.
[6] Fisher Yu and Vladlen Koltun, “Multi-scale context
aggregation by dilated convolutions,” arXiv preprint
arXiv:1511.07122, 2015.
[7] Minhyuk Sung, Vladimir G. Kim, Roland Angst, and Leonidas
Guibas, “Data-driven structural priors for shape completion,”
ACM Trans. Graph., vol. 34, no. 6, pp. 175:1–175:11, Oct.
2015.
[8] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra
Malik, “Category-specific object reconstruction from a single
image,” in The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), June 2015.
[9] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C.
Russell, and Mathieu Aubry, “A papier-mâché approach to
learning 3d surface generation,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2018.
[10] Guandao Yang, Yin Cui, Serge Belongie, and Bharath Hariha-
ran, “Learning single-view 3d reconstruction with limited pose
supervision,” in The European Conference on Computer Vision
(ECCV), September 2018.
[11] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang,
Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, and
William T. Freeman, “Pix3d: Dataset and methods for single-
image 3d shape modeling,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2018.
[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net:
Convolutional networks for biomedical image segmentation,”
CoRR, vol. abs/1505.04597, 2015.
[13] Yongbin Sun, Ziwei Liu, Yue Wang, and Sanjay E. Sarma,
“Im2avatar: Colorful 3d reconstruction from a single image,”
CoRR, vol. abs/1804.06375, 2018.
[14] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox,
“Octree generating networks: Efficient convolutional architec-
tures for high-resolution 3d outputs,” in The IEEE Interna-
tional Conference on Computer Vision (ICCV), Oct 2017.
1657
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.