0% found this document useful (0 votes)
65 views5 pages

Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network

This paper proposes a deep neural network method for reconstructing 3D objects from a single 2D image with fine details. The network uses a dilated downsample block to extract more features from the image, and a multi-path upsample block to make better use of the extracted features. Experiments show the method achieves higher accuracy and faster speed than state-of-the-art approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views5 pages

Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network

This paper proposes a deep neural network method for reconstructing 3D objects from a single 2D image with fine details. The network uses a dilated downsample block to extract more features from the image, and a multi-path upsample block to make better use of the extracted features. Experiments show the method achieves higher accuracy and faster speed than state-of-the-art approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

FAST SINGLE-VIEW 3D OBJECT RECONSTRUCTION WITH FINE DETAILS THROUGH

DILATED DOWNSAMPLE AND MULTI-PATH UPSAMPLE DEEP NEURAL NETWORK

Chia-Ho Hsu, Ching-Te Chiu, and Chia-Yu Kuan

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan

ABSTRACT With the emergence of large-scale datasets like shapenet [5],


many learning-based works have been reported. For example, Choy
Three-dimensional (3D) object reconstruction is among the most et al. [3] proposed a unified recurrent convolution neural network
important research areas in the field of computer vision. Its pur- that can take single-view image or multiple view images as input.
pose is to reconstruct the overall shape of an object from its two- Through their method, the reconstruction quality of the object will
dimensional (2D) image. With the development of deep learning, improve when the network sees more images of the object. However,
many methods based on convolutional neural networks (CNNs) have there is a problem that using CNNs-based methods to reconstruct the
been applied in related research. object shape in high 3D space usually misses some shape details, as
To achieve 3D shape reconstruction with low computation time, the variations in object shape can be very large, even in the same
we focus on the commonly used method: single-image reconstruc- object category.
tion. The main issue of using a single image as an input is that the Therefore, Wang et al. [2] use the probabilistic single-view vi-
reconstruction shape often lacks structural detail. To address this is- sual hull (PSVH) to map the 2D silhouette of the object and pose
sue, we proposed two methods: the dilated downsample block and an estimation to 3D space. With PSVH, they successfully addressed
the multi-path upsample block. The dilated downsample block ex- the problem of missing shape detail. However, they need additional
tracts more features and the multi-path upsample block uses the fea- silhouettes and the camera-pose ground truth in their training stage.
tures in our architecture. Thereafter, we concatenate the encoder and Our work aims to use a single view image to reconstruct a de-
decoder with corresponding layers to keep the image features in re- tailed 3D shape in a voxel grid without additional information. It is
construction process. not surprise that only take one view image will reconstruct coarse
Finally, we perform experiments on the dataset provided by shape, so we enhance our model to increase the reconstruction abil-
Choy et al. Results show that our method achieves 67.7% intersection- ity. In this regard, we propose a powerful feature extractor called
over-union (IoU) accuracy, 3.6% higher than state-of-the-art method, dilated downsample block that integrates residual block and dilated
VTN. Compared to the PSVH method, our result achieves 71.4%, convolution [6]. Then, we propose a multi-path upsample block that
an increase of 3.4%. Our average reconstruction time is 13 ms, extends the path of the residual block to increase the usage of the fea-
approximately 25 times faster than PSVH. tures and concatenate the encoder and decoder. Moreover, without
additional input, we can improve our reconstruction speed.
Index Terms— 3D object reconstruction, 3D shape reconstruc-
tion, deep convolutional neural network, single view
2. RELATED WORK
1. INTRODUCTION 2.1. Traditional Method
Three-dimensional (3D) object reconstruction is a computer vision Traditional methods [7] [8] use geometry priors for this difficult task.
technique that uses two-dimensional (2D) information to reconstruct For example, Kar et al. [8] reconstruct a 3D object shape template
the 3D shape. Its purpose is to reconstruct the shape of the object from images of objects in the same category as shape prior. When
from its 2D image, including the information that cannot be pre- given an input image, they first estimate silhouette and viewpoint
sented by the image. Given an image of object, we can recognize it from the input image, and then reconstruct the 3D object by fit-
and imagine the shape of the part the image cannot provide. How- ting the shape template. However, these methods often rely on the
ever, 3D object reconstruction is a difficult task for computer vision. database to achieve a precise shape. Thus, if the database does not
For 3D object reconstruction, there are two main types: single- have this property, their reconstruction quality will be low.
view reconstruction and multi-view reconstruction. In single-view
reonstruction [1] [2], for different times, one view image is randomly 2.2. Learning-Based Method
selected from several views images of the same object to reconstruct
its corresponding 3D shape. By contrast, multi-view reconstruction With the appearance in large-scale shape repositories like shapeNet
[3] [4] uses more than one image from several view images of the [5], many data-driven methods have been proposed, especially using
object to reconstruct its corresponding 3D shape by integrating the CNNs [4] [9] [10] [11]. For example, Choy et al. [3] propose a 3D
features of different views. In the same network architecture of deep recurrent reconstruction neural network (3D-R2N2), which is a uni-
learning, the reconstruction quality of the latter is better than the fied network for single-view or multi-view 3D object reconstruction.
former. However, in real-world applications, such as AR and VR, the Because this network has memory property that can memorize the
efficiency of the former is greater than that of the latter. Furthermore, state of the previous input, its reconstruction quality will get increas-
although the reconstruction quality of the former is not better than ingly better if is see more and more views of the object. However,
the latter, the result is not far from the expected shape. this model has worse reconstruction quality when it only sees one

978-1-5090-6631-5/20/$31.00 ©2020 IEEE 1653 ICASSP 2020

Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Overall architecture of proposed method. In training stage, we randomly choose one view for each object in the training batch as input
so each view of the same object will be trained at different time. Constant size block (CB) is similar with DSDB.

view of the object. It needs multi-view images to show its advan- 3.1.2. Dilated Downsample Block
tage.
The most related work to ours is the voxel tube network from Recall that the encoder is major in extracting image features, so
Richter et al. [1]. They use single image and 3D shape ground truth we need to design a powerful image extractor to extract more fea-
for 3D shape reconstruction without any additional ground truth ture. For this purpose, we propose the dilated downsample block, as
such camera pose or object silhouette. Their network architecture shown in Fig. 2.
is friendly to memory because they mainly apply 2D convolution
module in their decoder called voxel tube decoder. Besides, the
computation time of using 2D convolution is less than using 3D
convolution. However, because the object shape is in 3D, using 2D
convolution is not powerful enough to reconstruct the 3D shape.
Thus, we still use the 3D convolution module in the decoder and
integrate our proposed method to address the single-view recon-
struction problem.

3. PROPOSED METHOD

In our proposed method, there are four modules: dilated downsam-


ple, multi-path upsample, concatenation and integrated loss func-
tion. Fig. 1 is our detail architecture.
Fig. 2. Detail structure of dilated downsample block.
3.1. 2D Image Encoder
The encoder is focus on extracting image features and downsampling The reason why this type downsample block is more powerful
the input. Therefore, the first method coming to mind is using resid- than the other two blocks mentioned previously is that different ker-
ual block because it has been proven to have this ability. Thus, in the nel size of convolution has different receptive field. Different recep-
following sections, we describe three blocks, sparse step downsam- tive field means that we can extract different features. In addition,
ple block (SSDB), dense step downsample block (DSDB) and our it is a good candidate to extract different features because it can use
proposed dilated downsample block (D-DB). All blocks are based small kernel size of convolution to get large receptive field with-
on residual block. out increasing the parameters of network. Therefore, we do not use
big kernel size of convolution in our proposed dilated downsample
3.1.1. Sparse Step Downsample Block & Dense Step Downsample block.
Block
Because the input must be downsampled in the encoder, the first con-
volutional layer of the block will set stride = 2 to reach this purpose. 3.2. 3D Shape Decoder
However, using this setting means that some input feature maps in-
formation of the previous layer may not be contributed to the next The decoder is major in upsampling the input features from the la-
layer, as it skips some elements of the input feature maps. tent space and reconstructing 3D object shape in voxel representation
Therefore, if we change this setting to stride = 1 which is called from these features. To reconstruct object shape with detail structure,
dense step downsample block, we can avoid this problem. However, our decoder must make a use of the input features greatly. Therefore,
using this block cannot downsample the input feature. Thus, we add the upsample block using in decoder is similar with residual block.
max pooling layer immediately after each dense step’s downsample We describe the upsample block and our proposed multi-path up-
block to reduce the size of the input feature map. sample block in the following sections.

1654

Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
3.2.1. Upsample Block 3.4. Objective Function
Because we want to reconstruct the 3D object to the specific resolu- We first define some notations for convenience. Because the pre-
tion, the most difference between upsample block and residual block dicted shape is 3D, let the predicted shape V ∗ = [v1∗ , v2∗ , ..., vN∗
]
is that the first layer of the upsample block is changed from convo- denote the probabilistic volume after sigmoid of the output volume
lution to deconvolution because the convolution does not upsample through our network, where N is the number of the voxel in volume
the input features. Besides, the general upsample skill, such as inter- V ∗ and 0 ¡ vi≤N

¡ 1. V = [v1 , v2 , . . . , vN ] represents the ground
polation, nearest or bilinear, is added to the first layer of the shortcut truth volume and each voxel in ground truth volume is 0 or 1.
path of the upsample block.
3.4.1. Intersection-over-Union (IoU) Loss
3.2.2. Multi-path Upsample Block
In spired by Richter et al. [1], the IoU divides the number of the
Though the upsample block has the ability to reconstruct the object intersection by the union. For the segmentation of a 3D object, cor-
shape to the specific resolution, it is not powerful. To use the input rect foreground predictions are effectively weighted by the size of
features greatly, we try to expand the path of the upsample block and the ground truth and prediction shape. Therefore, it is benefit to seg-
the first layer of each path uses different kernel size of deconvolution. ment the object shape from background.
Fig. 3 shows the detail structure of multi-path upsample block. The To use IoU for training, the IoU loss is defined as following:
reason why we do that is similar with dilated downsample block P ∗
mentioned above, and thus the input features can be used as much as vi vi
possible. LIoU (V ∗ , V ) = 1 − P ∗ i ∗
(1)
i [vi + vi − vi vi ]

Where V ∗ is the predicted probabilistic volume, V is the ground


truth volume and i traces all voxel in both V ∗ and V.

3.4.2. Mean Squared False Cross Entropy Loss


In general, 3D object reconstruction in a voxel grid is often cast as
binary classification; thus, minimizing the binary cross entropy loss
is the main purpose. Many works [3] [2] use the standard binary
cross entropy function, which weights both false positive and false
Fig. 3. Detail structure of our multi-path upsample block. negative results equally, to minimize the loss.
However, in the context of 3D object reconstruction, the shape
volume has the sparse property. There is a severe unbalance ratio
between occupied and unoccupied voxels. Therefore, if this loss
3.3. Concatenation function is used, the loss will be unbalanced; thus, the network will
According to our architecture, the image features extracted by our easily obtain a false-positive estimation.
encoder are passed to the latent space through fully connected layer. Consequently, inspired by [13], we leverage a loss function
Therefore, when we reconstruct object shape in our decoder, the ob- called mean squared false cross entropy loss (MSFCEL) proposed
ject features from the encoder is vanished. Thus, to reserve the image by Sun et al. [13] to handle this unbalanced problem and expressed
features, we must pass them from the encoder to the decoder. as:
Inspired by [12], concatenation is the proper skill to reserve im- M SF CEL(V ∗ , V ) = F P CE 2 + F N CE 2 , (2)
age features. However, because our encoder and decoder are in dif- where FPCE is a false-positive cross entropy on unoccupied voxels
ferent dimensions, we must transform the image feature maps to 3D of a ground-truth shape volume, and FNCE is false-negative cross
feature volume. The transformation method is that we stack several entropy on occupied voxels:
feature maps, and thus these maps become a 3D feature volume.
By this method, we can concatenate these transformed feature N1
1 X
volumes with the corresponding layers of the same 3D volume size F P CE = − [vn log vn∗ + (1 − vn ) log(1 − vn∗ )], (3)
N1 n=1
in decoder as Fig. 4 shows.

N2
1 X
F N CE = − [vp log vp∗ + (1 − vp ) log(1 − vp∗ )], (4)
N2 p=1

where N1 is the number of unoccupied voxels of V, and N2 is the


number of occupied voxels. vn is the nth occupied voxel and vp is
the pth occupied voxel. vn∗ and vp∗ are the predicted voxels of vn and
vp respectively.
Thus, the losses of occupied and unoccupied voxels are mini-
mized together and balanced. Finally, we integrate these two loss
functions as our final loss function LF inal , expressed as
Fig. 4. Detail structure of concatenation module in our decoder. ”T”
is transformation and ”C” is concatenation. LF inal (V ∗ , V ) = M SF CEL(V ∗ , V ) + LIoU (V ∗ , V ). (5)

1655

Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
Table 1. Comparison with other works. Note * is recalculated by the author of AtlasNet. ** and *** are tested on our own experiment
environment.
2D Image Encoder- ShapeNet dataset Memory
Mean IoU(%) Mean CD Time/per image
Decoder Arch. (13 categories) (MB)
3D-R2N2 [3] 3DRNN 13 categories 56.0% - 205.8 MB -
OGN [14] Octree Decoder 13 categories 59.6%(+3.6%) - 296.9 MB 16 ms on Titan X
130 ms on
PSGN [15] Point Cloud Decoder 13 categories 64.0%(+8.0%) 6.41* 148.4 MB
laptop CPU
AtlasNet [9] Mesh Decoder 13 categories - 5.11(-1.3) 488.4 MB -
8 ms** on
VTN [1] Voxel Tube Decoder 13 categories 64.1%(+8.1%) - 126.6 MB
GTX 1080 Ti
18 ms on Tesla M40
PSVH [2] 3D Shape Decoder 4 categories 68.0% - 226.9 MB 323 ms*** on
GTX 1080 Ti
D-DB-MpUB 11 ms on
3D Shape Decoder 13 categories 66.7%(+10.7%) 5.01(-1.40) 112.3 MB
w/ LF inal GTX 1080 Ti
C-D-DB-MpUB 13 ms on
3D Shape Decoder 13 categories 67.7%(+11.7%) 4.83(-1.58) 118.1 MB
w/ LF inal GTX 1080 Ti

4. EXPERIMENTAL RESULTS entry in Table 2 and the LF inal also has great improvement, 2.1%
higher than its previous entry.
4.1. Environment and Dataset
We implement our method in PyTorch. The CPU is Intel(R) Xeon(R) Table 3. Reconstruction comparison with PSVH.
CPU E5-2620 v4 @ 2.1 GHz, the main memory is 32 GB DDR4
Mean Time/
RAM, and the GPU is NVIDIA GeForce GTX 1080 Ti. Method areo car chair sofa
IoU(%) per image
In this work, we mainly use the dataset provided by [3] which PSVH [2] 63.1 83.9 55.2 69.8 68.0 323 ms
rendered objects from the ShapeNet dataset [5]. The dataset we used Our 68.0 85.5 57.6 74.4 71.4(+3.4) 13 ms
has 13 categories and consists of nearly 50K 3D objects, and each
object has 24 images from different views. We use the same training
and testing split provided by Choy et al. [3]. The object resolution 4.3. Comparison with Other Works
of this dataset is 32 × 32 × 32 in voxel representation and we train
one network for all 13 categories. First, we compare our final result with PSVH [2] whose research pur-
We train our model with batch size 64 and training epoch 210. pose is the same as us. Table 3 shows that our reconstruction result is
The initial learning rate is 10−3 , the learning rate decay is 0.5 per 30 better than PSVH and gets 3.4% improvement. For the reconstruc-
epochs, and the optimizer is Adam. tion speed, our reconstruction time achieves 13 ms per image and
We evaluate our reconstruction results in two metrics: intersec- approximately 25 times faster than 323 ms of them. Note that we
tion over union (IoU) and chamfer distance (CD). The higher IoU use their pretrained model and test on our own environment.
score means the better reconstruction result. The lower CD value At last, we show our result and other works in several aspects.
means the better reconstruction result. As Table 1 shows, our best method achieves 67.7% IoU score, 3.6%
higher than our reference architecture VTN, and has the better results
4.2. Results in both IoU and CD. Even with lower speed GPU, our average recon-
struction time is less than most of works. Note that all CD values are
In terms of the architecture, we propose two modules and leverage multiplied by 103 . In addition, we find that our reconstruction time
the concatenation skill. Then we integrate our proposed modules is less without using concatenation. VTN has the lowest reconstruc-
with our LF inal . We show the influence of each module in Table 2. tion time, as it applies 2D convolution in their voxel tube decoder
that its computation time is less than applying 3D convolution.
Table 2. Summary of all modules used and comparison of influence
of each module. 5. CONCLUSION
Mean
D-DB MpUB Concat. w/LF inal
IoU(%) We design a network for single-view 3D object reconstruction with-
SSDB-UB out other additional input. We also focus on designing a network
63.5
(Baseline) √ which has powerful ability to extract object features and greatly use
D-DB-UB √ √ 64.4% them. We propose two blocks, dilated downsample block and multi-
D-DB-MpUB √ √ √ 64.6%
C-D-DB-MpUB 65.6% path upsample block, to enhance our model. In addition, we use the
C-D-DB-MpUB √ √ √ √ concatenation skill to reserve features of encoder during reconstruc-
67.7%
w/LF inal tion step and leverage MSFCEL to handle unbalance problem.
With all of our proposed modules, our final reconstruction result
In this table, our baseline is the sparse-step downsample block achieves state of the art performance, 67.7% IoU in 13 categories,
(SSDB) and upsample block (UB). D-DB is dilated downsample 3.6% improvement compared with VTN [1]. In addition, in 4 cate-
block, MpUB is multi-path upsample block and w/LF inal is inte- gories, our result achieves 71.4% IoU, 3.4% higher than PSVH [2].
grating LIoU and MSFCEL. We can infer that each module we pro- In terms of reconstruction speed, our average reconstruction time is
posed has benefit to the reconstruction result. Specifically, the con- 13 ms, 25 times faster than PSVH on our own experiment environ-
catenation module has 1% improvement compared with its previous ment.

1656

Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [15] Haoqiang Fan, Hao Su, and Leonidas J. Guibas, “A point set
generation network for 3d object reconstruction from a single
[1] Stephan R. Richter and Stefan Roth, “Matryoshka networks: image,” in The IEEE Conference on Computer Vision and Pat-
Predicting 3d geometry via nested shape layers,” in The tern Recognition (CVPR), July 2017.
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
[2] Hanqing Wang, Jiaolong Yang, Wei Liang, and Xin Tong,
“Deep single-view 3d object reconstruction with visual hull
embedding,” in Proceedings of the AAAI Conference on Ar-
tificial Intelligence (AAAI), 2019.
[3] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin
Chen, and Silvio Savarese, “3d-r2n2: A unified approach for
single and multi-view 3d object reconstruction,” in Computer
Vision – ECCV 2016, Cham, 2016, pp. 628–644, Springer In-
ternational Publishing.
[4] Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik,
“Multi-view consistency as supervisory signal for learning
shape and pose prediction,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2018.
[5] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas,
Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese,
Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,
and Fisher Yu, “Shapenet: An information-rich 3d model
repository,” CoRR, vol. abs/1512.03012, 2015.
[6] Fisher Yu and Vladlen Koltun, “Multi-scale context
aggregation by dilated convolutions,” arXiv preprint
arXiv:1511.07122, 2015.
[7] Minhyuk Sung, Vladimir G. Kim, Roland Angst, and Leonidas
Guibas, “Data-driven structural priors for shape completion,”
ACM Trans. Graph., vol. 34, no. 6, pp. 175:1–175:11, Oct.
2015.
[8] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra
Malik, “Category-specific object reconstruction from a single
image,” in The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), June 2015.
[9] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C.
Russell, and Mathieu Aubry, “A papier-mâché approach to
learning 3d surface generation,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2018.
[10] Guandao Yang, Yin Cui, Serge Belongie, and Bharath Hariha-
ran, “Learning single-view 3d reconstruction with limited pose
supervision,” in The European Conference on Computer Vision
(ECCV), September 2018.
[11] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang,
Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, and
William T. Freeman, “Pix3d: Dataset and methods for single-
image 3d shape modeling,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2018.
[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net:
Convolutional networks for biomedical image segmentation,”
CoRR, vol. abs/1505.04597, 2015.
[13] Yongbin Sun, Ziwei Liu, Yue Wang, and Sanjay E. Sarma,
“Im2avatar: Colorful 3d reconstruction from a single image,”
CoRR, vol. abs/1804.06375, 2018.
[14] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox,
“Octree generating networks: Efficient convolutional architec-
tures for high-resolution 3d outputs,” in The IEEE Interna-
tional Conference on Computer Vision (ICCV), Oct 2017.

1657

Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on April 20,2023 at 16:02:50 UTC from IEEE Xplore. Restrictions apply.

You might also like