Volumetric and Multi-View CNNs For Object Classification On 3D Data
Volumetric and Multi-View CNNs For Object Classification On 3D Data
Charles R. Qi∗ Hao Su∗ Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas
Stanford University
tal material.
3D shape models are becoming widely available and While the extension of 2D convolutional neural networks
easier to capture, making available 3D information crucial to 3D seems natural, the additional computational com-
for progress in object classification. Current state-of-the- plexity (volumetric domain) and data sparsity introduces
art methods rely on CNNs to address this problem. Recently, significant challenges; for instance, in an image, every pixel
we witness two types of CNNs being developed: CNNs contains observed information, whereas in 3D, a shape is
based upon volumetric representations versus CNNs based only defined on its surface. Seminal work by Wu et al.
upon multi-view representations. Empirical results from [33] propose volumetric CNN architectures on volumetric
these two types of CNNs exhibit a large gap, indicating grids for object classification and retrieval. While these
that existing volumetric CNN architectures and approaches approaches achieve good results, it turns out that training a
are unable to fully exploit the power of 3D representations. CNN on multiple 2D views achieves a significantly higher
In this paper, we aim to improve both volumetric CNNs performance, as shown by Su et al. [32], who augment their
and multi-view CNNs according to extensive analysis of 2D CNN with pre-training from ImageNet RGB data [6].
existing approaches. To this end, we introduce two distinct These results indicate that existing 3D CNN architectures
network architectures of volumetric CNNs. In addition, and approaches are unable to fully exploit the power of 3D
we examine multi-view CNNs, where we introduce multi- representations. In this work, we analyze these observations
resolution filtering in 3D. Overall, we are able to outper- and evaluate the design choices. Moreover, we show how to
form current state-of-the-art methods for both volumetric reduce the gap between volumetric CNNs and multi-view
CNNs and multi-view CNNs. We provide extensive experi- CNNs by efficiently augmenting training data, introducing
ments designed to evaluate underlying design choices, thus new CNN architectures in 3D. Finally, we examine multi-
providing a better understanding of the space of methods view CNNs; our experiments show that we are able to
available for object classification on 3D data. improve upon state of the art with improved training data
augmentation and a new multi-resolution component.
…
40
…
slicing
…
30
48 160 512
… … Loss Prediction by
…
…
partial object
40
2
13 5 5 2
13 2
30 5
13
30
Loss
mlpconv mlpconv mlpconv fc
(48, 6, 2; 48; 48) (160, 5, 2; 160; 160) (512, 3, 2; 512; 512) 40 Prediction by
2048 2048 whole object
Figure 3. Auxiliary Training by Subvolume Supervision (Sec 4.2). The main innovation is that we add auxiliary tasks to predict class labels
that focus on part of an object, intended to drive the CNN to more heavily exploit local discriminative features. An mlpconv layer is a
composition of three conv layers interleaved by ReLU layers. The five numbers under mlpconv are the number of channels, kernel size
and stride of the first conv layer, and the number of channels of the second and third conv layers, respectively. The kernel size and stride of
the second and third conv layers are 1. For example, mlpconv(48, 6, 2; 48; 48) is a composition of conv(48, 6, 2), ReLU, conv(48, 1, 1),
ReLU, conv(48, 1, 1) and ReLU layers. Note that we add dropout layers with rate=0.5 after fully connected layers.
Data Augmentation Compared with 2D image datasets, conducting both object classification and detection [9]).
currently available 3D shape datasets are limited in scale We implement this design through an architecture shown
and variation. To fully exploit the design of our networks, in Fig 3. The first three layers are mlpconv (multilayer
we augment the training data with different azimuth and ele- perceptron convolution) layers, a 3D extension of the 2D
vation rotations. This allows the first network to cover local mlpconv proposed by [23]. The input and output of our
regions at different orientations, and the second network to mlpconv layers are both 4D tensors. Compared with the
relate distant points at different relative angles. standard combination of linear convolutional layers and
max pooling layers, mlpconv has a three-layer structure and
Multi-Orientation Pooling Both of our new networks are is thus a universal function approximator if enough neurons
sensitive to shape orientation, i.e., they capture different are provided in its intermediate layers. Therefore, mlpconv
information at different orientations. To capture a more is a powerful filter for feature extraction of local patches,
holistic sense of a 3D object, we add an orientation pooling enhancing approximation of more abstract representations.
stage that aggregates information from different orienta- In addition, mlpconv has been validated to be more discrim-
tions. inative with fewer parameters than ordinary convolution
with pooling [23].
4.2. Network 1: Auxiliary Training by Subvolume At the fourth layer, the network branches into two. The
Supervision lower branch takes the whole object as input for traditional
classification. The upper branch is a novel branch for
We observe significant overfitting when we train the
auxiliary tasks. It slices the 512 × 2 × 2 × 2 4D tensor (2
volumetric CNN proposed by [33] in an end-to-end fashion
grids along x, y, z axes and 512 channels) into 2×2×2 = 8
(see supplementary). When the volumetric CNN overfits to
vectors of dimension 512. We set up a classification task
the training data, it has no incentive to continue learning.
for each vector. A fully connected layer and a softmax
We thus introduce auxiliary tasks that are closely correlated
layer are then appended independently to each vector to
with the main task but are difficult to overfit, so that learning
construct classification losses. Simple calculation shows
continues even if our main task is overfitted.
that the receptive field of each task is 22 × 22 × 22, covering
These auxiliary training tasks also predict the same ob-
roughly 2/3 of the entire volume.
ject labels, but the predictions are made solely on a local
subvolume of the input. Without complete knowledge of
4.3. Network 2: Anisotropic Probing
the object, the auxiliary tasks are more challenging, and
can thus better exploit the discriminative power of local The success of multi-view CNNs is intriguing. multi-
regions. This design is different from the classic multi- view CNNs first project 3D objects to 2D and then make
task learning setting of hetergenous auxiliary tasks, which use of well-developed 2D image CNNs for classification.
inevitably requires collecting additional annotations (e.g., Inspired by its success, we design a neural network archi-
Anisotropic Probing
30 30 30 30
tecture that is also composed of the two stages. However, interaction in the early feature extraction stage. Thus it
while multi-view CNNs use external rendering pipelines is helpful to augment the training data by varying object
from computer graphics, we achieve the 3D-to-2D projec- orientation and combining predictions through orientation
tion using network layers in a manner similar to ‘X-ray pooling.
scanning’. Similar to Su-MVCNN [32] which aggregates infor-
Key to this network is the use of an elongated anisotropic mation from multiple view inputs through a view-pooling
kernel which helps capture the global structure of the 3D layer and follow-on fully connected layers, we sample 3D
volume. As illustrated in Fig 4, the neural network has two input from different orientations and aggregate them in a
modules: an anisotropic probing module and a network in multi-orientation volumetric CNN (MO-VCNN) as shown
network module. The anisotropic probing module contains in Fig 5. At training time, we generate different rotations
three convolutional layers of elongated kernels, each fol- of the 3D model by changing both azimuth and elevation
lowed by a nonlinear ReLU layer. Note that both the input angles, sampled randomly. A volumetric CNN is firstly
and output of each layer are 3D tensors. trained on single rotations. Then we decompose the net-
In contrast to traditional isotropic kernels, an anisotropic work to CNN1 (lower layers) and CNN2 (higher layers)
probing module has the advantage of aggregating long- to construct a multi-orientation version. The MO-VCNN’s
range interactions in the early feature learning stage with weights are initialized by a previously trained volumetric
fewer parameters. As a comparison, with traditional neu- CNN with CNN1 ’s weights fixed during fine-tuning. While
ral networks constructed from isotropic kernels, introduc- a common practice is to extract the highest level features
ing long-range interactions at an early stage can only be (features before the last classification linear layer) of mul-
achieved through large kernels, which inevitably introduce tiple orientations, average/max/concatenate them, and train
many more parameters. After anisotropic probing, we use a linear SVM on the combined feature, this is just a special
an adapted NIN network [23] to address the classification case of the MO-VCNN.
problem. Compared to 3DShapeNets [33] which only augments
Our anistropic probing network is capable of capturing data by rotating around vertical axis, our experiment shows
internal structures of objects through its X-ray like projec- that orientation pooling combined with elevation rotation
tion mechanism. This is an ability not offered by standard
rendering. Combined with multi-orientation pooling (intro-
class prediction class prediction
duced below), it is possible for this probing mechanism to
capture any 3D structure, due to its relationship with the 3D CNN2
Radon transform.
3D Ori-Pooling
In addition, this architecture is scalable to higher res- CNN
olutions, since all its layers can be viewed as 2D. While
3D convolution involves computation at locations of cubic
3D
CNN1
3D
CNN1
… 3D
CNN1
89.9 89.5
90 89.2
Average class accuracy
88 86.6
Average instance accuracy 86.0 85.6
86
84 83.0
82 94
80 92
78 77.3
90
Accuracy (%)
76
88
74
86
3DShapeNets VoxNet Ours-MO- Ours-MO- Ours-MVCNN- CNN-Sphere (single view)
(Wu et al.) (Maturana et al.) SubvolumeSup AniProbing Sphere-30 84
Ours-MVCNN-Sphere
Figure 7. Classification accuracy on ModelNet40 (voxelized at res- 82
Ours-SubvolumeSup (single ori)
olution 30). Our volumetric CNNs have matched the performance 80
96 Ours-MO-SubvolumeSup
Average class accuracy 93.8 78
of94multi-view CNN at 3D resolution 30 (our implementation of
Average instance accuracy 92.0 0 50 100 150 200
Su-MVCNN
92 [32], rightmost
90.5
group). 91.4
Model Voxelization Resolution
90.1 89.7
90
96 Figure 9. Top: sphere rendering at 3D resolution 10, 30, 60, and
Average class accuracy 93.8
94 standard rendering. Bottom: performance of image-based CNN
Average instance accuracy 92.0 and volumetric CNN with increasing 3D resolution. The two
92 91.4
90.1 90.5
90
89.7 rightmost points are trained/tested from standard rendering.
88 87.2
86
study the effect of 3D resolution for both types of networks.
84
82
Fig 9 shows the performance of our volumetric CNN
MVCNN HoGPyramid- Ours- Ours- and multi-view CNN at different 3D resolutions (defined
(Su et al.) LFD MVCNN MVCNN-MultiRes
at the beginning of Sec 6). Due to computational cost,
Figure 8. Classification acurracy on ModelNet40 (multi-view rep- we only test our volumetric CNN at 3D resolutions 10
resentation). The 3D multi-resolution version is the strongest. It is and 30. The observations are: first, the performance of
worth noting that the simple baseline HoGPyramid-LFD performs
our volumetric CNN and multi-view CNN is on par at
quite well.
tested 3D resolutions; second, the performance of multi-
view CNN increases as the 3D resolution grows up. To
3D resolution, we also include Ours-MVCNN-Sphere-30, further improve the performance of volumetric CNN, this
the result of our multi-view CNN with sphere rendering at experiment suggests that it is worth exploring how to scale
3D resolution 30. More details of setup can be found in the volumetric CNN to higher 3D resolutions.
supplementary.
As can be seen, both of our proposed volumetric CNNs
6.4. More Evaluations
significantly outperform state-of-the-art volumetric CNNs. Data Augmentation and Multi-Orientation Pooling
Moreover, they both match the performance of our multi- We use the same volumetric CNN model, the end-to-end
view CNN under the same 3D resolution. That is, the gap learning verion of 3DShapeNets [33], to train and test on
between volumetric CNNs and multi-view CNNs is closed three variations of augmented data (Table 1). Similar trend
under 3D resolution 30 on ModelNet40 dataset, an issue is observed for other volumetric CNN variations.
that motivates our study (Sec 3).
Data Augmentation Single-Ori Multi-Ori ∆
Azimuth rotation (AZ) 84.7 86.1 1.4
Multi-view CNNs Fig 8 summarizes the performance of
AZ + translation 84.8 86.1 1.3
multi-view CNNs. Ours-MVCNN-MultiRes is the result AZ + elevation rotation 83.0 87.8 4.8
by training an SVM over the concatenation of fc7 features Table 1. Effects of data augmentations on multi-orientation vol-
from Ours-MVCNN-Sphere-30, 60, and Ours-MVCNN. umetric CNN. We report numbers of classification accuracy on
HoGPyramid-LFD is the result by training an SVM over a ModelNet40, with (Multi-Ori) or without (Single-Ori) multi-
concatenation of HoG features at three 2D resolutions. Here orientation pooling described in Sec 4.4.
LFD (lightfield descriptor) simply refers to extracting fea-
tures from renderings. Ours-MVCNN-MultiRes achieves When combined with multi-orientation pooling, apply-
state-of-the-art. ing both azimuth rotation (AZ) and elevation rotation (EL)
augmentations is extremely effective. Using only azimuth
6.3. Effect of 3D Resolution over Performance
augmentation (randomly sampled from 0◦ to 360◦ ) with
Sec 6.2 shows that our volumetric CNN and multi-view orientation pooling, the classification performance is in-
CNN performs comparably at 3D resolution 30. Here we creased by 86.1% − 84.7% = 1.4%; combined with eleva-
Network Single-Ori Multi-Ori Method Classification Retrieval MAP
E2E-[33] 83.0 87.8 E2E-[33] 69.6 -
VoxNet[24] 83.8 85.9 Su-MVCNN [32] 72.4 35.8
3D-NIN 86.1 88.5 Ours-MO-SubvolumeSup 73.3 39.3
Ours-SubvolumeSup 87.2 89.2 Ours-MO-AniProbing 70.8 40.2
Ours-AniProbing 85.9 89.9 Ours-MVCNN-MultiRes 74.5 51.4
Table 2. Comparison of performance of volumetric CNN archi- Table 4. Classification accuracy and retrieval MAP on recon-
tectures. Numbers reported are classification accuracy on Model- structed meshes of 12-class real-world scans.
Net40. Results from E2E-[33] (end-to-end learning version) and
VoxNet [24] are obtained by ourselves. All experiments are using
the same set of azimuth and elevation augmented data.
of ModelNet40 containing 3,183 training samples. They
are provided for reference. Also note that the MVCNNs
tion augmentation (randomly sampled from −45◦ to 45◦ ), in the second group are our implementations in Caffe with
the improvement becomes more significant – increasing by AlexNet instead of VGG as in Su-MVCNN [32].
87.8% − 83.0% = 4.8%. On the other hand, translation We observe that MVCNNs are superior to methods by
jittering (randomly sampled shift from 0 to 6 voxels in each SVMs on hand-crafted features.
direction) provides only marginal influence.
Comparison of Volumetric CNN Architectures The ar- Evaluation on the Real-World Reconstruction Dataset
chitectures in comparison include VoxNet [24], E2E-[33] We further assess the performance of volumetric CNNs and
(the end-to-end learning variation of [33] implemented in multi-view CNNs on real-world reconstructions in Table 4.
Caffe [16] by ourselves), 3D-NIN (a 3D variation of Net- All methods are trained on CAD models in ModelNet40 but
work in Network [23] designed by ourselves as in Fig 3 tested on real data, which may be highly partial, noisy, or
without the “Prediction by partial object” branch), Subvol- oversmoothed (Fig 6). Our networks continue to outper-
umeSup (Sec 4.2) and AniProbing (Sec 4.3). Data augmen- form state-of-the-art results. In particular, our 3D multi-
tation of AZ+EL (Sec 6.4) are applied. resolution filtering is quite effective on real-world data,
From Table 2, first, the two volumetric CNNs we pro- possibly because the low 3D resolution component filters
pose, SubvolumeSup and AniProbing networks, both show out spurious and noisy micro-structures. Example results
superior performance, indicating the effectiveness of our for object retrieval can be found in supplementary.
design; second, multi-orientation pooling increases per-
formance for all network variations. This is especially 7. Conclusion and Future work
significant for the anisotropic probing network, since each
orientation usually only carries partial information of the In this paper, we have addressed the task of object classi-
object. fication on 3D data using volumetric CNNs and multi-view
CNNs. We have analyzed the performance gap between
Comparison of Multi-view Methods We compare differ- volumetric CNNs and multi-view CNNs from perspectives
ent methods that are based on multi-view representations of network architecture and 3D resolution. The analysis
in Table 3. Methods in the second group are trained on motivates us to propose two new architectures of volumetric
the full ModelNet40 train set. Methods in the first group, CNNs, which outperform state-of-the-art volumetric CNNs,
SPH, LFD, FV, and Su-MVCNN, are trained on a subset achieving comparable performance to multi-view CNNs at
the same 3D resolution of 30 × 30 × 30. Further evalu-
tion over the influence of 3D resolution indicates that 3D
Accuracy Accuracy resolution is likely to be the bottleneck for the performance
Method #Views
(class) (instance) of volumetric CNNs. Therefore, it is worth exploring the
SPH (reported by [33]) - 68.2 - design of efficient volumetric CNN architectures that scale
LFD (reported by [33]) - 75.5 - up to higher resolutions.
FV (reported by [32]) 12 84.8 -
Su-MVCNN [32] 80 90.1 -
PyramidHoG-LFD 20 87.2 90.5 Acknowledgement. The authors gratefully acknowledge
Ours-MVCNN 20 89.7 92.0 the support of Stanford Graduate Fellowship, NSF grants
Ours-MVCNN-MultiRes 20 91.4 93.8 IIS-1528025 and DMS-1546206, ONR MURI grant
Table 3. Comparison of multi-view based methods. Numbers N00014-13-1-0341, a Google Focused Research award, the
reported are classification accuracy (class average and instance Max Planck Center for Visual Computing and Communica-
average) on ModelNet40. tions and hardware donations by NVIDIA.
References [17] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation
invariant spherical harmonic representation of 3 d shape
[1] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovs- descriptors. In SGP 2003, volume 6, pages 156–164, 2003.
janikov. Shape google: Geometric words and expressions
[18] J. Knopp, M. Prasad, G. Willems, R. Timofte, and
for invariant shape retrieval. ACM Transactions on Graphics
L. Van Gool. Hough transform and 3d surf for robust three
(TOG), 30(1):1, 2011.
dimensional classification. In ECCV 2010, pages 589–602.
[2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Springer, 2010.
Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, [19] I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bron-
et al. Shapenet: An information-rich 3d model repository. stein. Intrinsic shape context descriptors for deformable
arXiv preprint arXiv:1512.03012, 2015. shapes. In CVPR 2012, pages 159–166. IEEE, 2012.
[3] S. Chaudhuri and V. Koltun. Data-driven suggestions for [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
creativity support in 3d modeling. In ACM Transactions on classification with deep convolutional neural networks. In
Graphics (TOG), volume 29, page 183. ACM, 2010. Advances in neural information processing systems, pages
[4] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On 1097–1105, 2012.
visual similarity based 3d model retrieval. In CGF, vol- [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
ume 22, pages 223–232. Wiley Online Library, 2003. based learning applied to document recognition. Proceed-
[5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and ings of the IEEE, 86(11):2278–2324, 1998.
A. Vedaldi. Describing textures in the wild. In CVPR 2014, [22] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods
pages 3606–3613. IEEE, 2014. for generic object recognition with invariance to pose and
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- lighting. In CVPR 2014, volume 2, pages II–97. IEEE, 2004.
Fei. Imagenet: A large-scale hierarchical image database. In
CVPR 2009, pages 248–255. IEEE, 2009. [23] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, preprint arXiv:1312.4400, 2013.
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- [24] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
vation feature for generic visual recognition. arXiv preprint neural network for real-time object recognition. In IEEE/RSJ
arXiv:1310.1531, 2013. International Conference on Intelligent Robots and Systems,
[8] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and September 2015.
W. Burgard. Multimodal deep learning for robust rgb-d [25] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
object recognition. In IEEE/RSJ International Conference on Real-time 3d reconstruction at scale using voxel hashing.
Intelligent Robots and Systems (IROS), Hamburg, Germany, ACM Transactions on Graphics (TOG), 32(6):169, 2013.
2015. [26] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin.
[9] R. Girshick. Fast r-cnn. In ICCV 2015, pages 1440–1448, Shape distributions. ACM Transactions on Graphics (TOG),
2015. 21(4):807–832, 2002.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich [27] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
feature hierarchies for accurate object detection and semantic Cnn features off-the-shelf: an astounding baseline for recog-
segmentation. In CVPR 2014, pages 580–587. IEEE, 2014. nition. In Computer Vision and Pattern Recognition Work-
[11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning shops (CVPRW), 2014 IEEE Conference on, pages 512–519.
rich features from rgb-d images for object detection and IEEE, 2014.
segmentation. In ECCV 2014, pages 345–360. Springer, [28] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep
2014. panoramic representation for 3-d shape recognition. Signal
[12] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Processing Letters, IEEE, 22(12):2339–2343, 2015.
Matchnet: Unifying feature and metric learning for patch- [29] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
based matching. In CVPR 2015, pages 3279–3286, 2015. segmentation and support inference from rgbd images. In
ECCV 2012, pages 746–760. Springer, 2012.
[13] B. K. Horn. Extended gaussian images. Proceedings of the [30] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y.
IEEE, 72(12):1671–1686, 1984. Ng. Convolutional-recursive deep learning for 3d object
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating classification. In NIPS 2012, pages 665–673, 2012.
deep network training by reducing internal covariate shift. [31] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
arXiv preprint arXiv:1502.03167, 2015. scene understanding benchmark suite. In CVPR 2015, pages
[15] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial 567–576, 2015.
transformer networks. In Advances in Neural Information [32] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.
Processing Systems, pages 2008–2016, 2015. Multi-view convolutional neural networks for 3d shape
[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, recognition. In ICCV 2015, 2015.
R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolu- [33] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
tional architecture for fast feature embedding. arXiv preprint J. Xiao. 3d shapenets: A deep representation for volumetric
arXiv:1408.5093, 2014. shapes. In CVPR 2015, pages 1912–1920, 2015.
[34] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database A. Appendix
of big spaces reconstructed using sfm and object labels. In
ICCV 2013, pages 1625–1632. IEEE, 2013. In this section, we present positive effects of two adds-
on modules – volumetric batch normalization (Sec A.1) and
spatial transformer networks (Sec A.2). We also provide
more details on experiments in the main paper (Sec A.3) and
real-world dataset construction (Sec A.4). Retrieval results
can also be found in Sec A.5.
Model Single-Ori
Ours-SubvolSup + BN 88.8
Ours-SubvolSup + BN + STN 89.1
Table 6. Spatial transformer network helps improve single orien-
tation classification accuracy.
27
Bench
19
Chair
17
Cup
18
Desk
16
Monitor Dresser
45 12
Night-
stand
21
Sofa
26
Table
18
Toilet
17
Figure 13. Our real-world reconstruction test dataset, comprising 12 categories and 243 models. Each row lists a category along with the
number of objects and several example reconstructed models in that category.