0% found this document useful (0 votes)
21 views14 pages

Volumetric and Multi-View CNNs For Object Classification On 3D Data

This paper proposes and evaluates new CNN architectures for 3D object classification using either volumetric or multi-view representations of 3D data. The authors introduce two new volumetric CNN architectures that improve upon the state-of-the-art for volumetric CNNs. They also propose a new multi-resolution component for multi-view CNNs. Extensive experiments on 3D shape datasets demonstrate that the proposed methods outperform previous work and help close the gap between volumetric and multi-view CNNs.

Uploaded by

michal.luba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Volumetric and Multi-View CNNs For Object Classification On 3D Data

This paper proposes and evaluates new CNN architectures for 3D object classification using either volumetric or multi-view representations of 3D data. The authors introduce two new volumetric CNN architectures that improve upon the state-of-the-art for volumetric CNNs. They also propose a new multi-resolution component for multi-view CNNs. Extensive experiments on 3D shape datasets demonstrate that the proposed methods outperform previous work and help close the gap between volumetric and multi-view CNNs.

Uploaded by

michal.luba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Volumetric and Multi-View CNNs for Object Classification on 3D Data

Charles R. Qi∗ Hao Su∗ Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas
Stanford University

Abstract addition, we demonstrate retrieval results in the supplemen-


arXiv:1604.03265v2 [cs.CV] 29 Apr 2016

tal material.
3D shape models are becoming widely available and While the extension of 2D convolutional neural networks
easier to capture, making available 3D information crucial to 3D seems natural, the additional computational com-
for progress in object classification. Current state-of-the- plexity (volumetric domain) and data sparsity introduces
art methods rely on CNNs to address this problem. Recently, significant challenges; for instance, in an image, every pixel
we witness two types of CNNs being developed: CNNs contains observed information, whereas in 3D, a shape is
based upon volumetric representations versus CNNs based only defined on its surface. Seminal work by Wu et al.
upon multi-view representations. Empirical results from [33] propose volumetric CNN architectures on volumetric
these two types of CNNs exhibit a large gap, indicating grids for object classification and retrieval. While these
that existing volumetric CNN architectures and approaches approaches achieve good results, it turns out that training a
are unable to fully exploit the power of 3D representations. CNN on multiple 2D views achieves a significantly higher
In this paper, we aim to improve both volumetric CNNs performance, as shown by Su et al. [32], who augment their
and multi-view CNNs according to extensive analysis of 2D CNN with pre-training from ImageNet RGB data [6].
existing approaches. To this end, we introduce two distinct These results indicate that existing 3D CNN architectures
network architectures of volumetric CNNs. In addition, and approaches are unable to fully exploit the power of 3D
we examine multi-view CNNs, where we introduce multi- representations. In this work, we analyze these observations
resolution filtering in 3D. Overall, we are able to outper- and evaluate the design choices. Moreover, we show how to
form current state-of-the-art methods for both volumetric reduce the gap between volumetric CNNs and multi-view
CNNs and multi-view CNNs. We provide extensive experi- CNNs by efficiently augmenting training data, introducing
ments designed to evaluate underlying design choices, thus new CNN architectures in 3D. Finally, we examine multi-
providing a better understanding of the space of methods view CNNs; our experiments show that we are able to
available for object classification on 3D data. improve upon state of the art with improved training data
augmentation and a new multi-resolution component.

1. Introduction Problem Statement We consider volumetric representa-


tions of 3D point clouds or meshes as input to the 3D
object classification problem. This is primarily inspired
Understanding 3D environments is a vital element of by recent advances in real-time scanning technology, which
modern computer vision research due to paramount rele- use volumetric data representations. We further assume that
vance in many vision systems, spanning a wide field of the input data is already pre-segmented by 3D bounding
application scenarios from self-driving cars to autonomous boxes. In practice, these bounding boxes can be extracted
robots. Recent advancements in real-time SLAM tech- using the sliding windows, object proposals, or background
niques and crowd-sourcing of virtual 3D models have ad- subtraction. The output of the method is the category label
ditionally facilitated the availability of 3D data. [29, 34, 31, of the volumetric data instance.
33, 2]. This development has encouraged the lifting of 2D to
3D for deep learning, opening up new opportunities with the
additional information of 3D data; e.g., aligning models is Approach We provide a detailed analysis over factors that
easier in 3D Euclidean space. In this paper, we specifically influence the performance of volumetric CNNs, including
focus on the object classification task on 3D data obtained network architecture and volumn resolution. Based upon
from both CAD models and commodity RGB-D sensors. In our analysis, we strive to improve the performance of volu-
metric CNNs. We propose two volumetric CNN network
* indicates equal contributions. architectures that signficantly improve state-of-the-art of
Multi-View Standard Rendering Multi-View Sphere Rendering
volumetric CNNs on 3D shape classification. This result
has also closed the gap between volumetric CNNs and 3D Shape
multi-view CNNs, when they are provided with 3D input
discretized at 30 × 30 × 30 3D resolution. The first network
introduces auxiliary learning tasks by classifying part of an
object, which help to scrutize details of 3D objects more
deeply. The second network uses long anisotropic kernels
to probe for long-distance interactions. Combining data
augmentation with a multi-orientation pooling, we observe
significant performance improvement for both networks.
Volumetric Occupancy Grid
We also conduct extensive experiments to study the in-
fluence of volume resolution, which sheds light on future Figure 1. 3D shape representations.
directions of improving volumetric CNNs.
Furthermore, we introduce a new multi-resolution com-
ponent to multi-view CNNs, which improves their already ment in performance on these tasks has decidedly moved
compelling performance. the field forward.
In addition to providing extensive experiments on 3D
CAD model datasets, we also introduce a dataset of real- CNNs on Depth and 3D Data With the introduction
world 3D data, constructed using dense 3D reconstruction of commodity range sensors, the depth channel became
taken with [25]. Experiments show that our networks can available to provide additional information that could be
better adapt from synthetic data to this real-world data than incorporated into common CNN architectures. A very first
previous methods. approach combines convolutional and recursive neural net-
works for learning features and classifying RGB-D images
2. Related Work [30]. Impressive performance for object detection from
RGB-D images has been achieved using a geocentric em-
Shape Descriptors A large variety of shape descriptors
bedding for depth images that encodes height above ground
has been developed in the computer vision and graphics
and angle with gravity for each pixel in addition to the
community. For instance, shapes can be represented as
horizontal disparity [11]. Recently, a CNN architecture has
histograms or bag-of-feature models which are constructed
been proposed where the RGB and depth data are processed
from surface normals and curvatures [13]. Alternatives
in two separate streams; in the end, the two streams are
include models based on distances, angles, triangle areas, or
combined with a late fusion network [8]. All these descrip-
tetrahedra volumes [26], local shape diameters measured at
tors operate on single RGB-D images, thus processing 2.5D
densely-sampled surface points [3], Heat kernel signatures
data.
[1, 19], or extensions of SIFT and SURF feature descriptors
Wu et al. [33] lift 2.5D to 3D with their 3DShapeNets
to 3D voxel grids [18]. The spherical harmonic descriptor
approach by categorizing each voxel as free space, surface
(SPH) [17] and the Light Field descriptor (LFD) [4] are
or occluded, depending on whether it is in front of, on, or
other popular descriptors. LFD extracts geometric and
behind the visible surface (i.e., the depth value) from the
Fourier descriptors from object silhouettes rendered from
depth map. The resulting representation is a 3D binary
several different viewpoints, and can be directly applied to
voxel grid, which is the input to a CNN with 3D filter
the shape classification task. In contrast to recently devel-
banks. Their method is particularly relevant in the context
oped feature learning techniques, these features are hand-
of this work, as they are the first to apply CNNs on a 3D
crafted and do not generalize well across different domains.
representation. A similar approach is VoxNet [24], which
also uses binary voxel grids and a corresponding 3D CNN
Convolutional Neural Networks Convolutional Neural architecture. The advantage of these approaches is that it
Networks (CNNs) [21] have been successfully used in dif- can process different sources of 3D data, including LiDAR
ferent areas of computer vision and beyond. In particu- point clouds, RGB-D point clouds, and CAD models; we
lar, significant progress has been made in the context of likewise follow this direction.
learning features. It turns out that training from large An alternative direction is to exploit established 2D CNN
RGB image datasets (e.g., ImageNet [6]) is able to learn architectures; to this end, 2D data is extracted from the
general purpose image descriptors that outperform hand- 3D representation. In this context, DeepPano [28] converts
crafted features for a number of vision tasks, including 3D shapes into panoramic views; i.e., a cylinder projection
object detection, scene recognition, texture recognition and around its principle axis. Current state-of-the-art uses mul-
classification [7, 10, 27, 5, 12]. This significant improve- tiple rendered views, and trains a CNN that can process
all views jointly [32]. This multi-view CNN (MVCNN) is However, the difference in input resolution is not the
pre-trained on ImageNet [6] and uses view-point pooling to primary reason for this performance gap, as evidenced by
combine all streams obtained from each view. A similar further experiments. We compare the two networks by
idea on stereo views has been proposed earlier [22]. providing them with data containing similar level of detail.
To this end, we feed the multi-view CNN with renderings of
3. Analysis of state-of-the-art 3D Volumetric the 30 × 30 × 30 occupancy grid using sphere rendering3 ,
CNN versus Multi-View CNN i.e., for each occupied voxel, a ball is placed at its center,
with radius equal to the edge length of a voxel (Multi-View
Sphere Rendering in Fig 1). We train the multi-view CNN
Multi-View CNN (standard rendering) 92.0 from scratch using these sphere renderings. The accuracy
Multi-View CNN (sphere-30 rendering) 89.5 of this multi-view CNN is reported in blue.
Volumetric CNN (volume 30x30x30)
As shown in Fig 2, even with similar level of object
84.7
detail, the volumetric CNN (green) is 4.8% worse than
80 82 84 86 88 90 92 94 the multi-view CNN (blue). That is, there is still sig-
Figure 2. Classification accuracy. Yellow and blue bars: Perfor- nificant room to improve the architecture of volumetric
mance drop of multi-view CNN due to discretization of CAD CNNs. This discovery motivates our efforts in Sec 4 to
models in rendering. Blue and green bars: Volumetric CNN improve volumetric CNNs. Additionally, low-frequency
is significantly worse than multi-view CNN, even though their
information in 3D seems to be quite discriminative for ob-
inputs have similar amounts of information. This indicates that the
network of the volumetric CNN is weaker than that of the multi-
ject classification—it is possible to achieve 89.5% accuracy
view CNN. (blue) at a resolution of only 30 × 30 × 30. This discovery
motivates our efforts in Sec 5 to improve multi-view CNNs
Two representations of generic 3D shapes are popularly with a 3D multi-resolution approach.
used for object classification, volumetric and multi-view
(Fig 1). The volumetric representation encodes a 3D shape 4. Volumetric Convolutional Neural Networks
as a 3D tensor of binary or real values. The multi-view rep- 4.1. Overview
resentation encodes a 3D shape as a collection of renderings
from multiple viewpoints. Stored as tensors, both repre- We improve volumetric CNNs through three separate
sentations can easily be used to train convolutional neural means: 1) introducing new network structures; 2) data
networks, i.e., volumetric CNNs and multi-view CNNs. augmentation; 3) feature pooling.
Intuitively, a volumetric representation should encode
as much information, if not more, than its multi-view Network Architecture We propose two network varia-
counterpart. However, experiments indicate that multi- tions that significantly improve state-of-the-art CNNs on 3D
view CNNs produce superior performance in object clas- volumetric data. The first network is designed to mitigate
sification. Fig 2 reports the classification accuracy on the overfitting by introducing auxiliary training tasks, which
ModelNet40 dataset by state-of-the-art volumetric/multi- are themselves challenging. These auxiliary tasks encour-
view architectures1 . A volumetric CNN based on voxel age the network to predict object class labels from partial
occupancy (green) is 7.3% worse than a multi-view CNN subvolumes. Therefore, no additional annotation efforts are
(yellow). needed. The second network is designed to mimic multi-
We investigate this performance gap in order to ascer- view CNNs, as they are strong in 3D shape classification.
tain how to improve volumetric CNNs. The gap seems Instead of using rendering routines from computer graphics,
to be caused by two factors: input resolution and net- our network projects a 3D shape to 2D by convolving its
work architecture differences. The multi-view CNN down- 3D volume with an anisotropic probing kernel. This ker-
samples each rendered view to 227 × 227 pixels (Multi- nel is capable of encoding long-range interactions between
view Standard Rendering in Fig 1); to maintain a similar points. An image CNN is then appended to classify the 2D
computational cost, the volumetric CNN uses a 30×30×30 projection. Note that the training of the projection module
occupancy grid (Volumetric Occupancy Grid in Fig 1)2 . As and the image classification module is end-to-end. This em-
shown in Fig 1, the input to the multi-view CNN captures ulation of multi-view CNNs achieves similar performance
more detail. to them, using only standard layers in CNN.
1 We train models by replicating the architecture of [33] for volumetric In order to mitigate overfitting from too many param-
CNNs and [32] for multi-view CNNs. All networks are trained in an end- eters, we adopt the mlpconv layer from [23] as our basic
to-end fashion. All methods are trained/tested on the same split for fair building block in both network variations.
comparison. The reported numbers are average instance accuracy. See
Sec 6 for details. 3 It is computationally prohibitive to match the volumetric CNN resolu-
2 Note that 30 × 30 × 30 ≈ 227 × 227. tion to multi-view CNN, which would be 227 × 227 × 227.
Loss


40


slicing


30
48 160 512
… … Loss Prediction by


partial object
40
2
13 5 5 2
13 2
30 5
13
30
Loss
mlpconv mlpconv mlpconv fc
(48, 6, 2; 48; 48) (160, 5, 2; 160; 160) (512, 3, 2; 512; 512) 40 Prediction by
2048 2048 whole object

Figure 3. Auxiliary Training by Subvolume Supervision (Sec 4.2). The main innovation is that we add auxiliary tasks to predict class labels
that focus on part of an object, intended to drive the CNN to more heavily exploit local discriminative features. An mlpconv layer is a
composition of three conv layers interleaved by ReLU layers. The five numbers under mlpconv are the number of channels, kernel size
and stride of the first conv layer, and the number of channels of the second and third conv layers, respectively. The kernel size and stride of
the second and third conv layers are 1. For example, mlpconv(48, 6, 2; 48; 48) is a composition of conv(48, 6, 2), ReLU, conv(48, 1, 1),
ReLU, conv(48, 1, 1) and ReLU layers. Note that we add dropout layers with rate=0.5 after fully connected layers.

Data Augmentation Compared with 2D image datasets, conducting both object classification and detection [9]).
currently available 3D shape datasets are limited in scale We implement this design through an architecture shown
and variation. To fully exploit the design of our networks, in Fig 3. The first three layers are mlpconv (multilayer
we augment the training data with different azimuth and ele- perceptron convolution) layers, a 3D extension of the 2D
vation rotations. This allows the first network to cover local mlpconv proposed by [23]. The input and output of our
regions at different orientations, and the second network to mlpconv layers are both 4D tensors. Compared with the
relate distant points at different relative angles. standard combination of linear convolutional layers and
max pooling layers, mlpconv has a three-layer structure and
Multi-Orientation Pooling Both of our new networks are is thus a universal function approximator if enough neurons
sensitive to shape orientation, i.e., they capture different are provided in its intermediate layers. Therefore, mlpconv
information at different orientations. To capture a more is a powerful filter for feature extraction of local patches,
holistic sense of a 3D object, we add an orientation pooling enhancing approximation of more abstract representations.
stage that aggregates information from different orienta- In addition, mlpconv has been validated to be more discrim-
tions. inative with fewer parameters than ordinary convolution
with pooling [23].
4.2. Network 1: Auxiliary Training by Subvolume At the fourth layer, the network branches into two. The
Supervision lower branch takes the whole object as input for traditional
classification. The upper branch is a novel branch for
We observe significant overfitting when we train the
auxiliary tasks. It slices the 512 × 2 × 2 × 2 4D tensor (2
volumetric CNN proposed by [33] in an end-to-end fashion
grids along x, y, z axes and 512 channels) into 2×2×2 = 8
(see supplementary). When the volumetric CNN overfits to
vectors of dimension 512. We set up a classification task
the training data, it has no incentive to continue learning.
for each vector. A fully connected layer and a softmax
We thus introduce auxiliary tasks that are closely correlated
layer are then appended independently to each vector to
with the main task but are difficult to overfit, so that learning
construct classification losses. Simple calculation shows
continues even if our main task is overfitted.
that the receptive field of each task is 22 × 22 × 22, covering
These auxiliary training tasks also predict the same ob-
roughly 2/3 of the entire volume.
ject labels, but the predictions are made solely on a local
subvolume of the input. Without complete knowledge of
4.3. Network 2: Anisotropic Probing
the object, the auxiliary tasks are more challenging, and
can thus better exploit the discriminative power of local The success of multi-view CNNs is intriguing. multi-
regions. This design is different from the classic multi- view CNNs first project 3D objects to 2D and then make
task learning setting of hetergenous auxiliary tasks, which use of well-developed 2D image CNNs for classification.
inevitably requires collecting additional annotations (e.g., Inspired by its success, we design a neural network archi-
Anisotropic Probing

30 30 30 30

1 Image-based CNN Softmax


1 30 1 1 Loss
1 1 (Network In Network)
5 5
30 30 30 30
40
30 5 5
Figure 4. CNN with Anisotropic Probing kernels. We use an elongated kernel to convolve the 3D cube and aggregate information to a 2D
plane. Then we use a 2D NIN (NIN-CIFAR10 [23]) to classify the 2D projection of the original 3D shape.

tecture that is also composed of the two stages. However, interaction in the early feature extraction stage. Thus it
while multi-view CNNs use external rendering pipelines is helpful to augment the training data by varying object
from computer graphics, we achieve the 3D-to-2D projec- orientation and combining predictions through orientation
tion using network layers in a manner similar to ‘X-ray pooling.
scanning’. Similar to Su-MVCNN [32] which aggregates infor-
Key to this network is the use of an elongated anisotropic mation from multiple view inputs through a view-pooling
kernel which helps capture the global structure of the 3D layer and follow-on fully connected layers, we sample 3D
volume. As illustrated in Fig 4, the neural network has two input from different orientations and aggregate them in a
modules: an anisotropic probing module and a network in multi-orientation volumetric CNN (MO-VCNN) as shown
network module. The anisotropic probing module contains in Fig 5. At training time, we generate different rotations
three convolutional layers of elongated kernels, each fol- of the 3D model by changing both azimuth and elevation
lowed by a nonlinear ReLU layer. Note that both the input angles, sampled randomly. A volumetric CNN is firstly
and output of each layer are 3D tensors. trained on single rotations. Then we decompose the net-
In contrast to traditional isotropic kernels, an anisotropic work to CNN1 (lower layers) and CNN2 (higher layers)
probing module has the advantage of aggregating long- to construct a multi-orientation version. The MO-VCNN’s
range interactions in the early feature learning stage with weights are initialized by a previously trained volumetric
fewer parameters. As a comparison, with traditional neu- CNN with CNN1 ’s weights fixed during fine-tuning. While
ral networks constructed from isotropic kernels, introduc- a common practice is to extract the highest level features
ing long-range interactions at an early stage can only be (features before the last classification linear layer) of mul-
achieved through large kernels, which inevitably introduce tiple orientations, average/max/concatenate them, and train
many more parameters. After anisotropic probing, we use a linear SVM on the combined feature, this is just a special
an adapted NIN network [23] to address the classification case of the MO-VCNN.
problem. Compared to 3DShapeNets [33] which only augments
Our anistropic probing network is capable of capturing data by rotating around vertical axis, our experiment shows
internal structures of objects through its X-ray like projec- that orientation pooling combined with elevation rotation
tion mechanism. This is an ability not offered by standard
rendering. Combined with multi-orientation pooling (intro-
class prediction class prediction
duced below), it is possible for this probing mechanism to
capture any 3D structure, due to its relationship with the 3D CNN2

Radon transform.
3D Ori-Pooling
In addition, this architecture is scalable to higher res- CNN
olutions, since all its layers can be viewed as 2D. While
3D convolution involves computation at locations of cubic
3D
CNN1
3D
CNN1
… 3D
CNN1

resolution, we maintain quadratic compute.


4.4. Data Augmentation and Multi-Orientation …
Pooling
The two networks proposed above are both sensitive to Figure 5. Left: Volumetric CNN (single orientation input). Right:
model orientation. In the subvolume supervision method, Multi-orientation volumetric CNN (MO-VCNN), which takes in
different model orientations define different local subvol- various orientations of the 3D input, extracts features from shared
umes; in the anisotropic probing method, only voxels of CNN1 and then pass pooled feature through another network
the same height and along the probing direction can have CNN2 to make a prediction.
can greatly increase performance. (a) bathtub (b) sofa

5. Multi-View Convolutional Neural Networks


The multi-view CNN proposed by [32] is a strong al-
ternative to volumetric representations. This multi-view
representation is constructed in three steps: first, a 3D shape
is rendered into multiple images using varying camera ex-
trinsics; then image features (e.g. conv5 feature in VGG
or AlexNet) are extracted for each view; lastly features are
combined across views through a pooling layer, followed
(c) chair (d) monitor (e) bed
by fully connected layers.
Although the multi-view CNN presented by [32] pro- Figure 6. Example models from our real-world dataset. Each
duces compelling results, we are able to improve its perfor- model is a dense 3D reconstruction, annotated, and segmented
mance through a multi-resolution extension with improved from the background.
data augmentation. We introduce multi-resolution 3D filter-
ing to capture information at multiple scales. We perform
use this train/test split for our experiments.
sphere rendering (see Sec 3) at different volume resolu-
By default, we report classification accuracy on all mod-
tions. Note that we use spheres for this discretization as
els in the test set (average instance accuracy). For com-
they are view-invariant. In particular, this helps regularize
parisons with previous work we also report average class
out potential noise or irregularities in real-world scanned
accuracy.
data (relative to synthetic training data), enabling robust
performance on real-world scans. Note that our 3D multi-
resolution filtering is different from classical 2D multi- Real-world Reconstructions We provide a new real-
resolution approaches, since the 3D filtering respects the world scanning dataset benchmark, comprising 243 objects
distance in 3D. of 12 categories; the geometry is captured with an ASUS
Additionally, we also augment training data with varia- Xtion Pro and a dense reconstruction is obtained using the
tions in both azimuth and elevation, as opposed to azimuth publicly-available VoxelHashing framework [25]. For each
only. We use AlexNet instead of VGG for efficiency. scan, we have performed a coarse, manual segmentation
of the object of interest. In addition, each scan is aligned
6. Experiments with the world-up vector. While there are existing datasets
captured with commodity range sensors – e.g., [29, 34, 31]
We evaluate our volumetric CNNs and multi-view CNNs – this is the first containing hundreds of annotated models
along with current state of the art on the ModelNet from dense 3D reconstructions. The goal of this dataset is
dataset [33] and a new dataset of real-world reconstructions to provide an example of modern real-time 3D reconstruc-
of 3D objects. tions; i.e., structured representations more complete than a
For convenience in following discussions, we define 3D single RGB-D frame but still with many occlusions. This
resolution to be the discretization resolution of a 3D shape. dataset is used as a test set.
That is, a 30 × 30 × 30 volume has 3D resolution 30. The
sphere rendering from this volume also has 3D resolution 6.2. Comparison with State-of-the-Art Methods
30, though it may have higher 2D image resolution. We compare our methods with state of the art for shape
6.1. Datasets classification on the ModelNet40 dataset. In the following,
we discuss the results within volumetric CNN methods and
ModelNet We use ModelNet [33] for our training and within multi-view CNN methods.
testing datasets. ModelNet currently contains 127, 915 3D
CAD models from 662 categories. ModelNet40, a subset
Volumetric CNNs Fig 7 summarizes the performance of
including 12, 311 models from 40 categories, is well anno-
volumetric CNNs. Ours-MO-SubvolumeSup is the sub-
tated and can be downloaded from the web. The authors
volume supervision network in Sec 4.2 and Ours-MO-
also provide a training and testing split on the website, in
AniProbing is the anistropic probing network in Sec 4.3.
which there are 9, 843 training and 2, 468 test models4 . We
Data augmentation is applied as described in Sec 6.4 (az-
4 VoxNet [24] uses the train/test split provided on the website and report
imuth and elevation rotations). For clarity, we use MO-
average class accuracy on the 2, 468 test split. 3DShapeNets [33] and
MVCNN [32] use another train/test split comprising the first 80 shapes of
to denote that both networks are trained with an additional
each category in the “train” folder (or all shapes if there are fewer than 80) multi-orientation pooling step (20 orientations in practice).
and the first 20 shapes of each category in the “test” folder, respectively. For reference of multi-view CNN performance at the same
Average class accuracy
77.3 83.0 86.0 85.6 86.6 Volumetric CNN at resolution84.7
30
Average instance accuracy 89.2 89.9 89.5 Multi-View CNN sphere-30 rendering
89.5
HoGPyra Ours-
MVCNN mid- Ours- MVCNN-
(Su et al.) LFD MVCNN MultiRes Multi-View CNN standard rendering
92.0
Average class accuracy
90.1 87.2 89.7 91.4
Average instance accuracy 90.5 92.0 93.8

89.9 89.5
90 89.2
Average class accuracy
88 86.6
Average instance accuracy 86.0 85.6
86
84 83.0
82 94
80 92
78 77.3
90

Accuracy (%)
76
88
74
86
3DShapeNets VoxNet Ours-MO- Ours-MO- Ours-MVCNN- CNN-Sphere (single view)
(Wu et al.) (Maturana et al.) SubvolumeSup AniProbing Sphere-30 84
Ours-MVCNN-Sphere
Figure 7. Classification accuracy on ModelNet40 (voxelized at res- 82
Ours-SubvolumeSup (single ori)
olution 30). Our volumetric CNNs have matched the performance 80
96 Ours-MO-SubvolumeSup
Average class accuracy 93.8 78
of94multi-view CNN at 3D resolution 30 (our implementation of
Average instance accuracy 92.0 0 50 100 150 200
Su-MVCNN
92 [32], rightmost
90.5
group). 91.4
Model Voxelization Resolution
90.1 89.7
90
96 Figure 9. Top: sphere rendering at 3D resolution 10, 30, 60, and
Average class accuracy 93.8
94 standard rendering. Bottom: performance of image-based CNN
Average instance accuracy 92.0 and volumetric CNN with increasing 3D resolution. The two
92 91.4
90.1 90.5
90
89.7 rightmost points are trained/tested from standard rendering.
88 87.2

86
study the effect of 3D resolution for both types of networks.
84
82
Fig 9 shows the performance of our volumetric CNN
MVCNN HoGPyramid- Ours- Ours- and multi-view CNN at different 3D resolutions (defined
(Su et al.) LFD MVCNN MVCNN-MultiRes
at the beginning of Sec 6). Due to computational cost,
Figure 8. Classification acurracy on ModelNet40 (multi-view rep- we only test our volumetric CNN at 3D resolutions 10
resentation). The 3D multi-resolution version is the strongest. It is and 30. The observations are: first, the performance of
worth noting that the simple baseline HoGPyramid-LFD performs
our volumetric CNN and multi-view CNN is on par at
quite well.
tested 3D resolutions; second, the performance of multi-
view CNN increases as the 3D resolution grows up. To
3D resolution, we also include Ours-MVCNN-Sphere-30, further improve the performance of volumetric CNN, this
the result of our multi-view CNN with sphere rendering at experiment suggests that it is worth exploring how to scale
3D resolution 30. More details of setup can be found in the volumetric CNN to higher 3D resolutions.
supplementary.
As can be seen, both of our proposed volumetric CNNs
6.4. More Evaluations
significantly outperform state-of-the-art volumetric CNNs. Data Augmentation and Multi-Orientation Pooling
Moreover, they both match the performance of our multi- We use the same volumetric CNN model, the end-to-end
view CNN under the same 3D resolution. That is, the gap learning verion of 3DShapeNets [33], to train and test on
between volumetric CNNs and multi-view CNNs is closed three variations of augmented data (Table 1). Similar trend
under 3D resolution 30 on ModelNet40 dataset, an issue is observed for other volumetric CNN variations.
that motivates our study (Sec 3).
Data Augmentation Single-Ori Multi-Ori ∆
Azimuth rotation (AZ) 84.7 86.1 1.4
Multi-view CNNs Fig 8 summarizes the performance of
AZ + translation 84.8 86.1 1.3
multi-view CNNs. Ours-MVCNN-MultiRes is the result AZ + elevation rotation 83.0 87.8 4.8
by training an SVM over the concatenation of fc7 features Table 1. Effects of data augmentations on multi-orientation vol-
from Ours-MVCNN-Sphere-30, 60, and Ours-MVCNN. umetric CNN. We report numbers of classification accuracy on
HoGPyramid-LFD is the result by training an SVM over a ModelNet40, with (Multi-Ori) or without (Single-Ori) multi-
concatenation of HoG features at three 2D resolutions. Here orientation pooling described in Sec 4.4.
LFD (lightfield descriptor) simply refers to extracting fea-
tures from renderings. Ours-MVCNN-MultiRes achieves When combined with multi-orientation pooling, apply-
state-of-the-art. ing both azimuth rotation (AZ) and elevation rotation (EL)
augmentations is extremely effective. Using only azimuth
6.3. Effect of 3D Resolution over Performance
augmentation (randomly sampled from 0◦ to 360◦ ) with
Sec 6.2 shows that our volumetric CNN and multi-view orientation pooling, the classification performance is in-
CNN performs comparably at 3D resolution 30. Here we creased by 86.1% − 84.7% = 1.4%; combined with eleva-
Network Single-Ori Multi-Ori Method Classification Retrieval MAP
E2E-[33] 83.0 87.8 E2E-[33] 69.6 -
VoxNet[24] 83.8 85.9 Su-MVCNN [32] 72.4 35.8
3D-NIN 86.1 88.5 Ours-MO-SubvolumeSup 73.3 39.3
Ours-SubvolumeSup 87.2 89.2 Ours-MO-AniProbing 70.8 40.2
Ours-AniProbing 85.9 89.9 Ours-MVCNN-MultiRes 74.5 51.4
Table 2. Comparison of performance of volumetric CNN archi- Table 4. Classification accuracy and retrieval MAP on recon-
tectures. Numbers reported are classification accuracy on Model- structed meshes of 12-class real-world scans.
Net40. Results from E2E-[33] (end-to-end learning version) and
VoxNet [24] are obtained by ourselves. All experiments are using
the same set of azimuth and elevation augmented data.
of ModelNet40 containing 3,183 training samples. They
are provided for reference. Also note that the MVCNNs
tion augmentation (randomly sampled from −45◦ to 45◦ ), in the second group are our implementations in Caffe with
the improvement becomes more significant – increasing by AlexNet instead of VGG as in Su-MVCNN [32].
87.8% − 83.0% = 4.8%. On the other hand, translation We observe that MVCNNs are superior to methods by
jittering (randomly sampled shift from 0 to 6 voxels in each SVMs on hand-crafted features.
direction) provides only marginal influence.

Comparison of Volumetric CNN Architectures The ar- Evaluation on the Real-World Reconstruction Dataset
chitectures in comparison include VoxNet [24], E2E-[33] We further assess the performance of volumetric CNNs and
(the end-to-end learning variation of [33] implemented in multi-view CNNs on real-world reconstructions in Table 4.
Caffe [16] by ourselves), 3D-NIN (a 3D variation of Net- All methods are trained on CAD models in ModelNet40 but
work in Network [23] designed by ourselves as in Fig 3 tested on real data, which may be highly partial, noisy, or
without the “Prediction by partial object” branch), Subvol- oversmoothed (Fig 6). Our networks continue to outper-
umeSup (Sec 4.2) and AniProbing (Sec 4.3). Data augmen- form state-of-the-art results. In particular, our 3D multi-
tation of AZ+EL (Sec 6.4) are applied. resolution filtering is quite effective on real-world data,
From Table 2, first, the two volumetric CNNs we pro- possibly because the low 3D resolution component filters
pose, SubvolumeSup and AniProbing networks, both show out spurious and noisy micro-structures. Example results
superior performance, indicating the effectiveness of our for object retrieval can be found in supplementary.
design; second, multi-orientation pooling increases per-
formance for all network variations. This is especially 7. Conclusion and Future work
significant for the anisotropic probing network, since each
orientation usually only carries partial information of the In this paper, we have addressed the task of object classi-
object. fication on 3D data using volumetric CNNs and multi-view
CNNs. We have analyzed the performance gap between
Comparison of Multi-view Methods We compare differ- volumetric CNNs and multi-view CNNs from perspectives
ent methods that are based on multi-view representations of network architecture and 3D resolution. The analysis
in Table 3. Methods in the second group are trained on motivates us to propose two new architectures of volumetric
the full ModelNet40 train set. Methods in the first group, CNNs, which outperform state-of-the-art volumetric CNNs,
SPH, LFD, FV, and Su-MVCNN, are trained on a subset achieving comparable performance to multi-view CNNs at
the same 3D resolution of 30 × 30 × 30. Further evalu-
tion over the influence of 3D resolution indicates that 3D
Accuracy Accuracy resolution is likely to be the bottleneck for the performance
Method #Views
(class) (instance) of volumetric CNNs. Therefore, it is worth exploring the
SPH (reported by [33]) - 68.2 - design of efficient volumetric CNN architectures that scale
LFD (reported by [33]) - 75.5 - up to higher resolutions.
FV (reported by [32]) 12 84.8 -
Su-MVCNN [32] 80 90.1 -
PyramidHoG-LFD 20 87.2 90.5 Acknowledgement. The authors gratefully acknowledge
Ours-MVCNN 20 89.7 92.0 the support of Stanford Graduate Fellowship, NSF grants
Ours-MVCNN-MultiRes 20 91.4 93.8 IIS-1528025 and DMS-1546206, ONR MURI grant
Table 3. Comparison of multi-view based methods. Numbers N00014-13-1-0341, a Google Focused Research award, the
reported are classification accuracy (class average and instance Max Planck Center for Visual Computing and Communica-
average) on ModelNet40. tions and hardware donations by NVIDIA.
References [17] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation
invariant spherical harmonic representation of 3 d shape
[1] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovs- descriptors. In SGP 2003, volume 6, pages 156–164, 2003.
janikov. Shape google: Geometric words and expressions
[18] J. Knopp, M. Prasad, G. Willems, R. Timofte, and
for invariant shape retrieval. ACM Transactions on Graphics
L. Van Gool. Hough transform and 3d surf for robust three
(TOG), 30(1):1, 2011.
dimensional classification. In ECCV 2010, pages 589–602.
[2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Springer, 2010.
Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, [19] I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bron-
et al. Shapenet: An information-rich 3d model repository. stein. Intrinsic shape context descriptors for deformable
arXiv preprint arXiv:1512.03012, 2015. shapes. In CVPR 2012, pages 159–166. IEEE, 2012.
[3] S. Chaudhuri and V. Koltun. Data-driven suggestions for [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
creativity support in 3d modeling. In ACM Transactions on classification with deep convolutional neural networks. In
Graphics (TOG), volume 29, page 183. ACM, 2010. Advances in neural information processing systems, pages
[4] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On 1097–1105, 2012.
visual similarity based 3d model retrieval. In CGF, vol- [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
ume 22, pages 223–232. Wiley Online Library, 2003. based learning applied to document recognition. Proceed-
[5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and ings of the IEEE, 86(11):2278–2324, 1998.
A. Vedaldi. Describing textures in the wild. In CVPR 2014, [22] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods
pages 3606–3613. IEEE, 2014. for generic object recognition with invariance to pose and
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- lighting. In CVPR 2014, volume 2, pages II–97. IEEE, 2004.
Fei. Imagenet: A large-scale hierarchical image database. In
CVPR 2009, pages 248–255. IEEE, 2009. [23] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, preprint arXiv:1312.4400, 2013.
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- [24] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
vation feature for generic visual recognition. arXiv preprint neural network for real-time object recognition. In IEEE/RSJ
arXiv:1310.1531, 2013. International Conference on Intelligent Robots and Systems,
[8] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and September 2015.
W. Burgard. Multimodal deep learning for robust rgb-d [25] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
object recognition. In IEEE/RSJ International Conference on Real-time 3d reconstruction at scale using voxel hashing.
Intelligent Robots and Systems (IROS), Hamburg, Germany, ACM Transactions on Graphics (TOG), 32(6):169, 2013.
2015. [26] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin.
[9] R. Girshick. Fast r-cnn. In ICCV 2015, pages 1440–1448, Shape distributions. ACM Transactions on Graphics (TOG),
2015. 21(4):807–832, 2002.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich [27] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
feature hierarchies for accurate object detection and semantic Cnn features off-the-shelf: an astounding baseline for recog-
segmentation. In CVPR 2014, pages 580–587. IEEE, 2014. nition. In Computer Vision and Pattern Recognition Work-
[11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning shops (CVPRW), 2014 IEEE Conference on, pages 512–519.
rich features from rgb-d images for object detection and IEEE, 2014.
segmentation. In ECCV 2014, pages 345–360. Springer, [28] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep
2014. panoramic representation for 3-d shape recognition. Signal
[12] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Processing Letters, IEEE, 22(12):2339–2343, 2015.
Matchnet: Unifying feature and metric learning for patch- [29] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
based matching. In CVPR 2015, pages 3279–3286, 2015. segmentation and support inference from rgbd images. In
ECCV 2012, pages 746–760. Springer, 2012.
[13] B. K. Horn. Extended gaussian images. Proceedings of the [30] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y.
IEEE, 72(12):1671–1686, 1984. Ng. Convolutional-recursive deep learning for 3d object
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating classification. In NIPS 2012, pages 665–673, 2012.
deep network training by reducing internal covariate shift. [31] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
arXiv preprint arXiv:1502.03167, 2015. scene understanding benchmark suite. In CVPR 2015, pages
[15] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial 567–576, 2015.
transformer networks. In Advances in Neural Information [32] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.
Processing Systems, pages 2008–2016, 2015. Multi-view convolutional neural networks for 3d shape
[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, recognition. In ICCV 2015, 2015.
R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolu- [33] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
tional architecture for fast feature embedding. arXiv preprint J. Xiao. 3d shapenets: A deep representation for volumetric
arXiv:1408.5093, 2014. shapes. In CVPR 2015, pages 1912–1920, 2015.
[34] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database A. Appendix
of big spaces reconstructed using sfm and object labels. In
ICCV 2013, pages 1625–1632. IEEE, 2013. In this section, we present positive effects of two adds-
on modules – volumetric batch normalization (Sec A.1) and
spatial transformer networks (Sec A.2). We also provide
more details on experiments in the main paper (Sec A.3) and
real-world dataset construction (Sec A.4). Retrieval results
can also be found in Sec A.5.

A.1. Batch Normalization


We observe that using batch normalization [14] can ac-
celerate the training process and also improve final per-
formance. Taking our subvolume supervision model (base
network is 3D-NIN) for example, the classification accuracy
from single orientation is 87.2% and 88.8% before and after
using batch normalization, respectively. Complete results
are in Table 5.
Specifically, compared with the model described in the
main paper, we add batch normalization layers after each
convolution and fully connected layers. We also add
dropout layers after each convolutional layers.

Model Single-Ori Multi-Ori


Ours-SubvolSup 87.2 89.2
Ours-AniProbing 85.9 89.9
Ours-SubvolSup + BN 88.8 90.1
Ours-AniProbing + BN 87.5 90.0
Table 5. Positive effect of adding batch normalization at convolu-
tional layers. Numbers reported are classification (instace average)
on ModelNet40 test set.

A.2. Spatial Transformer Networks


One disadvantage of multi-view/orientation method is
that one needs to prepare multiple views/orientations of the
3D data, thus computationally more expensive. It would
be ideal if we can achieve similar performance with just a
single input. In this section we show how a Spatial Trans-
former Network (STN) [15] can help boost our model’s
performance on single-orientation input.

Model Single-Ori
Ours-SubvolSup + BN 88.8
Ours-SubvolSup + BN + STN 89.1
Table 6. Spatial transformer network helps improve single orien-
tation classification accuracy.

The spatial transformer network has three components:


(1) a regressor network which takes occupancy grid as input
and predicts transformation parameters. (2) a grid generator
that outputs a sampling grid based on the transformation
and (3) a sampler that transforms the input volume to a
new volume based on the sampling grid. We include a
spatial transfomer network directly after the data layer and
before the original volumetric CNN (see Table 6 for results).
In Fig 10, we visualize the effect of spatial transformer Training for Our MVCNN and Multi-resolution
network on some exemplar input occupancy grids. MVCNN We use Blender to render 20 views of each
(either ordinary or spherical) CAD model from azimuth
Input occupancy grid: Output from STN: angles in 0, 36, 72, ..., 324 degrees and elevation angles
in −30 and 30 degrees. For sphere rendering, we convert
voxelized CAD models into meshes by replacing each
voxel with an approximate sphere with 50 faces and
diameter length of the voxel size. Four fixed point light
sources are used for the ray-tracing rendering.
We first finetune AlexNet with rendered images for or-
dinary rendering and multi-resolutional sphere renderings
separately. Then we use trained AlexNet to initialize the
MVCNN and fine tune on multi-view inputs.

Other Volumetric Data Representations Note that


while we present our volumetric CNN methods using oc-
cupancy grid representations of 3D objects, our approaches
easily generalize to other volumetric data representations.
In particular, we have also used Signed Distance Functions
and (unsigned) Distance Functions as input (also 30 × 30 ×
Figure 10. Each row is a input and output pair of the spatial 30 grids). Signed distance fields were generated through
transformer netowrk (‘table’ category). Each point represents an virtual scanning of synthetic training data, using volumet-
occupied voxel and color is determined by depth. We see STN ric fusion (for our real-world reconstructed models, this is
tends to align all the tables to a canonical viewpoint. the natural representation); distance fields were generated
directly from the surfaces of the models. Performance was
not affected significantly by the different representations,
A.3. Details on Model Training differing by around 0.5% to 1.0% for classification accuracy
Training for Our Volumetric CNNs To produce occu- on ModelNet test data.
pancy grids from meshes, the faces of a mesh are subdivided
A.4. Real-world Reconstruction Test Data
until the length of the longest edge is within a single voxel;
then all voxels that intersect with a face are marked as In order to evaluate our method on real scanning data,
occupied. For 3D resolution 10,30 and 60 we generate we obtain a dataset of 3D models, which we reconstruct
voxelizations with central regions 10, 24, 54 and padding using data from a commodity RGB-D sensor (ASUS Xtion
0, 3, 3 respectively. Pro). To this end, we pick a variety of real-world objects for
This voxelization is followed by a hole filling step that which we record a short RGB-D frame sequence (several
fills the holes inside the models as occupied voxels. hundred frames) for each instance. For each object, we use
To augment our training data with azimuth and elevation the publicly-available Voxel Hashing framework in order
rotations, we generate 60 voxelizations for each model, to obtain a dense 3D reconstruction. In a semi-automatic
with azimuth uniformly sampled from [0, 360] and elevation post-processing step, we segment out the object of interest’s
uniformly sampled from [−45, 45] (both in degrees). geometry by removing the scene background. In addition,
We use a Nesterov solver with learning rate 0.005 and we align the obtained model with the world up direction.
weight decay 0.0005 for training. It takes around 6 hours Overall, we obtained scans of 243 objects, comprising of a
to train on a K40 using Caffe [16] for the subvolume su- total of over XYZ thousand RGB-D input frames.
pervision CNN and 20 hours for the anisotropic probing
A.5. More Retrieval Results
CNN. For multi-orientation versions of them, Subvolume-
Sup splits at the last conv layer and AniProbing splits at the For model retrieval, we extract CNN features (either
second last conv layer. Volumetric CNNs trained on single from 3D CNNs or MVCNNs) from query models and find
orientation inputs are then used to initialize their multi- nearest neighbor results based on L2 distance. Similar to
orientation version for fine tuning. MVCNN (Su et al.) [32], we use a low-rank Mahalanobis
During testing time, 20 orientations of a CAD model metric to optimize retrieval performance. Figure 11 and
occupancy grid (equally distributed azimuth and uniformly Figure 12 show more examples of retrieval from real model
sampled elevation from [−45, 45]) are input to MO-VCNN queries.
to make a class prediction.
Figure 11. More retrieval results. Left column: queries, real reconstructed meshes. Right five columns: retrieved models from ModelNet40
Test800.
Figure 12. More retrieval results (samples with mistakes). Left column: queries, real reconstructed meshes. Right five columns: retrieved
models from ModelNet40 Test800. Red bounding boxes denote results from wrong categories.
Models
Bathtub Class
# of
7
Bed

27
Bench

19
Chair

17
Cup

18
Desk

16
Monitor Dresser

45 12
Night-
stand
21
Sofa

26
Table

18
Toilet

17

Figure 13. Our real-world reconstruction test dataset, comprising 12 categories and 243 models. Each row lists a category along with the
number of objects and several example reconstructed models in that category.

You might also like