0% found this document useful (0 votes)
65 views10 pages

3D Affordancenet: A Benchmark For Visual Object Affordance Understanding

This document introduces the 3D AffordanceNet dataset, a benchmark for visual object affordance understanding containing 23k 3D shapes from 23 categories annotated with 18 affordance categories. It provides three benchmarking tasks for evaluating visual affordance: full-shape, partial-view, and rotation-invariant affordance estimation. Three state-of-the-art point cloud networks are evaluated on the tasks. Additionally, a semi-supervised learning method is explored to benefit from unlabeled data for affordance estimation.

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views10 pages

3D Affordancenet: A Benchmark For Visual Object Affordance Understanding

This document introduces the 3D AffordanceNet dataset, a benchmark for visual object affordance understanding containing 23k 3D shapes from 23 categories annotated with 18 affordance categories. It provides three benchmarking tasks for evaluating visual affordance: full-shape, partial-view, and rotation-invariant affordance estimation. Three state-of-the-art point cloud networks are evaluated on the tasks. Additionally, a semi-supervised learning method is explored to benefit from unlabeled data for affordance estimation.

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding

Shengheng Deng1,* , Xun Xu2,* , Chaozheng Wu1 , Ke Chen1,4 and Kui Jia1,3,4,†
1
South China University of Technology, 2 I2R, A-STAR
3
Pazhou Laboratory, 4 Peng Cheng Laboratory

Abstract
Pour Wrap-Grasp Contain
The ability to understand the ways to interact with ob-
jects from visual cues, a.k.a. visual affordance, is essential
to vision-guided robotic research. This involves categoriz-
ing, segmenting and reasoning of visual affordance. Rele-
Grasp Cut Stab
vant studies in 2D and 2.5D image domains have been made
previously, however, a truly functional understanding of ob-
ject affordance requires learning and prediction in the 3D
physical domain, which is still absent in the community. In
this work, we present a 3D AffordanceNet dataset, a bench- Grasp Contain Lift
mark of 23k shapes from 23 semantic object categories, an- Figure 1. The 3D AffordanceNet dataset. The mesh was first anno-
notated with 18 visual affordance categories. Based on this tated with affordance keypoints. Then we densely sample points
dataset, we provide three benchmarking tasks for evaluat- and obtain the ground truth data via label propagation.
ing visual affordance understanding, including full-shape,
objects[33]. Tasks including affordance categorization, rea-
partial-view and rotation-invariant affordance estimations.
soning, semantic labeling, activity recognition, etc. are de-
Three state-of-the-art point cloud deep learning networks
fined as specific instantiations of affordance understanding
are evaluated on all tasks. In addition we also investigate a
[8]. Among all these we find semantic labeling [22, 33] is
semi-supervised learning setup to explore the possibility to
of the most importance because the ability to localize the
benefit from unlabeled data. Comprehensive results on our
position of possible affordance is highly desired by robotic
contributed dataset show the promise of visual affordance
research. We refer semantic labeling as affordance estima-
understanding as a valuable yet challenging benchmark.
tion throughout this paper.
The most important and proper modality for affordance
1. Introduction understanding is through visual sensors [8]. Visual affor-
dance understanding has been extensively studied recently
The concept of affordance was first defined as what the with computer vision techniques. Many algorithms are
environment offers the animal, introduced by [6]. Affor- built upon deep neural networks [19, 3, 23] thus require
dance understanding is concerned with the interactions be- large labeled affordance dataset for benchmarking. Rele-
tween human and environment. For instance, human can sit vant datasets are developed for these purposes with data col-
on the chair, grasp a cup or lift a bag. Being able to under- lected from 2D (RGB) sensors [32, 24, 22] or 2.5D (RGBD)
stand the affordance of objects is crucial for robots to oper- sensors [18, 19, 23]. Nevertheless, we believe that the af-
ate in dynamic and complex environments [8]. Many appli- fordance understanding requires learning in the 3D domain
cations are supported by affordance understanding includ- which conveys the geometric properties. For example, the
ing, anticipating and predicting future actions[12, 10, 13], affordance of grasp is highly correlated with vertical struc-
recognizing agent’s activities[21, 4, 26], providing valid ture with small perimeter and sittable is correlated with flat
functionality of the objects[7], understanding social scene surface. Unfortunately, such detailed geometry is not cap-
situations[2] and understanding the hidden values of the tured by the existing 2D datasets while the 2.5 ones [19, 23]
* indicates equal contribution. are often captured with small depth variation and do not
† Correspondence to Kui Jia <[email protected]>. carry enough geometric information.

1778
To encourage research into visual affordance under- • We benchmark three baseline methods for proposed
standing in more realistic scenarios, a benchmark on real affordance learning tasks and further propose a semi-
3D dataset is highly desired. Therefore, we are inspired supervised affordance estimation method to take ad-
by PartNet[17], a recently proposed dataset containing the vantage of unlabeled data for affordance estimation.
fine-grained part hierarchy information of 3D shapes based
on the large-scale 3D CAD model dataset ShapeNet[1] and 2. Related Work
3D Warehouse. Although PartNet mentioned affordance as
potential application, there is still no benchmark purposely Affordance refers to the possible action an agent could
established for affordance yet. More importantly, we dis- make to interact with the environment [6]. Examples in-
cover, via user annotations, that the human perceived affor- clude a cup can afford ‘pouring’, a bed is ‘sittable’ and
dance often do not fully overlap with the individual parts ‘layable’, etc. Affordance understanding is the core func-
specified in PartNet dataset. For example, In the first row of tion in developing autonomous systems. In particular, the
Fig. 1, the Pour, Wrap-Grasp and Contain affordance from visual affordance understanding is the most promising way
Mug do not perfectly match any part indicated by the col- due to the rich information carried by visual sensors. We
ored image on the 1st column. Therefore, we believe it is mainly review the recent development in visual affordance
necessary to provide a new set of affordance labels on the dataset and approaches, a detailed review can be found
PartNet dataset. in [8]. Recent advances in visual affordance are mostly
Creating 3D visual affordance benchmark is challenging demonstrated on affordance recognition [7, 3], detection
due to the subjective definition. We take into account the [3, 18] and segmentation [3, 18]. Beyond the low-level vi-
affordance definitions from existing research on visual af- sual tasks, there is substantial attention on-the-rise paid to
fordance learning in 2D and 2.5D domains [8] and select affordance reasoning, affordance-based activity recognition
possible interactions that one can take with 3D shapes from and social affordances. In this work we are interested in
PartNet. Finally, 18 types of affordance were formally de- providing a benchmark for affordance segmentation, a.k.a.
fined over 23 semantic objects. Additional challenge asso- prediction, due to the clear definition and high demand in
ciated with annotation on 3D model is the scalability issue. robotic applications.
In order to provide highly quality annotation on such a large To benchmark visual affordance segmentation, UMD
scale, we use label propagation method to propagate affor- [18], CAD120 [23] and IIT-AFF [19] are respectively de-
dance sparsely labeled on individual points. Eventually, we veloped recently. All datasets feature affordance segmen-
obtain point-wise probabilistic score of affordance for each tation on RGBD images covering from 10-20 objects, 6-9
individual shape in PartNet. We name the new benchmark affordance types and 3k-10k labelled images. In particular,
3D AffordanceNet to reflect the focus on visual affordance IAF-IIT and CAD120 capture complex scenes while UMD
on 3D point cloud data. mainly focuses on well-controlled scenes. Nevertheless,
3D AffordanceNet enables benchmarking a diverse set of none of these datasets carry the rich geometric properties
tasks, in particular, we put forward full-shape, partial-view of objects that robotic application would expect and only
and rotation-invariant affordance estimations. Three state- a single view-point is present. As a result, these datasets
of-the-art point cloud deep learning networks are evaluated no longer pose challenges to modern computer vision tech-
on all tasks. We also propose a semi-supervised affordance niques.
estimation method to take the advantage of large amount of With the easy access to 3D point cloud data, e.g. col-
unlabeled data for affordance estimation. lected from LiDAR and SFM, and potential application in
In summary, we make the following contributions: robotics, atunomous driving, etc., there is a recent surge in
research towards 3D point cloud. ShapeNet [1] collected
• We introduce 3D AffordanceNet, consisting of 56307 3D CAD models from open-sourced 3D repositories, with
well-defined affordance information annotations for more than 3,000,000 models and 3,135 object categories.
22949 shapes covering 18 affordance classes and 23 It was further developed by [29] for shape part segmen-
semantic object categories. To the best of our knowl- tation. Partially motivated by affordance understanding,
edge, this is the first large-scale dataset with well- the subsequent PartNet dataset [17] was proposed with 26k
defined probabilistic affordance score annotations; objects, featuring fine-grained semantic segmentation task.
Hierarchical segmentation was also addressed by a recur-
• We propose three affordance learning tasks which are sive part decomposition [31]. Though affordance is briefly
supported by 3D AfffodanceNet to demonstrate the mentioned as the motivation for creating above 3D shape
value of annotated data: full-shape affordance estima- datasets, to the best of our knowledge, there is no existing
tion, partial-view affordance estimation and rotation- dataset which explicitly addresses the task of visual affor-
invariant affordance estimation. dance prediction. The only known attempt on 3D shape
https://3dwarehouse.sketchup.com functionality understanding [9] is still limited to a few types

1779
Repeat several times
Start Annotated point

Observe shape
information
Which points on
What Select a Does the adjacent Yes Select several
the shape
affordances does functional parts support the supported
support the
shape support? point same affordances? parts
affordances?
No

Figure 2. The annotation workflow. The blue arrows indicate the annotation procedures, the green arrows refer to the corresponding 3D
GUI actions. The annotators are first asked to determine the supported affordance classes and then select the functional points. The
annotators need to confirm whether the adjacent parts support the same affordances.
of objects and did not make connection to the well-studied thus some shapes may not have all affordances defined for
visual affordance understanding in 2D image domains. In its own shape category.
contrast, we created a new benchmark for visual affordance
estimation on 3D point cloud. The affordance types are se- 3.2. Annotation Interface
lectively inherited from a summary of existing works and We created a web-based 3D GUI to collect raw anno-
annotations are made on 3D point cloud data directly. tations. The process of annotation is designed to be a
question-answering workflow as illustrated in Fig. 2. A user
3. Dataset Construction is given one shape at a time visualized in 3D. Each indi-
We present 3D AffordanceNet as a dataset for affordance vidual parts are colored according to the pre-defined col-
estimation 3D point cloud. To construct this dataset, we ormap in PartNet dataset [17]. Annotators are allowed to
first define a set of affordance types by referring to the freely rotate, translate and change the scale of the shape us-
existing visual affordance works [8]. Raw 3D shape data ing mouse, which allow the annotators to observe the shape
are collected from the shapes in PartNet [17] which covers from more angles. After observation, annotators are first
common object types in typical indoor scenes. A question- asked to determine the supported affordances by choosing
answering 3D GUI is developed to collect raw point-wise from a list (‘What affordances does this shape support?’).
annotation on mesh shapes. In total, we hired 42 profes- Considering that some annotators may not understand the
sional annotators for annotations, the average annotation affordances, we provide the explanation of each affordance
time per shape is 2 minutes and each shape is annotated by 3 on the interface. Annotators are then asked to select key-
annotators.Finally, label propagation is employed to obtain points that support the specified affordance (‘What points
probabilistic ground-truth for the shape point cloud. on the shape support the affordance?’). At least 3 keypoints
will be labeled by one annotator for each affordance. Anno-
3.1. Affordance Type tators will also decide whether the selected affordance will
We refer to [8] for a full review of affordance types propagate beyond the part which the labeled keypoint sits
adopted in visual affordance research. From the full list on. If yes, annotators are asked to select eligible parts that
of possible affordances, we select those suitable for 3D ob- the affordance can propagate to, otherwise, more annota-
jects present in PartNet [17] and remove the irrelevant ones, tions will be made on the same part until enough keypoints
e.g. ‘reachable’ and ‘carryable’. Overall, we filter out 18 are collected.
categories of affordances, namely ‘grasp’, ‘lift’, ‘contain’, The questions that the annotation interface proposes for
‘open’, ‘lay’, ‘sit’, ‘support’, ‘wrap-grasp’, ‘pour’, ‘dis- each affordance directly determine how the annotators per-
play’, ‘push’, ‘pull’, ‘listen’, ‘wear’, ‘press’, ‘cut’, ‘stab’, ceive the affordances. Therefore, we define questions care-
and ‘move’. Then, we associate the affordance types to fully tailored for each affordance. Some questions are
each category of object in PartNet according to its attributes shown in Tab. 1. A complete list of affordance question
and functionality that it can afford to interact with human is given in the supplementary material.
or robot. For example, a chair is ‘sittable’ but not ‘layable’,
3.3. Ground-Truth Construction
a table can afford ‘support’ but not ‘contain’, etc. The af-
fordances of each category are shown in Tab. 2. The anno- After obtaining affordance keypoints, we propagate la-
tators are allowed to freely determine where the affordance bels to all points on the shape to create ground-truth for
locates on the object, e.g. ‘grasp’ of bag can be annotated at downstream learning tasks. We first record the coordinates
its handle, webbing or straps. Notice that we allow the an- of the selected keypoints. We then propagate the labels to
notators to select the supported affordance for each shape, N points densely sampled on the shape mesh surface, note

1780
Affordance Object Question
Lay Bed If you were to lie on this bed, which points would you lie on the bed?
Grasp Earphone If you want to grab this earphone, where will your palm position be?
Lift Bag If you want to lift this bag, at which points are your finger most likely to carry the bag?
Sit Chair If you were sitting on this chair, on which points would you sit?
Move Table If you want to move this table, at which points on this table will you exert your strength?
Open Trash Can If you want to open the lid of this trash can, from which points on the trash can you open it?
Pour Bottle Suppose there is water in the bottle, and you want to pour the water out of the bottle. From which points on the bottle will the water flow out?
Press Laptop If you want to press keys on a computer keyboard, which points on the keyboard would you press?
Contain Microwave If you put something in the microwave, at which points in the microwave would you put the object?
Support Table If you want to put something on the table, at which points on the table would you put the object?
Table 1. Some examples of the proposed questions for affordance annotation.

Knife Scissors Bowl Bottle Bed Bag Dishwasher Clock Trashcan Table Storage Furniture Chair

Refrigerator Mug Microwave Laptop Keyboard Hat Faucet Earphone Door Display Vase

Grasp Contain Lift Open Lay Sit Support Wrap. Pour Move Display Push Pull Listen Wear Press Cut Stab
Figure 3. The example of annotated data. Different affordances are shown in different colors, points annotated with multiple affordances
are colored by the affordance that has the highest scores. The brighter the color, the higher the score.
that we only propagate on the parts that support the specific Object Affordance Num
Bag grasp, lift, contain, open 125
affordance that are recorded during user annotation. For-
Bed lay, sit, support 181
mally, we construct a kNN graph on sampled points where Bowl contain, wrap-grasp, pour 187
the adjacency matrix A writes as, Clock display 524
 Dishwasher open, contain 166
kvi − vj k2 , vj ∈ N Nk (vi ) Display display 887
aij = (1)
0, otherwise Door open, push, pull 220
Earphone grasp, listen 223
where v is the xyz spatial coordinate of point and N Nk de- Faucet grasp, open 628
notes the set of k nearest neighbors. The adjacency matrix Hat grasp, wear 222
is symmetrized by W = 1/2(A + A⊤ ). Then we normal- Storage Furniture contain, open 2186
f = D−0.5 WD−0.5 . where Keyboard press 156
ize the adjacency matrix by W Knife grasp, cut, stab 314
D is the degree matrix. Finally the scores S for all points is Laptop display, press 421
propagated by the closed-form solution S = (I−αW f −1 )Y. Microwave open, contain, support 184
N ×18 Mug contain, pour, wrap-grasp, grasp 190
where Y ∈ {0, 1} is the one-hot label vector and 1 in-
Refrigerator contain, open 185
dicates positive label. α is a hyper-parameter controlling
Chair sit, support, move 6113
the decreasing speed of S, we empirically set α to 0.998 Scissors grasp, cut, stab 68
throughout the experiments. Finally we linearly normalize Table support, move 7990
S to the range between 0 and 1 so that it is a probabilistic Trash Can contain, pour, open 315
score. Example shapes with propagated affordance ground- Vase contain, pour, wrap-grasp 1048
Bottle contain, open, wrap-grasp, grasp, pour 411
truth are shown in Fig. 3.
Table 2. 3D AffordanceNet statistics. The first column shows the
3.4. Statistics object category. The second column shows the defined affordance
classes for each category.The third column shows the amount of
The final dataset provides well-defined visual affordance
each shape semantic category.
score map annotations for 22949 shapes from 23 shape cate-
gories with at most 5 affordance types defined for each cat-
egory. From the perspective of affordance categories, 3D 4. Tasks and Benchmarks
AfffodanceNet contains 56307 affordance annotations from
18 affordance classes. It is worth noting that due to the In this section, we benchmark three tasks to demonstrate
multi-label nature, each point could be labeled with multi- the 3D AfffodanceNet dataset, namely, full-shape, partial-
ple affordances. More details of the dataset are presented in view and rotation-invariant affordance estimation. The 3D
Tab. 2 and Tab. 3. AffordanceNet dataset is split into train, validation and test

1781
Support Move Sit Contain Open Grasp Pour Display Wrap-Grasp
#Annot 14848 14540 6516 5155 4506 2253 2086 1914 1889 ing the default training strategies and hyper-parameters de-
Press Cut Stab Wear Listen Pull Push Lay Lift
#Annot 588 393 393 231 228 225 225 194 123
scribed in respective papers[20, 27]. For Unet, we fine-tune
Table 3. The number of shapes that are positive for each category the network initialized by the pre-trained weight provided
of affordance. by PointContrast[28].
sets with a ratio of 70%, 10% and 20%, respectively accord- Evaluation and Results. We evaluate four metrics for
ing to the shape semantic category. The first experiment es- affordance estimation, including mean Average Precision
timates point-wise affordance given full 3D point cloud as (mAP) scores, mean squared error (MSE), Area Under ROC
input. The second experiment estimates the affordance of Curve (AUC) and average Intersection Over Union (aIoU).
partially visible objects observed from different viewpoints. For AP, we calculate the Precision-Recall Curve and AP is
The last experiment estimates the affordance of rotated 3D calculated for each affordance. For AUC, we report the area
objects under two different rotation settings. We also cre- under ROC Curve. For MSE, we calculate mean squared er-
ate a semi-supervised affordance estimation benchmark to ror of each affordance category and sum up the results from
explore the opportunity of exploiting unlabeled data for af- all affordance categories. For aIoU, we gradually tune up
fordance estimation. the threshold from 0 to 0.99 with 0.01 step to binarize the
prediction, and the aIoU is the arithmetic average of all IoUs
4.1. Full-Shape Affordance Estimation
at each threshold. Except for the MSE, all the other met-
Given an object as 3D point cloud without knowing the rics for each category are averaged over all shapes, a.k.a.
affordances supported by the object, the full-shape affor- macro-average. For each affordance category, the ground-
dance estimation task aims to estimate the supported affor- truth map is binarized with 0.5 threshold before evaluation.
dance type and predict the point-wise probabilistic score of The results are reported in Tab. 4 under the Full-Shape sec-
affordance. We show that state-of-the-art 3D point cloud tion and some qualitative examples from PointNet++ are se-
segmentation networks predict reasonable results on 3D Af- lected and visualized in Fig. 4.
fordanceNet. As shown in Tab. 4, the performances of three networks
Network and Training. We evaluate three network ar- are close and all achieve a relatively low aIOU score, which
chitecture, namely PointNet++[20], DGCNN[27] and U- indicates that affordance estimation is still a challenging
Net[28] for this task. To obtain the point-wise score, we uti- task. Comparing the second row of Fig. 4 to correspond-
lized the segmentation branch of PointNet++ and DGCNN ing ground truth, we found that PointNet++ produces some
as shared backbones to extract features for each point, then reasonable results. For example, the estimations of grasp on
for each affordance type, we pass the features through mul- a bag are successfully localized on both the handles and the
tiple classification heads and used a sigmoid function to ob- webbing. However, the results of pour on a bottle fail since
tain the posterior scores. The classification heads were set the network predicts the scores mainly on the lid of the bot-
up for each affordance category individually while the back- tle rather than the body edge of the bottle where is the place
bone networks were shared. We use cross-entropy loss LCE that the water flow out. More qualitative examples are given
for training the network as below, in the supplementary.
PM PN Room for Performance Improvement The performances
1
lCE = N i −(1 − tij )log(1 − pij ) − tij log(pij )
j of PointNet++ and DGCNN on the tasks mentioned above
(2) are relatively weak. Hence, we evaluate the trained network
where M is the total number of affordance types, N is the on full-shape affordance estimation task over the training,
number of points within each shape, sij is the ground truth validation and testing sets to investigate the room for perfor-
score of jth point of ith affordance category and pij is the mance improvement with results reported in Tab. 5. From
predicted score. the results we observe that both two networks still under-
Since the points with zero score account for a relatively fit, meaning that the proposed affordance estimation task is
large proportion of the total dataset, we further propose to very challenging for existing point cloud analysis networks.
use dice loss [15] to mitigate the imbalance issue. The dice
loss lDICE is defined as: 4.2. Partial-View Affordance Estimation
M PN
X j si,j pi,j + ǫ Although complete point clouds or meshes can provide
lDICE = 1 − PN
detailed geometric information for affordance estimation,
i j si,j + pi,j + ǫ
PN (3) in real-world application scenarios, we can only expect par-
j (1 − si,j )(1 − pi,j ) + ǫ tial view of 3D shapes, represented as partial point cloud.
− PN
Therefore, another important task we are concerned with is
j 2 − si,j − pi,j + ǫ
to estimate the affordance from partial point cloud.
Finally the loss function is defined as l = lCE + lDICE . We Network and Training. To obtain partial point clouds, we
train PointNet++ and DGCNN on the 3D AffordanceNet us- follow [11] to synthesize point cloud observed from certain

1782
Full-shape Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
P mAP 48.0 43.4 75.1 56.9 46.6 63.0 81.1 52.5 19.0 46.9 59.3 20.5 37.9 41.6 20.4 31.4 35.3 41.1 90.9
P AUC 87.4 82.8 97.1 89.3 90.6 92.6 96.0 89.7 72.6 89.2 90.6 83.1 85.3 85.9 67.9 90.9 79.0 91.4 98.8
P aIOU 19.3 15.7 41.2 22.2 20.2 30.0 38.1 17.5 4.0 18.2 25.6 6.5 12.7 14.0 6.5 11.2 8.5 15.2 40.9
P MSE 0.059 0.003 0.0001 0.006 0.003 0.0005 0.005 0.012 0.002 0.002 0.002 0.0007 0.0002 0.0006 0.0005 0.0006 0.021 0.0003 0.0001
D mAP 46.4 43.9 85.2 57.6 51.8 12.3 80.9 54.0 20.7 47.7 65.5 20.5 40.5 36.0 18.3 34.2 35.5 40.2 91.4
D AUC 85.5 82.5 98.7 89.9 91.6 50.1 96.1 90.2 74.6 89.2 92.1 85.0 89.7 86.1 61.9 91.8 78.9 91.7 98.7
D aIOU 17.8 13.9 40.2 21.6 25.4 1.0 34.9 18.8 5.6 17.7 32.1 5.5 11.8 11.9 5.9 14.8 9.9 14.5 35.4
D MSE 0.08 0.003 0.0001 0.007 0.003 0.0006 0.006 0.013 0.007 0.005 0.002 0.002 0.0006 0.002 0.002 0.0007 0.025 0.0002 0.0001
U mAP 47.4 42.6 75.7 56.6 45.9 60.6 80.8 53.7 19.1 45.0 61.5 19.7 36.5 37.4 20.4 33.9 35.2 39.8 89.0
U AUC 86.3 79.8 94.3 88.8 88.5 88.2 95.8 89.7 72.9 87.1 90.2 81.4 84.0 82.9 70.3 91.6 79.2 90.8 98.5
U aIOU 19.7 13.7 41.2 22.4 20.8 29.4 37.3 18.6 4.7 18.3 32.4 6.2 13.0 13.1 4.4 14.5 9.2 14.2 41.5
U MSE 0.063 0.003 0.0003 0.006 0.003 0.0006 0.005 0.014 0.002 0.002 0.002 0.0006 0.0002 0.0007 0.0003 0.0008 0.021 0.0002 0.0001
Partial Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
P mAP 45.7 43.2 80.6 41.9 48.5 52.6 69.8 45.5 20.0 47.0 52.5 24.1 36.5 42.1 15.3 30.7 37.8 43.3 92.6
P AUC 85.2 81.2 96.2 83.3 87.9 86.7 95.0 86.5 71.3 88.4 85.2 84.9 86.1 84.2 64.1 84.7 79.6 90.2 98.8
P aIOU 16.9 14.4 45.6 13.2 21.6 25.2 31.0 11.2 3.6 17.8 19.3 5.7 11.5 13.4 2.4 12.4 6.2 13.5 37.8
P MSE 0.062 0.003 0.0001 0.005 0.003 0.0006 0.004 0.013 0.002 0.002 0.002 0.0002 0.0001 0.0007 0.0004 0.0006 0.025 0.0003 0.0001
D mAP 42.2 40.1 83.8 38.5 44.5 46.6 67.3 43.9 19.6 44.8 50.3 15.7 28.6 19.9 17.7 26.9 35.6 43.3 92.1
D AUC 83.7 80.3 97.0 80.8 86.8 82.6 94.6 87.0 70 86.8 81.7 83.6 86.0 75 64.1 83.0 78.5 90.0 99.1
D aIOU 13.8 13.3 37.9 8.6 16.0 15.5 27.2 13.5 3.2 13.9 13.8 3.9 4.3 8.4 5.2 5.7 8.2 13.3 36.8
D MSE 0.069 0.005 0.0002 0.005 0.004 0.0006 0.004 0.013 0.003 0.002 0.002 0.001 0.001 0.002 0.003 0.0004 0.022 0.0004 0.0001
U mAP 43.0 40.8 73.2 41.2 47.6 44.6 68.9 45.3 18.5 45.2 53.2 19.5 29.6 38.9 11.1 29.6 36.7 39.1 90.8
U AUC 83.2 78.9 94.6 81.7 87.1 80.4 94.5 87.0 69.3 87.5 84.4 80.9 79.2 84.2 56.4 84.9 77.7 89.3 98.7
U aIOU 16.8 14.7 38.1 15.0 21.1 20.8 33.5 14.4 4.3 18.9 20.4 5.3 8.2 10.9 1.0 14.7 9.0 15.9 35.8
U MSE 0.065 0.003 0.0002 0.006 0.003 0.001 0.006 0.012 0.003 0.003 0.002 0.0003 0.0002 0.0003 0.0006 0.0008 0.022 0.0007 0.0002
Rotate z Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
P mAP 47.3 43.1 85.7 58.1 39.6 62.7 80.6 53.8 20.4 47.5 47.2 21.8 34.8 39.7 19.0 28.1 36.3 40.4 91.9
P AUC 87.0 82.0 97.9 89.5 86.2 91.1 95.9 90.1 74.2 89.4 87.1 85.4 87.9 84.3 67.0 88.5 80.3 91.5 98.5
P aIOU 18.7 15.5 45.4 22.4 17.6 26.0 38.0 18.3 4.6 19.5 17.4 7.0 10.5 13.9 6.9 9.2 8.8 14.8 40.6
P MSE 0.06 0.003 0.0001 0.006 0.003 0.0006 0.005 0.012 0.002 0.002 0.003 0.001 0.0003 0.0007 0.0007 0.0005 0.02 0.0003 0.0001
D mAP 44.8 42.2 82.9 58.2 45.2 17.3 78.8 52.4 20.3 46.7 58.5 21.6 45.2 28.5 16.8 29.6 35.0 36.4 90.8
D AUC 84.9 80.8 98.2 89.9 87.9 54.9 95.8 89.9 74.0 89.5 90.6 84.7 89.4 81.0 64.4 89.2 79.7 90.6 98.4
D aIOU 16.1 13.6 36.1 19.6 21.2 1.0 29.2 18.5 3.4 13.6 25.3 5.7 13.6 11.3 5.7 13.1 9.7 13.6 36.3
D MSE 0.074 0.003 0.0001 0.007 0.003 0.0006 0.007 0.013 0.005 0.004 0.002 0.003 0.0008 0.002 0.002 0.0009 0.021 0.0004 0.0001
U mAP 46.1 42.7 74.4 56.1 40.4 58.4 81.2 54.9 18.5 44.7 56.4 20.7 35.3 36.8 17.4 31.6 35.6 36.2 88.8
U AUC 86.1 81.2 95.8 87.5 85.9 88.4 95.8 90.2 72.2 87.6 87.9 85.1 87.9 83.2 63.1 90.0 78.9 90.3 98.4
U aIOU 18.9 15.5 39.6 21.8 18.4 24.7 38.4 18.9 4.5 18.6 26.9 6.6 12.2 14.5 5.1 14.1 9.9 12.1 37.7
U MSE 0.06 0.003 0.0001 0.006 0.003 0.0004 0.005 0.013 0.002 0.002 0.002 0.0007 0.0002 0.0008 0.0004 0.0008 0.021 0.0003 0.0001
Rotate SO(3) Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
P mAP 41.8 40.5 78.5 42.4 32.7 38.4 74.5 48.3 19.4 41.7 41.1 19.7 30.3 39.4 17.6 21.5 34.6 41.1 90.6
P AUC 83.3 79.0 93.6 81.1 81.3 79.6 93.9 87.4 71.6 85.4 83.7 83.1 84.0 84.2 64.8 78.2 78.6 91.0 99.4
P aIOU 15.2 12.8 38.3 12.2 13.1 9.6 33.6 16.5 3.8 16.1 13.7 3.0 11.1 14.8 5.5 8.8 8.9 14.4 37.2
P MSE 0.072 0.003 0.0001 0.008 0.003 0.0007 0.007 0.015 0.003 0.003 0.003 0.0002 0.0001 0.0006 0.002 0.0009 0.022 0.0006 0.0001
D mAP 37.3 37.9 70.7 37.3 34.2 9.7 73.9 46.8 17.6 40.2 46.8 19.1 37.4 6.9 11.8 22.3 30.9 41.4 86.4
D AUC 78.9 79.1 95.6 79.4 81.4 36.3 93.9 87.2 71.2 85.6 85.7 82.3 88.8 43.5 60.3 82.7 76.9 91.9 99.1
D aIOU 12.8 13.6 32.5 7.8 13.9 1.0 35.4 14.4 4.9 16.4 19.3 4.5 11.0 0.003 1.0 8.4 7.1 15.7 23.3
D MSE 0.08 0.004 0.0002 0.007 0.003 0.0006 0.007 0.015 0.005 0.004 0.005 0.002 0.0005 0.0004 0.0006 0.0009 0.022 0.0007 0.0002
U mAP 37.9 37.0 61.2 38.0 29.8 34.0 77.4 49.9 16.4 39.3 42.6 14.8 24.7 35.7 8.6 20.1 31.8 36.9 83.3
U AUC 80.9 76.9 90.5 79.5 77.7 78.1 94.1 87.8 67.9 82.7 81.9 78.1 83.5 83.7 52.5 76.6 77.1 89.6 99.0
U aIOU 12.0 15.3 8.2 10.8 10.7 5.4 35.8 16.2 2.7 12.7 15.8 1.0 4.3 12.3 1.0 7.3 7.9 11.9 35.8
U MSE 0.07 0.005 0.0001 0.008 0.003 0.0005 0.006 0.013 0.003 0.003 0.004 0.0002 0.0001 0.0004 0.0004 0.0008 0.022 0.0003 0.0001

Table 4. Affordance Estimation Results. Except for the MSE results, others are shown in percentage, the higher the scores the higher the
results. Algorithm P, D and U represent PointNet++[20], DGCNN[27] and U-Net[28] respectively. The words Full-Shape, Partial, Rotate
z and Rotate SO(3) represent the full-shape, partial, z/z and SO(3)/SO(3) rotation-invariant affordance estimation, respectively.
mAP AUC aIOU MSE Loss mAP AUC aIOU MSE Loss
P Train 52.3 89.8 21.7 0.054 8.75 D Train 51.7 89.1 21.2 0.061 8.83 actly the same backbone networks and training strategies
P Val 48.2 88.0 19.2 0.057 8.81 D Val 47.8 85.8 17.5 0.075 8.91
P Test 48.0 87.4 19.3 0.059 8.83 D Test 46.4 85.5 17.8 0.08 8.93
described in previous sections.
Table 5. The performances of two different networks on full-shape
Evaluation and Results. During testing stage, we esti-
affordance estimation task over train, validate and test sets. P rep-
resents PointNet++ and D refers to DGCNN.
mate the affordance on the visible partial point cloud only.
The evaluation protocol follows the one described in Sec-
tion 4.1. All evaluation metrics are reported in Tab. 4 with
qualitative results from PointNet++ shown in Fig. 4.
camera viewpoints. Only points directly facing the camera
will be preserved as visible points and each point is assigned Unsurprisingly, the quantitative performances of three
a radius to create occlusion effect. In specific, because all networks have decreased due to the loss of geometric in-
shapes are well aligned within the (-1,-1,-1) to (1,1,1) cube, formation of partial point cloud relative to complete point
we set up 4 affine cameras located at (1,1,1), (-1,-1,1), (1,- cloud. Nevertheless, we still observe reasonable qualitative
1,-1), (-1,1,-1) in Cartesian coordinate system, facing to- results even though only a partial view is observed. For in-
wards the origin. After obtaining the partial point clouds, stance, the network produces high prediction for move on
we sample 2048 points from each viewpoint via furthest the upside of the legs of a table despite the unseen parts
point sampling, if the number of points of the point cloud of the legs. The grasp for bag, hear for earphone, sit for
is fewer than 2048, we utilize the point cloud up-sampling chair, etc., are all more-or-less correctly predicted. In con-
method proposed in [30] to up-sample the data. We use ex- trast, the estimation for contain on storage furniture are par-

1783
Bag Earphone Table Chair Bottle Storage Furniture Microwave
Ground Truth
Full-Shape
Partial
Rotate z/z
Rotate SO(3)/SO(3)

Grasp Listen Move Sit Pour Contain Open


Figure 4. Qualitative results for affordance estimation. The top row shows the ground truth. The second row shows the full-shape estimated
results, the third row shows the partial-view estimated result, the fourth and the bottom row show the z/z and SO(3)/SO(3) rotation-
invariant estimated results, respectively. All results come from PointNet++. The top words indicate the semantic category of each column
and the bottom words indicate the affordance category. The greener the color of the points, the higher the confidence about specific
affordance types. Wrap. is the abbreviation of Wrap-Grasp.
tially missing since it predicts the scores on the top of the on the proposed methods. Quantitative results are presented
furniture which is not fully observed. in Tab. 4 and qualitative results are shown in the fourth and
fifth rows of Fig. 4 for z/z and SO(3)/SO(3) settings with
4.3. Rotation-Invariant Affordance Estimation PointNet++ as backbone. We observe that the performances
on both z/z and SO(3)/SO(3) settings dropped compared
The shapes in 3D AffordanceNet are all aligned in
to canonical view experiments. In particular, for z/z set-
canonical poses, however, the data observed by sensors in
ting, the performance dropped around 1% in all metrics
real world are not always in canonical poses. The difference
for all networks as backbone. While more significant loss
in rotation between real data and training data will lead to
of performance is observed for SO(3)/SO(3) setting with
a performance drop in real-world usage which inspired re-
5 − 10% drop in all metrics. This is aligned with our ex-
search into rotation equivariant network [5]. Hence, it is
pectation that SO(3)/SO(3) is a much more challenging
critical to train the algorithms to estimate affordance on ro-
task. We further make observations from the qualitative re-
tated objects. In this section, we provide a benchmark for
sults in Fig. 4. First, under z/z rotation scheme, despite
affordance estimation subject to two types of rotations.
the consistent performance drop, affordance estimations are
Network and Training. We used the same backbone net- largely correct across most categories. Obvious mistakes
works, training strategies and hyper-parameters described are made in bottle and microwave where the former missed
in the Sect. 4.1. We propose two different rotation settings the tip which supports pour while the latter mistakenly pre-
for experiment: z/z and SO(3)/SO(3) where z/z means dict the while door of microwave for open. Under the more
rotation is applied along z axis only for both training and challenging SO(3)/SO(3) scheme, there is still a visually
inference stages while SO(3)/SO(3) refers to SO(3) rota- satisfying results for most shapes. The most prominent er-
tion, i.e. freely rotation along x, y and z axes. During train- ror is made on storage furniture where contain is totally
ing session, we randomly sample rotation poses between missed, probably because the complex geometric structure
[0,2π] for each shape in the training mini-batch on-the-fly. (many concave shapes) renders contain a hard affordance to
We train proposed methods on complete point cloud. For learn under arbitrary rotation. In general, we believe affor-
testing phase, we randomly sample 5 rotation poses for each dance estimation under SO(3)/SO(3) a very challenging
shape for both rotation settings and fix the sampled rotations task and it deserves further investigation.
for testing data.
Evaluation and Results. We calculate mAP, AUC, aIOU

1784
VAT Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
mAP 36.6 34.8 78.8 44.3 22.6 34.9 64.8 41.4 13.3 41.5 58.9 13.5 8.3 20.5 8.5 29.5 19.4 33.1 90.2
AUC 78.8 75.8 95.3 78.4 79.2 71.2 92.4 86.7 56.5 83.7 89.9 69.0 74.8 74.1 51.4 88.8 62.7 89.3 99.1
aIOU 11.2 12.7 20.3 14.2 4.8 6.4 16.9 7.7 5.1 17.6 30.2 2.5 1.1 8.4 3.6 12.8 2.9 9.5 24.3
MSE 0.155 0.007 0.0001 0.02 0.006 0.0007 0.012 0.023 0.023 0.007 0.008 0.0003 0.0001 0.002 0.002 0.005 0.034 0.004 0.0003
Full-Shape Avg Grasp Lift Contain Open Lay Sit Support Wrap. Pour Display Push Pull Listen Wear Press Move Cut Stab
mAP 34.3 34.5 77.9 42.3 18.2 36.5 62.9 38.8 11.2 38.9 53.3 13.4 7.3 9 6.7 24.5 17.4 35.8 89.5
AUC 77.5 75.4 94.8 78.1 75.7 73.3 92.1 85.8 54.4 82.1 87.5 74.5 77.4 61.6 46.7 83.4 64.1 89.8 98.9
aIOU 9.8 11.2 28.1 14.3 2.5 10.0 23.4 9.8 2.2 7.5 19.9 1.9 1.0 1.6 1.6 6.8 2.3 5.6 26.5
MSE 0.105 0.009 0.0002 0.013 0.003 0.001 0.013 0.021 0.004 0.003 0.003 0.0002 0.0001 0.0004 0.001 0.0008 0.031 0.0007 0.0001

Table 6. The Results of Semi-Supervised Affordance Estimation. All numbers are in % except for MSE. We only implement semi-
supervised affordance estimation on DGCNN. The words Full-Shape and VAT represent full-shape estimation and semi-supervised affor-
dance estimation with virtual adversarial training. Wrap. is the abbreviation of Wrap-Grasp.

4.4. Semi-Supervised Affordance Estimation Evaluation and Results We evaluated the methods fol-
lowing the metrics described in Sect. 4.1 with results re-
Although the label propagation procedure allows the an-
ported in Tab. 6. Comparing the performance of semi-
notators to only annotate a few keypoints on the object sur-
supervised affordance estimation to the full-shape one, we
face, affordance annotation still remains as an expensive
found that semi-supervised affordance estimation outper-
and labor intensive procedure. Inspired by the recent suc-
forms the fully supervised baseline on all three metrics.
cess in semi-supervised learning (SSL) [14, 25, 16] we es-
Specifically, the gains for some affordance categories (e.g.
tablish a benchmark for semi-supervised affordance estima-
open) that have low metrics on full-shape affordance esti-
tion. We synthesize a semi-supervised setting by randomly
mation are high, which indicates that unlabeled data can
sampling 1% training data, assumed to be labeled, and the
provide useful information for affordance learning. In con-
rest are assumed to be unlabeled data. The validation and
clusion, we believe that exploiting unlabeled data to im-
testing sets are kept the same with standard benchmarks.
prove the performance has practical value and should re-
Network and Training. We utilize DGCNN[27] as our
ceive more attention in the future.
backbone. During every mini-batch in the training stage,
we randomly sample a equal number of labeled data Xl and
unlabeled data Xul . To fully exploit the unlabeled data, we 5. Conclusion
employ a state-of-the-art semi-supervised learning frame- In this work, we proposed 3D AffordanceNet, a 3D
work, namely Virtual Adversarial Training (VAT) [16]. It point cloud benchmark consisting of 22949 shapes from
encourages the consistency between the posterior of un- 23 semantic object categories, annotated with 56307 affor-
labeled sample and its augmentation, measured by mean dance annotations and covering 18 visual affordance cate-
square error, gories. Based on this dataset, we define three individual
M N
1 XX affordance estimation tasks and benchmarked three state-
lmse = ||pi,j − p̂i,j ||22 (4) of-the-art point cloud deep learning networks. The re-
N i j
sults suggested future research is required to achieve bet-
where p̂i,j is the posterior prediction for augmented sample. ter performance on difficult affordance categories and un-
To best exploit the consistency power, the augmentation der SO(3) rotation. Furthermore, we proposed a semi-
is obtained by first applying a one step adversarial attack, supervised affordance estimation method to take advantage
the corresponding adversarial perturbation is then added to of large amount of unlabeled data. The proposed dataset en-
the original point cloud to produce the augmentation. Fi- courage the community to focus on affordance estimation
nally, the total loss for semi-supervised affordance estima- research.
tion combines both losses defined for labeled data and un- Acknowledgement This work was supported in part by
labeled data. the National Natural Science Foundation of China (Grant
l u
No.: 61771201, 61902131), the Program for Guangdong
l = lCE + lDICE + lmse + lmse (5) Introducing Innovative and Entrepreneurial Teams (Grant
where l
lmse and u
lmse
is the mean square error (MSE) cal- No.: 2017ZT07X183), and the Guangdong R&D key
culated between labeled and unlabeled data, respectively. project of China (Grant No.: 2019B010155001). Xun
We compare the semi-supervised approach against a fully Xu acknowledges the A*STAR Career Development Award
supervised baseline which is trained on the 1% labeled data (CDA) Funding for providing financial support (Grant No.
202D8243).
alone with cross-entropy and dice loss. We use a mini-batch
of 16, 8 for labeled data and 8 for unlabeled data, and follow
the same training strategies and hyper-parameters described References
in [16]. We train a full-shape affordance estimation method [1] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
based on DGCNN only on the labelled data following the Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
description in Sect. 4.1. Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:

1785
An information-rich 3d model repository. arXiv preprint 3d object understanding. In Proceedings of the IEEE Con-
arXiv:1512.03012, 2015. ference on Computer Vision and Pattern Recognition, 2019.
[2] Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja [18] Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis
Fidler. Learning to act properly: Predicting and explaining Aloimonos. Affordance detection of tool parts from geomet-
affordances from images. In Proceedings of the IEEE Con- ric features. In IEEE International Conference on Robotics
ference on Computer Vision and Pattern Recognition, 2018. and Automation, 2015.
[3] Thanh-Toan Do, Anh Nguyen, and Ian Reid. Affordancenet: [19] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
An end-to-end deep learning approach for object affordance Nikos G Tsagarakis. Object-based affordances detection
detection. In IEEE International Conference on Robotics with convolutional neural networks and dense conditional
and Automation, 2018. random fields. In IEEE/RSJ International Conference on In-
[4] Jay Earley. An efficient context-free parsing algorithm. Com- telligent Robots and Systems, 2017.
munications of the ACM, 1970. [20] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
[5] Carlos Esteves, Christine Allen-Blanchette, Ameesh Maka- Guibas. Pointnet++: Deep hierarchical feature learning on
dia, and Kostas Daniilidis. Learning so (3) equivariant rep- point sets in a metric space. In Advances in neural informa-
resentations with spherical cnns. In Proceedings of the Eu- tion processing systems, 2017.
ropean Conference on Computer Vision, 2018. [21] Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu.
[6] J Gibson James. The ecological approach to visual percep- Predicting human activities using stochastic grammar. In
tion, 1979. Proceedings of the IEEE International Conference on Com-
[7] Helmut Grabner, Juergen Gall, and Luc Van Gool. What puter Vision, 2017.
makes a chair a chair? In Proceedings of the IEEE Confer- [22] Anirban Roy and Sinisa Todorovic. A multi-scale cnn for
ence on Computer Vision and Pattern Recognition, 2011. affordance segmentation in rgb images. In Proceedings of
[8] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Vi- the European Conference on Computer Vision, 2016.
sual affordance and function understanding: A survey. arXiv
[23] Johann Sawatzky, Abhilash Srikantha, and Juergen Gall.
preprint arXiv:1807.06775, 2018.
Weakly supervised affordance detection. In Proceedings
[9] Ruizhen Hu, Oliver van Kaick, Bojian Wu, Hui Huang, Ariel of the IEEE Conference on Computer Vision and Pattern
Shamir, and Hao Zhang. Learning how objects function via Recognition, 2017.
co-analysis of interactions. ACM Transactions on Graphics,
[24] Hyun Oh Song, Mario Fritz, Daniel Goehring, and Trevor
2016.
Darrell. Learning to detect visual grasp affordance.
[10] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh
IEEE Transactions on Automation Science and Engineering,
Saxena. Structural-rnn: Deep learning on spatio-temporal
2015.
graphs. In Proceedings of the IEEE Conference on Computer
[25] Antti Tarvainen and Harri Valpola. Mean teachers are better
Vision and Pattern Recognition, 2016.
role models: Weight-averaged consistency targets improve
[11] Sagi Katz, Ayellet Tal, and Ronen Basri. Direct visibility of
semi-supervised deep learning results. In Advances in neural
point sets. In ACM Special Interest Group on GRAPHics and
information processing systems, 2017.
Interactive Techniques 2007 papers. 2007.
[12] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Sax- [26] Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva,
ena. Learning human activities and object affordances from and Josef Sivic. Predicting actions from static scenes. In
rgb-d videos. The International Journal of Robotics Re- Proceedings of the European Conference on Computer Vi-
search, 2013. sion, 2014.
[13] Hema S Koppula and Ashutosh Saxena. Anticipating hu- [27] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
man activities using object affordances for reactive robotic Michael M Bronstein, and Justin M Solomon. Dynamic
response. IEEE Transactions on Pattern Analysis and Ma- graph cnn for learning on point clouds. ACM Transactions
chine Intelligence, 2015. on Graphics, 2019.
[14] Samuli Laine and Timo Aila. Temporal ensembling for semi- [28] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas
supervised learning. International Conference on Learning Guibas, and Or Litany. Pointcontrast: Unsupervised pre-
Representations, 2017. training for 3d point cloud understanding. In Proceedings
[15] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. of the European Conference on Computer Vision, 2020.
V-net: Fully convolutional neural networks for volumetric [29] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen,
medical image segmentation. In 2016 fourth international Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Shef-
conference on 3D vision, 2016. fer, and Leonidas Guibas. A scalable active framework for
[16] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and region annotation in 3d shape collections. ACM Transactions
Shin Ishii. Virtual adversarial training: a regularization on Graphics, 2016.
method for supervised and semi-supervised learning. IEEE [30] Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, and
Transactions on Pattern Analysis and Machine Intelligence, Olga Sorkine-Hornung. Patch-based progressive 3d point
2018. set upsampling. In Proceedings of the IEEE Conference on
[17] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Computer Vision and Pattern Recognition, 2019.
Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- [31] Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, and Kai
scale benchmark for fine-grained and hierarchical part-level Xu. Partnet: A recursive part decomposition network for

1786
fine-grained and hierarchical shape segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019.
[32] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic under-
standing of scenes through the ade20k dataset. International
Journal of Computer Vision, 2019.
[33] Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. Understanding
tools: Task-oriented object modeling, learning and recogni-
tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015.

1787

You might also like