0% found this document useful (0 votes)
50 views8 pages

Gradual Adaption With Memory Mechanism For Image-Based 3D Model Retrieval

Uploaded by

于芷萱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views8 pages

Gradual Adaption With Memory Mechanism For Image-Based 3D Model Retrieval

Uploaded by

于芷萱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Image and Vision Computing 123 (2022) 104482

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier.com/locate/imavis

Gradual adaption with memory mechanism for image-based


3D model retrieval
Dan Song a,b,c, Yuting Ling c, Tianbao Li c,⁎, Ting Zhang c, Guoqing Jin a, Junbo Guo a, Xuanya Li d,⁎
a
State Key Laboratory of Communication Content Cognition, People's Daily Online, Beijing 100733, China
b
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China
c
School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
d
Baidu Inc., Beijing 100105, China

a r t i c l e i n f o a b s t r a c t

Article history: With the development of 3D modeling technology and its wide application in different fields, the number of 3D
Received 11 January 2022 models increases rapidly, making 3D model retrieval a hot topic in current research. Compared with other 3D
Received in revised form 18 April 2022 model retrieval methods, 2D image-based unsupervised 3D model retrieval takes the 2D images which have
Accepted 11 May 2022
rich labels and are easy to obtain as the queries, and also takes into account the difficulties of labeling 3D models.
Available online 16 May 2022
2D image-based unsupervised 3D model retrieval is a retrieval task involving cross-domain adaptation problem,
Keywords:
which main challenge is the excessive domain gap. In this paper, we propose a cross-domain 3D model retrieval
3D model retrieval method of memory mechanism based on disentangled feature learning. The disentangled feature learning en-
Unsupervised learning ables to disentangle the twisted original features into the isolated domain-invariant features and domain-
Domain adaptation specific features, where the former is to be aligned to narrow the domain gap. On this basis, the memory mech-
anism selects feature vectors from class memory modules constructed by class representative features of the op-
posite domain for every sample, which are used to update the domain-invariant features with gradient weight.
The memory mechanism can gradually improve the adaptability of the model to the very different two domains.
Experiments are conducted on the public datasets MI3DOR and MI3DOR-2 to verify the feasibility and the supe-
riority of the proposed method. Especially on MI3DOR-2 dataset, our method outperforms the current state-of-
the-art methods with gains of 7.71% for the strictest retrieval metric NN.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction retrieve the existing 3D models directly than to build a model from
scratch. However, with the widespread application of 3D modeling
Relying on the development of computer software and hardware, technology, the number of 3D models is also increasing. In particular,
many disciplines and industries have been improved, and the types of in recent years, 3D modeling technology has been successfully deployed
information that can be obtained, presented and used are becoming in mobile terminals, which has freed itself from hardware limitations
richer and richer [1,2]. Especially, 3D technology has gradually attracted and gradually gained a wider user group, resulting in a surge in the
attention from the new to mature. 3D modeling technology builds the number of 3D models, and this growth trend will continue in the future.
spatial structure information of the object, which makes the object pres- Under this premise, how to find the needed 3D model in the mass of
ent in the form closer to the recognition of human. At present, 3D data has become a practical and challenging task, making the research
modeling technology has been applied to many industries and fields, of 3D model retrieval method attract the attention of scholars.
such as manufacturing, construction, medical and cultural industries. Up to now, a lot of related work has emerged in 3D model retrieval
The management of 3D models is an important part of 3D technology. [3,4]. The current mainstream is to apply deep learning in methods,
In the whole process of 3D model management, 3D model retrieval which brings impressive performance [5–7]. It is worth noting that the
plays a vital role. Although there are more and more software and hard- success of deep learning-based methods depend on the annotation in-
ware that try to simplify the modeling process and reduce its technical formation of 3D models. However, the annotation of 3D models requires
difficulty, 3D modeling still requires expertise and high manual costs. high manual cost. Given that the number of 3D models is huge and still
This means in many scenarios, it is more convenient and efficient to increasing rapidly, the consumption of resources needed to annotate 3D
models may be unacceptable. Therefore, for the research of 3D model
retrieval methods, seeking a method independent of annotation is one
⁎ Corresponding authors.
E-mail addresses: [email protected] (T. Li), [email protected] (X. Li).
of the directions worth exploring. One of the methods is to transfer

https://doi.org/10.1016/j.imavis.2022.104482
0262-8856/© 2022 Elsevier B.V. All rights reserved.
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

knowledge from data in other domains with abundant annotation. The adds a “group” module, which groups features from different views ac-
domain adaptation method can ensure the knowledge gained from one cording to the distinguishing score. Compared with MVCNN, GVCNN
label-rich domain can be well applied to different but related domains. considers the connection between multiple views.
Images can be used as source domain because there are many widely 2D images are more common and easier to obtain than 3D models.
used image datasets with reliable annotations. At the same time, com- For practical considerations, 3D model retrieval based on 2D images
pared to the sketch, the real-world images contain more details, has been extensively studied, and images and models are generally
which can bring higher retrieval accuracy. Therefore, domain adapta- regarded as two domains. The images here can be drawn sketches or
tion can be performed between labeled 2D image domain and unlabeled real-world images. The method proposed by Wang et al. [22] learns Si-
3D model domain, and the 3D model can be retrieved using 2D images, amese Convolutional Neural Networks for sketches and 3D models re-
which leads to the task called unsupervised 3D model retrieval based on spectively to extract features. Zhu et al. [23] propose a pyramid cross-
2D images. domain neural network (PCDNN), which maps sketches and low-level
Especially, 2D image-based 3D object retrieval is challenging due to representations of 3D models on multiple pyramid levels to a unified
the significant gap in the feature space between real 2D images and 3D feature space. Dai et al. [24] propose Deep Correlation Metric Learning
objects. The images are generally from the real world objects and scenes (DCML), which learns two deep nonlinear transformations. The re-
shot by the camera, while the 3D models are generally made by the search on using real-world images to retrieve 3D models has arrests
computer. The huge visual difference between the two domains reflects more interest in recent years. Mu et al. [25] model the images and the
the diverse data distribution. Previous cross-domain 3D model retrieval views of 3D models as Euclidean points and symmetric positive definite
methods tend to do the alignment globally and fixedly to reduce the gap matrix respectively, which transforms this task into an Euclidean-Rie-
between the two domains, which cannot cope with the large domain mann metric learning problem, and map both Euclidean space and Rie-
gap in 2D image-based 3D model retrieval task because the domain- mannian manifold to a high-dimensional Hilbert space to solve the
invariant features and domain-specific features are entangled with problem. Zhou et al. [1] propose an end-to-end unsupervised Dual-
each other. And the domain-specific features will interfere with feature level Embedding Alignment (DLEA) network. The visual feature learn-
alignment and lead to negative transfer effects during domain adapta- ing module of this network is used to learn visual features for images
tion. In this paper, a memory mechanism was designed based on the and views of 3D models, and the cross-domain feature adaptation mod-
framework of disentangled feature learning to enhance the original rep- ule aligns the features of the two domains at the domain level and the
resentations and gradually improve the adaptability of the model to the class level. Zhou et al. [26] then focus on the instance features and
target domain. Our contributions can be summarized as follows: local semantics, and propose the method to maximize the mutual infor-
mation between the input and high-level features to preserve as much
1. We propose an end-to-end framework for 2D image- based unsuper-
of the individual instance features as possible. The method proposed
vised 3D model retrieval task, which gradually transfers knowledge
by Grabner et al. [27] takes a different idea from the previous method
from labeled 2D im- ages to unlabeled 3D models. The effectiveness
of mapping the 3D model and the real image to the same embedding
is demonstrated feasible by experiments conducted on MI3DOR
space. It establishes low-level representations coded the corresponding
and MI3DOR-2 datasets.
relationship of the pixels and the coordinates, from which the pose-
2. We design an incremental memory mechanism involving class
invariant 3D model descriptor is calculated, and this descriptor is used
memory module. The memory mechanism uses class memory mod-
for retrieval.
ule to update features for alignment, which reduces the large domain
gap in a progressive way.
2.2. Domain adaptation

2. Related work Domain adaptation is a branch of transfer learning. In a common


problem setting, domain adaptation involves a source domain with a
2.1. 3D retrieval large number of labeled samples and a target domain with no (or only
a small amount of) labeled samples. The data distribution in the two do-
The 3D model retrieval aims to find the matched 3D model in the mains is different, and there is a domain shift. Due to the lack of labels in
database according to the queries. The current mainstream methods the target domain, domain adaptation is often combined with unsuper-
can be roughly divided into two kinds: model-based 3D model retrieval vised [28–33] or semi-supervised [34–37] learning.
methods [8–14] and image-based 3D model retrieval methods Traditional domain adaptation is often performed by minimizing
[3,15–20]. statistical-based inter-domain difference such as the mean or higher-
Model-based 3D model retrieval refers to a retrieval method in order moments between the source and target domains. COARL [38]
which the queries are 3D models. The focus of the model-based 3D linearly transforms the covariance matrix of the source domain and
model retrieval method is to find suitable 3D model representations aligns it with the covariance matrix of the target domain, and then re-
or descriptors, and design algorithms based on them to extract features places the original matrix operation with whiten and recolor operation.
used for retrieval and calculate the similarity between the features. The Transfer Component Analysis (TCA) proposed by Pan et al. [39] uses
VoxNet [10] and ShapeNet [11] represent each grid as a binary tensor, the Maximum Mean Discrepancy strategy to learn domain-invariant
and use the probability distribution of the binary tensor on the 3D representations in the Reproducing Kernel Hilbert Space (RKHS).
voxel grid to represent the 3D model. Qi et al. construct the PointNet Deep domain adaptation methods lie in the use of deep neural net-
[4] to learn feature representations from a set of disordered points works. Therefore, the strategies and difference metrics in traditional do-
collected from the model, including point coordinates and additional main adaptation methods such as COARL and MMD are still applicable
features. Afterwards, Qi et al. propose PointNet ++ [9] based on in the deep domain adaptation [40,41]. In recent years, adversarial
PointNet, which imitates the idea of multi-layer receptive fields and learning strategy becomes popular in domain adaptation. Ganin et al.
adds a network structure for extracting local features to supplement de- [42] firstly apply adversarial learning to domain adaptive and propose
tails. MVCNN [3] uses rendered multi-views to represent the model. domain-adversarial Neural Network (DANN). The Gradient Reverse
These views are obtained by setting virtual cameras around the 3D Layer (GRL) is also proposed to achieve the two opposite optimization
model. After feature extraction, a maximum pool operation is added to objectives of feature extractor and domain discriminator. Compared
compress the features from multiple views into one compact descriptor. with DANN of sharing feature extractor between source domain and
MVCNN achieves better performance than ShapeNet, and is an impor- target domain, Tzeng et al. [43] adopt the method of not sharing weight
tant method in the view-based algorithm. Based on it, GVCNN [21] and independently extracting features from the source domain and the

2
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

target domain to capture more domain-specific features, and propose 3.3. Feature disentanglement
the Adversarial Discriminative Domain Adaptation (ADDA) algorithm.
Long et al. [44] propose Conditional Domain adversarial Networks The features extracted directly from a certain domain contain a vari-
(CDANs) for the correlation between categories which is neglected by ety of information, which are called original features in the paper. We
DANN. Two new adaption strategies are also adopted: multi-linear assume that the information contained in a sample in the domain
stratagy to improve the accuracy of the classifier and entropy to ensure can be summarized into two kinds: the domain-invariant features and
the portability of the classifier. the domain-specific features. Domain-specific features represent
untransferable domain attributes, which will introduce interference
and hinder more comprehensive alignment. Especially for the two
3. Method domains of 2D images and 3D models, the visual difference of the sam-
ples indicates that there are domain-specific features that cannot be
3.1. Overview ignored in these two domains.
We adopt a feature disentanglement module [45] and add it after the
The aim of the 2D image-based 3D model retrieval task is to retrieve F(⋅) to strip the domain-specific features from the original features. The
3D models matched to the query 2D images. In this task, 2D images are feature disentanglement module is composed of two parallel linear
defined as the source domain Ds = {(xis, yis)}ni=1s
, where xis is the i-th layers d and d′. Domain-specific features fis are captured by d′ and
i
image with its corresponding label ys ∈ [0, J − 1]. ns and J are the subtracted from output of d, and only domain-invariant features fi are
number of image samples and the number of classes respectively. left. The feature extractor with added disentanglement module is
Unlabeled 3D models are defined as the target domain Dt = {xit}ni=1 t
, written as Fm(⋅). As our goal is to reduce the domain gap and domain-
which contains nt unlabeled 3D model samples xt. The whole specific features disturb it, domain-specific features should be elimi-
framework we proposed is shown in Fig. 1, which contains feature nated while feature learning. We measure domain-specific features
disentanglement, class memory module and domain adversarial with its L1-norm and constraint it with Ld:
learning module. After disentanglement of the extracted original
visual features, the domain-invariant features retained will be enhanced 1 ns þnt i
Ld ¼ ∑ ∣f ∣ ð1Þ
by the memory mechanism, and then the enhanced features will be ns þ nt i¼1 s
aligned under the conduction of the adversarial domain adaptation
strategy, and the aligned features will be used for retrieval.
3.4. Memory mechanism

3.2. Original feature extraction When human beings face a brand-new item, they sometimes use
memory of resemble items to help understand. The design of the mem-
In the proposed method, the general CNN architecture is adopted as ory mechanism is similar to cognition process. The memory module
the backbone of the feature extractor F(⋅). For the source domain, the stores representative features of categories that can be used to provide
features of 2D images are directly extracted by the backbone network. assistance. Inspired by [46], we use class centroids as class representa-
For the target domain, we use a set of rendered views taken by preset tions to build memory module Ms and Mt. The class centroids are
virtual camera to represent a 3D model following [3]. Therefore, the fea- calculated by mean values of samples in every category, where pseudo
ture extractor is adjusted for the target domains. A view pooling layer is labels are required for the target domain. It is worth noticing that Ms
added after the last convolutional layer of CNN. A set of N views are is the memory for the source domain and stores the class centroids of
regarded as N channels. The features of each view are extracted at first the target domain while Mt does the opposite.
and then fused into compact one-channel feature vector by the view For every sample x, the memory mechanism will use a selection
pooling layer. function φ(⋅) to choose a feature vector from the memory module as fm:

Fig. 1. Overview of the network for the proposed method.

3
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

4. Experiments
f m ¼ M½φðxÞ ð2Þ
4.1. Implementation
where M is Ms if x is from the source domain, otherwise Mt. φ(⋅) is the
softmax function to output index of the selected class centroid. fm and 4.1.1. Dataset
the domain-invariant features together constitute new features: MI3DOR [48] is a public dataset containing 21,000 2D images and
7690 3D models of the same 21 categories, which is divided into a train-
f new ¼ ωf þ ð1  ωÞf m ð3Þ ing set and a test set. The training set includes 10,500 2D images and
3842 3D models of all categories. Each 3D model is represented by 12
Here, ω is changing and defined as: views. The test set includes the rest 2D images and 3D models.
MI3DOR-2 [1] contains 19,694 2D images and 3982 3D models (also

thr; ω < thr represeted by 12 views) of 40 categories. The training set includes
ω¼ ð4Þ
γ; ω ≥ thr 19,294 2D images and 3182 3D models. The test set includes 400 2D im-
ages and 800 3D models. MI3DOR-2 has more categories than MI3DOR,
where thr is a threshold to keep f the major part. And γ ¼ but 3D models in MI3DOR come from more than one dataset.

1þ exp ð  10 ⋅ pÞ  1 is a value that gradually increases with the training


2
4.1.2. Evaluation metrics
period p [47]. Therefore, ω also gradually increases with the training.
Referring to previous work on this task, 6 metrics are used to evalu-
As fm offers a buffer for the large domain gap, there is no need to
ate the retrieval performance of the proposed method, namely Nearest
buffer as much as before when the domain gap gradually decreases.
Neighbor (NN), First Tier (FT), Sencond Tier (ST), F-Measure,
Discounted Cumulative Gain (DCG) and Average Normalized Modified
3.5. Adversarial learning
Retrieval Rank(ANMRR). Their detailed description is as follows:

Adversarial learning is widely used for domain adaptation. Follow- • NN is defined as accuracy of the first returned results of all the queries.
ing several popular works on adversarial domain adaptation, we make • FT and ST represent how many results are of the query's class within
feature extractor fight against domain discriminator. Here, the feature the top-t and top-2 t returned results.
extractor refers to feature extractor with disentanglement module • F-Measure measures precision and recall of all the queries jointly,
Fm(⋅). A domain discriminator usually is a simple binary classification which is defined as 2 ⋅precisionþrecall
precision ⋅ recall
.
network composed of multiple linear layers and the activation function.
• DCG counts the influence of position into the score of sorted retrieval
In order to maintain the structural symmetry, we imitate the disentan-
results, so that the correct result that is ranked at the top gets a larger
glement module to modify the domain discriminator to obtain Dm(⋅).
weight.
Adversarial loss optimizes both Fm(⋅) and Dm(⋅):
• ANMRR is also a metric considers both the number of correct results
retrieved out but also their ranking.
Lad ¼  Ex∼Ds ½ log Dm ðF m ðxÞÞ
ð5Þ
 Ex∼Dt ½ log ð1  Dm ðF m ðxÞÞÞ

The values of the first five metrics are positively correlated with per-
The source domain is assigned with domain label 1 and the target
formance of the work but ANMRR expects lower value.
domain is assigned with domain label 0. The goal of Fm(⋅) is to learn
domain-invariant features of the two domains which can confuse
4.1.3. Implementation settings
Dm(⋅). However, Dm(⋅) aims to make the correct prediction whether
MI3DOR and MI3DOR-2 both provide 12 views of each 3D model.
the sample is from the source domain or the target domain. With the
When the batch size is fixed, the number of models that can be loaded
optimization of Fm(⋅) and Dm(⋅) to their contradictory goals, the
is inversely proportional to the number of views in a batch.
domain difference between the two domains is reduced until neither
ResNet50 [49] is used as backbone for the feature extractor and the
of Fm(⋅) and Dm(⋅) can go further. The classification loss Lcls is
parameters of convolutional layers are loaded from model pretrained
employed to ensure the domain-invariant features are discriminative
on ImageNet. The original feature vector is set to 256-d and feature dis-
in category on the source domain.
entanglement changes not the dimension of the features. The con-
structed memory modules Ms and Mt work as tensors of 21*256 or
Lcls ¼  Eðx,yÞ∼Ds Lce ðy, GðF m ðxÞÞÞ ð6Þ
40*256 according to the used dataset. Minibatch stochastic gradient
descent and learning rate decay are adopted in the training. The
where Lce is cross-entropy loss for multi-classification problems and G is lr0
learning rate is calculated by with initial leaning rate
the classifier. As the memory module is added, Eqs. (5) and (6) should ð1þα ⋅ min ð1, pÞÞβ

be modified to: r0 = 0.01, α = 10, β = 0.75 and training schedule p varies from 0 to
1. As for thr in memory mechanism, it is set to 0.7 according to
Lad ¼  Ex∼Ds ½ log Dm ðf new Þ experiments on sensitive analysis, which is shown in Sec.4.2.3.
ð7Þ
 Ex∼Dt ½ log ð1  Dm ðf new ÞÞ
4.2. Results and analysis
Lcls ¼  Eðx,yÞ∼Ds Lce ðy, Gðf new ÞÞ ð8Þ
4.2.1. Comparison results
We compare our work with several representative methods on both
but Eqs. (5) and (6) show the components that actually participate in MI3DOR and MI3DOR-2 datasets: CORAL [38], MEDA [50], JGSA [51],
the optimization. JAN [52], RevGrad [47], DLEA [1] and SC-IFA [5]. CORAL, MEDA and
Finally, the domain-invariant features are constrained by minimiz- JGSA are methods of traditional transfer learning, which align the distri-
ing L, which is written as: bution by reducing the shift of the statistics on the two domains. JAN,
RevGrad, DLEA and SC-IFA are methods of deep transfer learning,
L ¼ Lcls þ Lad þ Ld ð9Þ which embed domain-adaptation modules into deep networks to
align feature distributions across domains. To be specific, CORAL directly

4
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

narrows the second-order statistical characteristics across two domains Table 2


to reduce the shift. Considering the geometrical shift and distribution Ablation study results on MI3DOR and MI3DOR-2.

shift will result in the negative transfer, JGSA sets constraint on the NN↑ FT↑ ST↑ F-Measure↑ DCG↑ ANMRR↓
two coupled projections for both source and target data. MEDA per- Ours-d-m 0.647 0.517 0.685 0.126 0.547 0.469
forms dynamic distribution alignment for manifold domain adaptation MI3DOR Ours-m 0.738 0.561 0.705 0.139 0.596 0.402
by training a domain-invariant classifier in Grassmann manifold with Ours 0.743 0.593 0.736 0.143 0.626 0.389
structural risk minimization. JAN tries to align the joint distributions of Ours-d-m 0.737 0.597 0.724 0.597 0.635 0.382
MI3DOR-2 Ours-m 0.745 0.630 0.747 0.630 0.668 0.348
multiple domain-specific layers across domains. RevGrad proposes an
Ours 0.768 0.636 0.749 0.636 0.673 0.342
adversarial domain adaptation strategy to learn the domain-invariant
features. Different from previous methods, which only eliminate the
domain-level shift, DLEA considers the class-level alignment by 3. Compared with DLEA and SC-IFA, our work gains better scores on
matching the centroid of each category in source and target domain. most indicators. Besides domain-level alignment, DLEA proposes
SC-IFA jointly performs the instance visual feature extraction and class-level alignment while SC-IFA proposes instance feature adapta-
cross-domain instance feature adaptation with semantic consistency tion. Hierarchical alignment is the same as disentangled feature
to strengthen adversarial domain training. The method proposed in learning with memory mechanism, which can make the alignment
this paper is also to improve the adaptability of the adversarial more comprehensive. But generally speaking, the disentangled fea-
alignment. ture learning with memory mechanism is more suitable for this
The results of the above methods and our work are shown in Table.1. task because it focuses on the large gap between the two domains.
In general, our method achieves a leading position on most indicators
on both MI3DOR and MI3DOR-2. Moreover, it can be observed from
the results that: 4.2.2. Ablation study
In order to explore the effectiveness of the proposed method, we
1. Compared with work on traditional transfer learning, the perfor-
evaluate the performance of the key parts on Table 2. Here, “m” repre-
mance of our method is far ahead. Other deep transfer learning
sents memory mechanism and “d” represents feature disentanglement.
methods also gain better performance. For a retrieval task, deep neu-
The performance of the framework that only uses the adversarial
ral networks can extract high-level semantic information from the
strategy (Ours-d-m) is relatively lagging, because it not only tries to
inputs. At the same time, feature extraction and domain adaptation
align domain invariant features, but also tries to align domain-specific
in the methods based on deep transfer learning are carried out coop-
features that can cause negative transfer. This will affect the direction
eratively, of which the goal is to learn transferable features to mini-
of domain alignment optimization. After the disentanglement module
mize the domain gap. However, traditional approaches extract
unwraps the features into domain-invariant features and domain-
features and perform domain adaptation separately which means
specific features, only the domain-invariant features are retained, pro-
their optimization directions can not necessarily keep the same.
viding a pure feature space for subsequent operations, so that the two
JGSA learns low-dimensional coupled projections on the two do-
domains can be aligned as much as possible. On this basis, the memory
mains and reduces geometrical shifts and distribution shifts simulta-
mechanism is designed to provide a buffer for the model to reduce the
neously, which makes it beats JAN, a deep transfer learning method.
negative effects of excessive domain differences, so its addition can
2. Compared with JAN and RevGrad, our work performs much better.
bring further performance improvements.
Both of JAN and RevGrad directly align the features in global. They
Besides, we also consider the impact of using different backbones
are fit to resolve general cross-domain tasks with small domain gap
(ResNet34 [49], ResNet50 [49], VGG16 [53]) for extracting features on
and simple data composition but fail in 2D image-based 3D model re-
retrieval performance and conduct experiments on the MI3DOR-2
trieval task. The proposed method in the paper first disentangle the
dataset. As shown in Table 3, the features extracted by ResNet50 are
original extracted features to remove the domain-specific features,
more suitable for the method proposed in this paper.
which leads to huge domain divergence and interference in align-
ment. Then considering that disentanglement in the beginning
4.2.3. Analysis of hyperparameter
does not work sufficiently and the learned features of the two do-
The hyperparameter involved in the proposed method is thr. For the
mains still diverge a lot, the memory mechanism softens the features,
memory mechanism, the threshold thr determines the degree to which
provides a buffer for the discriminator, and enables the model to be
the domain-invariant features of a sample are enhanced by memory.
optimized robustly.
Fig. 3 gives the experimental results of sensitive analysis on thr. It can
be seen that when thr = 0.3 the overall retrieval decreases. When thr
is above 0.5, the change of thr brings tiny variations on the indicators,
Table 1 and the performance also get better. It indicates that the domain-
Comparison results on MI3DOR and MI3DOR-2.
invariant features should dominate, otherwise the domain discrimina-
NN↑ FT↑ ST↑ F-Measure↑ DCG↑ ANMRR↓ tor and classifier will be confused meaninglessly, leading to their nega-
CORAL 0.362 0.174 0.256 0.060 0.199 0.816 tive optimization. As most of the indicators reach the best when thr is
MEDA 0.430 0.344 0.501 0.046 0.361 0.646 0.7, so thr is set to 0.7.
JGSA 0.612 0.443 0.599 0.116 0.473 0.541
JAN 0.446 0.343 0.495 0.085 0.364 0.647 4.2.4. Qualitative evaluation
MI3DOR
RevGrad 0.650 0.505 0.643 0.112 0.542 0.474
DLEA 0.764 0.558 0.716 0.143 0.597 0.421
As shown in Fig. 2, we visualize the features obtained from different
SC-IFA 0.721 0.584 0.721 0.163 0.637 0.363 settings on MI3DOR-2 to observe the effects of proposed modules using
Ours 0.743 0.593 0.736 0.143 0.626 0.389
CORAL 0.538 0.369 0.497 0.369 0.399 0.614
MEDA 0.570 0.392 0.523 0.392 0.425 0.590 Table 3
JGSA 0.585 0.405 0.533 0.405 0.433 0.577 Experiment results of various backbones on MI3DOR-2.
JAN 0.608 0.501 0.646 0.501 0.527 0.484
MI3DOR-2 NN↑ FT↑ ST↑ F-Measure↑ DCG↑ ANMRR↓
RevGrad 0.623 0.467 0.614 0.467 0.503 0.514
DLEA 0.700 0.555 0.681 0.555 0.593 0.424 Vgg16 0.722 0.508 0.638 0.508 0.543 0.468
SC-IFA 0.713 0.641 0.738 0.641 0.648 0.415 MI3DOR-2 ResNet34 0.737 0.611 0.747 0.611 0.640 0.371
Ours 0.768 0.636 0.749 0.636 0.673 0.342 ResNet50 0.768 0.636 0.749 0.636 0.673 0.342

5
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

Fig. 3. Analysis on hyperparameter thr.

t-distributed stochastic neighbor embedding(t-SNE) [54]. The red dots


Fig. 2. Visualization of the cross-domain features with different modules via t-SNE on MI3DOR-2. represent the 2D image from source domain and the blue dots represent
the 3D models from target domain. We can find from Fig. 2(a) that
when only adversarial strategy is adopted, the features can be mapped
to the same feature space, but some samples are poorly aligned. We
think this is due to the excessive influence of negative transfer caused
by domain-specific features. After adding the disentanglement module,
as shown in Fig. 2(b), it can be seen that the distance between the
source domain and target domain samples is effectively reduced glob-
ally. Comparatively, from Fig. 2(c), we can observe that the feature dis-
tribution of the same category becomes more concentrated, since we
added the memory mechanism to enhance the domain-invariant fea-
tures by the representative features.

5. Conclusion

Aiming at the task of 3D model retrieval based on 2D images, this


paper proposes a framework combined with memory mechanism. For
the challenge of the large domain gap in this task, the framework is de-
signed with a memory mechanism based on the disentangled feature
learning. The extracted original visual features are disentangled to ob-
tain domain-invariant features. The domain-invariant features learned
in the beginning of training are still affected by domain divergence
and cannot achieve the desired expression. The memory mechanism of-
fers “memory” features to the domain-invariant features, bringing addi-
tional bothway knowledge transfer and gradually improving the
model's adaptability to large domain gap. The experimental results
prove the effectiveness of the framework proposed in this paper. Be-
sides, we sincerely thank to the Baidu Program for the Paddlepaddle
platform [55,56].

CRediT authorship contribution statement

Dan Song: Conceptualization, Methodology. Yuting Ling: Data


curation, Software, Writing – original draft. Tianbao Li: Investigation,
Writing – review & editing, Software. Ting Zhang: Investigation, Writ-
ing – original draft. Guoqing Jin: Writing – review & editing. Junbo
Guo: Software, Validation. Xuanya Li: Visualization.

Declaration of Competing Interest

None.

Acknowledgment

This work was supported in part by the National Nature Science


Foundation of China (61902277), State Key Laboratory of Communica-
tion Content Cognition (Grant No. A02106), the Open Funding Project

6
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

of the State Key Laboratory of Communication Content Cognition (Grant [25] Mu Pan-pan, San-yuan Zhang, Yin Zhang, Xiu-zi Ye, Xiang Pan, Image-based 3d
model retrieval using manifold learning, Front. Inform. Technol. Electron. Eng. 19
No. 20K04) and the Baidu Program. (11) (2018) 1397–1408.
[26] Heyu Zhou, Weizhi Nie, Wenhui Li, Dan Song, An-An Liu, Hierarchical instance fea-
References ture alignment for 2d image-based 3d shape retrieval, Proceedings of the Twenty-
Ninth International Conference on International Joint Conferences on Artificial Intel-
[1] Heyu Zhou, An-An Liu, Weizhi Nie, Dual-level embedding alignment network for 2d ligence 2021, pp. 839–845.
image-based 3d object retrieval, Proceedings of the 27th ACM International Confer- [27] Alexander Grabner, Peter M. Roth, Vincent Lepetit, Location field descriptors: Single
ence on Multimedia 2019, pp. 1667–1675. image 3d model retrieval in the wild, 2019 International Conference on 3D Vision
(3DV), IEEE 2019, pp. 583–593.
[2] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen,
Xin Wang, A comprehensive survey of neural architecture search: challenges and [28] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, Deep clustering
solutions, ACM Comput. Surv. 54 (4) (2021) 1–34. for unsupervised learning of visual features, Proceedings of the European Confer-
ence on Computer Vision (ECCV) 2018, pp. 132–149.
[3] Su Hang, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller, Multi-view
[29] Unsupervised learning of visual features by contrasting cluster assignments, in:
convolutional neural networks for 3d shape recognition, Proceedings of the IEEE In-
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand
ternational Conference on Computer Vision 2015, pp. 945–953.
Joulin (Eds.), Thirty-fourth Conference on Neural Information Processing Systems
[4] Charles R. Qi, Su Hao, Kaichun Mo, Leonidas J. Guibas, Pointnet: Deep learning on
(NeurIPS), 2020.
point sets for 3d classification and segmentation, Proceedings of the IEEE Confer-
[30] Atif Belal, Madhu Kiran, Jose Dolz, Louis-Antoine Blais-Morin, Eric Granger, et al.,
ence on Computer Vision and Pattern Recognition 2017, pp. 652–660.
Knowledge distillation methods for efficient unsupervised adaptation across multi-
[5] Heyu Zhou, Weizhi Nie, Dan Song, Hu Nian, Xuanya Li, An-An Liu, Semantic consis- ple domains, Image Vis. Comput. 108 (2021) 104096.
tency guided instance feature alignment for 2d image-based 3d shape retrieval, Pro-
[31] Q. Zhou, W. Zhou, S. Wang, Cluster adaptation networks for unsupervised domain
ceedings of the 28th ACM International Conference on Multimedia 2020,
adaptation - sciencedirect, Image Vis. Comput. 108 (2021).
pp. 925–933.
[32] Xueping Wang, Rameswar Panda, Min Liu, Yaonan Wang, Amit K. Roy-Chowdhury,
[6] Zhihui Li, Feiping Nie, Xiaojun Chang, Yi Yang, Chengqi Zhang, Nicu Sebe, Dynamic Exploiting global camera network constraints for unsupervised video person re-
affinity graph construction for spectral clustering using multiple features, IEEE identification, IEEE Transactions on Circuits and Systems for Video Technology,
Trans. Neural Netw. Learn. Syst. 29 (12) (2018) 6323–6332. 2020.
[7] Xiaojun Chang, Feiping Nie, Sen Wang, Yi Yang, Xiaofang Zhou, Chengqi Zhang, [33] Zhenguang Liu, Peng Qian, Xiaoyang Wang, Yuan Zhuang, Lin Qiu, Xun Wang, Com-
Compound rank-k projections for bilinear analysis, IEEE Trans. Neural Netw. bining graph neural networks with expert knowledge for smart contract vulnerabil-
Learn. Syst. 27 (7) (2015) 1502–1513. ity detection, IEEE Transactions on Knowledge and Data Engineering, 2021.
[8] Pointnet: A 3d convolutional neural network for real-time object class recognition, [34] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver,
in: Alberto Garcia-Garcia, Francisco Gomez-Donoso, Jose Garcia-Rodriguez, Sergio Colin A. Raffel, Mixmatch: A holistic approach to semi-supervised learning, Adv.
Orts-Escolano, Miguel Cazorla, J. Azorin-Lopez (Eds.), 2016 International Joint Con- Neural Inf. Proces. Syst. 32 (2019).
ference on Neural Networks (IJCNN), IEEE 2016, pp. 1578–1584. [35] Semi-supervised learning with graph learning-convolutional networks, in: Bo Jiang,
[9] Charles Ruizhongtai Qi, Li Yi, Hao Su, Leonidas J. Guibas, Pointnet++: Deep hierar- Ziyan Zhang, Doudou Lin, Jin Tang, Bin Luo (Eds.), Proceedings of the IEEE/CVF Con-
chical feature learning on point sets in a metric space, Advances in Neural Informa- ference on Computer Vision and Pattern Recognition 2019, pp. 11313–11320.
tion Processing Systems 2017, pp. 5099–5108. [36] Zongsheng Yue, Deyu Meng, Juan He, Gemeng Zhang, Semi-supervised learning
[10] Daniel Maturana, Sebastian Scherer, Voxnet: A 3d convolutional neural network for through adaptive laplacian graph trimming, Image Vis. Comput. 60 (2017) 38–47.
real-time object recognition, 2015 IEEE/RSJ International Conference on Intelligent [37] Xueping Wang, Min Liu, Dripta S. Raychaudhuri, Sujoy Paul, Yaonan Wang, Amit K.
Robots and Systems (IROS), IEEE 2015, pp. 922–928. Roy-Chowdhury, Learning person re-identification models from videos with weak
[11] Wu Zhirong, Shuran Song, Aditya Khosla, Yu Fisher, Linguang Zhang, Xiaoou Tang, supervision, IEEE Trans. Image Process. 30 (2021) 3017–3028.
Jianxiong Xiao, 3d shapenets: A deep representation for volumetric shapes, Proceed- [38] Baochen Sun, Jiashi Feng, Kate Saenko, Return of frustratingly easy domain adapta-
ings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, tion, Proceedings of the AAAI Conference on Artificial Intelligence, 30, 2016.
pp. 1912–1920. [39] Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, Qiang Yang, Domain adaptation via
[12] Nima Sedaghat, Mohammadreza Zolfaghari, Ehsan Amiri, Thomas Brox, Orientation- transfer component analysis, IEEE Trans. Neural Netw. 22 (2) (2010) 199–210.
Boosted Voxel Nets for 3d Object Recognition, arXiv preprint arXiv:1604.03351 [40] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, Trevor Darrell, Deep Domain
2016. Confusion: Maximizing for Domain Invariance, arXiv preprint arXiv:1412.3474
[13] Seong-heum Kim, Youngbae Hwang, In So Kweon, Category-specific upright orien- 2014.
tation estimation for 3d model classification and retrieval, Image Vis. Comput. 96 [41] Baochen Sun, Kate Saenko, Deep coral: Correlation alignment for deep domain ad-
(2020) 103900. aptation, In European Conference on Computer Vision, Springer 2016, pp. 443–450.
[14] Weizhi Nie, Yue Zhao, Dan Song, Yue Gao, Dan: deep-attention network for 3d [42] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
shape recognition, IEEE Trans. Image Process. 30 (2021) 4371–4383. François Laviolette, Mario Marchand, Victor Lempitsky, Domain-adversarial training
[15] Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, Qi Tian, Longin Jan Latecki, Gift: of neural networks, J. Mach. Learn. Res. 17 (1) (2016) 2096–2030.
Towards scalable 3d shape retrieval, IEEE Trans. Multimedia 19 (6) (2017) [43] Eric Tzeng, Judy Hoffman, Kate Saenko, Trevor Darrell, Adversarial discriminative
1257–1271. domain adaptation, Proceedings of the IEEE Conference on Computer Vision and
[16] Alexander Grabner, Peter M. Roth, Vincent Lepetit, 3d pose estimation and 3d model Pattern Recognition 2017, pp. 7167–7176.
retrieval for objects in the wild, Proceedings of the IEEE Conference on Computer Vi- [44] Mingsheng Long, Zhangjie Cao, Jianmin Wang, Michael I. Jordan, Conditional adver-
sion and Pattern Recognition 2018, pp. 3022–3031. sarial domain adaptation, Advances in Neural Information Processing Systems 2018,
pp. 1640–1650.
[17] Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, Xiang Bai, Triplet-center loss for
[45] Gradually vanishing bridge for adversarial domain adaptation, in: Shuhao Cui,
multi-view 3d object retrieval, Proceedings of the IEEE Conference on Computer Vi-
Shuhui Wang, Junbao Zhuo, Su Chi, Qingming Huang, Qi Tian (Eds.), Proceedings
sion and Pattern Recognition 2018, pp. 1945–1954.
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020,
[18] Zhaoqun Li, Xu Cheng, Biao Leng, Angular triplet-center loss for multi-view 3d shape
pp. 12455–12464.
retrieval, Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019,
[46] Jake Snell, Kevin Swersky, Richard Zemel, Prototypical networks for few-shot learn-
pp. 8682–8689.
ing, Proceedings of the 31st International Conference on Neural Information Pro-
[19] Jin Xie, Guoxian Dai, Fan Zhu, Yi Fang, Learning barycentric representations of 3d cessing Systems 2017, pp. 4080–4090.
shapes for sketch-based 3d shape retrieval, Proceedings of the IEEE Conference on
[47] Yaroslav Ganin, Victor Lempitsky, Unsupervised domain adaptation by
Computer Vision and Pattern Recognition 2017, pp. 5068–5076.
backpropagation, International Conference on Machine Learning, PMLR 2015,
[20] Wei-Zhi Nie, An-An Liu, Sicheng Zhao, Yue Gao, Deep correlated joint network for pp. 1180–1189.
2-d image-based 3-d model retrieval, IEEE Trans. Cybern. 52 (3) (2020) 1862–1871. [48] Shrec 2019-monocular image based 3d model retrieval, in: Wenhui Li, Anan Liu,
[21] Gvcnn: Group-view convolutional neural networks for 3d shape recognition, in: Ngoc-Minh Bui, Yunchi Cen, Huy-Hoang Chung-Nguyen Zenian Chen, Gia-Han
Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, Yue Gao (Eds.), Proceedings of Diep, Trong-Le Do, Eugeni L. Doubrovski, Charlie C.L. Wang, Shijie Wang (Eds.),
the IEEE Conference on Computer Vision and Pattern Recognition 2018, Eurographics 2019 Workshop 3D Object Retrieval 2019, pp. 1–7.
pp. 264–272. [49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for
[22] Fang Wang, Le Kang, Yi Li, Sketch-based 3d shape retrieval using convolutional neu- image recognition, Proceedings of the IEEE Conference on Computer Vision and Pat-
ral networks, Proceedings of the IEEE Conference on Computer Vision and Pattern tern Recognition 2016, pp. 770–778.
Recognition 2015, pp. 1875–1883. [50] Visual domain adaptation with manifold embedded distribution alignment, in:
[23] Fan Zhu, Jin Xie, Yi Fang, Learning cross-domain neural networks for sketch-based Jindong Wang, Wenjie Feng, Yiqiang Chen, Yu Han, Meiyu Huang, Philip S. Yu
3d shape retrieval, Proceedings of the AAAI Conference on Artificial Intelligence, (Eds.),Proceedings of the 26th ACM International Conference on Multimedia 2018,
30, 2016. pp. 402–410.
[24] Guoxian Dai, Jin Xie, Fan Zhu, Yi Fang, Deep correlated metric learning for sketch- [51] Jing Zhang, Wanqing Li, Philip Ogunbona, Joint geometrical and statistical alignment
based 3d shape retrieval, Thirty-First AAAI Conference on Artificial Intelligence, for visual domain adaptation, Proceedings of the IEEE Conference on Computer Vi-
2017. sion and Pattern Recognition 2017, pp. 1859–1867.

7
D. Song, Y. Ling, T. Li et al. Image and Vision Computing 123 (2022) 104482

[52] Deep transfer learning with joint adaptation networks, in: Mingsheng Long, Han [55] Yanjun Ma, Yu Dianhai, Wu Tian, Haifeng Wang, Paddlepaddle: an open-source
Zhu, Jianmin Wang, Michael I. Jordan (Eds.), International Conference on Machine deep learning platform from industrial practice, Front. Data Domputing 1 (1)
Learning, PMLR 2017, pp. 2208–2217. (2019) 105–115.
[53] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large- [56] PaddlePaddle, Paddlepaddle: An Easy-to-Use, Easy-to-Learn Deep Learning Plat-
Scale Image Recognition, arXiv preprint arXiv:1409.1556 2014. form, http://www.paddlepaddle.org/ 2019.
[54] Laurens Van der Maaten, Geoffrey Hinton, Visualizing data using t-sne, J. Mach.
Learn. Res. 9 (11) (2008).

You might also like