0% found this document useful (0 votes)
14 views

Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature

This document proposes a new method for unsupervised embedding learning called instance feature-based softmax embedding. The method aims to learn embedding features that are 1) invariant to data augmentations for the same image instance and 2) spread out for different image instances. It directly optimizes the embedding features using inner products and a softmax function, achieving faster learning and higher accuracy than existing methods. The method performs well on both seen and unseen image categories without requiring category labels or pre-trained networks.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature

This document proposes a new method for unsupervised embedding learning called instance feature-based softmax embedding. The method aims to learn embedding features that are 1) invariant to data augmentations for the same image instance and 2) spread out for different image instances. It directly optimizes the embedding features using inner products and a softmax function, achieving faster learning and higher accuracy than existing methods. The method performs well on both seen and unseen image categories without requiring category labels or pre-trained networks.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unsupervised Embedding Learning via Invariant and Spreading

Instance Feature

Mang Ye† Xu Zhang‡ Pong C. Yuen† Shih-Fu Chang‡


† ‡
Hong Kong Baptist University, Hong Kong Columbia University, New York
{mangye,pcyuen}@comp.hkbu.edu.hk, {xu.zhang,sc250}@columbia.edu

Abstract Input Images Output Features

This paper studies the unsupervised embedding learn-


ing problem, which requires an effective similarity mea-
CNN
surement between samples in low-dimensional embedding
space. Motivated by the positive concentrated and nega-
tive separated properties observed from category-wise su-
pervised learning, we propose to utilize the instance-wise
supervision to approximate these properties, which aims at Figure 1: Illustration of our basic idea. The features of the same
learning data augmentation invariant and instance spread- instance under different data augmentations should be invariant,
out features. To achieve this goal, we propose a novel in- while features of different image instances should be separated.
stance based softmax embedding method, which directly op-
timizes the ‘real’ instance features on top of the softmax
function. It achieves significantly faster learning speed and Unsupervised embedding learning usually requires that
higher accuracy than all existing methods. The proposed the similarity between learned embedding features is con-
method performs well for both seen and unseen testing cat- sistent with the visual similarity or category relations of in-
egories with cosine similarity. It also achieves competitive put images. In comparison, general unsupervised feature
performance even without pre-trained network over sam- learning usually aims at learning a good “intermediate” fea-
ples from fine-grained categories. ture representation from unlabelled data [6, 26, 31, 34]. The
learned feature is then generalized to different tasks by us-
ing a small set of labelled training data from the target task
1. Introduction to fine-tune models (e.g., linear classifier, object detector,
etc.) for the target task [3]. However, the learned feature
Deep embedding learning is a fundamental task in com- representation may not preserve visual similarity and its
puter vision [14], which aims at learning a feature embed- performance drops dramatically for similarity based tasks,
ding that has the following properties: 1) positive concen- e.g. nearest neighbor search [46, 48, 50].
trated, the embedding features of samples belonging to the The main challenge of unsupervised embedding learning
same category are close to each other [32]; 2) negative sep- is to discover visual similarity or weak category information
arated, the embedding features of samples belonging to d- from unlabelled samples. Iscen et al. [21] proposed to mine
ifferent categories are separated as much as possible [52]. hard positive and negative samples on manifolds. However,
Supervised embedding learning methods have been studied its performance heavily relies on the quality of the initial-
to achieve such objectives and demonstrate impressive ca- ized feature representation for label mining, which limits
pabilities in various vision tasks [28, 30, 53]. However, the applicability for general tasks. In this paper, we pro-
annotated data needed for supervised methods might be dif- pose to utilize the instance-wise supervision to approximate
ficult to obtain. Collecting enough annotated data for differ- the positive concentrated and negative separated proper-
ent tasks requires costly human efforts and special domain ties mentioned earlier. The learning process only relies on
expertise. To address this issue, this paper tackles the unsu- instance-wise relationship and does not rely on relations be-
pervised embedding learning problem (a.k.a. unsupervised tween pre-defined categories, so it can be well generalized
metric learning in [21]), which aims at learning discrimina- to samples of arbitrary categories that have not been seen
tive embedding features without human annotated labels. before (unseen testing categories) [12].

6210
For positive concentration: it is usually infeasible to 2. Related Work
mine reliable positive information with randomly initialized
network. Therefore, we apply a random data augmentation General Unsupervised Feature Learning. Unsuper-
(e.g., transformation, scaling) to each image instance and vised feature learning has been widely studied in litera-
use the augmented image as a positive sample. In other ture. Existing works can be roughly categorized into three
words, features of each image instance under different data categories [3]: 1) generative models, this approach aims
augmentations should be invariant. For negative separa- at learning a parameterized mapping between images and
tion: since unlabelled data are usually highly imbalanced predefined noise signals, which constrains the distribution
[27, 49], the number of negative samples for each image in- between raw data and noises [46]. Bolztmann Machines
stance is much larger than that of positive samples. There- (RBMs) [24, 40], Auto-encoders [20, 42] and generative
fore, a small batch of randomly selected instances can be ap- adversarial network (GAN) [7, 10, 11] are widely stud-
proximately treated as negative samples for each instance. ied. 2) Estimating Between-image Labels, it usually esti-
With such assumption, we try to separate each instance from mates between-image labels using the clustering technique
all the other sampled instances within the batch, resulting [3, 9, 26] or kNN-based methods [41], which provide label
in a spread-out property [52]. It is clear that such assump- information. Then label information and feature learning
tion may not always hold, and each batch may contain a process are iteratively updated. 3) Self-supervised Learn-
few false negatives. However, through our extensive exper- ing, this approach designs pretext tasks/signals to generate
iments, we observe that the spread-out property effectively “pseudo-labels” and then formulate it as a prediction task to
improves the discriminability. In summary, our main idea is learn the feature representations. The pretext task could be
to learn a discriminative instance feature, which preserves the context information of local patches [6], the position of
data augmentation invariant and spread-out properties for randomly rearranged patches [31], the missing pixels of an
unsupervised embedding learning, as shown in Fig. 1. image [34] or the color information from gray-scale images
To achieve these goals, we introduce a novel instance [51]. Some attempts also use video information to provide
feature-based softmax embedding method. Existing soft- weak supervision to learn feature representations [1, 44].
max embedding is usually built on classifier weights [8] or As we discussed in Section 1, general unsupervised fea-
memorized features [46], which has limited efficiency and ture learning usually aims at learning a good “intermediate”
discriminability. We propose to explicitly optimize the fea- feature representation that can be well generalized to oth-
ture embedding by directly using the inner products of in- er tasks. The intermediate feature representation may not
stance features on top of softmax function, leading to signif- preserve visual similar property. In comparison, unsuper-
icant performance and efficiency gains. The softmax func- vised embedding learning requires additional visual simi-
tion mines hard negative samples and takes full advantage larity property of the learned features.
of relationships among all sampled instances to improve the Deep Embedding Learning. Deep embedding learning
performance. The number of instance is significantly larger usually learns an embedding function by minimizing the
than the number of categories, so we introduce a Siamese intra-class variation and maximizing the inter-class varia-
network training strategy. We transform the multi-class tion [32, 37, 45, 47]. Most of them are designed on top of
classification problem to a binary classification problem and pairwise [12, 30] or triplet relationships [13, 29]. In partic-
use maximum likelihood estimation for optimization. ular, several sampling strategies are widely investigated to
The main contributions can be summarized as follows: improve the performance, such as hard mining [16], semi-
hard mining [35], smart mining [13] and so on. In compari-
• We propose a novel instance feature-based softmax son, softmax embedding achieves competitive performance
embedding method to learn data augmentation invari- without sampling requirement [18]. Supervised learning
ant and instance spread-out features. It achieves signif- has achieved superior performance on various tasks, but
icantly faster learning speed and higher accuracy than they still rely on enough annotated data.
all the competing methods. Unsupervised Embedding Learning. According to the
• We show that both the data augmentation invariant evaluation protocol, it can be categorized into two cases,
and instance spread-out properties are important for 1) the testing categories are the same with the training cat-
instance-wise unsupervised embedding learning. They egories (seen testing categories), and 2) the testing cate-
help capture apparent visual similarity between sam- gories are not overlapped with the training categories (un-
ples and generalizes well on unseen testing categories. seen testing categories). The latter setting is more challeng-
ing. Without category-wise labels, Iscen et al. [21] pro-
• The proposed method achieves the state-of-the-art per- posed to mine hard positive and negative samples on mani-
formances over other unsupervised learning methods folds, and then train the feature embedding with triplet loss.
on comprehensive image classification and embedding However, it heavily relies on the initialized representation
learning experiments. for label mining.

6211
3. Proposed Method where τ is the temperature parameter controlling the con-
centration level of the sample distribution [17]. viT fj mea-
Our goal is to learn a feature embedding network fθ (·) sures the cosine similarity between the feature fj and the
from a set of unlabelled images X = {x1 , x2 , · · · , xn }. i-th memorized feature vi . For instance xi at each step,
fθ (·) maps the input image xi into a low-dimensional em- the network pulls its feature fi towards its corresponding
bedding feature fθ (xi ) ∈ Rd , where d is the feature dimen-
memorized vector vi , and pushes it away from the memo-
sion. For simplicity, the feature representation fθ (xi ) of an
rized vectors of other instances. Due to efficiency issue, the
image instance is represented by fi , and we assume that all memorized feature vi corresponding to instance xi is only
the features are ℓ2 normalized, i.e. kfi k2 = 1. A good fea- updated in the iteration which takes xi as input. In other
ture embedding should satisfy: 1) the embedding features of words, the memorized feature vi is only updated once per
visual similar images are close to each other; 2) the embed- epoch. However, the network itself is updated in each iter-
ding features of dissimilar image instances are separated.
ation. Comparing the real-time instance feature fi with the
Without category-wise labels, we utilize the instance-
outdated memorized feature vi would cumber the training
wise supervision to approximate the positive concentrated process. Thus, the memory bank scheme is still inefficient.
and negative seperated properties. In particular, the em-
A straightforward idea to improve the efficiency is di-
bedding features of the same instance under different data
rectly optimizing over feature itself, i.e. replacing the
augmentations should be invariant, while the features of d-
weight {wi } or memory {vi } with fi . However, it is im-
ifferent instances should be spread-out. In the rest of this
plausible due to two reasons: 1) Considering the probability
section, we first review two existing instance-wise feature
P (i|xi ) of recognizing xi to itself, since fiT fi =1, i.e. the fea-
learning methods, and then propose a much more efficient
ture and “pseudo classifier weight” (the feature itself) are al-
and discriminative instance feature-based softmax embed-
ways perfectly aligned, optimizing the network will not pro-
ding. Finally, we will give a detailed rationale analysis and
vide any positive concentrated property; 2) It’s impractical
introduce our training strategy with Siamese network.
to calculate the feature of all the samples (fk , k = 1, . . . , n)
3.1. Instance-wise Softmax Embedding on-the-fly in order to calculate the denominator in Eq. (2),
especially for large-scale instance number dataset.
Softmax Embedding with Classifier Weights. Exemplar
CNN [8] treats each image as a distinct class. Follow- 3.2. Softmax Embedding on ‘Real’ Instance Feature
ing the conventional classifier training, it defines a matrix
W = [w1 , w2 , · · · , wn ]T ∈ Rn×d , where the j-th column To address above issues, we propose a softmax embed-
wj is called the corresponding classifier weight for the j-th ding variant for unsupervised embedding learning, which
instance. Exemplar CNN ensures that image instance under directly optimizes the real instance feature rather than clas-
different image transformations can be correctly classified sifier weights [8] or memory bank [46]. To achieve the goal
into its original instance with the learned weight. Based on that features of the same instance under different data aug-
Softmax function, the probability of sample xj being rec- mentations are invariant, while the features of different in-
ognized as the i-th instance can be represented as stances are spread-out, we propose to consider 1) both the
original image and its augmented image, 2) a small batch of
exp(wiT fj ) randomly selected samples instead of the full dataset.
P (i|xj ) = Pn . (1)
T
k=1 exp(wk fj )
For each iteration, we randomly sample m instances
from the dataset. To simplify the notation, without
At each step, the network pulls sample feature fi towards loss of generality, the selected samples are denoted by
its corresponding weight wi , and pushes it away from the {x1 , x2 , · · · , xm }. For each instance, a random data aug-
classifier weights wk of other instances. However, classifier mentation operation T (·) is applied to slightly modify the
weights prevent explicitly comparison over features, which original image. The augmented sample T (xi ) is denoted by
results in limited efficiency and discriminability. x̂i , and its embedding feature fθ (x̂i ) is denoted by f̂i . In-
Softmax Embedding with Memory Bank. To improve the stead of considering the instance feature learning as a multi-
inferior efficiency, Wu et al. [46] propose to set up a mem- class classification problem, we solve it as binary classifica-
ory bank to store the instance features fi calculated in the tion problem via maximum likelihood estimation (MLE). In
previous step. The feature stored in the memory bank is de- particular, for instance xi , the augmented sample x̂i should
noted as vi , which serves as the classifier weight for the be classified into instance i, and other instances xj , j 6= i
corresponding instance in the following step. Therefore, shouldn’t be classified into instance i. The probability of x̂i
the probability of sample xj being recognized as the i-th being recognized as instance i is defined by
instance can be written as
exp(viT fj /τ ) exp(fiT f̂i /τ )
P (i|xj ) = Pn T
, (2) P (i|x̂i ) = Pm
T
. (3)
k=1 exp(vk fj /τ ) k=1 exp(fk f̂i /τ )

6212
Low-dim
𝐱1 𝐱2 𝐱3 f1

L2 Norm
FC
CNN f2

f3
Data Share Weights
Augmentation
f1

L2 Norm
FC
CNN f2

𝐱1 𝐱2 𝐱3 f3 Embedding Space
Low-dim

Figure 2: The framework of the proposed unsupervised learning method with Siamese network. The input images are projected into
low-dimensional normalized embedding features with the CNN backbone. Image features of the same image instance with different data
augmentations are invariant, while embedding features of different image instances are spread-out.

On the other hand, the probability of xj being recognized Maximizing Eq. (3) requires maximizing exp(fiT f̂i /τ ) and
as instance i is defined by minimizing exp(fkT f̂i /τ ), k 6= i. Since all the features are
ℓ2 normalized, maximizing exp(fiT f̂i /τ ) requires increas-
exp(fiT fj /τ )
P (i|xj ) = Pm T
, j 6= i (4) ing the inner product (cosine similarity) between fi and f̂i ,
k=1 exp(fk fj /τ ) resulting in a feature that is invariant to data augmentation.
On the other hand, minimizing exp(fkT f̂i /τ ) ensures f̂i and
Correspondingly, the probability of xj not being recognized
other instances {fk } are separated. Considering all the in-
as instance i is 1 − P (i|xj ).
stances within the batch, the instances are forced to be sep-
Assuming different instances being recognized as in-
arated from each other, resulting in the spread-out property.
stance i are independent, the joint probability of x̂i being
Similarly, Eq. (4) can be rewritten as,
recognized as instance i and xj , j 6= i not being classified
into instance i is exp(fiT fj /τ )
P (i|xj ) = P , (9)
Y exp(fjT fj /τ )+ k6=j exp(fkT fj /τ )
Pi = P (i|x̂i ) (1 − P (i|xj )) (5)
j6=i Note that the inner product fjT fj is 1 and the value of τ
is generally small (say 0.1 in the experiment). Therefore,
The negative log likelihood is given by exp(fjT fj /τ ) generally determines the value of the whole
X denominator. Minimizing Eq. (4) means that exp(fiT fj /τ )
Ji = − log P (i|x̂i ) − log(1 − P (i|xj )) (6) should be minimized, which aims at separating fj from fi .
j6=i Thus, it further enhances the spread-out property.
We solve this problem by minimizing the sum of the neg- 3.4. Training with Siamese Network
ative log likelihood over all the instances within the batch,
We proposed a Siamese network to implement the pro-
which is denoted by
posed algorithm as shown in Fig. 2. At each iteration, m
X XX randomly selected image instances are fed into in the first
J =− log P (i|x̂i ) − log(1 − P (i|xj )). (7)
branch, and the corresponding augmented samples are fed
i i j6=i
into the second branch. Note that data augmentation is al-
3.3. Rationale Analysis so be used in the first branch to enrich the training sam-
ples. For implementation, each sample has one randomly
This section gives a detailed rationale analysis about why augmented positive sample and 2N − 2 negative samples
minimizing Eq. (6) could achieve the augmentation invari- to compute Eq. (7), where N is the batch size. The pro-
ant and instance spread-out feature. Minimizing Eq. (6) can posed training strategy greatly reduces the computational
be viewed as maximizing Eq. (3) and minimizing Eq. (4). cost. Meanwhile, this training strategy also takes full ad-
Considering Eq. (3), it can be rewritten as vantage of relationships among all instances sampled in a
mini-batch [32]. Theoretically, we could also use a multi-
exp(fiT f̂i /τ ) branch network by considering multiple augmented images
P (i|x̂i ) = P , (8)
exp(fiT f̂i /τ ) + k6=i exp(fkT f̂i /τ ) for each instance in the batch.

6213
Methods kNN
90
RandomCNN 32.1
80
DeepCluster (10) [3] 44.4
DeepCluster (1000) [3] 67.6 70
Exemplar [8] 74.5

kNN Accuracy (%)


60
NPSoftmax [46] 80.8
50
NCE [46] 80.4
Triplet 57.5 40

Triplet (Hard) 78.4 30


Ours
DeepCluster [3]
Ours 83.6 20 NCE [46]
Exemplar [8]
Table 1: kNN accuracy (%) on CIFAR-10 dataset. 10
0 10 20 40 80 120 160 200
4. Experimental Results Training Epochs

We have conducted the experiments with two different Figure 3: Evaluation of the training efficiency on CIFAR-10
dataset. kNN accuracy (%) at each epoch is reported, demonstrat-
settings to evaluate the proposed method1 . The first setting
ing the learning speed of different methods.
is that the training and testing sets share the same categories
(seen testing category). This protocol is widely adopted for plar CNN [8], NPSoftmax [46], NCE [46] and Triplet loss
general unsupervised feature learning. The second setting with and without hard mining. Triplet (hard) is the online
is that the training and testing sets do not share any com- hard negative sample within each batch for training [16],
mon categories (unseen testing category). This setting is and the margin parameter is set to 0.5. DeepCluster [3] and
usually used for supervised embedding learning [32]. Fol- NCE [46] represent the state-of-the-art unsupervised feature
lowing [21], we don’t use any semantic label in the training learning methods. The results are shown in Table 1.
set. The latter setting is more challenging than the former Classification Accuracy. Table 1 demonstrates that our
setting and it could apparently demonstrate the quality of proposed method achieves the best performance (83.6%)
learned features on unseen categories. with kNN classifier. DeepCluster [3] performs well in
4.1. Experiments on Seen Testing Categories learning good “intermediate” features with large-scale un-
labelled data, but the performance with kNN classification
We follow the experimental settings in [46] to conduct drops dramatically. Meanwhile, it is also quite sensitive
the experiments on CIFAR-10 [23] and STL-10 [4] dataset- to cluster numbers, which is unsuitable for different tasks.
s, where training and testing set share the same categories. Compared to Exemplar CNN [8] which uses the classifi-
Specifically, ResNet18 network [15] is adopted as the back- er weights for training, the proposed method outperforms
bone and the output embedding feature dimension is set to it by 9.1%. Compared to NPSoftmax [46] and NCE [46],
128. The initial learning rate is set to 0.03, and it is de- which use memorized feature for optimizing, the proposed
cayed by 0.1 and 0.01 at 120 and 160 epoch. The network is method outperform by 2.8% and 3.2% respectively. The
trained for 200 epochs. The temperature parameter τ is set performance improvement is clear due to the idea of direct-
to 0.1. The algorithm is implemented on PyTorch with SGD ly performing optimization over feature itself. Compared
optimizer with momentum. The weight decay parameter is to triplet loss, the proposed method also outperforms it by
5×10−4 and momentum is 0.9. The training batch size is a clear margin. The superiority is due to the hard mining
set to 128 for all competing methods on both datasets. Four nature in Softmax function.
kinds of data augmentation methods (RandomResizedCrop, Efficiency. We plot the learning curves of the compet-
RandomGrayscale, ColorJitter, RandomHorizontalFlip) in ing methods at different epochs in Fig. 3. The proposed
PyTorch with default parameters are adopted. method takes only 2 epochs to get a kNN accuracy of 60%
Following [46], we adopt weighted kNN classifier to e- while [46] takes 25 epochs and [8] takes 45 epochs to reach
valuate the performance. Given a test sample, we retrieve the same accuracy. It is obvious that our learning speed is
its top-k (k = 200) nearest neighbors based on cosine simi- much faster than the competitors. The efficiency is guar-
larity, then apply weighted voting to predict its label [46]. anteed by directly optimization on instance features rather
4.1.1 CIFAR-10 Dataset than classifier weights [8] or memory bank [46].
CIFAR-10 datset [23] contains 50K training images and
4.1.2 STL-10 Dataset
10K testing images from the same ten classes. The image
size are 32 × 32. Five methods are included for compari- STL-10 dataset [4] is an image recognition dataset with col-
son: DeepCluster [3] with different cluster numbers, Exem- ored images of size 96 × 96, which is widely used in unsu-
1 Code is available at https://github.com/mangye16/ pervised learning. Specifically, this dataset is originally de-
Unsupervised_Embedding_Learning signed with three splits: 1) train, 5K labelled images in ten

6214
Methods Training Linear kNN Methods R@1 R@2 R@4 R@8 NMI
RandomCNN None - 22.4 Initial (FC) 39.2 52.1 66.1 78.2 51.4
k-MeansNet∗ [5] 105K 60.1 - Supervised Learning
HMP∗ [2] 105K 64.5 - Lifted [32] 43.6 56.6 68.6 79.6 56.5
Satck∗ [54] 105K 74.3 - Clustering[38] 48.2 61.4 71.8 81.9 59.2
Exemplar∗ [8] 105K 75.4 - Triplet+ [13] 45.9 57.7 69.6 79.8 58.1
NPSoftmax [46] 5K 62.3 66.8 Smart+ [13] 49.8 62.3 74.1 83.3 59.9
NCE [46] 5K 61.9 66.2 Unsupervised Learning
DeepCluster(100) [3] 5K 56.5 61.2 Cyclic [25] 40.8 52.8 65.1 76.0 52.6
Ours 5K 69.5 74.1 Exemplar [8] 38.2 50.3 62.8 75.0 45.0
Ours 105K 77.9 81.6 NCE [46] 39.2 51.4 63.7 75.8 45.1
DeepCluster[3] 42.9 54.1 65.6 76.2 53.0
Table 2: Classification accuracy (%) with linear classifier and
MOM [21] 45.3 57.8 68.6 78.4 55.0
kNN classifier on STL-10 dataset. ∗ Results are taken from [33],
Ours 46.2 59.0 70.1 80.2 55.4
the baseline network is different.
Table 3: Results (%) on CUB200 dataset.
classes for training, 2) test, 8K images from the same ten
Methods R@1 R@10 R@100 NMI
classes for testing, 3) unlabelled, 100K unlabelled images
Initial (FC) 40.8 56.7 72.1 84.0
which share similar distribution with labelled data for un- Exemplar [8] 45.0 60.3 75.2 85.0
supervised learning. We follow the same experimental set- NCE [46] 46.6 62.3 76.8 85.8
ting as CIFAR-10 dataset and report classification accuracy DeepCluster[3] 34.6 52.6 66.8 82.8
(%) with both Linear Classifier (Linear) and kNN classier MOM [21] 43.3 57.2 73.2 84.4
(kNN) in Table 2. Linear classifier means training a SVM Ours 48.9 64.0 78.0 86.0
classifier on the learned features and the labels of training Table 4: Results (%) on Product dataset.
samples. The classifier is used to predict the label of test
samples. We implement NPSoftmax [46], NCE [46] and 200 (CUB200) [43] is a fine-grained bird dataset. Follow-
DeepCluster [3] (cluster number 100) under the same set- ing [32], the first 100 categories with 5,864 images are used
tings with their released code. By default, we only use 5K for training, while the other 100 categories with 5,924 im-
training images without using labels for training. The per- ages are used for testing. Stanford Online Product (Product)
formances of some state-of-the-art unsupervised methods [32] is a large-scale fine-grained product dataset. Similar-
(k-MeansNet [5], HMP [2], Satck [54] and Exemplar [8]) ly, 11,318 categories with totally 59,551 images are used
are also reported. Those results are taken from [33]. for training, while the other 11,316 categories with 60,502
As shown in Table 2 , when only using 5K training im- images are used for testing. Cars (Car196) dataset [22] is
ages for learning, the proposed method achieves the best ac- a fine-grained car category dataset. The first 98 categories
curacy with both classifiers (kNN: 74.1%, Linear: 69.5%), with 8,054 images are used for training, while the other 98
which are much better than NCE [46] and DeepCluster [3] categories with 8,131 images are used for testing.
under the same evaluation protocol. Note that kNN mea- Implementation Details. We implement the proposed
sures the similarity directly with the learned features and method on PyTorch. The pre-trained Inception-V1 [39] on
Linear requires additional classifier learning with the la- ImageNet is used as the backbone network following exist-
belled training data. When 105K images are used for train- ing methods [30, 32, 37]. A 128-dim fully connected layer
ing, the proposed method also achieves the best perfor- with ℓ2 normalization is added after the pool5 layer as the
mance for both kNN classifier and linear classifier. In par- feature embedding layer. All the input images are firstly re-
ticular, the kNN accuracy is 74.1% for 5K training images, sized to 256 × 256. For data augmentation, the images are
and it increases to 81.6% for full 105K training images. The randomly cropped at size 227× 227 with random horizontal
classification accuracy with linear classifier also increases flipping following [21, 30]. Since the pre-trained network
from 69.5% to 77.9%. This experiment verifies that the pro- performs well on CUB200 dataset, we randomly select the
posed method can benefit from more training samples. augmented instance and its corresponding nearest instance
as positive. In testing phase, a single center-cropped im-
4.2. Experiments on Unseen Testing Categories
age is adopted for fine-grained recognition as in [30]. We
This section evaluates the discriminability of the learned adopt the SGD optimizer with 0.9 momentum. The initial
feature embedding when the semantic categories of training learning rate is set to 0.001 without decay. The temperature
samples and testing samples are not overlapped. We follow parameter τ is set to 0.1. The training batch size is set to 64.
the experimental settings described in [32] to conduct ex- Evaluation Metrics. Following existing works on su-
periments on CUB200-2011(CUB200) [43], Stanford On- pervised deep embedding learning [13, 32], the retrieval
line Product (Product) [32] and Car196 [22] datasets. No performance and clustering quality of the testing set are e-
semantic label is used for training. Caltech-UCSD Birds valuated. Cosine similarity is adopted for similarity mea-

6215
Methods R@1 R@2 R@4 R@8 NMI Methods R@1 R@10 R@100 NMI
Initial (FC) 35.1 47.4 60.0 72.0 38.3 Random 18.4 29.4 46.0 79.8
Exemplar [8] 36.5 48.1 59.2 71.0 35.4 Exemplar [8] 31.5 46.7 64.2 82.9
NCE [46] 37.5 48.7 59.8 71.5 35.6 NCE [46] 34.4 49.0 65.2 84.1
DeepCluster[3] 32.6 43.8 57.0 69.5 38.5 MOM [21] 16.3 27.6 44.5 80.6
MOM [21] 35.5 48.2 60.6 72.4 38.6 Ours 39.7 54.9 71.0 84.7
Ours 41.3 52.3 63.6 74.9 35.8
Table 6: Results (%) on Product dataset using network without
Table 5: Results (%) on Car196 dataset. pre-trained parameters.

surement. Given a query image from the testing set, R@K 4.3. Ablation Study
measures the probability of any correct matching (with The proposed method imposes two important properties
same category label) occurs in the top-k retrieved ranking for instance feature learning: data augmentation invariant
list [32]. The average score is reported for all testings sam- and instance spread-out. We conduct ablation study to show
ples. Normalized Mutual Information (NMI) [36] is utilized the effectiveness of each property on CIFAR-10 dataset.
to measure the clustering performance of the testing set.
Comparison to State-of-the-arts. The results of all the Strategy Full w/o R w/o G w/o C w/o F
competing methods on three datasets are listed in Table 3, kNN Acc (%) 83.6 56.2 79.3 75.7 82.6
4 and 5, respectively. MOM [21] is the only method that Table 7: Effects of each data augmentation operation on CIFAR-
claims for unsupervised metric learning. We implement the 10 dataset. ’w/o’: Without. ’R’: RandomResizedCrop, ’G’: Ran-
other three state-of-the-art unsupervised methods (Exem- domGrayscale, ’C’: ColorJitter, ’F’: RandomHorizontalFlip.
plar [8], NCE [46] and DeepCluster [3]) on three dataset-
s with their released code under the same setting for fair Strategy Full No DA Hard Easy
comparison. Note that these methods are originally eval- kNN Acc (%) 83.6 37.4 83.2 57.5
uated for general unsupervised feature learning, where the Table 8: Different sampling strategies on CIFAR-10 dataset.
training and testing set share the same categories. We al-
To show the importance of data augmentation invariant
so list some results of supervised learning (originate from
property, we firstly evaluate the performance by removing
[21]) on CUB200 dataset as shown in Table 3.
each of the operation respectively from the data augmen-
Generally, the instance-wise feature learning methods tation set. The results are shown in Table 7. We observe
(NCE [46], Examplar [8], Ours) outperform non-instance- that all listed operations contribute to the remarkable per-
wise feature learning methods (DeepCluster [3], MOM formance gain achieved by the proposed algorithm. In par-
[21]), especially on Car196 and Product datasets, which in- ticular, RandomResizedCrop contributes the most. We al-
dicates instance-wise feature learning methods have good so evaluate the performance without data augmentation (No
generalization ability on unseen testing categories. Among DA) in Table 8, and it shows that performance drops sig-
all the instance-wise feature learning methods, the proposed nificantly from 83.6% to 37.4%. It is because when train-
method is the clear winner, which also verifies the effective- ing without data augmentation, the network does not create
ness of directly optimizing over feature itself. Moreover, the any positive concentration property. The features of visual-
proposed unsupervised learning method is even competitive ly similar images are falsely separated.
to some supervised learning methods on CUB200 dataset. To show the importance of spread-out property, we eval-
Qualitative Result. Some retrieved examples with co- uated two different strategies to choose negative samples:
sine similarity on CUB200 dataset at different training e- 1) selecting the top 50% instance features that are similar
pochs are shown in Fig. 4. The proposed algorithm can to query instance as negative (hard negative); 2) selecting
iteratively improve the quality of the learned feature and the bottom 50% instance features that are similar to query
retrieve more correct images. Although there are some instance as negative (easy negative). The results are shown
wrongly retrieved samples from other categories, most of as “Hard” and “Easy” in Table 8. The performance drops
the top retrieved samples are visually similar to the query. dramatically when only using the easy negative. In com-
Training from Scratch. We also evaluate the perfor- parison, the performance almost remains the same as the
mance using a network (ResNet18) without pre-training. full model when only using hard negative. It shows that
The results on the large-scale Product dataset are shown in separating hard negative instances helps to improve the dis-
Table 6. The proposed method is also a clear winner. In- criminability of the learned embedding.
terestingly, MOM [21] fails in this experiment. The main
4.4. Understanding of the Learned Embedding
reason is that the feature from randomly initialized network
provides limited information for label mining. Therefore, We calculate the cosine similarity between the query fea-
MOM cannot estimate reliable labels for training. ture and its 5NN features from the same category (Positive)

6216
Query Epoch 0 Epoch 1 Epoch 2

Figure 4: 4NN retrieval results of some example queries on CUB200-2011 dataset. The positive (negative) retrieved results are framed in
green (red). The similarity is measured with cosine similarity.

Positive
Positive Negative
Negative Positive
Positive Negative
Negative Positive Negative Positive Negative

55 55 5 5

00 00 0 0
1.0
1.0 0.8
0.8 0.6
0.6 0.4
0.4 0.2
0.2 0.0
0.0 1.0
1.0 0.8
0.8 0.6
0.6 0.4
0.4 0.2
0.2 0.0
0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

(a) Random Network (b) NCE [46] (a) Attribute “animals vs artifacts”
Positive
Positive Negative
Negative Positive
Positive Negative
Negative Positive Negative Positive Negative

5
5 55 5 5

0
0 00 0 0
1.0
1.0 0.8
0.8 0.6
0.6 0.4
0.4 0.2
0.2 0.0
0.0 1.0
1.0 0.8
0.8 0.6
0.6 0.4
0.4 0.2
0.2 0.0
0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

(c) Exemplar [8] (d) Ours (b) Attribute “big vs small shape animal”
Figure 5: The cosine similarity distributions on CIFAR-10 [23] Figure 6: The cosine similarity distributions of randomly initial-
ized network (left column) and our learned model (right column)
as well as 5NN features from different categories (Nega- with different attributes on CIFAR-10 [23].
tive). The distributions of the cosine similarity of different
methods are shown in Fig. 5. A more separable distribution
ly pulls the features of the same instance under different
indicates a better feature embedding. It shows that the pro-
data augmentations close and pushes the features of differ-
posed method performs best to separate positive and nega-
ent instances away. Comprehensive experiments show that
tive samples. We could also observe that our learned feature
directly optimizing over instance feature leads to significant
preserves the best spread-out property.
performance and efficiency gains. We empirically show that
It is interesting to show how the learned instance-wise
the spread-out property is particularly important and it helps
feature helps the category label prediction. We report the
capture the visual similarity among samples.
cosine similarity distribution based on other category defi-
nitions (attributes in [19]) instead of semantic label in Fig. 6.
The distribution clearly shows that the proposed method al- Acknowledgement
so performs well to separate other attributes, which demon-
This work is partially supported by Research Grants
strates the generalization ability of the learned feature.
Council (RGC/HKBU12200518), Hong Kong. This work is
5. Conclusion partially supported by the United States Air Force Research
Laboratory (AFRL) and the Defense Advanced Research
In this paper, we propose to address the unsupervised Projects Agency (DARPA) under Contract No. FA8750-16-
embedding learning problem by learning a data augmen- C-0166. Any opinions, findings and conclusions or recom-
tation invariant and instance spread-out feature. In partic- mendations expressed in this material are solely the respon-
ular, we propose a novel instance feature based softmax sibility of the authors and does not necessarily represent the
embedding trained with Siamese network, which explicit- official views of AFRL, DARPA, or the U.S. Government.

6217
References [18] Shota Horiguchi, Daiki Ikami, and Kiyoharu Aizawa. Sig-
nificance of softmax-based features in comparison to dis-
[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning tance metric learning-based features. arXiv preprint arX-
to see by moving. In ICCV, pages 37–45, 2015. 2 iv:1712.10151, 2017. 2
[2] Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Unsupervised
[19] Chen Huang, Chen Change Loy, and Xiaoou Tang. Unsu-
feature learning for rgb-d based object recognition. In Ex-
pervised learning of discriminative attributes and visual rep-
perimental Robotics, pages 387–402. Springer, 2013. 6
resentations. In CVPR, pages 5175–5184, 2016. 8
[3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
[20] Fu Jie Huang, Y-Lan Boureau, Yann LeCun, et al. Unsuper-
Matthijs Douze. Deep clustering for unsupervised learning
vised learning of invariant feature hierarchies with applica-
of visual features. In ECCV, pages 132–149, 2018. 1, 2, 5,
tions to object recognition. In CVPR, pages 1–8, 2007. 2
6, 7
[4] Adam Coates, Andrew Ng, and Honglak Lee. An analysis [21] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej
of single-layer networks in unsupervised feature learning. In Chum. Mining on manifolds: Metric learning without labels.
AISTATS, pages 215–223, 2011. 5 2018. 1, 2, 5, 6, 7
[5] Adam Coates and Andrew Y Ng. Selecting receptive fields [22] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
in deep networks. In NIPS, pages 2528–2536, 2011. 6 3d object representations for fine-grained categorization. In
[6] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- ICCVW, pages 554–561, 2013. 6
vised visual representation learning by context prediction. In [23] Alex Krizhevsky. Learning multiple layers of features from
ICCV, pages 1422–1430, 2015. 1, 2 tiny images. Technical report, Citeseer, 2009. 5, 8
[7] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- [24] Honglak Lee, Roger Grosse, Rajesh Ranganath, and An-
versarial feature learning. arXiv preprint arXiv:1605.09782, drew Y Ng. Convolutional deep belief networks for scal-
2016. 2 able unsupervised learning of hierarchical representations. In
[8] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- ICML, pages 609–616, 2009. 2
berg, Martin Riedmiller, and Thomas Brox. Discriminative [25] Dong Li, Wei-Chih Hung, Jia-Bin Huang, Shengjin Wang,
unsupervised feature learning with exemplar convolutional Narendra Ahuja, and Ming-Hsuan Yang. Unsupervised vi-
neural networks. PAMI, 38(9):1734–1747, 2016. 2, 3, 5, 6, sual representation learning by graph-based consistent con-
7, 8 straints. In ECCV, pages 678–694, 2016. 6
[9] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Ried- [26] Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Ur-
miller, and Thomas Brox. Discriminative unsupervised fea- tasun. Learning deep parsimonious representations. In NIPS,
ture learning with convolutional neural networks. In NIPS, pages 5076–5084, 2016. 1, 2
pages 766–774, 2014. 2 [27] Tongliang Liu and Dacheng Tao. Classification with noisy
[10] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivi- labels by importance reweighting. IEEE TPAMI, 38(3):447–
er Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron 461, 2016. 2
Courville. Adversarially learned inference. arXiv preprint
[28] Chaochao Lu and Xiaoou Tang. Surpassing human-level face
arXiv:1606.00704, 2016. 2
verification performance on lfw with gaussianface. In AAAI,
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing X- pages 3811–3819, 2015. 1
u, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[29] R Manmatha, Chao-Yuan Wu, Alexander J Smola, and
Yoshua Bengio. Generative adversarial nets. In NIPS, pages
Philipp Krähenbühl. Sampling matters in deep embedding
2672–2680, 2014. 2
learning. In ICCV, pages 2859–2867, 2017. 2
[12] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension-
ality reduction by learning an invariant mapping. In CVPR, [30] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Le-
2006. 1, 2 ung, Sergey Ioffe, and Saurabh Singh. No fuss distance met-
ric learning using proxies. In ICCV, pages 360–368, 2017.
[13] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom
1, 2, 6
Drummond, et al. Smart mining for deep metric learning. In
ICCV, pages 2821–2829, 2017. 2, 6 [31] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. visual representations by solving jigsaw puzzles. In ECCV,
Delving deep into rectifiers: Surpassing human-level perfor- pages 69–84, 2016. 1, 2
mance on imagenet classification. In ICCV, pages 1026– [32] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio
1034, 2015. 1 Savarese. Deep metric learning via lifted structured feature
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. embedding. In CVPR, pages 4004–4012, 2016. 1, 2, 4, 5, 6,
Deep residual learning for image recognition. In CVPR, 7
pages 770–778, 2016. 5 [33] Edouard Oyallon, Eugene Belilovsky, and Sergey
[16] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de- Zagoruyko. Scaling the scattering transform: Deep
fense of the triplet loss for person re-identification. arXiv hybrid networks. In ICCV, pages 5619–5628, 2017. 6
preprint arXiv:1703.07737, 2017. 2, 5 [34] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
[17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling Darrell, and Alexei A Efros. Context encoders: Feature
the knowledge in a neural network. arXiv preprint arX- learning by inpainting. In CVPR, pages 2536–2544, 2016.
iv:1503.02531, 2015. 3 1, 2

6218
[35] Florian Schroff, Dmitry Kalenichenko, and James Philbin. [52] Xu Zhang, X Yu Felix, Sanjiv Kumar, and Shih-Fu Chang.
Facenet: A unified embedding for face recognition and clus- Learning spread-out local feature descriptors. In ICCV,
tering. In CVPR, pages 815–823, 2015. 2 pages 4605–4613, 2017. 1, 2
[36] Hinrich Schütze, Christopher D Manning, and Prabhakar [53] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun,
Raghavan. Introduction to information retrieval, volume 39. Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. Aligne-
Cambridge University Press, 2008. 7 dreid: Surpassing human-level performance in person re-
[37] Kihyuk Sohn. Improved deep metric learning with multi- identification. arXiv preprint arXiv:1711.08184, 2017. 1
class n-pair loss objective. In NIPS, pages 1857–1865, 2016. [54] J Zhao, M Mathieu, R Goroshin, and Y Lecun. S-
2, 6 tacked what-where auto-encoders. arXiv preprint arX-
[38] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin iv:1506.02351. 6
Murphy. Deep metric learning via facility location. In CVPR,
pages 2206–2214, 2017. 6
[39] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincen-
t Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, pages 1–9, 2015. 6
[40] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton.
Robust boltzmann machines for recognition and denoising.
In CVPR, pages 2264–2271, 2012. 2
[41] Daniel Tarlow, Kevin Swersky, Laurent Charlin, Ilya
Sutskever, and Rich Zemel. Stochastic k-neighborhood s-
election for supervised and unsupervised learning. In ICML,
pages 199–207, 2013. 2
[42] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
Pierre-Antoine Manzagol. Extracting and composing robust
features with denoising autoencoders. In ICML, pages 1096–
1103, 2008. 2
[43] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 6
[44] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In ICCV, pages 2794–
2802, 2015. 2
[45] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A
discriminative feature learning approach for deep face recog-
nition. In ECCV, pages 499–515, 2016. 2
[46] Zhirong Wu, Yuanjun Xiong, X Yu Stella, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In CVPR, pages 3733–3742, 2018. 1, 2, 3, 5,
6, 7, 8
[47] Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang
Wang. Learning deep feature representations with domain
guided dropout for person re-identification. In CVPR, pages
1249–1258, 2016. 2
[48] Mang Ye, Xiangyuan Lan, and Pong C. Yuen. Robust anchor
embedding for unsupervised video person re-identification in
the wild. In ECCV, pages 170–186, 2018. 1
[49] Mang Ye, Jiawei Li, Andy J Ma, Liang Zheng, and Pong C.
Yuen. Dynamic graph co-matching for unsupervised video-
based person re-identification. In IEEE Transactions on Im-
age Processing (TIP), 2019. 2
[50] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C.
Yuen. Dynamic label graph matching for unsupervised video
re-identification. In ICCV, pages 5142–5150, 2017. 1
[51] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In ECCV, pages 649–666, 2016. 2

6219

You might also like