Unsupervised Pre-Training of Image Features On Non-Curated Data
Unsupervised Pre-Training of Image Features On Non-Curated Data
Abstract
76 76
Pre-training general-purpose visual features with con-
volutional neural networks without relying on annotations 74 74
mAP
mAP
DeeperCluster YFCC
is a challenging and important task. Most recent efforts in DeeperCluster ImNet
unsupervised feature learning have focused on either small 72 RotNet YFCC
RotNet ImNet
72
or highly curated datasets like ImageNet, whereas using
non-curated raw datasets was found to decrease the fea- 70 70
ture quality when evaluated on a transfer task. Our goal is 1 4 20 100 1 2 4 8 16
to bridge the performance gap between unsupervised meth- size of dataset (x 10e6) number of cluster (x 10e4)
ods trained on curated data, which are costly to obtain, and Figure 1: Influence of amount of data (left) and number of
massive raw datasets that are easily available. To that ef- clusters (right) on the features quality. We report validation
fect, we propose a new unsupervised approach which lever- mAP on Pascal VOC classification task (FC 68 setting).
ages self-supervision and clustering to capture complemen-
tary statistics from large-scale data. We validate our ap-
proach on 96 million images from YFCC100M [42], achiev- Scaling up the annotation process to datasets that are orders
ing state-of-the-art results among unsupervised methods on of magnitude bigger raises important difficulties. Using
standard benchmarks, which confirms the potential of un- raw metadata as an alternative has been shown to perform
supervised learning when only non-curated raw data are comparatively well [23, 41], even surpassing ImageNet pre-
available. We also show that pre-training a supervised training when trained on billions of images [30]. However,
VGG-16 with our method achieves 74.9% top-1 classifica- metadata are not always available, and when they are, they
tion accuracy on the validation set of ImageNet, which is an do not necessarily cover the full extent of a dataset. These
improvement of +0.8% over the same network trained from difficulties motivate the design of methods that learn trans-
scratch. Our code is available at https://github. ferable features without using any annotation.
com/facebookresearch/DeeperCluster. Recent works describing unsupervised approaches have
reported performances that are closing the gap with their
supervised counterparts [6, 15, 51]. However, the best per-
1. Introduction forming unsupervised methods are trained on ImageNet, a
curated dataset made of carefully selected images to form
Pre-trained convolutional neural networks, or convnets, well-balanced and diversified classes [11]. Simply dis-
are important components of image recognition applica- carding the labels does not undo this careful selection, as
tions [7, 8, 38, 46]. They improve the generalization of it only removes part of the human supervision. Because
models trained on a limited amount of data [39] and speed of that, previous works that have experimented with non-
up the training on applications when annotated data is abun- curated raw data report a degradation of the quality of fea-
dant [20]. Convnets produce good generic representations tures [6, 12]. In this work, we aim at learning good visual
when they are pre-trained on large supervised datasets like representations from unlabeled and non-curated datasets.
ImageNet [11]. However, designing such fully-annotated We focus on the YFCC100M dataset [42], which contains
datasets has required a significant effort from the research 99 million images from the Flickr photo-sharing website.
community in terms of data cleansing and manual labeling. This dataset is unbalanced, with a “long-tail” distribution of
12959
hashtags contrasting with the well-behaved label distribu- another. Our approach, DeeperCluster, automatically gen-
tion of ImageNet (see Appendix). For example, guenon and erates targets by clustering the features of the entire dataset,
baseball correspond to labels with 1300 associated images under constraints derived from self-supervision. Due to the
in ImageNet, while there are respectively 226 and 256, 758 “long-tail” distribution of raw non-curated data, processing
images associated with these hashtags in YFCC100M. Our huge datasets and learning a large number of targets is nec-
goal is to understand if trading manually-curated data for essary, making the problem challenging from a computa-
scale leads to an improvement in the feature quality. tional point of view. For this reason, we propose a hier-
We propose a new unsupervised approach specifically achical formulation that is suitable for distributed training.
designed to leverage large amount of raw data. Indeed, This enables the discovery of latent categories present in
training on large-scale non-curated data requires (i) model the “tail” of the image distribution. While our framework
complexity to increase with dataset size; (ii) model stability is general, in practice we focus on combining the large ro-
to data distribution changes. A simple yet effective solution tation classification task of Gidaris et al. [15] with the clus-
is to combine methods from two domains of unsupervised tering approach of Caron et al. [6]. Figure 1 left shows that
learning: clustering and self-supervision. Since cluster- as we increase the number of training images, the quality
ing methods, like DeepCluster [6], build supervision from of features improves to the point where it surpasses those
inter-image similarities, the task at hand becomes inher- trained without labels on curated datasets. More impor-
ently more complex when the number of images increases. tantly, we evaluate the quality of our approach as a pre-
In addition, DeepCluster captures finer relations between training step for ImageNet classification. Pre-training a su-
images when the number of clusters scales with the dataset pervised VGG-16 with our unsupervised approach leads to
size. Clustering approaches infer target labels at the same a top-1 accuracy of 74.9%, which is an improvement of
time as features are learned. Thus, target labels evolve dur- +0.8% over a model trained from scratch. This shows the
ing training, making clustering-based approaches unstable. potential of unsupervised pre-training on large non-curated
Furthermore, these methods are sensitive to data distribu- datasets as a way to improve the quality of visual features.
tion as they rely directly on cluster structure in the underly-
ing data. Explicitly dealing with unbalanced category dis- 2. Related Work
tribution might be a solution but it assumes that we know Self-supervision. Self-supervised learning builds a pre-
the distribution of the latent classes. We design our method text task from the input signal to train a model with-
without this assumption. On the other hand, self-supervised out annotation [10]. Many pretext tasks have been pro-
learning [10] consists in designing a pretext task by predict- posed [22, 31, 44, 48], exploiting, amongst others, spatial
ing pseudo-labels automatically extracted from input sig- context [12, 24, 33, 34, 36], cross-channel prediction [27,
nals [12]. In other words, self-supervised approaches, like 28, 52, 53], or the temporal structure of videos [1, 35, 43].
RotNet [15], leverage intra-image statistics to build super- Some pretext tasks explicitly encourage the representations
vision, which are often independent of the data distribu- to be either invariant or discriminative to particular types of
tion. However, the dataset size has little impact on the input tranformations. For example, Dosovitskiy et al. [13]
nature of the task and on the performance of the result- consider each image and its transformations as a class to en-
ing features (see Figure 1). A solution to leveraging larger force invariance to data transformations. In this paper, we
datasets require manually increasing the difficulty of the build upon the work of Gidaris et al. [15] where the model
self-supervision task [19]. Our approach automatically in- encourages features to be discriminative for large rotations.
creases complexity through the clustering strategy. Recently, Kolesnikov et al. [25] have conducted an exten-
sive benchmark of self-supervised learning methods on dif-
intra-image inter-images stable to distri- ferent convnet architectures. As opposed to our work, they
Method statistics statistics. bution change use curated datasets for pre-training.
Self-Sup (RotNet) X ✗ X
Deep clustering. Clustering, along with density estima-
Deep Clustering X X ✗ tion and dimensionality reduction, is a family of standard
unsupervised learning methods. Various attempts have been
Table 1: Training on non-curated large-scale data requires made to train convnets using clustering [2, 3, 6, 29, 45, 49,
model complexity to increase with dataset size and model 50]. Our paper builds upon the work of Caron et al. [6],
stability to data distribution changes. A simple solution is in which k-means is used to cluster the visual represen-
to combine self-supervision and clustering. tations. Unlike our work, they mainly focus on training
their approach using ImageNet without labels. Recently,
The novelty of our method lies in the combination of Noroozi et al. [34] show that clustering can also be used as a
these two paradigms (Table 1) so that they benefit from one form of distillation to improve the performance of networks
2960
trained with self-supervision. As opposed to our work, they since its performance on standard evaluation benchmarks
use clustering only as a post-processing step and does not is among the best in self-supervised learning. This pretext
leverage the complementarity between clustering and self- task corresponds to a multiclass classification problem with
supervision to further improve the quality of features. four categories: rotations in {0◦ , 90◦ , 180◦ , 270◦ }. Each in-
put xn in Eq. (1) is randomly rotated and associated with a
Learning on non-curated datasets. Some methods [9, target yn that represents the angle of the applied rotation.
17, 32] aim at learning visual features from non-curated 3.2. Deep clustering
data streams. They typically use metadata such as hash-
tags [23, 41] or geolocalization [47] as a source of noisy su- Clustering-based approaches for deep networks typically
pervision. In particular, Mahajan et al. [30] train a network build target classes by clustering visual features produced
to classify billions of Instagram images into predefined and by convnets. As a consequence, the targets are updated
clean sets of hashtags. They show that with little human ef- during training along with the representations and are po-
fort, it is possible to learn features that transfer well to Im- tentially different at each epoch. In this context, we define
ageNet, even achieving state-of-the-art performance if fine- a latent pseudo-label zn in Z for each image n as well as
tuned. As opposed to our work, they use an extrinsic source a corresponding linear classifier W . These clustering-based
of supervision that had to be cleaned beforehand. methods alternate between learning the parameters θ and W
and updating the pseudo-labels zn . Between two reassign-
3. Preliminaries ments, the pseudo-labels zn are fixed, and the parameters
and classifier are optimized by solving
In this work, we refer to the vector obtained at the N
penultimate layer of the convnet as a feature or represen- 1 X
min ℓ(zn , W fθ (xn )), (2)
tation. We denote by fθ the feature-extracting function, θ,W N n=1
parametrized by a set of parameters θ. Given a set of im-
ages, our goal is then to learn a “good” mapping fθ∗ . By which is of the same form as Eq. (1). Then, the pseudo-
“good”, we mean a function that produces general-purpose labels zn can be reassigned by minimizing an auxiliary loss
visual features that are useful on downstream tasks. function. This loss sometimes coincides with Eq. (2) [3, 49]
but some works proposed to use another objective [6, 50].
3.1. Self-supervision
Updating the targets with k-means. In this work, we
In self-supervised learning, a pretext task is used to ex-
focus on the framework of Caron et al. [6], DeepCluster,
tract target labels directly from data [12]. These targets can
where latent targets are obtained by clustering the activa-
take a variety of forms. They can be categorical labels as-
tions with k-means. More precisely, the targets zn are up-
sociated with a multiclass problem, as when predicting the
dated by solving the following optimization problem:
transformation of an image [15, 51] or the ordering of a set
of patches [33]. Or they can be continuous variables asso- XN
ciated with a regression problem, as when predicting image min min kCzn − fθ (xn )k22 , (3)
C∈Rd×k zn ∈{0,1}k s.t. zn
⊤ 1=1
color [52] or surrounding patches [36]. In this work, we are n=1
interested in the former. We suppose that we are given a set C is the matrix where each column corresponds to a cen-
of N images {x1 , . . . , xN } and we assign a pseudo-label yn troid, k is the number of centroids, and zn is a binary vector
in Y to each input xn . Given these pseudo-labels, we learn with a single non-zero entry. This approach assumes that
the parameters θ of the convet jointly with a linear classifier the number of clusters k is known a priori; in practice, we
V to predict pseudo-labels by solving the problem set it by validation on a downstream task (see Sec. 5.3). The
latent targets are updated every T epochs of stochastic gra-
N
1 X dient descent steps when minimizing the objective (2).
min ℓ(yn , V fθ (xn )), (1) Note that this alternate optimization scheme is prone to
θ,V N n=1
trivial solutions and controlling the way optimization pro-
where ℓ is a loss function. The pseudo-labels yn are fixed cedures of both objectives interact is crucial. Re-assigning
during the optimization and the quality of the learned fea- empty clusters and performing a batch-sampling based on
tures entirely depends on their relevance. an uniform distribution over the cluster assignments are
workarounds to avoid trivial parametrization [6].
Rotation as self-supervision. Gidaris et al. [15] have re- 4. Method
cently shown that good features can be obtained when train-
ing a convnet to discriminate between different image rota- In this section, we describe how we combine self-
tions. In this work, we focus on their pretext task, RotNet, supervised learning with deep clustering in order to scale
2961
up to large numbers of images and targets. 1) Hierarchical clustering 2) Unsupervised feature
learning
4.1. Combining self-supervision and clustering Level 2 Sub-clusters
…
We assume that the inputs x1 , . . . , xN are rotated im-
ages, each associated with a target label yn encoding its Clusters
x
rotation angle and a cluster assignment zn . The cluster as- Level 1 Rotations
…
…
…
…
0 90 180 270
the two tasks. We get the following optimization problem:
N Figure 2: DeeperCluster alternates between a hierachical
1 X clustering of the features and learning the parameters of a
min ℓ(yn ⊗ zn , W fθ (xn )). (4)
θ,W N convnet by predicting both the rotation angle and the cluster
n=1
assignments in a single hierachical loss.
Note that any clustering or self-supervised approach with
a multiclass objective can be combined with this formula- by zns the vector in {0, 1}ks of the assignment into ks sub-
tion. For example, we could use a self-supervision task that classes for an image n belonging to super-class s. There
captures information about tiles permutations [33] or frame are S sub-class classifiers W1 , . . . , WS , each predicting the
ordering in a video [43]. However, this formulation does not sub-class memberships within a super-class s. The param-
scale in the number of combined targets, i.e., its complexity eters of the linear classifiers (V, W1 , . . . , WS ) and θ are
is O(|Y||Z|). This limits the use of a large number of clus- jointly learned by minimizing the following loss function:
ter or a self-supervised task with a large output space [51]. " #
N S
In particular, if we want to capture information contained 1 X X
ℓ V fθ (xn ), yn + yns ℓ (Ws fθ (xn ), zns ) , (5)
in the tail of the distribution of non-curated dataset, we may N n=1 s=1
need a large number of clusters. We thus propose an approx-
imation of our formulation based on a scalable hierarchical where ℓ is the negative log-softmax function. Note that an
loss that it is designed to suit distributed training. image that does not belong to the super-class s does not
belong either to any of its ks sub-classes.
4.2. Scaling up to large number of targets
Hierarchical losses are commonly used in language mod- Choice of super-classes. A natural partition would be to
eling where the goal is to predict a word out of a large vo- define the super-classes based on the target labels from the
cabulary [5]. Instead of making one decision over the full self-supervised task and the sub-classes as the labels pro-
vocabulary, these approaches split the process in a hierarchy duced by clustering. However, this would mean that each
of decisions, each with a smaller output space. For exam- image of the entire dataset would be present in each super-
ple, the vocabulary can be split into clusters of semantically class (with a different rotation), which does not take advan-
similar words, and the hierarchical process would first se- tage of the hierarchical structure to use a bigger number of
lect a cluster and then a word within this cluster. clusters.
Following this line of work, we partition the target labels Instead, we split the dataset into m sets by running k-
into a 2-level hierarchy where we first predict a super-class means with m centroids on the full dataset every T epochs.
and then a sub-class among its associated target labels. The We then use the Cartesian product between the assignment
first level is a partition of the images into S super-classes to these m clusters and the angle rotation classes to form
and we denote by yn the super-class assignment vector in the super-classes. There are 4m super-classes, each associ-
{0, 1}S of the image n and by yns the s-th entry of yn . This ated with the subset of data belonging to the corresponding
super-class assignment is made with a linear classifier V on cluster (N/m images if the clustering is perfectly balanced).
top of the features. The second-level of the hierarchy is ob- These subsets are then further split with k-means into k sub-
tained by partitioning within each super-class. We denote classes. This is equivalent to running a hierarchical k-means
2962
with rotation constraints on the full datasets to form our hi- Classif. Detect.
erarchical loss. We typically use m = 4 and k = 80k, lead-
Method Data FC 68 ALL FC 68 ALL
ing to a total of 320k different clusters split in 4 subsets. Our
approach, “DeeperCluster”, shares similarities with Deep- ImageNet labels INet 89.3 89.2 66.3 70.3
Cluster but is designed to scale to larger datasets. We alter- Random – 10.1 49.6 5.4 55.6
nate between clustering the non-rotated images features and
Unsupervised on curated data
training the network to predict both the rotation applied to
the input data and its cluster assignment amongst the clus- Larsson et al. [28] INet+Pl. – 77.2† 49.2 59.7
ters corresponding to this rotation (Figure 2). Wu et al. [48] INet – – – 60.5†
Doersh et al. [12] INet 54.6 78.5 38.0 62.7
Caron et al. [6] INet 78.5 82.5 58.7 65.9†
Distributed training. Building the super-classes based
Unsupervised on non-curated data
on data splits lends itself to a distributed implementation
that scales well in the number of images. Specifically, when Mahendran et al. [31] YFCCv – 76.4† – –
optimizing Eq. (5), we form as many distributed communi- Wang and Gupta [43] YT8M – – – 60.2†
cation groups of p GPUs as the number of super-classes, Wang et al. [44] YT9M 59.4 79.6 40.9 63.2†
i.e., G = 4m. Different communication groups share the DeeperCluster YFCC 79.7 84.3 60.5 67.8
parameters θ and the super-class classifier V , while the pa-
rameters of the sub-class classifiers W1 , . . . , WS are only Table 2: Comparison of DeeperCluster to state-of-the-art
shared within a communication group. Each communica- unsupervised feature learning on classification and detec-
tion group s deals only with the subset of images and the tion on PASCAL VOC 2007. We disassociate methods using
rotation angle associated with the super-class s. curated datasets and methods using non-curated datasets.
We selected hyper-parameters for each transfer task on the
Distributed k-means. Every T epochs, we recompute the validation set, and then retrain on both training and val-
super and sub-class assignments by running two consecu- idation sets. We report results on the test set averaged
tive k-means on the entire dataset. This is achieved by first over 5 runs. “YFFCv” stands for the videos contained in
randomly splitting the dataset across different GPUs. Each YFFC100M dataset. † numbers from their original paper.
GPU is in charge of computing cluster assignments for its
partition, whereas centroids are updated across GPUs. We
the VGG-16 architecture [40] with batch normalization lay-
reduce communication between GPUs by sharing only the
ers. Following [3, 6, 37], we pre-process images with a So-
number of assigned elements for each cluster and the sum
bel filtering. We train our models on the 96M images from
of their features. The new centroids are then computed from
YFCC100M [42] that we managed to download. We use
these statistics. We observe empirically that k-means con-
this publicly available dataset for research purposes only.
verges in 10 iterations. We cluster 96M features of dimen-
sion 4096 into m = 4 clusters using 64 GPUs (1 minute per
5. Experiments
iteration). Then, we split this pool of GPUs into 4 groups
of 16 GPUs. Each group clusters around 23M features into In this section we evaluate the quality of the features
80k clusters (4 minutes per iteration). learned with DeeperCluster on a variety of downstream
tasks, such as classification or object detection. We also
4.3. Implementation details provide insights about the impact of the number of images
The loss in Eq. (5) is minimized with mini-batch stochas- and clusters on the performance of our model.
tic gradient descent [4]. Each mini-batch contains 3072 in- 5.1. Evaluating unsupervised features
stances distributed accross 64 GPUs, leading to 48 instances
per GPU per minibatch [18]. We use dropout, weight de- We evaluate the quality of the features extracted from a
cay, momentum and a constant learning rate of 0.1. We convnet trained with DeeperCluster on YFCC100M by con-
reassign clusters every 3 epochs. We use the Pascal VOC sidering several standard transfer learning tasks, namely im-
2007 classification task without finetuning as a downstream age classification, object detection and scene classification.
task to select hyper-parameters. In order to speed up exper-
imentations, we initialize the network with RotNet trained Pascal VOC 2007 [14]. This dataset has small training
on YFCC100M. Before clustering, we perform a whiten- and validation sets (2.5k images each), making it close to
ing of the activations and ℓ2 -normalize each of them. We the setting of real applications where models trained us-
use standard data augmentations, i.e., cropping of random ing large computational resources are adapted to a new task
sizes and aspect ratios and horizontal flips [26]). We use with a small number of instances. We report numbers on the
2963
classification and detection tasks with finetuning (“ALL”) 70 ImageNet 70
Places
classification accuracy
classification accuracy
DeeperCluster
60
RotNet
of the network (“FC 68”). The FC 68 setting gives a better 50 50 Supervised
2964
ImageNet top-1 top-5 Influence of dataset size and number of clusters. To
measure the influence of the number of images on features,
Supervised (PyTorch documentation4 ) 73.4 91.5 we train models with 1M, 4M, 20M, and 96M images and
Supervised (our code) 74.1 91.8 report their accuracy on the validation set of the Pascal VOC
Supervised + RotNet pre-training 74.5 92.0 2007 classification task (FC 68 setting). We also train mod-
Supervised + DeeperCluster pre-training 74.9 92.3 els on 20M images with a number of clusters that varies
from 10k to 160k. For the experiment with a total of 160k
Table 3: Accuracy on the validation set of ImageNet classi- clusters, we choose m = 2 which results in 8 super-classes.
fication for a supervised VGG-16 trained with different ini- In Figure 1, we observe that the quality of our features im-
tializations: we compare a network trained from a standard proves when scaling both in terms of images and clusters.
initialization to networks trained from pre-trained weights Interestingly, between 4M and 20M of YFCC100M images
using either DeeperCluster or RotNet on YFCC100M. are needed to meet the performance of our method on Ima-
geNet. Augmenting the number of images has a bigger im-
Method Data ImageNet Places VOC2007 pact than the number of clusters. Yet, this improvement is
significant since it corresponds to a reduction of more than
Supervised ImageNet 70.2 45.9 84.8
10% of the relative error w.r.t. the supervised model.
Wu et al. [48] ImageNet 39.2 36.3 -
RotNet ImageNet 32.7 32.6 60.9 Quality of the clusters. In addition to features, our
DeepCluster ImageNet 48.4 37.9 71.9 method provides a clustering of the input images. We eval-
RotNet YFCC100M 33.0 35.5 62.2 uate the quality of these clusters by measuring their cor-
DeepCluster YFCC100M 34.1 35.4 63.9 relation with existing partitions of the data. In particular,
YFCC100M comes with many different metadata. We con-
DeeperCluster YFCC100M 45.6 42.1 73.0 sider hashtags, users, camera and GPS coordinates. If an
image has several hashtags, we pick as label the least fre-
Table 4: Comparaison between DeeperCluster, RotNet and quent one in the total hashtag distribution. We also mea-
DeepCluster when pre-trained on curated and non-curated sure the correlation of ours clusters with labels predicted
dataset. We report the accuracy on several datasets of a by a classifier trained on ImageNet categories. We use a
linear classifier trained on top of features of the last con- ResNet-50 network [21], pre-trained on ImageNet, to clas-
volutional layer. All the methods use the same architecture. sify the YFCC100M images and we select those for which
DeepCluster does not scale to the full YFCC100M dataset, the confidence in prediction is higher than 75%. This evalu-
we thus train it on a random subset of 1.3M images. ation omits a large amount of the data but gives some insight
about the quality of our clustering in object classification.
In Figure 4, we show the evolution during training of the
curated datasets. We then report quantitative and qualitative normalized mutual information (NMI) between our cluster-
evaluations of the clusters obtained with DeeperCluster. ing and different metadata, and the predicted labels from
ImageNet. The higher the NMI, the more correlated our
Comparison with RotNet and DeepCluster. In Table 4, clusters are to the considered partition. For reference, we
we compare DeeperCluster with DeepCluster and RotNet compute the NMI for a clustering of RotNet features (as it
when a linear classifier is trained on top of the last convolu- corresponds to weights at initialization) and of a supervised
tional layer of a VGG-16 on several datasets. For reference, model. First, it is interesting to observe that our cluster-
we also report previously published numbers [48] with a ing is improving over time for every type of metadata. One
VGG-16 architecture. We average-pool the features of the important factor is that most of these commodities are cor-
last layer resulting in representations of 8192 dimensions. related since a given user takes pictures in specific places
Our approach outperforms both RotNet and DeepCluster, with probably a single camera and use a preferred fixed set
even when they are trained on curated datasets (except for of hashtags. Yet, these plots show that our model captures in
ImageNet classification task where DeepCluster trained on the input signal enough information to predict these meta-
ImageNet yields the best performance). More interestingly, data at least as well as the features trained with supervision.
we see that the quality of the dataset or its scale has little We visually assess the consistency of our clusters in Fig-
impact on RotNet while it has on DeepCluster. This is con- ure 5. We display 9 random images from 8 manually picked
firming that self-supervised methods are more robust than clusters. The first two clusters contain a majority of images
clustering to a change of dataset distribution. associated with tag from the head (first cluster) and from
the tail (second cluster) in the YFC100M dataset. Indeed,
418.538 YFC100M images are associated with the tag cat
2965
Tags Users GPS
0.204
Device ImageNet classifier
0.554
0.530
0.202 0.65
0.552 0.528 0.370
0.526
0.200 0.60
0.550
0.524 0.365
0.198
0.55
NMI
0.196
0.50
0.548 0.522
DeeperCluster
0.520 0.360 0.194
0.45
0.546 Supervised
0.518 0.192
RotNet
0.40
0.544 0.516 0.355 0.190
10 30 50 70 90
0.514
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
epochs epochs epochs epochs epochs
Figure 4: Normalized mutual information between our clustering and different sorts of metadata: hashtags, user IDs, geo-
graphic coordinates, and device types. We also plot the NMI with an ImageNet classifier labeling.
GPS: (43, 10) GPS: (−34, −151) GPS: (64, −20) GPS: (43, −104)
Figure 5: We randomly select 9 images per cluster and indicate the dominant cluster metadata. The bottom row depicts
clusters pure for GPS coordinates but unpure for user IDs. As expected, they turn out to correlate with tourist landmarks. No
metadata is used during training. For copyright reasons, we provide in Appendix the photographer username for each image.
2966
References tions. In International Conference on Learning Representa-
tions (ICLR), 2018. 1, 2, 3
[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learn-
[16] Ross Girshick. Fast r-cnn. In Proceedings of the Interna-
ing to see by moving. In Proceedings of the International
tional Conference on Computer Vision (ICCV), 2015. 6
Conference on Computer Vision (ICCV), 2015. 2
[17] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis
[2] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina Karatzas, and CV Jawahar. Self-supervised learning of
Tikhoncheva, and Bjorn Ommer. Cliquecnn: Deep un- visual features through embedding images into text topic
supervised exemplar learning. In Advances in Neural spaces. In Proceedings of the Conference on Computer Vi-
Information Processing Systems (NIPS), 2016. 2 sion and Pattern Recognition (CVPR), 2017. 3
[3] Piotr Bojanowski and Armand Joulin. Unsupervised learn- [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
ing by predicting noise. In Proceedings of the International huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Conference on Machine Learning (ICML), 2017. 2, 3, 5 Yangqing Jia, and Kaiming He. Accurate, large mini-
[4] Léon Bottou. Stochastic gradient descent tricks. In Neu- batch sgd: Training imagenet in 1 hour. arXiv preprint
ral networks: Tricks of the trade, pages 421–436. Springer, arXiv:1706.02677, 2017. 5
2012. 5 [19] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
[5] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent Misra. Scaling and benchmarking self-supervised visual
J Della Pietra, and Jenifer C Lai. Class-based n-gram models representation learning. arXiv preprint arXiv:1905.01235,
of natural language. Computational linguistics, 18(4):467– 2019. 2
479, 1992. 4 [20] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im-
[6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and agenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
Matthijs Douze. Deep clustering for unsupervised learning 1
of visual features. In Proceedings of the European Confer- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
ence on Computer Vision (ECCV), 2018. 1, 2, 3, 5, 6 Deep residual learning for image recognition. In Proceedings
[7] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Ji- of the Conference on Computer Vision and Pattern Recogni-
tendra Malik. Human pose estimation with iterative error tion (CVPR), 2016. 7
feedback. In Proceedings of the Conference on Computer [22] Simon Jenni and Paolo Favaro. Self-supervised feature learn-
Vision and Pattern Recognition (CVPR), 2016. 1 ing by learning to spot artifacts. In Proceedings of the
[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- Conference on Computer Vision and Pattern Recognition
nos, Kevin Murphy, and Alan L Yuille. Deeplab: Se- (CVPR), 2018. 2
mantic image segmentation with deep convolutional nets, [23] Armand Joulin, Laurens van der Maaten, Allan Jabri, and
atrous convolution, and fully connected crfs. arXiv preprint Nicolas Vasilache. Learning visual features from large
arXiv:1606.00915, 2016. 1 weakly supervised data. In Proceedings of the European
[9] Xinlei Chen and Abhinav Gupta. Webly supervised learning Conference on Computer Vision (ECCV), 2016. 1, 3
of convolutional networks. In Proceedings of the Interna- [24] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So
tional Conference on Computer Vision (ICCV), 2015. 3 Kweon. Learning image representations by completing dam-
[10] Virginia R de Sa. Learning classification with unlabeled aged jigsaw puzzles. In Winter Conference on Applications
data. In Advances in Neural Information Processing Systems of Computer Vision (WACV), 2018. 2
(NIPS), 1994. 2 [25] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Re-
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, visiting self-supervised visual representation learning. arXiv
and Li Fei-Fei. Imagenet: A large-scale hierarchical image preprint arXiv:1901.09005, 2019. 2
database. In Proceedings of the Conference on Computer [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Vision and Pattern Recognition (CVPR), 2009. 1, 6 Imagenet classification with deep convolutional neural net-
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- works. In Advances in Neural Information Processing Sys-
vised visual representation learning by context prediction. In tems (NIPS), 2012. 5
Proceedings of the International Conference on Computer [27] Gustav Larsson, Michael Maire, and Gregory
Vision (ICCV), 2015. 1, 2, 3, 5, 6 Shakhnarovich. Learning representations for automatic
[13] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- colorization. In Proceedings of the European Conference on
berg, Martin Riedmiller, and Thomas Brox. Discriminative Computer Vision (ECCV), 2016. 2
unsupervised feature learning with exemplar convolutional [28] Gustav Larsson, Michael Maire, and Gregory
neural networks. IEEE transactions on pattern analysis and Shakhnarovich. Colorization as a proxy task for visual
machine intelligence, 38(9):1734–1747, 2016. 2 understanding. In Proceedings of the Conference on Com-
[14] Mark Everingham, Luc Van Gool, Christopher KI Williams, puter Vision and Pattern Recognition (CVPR), 2017. 2,
John Winn, and Andrew Zisserman. The pascal visual object 5
classes (voc) challenge. International journal of computer [29] Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Ur-
vision, 88(2):303–338, 2010. 5 tasun. Learning deep parsimonious representations. In Ad-
[15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- vances in Neural Information Processing Systems (NIPS),
supervised representation learning by predicting image rota- 2016. 2
2967
[30] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, [44] Xiaolong Wang, Kaiming He, and Abhinav Gupta. Tran-
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, sitive invariance for self-supervised visual representation
and Laurens van der Maaten. Exploring the limits of weakly learning. In Proceedings of the International Conference on
supervised pretraining. In Proceedings of the European Con- Computer Vision (ICCV), 2017. 2, 5
ference on Computer Vision (ECCV), 2018. 1, 3 [45] Xiaosong Wang, Le Lu, Hoo-Chang Shin, Lauren Kim, Mo-
[31] Aravindh Mahendran, James Thewlis, and Andrea Vedaldi. hammadhadi Bagheri, Isabella Nogues, Jianhua Yao, and
Cross pixel optical flow similarity for self-supervised learn- Ronald M Summers. Unsupervised joint mining of deep fea-
ing. arXiv preprint arXiv:1807.05636, 2018. 2, 5 tures and image labels for large-scale radiology image cate-
[32] Karl Ni, Roger Pearce, Kofi Boakye, Brian Van Essen, gorization and scene recognition. In Winter Conference on
Damian Borth, Barry Chen, and Eric Wang. Large-scale Applications of Computer Vision (WACV), 2017. 2
deep learning on the yfcc100m dataset. arXiv preprint [46] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and
arXiv:1502.03409, 2015. 3 Cordelia Schmid. Deepflow: Large displacement optical
[33] Mehdi Noroozi and Paolo Favaro. Unsupervised learning flow with deep matching. In Proceedings of the International
of visual representations by solving jigsaw puzzles. In Pro- Conference on Computer Vision (ICCV), 2013. 1
ceedings of the European Conference on Computer Vision [47] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-
(ECCV), 2016. 2, 3, 4 photo geolocation with convolutional neural networks. In
[34] Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Proceedings of the European Conference on Computer Vi-
Hamed Pirsiavash. Boosting self-supervised learning via sion (ECCV), 2016. 3
knowledge transfer. In Proceedings of the Conference on [48] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Computer Vision and Pattern Recognition (CVPR), 2018. 2 Unsupervised feature learning via non-parametric instance
[35] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, discrimination. In Proceedings of the Conference on Com-
and Bharath Hariharan. Learning features by watching ob- puter Vision and Pattern Recognition (CVPR), 2018. 2, 5,
jects move. In Proceedings of the Conference on Computer 7
Vision and Pattern Recognition (CVPR), 2017. 2 [49] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised
[36] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor deep embedding for clustering analysis. In Proceedings of
Darrell, and Alexei A Efros. Context encoders: Feature the International Conference on Machine Learning (ICML),
learning by inpainting. In Proceedings of the Conference 2016. 2, 3
on Computer Vision and Pattern Recognition (CVPR), 2016. [50] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper-
2, 3 vised learning of deep representations and image clusters. In
[37] Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Proceedings of the Conference on Computer Vision and Pat-
Mairal, Florent Perronin, and Cordelia Schmid. Local con- tern Recognition (CVPR), 2016. 2, 3
volutional features with unsupervised training for image re- [51] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo.
trieval. In Proceedings of the International Conference on Aet vs. aed: Unsupervised representation learning by auto-
Computer Vision (ICCV), 2015. 5 encoding transformations rather than data. arXiv preprint
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. arXiv:1901.04596, 2019. 1, 3, 4
Faster r-cnn: Towards real-time object detection with region [52] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
proposal networks. In Advances in Neural Information Pro- image colorization. In Proceedings of the European Confer-
cessing Systems (NIPS), 2015. 1 ence on Computer Vision (ECCV), 2016. 2, 3, 6
[39] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, [53] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain
and Stefan Carlsson. Cnn features off-the-shelf: an astound- autoencoders: Unsupervised learning by cross-channel pre-
ing baseline for recognition. In CVPR workshops, 2014. 1 diction. In Proceedings of the Conference on Computer Vi-
[40] Karen Simonyan and Andrew Zisserman. Very deep convo- sion and Pattern Recognition (CVPR), 2017. 2
lutional networks for large-scale image recognition. arXiv [54] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
preprint arXiv:1409.1556, 2014. 5 ralba, and Aude Oliva. Learning deep features for scene
[41] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- recognition using places database. In Advances in Neural
nav Gupta. Revisiting unreasonable effectiveness of data in Information Processing Systems (NIPS), 2014. 6
deep learning era. In Proceedings of the International Con-
ference on Computer Vision (ICCV), pages 843–852, 2017.
1, 3
[42] Bart Thomee, David A Shamma, Gerald Friedland, Ben-
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
Li-Jia Li. Yfcc100m: The new data in multimedia research.
arXiv preprint arXiv:1503.01817, 2015. 1, 5
[43] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In Proceedings of the
International Conference on Computer Vision (ICCV), 2015.
2, 4, 5
2968