0% found this document useful (0 votes)

119 views10 pages

Unsupervised Pre-Training of Image Features On Non-Curated Data

Uploaded by

Hasham Adeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views10 pages

Unsupervised Pre-Training of Image Features On Non-Curated Data

Uploaded by

Hasham Adeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Unsupervised Pre-Training of Image Features on Non-Curated Data

Mathilde Caron1,2 , Piotr Bojanowski1 , Julien Mairal2 , and Armand Joulin1

1
Facebook AI Research
2
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

Abstract
76 76
Pre-training general-purpose visual features with con-
volutional neural networks without relying on annotations 74 74

mAP

mAP
DeeperCluster YFCC
is a challenging and important task. Most recent efforts in DeeperCluster ImNet
unsupervised feature learning have focused on either small 72 RotNet YFCC
RotNet ImNet
72
or highly curated datasets like ImageNet, whereas using
non-curated raw datasets was found to decrease the fea- 70 70
ture quality when evaluated on a transfer task. Our goal is 1 4 20 100 1 2 4 8 16
to bridge the performance gap between unsupervised meth- size of dataset (x 10e6) number of cluster (x 10e4)
ods trained on curated data, which are costly to obtain, and Figure 1: Influence of amount of data (left) and number of
massive raw datasets that are easily available. To that ef- clusters (right) on the features quality. We report validation
fect, we propose a new unsupervised approach which lever- mAP on Pascal VOC classification task (FC 68 setting).
ages self-supervision and clustering to capture complemen-
tary statistics from large-scale data. We validate our ap-
proach on 96 million images from YFCC100M [42], achiev- Scaling up the annotation process to datasets that are orders
ing state-of-the-art results among unsupervised methods on of magnitude bigger raises important difficulties. Using
standard benchmarks, which confirms the potential of un- raw metadata as an alternative has been shown to perform
supervised learning when only non-curated raw data are comparatively well [23, 41], even surpassing ImageNet pre-
available. We also show that pre-training a supervised training when trained on billions of images [30]. However,
VGG-16 with our method achieves 74.9% top-1 classifica- metadata are not always available, and when they are, they
tion accuracy on the validation set of ImageNet, which is an do not necessarily cover the full extent of a dataset. These
improvement of +0.8% over the same network trained from difficulties motivate the design of methods that learn trans-
scratch. Our code is available at https://github. ferable features without using any annotation.
com/facebookresearch/DeeperCluster. Recent works describing unsupervised approaches have
reported performances that are closing the gap with their
supervised counterparts [6, 15, 51]. However, the best per-
1. Introduction forming unsupervised methods are trained on ImageNet, a
curated dataset made of carefully selected images to form
Pre-trained convolutional neural networks, or convnets, well-balanced and diversified classes [11]. Simply dis-
are important components of image recognition applica- carding the labels does not undo this careful selection, as
tions [7, 8, 38, 46]. They improve the generalization of it only removes part of the human supervision. Because
models trained on a limited amount of data [39] and speed of that, previous works that have experimented with non-
up the training on applications when annotated data is abun- curated raw data report a degradation of the quality of fea-
dant [20]. Convnets produce good generic representations tures [6, 12]. In this work, we aim at learning good visual
when they are pre-trained on large supervised datasets like representations from unlabeled and non-curated datasets.
ImageNet [11]. However, designing such fully-annotated We focus on the YFCC100M dataset [42], which contains
datasets has required a significant effort from the research 99 million images from the Flickr photo-sharing website.
community in terms of data cleansing and manual labeling. This dataset is unbalanced, with a “long-tail” distribution of

12959
hashtags contrasting with the well-behaved label distribu- another. Our approach, DeeperCluster, automatically gen-
tion of ImageNet (see Appendix). For example, guenon and erates targets by clustering the features of the entire dataset,
baseball correspond to labels with 1300 associated images under constraints derived from self-supervision. Due to the
in ImageNet, while there are respectively 226 and 256, 758 “long-tail” distribution of raw non-curated data, processing
images associated with these hashtags in YFCC100M. Our huge datasets and learning a large number of targets is nec-
goal is to understand if trading manually-curated data for essary, making the problem challenging from a computa-
scale leads to an improvement in the feature quality. tional point of view. For this reason, we propose a hier-
We propose a new unsupervised approach specifically achical formulation that is suitable for distributed training.
designed to leverage large amount of raw data. Indeed, This enables the discovery of latent categories present in
training on large-scale non-curated data requires (i) model the “tail” of the image distribution. While our framework
complexity to increase with dataset size; (ii) model stability is general, in practice we focus on combining the large ro-
to data distribution changes. A simple yet effective solution tation classification task of Gidaris et al. [15] with the clus-
is to combine methods from two domains of unsupervised tering approach of Caron et al. [6]. Figure 1 left shows that
learning: clustering and self-supervision. Since cluster- as we increase the number of training images, the quality
ing methods, like DeepCluster [6], build supervision from of features improves to the point where it surpasses those
inter-image similarities, the task at hand becomes inher- trained without labels on curated datasets. More impor-
ently more complex when the number of images increases. tantly, we evaluate the quality of our approach as a pre-
In addition, DeepCluster captures finer relations between training step for ImageNet classification. Pre-training a su-
images when the number of clusters scales with the dataset pervised VGG-16 with our unsupervised approach leads to
size. Clustering approaches infer target labels at the same a top-1 accuracy of 74.9%, which is an improvement of
time as features are learned. Thus, target labels evolve dur- +0.8% over a model trained from scratch. This shows the
ing training, making clustering-based approaches unstable. potential of unsupervised pre-training on large non-curated
Furthermore, these methods are sensitive to data distribu- datasets as a way to improve the quality of visual features.
tion as they rely directly on cluster structure in the underly-
ing data. Explicitly dealing with unbalanced category dis- 2. Related Work
tribution might be a solution but it assumes that we know Self-supervision. Self-supervised learning builds a pre-
the distribution of the latent classes. We design our method text task from the input signal to train a model with-
without this assumption. On the other hand, self-supervised out annotation [10]. Many pretext tasks have been pro-
learning [10] consists in designing a pretext task by predict- posed [22, 31, 44, 48], exploiting, amongst others, spatial
ing pseudo-labels automatically extracted from input sig- context [12, 24, 33, 34, 36], cross-channel prediction [27,
nals [12]. In other words, self-supervised approaches, like 28, 52, 53], or the temporal structure of videos [1, 35, 43].
RotNet [15], leverage intra-image statistics to build super- Some pretext tasks explicitly encourage the representations
vision, which are often independent of the data distribu- to be either invariant or discriminative to particular types of
tion. However, the dataset size has little impact on the input tranformations. For example, Dosovitskiy et al. [13]
nature of the task and on the performance of the result- consider each image and its transformations as a class to en-
ing features (see Figure 1). A solution to leveraging larger force invariance to data transformations. In this paper, we
datasets require manually increasing the difficulty of the build upon the work of Gidaris et al. [15] where the model
self-supervision task [19]. Our approach automatically in- encourages features to be discriminative for large rotations.
creases complexity through the clustering strategy. Recently, Kolesnikov et al. [25] have conducted an exten-
sive benchmark of self-supervised learning methods on dif-
intra-image inter-images stable to distri- ferent convnet architectures. As opposed to our work, they
Method statistics statistics. bution change use curated datasets for pre-training.
Self-Sup (RotNet) X ✗ X
Deep clustering. Clustering, along with density estima-
Deep Clustering X X ✗ tion and dimensionality reduction, is a family of standard
unsupervised learning methods. Various attempts have been
Table 1: Training on non-curated large-scale data requires made to train convnets using clustering [2, 3, 6, 29, 45, 49,
model complexity to increase with dataset size and model 50]. Our paper builds upon the work of Caron et al. [6],
stability to data distribution changes. A simple solution is in which k-means is used to cluster the visual represen-
to combine self-supervision and clustering. tations. Unlike our work, they mainly focus on training
their approach using ImageNet without labels. Recently,
The novelty of our method lies in the combination of Noroozi et al. [34] show that clustering can also be used as a
these two paradigms (Table 1) so that they benefit from one form of distillation to improve the performance of networks

2960
trained with self-supervision. As opposed to our work, they since its performance on standard evaluation benchmarks
use clustering only as a post-processing step and does not is among the best in self-supervised learning. This pretext
leverage the complementarity between clustering and self- task corresponds to a multiclass classification problem with
supervision to further improve the quality of features. four categories: rotations in {0◦ , 90◦ , 180◦ , 270◦ }. Each in-
put xn in Eq. (1) is randomly rotated and associated with a
Learning on non-curated datasets. Some methods [9, target yn that represents the angle of the applied rotation.
17, 32] aim at learning visual features from non-curated 3.2. Deep clustering
data streams. They typically use metadata such as hash-
tags [23, 41] or geolocalization [47] as a source of noisy su- Clustering-based approaches for deep networks typically
pervision. In particular, Mahajan et al. [30] train a network build target classes by clustering visual features produced
to classify billions of Instagram images into predefined and by convnets. As a consequence, the targets are updated
clean sets of hashtags. They show that with little human ef- during training along with the representations and are po-
fort, it is possible to learn features that transfer well to Im- tentially different at each epoch. In this context, we define
ageNet, even achieving state-of-the-art performance if fine- a latent pseudo-label zn in Z for each image n as well as
tuned. As opposed to our work, they use an extrinsic source a corresponding linear classifier W . These clustering-based
of supervision that had to be cleaned beforehand. methods alternate between learning the parameters θ and W
and updating the pseudo-labels zn . Between two reassign-
3. Preliminaries ments, the pseudo-labels zn are fixed, and the parameters
and classifier are optimized by solving
In this work, we refer to the vector obtained at the N
penultimate layer of the convnet as a feature or represen- 1 X
min ℓ(zn , W fθ (xn )), (2)
tation. We denote by fθ the feature-extracting function, θ,W N n=1
parametrized by a set of parameters θ. Given a set of im-
ages, our goal is then to learn a “good” mapping fθ∗ . By which is of the same form as Eq. (1). Then, the pseudo-
“good”, we mean a function that produces general-purpose labels zn can be reassigned by minimizing an auxiliary loss
visual features that are useful on downstream tasks. function. This loss sometimes coincides with Eq. (2) [3, 49]
but some works proposed to use another objective [6, 50].
3.1. Self-supervision
Updating the targets with k-means. In this work, we
In self-supervised learning, a pretext task is used to ex-
focus on the framework of Caron et al. [6], DeepCluster,
tract target labels directly from data [12]. These targets can
where latent targets are obtained by clustering the activa-
take a variety of forms. They can be categorical labels as-
tions with k-means. More precisely, the targets zn are up-
sociated with a multiclass problem, as when predicting the
dated by solving the following optimization problem:
transformation of an image [15, 51] or the ordering of a set
of patches [33]. Or they can be continuous variables asso- XN
ciated with a regression problem, as when predicting image min min kCzn − fθ (xn )k22 , (3)
C∈Rd×k zn ∈{0,1}k s.t. zn
⊤ 1=1
color [52] or surrounding patches [36]. In this work, we are n=1

interested in the former. We suppose that we are given a set C is the matrix where each column corresponds to a cen-
of N images {x1 , . . . , xN } and we assign a pseudo-label yn troid, k is the number of centroids, and zn is a binary vector
in Y to each input xn . Given these pseudo-labels, we learn with a single non-zero entry. This approach assumes that
the parameters θ of the convet jointly with a linear classifier the number of clusters k is known a priori; in practice, we
V to predict pseudo-labels by solving the problem set it by validation on a downstream task (see Sec. 5.3). The
latent targets are updated every T epochs of stochastic gra-
N
1 X dient descent steps when minimizing the objective (2).
min ℓ(yn , V fθ (xn )), (1) Note that this alternate optimization scheme is prone to
θ,V N n=1
trivial solutions and controlling the way optimization pro-
where ℓ is a loss function. The pseudo-labels yn are fixed cedures of both objectives interact is crucial. Re-assigning
during the optimization and the quality of the learned fea- empty clusters and performing a batch-sampling based on
tures entirely depends on their relevance. an uniform distribution over the cluster assignments are
workarounds to avoid trivial parametrization [6].
Rotation as self-supervision. Gidaris et al. [15] have re- 4. Method
cently shown that good features can be obtained when train-
ing a convnet to discriminate between different image rota- In this section, we describe how we combine self-
tions. In this work, we focus on their pretext task, RotNet, supervised learning with deep clustering in order to scale

2961
up to large numbers of images and targets. 1) Hierarchical clustering 2) Unsupervised feature
learning
4.1. Combining self-supervision and clustering Level 2 Sub-clusters
…
We assume that the inputs x1 , . . . , xN are rotated im-
ages, each associated with a target label yn encoding its Clusters
x
rotation angle and a cluster assignment zn . The cluster as- Level 1 Rotations

signment changes during training along with the visual rep-

resentations. We denote by Y the set of possible rotation Convnet
angles and by Z, the set of possible cluster assignments.
A way of combining self-supervision with deep clustering Dataset
is to add the losses defined in Eq. (1) and Eq. (2). How- Rotations

ever, summing these losses implicitly assumes that classi-

fying rotations and cluster memberships are two indepen-
dent tasks, which may limit the signal that can be captured.
Instead, we work with the Cartesian product space Y × Z,
which can potentially capture richer interactions between

…
…
…
…
0 90 180 270
the two tasks. We get the following optimization problem:
N Figure 2: DeeperCluster alternates between a hierachical
1 X clustering of the features and learning the parameters of a
min ℓ(yn ⊗ zn , W fθ (xn )). (4)
θ,W N convnet by predicting both the rotation angle and the cluster
n=1
assignments in a single hierachical loss.
Note that any clustering or self-supervised approach with
a multiclass objective can be combined with this formula- by zns the vector in {0, 1}ks of the assignment into ks sub-
tion. For example, we could use a self-supervision task that classes for an image n belonging to super-class s. There
captures information about tiles permutations [33] or frame are S sub-class classifiers W1 , . . . , WS , each predicting the
ordering in a video [43]. However, this formulation does not sub-class memberships within a super-class s. The param-
scale in the number of combined targets, i.e., its complexity eters of the linear classifiers (V, W1 , . . . , WS ) and θ are
is O(|Y||Z|). This limits the use of a large number of clus- jointly learned by minimizing the following loss function:
ter or a self-supervised task with a large output space [51]. " #
N S
In particular, if we want to capture information contained 1 X X
ℓ V fθ (xn ), yn + yns ℓ (Ws fθ (xn ), zns ) , (5)
in the tail of the distribution of non-curated dataset, we may N n=1 s=1
need a large number of clusters. We thus propose an approx-
imation of our formulation based on a scalable hierarchical where ℓ is the negative log-softmax function. Note that an
loss that it is designed to suit distributed training. image that does not belong to the super-class s does not
belong either to any of its ks sub-classes.
4.2. Scaling up to large number of targets
Hierarchical losses are commonly used in language mod- Choice of super-classes. A natural partition would be to
eling where the goal is to predict a word out of a large vo- define the super-classes based on the target labels from the
cabulary [5]. Instead of making one decision over the full self-supervised task and the sub-classes as the labels pro-
vocabulary, these approaches split the process in a hierarchy duced by clustering. However, this would mean that each
of decisions, each with a smaller output space. For exam- image of the entire dataset would be present in each super-
ple, the vocabulary can be split into clusters of semantically class (with a different rotation), which does not take advan-
similar words, and the hierarchical process would first se- tage of the hierarchical structure to use a bigger number of
lect a cluster and then a word within this cluster. clusters.
Following this line of work, we partition the target labels Instead, we split the dataset into m sets by running k-
into a 2-level hierarchy where we first predict a super-class means with m centroids on the full dataset every T epochs.
and then a sub-class among its associated target labels. The We then use the Cartesian product between the assignment
first level is a partition of the images into S super-classes to these m clusters and the angle rotation classes to form
and we denote by yn the super-class assignment vector in the super-classes. There are 4m super-classes, each associ-
{0, 1}S of the image n and by yns the s-th entry of yn . This ated with the subset of data belonging to the corresponding
super-class assignment is made with a linear classifier V on cluster (N/m images if the clustering is perfectly balanced).
top of the features. The second-level of the hierarchy is ob- These subsets are then further split with k-means into k sub-
tained by partitioning within each super-class. We denote classes. This is equivalent to running a hierarchical k-means

2962
with rotation constraints on the full datasets to form our hi- Classif. Detect.
erarchical loss. We typically use m = 4 and k = 80k, lead-
Method Data FC 68 ALL FC 68 ALL
ing to a total of 320k different clusters split in 4 subsets. Our
approach, “DeeperCluster”, shares similarities with Deep- ImageNet labels INet 89.3 89.2 66.3 70.3
Cluster but is designed to scale to larger datasets. We alter- Random – 10.1 49.6 5.4 55.6
nate between clustering the non-rotated images features and
Unsupervised on curated data
training the network to predict both the rotation applied to
the input data and its cluster assignment amongst the clus- Larsson et al. [28] INet+Pl. – 77.2† 49.2 59.7
ters corresponding to this rotation (Figure 2). Wu et al. [48] INet – – – 60.5†
Doersh et al. [12] INet 54.6 78.5 38.0 62.7
Caron et al. [6] INet 78.5 82.5 58.7 65.9†
Distributed training. Building the super-classes based
Unsupervised on non-curated data
on data splits lends itself to a distributed implementation
that scales well in the number of images. Specifically, when Mahendran et al. [31] YFCCv – 76.4† – –
optimizing Eq. (5), we form as many distributed communi- Wang and Gupta [43] YT8M – – – 60.2†
cation groups of p GPUs as the number of super-classes, Wang et al. [44] YT9M 59.4 79.6 40.9 63.2†
i.e., G = 4m. Different communication groups share the DeeperCluster YFCC 79.7 84.3 60.5 67.8
parameters θ and the super-class classifier V , while the pa-
rameters of the sub-class classifiers W1 , . . . , WS are only Table 2: Comparison of DeeperCluster to state-of-the-art
shared within a communication group. Each communica- unsupervised feature learning on classification and detec-
tion group s deals only with the subset of images and the tion on PASCAL VOC 2007. We disassociate methods using
rotation angle associated with the super-class s. curated datasets and methods using non-curated datasets.
We selected hyper-parameters for each transfer task on the
Distributed k-means. Every T epochs, we recompute the validation set, and then retrain on both training and val-
super and sub-class assignments by running two consecu- idation sets. We report results on the test set averaged
tive k-means on the entire dataset. This is achieved by first over 5 runs. “YFFCv” stands for the videos contained in
randomly splitting the dataset across different GPUs. Each YFFC100M dataset. † numbers from their original paper.
GPU is in charge of computing cluster assignments for its
partition, whereas centroids are updated across GPUs. We
the VGG-16 architecture [40] with batch normalization lay-
reduce communication between GPUs by sharing only the
ers. Following [3, 6, 37], we pre-process images with a So-
number of assigned elements for each cluster and the sum
bel filtering. We train our models on the 96M images from
of their features. The new centroids are then computed from
YFCC100M [42] that we managed to download. We use
these statistics. We observe empirically that k-means con-
this publicly available dataset for research purposes only.
verges in 10 iterations. We cluster 96M features of dimen-
sion 4096 into m = 4 clusters using 64 GPUs (1 minute per
5. Experiments
iteration). Then, we split this pool of GPUs into 4 groups
of 16 GPUs. Each group clusters around 23M features into In this section we evaluate the quality of the features
80k clusters (4 minutes per iteration). learned with DeeperCluster on a variety of downstream
tasks, such as classification or object detection. We also
4.3. Implementation details provide insights about the impact of the number of images
The loss in Eq. (5) is minimized with mini-batch stochas- and clusters on the performance of our model.
tic gradient descent [4]. Each mini-batch contains 3072 in- 5.1. Evaluating unsupervised features
stances distributed accross 64 GPUs, leading to 48 instances
per GPU per minibatch [18]. We use dropout, weight de- We evaluate the quality of the features extracted from a
cay, momentum and a constant learning rate of 0.1. We convnet trained with DeeperCluster on YFCC100M by con-
reassign clusters every 3 epochs. We use the Pascal VOC sidering several standard transfer learning tasks, namely im-
2007 classification task without finetuning as a downstream age classification, object detection and scene classification.
task to select hyper-parameters. In order to speed up exper-
imentations, we initialize the network with RotNet trained Pascal VOC 2007 [14]. This dataset has small training
on YFCC100M. Before clustering, we perform a whiten- and validation sets (2.5k images each), making it close to
ing of the activations and ℓ2 -normalize each of them. We the setting of real applications where models trained us-
use standard data augmentations, i.e., cropping of random ing large computational resources are adapted to a new task
sizes and aspect ratios and horizontal flips [26]). We use with a small number of instances. We report numbers on the

2963
classification and detection tasks with finetuning (“ALL”) 70 ImageNet 70
Places

or by only retraining the last three fully connected layers 60

classification accuracy

classification accuracy
DeeperCluster
60
RotNet
of the network (“FC 68”). The FC 68 setting gives a better 50 50 Supervised

measure of the quality of the evaluated features since fewer

40 40
parameters are retrained. For classification, we use the code
of Caron et al. [6]1 and for detection, fast-rcnn [16]2 . 30 30

For classification, we train the models for 150k iterations, 20 20

starting with a learning rate of 0.002 decayed by a factor 10 10

10 every 20k iterations, and we report results averaged over c1 c2 c3 c4 c5 c1 c2 c3 c4 c5

layer layer
10 random crops. For object detection, we train our net-
work for 150k iterations, dividing the step-size by 10 af- Figure 3: Accuracy of linear classifiers on ImageNet and
ter the first 50k steps with an initial learning rate of 0.01 Places205 using the activations from different layers as fea-
(FC 68) or 0.002 (ALL) and a weight decay of 0.0001. Fol- tures. We compare a VGG-16 trained with supervision on
lowing Doersch et al. [12], we use the multiscale config- ImageNet to VGG-16 trained with either RotNet or Deep-
uration, with scales [400, 500, 600, 700] for training and erCluster on YFCC100M. Exact numbers are in Appendix.
[400, 500, 600] for testing. In Table 2, we compare Deep-
erCluster with two sets of unsupervised methods that use
a VGG-16 network: those trained on curated datasets and pervision. In this experiment, we want to check whether
those trained on non-curated datasets. Previous unsuper- these low-level features pre-trained on YFCC100M with-
vised methods that worked on unucurated datasets with a out supervision can serve as a good initialization for fully-
VGG-16 use videos: Youtube8M (“YT8M”), Youtube9M supervised ImageNet classification. To this end, we pre-
(“YT9M”) or the videos from YFCC100M (“YFFCv”). Our train a VGG-16 on YFCC100M using either DeeperClus-
approach achieves state-of-the-art performance among all ter or RotNet. The resulting weights are then used as ini-
the unsupervised method that uses a VGG-16 architecture, tialization for the training of a network on ImageNet with
even those that use ImageNet as a training set. The gap with supervision. We merge the Sobel weights of the network
a supervised network is still important when we freeze the pre-trained with DeeperCluster with the first convolutional
convolutions (6% for detection and 10% for classification) layer during the initialization. We then train the networks
but drops to less than 5% for both tasks with finetuning. on ImageNet with mini-batch SGD for 100 epochs, a learn-
ing rate of 0.1, a weight decay of 0.0001, a batch size
Linear classifiers on ImageNet [11] and Places205 [54]. of 256 and dropout of 0.5. We reduce the learning rate
ImageNet (“INet”) and Places205 (“Pl.”) are two large scale by a factor of 0.2 every 20 epochs. Note that this learn-
image classification datasets: ImageNet’s domain covers ing rate decay schedule slightly differs from the ImageNet
objects and animals (1.3M images) and Places205’s domain classification PyTorch default implementation3 where they
covers indoor and outdoor scenes (2.5M images). We train train for 90 epochs and decay the learning rate by 0.1 at
linear classifiers with a logistic loss on top of frozen con- epochs 30 and 60. We give in Appendix the results with
volutional layers at different depths. To reduce influence of this default schedule (with unchanged conclusions). In Ta-
feature dimension in the comparison, we average-pool the ble 3, we compare the performance of a network trained
features until their dimension is below 10k [52]. This ex- with a standard intialization (“Supervised”) to one initial-
periment probes the quality of the features extracted at each ized with a pre-training obtained from either DeeperClus-
convolutional layer. In Figure 3, we observe that Deeper- ter (“Supervised + DeeperCluster pre-training”) or RotNet
Cluster matches the performance of a supervised network (“Supervised + RotNet pre-training”) on YFCC100M. We
for all layers on Places205. On ImageNet, it also matches see that our pre-training improves the performance of a su-
supervised features up to the 4th convolutional block; then pervised network by +0.8%, leading to 74.9% top-1 accu-
the gap suddenly increases to around 20%. It is not surpris- racy. This means that our pre-training captures important
ing since the supervised features are trained on ImageNet statistics from YFCC100M that transfers well to ImageNet.
itself, while ours are trained on YFCC100M.
5.3. Model analysis
5.2. Pre-training for ImageNet In this final set of experiments, we analyze some compo-
In the previous section, we can observe that a VGG-16 nents of our model. Since DeeperCluster derives from Rot-
trained on YFCC100M has similar or better low level fea- Net and DeepCluster, we first look at the difference between
tures than the same network trained on ImageNet with su- these methods and ours, when trained on curated and non-
1 github.com/facebookresearch/deepcluster 3 github.com/pytorch/examples/blob/master/imagenet/
2 github.com/rbgirshick/py-faster-rcnn 4 pytorch.org/docs/stable/torchvision/models

2964
ImageNet top-1 top-5 Influence of dataset size and number of clusters. To
measure the influence of the number of images on features,
Supervised (PyTorch documentation4 ) 73.4 91.5 we train models with 1M, 4M, 20M, and 96M images and
Supervised (our code) 74.1 91.8 report their accuracy on the validation set of the Pascal VOC
Supervised + RotNet pre-training 74.5 92.0 2007 classification task (FC 68 setting). We also train mod-
Supervised + DeeperCluster pre-training 74.9 92.3 els on 20M images with a number of clusters that varies
from 10k to 160k. For the experiment with a total of 160k
Table 3: Accuracy on the validation set of ImageNet classi- clusters, we choose m = 2 which results in 8 super-classes.
fication for a supervised VGG-16 trained with different ini- In Figure 1, we observe that the quality of our features im-
tializations: we compare a network trained from a standard proves when scaling both in terms of images and clusters.
initialization to networks trained from pre-trained weights Interestingly, between 4M and 20M of YFCC100M images
using either DeeperCluster or RotNet on YFCC100M. are needed to meet the performance of our method on Ima-
geNet. Augmenting the number of images has a bigger im-
Method Data ImageNet Places VOC2007 pact than the number of clusters. Yet, this improvement is
significant since it corresponds to a reduction of more than
Supervised ImageNet 70.2 45.9 84.8
10% of the relative error w.r.t. the supervised model.
Wu et al. [48] ImageNet 39.2 36.3 -
RotNet ImageNet 32.7 32.6 60.9 Quality of the clusters. In addition to features, our
DeepCluster ImageNet 48.4 37.9 71.9 method provides a clustering of the input images. We eval-
RotNet YFCC100M 33.0 35.5 62.2 uate the quality of these clusters by measuring their cor-
DeepCluster YFCC100M 34.1 35.4 63.9 relation with existing partitions of the data. In particular,
YFCC100M comes with many different metadata. We con-
DeeperCluster YFCC100M 45.6 42.1 73.0 sider hashtags, users, camera and GPS coordinates. If an
image has several hashtags, we pick as label the least fre-
Table 4: Comparaison between DeeperCluster, RotNet and quent one in the total hashtag distribution. We also mea-
DeepCluster when pre-trained on curated and non-curated sure the correlation of ours clusters with labels predicted
dataset. We report the accuracy on several datasets of a by a classifier trained on ImageNet categories. We use a
linear classifier trained on top of features of the last con- ResNet-50 network [21], pre-trained on ImageNet, to clas-
volutional layer. All the methods use the same architecture. sify the YFCC100M images and we select those for which
DeepCluster does not scale to the full YFCC100M dataset, the confidence in prediction is higher than 75%. This evalu-
we thus train it on a random subset of 1.3M images. ation omits a large amount of the data but gives some insight
about the quality of our clustering in object classification.
In Figure 4, we show the evolution during training of the
curated datasets. We then report quantitative and qualitative normalized mutual information (NMI) between our cluster-
evaluations of the clusters obtained with DeeperCluster. ing and different metadata, and the predicted labels from
ImageNet. The higher the NMI, the more correlated our
Comparison with RotNet and DeepCluster. In Table 4, clusters are to the considered partition. For reference, we
we compare DeeperCluster with DeepCluster and RotNet compute the NMI for a clustering of RotNet features (as it
when a linear classifier is trained on top of the last convolu- corresponds to weights at initialization) and of a supervised
tional layer of a VGG-16 on several datasets. For reference, model. First, it is interesting to observe that our cluster-
we also report previously published numbers [48] with a ing is improving over time for every type of metadata. One
VGG-16 architecture. We average-pool the features of the important factor is that most of these commodities are cor-
last layer resulting in representations of 8192 dimensions. related since a given user takes pictures in specific places
Our approach outperforms both RotNet and DeepCluster, with probably a single camera and use a preferred fixed set
even when they are trained on curated datasets (except for of hashtags. Yet, these plots show that our model captures in
ImageNet classification task where DeepCluster trained on the input signal enough information to predict these meta-
ImageNet yields the best performance). More interestingly, data at least as well as the features trained with supervision.
we see that the quality of the dataset or its scale has little We visually assess the consistency of our clusters in Fig-
impact on RotNet while it has on DeepCluster. This is con- ure 5. We display 9 random images from 8 manually picked
firming that self-supervised methods are more robust than clusters. The first two clusters contain a majority of images
clustering to a change of dataset distribution. associated with tag from the head (first cluster) and from
the tail (second cluster) in the YFC100M dataset. Indeed,
418.538 YFC100M images are associated with the tag cat

2965
Tags Users GPS
0.204
Device ImageNet classifier
0.554
0.530
0.202 0.65
0.552 0.528 0.370

0.526
0.200 0.60
0.550
0.524 0.365
0.198
0.55
NMI

0.196

0.50
0.548 0.522
DeeperCluster
0.520 0.360 0.194

0.45
0.546 Supervised
0.518 0.192
RotNet

0.40
0.544 0.516 0.355 0.190

10 30 50 70 90
0.514
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
epochs epochs epochs epochs epochs

Figure 4: Normalized mutual information between our clustering and different sorts of metadata: hashtags, user IDs, geo-
graphic coordinates, and device types. We also plot the NMI with an ImageNet classifier labeling.

tag: cat tag: elephantparadelondon tag: always device: CanoScan

GPS: (43, 10) GPS: (−34, −151) GPS: (64, −20) GPS: (43, −104)

Figure 5: We randomly select 9 images per cluster and indicate the dominant cluster metadata. The bottom row depicts
clusters pure for GPS coordinates but unpure for user IDs. As expected, they turn out to correlate with tourist landmarks. No
metadata is used during training. For copyright reasons, we provide in Appendix the photographer username for each image.

whereas only 384 images contain the tag elephantparade- 6. Conclusion

london (0.0004% of the dataset). We also show a cluster
for which the dominant hashtag does not corrolate visually In this paper, we present an unsupervised approach
with the content of the cluster. As already mentioned, this specifically designed to deal with large amount of non-
database is non-curated and contains images that basically curated data. Our method is well-suited for distributed
do not depict anything semantic. The dominant metadata of training, which allows training on large datasets with 96M
the last cluster in the top row is the device ID CanoScan. As of images. With such amount of data, our approach sur-
this cluster is about drawings, its images have been mainly passes unsupervised methods trained on curated datasets,
taken with a scanner. Finally, the bottom row depict clusters which validates the potential of unsupervised learning in
that are pure for GPS coordinates but unpure for user IDs. applications where annotations are scarce or curation is not
It results in clusters of images taken by many different users trivial. Finally, we show that unsupervised pre-training im-
in the same place: tourist landmarks. proves the performance of a network trained on ImageNet.

Acknowledgement. Julien Mairal was funded by the

ERC grant number 714381 (SOLARIS project).

2966
References tions. In International Conference on Learning Representa-
tions (ICLR), 2018. 1, 2, 3
[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learn-
[16] Ross Girshick. Fast r-cnn. In Proceedings of the Interna-
ing to see by moving. In Proceedings of the International
tional Conference on Computer Vision (ICCV), 2015. 6
Conference on Computer Vision (ICCV), 2015. 2
[17] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis
[2] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina Karatzas, and CV Jawahar. Self-supervised learning of
Tikhoncheva, and Bjorn Ommer. Cliquecnn: Deep un- visual features through embedding images into text topic
supervised exemplar learning. In Advances in Neural spaces. In Proceedings of the Conference on Computer Vi-
Information Processing Systems (NIPS), 2016. 2 sion and Pattern Recognition (CVPR), 2017. 3
[3] Piotr Bojanowski and Armand Joulin. Unsupervised learn- [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
ing by predicting noise. In Proceedings of the International huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Conference on Machine Learning (ICML), 2017. 2, 3, 5 Yangqing Jia, and Kaiming He. Accurate, large mini-
[4] Léon Bottou. Stochastic gradient descent tricks. In Neu- batch sgd: Training imagenet in 1 hour. arXiv preprint
ral networks: Tricks of the trade, pages 421–436. Springer, arXiv:1706.02677, 2017. 5
2012. 5 [19] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
[5] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent Misra. Scaling and benchmarking self-supervised visual
J Della Pietra, and Jenifer C Lai. Class-based n-gram models representation learning. arXiv preprint arXiv:1905.01235,
of natural language. Computational linguistics, 18(4):467– 2019. 2
479, 1992. 4 [20] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im-
[6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and agenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
Matthijs Douze. Deep clustering for unsupervised learning 1
of visual features. In Proceedings of the European Confer- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
ence on Computer Vision (ECCV), 2018. 1, 2, 3, 5, 6 Deep residual learning for image recognition. In Proceedings
[7] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Ji- of the Conference on Computer Vision and Pattern Recogni-
tendra Malik. Human pose estimation with iterative error tion (CVPR), 2016. 7
feedback. In Proceedings of the Conference on Computer [22] Simon Jenni and Paolo Favaro. Self-supervised feature learn-
Vision and Pattern Recognition (CVPR), 2016. 1 ing by learning to spot artifacts. In Proceedings of the
[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- Conference on Computer Vision and Pattern Recognition
nos, Kevin Murphy, and Alan L Yuille. Deeplab: Se- (CVPR), 2018. 2
mantic image segmentation with deep convolutional nets, [23] Armand Joulin, Laurens van der Maaten, Allan Jabri, and
atrous convolution, and fully connected crfs. arXiv preprint Nicolas Vasilache. Learning visual features from large
arXiv:1606.00915, 2016. 1 weakly supervised data. In Proceedings of the European
[9] Xinlei Chen and Abhinav Gupta. Webly supervised learning Conference on Computer Vision (ECCV), 2016. 1, 3
of convolutional networks. In Proceedings of the Interna- [24] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So
tional Conference on Computer Vision (ICCV), 2015. 3 Kweon. Learning image representations by completing dam-
[10] Virginia R de Sa. Learning classification with unlabeled aged jigsaw puzzles. In Winter Conference on Applications
data. In Advances in Neural Information Processing Systems of Computer Vision (WACV), 2018. 2
(NIPS), 1994. 2 [25] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Re-
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, visiting self-supervised visual representation learning. arXiv
and Li Fei-Fei. Imagenet: A large-scale hierarchical image preprint arXiv:1901.09005, 2019. 2
database. In Proceedings of the Conference on Computer [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Vision and Pattern Recognition (CVPR), 2009. 1, 6 Imagenet classification with deep convolutional neural net-
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- works. In Advances in Neural Information Processing Sys-
vised visual representation learning by context prediction. In tems (NIPS), 2012. 5
Proceedings of the International Conference on Computer [27] Gustav Larsson, Michael Maire, and Gregory
Vision (ICCV), 2015. 1, 2, 3, 5, 6 Shakhnarovich. Learning representations for automatic
[13] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- colorization. In Proceedings of the European Conference on
berg, Martin Riedmiller, and Thomas Brox. Discriminative Computer Vision (ECCV), 2016. 2
unsupervised feature learning with exemplar convolutional [28] Gustav Larsson, Michael Maire, and Gregory
neural networks. IEEE transactions on pattern analysis and Shakhnarovich. Colorization as a proxy task for visual
machine intelligence, 38(9):1734–1747, 2016. 2 understanding. In Proceedings of the Conference on Com-
[14] Mark Everingham, Luc Van Gool, Christopher KI Williams, puter Vision and Pattern Recognition (CVPR), 2017. 2,
John Winn, and Andrew Zisserman. The pascal visual object 5
classes (voc) challenge. International journal of computer [29] Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Ur-
vision, 88(2):303–338, 2010. 5 tasun. Learning deep parsimonious representations. In Ad-
[15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- vances in Neural Information Processing Systems (NIPS),
supervised representation learning by predicting image rota- 2016. 2

2967
[30] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, [44] Xiaolong Wang, Kaiming He, and Abhinav Gupta. Tran-
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, sitive invariance for self-supervised visual representation
and Laurens van der Maaten. Exploring the limits of weakly learning. In Proceedings of the International Conference on
supervised pretraining. In Proceedings of the European Con- Computer Vision (ICCV), 2017. 2, 5
ference on Computer Vision (ECCV), 2018. 1, 3 [45] Xiaosong Wang, Le Lu, Hoo-Chang Shin, Lauren Kim, Mo-
[31] Aravindh Mahendran, James Thewlis, and Andrea Vedaldi. hammadhadi Bagheri, Isabella Nogues, Jianhua Yao, and
Cross pixel optical flow similarity for self-supervised learn- Ronald M Summers. Unsupervised joint mining of deep fea-
ing. arXiv preprint arXiv:1807.05636, 2018. 2, 5 tures and image labels for large-scale radiology image cate-
[32] Karl Ni, Roger Pearce, Kofi Boakye, Brian Van Essen, gorization and scene recognition. In Winter Conference on
Damian Borth, Barry Chen, and Eric Wang. Large-scale Applications of Computer Vision (WACV), 2017. 2
deep learning on the yfcc100m dataset. arXiv preprint [46] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and
arXiv:1502.03409, 2015. 3 Cordelia Schmid. Deepflow: Large displacement optical
[33] Mehdi Noroozi and Paolo Favaro. Unsupervised learning flow with deep matching. In Proceedings of the International
of visual representations by solving jigsaw puzzles. In Pro- Conference on Computer Vision (ICCV), 2013. 1
ceedings of the European Conference on Computer Vision [47] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-
(ECCV), 2016. 2, 3, 4 photo geolocation with convolutional neural networks. In
[34] Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Proceedings of the European Conference on Computer Vi-
Hamed Pirsiavash. Boosting self-supervised learning via sion (ECCV), 2016. 3
knowledge transfer. In Proceedings of the Conference on [48] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Computer Vision and Pattern Recognition (CVPR), 2018. 2 Unsupervised feature learning via non-parametric instance
[35] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, discrimination. In Proceedings of the Conference on Com-
and Bharath Hariharan. Learning features by watching ob- puter Vision and Pattern Recognition (CVPR), 2018. 2, 5,
jects move. In Proceedings of the Conference on Computer 7
Vision and Pattern Recognition (CVPR), 2017. 2 [49] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised
[36] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor deep embedding for clustering analysis. In Proceedings of
Darrell, and Alexei A Efros. Context encoders: Feature the International Conference on Machine Learning (ICML),
learning by inpainting. In Proceedings of the Conference 2016. 2, 3
on Computer Vision and Pattern Recognition (CVPR), 2016. [50] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper-
2, 3 vised learning of deep representations and image clusters. In
[37] Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Proceedings of the Conference on Computer Vision and Pat-
Mairal, Florent Perronin, and Cordelia Schmid. Local con- tern Recognition (CVPR), 2016. 2, 3
volutional features with unsupervised training for image re- [51] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo.
trieval. In Proceedings of the International Conference on Aet vs. aed: Unsupervised representation learning by auto-
Computer Vision (ICCV), 2015. 5 encoding transformations rather than data. arXiv preprint
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. arXiv:1901.04596, 2019. 1, 3, 4
Faster r-cnn: Towards real-time object detection with region [52] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
proposal networks. In Advances in Neural Information Pro- image colorization. In Proceedings of the European Confer-
cessing Systems (NIPS), 2015. 1 ence on Computer Vision (ECCV), 2016. 2, 3, 6
[39] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, [53] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain
and Stefan Carlsson. Cnn features off-the-shelf: an astound- autoencoders: Unsupervised learning by cross-channel pre-
ing baseline for recognition. In CVPR workshops, 2014. 1 diction. In Proceedings of the Conference on Computer Vi-
[40] Karen Simonyan and Andrew Zisserman. Very deep convo- sion and Pattern Recognition (CVPR), 2017. 2
lutional networks for large-scale image recognition. arXiv [54] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
preprint arXiv:1409.1556, 2014. 5 ralba, and Aude Oliva. Learning deep features for scene
[41] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- recognition using places database. In Advances in Neural
nav Gupta. Revisiting unreasonable effectiveness of data in Information Processing Systems (NIPS), 2014. 6
deep learning era. In Proceedings of the International Con-
ference on Computer Vision (ICCV), pages 843–852, 2017.
1, 3
[42] Bart Thomee, David A Shamma, Gerald Friedland, Ben-
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
Li-Jia Li. Yfcc100m: The new data in multimedia research.
arXiv preprint arXiv:1503.01817, 2015. 1, 5
[43] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In Proceedings of the
International Conference on Computer Vision (ICCV), 2015.
2, 4, 5

2968

ASME A17.1/CSA B44 Handbook
0% (1)
ASME A17.1/CSA B44 Handbook
7 pages
Minerals: Mill Circuit Duty Pumps
No ratings yet
Minerals: Mill Circuit Duty Pumps
8 pages
LMS Band Handbook 2014
No ratings yet
LMS Band Handbook 2014
13 pages
Olympic Sports Matching Quiz
No ratings yet
Olympic Sports Matching Quiz
2 pages
Marketing Management - Cisco Case Study
100% (1)
Marketing Management - Cisco Case Study
1 page
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
No ratings yet
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
24 pages
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
No ratings yet
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
33 pages
Contrastive Learning
No ratings yet
Contrastive Learning
10 pages
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
No ratings yet
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
10 pages
Unsupervised Feature Learning Via Non-Parametric Instance Discrimination
No ratings yet
Unsupervised Feature Learning Via Non-Parametric Instance Discrimination
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
A Simple Baseline For Low-Budget Active Learning
No ratings yet
A Simple Baseline For Low-Budget Active Learning
20 pages
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
No ratings yet
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
26 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Make 04 00002 v2
No ratings yet
Make 04 00002 v2
20 pages
Unsupervised
No ratings yet
Unsupervised
11 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
10 pages
Sagar Institute of Research & Technology Department of Electronics & Communication
No ratings yet
Sagar Institute of Research & Technology Department of Electronics & Communication
13 pages
Cat and Dog 1
No ratings yet
Cat and Dog 1
9 pages
1 s2.0 S0925231221009486 Main
No ratings yet
1 s2.0 S0925231221009486 Main
7 pages
Big Self-Supervised Models Are Strong Semi-Supervised Learners
No ratings yet
Big Self-Supervised Models Are Strong Semi-Supervised Learners
17 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
11 pages
Big Self-Supervised Models Are Strong Semi-Supervised Learners
No ratings yet
Big Self-Supervised Models Are Strong Semi-Supervised Learners
18 pages
Automated Image Data Preprocessing With Deep Reinforcement Learning
No ratings yet
Automated Image Data Preprocessing With Deep Reinforcement Learning
9 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13 pages
Dinov 2
No ratings yet
Dinov 2
31 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
23 pages
Deep Learning With Nonparametric Clustering: Gang Chen
No ratings yet
Deep Learning With Nonparametric Clustering: Gang Chen
14 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Best-4 - Topological Gradient Based Competitive Learning
No ratings yet
Best-4 - Topological Gradient Based Competitive Learning
12 pages
Unsupervised Learning To Aid Labelling For Supervised Learning
No ratings yet
Unsupervised Learning To Aid Labelling For Supervised Learning
29 pages
An Introductory Note On Machine Learning. A V Narasimhadhan
No ratings yet
An Introductory Note On Machine Learning. A V Narasimhadhan
2 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Review On Self-Supervised Image Recognition Using Deep Neural
No ratings yet
Review On Self-Supervised Image Recognition Using Deep Neural
22 pages
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
No ratings yet
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
9 pages
AAI Module 4
No ratings yet
AAI Module 4
13 pages
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
21 pages
Self Supervised Learning
No ratings yet
Self Supervised Learning
5 pages
How Are We Going To Build - Review-1
No ratings yet
How Are We Going To Build - Review-1
8 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Learning Without Forgetting
No ratings yet
Learning Without Forgetting
13 pages
Fu Relaxing From Vocabulary ICCV 2015 Paper
No ratings yet
Fu Relaxing From Vocabulary ICCV 2015 Paper
9 pages
Dense Constrastive Learning For Self Supervised Visual Pre Training
No ratings yet
Dense Constrastive Learning For Self Supervised Visual Pre Training
11 pages
Fuzzy Preprocessing For Semi-Supervised Image Classification in Modern Industry
No ratings yet
Fuzzy Preprocessing For Semi-Supervised Image Classification in Modern Industry
11 pages
2021 CASTELLANOS - Artificial Intelligence - Unsupervised Deep Generative Adversarial Based Methodology For Automatic
No ratings yet
2021 CASTELLANOS - Artificial Intelligence - Unsupervised Deep Generative Adversarial Based Methodology For Automatic
6 pages
Application of Transfer Learning For Image Classification On Dataset With Not Mutually Exclusive Classes
No ratings yet
Application of Transfer Learning For Image Classification On Dataset With Not Mutually Exclusive Classes
4 pages
Self-Taught Learning: Transfer Learning From Unlabeled Data
No ratings yet
Self-Taught Learning: Transfer Learning From Unlabeled Data
8 pages
Local Aggregation For Unsupervised Learning of Visual Embeddings
No ratings yet
Local Aggregation For Unsupervised Learning of Visual Embeddings
13 pages
AML - Lecture - 11 - 19nov24
No ratings yet
AML - Lecture - 11 - 19nov24
103 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
A Study On Using Data Clustering For Feature Extraction To Improve The Quality of Classification
No ratings yet
A Study On Using Data Clustering For Feature Extraction To Improve The Quality of Classification
35 pages
Revisting
No ratings yet
Revisting
13 pages
LSUN: Construction of A Large-Scale Image Dataset Using Deep Learning With Humans in The Loop
No ratings yet
LSUN: Construction of A Large-Scale Image Dataset Using Deep Learning With Humans in The Loop
9 pages
Module 3
No ratings yet
Module 3
26 pages
Exer8 TresMarias
No ratings yet
Exer8 TresMarias
3 pages
Week 09
No ratings yet
Week 09
6 pages
Why Does Unsupervised Pre-Training Help Deep Learning?
No ratings yet
Why Does Unsupervised Pre-Training Help Deep Learning?
36 pages
Towards Open Set Deep Networks: Abhijit Bendale, Terrance E. Boult University of Colorado at Colorado Springs
No ratings yet
Towards Open Set Deep Networks: Abhijit Bendale, Terrance E. Boult University of Colorado at Colorado Springs
14 pages
Unsupervised Clustering For Deep Learnin
No ratings yet
Unsupervised Clustering For Deep Learnin
25 pages
Generative Pretraining From Pixels V2
No ratings yet
Generative Pretraining From Pixels V2
12 pages
Caron Emerging Properties in Self-Supervised Vision Transformers ICCV 2021 Paper
No ratings yet
Caron Emerging Properties in Self-Supervised Vision Transformers ICCV 2021 Paper
11 pages
Machine Learning Tut
No ratings yet
Machine Learning Tut
68 pages
Clinical Perspectives On Wheelchair Selection: Factors in Functional Assessment
No ratings yet
Clinical Perspectives On Wheelchair Selection: Factors in Functional Assessment
14 pages
SRS-DNN: A Deep Neural Network With Strengthening Response Sparsity
No ratings yet
SRS-DNN: A Deep Neural Network With Strengthening Response Sparsity
16 pages
110id0625 18 PDF
No ratings yet
110id0625 18 PDF
40 pages
Assignment 5
No ratings yet
Assignment 5
2 pages
Payslip Explanation
No ratings yet
Payslip Explanation
2 pages
Quality Improvement Programme: Status Report 2009-2010
No ratings yet
Quality Improvement Programme: Status Report 2009-2010
7 pages
CSEC English A June 2023 Paper 1
No ratings yet
CSEC English A June 2023 Paper 1
16 pages
Bodnar - Saving Private Ryan and Postwar PDF
No ratings yet
Bodnar - Saving Private Ryan and Postwar PDF
14 pages
Annual Report INDS 2018
No ratings yet
Annual Report INDS 2018
155 pages
Impact of Technology On Leadership Practices in Organizations
No ratings yet
Impact of Technology On Leadership Practices in Organizations
8 pages
My Childhood 6. My Childhood 6. My Childhood 6. My Childhood 6. My Childhood
No ratings yet
My Childhood 6. My Childhood 6. My Childhood 6. My Childhood 6. My Childhood
14 pages
Month of October: Commemoration of Our Venerable Father Gall, Wonder-Worker of Switzerland
No ratings yet
Month of October: Commemoration of Our Venerable Father Gall, Wonder-Worker of Switzerland
5 pages
Windows Registry
No ratings yet
Windows Registry
17 pages
Double Layer Tile Press
100% (1)
Double Layer Tile Press
6 pages
DRUM LLVL Manual
No ratings yet
DRUM LLVL Manual
30 pages
Denim Present and Future of Bangladesh (For Selim)
33% (3)
Denim Present and Future of Bangladesh (For Selim)
12 pages
Diy Survival Containers From Co2 Cartige: Technology Workshop Craft Home Food Play Outside Costumes
No ratings yet
Diy Survival Containers From Co2 Cartige: Technology Workshop Craft Home Food Play Outside Costumes
6 pages
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
No ratings yet
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
28 pages
Pamphlet Stitch Bookbinding Class Handout
No ratings yet
Pamphlet Stitch Bookbinding Class Handout
6 pages
Practical 3
No ratings yet
Practical 3
10 pages
Besc1071 Besc1392 Lect 5 Lecture 2022-1
No ratings yet
Besc1071 Besc1392 Lect 5 Lecture 2022-1
89 pages
Jeeves Econo Lift 100 20x20x30 Front Back PDF
No ratings yet
Jeeves Econo Lift 100 20x20x30 Front Back PDF
1 page
What Is Caste
No ratings yet
What Is Caste
2 pages
Math4 Q4 Mod4
100% (1)
Math4 Q4 Mod4
18 pages
2022 Discrete Mathematics Lecturer
No ratings yet
2022 Discrete Mathematics Lecturer
167 pages
Notes - Union-Find Disjoint Sets (UFDS)
No ratings yet
Notes - Union-Find Disjoint Sets (UFDS)
1 page
OpTransactionHistory25 02 2025
No ratings yet
OpTransactionHistory25 02 2025
10 pages
Weekly PDF
No ratings yet
Weekly PDF
353 pages
SInclair Presentation Dec 2021
No ratings yet
SInclair Presentation Dec 2021
66 pages

Unsupervised Pre-Training of Image Features On Non-Curated Data

Uploaded by

Unsupervised Pre-Training of Image Features On Non-Curated Data

Uploaded by

Unsupervised Pre-Training of Image Features on Non-Curated Data

Mathilde Caron1,2 , Piotr Bojanowski1 , Julien Mairal2 , and Armand Joulin1

signment changes during training along with the visual rep-

ever, summing these losses implicitly assumes that classi-

or by only retraining the last three fully connected layers 60

measure of the quality of the evaluated features since fewer

For classification, we train the models for 150k iterations, 20 20

starting with a learning rate of 0.002 decayed by a factor 10 10

10 every 20k iterations, and we report results averaged over c1 c2 c3 c4 c5 c1 c2 c3 c4 c5

tag: cat tag: elephantparadelondon tag: always device: CanoScan

whereas only 384 images contain the tag elephantparade- 6. Conclusion

Acknowledgement. Julien Mairal was funded by the

You might also like