Info NCE

Uploaded by

djywithhjh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Info NCE

Uploaded by

djywithhjh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Ranking Info Noise Contrastive Estimation:

Boosting Contrastive Learning via Ranked Positives
David T. Hoffmann* ,1,2 , Nadine Behrmann* ,1 , Juergen Gall3 , Thomas Brox2 , Mehdi Noroozi1
1 2 3
Bosch Center for Artificial Intelligence University of Freiburg University of Bonn
{david.hoffmann2, nadine.behrmann}@de.bosch.com

Abstract some samples cannot clearly be categorized as either positive

or negative, e.g. the dog breeds in Fig. 1. Treating them as
This paper introduces Ranking Info Noise Contrastive Esti-
mation (RINCE), a new member in the family of InfoNCE
positives makes the network invariant towards the distinct
losses that preserves a ranked ordering of positive samples. attributes of the samples. As a result, the network struggles to
In contrast to the standard InfoNCE loss, which requires a distinguish between different dog breeds. If they are treated
strict binary separation of the training pairs into similar and as negatives, the network cannot exploit their similarities. For
dissimilar samples, RINCE can exploit information about a transfer learning to other tasks, e.g. out-of-distribution detec-
similarity ranking for learning a corresponding embedding tion, a clean structure of the embedding space, s.t. samples
space. We show that the proposed loss function learns favor- sharing certain attributes will be closer, is beneficial.
able embeddings compared to the standard InfoNCE whenever Another example comes from video representation learn-
at least noisy ranking information can be obtained or when ing: In addition to spatial crops as for images, videos allow
the definition of positives and negatives is blurry. We demon-
strate this for a supervised classification task with additional
to create temporal crops, i.e. creating a sample from different
superclass labels and noisy similarity scores. Furthermore, we frames of the same video. To date, it is an open point of
show that RINCE can also be applied to unsupervised train- discussion whether temporally different clips from the same
ing with experiments on unsupervised representation learn- video should be treated as positive (Feichtenhofer et al. 2021)
ing from videos. In particular, the embedding yields higher or negative (Dave et al. 2021). Treating them as positives will
classification accuracy, retrieval rates and performs better in force the network to be invariant towards changes over time,
out-of-distribution detection than the standard InfoNCE loss. but treating them as negatives will encourage the network to
ignore the features that stay constant. In summary, a binary
classification in positive and negative will, for most appli-
Introduction cations, lead to a sub-optimal solution. To the best of our
Contrastive learning recently triggered progress in self- knowledge, a method that benefits from a fine-grained defini-
supervised representation learning. Most existing variants tion of negatives, positives and various states in between is
require a strict definition of positive and negative pairs used missing.
in the InfoNCE loss or simply ignore samples that can not be As a remedy, we propose Ranking Info Noise Contrastive
clearly classified as either one or the other (Zhao et al. 2021). Estimation (RINCE). RINCE supports a fine-grained defini-
Contrastive learning forces the network to impose a similar tion of negatives and positives. Thus, methods trained with
structure in the feature space by pulling the positive pairs RINCE can take advantage of various kinds of similarity
closer to each other while keeping the negatives apart. measures. For example similarity measures can be based on
This binary separation into positives and negatives can be class similarities, gradual changes of content within videos,
limiting whenever the boundary between those is blurry. For pretrained feature embeddings, or even the camera positions
example, different samples from the same classes are used in a multi-view setting etc. In this work, we demonstrate class
as negatives for instance recognition, which prevents the net- similarities and gradual changes in videos as examples.
work from exploiting their similarities. One way to address RINCE puts higher emphasis on similarities between re-
this issue is supervised contrastive learning (SCL) (Khosla lated samples than SCL and cross-entropy, resulting in a
et al. 2020), which takes class labels into account when mak- richer representation. We show that RINCE learns to repre-
ing pairs: samples from the same class are treated as positives, sent semantic similarities in the embedding space, s.t. more
while samples of different classes pose negatives. However, similar samples are closer than less similar samples. Key to
even in this optimal setting with ground truth labels, the prob- this is a new InfoNCE-based loss, which enforces gradually
lem persists – semantically similar classes share many visual decreasing similarity with increasing rank of the samples.
features (Deselaers and Ferrari 2011) with the query – and The representation learned with RINCE on Cifar-100 im-
* These authors contributed equally. proves significantly over cross-entropy for classification, re-
Copyright © 2022, Association for the Advancement of Artificial trieval and OOD detection, and outperforms the stronger
Intelligence (www.aaai.org). All rights reserved. SCL baseline (Khosla et al. 2020). Here, improvements are

897
Figure 1: Contrastive Learning should not be binary. In many scenarios a strict separation of samples in “positives” and “negatives”
is not possible. So far, this grey zone (left) was neglected, leading to sub-optimal results. We propose a solution to this problem,
which embeds same samples very close and similar samples close in the embedding space (right).

particularly large for retrieval and OOD detection. To ob- same instance – while minimizing the similarity of negative
tain ranked positives for RINCE, we use the superclasses of pairs, i.e. views of different instances. Different views can
Cifar-100. Further, we demonstrate that RINCE works on be generated from multi-modal data (Tian, Krishnan, and
large scale datasets and in more general applications, where Isola 2020), permutations (Misra and van der Maaten 2020),
ranking of samples is not initially given and contains noise. or augmentations (Chen et al. 2020a). The negative pairs
To this end, we show that RINCE outperforms our baselines play a vital role in contrastive learning as they prevent short-
on ImageNet-100 using only noisy ranks provided by an off- cuts and collapsed solutions. In order to provide challenging
the-shelf natural language processing model (Liu et al. 2019). negatives, (He et al. 2020) introduce a memorybank with
Finally, we showcase that RINCE can be applied to the fully a momentum encoder, which allows to store a large set of
unsupervised setting, by training RINCE unsupervised on negatives. Other approaches explicitly construct hard nega-
videos, treating temporally far clips as weak positives. This tives from patches in the same image (van den Oord, Li, and
results in a higher accuracy on the downstream task of video Vinyals 2018) or temporal negatives in videos (Behrmann,
action classification than our baselines and even outperforms Gall, and Noroozi 2021). More recent works omit negative
recent video representation learning methods. pairs completely (Chen and He 2021; Grill et al. 2020).
In summary, our contributions are: 1) We propose a new In the above cases, positive pairs are obtained from the
InfoNCE-based loss that replaces the binary definition of same instance, and different instances serve as negatives even
positives and negatives by a ranked definition of similarity. 2) when they share the same semantics. Previous work addresses
We study the properties of RINCE in a controlled supervised this issue by allowing multiple positive samples: (Miech et al.
setting. Here, we show mild improvements on Cifar-100 2020) allows several positive candidates within a video, (Han,
classification and sensible improvements for OOD detection. Xie, and Zisserman 2020) and (Caron et al. 2020) obtain pos-
3) We show that RINCE can handle significant noise in the itives by clustering the feature space, whereas (Khosla et al.
similarity scores and leads to improvements on large scale 2020) uses class labels to define a set of positives. False neg-
datasets. 4) We demonstrate the applicability of RINCE to atives are eliminated from the InfoNCE loss by (Huynh et al.
self-supervised learning with noisy similarities in a video 2020), either using labels or a heuristic. Integrating multiple
representation learning task and show improvements over positives in contrastive learning is not straightforward: the
InfoNCE in all downstream tasks. 5) Code is available at1 . set of positives can be noisy and include some samples that
The Sup. Mat. can be found in (Hoffmann et al. 2022). are more related than others. In this work, we provide a tool
to properly incorporate such samples.
Related Works
Contrastive Learning. Contrastive learning has recently Supervised Contrastive Learning. Labelled training data
advanced the field of self-supervised learning. Current state- has been used in many recent works on contrastive learning.
of-the-art methods use instance recognition, originally pro- (Romijnders et al. 2021) use pseudo labels obtained from a
posed by (Dosovitskiy et al. 2016), where the task is to rec- detector, (Tian et al. 2020) use labels to construct better views
ognize an instance under various transformations. Modern and (Neill and Bollegala 2021) use similarity of class word
instance recognition methods utilize InfoNCE (van den Oord, embeddings to draw hard negatives. The term supervised
Li, and Vinyals 2018), which was first proposed as N-pair contrastive learning (SCL) is introduced in (Khosla et al.
loss in (Sohn 2016). It maximizes the similarity of positive 2020) showing that SCL outperforms standard cross-entropy.
pairs – which are obtained from two different views of the In the SCL setting ground truth labels are available and
can be used to define positives and negatives. Commonly,
1
https://github.com/boschresearch/rince samples from the same class are treated as positive, while in-

898
stances from all other classes are treated as negatives. (Khosla Logout Positives. A straightforward approach to include
et al. 2020) find that the SCL loss function outperforms cross- multiple positives is to compute Eq. (1) for each of them,
entropy in the supervised setting. In contrast, (Huynh et al. i.e. take the sum over positives outside of the log. This en-
2020) aim for an unsupervised detection of false negatives. forces similarity between all positives during training, which
They propose to only eliminate false negatives from the In- suits a clean set of positives well.
foNCE loss which leads to best results for noisy labels.
Along these lines, (Winkens et al. 2020) show that In- exp h(q,p)

X
out τ
foNCE loss is better suited for out-of-distribution detection L =− log . (2)
exp h(q,p) h(q,n)
P
than cross-entropy. Here, we introduce a method to deal with p∈P τ + exp τ
n∈N
non-binary similarity labels and study different versions of it
in the SCL setting free from label noise and show that we get However, the set of positives can be noisy, e.g. sampling a
similar results in more noisy and even unsupervised settings. temporally distant clip may include sub-optimal positives due
to drastic changes in the video.
Ranking. Learning to Rank has been studied exten-
Login Positives. An alternative approach, which is more
sively (Burges et al. 2005; Cakir et al. 2019; Cao et al. 2007;
robust to noise or inaccurate samples (Miech et al. 2020),
Liu 2009). These works aim for downstream applications
is to take the sum inside the log, Eq. (3). To minimize this
that require ranking e.g. image or document retrieval, Natural
loss, the network is not forced to set a high similarity to all
Language Processing and Data Mining. In contrast, we are
pairs. It can neglect the noisy/false positives, given that a
not interested in the ranking per-se, but rather use the ranking
sufficiently large similarity is set for the true positives, see
task to improve the learned representation.
Tab. 4. However, if a discrepancy between positives exists,
Some approaches in the field metric learning use rank-
it results in a degenerate solution of discarding hard posi-
ing losses to learn a feature embedding: Contrastive losses
tives. For instance, consider supervised learning where both
such as triplet loss (Weinberger, Blitzer, and Saul 2006) or
augmentations and class positives are available for a given
N-pair loss (Sohn 2016) can be interpreted as ranking the pos-
query: the class positives, which are harder to optimize, can
itive higher w.r.t. the anchor than the negative. For instance,
be ignored.
(Tschannen et al. 2020) use the triplet loss, to learn represen-
tations, but focus on learning invariances. (Ge 2018) learn P h(q,p)
a hierarchy from data for hard example mining to improve exp τ
p∈P
the triplet loss. Further, these approaches only consider two Lin = − log P h(q,p) P h(q,n)
. (3)
ranks, whereas our method can work with multiple ranks. exp τ + exp τ
p∈P n∈N

Methods The above methods assume a binary set of positives and

negatives. Thus, they can not exploit the similarity of posi-
InfoNCE tives and negatives. In the following, we discuss the proposed
We start with the most basic form of the InfoNCE. In this ranking version of InfoNCE that allows us to preserve the
setting, two different views of the same data – e.g. two differ- order of the positives and benefit from the additional infor-
ent augmentations of the same image – are pulled together in mation.
feature space, while pushing views of different samples apart.
More specifically, for a query q, a single positive p and a set RINCE: Ranking InfoNCE
of negatives N = {n1 , . . . nk } is given. The views are fed to Let us assume that for a given query sample q, we have access
an encoder network f , followed by a projection head g (Chen to a set of ranked positives in a form of P1 , . . . , Pr , where
et al. 2020a). To measure the similarity between a pair of fea- Pi includes the positives of rank i. Let us also assume N is a
tures we use the cosine similarity cos_sim. Overall thetask is set of negatives. Our objective is to train a critic h such that:
to train a critic h(x, y) = cos_sim g(f (x)), g(f (y)) using
the loss: h(q, p1 ) > · · · > h(q, pr ) > h(q, n) ∀pi ∈ Pi , n ∈ N .
exp h(q,p)

τ
(4)
L = − log , (1) Note that Pi can contain multiple positives. For ease of no-
exp h(q,p) exp h(q,n)
P
+
τ
n∈N
τ tation we omit these indices. To impose the desired ranking
presented by the positive sets, we use InfoNCE in a recursive
where τ is a temperature parameter (Chen et al. 2020a). The manner where we start with the first set of positives, treat the
above loss relies on the assumption that a single positive pair remaining positives as negatives, drop the current positive,
is available. One drawback with this approach is that all other and move to the next. We repeat this procedure until there
samples are treated as negatives, even if they are semantically are no positives left. More precisely, the loss function reads
close to the query. Potential solutions include removing them Pr
Lrank = i=1 `i , where
from the negatives (Zhao et al. 2021) or adding them to
exp h(q,p)
P
the positives (Khosla et al. 2020), which we denote by P = τi
p∈Pi
{p1 , . . . , pl }. In other cases, we naturally have access to more `i = − log
exp h(q,p) h(q,n)
P P
than one positive, e.g. we can sample several clips from a τi + exp τi
single video, see Fig. 3. Having multiple positives per query
S
p∈ j≥i Pj n∈N
leaves two options, which we discuss in the following. (5)

899
Naming # positives per rank loss for transferability to other tasks. RINCE allows the model
RINCE-uni single Eq. (1) to keep this structure, and learn not only dissimilarities be-
RINCE-out multiple Eq. (2) tween, but also similarities across classes. We show quanti-
RINCE-in multiple Eq. (3) tatively that RINCE learns a higher quality representation
Eq. (2) (`1 ); than cross-entropy and SCL on Cifar-100 and ImageNet-100
RINCE-out-in multiple
Eq. (3) (ì , i > 1) by evaluating on linear classification, image retrieval, and
OOD tasks. Unless otherwise stated, we report results for
Table 1: Different variants of RINCE. For the exact loss ResNet-50. More implementation details in the Sup. Mat.
functions see the Sup. Mat.
Datasets. Cifar-100 (Krizhevsky, Hinton et al. 2009) pro-
vides both, class and superclass labels, defining a semantic
and τi < τi+1 . Eq. (5) denotes the Lin version of InfoNCE hierarchy. We use this hierarchy to define first rank positives
for positives of same rank; other variants are summarized (same class) and second rank positives (same superclass).
in Tab. 1. The rational behind this loss is simple: The i- TinyImageNet (Le and Yang 2015) comprises 200 Ima-
th loss is optimized when I) exp(h(q, pi )/τi ) 0, II) geNet (Deng et al. 2009) classes at low resolution. ImageNet-
exp(h(q, pj )/τi ) → 0 for j > i and III) exp(h(q, n)/τi ) → 100 (Tian, Krishnan, and Isola 2020) is a 100 class subset of
0 for all i, j, n. I) and II) are competing across the losses: ImageNet. We use the RoBERTa (Liu et al. 2019) model to
ì entails exp(h(q, pi+1 )/τi ) → 0 but ì+1 requires obtain semantic word embeddings for all class names. Second
exp(h(q, pi+1 )/τi+1 ) 0. This requires the model to trade- rank positives are based on the word embedding similarity
off the respective loss terms, resulting in a ranking of posi- and a predefined threshold. Details in the Sup. Mat.
tives h(q, pi ) > h(q, pi+1 ).
In the following we explain the intuition behind our choice Baselines and SOTA. As baselines we use cross-entropy,
of τ values based on the analyses of (Wang and Liu 2021); cross-entropy with the same augmentations as RINCE (cross-
for a more detailed analysis see Sup. Mat. A low temperature entropy s.a.), Triplet loss (Weinberger, Blitzer, and Saul
in the InfoNCE loss results in a larger relative penalty on the 2006) and SCL (Khosla et al. 2020), trained with Eq. (2)
high similarity regions, i.e. hard negatives. As the temperature (SCL-out) or Eq. (3) (SCL-in). An advantage of RINCE
increases, the relative penalty distributes more uniformly, pe- compared to these baselines is that it benefits from extra
nalizing all negatives equally. A low temperature in ì allows information provided by the superclasses. To show that mak-
the network to concentrate on forcing h(q, pi ) > h(q, pi+1 ), ing use of this knowledge is not trivial, we compare to the
ignoring easy negatives. A higher temperature on `r relaxes following baselines: 1) We train SCL on Cifar-100 with 20
the relative penalty of negatives with respect to pr so that the superclasses, denoted by SCL superclass. 2) Hierarchical
network can enforce h(q, pr ) > h(q, n). Triplet (Ge 2018), which uses the superclasses to mine hard
examples. 3) Fast AP (Cakir et al. 2019), a “learning to rank”
Experiments approach that directly optimizes Average Precision. 4) Label
smoothing (Szegedy et al. 2016), which reduces network
We first study the properties of RINCE in the controlled over-confidence and can improve OOD detection (Lee and
supervised setting, looking at classification accuracy, retrieval Cheon 2020). Here, we assign some probability mass to the
and out-of-distribution (OOD) detection on Cifar-100. Next, classes from the same superclass. 5) A multi-classification
we show that RINCE leads to significant improvements on baseline, referred to as two heads, that jointly predicts class
the large scale dataset ImageNet-100 in terms of accuracy and superclass labels. 6) SCL two heads, a variant of two
and OOD, even with more noisy similarity scores. Last, we heads, that uses the SCL loss instead of cross-entropy. Details
showcase exemplary with unsupervised video representation for all baselines are given in the Sup. Mat.
learning that RINCE can be used in an unsupervised setting.
For all experiments we follow the MoCo v2 setting (Chen Classification and Retrieval on Cifar. For the classifica-
et al. 2020b) with a momentum encoder, a memory bank tion evaluation we train a linear layer on top of the last layer
and a projection head. Throughout the section we compare of the frozen pre-trained networks. The non-parametric re-
different versions of RINCE (Tab. 1), to study their behavior trieval evaluation involves finding the relevant data points in
in different settings. More ablations in the Sup. Mat. the feature space of the pre-trained network in terms of class
labels via a simple similarity metric, e.g. cosine similarity.
Learning from Class Hierarchies RINCE is superior to the baselines for all experiments, Tab. 2.
The optimal testbed to study the proposed loss functions is Note, that all evaluations in Tab. 2 are based on the same
the supervised contrastive learning (SCL) setting. The effect pre-trained weights using Cifar-100 fine labels as rank 1 and,
of the proposed loss functions can be studied without con- if applicable, superclass labels as rank 2.
founding noise, using ground truth labels and ground truth These experiments indicate that training with RINCE main-
rankings. In SCL all samples with the same class are contains ranking order and results in a more structured feature
sidered as positives, thus either Eq. (2), or Eq. (3) is used. space in which the samples of the same class are well sep-
However, semantically similar classes share similar visual arated from the other classes. This is further approved by a
features (Deselaers and Ferrari 2011). When strictly treated qualitative comparison between embedding spaces in Fig. 2.
as negatives the model does not mirror the structure available Furthermore, we find that the grouping of classes is learned
by the labels in its feature space. This, however, is favorable by the MLP head. The increased difficulty of the ranking task

900
Cifar100 fine Cifar100 superclass AUROC
Method
Accuracy R@1 R@1 Dout : Cifar-10 Dout : TinyImageNet
SCL-out 76.50 N/A N/A N/A N/A
Soft Labels◦ 76.90 N/A N/A N/A 67.50
ODIN† N/A N/A N/A 77.20 85.20
Mahalanobis† N/A N/A N/A 77.50 97.40
Contrastive OOD‡ N/A N/A N/A 78.30 N/A
Gram Matrices N/A N/A N/A 67.90 98.90
Cross-entropy∗ 74.52 ± 0.32 74.84 ± 0.21 83.99 ± 0.21 75.32 ± 0.65 77.76 ± 0.77
Cross-entropy s.a.∗ 75.46 ± 1.09 76.03 ± 1.04 84.68 ± 0.86 75.91 ± 0.10 79.44 ± 0.50
Triplet 68.44 ± 0.18 47.73 ± 0.14 72.29 ± 0.27 70.33 ± 0.54 80.76 ± 0.24
Hierarchical Triplet∗ 69.27 ± 1.64 65.31 ± 2.69 77.41 ± 1.55 71.97 ± 2.48 76.22 ± 1.27
Fast AP∗ 66.96 ± 0.88 62.03 ± 0.51 69.56 ± 0.54 69.14 ± 1.02 72.44 ± 0.94
Smooth Labels 75.66 ± 0.27 74.90 ± 0.06 85.59 ± 0.12 74.35 ± 0.65 80.10 ± 0.77
Two heads 74.08 ± 0.40 73.62 ± 0.31 81.92 ± 0.21 77.99 ± 0.07 78.35 ± 0.39
SCL-in superclass∗ 74.41 ± 0.15 69.83 ± 0.28 85.35 ± 0.51 74.40 ± 0.72 80.20 ± 1.05
SCL-in∗ 76.86 ± 0.18 73.20 ± 0.19 82.16 ± 0.24 74.63 ± 0.16 78.96 ± 0.45
SCL-out∗ 76.70 ± 0.29 74.45 ± 0.39 82.94 ± 0.39 75.32 ± 0.59 79.80 ± 0.70
SCL-in two heads∗ 77.15 ± 0.14 74.36 ± 0.10 83.31 ± 0.09 75.41 ± 0.16 79.34 ± 0.19
SCL-out two heads∗ 76.91 ± 0.08 74.87 ± 0.37 83.74 ± 0.16 75.27 ± 0.34 79.64 ± 0.53
Contrastive OOD N/A N/A N/A 74.20 ± 0.40 N/A
RINCE-out 76.94 ± 0.16 76.68 ± 0.09 86.10 ± 0.25 77.76 ± 0.09 81.02 ± 0.14
RINCE-out-in 77.59 ± 0.21 77.47 ± 0.16 86.20 ± 0.23 76.82 ± 0.44 81.40 ± 0.38
RINCE-in 77.45 ± 0.05 77.56 ± 0.03 86.46 ± 0.21 77.03 ± 0.53 81.78 ± 0.05

Table 2: Classification, retrieval and OOD results for Cifar-100 pretraining. Left: classification and retrieval; fine-grained task
(fine) with 100 classes and superclass task (superclass) with 20 classes. Right: OOD task with inlier dataset Din : Cifar-100 and
outlier dataset Dout : Cifar-10 and TinyImageNet. We report the mean and standard deviation over 3 runs. Contrastive OOD
averaged over 5 runs. Best method in bold, second best underlined. Note that, models indicated with † are not directly comparable,
since they use data explicitly labeled as OOD samples for tuning. ∗ indicates methods of others trained by us, ◦ uses 2× wider
ResNet-40, ‡ 4× wider ResNet-50. The lower part of the table uses ResNet-50. Methods not references in text: Soft Labels (Lee
and Cheon 2020), Gram Matrices (Sastry and Oore 2020), Triplet (Weinberger, Blitzer, and Saul 2006).

of RINCE results in a more structured embedding space fit class-conditional multivariate Gaussians to the embedding
before the MLP compared with SCL, see Sup. Mat. Fig. 7. of the training set. We use the log-likelihood to define the
OOD-score. As a result, the likelihood to identify OOD-
Out-of-distribution Detection. To further investigate the samples is high, if each in-class follows roughly a Gaussian
structure of the learned representation of RINCE we evaluate distribution in the embedding space, compare Fig. 2a and
on the task of out-of-distribution detection (OOD). As argued 2c. For evaluation, we compute the area under the receiver
in (Winkens et al. 2020), models trained with cross-entropy operating characteristic curve (AUROC), details in Sup. Mat.
only need to distinguish classes and can omit irrelevant fea-
tures. Contrastive learning differs, by forcing the network to Results and a comparison to the most related previous
distinguish between each pair of samples, resulting in a more work is shown in Tab. 2. Note that we aim here to com-
complete representation. Such a representation is beneficial pare the learned representation space via RINCE to its coun-
for OOD detection (Hendrycks et al. 2019; Winkens et al. terparts, i.e. cross-entropy and SCL, but show well known
2020). Therefore, OOD performance can be seen as evalua- methods as reference. Most importantly, RINCE clearly out-
tion of representation quality beyond standard metrics like performs cross-entropy, all SCL variants, contrastive OOD
accuracy and retrieval. RINCE incentivizes the network to and our own baselines using the identical OOD approach.
learn an even richer representation. Besides that, OOD bene- Only two-heads outperforms all other methods in the near
fits from good trade-off between alignment and uniformity, OOD setting with Dout : Cifar-10. However, performance on
which RINCE manages well (Fig. 9 in Sup. Mat.). all other settings is low, showing weak generalization. This
We follow common evaluation settings for OOD (Lee et al. underlines our hypothesis, that training with RINCE yields a
2018; Liang, Li, and Srikant 2018; Winkens et al. 2020). more structured and general representation space. Comparing
Here Cifar-100 is used as the inlier dataset Din , Cifar-10 to related works, RINCE not only outperforms Contrastive
and TinyImageNet as outlier dataset Dout . Note that Cifar- OOD (Winkens et al. 2020) using the same architecture, but
100 and Cifar-10 have disjoint labels and images. For both even approaches the 4× wider ResNet on Cifar-10 as Dout .
protocols we only use the test or validation images. Our ODIN (Liang, Li, and Srikant 2018) and Mahalanobis (Lee
models are identical to those in the previous section. Inspired et al. 2018) require samples labelled as OOD to tune param-
by (Winkens et al. 2020), we follow a simple approach, and eters of the OOD approach. Here we evaluate in the more

901
(a) (b) (c)

Figure 2: Qualitative comparison of embedding spaces. T-SNE plot of (a) supervised contrastive learning (SCL-in) and (b)
RINCE-in (c) RINCE-out-in on Cifar-100. Best seen in color, on screen and zoomed in. Color and marker type combined indicate
class. Labels omitted for clarity. Sup. Mat. contains a version of this plot with color indicating the superclass. RINCE learns a
more structured embedding space than SCL, e.g. classes are linearly separable and can be modelled well by a Gaussian.

AUROC less controlled setting and define a ranking based on temporal

Method Accuracy Dout : Dout : ordering for unsupervised video representation learning.
ImageNet-100† AwA2
Cross-entropy s.a. 83.94 79.076 ± 1.477 79.04 Unsupervised RINCE
SCL-out 84.18 79.779 ± 1.274 79.05 In this section we demonstrate that RINCE can be used in a
RINCE-out-in 84.90 80.473 ± 1.210 80.73 fully unsupervised setting with noisy hierarchies by applying
it to unsupervised video representation. Inspired by (Tschan-
Table 3: ImageNet-100 classification accuracy and OOD de- nen et al. 2020), we construct three ranks for a given query
tection for Din : ImageNet-100, and Dout : ImageNet-100† video, same frames, same shot and same video, see Fig. 3.
and AwA2 (Xian et al. 2018). ImageNet-100† denotes three The first positive xf is obtained by augmenting the query
ImageNet-100 datasets with non-overlapping classes. frames. The second positive xs is a clip consecutive to the
query frames, where small transformations of the objects,
illumination changes, etc. occur. The third positive xv is
realistic setting without labelled OOD samples. Despite us- sampled from a different time interval of the same video,
ing significantly less information, RINCE is compatible with which may show visually distinct but semantically related
them and even outperforms them for Dout : Cifar-10. scenes. Naturally, xf shows the most similar content to the
query frames, followed by xs and finally xv . We compare
Large Scale Data and Noisy Similarities temporal ranking with RINCE to different baselines.
Additionally, we perform the same evaluations on ImageNet- Baselines. We compare to the basic InfoNCE, where a
100, a 100-class subset of ImageNet, see Tab. 3. Here, we single positive is generated via augmentations (Chen et al.
use ResNet-18. We obtain the second rank classes for a given 2020a; He et al. 2020), i.e. only frame positives xf . When
class via similarities of the RoBERTa (Liu et al. 2019) class considering multiple clips from the same video such as xs
name embeddings. In contrast to the previous experiments, and xv , there are several possibilities: We can treat them all
where ground truth hierarchies are known, these similarity as positives (hard positive), we can use the distant xv as
scores are noisy and inaccurate – yet it still provides valu- a hard negative or ignore it (easy positive). In both cases
able information to the model. We evaluate our model via Lout , Eq. (2), and Lin , Eq. (3), are possible. Additionally,
linear classification on ImageNet-100 and two OOD tasks: we compare to two recent methods trained in comparable
AwA2 (Xian et al. 2018) as Dout and ImageNet-100† , where settings, i.e. VIE (Zhuang et al. 2020), LA-IDT (Tokmakov,
we use the remaining ImageNet classes to define three non- Hebert, and Schmid 2020).
overlapping splits and report the average OOD.
Result are shown in Tab. 3. Again, RINCE significantly Ranking Frame-, Shot- and Video-level Positives. We
improves over SCL and cross-entropy in linear evaluation as sample short clips of a video, each consisting of 16 frames.
well as on the OOD tasks. This demonstrates 1) that RINCE We augment each clip with a set of standard video augmen-
can handle noisy rankings and 2) that RINCE leads to im- tations. For more details we refer to the Sup. Mat. For the
provements on large scale datasets. Next, we move to an even anchor clip x, we define positives as in Fig. 3: p1 = xf con-

902
Figure 3: Positives in Videos. For a given query clip we use frame positives xf , shot positives xs and video positives xv .

Top 1 Accuracy Retrieval mAP

Method Loss Positives Negatives
HMDB UCF HMDB UCF
VIE - - - 44.8 72.3 - -
LA-IDT - - - 44.0 72.8 - -
InfoNCE L {xf } N 41.5 71.3 0.0500 0.0688
Lin {xf , xs , xv } N 42.6 74.3 0.0685 0.1119
hard positive
Lout {xf , xs , xv } N 41.4 73.6 0.0666 0.1204
Lin {xf , xs } N 42.7 74.5 0.0581 0.1257
easy positive
Lout {xf , xs } N 40.7 73.5 0.0593 0.1297
Lin {xf , xs } {xv } ∪ N 43.6 74.3 0.0678 0.1141
hard negative
Lout {xf , xs } {xv } ∪ N 43.5 75.2 0.0675 0.1193
RINCE RINCE-uni xf > x s > x v N 44.9 75.4 0.0719 0.1395

Table 4: Finetuning on UCF and HMDB. L, Lin and Lout correspond to Eq. (1), Eq. (3) and Eq. (2), respectively. Positives and
Negatives indicates how xf , xs , xv were incorporated into contrastive learning, where N denotes the set of negative pairs from
random clips. Since we consider only a single positive per rank we use the RINCE-uni loss variant for RINCE.

sists of the same frames as x, p2 = xs is a sequence of 16 leads to inferior performance compared to Lin . Lin allows
frames adjacent to x, and p3 = xv is sampled from a different more noise in the set of positives by weak influence of false
time interval than xf and xs . Negatives xn are sampled from positives xv . With RINCE we can impose the temporal or-
different videos. Since each rank i contains only a single dering xf > xs > xv and treat xv properly, leading to the
positive pi , Eq. (2) = Eq. (3), we call this variant RINCE-uni. highest downstream performance. Improvements of RINCE
By ranking the positives we ensure that the similarities sat- over Lout is less pronounced on UCF. This is due to the strong
isfy sim(x, xf ) > sim(x, xs ) > sim(x, xv ) > sim(x, xn ), static bias (Li, Li, and Vasconcelos 2018) of UCF and Lout en-
adhering to the temporal structure in videos. courages static features. Contrarily, improvements of RINCE
over Lout on HMDB are substantial, due to the weaker bias
Datasets and Evaluation. For self-supervised learning, towards static features. Last, we compare our method to two
we use Kinetics-400 (Kay et al. 2017) and discard the labels. recent unsupervised video representation learning methods
Our version of the dataset consists of 234.584 training videos. that use the same backbone network in Tab. 4. We outperform
We evaluate the learned representation via finetuning on these methods on both datasets.
UCF (Soomro, Zamir, and Shah 2012) and HMDB (Kuehne
et al. 2011) and report top 1 accuracy. In this evaluation, the
pretrained weights are used to initialize a network and train Conclusion
it end-to-end using cross-entropy. Additionally, we evaluate
the representation via nearest neighbor retrieval and report We introduced RINCE, a new member in the family of In-
mAP. Precision-Recall curves can be found in the Sup. Mat. foNCE losses. We show that RINCE can exploit rankings to
learn a more structured feature space with desired properties,
Experimental Results. For all experiments we use a 3D- lacking with standard InfoNCE. Furthermore, representations
ResNet-18 backbone. Training details can be found in the learned through RINCE can improve accuracy, retrieval and
Sup. Mat. We report the results for RINCE as well as the OOD. Most importantly, we show that RINCE works well
baselines in Tab. 4. Adding shot- and video-level samples with noisy similarities, is applicable to large scale datasets
to InfoNCE improves the downstream accuracies. We ob- and to unsupervised training. We compare the different vari-
serve that adding xv to the set of negatives to provide a hard ants of RINCE. Here lies a limitation: Different variants
negative rather than adding it to the set of positives leads are optimal for different tasks and must be chosen based on
to higher performance, suggesting that this should not be a domain knowledge. Future work will explore further applica-
true positive. This is further supported by the second and tions of obtaining similarity scores, e.g. based on distance in
third row, where all three positives are treated as true pos- a pretrained embedding space, distance between cameras in
itives. Here, Lout , which forces all positives to be similar, a multi-view setting or distances between clusters.

903
Acknowledgments Hendrycks, D.; Mazeika, M.; Kadavath, S.; and Song, D.
JG has been supported by the Deutsche Forschungsgemein- 2019. Using Self-Supervised Learning Can Improve Model
schaft (DFG, German Research Foundation) - GA1927/4-2. Robustness and Uncertainty. NeurIPS.
Hoffmann, D. T.; Behrmann, N.; Gall, J.; Brox, T.; and
References Noroozi, M. 2022. Ranking Info Noise Contrastive Esti-
mation: Boosting Contrastive Learning via Ranked Positives.
Behrmann, N.; Gall, J.; and Noroozi, M. 2021. Unsupervised arXiv preprint arXiv:2201.11736.
Video Representation Learning by Bidirectional Feature Pre-
diction. In WACV. Huynh, T.; Kornblith, S.; Walter, M. R.; Maire, M.; and
Khademi, M. 2020. Boosting Contrastive Self-Supervised
Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Learning with False Negative Cancellation. arXiv preprint
Hamilton, N.; and Hullender, G. 2005. Learning to Rank arXiv:2011.11765.
Using Gradient Descent. In ICML.
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.;
Cakir, F.; He, K.; Xia, X.; Kulis, B.; and Sclaroff, S. 2019. Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,
Deep metric learning to rank. In CVPR. P.; Suleyman, M.; and Zisserman, A. 2017. The Kinetics
Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Human Action Video Dataset. arXiv, abs/1705.06950.
Learning to rank: from pairwise approach to listwise ap- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola,
proach. In ICML. P.; Maschinot, A.; Liu, C.; and Krishnan, D. 2020. Supervised
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Contrastive Learning. NeurIPS.
Joulin, A. 2020. Unsupervised Learning of Visual Features Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple
by Contrasting Cluster Assignments. In NeurIPS. layers of features from tiny images. Tech Report.
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a. Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre,
A Simple Framework for Contrastive Learning of Visual T. 2011. HMDB: A large video database for human motion
Representations. In ICML. recognition. In ICCV.
Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020b. Im- Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition
proved baselines with momentum contrastive learning. challenge. CS 231N, 7(7): 3.
arXiv:2003.04297. Lee, D.; and Cheon, Y. 2020. Soft Labeling Affects Out-
Chen, X.; and He, K. 2021. Exploring simple siamese repre- of-Distribution Detection of Deep Neural Networks. arXiv
sentation learning. In CVPR. preprint arXiv:2007.03212.
Dave, I.; Gupta, R.; Rizve, M. N.; and Shah, M. 2021. TCLR: Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A simple
Temporal Contrastive Learning for Video Representation. unified framework for detecting out-of-distribution samples
arXiv preprint arXiv:2101.07974. and adversarial attacks. NeurIPS.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, Li, Y.; Li, Y.; and Vasconcelos, N. 2018. RESOUND: To-
L. 2009. Imagenet: A large-scale hierarchical image database. wards Action Recognition without Representation Bias. In
In CVPR. ECCV.
Liang, S.; Li, Y.; and Srikant, R. 2018. Enhancing the reliabil-
Deselaers, T.; and Ferrari, V. 2011. Visual and semantic
ity of out-of-distribution image detection in neural networks.
similarity in imagenet. In CVPR.
In ICLR.
Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller, Liu, T.-Y. 2009. Learning to Rank for Information Retrieval.
M.; and Brox, T. 2016. Discriminative Unsupervised Feature Foundations and Trends® in Information Retrieval.
Learning with Exemplar Convolutional Neural Networks. In
TPAMI. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019.
Feichtenhofer, C.; Fan, H.; Xiong, B.; Girshick, R.; and He, K. Roberta: A robustly optimized bert pretraining approach.
2021. A Large-Scale Study on Unsupervised Spatiotemporal arXiv preprint arXiv:1907.11692.
Representation Learning. In CVPR.
Miech, A.; Alayrac, J.-B.; Smaira, L.; Laptev, I.; Sivic, J.;
Ge, W. 2018. Deep metric learning with hierarchical triplet and Zisserman, A. 2020. End-to-End Learning of Visual
loss. In ECCV. Representations from Uncurated Instructional Videos. In
Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, CVPR.
P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Misra, I.; and van der Maaten, L. 2020. Self-Supervised
Gheshlaghi Azar, M.; Piot, B.; Kavukcuoglu, K.; Munos, R.; Learning of Pretext-Invariant Representations. In CVPR.
and Valko, M. 2020. Bootstrap Your Own Latent - A New Neill, J. O.; and Bollegala, D. 2021. Semantically-
Approach to Self-Supervised Learning. In NeurIPS. Conditioned Negative Samples for Efficient Contrastive
Han, T.; Xie, W.; and Zisserman, A. 2020. Self-supervised Learning. arXiv preprint arXiv:2102.06603.
Co-training for Video Representation Learning. In NeurIPS. Romijnders, R.; Mahendran, A.; Tschannen, M.; Djolonga,
He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. J.; Ritter, M.; Houlsby, N.; and Lucic, M. 2021. Representa-
Momentum Contrast for Unsupervised Visual Representation tion Learning From Videos In-the-Wild: An Object-Centric
Learning. In CVPR. Approach. In WACV.

904
Sastry, C. S.; and Oore, S. 2020. Detecting out-of-distribution
examples with gram matrices. In ICML.
Sohn, K. 2016. Improved Deep Metric Learning with Multi-
class N-pair Loss Objective. In NIPS.
Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A
dataset of 101 human actions classes from videos in the wild.
arXiv, abs/1212.0402.
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,
Z. 2016. Rethinking the inception architecture for computer
vision. In CVPR.
Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Multi-
view Coding. In ECCV.
Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; and
Isola, P. 2020. What makes for good views for contrastive
learning. arXiv preprint arXiv:2005.10243.
Tokmakov, P.; Hebert, M.; and Schmid, C. 2020. Unsuper-
vised Learning of Video Representations via Dense Trajec-
tory Clustering. In ECCV Workshops.
Tschannen, M.; Djolonga, J.; Ritter, M.; Mahendran, A.;
Houlsby, N.; Gelly, S.; and Lucic, M. 2020. Self-Supervised
Learning of Video-Induced Visual Invariances. In CVPR.
van den Oord, A.; Li, Y.; and Vinyals, O. 2018. Represen-
tation Learning with Contrastive Predictive Coding. arXiv,
abs/1807.03748.
Wang, F.; and Liu, H. 2021. Understanding the Behaviour of
Contrastive Loss. In CVPR.
Weinberger, K. Q.; Blitzer, J.; and Saul, L. 2006. Distance
Metric Learning for Large Margin Nearest Neighbor Classifi-
cation. In NIPS.
Winkens, J.; Bunel, R.; Roy, A. G.; Stanforth, R.; Natara-
jan, V.; Ledsam, J. R.; MacWilliams, P.; Kohli, P.; Karthike-
salingam, A.; Kohl, S.; et al. 2020. Contrastive training
for improved out-of-distribution detection. arXiv preprint
arXiv:2007.05566.
Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2018.
Zero-Shot Learning - A Comprehensive Evaluation of the
Good, the Bad and the Ugly. In TPAMI.
Zhao, N.; Wu, Z.; Lau, R. W. H.; and Lin, S. 2021. What
Makes Instance Discrimination Good for Transfer Learning?
In ICLR.
Zhuang, C.; She, T.; Andonian, A.; Mark, M. S.; and Yamins,
D. 2020. Unsupervised Learning From Video With Deep
Neural Embeddings. In CVPR.