0% found this document useful (0 votes)

25 views10 pages

Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper

Uploaded by

atsibqydkezsgkggve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views10 pages

Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper

Uploaded by

atsibqydkezsgkggve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Songwei Ge⇤1 , Shlok Mishra⇤1 , Simon Kornblith2 ,

Chun-Liang Li2 , David Jacobs1,3
1
University of Maryland, College Park, 2 Google Research, 3 Meta
{songweig,shlokm,dwj}@umd.edu, {chunliang,skornblith}@google.com

Abstract

Although self-/un-supervised methods have led to rapid

progress in visual representation learning, these methods
generally treat objects and scenes using the same lens. In this
paper, we focus on learning representations for objects and
scenes that preserve the structure among them. Motivated by
the observation that visually similar objects are close in the
representation space, we argue that the scenes and objects
should instead follow a hierarchical structure based on their
𝑂
compositionality. To exploit such a structure, we propose a
contrastive learning framework where a Euclidean loss is
used to learn object representations and a hyperbolic loss is
used to encourage representations of scenes to lie close to
representations of their constituent objects in a hyperbolic
space. This novel hyperbolic objective encourages the scene-
object hypernymy among the representations by optimizing
the magnitude of their norms. We show that when pretrain-
ing on the COCO and OpenImages datasets, the hyperbolic
loss improves downstream performance of several baselines
across multiple datasets and tasks, including image classifi-
cation, object detection, and semantic segmentation. We also Figure 1. Illustration of the representation space learned by our
show that the properties of the learned representations allow models. Object images of the same class tend to gather near the
us to solve various vision tasks that involve the interaction center around similar directions, while the scene images are far
between scenes and objects in a zero-shot fashion. away in these directions with larger norms.

1. Introduction the same space while preserving such hierarchical structures.

Our visual world is diverse and structured. Imagine taking Un-/self-supervised learning has become a standard
a close-up of a box of cereal in the morning. If we zoom out method to learn visual representations [7, 12, 24, 26, 27, 51].
slightly, we may see different nearby objects such as a pitcher Although these methods attain superior performance over
of milk, a cup of hot coffee, today’s newspaper, or reading supervised pretraining on object-centric datasets such as Im-
glasses. Zooming out further, we will probably recognize ageNet [6], inferior results are observed on images depicting
that these items are placed on a dining table with the kitchen multiple objects such as OpenImages or COCO [68]. Several
as background rather than inside a bathroom. Such scene- methods have been proposed to mitigate this issue, but all fo-
object structure is diverse, yet not completely random. In cus either on learning improved object representations [1,68]
this paper, we aim at learning visual representations of both or dense pixel representations [39, 64, 69], instead of explic-
the cereal box (objects) and the entire dining table (scenes) in itly modeling representations for scene images. The object
representations learned by these methods present a natural
⇤ Equal Contribution. The order is decided randomly. topology [67]. That is, the objects from visually similar

6840
classes lie close to each other in the representation space. uncertainty quantification to out-of-context object detection.
However, it is not clear how the representations of scene Our contributions are summarized below:
images should fit into that topology. Directly applying exist-
ing contrastive learning results in a sub-optimal topology of 1. We propose a hyperbolic contrastive loss that regular-
scenes and objects as well as unsatisfactory performance, as izes scene representations so that they follow an object-
we will show in the experiments. To this end, we argue that centric hierarchy, with positive and negative pairs sam-
a hierarchical structure can be naturally adopted. Consider- pled from the hierarchy.
ing that the same class of objects can be placed in different 2. We demonstrate that our learned representations trans-
scenes, we construct a hierarchical structure to describe such fer better than representations learned using vanilla
relationships, where the root nodes are the visually similar contrastive loss on a variety of downstream tasks, in-
objects, and the scene images consisting of them are placed cluding object detection, semantic segmentation, and
as the descendants. We call this structure the object-centric linear classification.
scene hierarchy.
The intermediate modeling difficulty induced by this 3. We show that the magnitude of representation norms
structure is the combinatorial explosion. A finite number of effectively reflect the scene-objective hypernymy.
objects leads to exponentially many different possible scenes.
Consequently, Euclidean space may require an arbitrarily 2. Method
large number of dimensions to faithfully embed these scenes, In this section, we elaborate upon our approach to learn-
whereas it is known that any infinite trees can be embedded ing visual representations of object and scene images. We
without distortion in a 2D hyperbolic space [25]. Therefore, start by describing the hierarchical structure between objects
we propose to employ a hyperbolic objective to regularize and scenes that we wish to enforce in the learned representa-
the scene representations. To learn representations of scenes, tion space.
in the general setting of contrastive learning, we sample co-
occurring scene-object pairs as positive pairs, and objects 2.1. Object-Centric Scene Hierarchy
that are not part of that scene as negative samples, and use From simple object co-occurrence statistics [19, 41] to
these pairs to compute an auxiliary hyperbolic contrastive finer object relationships [30, 32], using hierarchical relation-
objective. Our model is trained to reduce the distance be- ships between objects and scenes to understand images is
tween positive pairs and push away the negative pairs in a not new. Previous studies primarily work on an image-level
hyperbolic space. hierarchy by dividing an image into its lower-level elements
Contrastive learning usually has objectives defined on a recursively: a scene contains multiple objects, an object has
hypersphere [12, 27]. By discarding the norm information, different parts, and each part may consist of even lower-level
these models circumvent the shortcut of minimizing losses features [14, 29, 48]. While this is intuitive, it describes
through tuning the norms and obtain better downstream per- a hierarchical structure contained in the individual images.
formance. However, the norm of the representation can also Instead, we study the structure presented among different
be used to encode useful representational structure. In hy- images. Our goal is to learn a representation space for im-
perbolic space, the magnitude of a vector often plays the ages of both objects and scenes across the entire dataset.
role of modeling the hypernymy of the hierarchical struc- To this end, we argue that it is more natural to consider an
ture [45, 53, 59]. When projecting the representations to object-centric hierarchy.
the hyperbolic space, the norm information is preserved and It is known that when training an image classifier, the
used to determine the Riemannian distance, which eventually objects from visually similar classes often lie close to each
affects the loss. Since hyperbolic space is diffeomorphic and other in the representation space [67], which has become
conformal to Euclidean space, our hyperbolic contrastive the cornerstone of contrastive learning. Motivated by this
loss is differentiable and complementary to the original con- observation, we believe that the representation of each scene
trastive objective. image should also be close to the object clusters it consists
When training simultaneously with the original con- of. However, modeling scenes requires a much larger vol-
trastive objective for objects and our proposed hyperbolic ume due to the exponential number of possible compositions
contrastive objective for scenes, the resulting representation of objects. Another way to think about the object-centric
space exhibits a desired hierarchical structure while leaving hierarchy is through the generality and specificity as often
the object clustering topology intact as shown in Figure 1. discussed in the language literature [42, 45]. An object con-
We demonstrate the effectiveness of the hyperbolic objective cept is general when standing alone in the visual world, and
under several frameworks on multiple downstream tasks. We it will become specific when a certain context is given. For
also show that the properties of the representations allow us example, “a desk” is thought to be a more general concept
to perform various vision tasks in a zero-shot way, from label than “a desk in a classroom with a boy sitting on it”.

6841
Therefore, we propose to study an object-centric hierar- defined as
chy across the entire dataset. Formally, given a set of images ✓ ✓ ◆ ◆
rkvk rv
S = {s1 , s2 , · · · , sn }, Oi = {o1i , o2i , · · · , oni i } are the ob- expp (v) := p tanh 2
, (2)
ject bounding boxes contained in the image si . We define r kpk2 kvk
the regions of scene Ri = {ri1 , ri2 , · · · , rimi } to be partial The exponential map gives us a way to map the output of a
areas of the image si that contain multiple objects such that network, which is in the Euclidean space, to the Poincaré
rij = [k oki , where oki 2 Oi and object k is in the region j. ball. In practice, to avoid numerical issues, we clip the
We define the object-centric hierarchy T = (V, E) to be maximal norm of v with r " before the projection, where
that V = S [ O [ R, where R = R1 [ · · · [ Rn and " > 0. During the backpropagation, we perform RSGD [4]
O = O1 [ · · · [ On . For u, v 2 V , e = (u, v) is an edge of by scaling the gradients by gD (p) 1 . Intuitively, this forces
T if u ✓ v or v ✓ u. Note that the natural scene images S the optimizer to take a smaller step when p is closer to the
are always put as the leaf nodes. boundary. The scaling factor is lower bounded by O("2 ).
The immediate consequence of the negative curvature is
2.2. Representation Learning beyond Objects
that for any point p 2 Hm , there are no conjugate points
To describe our proposed model based on this hierarchy, along any geodesic starting from p. Therefore, the volume
we begin with a brief review of hyperbolic space and its prop- grows exponentially faster in hyperbolic space than in Eu-
erties used in our model. For comprehensive introductions clidean space. Such a property makes it suitable to embed
to Riemannian geometry and hyperbolic space, we refer the the hierarchical structure that has constant branching factors
readers to [16, 34]. and exponential number of nodes. This is formally stated in
the theorem below:
2.2.1 Hyperbolic Space Theorem 2. [25] Given a Poincaré ball Dn with an arbi-
trary dimension n 2 and any set of points p1 , · · · , pm 2
A hyperbolic space (Hm , g) is a complete, connected Rie-
Dn , there exists a finite weighted tree (T, dT ) and an embed-
mannian manifold with constant negative sectional curva-
ding f : T ! Dn such that for all i, j,
ture. These special manifolds are all isometric to each
p
other with the isometries defined as O+ (m, 1). Among dT f 1 (xi ) , f 1 (xj ) dD (xi , xj ) = O(log(1+ 2) log(m))
these isometries, there are five common models that pre-
vious studies often work on [5]. In this paper, we choose Intuitively, the theorem states that any tree can be em-
the Poincaré ball Dn := p 2 Rn | kpk2 < r2 as our ba- bedded into a Poincaré disk (n = 2) with low distortion.
sic model [21, 45, 59], where r > 0 is the radius of the On the contrary, it is known that the Euclidean space with
ball. The Poincaré ball is coupled with a Riemannian met- unbounded number of dimensions is not able to achieve such
ric gD (p) = (1 kpk42 /r2 )2 gE , where p 2 Dn and gE is the a low distortion [36]. One useful intuition [53] to help un-
canonical metric of the Euclidean space. For p, q 2 D, the derstand the advantage of the hyperbolic space is given two
Riemannian distance on the Poincaré ball induced by its points p, q 2 Dn s.t. kpk = kqk,
metric gD is defined as follows:
dD (p, q) ! dD (p, 0) + dD (0, q), as kpk = kqk ! r (3)
✓ ◆
1 k p qk
dD (p, q) = 2r tanh , (1) This property basically reflects the fact that the shortest path
r
in a tree is the path through the earliest common ancestor,
where is the Möbius addition and it is clearly differen- and it is reproduced in the Poincaré when points are both
tiable. In addition, the Poincaré ball can be viewed as a close to the boundary.
natural counterpart of the hypersphere as it allows all di-
rections, unlike the other models such as the halfspace or 2.2.2 Hyperbolic Contrastive Learning
hemisphere models that have constraints on the directions.
Given the theoretical benefits of the hyperbolic space stated
The hyperbolic space is globally differomorphic to the Eu-
above, we propose a contrastive learning framework as
clidean space, which is stated in the theorem below:
shown in Figure 2. We adopt two losses to learn the object
Theorem 1. (Cartan–Hadamard). For every point p 2 and scene representations. First, to learn object representa-
Hn the exponential map expp : Tp Hn ⇡ Rn ! Hn is a tions, we use the standard normalized temperature-scaled
smooth covering map. Since Hn is simply connected, it is cross-entropy loss, which operates on the hypersphere in
diffeomorphic to Rn . Euclidean space. As shown in the top branch of Figure 2,
we crop two views of a jittered and slightly expanded object
Specifically, for p 2 Dn and v 2 Tp Dn ⇡ Rn , the region as the positive pairs and feed into the base and mo-
exponential map of the Poincaré ball expp : Tp Dn ! Dn is mentum encoders to calculate the object representations. We

6842
Encoder Backbone Normalize 1
𝑧𝑒𝑢𝑐
Cosine
Similarity
Momentum 2
Encoder Backbone
Normalize 𝑧𝑒𝑢𝑐 sg

Exponential 1
Encoder Backbone
Map 𝑧ℎ𝑦𝑝
Object Hyperbolic
Distance
Scene Momentum Exponential 2
Encoder Backbone Map 𝑧ℎ𝑦𝑝 sg

Figure 2. Our Hyperbolic Contrastive Learning (HCL) framework has two branches: given a scene image, two object regions are cropped
to learn the object representations with a loss defined in the Euclidean space focusing on the representation directions. A scene region as
well as a contained object region are used to learn the scene representations with a loss defined in the hyperbolic space that affects the
representation norms.

denote the output after the normalization to be z1euc and z2euc . in equation 1. We calculate the hyperbolic contrastive loss
We follow MoCo [27] and leverage a memory bank to store as follows:
the negative representations zeuc
n
, which are the features z2euc ⇣ d (z1 ,z2 ) ⌘
D hyp hyp
from the previous batches. Note that our framework can be exp ⌧
readily extended to other contrastive learning models. The Lhyp = log ⇣ d (z1 ,z2 ) ⌘ P ⇣ d (z1 ,zn ) ⌘ ,
D hyp hyp D hyp hyp
Euclidean loss for each image is then calculated as: exp ⌧ + n exp ⌧

exp z1euc · z2euc /⌧ When minimizing the distances of all the positive pairs, with
Leuc = log P ,
exp (z1euc ·z2euc /⌧ )
+ n exp (z1euc · zneuc /⌧ ) the intuition from equation 3, it would be beneficial to put
the nodes near the root, i.e. objects, close to the center to
where ⌧ is a temperature parameter. achieve an overall lower loss. The overall loss function of
While the loss above aims to learn object representations, our model is as follows:
we propose a hyperbolic contrastive objective to learn the
representations for scene images. We sample positive region L = Leuc + Lhyp ,
pairs u and v from object-centric scene hierarchy T such that
(u, v) 2 E. In other words, as shown in the bottom branch where is a scaling parameter to control the trade-off be-
of Figure 2, the objects contained in one region are required tween hyperbolic and Euclidean losses.
to be a subset of the objects in the other. We sample the
negative samples of u to be Nu = {v|(u, v) 62 E}. However, 3. Experiments
building and sampling exhaustively from the entire hierarchy 3.1. Implementation Details
explicitly is tricky. In practice, given an image s, we always
sample u 2 R [ {s} to be a scene region, v 2 O to be an Pre-training phase. We pre-train on three datasets:
object that occurs in u, and Nu to be the other objects that COCO [35], the full OpenImages labelled dataset [33](⇠ 1.7
are not in u. million samples) and a subset of OpenImages (⇠ 212k) [44].
The pair of scene and object images are fed into the All these datasets are multi-object datasets; OpenImages
base and momentum encoders that share the weights with contains 12 objects on average per image and COCO con-
the Euclidean branch. However, instead of normalizing the tains 6 objects on average. We experiment with both the
output of the encoders, we use the exponential map defined ground truth bounding box (GT) and using selective search
in the equation 2 to project these features in the Euclidean (SS) [61] to produce object bounding boxes in an unsuper-
space to the Poincaré ball, which are denoted as z1hyp and vised fashion, following previous work [68]. As the goal
z2hyp . Further, we replace the inner product in the cross- of this paper is not to present another state-of-the-art self-
entropy loss with the negative hyperbolic distance as defined supervised learning method, we implement our sampling

6843
APb APb50 APb75 APm APm
50 AP75
m
Pre-train Bbox VOC IN-100 IN-1k
MoCo-v2 pre-trained on COCO: MoCo-v2 COCO - 64.79 64.84 51.17
Baseline 38.5 58.1 42.1 34.8 55.3 37.3 HCL w/o Lhyp COCO SS 73.13 73.84 54.21
HCL w/o Lhyp 39.7 60.1 43.4 36.0 57.3 38.8 HCL w/o Lhyp COCO GT 75.55 76.22 54.52
HCL CC 40.6 61.1 44.5 37.0 58.3 39.7 HCL COCO SS 74.19 75.16 55.03
Dense-CL pre-trained on COCO: HCL COCO GT 76.51 76.74 55.63
Baseline 39.6 59.3 43.3 35.7 56.5 38.4 MoCo-v2 OpenImages - 69.95 72.80 54.12
HCL w/o Lhyp 41.3 61.5 44.7 37.5 59.5 40.4 HCL w/o Lhyp OpenImages SS 71.82 75.33 56.58
HCL 42.5 62.5 45.8 38.5 60.6 41.4 HCL w/o Lhyp OpenImages GT 73.79 77.36 57.57
ORL pre-trained on COCO: HCL OpenImages SS 74.31 78.14 58.12
Baseline 40.3 60.2 44.4 36.3 57.3 38.9 HCL OpenImages GT 75.40 79.08 58.51
HCL 41.4 61.4 45.5 37.3 58.5 40.0
Dense-CL pre-trained on OpenImages:
Baseline 38.2 58.9 42.6 34.8 55.3 37.8 Table 2. Classification results with linear evaluation. The first
HCL w/o Lhyp 41.1 61.5 44.4 37.2 58.3 39.7 row shows the results using random crops on pre-training datasets.
HCL 42.1 62.6 45.5 38.3 59.4 40.6 In the last two rows we use our hyperbolic loss and we see improved
performance by using both Ground Truth (GT) boxes and Selective
Table 1. Comparison with state-of-the-art methods. This table Search (SS) boxes. HCL improves scene-level classification on the
shows object detection (columns 1-3) and semantic segmentation VOC dataset, and object-level classification on ImageNet-100 and
(columns 4-6) results on COCO using MoCo-v2, Dense-CL and ImageNet-1k datasets.
ORL by pre-training on COCO and OpenImages using unsuper-
vised object bounding boxes generated by the selective search. The
protocols listed in Detectron2 [66].
first row in each sub-table shows the results using random crops on
pre-training datasets. The second and third rows set HCL/Lhyp to 0, 3.2. Main Results
which means we are pre-training baseline methods on just proposal
boxes. Our model consistently improves both object detection and Object detection and semantic segmentation. Table 1
semantic segmentation tasks across multiple contrastive learning reports the object detection and semantic segmentation re-
baselines by pre-training on both COCO (800 epochs) and the full sults by pre-training on COCO and full OpenImages dataset
OpenImages dataset (75 epochs, last 3 rows).
(last 3 rows) by using selective search boxes. HCL shows
consistent improvements over the baselines on COCO ob-
ject detection and COCO semantic segmentation. Although
procedure and hyperbolic loss on top of three popular con- Dense-CL and ORL improve the object-level downstream
trastive learning methods: MoCo-v2 [13], Dense-CL [64], performance over MoCo-v2 through improved object rep-
and ORL [68]. Dense-CL is a contrastive learning framework resentations or dense pixel representations, they still lack
which extracts dense features from scene images and gener- the direct modeling of scene images. We show that learning
ally achieves better object detection results than MoCo-v2. representations for scene images in hyperbolic space is ben-
ORL is a pipeline that learns improved object representa- eficial to object-level downstream performance. Note that
tions from scene images. We also consider HCL without the pre-training Dense-CL on ImageNet for 200 epochs gives
hyperbolic loss Lhyp . This approach, which we denote as 40.3 mAP [64], while pre-trainng on OpenImages for only
“HCL w/o Lhyp ”, adopts the same cropping strategy as HCL 75 epochs with our method gives 42.1 mAP. This shows the
but applies only a standard contrastive loss. We show that importance of efficient pre-training on datasets like OpenIm-
adding the hyperbolic loss improves results under various ages.
settings. More details on the datasets as well as training
Image classification. As shown in Table 2, HCL improves
setups can be found in Appendix A.
image classification on both scene-level (VOC) and object-
Downstream tasks. We evaluate our pre-trained models on level (ImageNet) datasets. When pretraining on OpenImages,
image classification, object-detection and semantic segmen- HCL improves ImageNet lineval accuracy by 0.94% points
tation. For classification, we show linear evaluation (lineval) and VOC lineval classification accuracy by 1.61 mAP. We
accuracy with MoCo-v2, i.e. we freeze the backbone and observe similar improvements when pretraining on COCO.
only train the final linear layer. We test on VOC [18], HCL improves accuracy whether we use ground truth object
ImageNet-100 [58] and ImageNet-1k [15] datasets. For bounding boxes or boxes generated by selective search. In
object detection and semantic segmentation, we show re- general, we observe a larger improvement of using HCL on
sults with all 3 baselines on the COCO datasets using Mask OpenImages than COCO, which supports our hypothesis
R-CNN, following [13]. We closely follow the common that HCL provides larger improvements on datasets with

6844
labels. For each class of the ImageNet training set, we
use a pre-trained OpenImages model and rank the images
according to their norms. The extreme images of some
classes are shown in Figure 4 and also in the Appendix.
Images with smaller norms tend to capture a single object,
while those with larger norms are likely to depict a scene.
To quantitatively evaluate this property, we report the
NDCG metric on the ranked images as shown in Table 3.
NDCG assesses how often the scene images are ranked at the
top. As a baseline, we rank the images based on the entropy
of the class probability predicted by a classifier, which is a
widely adopted indicator of label uncertainty [11, 47]. We
Figure 3. Average representation norms of images with different use both MoCo-v2 and supervised ResNet-50 as the classifier.
number of labels in ImageNet-ReaL. As shown in Table 3, using norms with HCL achieves similar
rank quality as using entropy with the supervised ResNet-50
Datasets on the ImageNet-ReaL dataset. In addition, when combining
Method Indicator two ranks using simple ensemble methods such as Borda
IN-Real COCO
count, the score is further improved to 0.717. This shows
MoCo Entropy 0.633 0.791 that the entropy and the norm provide complimentary signals
Supervised Entropy 0.671 0.793 regarding the existence of multiple labels. For example, the
HCL Norm 0.655 0.839 entropy indicator can be affected by the bias of the model
Ensemble Entropy+Norm 0.717 0.823 and the norm indicator can be wrong on the images with
Table 3. NDCG scores of the image rankings based on the different multiple objects from the same class.
indicators and models, and evaluated by the number of labels per Compared to supervised indicators of label uncertainty,
image. HCL has the additional advantage that it is dataset-agnostic
and can be applied to new data without further training.
To demonstrate this benefit, we report the same metric on
more objects per image. the COCO validation, where we also have the number of
labels for each image. Our method achieves much better
3.3. Properties of Models Trained with HCL NDCG scores than the supervised ResNet-50 as shown in
The visual representations learned by HCL have several Table 3. This finding can be potentially useful to guide label
useful properties. In this section, we evaluate the repre- reassessment, or provide an extra signal for model training.
sentation norm as an measure of the label uncertainty for
image classification datasets, and evaluate the object-scene
3.3.2 Out-of-Context Detection
similarity in terms of out-of-context detection.
Our hyperbolic loss Lhyp encourages the model to capture the
3.3.1 Label Uncertainty Quantification similarity between the object and scene. We apply the result-
ing representations to detect out-of-context objects, which
ImageNet [15] is an image classification dataset consist- can be useful in designing data augmentation for object de-
ing of object-centered images, each of which has a single tection [17]. We are especially interested in out-of-context
label. As the performance on this dataset has gradually images with conflicting backgrounds. To this end, we use the
saturated, the original labels have been scrutinized more out-of-context images proposed in the SUN09 dataset [14].
carefully [3, 52, 55, 60, 62]. Prevailing labeling issues in We first compute the representations of each object and entire
the validation set have been recently identified, including scene image with that object masked out. We then calculate
labeling errors, multi-label images with only a single label the hyperbolic distance between the representations mapped
provided, and so on. Although [3] provides reassessed labels to the Poincaré ball. Some example images from this dataset
for the entire validation set, relabeling the entire training set as well as the distance of each contained object are shown in
may be infeasible. Figure 5. We find that the out-of-context objects generally
Our learned representations provide a potential automatic have a large distance, i.e. smaller similarity, to the overall
way to identify images with multiple labels from datasets like scene image. To quantify this finding, we compute the mAP
ImageNet. Specifically, we first show in Figure 3 that there of the object ranking on each image and obtain 0.61 for HCL.
is a strong correlation between the representation norms and As a comparison, the MoCo similarity gives mAP = 0.52
the number of labels per image according to the reassessed and the random ranking gives mAP = 0.44.

6845
Smallest norms (objects) Largest norms (scenes)

Figure 4. Images from ImageNet training set. The 5 images on the left have the smallest representation norms among all the images from the
same class, and the 5 on the right have the largest norms.

0.6679

0.5091
0.3674
0.4542
0.5695 0.7499
0.6696

0.4849
0.5010 0.4206 0.4322 0.7122
0.8025
0.4053 0.5393
0.1698
Figure 5. Out-of-context images from the SUN09 dataset. The bounding box of each object and its hyperbolic distance to the scene are
shown. Regular objects are in blue and out-of-context objects are in purple. Note that the out-of-context objects tend to have large distances.

4. Main Ablation Studies model performs even worse than the baseline without loss
function on the scene-object pairs, demonstrating the neces-
In this section, we report the results of several important sity of using hyperbolic distance. We also validate our choice
ablation studies with respect to HCL. All the models are of an object-centric hierarchy by comparing its performance
trained on the subset of the OpenImages dataset and linearly with that of a scene-centric hierarchy [48, 49] generated by
evaluated on the ImageNet-100 dataset. The top-1 accuracy sampling the negative pairs as objects and unpaired scenes.
is reported. This scene-centric hierarchy leads to substantially lower ac-
Similarity measure and the center of the scene-object hi- curacy (Table 4).
erarchy. We propose to use the negative hyperbolic distance Trade-off between the Euclidean and hyperbolic losses.
as the similarity measure of the scene-object pairs. As an al- We adopt the Euclidean loss to learn object-object similarity
ternative, one can use cosine similarity on the hypersphere as and the hyperbolic loss to learn object-scene similarity. A
the measure as in the original contrastive objective. However, hyperparameter controls the trade-off between them. As
this would attempt to maximize the similarity between a sin- shown in Table 5, we find that a smaller = 0.01 leads
gle object and multiple objects. It is likely that these objects to marginal improvement. However, we also observe that
belong to different classes, and hence this strategy impairs larger s can lead to unstable and even stalled training. With
the quality of the representation. As shown in Table 4, re- careful inspection, we find that in the early stage of the
placing the negative hyperbolic distance with the Euclidean training, the gradient provided by the hyperbolic loss can be
similarity impairs downstream performance. The resulting inaccurate but strong, which pushes the representations to be

6846
Distance Center IN-100 Accuracy IN-100 Accuracy Optimizer IN-100 Accuracy
- - 77.36 0.01 77.70 RSGD 0.1 79.08
Hyperbolic Scene 79.08 0.1 79.08 RSGD 0.5 0
Hyperbolic Object 76.96 0.2 78.64 SGD 0.1 70.16
Euclidean Scene 76.68 0.5 0 SGD 0.5 74.18

Table 4. Similarity measure and hierarchy center. Table 5. Losses trade-off. Table 6. RSGD versus SGD optimizers.

close to the boundary. As a result, since Riemannian SGD eral papers mitigate this issue by proposing different tech-
divides gradients by the distance to the boundary, updates niques. Dense-CL [64] operates on pre-average pool features
become small and training ceases to make progress. and uses dense features on pixel level to show improved per-
Optimizer. Given the observation above, we ask whether formance on dense tasks such as semantic segmentation. Det-
RSGD is necessary for practical usage. We replace the Con [28] uses unsupervised semantic segmentation masks
RSGD optimizer with SGD. To avoid numerical issues when to generate features for the corresponding objects in the two
the representations are too close to the boundary, we increase views. PixContrast [69] uses pixel-to-propagation consis-
" from 1e 5 to 1e 1 . This allows a larger to be used as tency pretext task to build features for both dense down-
opposed to the RSGD. However, SGD always yields inferior stream tasks and discriminative downstream tasks. Pixel-
performance compared to RSGD. to-Pixel Contrast [63] uses pixel-level contrastive learning
to learn better features for semantic segmentation. Self-
5. Related Work EMD [39] uses earth mover distance with BYOL [24] for
pretraining on the COCO dataset. ORL [68] uses selective
Representation Learning with Hyperbolic Space. Rep- search to generate object proposals, then applies object-level
resentations are typically learned in Euclidean space. Hy- contrastive loss to enforce object-level consistency. Below-
perbolic space has been adopted for its expressiveness in par performance of SSL methods can be attributed to treating
modeling tree-like structures existing in various domains scenes and objects using similar techniques, which often re-
such as language [45, 46, 53], graphs [2, 8, 50], and vi- sults in similar representations. In our work, instead of treat-
sion [10, 57]. The corresponding neural network modules ing scenes and objects similarly, we use a hyperbolic loss,
have been designed to boost the progress of such applica- which builds representation that disambiguates scenes and
tions [9, 21, 37, 56]. The hierarchical structure presented in objects based on the norm of the embeddings. Our method
the datasets can arise from three factors that motivate the not only separates scenes and objects, but also improves
use of hyperbolic space. The first factor is generality: the downstream tasks such as image classification.
hypernym-hyponym property is a natural feature of words
(e.g. WordNet [42]) and the hyperbolic space is extensively
exploited to learn word and image embeddings that preserve 6. Conclusion
that property [20, 38, 40, 53, 59, 70]. The second factor is We present HCL, a contrastive learning framework that
uncertainty: Several studies have found that applying hyper- learns visual representation for both objects and scenes in
bolic neural network modules to different tasks leads to a the same representation space. The major novelty of our
natural modeling of the uncertainty [23,31,57]. The third fac- method is a hyperbolic contrastive objective built on an
tor is compositionality of different basic elements to form a object-centric scene hierarchy. We show the effectiveness
natural hierarchy. Motivated by these factors, previous work of HCL on several benchmarks including image classifi-
in computer vision has applied hierarchical representations cation, object detection, and semantic segmentation. We
learned in the hyperbolic space to various tasks such as image also demonstrate useful properties of the representations un-
classification [31] or segmentation [65], zero-/few-shot learn- der several zero-shot settings, from detecting out-of-context
ing [38], action recognition [40], and video prediction [57]. objects to quantifying the label uncertainty in the datasets
In this paper, we focus on learning the representations that like ImageNet. More generally, we hope this paper will
capture the hierarchy between the objects and scenes with encourage future work towards building a more holistic vi-
the goal of learning general-purpose image representations sual representation space, and draw attention to the power of
that can transfer to various downstream tasks. non-Euclidean representation learning.
Self-Supervised Learning on Scenes. Self-Supervised
Learning (SSL) has made great strides in closing the perfor- 7. Acknowledgements
mance with supervised methods [12, 13, 22] when pretrained
on the object-centric datasets like ImageNet. However, re- Songwei Ge, Shlok Mishra, and David Jacobs were sup-
cent work has shown that SSL is limited on multi-object ported in part by the National Science Foundation under
datasets like COCO [43, 54, 64] and OpenImages [33]. Sev- grant no. IIS-1910132 and IIS-2213335.

6847
References [19] Carolina Galleguillos, Andrew Rabinovich, and Serge Be-
longie. Object categorization using co-occurrence, location
[1] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, and appearance. In CVPR, 2008. 2
and Alexander C Berg. Point-level region contrast for object [20] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hy-
detection pre-training. CVPR, 2022. 1 perbolic entailment cones for learning hierarchical embed-
[2] Ivana Balazevic, Carl Allen, and Timothy Hospedales. Multi- dings. In ICML, 2018. 8
relational poincaré graph embeddings. NeurIPS, 2019. 8 [21] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hy-
[3] Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xi- perbolic neural networks. NeurIPS, 2018. 3, 8
aohua Zhai, and Aäron van den Oord. Are we done with [22] Songwei Ge, Shlok Kumar Mishra, Haohan Wang, Chun-
imagenet?, 2020. 6 Liang Li, and David Jacobs. Robust contrastive learning using
[4] Silvere Bonnabel. Stochastic gradient descent on rieman- negative samples with diminished semantics. In NeurIPS,
nian manifolds. IEEE Transactions on Automatic Control, 2021. 8
58(9):2217–2229, 2013. 3 [23] Mina GhadimiAtigh, Julian Schoep, Erman Acar, Nanne van
[5] James W Cannon, William J Floyd, Richard Kenyon, Walter R Noord, and Pascal Mettes. Hyperbolic image segmentation.
Parry, et al. Hyperbolic geometry. Flavors of geometry, 31(59- arXiv preprint arXiv:2203.05898, 2022. 8
115):2, 1997. 3 [24] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
[6] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer-
Piotr Bojanowski, and Armand Joulin. Unsupervised learning sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
of visual features by contrasting cluster assignments. NeurIPS, laghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and
2020. 1 Michal Valko. Bootstrap your own latent - a new approach to
self-supervised learning. In NeurIPS, 2020. 1, 8
[7] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
[25] Mikhael Gromov. Hyperbolic groups. In Essays in group
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
theory, pages 75–263. Springer, 1987. 2, 3
ing properties in self-supervised vision transformers. In ICCV,
[26] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
2021. 1
Dollár, and Ross Girshick. Masked autoencoders are scalable
[8] Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Su- vision learners. In Proceedings of the IEEE/CVF Conference
jith Ravi, and Christopher Ré. Low-dimensional hyperbolic on Computer Vision and Pattern Recognition (CVPR), pages
knowledge graph embeddings. In ACL, 2020. 8 16000–16009, June 2022. 1
[9] Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. [27] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Hyperbolic graph convolutional neural networks. NeurIPS, Girshick. Momentum contrast for unsupervised visual repre-
2019. 8 sentation learning. In CVPR, 2020. 1, 2, 4
[10] Jiaxin Chen, Jie Qin, Yuming Shen, Li Liu, Fan Zhu, and Ling [28] Olivier J Hénaff, Skanda Koppula, Jean-Baptiste Alayrac,
Shao. Learning attentive and hierarchical representations for Aaron van den Oord, Oriol Vinyals, and João Carreira. Effi-
3d shape recognition. In ECCV, 2020. 8 cient visual pretraining with contrastive detection. In ICCV,
[11] Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu pages 10086–10096, 2021. 8
Zhang. Understanding and utilizing deep neural networks [29] Geoffrey Hinton. How to represent part-whole hierarchies in
trained with noisy labels. In ICML, 2019. 6 a neural network. arXiv preprint arXiv:2102.12627, 2021. 2
[12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- [30] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
frey Hinton. A simple framework for contrastive learning of David Shamma, Michael Bernstein, and Li Fei-Fei. Image
visual representations. In ICML, 2020. 1, 2, 8 retrieval using scene graphs. In CVPR, 2015. 2
[13] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- [31] Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova,
proved baselines with momentum contrastive learning. arXiv Ivan Oseledets, and Victor Lempitsky. Hyperbolic image
preprint arXiv:2003.04297, 2020. 5, 8 embeddings. In CVPR, 2020. 8
[32] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
[14] Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Willsky. Exploiting hierarchical context on a large database
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
of object categories. In CVPR, 2010. 2, 6
Connecting language and vision using crowdsourced dense
[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li image annotations. IJCV, 2017. 2
Fei-Fei. Imagenet: A large-scale hierarchical image database.
[33] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
In CVPR, 2009. 5, 6
Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
[16] Manfredo Perdigao Do Carmo and J Flaherty Francis. Rie- Matteo Malloci, Alexander Kolesnikov, et al. The open im-
mannian geometry, volume 6. Springer, 1992. 3 ages dataset v4. IJCV, 2020. 4, 8
[17] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. On the [34] John M Lee. Introduction to Riemannian manifolds. Springer,
importance of visual context for data augmentation in scene 2018. 3
understanding. PAMI, 43(6):2014–2028, 2019. 6 [35] Tsung-Yi Lin, M. Maire, Serge J. Belongie, James Hays, P.
[18] Mark Everingham, Luc Van Gool, Christopher KI Williams, Perona, D. Ramanan, Piotr Dollár, and C. L. Zitnick. Mi-
John Winn, and Andrew Zisserman. The pascal visual object crosoft coco: Common objects in context. In ECCV, 2014.
classes (voc) challenge. IJCV, 88(2):303–338, 2010. 5 4

6848
[36] Nathan Linial, Eran London, and Yuri Rabinovich. The ge- [54] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
ometry of graphs and some of its algorithmic applications. Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-
Combinatorica, 15(2):215–245, 1995. 3 cam: Visual explanations from deep networks via gradient-
[37] Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic based localization. In ICCV, 2017. 8
graph neural networks. NeurIPS, 2019. 8 [55] Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang,
[38] Shaoteng Liu, Jingjing Chen, Liangming Pan, Chong-Wah Benjamin Recht, and Ludwig Schmidt. Evaluating machine
Ngo, Tat-Seng Chua, and Yu-Gang Jiang. Hyperbolic visual accuracy on imagenet. In ICML, 2020. 6
embedding learning for zero-shot recognition. In CVPR, 2020. [56] Ryohei Shimizu, YUSUKE Mukuta, and Tatsuya Harada.
8 Hyperbolic neural networks++. In ICLR, 2021. 8
[39] Songtao Liu, Zeming Li, and Jian Sun. Self-emd: Self- [57] Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the
supervised object detection without imagenet, 2021. 1, 8 predictability of the future. In CVPR, 2021. 8
[40] Teng Long, Pascal Mettes, Heng Tao Shen, and Cees G. M. [58] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive
Snoek. Searching for actions on the hyperbole. In CVPR, multiview coding. In ECCV, 2020. 5
2020. 8 [59] Alexandru Tifrea, Gary Bécigneul, and Octavian-Eugen
[41] Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Ganea. Poincaré glove: Hyperbolic word embeddings. In
Costa: Co-occurrence statistics for zero-shot classification. ICLR. OpenReview, 2018. 2, 3, 8
In CVPR, 2014. 2
[60] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, An-
[42] George A Miller, Richard Beckwith, Christiane Fellbaum, drew Ilyas, and Aleksander Madry. From imagenet to image
Derek Gross, and Katherine J Miller. Introduction to word- classification: Contextualizing progress on benchmarks. In
net: An on-line lexical database. International journal of ICML, 2020. 6
lexicography, 3(4):235–244, 1990. 2, 8
[61] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers,
[43] Shlok Kumar Mishra, Anshul B. Shah, Ankan Bansal,
and Arnold WM Smeulders. Selective search for object recog-
Jonghyun Choi, Abhinav Shrivastava, Abhishek Sharma, and
nition. IJCV, 104(2):154–171, 2013. 4
David Jacobs. Learning visual representations for transfer
[62] Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes,
learning by suppressing texture. ArXiv, abs/2011.01901, 2020.
Sara Fridovich-Keil, and Rebecca Roelofs. When does dough
8
become a bagel? analyzing the remaining mistakes on ima-
[44] Shlok Kumar Mishra, Anshul B. Shah, Ankan Bansal, Ab-
genet. arXiv preprint arXiv:2205.04596, 2022. 6
hyuday N. Jagannatha, Abhishek Sharma, David Jacobs, and
Dilip Krishnan. Object-aware cropping for self-supervised [63] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender
learning. ArXiv, abs/2112.00319, 2021. 4 Konukoglu, and Luc Van Gool. Exploring cross-image pixel
[45] Maximillian Nickel and Douwe Kiela. Poincaré embeddings contrast for semantic segmentation. In ICCV, 2021. 8
for learning hierarchical representations. NeurIPS, 2017. 2, [64] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
3, 8 Lei Li. Dense contrastive learning for self-supervised visual
[46] Maximillian Nickel and Douwe Kiela. Learning continuous pre-training. In CVPR, 2021. 1, 5, 8
hierarchies in the lorentz model of hyperbolic geometry. In [65] Zhenzhen Weng, Mehmet Giray Ogut, Shai Limonchik, and
ICML, 2018. 8 Serena Yeung. Unsupervised discovery of the long-tail in
[47] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident instance segmentation using hierarchical self-supervision. In
learning: Estimating uncertainty in dataset labels. Journal of CVPR, 2021. 8
Artificial Intelligence Research, 70:1373–1411, 2021. 6 [66] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
[48] Devi Parikh and Tsuhan Chen. Hierarchical semantics of Lo, and Ross Girshick. Detectron2. https://github.
objects (hsos). In ICCV, 2007. 2, 7 com/facebookresearch/detectron2, 2019. 5
[49] Devi Parikh, C Lawrence Zitnick, and Tsuhan Chen. Unsu- [67] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
pervised learning of hierarchical spatial structures in images. Unsupervised feature learning via non-parametric instance
In CVPR, 2009. 7 discrimination. In CVPR, 2018. 1, 2
[50] Jiwoong Park, Junho Cho, Hyung Jin Chang, and Jin Young [68] Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and
Choi. Unsupervised hyperbolic representation learning via Chen Change Loy. Unsupervised object-level representation
message passing auto-encoders. In CVPR, 2021. 8 learning from scene images. In NeurIPS, 2021. 1, 4, 5, 8
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [69] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Lin, and Han Hu. Propagate yourself: Exploring pixel-level
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning consistency for unsupervised visual representation learning.
transferable visual models from natural language supervision. In CVPR, 2021. 1, 8
In ICML, 2021. 1 [70] Jiexi Yan, Lei Luo, Cheng Deng, and Heng Huang. Unsuper-
[52] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and vised hyperbolic metric learning. In CVPR, 2021. 8
Vaishaal Shankar. Do imagenet classifiers generalize to ima-
genet? In ICML, 2019. 6
[53] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré.
Representation tradeoffs for hyperbolic embeddings. In ICML,
2018. 2, 3, 8

6849

Friedman, Michael - Geometry As A Branch of Physics. Background and Context For Eisnstein's 'Geometry and Experience'
No ratings yet
Friedman, Michael - Geometry As A Branch of Physics. Background and Context For Eisnstein's 'Geometry and Experience'
19 pages
KC Sinha
57% (7)
KC Sinha
33 pages
Putting Objects in Perspective
No ratings yet
Putting Objects in Perspective
8 pages
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
No ratings yet
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
11 pages
What Should Not Be Contrastive in Contrastive Learning
No ratings yet
What Should Not Be Contrastive in Contrastive Learning
13 pages
On Hyperbolic Embeddings in Object Detection
No ratings yet
On Hyperbolic Embeddings in Object Detection
19 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Adaptive Deconvolutional Networks For Mid and High Level Feature Learning
No ratings yet
Adaptive Deconvolutional Networks For Mid and High Level Feature Learning
8 pages
230623-Paper-Learning and Representing Object Shape Through An Array of Orientation Columns
No ratings yet
230623-Paper-Learning and Representing Object Shape Through An Array of Orientation Columns
14 pages
4216-Article Text-7270-1-10-20190705
No ratings yet
4216-Article Text-7270-1-10-20190705
9 pages
DeepPrimitive Image Decomposition by Layered Primi
No ratings yet
DeepPrimitive Image Decomposition by Layered Primi
13 pages
2021 - Joseph Et Al - Towards Open World Object Detection
No ratings yet
2021 - Joseph Et Al - Towards Open World Object Detection
16 pages
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
No ratings yet
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
17 pages
NeurIPS 2021 Unsupervised Part Discovery From Contrastive Reconstruction Paper
No ratings yet
NeurIPS 2021 Unsupervised Part Discovery From Contrastive Reconstruction Paper
15 pages
(2021 - CVPR) Spatially Consistent Representation Learning
No ratings yet
(2021 - CVPR) Spatially Consistent Representation Learning
14 pages
Elhoseiny 16
No ratings yet
Elhoseiny 16
10 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
Wang 20 K
No ratings yet
Wang 20 K
11 pages
Attribute-Based Classification For Zero-Shot Visual Object Categorization
No ratings yet
Attribute-Based Classification For Zero-Shot Visual Object Categorization
13 pages
Attribute-Centric Recognition For Cross-Category Generalization
No ratings yet
Attribute-Centric Recognition For Cross-Category Generalization
8 pages
Object Detection and Game-Based Learning
No ratings yet
Object Detection and Game-Based Learning
23 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
Coconets: Continuous Contrastive 3D Scene Representations
No ratings yet
Coconets: Continuous Contrastive 3D Scene Representations
16 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
Contrastive Learning For Object Detection
No ratings yet
Contrastive Learning For Object Detection
5 pages
Object Recognition
No ratings yet
Object Recognition
116 pages
A Model For Learning Variance Components of Natural Images
No ratings yet
A Model For Learning Variance Components of Natural Images
8 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
23 pages
Exploring Patch-Wise Semantic Relation For Contrastive Learning in Image-to-Image Translation Tasks
No ratings yet
Exploring Patch-Wise Semantic Relation For Contrastive Learning in Image-to-Image Translation Tasks
10 pages
Scene Graph Generation With Hierarchical Context
No ratings yet
Scene Graph Generation With Hierarchical Context
7 pages
Detection New
No ratings yet
Detection New
13 pages
CVPR 2004
No ratings yet
CVPR 2004
8 pages
Mathematics 10 02525
No ratings yet
Mathematics 10 02525
20 pages
Beyond Bags of Features: Spatial Pyramid Matching For Recognizing Natural Scene Categories
No ratings yet
Beyond Bags of Features: Spatial Pyramid Matching For Recognizing Natural Scene Categories
8 pages
Canonical SG 2 Im
No ratings yet
Canonical SG 2 Im
31 pages
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
No ratings yet
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
28 pages
Spatial Context-Aware Object-Attentional Network For Multi-Label Image Classification
No ratings yet
Spatial Context-Aware Object-Attentional Network For Multi-Label Image Classification
13 pages
3 Object Recognition
No ratings yet
3 Object Recognition
19 pages
Contrastive Learning For Unpaired Image-to-Image Translation
No ratings yet
Contrastive Learning For Unpaired Image-to-Image Translation
29 pages
NIPS 2012 Shifting Weights Adapting Object Detectors From Image To Video Paper
No ratings yet
NIPS 2012 Shifting Weights Adapting Object Detectors From Image To Video Paper
9 pages
Hyperbolic Deep Learning in Computer Vision: A Survey
No ratings yet
Hyperbolic Deep Learning in Computer Vision: A Survey
25 pages
Keypoint Recognition Using Randomized Trees
No ratings yet
Keypoint Recognition Using Randomized Trees
29 pages
Hierarchical Novelty Detection For Traffic Sign Recognition
No ratings yet
Hierarchical Novelty Detection For Traffic Sign Recognition
22 pages
selectiveSearchDraft PDF
No ratings yet
selectiveSearchDraft PDF
14 pages
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
No ratings yet
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
26 pages
Transfer Learning For Visual Categorization A Survey
No ratings yet
Transfer Learning For Visual Categorization A Survey
16 pages
SelectiveSearch Segmentation
No ratings yet
SelectiveSearch Segmentation
15 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13 pages
Countr: Transformer-Based Generalised Visual Counting
No ratings yet
Countr: Transformer-Based Generalised Visual Counting
16 pages
Taskonomy: Disentangling Task Transfer Learning
No ratings yet
Taskonomy: Disentangling Task Transfer Learning
11 pages
Continual Object Detection: A Review of Definitions, Strategies, and Challenges
No ratings yet
Continual Object Detection: A Review of Definitions, Strategies, and Challenges
17 pages
Heitz+Koller ECCV08
No ratings yet
Heitz+Koller ECCV08
14 pages
Incremental Learning
No ratings yet
Incremental Learning
8 pages
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
No ratings yet
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
10 pages
Atigh Hyperbolic Image Segmentation CVPR 2022 Paper
No ratings yet
Atigh Hyperbolic Image Segmentation CVPR 2022 Paper
10 pages
Agarwal Contrastive Learning of Semantic Concepts For Open-Set Cross-Domain Retrieval WACV 2023 Paper
No ratings yet
Agarwal Contrastive Learning of Semantic Concepts For Open-Set Cross-Domain Retrieval WACV 2023 Paper
10 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
Unpaired Image To Image Translation CycleGAn
No ratings yet
Unpaired Image To Image Translation CycleGAn
18 pages
Data-Driven 3D Primitives For Single Image Understanding
No ratings yet
Data-Driven 3D Primitives For Single Image Understanding
8 pages
Dominant Orientation Templates For Real-Time Detection of Texture-Less Object
No ratings yet
Dominant Orientation Templates For Real-Time Detection of Texture-Less Object
8 pages
A Concise Guide to Object Orientated Programming
From Everand
A Concise Guide to Object Orientated Programming
alasdair gilchrist
No ratings yet
Scale Space: Exploring Dimensions in Computer Vision
From Everand
Scale Space: Exploring Dimensions in Computer Vision
Fouad Sabry
No ratings yet
Submanifolds
No ratings yet
Submanifolds
5 pages
Learning Algorithms Utilizing Quasi-Geodesic Flows On The Stiefel Manifold
No ratings yet
Learning Algorithms Utilizing Quasi-Geodesic Flows On The Stiefel Manifold
30 pages
Memoria
No ratings yet
Memoria
76 pages
XXXXHFGJ
No ratings yet
XXXXHFGJ
3 pages
Nonparametric Statistics On Manifolds and Their Applications To Object Data Analysis 1st Edition Victor Patrangenaru Download
No ratings yet
Nonparametric Statistics On Manifolds and Their Applications To Object Data Analysis 1st Edition Victor Patrangenaru Download
87 pages
Introduction To Complex Manifolds 1st Edition John M. Lee Download
No ratings yet
Introduction To Complex Manifolds 1st Edition John M. Lee Download
52 pages
Geometric Structures On Manifolds William Mark Goldman Instant Download
No ratings yet
Geometric Structures On Manifolds William Mark Goldman Instant Download
78 pages
Curriculum Vitae: Name in Full: Date of Birth: Present Residential Address
No ratings yet
Curriculum Vitae: Name in Full: Date of Birth: Present Residential Address
3 pages
A Grasping Force Optimization Algorithm For Multiarm Robots With Multifingered Hands
No ratings yet
A Grasping Force Optimization Algorithm For Multiarm Robots With Multifingered Hands
13 pages
Differential Geometry - Christian Bär
No ratings yet
Differential Geometry - Christian Bär
174 pages
Annals of Mathematics
No ratings yet
Annals of Mathematics
42 pages
Anderson Time Theory
No ratings yet
Anderson Time Theory
47 pages
Cedram: Éric Gourgoulhon and Marco Mancini
No ratings yet
Cedram: Éric Gourgoulhon and Marco Mancini
58 pages
Ricci Soliton 0908.2006
No ratings yet
Ricci Soliton 0908.2006
32 pages
Nicolaescu L. - Lectures On The Geometry of Manifolds (2018) PDF
No ratings yet
Nicolaescu L. - Lectures On The Geometry of Manifolds (2018) PDF
585 pages
Werner Muller
No ratings yet
Werner Muller
92 pages
Riemann Sphere: Complex Plane, The Complex Plane Plus A Point at Infinity. This
No ratings yet
Riemann Sphere: Complex Plane, The Complex Plane Plus A Point at Infinity. This
6 pages
Harmonic Sections - Cotangent PDF
No ratings yet
Harmonic Sections - Cotangent PDF
9 pages
(Fundamental Theories of Physics 82) Radu Miron (Auth.) - The Geometry of Higher-Order Lagrange Spaces - Applications To Mechanics and Physics-Springer Netherlands (1997) PDF
100% (1)
(Fundamental Theories of Physics 82) Radu Miron (Auth.) - The Geometry of Higher-Order Lagrange Spaces - Applications To Mechanics and Physics-Springer Netherlands (1997) PDF
351 pages
Combinatorial Mathematics Business Mathematics Special Theory of Relativity-I Computational Mathematics Lab-I
No ratings yet
Combinatorial Mathematics Business Mathematics Special Theory of Relativity-I Computational Mathematics Lab-I
13 pages
V.OPROIU, N.PAPAPGHIUC - On The Geometry of Tangent Bundle of A (Pseudo-) Riemannian Manifold
No ratings yet
V.OPROIU, N.PAPAPGHIUC - On The Geometry of Tangent Bundle of A (Pseudo-) Riemannian Manifold
17 pages
Geometry, Analysis and Dynamics On Sub-Riemannian Manifolds
No ratings yet
Geometry, Analysis and Dynamics On Sub-Riemannian Manifolds
334 pages
Mikhael Gromov (Mathematician)
No ratings yet
Mikhael Gromov (Mathematician)
15 pages
2 Introduction To Riemannian Geometry: - A Manifold Is The Least Structure That
No ratings yet
2 Introduction To Riemannian Geometry: - A Manifold Is The Least Structure That
13 pages
(Gulliver-Osserman-Royden) A Theory of Branched Immersions of Surfaces
No ratings yet
(Gulliver-Osserman-Royden) A Theory of Branched Immersions of Surfaces
64 pages
Computing Geodesics and Minimal Surfaces Via Graph Cuts: Yuri Boykov, Siemens Research, Princeton, NJ
No ratings yet
Computing Geodesics and Minimal Surfaces Via Graph Cuts: Yuri Boykov, Siemens Research, Princeton, NJ
18 pages
2013 Mikls Schweitzer: Contributors: Randomusername, Joybangla
No ratings yet
2013 Mikls Schweitzer: Contributors: Randomusername, Joybangla
3 pages
Maths Syllabus Ddu
No ratings yet
Maths Syllabus Ddu
13 pages

Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper

Uploaded by

Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper

Uploaded by

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Songwei Ge⇤1 , Shlok Mishra⇤1 , Simon Kornblith2 ,

Although self-/un-supervised methods have led to rapid

1. Introduction the same space while preserving such hierarchical structures.

You might also like