Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper
Ge Hyperbolic Contrastive Learning For Visual Representations Beyond Objects CVPR 2023 Paper
Abstract
6840
classes lie close to each other in the representation space. uncertainty quantification to out-of-context object detection.
However, it is not clear how the representations of scene Our contributions are summarized below:
images should fit into that topology. Directly applying exist-
ing contrastive learning results in a sub-optimal topology of 1. We propose a hyperbolic contrastive loss that regular-
scenes and objects as well as unsatisfactory performance, as izes scene representations so that they follow an object-
we will show in the experiments. To this end, we argue that centric hierarchy, with positive and negative pairs sam-
a hierarchical structure can be naturally adopted. Consider- pled from the hierarchy.
ing that the same class of objects can be placed in different 2. We demonstrate that our learned representations trans-
scenes, we construct a hierarchical structure to describe such fer better than representations learned using vanilla
relationships, where the root nodes are the visually similar contrastive loss on a variety of downstream tasks, in-
objects, and the scene images consisting of them are placed cluding object detection, semantic segmentation, and
as the descendants. We call this structure the object-centric linear classification.
scene hierarchy.
The intermediate modeling difficulty induced by this 3. We show that the magnitude of representation norms
structure is the combinatorial explosion. A finite number of effectively reflect the scene-objective hypernymy.
objects leads to exponentially many different possible scenes.
Consequently, Euclidean space may require an arbitrarily 2. Method
large number of dimensions to faithfully embed these scenes, In this section, we elaborate upon our approach to learn-
whereas it is known that any infinite trees can be embedded ing visual representations of object and scene images. We
without distortion in a 2D hyperbolic space [25]. Therefore, start by describing the hierarchical structure between objects
we propose to employ a hyperbolic objective to regularize and scenes that we wish to enforce in the learned representa-
the scene representations. To learn representations of scenes, tion space.
in the general setting of contrastive learning, we sample co-
occurring scene-object pairs as positive pairs, and objects 2.1. Object-Centric Scene Hierarchy
that are not part of that scene as negative samples, and use From simple object co-occurrence statistics [19, 41] to
these pairs to compute an auxiliary hyperbolic contrastive finer object relationships [30, 32], using hierarchical relation-
objective. Our model is trained to reduce the distance be- ships between objects and scenes to understand images is
tween positive pairs and push away the negative pairs in a not new. Previous studies primarily work on an image-level
hyperbolic space. hierarchy by dividing an image into its lower-level elements
Contrastive learning usually has objectives defined on a recursively: a scene contains multiple objects, an object has
hypersphere [12, 27]. By discarding the norm information, different parts, and each part may consist of even lower-level
these models circumvent the shortcut of minimizing losses features [14, 29, 48]. While this is intuitive, it describes
through tuning the norms and obtain better downstream per- a hierarchical structure contained in the individual images.
formance. However, the norm of the representation can also Instead, we study the structure presented among different
be used to encode useful representational structure. In hy- images. Our goal is to learn a representation space for im-
perbolic space, the magnitude of a vector often plays the ages of both objects and scenes across the entire dataset.
role of modeling the hypernymy of the hierarchical struc- To this end, we argue that it is more natural to consider an
ture [45, 53, 59]. When projecting the representations to object-centric hierarchy.
the hyperbolic space, the norm information is preserved and It is known that when training an image classifier, the
used to determine the Riemannian distance, which eventually objects from visually similar classes often lie close to each
affects the loss. Since hyperbolic space is diffeomorphic and other in the representation space [67], which has become
conformal to Euclidean space, our hyperbolic contrastive the cornerstone of contrastive learning. Motivated by this
loss is differentiable and complementary to the original con- observation, we believe that the representation of each scene
trastive objective. image should also be close to the object clusters it consists
When training simultaneously with the original con- of. However, modeling scenes requires a much larger vol-
trastive objective for objects and our proposed hyperbolic ume due to the exponential number of possible compositions
contrastive objective for scenes, the resulting representation of objects. Another way to think about the object-centric
space exhibits a desired hierarchical structure while leaving hierarchy is through the generality and specificity as often
the object clustering topology intact as shown in Figure 1. discussed in the language literature [42, 45]. An object con-
We demonstrate the effectiveness of the hyperbolic objective cept is general when standing alone in the visual world, and
under several frameworks on multiple downstream tasks. We it will become specific when a certain context is given. For
also show that the properties of the representations allow us example, “a desk” is thought to be a more general concept
to perform various vision tasks in a zero-shot way, from label than “a desk in a classroom with a boy sitting on it”.
6841
Therefore, we propose to study an object-centric hierar- defined as
chy across the entire dataset. Formally, given a set of images ✓ ✓ ◆ ◆
rkvk rv
S = {s1 , s2 , · · · , sn }, Oi = {o1i , o2i , · · · , oni i } are the ob- expp (v) := p tanh 2
, (2)
ject bounding boxes contained in the image si . We define r kpk2 kvk
the regions of scene Ri = {ri1 , ri2 , · · · , rimi } to be partial The exponential map gives us a way to map the output of a
areas of the image si that contain multiple objects such that network, which is in the Euclidean space, to the Poincaré
rij = [k oki , where oki 2 Oi and object k is in the region j. ball. In practice, to avoid numerical issues, we clip the
We define the object-centric hierarchy T = (V, E) to be maximal norm of v with r " before the projection, where
that V = S [ O [ R, where R = R1 [ · · · [ Rn and " > 0. During the backpropagation, we perform RSGD [4]
O = O1 [ · · · [ On . For u, v 2 V , e = (u, v) is an edge of by scaling the gradients by gD (p) 1 . Intuitively, this forces
T if u ✓ v or v ✓ u. Note that the natural scene images S the optimizer to take a smaller step when p is closer to the
are always put as the leaf nodes. boundary. The scaling factor is lower bounded by O("2 ).
The immediate consequence of the negative curvature is
2.2. Representation Learning beyond Objects
that for any point p 2 Hm , there are no conjugate points
To describe our proposed model based on this hierarchy, along any geodesic starting from p. Therefore, the volume
we begin with a brief review of hyperbolic space and its prop- grows exponentially faster in hyperbolic space than in Eu-
erties used in our model. For comprehensive introductions clidean space. Such a property makes it suitable to embed
to Riemannian geometry and hyperbolic space, we refer the the hierarchical structure that has constant branching factors
readers to [16, 34]. and exponential number of nodes. This is formally stated in
the theorem below:
2.2.1 Hyperbolic Space Theorem 2. [25] Given a Poincaré ball Dn with an arbi-
trary dimension n 2 and any set of points p1 , · · · , pm 2
A hyperbolic space (Hm , g) is a complete, connected Rie-
Dn , there exists a finite weighted tree (T, dT ) and an embed-
mannian manifold with constant negative sectional curva-
ding f : T ! Dn such that for all i, j,
ture. These special manifolds are all isometric to each
p
other with the isometries defined as O+ (m, 1). Among dT f 1 (xi ) , f 1 (xj ) dD (xi , xj ) = O(log(1+ 2) log(m))
these isometries, there are five common models that pre-
vious studies often work on [5]. In this paper, we choose Intuitively, the theorem states that any tree can be em-
the Poincaré ball Dn := p 2 Rn | kpk2 < r2 as our ba- bedded into a Poincaré disk (n = 2) with low distortion.
sic model [21, 45, 59], where r > 0 is the radius of the On the contrary, it is known that the Euclidean space with
ball. The Poincaré ball is coupled with a Riemannian met- unbounded number of dimensions is not able to achieve such
ric gD (p) = (1 kpk42 /r2 )2 gE , where p 2 Dn and gE is the a low distortion [36]. One useful intuition [53] to help un-
canonical metric of the Euclidean space. For p, q 2 D, the derstand the advantage of the hyperbolic space is given two
Riemannian distance on the Poincaré ball induced by its points p, q 2 Dn s.t. kpk = kqk,
metric gD is defined as follows:
dD (p, q) ! dD (p, 0) + dD (0, q), as kpk = kqk ! r (3)
✓ ◆
1 k p qk
dD (p, q) = 2r tanh , (1) This property basically reflects the fact that the shortest path
r
in a tree is the path through the earliest common ancestor,
where is the Möbius addition and it is clearly differen- and it is reproduced in the Poincaré when points are both
tiable. In addition, the Poincaré ball can be viewed as a close to the boundary.
natural counterpart of the hypersphere as it allows all di-
rections, unlike the other models such as the halfspace or 2.2.2 Hyperbolic Contrastive Learning
hemisphere models that have constraints on the directions.
Given the theoretical benefits of the hyperbolic space stated
The hyperbolic space is globally differomorphic to the Eu-
above, we propose a contrastive learning framework as
clidean space, which is stated in the theorem below:
shown in Figure 2. We adopt two losses to learn the object
Theorem 1. (Cartan–Hadamard). For every point p 2 and scene representations. First, to learn object representa-
Hn the exponential map expp : Tp Hn ⇡ Rn ! Hn is a tions, we use the standard normalized temperature-scaled
smooth covering map. Since Hn is simply connected, it is cross-entropy loss, which operates on the hypersphere in
diffeomorphic to Rn . Euclidean space. As shown in the top branch of Figure 2,
we crop two views of a jittered and slightly expanded object
Specifically, for p 2 Dn and v 2 Tp Dn ⇡ Rn , the region as the positive pairs and feed into the base and mo-
exponential map of the Poincaré ball expp : Tp Dn ! Dn is mentum encoders to calculate the object representations. We
6842
Encoder Backbone Normalize 1
𝑧𝑒𝑢𝑐
Cosine
Similarity
Momentum 2
Encoder Backbone
Normalize 𝑧𝑒𝑢𝑐 sg
Exponential 1
Encoder Backbone
Map 𝑧ℎ𝑦𝑝
Object Hyperbolic
Distance
Scene Momentum Exponential 2
Encoder Backbone Map 𝑧ℎ𝑦𝑝 sg
Figure 2. Our Hyperbolic Contrastive Learning (HCL) framework has two branches: given a scene image, two object regions are cropped
to learn the object representations with a loss defined in the Euclidean space focusing on the representation directions. A scene region as
well as a contained object region are used to learn the scene representations with a loss defined in the hyperbolic space that affects the
representation norms.
denote the output after the normalization to be z1euc and z2euc . in equation 1. We calculate the hyperbolic contrastive loss
We follow MoCo [27] and leverage a memory bank to store as follows:
the negative representations zeuc
n
, which are the features z2euc ⇣ d (z1 ,z2 ) ⌘
D hyp hyp
from the previous batches. Note that our framework can be exp ⌧
readily extended to other contrastive learning models. The Lhyp = log ⇣ d (z1 ,z2 ) ⌘ P ⇣ d (z1 ,zn ) ⌘ ,
D hyp hyp D hyp hyp
Euclidean loss for each image is then calculated as: exp ⌧ + n exp ⌧
exp z1euc · z2euc /⌧ When minimizing the distances of all the positive pairs, with
Leuc = log P ,
exp (z1euc ·z2euc /⌧ )
+ n exp (z1euc · zneuc /⌧ ) the intuition from equation 3, it would be beneficial to put
the nodes near the root, i.e. objects, close to the center to
where ⌧ is a temperature parameter. achieve an overall lower loss. The overall loss function of
While the loss above aims to learn object representations, our model is as follows:
we propose a hyperbolic contrastive objective to learn the
representations for scene images. We sample positive region L = Leuc + Lhyp ,
pairs u and v from object-centric scene hierarchy T such that
(u, v) 2 E. In other words, as shown in the bottom branch where is a scaling parameter to control the trade-off be-
of Figure 2, the objects contained in one region are required tween hyperbolic and Euclidean losses.
to be a subset of the objects in the other. We sample the
negative samples of u to be Nu = {v|(u, v) 62 E}. However, 3. Experiments
building and sampling exhaustively from the entire hierarchy 3.1. Implementation Details
explicitly is tricky. In practice, given an image s, we always
sample u 2 R [ {s} to be a scene region, v 2 O to be an Pre-training phase. We pre-train on three datasets:
object that occurs in u, and Nu to be the other objects that COCO [35], the full OpenImages labelled dataset [33](⇠ 1.7
are not in u. million samples) and a subset of OpenImages (⇠ 212k) [44].
The pair of scene and object images are fed into the All these datasets are multi-object datasets; OpenImages
base and momentum encoders that share the weights with contains 12 objects on average per image and COCO con-
the Euclidean branch. However, instead of normalizing the tains 6 objects on average. We experiment with both the
output of the encoders, we use the exponential map defined ground truth bounding box (GT) and using selective search
in the equation 2 to project these features in the Euclidean (SS) [61] to produce object bounding boxes in an unsuper-
space to the Poincaré ball, which are denoted as z1hyp and vised fashion, following previous work [68]. As the goal
z2hyp . Further, we replace the inner product in the cross- of this paper is not to present another state-of-the-art self-
entropy loss with the negative hyperbolic distance as defined supervised learning method, we implement our sampling
6843
APb APb50 APb75 APm APm
50 AP75
m
Pre-train Bbox VOC IN-100 IN-1k
MoCo-v2 pre-trained on COCO: MoCo-v2 COCO - 64.79 64.84 51.17
Baseline 38.5 58.1 42.1 34.8 55.3 37.3 HCL w/o Lhyp COCO SS 73.13 73.84 54.21
HCL w/o Lhyp 39.7 60.1 43.4 36.0 57.3 38.8 HCL w/o Lhyp COCO GT 75.55 76.22 54.52
HCL CC 40.6 61.1 44.5 37.0 58.3 39.7 HCL COCO SS 74.19 75.16 55.03
Dense-CL pre-trained on COCO: HCL COCO GT 76.51 76.74 55.63
Baseline 39.6 59.3 43.3 35.7 56.5 38.4 MoCo-v2 OpenImages - 69.95 72.80 54.12
HCL w/o Lhyp 41.3 61.5 44.7 37.5 59.5 40.4 HCL w/o Lhyp OpenImages SS 71.82 75.33 56.58
HCL 42.5 62.5 45.8 38.5 60.6 41.4 HCL w/o Lhyp OpenImages GT 73.79 77.36 57.57
ORL pre-trained on COCO: HCL OpenImages SS 74.31 78.14 58.12
Baseline 40.3 60.2 44.4 36.3 57.3 38.9 HCL OpenImages GT 75.40 79.08 58.51
HCL 41.4 61.4 45.5 37.3 58.5 40.0
Dense-CL pre-trained on OpenImages:
Baseline 38.2 58.9 42.6 34.8 55.3 37.8 Table 2. Classification results with linear evaluation. The first
HCL w/o Lhyp 41.1 61.5 44.4 37.2 58.3 39.7 row shows the results using random crops on pre-training datasets.
HCL 42.1 62.6 45.5 38.3 59.4 40.6 In the last two rows we use our hyperbolic loss and we see improved
performance by using both Ground Truth (GT) boxes and Selective
Table 1. Comparison with state-of-the-art methods. This table Search (SS) boxes. HCL improves scene-level classification on the
shows object detection (columns 1-3) and semantic segmentation VOC dataset, and object-level classification on ImageNet-100 and
(columns 4-6) results on COCO using MoCo-v2, Dense-CL and ImageNet-1k datasets.
ORL by pre-training on COCO and OpenImages using unsuper-
vised object bounding boxes generated by the selective search. The
protocols listed in Detectron2 [66].
first row in each sub-table shows the results using random crops on
pre-training datasets. The second and third rows set HCL/Lhyp to 0, 3.2. Main Results
which means we are pre-training baseline methods on just proposal
boxes. Our model consistently improves both object detection and Object detection and semantic segmentation. Table 1
semantic segmentation tasks across multiple contrastive learning reports the object detection and semantic segmentation re-
baselines by pre-training on both COCO (800 epochs) and the full sults by pre-training on COCO and full OpenImages dataset
OpenImages dataset (75 epochs, last 3 rows).
(last 3 rows) by using selective search boxes. HCL shows
consistent improvements over the baselines on COCO ob-
ject detection and COCO semantic segmentation. Although
procedure and hyperbolic loss on top of three popular con- Dense-CL and ORL improve the object-level downstream
trastive learning methods: MoCo-v2 [13], Dense-CL [64], performance over MoCo-v2 through improved object rep-
and ORL [68]. Dense-CL is a contrastive learning framework resentations or dense pixel representations, they still lack
which extracts dense features from scene images and gener- the direct modeling of scene images. We show that learning
ally achieves better object detection results than MoCo-v2. representations for scene images in hyperbolic space is ben-
ORL is a pipeline that learns improved object representa- eficial to object-level downstream performance. Note that
tions from scene images. We also consider HCL without the pre-training Dense-CL on ImageNet for 200 epochs gives
hyperbolic loss Lhyp . This approach, which we denote as 40.3 mAP [64], while pre-trainng on OpenImages for only
“HCL w/o Lhyp ”, adopts the same cropping strategy as HCL 75 epochs with our method gives 42.1 mAP. This shows the
but applies only a standard contrastive loss. We show that importance of efficient pre-training on datasets like OpenIm-
adding the hyperbolic loss improves results under various ages.
settings. More details on the datasets as well as training
Image classification. As shown in Table 2, HCL improves
setups can be found in Appendix A.
image classification on both scene-level (VOC) and object-
Downstream tasks. We evaluate our pre-trained models on level (ImageNet) datasets. When pretraining on OpenImages,
image classification, object-detection and semantic segmen- HCL improves ImageNet lineval accuracy by 0.94% points
tation. For classification, we show linear evaluation (lineval) and VOC lineval classification accuracy by 1.61 mAP. We
accuracy with MoCo-v2, i.e. we freeze the backbone and observe similar improvements when pretraining on COCO.
only train the final linear layer. We test on VOC [18], HCL improves accuracy whether we use ground truth object
ImageNet-100 [58] and ImageNet-1k [15] datasets. For bounding boxes or boxes generated by selective search. In
object detection and semantic segmentation, we show re- general, we observe a larger improvement of using HCL on
sults with all 3 baselines on the COCO datasets using Mask OpenImages than COCO, which supports our hypothesis
R-CNN, following [13]. We closely follow the common that HCL provides larger improvements on datasets with
6844
labels. For each class of the ImageNet training set, we
use a pre-trained OpenImages model and rank the images
according to their norms. The extreme images of some
classes are shown in Figure 4 and also in the Appendix.
Images with smaller norms tend to capture a single object,
while those with larger norms are likely to depict a scene.
To quantitatively evaluate this property, we report the
NDCG metric on the ranked images as shown in Table 3.
NDCG assesses how often the scene images are ranked at the
top. As a baseline, we rank the images based on the entropy
of the class probability predicted by a classifier, which is a
widely adopted indicator of label uncertainty [11, 47]. We
Figure 3. Average representation norms of images with different use both MoCo-v2 and supervised ResNet-50 as the classifier.
number of labels in ImageNet-ReaL. As shown in Table 3, using norms with HCL achieves similar
rank quality as using entropy with the supervised ResNet-50
Datasets on the ImageNet-ReaL dataset. In addition, when combining
Method Indicator two ranks using simple ensemble methods such as Borda
IN-Real COCO
count, the score is further improved to 0.717. This shows
MoCo Entropy 0.633 0.791 that the entropy and the norm provide complimentary signals
Supervised Entropy 0.671 0.793 regarding the existence of multiple labels. For example, the
HCL Norm 0.655 0.839 entropy indicator can be affected by the bias of the model
Ensemble Entropy+Norm 0.717 0.823 and the norm indicator can be wrong on the images with
Table 3. NDCG scores of the image rankings based on the different multiple objects from the same class.
indicators and models, and evaluated by the number of labels per Compared to supervised indicators of label uncertainty,
image. HCL has the additional advantage that it is dataset-agnostic
and can be applied to new data without further training.
To demonstrate this benefit, we report the same metric on
more objects per image. the COCO validation, where we also have the number of
labels for each image. Our method achieves much better
3.3. Properties of Models Trained with HCL NDCG scores than the supervised ResNet-50 as shown in
The visual representations learned by HCL have several Table 3. This finding can be potentially useful to guide label
useful properties. In this section, we evaluate the repre- reassessment, or provide an extra signal for model training.
sentation norm as an measure of the label uncertainty for
image classification datasets, and evaluate the object-scene
3.3.2 Out-of-Context Detection
similarity in terms of out-of-context detection.
Our hyperbolic loss Lhyp encourages the model to capture the
3.3.1 Label Uncertainty Quantification similarity between the object and scene. We apply the result-
ing representations to detect out-of-context objects, which
ImageNet [15] is an image classification dataset consist- can be useful in designing data augmentation for object de-
ing of object-centered images, each of which has a single tection [17]. We are especially interested in out-of-context
label. As the performance on this dataset has gradually images with conflicting backgrounds. To this end, we use the
saturated, the original labels have been scrutinized more out-of-context images proposed in the SUN09 dataset [14].
carefully [3, 52, 55, 60, 62]. Prevailing labeling issues in We first compute the representations of each object and entire
the validation set have been recently identified, including scene image with that object masked out. We then calculate
labeling errors, multi-label images with only a single label the hyperbolic distance between the representations mapped
provided, and so on. Although [3] provides reassessed labels to the Poincaré ball. Some example images from this dataset
for the entire validation set, relabeling the entire training set as well as the distance of each contained object are shown in
may be infeasible. Figure 5. We find that the out-of-context objects generally
Our learned representations provide a potential automatic have a large distance, i.e. smaller similarity, to the overall
way to identify images with multiple labels from datasets like scene image. To quantify this finding, we compute the mAP
ImageNet. Specifically, we first show in Figure 3 that there of the object ranking on each image and obtain 0.61 for HCL.
is a strong correlation between the representation norms and As a comparison, the MoCo similarity gives mAP = 0.52
the number of labels per image according to the reassessed and the random ranking gives mAP = 0.44.
6845
Smallest norms (objects) Largest norms (scenes)
Figure 4. Images from ImageNet training set. The 5 images on the left have the smallest representation norms among all the images from the
same class, and the 5 on the right have the largest norms.
0.6679
0.5091
0.3674
0.4542
0.5695 0.7499
0.6696
0.4849
0.5010 0.4206 0.4322 0.7122
0.8025
0.4053 0.5393
0.1698
Figure 5. Out-of-context images from the SUN09 dataset. The bounding box of each object and its hyperbolic distance to the scene are
shown. Regular objects are in blue and out-of-context objects are in purple. Note that the out-of-context objects tend to have large distances.
4. Main Ablation Studies model performs even worse than the baseline without loss
function on the scene-object pairs, demonstrating the neces-
In this section, we report the results of several important sity of using hyperbolic distance. We also validate our choice
ablation studies with respect to HCL. All the models are of an object-centric hierarchy by comparing its performance
trained on the subset of the OpenImages dataset and linearly with that of a scene-centric hierarchy [48, 49] generated by
evaluated on the ImageNet-100 dataset. The top-1 accuracy sampling the negative pairs as objects and unpaired scenes.
is reported. This scene-centric hierarchy leads to substantially lower ac-
Similarity measure and the center of the scene-object hi- curacy (Table 4).
erarchy. We propose to use the negative hyperbolic distance Trade-off between the Euclidean and hyperbolic losses.
as the similarity measure of the scene-object pairs. As an al- We adopt the Euclidean loss to learn object-object similarity
ternative, one can use cosine similarity on the hypersphere as and the hyperbolic loss to learn object-scene similarity. A
the measure as in the original contrastive objective. However, hyperparameter controls the trade-off between them. As
this would attempt to maximize the similarity between a sin- shown in Table 5, we find that a smaller = 0.01 leads
gle object and multiple objects. It is likely that these objects to marginal improvement. However, we also observe that
belong to different classes, and hence this strategy impairs larger s can lead to unstable and even stalled training. With
the quality of the representation. As shown in Table 4, re- careful inspection, we find that in the early stage of the
placing the negative hyperbolic distance with the Euclidean training, the gradient provided by the hyperbolic loss can be
similarity impairs downstream performance. The resulting inaccurate but strong, which pushes the representations to be
6846
Distance Center IN-100 Accuracy IN-100 Accuracy Optimizer IN-100 Accuracy
- - 77.36 0.01 77.70 RSGD 0.1 79.08
Hyperbolic Scene 79.08 0.1 79.08 RSGD 0.5 0
Hyperbolic Object 76.96 0.2 78.64 SGD 0.1 70.16
Euclidean Scene 76.68 0.5 0 SGD 0.5 74.18
Table 4. Similarity measure and hierarchy center. Table 5. Losses trade-off. Table 6. RSGD versus SGD optimizers.
close to the boundary. As a result, since Riemannian SGD eral papers mitigate this issue by proposing different tech-
divides gradients by the distance to the boundary, updates niques. Dense-CL [64] operates on pre-average pool features
become small and training ceases to make progress. and uses dense features on pixel level to show improved per-
Optimizer. Given the observation above, we ask whether formance on dense tasks such as semantic segmentation. Det-
RSGD is necessary for practical usage. We replace the Con [28] uses unsupervised semantic segmentation masks
RSGD optimizer with SGD. To avoid numerical issues when to generate features for the corresponding objects in the two
the representations are too close to the boundary, we increase views. PixContrast [69] uses pixel-to-propagation consis-
" from 1e 5 to 1e 1 . This allows a larger to be used as tency pretext task to build features for both dense down-
opposed to the RSGD. However, SGD always yields inferior stream tasks and discriminative downstream tasks. Pixel-
performance compared to RSGD. to-Pixel Contrast [63] uses pixel-level contrastive learning
to learn better features for semantic segmentation. Self-
5. Related Work EMD [39] uses earth mover distance with BYOL [24] for
pretraining on the COCO dataset. ORL [68] uses selective
Representation Learning with Hyperbolic Space. Rep- search to generate object proposals, then applies object-level
resentations are typically learned in Euclidean space. Hy- contrastive loss to enforce object-level consistency. Below-
perbolic space has been adopted for its expressiveness in par performance of SSL methods can be attributed to treating
modeling tree-like structures existing in various domains scenes and objects using similar techniques, which often re-
such as language [45, 46, 53], graphs [2, 8, 50], and vi- sults in similar representations. In our work, instead of treat-
sion [10, 57]. The corresponding neural network modules ing scenes and objects similarly, we use a hyperbolic loss,
have been designed to boost the progress of such applica- which builds representation that disambiguates scenes and
tions [9, 21, 37, 56]. The hierarchical structure presented in objects based on the norm of the embeddings. Our method
the datasets can arise from three factors that motivate the not only separates scenes and objects, but also improves
use of hyperbolic space. The first factor is generality: the downstream tasks such as image classification.
hypernym-hyponym property is a natural feature of words
(e.g. WordNet [42]) and the hyperbolic space is extensively
exploited to learn word and image embeddings that preserve 6. Conclusion
that property [20, 38, 40, 53, 59, 70]. The second factor is We present HCL, a contrastive learning framework that
uncertainty: Several studies have found that applying hyper- learns visual representation for both objects and scenes in
bolic neural network modules to different tasks leads to a the same representation space. The major novelty of our
natural modeling of the uncertainty [23,31,57]. The third fac- method is a hyperbolic contrastive objective built on an
tor is compositionality of different basic elements to form a object-centric scene hierarchy. We show the effectiveness
natural hierarchy. Motivated by these factors, previous work of HCL on several benchmarks including image classifi-
in computer vision has applied hierarchical representations cation, object detection, and semantic segmentation. We
learned in the hyperbolic space to various tasks such as image also demonstrate useful properties of the representations un-
classification [31] or segmentation [65], zero-/few-shot learn- der several zero-shot settings, from detecting out-of-context
ing [38], action recognition [40], and video prediction [57]. objects to quantifying the label uncertainty in the datasets
In this paper, we focus on learning the representations that like ImageNet. More generally, we hope this paper will
capture the hierarchy between the objects and scenes with encourage future work towards building a more holistic vi-
the goal of learning general-purpose image representations sual representation space, and draw attention to the power of
that can transfer to various downstream tasks. non-Euclidean representation learning.
Self-Supervised Learning on Scenes. Self-Supervised
Learning (SSL) has made great strides in closing the perfor- 7. Acknowledgements
mance with supervised methods [12, 13, 22] when pretrained
on the object-centric datasets like ImageNet. However, re- Songwei Ge, Shlok Mishra, and David Jacobs were sup-
cent work has shown that SSL is limited on multi-object ported in part by the National Science Foundation under
datasets like COCO [43, 54, 64] and OpenImages [33]. Sev- grant no. IIS-1910132 and IIS-2213335.
6847
References [19] Carolina Galleguillos, Andrew Rabinovich, and Serge Be-
longie. Object categorization using co-occurrence, location
[1] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, and appearance. In CVPR, 2008. 2
and Alexander C Berg. Point-level region contrast for object [20] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hy-
detection pre-training. CVPR, 2022. 1 perbolic entailment cones for learning hierarchical embed-
[2] Ivana Balazevic, Carl Allen, and Timothy Hospedales. Multi- dings. In ICML, 2018. 8
relational poincaré graph embeddings. NeurIPS, 2019. 8 [21] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hy-
[3] Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xi- perbolic neural networks. NeurIPS, 2018. 3, 8
aohua Zhai, and Aäron van den Oord. Are we done with [22] Songwei Ge, Shlok Kumar Mishra, Haohan Wang, Chun-
imagenet?, 2020. 6 Liang Li, and David Jacobs. Robust contrastive learning using
[4] Silvere Bonnabel. Stochastic gradient descent on rieman- negative samples with diminished semantics. In NeurIPS,
nian manifolds. IEEE Transactions on Automatic Control, 2021. 8
58(9):2217–2229, 2013. 3 [23] Mina GhadimiAtigh, Julian Schoep, Erman Acar, Nanne van
[5] James W Cannon, William J Floyd, Richard Kenyon, Walter R Noord, and Pascal Mettes. Hyperbolic image segmentation.
Parry, et al. Hyperbolic geometry. Flavors of geometry, 31(59- arXiv preprint arXiv:2203.05898, 2022. 8
115):2, 1997. 3 [24] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
[6] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer-
Piotr Bojanowski, and Armand Joulin. Unsupervised learning sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
of visual features by contrasting cluster assignments. NeurIPS, laghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and
2020. 1 Michal Valko. Bootstrap your own latent - a new approach to
self-supervised learning. In NeurIPS, 2020. 1, 8
[7] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
[25] Mikhael Gromov. Hyperbolic groups. In Essays in group
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
theory, pages 75–263. Springer, 1987. 2, 3
ing properties in self-supervised vision transformers. In ICCV,
[26] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
2021. 1
Dollár, and Ross Girshick. Masked autoencoders are scalable
[8] Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Su- vision learners. In Proceedings of the IEEE/CVF Conference
jith Ravi, and Christopher Ré. Low-dimensional hyperbolic on Computer Vision and Pattern Recognition (CVPR), pages
knowledge graph embeddings. In ACL, 2020. 8 16000–16009, June 2022. 1
[9] Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. [27] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Hyperbolic graph convolutional neural networks. NeurIPS, Girshick. Momentum contrast for unsupervised visual repre-
2019. 8 sentation learning. In CVPR, 2020. 1, 2, 4
[10] Jiaxin Chen, Jie Qin, Yuming Shen, Li Liu, Fan Zhu, and Ling [28] Olivier J Hénaff, Skanda Koppula, Jean-Baptiste Alayrac,
Shao. Learning attentive and hierarchical representations for Aaron van den Oord, Oriol Vinyals, and João Carreira. Effi-
3d shape recognition. In ECCV, 2020. 8 cient visual pretraining with contrastive detection. In ICCV,
[11] Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu pages 10086–10096, 2021. 8
Zhang. Understanding and utilizing deep neural networks [29] Geoffrey Hinton. How to represent part-whole hierarchies in
trained with noisy labels. In ICML, 2019. 6 a neural network. arXiv preprint arXiv:2102.12627, 2021. 2
[12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- [30] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
frey Hinton. A simple framework for contrastive learning of David Shamma, Michael Bernstein, and Li Fei-Fei. Image
visual representations. In ICML, 2020. 1, 2, 8 retrieval using scene graphs. In CVPR, 2015. 2
[13] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- [31] Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova,
proved baselines with momentum contrastive learning. arXiv Ivan Oseledets, and Victor Lempitsky. Hyperbolic image
preprint arXiv:2003.04297, 2020. 5, 8 embeddings. In CVPR, 2020. 8
[32] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
[14] Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Willsky. Exploiting hierarchical context on a large database
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
of object categories. In CVPR, 2010. 2, 6
Connecting language and vision using crowdsourced dense
[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li image annotations. IJCV, 2017. 2
Fei-Fei. Imagenet: A large-scale hierarchical image database.
[33] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
In CVPR, 2009. 5, 6
Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
[16] Manfredo Perdigao Do Carmo and J Flaherty Francis. Rie- Matteo Malloci, Alexander Kolesnikov, et al. The open im-
mannian geometry, volume 6. Springer, 1992. 3 ages dataset v4. IJCV, 2020. 4, 8
[17] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. On the [34] John M Lee. Introduction to Riemannian manifolds. Springer,
importance of visual context for data augmentation in scene 2018. 3
understanding. PAMI, 43(6):2014–2028, 2019. 6 [35] Tsung-Yi Lin, M. Maire, Serge J. Belongie, James Hays, P.
[18] Mark Everingham, Luc Van Gool, Christopher KI Williams, Perona, D. Ramanan, Piotr Dollár, and C. L. Zitnick. Mi-
John Winn, and Andrew Zisserman. The pascal visual object crosoft coco: Common objects in context. In ECCV, 2014.
classes (voc) challenge. IJCV, 88(2):303–338, 2010. 5 4
6848
[36] Nathan Linial, Eran London, and Yuri Rabinovich. The ge- [54] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
ometry of graphs and some of its algorithmic applications. Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-
Combinatorica, 15(2):215–245, 1995. 3 cam: Visual explanations from deep networks via gradient-
[37] Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic based localization. In ICCV, 2017. 8
graph neural networks. NeurIPS, 2019. 8 [55] Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang,
[38] Shaoteng Liu, Jingjing Chen, Liangming Pan, Chong-Wah Benjamin Recht, and Ludwig Schmidt. Evaluating machine
Ngo, Tat-Seng Chua, and Yu-Gang Jiang. Hyperbolic visual accuracy on imagenet. In ICML, 2020. 6
embedding learning for zero-shot recognition. In CVPR, 2020. [56] Ryohei Shimizu, YUSUKE Mukuta, and Tatsuya Harada.
8 Hyperbolic neural networks++. In ICLR, 2021. 8
[39] Songtao Liu, Zeming Li, and Jian Sun. Self-emd: Self- [57] Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the
supervised object detection without imagenet, 2021. 1, 8 predictability of the future. In CVPR, 2021. 8
[40] Teng Long, Pascal Mettes, Heng Tao Shen, and Cees G. M. [58] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive
Snoek. Searching for actions on the hyperbole. In CVPR, multiview coding. In ECCV, 2020. 5
2020. 8 [59] Alexandru Tifrea, Gary Bécigneul, and Octavian-Eugen
[41] Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Ganea. Poincaré glove: Hyperbolic word embeddings. In
Costa: Co-occurrence statistics for zero-shot classification. ICLR. OpenReview, 2018. 2, 3, 8
In CVPR, 2014. 2
[60] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, An-
[42] George A Miller, Richard Beckwith, Christiane Fellbaum, drew Ilyas, and Aleksander Madry. From imagenet to image
Derek Gross, and Katherine J Miller. Introduction to word- classification: Contextualizing progress on benchmarks. In
net: An on-line lexical database. International journal of ICML, 2020. 6
lexicography, 3(4):235–244, 1990. 2, 8
[61] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers,
[43] Shlok Kumar Mishra, Anshul B. Shah, Ankan Bansal,
and Arnold WM Smeulders. Selective search for object recog-
Jonghyun Choi, Abhinav Shrivastava, Abhishek Sharma, and
nition. IJCV, 104(2):154–171, 2013. 4
David Jacobs. Learning visual representations for transfer
[62] Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes,
learning by suppressing texture. ArXiv, abs/2011.01901, 2020.
Sara Fridovich-Keil, and Rebecca Roelofs. When does dough
8
become a bagel? analyzing the remaining mistakes on ima-
[44] Shlok Kumar Mishra, Anshul B. Shah, Ankan Bansal, Ab-
genet. arXiv preprint arXiv:2205.04596, 2022. 6
hyuday N. Jagannatha, Abhishek Sharma, David Jacobs, and
Dilip Krishnan. Object-aware cropping for self-supervised [63] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender
learning. ArXiv, abs/2112.00319, 2021. 4 Konukoglu, and Luc Van Gool. Exploring cross-image pixel
[45] Maximillian Nickel and Douwe Kiela. Poincaré embeddings contrast for semantic segmentation. In ICCV, 2021. 8
for learning hierarchical representations. NeurIPS, 2017. 2, [64] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
3, 8 Lei Li. Dense contrastive learning for self-supervised visual
[46] Maximillian Nickel and Douwe Kiela. Learning continuous pre-training. In CVPR, 2021. 1, 5, 8
hierarchies in the lorentz model of hyperbolic geometry. In [65] Zhenzhen Weng, Mehmet Giray Ogut, Shai Limonchik, and
ICML, 2018. 8 Serena Yeung. Unsupervised discovery of the long-tail in
[47] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident instance segmentation using hierarchical self-supervision. In
learning: Estimating uncertainty in dataset labels. Journal of CVPR, 2021. 8
Artificial Intelligence Research, 70:1373–1411, 2021. 6 [66] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
[48] Devi Parikh and Tsuhan Chen. Hierarchical semantics of Lo, and Ross Girshick. Detectron2. https://github.
objects (hsos). In ICCV, 2007. 2, 7 com/facebookresearch/detectron2, 2019. 5
[49] Devi Parikh, C Lawrence Zitnick, and Tsuhan Chen. Unsu- [67] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
pervised learning of hierarchical spatial structures in images. Unsupervised feature learning via non-parametric instance
In CVPR, 2009. 7 discrimination. In CVPR, 2018. 1, 2
[50] Jiwoong Park, Junho Cho, Hyung Jin Chang, and Jin Young [68] Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and
Choi. Unsupervised hyperbolic representation learning via Chen Change Loy. Unsupervised object-level representation
message passing auto-encoders. In CVPR, 2021. 8 learning from scene images. In NeurIPS, 2021. 1, 4, 5, 8
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [69] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Lin, and Han Hu. Propagate yourself: Exploring pixel-level
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning consistency for unsupervised visual representation learning.
transferable visual models from natural language supervision. In CVPR, 2021. 1, 8
In ICML, 2021. 1 [70] Jiexi Yan, Lei Luo, Cheng Deng, and Heng Huang. Unsuper-
[52] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and vised hyperbolic metric learning. In CVPR, 2021. 8
Vaishaal Shankar. Do imagenet classifiers generalize to ima-
genet? In ICML, 2019. 6
[53] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré.
Representation tradeoffs for hyperbolic embeddings. In ICML,
2018. 2, 3, 8
6849