0% found this document useful (0 votes)
19 views

Contrastive Learning for Object Detection

Uploaded by

djywithhjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Contrastive Learning for Object Detection

Uploaded by

djywithhjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Contrastive Learning for Object Detection

Rishab Balasubramanian Kunal Rathore


Oregon State University
{balasuri, rathorek}@oregonstate.edu
arXiv:2208.06412v1 [cs.CV] 12 Aug 2022

1. Introduction Learning_For_Object_Detection

1.1. Problem Statement


1.2. Scope and Challenges
In this project we will be working on Object Detection
using Contrastive Learning. The goal of the project is to We work in a supervised setting with labels for the
implement and evaluate the success of contrastive learning bounding boxes of objects and their corresponding class.
paradigm for learning better feature representations, and We work with VOC 2007, a sufficiently large dataset of
use these for object detection. annotated images belonging to 20 different classes. The
main challenge in our work would be the training time for
multiple experiments. Due to the longer training time, and
Contrastive Learning follows from traditional triplet-
batch size of contrastive learning, we would need access to
loss, where similarity between an “anchor” and “positive”
GPUs for our experiments. The class similarity rankings,
is maximized, and the similarity between “anchor” and
which are a user input are also a challenge as they would
“negative” is minimized. Contrastive learning is commonly
require manual tuning and multiple experiments to identify
used as a method of self-supervised learning with the
useful ranking order.
“anchor” and “positive” being two random augmentations
of a given input image, and the “negative” is the set of
all other images. This has been shown to outperform 2. Approach
traditional approaches such as the triplet loss and N-pair
loss in [1]. However, the requirement of large batch sizes Fig 1 shows the approach we follow for training our
and memory banks has made it difficult and slow to train model. Given an input image, we pass it through the Ob-
( [1], [3], [2]). This motivated the rise of Supervised ject Detection module to predict bounding boxes (Sec 2.1).
Contrasative approaches that overcome these problems Once the bounding boxes are predicted, we perform a Two-
by using annotated data [5]. However there is no explicit Crop transformation on each object in the image (Sec 2.2),
emphasis on learning good representation, but rather the and pass it through our Contrastive Learning framework
idea is to cluster points into regions such that they are (Sec 2.3). We divide the process into three main stages
separable in the higher dimensional parameter space. The • Object Detection
authors in [4] make an attempt at enforcing better represen-
tation learning into the contrastive learning framework by • Two Crop Augmentations
clustering classes together based on their similarity to each • Object Classification
other.
2.1. Object Detection
Inspired by this approach, we look to rank classes based In this work we use Faster RCNN [6] for object detec-
on their similarity, and observe the impact of human bias tion. Faster RCNN has two main components, a region
(in the form of ranking) on the learned representations. We proposal network (RPN), and a classification network.
feel this is an important question to address, as learning We remove the classification network and retain only the
good feature embeddings has been a long sought after proposed bounding boxes by the RPN. Fig 2 (adapted from
problem in computer vision. This would also be impor- d2l) shows the Faster RCNN pipeline used in our work.
tant for similar domains such as OOD detection, image Given the location of the bounding boxes, the image is
matching/retrieval, and other tasks which require a good cropped at these locations and a two-crop transformation is
representation of the images. Code available at https: performed.
/ / github . com / rishabbala / Contrastive _

1
Figure 1. Our Approach

h(q, P1 ) > h(q, P2 ) > · · · h(q, Pr ) > h(q, N ) (1)


Pr
We do this by defining a loss L = i=1 li where

P h(q,p)
p∈Pi exp( τi )
li = − log P h(q,p) P
p∈∪j≥i Pi exp( τi ) + h(q,n)
n∈N exp( τ )
i
(2)
This can be thought of as recursively computing the
loss (L), when considering the current highest ranked class
(i) as “positive” and all other classes as negative. After
Figure 2. Faster RCNN
computing the loss, the current highest ranked class (i) is
removed, and the loss is computed again for class i + 1.
To ensure good separation, we set τi+1 > τi , following the
2.2. Two Crop Transformation empirical studies provided in [4].
The cropped images at the location of the bounding
Opposed to [4], we rank classes instead of clustering
boxes are stacked together into a batch of images. We
them into groups. The difference between them is that in
first normalize the images using the mean and standard
our ranking, any class could be ranked similar to any other
deviation of the dataset. Then we follow the standard
class, with a user defined score. However, in clustering
transformations for Contrastive Learning proposed in [5],
as done in [4], only the classes within the same cluster
as shown in Fig 3. Of these we do not use the cutout, blur,
are considered similar. For example, [4] puts the classes
and sobel filter augmentations proposed. We perform two
“aeroplane” and “ship” together as “vehicles”. However,
random combination using a subset of these methods to
from human knowledge, we know that an “aeroplane” is
produce two augmentations of the image. The first is the
also (probably more) similar to a “bird” than a “ship”.
“anchor”, and the second falls in the “positive” class.
Hence in our method, we create the ranking for the class
“aeroplane” as {“bird”, “ship”, ... } with decreasing order
of similarity from left to right.
2.3. Object Classification
Fig 4 (adapted from [4]) shows the Contrastive Learning
approach we use. We follow [4], and use a ranked super-
3. Evaluation
vised contrastive learning method, where the ranking is 3.1. Implementation Details
user defined. This is different from traditional approaches
which use a single “positive” image/class. • We use the detectron2 library [7] open sourced by
Facebook for the bounding box predictions. We use
a ResNet 50 FPN backbone for object detection
For each anchor (query) image q, we rank a number of
similar classes as P1 · · · Pr , where r denotes the number of • We build upon the code provided in [4] to incorporate
positive classes in our ranking. We also define a negative our ranking and experiments.
class as N . Let h(q, x) be the cosine similarity between
the query and any other image x. Then, we can define our • We used ResNet backbone 50 for all our Contrastive
objective as enforcing : Learning experiments

2
Figure 3. Proposed transformations

Figure 4. Traditional Contrastive Learning approaches are binary (left), where there is a single “anchor”, and a single “positive” im-
age/class. We use a ranking system to improve learned features (right)

• We run our experiments with a batch size of 32 with truth bounding box annotations, and the corresponding
the VOC2007 dataset class for each bounding box.

• We trained our model for 500 epochs

• We use cosine similarity, with a learning rate of 0.5,


and a learning decay rate of 0.1. 3.3. Metrics And Comparison

• We set the temperature in the loss τ ∈ [0.1, 0.6] We evaluate the mAP for the object detection stage
and evaluate classification accuracy for the Contrastive
• The experiments are conducted with the number of
Learning model. We compare the accuracy with SupCL
positively ranked classes r ∈ {1, 3, 5}
( [5] where there is no ranking), RINCE ( [4], where similar
classes are clustered together), and SoftMax (common
3.2. Dataset
discriminative approach of training a ResNet 50 with a
We evaluate on the VOC2007 dataset, which has 20 SoftMax loss). Extra Credit: We also test our model for
classes, 9963 images and 24640 objects (bounding boxes). detecting OOD classes. In this case we plot the ROC curve
This is split into 5011 images in the training/val set and of True positive Rate vs False Positive Rate and compare
4952 images in the test set, with similar number of objects the Area Under the Curve.
between the two. The dataset provides the images, ground

3
Model AP AP50 AP75
Faster RCNN (trained on VOC2007 train+val) ResNet50 45.254 72.746 49.338
FPN
Table 1. Object Detection scores

Figure 5. Results

3.4. Results & Evaluations to the diverse nature of classes in the dataset, it resulted
in sparser ranking which affects the performance of this
We first evaluate the efficiency of the Faster-RCNN approach.
model in predicting bounding boxes. To do so, we compute
the AP scores using a pre-trained model. Table 1 shows
the AP, AP50, and AP75 scores of the Object detection Method Classification Accuracy
module we used. This is lower than the values reported
in the Faster RCNN paper, and also lower than more SupCL (r=1) 0.6499
modern approaches. Since the objective was not only Ours(r=3) 0.6068
object detection, we did not try different detection models. RINCE(r=5) 0.6368
SoftMax 0.5829

Table 2 shows the classification accuracy on the Table 2. Classification Accuracy on VOC2007
VOC2007 dataset, and Fig 5 shows the results from our
model. We observe that again our accuracy is comparable
to RINCE ( [4]) and SupCL ( [5]), while the discriminative
3.5. Extra Credit: Out of Distribution Detection
classification receives a much lower score. However, we
observed that our scores are slightly lesser than SupCL and We finally evaluate our model’s performance for OOD
RINCE. Since, all other parameters were similar during object detection. We evaluate on the VOC2007 dataset, with
testing, the two factors that affects these results the most 2 classes withheld and show the ROC curve in Fig 6 and the
is ranking, and the user-tuned class-similarity scores. Due AUROC in Table 3. We see that our model does not per-

4
(a) baseline (b) 1 positive class (r = 1) (c) 5 positive classes (r = 5)

Figure 6. ROC plots for VOC

form better than the baselines. This shows that our method [2] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im-
enforces good representation learning when human input proved baselines with momentum contrastive learning. arXiv
is given, but the representations for new classes are poor. preprint arXiv:2003.04297, 2020. 1
We can conclude that given an unobserved object class, our [3] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
model pushes it close to one of the known classes thus re- Girshick. Momentum contrast for unsupervised visual repre-
sulting in poor results. sentation learning. In Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, pages 9729–
9738, 2020. 1
Method AUROC
[4] David T Hoffmann, Nadine Behrmann, Juergen Gall, Thomas
SupCL (r=1) 0.6679 Brox, and Mehdi Noroozi. Ranking info noise contrastive es-
Ours(r=5) 0.5532 timation: Boosting contrastive learning via ranked positives.
SoftMax 0.5621 arXiv preprint arXiv:2201.11736, 2022. 1, 2, 3, 4
[5] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
Table 3. AUROC on VOC2007 with 2 classes withheld Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Dilip Krishnan. Supervised contrastive learning. Advances
in Neural Information Processing Systems, 33:18661–18673,
3.6. Runtimes & Hardware 2020. 1, 2, 3, 4
We train all our models on the HPC cluster using a Tesla [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
V100 GPU. For training we use a batch size of 32, and Faster r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information process-
observe that it takes around 1.5-2 minutes per epoch During
ing systems, 28, 2015. 1
testing we observe that we take 2 minutes to to generate
[7] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
results over the validation set. Evaluation of the Average Lo, and Ross Girshick. Detectron2. https://github.
Precision scores for Faster RCNN takes about 5 minutes. com/facebookresearch/detectron2, 2019. 2

3.7. Individual Contributions


Kunal worked on the object detection pipeline, and its
evaluation. Rishab worked on setting up the Contrastive
Learning experiments and training them. We worked
together for OOD Detection. We filled up our respective
portions in the report, and made changes together.

References
[1] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
offrey Hinton. A simple framework for contrastive learning
of visual representations. In International conference on ma-
chine learning, pages 1597–1607. PMLR, 2020. 1

You might also like