0% found this document useful (0 votes)
100 views6 pages

R-CNN (Object Detection) - A Beginners Guide To One of The Most - by Sharif Elfouly - Medium

The document summarizes the R-CNN object detection system. It describes how R-CNN works by first generating region proposals from an input image using selective search. Each proposal is then fed into a pretrained CNN to extract features. These features are classified and bounded using independently trained SVMs and a bounding box regressor. The outputs are post-processed using non-maximum suppression to generate the final detected objects and bounding boxes. R-CNN pioneered the use of region-based object detection but had limitations that were addressed by later methods like Fast R-CNN.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views6 pages

R-CNN (Object Detection) - A Beginners Guide To One of The Most - by Sharif Elfouly - Medium

The document summarizes the R-CNN object detection system. It describes how R-CNN works by first generating region proposals from an input image using selective search. Each proposal is then fed into a pretrained CNN to extract features. These features are classified and bounded using independently trained SVMs and a bounding box regressor. The outputs are post-processed using non-maximum suppression to generate the final detected objects and bounding boxes. R-CNN pioneered the use of region-based object detection but had limitations that were addressed by later methods like Fast R-CNN.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

11/24/2020 R-CNN (Object Detection).

A beginners guide to one of the most… | by Sharif Elfouly | Medium

Get started Open in app

Sharif Elfouly
209 Followers · About Follow

R-CNN (Object Detection)


A beginners guide to one of the most fundamental concepts in object detection.

Sharif Elfouly Jul 16, 2019 · 5 min read

When the paper “Rich feature hierarchies for accurate object detection and semantic
segmentation” came out of UC Berkely in 2014 no one could have predicted its impact.
After 5 years it now has nearly 9000 citations. In this paper, the authors introduced a
fundamental concept for all modern object detection networks: Combining region
proposals with CNN’s. They called this method R-CNN.

This will be the first entry in a 3 part series covering R-CNN, fast R-CNN and faster R-
CNN. You will need to fully understand the intuition behind the concepts here to
understand the following articles. Remember, knowing the fundamentals well is much
more important than “half-knowing” modern state-of-the-art approaches.

R-CNN System

https://medium.com/@selfouly/r-cnn-3a9beddfd55a 1/6
11/24/2020 R-CNN (Object Detection). A beginners guide to one of the most… | by Sharif Elfouly | Medium

The R-CNN system

The problem the R-CNN system tries to solve it is to locate objects in an image (object
detection). What do you do to solve this? You could start with a sliding window
approach. When using this method you just go over the whole image with different sized
rectangles and look at those smaller images in a brute-force-method. The problem is you
will have a giant number of smaller images to look at. To our luck, other smart people
developed algorithms to smartly choose those so-called region proposals. To simplify
this concept:

Region proposals are just smaller parts of the original image, that we think could contain
the objects we are searching for.

Region proposals
There are different region proposal algorithms we can choose from. These are “normal”
algorithms that work out of the box. We don’t have to train them or anything. In the case
of this paper, they use the selective search method to generate region proposals. I found
a very good and detailed explanation on how, the algorithm works here. But keep in
mind:

The R-CNN is agnostic to the region proposal method.

You can choose any method you like and it would work either way.

This will create nearly 2000 different regions we will have to look at. This sounds like a
big number, but it’s still very small compared to the brute-force sliding window
approach.

CNN
In the next step, we take each region proposal and we will create a feature vector
representing this image in a much smaller dimension using a Convolutional Neural
Network (CNN).

https://medium.com/@selfouly/r-cnn-3a9beddfd55a 2/6
11/24/2020 R-CNN (Object Detection). A beginners guide to one of the most… | by Sharif Elfouly | Medium

AlexNet

They use the AlexNet as a feature extractor. Don’t forget it’s 2014 and AlexNet is still
kind of state-of-the-art (Oh, how times have changed…).

One question we need to answer:

If you only use the AlexNet as a feature extractor how do we train this thing?

Well, this is one fundamental issue with this R-CNN system. You can’t train the whole
system in one go (This will be solved by the fast R-CNN system). Rather, you will need to
train every part independently. That means that the AlexNet was trained before on a
classification task. After the training, they removed the last softmax layer. Now the last
layer is the fully connected 4096-dimensional one. This means that our features are
4096 dimensional.

Another important thing to keep in mind is that the input to the AlexNet is always the
same (227, 227, 3). The image proposals have different shapes though. Many of them
are smaller or larger than the required size. So we will need to resize every region
proposal.

To summerize the task of the CNN:

Input and output of the CNN

SVM
https://medium.com/@selfouly/r-cnn-3a9beddfd55a 3/6
11/24/2020 R-CNN (Object Detection). A beginners guide to one of the most… | by Sharif Elfouly | Medium

We created feature vectors from the image proposals. Now we will need to classify those
feature vectors. We want to detect what class of object those feature vectors represent.
For this, we use an SVM classificator. We have one SVM for each object class and we use
them all. This means that for one feature vector we have n outputs, where n is the
number of different objects we want to detect. The output is a confidence score. How
confident are we that this particular feature vector represents this class.

The thing that confused me when I read this paper for the first time, was how we trained
those different SVM’s. Well, we train them on feature vectors created by the AlexNet.
That means, that we have to wait until we fully trained the CNN before we can train the
SVM. The training is not parallelizable. Because we know when training what feature
vector represented which class we can easily train the different SVM’s in a supervised-
learning way.

To summerize:
1. We created different image proposals from one image.

2. Then we created a feature vector from those proposals using the CNN.

3. In the end we classified each feature vector with the SVM’s for each object class.

The output:

Now we have image proposals that are classified on every object class. How do we bring
them all back to the image? We use something called greedy non-maximum
suppression. This is a fancy word for the following concept:

We reject a region (image proposal) if it has an intersection-over-union (IoU) overlap with a


higher scoring selected region.

We combine each region if there is an overlap we take the proposal with the higher score
(calculated by the SVM). We do this step for each object class independently. After this
ends we only keep regions with a score higher than 0.5.

Bounding Box Regressor (optional)


I want to mention the Bounding Box Regressor at the end because it is not a
fundamental building block of the R-CNN System. It’s a greate idea though and the
authors found that it improves the average precision by 3%. So how does it work?
https://medium.com/@selfouly/r-cnn-3a9beddfd55a 4/6
11/24/2020 R-CNN (Object Detection). A beginners guide to one of the most… | by Sharif Elfouly | Medium

When you are training the Bounding Box Regressor your input is the center, width and
height in pixels of the region proposal and the label is the ground truth bounding box.
The goal as stated in the paper is:

Our goal is to learn a transformation that maps a proposed box P to a ground-truth box G.

Conclusion
Another interesting discovery made in this paper is that it was highly effective to pre-
train the CNN on a task with a lot of data (for example image classification) and after
that to fine tune the network for the actual task, which was the object detection.

I really believe that it is necessary to fully understand the concepts presented in this
paper to fully grasp more modern approaches in object detection. I hope this article
helped you understand those ideas and more importantly helped you get an intuition
about the different parts in the R-CNN system. The next part of this series will be about
fast R-CNN which builds up directly on top of R-CNN system.

Thank you for reading and keep up the learning!

If you want more and stay up to date you can find me here:

@elfouly_sharif

LinkedIn

GitHub

Machine Learning AI Data Science Neural Networks Convolutional Network

About Help Legal

https://medium.com/@selfouly/r-cnn-3a9beddfd55a 5/6
11/24/2020 R-CNN (Object Detection). A beginners guide to one of the most… | by Sharif Elfouly | Medium

Get the Medium app

https://medium.com/@selfouly/r-cnn-3a9beddfd55a 6/6

You might also like