0% found this document useful (0 votes)
231 views

Object Detection Slides

The document discusses various computer vision tasks and object detection algorithms including RCNN, Fast RCNN, Faster RCNN, YOLO, and mAP evaluation. RCNN was one of the early object detection algorithms that was slow due to running CNN thousands of times per image. Fast RCNN and Faster RCNN improved speed by running CNN once per image and using region proposals. YOLO further improved speed by running CNN on the full image and dividing it into a grid for predictions. mAP is used to evaluate object detection and involves calculating precision and recall based on a confusion matrix using IoU thresholds.

Uploaded by

Arooj Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views

Object Detection Slides

The document discusses various computer vision tasks and object detection algorithms including RCNN, Fast RCNN, Faster RCNN, YOLO, and mAP evaluation. RCNN was one of the early object detection algorithms that was slow due to running CNN thousands of times per image. Fast RCNN and Faster RCNN improved speed by running CNN once per image and using region proposals. YOLO further improved speed by running CNN on the full image and dividing it into a grid for predictions. mAP is used to evaluate object detection and involves calculating precision and recall based on a confusion matrix using IoU thresholds.

Uploaded by

Arooj Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

Object Detection

Object Detection
✓ boat
✓ person

Image Classification Object Detection


(what?) (what + where?)
Computer vision Tasks
Comparing Boxes: Intersection over Union
(IoU)
Comparing Boxes: Intersection over Union
(IoU)
Region-based Convolutional Neural Network (RCNN )
• Instead of working on a massive number of regions, the RCNN algorithm proposes
a bunch of boxes in the image and checks if any of these boxes contain any object.
RCNN uses selective search to extract these boxes from an image (these boxes are called
regions).

First Selective Search • Then combines the similar


• It first takes an image as input • Then, it generates initial sub-
regions to form a larger region
segmentations so that we
(based on color similarity, texture
have multiple regions from
similarity, size similarity, and
this image:
shape compatibility)
RCNN- PROBLEMS
• Extracting 2,000 regions for each image based on a selective search

• Extracting features using CNN for every image region. Suppose we have N images,
then the number of CNN features will be N*2,000

• The entire process of object detection using RCNN has three models:

• CNN for feature extraction

• Linear SVM classifier for identifying objects.

• Regression model for tightening the bounding boxes.

• All these processes combine to make RCNN very slow. It takes around 40-50 seconds to
make predictions for each new image
RCNN-PROBLEM

• As CNN is followed by fully connected layers which can accept


input of fixed size.

• This makes CNN incapable of accepting varied size inputs. Thus,


images are first reshaped into some specific dimension before
feeding into CNN.

• This creates another issue of image warping and reduced


resolution. Spatial Pyramid pooling comes as a counter to this
problem.
What’s wrong with SPP-net?

• Training is still Slow( though better).

• Introduces a new problem: cannot update parameters below SPP layer during
training.
FAST-RCNN
• Instead of running a CNN 2,000 times per image, we can run it just once per image and get
all the regions of interest (regions containing some object).

• In Fast RCNN, we feed the input image to the CNN, which in turn generates the
convolutional feature maps.

• Using these maps, the regions of proposals are extracted.

• We then use a RoI pooling layer to reshape all the proposed regions into a fixed size, so that
it can be fed into a fully connected network.

• A SoftMax layer is used on top of the fully connected network to output classes. Along with
the SoftMax layer, a linear regression layer is also used parallelly to output bounding box
coordinates for predicted classes.
Cropping Features: RoI Pool
Model takes an image input of size 512x512x3 (width x height x RGB) and VGG16 is
mapping it into a 16x16x512 feature map.

Note that the Output’s width and height are exactly 32 times smaller than the input
image (512/32 = 16). That’s important because all RoIs must be scaled down by this
factor.
Cropping Features: RoI Pool

• Its original size is 145x200 and the top left corner is


set to be in (192x296). As you could probably tell,
we’re not able to divide most of those numbers by 32.

• width: 200/32 = 6.25

• height: 145/32 = ~4.53

• x: 296/32 = 9.25

• y: 192/32 = 6
Cropping Features: RoI Pool
After RoI Pooling Layer there is a Fully Connected
layer with a fixed size. Because our Roi's have different
sizes we have to pool them into the same size
(3x3x512 in our example). At this moment our mapped
RoI is a size of 4x6x512 and as you can imagine
we cannot divide 4 by 3.
Problems with Fast RCNN
• Fast RCNN has certain problem areas.

• It also uses selective search as a


proposed method to find the Regions
of Interest, which is a slow and time-
consuming process.

• It takes around 2 seconds per image


to detect objects, which is much
better compared to RCNN.

• But when we consider large real-life


datasets, then even a Fast RCNN
doesn’t look so fast anymore.
Faster-RCNN
• Faster RCNN is the modified version of Fast RCNN. The major difference
between them is that Fast RCNN uses the selective search for generating
Regions of Interest.
• We extract a descriptor
per location
YOLO (You Only Look Once!)
• YOLO is a real-time object detection algorithm. It was developed by Joseph
Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi at the University of
Washington (2015).

• Yolo is extremely fast because it passes the entire image at once into a CNN,
rather than making predictions on many individual regions of the image.

• The key idea behind YOLO is to use a single neural network to predict the
bounding boxes and class probabilities for objects in an image
YOLO (You Only Look Once!)
• YOLO divides the input image into a grid of cells and predicts the presence of objects in
each cell.
• If an object is detected in a cell, the algorithm also predicts the bounding box and the
class for the object.
• The bounding box coordinates and class probabilities are then used to localize and
classify the objects.

Each object in training


image is assigned to
grid cell that contains
that object’s midpoint.
YOLO –Anchor Boxes
• One of the Caveats of YOLO is that it can’t detect multiple objects in same grid.
• Solution: Anchor boxes. It is a predefined bounding box used in object detection
algorithms.
• The anchor box is used to define the size and aspect ratio of the window, and it is defined
prior to training the object detection model. The model is then trained to predict the
bounding box coordinates and class probabilities for objects relative to the anchor box.
Each object in training Per grid target label
image is assigned to
grid cell that contains
object’s midpoint and
anchor box for the grid
cell with highest IoU.
Putting it together: YOLO algorithm

Two anchor boxes used


Outputting the non-max suppressed outputs
Detection evaluation
mAP formula is based on the following sub metrics:
• Confusion Matrix,
• Intersection over Union(IoU),
• Recall,
• Precision

Confusion Matrix
• To create a confusion matrix, we need four attributes:
• True Positives (TP): The model predicted a label and matches correctly as per ground
truth.
• True Negatives (TN): The model does not predict the label and is not a part of the
ground truth.
• False Positives (FP): The model predicted a label, but it is not a part of the ground
truth.
• False Negatives (FN): The model does not predict a label, but it is part of the ground
truth.
Confusion Matrix
Detection evaluation

• Precision measures how many of the “positive” predictions


made by the model were correct.

• Recall measures how many of the positive class samples


present in the dataset were correctly identified by the model.
• Precision and recall offer a trade-off, i.e., one metric comes
at the cost of another.
mAP
• The mAP is calculated by finding Average Precision(AP) for each class and then average over a
number of classes.

You might also like