Object Detection Slides
Object Detection Slides
Object Detection
✓ boat
✓ person
• Extracting features using CNN for every image region. Suppose we have N images,
then the number of CNN features will be N*2,000
• The entire process of object detection using RCNN has three models:
• All these processes combine to make RCNN very slow. It takes around 40-50 seconds to
make predictions for each new image
RCNN-PROBLEM
• Introduces a new problem: cannot update parameters below SPP layer during
training.
FAST-RCNN
• Instead of running a CNN 2,000 times per image, we can run it just once per image and get
all the regions of interest (regions containing some object).
• In Fast RCNN, we feed the input image to the CNN, which in turn generates the
convolutional feature maps.
• We then use a RoI pooling layer to reshape all the proposed regions into a fixed size, so that
it can be fed into a fully connected network.
• A SoftMax layer is used on top of the fully connected network to output classes. Along with
the SoftMax layer, a linear regression layer is also used parallelly to output bounding box
coordinates for predicted classes.
Cropping Features: RoI Pool
Model takes an image input of size 512x512x3 (width x height x RGB) and VGG16 is
mapping it into a 16x16x512 feature map.
Note that the Output’s width and height are exactly 32 times smaller than the input
image (512/32 = 16). That’s important because all RoIs must be scaled down by this
factor.
Cropping Features: RoI Pool
• x: 296/32 = 9.25
• y: 192/32 = 6
Cropping Features: RoI Pool
After RoI Pooling Layer there is a Fully Connected
layer with a fixed size. Because our Roi's have different
sizes we have to pool them into the same size
(3x3x512 in our example). At this moment our mapped
RoI is a size of 4x6x512 and as you can imagine
we cannot divide 4 by 3.
Problems with Fast RCNN
• Fast RCNN has certain problem areas.
• Yolo is extremely fast because it passes the entire image at once into a CNN,
rather than making predictions on many individual regions of the image.
• The key idea behind YOLO is to use a single neural network to predict the
bounding boxes and class probabilities for objects in an image
YOLO (You Only Look Once!)
• YOLO divides the input image into a grid of cells and predicts the presence of objects in
each cell.
• If an object is detected in a cell, the algorithm also predicts the bounding box and the
class for the object.
• The bounding box coordinates and class probabilities are then used to localize and
classify the objects.
Confusion Matrix
• To create a confusion matrix, we need four attributes:
• True Positives (TP): The model predicted a label and matches correctly as per ground
truth.
• True Negatives (TN): The model does not predict the label and is not a part of the
ground truth.
• False Positives (FP): The model predicted a label, but it is not a part of the ground
truth.
• False Negatives (FN): The model does not predict a label, but it is part of the ground
truth.
Confusion Matrix
Detection evaluation