0% found this document useful (0 votes)
3 views

Object Detection

The document discusses object detection techniques, focusing on object localization and classification, which involve drawing bounding boxes around objects and labeling them. It outlines various algorithms such as R-CNN, Fast R-CNN, Faster R-CNN, and YOLO, detailing their methodologies, advantages, and limitations. Additionally, it explains the importance of parameters like Intersection over Union (IoU) and mean Average Precision in evaluating detection accuracy.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Object Detection

The document discusses object detection techniques, focusing on object localization and classification, which involve drawing bounding boxes around objects and labeling them. It outlines various algorithms such as R-CNN, Fast R-CNN, Faster R-CNN, and YOLO, detailing their methodologies, advantages, and limitations. Additionally, it explains the importance of parameters like Intersection over Union (IoU) and mean Average Precision in evaluating detection accuracy.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Object Detection

Dr. Thomas Abraham J V


SCOPE, VIT Chennai.
Object Localization
Object Detection
• Localization: Essentially drawing boxes around the objects. This helps the computer
understand where things are in a picture.

• Classification: Beyond just saying, “Hey, there’s a thing here,” Object Detection also labels
each object. So, if there’s a car in the image, it doesn’t just know it is there; it knows what
exactly is that i.e, a car.
• The heart of object localization lie with four key parameters:

• Midpoint (bx, by): Representing the center of the bounding box, these coordinates pinpoint
the object’s location in the image.

• Height and Width (bh, bw): These parameters dictate the dimensions of the bounding box,
outlining the object’s size within the image.
• To teach a machine to localize objects, we introduce additional labels. These include:

• Object Probability (pc): A binary indicator (1 or 0) denoting whether the object or objects of
interest is present in the image.

• Bounding Box Parameters: If an object is present (pc = 1), the machine learns the specific
coordinates (bx, by) and dimensions (bh, bw) of the bounding box.

• Class Labels: Indicating the category of the object (e.g., car, pedestrian), these labels provide
a comprehensive understanding of the object’s identity.
• How to do Bounding Box Ev lu tion?

• How to c lcul te IoU?

• Ev lu tion Metric – me n Aver ge Precision


a
a
a
a
a
a
a
a
Bounding Box Evaluation – Intersection over Union (IoU)
Which Bounding Box is more accurate?
Area of Intersection
Intersection over Union (IoU)
Scenario 1: Range of IoU
Scenario 2: Range of IoU
Intersection over Union (IoU)
Intersection over Union (IoU)
Calculation of IoU
Area of Intersection
Area of Intersection
Area of Intersection
Area of Intersection
Area of Intersection
Area of Union
Area of Union
Area of Union
Area of Union
Intersection over Union
Evaluation Metric – mean Average Precision
Assuming the test image has 3
Predicted IoU TP/FP Precision Recall
Ground truth objects of the same class
Bounding Box 1 0.7 TP 1 0.33

TP
Bounding Box 2 0.65 TP 1 0.67 Precision =
TP + FP

Bounding Box 3 0.2 FP 0.67 0.67 True Positive(TP)


Recall =
Total Ground Truth Objects (GT)
Bounding Box 4 0.8 TP 0.75 1
1
N∑
AveragePrecision = Precision
Bounding Box 5 0.3 FP 0.6 1
1
= (1 + 1 + 0.75)
3
Object Detection Algorithm - Sliding Window Approach
• The sliding window technique involves moving a window of a fixed size across the entire
image, analyzing each window individually or applying the model on each region to
determine whether it contains an object of interest.

• The window slides pixel by pixel, capturing different portions of the image in a systematic
manner. For each position of the window, the object detection model assesses whether the
content within the window corresponds to a specific class, adjusting for scale and aspect ratio
variations.

• The samething is repeated with different window sizes and aspect ratios resulting in many
predictions or windows of which model returns there’s a object and then it selects the
window which best matches with the ground truth box.
Object Detection Algorithm - Sliding Window Approach

Different Window Sizes

• The m jor dr wb ck of the Sliding Windows Detection


Algorithm is its comput tion l cost. With numerous
window positions nd v rying sizes, running e ch region
through the model becomes comput tion lly expensive.

Sliding Window Approach


a
a
a
a
a
a
a
a
a
a
Object Detection - Deep Learning Approach
Basic Structure

• Encoder: The encoder in deep learning-based object detection extracts features from the input
image using convolutional neural networks (CNNs). This part of the network captures high-
level patterns such as shapes, textures, and other statistical features that are useful for object
detection.

• Decoder: The decoder is responsible for predicting bounding boxes and class labels for the
objects detected. In this case, the decoder is a regressor, which predicts the exact coordinates
(X, Y, width, height) of the bounding boxes and the class labels (what object is in that box).

• The simplest decoder is a pure regressor. The regressor is connected to the output of the
encoder and predicts the location and size of each bounding box directly. The output of the
model is the X, Y coordinate pair for the object and its extent in the image.
Region Proposal Network
• An extension of the regressor approach is a region proposal
network.

• In this decoder, the model proposes regions of an image


where it believes an object might reside.

• The pixels of these regions are then fed into a classification


subnetwork to determine a label (or reject the proposal).

• It then runs the pixels containing those regions through a


classification network. The benefit of this method is a more
accurate, flexible model that can propose arbitrary numbers
of regions that may contain a bounding box.

• The added accuracy, though, comes at the cost of


computational efficiency.
Object Detection Algorithms
• Single-shot detectors (SSDs) are a type of object detection algorithm that predict the
bounding box and the class of the object in one single shot. This means that in a single
forward pass of the network, the presence of an object and the bounding box are predicted
simultaneously. This makes SSDs very fast and efficient, suitable for tasks that require real-
time detection.

• Two-shot or multi-shot object detection algorithms, on the other hand, use a two-step
process for detecting objects. The first step involves proposing a series of bounding boxes
that could potentially contain an object. This is often done using a method called region
proposal. The second step involves running these proposed regions through a convolutional
neural network to classify the object classes within the box.
R-CNN
1. Extraction of the regions from an input image, called region proposals. In other
words, region proposals are the regions where there is a possibility of the object we
are looking for in the image.

2. Label the category and ground truth bounding box of each proposed region.

3. Region of proposed is warped to have the same shape as the input of CNN

4. Computing the features of those regions with a Convolutional Neural Network


(CNN).

5. The features and category label is used for classification with a support vector
machine (SVM).

6. Finally, regression is performed on features and labeled bounding box to predict the
ground truth bounding box.
How to create boxes and how many?
1. Constrained Parametric Min-Cuts

2. Category Independent Object Proposals

3. Randomized Prim

4. Selective Search
Selective Search
• Step I: R-CNN uses Felsenszwalb’s efficient graph-based image segmentation to create
initial segmentation/regions
Selective Search
• Step II: Combining smaller regions into larger
based on similarity. In a way, it generates a
hierarchy of bounding boxes. The four
commonly used similarity matrix are color,
texture, size, and fill/shape.

• for region r1,r2 similarity will be a linear


combination of all these four ones.

• s_final(r1,r2) = a1*s_color(r1,r2) +
a2*s_texture(r1,r2) + a3*s_size(r1,r2) +
a4*s_fill(r1,r2)

• Where ai belongs to {0,1} depending on Selective Search Result


whether we are considering that matrix or
not.
Feature Extraction
• The pre-trained CNN is applied to the proposed region after warping to the dimension of the
network. The features extracted from these regions along with labels are used for
classification and bounding box predictions.
For each proposal,

• Binary SVM is trained for


classification.

• A simple linear regression


is trained on the region
proposal to generate tighter
bounding box coordinates.
R-CNN Recap…..
R-CNN Recap…..
R-CNN Recap…..
• Extr cting 2,000 regions for e ch im ge b sed
on selective se rch

• Extr cting fe tures using CNN for every im ge


region. Suppose we h ve N im ges, then the
number of CNN fe tures will be N*2,000

• The entire process of object detection using


RCNN h s three models:

• CNN for fe ture extr ction

• Line r SVM cl ssi ier for identifying objects

• Regression model for tightening the


bounding boxes.
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a
Fast R-CNN
1. This image is passed to a ConvNet which in turns generates the Regions of Interest.

2. A RoI pooling layer is applied on all of these regions to reshape them as per the input of the
ConvNet. Then, each region is passed on to a fully connected network.

3. A softmax layer is used on top of the fully connected network to output classes. Along with
the softmax layer, a linear regression layer is also used parallely to output bounding box
coordinates for predicted classes.

• So, instead of using three different models (like in RCNN), Fast RCNN uses a single model
which extracts features from the regions, divides them into different classes, and returns the
boundary boxes for the identified classes simultaneously.
Fast R-CNN
Fast R-CNN
Faster R-CNN
• Pass the input image to the ConvNet which returns the feature map for that image.

• Region proposal network is applied on these feature maps. This returns the object proposals
along with their objectness score.

• A RoI pooling layer is applied on these proposals to bring down all the proposals to the same
size.

• Finally, the proposals are passed to a fully connected layer which has a softmax layer and a
linear regression layer at its top, to classify and output the bounding boxes for objects.
Faster R-CNN
Faster R-CNN
• Anchor boxes re ixed sized bound ry boxes th t re pl ced throughout the im ge nd
h ve di erent sh pes nd sizes. For e ch nchor, RPN predicts two things:

• The irst is the prob bility th t n nchor is n object (it does not consider which cl ss the
object belongs to)

• Second is the bounding box regressor for djusting the nchors to better it the object
a
f
ff
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
Summary of the Algorithms
Algorithm Features Prediction time Limitations
/ image

CNN Divides the image into multiple regions and then – Needs a lot of regions to predict
classify each region into various classes. accurately and hence high computation
time.

RCNN Uses selective search to generate regions. Extracts 40-50 seconds High computation time as each region
around 2000 regions from each image. is passed to the CNN separately also it
uses three different model for making
predictions.
Fast RCNN Each image is passed only once to the CNN and 2 seconds Selective search is slow and hence
feature maps are extracted. Selective search is used computation time is still high.
on these maps to generate predictions. Combines all
the three models used in RCNN together.
Faster RCNN Replaces the selective search method with region 0.2 seconds Object proposal takes time and as
proposal network which made the algorithm much there are different systems working
faster. one after the other, the performance of
systems depends on how the previous
system has performed.
You Only Look Once (YOLO)
One Stage Algorithm

• YOLO, developed by Joseph Redmon et al., frames object detection as a regression problem
to spatially separated bounding boxes and associated class probabilities.

• It looks at the whole image at test time so its predictions are informed by global context in
the image.

• The model takes an image or a series of images (video frames) as Input and returns important
features like x- coordinates, y-coordinates, Class name, and Confident score(Probability).

• YOLO promises excellent learning capabilities, faster speed(up to 45 FPS), and high
accuracy compared to other Algorithms

• YOLO is known for its speed, making it suitable for real-time applications.
CNN - Backbone Network
CNN - Backbone Network
• The input image into a convolutional neural network (CNN), known as the
backbone network. This backbone has typically been pre-trained on large-
scale datasets like ImageNet for image classification tasks

• After pre-training, the backbone network is adapted for object detection by


removing the last few layers, which were originally designed for
classification tasks.

• This adjustment transforms the backbone into a feature extractor that


generates a series of stacked feature maps. These maps represent the image
in a format with reduced spatial resolution but higher channel (feature)
resolution.

• For example, for a 224 x 224 input image, the resulting feature maps might
be organized in a grid-like structure, such as 7x7x512, where 512 represents
the number of feature maps capturing different aspects of the image content.
YOLO Worklow
• Pre-train a CNN: The process starts by pre-training a Convolutional Neural Network (CNN)
on an image classification task. This helps the network learn to extract relevant features from
images.

• Image Grid and Responsibility: The input image is divided into an S×S grid of cells. Each
cell is responsible for detecting objects whose center falls within that cell.

• each grid cell in the feature map predicts information that helps identify and locate objects
within that cell. For each cell, this prediction includes class probabilities, bounding box
coordinates, and confidence scores.
• The prediction for each grid cell combines several elements:

• The likelihood that the cell contains an object, known as the confidence score or objectiness.

• The class probabilities, indicating which class the object belongs to.

• The coordinates and dimensions of the bounding box that localizes the object.
• For each grid cell, if we predict B bounding boxes, each bounding box prediction includes five components:

: x coordinate of the center of the bounding box relative to the grid cell.
: y coordinate of the center of the bounding box relative to the grid cell.
: w width of the bounding box, often normalized by the image width.
h: h height of the bounding box, often normalized by the image height.
Confidence score, indicating the likelihood that the bounding box contains an object and how accurate
the box is.
If there are C classes, each grid cell also predicts C class probabilities.
𝑤
𝑦
𝑥
• The probability of each object class, conditioned on the existence of an object in the
bounding box.

• The bounding box coordinates are defined by four values bx, by, bw, bh where bx and by are
center coornates of bounding box, they decide where to put the bounding box in the image.

• Similarly, bw and bh are width and height of the box, they define the size of the box. bx, by,
bw and bh are normalized by widht and height of the image and thus all between [0,1].

• The confidence score for each bounding box is defined as the product of the probability that
the box contains an object and the Intersection over Union (IoU) between the predicted box
and the ground truth box.

• This score is high if the box is likely to contain an object and if the predicted box overlaps
significantly with the true box.
• If a cell contains an object, it predicts the probability that this object belongs to each class.
These are conditional probabilities: P(object belongs to class i | containing object) where i is
a specific class.

• This means that each cell only predicts one set of class probabilities, regardless of the
number of bounding boxes.

• For an image, the total number of bounding boxes predicted is S×S×B. Each bounding box
has 4 location predictions, 1 confidence score, and C class probabilities. Therefore, the total
number of predictions for one image is S×S×(B×5+C).

• The final prediction tensor, shaped as S×S×(B×5+C), is produced by two fully connected
layers applied over the convolutional feature map.
YOLO Network Architecture
Final Output Sensor Shape
• Bounding Box and Class Probability Prediction: Each grid cell predicts B bounding boxes
and confidence scores for those boxes. The confidence score reflects how certain the model
is that a box contains an object and how accurate it thinks the box is. Each grid cell also
predicts C conditional class probabilities (one per class for the potential objects). These
probabilities are conditioned on there being an object in the box.
• Intersection over union: It describes how boxes overlap. The IOU equals 1 if the predicted
bounding box is the same as the real box. This mechanism eliminates bounding boxes that
are not equal to the real box. IOU is calculated as the ratio of ‘area of overlap’ to ‘area of the
union’. Finally, by Non-Maximal Suppression, YOLO suppresses all bounding boxes with
lower probability scores to achieve final bounding boxes.
• Here,

• pc defines whether an object is present in


the grid or not (it is the probability)

• bx, by, bh, bw specify the bounding box if


there is an object

• c1, c2, c3 represent the classes. So, if the


object is a car, c2 will be 1 and c1
& c3 will be 0, and so on
• In YOLO, the coordinates assigned to all the grids are:

• bx, by are the x and y coordinates of the midpoint of the object with respect to this grid.
In this case, it will be (around) bx = 0.4 and by = 0.3:

• Bh is the ratio of the height of the bounding box (red box in the above example) to the
height of the corresponding grid cell, which, in our case, is around 0.9. So, bh = 0.9. bw
is the ratio of the bounding box’s width to the grid cell’s width. So, bw = 0.5
(approximately).
The Non-Max Suppression Algorithm
• The Non-Max suppression algorithm:

• Discard all the boxes having probabilities less than or equal to a predefined threshold (say,
0.5)

• For the remaining boxes:

• Pick the box with the highest probability and take that as the output prediction

• Discard any other box that has IoU greater than the threshold with the output box from
the above step

• Repeat step 2 until all the boxes are either taken as the output prediction or discarded
Non-Max Suppression

You might also like