0% found this document useful (0 votes)
62 views61 pages

Deep Learning: Dr. Sanjeev Sharma

This document discusses object detection in deep learning. It begins by explaining the difference between image classification, which identifies a single target object in an image, and object detection, which identifies multiple target objects and their positions. It then describes the general framework for object detection using deep learning models, including generating region proposals, extracting features from those regions, classifying and locating objects, and suppressing duplicate detections. It also discusses common metrics used to evaluate object detection models, such as mean average precision (mAP) and intersection over union (IoU).

Uploaded by

Rushikesh Dhete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views61 pages

Deep Learning: Dr. Sanjeev Sharma

This document discusses object detection in deep learning. It begins by explaining the difference between image classification, which identifies a single target object in an image, and object detection, which identifies multiple target objects and their positions. It then describes the general framework for object detection using deep learning models, including generating region proposals, extracting features from those regions, classifying and locating objects, and suppressing duplicate detections. It also discusses common metrics used to evaluate object detection models, such as mean average precision (mAP) and intersection over union (IoU).

Uploaded by

Rushikesh Dhete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Deep Learning

Dr. Sanjeev Sharma


Object Detection
• In image classification, we assume that there is
only one main target object in the image, and the
model’s sole focus is to identify the target
category.
• However, in many situations, we are interested in
multiple targets in the image. We want to not only
classify them, but also obtain their specific
positions in the image. In computer vision, we
refer to such tasks as object detection.
• Figure explains the difference between image
classification and object detection tasks.
• Object detection is widely used in many fields.
For example, in self-driving technology, we
need to plan routes by identifying the locations
of vehicles, pedestrians, roads, and obstacles in
a captured video image.
• Robots often perform this type of task to detect
targets of interest. And systems in the security
field need to detect abnormal targets, such as
intruders or bombs.
General object detection framework
• Typically, an object detection framework has four
components:
• Region proposal—An algorithm or a DL model is used
to generate regions of interest (RoIs) to be further
processed by the system.
• These are regions that the network believes might
contain an object; the output is a large number of
bounding boxes, each of which has an objectness score.
• Boxes with large objectness scores are then passed
along the network layers for further processing.
• Feature extraction and network predictions—
Visual features are extracted for each of the
bounding boxes.
• They are evaluated, and it is determined
whether and which objects are present in the
proposals based on visual features (for
example, an object classification component).
• Non-maximum suppression (NMS)—In this
step, the model has likely found multiple
bounding boxes for the same object.
• NMS helps avoid repeated detection of the
same instance by combining overlapping
boxes into a single bounding box for each
object.
• Evaluation metrics—Similar to accuracy,
precision, and recall metrics in image
classification tasks (see chapter 4), object
detection systems have their own metrics to
evaluate their detection performance. In this
section, we will explain the most popular
metrics, like mean average precision (mAP),
precision-recall curve (PR curve), and
intersection over union (IoU).
Region proposals
• RoIs are regions that the system believes have
a high likelihood of containing an object,
called the objectness score (figure).
• Regions with high objectness scores are passed
to the next steps; regions with low scores are
abandoned.
• The important thing to note is that this step
produces a lot (thousands) of bounding boxes to
be further analyzed and classified by the network.
During this step, the network analyzes these
regions in the image and classifies each region as
foreground (object) or background (no object)
based on its objectness score. If the objectness
score is above a certain threshold, then this region
is considered a foreground and pushed forward in
the network. Note that this threshold is
configurable based on your problem.
• If the threshold is too low, your network will
exhaustively generate all possible proposals, and
you will have a better chance of detecting all
objects in the image.
• On the flip side, this is very computationally
expensive and will slow down detection. So, the
trade-off with generating region proposals is the
number of regions versus computational
complexity—and the right approach is to use
problem-specific information to reduce the
number of RoIs.
Network predictions
• This component includes the pretrained CNN network
that is used for feature extraction to extract features
from the input image that are representative for the task
at hand and to use these features to determine the class
of the image. In object detection frameworks, people
typically use pretrained image classification models to
extract visual features, as these tend to generalize fairly
well.
• For example, a model trained on the MS COCO or
ImageNet dataset is able to extract fairly generic
features.
• In this step, the network analyzes all the
regions that have been identified as having a
high likelihood of containing an object and
makes two predictions for each region:
• Bounding-box prediction—The coordinates that
locate the box surrounding the object. The
bounding box coordinates are represented as the
tuple (x, y, w, h), where x and y are the
coordinates of the center point of the bounding
box and w and h are the width and height of the
box.
• Class prediction: The classic softmax function
that predicts the class probability for each object.
• Since thousands of regions are proposed, each
object will always have multiple bounding
boxes surrounding it with the correct
classification.
• For example, take a look at the image of the
dog in figure 7.3. The network was clearly able
to find the object (dog) and successfully
classify it. But the detection fired a total of five
times because
• the dog was present in the five RoIs produced in the
previous step: hence the five bounding boxes around
the dog in the figure. Although the detector was able to
successfully locate the dog in the image and classify it
correctly, this is not exactly what we need. We need just
one bounding box for each object for most problems.
• In some problems, we only want the one box that fits
the object the most. What if we are building a system to
count dogs in an image? Our current system will count
five dogs. We don’t want that. This is when the non-
maximum suppression technique comes in handy.
Non-maximum suppression (NMS)
• As you can see in figure 7.4, one of the problems of an
object detection algorithm is that it may find multiple
detections of the same object. So, instead of creating
only one bounding box around the object, it draws
multiple boxes for the same object.
• NMS is a technique that makes sure the detection
algorithm detects each object only once. As the name
implies, NMS looks at all the boxes surrounding an
object to find the box that has the maximum prediction
probability, and it suppresses or eliminates the other
boxes (hence the name).
• The general idea of NMS is to reduce the
number of candidate boxes to only one
bounding box for each object.
• For example, if the object in the frame is fairly
large and more than 2,000 object proposals
have been generated, it is quite likely that
some of them will have significant overlap
with each other and the object.
Object-detector evaluation metrics
• FRAMES PER SECOND (FPS) TO MEASURE
DETECTION SPEED
• The most common metric used to measure
detection speed is the number of frames per
second (FPS). For example, Faster R-CNN
operates at only 7 FPS, whereas SSD operates at
59 FPS. In benchmarking experiments, you will
see the authors of a paper state their network
results as: “Network X achieves mAP of Y% at Z
FPS,” where X is the network name, Y is the mAP
percentage, and Z is the FPS.
• MEAN AVERAGE PRECISION (MAP) TO
MEASURE NETWORK PRECISION
• The most common evaluation metric used in object
recognition tasks is mean average precision (mAP). It is
a percentage from 0 to 100, and higher values are
typically better, but its value is different from the
accuracy metric used in classification.
• To understand how mAP is calculated, you first need to
understand intersection over union (IoU) and the
precision-recall curve (PR curve). Let’s explain IoU
and the PR curve and then come back to mAP.
• INTERSECTION OVER UNION (IOU)
• This measure evaluates the overlap between two
bounding boxes: the ground truth bounding box
(Bground truth) and the predicted bounding box
(Bpredicted). By applying the IoU, we can tell
whether a detection is valid (True Positive) or not
(False Positive).
• Figure 7.5 illustrates the IoU between a ground
truth bounding box and a predicted bounding box.
• The intersection over the union value ranges
from 0 (no overlap at all) to 1 (the two
bounding boxes overlap each other 100%).
The higher the overlap between the two
bounding boxes (IoU value), the better (figure
7.6).
• To calculate the IoU of a prediction, we need the
following:
• The ground truth bounding box (Bground
truth): the hand-labeled bounding box
• created during the labeling process
• The predicted bounding box (Bpredicted) from
our model
• We calculate IoU by dividing the area of overlap
by the area of the union, as in the following
• equation:
• IoU is used to define a correct prediction,
meaning a prediction (True Positive) that has an
IoU greater than some threshold. This threshold is
a tunable value depending on the challenge, but
0.5 is a standard value. For example, some
challenges, like Microsoft COCO, use [email protected]
(IoU threshold of 0.5) or [email protected] (IoU
threshold of 0.75). If the IoU value is above this
threshold, the prediction is considered a True
Positive (TP); and if it is below the threshold, it is
considered a False Positive (FP).
PRECISION-RECALL CURVE (PR
CURVE)
• With the TP and FP defined, we can now
calculate the precision and recall of our
detection for a given class across the testing
dataset. As explained in chapter 4, we calculate
the precision and recall as follows (recall that
FN stands for False Negative):
• With the TP and FP defined, we can now
calculate the precision and recall of our
• detection for a given class across the testing
dataset. As explained in chapter 4, we calculate
• the precision and recall as follows (recall that
FN stands for False Negative):
Region-based convolutional neural
networks (R-CNNs)
• The R-CNN family of object detection
techniques usually referred to as R-CNNs,
which is short for region-based convolutional
neural networks, was developed by Ross
Girshick et al. in 2014.
• The R-CNN family expanded to include Fast-
RCNN2 and Faster-RCN3 in 2015 and 2016,
respectively
R-CNN
• R-CNN is the least sophisticated region-based
architecture in its family, but it is the basis for
understanding how multiple object-recognition
algorithms work for all of them.
• It was one of the first large, successful
applications of convolutional neural networks
to the problem of object detection and
localization, and it paved the way for the other
advanced detection algorithms.

You might also like