0% found this document useful (0 votes)
18 views22 pages

139 Pretrained Networks Object Detection

Uploaded by

greenmyworld1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views22 pages

139 Pretrained Networks Object Detection

Uploaded by

greenmyworld1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ULC665

Deep Learning

Convolutional Neural Networks (CNN) for


Object detection

(DL, Dr. Ashish Gupta) ULC665 : Introduction 1 / 22


Agenda

■ What is Object Recognition?


■ R-CNN Model Family
■ YOLO Model Family

(DL, Dr. Ashish Gupta) ULC665 : Introduction 2 / 22


Key Terms/ Tasks

■ Object recognition: a general term to describe a collection of related


computer vision tasks that involve identifying objects in digital pho-
tographs.
• Image classification typically involves predicting the class of one
object in an image.
• Object localization refers to identifying the location of one or more
objects in an image and drawing a bounding box around their
extent.
• Object detection combines these two tasks and localizes and clas-
sifies one or more objects in an image.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 3 / 22


Classification

■ Image Classification: Predict the type or class of an object in an image.


• Input: An image with a single object, such as a photograph.
• Output: A class label (e.g. one or more integers that are mapped
to class labels).

(DL, Dr. Ashish Gupta) ULC665 : Introduction 4 / 22


■ Object Localization: Locate the presence of objects in an image and
indicate their location with a bounding box.
• Input: An image with one or more objects, such as a photograph.
• Output: One or more bounding boxes (e.g. defined by a point,
width, and height).

(DL, Dr. Ashish Gupta) ULC665 : Introduction 5 / 22


Object Detection

■ Object Detection: Locate the presence of objects with a bounding box


and types or classes of the located objects in an image.
• Input: An image with one or more objects.
• Output: Algorithms produce a list of object categories present in
the image along with an axis-aligned bounding box (e.g. defined
by a point, width, and height) indicating the position and scale of
every instance of each object category.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 6 / 22


Object Detection Datasets

(DL, Dr. Ashish Gupta) ULC665 : Introduction 7 / 22


Dense prediction tasks: Image segmentation and its types

■ Object segmentation: instances of recognized objects are indicated


by highlighting the specific pixels of the object instead of a coarse
bounding box.
■ Object proposal models aim at producing a small set, typically a few
hundreds or thousands, of overlapping candidate object bounding boxes
or region proposals.
■ Salient object detection: detecting and accuretaly segmenting the
most salient object regions in the image.
■ Fixation prediction models typically try to predict where humans
look, i.e., a small set of fixation points.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 8 / 22


■ Semantic segmentation aims at accurately partitioning each object
region from the background region, i.e., not only locates all the target
objects but also accurately delineates their boundaries.
■ Instance segmentation aims to detect each object as an individual in
the image.
■ Panoptic segmentation has the highest goal, which assigns a seman-
tic label and an instance label to each pixel.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 9 / 22


Overview of Object Recognition Computer Vision Tasks

Object recognition refers to a suite of challenging computer vision tasks as


shown below

(DL, Dr. Ashish Gupta) ULC665 : Introduction 10 / 22


R-CNN Model Family

■ R-CNN family of methods refers to


• Regions with CNN Features or
• Region-Based Convolutional Neural Network (Ross Girshick, et al.,
2014)
■ Includes the techniques
• R-CNN,
• Fast R-CNN, and
• Faster-RCNN
designed and demonstrated for object localization and object recogni-
tion.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 11 / 22


R-CNN

■ First large and successful application of CNNs to the problem of object


localization, detection, and segmentation.
■ Their proposed R-CNN model is comprised of three modules:
• Region Proposal: Generate and extract category independent
region proposals, e.g. candidate bounding boxes.
To propose candidate regions or bounding boxes of potential ob-
jects in the image called selective search.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 12 / 22


R-CNN

■ Their proposed R-CNN model is comprised of three modules:


• Region Proposal
• Feature Extractor: Extract features from each candidate region,
e.g. using AlexNet DCNN. Output of AlexNet : 4096 element
vector fed to next stage.
• Classifier: Classify features as one of the known classes, e.g. linear
SVM classifier model.

Downside: it is slow, requiring a CNN-based feature extraction pass on


each of the candidate regions generated by the region proposal algorithm.
(DL, Dr. Ashish Gupta) ULC665 : Introduction 13 / 22
Fast R-CNN

■ Limitations of R-CNN
• Training is a multi-stage pipeline. Involves the preparation and
operation of three separate models.
• Training is expensive in space and time. Training a deep CNN on
so many region proposals per image is very slow.
• Object detection is slow. Make predictions using a deep CNN on
so many region proposals is very slow.
■ Fast R-CNN is proposed as a single model instead of a pipeline to learn
and output regions and classifications directly.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 14 / 22


Fast R-CNN

• Input: An Image and a set of region proposals.


• Passed through a DCNN. A pre-trained CNN, such as a VGG-16, is
used for feature extraction.
• The end of the deep CNN is a custom layer called a Region of Interest
Pooling Layer, or RoI Pooling, that extracts features specific for a given
input candidate region.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 15 / 22


RoI Pooling in Fast-RCNN

The model is significantly faster to train and to make predictions, yet still
requires a set of candidate regions to be proposed along with each input
image.
(DL, Dr. Ashish Gupta) ULC665 : Introduction 16 / 22
Faster R-CNN

■ Although it is a single unified model, the architecture is comprised of


two modules:
• Region Proposal Network: A CNN based architecture to both pro-
pose and refine region proposals as part of the training process.
• Fast R-CNN: These regions are then used in concert with a Fast
R-CNN model in a single model design. Extracting features from
the proposed regions → outputting the bounding box and class
labels.
■ Both modules operate on the same output of a deep CNN.
■ The region proposal network acts as an attention mechanism for the
Fast R-CNN network, informing the second network of where to look
or pay attention.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 17 / 22


(DL, Dr. Ashish Gupta) ULC665 : Introduction 18 / 22
Feature Pyramid networks (FPN)

■ Most of the DL-based detectors run detection only on the feature maps
of the networks’ top layer.
■ Although the features in deeper layers of a CNN are beneficial for
category recognition, it is not conducive to localizing objects.
■ FPN
• leverages a ConvNet’s pyramidal feature hierarchy, which has se-
mantics from low to high levels, and
• build a feature pyramid with high-level semantics throughout.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 19 / 22


(DL, Dr. Ashish Gupta) ULC665 : Introduction 20 / 22
You only look once (YOLO)
The approach involves a single neural network trained end-to-end that
takes a image as input and predicts bounding boxes and class labels for
each bounding box directly.
1
The model works by first splitting
2
Each grid cell predicts a bounding box
the input image into a grid of involving the x, y coordinate and the width and height
cells, where each cell is responsible
For example, an image may be divided into a 7 × 7 grid
for predicting a bounding box if and each cell in the grid may predict 2 bounding boxes,
the center of a bounding box falls resulting in 98 proposed bounding box predictions.

within it.

4
3
Each grid cell predicts a bounding box
The class probabilities map and the bounding boxes with
confidences are then combined into a final set of
bounding boxes and class labels.
involving the x, y coordinate and the width and height
and the confidence.

A class prediction is also based on each cell.

(DL, Dr. Ashish Gupta) ULC665 : Introduction 21 / 22


(DL, Dr. Ashish Gupta) ULC665 : Introduction 22 / 22

You might also like