0% found this document useful (0 votes)
10 views

Lect-7 Segmentation Localization

Uploaded by

maimoonaziz2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lect-7 Segmentation Localization

Uploaded by

maimoonaziz2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Deep Learning

CS-878
Week-07
Image Classification

(assume given a set of possible labels)


{dog, cat, truck, plane, ...}

cat

This image by Nikita is


licensed under CC-BY 2.0

Fei-Fei Li, Ehsan Adeli 2


Computer Vision Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain

Fei-Fei Li, Ehsan Adeli 3


Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Ehsan Adeli 4


Semantic Segmentation

Assign a label to each pixel in an image:


• Pixel-level image annotation/analysis (vs. object-level analysis)

Common datasets: PASCAL VOC (2012) and MSCOCO


Semantic Segmentation

▪ A key part of Scene Understanding


▪ Applications
▪ Autonomous navigation
▪ Assisting the partially sighted
▪ Medical diagnosis
▪ Image editing
Semantic Segmentation

▪ Applications
▪ Assisting the partially sighted
Semantic Segmentation

▪ Applications
▪ Medical diagnosis
Semantic Segmentation

▪ Applications
▪ Image editing
Segmentation Tasks

A sample image from the PASCAL VOC2011 dataset

Original (input) Image Semantic (class) Segmentation

Image Source: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/segexamples/index.html


Semantic Segmentation

With semantic segmentation, we are interested


in the precise location of the object.

For example, for a 16x16 image, we would get


256 outputs arranged in 16x16 matrix.

These outputs would tell that which pixel


belongs to which particular class.

For just one object in an image, a pixel could


either belong to the object or the background.
Semantic Segmentation: The Problem

GRASS, CAT, TREE, At test time, classify each pixel of a new image.
SKY, ...
Paired training data:for each training image,Lecture 11 - April 30, 2024
each pixel is labeled with a semantic category.

Fei-Fei Li, Ehsan Adeli 1


Semantic Segmentation Idea: Sliding Window

Full image

Lecture 11 - 13 April 30, 2024

Fei-Fei Li, Ehsan Adeli


Semantic Segmentation Idea: Sliding Window

Full image

Impossible to classify without context

Q: how do we include context?

Fei-Fei Li, Ehsan Adeli


Semantic Segmentation Idea: Sliding Window

Full image

April 30, 2024

Q: how do we model this?

Fei-Fei Li, Ehsan Adeli


Semantic Segmentation

One straight-forward
strategy is to modify our
classification network

Run the classification


network for each pixel in the
image using sliding window

Problems with training and


inference
Semantic Segmentation

Another strategy is to modify our classification network by keeping the


feature map size same throughout the network

Problems in training and inference persist


Fully Convolutional networks (FCN)

J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
Fully "Convolutional" networks (FCN)
• Use pre-trained networks for classification for segmentation! (VGG, AlexNet, etc.)

• Re-interpret the fully-connected layers as fully convolutional networks.

• Utilize skip-layer concept to improve the segmentation accuracy.


Fully Convolutional networks

Interpret the FC layers as conv layers.

J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
FCN
FCN

No skip connection 1-skip connection 2-skip connections

11/30/2021
J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”


1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input:2 x 2 Output:4 x 4 Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


In-Network upsampling: “Max Unpooling”
Max Pooling Max Unpooling
Remember which element was max! Use positions from
pooling layer 0 0 2 0
1 2 6 3
1 2
3 5 2 1 5 6
… 3 4
0 1 0 0

1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input:4 x 4 Output:2 x 2 Input:2 x 2 Output:4 x 4

Corresponding pairs of
downsampling and
upsampling layers

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 1 pad 1

Lecture 11 - 25 April 30, 2024


Input:4 x 4 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Lecture 11 - 26 April 30, 2024


Input:4 x 4 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Lecture 11 - 27 April 30, 2024


Input:4 x 4 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 2 pad 1

Lecture 11 - 28 April 30, 2024


Input:4 x 4 Output:2 x 2

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 2 pad 1

Dot product
between filter
and input

Lecture 11 - 29 April 30, 2024


Input:4 x 4 Output:2 x 2

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling

Recall:Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in the


input for every one pixel in
the output
Dot product
between filter Stride gives ratio between
and input movement in input and
output

Lecture 11 - 30 April 30, 2024 We can interpret strided


Input:4 x 4 Output:2 x 2 convolution as “learnable
downsampling”.

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: Transposed Convolution

3 x 3 transposed convolution, stride 2 pad 1

Lecture 11 - 31 April 30, 2024


Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: Transposed Convolution

3 x 3 transposed convolution, stride 2 pad 1

Input gives
weight for
filter

Lecture 11 - 32
Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: Transposed Convolution

3 x 3 transposed convolution, stride 2 pad 1

Filter moves 2 pixels in the


Input gives output for every one pixel
weight for in the input
filter
Stride gives ratio between
movement in output and
input
Lecture 11 - 33
Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 transposed convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in the


Input gives output for every one pixel
weight for in the input
filter
Stride gives ratio between
movement in output and
input
Lecture 11 - 34
Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 transposed convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in the


Input gives output for every one pixel
weight for in the input
filter
Stride gives ratio between
movement in output and
input
Lecture 11 - 35
Input:2 x 2 Output:4 x 4

Fei-Fei Li, Ehsan Adeli


Learnable Upsampling: 1D Example

Output
Input Filter ax Output contains
copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
the output
y az +bx
b
z by
36
bz

Fei-Fei Li, Ehsan Adeli


Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication

Lecture 11 - 37 April 30, 2024


Example:1D conv, kernel size=3,
stride=2, padding=1

Fei-Fei Li, Ehsan Adeli


Convolution as Matrix Multiplication (1D Example)
We can express convolution in Transposed convolution multiplies by the
terms of a matrix multiplication transpose of the same matrix:

38
Example:1D conv, kernel size=3, Example:1D transposed conv, kernel size=3,
stride=2, padding=1 stride=2, padding=0

Fei-Fei Li, Ehsan Adeli


Deconvolution Network for Semantic Segmentation

H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Input image 14 × 14 deconvolutional layer 28 × 28 unpooling layer 28 × 28 deconvolutional layer 56 × 56 unpooling layer

56 × 56 deconvolutional layer 112 × 112 unpooling layer 112 × 112 deconvolutional layer 224 × 224 unpooling layer 224 × 224 deconvolutional layer

Image source: H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Learned upsampling architectures

Figure source
SegNet

Uses VGG architecture!


Image source: http://mi.eng.cam.ac.uk/projects/segnet/
No FC layer!
V Badrinarayanan, et al., A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling, 2015
U-Net

Source: Olaf Ronneberger, Philipp Fischer, Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI, 2015
Object detection
Object Recognition

• Problem: Given an image A, does A contain an image of a person?

YES
Object localization
Human Detection

• UAV images
• Surveillance images
Object detection

• Multiple objects
Why detection?

• Self-driving car
A simple solution
Sliding window
Sliding window

• Slide a box around image and classify


each crop

• There are many interesting papers


• Viola and Jones (2001): Face detector
(22K citations)
• Dalal and Triggs (2005): HOG (33K
citations)
Naïve approach: Template Matching

Find the chair in this image Output of correlation

This is a chair
Template Matching

Find the chair in this image

Epic fail!
Simple template matching is not going to make it
Sliding window

• General approach
• Scan all possible locations
• Extract features
• Classify features
• Post-processing
Evaluation

• Intersection over union


Sample IoU
Evaluation

• True positives
Evaluation

• True positives
• False positives
Evaluation

• True positives
• False positives
Evaluation

• True positives
• False positives
• False negatives
Evaluation

• True positives
• False positives
• False negatives
Evaluation

• Only one is correct


Evaluation

• Precision
• Precision is the ability of a model to identify only the relevant objects.
• It is the percentage of correct positive predictions and is given by:

• Recall
• Recall is the ability of a model to find all the relevant cases
• It is the percentage of true positive detected among all relevant ground truths
and is given by:
Evaluation

• Sort all predicted boxes (for all images)


• According to scores
• For each k (location) in the list
• Compute recall and precision

https://github.com/rafaelpadilla/Object-Detection-Metrics
Average precision (AP)

mAP: average AP over multiple classes


Simple Recipe for Object Detection

Step 1: Train (or download) a classification model (AlexNet, VGG, GoogLeNet)

Convolution
and Pooling Fully-connected
layers

Softmax loss

Final conv
feature map Class scores

Image

Problem: Is there a turtle in this picture? If yes, localize!


Simple Recipe for Object Detection

Step 2: Attach new fully-connected “regression head” to the network.

Fully-connected
layers
“Classification head”

Convolution Class scores


and Pooling

Fully-connected
layers
“Regression head”

Final conv Box


feature map coordinates

Image
Simple Recipe for Object Detection

Step 3: Train the regression head only using L2 loss

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

L2 loss
Final conv
feature map Box coordinates

Image
Simple Recipe for Object Detection

Step 4: At test time use both heads.

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

Final conv
feature map Box coordinates

Image
Simple Recipe for Object Detection
Correct label:
Cat
ObjectDetection:SingleObject
Class Scores
(Classification +Localization) Fully Cat:0.9 Softmax
Connected: Dog:0.05 Loss
4096 to 1000
Car:0.01
x, y ...

h Multitask Loss + Loss

w
Vector: Fully
This image is CC0 public domain
Connected:
4096 4096 to 4 Box
Lecture 11 - 74 Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)

= 16 numbers
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)

= 8 numbers
Detection as Regression?

CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)

= many numbers

Need variable sized outputs


Detection as Classification

CAT? NO

DOG? NO
Detection as Classification

CAT? YES!

DOG? NO
Detection as Classification

CAT? NO

DOG? Yes
Detection as Classification

Problem:
• Need to test many positions and scales
• Use a computationally demanding classifier (CNN)
• Search at different scales
• Search at different positions

Solution: Only look at a tiny subset of possible positions


Region Proposals

● Find “blobby” image regions that are likely to contain objects


● “Class-agnostic” object detector
● Look for “blob-like” regions
Region Proposals: Selective Search
Bottom-up segmentation, merging regions at multiple scales

Convert
regions
to boxes

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013


R-CNN

84

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


R-CNN

Regions of Interest
(RoI) from a proposal
85
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


R-CNN

Warped image
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli Lecture 11 - 54 April 30, 2024


R-CNN

Conv Forward each region


Conv Net through ConvNet
Net (ImageNet-pretranied)
Conv
Warped image
Net
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
Input image segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Lecture 11 - 55 April 30, 2024
R-CNN
SVMs Classify regions with
SVMs SVMs

SVMs
Conv Forward each region
Conv Net through ConvNet
Net (ImageNet-pretranied)
Conv
Warped image
Net
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
Input image segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Lecture 11 - 56 April 30, 2024
R-CNN
Bbox reg SVMs Predict “corrections” to the RoI:
SVMs
4 numbers: (dx, dy, dw, dh)
Bbox reg
Classify regions with
Bbox reg SVMs SVMs
Conv
Forward each region
Conv Net
through ConvNet
Net
Conv (ImageNet-pretranied)
Net
Warped image
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)
Input image Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Ehsan Adeli
R-CNN details

• Regions: uses ~2000 Selective Search proposals


• Network: uses AlexNet pre-trained on ImageNet (1000 classes), fine-
tuned on PASCAL (21 classes)
• Final detector:
• first warp proposal regions,
• then extract fc7 network activations (4096 dimensions),
• Finally, classify with linear SVM
• Bounding box regression is also used to refine box locations
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014.
R-CNN Training (initialization)

Step 1: Train (or download) a classification model for ImageNet (AlexNet)

Convolution
and Pooling Fully-connected
layers

Softmax loss

Final conv
Class scores
feature map
Image 1000 classes
R-CNN Training (Fine-tuning)

Step 2: "Fine-tune" model for detection


- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images

Re-initialize this layer:


Convolution was 4096 x 1000,
and Pooling Fully-connected now will be 4096 x 21
layers

Softmax loss

Final conv
Class scores:
feature map
Image 21 classes
R-CNN Training (feature extraction)

Step 3: Extract features


- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5
features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!

Convolution
and Pooling

pool5 features

Image Region Proposals Crop + Warp Forward pass Save to disk


R-CNN Training (train classifier)

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Positive samples for cat SVM Negative samples for cat SVM
R-CNN Training (train classifier)

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Negative samples for dog SVM Positive samples for dog SVM
R-CNN Training (bounding box regression/prediction)

Step 5 (bbox regression): For each class, train a linear regression model to map from cached
features to offsets to GT boxes to make up for “slightly wrong” proposals

Training image regions:

Cached region features

Regression targets: (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)


(dx, dy, dw, dh) Proposal is good Proposal too Proposal too
Normalized coordinates far to left wide
Issue #1 with R-CNN

• Slow in run-time
• Multiple forward passes for each proposal
• There are thousands of proposals

• Solution
• Single forward pass for each image?
Issue #2 with R-CNN

• Separate classifier training


• CNN feature extractor is not trained with classifier and regressor

• Solution
• End-to-end training?
Issue #3 with R-CNN

• Complex training pipeline


• Proposals
• Feature extraction
• Classification

• Solution
• Single forward pass for each image?
Solution

• Fast R-CNN
• Single forward pass for each image
• No separate classifier
• End-to-end training
Fast R-CNN

“Slow” R-CNN

10
1
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN

“Slow” R-CNN

“conv5” features
Run whole image
through ConvNet
“Backbone”
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN

Regions of “Slow” R-CNN


Interest (RoIs)
from a proposal
method “conv5” features
Run whole image
through ConvNet
“Backbone”
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN

Regions of “Slow” R-CNN


Interest (RoIs)
Crop +Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN
Object Linear +
softmax Linear Box offset
category

Regions of CNN Per-Region Network “Slow” R-CNN


Interest (RoIs)
Crop +Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN
Object Linear +
softmax Linear Box offset
category

Regions of CNN Per-Region Network “Slow” R-CNN


Interest (RoIs)
Crop +Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli


Fast R-CNN: Another view

R. Girshick, Fast R-CNN, ICCV 2015


Cropping Features: RoI Pool

CNN

Lecture 11 - April 30, 2024


Input Image Image features:C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.


Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Pool
Project proposal
onto features

CNN

Input Image Lecture 11 - April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.


Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli 67


Cropping Features: RoI Pool
“Snap” to
Project proposal grid cells
onto features

CNN

Input Image Lecture 11 - 11 April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) 0
(e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Pool
“Snap” to
Project proposal grid cells
onto features

Q: how do we resize the 512 x 5


x 4 region to, e.g., a 512 x 2 x 2
tensor?.
CNN

Input Image Lecture 11 - 11 April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) 1
(e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Pool
“Snap” to Divide into 2x2
Project proposal grid cells grid of (roughly)
onto features equal subregions

Q: how do we resize the 512 x 5


x 4 region to, e.g., a 512 x 2 x 2
tensor?.
CNN

Input Image Lecture 11 - 11 April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) 2
(e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Pool
“Snap” to Divide into 2x2
grid cells grid of (roughly)
Project proposal
equal subregions
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W Region features always the
(e.g. 3 x 640 x 480) 3
(e.g. 512 x 20 x 15) same size even if input regions
have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Pool
“Snap” to Divide into 2x2
Project proposal grid cells grid of (roughly)
onto features equal subregions
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W Region features always the
(e.g. 3 x 640 x 480) 4
(e.g. 512 x 20 x 15) same size even if input regions
have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Problem: Region features slightly misaligned

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Align
No “snapping”!
Project proposal
onto features

CNN

Input Image Lecture 11 - 11 April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) 5
(e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Align
Sample at regular points in
No “snapping”! each subregion using
Project proposal
onto features bilinear interpolation

CNN

Input Image Lecture 11 - 11 April 30, 2024


Image features:C x H x W
(e.g. 3 x 640 x 480) 6
(e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Align
Sample at regular points in each
No “snapping”! subregion using bilinear interpolation
Project proposal
onto features

CNN

Feature fxy for point (x, y) is


a linear combination of
Input Image Lecture 11 - 11
Image features:C x H x W features at its four
(e.g. 3 x 640 x 480) 7
(e.g. 512 x 20 x 15) neighboring grid cells:

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Align
Sample at regular points in each
No “snapping”! subregion using bilinear interpolation
Project proposal
onto features f11∈R512 f21∈R512
(x1,y1) (x2,y1)
(x,y)
f12∈R512 f22∈R512
CNN
(x1,y2) (x2,y2)

Feature fxy for point (x, y) is


a linear combination of
Input Image Image features:C x H x W features at its four
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) neighboring grid cells:

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli


Cropping Features: RoI Align
Sample at regular points in each
No “snapping”! subregion using bilinear interpolation
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W
(e.g. 3 x 640 x 480) 9
(e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli


R-CNN vs Fast R-CNN

Lecture 11 - 12 April 30, 2024


0
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li, Ehsan Adeli


R-CNN vs Fast R-CNN

Problem:
Runtime dominated by
Lecture 11 - 12 April 30, 2024 region proposals!
1
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li, Ehsan Adeli


Faster R-CNN: Make CNN do proposals!

Insert Region Proposal


Network (RPN) to predict
proposals from features

Otherwise same as Fast R-CNN: Crop


features for each proposal, classify
each one Lecture 11 - 12
2

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Ehsan Adeli


Region Proposal Network

CNN

Input Image Lecture 11 - 12 April 30, 2024


(e.g. 3 x 640 x 480) 3
Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Ehsan Adeli


Region Proposal Network
Imagine an anchor box of
fixed size at each point in
the feature map

CNN

Input Image Lecture 11 -


(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Ehsan Adeli


Region Proposal Network
Imagine an anchor box of
fixed size at each point in
the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv

Lecture 11 - At each point, predict whether


Input Image
(e.g. 3 x 640 x 480)
the corresponding anchor
Image features
(e.g. 512 x 20 x 15)
contains an object (binary
classification)

Fei-Fei Li, Ehsan Adeli


Region Proposal Network
Imagine an anchor box of
fixed size at each point in
the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv
Box corrections
4 x 20 x 15

Lecture 11 - For positive boxes, also predict a


Input Image
(e.g. 3 x 640 x 480)
corrections from the anchor to
Image features
(e.g. 512 x 20 x 15) the ground-truth box (regress 4
numbers per pixel)

Fei-Fei Li, Ehsan Adeli


Region Proposal Network
In practice use K different
anchor boxes of different
size/scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Input Image April 30, 2024


(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Ehsan Adeli


Region Proposal Network
In practice use K different
anchor boxes of different
size/scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Sort the K*20*15 boxes by


Input Image their “objectness” score, take
(e.g. 3 x 640 x 480) Image features top ~300 as our proposals
(e.g. 512 x 20 x 15)

Fei-Fei Li, Ehsan Adeli


Region proposal network (RPN)
• Slide a small window over the conv5 layer
• Predict object/no object
• Regress bounding box coordinates
• Box regression is with reference to anchors (3 scales x 3 aspect ratios)
Classification head Regression head

Source:
Anchor boxes
Region proposal network (RPN)
Faster R-CNN: Make CNN do proposals!

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates

Lecture 11 - 13 April 30, 2024


2

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Ehsan Adeli


Faster R-CNN: Make CNN do proposals!

Lecture 11 - 13 April 30, 2024


3

Fei-Fei Li, Ehsan Adeli


Faster R-CNN: Make CNN do proposals!
Glossing over many details:
- Ignore overlapping proposals with
non-max suppression
- How are anchors determined?
- How do we sample positive /
negative samples for training the
RPN?
- How to parameterize bounding box
regression?
Lecture 11 - 13 April 30, 2024
4

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Ehsan Adeli


Faster R-CNN: Make CNN do proposals!

Faster R-CNN is a
Two-stage object detector

First stage:Run once per image


- Backbone network
- Region proposal network

Second stage:Run once per region


- Crop features:RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Ehsan Adeli


R-CNN Summary:

R-CNN: Propose regions first. Classify proposed regions one at a


time. Output contains: label + bounding box.

Fast R-CNN: Propose regions after the convolutional net. Use


convolution implementation of sliding windows to classify
all the proposed regions. End-to-end.

Faster R-CNN: Use ConvNet to propose regions. End-to-end.

[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]
[Girshik, 2015. Fast R-CNN]
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks]
Faster R-CNN: Make CNN do proposals!
Do we really need
the second stage?
Faster R-CNN is a
Two-stage object detector

First stage:Run once per image


- Backbone network
- Region proposal network

Second stage:Run once per region


- Crop features:RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Ehsan Adeli


Single-Stage Object Detectors: YOLO / SSD / RetinaNet

Within each grid cell:


- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
- Looks a lot like RPN, but
Input image
category-specific!
Divide image into grid
3xHxW 7x7
Image a set of base boxes
Output:
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016 centered at each grid cell 7 x 7 x (5 *B +C)
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Lin et al, “Focal Loss for Dense Object Detection”, ICCV 2017 Here B =3

Fei-Fei Li, Ehsan Adeli


YOLO– real-time object detection

April 30, 2024

Redmon et al. "You only look once: unified, real-time object detection (2015)."

Fei-Fei Li, Ehsan Adeli


Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage:Faster R-CNN Faster R-CNN is slower but
VGG16 Single-stage:YOLO / SSD more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but not
Inception V3 Image Size as accurate
Inception ResNet #Region Proposals
MobileNet … Bigger / Deeper backbones
work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional
Lecture 11 - 14 object detectors”, CVPR 2017
April 30, 2024
Zou et al, “Object Detection in 20 Years: A Survey”, arXiv 2019 0
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3:Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Fei-Fei Li, Ehsan Adeli


Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, 14 DOG, DOG, CAT DOG, DOG, CAT
SKY 1

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Ehsan Adeli


Object Detection: Faster R-CNN

Fei-Fei Li, Ehsan Adeli


Instance Segmentation: Mask R-CNN
Mask Prediction

Add a small mask


network that operates
on each RoIand
predicts a 28x28 binary
mask

Fei-Fei
He et al, “MaskLi, Ehsan
R-CNN”, ICCV 2017 Adeli
Mask R-CNN
Classification Scores: C
Box coordinates (per class):4 *C

CNN Conv Conv


+RPN RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes

C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017

Fei-Fei Li, Ehsan Adeli 144


Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Ehsan Adeli 145


Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Ehsan Adeli 146


Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Ehsan Adeli 147


Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Ehsan Adeli 148


Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli 149


Mask R-CNN Also does pose

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Ehsan Adeli 150


Open Source Frameworks

Lots of good implementations on GitHub!

TensorFlow Detection API:


https://github.com/tensorflow/models/tree/master/research/object_detection
Faster RCNN, SSD, RFCN, Mask R-CNN, ...

Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...
Lecture 11 - April 30, 2024
Finetune on your own dataset with pre-trained models

Fei-Fei Li, Ehsan Adeli 151

You might also like