0% found this document useful (0 votes)
2 views

Object Detyection Using CNN

Object Detyection Using CNN

Uploaded by

geetha.r
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Object Detyection Using CNN

Object Detyection Using CNN

Uploaded by

geetha.r
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Lecture 9:

Object Detection and Image


Segmentation

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 1 April 26, 2022
Image Classification: A core task in Computer Vision

(assume given a set of possible labels)


{dog, cat, truck, plane, ...}

cat

This image by Nikita is


licensed under CC-BY 2.0

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 2 April 26, 2022
Computer Vision Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 3 April 26, 2022
Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 4 April 26, 2022
Semantic Segmentation: The Problem

GRASS, CAT, At test time, classify each pixel of a new image.


TREE, SKY, ...
Paired training data: for each training image,
each pixel is labeled with a semantic category.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 5 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 6 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Impossible to classify without context

Q: how do we include context?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 7 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Q: how do we model this?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 8 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 9 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 10 April 26, 2022
Semantic Segmentation Idea: Convolution

Full image

An intuitive idea: encode the entire image with conv net, and do semantic segmentation
on top.

Problem: classification architectures often reduce feature spatial sizes to go deeper, but
semantic segmentation requires the output size to be the same as input size.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 11 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 12 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 13 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 14 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 15 April 26, 2022
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”


1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 16 April 26, 2022
In-Network upsampling: “Max Unpooling”
Max Pooling
Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0

1 2 0 1 0 0
3 5 2 1 5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 17 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 18 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 19 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 20 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 21 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 2 x 2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 22 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in


the input for every one
pixel in the output
Dot product
between filter Stride gives ratio between
and input movement in input and
output

We can interpret strided


Input: 4 x 4 Output: 2 x 2 convolution as “learnable
downsampling”.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 23 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 24 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 25 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 26 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 transposed convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 27 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
Q: Why is it called 3 x 3 transposed convolution, stride 2 pad 1 output overlaps
transposed
convolution?

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 28 April 26, 2022
Learnable Upsampling: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
where at overlaps in
a the output
y az + bx
b
z by
bz

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 29 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication

Example: 1D conv, kernel


size=3, stride=2, padding=1

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 30 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Transposed convolution multiplies by the
terms of a matrix multiplication transpose of the same matrix:

Example: 1D conv, kernel Example: 1D transposed conv, kernel


size=3, stride=2, padding=1 size=3, stride=2, padding=0

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 31 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transposed convolution
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 32 April 26, 2022
Semantic Segmentation: Summary

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 33 April 26, 2022
Semantic Segmentation This image is CC0 public domain

Label each pixel in the


image with a category
label

s
Sky

ee
Sky

Tr

Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass
Grass

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 34 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 35 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 36 April 26, 2022
Object Detection: Single Object
(Classification + Localization)
Class Scores
Fully Cat: 0.9
Connected: Dog: 0.05
4096 to 1000 Car: 0.01
x, y ...

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates
(x, y, w, h)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 37 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 38 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...

h Multitask Loss + Loss

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 39 April 26, 2022
Object Detection: Multiple Objects
CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 40 April 26, 2022
Each image needs a
Object Detection: Multiple Objects different number of outputs!

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 12 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h) Many


DUCK: (x, y, w, h) numbers!
….

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 41 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 42 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 43 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 44 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Q: What’s the problem with this approach?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 45 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Problem: Need to apply CNN to huge


number of locations, scales, and aspect
ratios, very computationally expensive!

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 46 April 26, 2022
Region Proposals: Selective Search
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 2000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 47 April 26, 2022
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 48 April 26, 2022
R-CNN

Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 49 April 26, 2022
R-CNN

Warped image regions


(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 50 April 26, 2022
R-CNN

ConvN Forward each region


et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 51 April 26, 2022
R-CNN
SVMs Classify regions with
SVMs SVMs

SVMs
ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 52 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with
Bbox reg SVMs SVMs

Bbox reg SVMs


ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 53 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 54 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
“Slow” R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet Idea: Pass the
et
ConvN
image through
et Warped image regions convnet before
(224x224 pixels) cropping! Crop the
Regions of Interest conv feature instead!
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 55 April 26, 2022
Fast R-CNN

“Slow” R-CNN

Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 56 April 26, 2022
Fast R-CNN

“Slow” R-CNN

“conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 57 April 26, 2022
Fast R-CNN

Regions of “Slow” R-CNN


Interest (RoIs)
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 58 April 26, 2022
Fast R-CNN

Regions of “Slow” R-CNN


Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 59 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 60 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image


“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 61 April 26, 2022
Cropping Features: RoI Pool

CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.


Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 62 April 26, 2022
Cropping Features: RoI Pool
Project proposal
onto features

CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.


Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 63 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features

CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 64 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features

Q: how do we resize the 512


x 5 x 4 region to, e.g., a 512
x 2 x 2 tensor?.
CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 65 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features

Q: how do we resize the 512


x 5 x 4 region to, e.g., a 512
x 2 x 2 tensor?.
CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 66 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 67 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Problem: Region features slightly misaligned

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 68 April 26, 2022
Cropping Features: RoI Align
No “snapping”!
Project proposal
onto features

CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 69 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features

CNN

Input Image Image features: C x H x W


(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 70 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features

CNN

Feature fxy for point (x, y)


is a linear combination of
Input Image Image features: C x H x W features at its four
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) neighboring grid cells:
He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 71 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
51
onto features f ∈R
f11∈R512 21 2
(x1,y1)
(x2,y1)
(x,y)
512
f12∈R f22∈R 512
CNN
(x1,y2) (x2,y2)

Feature fxy for point (x, y)


is a linear combination of
Input Image Image features: C x H x W features at its four
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) neighboring grid cells:
He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 72 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 73 April 26, 2022
R-CNN vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 74 April 26, 2022
R-CNN vs Fast R-CNN

Problem:
Runtime dominated
by region proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 75 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Insert Region Proposal


Network (RPN) to predict
proposals from features

Otherwise same as Fast R-CNN:


Crop features for each proposal,
classify each one

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 76 April 26, 2022
Region Proposal Network

CNN

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 77 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

CNN

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 78 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv

At each point, predict


Input Image
(e.g. 3 x 640 x 480)
whether the corresponding
Image features
(e.g. 512 x 20 x 15) anchor contains an object
(binary classification)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 79 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv
Box corrections
4 x 20 x 15

For positive boxes, also predict


Input Image
(e.g. 3 x 640 x 480)
a corrections from the anchor to
Image features
(e.g. 512 x 20 x 15) the ground-truth box (regress 4
numbers per pixel)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 80 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 81 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Sort the K*20*15 boxes by


Input Image their “objectness” score,
(e.g. 3 x 640 x 480) Image features take top ~300 as our
(e.g. 512 x 20 x 15) proposals

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 82 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 83 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 84 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Glossing over many details:


- Ignore overlapping proposals with
non-max suppression
- How are anchors determined?
- How do we sample positive /
negative samples for training the
RPN?
- How to parameterize bounding
box regression?

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 85 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Faster R-CNN is a
Two-stage object detector

First stage: Run once per image


- Backbone network
- Region proposal network

Second stage: Run once per region


- Crop features: RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 86 April 26, 2022
Faster R-CNN: Do we really need
Make CNN do proposals! the second stage?

Faster R-CNN is a
Two-stage object detector

First stage: Run once per image


- Backbone network
- Region proposal network

Second stage: Run once per region


- Crop features: RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 87 April 26, 2022
Single-Stage Object Detectors: YOLO / SSD / RetinaNet
Within each grid cell:
- Regress from each of the B
base boxes to a final box
with 5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
- Looks a lot like RPN, but
category-specific!
Input image Divide image into grid
3xHxW 7x7
Image a set of base boxes
Output:
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016 centered at each grid cell 7 x 7 x (5 * B + C)
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Lin et al, “Focal Loss for Dense Object Detection”, ICCV 2017 Here B = 3

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 88 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017

R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 89 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Zou et al, “Object Detection in 20 Years: A Survey”, arXiv 2019
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 90 April 26, 2022
Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 91 April 26, 2022
Object Detection:
Faster R-CNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 92 April 26, 2022
Instance Segmentation: Mask Prediction

Mask R-CNN

Add a small mask


network that operates
on each RoI and
predicts a 28x28
binary mask

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 93 April 26, 2022
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C

CNN Conv Conv


+RPN RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes

C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 94 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 95 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 96 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 97 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 98 April 26, 2022
Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 99 April 26, 2022
Mask R-CNN
Also does pose

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 100 April 26, 2022
Open Source Frameworks
Lots of good implementations on GitHub!

TensorFlow Detection API:


https://github.com/tensorflow/models/tree/master/research/object_detection
Faster RCNN, SSD, RFCN, Mask R-CNN, ...

Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...

Finetune on your own dataset with pre-trained models

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 101 April 26, 2022
Beyond 2D Object Detection...

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 102 April 26, 2022
Object Detection + Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 103 April 26, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 104 April 26, 2022
Dense Video Captioning

Ranjay Krishna et al., “Dense-Captioning Events in Videos”, ICCV 2017


Figure copyright IEEE, 2017. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 105 April 26, 2022
Objects + Relationships = Scene Graphs

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen et al. "Visual genome: Connecting language and vision using
crowdsourced dense image annotations." International Journal of Computer Vision 123,
no. 1 (2017): 32-73.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 106 April 26, 2022
Scene Graph Prediction

Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017
Figure copyright IEEE, 2018. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 107 April 26, 2022
3D Object Detection
2D Object Detection:
2D bounding box
(x, y, w, h)

3D Object Detection:
3D oriented bounding box
(x, y, z, w, h, l, r, p, y)

Simplified bbox: no roll & pitch

Much harder problem than 2D


object detection!
This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 108 April 26, 2022
3D Object Detection: Simple Camera Model
A point on the image plane
corresponds to a ray in the 3D
3D ray space

A 2D bounding box on an image


is a frustrum in the 3D space
2D point
Localize an object in 3D:
The object can be anywhere in
camera
viewing frustrum the camera viewing frustrum!
image plane
camera

Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 109 April 26, 2022
3D Object Detection: Monocular Camera

Faster R-CNN

- Same idea as Faster RCNN, but proposals are in 3D


- 3D bounding box proposal, regress 3D box parameters + class score
Chen, Xiaozhi, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel
Urtasun. "Monocular 3d object detection for autonomous driving." CVPR 2016.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 110 April 26, 2022
3D Shape Prediction: Mesh R-CNN

Gkioxari et al., Mesh RCNN, ICCV 2019

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 111 April 26, 2022
Recap: Lots of computer vision tasks!
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 112 April 26, 2022
Next time: Recurrent Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 113 April 26, 2022

You might also like