0% found this document useful (0 votes)

23 views113 pages

Object Detyection Using CNN

Lecture 9 discusses object detection and image segmentation, focusing on semantic segmentation as a core task in computer vision. It introduces various methods for semantic segmentation, including sliding window techniques and fully convolutional networks, which aim to classify each pixel in an image. The lecture also covers the importance of downsampling and upsampling in network design to maintain spatial resolution while processing images.

Uploaded by

geetha.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views113 pages

Object Detyection Using CNN

Uploaded by

geetha.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 113

Lecture 9:

Object Detection and Image

Segmentation

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 1 April 26, 2022
Image Classification: A core task in Computer Vision

(assume given a set of possible labels)

{dog, cat, truck, plane, ...}

cat

This image by Nikita is

licensed under CC-BY 2.0

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 2 April 26, 2022
Computer Vision Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 3 April 26, 2022
Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 4 April 26, 2022
Semantic Segmentation: The Problem

GRASS, CAT, At test time, classify each pixel of a new image.

TREE, SKY, ...
Paired training data: for each training image,
each pixel is labeled with a semantic category.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 5 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 6 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Impossible to classify without context

Q: how do we include context?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 7 April 26, 2022
Semantic Segmentation Idea: Sliding Window

Full image

Q: how do we model this?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 8 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 9 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 10 April 26, 2022
Semantic Segmentation Idea: Convolution

Full image

An intuitive idea: encode the entire image with conv net, and do semantic segmentation
on top.

Problem: classification architectures often reduce feature spatial sizes to go deeper, but
semantic segmentation requires the output size to be the same as input size.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 11 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 12 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 13 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 14 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 15 April 26, 2022
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”

1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 16 April 26, 2022
In-Network upsampling: “Max Unpooling”
Max Pooling
Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0

1 2 0 1 0 0
3 5 2 1 5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 17 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 18 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 19 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 20 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 21 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 2 x 2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 22 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in

the input for every one
pixel in the output
Dot product
between filter Stride gives ratio between
and input movement in input and
output

We can interpret strided

Input: 4 x 4 Output: 2 x 2 convolution as “learnable
downsampling”.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 23 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 24 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 25 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1

Filter moves 2 pixels in

Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 26 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 transposed convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in

Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 27 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
Q: Why is it called 3 x 3 transposed convolution, stride 2 pad 1 output overlaps
transposed
convolution?

Filter moves 2 pixels in

Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 28 April 26, 2022
Learnable Upsampling: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
where at overlaps in
a the output
y az + bx
b
z by
bz

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 29 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication

Example: 1D conv, kernel

size=3, stride=2, padding=1

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 30 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Transposed convolution multiplies by the
terms of a matrix multiplication transpose of the same matrix:

Example: 1D conv, kernel Example: 1D transposed conv, kernel

size=3, stride=2, padding=1 size=3, stride=2, padding=0

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 31 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transposed convolution
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 32 April 26, 2022
Semantic Segmentation: Summary

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 33 April 26, 2022
Semantic Segmentation This image is CC0 public domain

Label each pixel in the

image with a category
label

s
Sky

ee
Sky

Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass
Grass

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 34 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 35 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 36 April 26, 2022
Object Detection: Single Object
(Classification + Localization)
Class Scores
Fully Cat: 0.9
Connected: Dog: 0.05
4096 to 1000 Car: 0.01
x, y ...

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates
(x, y, w, h)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 37 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 38 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...

h Multitask Loss + Loss

w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 39 April 26, 2022
Object Detection: Multiple Objects
CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 40 April 26, 2022
Each image needs a
Object Detection: Multiple Objects different number of outputs!

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 12 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h) Many

DUCK: (x, y, w, h) numbers!
….

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 41 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 42 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 43 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 44 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Q: What’s the problem with this approach?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 45 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Problem: Need to apply CNN to huge

number of locations, scales, and aspect
ratios, very computationally expensive!

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 46 April 26, 2022
Region Proposals: Selective Search
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 2000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 47 April 26, 2022
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 48 April 26, 2022
R-CNN

Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 49 April 26, 2022
R-CNN

Warped image regions

(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 50 April 26, 2022
R-CNN

ConvN Forward each region

et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 51 April 26, 2022
R-CNN
SVMs Classify regions with
SVMs SVMs

SVMs
ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 52 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with
Bbox reg SVMs SVMs

Bbox reg SVMs

ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 53 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 54 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
“Slow” R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet Idea: Pass the
et
ConvN
image through
et Warped image regions convnet before
(224x224 pixels) cropping! Crop the
Regions of Interest conv feature instead!
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 55 April 26, 2022
Fast R-CNN

“Slow” R-CNN

Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 56 April 26, 2022
Fast R-CNN

“Slow” R-CNN

“conv5” features

Run whole image

“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 57 April 26, 2022
Fast R-CNN

Regions of “Slow” R-CNN

Interest (RoIs)
from a proposal
method “conv5” features

Run whole image

“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 58 April 26, 2022
Fast R-CNN

Regions of “Slow” R-CNN

Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image

“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 59 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image

“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 60 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features

Run whole image

“Backbone” through ConvNet
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 61 April 26, 2022
Cropping Features: RoI Pool

CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 62 April 26, 2022
Cropping Features: RoI Pool
Project proposal
onto features

CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 63 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features

CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 64 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features

Q: how do we resize the 512

x 5 x 4 region to, e.g., a 512
x 2 x 2 tensor?.
CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 65 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features

Q: how do we resize the 512

x 5 x 4 region to, e.g., a 512
x 2 x 2 tensor?.
CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 66 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 67 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion

CNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 68 April 26, 2022
Cropping Features: RoI Align
No “snapping”!
Project proposal
onto features

CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 69 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features

CNN

Input Image Image features: C x H x W

(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 70 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features

CNN

Feature fxy for point (x, y)

is a linear combination of
Input Image Image features: C x H x W features at its four
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) neighboring grid cells:
He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 71 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
51
onto features f ∈R
f11∈R512 21 2
(x1,y1)
(x2,y1)
(x,y)
512
f12∈R f22∈R 512
CNN
(x1,y2) (x2,y2)

Feature fxy for point (x, y)

is a linear combination of
Input Image Image features: C x H x W features at its four
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) neighboring grid cells:
He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 72 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features
Max-pool within
each subregion

CNN

Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 73 April 26, 2022
R-CNN vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 74 April 26, 2022
R-CNN vs Fast R-CNN

Problem:
Runtime dominated
by region proposals!

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 75 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Insert Region Proposal

Network (RPN) to predict
proposals from features

Otherwise same as Fast R-CNN:

Crop features for each proposal,
classify each one

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 76 April 26, 2022
Region Proposal Network

CNN

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 77 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

CNN

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 78 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv

At each point, predict

Input Image
(e.g. 3 x 640 x 480)
whether the corresponding
Image features
(e.g. 512 x 20 x 15) anchor contains an object
(binary classification)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 79 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map

Anchor is an object?
1 x 20 x 15
CNN Conv
Box corrections
4 x 20 x 15

For positive boxes, also predict

Input Image
(e.g. 3 x 640 x 480)
a corrections from the anchor to
Image features
(e.g. 512 x 20 x 15) the ground-truth box (regress 4
numbers per pixel)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 80 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 81 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point

Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15

Sort the K2015 boxes by

Input Image their “objectness” score,
(e.g. 3 x 640 x 480) Image features take top ~300 as our
(e.g. 512 x 20 x 15) proposals

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 82 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Jointly train with 4 losses:

1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 83 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 84 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Glossing over many details:

- Ignore overlapping proposals with
non-max suppression
- How are anchors determined?
- How do we sample positive /
negative samples for training the
RPN?
- How to parameterize bounding
box regression?

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 85 April 26, 2022
Faster R-CNN:
Make CNN do proposals!

Faster R-CNN is a
Two-stage object detector

First stage: Run once per image

- Backbone network
- Region proposal network

Second stage: Run once per region

- Crop features: RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 86 April 26, 2022
Faster R-CNN: Do we really need
Make CNN do proposals! the second stage?

Faster R-CNN is a
Two-stage object detector

First stage: Run once per image

- Backbone network
- Region proposal network

Second stage: Run once per region

- Crop features: RoI pool / align
- Predict object class
- Prediction bbox offset

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 87 April 26, 2022
Single-Stage Object Detectors: YOLO / SSD / RetinaNet
Within each grid cell:
- Regress from each of the B
base boxes to a final box
with 5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
- Looks a lot like RPN, but
category-specific!
Input image Divide image into grid
3xHxW 7x7
Image a set of base boxes
Output:
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016 centered at each grid cell 7 x 7 x (5 * B + C)
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Lin et al, “Focal Loss for Dense Object Detection”, ICCV 2017 Here B = 3

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 88 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017

R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 89 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Zou et al, “Object Detection in 20 Years: A Survey”, arXiv 2019
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 90 April 26, 2022
Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 91 April 26, 2022
Object Detection:
Faster R-CNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 92 April 26, 2022
Instance Segmentation: Mask Prediction

Mask R-CNN

Add a small mask

network that operates
on each RoI and
predicts a 28x28
binary mask

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 93 April 26, 2022
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C

CNN Conv Conv

+RPN RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for

each of C classes

C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 94 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 95 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 96 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 97 April 26, 2022
Mask R-CNN: Example Mask Training Targets

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 98 April 26, 2022
Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 99 April 26, 2022
Mask R-CNN
Also does pose

He et al, “Mask R-CNN”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 100 April 26, 2022
Open Source Frameworks
Lots of good implementations on GitHub!

TensorFlow Detection API:

https://github.com/tensorflow/models/tree/master/research/object_detection
Faster RCNN, SSD, RFCN, Mask R-CNN, ...

Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...

Finetune on your own dataset with pre-trained models

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 101 April 26, 2022
Beyond 2D Object Detection...

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 102 April 26, 2022
Object Detection + Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 103 April 26, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 104 April 26, 2022
Dense Video Captioning

Ranjay Krishna et al., “Dense-Captioning Events in Videos”, ICCV 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 105 April 26, 2022
Objects + Relationships = Scene Graphs

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen et al. "Visual genome: Connecting language and vision using
crowdsourced dense image annotations." International Journal of Computer Vision 123,
no. 1 (2017): 32-73.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 106 April 26, 2022
Scene Graph Prediction

Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017
Figure copyright IEEE, 2018. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 107 April 26, 2022
3D Object Detection
2D Object Detection:
2D bounding box
(x, y, w, h)

3D Object Detection:
3D oriented bounding box
(x, y, z, w, h, l, r, p, y)

Simplified bbox: no roll & pitch

Much harder problem than 2D

object detection!
This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 108 April 26, 2022
3D Object Detection: Simple Camera Model
A point on the image plane
corresponds to a ray in the 3D
3D ray space

A 2D bounding box on an image

is a frustrum in the 3D space
2D point
Localize an object in 3D:
The object can be anywhere in
camera
viewing frustrum the camera viewing frustrum!
image plane
camera

Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 109 April 26, 2022
3D Object Detection: Monocular Camera

Faster R-CNN

- Same idea as Faster RCNN, but proposals are in 3D

- 3D bounding box proposal, regress 3D box parameters + class score
Chen, Xiaozhi, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel
Urtasun. "Monocular 3d object detection for autonomous driving." CVPR 2016.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 110 April 26, 2022
3D Shape Prediction: Mesh R-CNN

Gkioxari et al., Mesh RCNN, ICCV 2019

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 111 April 26, 2022
Recap: Lots of computer vision tasks!
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 112 April 26, 2022
Next time: Recurrent Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 113 April 26, 2022

PowerPoint Presentation For Data Center
100% (1)
PowerPoint Presentation For Data Center
46 pages
Ezy-Bzly Document PDF
100% (1)
Ezy-Bzly Document PDF
2 pages
Segmentation Detection
100% (1)
Segmentation Detection
109 pages
Lect-7 Segmentation Localization
No ratings yet
Lect-7 Segmentation Localization
151 pages
9. Deep Learning for Computer Vision(1)
No ratings yet
9. Deep Learning for Computer Vision(1)
181 pages
Object Detection-Compressed
No ratings yet
Object Detection-Compressed
80 pages
Lecture 5 - CNNs For Detection and Segmentation
No ratings yet
Lecture 5 - CNNs For Detection and Segmentation
62 pages
8-Image Detection and Segmentation
No ratings yet
8-Image Detection and Segmentation
73 pages
Lecture 5 Segmentation
No ratings yet
Lecture 5 Segmentation
140 pages
Lecture07 VDL Part01
No ratings yet
Lecture07 VDL Part01
90 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
Object Detection and Segmentation - Part 2
No ratings yet
Object Detection and Segmentation - Part 2
36 pages
14 Segmentation
No ratings yet
14 Segmentation
22 pages
05 CNN 2
No ratings yet
05 CNN 2
92 pages
(Fall 2024) Images and Convolutions
No ratings yet
(Fall 2024) Images and Convolutions
69 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
38 pages
Dlcv2017d3l1segmentation 170623173102
No ratings yet
Dlcv2017d3l1segmentation 170623173102
36 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
Sarma CNN Vce Oct 2022
No ratings yet
Sarma CNN Vce Oct 2022
63 pages
Week 11 - Convolutional
No ratings yet
Week 11 - Convolutional
78 pages
02 Semantic Segmentation 2024
No ratings yet
02 Semantic Segmentation 2024
53 pages
HODL Lec 3 DNNs For Vision 1
No ratings yet
HODL Lec 3 DNNs For Vision 1
36 pages
Explain The Convolution Operation in The Context of Image Processing. How Does It Differ From Standard Matrix Multiplication?
No ratings yet
Explain The Convolution Operation in The Context of Image Processing. How Does It Differ From Standard Matrix Multiplication?
5 pages
Chapter Convolutional Neural Networks
No ratings yet
Chapter Convolutional Neural Networks
7 pages
CNN Ai
No ratings yet
CNN Ai
17 pages
ML-05 CNN Intro
No ratings yet
ML-05 CNN Intro
57 pages
CV Lab 12 - Implementatin of A Simple CNN
No ratings yet
CV Lab 12 - Implementatin of A Simple CNN
9 pages
Computer Vision & CNNs - Study Notes
No ratings yet
Computer Vision & CNNs - Study Notes
12 pages
Understanding of Convolutional Neural Network (CNN) - Deep Learning
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning
7 pages
Aiml Ece Unit-5
No ratings yet
Aiml Ece Unit-5
48 pages
Convolutional Neural Networks - Part 1
No ratings yet
Convolutional Neural Networks - Part 1
44 pages
CNN Iitkgp
No ratings yet
CNN Iitkgp
112 pages
Image Segmentation Basics
No ratings yet
Image Segmentation Basics
11 pages
unit3dl
No ratings yet
unit3dl
72 pages
Lecture4 - Convnets For CV Slide
No ratings yet
Lecture4 - Convnets For CV Slide
65 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
102 pages
Lecture 2 PDF
No ratings yet
Lecture 2 PDF
62 pages
Ml@ok Questions
No ratings yet
Ml@ok Questions
16 pages
Notes Chapter Convolutional Neural Networks
No ratings yet
Notes Chapter Convolutional Neural Networks
6 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
L09-10 DL and CNN
No ratings yet
L09-10 DL and CNN
56 pages
Deep Learning CNN
No ratings yet
Deep Learning CNN
204 pages
CS60010 - CNN 4
No ratings yet
CS60010 - CNN 4
32 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
Image Recognition Using Neural Networks
No ratings yet
Image Recognition Using Neural Networks
18 pages
Lecture 6 Review
No ratings yet
Lecture 6 Review
74 pages
An Introduction to Convolutional Neural Networks
No ratings yet
An Introduction to Convolutional Neural Networks
11 pages
Lecture 6
No ratings yet
Lecture 6
17 pages
L11 Learning III Neural Network Architectures
No ratings yet
L11 Learning III Neural Network Architectures
35 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
05introduction To Convolutional Neural Networks
No ratings yet
05introduction To Convolutional Neural Networks
72 pages
DAAI - Lecture - 15 - 23nov22
No ratings yet
DAAI - Lecture - 15 - 23nov22
113 pages
Ch3 CNN
No ratings yet
Ch3 CNN
64 pages
Notes Conv Nets Slides
No ratings yet
Notes Conv Nets Slides
207 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
Unet + RL
No ratings yet
Unet + RL
63 pages
DL Unit-3
No ratings yet
DL Unit-3
70 pages
What Is A Convolutional Neural Network-Unit3
No ratings yet
What Is A Convolutional Neural Network-Unit3
12 pages
UNIT-III Convolution Neural Networks
No ratings yet
UNIT-III Convolution Neural Networks
9 pages
CNN Slides Part2
No ratings yet
CNN Slides Part2
69 pages
Convolutional Networks 2024
No ratings yet
Convolutional Networks 2024
44 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
21CS644 Module
No ratings yet
21CS644 Module
30 pages
Lesson Plan - FCV - 2024
No ratings yet
Lesson Plan - FCV - 2024
4 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
43 pages
Probability
No ratings yet
Probability
11 pages
Scheme Report
No ratings yet
Scheme Report
1 page
List of ESG Startups
No ratings yet
List of ESG Startups
46 pages
Централизованное Управление Паролями Локальных Учетных Записей
No ratings yet
Централизованное Управление Паролями Локальных Учетных Записей
19 pages
TS ICET 28th July 2022 Shift-2 by Cracku
No ratings yet
TS ICET 28th July 2022 Shift-2 by Cracku
50 pages
Week 9 Arts 10 Approved
No ratings yet
Week 9 Arts 10 Approved
6 pages
SDM College of Engineering & Technology, Dharwad - 580 002
No ratings yet
SDM College of Engineering & Technology, Dharwad - 580 002
1 page
How To Add Words To A Mickey Head in MICROSOFT WORD
No ratings yet
How To Add Words To A Mickey Head in MICROSOFT WORD
5 pages
Arista Ansible Getting Started
No ratings yet
Arista Ansible Getting Started
7 pages
Practice Problems in Relational and Logical Operators
No ratings yet
Practice Problems in Relational and Logical Operators
4 pages
CV671939 Mebarki-Boussaad Turnover-It
No ratings yet
CV671939 Mebarki-Boussaad Turnover-It
7 pages
Normalization of Informal Text
No ratings yet
Normalization of Informal Text
22 pages
FADGI Technical Guidelines For Digitizing Cultural Heritage Materials - 3rd Edition - 05092023
No ratings yet
FADGI Technical Guidelines For Digitizing Cultural Heritage Materials - 3rd Edition - 05092023
129 pages
Film Directing Fundamentals See Your Film Before Shooting 3rd Edition Nicholas T. Proferes
No ratings yet
Film Directing Fundamentals See Your Film Before Shooting 3rd Edition Nicholas T. Proferes
24 pages
Functional Specifications - Chromeleon 7.3.1
No ratings yet
Functional Specifications - Chromeleon 7.3.1
424 pages
MS PPT Class Activity
No ratings yet
MS PPT Class Activity
15 pages
Programmable Temperature Controller
No ratings yet
Programmable Temperature Controller
2 pages
Reliability-Centered Maintenance A Case Study
No ratings yet
Reliability-Centered Maintenance A Case Study
8 pages
jROS UG HTC
No ratings yet
jROS UG HTC
177 pages
Electrical System Design Design Example
No ratings yet
Electrical System Design Design Example
2 pages
Li Idc Red Hat Enterprise Linux Economy Analyst Paper f17271 201904 en - 0
No ratings yet
Li Idc Red Hat Enterprise Linux Economy Analyst Paper f17271 201904 en - 0
19 pages
Thesis Topics in Operating System
100% (2)
Thesis Topics in Operating System
6 pages
Bugreport
No ratings yet
Bugreport
13 pages
Brigada Eskwela Form 2 School Work Plan
No ratings yet
Brigada Eskwela Form 2 School Work Plan
2 pages
M7A5EC7FRliwizzaWPUA Komatsu Hard Rock Mining Equipment Infographic
No ratings yet
M7A5EC7FRliwizzaWPUA Komatsu Hard Rock Mining Equipment Infographic
1 page
Pick Hammers Manual de Operacion y Mantenimiento
No ratings yet
Pick Hammers Manual de Operacion y Mantenimiento
9 pages
ST - English - Ro10
No ratings yet
ST - English - Ro10
41 pages
CTM 7 Features
No ratings yet
CTM 7 Features
17 pages
MOHNEET's Resume
No ratings yet
MOHNEET's Resume
1 page
Salary Dataset
No ratings yet
Salary Dataset
114 pages

Object Detyection Using CNN

Uploaded by

Object Detyection Using CNN

Uploaded by

Lecture 9:

Object Detection and Image

(assume given a set of possible labels)

This image by Nikita is

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Object

GRASS, CAT, At test time, classify each pixel of a new image.

Impossible to classify without context

Q: how do we include context?

Q: how do we model this?

Conv Conv Conv Conv argmax

Conv Conv Conv Conv argmax

Nearest Neighbor “Bed of Nails”

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Filter moves 2 pixels in

We can interpret strided

Filter moves 2 pixels in

Filter moves 2 pixels in

Filter moves 2 pixels in

Example: 1D conv, kernel

Example: 1D conv, kernel Example: 1D transposed conv, kernel

Label each pixel in the

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Object

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Object

h Multitask Loss + Loss

CAT: (x, y, w, h) 4 numbers

DUCK: (x, y, w, h) Many

Q: What’s the problem with this approach?

Problem: Need to apply CNN to huge

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

Warped image regions

ConvN Forward each region

Bbox reg SVMs

Run whole image

Regions of “Slow” R-CNN

Run whole image

Regions of “Slow” R-CNN

Run whole image

Run whole image

Run whole image

Input Image Image features: C x H x W

Girshick, “Fast R-CNN”, ICCV 2015.

Input Image Image features: C x H x W

Girshick, “Fast R-CNN”, ICCV 2015.

Input Image Image features: C x H x W

Girshick, “Fast R-CNN”, ICCV 2015.

Q: how do we resize the 512

Input Image Image features: C x H x W

Girshick, “Fast R-CNN”, ICCV 2015.

Q: how do we resize the 512

Input Image Image features: C x H x W

Girshick, “Fast R-CNN”, ICCV 2015.

Input Image Image features: C x H x W

He et al, “Mask R-CNN”, ICCV 2017

Input Image Image features: C x H x W

He et al, “Mask R-CNN”, ICCV 2017

Feature fxy for point (x, y)

Feature fxy for point (x, y)

He et al, “Mask R-CNN”, ICCV 2017

Insert Region Proposal

Otherwise same as Fast R-CNN:

At each point, predict

For positive boxes, also predict

Sort the K*20*15 boxes by

Jointly train with 4 losses:

Glossing over many details:

First stage: Run once per image

Second stage: Run once per region

First stage: Run once per image

Second stage: Run once per region

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Object

Sort the K2015 boxes by