Lect-7 Segmentation Localization
Lect-7 Segmentation Localization
CS-878
Week-07
Image Classification
cat
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
▪ Applications
▪ Assisting the partially sighted
Semantic Segmentation
▪ Applications
▪ Medical diagnosis
Semantic Segmentation
▪ Applications
▪ Image editing
Segmentation Tasks
GRASS, CAT, TREE, At test time, classify each pixel of a new image.
SKY, ...
Paired training data:for each training image,Lecture 11 - April 30, 2024
each pixel is labeled with a semantic category.
Full image
Full image
Full image
One straight-forward
strategy is to modify our
classification network
J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
Fully "Convolutional" networks (FCN)
• Use pre-trained networks for classification for segmentation! (VGG, AlexNet, etc.)
J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
FCN
FCN
11/30/2021
J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015
In-Network upsampling: “Unpooling”
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Dot product
between filter
and input
Dot product
between filter
and input
Dot product
between filter
and input
Input gives
weight for
filter
Lecture 11 - 32
Input:2 x 2 Output:4 x 4
Output
Input Filter ax Output contains
copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
the output
y az +bx
b
z by
36
bz
38
Example:1D conv, kernel size=3, Example:1D transposed conv, kernel size=3,
stride=2, padding=1 stride=2, padding=0
H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Input image 14 × 14 deconvolutional layer 28 × 28 unpooling layer 28 × 28 deconvolutional layer 56 × 56 unpooling layer
56 × 56 deconvolutional layer 112 × 112 unpooling layer 112 × 112 deconvolutional layer 224 × 224 unpooling layer 224 × 224 deconvolutional layer
Image source: H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Learned upsampling architectures
Figure source
SegNet
Source: Olaf Ronneberger, Philipp Fischer, Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI, 2015
Object detection
Object Recognition
YES
Object localization
Human Detection
• UAV images
• Surveillance images
Object detection
• Multiple objects
Why detection?
• Self-driving car
A simple solution
Sliding window
Sliding window
This is a chair
Template Matching
Epic fail!
Simple template matching is not going to make it
Sliding window
• General approach
• Scan all possible locations
• Extract features
• Classify features
• Post-processing
Evaluation
• True positives
Evaluation
• True positives
• False positives
Evaluation
• True positives
• False positives
Evaluation
• True positives
• False positives
• False negatives
Evaluation
• True positives
• False positives
• False negatives
Evaluation
• Precision
• Precision is the ability of a model to identify only the relevant objects.
• It is the percentage of correct positive predictions and is given by:
• Recall
• Recall is the ability of a model to find all the relevant cases
• It is the percentage of true positive detected among all relevant ground truths
and is given by:
Evaluation
https://github.com/rafaelpadilla/Object-Detection-Metrics
Average precision (AP)
Convolution
and Pooling Fully-connected
layers
Softmax loss
Final conv
feature map Class scores
Image
Fully-connected
layers
“Classification head”
Fully-connected
layers
“Regression head”
Image
Simple Recipe for Object Detection
Fully-connected
layers
Fully-connected
layers
L2 loss
Final conv
feature map Box coordinates
Image
Simple Recipe for Object Detection
Fully-connected
layers
Fully-connected
layers
Final conv
feature map Box coordinates
Image
Simple Recipe for Object Detection
Correct label:
Cat
ObjectDetection:SingleObject
Class Scores
(Classification +Localization) Fully Cat:0.9 Softmax
Connected: Dog:0.05 Loss
4096 to 1000
Car:0.01
x, y ...
w
Vector: Fully
This image is CC0 public domain
Connected:
4096 4096 to 4 Box
Lecture 11 - 74 Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)
= 16 numbers
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
= 8 numbers
Detection as Regression?
CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)
= many numbers
CAT? NO
DOG? NO
Detection as Classification
CAT? YES!
DOG? NO
Detection as Classification
CAT? NO
DOG? Yes
Detection as Classification
Problem:
• Need to test many positions and scales
• Use a computationally demanding classifier (CNN)
• Search at different scales
• Search at different positions
Convert
regions
to boxes
84
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Regions of Interest
(RoI) from a proposal
85
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Warped image
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
SVMs
Conv Forward each region
Conv Net through ConvNet
Net (ImageNet-pretranied)
Conv
Warped image
Net
regions (224x224
pixels)
Regionsof
Interest (RoI)
fromaproposal
method(~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic
Input image segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Convolution
and Pooling Fully-connected
layers
Softmax loss
Final conv
Class scores
feature map
Image 1000 classes
R-CNN Training (Fine-tuning)
Softmax loss
Final conv
Class scores:
feature map
Image 21 classes
R-CNN Training (feature extraction)
Convolution
and Pooling
pool5 features
Step 4: Train one binary SVM per class to classify region features
Positive samples for cat SVM Negative samples for cat SVM
R-CNN Training (train classifier)
Step 4: Train one binary SVM per class to classify region features
Negative samples for dog SVM Positive samples for dog SVM
R-CNN Training (bounding box regression/prediction)
Step 5 (bbox regression): For each class, train a linear regression model to map from cached
features to offsets to GT boxes to make up for “slightly wrong” proposals
• Slow in run-time
• Multiple forward passes for each proposal
• There are thousands of proposals
• Solution
• Single forward pass for each image?
Issue #2 with R-CNN
• Solution
• End-to-end training?
Issue #3 with R-CNN
• Solution
• Single forward pass for each image?
Solution
• Fast R-CNN
• Single forward pass for each image
• No separate classifier
• End-to-end training
Fast R-CNN
“Slow” R-CNN
10
1
Input image
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
“Slow” R-CNN
“conv5” features
Run whole image
through ConvNet
“Backbone”
network:
AlexNet, VGG, ConvNet
ResNet, etc
Input image
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015;source. Reproduced with permission.
CNN
CNN
CNN
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W Region features always the
(e.g. 3 x 640 x 480) 3
(e.g. 512 x 20 x 15) same size even if input regions
have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W Region features always the
(e.g. 3 x 640 x 480) 4
(e.g. 512 x 20 x 15) same size even if input regions
have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Problem: Region features slightly misaligned
CNN
CNN
CNN
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Lecture 11 - 11 April 30, 2024
Image features:C x H x W
(e.g. 3 x 640 x 480) 9
(e.g. 512 x 20 x 15)
Problem:
Runtime dominated by
Lecture 11 - 12 April 30, 2024 region proposals!
1
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
CNN
CNN
Anchor is an object?
1 x 20 x 15
CNN Conv
Anchor is an object?
1 x 20 x 15
CNN Conv
Box corrections
4 x 20 x 15
Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15
Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15
Source:
Anchor boxes
Region proposal network (RPN)
Faster R-CNN: Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Faster R-CNN is a
Two-stage object detector
[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]
[Girshik, 2015. Fast R-CNN]
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks]
Faster R-CNN: Make CNN do proposals!
Do we really need
the second stage?
Faster R-CNN is a
Two-stage object detector
Redmon et al. "You only look once: unified, real-time object detection (2015)."
CAT GRASS, CAT, TREE, 14 DOG, DOG, CAT DOG, DOG, CAT
SKY 1
Fei-Fei
He et al, “MaskLi, Ehsan
R-CNN”, ICCV 2017 Adeli
Mask R-CNN
Classification Scores: C
Box coordinates (per class):4 *C
C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017
Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...
Lecture 11 - April 30, 2024
Finetune on your own dataset with pre-trained models