0% found this document useful (0 votes)
2 views

lecture4

The document discusses image segmentation and the importance of context in identifying objects within images. It covers various techniques for automatic and semi-automatic segmentation, including semantic segmentation using fully convolutional networks and object detection methods like R-CNN and YOLO. The content emphasizes the need for efficient algorithms to classify and segment images accurately while managing computational resources.

Uploaded by

asumi288hk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture4

The document discusses image segmentation and the importance of context in identifying objects within images. It covers various techniques for automatic and semi-automatic segmentation, including semantic segmentation using fully convolutional networks and object detection methods like R-CNN and YOLO. The content emphasizes the need for efficient algorithms to classify and segment images accurately while managing computational resources.

Uploaded by

asumi288hk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Image Segmentation

How many zebras?

From Sandlot Science


Why context is important?

What is this?

slide by Takeo Kanade


Why is this a car?
…because it’s on the road!

Why is this road?


Why is this a road?

Context is very important!


Same problem in real scenes
From images to objects

What defines an object?


• Subjective problem, but has been well-studied
Extracting objects

How could we do this automatically (or at least


semi-automatically)?
Semi-automatic binary segmentation
Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]
Auto segmentation: toy example
white
pixels
3

pixel count
black pixels
gray

1 2 pixels

input image
intensity

• These intensities define the three groups.


• We could label every pixel in the image according to
which of these primary intensities it is.
• i.e., segment the image based on the intensity feature.
• But … image isn’t quite so simple …

Source: K. Grauman
pixel count
input image
intensity

• Now how to determine the three main intensities that


define our groups?
• We need to cluster.

Source: K. Grauman
Deep Learning
Semantic Classification Object Instance
Segmentation + Detection Segmentation
Localization

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

Pixel-level Single Object Multiple Object

May 10, 2017


Segmentation+Classification
Fei-Fei Li & Justin Johnson &
Lecture 11 - 13
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation

Label each pixel in the


image with a category
label

s
Sky

ee
Sky

Tr

Tr
ee
s
Cat Cow

Grass Grass

Don’t differentiate instances,


only care about pixels
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3 x H xW
CxHxW HxW
Convolutions

Each channel is a class


C channels->C classes
May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”


1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
In-Network upsampling: “Max Unpooling”
Max Pooling Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0

1 2
3 5 2 1 5 6
… 3 4
0 1 0 0

1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling

3 x 3 transpose convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling

3 x 3 transpose convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling
Sum where
3 x 3 transpose convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter

Input: 2 x 2 Output: 4 x 4

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Transpose Convolution: 1D Example

Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
the output
y az + bx
b
z by
bz

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 - 23
Adapted fromYeung
Serena Justin Johnson
Object Detection as Regression?

CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 - 24
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Regression?

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h) Many


DUCK: (x, y, w, h) numbers!
….

May 10, 2017


Each image needs a different
number of outputs!
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO Cat?
NO
Background? YES

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 - 26
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO Cat?
YES
Background? NO

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO Cat?
YES
Background? NO

Problem: Need to
apply CNN to huge
number of locations
and scales, very
computationally
May 10, 2017
expensive!

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Region Proposals
● Find image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013 May 10, 2017
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
Fei-Fei Li & Justin Johnson &
Lecture 11 - 31
Slide by: Justin
Serena Johnson
Yeung
Alexe et al., CVPR 2010
R-CNN

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 - 33
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN

Conv
Conv Net
Conv Net
Net

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN

Conv
Conv Net
Conv Net
Net

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN

Conv
Conv Net
Conv Net
Net

May 10, 2017

Fei-Fei Li & Justin Johnson &


Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
Detection without Proposals:
YOLO
Within each grid cell:
•Regress from each of the B base
boxes to a final box with 5
numbers:(dx, dy, dh, dw,
confidence)
•Predict scores for each of C
classes (including background as
a class)

Input image Divide image into grid 7 Output:


3xHxW x7 7 x 7 x (5 * B + C)

Image a set of base


boxes centered at each
grid cell Here B = 3

May 10, 2017


Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016

Fei-Fei Li & Justin Johnson &


Lecture 11 - 39
Slide by: Justin
Serena Johnson
Yeung
This parameterization fixes the output
size
• Each cell predicts:

- For each bounding box:


- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities

• For Pascal VOC:

- 7x7 grid
- 2 bounding boxes / cell
- 20 classes

• 7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Split the image into a grid

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell predicts boxes and confidences:
P(Object)

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell also predicts a probability
P(Class | Object)

Bicycle Car

Dog

Dining
Table
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Combine the box and class predictions

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Finally do non-maximum suppression and
threshold detections

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
It also generalizes well to new domains

Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

You might also like