lecture4
lecture4
What is this?
pixel count
black pixels
gray
1 2 pixels
input image
intensity
Source: K. Grauman
pixel count
input image
intensity
Source: K. Grauman
Deep Learning
Semantic Classification Object Instance
Segmentation + Detection Segmentation
Localization
s
Sky
ee
Sky
Tr
Tr
ee
s
Cat Cow
Grass Grass
Input:
Scores: Predictions:
3 x H xW
CxHxW HxW
Convolutions
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
1 2
3 5 2 1 5 6
… 3 4
0 1 0 0
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
Input: 2 x 2 Output: 4 x 4
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
the output
y az + bx
b
z by
bz
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)
Dog? NO Cat?
NO
Background? YES
Dog? YES
Cat? NO
Background? NO
Dog? YES
Cat? NO
Background? NO
Dog? NO Cat?
YES
Background? NO
Dog? NO Cat?
YES
Background? NO
Problem: Need to
apply CNN to huge
number of locations
and scales, very
computationally
May 10, 2017
expensive!
Conv
Conv Net
Conv Net
Net
Conv
Conv Net
Conv Net
Net
Conv
Conv Net
Conv Net
Net
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Split the image into a grid
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell predicts boxes and confidences:
P(Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell also predicts a probability
P(Class | Object)
Bicycle Car
Dog
Dining
Table
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Combine the box and class predictions
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Finally do non-maximum suppression and
threshold detections
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
It also generalizes well to new domains
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016