0% found this document useful (0 votes)
8 views92 pages

05 CNN 2

Uploaded by

kamble793supriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views92 pages

05 CNN 2

Uploaded by

kamble793supriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Classic Convolutional Neural

Networks
Common Augmentation Method

Andrew Ng
Color Shifting

Andrew Ng
Image Classification

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 -


https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html 27 Jan 2016
Other computer Vision Tasks
http://cs231n.stanford.edu/2017/syllabus.html

Semantic This image is CC0 public domain

Segmentation
Label each pixel in the
image with a category
label Sky Sky

Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass Grass

May 10, 2018


Semantic Segmentation Idea: Sliding
Window
Classify center
Extract patch
pixel with CNN
Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML
2014

May 10, 2018


Semantic Segmentation Idea:
Sliding Window
Classify center
Extract patch
pixel with CNN
Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
overlapping patches Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML
2014

May 10, 2018


Semantic Segmentation Idea:
Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW

May 10, 2018


Semantic Segmentation Idea:
Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at
DxHxW
original image resolution will
be very expensive ...
May 10, 2018
Semantic Segmentation Idea:
Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR
2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2018


Semantic Segmentation Idea:
Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR
2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2018


In-Network upsampling:
“Unpooling”
Nearest Neighbor “Bed of Nails”
1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

May 10, 2018


In-Network upsampling:
“Max Unpooling”
Max Pooling
Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0


3 5 2 1 5 6 1 2 0 1 0 0

1 2 2 1 7 8 3 4 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers

May 10, 2018


Transpose Convolution
• Allows to increase the size of the output feature map
compared to the input feature map

• Synonyms:
often (incorrectly) called “deconvolution”
(mathematically, deconvolution is defined as the
inverse of convolution, which is different from
transposed convolutions)
• The term “unconv” is sometimes also used
• Fractionally strided convolution is another term
May 10, 2018
Transpose Convolution

The Complete Transposed Convolution Operation

May 10, 2018


A guide to convolution arithmetic for deep learning
Learnable Upsampling:
Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

May 10, 2018


Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in


Dot product the input for every one
between filter pixel in the output
and input
Stride gives ratio between
movement in input and
output
Input: 4 x 4 Output: 2 x 2

May 10, 2018


Learnable Upsampling:
Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

May 10, 2018


Learnable Upsampling:
Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4

May 10, 2018


Learnable Upsampling: Transpose
Convolution
Sum where
3 x 3 transpose convolution, stride 2 pad 1
output overlaps

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

May 10, 2018


Semantic Segmentation Idea:
Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR
2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2018


Classification + Localization

Class Scores
Fully Cat: 0.9
Connected:
Dog: 0.05
4096 to 1000
Car: 0.01
...

This image is CC0 public domain Vector: Fully


Connected:
4096 Box
4096 to 4
Coordinates
Treatlocalization as a (x, y, w, h)
regression problem!

May 10, 2018


Classification + Localization
Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

This image is CC0 public domain Vector: Fully


Connected:
4096 Box
4096 to 4
Coordinates L2 Loss
(x, y, w, h)

Correct box:
(x’, y’, w’, h’)

May 10, 2018


Classification + Correct label:

Localization Class Scores


Cat

Fully Cat: 0.9 Softmax


Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

Multitask Loss + Loss

This image is CC0 public domain Vector: Fully


Connected:
4096 Box
4096 to 4
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
Correct box:
regression problem!
(x’, y’, w’, h’)

May 10, 2018


Classification + Correct label:

Localization Class Scores


Cat

Fully Cat: 0.9 Softmax


Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

+ Loss

This image is CC0 public domain Vector: Fully


Often pretrained on ImageNet Connected:
4096 Box
(Transfer learning) 4096 to 4
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
Correct box:
regression problem!
(x’, y’, w’, h’)

May 10, 2018


Aside: Human Pose Estimation
Represent pose as a
set of 14 joint
positions:

Left / right foot


Left / right knee
Left / right hip
Left / right shoulder
Left / right elbow
Left / right hand
Neck
This image is licensed under CC-BY 2.0.
Head top
Johnson and Everingham, "Clustered Pose and Nonlinear Appearance
Models
for Human Pose Estimation", BMVC
2010

May 10, 2017


Aside: Human Pose Estimation

Left foot: (x, y)

Right foot: (x, y)


Vector:
4096 Head top: (x, y)

Toshev and Szegedy, “DeepPose: Human Pose


Estimation via Deep Neural Networks”, CVPR 2014

May 10, 2017


Aside: Human Pose
Estimation
Correct left
foot: (x’, y’)

Left foot: (x, y) L2 loss

Right foot: (x, y) L2 loss

… ...
+ Loss
Vector:
4096 Head top: (x, y) L2 loss

Correct head
Toshev and Szegedy, “DeepPose: Human Pose top: (x’, y’)
Estimation via Deep Neural Networks”, CVPR 2014

52 May 10, 2017


Object
Detection

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

May 10, 2017


Object Detection: Impact of Deep
Learning

Figure copyright Ross Girshick,


2015. Reproduced with permission.

May 10, 2017


Object Detection as Regression?
CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….

May 10, 2017


Object Detection as
Regression?
Each image needs a different
number of outputs!

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
Many
DUCK: (x, y, w, h)
numbers!
…. May 10, 2017
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES

May 10, 2017


Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

May 10, 2017


Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Problem: Need to apply CNN to huge


number of locations and scales, very
computationally expensive!

May 10, 2017


Region Proposals
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI


2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV
2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR
2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection


and semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN: Problems
• Ad hoc training objectives
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman.
ICLR15]
• Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection


semantic segmentation”, CVPR 2014.
and
Slide copyright Ross Girshick, 2015; source. Reproduced with
permission.

69 May 10, 2017


Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-
CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-
CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-CNN (Training)

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


Fast R-CNN (Training)

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with
permission.

May 10, 2017


R-CNN vs SPP vs Fast
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR
2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

May 10, 2017


R-CNN vs SPP vs Fast
R-CNN

Problem:
Runtime dominated
by region proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR
2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

May 10, 2017


Faster R-CNN:

Make CNN do proposals!


Insert Region Proposal
Network (RPN) to predict
proposals from features

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS
2015 Figure copyright 2015, Ross Girshick; reproduced with permission

May 10, 2017


Faster R-CNN:
Make CNN do proposals!

May 10, 2017


Faster R-CNN:
Make CNN do proposals!

May 10, 2017


Mask R-CNN:
Make CNN do proposals!

May 10, 2017


Detection without Proposals: YOLO
/ SSD (single shot detection)
Go from input image to tensor of scores with one big convolutional network!

Within each grid cell:


- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
Input image Divide image into grid Output:
3xHxW 7x7 7 x 7 x (5 * B + C)

Image a set of base boxes


Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016 centered at each grid cell
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV
2016 Here B = 3
May 10, 2017
Object Detection: Lots of variables
...
Base Network Object Detection Takeaways
VGG16 architecture Faster R-CNN is
ResNet-101 Faster R-CNN slower but more
Inception V2 R-FCN accurate
Inception V3 SSD
Inception SSD is much
ResNet Image Size faster but not as
MobileNet # Region Proposals accurate

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML
2015 Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv
2017

May 10, 2017


Aside: Object Detection + Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR
2016 Figure copyright IEEE, 2016. Reproduced for educational purposes.

May 10, 2017


Aside: Object Detection +
Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR
2016 Figure copyright IEEE, 2016. Reproduced for educational purposes.

May 10, 2017


Instance
Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

May 10, 2017


Mask R-CNN https://arxiv.org/pdf/1703.06870.pdf

Classification Scores: C
Box coordinates (per class): 4 * C

CNN Conv Conv


RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes
C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017

May 10, 2017


Mask R-CNN: Very
Good Results!

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick,
2017.
Reproduced with permission.

May 10, 2017


Mask R-CNN Also does pose
Classification Scores: C
Box coordinates (per class): 4 * C
Joint coordinates

CNN Conv Conv


RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes
C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017

May 10, 2017


Mask R-CNN Also does pose

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick,
2017. Reproduced with permission.

93 May 10, 2017

https://cocodataset.org/#home
Detectron 2
https://github.com/facebookresearch/detectron2

• Detectron 2 was built by Facebook AI Research (FAIR) to support


rapid implementation and evaluation of novel computer vision
research
• It is implmeted in PyTorch
• It is flexible and extensible and provide fast raining on GPU servers
• It an case use as a library to support different projects
• Detectron 2 library are pre-trained on COCO Dataset.
• need to fine-tune custom dataset on the pre-trained model.

May 10, 2017


Detectron 2

Getting Started with Detectron2


May 10, 2017
TensorFlow 2 Object Detection API

• TensorFlow 2 Object Detection API is an open source framework


built on top of TensorFlow that make it easy to construct, train,
and deploy object detection models.
• TensorFlow 2 Object Detection API allows to train a collection
state of the art object detection modelds under a unified
framework

• https://github.com/tensorflow/models/tree/master/research/o
bject_detection

May 10, 2017


Model Zoo:
May 10, 2017
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md
Tutorial: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/
May 10, 2017
U-Net: Convolutional Networks for Biomedical Image Segmentation
https://arxiv.org/pdf/1505.04597.pdf
Figure 1 | Scientific Reports (nature.com)
Part 2
CNN ADVERSARIAL Examples
Fast gradient sign method
https://www.tensorflow.org/tutorials/generative/adversarial_fgsm

https://arxiv.org/pdf/1412.6572.pdf
Intriguing properties of neural networks
https://arxiv.org/pdf/1312.6199.pdf
How to Hack Artificial Intelligence
https://medium.com/xix-ai/how-adversarial-attacks-work-87495b81da2d
http://databasecultures.irmielin.org/how-to-hack-artificial-intelligence/
Fooling Neural Networks in the Physical World with
3D Adversarial Objects
https://www.labsix.org/physical-objects-that-fool-neural-nets/
Countermeasures for adversarial examples
• Reactive Strategy: detect adversarial examples after deep
neural networks are built. Ex) MagNet: a Two-Pronged
Defense against Adversarial Examples
• https://arxiv.org/pdf/1705.09064.pdf

• Proactive Strategy: make deep neural networks more robust


before adversaries generate adversarial examples such as
Adversarial (Re)training.
Adversarial Examples: Attacks and Defenses for Deep
Learning
https://beerkay.github.io/cs529/content/papers/yuan.pd
f
How CNN learn II?
https://arxiv.org/pdf/1611.03530.pdf

our experiments establish that state-of-the-art convolutional networks for image


classification trained with stochastic gradient methods easily fit a random labeling of the
training data … We corroborate these experimental findings with a theoretical
construction showing that simple depth two neural networks already have perfect finite
sample expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice
CIFAR10
• The CIFAR-10 dataset consists of 60000 32x32 color images in 10
classes, with 6000 images per class
1.2M 256x256 color
images
in 1000 classes
https://arxiv.org/pdf/1706.05394.pdf
https://arxiv.org/pdf/1703.00810.pdf
Stanford Seminar - Information Theory
of Deep Learning, 2018

• https://www.youtube.com/watch?v=XL07WEc2TRI

You might also like