0% found this document useful (0 votes)
104 views91 pages

CV Ss16 0609 Deep Learning

This document provides an introduction to deep learning for computer vision. It discusses how deep learning methods can learn features from data rather than relying on hand-designed features. Convolutional neural networks are described as a successful deep learning approach for computer vision tasks like object recognition. The document reviews historical neural network architectures and recent successes of deep learning on large-scale datasets like ImageNet.

Uploaded by

Roberto Ariel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views91 pages

CV Ss16 0609 Deep Learning

This document provides an introduction to deep learning for computer vision. It discusses how deep learning methods can learn features from data rather than relying on hand-designed features. Convolutional neural networks are described as a successful deep learning approach for computer vision tasks like object recognition. The document reviews historical neural network architectures and recent successes of deep learning on large-scale datasets like ImageNet.

Uploaded by

Roberto Ariel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

High Level Computer Vision

Intro to Deep Learning for Computer Vision

Bernt Schiele - [email protected]


Mario Fritz - [email protected]

https://www.mpi-inf.mpg.de/hlcv

most slides from: Rob Fergus & Marc’Aurelio Ranzato


Deep Learning 

for 

Computer Vision

NIPS 2013 Tutorial

Rob Fergus
Dept. of Computer Science
New York University
Overview

• Primarily about object recognition, using


supervised ConvNet models

• Focus on natural images


– Rather than digits
– Classification & Detection

• Brief discussion of 
 instead of


other vision problems
Motivation
Existing Recognition Approach

Image/Video
Hand-designed

Trainable
 Object

Pixels Feature
Classifier Class
Extraction

• Features are not learned

• Trainable classifier is often generic (e.g. SVM)


Motivation
• Features are key to recent progress in recognition
• Multitude of hand-designed features currently in use
– SIFT, HOG, LBP, MSER, Color-SIFT………….
• Where next? Better classifiers? Or keep building
more features?

Felzenszwalb, Girshick, 
 Yan & Huang 



McAllester and Ramanan, PAMI 2007 (Winner of PASCAL 2010 classification competition)
Hand-Crafted Features
• LP-β Multiple Kernel Learning (MKL)
– Gehler and Nowozin, On Feature Combination for
Multiclass Object Classification, ICCV’09
• 39 different kernels
– PHOG, SIFT, V1S+,

Region Cov. Etc.
• MKL only gets 

few % gain over 

averaging features
à Features are 

doing the work
What about Learning the Features?

• Perhaps get better performance?


• Deep models: hierarchy of feature extractors
• All the way from pixels à classifier
• One layer extracts features from output of previous layer

Image/Video
Simple 

Pixels Layer 1 Layer 2 Layer 3
Classifier

• Train all layers jointly


Deep 

Learning
SUPERVISED
Recurrent Neural Net

Convolutional 
 Boosting
Neural Net

Neural Net Perceptron


SVM
DEEP SHALLOW
Deep (sparse/denoising) 
 AutoencoderNeural Net
Autoencoder
Sparse Coding
SP GMM
Deep Belief Net Restricted BM

BayesNP
UNSUPERVISED Slide: M. Ranzato
Multistage Hubel&Wiesel Architecture
Slide: Y.LeCun

• [Hubel & Wiesel 1962]


• simple cells detect local features
• complex cells “pool” the outputs
of simple cells within a retinotopic
neighborhood.

Cognitron / Neocognitron

[Fukushima 1971-1982]
Convolutional Networks

• Also HMAX [Poggio 2002-2006] [LeCun 1988-present]
[Reading - Chapter 5.1 - 5.3 @ Bishop 2006]

Short Intro: “Standard” Neural Networks

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 12
Short Intro: Perceptron

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 13
Short Intro: Perceptron

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 14
Short Intro: Perceptron - Activation Functions

High Level Computer Vision - June 9, 2o16 slide taken from David Stutz (Aachen)15
Single Layer Perceptron

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 16
Short Intro: Two-Layer Perceptron

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 17
Short Intro: Multi-Layer Perceptron (MLP)

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 18
Network Training

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 19
Network Training - Error Measures

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 20
Network Training - Approaches

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 21
Network Training - Parameter Optimization

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 22
Parameter Optimization by Gradient Descent

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 23
Backpropagation = 

Parameter Optimization by Gradient Descent

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 24
Backpropagation = 

Parameter Optimization by Gradient Descent

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 25
Backpropagation = 

Parameter Optimization by Gradient Descent

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 26
Backpropagation = 

Parameter Optimization by Gradient Descent

slide taken from David Stutz (Aachen)


High Level Computer Vision - June 9, 2o16 27
Convolutional Neural Networks

• LeCun et al. 1989


• Neural network with
specialized connectivity
structure
Convnet Successes
• Handwritten text/digits
– MNIST (0.17% error [Ciresan et al. 2011])
– Arabic & Chinese [Ciresan et al. 2012]

• Simpler recognition benchmarks


– CIFAR-10 (9.3% error [Wan et al. 2013])
– Traffic sign recognition
• 0.56% error vs 1.16% for humans [Ciresan et al. 2011]

• But (until recently) less good at 



more complex datasets
– E.g. Caltech-101/256 (few training examples)
Characteristics of Convnets
Feature maps
• Feed-forward:
– Convolve input
– Non-linearity (rectified linear) Pooling
– Pooling (local max) / (=subsampling)
• Supervised Non-linearity
• Train convolutional filters by 

back-propagating classification error
Convolution (Learned)

Input Image

[LeCun et al. 1989]


Application to ImageNet
[Deng et al. CVPR 2009]

• ~14 million labeled images, 20k classes

• Images gathered from Internet

• Human labels via Amazon Turk

[NIPS 2012]
Krizhevsky et al. [NIPS2012]
• Same model as LeCun’98 but:

- Bigger model (8 layers)
- More data (106 vs 103 images)
- GPU implementation (50x speedup over CPU)
- Better regularization (DropOut)

• 7 hidden layers, 650,000 neurons, 60,000,000 parameters


• Trained on 2 GPUs for a week
ImageNet Classification 2012

• Krizhevsky et al. - 16.4% error (top-5)


• Next best (non-convnet) – 26.2% error
30

22.5
Top-5 error rate %

15

7.5

0
SuperVision ISI Oxford INRIA Amsterdam
Commercial Deployment
• Google & Baidu, Spring 2013 for personal image search
Intuitions Behind 

Deep Networks
(following slides from Marc Aurelio Ranzato - Google)
36
37
38
39
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Large Convnets

for 

Image Classification
Large Convnets for Image Classification

• Operations in each layer

• Architecture

• Training

• Results
Components of Each Layer

Pixels / Filter with 
 + Non-linearity


Dictionary
Features (convolutional

or tiled)

Spatial/Feature
(Sum or Max)

Normalization

[Optional] between 
 Output Features
feature responses
Compare: SIFT Descriptor
Image 

Pixels Apply

Gabor filters

Spatial pool
(Sum)

Feature 

Normalize to unit
length Vector
Non-Linearity

• Non-linearity
– Per-feature independent
– Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear
• Simplifies backprop
• Makes learning faster
• Avoids saturation issues


à Preferred option
Pooling
• Spatial Pooling
– Non-overlapping / overlapping regions
– Sum or max
– Boureau et al. ICML’10 for theoretical analysis

Max

Sum
Architecture

Importance of Depth
Architecture of Krizhevsky et al.
Softmax Output

• 8 layers total Layer 7: Full

Layer 6: Full
• Trained on Imagenet

dataset [Deng et al. CVPR’09] Layer 5: Conv + Pool

Layer 4: Conv
• 18.2% top-5 error
Layer 3: Conv

Layer 2: Conv + Pool


• Our reimplementation:
18.1% top-5 error Layer 1: Conv + Pool

Input Image
Architecture of Krizhevsky et al.
Softmax Output

• Remove top fully


connected layer
– Layer 7 Layer 6: Full

Layer 5: Conv + Pool


• Drop 16 million
parameters Layer 4: Conv

Layer 3: Conv
• Only 1.1% drop in
performance! Layer 2: Conv + Pool

Layer 1: Conv + Pool

Input Image
Architecture of Krizhevsky et al.
Softmax Output

• Remove both fully connected


layers
– Layer 6 & 7
Layer 5: Conv + Pool

• Drop ~50 million parameters


Layer 4: Conv

Layer 3: Conv
• 5.7% drop in performance
Layer 2: Conv + Pool

Layer 1: Conv + Pool

Input Image
Architecture of Krizhevsky et al.
Softmax Output

• Now try removing upper feature Layer 7: Full


extractor layers:
Layer 6: Full
– Layers 3 & 4
Layer 5: Conv + Pool
• Drop ~1 million parameters

• 3.0% drop in performance


Layer 2: Conv + Pool

Layer 1: Conv + Pool

Input Image
Architecture of Krizhevsky et al.
Softmax Output

• Now try removing upper feature


extractor layers & fully connected:
– Layers 3, 4, 6 ,7

Layer 5: Conv + Pool


• Now only 4 layers

• 33.5% drop in performance

Layer 2: Conv + Pool


àDepth of network is key
Layer 1: Conv + Pool

Input Image
Tapping off Features at each Layer
Plug features from each layer into linear SVM or soft-max
Translation (Vertical)

Output
Layer 1

Layer 7
Layer 1

Layer 7 Output
Scale Invariance
Layer 1

Layer 7 Output
Rotation Invariance
Visualizing 

ConvNets
Visualizing Convnets

• Raw coefficients of learned filters in higher


layers difficult to interpret

• Several approaches look to optimize input



to maximize activity in a high-level feature
– Erhan et al. [Tech Report 2009]
– Le et al. [NIPS 2010]
– Depend on initialization
– Model invariance with Hessian about

(locally) optimal stimulus
Visualization using Deconvolutional Networks
[Zeiler et al. CVPR’10, ICCV’11, arXiv’13]

• Provide way to map activations


at high layers back to the input Feature maps

• Same operations as Convnet, Unpooling

but in reverse:
– Unpool feature maps Non-linearity
– Convolve unpooled maps
• Filters copied from Convnet
Convolution (learned)

• Used here purely as a probe


– Originally proposed as unsupervised
learning method Input Image

– No inference, no learning
Deconvnet Projection from Higher Layers
[Zeiler and Fergus. arXiv’13]

Feature

0 .... 0 Map ....

Filters Filters

Layer 2 Reconstruction Layer 2: Feature maps


Deconvnet

Convnet
Layer 1 Reconstruction Layer 1: Feature maps

Visualization Input Image


Unpooling Operation
Layer 1 Filters
Visualizations of Higher Layers
[Zeiler and Fergus. arXiv’13]

• Use ImageNet 2012 validation set


• Push each image through network

• Take max activation from


Feature
 feature map associated
Map ....
with each filter
Filters
• Use Deconvnet to project
back to pixel space
Lower Layers
• Use pooling “switches”
peculiar to that activation
Input 

Image Validation Images
Layer 1: Top-9 Patches
Layer 2: Top-9

• NOT SAMPLES FROM MODEL


• Just parts of input image that give strong activation of this feature map
• Non-parametric view on invariances learned by model
Layer 2: Top-9 Patches

• Patches from validation images that give maximal activation of a given feature map
Layer 3: Top-1
Layer 3: Top-9
Layer 3: Top-9 Patches
Layer 4: Top-1
Layer 4: Top-9
Layer 4: Top-9 Patches
Layer 5: Top-1
Layer 5: Top-9
Layer 5: Top-9 Patches
ImageNet Classification 2013 Results
• http://www.image-net.org/challenges/LSVRC/2013/results.php
0.17

0.1525
Test error (top-5)

0.135

0.1175

0.1
Clarifai (extra data) NUS Andrew Howard UvA-Euvision Adobe CogniXveVision

• Pre-2012: 26.2% error à 2012: 16.5% error à 2013: 11.2% error


Sample Classification Results
[Krizhevsky et al. NIPS’12]

You might also like