Object Recog
Object Recog
cat
This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain
This image by Nikita is
licensed under CC-BY 2.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 8 April 6, 2017
Challenges: Deformation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 9 April 6, 2017
Challenges: Occlusion
This image by Umberto Salvagnin This image by sare bear is This image by Tom Thai is
This image by Umberto Salvagnin This image by jonsson is licensed
is licensed under CC-BY 2.0 licensed under CC-BY 2.0 licensed under CC-BY 2.0 This image is CC0 1.0 public domain This image is CC0 1.0 public domain
is licensed under CC-BY 2.0 under CC-BY 2.0
Challenge: background clutter
Challenges: Background Clutter
This image is CC0 1.0 public domain This image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 12 April 6, 2017
Challenges: Intraclass variation
Kilmeny Niland. 1995
Galaxy Classification
Person
Horse
This image is free to use under the Pexels license
Image Classification: Building Block for other tasks!
Image Classification: Building Block for other tasks!
Example: Image Captioning
?
Image classifier
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images
Example training set
MNIST
Datasets
Example Dataset: CIFAR10
MNIST Handwritten Digits:
10 classes I One of the most popular datasets in ML (many variants, still in use today)
50,000 training images I Based on a data from the National Institute of Standards and Technology
10,000 testing images I Hand written by Census Bureau employees and high-school children
I Resolution: �8 x �8 pixels, 6�k training samples with labels, ��k test samples
LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, ���8. �
Caltech���
Caltech���:
I Caltech��� was the �rst major object
recognition dataset, collected in ����
I ��� object categories
I Hand-curated from Google Image Search
I Biased: canonical size and location
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009. Fei-Fei, Fergus and Perona: Learning generative visual models from few training examples: an incremental Bayesian approach tested on ��� object categories. CVPR,
����. ��
Datasets
Image Classification Datasets: CIFAR100
100 classes
50k training images (500 per class)
10k testing images (100 per class)
32x32 RGB images
Zhou et al, “Places: A 10 million Image Database for Scene Recognition”, TPAMI 2017
Datasets
Classification Datasets: Number of Training Pixels
1.E+13
1.E+12
~1.6T
1.E+11 ~251B
1.E+10
1.E+09
1.E+08
~154M ~154M
1.E+07
~47M
1.E+06
MNIST CIFAR10 CIFAR100 ImageNet Places365
Datasets
Image Classification Datasets: Omniglot
Lake et al, “Human-level concept learning through probabilistic program induction”, Science, 2015
Image classifier
First classifier: Nearest Neighbor
Memorize all
data and labels
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
Image classifier
Distance Metric to compare images
L1 distance:
add
Image classifier
What does this look like?
Image classifier
Nearest Neighbor classifier
Q: With N examples,
how fast are training
and prediction?
A: Train O(1),
predict O(N)
Your Dataset
Idea #2: Split data into train and test, choose BAD: No idea how algorithm
hyperparameters that work best on test data will perform on new data
train test
Idea #3: Split data into train, val, and test; choose Better!
hyperparameters on val and evaluate on test
train validation test
Image classifier
Setting Hyperparameters
Your Dataset
Useful for small datasets, but not used too frequently in deep learning
Image classifier
Setting Hyperparameters
Dimensions = 2
Points = 42 Justin Johnson
Dimensions = 1
Points = 4
Image classifier
Nearest Neighbor
Nearest Neighbor Classi�er:
I Choose image distance:
X
d(I1 , I2 ) = |I1 (p) I2 (p)|1
p
I⇤ = argmin d(I, I0 )
I0 2D
I Return class of I⇤
I This is a slow (NN at test time) and bad (pixel distances uninformative) classi�er
�6
Image classifier
Nearest Neighbor
Thought Experiment:
I Two checkerboards, horizontally displaced by one �eld ) What is d(I1 , I2 )?
��
Interest Point Detection (features extraction)
Interest Point Detection (features extraction)
Interest Point Detection
Feature detectors and descriptors
Why do we need feature descriptors?
colors
colors
58
Scale Scale
InvariantInvariant
Detectors Detectors
← Laplacian →%
scale
• Harris-Laplacian1
Find local maximum of:
– Harris corner detector in y
space (image coordinates)
akov The Weizmann Institute of Science
oG →%
Find local maximum
(minimum) of:
Blobs detectors
SIFT features
SIFT vector formation
• Computed on rotated and scaled version of window
according to computed orientation & scale
– resample the window
• Based on gradients weighted by a Gaussian of
variance half the window (for smooth falloff)
SIFT features
Local binary pattern features
Local binary pattern features
HOG features
HOG
Dalal, Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005
histogram of ‘unsigned’
gradients
Cell
(8x8 pixels)
gradient magnitude histogram
(one for each cell)
soft binning
Block
(2x2 cells)
Shape context
SIFT
340
Texton
HOG features
Texture classification
Example of Filter Banks
Gaussian
derivatives at
different scales ‘LM’
and orientations
‘MR8’
Histogram of Textons
Histogram descriptor
of Textons descriptor
‘encoding’
‘pooling’
Filter Responses
Learning Textons from data
Learning Textons from data
pa
tch
es
patches
es
tch Clustering
pa
Texton
Dictionary
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001;
Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Bag-of-Words
Bag-of-Words (BoW) (BoW)
Fei-Fei and Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, ����. �8
Bag-of-Words
Bag-of-Words (BoW) (BoW)
An object as
Encode:
build Bags-of-Words (BOW) vectors
for each image
Classify:
Train and test data using BOWs
Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering
Classifier
Another dictionary
…
Appearance codebook
… Source: B. Leibe
…
…
…
…
Appearance codebook
Representation Learning
ImageNet dataset
Steel drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
ImageNet Large Scale Visual Recognition
Challenge
ImageNet Large Scale Visual Recognition Challenge
16.4
11.7
22 layers 19 layers
6.7 7.3
Other maxpool
conv-128
Dense descriptor grid:
conv-128
HOG, LBP
maxpool
conv-256
Coding: local coordinate, conv-256
super-vector
maxpool
conv-512
conv-512
Pooling, SPM maxpool
conv-512
conv-512
Linear SVM maxpool
fc-4096
fc-4096
fc-1000
softmax
Figure copyright Alex Krizhevsky, Ilya [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He ICCV 2015]
Lion image by Swissfrog is
Sutskever, and Geoffrey Hinton, 2012.
licensed under CC BY 3.0
Convolutional Neural Networks (CNN) were
not invented overnight
Image Maps
Input
1998
K
LeCun et al. Output
2012
Krizhevsky et al.
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Conv layer Vs FC layer
Convolution Layer
Fully Connected Layer activation map
32x32x3 image
32x32x3 image -> stretch to 3072 x 1 5x5x3 filter
32
input activation 28
1 1
10 x 3072 convolve (slide) over all
3072 10
weights spatial locations
1 number:
the result of taking a dot product 28
between a row of W and the input
32
(a 3072-dimensional dot product) 3 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 32 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 27 April 18, 2017
Conv layer
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
32
32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
Conv layer
A closer look at spatial dimensions:
A closer look at spatial dimensions:
A closer look at spatial dimensions:
7 7
7x7 input (spatially) 7x7 input (spatially)
assume 3x3 filter assume 3x3 filter 7
7 7 7x7 input (spatially)
A closer look at spatial dimensions:
assume 3x3 filter
A closer look at spatial dimensions:
7 7
Fei-Fei
Fei-Fei Li & Justin Johnson & Serena Li & JustinLecture
Yeung Johnson5&-Serena
42 Lecture 5 - 43
Yeung18, 2017
April April 18, 2017 => 5x5 output
7x7 input (spatially)
assume 3x3 filter
7x7 input (spatially)
assume 3x3 filter
7
7 7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 45 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 44 April 18, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5
Conv Layer
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Conv layer
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 72 April 18, 2017
Pooling layer
MAX POOLING
3 2 1 0 3 4
1 2 3 4
y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 73 April 18, 2017
Fully connected layer
Output and Loss Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I In image classi�cation, we use a softmax output layer and a cross entropy loss
��
Fully connected layer
Hierarchical learning
Visualization of VGG-16 by Lane McIntosh. VGG-16
Preview [Zeiler and Fergus 2013] architecture from [Simonyan and Zisserman 2014].
Convolutional Neural Networks CNN
• Recent advancement in CNN Architectures
• 1- LeNet
2- AlexNet
3- ZFNet
4- VGGNet
5- GoogLeNet
6- ResNet
7- Inception
8- Xception
9- SqueezeNet
10- ShuffleNet
11- MobileNetV2
12- DenseNet201
13- NasNet-Mobile
CNN
���8: LeNet-�
AlexNet:
I 8 layers, ReLUs, dropout, data augmentation, trained on � GTX �8� GPUs
I Number of feature channels increase with depth, spatial resolution decreases
I Triggered deep learning revolution, showed that CNNs work well in practice
CNN
����: VGG
VGG:
I Uses 3 ⇥ 3 convolutions everywhere (same expressiveness, fewer parameters)
I Three 3 ⇥ 3 layers: same receptive �eld as one 7 ⇥ 7 layer, but less parameters
I � Variants: �6 and �� Layers. Showed that deeper networks are better.
Simonyan and Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, ����. ��
CNN
����: Inception / GoogLeNet
Inception:
I �� layers. Inception modules utilize conv/pool operations with varying �lter size
I Multiple intermediate classi�cation heads to improve gradient �ow
I Global avarage pooling (no FC layers), ��x less parameters (� million) than VGG-�6
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
CNN
3x3 conv, 64
ResNet:
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
���6: ResNet
avg pool
fc 1000
I Very simple and regular network structure with 3 ⇥ 3 convolutions
I Residual connections allow for training deeper networks (up to ��� layers)
CNN
Accuracy vs. Complexity