0% found this document useful (0 votes)
8 views

Object Recog

Uploaded by

bilqesahmed60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Object Recog

Uploaded by

bilqesahmed60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Advanced topics of computer

graphic and vision


Spring 2024
Object recognition
Image classification(object recognition)
Image Classification: A core task in Computer Vision

(assume given set of discrete labels)


{dog, cat, truck, plane, ...}

cat

This image by Nikita is


licensed under CC-BY 2.0
Image classification(object recognition)
The Problem: Semantic Gap

What the computer sees

An image is just a big grid of


numbers between [0, 255]:

e.g. 800 x 600 x 3


This image by Nikita is
licensed under CC-BY 2.0
(3 channels RGB)
Challenges Challenges: Illumination
Challenges: Viewpoint variation

All pixels change when


the camera moves!

This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain
This image by Nikita is
licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 8 April 6, 2017

Challenges: Deformation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 9 April 6, 2017
Challenges: Occlusion

This image by Umberto Salvagnin This image by sare bear is This image by Tom Thai is
This image by Umberto Salvagnin This image by jonsson is licensed
is licensed under CC-BY 2.0 licensed under CC-BY 2.0 licensed under CC-BY 2.0 This image is CC0 1.0 public domain This image is CC0 1.0 public domain
is licensed under CC-BY 2.0 under CC-BY 2.0
Challenge: background clutter
Challenges: Background Clutter

This image is CC0 1.0 public domain This image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 12 April 6, 2017
Challenges: Intraclass variation
Kilmeny Niland. 1995

Challenge: intra-class variations

This image is CC0 1.0 public domain


Svetlana Lazebnik
ImageClassification:
Image Classification: Very
Very Useful!
Useful!
Medical Imaging
Whale recognition

Levy et al, 2016 Figure reproduced with permission

Galaxy Classification

Dieleman et al, 2014


From left to right: public domain by NASA, usage permitted by
ESA/Hubble, public domain by NASA, and public domain. Kaggle Challenge This image by Christin Khan is in the public domain and
originally came from the U.S. NOAA.
Image
Image Classification:
Classification: Building
Building Block
Block for
for other
other tasks!
tasks!
Example: Object Detection

Person

Horse
This image is free to use under the Pexels license
Image Classification: Building Block for other tasks!
Image Classification: Building Block for other tasks!
Example: Image Captioning

riding What word


cat to say next?
horse
man
when Caption:
Man riding horse

This image is free to use under the Pexels license
<STOP>
Image classifier
An image classifier

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for


recognizing a cat, or other classes.
Image classifier

Attempts have been made

Find edges Find corners

?
Image classifier
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images
Example training set
MNIST

Datasets
Example Dataset: CIFAR10
MNIST Handwritten Digits:
10 classes I One of the most popular datasets in ML (many variants, still in use today)
50,000 training images I Based on a data from the National Institute of Standards and Technology

10,000 testing images I Hand written by Census Bureau employees and high-school children
I Resolution: �8 x �8 pixels, 6�k training samples with labels, ��k test samples

LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, ���8. �

Caltech���

Caltech���:
I Caltech��� was the �rst major object
recognition dataset, collected in ����
I ��� object categories
I Hand-curated from Google Image Search
I Biased: canonical size and location

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009. Fei-Fei, Fergus and Perona: Learning generative visual models from few training examples: an incremental Bayesian approach tested on ��� object categories. CVPR,
����. ��
Datasets
Image Classification Datasets: CIFAR100
100 classes
50k training images (500 per class)
10k testing images (100 per class)
32x32 RGB images

20 superclasses with 5 classes each:

Aquatic mammals: beaver, dolphin,


otter, seal, whale
Trees: Maple, oak, palm, pine, willow
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
Datasets
Image Classification Datasets: ImageNet
1000 classes

~1.3M training images (~1.3K per class)


50K validation images (50 per class)
100K test images (100 per class)
test labels are secret!

Images have variable size, but often


resized to 256x256 for training

There is also a 22k category version of


Deng et al, “ImageNet: A Large-Scale Hierarchical Image Database”, CVPR 2009
Russakovsky et al, “ImageNet Large Scale Visual Recognition Challenge”, IJCV 2015 ImageNet, but less commonly used
Datasets
Image Classification Datasets: MIT Places

365 classes of different scene types

~8M training images


18.25K val images (50 per class)
328.5K test images (900 per class)

Images have variable size, often


resize to 256x256 for training

Zhou et al, “Places: A 10 million Image Database for Scene Recognition”, TPAMI 2017
Datasets
Classification Datasets: Number of Training Pixels
1.E+13

1.E+12
~1.6T
1.E+11 ~251B
1.E+10

1.E+09

1.E+08
~154M ~154M
1.E+07
~47M
1.E+06
MNIST CIFAR10 CIFAR100 ImageNet Places365
Datasets
Image Classification Datasets: Omniglot

1623 categories: characters


from 50 different alphabets

20 images per category

Meant to test few shot learning

Lake et al, “Human-level concept learning through probabilistic program induction”, Science, 2015
Image classifier
First classifier: Nearest Neighbor

Memorize all
data and labels

Predict the label


of the most similar
training image
Image classifier
Example Dataset: CIFAR10
10 classes
50,000 training images
10,000 testing images Test images and nearest neighbors

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
Image classifier
Distance Metric to compare images

L1 distance:

add
Image classifier
What does this look like?
Image classifier
Nearest Neighbor classifier

Memorize training data


Image classifier
Nearest Neighbor classifier

For each test image:


Find closest train image
Predict label of nearest image
Image classifier
Nearest Neighbor classifier

Q: With N examples,
how fast are training
and prediction?

A: Train O(1),
predict O(N)

This is bad: we want


classifiers that are fast
at prediction; slow for
training is ok
Image classifier
K-Nearest Neighbors
Instead of copying label from nearest neighbor,
take majority vote from K closest points

K=1 K=3 K=5


Image classifier
K-Nearest Neighbors: Distance Metric

L1 (Manhattan) distance L2 (Euclidean) distance


Hyperparameters
Image classifier
Hyperparameters
What is the best value of k to use?
Whatisisthe
What the best
best distance
value to use?
of k to use?
What is the best distance to use?
These are hyperparameters: choices about
These are hyperparameters:
the algorithm choices
that we set rather about
than learn
the algorithm that we set rather than learn
Very problem-dependent.
Must try them all out and see what works best.
Image classifier
Setting Hyperparameters
Idea #1: Choose hyperparameters BAD: K = 1 always works
that work best on the data perfectly on training data

Your Dataset

Idea #2: Split data into train and test, choose BAD: No idea how algorithm
hyperparameters that work best on test data will perform on new data
train test

Idea #3: Split data into train, val, and test; choose Better!
hyperparameters on val and evaluate on test
train validation test
Image classifier
Setting Hyperparameters
Your Dataset

Idea #4: Cross-Validation: Split data into folds,


try each fold as validation and average the results

fold 1 fold 2 fold 3 fold 4 fold 5 test

fold 1 fold 2 fold 3 fold 4 fold 5 test

fold 1 fold 2 fold 3 fold 4 fold 5 test

Useful for small datasets, but not used too frequently in deep learning
Image classifier
Setting Hyperparameters

Example of 5-fold cross-validation for


the value of k.

Each point: single outcome.

The line goes through the mean, bars


indicated standard deviation

(Seems that k ~ 7 works best


for this data)
Number of possible
Image classifier 32x32 binary images:

k-Nearest Neighbor on images never used.


232x32 ≈ 10308
Dimensions = 3
- Curse of dimensionality Points = 43

Dimensions = 2
Points = 42 Justin Johnson
Dimensions = 1
Points = 4
Image classifier
Nearest Neighbor
Nearest Neighbor Classi�er:
I Choose image distance:
X
d(I1 , I2 ) = |I1 (p) I2 (p)|1
p

I Given I, �nd nearest neighbor:

I⇤ = argmin d(I, I0 )
I0 2D

I Return class of I⇤

I This is a slow (NN at test time) and bad (pixel distances uninformative) classi�er
�6
Image classifier
Nearest Neighbor

Thought Experiment:
I Two checkerboards, horizontally displaced by one �eld ) What is d(I1 , I2 )?
��
Interest Point Detection (features extraction)
Interest Point Detection (features extraction)
Interest Point Detection
Feature detectors and descriptors
Why do we need feature descriptors?

If we know where the good features are,


how do we match them?
Finding objects
Finding the “same” thing across images
Categories Find a bottle: Instances Find these two objects

Can’t do Can nail it


Building a Panorama
Building a Panorama
What is the best descriptor for an image
feature?
• Representative Features
What is the best descriptor for an image fea

is the best descriptor for an image feature?


What is the best descriptor for an image
Photometric transformations
feature?
Geometric transformations

objects will appear at different scales,


objects will appear at different scales,
translation and rotation
translation and rotation
Color histogram
Color histogram
Count the colors in the image using a histogram

colors

Invariant to changes in scale and rotation


Color histogram
Count the colors in the image using a histogram
Color histogram
Count the colors in the image using a histogram
colors

Invariant to changes in scale and rotation

colors

What are the problems?


Spatial histograms
Spatial histograms

Compute histograms over spatial ‘cells’

Retains rough spatial layout


Darya Frolova, Denis Simakov The Weizmann Institute of Science
http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeature
Harris
Harris
cornercorner
Harris corner detector

C.Harris, M.Stephens. A Combined Corner and Edge Detector . 1988


detector detector
The
The Basic Idea Basic Idea
• We should easily localize the point by looking
http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt

through a small window


• Shifting a window in any direction should give
a large change in pixels intensities in window
Darya Frolova, Denis Simakov The Weizmann Institute of Science

– makes location precisely define


Corner Detector: Basic Idea
Corner Detector: Basic Idea
.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
lova, Denis Simakov The Weizmann Institute of Science

flat region: edge : corner :


no change in all no change along significant change
directions the edge direction in all directions
tp://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
arya Frolova, Denis Simakov The Weizmann Institute of Science
Harris detectors
Harris Detector: Workflow
Harris detectors
Harris Detector: Some Properties
• Rotation invariance

Ellipse rotates but its shape (i.e. eigenvalues)


remains the same
Corner response R is invariant to image rotation

Eigen analysis allows us to work in the canonical frame of the


linear form.
Harris Detector: Some Properties
image image zoomed image
zoomed image
• Not invariant to image scale!
dom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
, Denis Simakov The Weizmann Institute of Science

All points will be Corner !


classified as edges
Laplacian
Laplacianof Gaussian
of Gaussian forfor selection
selection of of
characteristic scale
characteristic scale
http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/mikolajczyk_ijcv2004.pdf

58
Scale Scale
InvariantInvariant
Detectors Detectors

← Laplacian →%
scale
• Harris-Laplacian1
Find local maximum of:
– Harris corner detector in y
space (image coordinates)
akov The Weizmann Institute of Science

– Laplacian in scale ← Harris →% x

• SIFT (Lowe)2 scale

oG →%
Find local maximum
(minimum) of:
Blobs detectors
SIFT features
SIFT vector formation
• Computed on rotated and scaled version of window
according to computed orientation & scale
– resample the window
• Based on gradients weighted by a Gaussian of
variance half the window (for smooth falloff)
SIFT features
Local binary pattern features
Local binary pattern features
HOG features
HOG
Dalal, Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

histogram of ‘unsigned’
gradients
Cell
(8x8 pixels)
gradient magnitude histogram
(one for each cell)

soft binning
Block
(2x2 cells)

Concatenate and L-2 normalization


HOG features
Pedestrian detector
classical method
Features Learning
Extraction Algorithm
Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
each pixel in the original image and subtracting the value of its neighboring pixel on the
left. This shows the strength of all of the vertically oriented edges in the input image,
which can be a useful operation for object detection. Both images are 280 pixels tall.
The input image is 320 pixels wide while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing two elements, and
requires 319 × 280 × 3 = 267, 960 floating point operations (two multiplications and
one addition per output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320 × 280 × 319 × 280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representing this transformation. The straightforward matrix multiplication algorithm
performs over sixteen billion floating point operations, making convolution roughly 60,000
times more efficient computationally. Of course, most of the entries of the matrix would be
zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication
and convolution would require the same number of floating point operations to compute.
The matrix would still need to contain 2 × 319 × 280 = 178, 640 entries. Convolution
is an extremely efficient way of describing transformations that apply the same linear
transformation of a small, local region across the entire input. (Photo credit: Paula
Goodfellow)

Shape context
SIFT
340

Texton

HOG features
Texture classification
Example of Filter Banks

Isotropic Gabor ‘S’

Gaussian
derivatives at
different scales ‘LM’
and orientations

‘MR8’
Histogram of Textons
Histogram descriptor
of Textons descriptor

‘encoding’
‘pooling’

Texton Map Histogram of


Training textons in
image image

Filter Responses
Learning Textons from data
Learning Textons from data
pa
tch
es

patches

es
tch Clustering
pa

Texton
Dictionary

Multiple training Filter response


images of the same over a bank of
texture filters We will learn more about clustering
later in class (Bag of Words lecture).
Learning Textons from data
histogram

Universal texton dictionary

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001;
Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Bag-of-Words
Bag-of-Words (BoW) (BoW)

I Idea: Obtain spatial invariance by comparing histograms of local features

Fei-Fei and Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, ����. �8
Bag-of-Words
Bag-of-Words (BoW) (BoW)

I The idea of bag-of-word models originates from natural language processing


I Orderless document representation: frequencies of words from a dictionary
Bag-of-Words (BoW)
Some local features
What object do these parts belong to? are very informative

An object as

a collection of local features


(bag-of-features)

• deals well with occlusion


• scale invariant
• rotation invariant
Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

Encode:
build Bags-of-Words (BOW) vectors
for each image

Classify:
Train and test data using BOWs
Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

1. extract features (e.g., SIFT) from images


Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

2. Learn visual dictionary (e.g., K-means clustering)


Bag-of-Words
Bag-of-Words (BoW) (BoW)
�. Extract features (e.g., SIFT detector)
�. Learn visual vocabulary (e.g., k-means on SIFT feature vectors)
�. Quantize features into visual words using vocabulary (nearest neighbors)
�. Represent images by histograms of visual word frequencies
�. Train classi�er (e.g., k-NN, SVM, random forest, neural network)
Bag-of-Words (BoW)

Classifier
Another dictionary


Appearance codebook
… Source: B. Leibe



Appearance codebook
Representation Learning
ImageNet dataset

Steel drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
ImageNet Large Scale Visual Recognition
Challenge
ImageNet Large Scale Visual Recognition Challenge

ImageNet Classification top-5 error (%) 28.2


25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10


ResNet GoogleNet VGG AlexNet

I Classi�cation into ���� object categories. Current state-of-the-art: �.� %


ImageNet Large Scale Visual Recognition
Challenge
Steel drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
ImageNet Large Scale Visual Recognition
Challenge
Year 2010 Year 2012 Year 2014 Year 2015
NEC-UIUC SuperVision GoogLeNet VGG MSRA
Image
Pooling
Convolution conv-64
Softmax conv-64

Other maxpool
conv-128
Dense descriptor grid:
conv-128
HOG, LBP
maxpool

conv-256
Coding: local coordinate, conv-256
super-vector
maxpool

conv-512
conv-512
Pooling, SPM maxpool

conv-512
conv-512
Linear SVM maxpool

fc-4096
fc-4096
fc-1000
softmax

[Lin CVPR 2011] [Krizhevsky NIPS 2012]

Figure copyright Alex Krizhevsky, Ilya [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He ICCV 2015]
Lion image by Swissfrog is
Sutskever, and Geoffrey Hinton, 2012.
licensed under CC BY 3.0
Convolutional Neural Networks (CNN) were
not invented overnight
Image Maps
Input

1998
K
LeCun et al. Output

Convolutions Fully Connected


Subsampling

# of transistors # of pixels used in training


106 107

2012
Krizhevsky et al.

# of transistors GPUs # of pixels used in training


Figure copyright Alex Krizhevsky, Ilya
Sutskever, and Geoffrey Hinton, 2012.
Reproduced with permission.
109 1014
First strong results
Acoustic Modeling using Deep Belief Networks
Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, 2010
Context-Dependent Pre-trained Deep Neural Networks
for Large Vocabulary Speech Recognition
George Dahl, Dong Yu, Li Deng, Alex Acero, 2012

Imagenet classification with deep convolutional


Illustration of Dahl et al. 2012 by Lane McIntosh, copyright
neural networks CS231n 2017

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Conv layer Vs FC layer

Convolution Layer
Fully Connected Layer activation map
32x32x3 image
32x32x3 image -> stretch to 3072 x 1 5x5x3 filter
32

input activation 28
1 1
10 x 3072 convolve (slide) over all
3072 10
weights spatial locations
1 number:
the result of taking a dot product 28
between a row of W and the input
32
(a 3072-dimensional dot product) 3 1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 32 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 27 April 18, 2017
Conv layer
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 34 April 18, 2017
Conv layer
The brain/neuron view of CONV Layer

32

28 An activation map is a 28x28 sheet of neuron


outputs:
1. Each is connected to a small region in the input
2. All of them share parameters

32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
Conv layer
A closer look at spatial dimensions:
A closer look at spatial dimensions:
A closer look at spatial dimensions:
7 7
7x7 input (spatially) 7x7 input (spatially)
assume 3x3 filter assume 3x3 filter 7
7 7 7x7 input (spatially)
A closer look at spatial dimensions:
assume 3x3 filter
A closer look at spatial dimensions:

7 7
Fei-Fei
Fei-Fei Li & Justin Johnson & Serena Li & JustinLecture
Yeung Johnson5&-Serena
42 Lecture 5 - 43
Yeung18, 2017
April April 18, 2017 => 5x5 output
7x7 input (spatially)
assume 3x3 filter
7x7 input (spatially)
assume 3x3 filter
7

7 7

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 45 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 44 April 18, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5
Conv Layer
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Conv layer

In practice: Common to zero pad the border


0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Conv Layer
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Conv layer
two more layers to go: POOL/FC
two more layers to go: POOL/FC
Pooling layer
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 72 April 18, 2017
Pooling layer
MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 73 April 18, 2017
Fully connected layer
Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I In image classi�cation, we use a softmax output layer and a cross entropy loss
��
Fully connected layer
Hierarchical learning
Visualization of VGG-16 by Lane McIntosh. VGG-16
Preview [Zeiler and Fergus 2013] architecture from [Simonyan and Zisserman 2014].
Convolutional Neural Networks CNN
• Recent advancement in CNN Architectures
• 1- LeNet
2- AlexNet
3- ZFNet
4- VGGNet
5- GoogLeNet
6- ResNet
7- Inception
8- Xception
9- SqueezeNet
10- ShuffleNet
11- MobileNetV2
12- DenseNet201
13- NasNet-Mobile
CNN
���8: LeNet-�

I � convolution layers (5 ⇥ 5), � pooling layers (2 ⇥ 2), � fully connected layers


I Achieved state-of-the-art accuracy on MNIST (prior to ImageNet)
CNN
����: AlexNet

AlexNet:
I 8 layers, ReLUs, dropout, data augmentation, trained on � GTX �8� GPUs
I Number of feature channels increase with depth, spatial resolution decreases
I Triggered deep learning revolution, showed that CNNs work well in practice
CNN
����: VGG

VGG:
I Uses 3 ⇥ 3 convolutions everywhere (same expressiveness, fewer parameters)
I Three 3 ⇥ 3 layers: same receptive �eld as one 7 ⇥ 7 layer, but less parameters
I � Variants: �6 and �� Layers. Showed that deeper networks are better.
Simonyan and Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, ����. ��
CNN
����: Inception / GoogLeNet

Inception:
I �� layers. Inception modules utilize conv/pool operations with varying �lter size
I Multiple intermediate classi�cation heads to improve gradient �ow
I Global avarage pooling (no FC layers), ��x less parameters (� million) than VGG-�6
7x7 conv, 64, /2

pool, /2

3x3 conv, 64

3x3 conv, 64
CNN
3x3 conv, 64

ResNet:
3x3 conv, 64

3x3 conv, 64

3x3 conv, 64
���6: ResNet

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256


I Uses strided convolutions for downsampling

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


I ResNet and ResNet-like architectures are dominating today

avg pool

fc 1000
I Very simple and regular network structure with 3 ⇥ 3 convolutions
I Residual connections allow for training deeper networks (up to ��� layers)
CNN
Accuracy vs. Complexity

I Performance: Top-� or Top-� accuracy (target label in top N predictions)


I VGG has most parameters and is slowest
I ResNet/Inception/GoogLeNet are faster and have fewer parameters
��

You might also like