0% found this document useful (0 votes)

10 views102 pages

Object Recog

Uploaded by

bilqesahmed60

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views102 pages

Object Recog

Uploaded by

bilqesahmed60

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 102

Advanced topics of computer

graphic and vision

Spring 2024
Object recognition
Image classification(object recognition)
Image Classification: A core task in Computer Vision

(assume given set of discrete labels)

{dog, cat, truck, plane, ...}

cat

This image by Nikita is

licensed under CC-BY 2.0
Image classification(object recognition)
The Problem: Semantic Gap

What the computer sees

An image is just a big grid of

numbers between [0, 255]:

e.g. 800 x 600 x 3

This image by Nikita is
licensed under CC-BY 2.0
(3 channels RGB)
Challenges Challenges: Illumination
Challenges: Viewpoint variation

All pixels change when

the camera moves!

This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain
This image by Nikita is
licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 8 April 6, 2017

Challenges: Deformation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 9 April 6, 2017
Challenges: Occlusion

This image by Umberto Salvagnin This image by sare bear is This image by Tom Thai is
This image by Umberto Salvagnin This image by jonsson is licensed
is licensed under CC-BY 2.0 licensed under CC-BY 2.0 licensed under CC-BY 2.0 This image is CC0 1.0 public domain This image is CC0 1.0 public domain
is licensed under CC-BY 2.0 under CC-BY 2.0
Challenge: background clutter
Challenges: Background Clutter

This image is CC0 1.0 public domain This image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 12 April 6, 2017
Challenges: Intraclass variation
Kilmeny Niland. 1995

Challenge: intra-class variations

This image is CC0 1.0 public domain

Svetlana Lazebnik
ImageClassification:
Image Classification: Very
Very Useful!
Useful!
Medical Imaging
Whale recognition

Levy et al, 2016 Figure reproduced with permission

Galaxy Classification

Dieleman et al, 2014

From left to right: public domain by NASA, usage permitted by
ESA/Hubble, public domain by NASA, and public domain. Kaggle Challenge This image by Christin Khan is in the public domain and
originally came from the U.S. NOAA.
Image
Image Classification:
Classification: Building
Building Block
Block for
for other
other tasks!
tasks!
Example: Object Detection

Person

Horse
This image is free to use under the Pexels license
Image Classification: Building Block for other tasks!
Image Classification: Building Block for other tasks!
Example: Image Captioning

riding What word

cat to say next?
horse
man
when Caption:
Man riding horse
…
This image is free to use under the Pexels license
<STOP>
Image classifier
An image classifier

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for

recognizing a cat, or other classes.
Image classifier

Attempts have been made

Find edges Find corners

?
Image classifier
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images
Example training set
MNIST

Datasets
Example Dataset: CIFAR10
MNIST Handwritten Digits:
10 classes I One of the most popular datasets in ML (many variants, still in use today)
50,000 training images I Based on a data from the National Institute of Standards and Technology

10,000 testing images I Hand written by Census Bureau employees and high-school children
I Resolution: �8 x �8 pixels, 6�k training samples with labels, ��k test samples

LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, ��8. �

Caltech��

Caltech��:
I Caltech�� was the �rst major object
recognition dataset, collected in ��
I �� object categories
I Hand-curated from Google Image Search
I Biased: canonical size and location

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009. Fei-Fei, Fergus and Perona: Learning generative visual models from few training examples: an incremental Bayesian approach tested on �� object categories. CVPR,
��. ��
Datasets
Image Classification Datasets: CIFAR100
100 classes
50k training images (500 per class)
10k testing images (100 per class)
32x32 RGB images

20 superclasses with 5 classes each:

Aquatic mammals: beaver, dolphin,

otter, seal, whale
Trees: Maple, oak, palm, pine, willow
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
Datasets
Image Classification Datasets: ImageNet
1000 classes

~1.3M training images (~1.3K per class)

50K validation images (50 per class)
100K test images (100 per class)
test labels are secret!

Images have variable size, but often

resized to 256x256 for training

There is also a 22k category version of

Deng et al, “ImageNet: A Large-Scale Hierarchical Image Database”, CVPR 2009
Russakovsky et al, “ImageNet Large Scale Visual Recognition Challenge”, IJCV 2015 ImageNet, but less commonly used
Datasets
Image Classification Datasets: MIT Places

365 classes of different scene types

~8M training images

18.25K val images (50 per class)
328.5K test images (900 per class)

Images have variable size, often

resize to 256x256 for training

Zhou et al, “Places: A 10 million Image Database for Scene Recognition”, TPAMI 2017
Datasets
Classification Datasets: Number of Training Pixels
1.E+13

1.E+12
~1.6T
1.E+11 ~251B
1.E+10

1.E+09

1.E+08
~154M ~154M
1.E+07
~47M
1.E+06
MNIST CIFAR10 CIFAR100 ImageNet Places365
Datasets
Image Classification Datasets: Omniglot

1623 categories: characters

from 50 different alphabets

20 images per category

Meant to test few shot learning

Lake et al, “Human-level concept learning through probabilistic program induction”, Science, 2015
Image classifier
First classifier: Nearest Neighbor

Memorize all
data and labels

Predict the label

of the most similar
training image
Image classifier
Example Dataset: CIFAR10
10 classes
50,000 training images
10,000 testing images Test images and nearest neighbors

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
Image classifier
Distance Metric to compare images

L1 distance:

add
Image classifier
What does this look like?
Image classifier
Nearest Neighbor classifier

Memorize training data

Image classifier
Nearest Neighbor classifier

For each test image:

Find closest train image
Predict label of nearest image
Image classifier
Nearest Neighbor classifier

Q: With N examples,
how fast are training
and prediction?

A: Train O(1),
predict O(N)

This is bad: we want

classifiers that are fast
at prediction; slow for
training is ok
Image classifier
K-Nearest Neighbors
Instead of copying label from nearest neighbor,
take majority vote from K closest points

K=1 K=3 K=5

Image classifier
K-Nearest Neighbors: Distance Metric

L1 (Manhattan) distance L2 (Euclidean) distance

Hyperparameters
Image classifier
Hyperparameters
What is the best value of k to use?
Whatisisthe
What the best
best distance
value to use?
of k to use?
What is the best distance to use?
These are hyperparameters: choices about
These are hyperparameters:
the algorithm choices
that we set rather about
than learn
the algorithm that we set rather than learn
Very problem-dependent.
Must try them all out and see what works best.
Image classifier
Setting Hyperparameters
Idea #1: Choose hyperparameters BAD: K = 1 always works
that work best on the data perfectly on training data

Your Dataset

Idea #2: Split data into train and test, choose BAD: No idea how algorithm
hyperparameters that work best on test data will perform on new data
train test

Idea #3: Split data into train, val, and test; choose Better!
hyperparameters on val and evaluate on test
train validation test
Image classifier
Setting Hyperparameters
Your Dataset

Idea #4: Cross-Validation: Split data into folds,

try each fold as validation and average the results

fold 1 fold 2 fold 3 fold 4 fold 5 test

Useful for small datasets, but not used too frequently in deep learning
Image classifier
Setting Hyperparameters

Example of 5-fold cross-validation for

the value of k.

Each point: single outcome.

The line goes through the mean, bars

indicated standard deviation

(Seems that k ~ 7 works best

for this data)
Number of possible
Image classifier 32x32 binary images:

k-Nearest Neighbor on images never used.

232x32 ≈ 10308
Dimensions = 3
- Curse of dimensionality Points = 43

Dimensions = 2
Points = 42 Justin Johnson
Dimensions = 1
Points = 4
Image classifier
Nearest Neighbor
Nearest Neighbor Classi�er:
I Choose image distance:
X
d(I1 , I2 ) = |I1 (p) I2 (p)|1
p

I Given I, �nd nearest neighbor:

I⇤ = argmin d(I, I0 )
I0 2D

I Return class of I⇤

I This is a slow (NN at test time) and bad (pixel distances uninformative) classi�er
�6
Image classifier
Nearest Neighbor

Thought Experiment:
I Two checkerboards, horizontally displaced by one �eld ) What is d(I1 , I2 )?
��
Interest Point Detection (features extraction)
Interest Point Detection (features extraction)
Interest Point Detection
Feature detectors and descriptors
Why do we need feature descriptors?

If we know where the good features are,

how do we match them?
Finding objects
Finding the “same” thing across images
Categories Find a bottle: Instances Find these two objects

Can’t do Can nail it

Building a Panorama
Building a Panorama
What is the best descriptor for an image
feature?
• Representative Features
What is the best descriptor for an image fea

is the best descriptor for an image feature?

What is the best descriptor for an image
Photometric transformations
feature?
Geometric transformations

objects will appear at different scales,

objects will appear at different scales,
translation and rotation
translation and rotation
Color histogram
Color histogram
Count the colors in the image using a histogram

colors

Invariant to changes in scale and rotation

Color histogram
Count the colors in the image using a histogram
Color histogram
Count the colors in the image using a histogram
colors

Invariant to changes in scale and rotation

colors

What are the problems?

Spatial histograms
Spatial histograms

Compute histograms over spatial ‘cells’

Retains rough spatial layout

Darya Frolova, Denis Simakov The Weizmann Institute of Science
http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeature
Harris
Harris
cornercorner
Harris corner detector

C.Harris, M.Stephens. A Combined Corner and Edge Detector . 1988

detector detector
The
The Basic Idea Basic Idea
• We should easily localize the point by looking
http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt

through a small window

• Shifting a window in any direction should give
a large change in pixels intensities in window
Darya Frolova, Denis Simakov The Weizmann Institute of Science

– makes location precisely define

Corner Detector: Basic Idea
Corner Detector: Basic Idea
.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
lova, Denis Simakov The Weizmann Institute of Science

flat region: edge : corner :

no change in all no change along significant change
directions the edge direction in all directions
tp://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
arya Frolova, Denis Simakov The Weizmann Institute of Science
Harris detectors
Harris Detector: Workflow
Harris detectors
Harris Detector: Some Properties
• Rotation invariance

Ellipse rotates but its shape (i.e. eigenvalues)

remains the same
Corner response R is invariant to image rotation

Eigen analysis allows us to work in the canonical frame of the

linear form.
Harris Detector: Some Properties
image image zoomed image
zoomed image
• Not invariant to image scale!
dom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt
, Denis Simakov The Weizmann Institute of Science

All points will be Corner !

classified as edges
Laplacian
Laplacianof Gaussian
of Gaussian forfor selection
selection of of
characteristic scale
characteristic scale
http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/mikolajczyk_ijcv2004.pdf

58
Scale Scale
InvariantInvariant
Detectors Detectors

← Laplacian →%
scale
• Harris-Laplacian1
Find local maximum of:
– Harris corner detector in y
space (image coordinates)
akov The Weizmann Institute of Science

– Laplacian in scale ← Harris →% x

• SIFT (Lowe)2 scale

oG →%
Find local maximum
(minimum) of:
Blobs detectors
SIFT features
SIFT vector formation
• Computed on rotated and scaled version of window
according to computed orientation & scale
– resample the window
• Based on gradients weighted by a Gaussian of
variance half the window (for smooth falloff)
SIFT features
Local binary pattern features
Local binary pattern features
HOG features
HOG
Dalal, Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

histogram of ‘unsigned’
gradients
Cell
(8x8 pixels)
gradient magnitude histogram
(one for each cell)

soft binning
Block
(2x2 cells)

Concatenate and L-2 normalization

HOG features
Pedestrian detector
classical method
Features Learning
Extraction Algorithm
Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
each pixel in the original image and subtracting the value of its neighboring pixel on the
left. This shows the strength of all of the vertically oriented edges in the input image,
which can be a useful operation for object detection. Both images are 280 pixels tall.
The input image is 320 pixels wide while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing two elements, and
requires 319 × 280 × 3 = 267, 960 floating point operations (two multiplications and
one addition per output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320 × 280 × 319 × 280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representing this transformation. The straightforward matrix multiplication algorithm
performs over sixteen billion floating point operations, making convolution roughly 60,000
times more efficient computationally. Of course, most of the entries of the matrix would be
zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication
and convolution would require the same number of floating point operations to compute.
The matrix would still need to contain 2 × 319 × 280 = 178, 640 entries. Convolution
is an extremely efficient way of describing transformations that apply the same linear
transformation of a small, local region across the entire input. (Photo credit: Paula
Goodfellow)

Shape context
SIFT
340

Texton

HOG features
Texture classification
Example of Filter Banks

Isotropic Gabor ‘S’

Gaussian
derivatives at
different scales ‘LM’
and orientations

‘MR8’
Histogram of Textons
Histogram descriptor
of Textons descriptor

‘encoding’
‘pooling’

Texton Map Histogram of

Training textons in
image image

Filter Responses
Learning Textons from data
Learning Textons from data
pa
tch
es

patches

es
tch Clustering
pa

Texton
Dictionary

Multiple training Filter response

images of the same over a bank of
texture filters We will learn more about clustering
later in class (Bag of Words lecture).
Learning Textons from data
histogram

Universal texton dictionary

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001;
Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Bag-of-Words
Bag-of-Words (BoW) (BoW)

I Idea: Obtain spatial invariance by comparing histograms of local features

Fei-Fei and Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, ��. �8
Bag-of-Words
Bag-of-Words (BoW) (BoW)

I The idea of bag-of-word models originates from natural language processing

I Orderless document representation: frequencies of words from a dictionary
Bag-of-Words (BoW)
Some local features
What object do these parts belong to? are very informative

An object as

a collection of local features

(bag-of-features)

• deals well with occlusion

• scale invariant
• rotation invariant
Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

Encode:
build Bags-of-Words (BOW) vectors
for each image

Classify:
Train and test data using BOWs
Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

1. extract features (e.g., SIFT) from images

Bag-of-Words (BoW)
Dictionary Learning:
Learn Visual Words using clustering

2. Learn visual dictionary (e.g., K-means clustering)

Bag-of-Words
Bag-of-Words (BoW) (BoW)
�. Extract features (e.g., SIFT detector)
�. Learn visual vocabulary (e.g., k-means on SIFT feature vectors)
�. Quantize features into visual words using vocabulary (nearest neighbors)
�. Represent images by histograms of visual word frequencies
�. Train classi�er (e.g., k-NN, SVM, random forest, neural network)
Bag-of-Words (BoW)

Classifier
Another dictionary

…
Appearance codebook
… Source: B. Leibe
…
…
…

…
Appearance codebook
Representation Learning
ImageNet dataset

Steel drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
ImageNet Large Scale Visual Recognition
Challenge
ImageNet Large Scale Visual Recognition Challenge

ImageNet Classification top-5 error (%) 28.2

25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ResNet GoogleNet VGG AlexNet

I Classi�cation into �� object categories. Current state-of-the-art: �.� %

ImageNet Large Scale Visual Recognition
Challenge
Steel drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
ImageNet Large Scale Visual Recognition
Challenge
Year 2010 Year 2012 Year 2014 Year 2015
NEC-UIUC SuperVision GoogLeNet VGG MSRA
Image
Pooling
Convolution conv-64
Softmax conv-64

Other maxpool
conv-128
Dense descriptor grid:
conv-128
HOG, LBP
maxpool

conv-256
Coding: local coordinate, conv-256
super-vector
maxpool

conv-512
conv-512
Pooling, SPM maxpool

conv-512
conv-512
Linear SVM maxpool

fc-4096
fc-4096
fc-1000
softmax

[Lin CVPR 2011] [Krizhevsky NIPS 2012]

Figure copyright Alex Krizhevsky, Ilya [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He ICCV 2015]
Lion image by Swissfrog is
Sutskever, and Geoffrey Hinton, 2012.
licensed under CC BY 3.0
Convolutional Neural Networks (CNN) were
not invented overnight
Image Maps
Input

1998
K
LeCun et al. Output

Convolutions Fully Connected

Subsampling

# of transistors # of pixels used in training

106 107

2012
Krizhevsky et al.

# of transistors GPUs # of pixels used in training

Figure copyright Alex Krizhevsky, Ilya
Sutskever, and Geoffrey Hinton, 2012.
Reproduced with permission.
109 1014
First strong results
Acoustic Modeling using Deep Belief Networks
Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, 2010
Context-Dependent Pre-trained Deep Neural Networks
for Large Vocabulary Speech Recognition
George Dahl, Dong Yu, Li Deng, Alex Acero, 2012

Imagenet classification with deep convolutional

Illustration of Dahl et al. 2012 by Lane McIntosh, copyright
neural networks CS231n 2017

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Conv layer Vs FC layer

Convolution Layer
Fully Connected Layer activation map
32x32x3 image
32x32x3 image -> stretch to 3072 x 1 5x5x3 filter
32

input activation 28
1 1
10 x 3072 convolve (slide) over all
3072 10
weights spatial locations
1 number:
the result of taking a dot product 28
between a row of W and the input
32
(a 3072-dimensional dot product) 3 1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 32 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 27 April 18, 2017
Conv layer
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 34 April 18, 2017
Conv layer
The brain/neuron view of CONV Layer

28 An activation map is a 28x28 sheet of neuron

outputs:
1. Each is connected to a small region in the input
2. All of them share parameters

32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
Conv layer
A closer look at spatial dimensions:
A closer look at spatial dimensions:
A closer look at spatial dimensions:
7 7
7x7 input (spatially) 7x7 input (spatially)
assume 3x3 filter assume 3x3 filter 7
7 7 7x7 input (spatially)
A closer look at spatial dimensions:
assume 3x3 filter
A closer look at spatial dimensions:

7 7
Fei-Fei
Fei-Fei Li & Justin Johnson & Serena Li & JustinLecture
Yeung Johnson5&-Serena
42 Lecture 5 - 43
Yeung18, 2017
April April 18, 2017 => 5x5 output
7x7 input (spatially)
assume 3x3 filter
7x7 input (spatially)
assume 3x3 filter
7

7 7

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 45 April 18, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 44 April 18, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5
Conv Layer
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Conv layer

In practice: Common to zero pad the border

0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Conv Layer
Examples time:

Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Conv layer
two more layers to go: POOL/FC
two more layers to go: POOL/FC
Pooling layer
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 72 April 18, 2017
Pooling layer
MAX POOLING

Single depth slice

1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 73 April 18, 2017
Fully connected layer
Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I In image classi�cation, we use a softmax output layer and a cross entropy loss
��
Fully connected layer
Hierarchical learning
Visualization of VGG-16 by Lane McIntosh. VGG-16
Preview [Zeiler and Fergus 2013] architecture from [Simonyan and Zisserman 2014].
Convolutional Neural Networks CNN
• Recent advancement in CNN Architectures
• 1- LeNet
2- AlexNet
3- ZFNet
4- VGGNet
5- GoogLeNet
6- ResNet
7- Inception
8- Xception
9- SqueezeNet
10- ShuffleNet
11- MobileNetV2
12- DenseNet201
13- NasNet-Mobile
CNN
��8: LeNet-�

I � convolution layers (5 ⇥ 5), � pooling layers (2 ⇥ 2), � fully connected layers

I Achieved state-of-the-art accuracy on MNIST (prior to ImageNet)
CNN
��: AlexNet

AlexNet:
I 8 layers, ReLUs, dropout, data augmentation, trained on � GTX �8� GPUs
I Number of feature channels increase with depth, spatial resolution decreases
I Triggered deep learning revolution, showed that CNNs work well in practice
CNN
��: VGG

VGG:
I Uses 3 ⇥ 3 convolutions everywhere (same expressiveness, fewer parameters)
I Three 3 ⇥ 3 layers: same receptive �eld as one 7 ⇥ 7 layer, but less parameters
I � Variants: �6 and �� Layers. Showed that deeper networks are better.
Simonyan and Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, ��. ��
CNN
��: Inception / GoogLeNet

Inception:
I �� layers. Inception modules utilize conv/pool operations with varying �lter size
I Multiple intermediate classi�cation heads to improve gradient �ow
I Global avarage pooling (no FC layers), ��x less parameters (� million) than VGG-�6
7x7 conv, 64, /2

pool, /2

3x3 conv, 64

3x3 conv, 64
CNN
3x3 conv, 64

ResNet:
3x3 conv, 64

3x3 conv, 64

3x3 conv, 64
��6: ResNet

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

I Uses strided convolutions for downsampling

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

I ResNet and ResNet-like architectures are dominating today

avg pool

fc 1000
I Very simple and regular network structure with 3 ⇥ 3 convolutions
I Residual connections allow for training deeper networks (up to �� layers)
CNN
Accuracy vs. Complexity

I Performance: Top-� or Top-� accuracy (target label in top N predictions)

I VGG has most parameters and is slowest
I ResNet/Inception/GoogLeNet are faster and have fewer parameters
��

CNN Numpy 1st Handson
100% (1)
CNN Numpy 1st Handson
5 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
1 page
cs231n 2018 Lecture02
No ratings yet
cs231n 2018 Lecture02
65 pages
CS231n Convolutional Neural Networks For Visual Recognition PDF
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition PDF
16 pages
Lecture 2 PDF
No ratings yet
Lecture 2 PDF
62 pages
Lecture 2
No ratings yet
Lecture 2
98 pages
L10 Image Classification
No ratings yet
L10 Image Classification
10 pages
IT5409 - Ch7 - Part2 - Object Recognition - v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part2 - Object Recognition - v2 - 4pages
38 pages
Machine Learning Algorithms - pptx-1
No ratings yet
Machine Learning Algorithms - pptx-1
129 pages
Lecture 2
No ratings yet
Lecture 2
101 pages
Tutorial4 - Image Classification A
No ratings yet
Tutorial4 - Image Classification A
27 pages
CERN Deep Learning and Vision
No ratings yet
CERN Deep Learning and Vision
72 pages
Image Classification AI
No ratings yet
Image Classification AI
150 pages
Introduction To Object Recognition: Slides Adapted From Fei-Fei Li, Rob Fergus, Antonio Torralba, and Others
No ratings yet
Introduction To Object Recognition: Slides Adapted From Fei-Fei Li, Rob Fergus, Antonio Torralba, and Others
60 pages
Image Features and Categorization: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Image Features and Categorization: Computer Vision Jia-Bin Huang, Virginia Tech
70 pages
Lec 11
No ratings yet
Lec 11
20 pages
Part 2
No ratings yet
Part 2
225 pages
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
No ratings yet
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
10 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
CV 2025 Spring 12 Short
No ratings yet
CV 2025 Spring 12 Short
120 pages
Deep 2
No ratings yet
Deep 2
57 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Remotesensing 13 04712 v2
No ratings yet
Remotesensing 13 04712 v2
51 pages
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
No ratings yet
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
55 pages
4 100593163merged
No ratings yet
4 100593163merged
11 pages
Irjet V10i1067
No ratings yet
Irjet V10i1067
5 pages
Project Report Final 1
No ratings yet
Project Report Final 1
63 pages
VGG Image Classification Practical
No ratings yet
VGG Image Classification Practical
11 pages
Image Recognition Using Machine Learning Research Paper
No ratings yet
Image Recognition Using Machine Learning Research Paper
5 pages
Bundled
No ratings yet
Bundled
12 pages
Slides 11 - Image Pattern Classification
No ratings yet
Slides 11 - Image Pattern Classification
86 pages
IJCRT2210371
No ratings yet
IJCRT2210371
4 pages
Haozhang Ms Thesis
No ratings yet
Haozhang Ms Thesis
56 pages
Feature Extraction Using Convolution Neural Networks (CNN) and Deep Learning
No ratings yet
Feature Extraction Using Convolution Neural Networks (CNN) and Deep Learning
5 pages
Exer8 TresMarias
No ratings yet
Exer8 TresMarias
3 pages
CV - T3 - Unit-7
No ratings yet
CV - T3 - Unit-7
36 pages
Module V-Deep Learning
No ratings yet
Module V-Deep Learning
19 pages
8 Deep Learning CNN
No ratings yet
8 Deep Learning CNN
63 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
ACase Studyof Image Classification Basedon Deep Learning Using Tensorflow
No ratings yet
ACase Studyof Image Classification Basedon Deep Learning Using Tensorflow
6 pages
Comparing Image Recognition Algorithms in Artificial Intelligence
No ratings yet
Comparing Image Recognition Algorithms in Artificial Intelligence
7 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
Discussion 1 - Introduction
No ratings yet
Discussion 1 - Introduction
26 pages
CV 2025 Spring 16
No ratings yet
CV 2025 Spring 16
53 pages
19 Image Classification
No ratings yet
19 Image Classification
78 pages
Real Time Object Recognition and Classification
No ratings yet
Real Time Object Recognition and Classification
6 pages
6S191 MIT DeepLearning L3
100% (1)
6S191 MIT DeepLearning L3
60 pages
Ref 9
No ratings yet
Ref 9
12 pages
DIP Mini Project
100% (1)
DIP Mini Project
12 pages
Chapter 2 - Image Classification
No ratings yet
Chapter 2 - Image Classification
79 pages
DL Tutorial NIPS2015 PDF
No ratings yet
DL Tutorial NIPS2015 PDF
133 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
73 pages
Week 10
No ratings yet
Week 10
31 pages
CVI Week 2 1 Pre Note
No ratings yet
CVI Week 2 1 Pre Note
56 pages
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
No ratings yet
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
6 pages
Review of Image Classification Algorithms Based On
No ratings yet
Review of Image Classification Algorithms Based On
10 pages
7 CNN
No ratings yet
7 CNN
66 pages
Ee210-Project Report Pdf-Ilovepdf-Compressed
No ratings yet
Ee210-Project Report Pdf-Ilovepdf-Compressed
59 pages
ImageNet Classification With Deep
No ratings yet
ImageNet Classification With Deep
7 pages
A Computer Vision Based Approach For Driver Distraction Recognition Using Deep Learning and Genetic Algorithm Based Ensemble
No ratings yet
A Computer Vision Based Approach For Driver Distraction Recognition Using Deep Learning and Genetic Algorithm Based Ensemble
12 pages
The Role and Application of Matrices in Artificial Intelligence: Foundations, Methods, and Advancements
No ratings yet
The Role and Application of Matrices in Artificial Intelligence: Foundations, Methods, and Advancements
9 pages
Lecture 17. Convolutional Neural Networks PDF
No ratings yet
Lecture 17. Convolutional Neural Networks PDF
32 pages
Deep Learning Models For Predictive Maintenance - A Survey, Comparison, Challenges and Prospect
No ratings yet
Deep Learning Models For Predictive Maintenance - A Survey, Comparison, Challenges and Prospect
34 pages
Natural Scene Recognition
No ratings yet
Natural Scene Recognition
26 pages
Speaker-Independent Dysarthria Severity Classification Using Self-Supervised Transformers and Multi-Task Learning
No ratings yet
Speaker-Independent Dysarthria Severity Classification Using Self-Supervised Transformers and Multi-Task Learning
17 pages
Evaluating A Bimodal User Verification Robustness Against Synthetic Data Attacks
No ratings yet
Evaluating A Bimodal User Verification Robustness Against Synthetic Data Attacks
13 pages
TSR Project Report
100% (2)
TSR Project Report
46 pages
Academic-Ibm Report Front Page
No ratings yet
Academic-Ibm Report Front Page
16 pages
A Survey On Deep Learning Based Crop Yield Predict
No ratings yet
A Survey On Deep Learning Based Crop Yield Predict
14 pages
16 Comparison of Data Science Algorithms
No ratings yet
16 Comparison of Data Science Algorithms
13 pages
Points Explanation
No ratings yet
Points Explanation
15 pages
Deep Learning and Computacional Physics
No ratings yet
Deep Learning and Computacional Physics
88 pages
Emotion Detection-Final
No ratings yet
Emotion Detection-Final
24 pages
RL Model
No ratings yet
RL Model
16 pages
Hindi Digit Recognition Paper
No ratings yet
Hindi Digit Recognition Paper
8 pages
Shreyansh Resume
No ratings yet
Shreyansh Resume
1 page
Song Classification Using Machine Learning
No ratings yet
Song Classification Using Machine Learning
7 pages
Video Based Fight Detection Using Deep Learning
No ratings yet
Video Based Fight Detection Using Deep Learning
52 pages
Explainable and Trustworthy Traffic Sign Detection For Safe Autonomous Driving: An Inductive Logic Programming Approach
No ratings yet
Explainable and Trustworthy Traffic Sign Detection For Safe Autonomous Driving: An Inductive Logic Programming Approach
37 pages
D1 Review1
No ratings yet
D1 Review1
11 pages
Brain Tumor Classification Using Deep Learning Algorithms
No ratings yet
Brain Tumor Classification Using Deep Learning Algorithms
12 pages
Urdu Text Detection From Images Using Conventional Neural Network
No ratings yet
Urdu Text Detection From Images Using Conventional Neural Network
5 pages
2022 - Automatic Recognition and Localization of Underground Pipelines in GPR
No ratings yet
2022 - Automatic Recognition and Localization of Underground Pipelines in GPR
10 pages
Celeb-DF: A New Dataset For DeepFake Forensics
No ratings yet
Celeb-DF: A New Dataset For DeepFake Forensics
6 pages
Theoretical Framework
No ratings yet
Theoretical Framework
7 pages
Mini Project
No ratings yet
Mini Project
43 pages
DeepFake-O-Meter v2.0 An Open Platform For DeepFake Detection
No ratings yet
DeepFake-O-Meter v2.0 An Open Platform For DeepFake Detection
7 pages
E-Proctoring For Online Assessments
No ratings yet
E-Proctoring For Online Assessments
69 pages