0% found this document useful (0 votes)
7 views150 pages

Image Classification AI

The document outlines the content of a Computer Vision course, focusing on image classification and various machine learning techniques such as K-means clustering, K nearest neighbors, and Naïve Bayes. It includes course announcements, an overview of learning-based vision, and challenges in object recognition. Additionally, it discusses the Bag-of-Words model and the process of dictionary learning for image classification.

Uploaded by

shorifajerin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views150 pages

Image Classification AI

The document outlines the content of a Computer Vision course, focusing on image classification and various machine learning techniques such as K-means clustering, K nearest neighbors, and Naïve Bayes. It includes course announcements, an overview of learning-based vision, and challenges in object recognition. Additionally, it discusses the Bag-of-Words model and the process of dictionary learning for image classification.

Uploaded by

shorifajerin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 150

Image classification

16-385 Computer Vision


http://www.cs.cmu.edu/~16385/Spring 2020, Lecture 18
Course announcements

• Programming assignment 4 is due tonight at 23:59.


- Please make sure to download the updated version of
PA4 (last updated Monday, 10 am ET).
- Due Wednesday March 25th.
- Any questions about the homework?

• Programming assignment 5 will be posted tonight and will be due


April 8th.

• Take-home quiz 7 posted and is due Sunday March 29th.


Overview of today’s
lecture
• Introduction to learning-based vision.

• Image classification.

• Bag-of-words.

• K-means clustering.

• Classification.

• K nearest neighbors.

• Naïve Bayes.

• Support vector machine.


Slide credits
Most of these slides were adapted from:

• Kris Kitani (16-385, Spring 2017).

• Noah Snavely (Cornell University).

• Fei-Fei Li (Stanford University).


Course overview
Lectures 1 – 7
1. Image processing. See also 18-793: Image and Video
Processing

Lectures 7 – 13
2. Geometry-based vision. See also 16-822: Geometry-based Methods in
Vision

Lectures 14 – 17
See also 16-823: Physics-based Methods in
3. Physics-based vision.
Vision
See also 15-463: Computational Photography

4. Learning-based vision.
We are starting this part
now

5. Dealing with motion.


What do we mean by
learning-based vision or
‘semantic vision’?
Is this a street light?
(Recognition / classification)
Where are the people?
(Detection)
Is that Potala palace?
(Identification)
Sky
What’s in the scene?
(semantic segmentation)

Mountain

Trees

Building

Vendors
People

Ground
Object categorization

mountain

tree
building
banner

street lamp

vendor
people
What type of scene is it?
(Scene categorization)

Outdoor
Marketplace
City
Activity / Event Recognition
what are these
people doing?
Object recognition
Is it really so hard?
Find the chair in this image Output of normalized correlation

This is a chair
Object recognition
Is it really so hard?
Find the chair in this image

Pretty much garbage


Simple template matching is not going to make it

A “popular method is that of template matching, by point to point correlation of a model


pattern with the image pattern. These techniques are inadequate for three-dimensional scene
analysis for many reasons, such as occlusion, changes in viewing angle, and articulation of
parts.” Nivatia & Binford, 1977.
And it can get a lot harder

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Why is this hard?

Variability: Camera position


Illumination
Shape parameters
Challenge: variable viewpoint

Michelangelo 1475-1564
Challenge: variable illumination

image credit: J. Koenderink


Challenge: scale
Challenge: deformation
Deformation
Challenge:
Occlusion

Magritte, 1957
Occlusion
Challenge: background clutter

Kilmeny Niland. 1995


Challenge: Background clutter
Challenge: intra-class variations

Svetlana Lazebnik
Image Classification
Image Classification: Problem
Data-driven approach
• Collect a database of images with labels
• Use ML to train an image classifier
• Evaluate the classifier on test images
Bag of words
What object do these parts belong to?
Some local feature
are very informative

An object as

a collection of local features


(bag-of-features)

• deals well with occlusion


• scale invariant
• rotation invariant
(not so) crazy
assumption

spatial information of local features


can be ignored for object recognition (i.e., verification)
CalTech6 dataset

Works pretty well for image-level classification

Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
Bag-of-features

represent a data item (document, texture, image)


as a histogram over features

an old idea
(e.g., texture recognition and information retrieval)
Texture recognition
histogram

Universal texton dictionary


Vector Space Model
G. Salton. ‘Mathematics and Information Retrieval’ Journal of Documentation,1979

1 6 2 1 0 0 0 1
senso
Tartan robot CHIMP CMU bio soft ankle
r

0 4 0 1 4 5 3 2
senso
Tartan robot CHIMP CMU bio soft ankle
r

http://www.fodey.com/generators/newspaper/snippet.asp
A document (datapoint) is a vector of counts over each word (feature)

counts the number of occurrences just a histogram over words

What is the similarity between two documents?


A document (datapoint) is a vector of counts over each word (feature)

counts the number of occurrences just a histogram over words

What is the similarity between two documents?

Use any distance you want but the cosine distance is


fast.
but not all words are created equal
TF-IDF
Term Frequency Inverse Document Frequency

weigh each word by a heuristic

inverse document
term frequency
frequency

(down-weights common terms)


Standard BOW
pipeline
(for image classification)
Dictionary Learning:
Learn Visual Words using clustering

Encode:
build Bags-of-Words (BOW) vectors
for each image

Classify:
Train and test data using BOWs
Dictionary Learning:
Learn Visual Words using clustering

1. extract features (e.g., SIFT) from images


Dictionary Learning:
Learn Visual Words using clustering

2. Learn visual dictionary (e.g., K-means clustering)


What kinds of features can we extract?
• Regular grid
• Vogel & Schiele, 2003
• Fei-Fei & Perona, 2005
• Interest point detector
• Csurka et al. 2004
• Fei-Fei & Perona, 2005
• Sivic et al. 2005
• Other methods
• Random sampling (Vidal-Naquet &
Ullman, 2002)
• Segmentation-based patches (Barnard
et al. 2003)
Compute SIFT
descriptor Normalize patch
[Lowe’99]

Detect patches
[Mikojaczyk and Schmid ’02]
[Mata, Chum, Urban & Pajdla, ’02]
[Sivic & Zisserman, ’03]

How do we learn the dictionary?

Clustering
Visual vocabulary

Clustering
K-means clustering
1. Select initial
centroids at random
1. Select initial 2. Assign each object to
centroids at random the cluster with the
nearest centroid.
1. Select initial 2. Assign each object to
centroids at random the cluster with the
nearest centroid.

3. Compute each centroid as the


mean of the objects assigned to
it (go to 2)
1. Select initial 2. Assign each object to
centroids at random the cluster with the
nearest centroid.

3. Compute each centroid as the 2. Assign each object to


mean of the objects assigned to the cluster with the
it (go to 2) nearest centroid.
1. Select initial 2. Assign each object to
centroids at random the cluster with the
nearest centroid.

3. Compute each centroid as the 2. Assign each object to


mean of the objects assigned to the cluster with the
it (go to 2) nearest centroid.

Repeat previous 2 steps until no change


K-means Clustering
Given k:

1.Select initial centroids at random.

2.Assign each object to the cluster with the nearest


centroid.

3.Compute each centroid as the mean of the objects


assigned to it.

4.Repeat previous 2 steps until no change.


From what data should I learn the dictionary?

• Codebook can be learned on separate


training set

• Provided the training set is sufficiently


representative, the codebook will be
“universal”
From what data should I learn the dictionary?

• Dictionary can be learned on separate


training set

• Provided the training set is sufficiently


representative, the dictionary will be
“universal”
Example visual dictionary
Example dictionary


Appearance codebook
Source: B. Leibe
Another dictionary





Appearance codebook

Source: B. Leibe
Dictionary Learning:
Learn Visual Words using clustering

Encode:
build Bags-of-Words (BOW) vectors
for each image

Classify:
Train and test data using BOWs
1. Quantization: image features
gets associated to a visual word
(nearest cluster center)

Encode:
build Bags-of-Words (BOW) vectors
for each image
Encode:
build Bags-of-Words (BOW) vectors
for each image 2. Histogram: count
the number of visual
word occurrences
frequency

…..
codewords
Dictionary Learning:
Learn Visual Words using clustering

Encode:
build Bags-of-Words (BOW) vectors
for each image

Classify:
Train and test data using BOWs
K nearest neighbors

Support Vector Machine

Naïve Bayes
K nearest neighbors
Distribution of data from two classes
Distribution of data from two classes

Which class does q belong too?


Distribution of data from two classes

Look at the neighbors


K-Nearest Neighbor (KNN)
Classifier
Non-parametric pattern
classification approach
Consider a two class problem
where each sample consists of
two measurements (x,y).

For a given query point q, k=1


assign the class of the
nearest neighbor

Compute the k nearest k=3


neighbors and assign the
class by majority vote.
Nearest Neighbor is
competitive

Test Error Rate (%)


Linear classifier (1-layer NN) 12.0
MNIST Digit Recognition
K-nearest-neighbors, Euclidean 5.0
– Handwritten digits K-nearest-neighbors, Euclidean, deskewed 2.4
– 28x28 pixel images: d = K-NN, Tangent Distance, 16x16 1.1
784 K-NN, shape context matching 0.67
– 60,000 training samples 1000 RBF + linear classifier 3.6
SVM deg 4 polynomial 1.1
– 10,000 test samples
2-layer NN, 300 hidden units 4.7
2-layer NN, 300 HU, [deskewing] 1.6
LeNet-5, [distortions] 0.8
Yann LeCunn Boosted LeNet-4, [distortions] 0.7
What is the best distance metric between data points?

• Typically Euclidean distance

• Locality sensitive distance metrics

• Important to normalize.
Dimensions have different scales

How many K?

• Typically k=1 is good

• Cross-validation (try different k!)


Distance metrics
Euclidean

Cosine

Chi-squared
Choice of distance metric
• Hyperparameter
Visualization: L2 distance
CIFAR-10 and NN results
k-nearest neighbor
• Find the k closest points from training data
• Labels of the k points “vote” to classify
Hyperparameters
• What is the best distance to use?
• What is the best value of k to use?

• i.e., how do we set the hyperparameters?

• Very problem-dependent
• Must try them all and see what works best
Validation
Cross-validation
How to pick hyperparameters?
• Methodology
– Train and test
– Train, validate, test

• Train for original model


• Validate to find hyperparameters
• Test to understand generalizability
Pros

• simple yet effective

Cons

• search is expensive (can be sped-up)

• storage requirements

• difficulties with high-dimensional data


kNN -- Complexity and Storage
• N training images, M test images

• Training: O(1)
• Testing: O(MN)

• Hmm…
– Normally need the opposite
– Slow training (ok), fast testing (necessary)
Naïve Bayes
Distribution of data from two classes

Which class does q belong too?


Distribution of data from two classes
• Learn parametric model for each
class
• Compute probability of query
This is called the posterior.
the probability of a class z given the observed
features X

For classification, z is a
X is a set of observed
discrete random
features
variable (e.g., features from a single image)
(e.g., car, person, building)

(it’s a function that returns a single probability value)


This is called the posterior:
the probability of a class z given the observed
features X

For classification, z is a
Each x is an observed
discrete random
feature
variable (e.g., visual words)
(e.g., car, person, building)

(it’s a function that returns a single probability value)


Recall:
The posterior can be decomposed
according to Bayes’ Rule
likelihood prior

posterior

In our context…
The naive Bayes’ classifier is solving this optimization

MAP (maximum a posteriori) estimate

Bayes’ Rule

Remove constants

To optimize this…we need to compute this

Compute the likelihood…


A naive Bayes’ classifier assumes all features are
conditionally independent

Recall:
To compute the MAP estimate

Given (1) a set of known parameters (2) observations

Compute which z has the largest probability


count 1 6 2 1 0 0 0 1
word Tartan robot CHIMP CMU bio soft ankle sensor
p(x|z) 0.09 0.55 0.18 0.09 0.0 0.0 0.0 0.09

Numbers get really small so use log probabilities

ypically add pseudo-counts (0.001)


this is an example for computing the likelihood, need to multiply times prior to get posterior
count 1 6 2 1 0 0 0 1
word Tartan robot CHIMP CMU bio soft ankle sensor
p(x|z) 0.09 0.55 0.18 0.09 0.0 0.0 0.0 0.09

log p(X|z=grand challenge) = - 14.58


log p(X|z=bio inspired) = - 37.48

count 0 4 0 1 4 5 3 2
word Tartan robot CHIMP CMU bio soft ankle sensor
http://www.fodey.com/generators/newspaper/snippet.asp p(x|z) 0.0 0.21 0.0 0.05 0.21 0.26 0.16 0.11

log p(X|z=grand challenge) = - 94.06


log p(X|z=bio inspired) = - 32.41

ypically add pseudo-counts (0.001)


this is an example for computing the likelihood, need to multiply times prior to get posterior
Support Vector
Machine
Image Classification
Score function
Linear Classifier

data (histogram)
Convert image to histogram representation
Distribution of data from two classes

Which class does q belong too?


Distribution of data from two classes

Learn the decision boundary


First we need to understand hyperplanes…
Hyperplanes (lines) in 2D

a line can be written


as dot product plus a
bias

another version, add a weight


1 and push the bias inside
Hyperplanes (lines) in 2D
(offset/bias outside) (offset/bias inside)
Hyperplanes (lines) in 2D
(offset/bias outside) (offset/bias inside)

Important property:
Free to choose any normalization of w

The line

and the line

define the same line


What is the distance
to origin?

(hint: use normal form)


distance to origin

scale by

you get the normal form


What is the distance
between two parallel lines?
(hint: use distance to origin)
distance
between two
parallel lines

Difference of distance to origin


Now we can go to 3D …

Hyperplanes (planes) in 3D

what are the


dimensions of this
vector?

What happens if you change b?


Hyperplanes (planes) in 3D
Hyperplanes (planes) in 3D

What’s the distance


between these
parallel planes?
Hyperplanes (planes) in 3D
What’s the best w?
What’s the best w?
What’s the best w?
What’s the best w?
What’s the best w?
What’s the best w?

Intuitively, the line that is the


farthest from all interior points
What’s the best w?

Maximum Margin solution:


most stable to perturbations of data
What’s the best w?

support vectors

Want a hyperplane that is far away from ‘inner points’


Find hyperplane w such that …

margin

the gap between parallel hyperplanes is maximized


Can be formulated as a maximization problem

What does this constraint mean?

label of the data point

Why is it +1 and -1?


Can be formulated as a maximization problem

Equivalently, Where did the 2 go?

What happened to the labels?


‘Primal formulation’ of a linear SVM

Objective Function

Constraints

This is a convex quadratic programming (QP) problem


(a unique solution exists)
‘soft’ margin
What’s the best w?
What’s the best w?

Very narrow margin


Separating cats and dogs

Very narrow margin


What’s the best w?

Very narrow margin

Intuitively, we should allow for some


misclassification if we can get more robust
classification
What’s the best w?

Trade-off between the MARGIN and the MISTAKES


(might be a better solution)
Adding slack variables

misclassifie
d point
‘soft’ margin
objective subject to

for
‘soft’ margin
objective subject to

for

The slack variable allows for mistakes,


as long as the inverse margin is minimized.
‘soft’ margin
objective subject to

for

• Every constraint can be satisfied if slack is large


• C is a regularization parameter
• Small C: ignore constraints (larger margin)
• Big C: constraints (small margin)
• Still QP problem (unique solution)
References

Basic reading:
• Szeliski, Chapter 14.

You might also like