0% found this document useful (0 votes)
76 views

CS434a/541a: Pattern Recognition Prof. Olga Veksler

This lecture introduced the concepts of pattern recognition. It discussed what pattern recognition is, some applications like character recognition and medical diagnostics, and outlined the typical structure of a pattern recognition system. It used a toy example of classifying fish into salmon and sea bass to illustrate the design process, including collecting training data, extracting discriminative features, designing a classifier, and testing it on new data. Overfitting and the importance of generalization were also covered.

Uploaded by

SRIRAM R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

CS434a/541a: Pattern Recognition Prof. Olga Veksler

This lecture introduced the concepts of pattern recognition. It discussed what pattern recognition is, some applications like character recognition and medical diagnostics, and outlined the typical structure of a pattern recognition system. It used a toy example of classifying fish into salmon and sea bass to illustrate the design process, including collecting training data, extracting discriminative features, designing a classifier, and testing it on new data. Overfitting and the importance of generalization were also covered.

Uploaded by

SRIRAM R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Lecture 1

1
Outline of the lecture

Syllabus
Introduction to Pattern Recognition
Review of Probability/Statistics

2
Syllabus
Prerequisite
Analysis of algorithms (CS 340a/b)
First-year course in Calculus
Introductory Statistics (Stats 222a/b or
equivalent) will
review
Linear Algebra (040a/b)
Grading
Midterm 30%
Assignments 30%
Final Project 40%
3
Syllabus
Assignments
bi-weekly
theoretical or programming in Matlab or C
no extensive programming
may include extra credit work
may discuss but work individually
due in the beginning of the class
Midterm
open anything
roughly on November 8
4
Syllabus
Final project
Choose from the list of topics or design your
own
May work in group of 2, in which case it is
expected to be more extensive
5 to 8 page report
proposals due roughly November 1
due December 8

5
Intro to Pattern Recognition

Outline
What is pattern recognition?
Some applications
Our toy example
Structure of a pattern recognition system
Design stages of a pattern recognition system

6
What is Pattern Recognition ?

Informally
Recognize patterns in data
More formally
Assign an object or an event to
one of the several pre-specified
categories (a category is usually
called a class)

tea cup
face
phone 7
Application: male or female?
classes
Objects (pictures)
male female

Perfect

PR system

8
Application: photograph or not?
classes

Objects (pictures) photo not photo

Perfect

PR system

9
Application: Character Recognition

objects Perfect
hello world
PR system

In this case, the classes are all possible


characters: a, b, c,…., z

10
Application: Medical diagnostics
classes

objects (tumors) cancer not cancer

Perfect

PR system

11
Application: speech understanding

objects (acoustic signal) phonemes


Perfect
re-kig-'ni-sh&n
PR system

In this case, the classes are all phonemes

12
Application: Loan applications
classes
objects (people)
income debt married age approve deny

John Smith 200,000 0 yes 80

Peter White 60,000 1,000 no 30

Ann Clark 100,000 10,000 yes 40

Susan Ho 0 20,000 no 25

13
Our Toy Application: fish sorting
classifier
fis
hs
pe
fis
hi ies c salmon
ma
ge camera

sorting
chamber

conveyer belt sea bass

14
How to design a PR system?
Collect data (training data) and classify by hand
salmon sea bass salmon salmon sea bass sea bass

Preprocess by segmenting fish from background

Extract possibly discriminating features


length, lightness,width,number of fins,etc.
Classifier design
Choose model
Train classifier on part of collected data (training data)
Test classifier on the rest of collected data (test data)
i.e. the data not used for training
Should classify new data (new fish images) well 15
Classifier design
Notice salmon tends to be shorter than sea bass
Use fish length as the discriminating feature
Count number of bass and salmon of each length
2 4 8 10 12 14
bass 0 1 3 8 10 5
salmon 2 5 10 5 1 0

12
10
8
Count

salmon
6
sea bass
4
2
0
2 4 8 10 12 14
16
Length
Fish length as discriminating feature
Find the best length L threshold
fish length < L fish length > L

classify as salmon classify as sea bass

For example, at L = 5, misclassified:


1 sea bass
16 salmon
2 4 8 10 12 14
bass 0 1 3 8 10 5
salmon 2 5 10 5 1 0

17
= 34%
50 17
Fish Length as discriminating feature
fish classified fish classified
as salmon as sea bass
12
10
8
Count

salmon
6
sea bass
4
2
0
2 4 8 10 12 14
Length

After searching through all possible thresholds L,


the best L= 9, and still 20% of fish is misclassified
18
Next Step

Lesson learned:
Length is a poor feature alone!
What to do?
Try another feature
Salmon tends to be lighter
Try average fish lightness

19
Fish lightness as discriminating feature
1 2 3 4 5
bass 0 1 2 10 12
salmon 6 10 6 1 0

14
12
10
8
Count

salmon
6 sea bass
4
2
0
1 2 3 4 5
Lightness

Now fish are well separated at lightness threshold


of 3.5 with classification error of 8% 20
Can do even better by feature combining
Use both length and lightness features
Feature vector [length,lightness]

ba decision
ss boundary
lightness

decision regions

sa
lm
on
length

21
Better decision boundary
lightness

length

Ideal decision boundary, 0% classification error


22
Test Classifier on New Data
Classifier should perform well on new data
Test “ideal” classifier on new data: 25% error
lightness

length

23
What Went Wrong?
complicated
boundary
Poor generalization

Complicated boundaries do not generalize well to


the new data, they are too “tuned” to the particular
training data, rather than some true model which
will separate salmon from sea bass well.
This is called overfitting the data

24
Generalization
training data testing data

Simpler decision boundary does not perform ideally


on the training data but generalizes better on new
data
Favor simpler classifiers
William of Occam (1284-1347): “entities are not
to be multiplied without necessity”
25
Pattern Recognition System Structure
input
domain dependent

camera, microphones, medical sensing


imaging devices, etc.

Patterns should be well separated segmentation


and should not overlap.
Extract discriminating features. Good features feature extraction
make the work of classifier easy.

Use features to assign the object to a category.


Better classifier makes feature extraction easier. classification
Our main topic in this course

Exploit context (input depending information) to


improve system performance
post-processing
Tne cat The cat
decision 26
How to design a PR system?
start

collect data

choose features
prior
knowledge
choose model

train classifier

evaluate classifier
end 27
Design Cycle cont.
start

collect data
Collect Data
Can be quite costly
choose features
How do we know when
we have collected an
adequately choose model
representative set of
testing and training
examples? train classifier

evaluate classifier
end 28
Design Cycle cont.
start
Choose features collect data
Should be discriminating, i.e.
similar for objects in the same
category, different for objects in
different categories: choose features

good features: bad features:


choose model

Prior knowledge plays a great


role (domain dependent) train classifier
Easy to extract
Insensitive to noise and
evaluate classifier
irrelevant transformations
end 29
Design Cycle cont.
start

Choose model collect data


What type of classifier to
use?
choose features
When should we try to
reject one model and try
another one? choose model
What is the best classifier
for the problem?
train classifier

evaluate classifier
end 30
Design Cycle cont.
start

Train classifier collect data


Process of using data to
determine the parameters of
choose features
classifier
Change parameters of the
chosen model so that the choose model
model fits the collected data
Many different procedures
for training classifiers train classifier

Main scope of the course


evaluate classifier
end 31
Design Cycle cont.
start
Evaluate Classifier collect data
measure system
performance
Identify the need for choose features
improvements in system
components
choose model
How to adjust complexity of
the model to avoid over-
fitting? Any principled
methods to do this? train classifier

Trade-off between
computational complexity evaluate classifier
and performance
end 32
Conclusion

useful
a lot of exciting and important
applications
but hard
must solve many issues for a successful
pattern recognition system

33
Review: mostly probability and
some statistics

34
Content
Probability
Axioms and properties
Conditional probability and independence
Law of Total probability and Bayes theorem
Random Variables
Discrete
Continuous
Pairs of Random Variables
Random Vectors
Gaussian Random Variable
35
Basics
We are performing a random experiment (catching
one fish from the sea)
S: all fish in the sea

event A

12
total number of events: 2

probability
all events in S

P
events
A P(A)
Axioms of Probability

1. P (A) ≥ 0
2. P (S ) = 1
3. If A B = ∅ then P ( A B ) = P ( A) + P (B )

37
Properties of Probability
P (∅) = 0

P (A) ≤ 1

P ( A c ) = 1 − P ( A)

A ⊂B P ( A ) < P (B )

P(A B) = P(A) + P(B) − P(A B)


N N
{Ai Aj = ∅,∀i, j} P Ak = P(Ak )
k =1 k =1 38
Conditional Probability
If A and B are two events, and we know that event
B has occurred, then (if P(B)>0)
U
P(A B)
P(A|B)=
P(B)

A B occurred
U
A B B
A B
U B

U
! ! "
U
multiplication rule P(A B)= P(A|B) P(B)
39
Independence
A and B are independent events if
P(A B) = P(A) P(B)
U

By the law of conditional probability, if A


and B are independent
P(A) P(B)
P(A|B) = = P(A)
P(B)

If two events are not independent, then they


are said to be dependent
40
Law of Total Probability
B1, B2,…,B n partition S B1 B3
A
Consider an event A B2 B4

A = U
U U
U U
U U
A B1 A B2 A B3 A B4
Thus P(A) = P(A B1) +P(A B2 ) +P(A B3 ) +P(A B4 )
Or using multiplication rule:
P( A) = P(A | B1 )P(B1 ) + + P(A | B4 )P(B4 )
n
P(A) = P(A | Bk )P(Bk )
k =1
Bayes Theorem
Let B , B , …, B , be a partition of the
sample space S. Suppose event A occurs.
What is the probability of event B ?
Answer: Bayes Rule

P (B i A ) P (A | B i )P (B i )
P (B i | A ) = =
P (A ) n
P (A | B k )P (B k )
k =1

#
42
Random Variables
In random experiment, usually assign some number
to the outcome, for example, number of of fish fins
A random variable X is a function from sample
sample space S to a real number. $
$ (# of fins)
&
%

X is random due to randomness of its argument

P ( X = a ) = P ( X (ω ) = a ) = P (ω ∈ Ω | X (ω ) = a )
Two Types of Random Variables

Discrete random variable has countable


number of values
number of fish fins (0,1,2,….,30)

Continuous random variable has


continuous number of values
fish weight (any real number between 0 and
100)
Cumulative Distribution Function
Given a random variable X, CDF is defined
as
F (a ) = P ( X ≤ a )

"! #
" #

!
Properties of CDF F (a ) = P ( X ≤ a )

1. F(a) is non decreasing


2. lim b→ ∞ F (b) = 1 "! #

3. limb→ −∞ F (b) = 0
" #

Questions about X can be asked in terms of


CDF
P (a < X ≤ b) = F(b) − F(a)

Example:
P( ' ( " ) ( )=F(30)-F(20)
Discrete RV: Probability Mass Function
Given a discrete random variable X, we
define the probability mass function as
p(a) = P( X = a)
Satisfies all axioms of probability

CDF in discrete case satisfies


F(a) = P( X ≤ a) = P( X = a) = p(a)
x≤a x≤a

47
Continuous RV: Probability Density Function

Given a continuous RV X, we say f(x) is its


probability density function if
a
F(a) = P( X ≤ a) = f (x) dx
−∞
b
and, more generally P(a ≤ X ≤ b) = f (x)dx
a

48
Properties of Probability Density Function

d
F (x ) = f (x )
dx

a
P( X = a) = f (x )dx = 0
a


P(− ∞ ≤ X ≤ ∞) = f (x) dx = 1
−∞

f (x ) ≥ 0
49
probability mass probability density
pmf 1

pdf
1
0.4 0.6
0.3

1 2 3 4 5
! $

true probability density, not probability


P(fish weights 30kg) ≠ 0.6
P(fish has 2 or 3 fins)=
=p(2)+p(3)=0.3+0.4 P(fish weights 30kg)=0
P(fish weights between 29
31
and 31kg)= f ( x)dx
29
take sums integrate
Expected Value
Useful characterization of a r.v.
Also known as mean, expectation, or first
moment
discrete case: µ = E( X ) = ∀x
x p( x )
continuous case: µ = E( X ) =

x f (x)dx
−∞

Expectation can be thought of as the


average or the center, or the expected
average outcome over many experiments
51
Expected Value for Functions of X
Let g(x) be a function of the r.v. X. Then
discrete case: E[g( X )] = ∀x
g(x) p(x )

E[g( X )] = g(x) f (x)dx



continuous case:
−∞

2
An important function of X: [X-E(X)]
2
Variance E[[X-E(X)] ] = var(X)=σ2
Variance measures the spread around the
mean
Standard deviation = [Var(X)]1/2 , has the
same units as the r.v. X
52
Properties of Expectation
If X is constant r.v. X=c, then E(X) = c

If a and b are constants, E(aX+b)=aE(X)+b


More generally,
E ( n
i =1
(ai X i + c i )) =
n
i =1
(ai E ( X i ) + c i )

If a and b are constants, then


2
var(aX+b)= a var(X)

53
Pairs of Random Variables
Say we have 2 random variables:
Fish weight X
Fish lightness Y
Can define joint CDF
F(a,b) = P( X ≤ a,Y ≤ b) = P(ω ∈Ω | X(ω) ≤ a,Y(ω) ≤ b)
Similar to single variable case, can define
discrete: joint probability mass function
p(a,b) = P( X = a,Y = b)
continuous: joint density function f (x, y )
P(a ≤ X ≤ b,c ≤ Y ≤ d ) = f (x, y ) dx dy
54
a≤ x≤b
c≤ y ≤d
Marginal Distributions
given joint mass function px,y(a,b), marginal,
i.e. probability mass function for r.v. X can
be obtained from px,y(a,b)
px (a) = px,y (a, y ) py (b) = px,y (x, b)
∀y ∀x

marginal densities fx(x) and fy(y) are obtained


from joint density fx,y (x,y) by integrating

fx (x) = f x,y (x, y ) dy


y =∞
fy (y ) = f x,y (x, y ) dx
x =∞

y =−∞ x =−∞
55
Independence of Random Variables

r.v. X and Y are independent if


P( X ≤ x,Y ≤ y ) = P( X ≤ x)P(Y ≤ y )

Theorem: r.v. X and Y are independent if


and only if
px,y (x, y ) = py (y )px (x) (discrete)
fx,y (x, y ) = fy (y )fx (x) (continuous)

56
More on Independent RV’s

If X and Y are independent, then

E(XY)=E(X)E(Y)
Var(X+Y)=Var(X)+Var(Y)
G(X) and H(Y) are independent

57
Covariance
Given r.v. X and Y, covariance is defined as:
cov ( X ,Y ) = E[( X − E( X ))(Y − E(Y ))] = E( XY ) − E( X )E(Y )
Covariance is useful for checking if features X
and Y give similar information
Covariance (from co-vary) indicates tendency
of X and Y to vary together
If X and Y tend to increase together, Cov(X,Y) > 0
If X tends to decrease when Y increases, Cov(X,Y)
<0
If decrease (increase) in X does not predict
behavior of Y, Cov(X,Y) is close to 0 58
Covariance Correlation
If cov(X,Y) = 0, then X and Y are said to be
uncorrelated (think unrelated). However X
and Y are not necessarily independent.

If X and Y are independent, cov(X,Y) = 0


Can normalize covariance to get correlation
cov( X,Y )
− 1 ≤ cor ( X,Y ) = ≤1
var( X ) var(Y )

59
Random Vectors
Generalize from pairs of r.v. to vector of r.v.
X= [X1 X2… X3 ] (think multiple features)
Joint CDF, PDF, PMF are defined similarly to
the case of pair of r.v.’s
Example:
F (x1, x2,...,xn ) = P( X1 ≤ x1, X2 ≤ x2,...,Xn ≤ xn )

All the properties of expectation, variance,


covariance transfer with suitable modifications

60
Covariance Matrix
characteristics summary of random vector
T
cov(X)=cov[X1 X2… Xn] = Σ =E[(X- µ)(X- µ) ]=

E(X 1– µ1)(X 1– µ1) … E(X n– µn)(X 1– µ1)


E(X 2– µ2)(X 1– µ1) … E(X n– µn)(X 2– µ2)


E(X n– µn)(X 1– µ1) … E(X n– µn)(X n– µn)

σ12 c12 c13


variances c21 σ22 c23 covariances
c31 c32 σ32 61
Normal or Gaussian Random Variable
2
1 x−µ
1 −
Has density f (x ) = e 2 σ

σ 2π
Mean µ, and variance σ2

62
Multivariate Gaussian
1 1
[( x−µ ) −1
( x − µ )]
has density f (x ) =

e 2

(2π )
n/2 1/ 2

[
mean vector µ = µ 1, , µ n ]
covariance matrix

63
Why Gaussian?

Frequently observed (Central limit theorem)


Parameters µ and Σ are sufficient to
characterize the distribution
Nice to work with
Marginal and conditional distributions also are
gaussians
If X i’s are uncorrelated then they are also
independent

64
Summary

Intro to Pattern Recognition


Review of Probability and Statistics
Next time will review linear algebra

65

You might also like