Intro
Intro
PR , ANN, & ML 2
PR , ANN, & ML 3
PR , ANN, & ML 4
PR , ANN, & ML 5
DNA patterns Protein Patterns
AGCTCGAT 20 amino acids
PR , ANN, & ML 6
PR , ANN, & ML 7
PR , ANN, & ML 8
PR , ANN, & ML 9
Faces
Finger prints
PR , ANN, & ML 10
Other Patterns
Insurance, credit card applications
applicants are characterized by a pattern
# of accidents, make of car, year of model
income, # of dependents, credit worthiness,
mortgage amount
Dating services
Age, hobbies, income, etc. establish your
“desirability”
PR , ANN, & ML 11
Other Patterns
Web documents
Key words based description (e.g., documents
containing War, Bagdad, Hussen are different
from those containing football, NFL, AFL,
draft, quarterbacks)
Intrusion detection
Usage and connection patterns
Cancer detection
Image features for tumors, patient age,
treatment option, etc.
PR , ANN, & ML 12
Other Patterns
Housing market
Location, size, year, school district
University ranking
Student population, student-faculty ratio,
scholarship opportunities, location, faculty research
grants, etc.
Too many
E.g.,
http://www.ics.uci.edu/~mlearn/MLSummary.html
PR , ANN, & ML 13
What is a pattern?
A pattern is a set of objects, processes or
events which consist of both deterministic
and stochastic components
A pattern is a record of certain dynamic
processes influenced both by deterministic
and stochastic factors
PR , ANN, & ML 14
What is a Pattern? (cont.)
Constellation patterns,
texture patterns, EKG
patterns, etc.
PR , ANN, & ML 15
What is Pattern Recognition?
Classifies “patterns” into “classes”
Patterns (x)
have “measurements”, “traits”, or “features”
Classes ( i)
likelihood (a prior probabilityP( i ))
class-conditional density p( x| i )
Classifier (f(x) -> i)
An example
four coin classes: penny, nickel, dime, and quarter
measurements: weight, color, size, etc.
Assign a coin to a class based on its size, weight, etc.
We use P to denote probability mass function (discrete) and
p to denote probability density function (continuous)
PR , ANN, & ML 16
An Example
PR , ANN, & ML 17
Another Example
PR , ANN, & ML 18
Features
The intrinsic traits or characteristics that tell
one pattern (object) apart from another
Features extraction and representation allows
Focus on relevant, distinguishing parts of a pattern
Data reduction and abstraction
PR , ANN, & ML 19
Detection vs. Description
Detection: something Description: what has
happened happened?
Heard noise Gun shot, talking,
Saw something laughing, crying, etc.
interesting Lines, corners,
Non-flat signals textures
Mouse, cat, dog, bike,
etc.
PR , ANN, & ML 20
Feature Selection
More an art than a science
Effectiveness criteria:
population
size
PR , ANN, & ML 21
perimeter
compactness
elongatedness
compactness
PR , ANN, & ML 25
Optimal tradeoff between performance and generalization
PR , ANN, & ML 26
Importance of Features
Cannot be over-stated
We usually don’t know which to select,
what they represent, and how to tune them
(face, gait recognition, tumor detection, etc.)
Classification and regression schemes are
mostly trying to make the best of whatever
features are available
PR , ANN, & ML 27
Features
One is usually not descriptive (no silver
bullet)
Many (shotgun approach) can actually hurt
Many problems:
Relevance
Dimensionality
Co-dpendency
Time and space varying characteristics.
Accuracy
Uncertainty and error
Missing values
PR , ANN, & ML 28
Feature Selection (cont.)
Q: How to decide if a feature is effective?
A: Through a training phase
Trainingon typical samples and typical features
to discover
Whether features are effective
Whether there are any redundancy
Etc.
PR , ANN, & ML 29
Classifiers
i if gi ( x ) g j ( x ) for all j i
gi ( x ) P( i ) if no measurements are made
gi ( x ) P( i | x ) minimize misclassification rate
gi ( x ) R( i | x ) minimize associated risk
g1 g1 ( x )
g 2 (x)
x g2 max decision
g n (x)
gn
PR , ANN, & ML 30
Traditional Pattern Recognition
Parametric methods
Based on class sample exhibiting a certain
parametric distribution (e.g. Gaussian)
Learn the parameters through training
Density methods
Does not enforce a parametric form
Learn the density function directly
PR , ANN, & ML 31
Parametric Methods
size size
III. population IV. population
size size
1 | x x |2
1
2 2
e
( 2 )
n/ 2 n
PR , ANN, & ML 32
Density Methods
PR , ANN, & ML 33
Feature space
d dimensional (d the number of features)
populated with features from training samples
fd
f2
f1
PR , ANN, & ML 34
Decision
fd
Boundary ?
Methods f2
f1
fd fd
? ?
f2 f2
f1 f1
PR , ANN, & ML 35
PR , ANN, & ML 36
“Modern” vs “Traditional”
Pattern Recognition
Hand-crafted features Automatically learned
Simple and low-level features
concatenation of Hierarchical and
numbers or traits complex
Syntactic Semantic
Feature detection and Feature detection and
description are description are not
separate tasks from jointly optimized with
classifier design classifiers
PR , ANN, & ML 37
Traditional Features
PR , ANN, & ML 38
Modern Features
PR , ANN, & ML 39
Modern Features
PR , ANN, & ML 40
Modern Features
PR , ANN, & ML 41
Modern Features
PR , ANN, & ML 42
Modern Features
PR , ANN, & ML 43
“Modern” vs “Traditional”
Pattern Recognition
PR , ANN, & ML 44
Mathematical Foundation
Does not matter what methods or
techniques you use, the underlying
mathematical principle is quite simple
Bayesian theory is the foundation
PR , ANN, & ML 45
Review: Bayes Rule
Forward (synthesis) route:
From class to sample in a class
Grammar rules to sentences
Markov chain (or HMM) to pronunciation
PR , ANN, & ML 46
Review: Bayes Rule
Backward is always harder
Because the interpretation is not unique
Presence of x has multiple possibilities
2
1 x n
3
PR , ANN, & ML 47
The simplest example
Two classes: pennies and dimes
No measurements
Classification:
based on the a prior probabilities
Error rate:
1 if P(1 ) P( 2 )
2 if P(1 ) P( 2 )
1 or 2 otherwise
PR , ANN, & ML 48
A slightly more complicated example
Two classes: pennies and dimes
A measurement x is made (e.g. weight)
Classification
based on the a posterior probabilities with
Bayes rule
1 if P(1| x ) P( 2 | x )
1
x
2
2 if P(1| x ) P( 2 | x )
1 or 2 otherwise
p( x , i ) p( x| i ) P( i )
P( i | x )
p( x ) p( x )
PR , ANN, & ML 49
probability
p( x|1 ) p( x | 2 )
weight
P(1 ) P( 2 )
probability
p( x|1 ) P(1 )
p( x| 2 ) P( 2 )
weight
p( x ) p( x )
probability
p(1| x ) p( 2 | x )
PR , ANN, & ML
weight 50
Why Both?
p( x | i ) & P( i ) ?
PR , ANN, & ML 51
Essence
Turn a backward (analysis) problem into
several forward (synthesis) problem
Or analysis-by-synthesis
PR , ANN, & ML 52
Error rate
Determined by
The likelihood of a class
The likelihood of measuring x in a class
1
min( p( x|1 ) P(1 ), p( x| 2 ) P( 2 ))
p( x )
PR , ANN, & ML 53
Error Rate (cont.)
Bayes Decision Rule minimizes the average
error rate:
where
( x ) arg max p ( i | x )
*
i
PR , ANN, & ML 54
Various types of errors
PR , ANN, & ML 55
PR , ANN, & ML 56
Precision vs. Recall
A very common measure used in PR and
MI community
One goes up and the other HAS to go down
as a goodness measure
PR , ANN, & ML 57
Various ways to measure error rates
Training error
Test error
Empirical error
PR , ANN, & ML 58
An even more complicated example
Two classes: pennies or dimes
A measurement x is made
Risk associated with making a wrong decision
Based on the a posterior probabilities with
Bayesian risk
R(1| x ) 11P(1| x ) 12 P( 2 | x )
R( 2 | x ) 21P(1| x ) 22 P( 2 | x )
ij : the loss of action i in state j
R( i | x ): the conditional risk of action i with x
PR , ANN, & ML 59
State1 State2
Mis-classification
Math
Observation
Decision1 Decision2
Mis-interpretation
Human factor
Observation
PR , ANN, & ML 60
State1 State2 Decision1 Decision2
Observation Observation
Incorrect decisions
Incur domain-specific cost
State1 State2
Observation
PR , ANN, & ML 61
An even more complicated example
p(x|pennies)P(pennies)
R(used as pennies | x) =
r(pennies used as pennies) * P(pennies | x) +
p(x|dimes)P(dimes)
R(used as dimes | x) =
r(pennies used as dimes) * P(pennies | x) +
PR , ANN, & ML 62
A more credible example
R(call FD|smoke) =
r(call,fire)*P(fire|smoke) +
r(call, no fire)*P(no fire|smoke)
False positive
R(no call FD|smoke)=
r(no call, no fire)*P(no fire|smoke) +
r(no call, fire)*P(fire|smoke)
False negative
PR , ANN, & ML 63
A more credible example
R(attack|battle field intelligence) =
r(attack,<50%)*P(<50%|intelligence) +
r(attack,>50%)*P(>50%|intelligence)
False positive
R(no attack|battle field intelligence)=
r(no attack, >50%)*P(>50%|intelligence) +
r(no attack, <50%)*P(<50%|intelligence)
False negative
PR , ANN, & ML 64
Baysian Risk
Determined by
likelihood of a class
likelihood of measuring x in a class
Classification
Baysian risk should be minimized
min( R(1 | x), R( 2 | x))or
min( 11P( 1 | x) 12 P( 2 | x), 21P( 1 | x) 22 P( 2 | x)) or
R(1 | x) R( 2 | x) 1
(21 11 ) P( 1 | x) (12 22 ) P( 2 | x)
PR , ANN, & ML 65
Bayesian Risk (cont.)
Again, decisions depend on
likelihood of a class
likelihood of observation of x in a class
Modified by some positive risk factors
Why?
Because in the real world, it might not be the
misclassification rate that is important, it is the
action you assume
PR , ANN, & ML 66
Other generalizations
Multiple classes n
n classes P( i ) 1
i 1
Multiple measurements
X is a vector instead of a scalar
Non-numeric measurements
Actions vs. decisions
Correlated vs. independent events
speech signals and images
Training allowed or not
Time-varying behaviors
PR , ANN, & ML 67
Difficulties
PR , ANN, & ML 68
Typical Approaches
Supervised (with tagged samples x):
parameters of a probability function (e.g. Gaussian
) p( x | ) N ( , )
i i i
density functions (w/o assuming any parametric forms)
decision boundaries (classes are indeed separable)
Unsupervised (w/o tagged samples x):
minimum distance
hierarchical clustering
Reinforced (with hints)
Right or wrong, but not correct answer
Learning with a critic (not a teacher as in supervised)
PR , ANN, & ML 69
Pattern Recognition
PR , ANN, & ML 70
Applications
DNA sequence
Lie detectors
Handwritten digits recognition
Classification based on smell
Web document classification and search engine
Defect detection
Texture classification
Image database retrieval
Face recognition
etc.
PR , ANN, & ML 71
Other formulations
We talked about 1/3 of the scenarios – that
of classification (discrete)
Regression – continuous
Extrapolation and interpolation
Clustering
Similarity
Abnormality detection
Concept drift (discovery), etc.
PR , ANN, & ML 72
Classification vs. Regression
Classification Regression
Large vs. small hints Large means large,
on category small means small
Absolute values does Absolute values matter
not matter as much Fitting orders matter
(can actually hurt)
Normalization is often
necessary
Fitting order stays low
PR , ANN, & ML 73
Recent Development
Data can be “massaged” Surprisingly,
massaging the data and use simple
classifiers is better than massaging the
classifiers and use simple data (for simple
problems & small data sets)
Hard-to-visualize concept
Transform data into higher dimensional space
(e.g., infinite dimensional) has a tendency to
separate data and increase error margin
Concept of SVM and later kernel methods
74
More Recent Development
Think about fitting linear data with a model
Linear, quadratic, cubic, etc.
No ability to extrapolate
PR , ANN, & ML 75