0% found this document useful (0 votes)
8 views

Intro

Uploaded by

stcase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Intro

Uploaded by

stcase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Pattern Recognition

Artificial Neural Networks,


and Machine Learning
Yuan-Fang Wang
Department of Computer Science
University of California
Santa Barbara, CA 93106, USA
“Pattern Recognition”
What is a Pattern?

PR , ANN, & ML 2
PR , ANN, & ML 3
PR , ANN, & ML 4
PR , ANN, & ML 5
 DNA patterns  Protein Patterns
 AGCTCGAT  20 amino acids

PR , ANN, & ML 6
PR , ANN, & ML 7
PR , ANN, & ML 8
PR , ANN, & ML 9
Faces

Finger prints

PR , ANN, & ML 10
Other Patterns
 Insurance, credit card applications
 applicants are characterized by a pattern
# of accidents, make of car, year of model
 income, # of dependents, credit worthiness,
mortgage amount
 Dating services
 Age, hobbies, income, etc. establish your
“desirability”

PR , ANN, & ML 11
Other Patterns
 Web documents
 Key words based description (e.g., documents
containing War, Bagdad, Hussen are different
from those containing football, NFL, AFL,
draft, quarterbacks)
 Intrusion detection
 Usage and connection patterns
 Cancer detection
 Image features for tumors, patient age,
treatment option, etc.

PR , ANN, & ML 12
Other Patterns
 Housing market
 Location, size, year, school district
 University ranking
 Student population, student-faculty ratio,
scholarship opportunities, location, faculty research
grants, etc.
 Too many
 E.g.,
http://www.ics.uci.edu/~mlearn/MLSummary.html

PR , ANN, & ML 13
What is a pattern?
 A pattern is a set of objects, processes or
events which consist of both deterministic
and stochastic components
 A pattern is a record of certain dynamic
processes influenced both by deterministic
and stochastic factors

PR , ANN, & ML 14
What is a Pattern? (cont.)

Constellation patterns,
texture patterns, EKG
patterns, etc.

Completely regular, Completely


deterministic random
(e.g., crystal structure) (e.g., white noise)

PR , ANN, & ML 15
What is Pattern Recognition?
 Classifies “patterns” into “classes”
 Patterns (x)
 have “measurements”, “traits”, or “features”
 Classes (  i)
 likelihood (a prior probabilityP( i ))
 class-conditional density p( x| i )
 Classifier (f(x) ->  i)
 An example
 four coin classes: penny, nickel, dime, and quarter
 measurements: weight, color, size, etc.
 Assign a coin to a class based on its size, weight, etc.
We use P to denote probability mass function (discrete) and
p to denote probability density function (continuous)
PR , ANN, & ML 16
An Example

Many visual inspection systems are like this:


Circuit board, fruit, OCR, etc.

PR , ANN, & ML 17
Another Example

PR , ANN, & ML 18
Features
 The intrinsic traits or characteristics that tell
one pattern (object) apart from another
 Features extraction and representation allows
 Focus on relevant, distinguishing parts of a pattern
 Data reduction and abstraction

PR , ANN, & ML 19
Detection vs. Description
 Detection: something  Description: what has
happened happened?
 Heard noise  Gun shot, talking,
 Saw something laughing, crying, etc.
interesting  Lines, corners,
 Non-flat signals textures
 Mouse, cat, dog, bike,
etc.

PR , ANN, & ML 20
Feature Selection
 More an art than a science
 Effectiveness criteria:

population

size

Size alone is not effective

PR , ANN, & ML 21
perimeter

compactness

Perimeter is not effective


Discrimination is accomplished by compactness alone

elongatedness

compactness

The two feature values are correlated, only one of them


is needed
PR , ANN, & ML 23
PR , ANN, & ML 24
Too simple Too complicated

PR , ANN, & ML 25
Optimal tradeoff between performance and generalization

PR , ANN, & ML 26
Importance of Features
 Cannot be over-stated
 We usually don’t know which to select,
what they represent, and how to tune them
(face, gait recognition, tumor detection, etc.)
 Classification and regression schemes are
mostly trying to make the best of whatever
features are available

PR , ANN, & ML 27
Features
 One is usually not descriptive (no silver
bullet)
 Many (shotgun approach) can actually hurt

 Many problems:
 Relevance
 Dimensionality
 Co-dpendency
 Time and space varying characteristics.
 Accuracy
 Uncertainty and error
 Missing values
PR , ANN, & ML 28
Feature Selection (cont.)
 Q: How to decide if a feature is effective?
 A: Through a training phase
 Trainingon typical samples and typical features
to discover
 Whether features are effective
 Whether there are any redundancy

 The typical cluster shape (e.g., Gaussian)

 Decision boundaries between samples

 Cluster centers of particular samples

 Etc.

PR , ANN, & ML 29
Classifiers
i if gi ( x )  g j ( x ) for all j  i
gi ( x )  P( i ) if no measurements are made
gi ( x )  P( i | x ) minimize misclassification rate
gi ( x )  R( i | x ) minimize associated risk

g1 g1 ( x )
g 2 (x)
x g2 max decision
g n (x)

gn

PR , ANN, & ML 30
Traditional Pattern Recognition
 Parametric methods
 Based on class sample exhibiting a certain
parametric distribution (e.g. Gaussian)
 Learn the parameters through training

 Density methods
 Does not enforce a parametric form
 Learn the density function directly

 Decision boundary methods


 Learn the separation in the feature space

PR , ANN, & ML 31
Parametric Methods

I. population II. population

size size
III. population IV. population

size size
1 | x  x |2
1 
2 2
e
( 2 ) 
n/ 2 n

PR , ANN, & ML 32
Density Methods

PR , ANN, & ML 33
Feature space
d dimensional (d the number of features)
 populated with features from training samples

fd

f2

f1

PR , ANN, & ML 34
Decision
fd
Boundary ?
Methods f2

f1

• Decision surfaces • Cluster centers

fd fd

? ?
f2 f2

f1 f1

PR , ANN, & ML 35
PR , ANN, & ML 36
“Modern” vs “Traditional”
Pattern Recognition
 Hand-crafted features  Automatically learned
 Simple and low-level features
concatenation of  Hierarchical and
numbers or traits complex
 Syntactic  Semantic
 Feature detection and  Feature detection and
description are description are not
separate tasks from jointly optimized with
classifier design classifiers

PR , ANN, & ML 37
Traditional Features

PR , ANN, & ML 38
Modern Features

PR , ANN, & ML 39
Modern Features

PR , ANN, & ML 40
Modern Features

PR , ANN, & ML 41
Modern Features

PR , ANN, & ML 42
Modern Features

PR , ANN, & ML 43
“Modern” vs “Traditional”
Pattern Recognition

PR , ANN, & ML 44
Mathematical Foundation
 Does not matter what methods or
techniques you use, the underlying
mathematical principle is quite simple
 Bayesian theory is the foundation

PR , ANN, & ML 45
Review: Bayes Rule
 Forward (synthesis) route:
 From class to sample in a class
 Grammar rules to sentences
 Markov chain (or HMM) to pronunciation

 Texture rules (primitive + repetition) to textures

 Backward (analysis) route:


 From sample to class ID
A sentence parsed by a grammar
 A utterance is “congratulations” (not “constitution”)

 Brickwall vs. plaid shirt

PR , ANN, & ML 46
Review: Bayes Rule
 Backward is always harder
 Because the interpretation is not unique
 Presence of x has multiple possibilities

2
1 x n
3

PR , ANN, & ML 47
The simplest example
 Two classes: pennies and dimes
 No measurements

 Classification:
 based on the a prior probabilities
 Error rate:
1 if P(1 )  P( 2 )
2 if P(1 )  P( 2 )
1 or  2 otherwise

min( P(1 ), P( 2 )) 1 2

PR , ANN, & ML 48
A slightly more complicated example
 Two classes: pennies and dimes
 A measurement x is made (e.g. weight)

 Classification
 based on the a posterior probabilities with
Bayes rule
1 if P(1| x )  P( 2 | x )
1
x
2
2 if P(1| x )  P( 2 | x )
1 or  2 otherwise
p( x ,  i ) p( x| i ) P( i )
P( i | x )  
p( x ) p( x )
PR , ANN, & ML 49
probability

p( x|1 ) p( x |  2 )

weight
 P(1 )  P( 2 )
probability

p( x|1 ) P(1 )
p( x| 2 ) P( 2 )
weight
 p( x )  p( x )
probability
p(1| x ) p( 2 | x )

PR , ANN, & ML
weight 50
Why Both?
p( x | i ) & P( i ) ?

 In the day time, some animal runs in front of


you on the bike path, you know exactly what it
is (p(x|w) is sufficient)
 In the night time, some animal runs in front of
you on the bike path, you can hardly distinguish
the shape (p(x|w) is low for all cases, but you
know it is probably a squirrel, not a lion
because of p(w))

PR , ANN, & ML 51
Essence
 Turn a backward (analysis) problem into
several forward (synthesis) problem
 Or analysis-by-synthesis

 Whichever model has a highly likelihood of


synthesizing the outcome wins
 The formula is not mathematically provable

PR , ANN, & ML 52
Error rate
 Determined by
 The likelihood of a class
 The likelihood of measuring x in a class

min( P(1| x ), P( 2 | x )) or

1
min( p( x|1 ) P(1 ), p( x| 2 ) P( 2 ))
p( x )

PR , ANN, & ML 53
Error Rate (cont.)
 Bayes Decision Rule minimizes the average
error rate:

error   p(error | x ) p( x )dx


p ( error | x )   p ( | x )  1  p (
*
i ( x) | x)
 i  ( x )*

where
 ( x )  arg max p ( i | x )
*
i

PR , ANN, & ML 54
Various types of errors

PR , ANN, & ML 55
PR , ANN, & ML 56
Precision vs. Recall
 A very common measure used in PR and
MI community
 One goes up and the other HAS to go down

 A range of options (Receiver operating


characteristic curves)
 Area under the curve

as a goodness measure

PR , ANN, & ML 57
Various ways to measure error rates
 Training error
 Test error

 Empirical error

 Some under your control (training and test)

 Some not (empirical error)

 How: n-fold validation

 Why: Overfitting and underfitting problems

PR , ANN, & ML 58
An even more complicated example
 Two classes: pennies or dimes
 A measurement x is made
 Risk associated with making a wrong decision
 Based on the a posterior probabilities with
Bayesian risk
R(1| x )  11P(1| x )  12 P( 2 | x )
R( 2 | x )  21P(1| x )  22 P( 2 | x )
ij : the loss of action  i in state  j
R( i | x ): the conditional risk of action  i with x

PR , ANN, & ML 59
State1 State2

Mis-classification
Math
Observation

Decision1 Decision2

Mis-interpretation
Human factor

Observation

PR , ANN, & ML 60
State1 State2 Decision1 Decision2

Observation Observation

Decision1 Decision2 Decision1 Decision2

Incorrect decisions
Incur domain-specific cost

State1 State2

Observation
PR , ANN, & ML 61
An even more complicated example
p(x|pennies)P(pennies)
R(used as pennies | x) =
r(pennies used as pennies) * P(pennies | x) +

r(dimes used as pennies) * P(dimes | x)

p(x|dimes)P(dimes)
R(used as dimes | x) =
r(pennies used as dimes) * P(pennies | x) +

r(dimes used as dimes) * P(dimes | x)

PR , ANN, & ML 62
A more credible example
R(call FD|smoke) =
r(call,fire)*P(fire|smoke) +
r(call, no fire)*P(no fire|smoke)
False positive
R(no call FD|smoke)=
r(no call, no fire)*P(no fire|smoke) +
r(no call, fire)*P(fire|smoke)
False negative

 The risk associated with false negative is


much higher than that of false positive

PR , ANN, & ML 63
A more credible example
R(attack|battle field intelligence) =
r(attack,<50%)*P(<50%|intelligence) +
r(attack,>50%)*P(>50%|intelligence)
False positive
R(no attack|battle field intelligence)=
r(no attack, >50%)*P(>50%|intelligence) +
r(no attack, <50%)*P(<50%|intelligence)
False negative

PR , ANN, & ML 64
Baysian Risk
 Determined by
 likelihood of a class
 likelihood of measuring x in a class

 the risk of making a wrong action

 Classification
 Baysian risk should be minimized
min( R(1 | x), R( 2 | x))or
min( 11P( 1 | x)  12 P( 2 | x), 21P( 1 | x)  22 P( 2 | x)) or
R(1 | x)  R( 2 | x)   1
(21  11 ) P( 1 | x)  (12  22 ) P( 2 | x)

PR , ANN, & ML 65
Bayesian Risk (cont.)
 Again, decisions depend on
 likelihood of a class
 likelihood of observation of x in a class
 Modified by some positive risk factors
 Why?
 Because in the real world, it might not be the
misclassification rate that is important, it is the
action you assume

(21  11 ) P ( 1 | x)  (12  22 ) P ( 2 | x)

PR , ANN, & ML 66
Other generalizations
 Multiple classes n
 n classes  P( i )  1
i 1
 Multiple measurements
 X is a vector instead of a scalar
 Non-numeric measurements
 Actions vs. decisions
 Correlated vs. independent events
 speech signals and images
 Training allowed or not
 Time-varying behaviors

PR , ANN, & ML 67
Difficulties

 What features to use


 How many features (the curse of
dimensionality)
 The a prior probability P( i )

 The class-conditional density p( x| i )

 The a posterior probability P( i | x)

PR , ANN, & ML 68
Typical Approaches
 Supervised (with tagged samples x):
 parameters of a probability function (e.g. Gaussian
) p( x |  )  N (  ,  )
i i i
 density functions (w/o assuming any parametric forms)
 decision boundaries (classes are indeed separable)
 Unsupervised (w/o tagged samples x):
 minimum distance
 hierarchical clustering
 Reinforced (with hints)
 Right or wrong, but not correct answer
 Learning with a critic (not a teacher as in supervised)

PR , ANN, & ML 69
Pattern Recognition

Uncorrelated Events Correlated Events

Supervised Learning Un-supervised Clustering Hidden Markov Models

Parameter Density Decision Boundary Minimum Distance Hierarchical Clustering

PR , ANN, & ML 70
Applications
 DNA sequence
 Lie detectors
 Handwritten digits recognition
 Classification based on smell
 Web document classification and search engine
 Defect detection
 Texture classification
 Image database retrieval
 Face recognition
 etc.

PR , ANN, & ML 71
Other formulations
 We talked about 1/3 of the scenarios – that
of classification (discrete)
 Regression – continuous
 Extrapolation and interpolation
 Clustering
 Similarity

 Abnormality detection
 Concept drift (discovery), etc.

PR , ANN, & ML 72
Classification vs. Regression
 Classification  Regression
 Large vs. small hints  Large means large,
on category small means small
 Absolute values does  Absolute values matter
not matter as much  Fitting orders matter
(can actually hurt)
 Normalization is often
necessary
 Fitting order stays low

PR , ANN, & ML 73
Recent Development
 Data can be “massaged” Surprisingly,
massaging the data and use simple
classifiers is better than massaging the
classifiers and use simple data (for simple
problems & small data sets)
 Hard-to-visualize concept
 Transform data into higher dimensional space
(e.g., infinite dimensional) has a tendency to
separate data and increase error margin
 Concept of SVM and later kernel methods
74
More Recent Development
 Think about fitting linear data with a model
 Linear, quadratic, cubic, etc.

 Higher the order, better the fit


 n data points can be perfectly fit by an (n-1) order
polynomial
 However
 Overfitting is likely

 No ability to extrapolate

 “Massage” the classifiers (using deep networks)


 Feature detection and description
 Classification
 Jointly optimization

PR , ANN, & ML 75

You might also like