Spoken Dialog Systems and Voice XML
Spoken Dialog Systems and Voice XML
XML :
Intro to Pattern Recognition
Esther Levin
Dept of Computer Science
CCNY
Some materials used in this course were taken from the textbook “Pattern Classification” by Duda et al., John Wiley & Sons,
2001
with the permission of the authors and the publisher
Credits and Acknowledgments
Materials used in this course were taken from the textbook “Pattern
Classification” by Duda et al., John Wiley & Sons, 2001 with the permission of
the authors and the publisher; and also from
Other material on the web:
Dr. A. Aydin Atalan, Middle East Technical University, Turkey
Dr. Djamel Bouchaffra, Oakland University
Dr. Adam Krzyzak, Concordia University
Dr. Joseph Picone, Mississippi State University
Dr. Robi Polikar, Rowan University
Dr. Stefan A. Robila, University of New Orleans
Dr. Sargur N. Srihari, State University of New York at Buffalo
David G. Stork, Stanford University
Dr. Godfried Toussaint, McGill University
Dr. Chris Wyatt, Virginia Tech
Dr. Alan L. Yuille, University of California, Los Angeles
Dr. Song-Chun Zhu, University of California, Los Angeles
Outline
Introduction
What is this pattern recogntiion
Background Material
Probability theory
PATTERN RECOGNITION AREAS
Optical Character Recognition ( OCR)
Sorting letters by postal code.
Reconstructing text from printed materials (such as reading machines for blind
people).
Analysis and identification of human patterns
Speech and voice recognition.
Finger prints and DNA mapping.
Banking and insurance applications
Credit cards applicants classified by income, credit worthiness, mortgage amount, # of
dependents, etc.
Car insurance (pattern including make of car, #of accidents, age, sex, driving habits,
location, etc).
Diagnosis systems
Medical diagnosis (disease vs. symptoms classification, X-Ray, EKG and tests
analysis, etc).
Diagnosis of automotive malfunctioning
Prediction systems
Weather forecasting (based on satellite data).
Analysis of seismic patterns
Dating services (where pattern includes age, sex, race, hobbies, income, etc).
More Pattern Recognition
Applications
SENSORY DATA
Vision Text Categorization
Face/Handwriting/Hand Information
Speech Retrieval
Speaker/Speech Data Mining
Olfaction Genome Sequence
Apple Ripe? Matching
What is a pattern?
“A pattern is the opposite of a chaos; it is an entity
vaguely defined, that could be given a name.”
PR Definitions
A v t u I h D U w K
Ç ş ğ İ üÜ Ö Ğ
چك٤٧ع
К Ц Д
ζω Ψ Ω ξ θ
ם א ש ת ד נ
Handwriting
Terminology
Length
Lightness
Width
Number and shape of fins
Position of the mouth, etc…
This is the set of all suggested features to explore for use in our
classifier!
Solution by Stages
Preprocess raw data from camera
Segment isolated fish
Extract features from each fish (length,width,
brightness, etc.)
Classify each fish
Preprocessing
Use a segmentation operation to isolate fishes
from one another and from the background
Information from a single fish is sent to a feature
extractor whose purpose is to reduce the data by
measuring certain features
The features are passed to a classifier
2
2
Classification
2
2
The length is a poor feature alone!
2
2
“Customers do not want sea
bass in their cans of salmon”
Threshold decision boundary and cost relationship
Move our decision boundary toward smaller values
of lightness in order to minimize the cost (reduce
the number of sea bass that are classified salmon!)
2
Adopt the lightness and add the width of
the fish
Lightness Width
2
2
We might add other features that are not
correlated with the ones we already have. A
precaution should be taken not to reduce the
performance by adding such “noisy features”
2
2
However, our satisfaction is
premature because the central aim
of designing a classifier is to
correctly classify novel input
Issue of generalization!
2
2
Decision Boundaries
Observe: Can do much better with two features
Caveat: overfitting!
Occam’s Razor
William of Occam
(1284-1347)
A Complete PR System
Problem Formulation
Input Class
object Label
Measurements
& Features Classification
Preprocessing
Basic ingredients:
•Measurement space (e.g., image intensity, pressure)
•Features (e.g., corners, spectral energy)
•Classifier - soft and hard
•Decision boundary
•Training sample
•Probability of error
Pattern Recognition
Systems
Sensing
Use of a transducer (camera or microphone)
PR system depends of the bandwidth, the
resolution, sensitivity, distortion of the
transducer
Segmentation and grouping
Patterns should be well separated and
should not overlap
3
3
Feature extraction
Discriminative features
Invariant features with respect to translation, rotation and
scale.
Classification
Use a feature vector provided by a feature extractor to
assign the object to a category
Post Processing
Exploit context dependent information other than from the
target pattern itself to improve performance
The Design Cycle
Data collection
Feature Choice
Model Choice
Training
Evaluation
Computational Complexity
4
4
Data Collection
4
Feature Choice
4
Model Choice
4
Training
4
Evaluation
4
Computational Complexity
4
Learning and Adaptation
Learning: Any method that combines empirical information from
the environment with prior knowledge into the design of a
classifier, attempting to improve performance with time.
Empirical information: Usually in the form of training examples.
Prior knowledge: Invariances, correlations
Supervised learning
A teacher provides a category label or cost for each pattern in the
training set
Unsupervised learning
The system forms clusters or “natural groupings” of the input patterns
5
Syntactic Versus Statistical
PR
Basic assumption: There is an underlying regularity
behind the observed phenomena.
Question: Based on noisy observations, what is the
underlying regularity?
Syntactic: Structure through common generative
mechanism. For example, all different manifestations
of English, share a common underlying set of
grammatical rules.
Statistical: Objects characterized through statistical
similarity. For example, all possible digits `2' share
some common underlying statistical relationship.
Difficulties
Segmentation
Context
Temporal structure
Missing features
Aberrant data
Noise
How do we train…?
How do we combine prior knowledge with
empirical data?
How do we evaluate our performance
Validate the results. Confidence in decision?
Conclusion
I expect you are overwhelmed by the number,
complexity and magnitude of the sub-problems
of Pattern Recognition
6
Toolkit for PR
Statistics
Decision Theory
Optimization
Signal Processing
Neural Networks
Fuzzy Logic
Decision Trees
Clustering
Genetic Algorithms
AI Search
Formal Grammars
….
Linear algebra
a11 a12 ... a1n
Matrix A: a a22 ... a2 n
A [aij ]mn 21
... ... ... ...
am1 am 2 ... amn
Matrix Transpose
B [bij ]nm AT bij a ji ; 1 i n, 1 j m
Vector a
a1
a ... ; a T [a1 ,..., an ]
an
Matrix and vector multiplication
Matrix multiplication
A [aij ]m p ; B [bij ] pn ;
AB C [cij ]mn , where cij rowi ( A) col j ( B )
ai
2
a is normalized iff ||a|| = 1 a aT a
i 1
det(AB)= det(A)det(B)
Trace n
A [aij ]nn ; tr[ A] a jj
j 1
Matrix Inversion
A (n x n) is nonsingular if there AB BA I n ; B A1
exists B
A A I
#
Eigenvectors and Eigenvalues
Ae j j e j , j 1,..., n; || e j || 1
n n
tr[ A] j det[ A] j
j 1 j 1
Probability Theory
Primary references:
Any Probability and Statistics text book (Papoulis)
Appendix A.4 in “Pattern Classification” by Duda
et al
The table on the right is derived from the table on the left by dividing each entry by the total number of cookies under consideration, or 80 cookies.
Example 2
X Y Z P(x,y,z)
1. Power Plant Operation.
The variables X, Y, Z describe 0 0 0 0.07
the state of 3 power plants
(X=0 means plant X is idle). 0 0 1 0.04
Denote by A an event that a
plant X is idle, and by B an
0 1 0 0.03
event that 2 out of three plants 0 1 1 0.18
are working.
What’s P(A) and P(A|B), the 1 0 0 0.16
probability that X is idle given
that at least 2 out of three are
1 0 1 0.18
working? 1 1 0 0.21
1 1 1 0.13
P(A) = P(0,0,0) + P(0,0,1) + P(0,1,0) +
P(0, 1, 1) = 0.07+0.04 +0.03 +0.18
=0.32
P(B) = P(0,1,1) +P(1,0,1) + P(1,1,0)+
P(1,1,1)= 0.18+ 0.18+0.21+0.13=0.7
P(A and B) = P(0,1,1) = 0.18
z=x+y
z x y
Var(z) = Var(x) + Var(y) + 2Cov(x,y)
If x,y independent: Var(z) = Var(x) +
Var(y)
Distribution of z:
p( z ) p x ( x) p y ( y ) p ( x) p
x y ( z x)dx
Examples:
x and y are uniform on [0,1]
Find p(z=x+y), E(z), Var(z);
x is uniform on [-1,1], and P(y)= 0.5 for
y =0, y=10; and 0 elsewhere.
Find p(z=x+y), E(z), Var(z);
Normal Distributions
Gaussian distribution
1 ( x x ) 2 / 2 x 2
p( x) N ( , )
x x e
2 x
Mean E ( x) x
Variance E[( x x )2 ] x 2
x x
Mahalanobis Distance: r
x
Multivariate Normal Density
x is a vector of d Gaussian variables
1
1 ( x )T 1( x )
p( x) N ( ,) e 2
2 d / 2 ||1 / 2
E[ x] xp( x)dx
E[( x )( x ) ] ( x )( x )T p( x)dx
T
Mahalanobis Distance
r 2 ( x )T 1( x )
All conditionals and marginals are also Gaussian
Bivariate Normal Densities
I ( x) log 2 P ( x)
p ( x, y )
Ix , y H ( x) H ( x | y ) p( x, y ) log 2
x, y p( x) p( y )