0% found this document useful (0 votes)
44 views

Spoken Dialog Systems and Voice XML

This document provides an introduction to pattern recognition, including: - Defining pattern recognition and giving examples of application areas like optical character recognition, speech recognition, and medical diagnosis. - Describing the key components of a pattern recognition system including data collection, feature extraction, classification, and evaluation. - Explaining important concepts like learning, supervised vs. unsupervised learning, and the differences between syntactic and statistical pattern recognition. - Noting challenges in pattern recognition like segmentation, context, missing features, and noise.

Uploaded by

coharish
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Spoken Dialog Systems and Voice XML

This document provides an introduction to pattern recognition, including: - Defining pattern recognition and giving examples of application areas like optical character recognition, speech recognition, and medical diagnosis. - Describing the key components of a pattern recognition system including data collection, feature extraction, classification, and evaluation. - Explaining important concepts like learning, supervised vs. unsupervised learning, and the differences between syntactic and statistical pattern recognition. - Noting challenges in pattern recognition like segmentation, context, missing features, and noise.

Uploaded by

coharish
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 94

Spoken Dialog Systems and Voice

XML :
Intro to Pattern Recognition

Esther Levin
Dept of Computer Science
CCNY
Some materials used in this course were taken from the textbook “Pattern Classification” by Duda et al., John Wiley & Sons,
2001
with the permission of the authors and the publisher
Credits and Acknowledgments
Materials used in this course were taken from the textbook “Pattern
Classification” by Duda et al., John Wiley & Sons, 2001 with the permission of
the authors and the publisher; and also from
Other material on the web:
 Dr. A. Aydin Atalan, Middle East Technical University, Turkey
 Dr. Djamel Bouchaffra, Oakland University
 Dr. Adam Krzyzak, Concordia University
 Dr. Joseph Picone, Mississippi State University
 Dr. Robi Polikar, Rowan University
 Dr. Stefan A. Robila, University of New Orleans
 Dr. Sargur N. Srihari, State University of New York at Buffalo
 David G. Stork, Stanford University
 Dr. Godfried Toussaint, McGill University
 Dr. Chris Wyatt, Virginia Tech
 Dr. Alan L. Yuille, University of California, Los Angeles
 Dr. Song-Chun Zhu, University of California, Los Angeles
Outline
Introduction
 What is this pattern recogntiion
Background Material
 Probability theory
PATTERN RECOGNITION AREAS
Optical Character Recognition ( OCR)
 Sorting letters by postal code.
 Reconstructing text from printed materials (such as reading machines for blind
people).
Analysis and identification of human patterns
 Speech and voice recognition.
 Finger prints and DNA mapping.
Banking and insurance applications
 Credit cards applicants classified by income, credit worthiness, mortgage amount, # of
dependents, etc.
 Car insurance (pattern including make of car, #of accidents, age, sex, driving habits,
location, etc).
Diagnosis systems
 Medical diagnosis (disease vs. symptoms classification, X-Ray, EKG and tests
analysis, etc).
 Diagnosis of automotive malfunctioning
Prediction systems
 Weather forecasting (based on satellite data).
 Analysis of seismic patterns

Dating services (where pattern includes age, sex, race, hobbies, income, etc).
More Pattern Recognition
Applications
SENSORY DATA
Vision Text Categorization
 Face/Handwriting/Hand Information
Speech Retrieval
 Speaker/Speech Data Mining
Olfaction Genome Sequence
 Apple Ripe? Matching
What is a pattern?
“A pattern is the opposite of a chaos; it is an entity
vaguely defined, that could be given a name.”
PR Definitions

Theory, Algorithms, Systems to Put


Patterns into Categories
Classification of Noisy or Complex Data
Relate Perceived Pattern to Previously
Perceived Patterns
Characters

A v t u I h D U w K

Ç ş ğ İ üÜ Ö Ğ
‫چك‬٤٧‫ع‬
К Ц Д
ζω Ψ Ω ξ θ
‫ם א‬ ‫ש‬ ‫ת‬ ‫ד‬ ‫נ‬
Handwriting
Terminology

Features, feature vector


Decision boundary
Error
Cost of error
Generalization
A Fishy Example I
“Sorting incoming Fish on a conveyor
according to species using optical sensing”

Salmon or Sea Bass?


Problem Analysis

 Set up a camera and take some sample images to extract


features

 Length
 Lightness
 Width
 Number and shape of fins
 Position of the mouth, etc…

This is the set of all suggested features to explore for use in our
classifier!
Solution by Stages
Preprocess raw data from camera
Segment isolated fish
Extract features from each fish (length,width,
brightness, etc.)
Classify each fish
Preprocessing
 Use a segmentation operation to isolate fishes
from one another and from the background
Information from a single fish is sent to a feature
extractor whose purpose is to reduce the data by
measuring certain features
The features are passed to a classifier

2
2
Classification

Select the length of the fish as a possible


feature for discrimination

2
2
The length is a poor feature alone!

Select the lightness as a possible feature.

2
2
“Customers do not want sea
bass in their cans of salmon”
 Threshold decision boundary and cost relationship
 Move our decision boundary toward smaller values
of lightness in order to minimize the cost (reduce
the number of sea bass that are classified salmon!)

Task of decision theory

2
Adopt the lightness and add the width of
the fish

Fish x = [x1, x2]

Lightness Width

2
2
We might add other features that are not
correlated with the ones we already have. A
precaution should be taken not to reduce the
performance by adding such “noisy features”

Ideally, the best decision boundary should be


the one which provides an optimal performance
such as in the following figure:

2
2
However, our satisfaction is
premature because the central aim
of designing a classifier is to
correctly classify novel input

Issue of generalization!
2
2
Decision Boundaries
Observe: Can do much better with two features

Caveat: overfitting!
Occam’s Razor

Entities are not to be multiplied without necessity

William of Occam
(1284-1347)
A Complete PR System
Problem Formulation
Input Class
object Label
Measurements
& Features Classification
Preprocessing

Basic ingredients:
•Measurement space (e.g., image intensity, pressure)
•Features (e.g., corners, spectral energy)
•Classifier - soft and hard
•Decision boundary
•Training sample
•Probability of error
Pattern Recognition
Systems
Sensing
 Use of a transducer (camera or microphone)
 PR system depends of the bandwidth, the
resolution, sensitivity, distortion of the
transducer
Segmentation and grouping
 Patterns should be well separated and
should not overlap

3
3
Feature extraction
 Discriminative features
 Invariant features with respect to translation, rotation and
scale.

Classification
 Use a feature vector provided by a feature extractor to
assign the object to a category

Post Processing
 Exploit context dependent information other than from the
target pattern itself to improve performance
The Design Cycle

Data collection
Feature Choice
Model Choice
Training
Evaluation
Computational Complexity

4
4
Data Collection

How do we know when we have collected an


adequately large and representative set of
examples for training and testing the system?

4
Feature Choice

Depends on the characteristics of the


problem domain. Simple to extract,
invariant to irrelevant transformation
insensitive to noise.

4
Model Choice

Unsatisfied with the performance of our linear fish


classifier and want to jump to another class of
model

4
Training

Use data to determine the classifier. Many


different procedures for training classifiers and
choosing models

4
Evaluation

Measure the error rate (or performance) and


switch from one set of features & models to
another one.

4
Computational Complexity

What is the trade off between computational ease


and performance?
(How an algorithm scales as a function of the
number of features, number or training examples,
number patterns or categories?)

4
Learning and Adaptation
Learning: Any method that combines empirical information from
the environment with prior knowledge into the design of a
classifier, attempting to improve performance with time.
Empirical information: Usually in the form of training examples.
Prior knowledge: Invariances, correlations

Supervised learning
 A teacher provides a category label or cost for each pattern in the
training set

Unsupervised learning
 The system forms clusters or “natural groupings” of the input patterns

5
Syntactic Versus Statistical
PR
Basic assumption: There is an underlying regularity
behind the observed phenomena.
Question: Based on noisy observations, what is the
underlying regularity?
Syntactic: Structure through common generative
mechanism. For example, all different manifestations
of English, share a common underlying set of
grammatical rules.
Statistical: Objects characterized through statistical
similarity. For example, all possible digits `2' share
some common underlying statistical relationship.
Difficulties
Segmentation
Context
Temporal structure
Missing features
Aberrant data
Noise

Do all these images represent an `A'?


Design Cycle

How do we know what features to select, and how do we select


them…?

What type of classifier shall we use. Is there best classifier…?

How do we train…?
How do we combine prior knowledge with
empirical data?
How do we evaluate our performance
Validate the results. Confidence in decision?
Conclusion
I expect you are overwhelmed by the number,
complexity and magnitude of the sub-problems
of Pattern Recognition

Many of these sub-problems can indeed be


solved

Many fascinating unsolved problems still remain

6
Toolkit for PR
Statistics
Decision Theory
Optimization
Signal Processing
Neural Networks
Fuzzy Logic
Decision Trees
Clustering
Genetic Algorithms
AI Search
Formal Grammars
….
Linear algebra
 a11 a12 ... a1n 
Matrix A: a a22 ... a2 n 
A  [aij ]mn   21
 ... ... ... ... 
 
am1 am 2 ... amn 

Matrix Transpose
B  [bij ]nm  AT  bij  a ji ; 1  i  n, 1  j  m
Vector a
 a1 
a   ... ; a T  [a1 ,..., an ]
an 
Matrix and vector multiplication

Matrix multiplication
A  [aij ]m p ; B  [bij ] pn ;
AB  C  [cij ]mn , where cij  rowi ( A)  col j ( B )

Outer vector product


a  A  [aij ]m1 ; bT  B  [bij ]1n ;
c  a  b  AB, an m  n matrix
Vector-matrix product
A  [aij ]mn ; b  B  [bij ]n1 ;
C  Ab  an m 1 matrix  vector of length m
Inner Product n
Inner (dot) product: a T  b   ai bi
i 1
Length (Eucledian norm) of a vector n

 ai
2
a is normalized iff ||a|| = 1 a  aT  a 
i 1

The angle between two n- aT  b


dimesional vectors cos  
|| a || || b ||
An inner product is a measure of
collinearity:
 a and b are orthogonal iff
T
a b  0
 a and b are collinear iff a  b || a || || b ||
T

A set of vectors is linearly


independent if no vector is a linear
combination of other vectors.
Determinant and Trace
A  [aij ]nn ;
Determinant n
det( A)   aij Aij ; i  1,....n;
j 1

Aij  (1) i  j det( M ij )

det(AB)= det(A)det(B)

Trace n
A  [aij ]nn ; tr[ A]   a jj
j 1
Matrix Inversion
A (n x n) is nonsingular if there AB  BA  I n ; B  A1
exists B

A=[2 3; 2 2], B=[-1 3/2; 1 -1]

A is nonsingular iff || A || 0

Pseudo-inverse for a non square


matrix, provided 1 T
T A #
 [ AT
A] A
A A is not singular

A A I
#
Eigenvectors and Eigenvalues
Ae j   j e j , j  1,..., n; || e j || 1

Characteristic equation: det[ A  I n ]  0


n-th order polynomial, with n roots.

n n
tr[ A]    j det[ A]    j
j 1 j 1
Probability Theory

Primary references:
 Any Probability and Statistics text book (Papoulis)
 Appendix A.4 in “Pattern Classification” by Duda
et al

The principles of probability theory, describing


the behavior of systems with random
characteristics, are of fundamental
importance to pattern recognition.
Example 1 ( wikipedia)
•two bowls full of cookies.
•Bowl #1 has 10 chocolate chip cookies and 30 plain
cookies,
•bowl #2 has 20 of each.
•Fred picks a bowl at random, and then picks a cookie at
random.
•The cookie turns out to be a plain one.
•How probable is it that Fred picked it out of bowl
•what’s the probability that Fred picked bowl #1, given
that he has a plain cookie?”
•event A is that Fred picked bowl #1,
•event B is that Fred picked a plain cookie.
•Pr(A|B) ?
Example1 - cpntinued
Tables of occurrences and relative frequencies
It is often helpful when calculating conditional probabilities to create a simple table containing the number of occurrences of each outcome, or
the relative frequencies of each outcome, for each of the independent variables. The tables below illustrate the use of this method for the cookies.

Relative frequency of cookies in each bowl


Number of cookies in each bowl      
by type of cookie
by type of cookie

Bowl #1 Bowl #2 Totals


Bowl 1 Bowl 2 Totals
Chocolate Chip 0.125 0.250 0.375
Chocolate Chip 10 20 30
Plain 30 20 50 Plain 0.375 0.250 0.625
Total 40 40 80 Total 0.500 0.500 1.000

The table on the right is derived from the table on the left by dividing each entry by the total number of cookies under consideration, or 80 cookies.
Example 2

X Y Z P(x,y,z)
1. Power Plant Operation.
 The variables X, Y, Z describe 0 0 0 0.07
the state of 3 power plants
(X=0 means plant X is idle). 0 0 1 0.04
 Denote by A an event that a
plant X is idle, and by B an
0 1 0 0.03
event that 2 out of three plants 0 1 1 0.18
are working.
 What’s P(A) and P(A|B), the 1 0 0 0.16
probability that X is idle given
that at least 2 out of three are
1 0 1 0.18
working? 1 1 0 0.21
1 1 1 0.13
P(A) = P(0,0,0) + P(0,0,1) + P(0,1,0) +
P(0, 1, 1) = 0.07+0.04 +0.03 +0.18
=0.32
P(B) = P(0,1,1) +P(1,0,1) + P(1,1,0)+
P(1,1,1)= 0.18+ 0.18+0.21+0.13=0.7
P(A and B) = P(0,1,1) = 0.18

P(A|B) = P(A and B)/P(B) = 0.18/0.7


=0.257
2. Cars are assembled in four possible
locations. Plant I supplies 20% of the cars;
plant II, 24%; plant III, 25%; and plant IV,
31%. There is 1 year warrantee on every car.
The company collected data that shows
P(claim| plant I) = 0.05; P(claim|Plant II)=0.11;
P(claim|plant III) = 0.03; P(claim|Plant IV)=0.18;

Cars are sold at random.


An owned just submitted a claim for her car.
What are the posterior probabilities that this
car was made in plant I, II, III and IV?
P(claim) = P(claim|plant I)P(plant I) +
P(claim|plant II)P(plant II) +
P(claim|plant III)P(plant III) +
P(claim|plant IV)P(plant IV) =0.0687
P(plant1|claim) =
= P(claim|plant I) * P(plant I)/P(claim) = 0.146
P(plantII|claim) =
= P(claim|plant II) * P(plant II)/P(claim) = 0.384
P(plantIII|claim) =
= P(claim|plant III) * P(plant III)/P(claim) = 0.109
P(plantIV|claim) =
= P(claim|plant IV) * P(plant IV)/P(claim) = 0.361
Example 3
3. It is known that 1% of population suffers from a
particular disease. A blood test has a 97% chance to
identify the disease for a diseased individual, by also
has a 6% chance of falsely indicating that a healthy
person has a disease.
a. What is the probability that a random person has a
positive blood test.
b. If a blood test is positive, what’s the probability that
the person has the disease?
c. If a blood test is negative, what’s the probability that
the person does not have the disease?
A is the event that a person has a disease. P(A) =
0.01; P(A’) = 0.99.
B is the event that the test result is positive.
 P(B|A) = 0.97; P(B’|A) = 0.03;
 P(B|A’) = 0.06; P(B’|A’) = 0.94;
(a) P(B) = P(A) P(B|A) + P(A’)P(B|A’) = 0.01*0.97
+0.99 * 0.06 = 0.0691
(b) P(A|B)=P(B|A)*P(A)/P(B) = 0.97* 0.01/0.0691 =
0.1403
(c) P(A’|B’) = P(B’|A’)P(A’)/P(B’)= P(B’|A’)P(A’)/(1-
P(B))= 0.94*0.99/(1-.0691)=0.9997
Sums of Random Variables

z=x+y

z  x   y
Var(z) = Var(x) + Var(y) + 2Cov(x,y)
If x,y independent: Var(z) = Var(x) +
Var(y)
Distribution of z:

p( z )  p x ( x)  p y ( y )   p ( x) p

x y ( z  x)dx
Examples:
x and y are uniform on [0,1]
 Find p(z=x+y), E(z), Var(z);
x is uniform on [-1,1], and P(y)= 0.5 for
y =0, y=10; and 0 elsewhere.
 Find p(z=x+y), E(z), Var(z);
Normal Distributions
Gaussian distribution
1 ( x   x ) 2 / 2 x 2
p( x)  N ( , ) 
x x e
2  x
Mean E ( x)   x

Variance E[( x   x )2 ]   x 2

Central Limit Theorem says sums of random variables tend


toward a Normal distribution.

x x
Mahalanobis Distance: r
x
Multivariate Normal Density
x is a vector of d Gaussian variables
1
1  ( x   )T  1( x   )
p( x)  N ( ,)  e 2
2 d / 2 ||1 / 2

  E[ x]  xp( x)dx


  E[( x   )( x   ) ]  ( x   )( x   )T p( x)dx
T


Mahalanobis Distance
r 2  ( x   )T 1( x   )
All conditionals and marginals are also Gaussian
Bivariate Normal Densities

Level curves - elliplses.


 x and y width are determined by the
variances, and the eccentricity by
correlation coefficient
 Principal axes are the eigenvectors, and
the width in these direction is the root of
the corresponding eigenvalue.
Information theory
Key principles:
 What is the information contained in a
random event?
 Less probable event contains more information
 For two independent event, the information is a sum

I ( x)   log 2 P ( x)

 What is the average information or entropy


of a distribution?
H ( x )   P ( x) log 2 P ( x)
x
 Examples: uniform distribution, dirac
distribution;

 Mutual information: reduction in uncertainty


about one variable due to knowledge of other
variable.

p ( x, y )
Ix , y  H ( x)  H ( x | y )   p( x, y ) log 2
x, y p( x) p( y )

You might also like