Statistical Perspective
Statistical Perspective
1/67
Statistical Framework
The natural framework for studying the
design and capabilities of pattern
classification machines is statistical
Nature of information available for decision
making is probabilistic
2/67
Feedforward Neural Networks
Have a natural propensity for performing
classification tasks
Solve the problem of recognition of patterns in
the input space or pattern space
Pattern recognition:
Concerned with the problem of decision making based
on complex patterns of information that are
probabilistic in nature.
Network outputs can be shown to find proper
interpretation of conventional statistical pattern
recognition concepts. 3/67
Pattern Classification
Linearly separable pattern sets:
only the simplest ones
Iris data: classes overlap
Important issue:
Find an optimal placement of the discriminant
function so as to minimize the number of
misclassifications on the given data set, and
simultaneously minimize the probability of
misclassification on unseen patterns.
4/67
Notion of Prior
The prior probability P(Ck) of a pattern
belonging to class Ck is measured by the
fraction of patterns in that class assuming an
infinite number of patterns in the training
set.
Priors influence our decision to assign an
unseen pattern to a class.
5/67
Assignment without Information
In the absence of all other information:
Experiment:
In a large sample of outcomes of a coin toss
experiment the ratio of Heads to Tails is 60:40
Is the coin biased?
Classify the next (unseen) outcome and
minimize the probability of mis-classification
(Natural and safe) Answer: Choose Heads!
6/67
Introduce Observations
Can do much better with an observation…
Suppose we are allowed to make a single
measurement of a feature x of each pattern
of the data set.
x is assigned a set of discrete values
{x1, x2, …, xd}
7/67
Conditional Probability
P( A B)
P( A|B)
P ( B)
Conditional Probability --Example
Next table shows the number of male and
female members of the standing faculty in
the departments of Math and English.
Math English Total
Female 1 17 18
Male 37 20 57
Total 38 37 75
10/67
Conditional Probability --Example
Joint probability of gender and area
Math English Total
Female .013 .227 .240
Male .493 .267 .760
Total .506 .494 1.00
If gender is the class of a professor and area
is the attribute of a professor,
P(female, math)=0.013
11/67
Math English Total
12/67
Math English Total
13/67
Joint and Conditional Probability
Conditional probability P(xl|Ck) is the
fraction of patterns that have value xl (math,
English) given only patterns from class Ck
(male, female)
P(math|male)
= P(xl=x1=math|Ck =C1=math)
= P(math, male) /P(male)
=0.493/0.760=0.649
That is, P(xl|Ck)=P(xl, Ck)/P(Ck) 14/67
Joint and Conditional Probability
From previous result,
P(Ck, xl) = P(xl, Ck)
= P(xl|Ck)*P(Ck)
So
15/67
Joint Probability = Conditional
Probability Class Prior
Number of patterns
in class Ck
16/67
Posterior Probability: Bayes’
Theorem
P(Ck/ xl) is the posterior probability:
probability that feature value xl belongs to
class Ck
Bayes’ Theorem
17/67
Posterior Probability: Bayes’
Theorem
Bayes’ Theorem
Posterior = (likelihood*prior)/evidence
18/67
Bayes’ Theorem and
Classification
Bayes’ Theorem provides the key to
classifier design:
Assign pattern with x=xl to class Ck for which
the posterior is the highest!
Note therefore that all posteriors must sum
to one
And
19/67
Bayes’ Theorem and
Classification-- Example
The objects can be
classified as
either GREEN or RED
Our task is to classify
new cases as they arrive
i.e., decide to which
class label they belong,
based on the currently
exiting objects.
20/67
Bayes’ Theorem and
Classification-- Example
The objects can be classified as
either GREEN or RED
Our task is to classify new cases as they arrive
i.e., decide to which class label they belong,
based on the currently exiting objects
20 of 60 are RED and 40 of them are GREEN
Prior Probability for GREEN: 40 / 60 = 2/3
Prior Probability for RED : 20 / 60 = 1/3
21/67
Bayes’ Theorem and
Classification-- Example
we are now ready to classify a new object
(WHITE circle in the diagram above)
To measure this likelihood, we draw a circle
around X which encompasses a number (to be
chosen a priori) of points irrespective of their
class labels.
22/67
Bayes’ Theorem and
Classification-- Example
23/67
Bayes’ Theorem and
Classification-- Example
26/67
Gaussian Distributions
Two-class one Distribution Mean and
dimensional Gaussian Variance
probability density
function
variance
mean 27/67
normalizing factor
Example of Gaussian
Distribution
Two classes are assumed to be distributed
about means 1.5 and 3 respectively, with
equal variances 0.25.
28/67
Example of Gaussian
Distribution
29/67
Extension to n-dimensions
The probability density function expression
extends to the following
Mean
Covariance matrix
30/67
Covariance Matrix and Mean
Covariance matrix
describes the shape and orientation of the
distribution in space
Mean
describes the translation of the scatter from the
origin
31/67
Covariance Matrix and Data
Scatters
32/67
Covariance Matrix and Data
Scatters
33/67
Covariance Matrix and Data
Scatters
34/67
Probability Contours
Contours of the probability density function
are loci of equal Mahalanobis distance
35/67
Classification Decisions with
Bayes’ Theorem
Key: Assign X to Class Ck such that
or,
Note:
36/67
Placement of a Decision
Boundary
Decision boundary separates the classes in
question
Where do we place decision region
boundaries such that the probability of
misclassification is minimized?
37/67
Quantifying the Classification
Error
Example: 1-dimension, 2 classes identified by
regions R1, R2
Perror = P(x R1, C2) + P(x R2, C1)
38/67
Optimal Placement of A Decision
Boundary
39/67
Quantifying the Classification
Error
Place decision boundary such that
point x lies in R1 (decide C1) if p(x|C1)P(C1) >
p(x|C2)P(C2)
point x lies in R2 (decide C2) if p(x|C2)P(C2) >
p(x|C1)P(C1)
40/67
Optimal Placement of A Decision
Boundary
Bayesian Decision
Boundary:
The point
where the unnormalized
probability density
functions crossover
41/67
Probabilistic Interpretation of a
Neuron Discriminant Function
An artificial neuron
implements the discriminant
function:
Each of C neurons
implements its own
discriminant function for a
C-class problem
An arbitrary input vector X
is assigned to class Ck if
neuron k has the largest
activation 42/67
Probabilistic Interpretation of a
Neuron Discriminant Function
An optimal Bayes’ classification chooses the
class with maximum posterior probability P(Cj|X)
Discriminant function yj = P(X|Cj) P(Cj)
yj notation re-used for emphasis
Two classes case:
point x lies in R1 (decide C1) if p(x|C1)P(C1) >
p(x|C2)P(C2)
Relative magnitudes are important: use any
monotonic function of the probabilities to
generate a new discriminant function 43/67
Probabilistic Interpretation of a
Neuron Discriminant Function
Assume an n-dimensional density function
This yields,
44/67
Plotting a Bayesian Decision
Boundary: 2-Class Example
Assume classes C1, C2, and discriminant functions
of the form,
45/67
Plotting a Bayesian Decision
Boundary: 2-Class Example
This boundary is elliptic
46/67
Bayesian Decision Boundary
47/67
Bayesian Decision Boundary
48/67
Cholesky Decomposition of
Covariance Matrix K
Returns a matrix Q such that QTQ = K and
Q is upper triangular
49/67
Interpreting Neuron Signals as
Probabilities: Gaussian Data
Gaussian Distributed Data
2-Class data, K2 = K1 = K
50/67
Interpreting Neuron Signals as
Probabilities: Gaussian Data
Consider Class 1
Sigmoidal neuron ?
51/67
Interpreting Neuron Signals as
Probabilities: Gaussian Data
We substituted
or,
Neuron activation !
52/67
Interpreting Neuron Signals as
Probabilities
Bernoulli Distributed Data
Random variable xi takes values 0,1
Bernoulli distribution
53/67
Interpreting Neuron Signals as
Probabilities: Bernoulli Data
Bayesian discriminant
Neuron activation
54/67
Interpreting Neuron Signals as
Probabilities: Bernoulli Data
Consider the posterior probability for class
C1
where
55/67
Interpreting Neuron Signals as
Probabilities: Bernoulli Data
56/67
Multilayered Networks
The computational power of neural
networks stems from their multilayered
architecture
What kind of interpretation can the outputs of
such networks be given?
Can we use some other (more appropriate) error
function to train such networks?
If so, then with what consequences in network
behaviour?
57/67
Likelihood
Assume a training data set T={Xk,Dk}
drawn from a joint p.d.f. p(X,D) defined on
np
Joint probability or likelihood of T
58/67
Sum of Squares Error Function
Motivated by the concept of maximum likelihood
Context: neural network solving a classification or
regression problem
Objective: maximize the likelihood function
Alternatively: minimize negative likelihood:
Drop this
constant
59/67
Sum of Squares Error Function
Error function is the
negative sum of the log-
probabilities of desired
outputs conditioned on
inputs
A feedforward neural
network provides a
framework for modelling
p(D|X)
60/67
Normally Distributed Data
Decompose the p.d.f. into a product of
individual density functions
61/67
From Likelihood to Sum Square
Errors
Noise term has zero mean and s.d.
62/67
From Likelihood to Sum Square
Errors
63/67
Interpreting Network Signal
Vectors
Re-write the sum of squares error function
64/67
Interpreting Network Signal
Vectors
Algebra yields
69/67
Classification Problems
For a C-class classification problem, there will be
C-outputs
Only 1-of-C outputs will be one
Input pattern Xk is classified into class J if
That is, the desired output is one for node J and 0 for
other nodes.
The probability that X takes on desire output dj is p(dj|X).
73/67
NN Classifiers and Square Error
Functions
Note:
1 if x 0
( x)
0 if x 0
For two class problem:
74/67
NN Classifiers and Square Error
Functions
p ( d1 1 | X ) (1 11 ) p (C1 | X ) (1 12 ) p (C2 | X )
p (C1 | X )
p ( d1 0 | X ) ( 0 11 ) p (C1 | X ) ( 0 12 ) p (C2 | X )
p (C 2 | X )
75/67
Network Output = Class
Posterior
The jth output sj is
Class posterior
76/67
Relaxing the Gaussian Constraint
Design a new error function
Without the Gaussian noise assumption on the
desired outputs
Retain the ability to interpret the network
outputs as posterior probabilities
Subject to constraints:
signal confinement to (0,1) and
sum of outputs to 1
77/67
Neural Network With A Single
Output
Output s represents Class 1 posterior
Then 1-s represents Class 2 posterior
The probability that we observe a target value dk
on pattern Xk
78/67
Cross Entropy Error Function
Maximizing the probability of observing desired
value dk for input Xk on each pattern in T
Likelihood
Convenient to
minimize the negative
log-likelihood, which
we denote as the
error:
79/67
Architecture of Feedforward
Network Classifier
80/67
Network Training
Using the chain rule (Chapter 6) with the cross
entropy error function
81/67
C-Class Problem
Assume a 1 of C encoding scheme
Network has C outputs
and
Likelihood function
82/67
Modified Error Function
Cross entropy error
function for the C- class
case
Minimum value
85/67