0% found this document useful (0 votes)
28 views50 pages

4gaussian Discriminant

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views50 pages

4gaussian Discriminant

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Gaussian Discriminant Analysis

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

MA613 Data Mining


Introduction

Binary Classification
{(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data where
xi ∈ Rn , yi ∈ {1, 0}.

Data Soleus Gactrocnemius yi : 1/0


x1T x11 x12 1

x2T x21 x22 0

x3T x31 x32 0

x4T x41 x42 1

x5T x51 x52 1


Sum Rule and Product rule

Consider two random variables X and Y . The values taken


by X are {x1 , x2 , . . . xM } and Y are {y1 , y2 , . . . yL }. Let the
experiment conducted N times. Let the number of times in
which X = xi and Y = yj be nij , X = xi be ci and Y = yj be
rj .
nij
p(xi , yj ) = =, i = 1, 2, . . . M, j = 1, 2, . . . L
N
ci
p(xi ) = , i = 1, 2, . . . M
N
rj
p(yj ) = , j = 1, 2, . . . L
N
Sum Rule and Product rule

L
X
p(xi ) = p(xi , yj )
j=1

nij p(xi , yj )
p(yj |xi ) = =
ci p(xi )

nij
p(xi , yj ) =
N
nij ci
=
ci N
= p(yj |xi )p(xi )
= p(xi |yj )p(yj )
Bayes Theorem

p(xi |yj )p(yj )


p(yj |xi ) =
p(xi )

p(y )p(x/y )
p(y /x) =
p(x)
p(y |x) is the posterior, p(x|y ) is the likelihood function, p(y ) is
the prior, p(x) is the marginal likelihood.
Approaches of Probability

Frequentist:
nx
P(x) = limnt →∞
nt
Maximum likelihood Estimator
Bayesian Approach
p(y )p(x/y )
p(y /x) =
p(x)
Bayes theorem relates the posterior probability (what we
know about the parameter after seeing the data) to the
likelihood (derived from a statistical model for the observed
data) and the prior (what we knew about the parameter
before we saw the data).
Bayes Theorem: Two class Classification Problem

p(y = 1)p(x/y )
p(y = 1/x) =
p(x)
where

p(x) = p(y = 1)p(x/y = 1) + p(y = 0)p(x/y = 0)


Conditional Probability

Consider two boxes: one red and one blue. Let there be 2
apples and 6 oranges in the red box and 3 apples and 1
orange in the blue box. A box is randomly picked and a
fruit is selected. After observing the fruit it is replaced in
the box. This process is repeated many times. In doing so,
let the red box be picked 40% of the time and blue box
60% of the time. What is the overall probability that the
selection procedure will pick an apple?. Given that we
have chosen an orange, what is the probability that the box
we chose was the blue one?
Let B denote the identity of the box and F an identity of a fruit.
P(B = r ) = .4, P(B = b) = .6
p(F = a|B = r ) = 1/4, p(F = O/B = r ) = 3/4, p(F = a|B =
b) = 3/4, p(F = a|B = b) = 3/4, p(F = O|B = b) = 1/4

p(F = a) = p(B = r )p(F = a|B = r ) + p(B = b)p(F = a|B = b)


= 1/4 ∗ 4/10 + 3/4 ∗ 6/10
11
=
20

p(F = 0) = 1 − 11/20 = 9/20


(p(F = O)|B = r )p(B = r )
p(B = r /F = O) =
p(F = O)
= 2/3

p(B = b/F = O) = 1/3

Therefore if the observed fruit is orange, it is more probable to


come from red box than blue box.
Random Vector

A random vector is a random variable with multiple


dimensions
Each element of the vector is a scalar random variable
Each element has either a finite number of observed
empirical values or a finite or infinite number of potential
values
The potential values are specified by a theoretical joint
probability distribution.
X = (X1 , X2 , . . . Xn )T
Linear Relationship

y = mx + c
y = Sinx
Covariance

A1 A2
1 3
-2 1
5 7
4 5

P
j (Aij − Āj )
Covariance(Aj , Aj ) = Variance(Aj ) = , j = 1, 2
N −1
Covariance
− A¯1 )(A2j − A¯2 )
P
j (A1j
Covariance(A1 , A2 ) =
N −1
Covariance

Covariance is a measure of the joint variability of two


random variables
Variables are positively related if they move in the same
direction.
Variables are inversely related if they move in opposite
directions.
It can take any value between -infinity to +infinity
It is used for the linear relationship between variables.
Data: Matrix

 
x11 x12 . . . x1n
 x21 x22 . . . x2n 
X =
 
.. .. .. 
 . . . 
xN1 xN2 . . . xNn
Covariance Matrix

 
cov (A1 , A1 ) cov (A1 , A2 ) . . . cov (A1 , An )
 cov (A2 , A1 ) cov (A2 , A2 ) . . . cov (A2 , An ) 
Σ=
 
.. .. .. 
 . . . 
cov (An , A1 ) cov (An , A1 ) . . . cov (An , An )

Σ is called the covariance matrix of X


(xij − Āj )(xij − Āj )
Cov (Aj , Aj ) = Var (Aj ) = N
P
j=1
N −1
PN (xij − Āj )(xik − A¯k )
Cov (Aj , Ak ) = i=1
N −1
Covariance matrix: Formula

x11 − A¯1 x12 − A¯2 . . . x1n − A¯n


 
 x21 − A¯1 x22 − A¯2 . . . x2n − A¯n 
Xc = 
 
.. .. .. 
 . . . 
xN1 − A¯1 xN2 − A¯2 . . . xNn − A¯n
Find Xc XcT .
Covariance matrix: Formula

XcT Xc
Σ=
N −1
Unbiased Estimator

An unbiased estimator of a parameter is an estimator


whose expected value is equal to the parameter.
In statistics, Bessel’s correction is the use of n − 1 instead
of n in the formula for the sample variance and sample
standard deviation where n is the number of observations
in a sample. This method corrects the bias in the
estimation of the population variance.
One can understand Bessel’s correction as the degrees of
freedom in the residuals vector (residuals, not errors,
because the population mean is unknown):
(x1 − x, . . . , xn − x), where x is the sample mean. While
there are n independent observations in the sample, there
are only n − 1 independent residuals, as they sum to 0.
{x1 , x2 }, xi ∈ R2
A1 = (x11 , x21 )T , A2 = (x12 , x22 )T µ = (µ1 , µ2 )T
Find (x1 − µ)(x1 − µ)T + (x2 − µ)(x2 − µ)T
(x11 − µ1 )(x11 − µ1 ) + (x21 − µ1 )(x21 − µ1 )
cov (A1 , A1 ) =
N −1
(x11 − µ1 )(x12 − µ2 ) + (x21 − µ1 )(x22 − µ2 )
cov (A1 , A2 ) =
N −1
(x12 − µ2 )(x12 − µ2 ) + (x22 − µ2 )(x22 − µ2 )
cov (A2 , A2 ) =
N −1
 
T (x11 − µ1 )(x11 − µ1 ) (x11 − µ1 )(x12 − µ2 )
(x1 −µ)(x1 −µ) =
(x11 − µ1 )(x12 − µ2 ) (x12 − µ2 )(x12 − µ2 )

 
T (x21 − µ1 )(x21 − µ1 ) (x21 − µ1 )(x22 − µ2 )
(x2 −µ)(x2 −µ) =
(x21 − µ1 )(x22 − µ2 ) (x22 − µ2 )(x22 − µ2 )

(x1 − µ)(x1 − µ)T + (x2 − µ)(x2 − µ)T



N −1
Covariance Matrix: Formula

PN
i=1 (xi − µ)(xi − µ)T
Σ=
N −1
Covariance and Independence

The value of covariance between two random variables lie


between (−∞, +∞)
A positive value of covariance means that two random
variables tend to vary in the same direction, a negative
value means that they vary in opposite directions, and a 0
means that they don’t vary together.
Covariance and Independence

Cov(X,Y) = 0, no linear correlation.


If X and Y are independent, Cov (X , Y ) = 0
Cov (X , Y ) = 0, does not imply X and Y are independent
Mahalonabis Distance
Mahalonabis Distance

Euclidean distance does not consider the distribution of the


data points. So, it cannot be used to measure the deviation
of a point from the data distribution.
Mahalonobis distance is the distance between a point and
a distribution.

D 2 = (x − µ)T Σ−1 (x − µ)
If Σ = I, Mahalonabis distance becomes equivalent to
Euclidean distance
If the variables in the dataset are strongly correlated, then,
the covariance will be high. Dividing by a large covariance
will effectively reduce the distance.
Normal Distribution
X is a continuous real valued random variable
Probability density function (pdf):
(x − µ)2
 
1
p(x) = √ exp −
2πσ 2σ 2
Rb
P(a < x < b) = a p(x)dx
Multivariate Gaussian (Normal) Distribution

The multivariate Gaussian distribution is a generalization of


the one-dimensional (univariate) normal distribution to
higher dimensions.
One definition is that a random vector is said to be
k -variate normally distributed if every linear combination of
its k components has a univariate normal distribution.
X = (X1 , X2 , . . . Xn ) ∼ N (µ, σ)
X1 ∼ N (µ1 , σ1 ), X2 ∼ N (µ2 , σ2 ), . . . Xn ∼ N (µn , σn )
Multivariate Gaussian (Normal) Distribution

The probability density function


 
1 1 T −1
p(x) = exp − (x − µ) Σ (x − µ)
(2π)n/2 |Σ|1/2 2

where |Σ| is the determinant of the covariance matrix Σ.


Attributes are continuous valued
Linear Discriminant Analysis (LDA): Binary
Classification

Bayes Theorem
Data: {(xi , yi ), i = 1, 2, . . . N}, xi ∈ Rn , yi ∈ {1, 0}
p(y = 1)p(x/y )
p(y = 1/x) =
p(x)
p(y ): prior probability of y
p(x/y ): the distribution of x given y
Parameters

y ∼ Bernoulli(φ)

x/(y = 0) ∼ N (µ0 , Σ)

x/(y = 1) ∼ N (µ1 , Σ)
MLE of Bernoulli Distribution is the sample mean
Multivariate Gaussian Distribution
Determination of Parameters

No of times y = 1 appears
φ =
P Total number of data
1(yi = 1)
=
N

sum of positive data


µ1 =
Total number of positive data
PN
i=1 xi (yi = 1)
= PN
i=1 1(yi = 1)

sum of negative data


µ0 =
Total number of negative data
PN
i=1 xi (yi = 0)
= PN
i=1 1(yi = 0)
Determination of Parameters

Positive class {xp1 , xp2 , . . . xpk }


Negative class {xn1 , xn2 , . . . xnl }

Pk
i=1 (xpi − µ1 )(xpi − µ1 )T
Σ1 =
k −1

Pl
i=1 (xni − µ0 )(xni − µ0 )T
Σ0 =
l −1

(k − 1)Σ1 + (l − 1)Σ0
Σ =
k +l −2
(k − 1)Σ1 + (l − 1)Σ0
=
N −2
Algorithms

Discriminative Learning Algorithms


Learn learn p(y /x) directly (such as logistic regression),
Learn mappings directly from the space of inputs X to the
labels
Logistic Regression
Generative Learning Algorithms
Model p(x/y )
LDA
Output: Sigmoid Function

p(y = 1)p(x/y = 1)
p(y = 1/x) =
p(x)
p(y = 1)p(x/y = 1)
=
p(y = 1)p(x/y = 1) + p(y = 0)p(x/y = 0)
1
=
p(y = 0)p(x/y = 0)
1+
p(y = 1)p(x/y = 1)
1
=
1 + exp(−a)

p(y = 1)p(x/y = 1)
where a = log
p(y = 0)p(x/y = 0)
a ≥ 0, p(y | x) ≥ 0.5
a < 0, p(y | x) < 0.5
Decision Boundary: LDA
l1 = log p(x/y = 1), l0 = log p(x/y = 0)
π1 = p(y = 1), π0 = p(y = 0)
a = log π1 + l1 − log π0 − l0

1 1
l1 − l0 = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ0 )T Σ−1 (x − µ0 )
2 2
1
= − x Σ x + x Σ µ1 + µT1 Σ−1 x − µT1 Σ−1 µ1
T −1 T −1
2
+ x T Σ−1 x − x T Σ−1 µ0 − µT0 Σ−1 x + µT0 Σ−1 µ0


1 T −1
= x Σ (µ1 − µ0 ) + (µT1 − µT0 )Σ−1 x − µT1 Σ−1 µ1
2
+µT0 Σ−1 µ0


1
= (µT1 − µT0 )Σ−1 x − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 )
2
Decision Boundary: LDA

π1 1
a = log + (µT1 − µT0 )Σ−1 x − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 )
π0 2
T −1
w x + w0 , where w = Σ (µ1 − µ0 ) and
1 1 π1
w0 = − µT1 Σ−1 µ1 + µT0 Σ−1 µ0 + log
2 2 π0
Linear decision boundary
Decision Boundary: LDA
Determine the class

y = 0, 1. Find µk for each class . Find the common


covariance matrix Σ as the weighted average of Σk
x | (y = k ) ∼ N (µk , Σ) .

Method 1

Ĝ(x) = arg max p(y = k )p(x | y = k )


k

Method 2
p(y = 1)p(x | y = 1)
p(y = 1 | x) =
p(x)

p(y = 0 | x) = 1 − p(y = 1 | x)
Multiclass
C = 1, 2, . . . , m. Find µk for each class k . Find the
common covariance matrix Σ as the weighted average of
Σk
x/C = k ∼ N (µk , Σ) .
Ĝ(x) = arg max p(C = k /X = x)
k
= arg max p(C = k )p(x/k )
k
= arg max log p(C = k )p(x/k )
k
1
= arg max(− (x − µk )T Σ−1 (x − µk ) + log πk )
k 2
1  
= arg max −x T Σ−1 x + x T Σ−1 µk + µTk Σ−1 x − µTk Σ−1 µk
k 2

+ log πk
 1 
= arg max µTk Σ−1 x − µTk Σ−1 µk + log πk
k 2
Linear discriminant function

Define the linear discriminant function


1
δk (x) = µTk Σ−1 x − µTk Σ−1 µk + log πk
2
where p(C = k ) = πk .
Then Ĝ(x) = arg maxk δk .
Decision Boundary:Multi Class LDA

The decision boundary between class k and l is:


{x : δk (x) = δl (x)} or equivalently the following holds
1 1
µTk Σ−1 x − µTk Σ−1 µk + log πk = µTl Σ−1 x − µTl Σ−1 µl + log πl
2 2
πk T T −1 1 T −1 1 T −1
log + (µk − µl )Σ x − µk Σ µk + µl Σ µl = 0
πl 2 2
That is,
πk 1
log − (µk + µl )T Σ−1 (µk − µl ) + (µTk − µTl )Σ−1 x = 0
πl 2
Decision Boundary:Multi Class LDA
Quadratic Discriminant Analysis

µ̂k and a covariance matrix Σ̂k for each class separately


x | C = k ∼ N (µk , Σk ), k = 1, 2, . . . m
Therefore
 
1 1 T −1
p(x | C = k ) = exp − (x − µk ) Σk (x − µk )
(2π)n/2 |Σk |1/2 2
Quadratic Discriminant Analysis: Maximum A
Posteriori (MAP) Estimation

Ĝ(x) = arg max p(C = k /X = x)


k
= arg max p(C = k )p(x/k )
k
= arg max log p(C = k )p(x/k )
k
 1 1
= arg max − log |Σk | − (x − µk )T Σ−1
k (x − µk )
k 2 2

+ log πk

where, p(C = k ) = πk
Quadratic Discriminant Analysis: Discriminant
Function

Quadratic discriminant function:


1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk
2 2
1 T −1 1
= µTk Σ−1
k x − 2 µk Σk µk + log πk − 2 log |Σk |
1
− x T Σ−1
k x
2
This objective is now quadratic in x and so are the decision
boundaries.
: Classification Rule:

Ĝ(x) = arg max δk (x)


k
Decision Boundary: QDA

The decision boundary between class k and l is:


{x : δk (x) = δl (x)}
Quadratic decision boundary
GDA and Logistic Regression

GDA makes stronger modeling assumptions, and is more


data efficient (i.e., requires less training data to learn well)
when the modeling assumptions are correct or at least
approximately correct.
In GDA, the attributes are continuous valued. In Logistic
regression, attributes are can be discrete or continuous.
Logistic regression makes weaker assumptions, and is
significantly more robust to deviations from modeling
assumptions.
Specifically, when the data is indeed non-Gaussian, then in
the limit of large datasets, logistic regression will almost
always do better than GDA. For this reason, in practice
logistic regression is used more often than GDA

You might also like