4gaussian Discriminant
4gaussian Discriminant
S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology
Binary Classification
{(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data where
xi ∈ Rn , yi ∈ {1, 0}.
L
X
p(xi ) = p(xi , yj )
j=1
nij p(xi , yj )
p(yj |xi ) = =
ci p(xi )
nij
p(xi , yj ) =
N
nij ci
=
ci N
= p(yj |xi )p(xi )
= p(xi |yj )p(yj )
Bayes Theorem
p(y )p(x/y )
p(y /x) =
p(x)
p(y |x) is the posterior, p(x|y ) is the likelihood function, p(y ) is
the prior, p(x) is the marginal likelihood.
Approaches of Probability
Frequentist:
nx
P(x) = limnt →∞
nt
Maximum likelihood Estimator
Bayesian Approach
p(y )p(x/y )
p(y /x) =
p(x)
Bayes theorem relates the posterior probability (what we
know about the parameter after seeing the data) to the
likelihood (derived from a statistical model for the observed
data) and the prior (what we knew about the parameter
before we saw the data).
Bayes Theorem: Two class Classification Problem
p(y = 1)p(x/y )
p(y = 1/x) =
p(x)
where
Consider two boxes: one red and one blue. Let there be 2
apples and 6 oranges in the red box and 3 apples and 1
orange in the blue box. A box is randomly picked and a
fruit is selected. After observing the fruit it is replaced in
the box. This process is repeated many times. In doing so,
let the red box be picked 40% of the time and blue box
60% of the time. What is the overall probability that the
selection procedure will pick an apple?. Given that we
have chosen an orange, what is the probability that the box
we chose was the blue one?
Let B denote the identity of the box and F an identity of a fruit.
P(B = r ) = .4, P(B = b) = .6
p(F = a|B = r ) = 1/4, p(F = O/B = r ) = 3/4, p(F = a|B =
b) = 3/4, p(F = a|B = b) = 3/4, p(F = O|B = b) = 1/4
y = mx + c
y = Sinx
Covariance
A1 A2
1 3
-2 1
5 7
4 5
P
j (Aij − Āj )
Covariance(Aj , Aj ) = Variance(Aj ) = , j = 1, 2
N −1
Covariance
− A¯1 )(A2j − A¯2 )
P
j (A1j
Covariance(A1 , A2 ) =
N −1
Covariance
x11 x12 . . . x1n
x21 x22 . . . x2n
X =
.. .. ..
. . .
xN1 xN2 . . . xNn
Covariance Matrix
cov (A1 , A1 ) cov (A1 , A2 ) . . . cov (A1 , An )
cov (A2 , A1 ) cov (A2 , A2 ) . . . cov (A2 , An )
Σ=
.. .. ..
. . .
cov (An , A1 ) cov (An , A1 ) . . . cov (An , An )
XcT Xc
Σ=
N −1
Unbiased Estimator
T (x21 − µ1 )(x21 − µ1 ) (x21 − µ1 )(x22 − µ2 )
(x2 −µ)(x2 −µ) =
(x21 − µ1 )(x22 − µ2 ) (x22 − µ2 )(x22 − µ2 )
PN
i=1 (xi − µ)(xi − µ)T
Σ=
N −1
Covariance and Independence
D 2 = (x − µ)T Σ−1 (x − µ)
If Σ = I, Mahalonabis distance becomes equivalent to
Euclidean distance
If the variables in the dataset are strongly correlated, then,
the covariance will be high. Dividing by a large covariance
will effectively reduce the distance.
Normal Distribution
X is a continuous real valued random variable
Probability density function (pdf):
(x − µ)2
1
p(x) = √ exp −
2πσ 2σ 2
Rb
P(a < x < b) = a p(x)dx
Multivariate Gaussian (Normal) Distribution
Bayes Theorem
Data: {(xi , yi ), i = 1, 2, . . . N}, xi ∈ Rn , yi ∈ {1, 0}
p(y = 1)p(x/y )
p(y = 1/x) =
p(x)
p(y ): prior probability of y
p(x/y ): the distribution of x given y
Parameters
y ∼ Bernoulli(φ)
x/(y = 0) ∼ N (µ0 , Σ)
x/(y = 1) ∼ N (µ1 , Σ)
MLE of Bernoulli Distribution is the sample mean
Multivariate Gaussian Distribution
Determination of Parameters
No of times y = 1 appears
φ =
P Total number of data
1(yi = 1)
=
N
Pk
i=1 (xpi − µ1 )(xpi − µ1 )T
Σ1 =
k −1
Pl
i=1 (xni − µ0 )(xni − µ0 )T
Σ0 =
l −1
(k − 1)Σ1 + (l − 1)Σ0
Σ =
k +l −2
(k − 1)Σ1 + (l − 1)Σ0
=
N −2
Algorithms
p(y = 1)p(x/y = 1)
p(y = 1/x) =
p(x)
p(y = 1)p(x/y = 1)
=
p(y = 1)p(x/y = 1) + p(y = 0)p(x/y = 0)
1
=
p(y = 0)p(x/y = 0)
1+
p(y = 1)p(x/y = 1)
1
=
1 + exp(−a)
p(y = 1)p(x/y = 1)
where a = log
p(y = 0)p(x/y = 0)
a ≥ 0, p(y | x) ≥ 0.5
a < 0, p(y | x) < 0.5
Decision Boundary: LDA
l1 = log p(x/y = 1), l0 = log p(x/y = 0)
π1 = p(y = 1), π0 = p(y = 0)
a = log π1 + l1 − log π0 − l0
1 1
l1 − l0 = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ0 )T Σ−1 (x − µ0 )
2 2
1
= − x Σ x + x Σ µ1 + µT1 Σ−1 x − µT1 Σ−1 µ1
T −1 T −1
2
+ x T Σ−1 x − x T Σ−1 µ0 − µT0 Σ−1 x + µT0 Σ−1 µ0
1 T −1
= x Σ (µ1 − µ0 ) + (µT1 − µT0 )Σ−1 x − µT1 Σ−1 µ1
2
+µT0 Σ−1 µ0
1
= (µT1 − µT0 )Σ−1 x − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 )
2
Decision Boundary: LDA
π1 1
a = log + (µT1 − µT0 )Σ−1 x − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 )
π0 2
T −1
w x + w0 , where w = Σ (µ1 − µ0 ) and
1 1 π1
w0 = − µT1 Σ−1 µ1 + µT0 Σ−1 µ0 + log
2 2 π0
Linear decision boundary
Decision Boundary: LDA
Determine the class
Method 1
Method 2
p(y = 1)p(x | y = 1)
p(y = 1 | x) =
p(x)
p(y = 0 | x) = 1 − p(y = 1 | x)
Multiclass
C = 1, 2, . . . , m. Find µk for each class k . Find the
common covariance matrix Σ as the weighted average of
Σk
x/C = k ∼ N (µk , Σ) .
Ĝ(x) = arg max p(C = k /X = x)
k
= arg max p(C = k )p(x/k )
k
= arg max log p(C = k )p(x/k )
k
1
= arg max(− (x − µk )T Σ−1 (x − µk ) + log πk )
k 2
1
= arg max −x T Σ−1 x + x T Σ−1 µk + µTk Σ−1 x − µTk Σ−1 µk
k 2
+ log πk
1
= arg max µTk Σ−1 x − µTk Σ−1 µk + log πk
k 2
Linear discriminant function
where, p(C = k ) = πk
Quadratic Discriminant Analysis: Discriminant
Function