0% found this document useful (0 votes)
13 views23 pages

Lec-04 - Linear Discriminant Analysis

The document discusses Linear Discriminant Analysis (LDA), its advantages over Logistic Regression, and its application in classification problems with multiple classes. It explains the Bayes Decision Boundary and the use of Bayes Theorem for optimal classification, along with parameter estimation for LDA. Additionally, it contrasts LDA with Quadratic Discriminant Analysis (QDA) and provides examples and error rates from classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Lec-04 - Linear Discriminant Analysis

The document discusses Linear Discriminant Analysis (LDA), its advantages over Logistic Regression, and its application in classification problems with multiple classes. It explains the Bayes Decision Boundary and the use of Bayes Theorem for optimal classification, along with parameter estimation for LDA. Additionally, it contrasts LDA with Quadratic Discriminant Analysis (QDA) and provides examples and error rates from classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Linear Discriminant Analysis

Dr. Sayak Roychowdhury


Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference Books
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Why LDA?
• When classes are well separated, the parameter estimates for Logistic
Regression may become unstable.
• When 𝑛 is small and the distribution of 𝑋 is approximately normal,
LDA is more stable.
• LDA is more applicable when there are more than 2 classes, it
provides low dimensional view of the data.
• With right population model, the Bayes Rule is the best model.
Decision boundaries

Linear decision boundaries found by LDA Quadratic decision boundaries using LDA
Bayes Decision Boundary
• The test error rate of a classification problem is minimized when a new
observation is assigned to a class 𝑗, for which 𝑃(𝑌 = 𝑗|𝑋 = 𝑥0 ) is largest
• This is called a Bayes classifier
• In a 2 class problem for 𝑗 = {1,2},
predict class 1 for
𝑃 𝑌 = 1 𝑋 = 𝑥0 > 0.5
• The Bayes classifier produces the
lowest possible test error rate, called
the Bayes error rate
Bayes Decision Boundary
Bayes Theorem for Classification
Pr 𝑋=𝑥𝑌=𝑘 .Pr(𝑌=𝑘)
• Pr 𝑌 = 𝑘 𝑋 = 𝑥 =
Pr(𝑋=𝑥)
𝜋𝑘 𝑓𝑘 𝑥
• Pr 𝑌 = 𝑘 𝑋 = 𝑥 = σ𝑙 𝜋𝑙 𝑓𝑙 𝑥
• Where 𝑓𝑘 𝑥 = Pr 𝑋 = 𝑥 𝑌 = 𝑘 is the conditional density of 𝑋 in
class 𝑘
and 𝜋𝑘 = Pr(𝑌 = 𝑘) is the prior probability
• We need to know the class posterior Pr 𝑌 = 𝑘 𝑋 = 𝑥 for optimal
classification
Linear Discriminant Analysis for One Predictor
• The observation will be classified for which
𝑝𝑘 𝑥 = Pr 𝑌 = 𝑘 𝑋 = 𝑥 is greatest
1 1 2
• 𝑓𝑘 𝑥 = exp − 2 𝑥 − 𝜇𝑘
𝜎𝑘 2𝜋 2𝜎𝑘
Where 𝜇𝑘 and 𝜎𝑘 are mean and variance parameters of the 𝑘 𝑡ℎ class
• For Linear Discriminant Analysis, it is assumed
𝜎12 = 𝜎22 =. . = 𝜎𝑘2 = 𝜎 2
Linear Discriminant Analysis for One Predictor
𝜋𝑘 𝑓𝑘 𝑥
• 𝑝𝑘 𝑥 = Pr 𝑌 = 𝑘 𝑋 = 𝑥 =σ
𝑙 𝜋𝑙 𝑓𝑙 𝑥
1 1
𝜋𝑘 exp − 2 𝑥−𝜇𝑘 2
𝜎 2𝜋
𝑘 2𝜎 𝑘
= 1 1
σ𝑙 𝜋𝑙 exp − 2 𝑥−𝜇𝑙 2
𝜎𝑙 2𝜋 2𝜎𝑙
• The Bayes classifier will assign an observation at 𝑋 = 𝑥 to the class for
which 𝑝𝑘 (𝑥) is largest
• This is equivalent to assign the observation to a class for which 𝛿𝑘 𝑥
is largest
2
𝑥𝜇𝑘 𝜇𝑘
𝛿𝑘 𝑥 = − + log(𝜋𝑘 )
𝜎2 2𝜎 2
Bayes Decision Boundary
• For 𝐾 = 2 and 𝜋1 = 𝜋2 , observation is assigned to class 1 if
2𝑥 𝜇1 − 𝜇2 > 𝜇12 − 𝜇22
• The Bayes decision boundary correspond to the point where
𝜇12 −𝜇22 𝜇1 +𝜇2
𝑥= =
2 𝜇1 −𝜇2 2
Parameter Estimation
• 𝜇ො𝑘 = 1/𝑛𝑘 σ𝑖:𝑦𝑖 =𝑘 𝑥𝑖
• If no knowledge of prior probability 𝜋𝑘 is available, then it can be
estimated by
𝑛𝑘
𝜋ො 𝑘 =
𝑁
1
• 2
𝜎ො = σ𝐾
𝑘=1 σ𝑖:𝑦𝑖 =𝑘 𝑥𝑖 − 𝜇ො𝑘 2
𝑁−𝐾
• The LDA classifier plugs into these estimates for observation 𝑋 = 𝑥
𝑥 𝜇ො𝑘 𝜇ො𝑘2
𝛿መ𝑘 𝑥 = 2 − 2 + log(𝜋ො 𝑘 )
𝜎ො 2𝜎ො
Multivariate Gaussian

X1 and X2 uncorrelated, with X1 and X2 correlated


Var(X1)=Var(X2)
Gaussian Density Multiple Predictors
• Suppose each class density is multivariate Gaussian
1 1
− 𝑥−𝜇𝑘 𝑇 Σ−1 𝑥−𝜇𝑘
𝑓𝑘 𝑥 = 𝑝 1 𝑒
2
2𝜋 2 Σ𝑘 2

LDA is the special case when it is assume that the covariance matrix is
same for all the classes
Σ𝑘 = Σ ∀𝑘
LDA with Multiple Predictors

3 Gaussian Distributions Samples from 3 Gaussian Distributions,


Solid lines indicating LDA boundaries
Linear Discriminant function
• Discriminant function
1 𝑇 −1
𝛿𝑘 𝑥 = 𝑥 𝑇 Σ −1 𝜇𝑘 − 𝜇𝑘 Σ 𝜇𝑘 + log 𝜋𝑘
2
Decision rule: 𝐺 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 𝛿𝑘 (𝑥)
(the predicted class of 𝑥 is the one with largest 𝛿𝑘 (𝑥) value)

Estimated values:
𝑁𝐾
𝜋ො 𝑘 =
𝑁
𝑥
𝜇ො𝑘 = σ𝑔𝑖 =𝑘 𝑖
𝑁𝑘
𝑥𝑖 −ෝ 𝜇𝑘 𝑇
𝜇𝑘 𝑥𝑖 −ෝ
Σ෠ = 𝐾
σ𝑘=1 σ𝑔𝑖 =𝑘
𝑁−𝐾
Example (Default Data)

LDA classifier classified Yes when P(Default=Yes | X) > 0.5 (Bayes Classifier)
Overall training error rate 2.75%
Error rate for default individuals 75.7%
Sensitivity is the percentage of true defaulters that are identified :24.3 %.
Specificity is the percentage of non-defaulters that are correctly identified:
(1 − 23/9667) = 99.8 %.
Example (Default Data)

Fraction of defaulters
Incorrectly classified

LDA classifier classified Yes when


P(Default=Yes | X) > 0.2
Overall training error rate 3.73%
Error rate for default individuals 41.4% Fraction of
Overall error errors non-
rate defaulters
ROC Curve

True Positive Rate = Sensitivity


False Positive Rate = 1 - Specificity
Quadratic Discriminant function
• When the assumption of equal covariant matrix for all classes is
dropped, we get QDA
• Discriminant function
1
𝛿𝑘 𝑥 = − log Σ𝑘 − 𝑥 − 𝜇𝑘 𝑇 Σ𝑘−1 (𝑥 − 𝜇𝑘 ) + log 𝜋𝑘
2
Decision rule: 𝐺 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 𝛿𝑘 (𝑥)
QDA and LDA
• In LDA, the 𝐾 classes of response are assumed to have a common covariance
matrix, whereas in QDA this assumption is dropped
𝑝 𝑝+1
• For 𝑝 predictors, estimating a covariance matrix requires estimating
2
parameters
• QDA requires estimation of separate covariance matrices for each class, with a
𝐾𝑝 𝑝+1
total of parameters
2
• Hence QDA requires a lot of parameters leading to higher variance
• LDA model requires only 𝐾𝑝 linear coefficients to be estimated, it may lead to
higher bias
• If there are relatively few training data points, LDA performs better than QDA
• QDA is recommended when training data is very large
QDA and LDA

Two Gaussian classes with common correlation Two Gaussian classes with different covariance;
Between 𝑋1 and 𝑋2 ; Bayes decision boundary in Bayes decision boundary in purple dashed line,
purple dashed line, LDA (black dotted), QDA (green LDA (black dotted), QDA (green
Solid) Solid)
Example: Stock Market Data
Example: Stock Market Data
> plot(lda.fit)
Test Error Rate

You might also like