0% found this document useful (0 votes)
4 views38 pages

Lecture14 Discriminant Analysis

The lecture covers Discriminant Analysis, focusing on Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for classification tasks. LDA models the distribution of predictors given different categories, while QDA allows for different covariances across classes. The lecture also compares various classification methods, highlighting the performance of LDA, QDA, and k-NN based on the nature of the data and decision boundaries.

Uploaded by

vjain1855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

Lecture14 Discriminant Analysis

The lecture covers Discriminant Analysis, focusing on Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for classification tasks. LDA models the distribution of predictors given different categories, while QDA allows for different covariances across classes. The lecture also compares various classification methods, highlighting the performance of LDA, QDA, and k-NN based on the nature of the data and decision boundaries.

Uploaded by

vjain1855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 14: Discriminant Analysis

CS109A Introduction to Data Science


Pavlos Protopapas and Kevin Rader
Lecture Outline

• Discriminant Analysis
• LDA for one predictor
• LDA for p > 1
• QDA
• Comparison of Classification Methods (so far)

CS109A, PROTOPAPAS, RADER


Recall the Heart Data (for classification)
response variable Y
is Yes/No

Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD

63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed No

67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal Yes

67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable Yes

37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal No

41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal No

CS109A, PROTOPAPAS, RADER


Discriminant Analysis for Classification

CS109A, PROTOPAPAS, RADER


Linear Discriminant Analysis (LDA)

Linear discriminant analaysis (LDA) takes a different approach to


classification than logistic regression. Rather than attempting to
model the conditional distribution of Y given X, P(Y = k|X = x),
LDA models the distribution of the predictors X given the different
categories that Y takes on, P(X = x|Y = k).

In order to flip these distributions around to model P(X = x|Y = k)


an analyst uses Bayes' theorem.

In this setting with one feature (one X), Bayes' theorem can then
be written as:

What does this mean? CS109A, PROTOPAPAS, RADER


LDA (cont.)

The left hand side, P(Y = k|X = x), is called the posterior
probability and gives the probability that the observation is
in the kth category given the feature, X, takes on a specific
value, x. The numerator on the right is conditional
distribution of the feature within category k, fk(x), times the
prior probability that observation is in the kth category.

The Bayes' classifier is then selected. That is the


observation assigned to the group for which the posterior
probability is the largest.
CS109A, PROTOPAPAS, RADER
Inventor of LDA: R.A. Fisher
The 'Father' of Statistics. More famous for work in genetics
(statistically concluded that Mendel's genetic experiments were
'massaged').
Novel statistical work includes:
• Experimental Design
• ANOVA
• F-test (why do you think it's called the F-test?)
• Exact test for 2 x 2 tables
• Maximum Likelihood Theory
• Use of = 0.05 significance level: “The value for which P = .05,
or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point
as a limit in judging whether a deviation is to be considered
significant or not”.
• And so much more... CS109A, PROTOPAPAS, RADER
LDA for one predictor

LDA has the simplest form when there is just one


predictor/feature (p = 1). In order to estimate fk(x), we have
to assume it comes from a specific distribution. If X is
quantitative, what distribution do you think we should use?
One common assumption is that fk(x) comes from a Normal
distribution:

In shorthand notation, this is often written as ,


meaning, the distribution of the feature X within category k
is Normally distributed with mean and variance .
CS109A, PROTOPAPAS, RADER
LDA for one predictor (cont.)
An extra assumption that the variances are equal,
will simplify are lives.
Plugging this assumed likelihood into the Bayes' formula
(to get the posterior) results in:

The Bayes classifier will be the one that maximizes this


over all values chosen for x. How should we maximize?
So we take the log of this expression and rearrange to
simplify our maximization...
CS109A, PROTOPAPAS, RADER
LDA for one predictor (cont.)

So we maximize the following simplified expression:

How does this simplify if we have just two classes (K =


2) and if we set our prior probabilities to be equal?
This is equivalent to choosing a decision boundary for x
for which

Intuitively, why does this expression make sense? What


do we use in practice?
CS109A, PROTOPAPAS, RADER
LDA for one predictor (cont.)

In practice we don’t know the true mean, variance, and


prior. So we estimate them with the classical estimates,
and plug-them into the expression:

and

where n is the total sample size and nk is the sample


size within class k (thus, n = ).
CS109A, PROTOPAPAS, RADER
LDA for one predictor (cont.)

This classifier works great if the classes are about equal


in proportion, but can easily be extended to unequal
class sizes.
Instead of assuming all priors are equal, we instead set
the priors to match the 'prevalence' in the data set:

Note: we can use a prior probability from knowledge of


the subject as well; for example, if we expect the test
set to have a different prevalence than the training set.
How could we do this in the Dem. vs. Rep. data set?
CS109A, PROTOPAPAS, RADER
LDA for one predictor (cont.)

Plugging all of these estimates back into the original


logged maximization formula we get:

Thus this classifier is called the linear discriminant


classifier: this discriminant function is a linear function
of x.

CS109A, PROTOPAPAS, RADER


Illustration of LDA when p = 1

CS109A, PROTOPAPAS, RADER


LDA when p > 1

CS109A, PROTOPAPAS, RADER


LDA when p > 1

LDA generalizes 'nicely' to the case when there is more


than one predictor.

Instead of assuming the one predictor is Normally


distributed, it assumes that the set of predictors for
each class is 'multivariate normal distributed'
(shorthand: MVN). What does that mean?

This means that the vector of X for an observation has a


multidimensional normal distribution with a mean
vector, , and a covariance matrix, .
CS109A, PROTOPAPAS, RADER
Multivariate Normal Distribution

Here is a visualization of the Multivariate Normal


distribution with 2 variables:

CS109A, PROTOPAPAS, RADER


MVN Distribution

The joint PDF of the Multivariate Normal distribution,


, is:

where is a p dimensional vector and || is the


determinant of the p x p covariance matrix.
Let's do a quick dimension analysis sanity check...
What do and look like?

CS109A, PROTOPAPAS, RADER


LDA when p > 1

Discriminant analysis in the multiple predictor case assumes the


set of predictors for each class is then multivariate Normal:

Just like with LDA for one predictor, we make an extra assumption
that the covariances are equal in each group, . in order to simplify
our lives.

Now plugging this assumed likelihood into the Bayes' formula (to
get the posterior) results in:

CS109A, PROTOPAPAS, RADER


LDA when p > 1 (cont.)

Then doing the same steps as before (taking log and


maximizing), we see that the classification will for an observation
based on its predictors, , will be the one that maximizes
(maximum of K of these ):

Note: this is just the vector-matrix version of the formula we saw


earlier in lecture:

What do we have to estimate now with the vector-matrix version?


How many parameters are there?
There are pK means, pK variances, K prior proportions, and
covariances to estimate. CS109A, PROTOPAPAS, RADER
LDA when K > 2

The linear discriminant nature of LDA still holds not only when p
> 1, but also when K > 2 for that matter as well. A picture can be
very illustrative:

CS109A, PROTOPAPAS, RADER


Quadratic Discriminant Analysis (QDA)

CS109A, PROTOPAPAS, RADER


Quadratic Discriminant Analysis (QDA)

A generalization to linear discriminant analysis is


quadratic discriminant analysis (QDA).

Why do you suppose the choice in name?

The implementation is just a slight variation on LDA.


Instead of assuming the covariances of the MVN
distributions within classes are equal, we instead allow
them to be different.

This relaxation of an assumption completely changes


the picture... CS109A, PROTOPAPAS, RADER
QDA in a picture

A picture can be very illustrative:

QDA in a picture

CS109A, PROTOPAPAS, RADER


QDA (cont.)
When performing QDA, performing classification for an
observation based on its predictors is equivalent to
maximizing the following over the K classes:

Notice the `quadratic form' of this expression. Hence


the name QDA.
Now how many parameters are there to be estimated?
There are pK means, pK variances, K prior proportions,
and covariances to estimate. This could slow us down
very much if K is large...
CS109A, PROTOPAPAS, RADER
Discriminant Analysis in Python

LDA is already implemented in Python via the


sklearn.discriminant_analysis package through the
LinearDiscriminantAnalysis function.

QDA is in the same package and is the


QuadraticDiscriminantAnalysis function.

It's very easy to use. Let's see how this works

CS109A, PROTOPAPAS, RADER


Discriminant Analysis in Python (cont.)

CS109A, PROTOPAPAS, RADER


QDA vs. LDA

So both QDA and LDA take a similar approach to solving


this classification problem: they use Bayes' rule to flip
the conditional probability statement and assume
observations within each class are multivariate Normal
(MVN) distributed.

QDA differs in that it does not assume a common


covariance across classes for these MVNs. What
advantage does this have? What disadvantage does
this have?

CS109A, PROTOPAPAS, RADER


QDA vs. LDA (cont.)

So generally speaking, when should QDA be used over


LDA? LDA over QDA?

The extra covariance parameters that need to be


estimated in QDA not only slow us down, but also allow
for another opportunity for overfitting. Thus if your
training set is small, LDA should perform better for out-
of-sample prediction, aka, predicting future
observations (how do we mimic this process?)

CS109A, PROTOPAPAS, RADER


Comparison of Classification Methods (so far)

CS109A, PROTOPAPAS, RADER


Quadratic Discriminant Analysis (QDA)

We have seen 3 major methods for doing classification:


• Logistic Regression
• k-NN
• Discriminant Analysis (LDA and QDA)
For a specific problem, which approach should be used?

Well of course, it depends on the nature of the data. So


how should we decide?

Visualize the data!


CS109A, PROTOPAPAS, RADER
Six Classification Models We'll Compare

Let's investigate which method will work the best (as measured
by lowest overall classification error rate), by considering 6
different models for 4 different data sets (each data set as a pair
of predictors...you can think of them as the first 2 PCA
components…to come later in the lecture). The 6 models to
consider are:
• A logistic regression with only 'linear' main effects}
• A logistic regression with only 'linear' and 'quadratic' effects}
• LDA
• QDA
• k-NN where k = 3
• k-NN where k = 25
What else will also be important to measure (besides error rate)?
CS109A, PROTOPAPAS, RADER
Which method should perform better? #1

n = 20,000, p = 2, K = 2, 1 = 2 = 0.5

Notice anything fishy about our answers? What


did Kevin do? What should he have done?

CS109A, PROTOPAPAS, RADER


Easy to implement in Python

CS109A, PROTOPAPAS, RADER


Which method should perform better? #2

n = 20,000, p = 2, K = 2, 1 = 2 = 0.5

CS109A, PROTOPAPAS, RADER


Which method should perform better? #3

n = 20,000, p = 2, K = 2, 1 = 2 = 0.5

CS109A, PROTOPAPAS, RADER


Which method should perform better? #4

n = 20,000, p = 2, K = 2, 1 = 2 = 0.5

CS109A, PROTOPAPAS, RADER


Summary of Results

Generally speaking:
• LDA outperforms Logistic Regression if the distribution of
predictors is reasonably MVN (with constant covariance).
• QDA outperforms LDA if the covariances are not the same in
the groups.
• k-NN outperforms the others if the decision boundary is
extremely non-linear.
• Of course, we can always adapt our models (logistic and
LDA/QDA) to include polynomial terms, interaction terms, etc...
to improve classification (watch out for overfitting!)
• In order of computational speed (generally speaking, it
depends on K, p, and n of course):
LDA > QDA > Logistic > k-NN
CS109A, PROTOPAPAS, RADER

You might also like