0% found this document useful (0 votes)
38 views45 pages

Discriminant Analysis

The document introduces discriminant analysis, a classification method used to separate groups based on multiple predictor variables. It discusses different discriminant analysis methods like linear discriminant analysis and quadratic discriminant analysis. The major steps of discriminant analysis include preparing the data, building a classification model, and evaluating the model's performance on test data. Discriminant analysis assumes normally distributed predictor variables and can perform both group description and prediction tasks.

Uploaded by

Rachmat Hidayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views45 pages

Discriminant Analysis

The document introduces discriminant analysis, a classification method used to separate groups based on multiple predictor variables. It discusses different discriminant analysis methods like linear discriminant analysis and quadratic discriminant analysis. The major steps of discriminant analysis include preparing the data, building a classification model, and evaluating the model's performance on test data. Discriminant analysis assumes normally distributed predictor variables and can perform both group description and prediction tasks.

Uploaded by

Rachmat Hidayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to Discriminant Analysis

Pertemuan12-APG-
Outline
• Introduction Classification Methods
• Introduction to Discriminant Analysis
• Discriminant Analysis Methods
• The major steps of Discriminant Analysis
• Preparing Data
• Classification Model Evaluation

2 Discriminant Analysis
Introduction Classification Methods
• Recall, the regression model which is used to predict a quantitative or continuous
outcome variable based on one or multiple predictor variables.
• In Classification, the outcome variable is qualitative (or categorical).
• Classification refers to a set of machine learning methods for predicting the class
(or category) of individuals on the basis of one or multiple predictor variables.

3 Discriminant Analysis
Classification methods:
• Logistic regression for binary classification tasks
• Multinomial logistic regression, an extension of the logistic regression for
multiclass classification tasks
• Discriminant analysis, for binary and multiclass classification problems
• Naive bayes classifier
• Support vector machines
• Neural Network, etc.

Most of the classification algorithms


computes the probability of belonging to a
given class. Observations are then assigned to
the class that have the highest probability
score.

Generally, you need to decide a probability


cutoff above which you consider the an
observation as belonging to a given class.”

4 Discriminant Analysis
Introduction to Discriminant Analysis
There are two major objectives in separation of
groups:
1. Description of group separation, in which linear
functions of the variables (discriminant
functions) are used to describe the differences
between two or more groups. It includes
identifying the relative contribution of the 𝑝
independent variables to separation of the
groups. (discriminant analysis)
2. Prediction or allocation of observations to
groups, based on discrimination functions. It
include the classification model evaluation (or
classification analysis).

Example: 𝑌 = 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
-Y categorical var./non metric
- X continue var. (metric)
5 Discriminant Analysis
Discriminant Analysis Methods
• Linear discriminant analysis (LDA): Uses linear combinations of predictors to predict the class
of a given observation. Assumes that the predictor variables (𝑝) are normally distributed and the
classes have identical variances (for univariate analysis, 𝑝 = 1) or identical covariance matrices (for
multivariate analysis, 𝑝 > 1).
• Quadratic discriminant analysis (QDA): More flexible than LDA. Here, there is no assumption
that the covariance matrix of classes is the same. LDA tends to be a better than QDA when you
have a small training set.
• Mixture discriminant analysis (MDA): Each class is assumed to be a Gaussian mixture of
subclasses. Equality of covariance matrix, among classes, is still assumed.
• Flexible Discriminant Analysis (FDA): Non-linear combinations of predictors is used such as
splines. FDA is useful to model multivariate non-normality or non-linear relationships among
variables within each group.
• Regularized discriminant analysis (RDA): Regularization improves the estimate of the
covariance matrices in situations where the number of predictors is larger than the number of
samples. This might be very useful for a large multivariate data set containing highly correlated
predictors.
6 Discriminant Analysis
• MDA might outperform LDA and QDA is some situations, as illustrated below.

In this example data, we have 3 main groups of individuals, each having 3 no adjacent
subgroups.The solid black lines on the plot represent the decision boundaries of LDA,
QDA and MDA. It can be seen that the MDA classifier have identified correctly the
subclasses compared to LDA and QDA, which were not good at all in modelling this
data.

7 Discriminant Analysis
Why Discriminant Analysis
• When the classes of the response variable Y (i.e. default = “Yes”, default = “No”) are well-
separated, the parameter estimates for the logistic regression model are surprisingly unstable.
LDA & QDA do not suffer from this problem.
• If n is small and the distribution of the predictors X is approximately normal in each of the
classes, the LDA & QDA models are again more stable than the logistic regression model.
• LDA & QDA are often preferred over logistic regression when we have more than two non-
ordinal response classes (i.e.: stroke, drug overdose, and epileptic seizure).
• It is always good to compare the results of different analytic techniques; this can either help
to confirm results or highlight how different modelling assumptions and characteristics
uncover new insights.

8 Discriminant Analysis
However, its important to note that LDA & QDA have assumptions that are often more
restrictive then logistic regression:
• Both LDA and QDA assume the predictor variables X are drawn from a
multivariate Gaussian (aka normal) distribution.
• LDA assumes equality of covariances among the predictor variables X across each
all levels of Y. This assumption is relaxed with the QDA model.
• LDA and QDA require the number of predictor variables (𝑝) to be less then the
sample size (𝑛). Furthermore, its important to keep in mind that performance will
severely decline as 𝑝 approaches 𝑛. A simple rule of thumb is to use LDA & QDA
on data sets where 𝑛 ≥ 5 × 𝑝.

9 Discriminant Analysis
The major steps of Discriminant Analysis

10 Discriminant Analysis
11 Discriminant Analysis
Preparing Data
• Discriminant analysis can be affected by the scale/unit in which predictor variables
are measured.
• It's generally recommended to standardize/normalize continuous predictor before
the analysis.
1. Split the data into training and test set
2. Normalize the data. Categorical variables are automatically ignored

• The sample size must large enough (at least 20 cases per group)
• Number observation per variables (X) is ≥ 5
• In selecting independent variables, we can use univariate ANOVA to asses the
significant between means of the independent variables for the two or more
groups.

12 Discriminant Analysis
Linear Discriminant Analysis (LDA)
• Starts by finding directions (a linear combinations of predictor variables) that maximize the
separation between classes, then use these directions to predict the class of individuals.

• LDA assumes that predictors are normally distributed (Gaussian distribution) and that the
different classes have class-specific means and equal variance/covariance.

• Before performing LDA, consider:


• Inspecting the univariate distributions of each variable and make sure that they are
normally distribute. If not, you can transform them using log and root for exponential
distributions and Box-Cox for skewed distributions.
• Removing outliers from your data and standardize the variables to make their scale
comparable.
• LDA can be computed using the R function lda() [in MASS package]

Note that, if the predictor variables are standardized before computing LDA, the discriminator
weights can be used as measures of variable importance for feature selection.

13 Discriminant Analysis
Quadratic discriminant analysis (QDA)
• QDA is little bit more flexible than LDA, in the sense that it does not assumes the equality of
variance/covariance.
• In contrast, QDA is recommended if the training set is very large, so that the variance of the
classifier is not a major issue, or if the assumption of a common covariance matrix for the K
classes is clearly untenable (James et al. 2014 ) .
• QDA can be computed using the R function qda() [MASS package]
• If the variability of the observations within each class differ, QDA (right plot) is able to
capture the differing covariances and provide more accurate non-linear classification decision
boundaries

14 Discriminant Analysis
Classification Model Evaluation
• After building a predictive classification model, you need to evaluate the performance of
the model , that is how good the model is in predicting the outcome of new observations
test data that have been not used to train the model.
• estimate the model prediction accuracy and
• prediction errors using a new test data set
• Methods for assessing the performance of predictive classification models:
• Average classification accuracy, representing the proportion of correctly classified
observations.
• Confusion matrix , which is 2x2 table showing four parameters, including the number
of true positives, true negatives, false negatives and false positives.
• Precision, Sensitivity and Specificity, which are three major performance metrics
describing a predictive classification model
• ROC curve, showing the proportion of true positives and false positives at all possible
values of probability cutoff. (typically used in binary classification)
• The Area Under the Curve (AUC ) summarizes the overall performance of the
classifier. Values above 0.80 is an indication of a good classifier.

15 Discriminant Analysis
Average Classification Accuracy:
• The overall classification accuracy rate corresponds to the proportion of observations that
have been correctly classified.
accuracy <-mean(observed.classes == predicted.classes)
• Inversely, the classification error rate is defined as the proportion of observations that have
been misclassified. Error rate = 1 – accuracy

Confusion Matrix
It compares the observed and the predicted outcome values and shows the number of correct and
incorrect predictions categorized by type of outcome.
table(observed.classes, predicted.classes)
# Confusion matrix, proportion of cases
table(observed.classes, predicted.classes) %>% prop.table() %>%
round(digits =3)

The correct classification rate is the sum of the number on the diagonal divided by the sample
size in the test data. In our example, that is (48 +15)/78 = 81% or 0.615 + 0.192 =0.807
16 Discriminant Analysis
Example. Classification of diabetes patients

Confusion matrix

True positives (d): these are cases in which we predicted the individuals would be diabetes-
positive and they were.
True negatives (a): We predicted diabetes-negative, and the individuals were diabetes-
negative.
False positives (b): We predicted diabetes-positive, but the individuals didn't actually have
diabetes. (Also known as a Type I error .)
False negatives (c): We predicted diabetes-negative, but they did have diabetes. (Also known
as a Type II error .)

Technically the raw prediction accuracy of the model is defined as


(TruePositives + TrueNegatives)/SampleSize .

17 Discriminant Analysis
Precision, Sensitivity and Specificity

• Precision
Precision = TruePositives/(TruePositives + FalsePositives)

• Sensitivity, which is the True Positive Rate (TPR)


Sensitivity = TruePositives/(TruePositives + FalseNegatives)

• Specificity, which measures the True Negative Rate (TNR)


Specificity = TrueNegatives/(TrueNegatives + FalseNegatives)

• These above mentioned metrics can be easily computed using the function
confusionMatrix() [caret package].

• Sensitivy and Specificity are commonly used to measure the performance of a predictive
model

18 Discriminant Analysis
19 Discriminant Analysis
• In medical science, sensitivity and specificity are two important metrics that characterize the
performance of classifier or screening test.
• In medical diagnostic, such as in our example, we are likely to be more concerned with minimal
wrong positive diagnosis. So, we are more concerned about high Specificity. Here, the model
specificity is 92%, which is very good.
• In some situations, we may be more concerned with tuning a model so that the
sensitivity/precision is improved.
• To this end, you can test different probability cutoff to decide which individuals are positive and
which are negative.
• Note that, here we have used 𝑝 > 0.5 as the probability threshold above which, we declare the
concerned individuals as diabetes positive.
• However, if we are concerned about incorrectly predicting the diabetes-positive status for
individuals who are truly positive, then we can consider lowering this threshold: 𝑝 > 0.2

20 Discriminant Analysis
ROC curve (or receiver operating characteristics curve ) & AUC
• The ROC curve is a popular graphical measure for assessing the performance or the
accuracy of a classifier, which corresponds to the total proportion of correctly classified
observations.
• The ROC plot is to simply display the sensitive against the specificity.
• For a good model, the ROC curve should rise steeply, indicating that the true positive rate
(y-axis) increases faster than the false positive rate (x-axis) as the probability threshold
decreases.
• So, the "ideal point" is the top left corner of the graph, that is a false positive rate of zero,
and a true positive rate of one.
• The ROC analysis can be easily performed using the R package pROC.

• The Area Under the Curve (AUC ) summarizes the overall performance of the classifier,
over all possible probability cutoffs. It represents the ability of a classification algorithm to
distinguish 1s from 0s (i.e, events from non-events or positives from negatives).
• The AUC metric varies between 0.50 (random classifier) and 1.00. Values above 0.80 is an
indication of a good classifier.

21 Discriminant Analysis
ROC Curve

A highly performant classifier


will have an ROC that rises
steeply to the top-left corner, that
is it will correctly identify lots of
positives without
misclassifying lots of negatives
as positives

AUC Values above 0.80 is an


indication of a good classifier.

22 Discriminant Analysis
Prediction or allocation of observations to groups by
Linear Discriminant Analysis

Separation & Classification for Two Populations


We have two population with 𝑿1 ~𝑓1 𝐱 𝑎𝑛𝑑 𝑿2 ~𝑓2 (𝐱)

Obs. 𝑌 𝑋1 𝑋2 ⋯ 𝑋𝑝
1 0 𝑥111 𝑥211 𝑥𝑝11
𝜋1 2 0 𝑥112 𝑥211 𝑥𝑝12
⋮ ⋮
𝑛1 0 𝑥11𝑛1 𝑥21𝑛1 𝑥𝑝1𝑛1
1 1 𝑥121 𝑥221 𝑥𝑝21
2 1 𝑥122 𝑥221 𝑥𝑝22
𝜋2
⋮ ⋮
𝑛2 1 𝑥12𝑛2 𝑥22𝑛2 𝑥𝑝2𝑛2

23 Discriminant Analysis
• Let 𝑓1 𝐱 and 𝑓2 𝐱 be the probability density functions associated with the 𝑝 × 1 vector
random variable 𝑿 for the populations 𝜋1 and 𝜋2 , respectively. Let 𝜴 be the sample space-
that is, the collection of all possible observations be 𝐱. Let 𝑅1 set of be that 𝐱 values for which
we classify objects as 𝜋1 and 𝑅2 = 𝛺 − 𝑅1 be the remaining values for which we classify
objects as 𝜋2 .
• Let 𝑝1 be the prior probability of 𝜋1 and 𝑝2 be the prior probability of 𝜋2 where 𝑝1 + 𝑝2 = 1.

24 Discriminant Analysis
The cost of misclassification:

Expected cost of misclassification (ECM):


𝐸𝐶𝑀 = 𝑐 2 1 𝑃 2 1 𝑝1 + 𝑐 1 2 𝑃 1 2 𝑝2
The regions 𝑅1 and 𝑅2 values that minimize
ECM are defined by:

25 Discriminant Analysis
Linear Discriminant Analysis (LDA)
Classification of Normal Populations When 𝜮1 = 𝜮2 = 𝜮
Suppose that the joint densities of 𝑿' = [ 𝑋1 , 𝑋2 , ⋯ , 𝑋𝑝 ] for populations 𝜋1 and 𝜋2 are given by,
1 1
𝑓𝑖 𝐱 = 𝑝/2 1/2
exp − 𝐱 − 𝝁𝑖 ′ 𝜮−1 𝐱 − 𝝁𝑖
2𝜋 𝜮 2
The minimum ECM regions become:

Then we can construct the classification rule:

26 Discriminant Analysis
From this data matrices, the sample mean vectors and covariance matrices
are determined by,

If we assume that two population have equal variance, we have an


unbiased estimate for 𝜮 is

27 Discriminant Analysis
If we set
𝑐 12 𝑝2
= 1, 𝑡ℎ𝑒𝑛 ln 1 = 0, 𝑎𝑛𝑑
𝑐 21 𝑝1

We have discriminant function:


𝑦ො = 𝐱ത 1 − 𝐱ത 2 ′ 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 −1 𝐱 = 𝒂
ෝ′𝐱

If we define, the midpoint (cutoff point) between group centroid 𝑦ത1 and 𝑦ത2 as
1 1
𝑚ෝ = 𝐱ത 1 − 𝐱ത 2 ′ 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 −1 𝐱ത 1 − 𝐱ത 2 = 𝑦ത1 + 𝑦ത2
2 2
−1
where 𝑦ത1 = 𝐱ത 1 − 𝐱ത 2 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 𝐱ത 1 = 𝒂

𝐱1 and 𝑦ത2 = 𝐱ത 1 − 𝐱ത 2 ′ 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 −1 𝐱ത 2 = 𝒂
ෝ′ത ෝ′ത
𝐱2
Then,
Allocate 𝐱 0 to 𝜋1 if 𝑦ො0 = 𝒂ෝ′ 𝐱 0 ≥ 𝑚ෝ
Allocate 𝐱 0 to 𝜋2 if 𝑦ො0 = 𝒂 ′
ෝ 𝐱0 < 𝑚 ෝ
28 Discriminant Analysis
Two Groups Example. Discriminant Analysis of HBAT costumers based on
geographic location (located within North America or outside)

Selection Independent
Variables:
The candidate variables that
significant are X6, X11, X12,
X13, and X17

29 Discriminant Analysis
• Stepwise
estimation:
Adding the first
variable X13
with largest
Mahalanobis
distance
(minimum D^2)
• Step 2, adding
the second
variable X17
• And so on…

30 Discriminant Analysis
1. After X11 was 2
entered in the
equation, both X6 and
X12 (that had
significant differences
univariate across the 5
1 3
groups) have relatively
little additional
discriminatory power
and the estimation
process stops with
three variables(X11,
X13 and X17) 4
2. We see in canonical
discriminant
function that the
discriminant function
6
is highly significant and
displays a canonical
correlation of 0.749.

31 Discriminant Analysis
• We interpret this canonical correlation by squaring it (0.749)2 = 0.561.
• It means that 56,1 percent of the variation in the dependent variable can be explained
by this model which includes only three independent variables (X11, X13 and X17).

3. The discriminant weight. There are unstandardized and standardized weight.


4. The discriminant loadings are reported under the heading “Structure Matrix” and are
ordered from highest to lowest.
• The discriminant loadings are considered the more appropriate measure of
discriminatory power.
• The discriminant loadings are calculated for every independent variable, even for those
not included in the discriminant function.
• They more accurately represent each variable’s association with discriminant score
(than discriminant weight), because they are relatively unaffected by multicollinearity.
5. The classification function coefficients also known as Fisher’s linear discriminant
functions are used in classification.
6. Groups centroid represent the mean of individual discriminant function scores for each
group

32 Discriminant Analysis
Interpretation of the Results
Analyzing Discriminant Weight.
• The unstandardized weights (plus the constant) are used to calculate the discriminant scores
that can be used in classification, but can be affected by the scale of independent variables (just
like multiple regression weight).
• The standardized weight more truly reflect the impact of each independent variable on
discriminant function.

Analyzing Discriminant Loadings.


• The loading above ±0.40 should be used to identify substantive discriminating variables.
• With the discriminating variables identified and the discrimination function described in terms
on those variables with sufficiently high loadings, the researcher then proceeds to profile each
group on these variables to understand the differences between them.
• An example, the loadings of the three variables (X11, X13 and X17) entered in the
discrimination function are the three highest and all exceed ±0.40. The two additional (X6
and X7) however also have loadings above the ±0.40. X6 was not included in the discriminant
function due to multicollinearity and X7 however it did not have a significant univariate effect.
So we can use these five variables to profile each group.
33 Discriminant Analysis
Profiling the Discriminant Variables
• The researcher is interested in interpretations of the individual variables that have statistical
and practical significance.
• Such interpretations are accomplished by first identifying the variables with substantive
discriminatory power and then understanding what the differing group means on each variable
indicated.
• The mean profiles also illustrate the interpretation of signs (positive or negative) on the
discriminant weights and loadings. The signs indicate the pattern between the groups.
Example. Based on HBAT customer data. See the group mean in Table 5.
Higher score on the independent variables indicate more favorable perceptions of HBAT on that
attribute (except for X13, where lower score are more preferable).
Profiles between two groups on five variables are:
• Group 0 (customers in the USA/North America)has higher perceptions on the three variables:
X6 (Product Quality), X13 (Competitive Pricing) and X11 (Product line).
• Group 1 (customers outside North America) has higher perceptions on the remaining two
variables: X7 (e-Commerce Activities) and X17 (Price Flexibility).
We can see that The USA customers have much better perceptions of the HBAT products, whereas
the international customers (outside North America) find the price policing conducive for their need
Management should use these results to develop strategies.
34 Discriminant Analysis
Validation of the Results
• The primary means of
validation is through the
use of holdout sample and
the assessment of its
predictive accuracy.
• In this manner, validity is
established if the
discriminant function
performs at an acceptable
level in classifying
observation that were not
used in the estimation
process.
• If the holdout sample is
formed from the original √ (9+25)/40=0.85
sample, then this approach
establishes internal validity
35 Discriminant Analysis
Before divide observation into analysis (training data set) and holdout sample (validating data
set), total observation is 100, with 39% in group 0 (USA) and 61% in group 1 (outside).
• The proportional chance criterion: 0.392 + 0.612 = 0.524 = 52.4%
• The maximum chance criterion: 61%
If we use maximum chance criterion, our model should outperform the 61% level of
classification accuracy to be acceptable.
• Another threshold, we can used: 1.25 ∗ 52.4% = 65.5% or 1.25 ∗ 61% = 76.3%

• We can see in Table 10, that classification accuracy for analysis sample, holdout sample and
cross-validation sample are 86.7%, 85% and 83.3%.
• So, in all instance (analysis sample, holdout sample and cross-validation sample), the level of
classification accuracy are substantially higher than the threshold values, indicating of an
acceptable level of classification accuracy.

36 Discriminant Analysis
More than Two Groups
Interpretation of Two or More Functions
We can use these concepts:
- Rotation of Discriminant Functions
- Potency Index
Rotation Discrimination Functions
• Calculate the loadings for each function and review the rotation of the functions for
purposes of simplifying interpretation
• Examine the contribution of the predictor variables:
• To each function separately (i.e discriminant loadings)
• Cumulatively across multiple discriminant function with potency index.
• Graphically in a two dimensional solution to understand the position of each group.
Potency Index
• Is a summary measure for describing the contributions of a variable across all significant
functions.
• Represent the total discriminating effect of the variables across all significant discrimination
functions.
• Calculated by two-step process

37 Discriminant Analysis
38 Discriminant Analysis
A Three Group Illustrative Example
• In the three group example, it is necessary to develop two separate discriminant functions to distinguish
among three groups.
• The first function separates one group from the other two and the second separates the remaining two
groups.






39 Discriminant Analysis
• The total amount of variance explained by the first function is 0.8932 = 79.7%.
• The next function explain 0.5172 = 26.7% of the remaining variance (20.3%).
• Therefore, the total variance explained by two discriminant function is 79.7% + (26.7% ∗
20.3%) = 85.1% of the total variation in the dependent variable.
40 Discriminant Analysis
Interpretation of
Discriminant
Function

• Function1:
Described by three
variables (X18, X9
and X16) that
comprised as the
postsale customers
services plus X11and
X17
• Function2: Shows
only one variable, X6
(Product quality)

41 Discriminant Analysis
Contribution of Predictor Variable
• Start by examining the group centroids on
the two functions 2
• Examining the group centroids and the
distribution of cases in each group,
• Function 1 primarily differentiates between
group 1 versus group 2 and 3, whereas
function 2 distinguishes between group 3
versus group 1 and 2
• The overlap and misclassification of the
cases of group 2 and 3 can be addressed by
examining the strength of discriminant
functions and the groups differentiated by
each.

42 Discriminant Analysis
43 Discriminant Analysis
44 Discriminant Analysis
45 Discriminant Analysis

You might also like