Discriminant Analysis
Discriminant Analysis
Pertemuan12-APG-
Outline
• Introduction Classification Methods
• Introduction to Discriminant Analysis
• Discriminant Analysis Methods
• The major steps of Discriminant Analysis
• Preparing Data
• Classification Model Evaluation
2 Discriminant Analysis
Introduction Classification Methods
• Recall, the regression model which is used to predict a quantitative or continuous
outcome variable based on one or multiple predictor variables.
• In Classification, the outcome variable is qualitative (or categorical).
• Classification refers to a set of machine learning methods for predicting the class
(or category) of individuals on the basis of one or multiple predictor variables.
3 Discriminant Analysis
Classification methods:
• Logistic regression for binary classification tasks
• Multinomial logistic regression, an extension of the logistic regression for
multiclass classification tasks
• Discriminant analysis, for binary and multiclass classification problems
• Naive bayes classifier
• Support vector machines
• Neural Network, etc.
4 Discriminant Analysis
Introduction to Discriminant Analysis
There are two major objectives in separation of
groups:
1. Description of group separation, in which linear
functions of the variables (discriminant
functions) are used to describe the differences
between two or more groups. It includes
identifying the relative contribution of the 𝑝
independent variables to separation of the
groups. (discriminant analysis)
2. Prediction or allocation of observations to
groups, based on discrimination functions. It
include the classification model evaluation (or
classification analysis).
Example: 𝑌 = 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
-Y categorical var./non metric
- X continue var. (metric)
5 Discriminant Analysis
Discriminant Analysis Methods
• Linear discriminant analysis (LDA): Uses linear combinations of predictors to predict the class
of a given observation. Assumes that the predictor variables (𝑝) are normally distributed and the
classes have identical variances (for univariate analysis, 𝑝 = 1) or identical covariance matrices (for
multivariate analysis, 𝑝 > 1).
• Quadratic discriminant analysis (QDA): More flexible than LDA. Here, there is no assumption
that the covariance matrix of classes is the same. LDA tends to be a better than QDA when you
have a small training set.
• Mixture discriminant analysis (MDA): Each class is assumed to be a Gaussian mixture of
subclasses. Equality of covariance matrix, among classes, is still assumed.
• Flexible Discriminant Analysis (FDA): Non-linear combinations of predictors is used such as
splines. FDA is useful to model multivariate non-normality or non-linear relationships among
variables within each group.
• Regularized discriminant analysis (RDA): Regularization improves the estimate of the
covariance matrices in situations where the number of predictors is larger than the number of
samples. This might be very useful for a large multivariate data set containing highly correlated
predictors.
6 Discriminant Analysis
• MDA might outperform LDA and QDA is some situations, as illustrated below.
In this example data, we have 3 main groups of individuals, each having 3 no adjacent
subgroups.The solid black lines on the plot represent the decision boundaries of LDA,
QDA and MDA. It can be seen that the MDA classifier have identified correctly the
subclasses compared to LDA and QDA, which were not good at all in modelling this
data.
7 Discriminant Analysis
Why Discriminant Analysis
• When the classes of the response variable Y (i.e. default = “Yes”, default = “No”) are well-
separated, the parameter estimates for the logistic regression model are surprisingly unstable.
LDA & QDA do not suffer from this problem.
• If n is small and the distribution of the predictors X is approximately normal in each of the
classes, the LDA & QDA models are again more stable than the logistic regression model.
• LDA & QDA are often preferred over logistic regression when we have more than two non-
ordinal response classes (i.e.: stroke, drug overdose, and epileptic seizure).
• It is always good to compare the results of different analytic techniques; this can either help
to confirm results or highlight how different modelling assumptions and characteristics
uncover new insights.
8 Discriminant Analysis
However, its important to note that LDA & QDA have assumptions that are often more
restrictive then logistic regression:
• Both LDA and QDA assume the predictor variables X are drawn from a
multivariate Gaussian (aka normal) distribution.
• LDA assumes equality of covariances among the predictor variables X across each
all levels of Y. This assumption is relaxed with the QDA model.
• LDA and QDA require the number of predictor variables (𝑝) to be less then the
sample size (𝑛). Furthermore, its important to keep in mind that performance will
severely decline as 𝑝 approaches 𝑛. A simple rule of thumb is to use LDA & QDA
on data sets where 𝑛 ≥ 5 × 𝑝.
9 Discriminant Analysis
The major steps of Discriminant Analysis
10 Discriminant Analysis
11 Discriminant Analysis
Preparing Data
• Discriminant analysis can be affected by the scale/unit in which predictor variables
are measured.
• It's generally recommended to standardize/normalize continuous predictor before
the analysis.
1. Split the data into training and test set
2. Normalize the data. Categorical variables are automatically ignored
• The sample size must large enough (at least 20 cases per group)
• Number observation per variables (X) is ≥ 5
• In selecting independent variables, we can use univariate ANOVA to asses the
significant between means of the independent variables for the two or more
groups.
12 Discriminant Analysis
Linear Discriminant Analysis (LDA)
• Starts by finding directions (a linear combinations of predictor variables) that maximize the
separation between classes, then use these directions to predict the class of individuals.
• LDA assumes that predictors are normally distributed (Gaussian distribution) and that the
different classes have class-specific means and equal variance/covariance.
Note that, if the predictor variables are standardized before computing LDA, the discriminator
weights can be used as measures of variable importance for feature selection.
13 Discriminant Analysis
Quadratic discriminant analysis (QDA)
• QDA is little bit more flexible than LDA, in the sense that it does not assumes the equality of
variance/covariance.
• In contrast, QDA is recommended if the training set is very large, so that the variance of the
classifier is not a major issue, or if the assumption of a common covariance matrix for the K
classes is clearly untenable (James et al. 2014 ) .
• QDA can be computed using the R function qda() [MASS package]
• If the variability of the observations within each class differ, QDA (right plot) is able to
capture the differing covariances and provide more accurate non-linear classification decision
boundaries
14 Discriminant Analysis
Classification Model Evaluation
• After building a predictive classification model, you need to evaluate the performance of
the model , that is how good the model is in predicting the outcome of new observations
test data that have been not used to train the model.
• estimate the model prediction accuracy and
• prediction errors using a new test data set
• Methods for assessing the performance of predictive classification models:
• Average classification accuracy, representing the proportion of correctly classified
observations.
• Confusion matrix , which is 2x2 table showing four parameters, including the number
of true positives, true negatives, false negatives and false positives.
• Precision, Sensitivity and Specificity, which are three major performance metrics
describing a predictive classification model
• ROC curve, showing the proportion of true positives and false positives at all possible
values of probability cutoff. (typically used in binary classification)
• The Area Under the Curve (AUC ) summarizes the overall performance of the
classifier. Values above 0.80 is an indication of a good classifier.
15 Discriminant Analysis
Average Classification Accuracy:
• The overall classification accuracy rate corresponds to the proportion of observations that
have been correctly classified.
accuracy <-mean(observed.classes == predicted.classes)
• Inversely, the classification error rate is defined as the proportion of observations that have
been misclassified. Error rate = 1 – accuracy
Confusion Matrix
It compares the observed and the predicted outcome values and shows the number of correct and
incorrect predictions categorized by type of outcome.
table(observed.classes, predicted.classes)
# Confusion matrix, proportion of cases
table(observed.classes, predicted.classes) %>% prop.table() %>%
round(digits =3)
The correct classification rate is the sum of the number on the diagonal divided by the sample
size in the test data. In our example, that is (48 +15)/78 = 81% or 0.615 + 0.192 =0.807
16 Discriminant Analysis
Example. Classification of diabetes patients
Confusion matrix
True positives (d): these are cases in which we predicted the individuals would be diabetes-
positive and they were.
True negatives (a): We predicted diabetes-negative, and the individuals were diabetes-
negative.
False positives (b): We predicted diabetes-positive, but the individuals didn't actually have
diabetes. (Also known as a Type I error .)
False negatives (c): We predicted diabetes-negative, but they did have diabetes. (Also known
as a Type II error .)
17 Discriminant Analysis
Precision, Sensitivity and Specificity
• Precision
Precision = TruePositives/(TruePositives + FalsePositives)
• These above mentioned metrics can be easily computed using the function
confusionMatrix() [caret package].
• Sensitivy and Specificity are commonly used to measure the performance of a predictive
model
18 Discriminant Analysis
19 Discriminant Analysis
• In medical science, sensitivity and specificity are two important metrics that characterize the
performance of classifier or screening test.
• In medical diagnostic, such as in our example, we are likely to be more concerned with minimal
wrong positive diagnosis. So, we are more concerned about high Specificity. Here, the model
specificity is 92%, which is very good.
• In some situations, we may be more concerned with tuning a model so that the
sensitivity/precision is improved.
• To this end, you can test different probability cutoff to decide which individuals are positive and
which are negative.
• Note that, here we have used 𝑝 > 0.5 as the probability threshold above which, we declare the
concerned individuals as diabetes positive.
• However, if we are concerned about incorrectly predicting the diabetes-positive status for
individuals who are truly positive, then we can consider lowering this threshold: 𝑝 > 0.2
20 Discriminant Analysis
ROC curve (or receiver operating characteristics curve ) & AUC
• The ROC curve is a popular graphical measure for assessing the performance or the
accuracy of a classifier, which corresponds to the total proportion of correctly classified
observations.
• The ROC plot is to simply display the sensitive against the specificity.
• For a good model, the ROC curve should rise steeply, indicating that the true positive rate
(y-axis) increases faster than the false positive rate (x-axis) as the probability threshold
decreases.
• So, the "ideal point" is the top left corner of the graph, that is a false positive rate of zero,
and a true positive rate of one.
• The ROC analysis can be easily performed using the R package pROC.
• The Area Under the Curve (AUC ) summarizes the overall performance of the classifier,
over all possible probability cutoffs. It represents the ability of a classification algorithm to
distinguish 1s from 0s (i.e, events from non-events or positives from negatives).
• The AUC metric varies between 0.50 (random classifier) and 1.00. Values above 0.80 is an
indication of a good classifier.
21 Discriminant Analysis
ROC Curve
22 Discriminant Analysis
Prediction or allocation of observations to groups by
Linear Discriminant Analysis
Obs. 𝑌 𝑋1 𝑋2 ⋯ 𝑋𝑝
1 0 𝑥111 𝑥211 𝑥𝑝11
𝜋1 2 0 𝑥112 𝑥211 𝑥𝑝12
⋮ ⋮
𝑛1 0 𝑥11𝑛1 𝑥21𝑛1 𝑥𝑝1𝑛1
1 1 𝑥121 𝑥221 𝑥𝑝21
2 1 𝑥122 𝑥221 𝑥𝑝22
𝜋2
⋮ ⋮
𝑛2 1 𝑥12𝑛2 𝑥22𝑛2 𝑥𝑝2𝑛2
23 Discriminant Analysis
• Let 𝑓1 𝐱 and 𝑓2 𝐱 be the probability density functions associated with the 𝑝 × 1 vector
random variable 𝑿 for the populations 𝜋1 and 𝜋2 , respectively. Let 𝜴 be the sample space-
that is, the collection of all possible observations be 𝐱. Let 𝑅1 set of be that 𝐱 values for which
we classify objects as 𝜋1 and 𝑅2 = 𝛺 − 𝑅1 be the remaining values for which we classify
objects as 𝜋2 .
• Let 𝑝1 be the prior probability of 𝜋1 and 𝑝2 be the prior probability of 𝜋2 where 𝑝1 + 𝑝2 = 1.
24 Discriminant Analysis
The cost of misclassification:
25 Discriminant Analysis
Linear Discriminant Analysis (LDA)
Classification of Normal Populations When 𝜮1 = 𝜮2 = 𝜮
Suppose that the joint densities of 𝑿' = [ 𝑋1 , 𝑋2 , ⋯ , 𝑋𝑝 ] for populations 𝜋1 and 𝜋2 are given by,
1 1
𝑓𝑖 𝐱 = 𝑝/2 1/2
exp − 𝐱 − 𝝁𝑖 ′ 𝜮−1 𝐱 − 𝝁𝑖
2𝜋 𝜮 2
The minimum ECM regions become:
26 Discriminant Analysis
From this data matrices, the sample mean vectors and covariance matrices
are determined by,
27 Discriminant Analysis
If we set
𝑐 12 𝑝2
= 1, 𝑡ℎ𝑒𝑛 ln 1 = 0, 𝑎𝑛𝑑
𝑐 21 𝑝1
If we define, the midpoint (cutoff point) between group centroid 𝑦ത1 and 𝑦ത2 as
1 1
𝑚ෝ = 𝐱ത 1 − 𝐱ത 2 ′ 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 −1 𝐱ത 1 − 𝐱ത 2 = 𝑦ത1 + 𝑦ത2
2 2
−1
where 𝑦ത1 = 𝐱ത 1 − 𝐱ത 2 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 𝐱ത 1 = 𝒂
′
𝐱1 and 𝑦ത2 = 𝐱ത 1 − 𝐱ത 2 ′ 𝑺𝑝𝑜𝑜𝑙𝑒𝑑 −1 𝐱ത 2 = 𝒂
ෝ′ത ෝ′ത
𝐱2
Then,
Allocate 𝐱 0 to 𝜋1 if 𝑦ො0 = 𝒂ෝ′ 𝐱 0 ≥ 𝑚ෝ
Allocate 𝐱 0 to 𝜋2 if 𝑦ො0 = 𝒂 ′
ෝ 𝐱0 < 𝑚 ෝ
28 Discriminant Analysis
Two Groups Example. Discriminant Analysis of HBAT costumers based on
geographic location (located within North America or outside)
Selection Independent
Variables:
The candidate variables that
significant are X6, X11, X12,
X13, and X17
29 Discriminant Analysis
• Stepwise
estimation:
Adding the first
variable X13
with largest
Mahalanobis
distance
(minimum D^2)
• Step 2, adding
the second
variable X17
• And so on…
30 Discriminant Analysis
1. After X11 was 2
entered in the
equation, both X6 and
X12 (that had
significant differences
univariate across the 5
1 3
groups) have relatively
little additional
discriminatory power
and the estimation
process stops with
three variables(X11,
X13 and X17) 4
2. We see in canonical
discriminant
function that the
discriminant function
6
is highly significant and
displays a canonical
correlation of 0.749.
31 Discriminant Analysis
• We interpret this canonical correlation by squaring it (0.749)2 = 0.561.
• It means that 56,1 percent of the variation in the dependent variable can be explained
by this model which includes only three independent variables (X11, X13 and X17).
32 Discriminant Analysis
Interpretation of the Results
Analyzing Discriminant Weight.
• The unstandardized weights (plus the constant) are used to calculate the discriminant scores
that can be used in classification, but can be affected by the scale of independent variables (just
like multiple regression weight).
• The standardized weight more truly reflect the impact of each independent variable on
discriminant function.
• We can see in Table 10, that classification accuracy for analysis sample, holdout sample and
cross-validation sample are 86.7%, 85% and 83.3%.
• So, in all instance (analysis sample, holdout sample and cross-validation sample), the level of
classification accuracy are substantially higher than the threshold values, indicating of an
acceptable level of classification accuracy.
36 Discriminant Analysis
More than Two Groups
Interpretation of Two or More Functions
We can use these concepts:
- Rotation of Discriminant Functions
- Potency Index
Rotation Discrimination Functions
• Calculate the loadings for each function and review the rotation of the functions for
purposes of simplifying interpretation
• Examine the contribution of the predictor variables:
• To each function separately (i.e discriminant loadings)
• Cumulatively across multiple discriminant function with potency index.
• Graphically in a two dimensional solution to understand the position of each group.
Potency Index
• Is a summary measure for describing the contributions of a variable across all significant
functions.
• Represent the total discriminating effect of the variables across all significant discrimination
functions.
• Calculated by two-step process
37 Discriminant Analysis
38 Discriminant Analysis
A Three Group Illustrative Example
• In the three group example, it is necessary to develop two separate discriminant functions to distinguish
among three groups.
• The first function separates one group from the other two and the second separates the remaining two
groups.
√
√
√
√
√
39 Discriminant Analysis
• The total amount of variance explained by the first function is 0.8932 = 79.7%.
• The next function explain 0.5172 = 26.7% of the remaining variance (20.3%).
• Therefore, the total variance explained by two discriminant function is 79.7% + (26.7% ∗
20.3%) = 85.1% of the total variation in the dependent variable.
40 Discriminant Analysis
Interpretation of
Discriminant
Function
• Function1:
Described by three
variables (X18, X9
and X16) that
comprised as the
postsale customers
services plus X11and
X17
• Function2: Shows
only one variable, X6
(Product quality)
41 Discriminant Analysis
Contribution of Predictor Variable
• Start by examining the group centroids on
the two functions 2
• Examining the group centroids and the
distribution of cases in each group,
• Function 1 primarily differentiates between
group 1 versus group 2 and 3, whereas
function 2 distinguishes between group 3
versus group 1 and 2
• The overlap and misclassification of the
cases of group 2 and 3 can be addressed by
examining the strength of discriminant
functions and the groups differentiated by
each.
42 Discriminant Analysis
43 Discriminant Analysis
44 Discriminant Analysis
45 Discriminant Analysis