Ant Analysis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

Discriminant Analysis

Discriminant Analysis
 Purpose of Discriminant Analysis
 To classify objects (people, customers, things, etc.) into one of two or more
groups based on a set of features that describe the objects (e.g. gender,
age, income, weight, preference score, etc. )
 In general, we assign an object to one of a number of predetermined groups
based on observations made on the object.
 Groups are known or predetermined and do not have order (i.e. nominal
scale).. What we are looking for is two things:
 Which set of features can best determine group membership of the object?
 What is the classification rule or model to best separate those groups?
 Discriminant analysis is a useful way to answer the questions…
 Are the groups different?
 On which of the variables are they different?
 Is it possible to predict which group a person belongs to using these
variables?
Example
 A mortgage company loan officer wants to
decide whether to approve an applicant’s
mortgage loan
 Past data contains information of people who
have successfully repaid the loan & those who
have defaulted
 Information available on these two groups – age,
income, marital status, outstanding debt and
ownership of certain durable goods
Discriminant analysis
 Technique for
 Analysing marketing data
 Where the criterion is dependent variable &
 The predictor or independent variables are interval in
nature
Discriminant Function
 A linear combination of independent
variables, which will best discriminate
between the dependent variables (groups)
Groups in DA
 Two-group discriminant analysis
 Discriminant analysis where the criterion variable
has 2 categories
 Multiple discriminant analysis
 Discriminant analysis technique where the
criterion variable involves three or more
categories
Examples
 Dependent variable is the choice of a PC
brand (A or B) and independent variables are
the ratings of attributes of PC on a 5 point
scale like Price, Battery life, Weight…
 Do heavy, medium and light users of soft
drink differ in terms of their consumption of
frozen foods?
 Distinguishing characteristics of consumers
who respond to direct mail solicitations
Objectives of DA
 Developing a discriminant function, it is a linear combination of
the predictor or independent variables, which will best
discriminate between the categories of criterion or dependent
variable (groups)
 Examining whether significant differences exist among the
groups, in terms of the predictor variables
 Determining which predictor variable contributes most to the
intergroup differences
 Classification of cases to one of the groups based on the values
of the predictor variables
 Evaluation of the accuracy of the classification
DA model
 Discriminant analysis model involves a linear
combination of the following form
 D=b0+b1X1+b2X2+…….+bkXk
 where D is the discriminant score
 b’s are the discriminant coefficients or weights
 X’s are the predictor or independent variables
 The coefficients of b’s are estimated in such a way that the
groups differ as much as possible on the values of the
discriminant function
 Statistically, it means that the ratio of between-group sum
of squares to within-group sum of squares for the
discriminant scores is at a maximum
Comparing Regression and DA

Regressi Discriminant
  on Analysis
Similarities    
Number of dependent variables One One
Number of independent variables Multiple Multiple
Differences    
Nature of dependent variables Metric Categorical / Binary
Nature of independent variables Metric Metric
Conducting DA
Formulate the problem

Estimate the discriminant function coefficients

Determine the significance of the discriminant functions

Interpret the results

Assess the validity of discriminant analysis


Formulate the problem…1
 Identify the objectives, the criterion variable and the
independent variables
 Criterion variable must consist of two or more
mutually exclusive and collectively exhaustive
groups
 Dependent variable must be categorical
 Example, 1-Very Poor quality, 2-Poor quality, 3-Good
quality, 4-Better quality, 5-Excellent quality will be
converted as
1- Poor quality (1&2) & 2 – Good quality(3,4 & 5)
 Predictor variables are usually decided based on
previous research, experience…
Formulate the problem…2
 Next step is to divide the sample
 One part of the sample, Estimation or Analysis sample, is
used for the estimation of the discriminant function
 Other part of the sample, Holdout or Validation sample, is
reserved for validating the discriminant function
 Distribution of the dependent variable in the two samples
should be as in the entire sample
 For example, in a total sample of 600 there are 40% users
and 60% non-users then if the analysis sample is of size
400 and holdout sample is of size 200, the proportion of
users & non-users should remain the same as in the entire
600
Data Assumptions
 Assumes linearity: The discriminant functions should be linear
 Normally distributed: The predictor variable should be normally
distributed
 Absence of perfect multicollinearity: There should be no perfect
multicollinearity between the independent variables
 Interval data: In discriminant analysis, there should be an interval
data for independent variable
 Variance: No independents have a zero standard deviation in
one or more of the groups formed by the dependent
 Adequate sample size: There must be at least two cases for
each category of the dependent variable. However, it is
recommended that there should be at least four or five times as
many cases as independent variables
Multicollinearity
 Multicollinearity is a state of very high
intercorrelations or inter-associations among
the independent variables
 Multicollinearity is therefore a type of
disturbance in the data, and if present in the
data the statistical inferences made about the
data may not be reliable.
Canonical correlation
 Canonical correlation is a statistical technique that is used to example
the degree of the relationship between two canonical (Latent) variables
 In correlation we can find the relationship between one dependent
variable and many independent variables, canonical correlation,
examples the relationship between many dependent variables and
many independent variables. In canonical correlation, we make one
variate from the many independent variables and one variate from the
many dependent variables. Then, we compare those variates to find the
degree of relationship between all variables.
 Wilks’ lambda is used to test the significance of canonical correlation.
Like simple correlation, canonical correlation coefficient square gives
the percentages of variance that can be explained in the dependent
variable by using the independent variable.
 Pooled R-square: This is the sum of the square of all canonical
correlation. Pooled r-square is used to know how one set of variables
can predict the other set of variables
Linear Discriminant Analysis
 If we can assume that the groups are linearly
separable, we can use linear discriminant
model (LDA)
 Linearly separable suggests that the groups
can be separated by a linear combination of
features that describe the objects
Bayes Rule…1
 Assign an object to the group with highest conditional
probability
 If there are several groups, the Bayes' rule is to assign the object
to group i where 
 We want to know the probability  that an object will belong
to group i, given a set of measurement x
 In practice however, this quantity is difficult to obtain. What we
can get is  .
 This is the probability of getting a particular set of x
measurement  given that the object comes from group i . For
example, we know that the soap is good or bad then we can
measure the object (weight, smell, color etc.). What we want to
know is to determine the group of the soap (good or bad) based
on the measurement only.
Bayes Rule…2
 There is a relationship between the two
conditional probabilities that’s well known as
Bayes Theorem:
Discriminant function
 In practice, however, to use the Bayes rule directly is
unpractical
 To obtain the probability, need so much data to get the
relative frequencies of each groups for each measurement
 It is more practical to assume the distribution and get the
probability theoretically
 If we assume that each group has multivariate Normal
distribution and all groups have the same covariance matrix,
we get what is called Linear Discriminant Analysis formula
Example for two-group DA
Household data of 42 households which will determine whether the
family will take a vacation
Dependent variable – Taken a vacation in the last 2 years – 1; else – 2
Independent variables – Annual family income, attitude towards travel
(measured on a 9-point scale), importance attached to family vacations
(measured on a 9-point scale), household size and age of the head of
the household

30 households – Analysis sample & 12 households – Validation sample


Srl. No. Taken Vacation Annual HH Income Travel attitude Vacation imp HH size Age of the HH head
1 1 50.2 5 8 3 43
2 1 70.3 6 7 4 61
3 1 62.9 7 5 6 52
4 1 48.5 7 5 5 36
5 1 52.7 6 6 4 55
6 1 75 8 7 5 68
7 1 46.2 5 3 3 62
8 1 57 2 4 6 51
9 1 64.1 7 5 4 57
10 1 68.1 7 6 5 45
11 1 73.4 6 7 5 44
12 1 71.9 5 8 4 64
13 1 56.2 1 8 6 54
14 1 49.3 4 2 3 56
15 1 62 5 6 2 58
16 2 32.1 5 4 3 58
17 2 36.2 4 3 2 55
18 2 43.2 2 5 2 57
19 2 50.4 5 2 4 37
20 2 44.1 6 6 3 42
21 2 38.3 6 6 2 45
22 2 55 1 2 2 57
23 2 46.1 3 5 3 51
24 2 35 6 4 5 64
25 2 37.3 2 7 4 54
26 2 41.8 5 1 3 56
27 2 57 8 3 2 36
28 2 33.4 6 8 2 50
29 2 37.5 6 2 3 48
30 2 41.3 3 3 2 42
Group Means

Taken Vacation Income Travel Vacation Hsize Age

1 60.52 5.4 5.8 4.33 53.73

2 41.91 4.33 4.07 2.8 50.13

Total 51.22 4.87 4.93 3.57 51.93

Group Standard Deviations

Taken Vacation Income Travel Vacation Hsize Age

1 9.83 1.92 1.82 1.23 8.78

2 7.55 1.95 2.05 0.94 8.27

Total 12.79 1.98 2.1 1.33 8.57

Looking at the group means and group standard deviations-


1. The 2 groups are widely separated in terms of income than other variables
2. There appears to be more of a separation on the importance attached to family
vacation than on attitude toward travel
3. Difference between the 2 groups on age of the head of the household is small, and the
standard deviation is large
Pooled within-groups correlation matrix

  Income Travel Vacation Hsize Age

Income 1        

Travel 0.19745 1      

Vacation 0.09148 0.083434 1    

Hsize 0.08887 -0.01681 0.07046 1  

Age -0.01431 -0.19709 0.01742 -0.04301 1

Need to study correlation between the predictor variables-


- Low correlation, so multicollinearity is not a problem
Wilks' Lambda
and Univariate F
ratio with 1 & 28
degrees of
freedom      

Wilks'
Variable Lambda F Significance

Income 0.45 33.8 0

Travel 0.92 2.27 0.14

Vacation 0.82 5.99 0.02

Hsize 0.66 14.64 0

Age 0.95 1.39 0.2572

Significance of the F ratio indicates that when predictors are considered individually,
Income, Importance of vacation and household size significantly differentiate
between those who took a vacation and those who did not
Discriminant Function
 As there are 2 groups,
there will be 1
discriminant funtion

Canonical Discriminant Function


Function EigenValue Percent of Variance Cumulative Percent Canonical Correlation
1 1.7862 100 100 0.8007

The function explains 100% of the variance and has a correlation of 0.8007
r2=(.8007)2=0.64
Which indicates that 64% of the variance in the dependent variable, taken
vacation is explained by this model
Significance of Discriminant
function
 Further interpretation of
Discriminant analysis
Wilks Chi-square DF Sig.
makes sense, only if 0.3589 26.13 5 0.0001
the estimated
discriminant function is This is significant at 95% level of
significance. Thus the null hypothesis is
statistically significant rejected, indicating significant
 In SPSS, the statistic discrimination,
interpreted
so the results can be

provided is Wilks’
Lambda and its
corresponding chi-
square transformation
Interpreting results
Standard Canonical Discriminant Function Coefficients
Func1
Income 0.74301
Travel 0.09611
Vacation 0.23329
Hsize 0.46911
Age 0.20922

Pooled within-groups correlations


Func1
Income 0.82202
Hsize 0.54096
Vacation 0.34607
Travel 0.21337
Age 0.16354

Studying the standardised discriminant function coefficients –


1. Income is the most important predictor in discriminating between
groups, then household size followed by importance to vacation
Further studying the correlations, similar results are obtained
The equation
Unstandard Canonical Discriminant Function Coefficients
Func1
Income 8.48E-02
Travel 4.96E-02
Vacation 0.1202813
Hsize 0.4273893
Age 2.45E-02
Constant -7.98E+00

These are to be applied to the raw values of the variables for classification
purpose. All the coefficients are +ve, suggesting that higher family income,
household size, importance attached to vacation, attitude towards travel and age
are more likely to result in the family taking a vacation
Classification
Group Centroids
Group Func1
1 1.29118
2 -1.29118

Those who have travelled have a group centroid value of 1.29118


and those who havent is -1.29118.
The cases are assigned to a group based on their discriminant
scores and an appropriate decision rule. For example, a case will
be assigned to a group whose centroid it is closest.
Classification Function
1 2 Total
Original 1 12 3 15
Count 2 0 15 15
1 80% 20%
Percent
2 0% 100%

Correct classification 90%

Determine Hit Ratio which is percentage of cases correctly classified

Hold out sample 1 2 Total


1 4 2 6
Original Count
2 0 6 6
1 67% 33%
Percent
2 0% 100%

Correct classification
83%

Given, 2 groups of equal size, one would expect a hit ratio of ½=.50, by
chance, or 50%. Discriminant function has shown more than 25%
improvement over chance and the validity of the discriminant function is
judged as satisfactory

You might also like