0% found this document useful (0 votes)
127 views24 pages

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Dimension Reduction

Uploaded by

jay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views24 pages

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Dimension Reduction

Uploaded by

jay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 4 –Dimension Reduction

Data Mining for Business Intelligence


Shmueli, Patel & Bruce

© Galit Shmueli and Peter Bruce 2010


Dimension reduction
Reducing the number of variables in a dataset – helps
data mining algorithms to operate efficiently
Several data reduction techniques:
Incorporate domain knowledge to combine categories
Use data summaries
Employ automated reduction techniques – Principal
Component Analysis (PCA)
 Need to integrate expert knowledge with automated tools
 In computer science – often called feature selection or
feature extraction
Advantages of Dimension reduction
Often data mining tasks involve a large number of
variables
Subsets of variables are highly correlated
Some variables are not related to the outcome of interest in
classification/prediction problems
Such variables can lead to overfitting, affecting the
accuracy and reliability of the model
Superfluous variables increase cost

Question: Which variables are important for the task at hand


and which are useless?
Correlations Between Pairs of Variables:
Correlation Matrix from Excel
• Large datasets usually have many redundant
variables with much overlap
• Redundancies can be found by looking at the
pairwise correlations (correlation matrix)
• Helps detect duplications and multicollinearity

PTRATIO B LSTAT MEDV


PTRATIO 1
B -0.17738 1
LSTAT 0.374044 -0.36609 1
MEDV -0.50779 0.333461 -0.73766 1

Remember heat maps?


Correlation Analysis
Correlation matrix for portion of Boston Housing data
Shows correlation between variable pairs
CRIM ZN INDUS CHAS NOX RM
CRIM 1
ZN -0.20047 1
INDUS 0.406583 -0.53383 1
CHAS -0.05589 -0.0427 0.062938 1
NOX 0.420972 -0.5166 0.763651 0.091203 1
RM -0.21925 0.311991 -0.39168 0.091251 -0.30219 1

Interpretation: Pairs with high correlations have much


overlap and can be removed
Reducing Categories
A single categorical variable with m categories is
typically transformed into m-1 dummy variables
Each dummy variable takes the values 0 or 1
0 = “ no” for the category
1 = “ yes”
Problem: Can end up with too many variables
Solution: Reduce by combining categories that are
close to each other
Combining categories
 Examine sizes of various categories and how the
response behaves at each category
 Generally, categories that contain very few observations
can be combined with others
 Use categories that are most relevant to the task, label
the rest as “Other”
Principal Components Analysis
Goal: Reduce a set of numerical variables.
https://www.youtube.com/watch?v=kw9R0nD69OU

Idea: Remove the overlap of information between these variable.


[“ Information” is measured by the sum of the variances of the variables.]

Final product: A smaller number of numerical variables that contain


most of the information

Use: Useful when measurements are highly correlated. Works ONLY


with numerical variables
Principal Components Analysis
How does PCA do this?
Create new variables that are linear combinations of the
original variables (i.e., they are weighted averages of the
original variables)

These linear combinations are uncorrelated (no information


overlap), and only a few of them contain most of the original
information

The new variables are called principal components


Example – Breakfast Cereals
Data collected on nutritional information and consumer rating of 77
breakfast cereals – 13 numerical variables in the study
Goal: Reduce dimension
name mfr type calories protein … rating
100%_Bran N C 70 4 … 68
100%_Natural_Bran Q C 120 3 … 34
All-Bran K C 70 4 … 59
All-Bran_with_Extra_Fiber K C 50 4 … 94
Almond_Delight R C 110 2 … 34
Apple_Cinnamon_Cheerios G C 110 2 … 30
Apple_Jacks K C 110 2 … 33
Basic_4 G C 130 3 … 37
Bran_Chex R C 90 2 … 49
Bran_Flakes P C 90 3 … 53
Cap'n'Crunch Q C 120 1 … 18
Cheerios G C 110 6 … 51
Cinnamon_Toast_Crunch G C 120 1 … 20
Description of Variables
Name: name of cereal carbo: grams complex
mfr: manufacturer carbohydrates
type: cold or hot sugars: grams
calories: calories per potass: mg.
serving vitamins: % FDA rec
protein: grams shelf: display shelf
fat: grams weight: oz. 1 serving
sodium: mg. cups: in one serving
fiber: grams rating: consumer reports
Consider calories & ratings
Estimated Covariance matrix:
calories ratings
calories 379.63 -189.68
ratings -189.68 197.32

Two variables are strongly correlated with r = -0.69


Total variance (=“ information” ) is sum of individual
variances: 379.63 + 197.32 = 576.95
“Calories” accounts for 379.63/576.95 = 66% of covariation
with “ratings” – “overlap”
May make sense to reduce these two variables into a single
variable without losing much information (34%)
First & Second Principal Components
 Z1 and Z2 are two linear combinations
Z1 has the highest variation (spread of values)
Z2 has the second largest variation

Z1 : a1X1 + a2X2


Z2 : b1X1 + b2X2

where X1 and X2 are the original variables in the dataset; a1, a2, b1, b2 are called “weights”, and
Z1: first principal component
Z2: second principal component
PCA output for these 2 variables
(XLMiner)
Top: weights to project original Components
data onto Z1 & Z2
Variable 1 2
e.g. (-0.847, 0.532) are weights for calories -0.84705347 0.53150767
Z1 and (0.532,0.847) are weights rating 0.53150767 0.84705347
for Z2
Variance 498.0244751 78.932724
Bottom: reallocated variance for Variance% 86.31913757 13.68086338
new variables Cum% 86.31913757 100
P-value 0 1
Z1 : 86% of total variance
Z2 : 14%

Z1 = -0.847 x calories + 0.532 x ratings


Z2 = 0.532 x calories + 0.847 x ratings
Principal Component Scores
XLMiner : Principal Components Analysis - Scores

Row Id. 1 2
100%_Bran 44.92 2.20
100%_Natural_Bran -15.73 -0.38
All-Bran 40.15 -5.41
All-Bran_with_Extra_Fiber 75.31 13.00
Almond_Delight -7.04 -5.36
Apple_Cinnamon_Cheerios -9.63 -9.49
Apple_Jacks -7.69 -6.38
Basic_4 -22.57 7.52
Bran_Chex 17.73 -3.51

Weights are used to compute the above scores


e.g., col. 1 are computed Z1 scores and col. 2 are
computed z2 scores
Properties of the resulting variables

New distribution of information:


New variances = 498 (for Z1) and 79 (for Z2)
Sum of variances = sum of variances for original
variables calories and ratings
New variable Z1 has most of the total variance, might
be used as proxy for both calories and ratings

Z and Z2 have correlation of zero (no information


1

overlap)
Generalization
• X1, X2, X3, … Xp, original p variables

• Z1, Z2, Z3, … Zp, weighted averages of original variables

• All pairs of Z variables have zero correlation

• Order Z’s by variance (z1 largest, Zp smallest)


• Usually the first few Z variables contain most of the information,
and so the rest can be dropped (achieving dimension reduction)
Principal components
 Thus the principal component scores are:
Zi = ai,1X1 + ai,2X2 + …..+ ai,pXp, i=1,…p
• The software usually computes the weights ai,j which are
used in computing the principal component scores
• Breakfast cereal example: PCA on 13 numerical variables
shows that the first 3 PCs capture more than 96% of the
total variation associated with all the 13 original variables
• Less than 25% of the dimension of the original data
• First 2 PCs capture 93% of variation
PCA on full data set
Variable 1 2 3 4 5 6

calories 0.07624155 -0.01066097 0.61074823 -0.61706442 0.45754826 0.12601775


protein -0.00146212 0.00873588 0.00050506 0.0019389 0.05533375 0.10379469
fat -0.00013779 0.00271266 0.01596125 -0.02595884 -0.01839438 -0.12500292
sodium 0.98165619 0.12513085 -0.14073193 -0.00293341 0.01588042 0.02245871
fiber -0.00479783 0.03077993 -0.01684542 0.02145976 0.00872434 0.271184
carbo 0.01486445 -0.01731863 0.01272501 0.02175146 0.35580006 -0.56089228
sugars 0.00398314 -0.00013545 0.09870714 -0.11555841 -0.29906386 0.62323487
potass -0.119053 0.98861349 0.03619435 -0.042696 -0.04644227 -0.05091622
vitamins 0.10149482 0.01598651 0.7074821 0.69835609 -0.02556211 0.01341988
shelf -0.00093911 0.00443601 0.01267395 0.00574066 -0.00823057 -0.05412053
weight 0.0005016 0.00098829 0.00369807 -0.0026621 0.00318591 0.00817035
cups 0.00047302 -0.00160279 0.00060208 0.00095916 0.00280366 -0.01087413
rating -0.07615706 0.07254035 -0.30776858 0.33866307 0.75365263 0.41805118

Variance 7204.161133 4833.050293 498.4260864 357.2174377 72.47863007 4.33980322


Variance% 55.52834702 37.25226212 3.84177661 2.75336623 0.55865192 0.0334504
Cum% 55.52834702 92.78060913 96.62238312 99.37575531 99.93440247 99.96785736

First 6 components shown


First 2 capture 93% of the total variation
 Note: data differ slightly from text
Normalizing data
 PCA helps understand structure of the data
Examine weights to see how the original variables contribute
to PC
In these results, sodium dominates first PC
Just because of the way it is measured (mg), its scale is greater
than almost all other variables
Hence its variance will be a dominant component of the
total variance
 Normalize each variable to remove scale effect
Divide by std. deviation (may subtract mean first)
 Normalization (= standardization) is usually performed in
PCA; otherwise measurement units affect results
Note: In XLMiner, use correlation matrix option to use normalized variables
PCA using standardized variables
Variable 1 2 3 4 5 6

calories 0.32422706 0.36006299 0.13210163 0.30780381 0.08924425 -0.20683768


protein -0.30220962 0.16462311 0.2609871 0.43252215 0.14542894 0.15786675
fat 0.05846959 0.34051308 -0.21144024 0.37964511 0.44644874 0.40349057
sodium 0.20198308 0.12548573 0.37701431 -0.16090299 -0.33231756 0.6789462
fiber -0.43971062 0.21760374 0.07857864 -0.10126047 -0.24595702 0.06016004
carbo 0.17192839 -0.18648526 0.56368077 0.20293142 0.12910619 -0.25979191
sugars 0.25019819 0.3434512 -0.34577203 -0.10401795 -0.27725372 -0.20437138
potass -0.3834067 0.32790738 0.08459517 0.00463834 -0.16622125 0.022951
vitamins 0.13955688 0.16689315 0.38407779 -0.52358848 0.21541923 0.03514972
shelf -0.13469705 0.27544045 0.01791886 -0.4340663 0.59693497 -0.12134896
weight 0.07780685 0.43545634 0.27536476 0.10600897 -0.26767638 -0.38367996
cups 0.27874646 -0.24295618 0.14065795 0.08945525 0.06306333 0.06609894
rating -0.45326898 -0.22710647 0.18307236 0.06392702 0.03328028 -0.16606605

Variance 3.59530377 3.16411042 1.86585701 1.09171081 0.96962351 0.72342771


Variance% 27.65618324 24.3393116 14.35274601 8.39777565 7.45864248 5.5648284
Cum% 27.65618324 51.99549484 66.34824371 74.74601746 82.20465851 87.76948547

 First component accounts for smaller part of variance


 Need to use more components to capture same amount of
information
PCA in Classification/Prediction
Apply PCA to training data
Decide how many PC’s to use
Use variable weights in those PC’s with
validation/new data
This creates a new reduced set of predictors in
validation/new data
Other Approaches
Graphs, summaries and PCA – exploratory methods
PCA ignores output variable
Other approaches for dimension reduction:
Multiple Linear Regression (prediction)
Logistic Regression (Classification)
Both use subset selection where the algorithm chooses a
subset of variables
This procedure is integrated directly into the
predictive/classification task
Summary
Data summarization is an important for data exploration
Data reduction is useful for compressing the information
in the data into a smaller subset
Categorical variables can be reduced by combining similar
categories
Principal component analysis transforms an original set of
numerical data into a smaller set of weighted averages of the
original data that contain most of the original information in
less variables.

You might also like