0% found this document useful (0 votes)

127 views24 pages

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Dimension Reduction

Uploaded by

jay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views24 pages

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Dimension Reduction

Uploaded by

jay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Chapter 4 –Dimension Reduction

Data Mining for Business Intelligence

Shmueli, Patel & Bruce

© Galit Shmueli and Peter Bruce 2010

Dimension reduction
Reducing the number of variables in a dataset – helps
data mining algorithms to operate efficiently
Several data reduction techniques:
Incorporate domain knowledge to combine categories
Use data summaries
Employ automated reduction techniques – Principal
Component Analysis (PCA)
 Need to integrate expert knowledge with automated tools
 In computer science – often called feature selection or
feature extraction
Advantages of Dimension reduction
Often data mining tasks involve a large number of
variables
Subsets of variables are highly correlated
Some variables are not related to the outcome of interest in
classification/prediction problems
Such variables can lead to overfitting, affecting the
accuracy and reliability of the model
Superfluous variables increase cost

Question: Which variables are important for the task at hand

and which are useless?
Correlations Between Pairs of Variables:
Correlation Matrix from Excel
• Large datasets usually have many redundant
variables with much overlap
• Redundancies can be found by looking at the
pairwise correlations (correlation matrix)
• Helps detect duplications and multicollinearity

PTRATIO B LSTAT MEDV

PTRATIO 1
B -0.17738 1
LSTAT 0.374044 -0.36609 1
MEDV -0.50779 0.333461 -0.73766 1

Remember heat maps?

Correlation Analysis
Correlation matrix for portion of Boston Housing data
Shows correlation between variable pairs
CRIM ZN INDUS CHAS NOX RM
CRIM 1
ZN -0.20047 1
INDUS 0.406583 -0.53383 1
CHAS -0.05589 -0.0427 0.062938 1
NOX 0.420972 -0.5166 0.763651 0.091203 1
RM -0.21925 0.311991 -0.39168 0.091251 -0.30219 1

Interpretation: Pairs with high correlations have much

overlap and can be removed
Reducing Categories
A single categorical variable with m categories is
typically transformed into m-1 dummy variables
Each dummy variable takes the values 0 or 1
0 = “ no” for the category
1 = “ yes”
Problem: Can end up with too many variables
Solution: Reduce by combining categories that are
close to each other
Combining categories
 Examine sizes of various categories and how the
response behaves at each category
 Generally, categories that contain very few observations
can be combined with others
 Use categories that are most relevant to the task, label
the rest as “Other”
Principal Components Analysis
Goal: Reduce a set of numerical variables.
https://www.youtube.com/watch?v=kw9R0nD69OU

Idea: Remove the overlap of information between these variable.

[“ Information” is measured by the sum of the variances of the variables.]

Final product: A smaller number of numerical variables that contain

most of the information

Use: Useful when measurements are highly correlated. Works ONLY

with numerical variables
Principal Components Analysis
How does PCA do this?
Create new variables that are linear combinations of the
original variables (i.e., they are weighted averages of the
original variables)

These linear combinations are uncorrelated (no information

overlap), and only a few of them contain most of the original
information

The new variables are called principal components

Example – Breakfast Cereals
Data collected on nutritional information and consumer rating of 77
breakfast cereals – 13 numerical variables in the study
Goal: Reduce dimension
name mfr type calories protein … rating
100%_Bran N C 70 4 … 68
100%_Natural_Bran Q C 120 3 … 34
All-Bran K C 70 4 … 59
All-Bran_with_Extra_Fiber K C 50 4 … 94
Almond_Delight R C 110 2 … 34
Apple_Cinnamon_Cheerios G C 110 2 … 30
Apple_Jacks K C 110 2 … 33
Basic_4 G C 130 3 … 37
Bran_Chex R C 90 2 … 49
Bran_Flakes P C 90 3 … 53
Cap'n'Crunch Q C 120 1 … 18
Cheerios G C 110 6 … 51
Cinnamon_Toast_Crunch G C 120 1 … 20
Description of Variables
Name: name of cereal carbo: grams complex
mfr: manufacturer carbohydrates
type: cold or hot sugars: grams
calories: calories per potass: mg.
serving vitamins: % FDA rec
protein: grams shelf: display shelf
fat: grams weight: oz. 1 serving
sodium: mg. cups: in one serving
fiber: grams rating: consumer reports
Consider calories & ratings
Estimated Covariance matrix:
calories ratings
calories 379.63 -189.68
ratings -189.68 197.32

Two variables are strongly correlated with r = -0.69

Total variance (=“ information” ) is sum of individual
variances: 379.63 + 197.32 = 576.95
“Calories” accounts for 379.63/576.95 = 66% of covariation
with “ratings” – “overlap”
May make sense to reduce these two variables into a single
variable without losing much information (34%)
First & Second Principal Components
 Z1 and Z2 are two linear combinations
Z1 has the highest variation (spread of values)
Z2 has the second largest variation

Z1 : a1X1 + a2X2

Z2 : b1X1 + b2X2

where X1 and X2 are the original variables in the dataset; a1, a2, b1, b2 are called “weights”, and
Z1: first principal component
Z2: second principal component
PCA output for these 2 variables
(XLMiner)
Top: weights to project original Components
data onto Z1 & Z2
Variable 1 2
e.g. (-0.847, 0.532) are weights for calories -0.84705347 0.53150767
Z1 and (0.532,0.847) are weights rating 0.53150767 0.84705347
for Z2
Variance 498.0244751 78.932724
Bottom: reallocated variance for Variance% 86.31913757 13.68086338
new variables Cum% 86.31913757 100
P-value 0 1
Z1 : 86% of total variance
Z2 : 14%

Z1 = -0.847 x calories + 0.532 x ratings

Z2 = 0.532 x calories + 0.847 x ratings
Principal Component Scores
XLMiner : Principal Components Analysis - Scores

Row Id. 1 2
100%_Bran 44.92 2.20
100%_Natural_Bran -15.73 -0.38
All-Bran 40.15 -5.41
All-Bran_with_Extra_Fiber 75.31 13.00
Almond_Delight -7.04 -5.36
Apple_Cinnamon_Cheerios -9.63 -9.49
Apple_Jacks -7.69 -6.38
Basic_4 -22.57 7.52
Bran_Chex 17.73 -3.51

Weights are used to compute the above scores

e.g., col. 1 are computed Z1 scores and col. 2 are
computed z2 scores
Properties of the resulting variables

New distribution of information:

New variances = 498 (for Z1) and 79 (for Z2)
Sum of variances = sum of variances for original
variables calories and ratings
New variable Z1 has most of the total variance, might
be used as proxy for both calories and ratings

Z and Z2 have correlation of zero (no information

overlap)
Generalization
• X1, X2, X3, … Xp, original p variables

• Z1, Z2, Z3, … Zp, weighted averages of original variables

• All pairs of Z variables have zero correlation

• Order Z’s by variance (z1 largest, Zp smallest)

• Usually the first few Z variables contain most of the information,
and so the rest can be dropped (achieving dimension reduction)
Principal components
 Thus the principal component scores are:
Zi = ai,1X1 + ai,2X2 + …..+ ai,pXp, i=1,…p
• The software usually computes the weights ai,j which are
used in computing the principal component scores
• Breakfast cereal example: PCA on 13 numerical variables
shows that the first 3 PCs capture more than 96% of the
total variation associated with all the 13 original variables
• Less than 25% of the dimension of the original data
• First 2 PCs capture 93% of variation
PCA on full data set
Variable 1 2 3 4 5 6

calories 0.07624155 -0.01066097 0.61074823 -0.61706442 0.45754826 0.12601775

protein -0.00146212 0.00873588 0.00050506 0.0019389 0.05533375 0.10379469
fat -0.00013779 0.00271266 0.01596125 -0.02595884 -0.01839438 -0.12500292
sodium 0.98165619 0.12513085 -0.14073193 -0.00293341 0.01588042 0.02245871
fiber -0.00479783 0.03077993 -0.01684542 0.02145976 0.00872434 0.271184
carbo 0.01486445 -0.01731863 0.01272501 0.02175146 0.35580006 -0.56089228
sugars 0.00398314 -0.00013545 0.09870714 -0.11555841 -0.29906386 0.62323487
potass -0.119053 0.98861349 0.03619435 -0.042696 -0.04644227 -0.05091622
vitamins 0.10149482 0.01598651 0.7074821 0.69835609 -0.02556211 0.01341988
shelf -0.00093911 0.00443601 0.01267395 0.00574066 -0.00823057 -0.05412053
weight 0.0005016 0.00098829 0.00369807 -0.0026621 0.00318591 0.00817035
cups 0.00047302 -0.00160279 0.00060208 0.00095916 0.00280366 -0.01087413
rating -0.07615706 0.07254035 -0.30776858 0.33866307 0.75365263 0.41805118

Variance 7204.161133 4833.050293 498.4260864 357.2174377 72.47863007 4.33980322

Variance% 55.52834702 37.25226212 3.84177661 2.75336623 0.55865192 0.0334504
Cum% 55.52834702 92.78060913 96.62238312 99.37575531 99.93440247 99.96785736

First 6 components shown

First 2 capture 93% of the total variation
 Note: data differ slightly from text
Normalizing data
 PCA helps understand structure of the data
Examine weights to see how the original variables contribute
to PC
In these results, sodium dominates first PC
Just because of the way it is measured (mg), its scale is greater
than almost all other variables
Hence its variance will be a dominant component of the
total variance
 Normalize each variable to remove scale effect
Divide by std. deviation (may subtract mean first)
 Normalization (= standardization) is usually performed in
PCA; otherwise measurement units affect results
Note: In XLMiner, use correlation matrix option to use normalized variables
PCA using standardized variables
Variable 1 2 3 4 5 6

calories 0.32422706 0.36006299 0.13210163 0.30780381 0.08924425 -0.20683768

protein -0.30220962 0.16462311 0.2609871 0.43252215 0.14542894 0.15786675
fat 0.05846959 0.34051308 -0.21144024 0.37964511 0.44644874 0.40349057
sodium 0.20198308 0.12548573 0.37701431 -0.16090299 -0.33231756 0.6789462
fiber -0.43971062 0.21760374 0.07857864 -0.10126047 -0.24595702 0.06016004
carbo 0.17192839 -0.18648526 0.56368077 0.20293142 0.12910619 -0.25979191
sugars 0.25019819 0.3434512 -0.34577203 -0.10401795 -0.27725372 -0.20437138
potass -0.3834067 0.32790738 0.08459517 0.00463834 -0.16622125 0.022951
vitamins 0.13955688 0.16689315 0.38407779 -0.52358848 0.21541923 0.03514972
shelf -0.13469705 0.27544045 0.01791886 -0.4340663 0.59693497 -0.12134896
weight 0.07780685 0.43545634 0.27536476 0.10600897 -0.26767638 -0.38367996
cups 0.27874646 -0.24295618 0.14065795 0.08945525 0.06306333 0.06609894
rating -0.45326898 -0.22710647 0.18307236 0.06392702 0.03328028 -0.16606605

Variance 3.59530377 3.16411042 1.86585701 1.09171081 0.96962351 0.72342771

Variance% 27.65618324 24.3393116 14.35274601 8.39777565 7.45864248 5.5648284
Cum% 27.65618324 51.99549484 66.34824371 74.74601746 82.20465851 87.76948547

 First component accounts for smaller part of variance

 Need to use more components to capture same amount of
information
PCA in Classification/Prediction
Apply PCA to training data
Decide how many PC’s to use
Use variable weights in those PC’s with
validation/new data
This creates a new reduced set of predictors in
validation/new data
Other Approaches
Graphs, summaries and PCA – exploratory methods
PCA ignores output variable
Other approaches for dimension reduction:
Multiple Linear Regression (prediction)
Logistic Regression (Classification)
Both use subset selection where the algorithm chooses a
subset of variables
This procedure is integrated directly into the
predictive/classification task
Summary
Data summarization is an important for data exploration
Data reduction is useful for compressing the information
in the data into a smaller subset
Categorical variables can be reduced by combining similar
categories
Principal component analysis transforms an original set of
numerical data into a smaller set of weighted averages of the
original data that contain most of the original information in
less variables.

Principal Component Analysis Slides
No ratings yet
Principal Component Analysis Slides
26 pages
INF2008 Lecture09
No ratings yet
INF2008 Lecture09
46 pages
IDS - W3 - Data - Quality and Processing
No ratings yet
IDS - W3 - Data - Quality and Processing
55 pages
DMBAR Chapter 4 Dimension Reduction
No ratings yet
DMBAR Chapter 4 Dimension Reduction
25 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
34 pages
BT3041 Topic9
No ratings yet
BT3041 Topic9
25 pages
Dimension Reduction Methods
No ratings yet
Dimension Reduction Methods
32 pages
Monograph PCA-FA Final Version
No ratings yet
Monograph PCA-FA Final Version
40 pages
Chapter 3 Normalized Principal Components Analysis
No ratings yet
Chapter 3 Normalized Principal Components Analysis
4 pages
Chapter 4: Normalized Principal Components Analysis: Dr. Lassad El Moubarki Tunis Business School
No ratings yet
Chapter 4: Normalized Principal Components Analysis: Dr. Lassad El Moubarki Tunis Business School
23 pages
DimensionalitY Reduction
No ratings yet
DimensionalitY Reduction
29 pages
4 1 Pca
No ratings yet
4 1 Pca
21 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
Subject: Statistics
No ratings yet
Subject: Statistics
55 pages
RES805-RM-Module 2
No ratings yet
RES805-RM-Module 2
26 pages
Pca SPSS
100% (1)
Pca SPSS
52 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis: by Eesha Tur Razia Babar
No ratings yet
Principal Component Analysis: by Eesha Tur Razia Babar
38 pages
Principal Components Analysis (PCA)
No ratings yet
Principal Components Analysis (PCA)
27 pages
Data Analytics
No ratings yet
Data Analytics
28 pages
PC A Tutorial
No ratings yet
PC A Tutorial
12 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
4 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Chapter 04 Dimension Reduction (R)
No ratings yet
Chapter 04 Dimension Reduction (R)
27 pages
ACPusing R
No ratings yet
ACPusing R
25 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
ML Unit - 3 DimensionalitY Reduction
No ratings yet
ML Unit - 3 DimensionalitY Reduction
39 pages
PR - Unit 4
No ratings yet
PR - Unit 4
15 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
17 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Module 3
No ratings yet
Module 3
41 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
DR Pca
No ratings yet
DR Pca
22 pages
PCA Analysis
No ratings yet
PCA Analysis
7 pages
Lecture 6 - PCA - Lecturefin
No ratings yet
Lecture 6 - PCA - Lecturefin
71 pages
Practical Guide To Principal Component N R
No ratings yet
Practical Guide To Principal Component N R
43 pages
Multivariate Statistics Principal Component Analysis (PCA)
No ratings yet
Multivariate Statistics Principal Component Analysis (PCA)
41 pages
Program 3
No ratings yet
Program 3
7 pages
MiM Predictive Analytics Sessions 1 2 (PCA)
No ratings yet
MiM Predictive Analytics Sessions 1 2 (PCA)
26 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
PCA Notes
No ratings yet
PCA Notes
3 pages
Pca
No ratings yet
Pca
18 pages
STAT502
No ratings yet
STAT502
13 pages
PCA Explained Stepbystep
No ratings yet
PCA Explained Stepbystep
4 pages
Remote Sensing Assignment
No ratings yet
Remote Sensing Assignment
10 pages
Dimensional Reduction in R
No ratings yet
Dimensional Reduction in R
24 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
10 pages
Ahmed Rebai PCA-ICA
No ratings yet
Ahmed Rebai PCA-ICA
34 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
Lec 17 - Principal Component Analysis PDF
No ratings yet
Lec 17 - Principal Component Analysis PDF
30 pages
Unit 3
No ratings yet
Unit 3
28 pages
EQ2310 Collection of Problems
100% (1)
EQ2310 Collection of Problems
203 pages
Pca Tutorial
No ratings yet
Pca Tutorial
11 pages
Lecture-4 - Random Variables and Probability Distributions
67% (3)
Lecture-4 - Random Variables and Probability Distributions
42 pages
Stats Paper 3
No ratings yet
Stats Paper 3
20 pages
1 s2.0 S2214241X15000589 Main
0% (1)
1 s2.0 S2214241X15000589 Main
8 pages
Hamming Code Numericals
No ratings yet
Hamming Code Numericals
4 pages
Pearson Product-Moment Correlation Coefficient Table of Critical Values
No ratings yet
Pearson Product-Moment Correlation Coefficient Table of Critical Values
2 pages
EViews 3rd Week Assignment With Solution
100% (2)
EViews 3rd Week Assignment With Solution
26 pages
Variance and Mean of A Distribution Powerpoint Presentation
No ratings yet
Variance and Mean of A Distribution Powerpoint Presentation
34 pages
ISDS 558 Advanced Software Development With Web Applications Spring 2021
No ratings yet
ISDS 558 Advanced Software Development With Web Applications Spring 2021
18 pages
MAT2379 Practice Midterm
No ratings yet
MAT2379 Practice Midterm
9 pages
Business Research Methods: Bivariate Analysis - Tests of Differences
No ratings yet
Business Research Methods: Bivariate Analysis - Tests of Differences
56 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Introduction To K-Fold Cross-Validation
No ratings yet
Introduction To K-Fold Cross-Validation
6 pages
Statistical Methods in Psychology Journals: Guidelines and Explanations
No ratings yet
Statistical Methods in Psychology Journals: Guidelines and Explanations
17 pages
Machine Learning - AI Course
No ratings yet
Machine Learning - AI Course
2 pages
Short-Term Load Forecasting - "Weighted Least Square Method (WLSM) "
No ratings yet
Short-Term Load Forecasting - "Weighted Least Square Method (WLSM) "
9 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Lecture 1A: Statistical Estimators of Grade: Min 4025 Geostatistics
No ratings yet
Lecture 1A: Statistical Estimators of Grade: Min 4025 Geostatistics
23 pages
Excel Statistics Functions
No ratings yet
Excel Statistics Functions
6 pages
I3-TD2 (Point Estimation)
No ratings yet
I3-TD2 (Point Estimation)
4 pages
Chapter 10 - Logistic Regression: Data Mining For Business Intelligence
No ratings yet
Chapter 10 - Logistic Regression: Data Mining For Business Intelligence
20 pages
Chapter 13 - Association Rules: Data Mining For Business Intelligence
No ratings yet
Chapter 13 - Association Rules: Data Mining For Business Intelligence
22 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
11 pages
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
No ratings yet
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
38 pages
Assignment: Engineering System Reliability MCT 3341, Semester-1, Section-1&2, 2013/14 Full Marks: 100 To Be Submitted By: 21 Nov, 2013
No ratings yet
Assignment: Engineering System Reliability MCT 3341, Semester-1, Section-1&2, 2013/14 Full Marks: 100 To Be Submitted By: 21 Nov, 2013
2 pages
Geostatistical Simulations of Geothermal Reservoirs: Two-And Multiple-Point Statistic Models
No ratings yet
Geostatistical Simulations of Geothermal Reservoirs: Two-And Multiple-Point Statistic Models
13 pages
Regression Analysis: March 2014
No ratings yet
Regression Analysis: March 2014
42 pages
Sta201 Assignment03 Spring2023
No ratings yet
Sta201 Assignment03 Spring2023
2 pages
23.karlis and Xekalaki Mixed Poisson Review PDF
No ratings yet
23.karlis and Xekalaki Mixed Poisson Review PDF
24 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Homework 2 1445 - 1446
No ratings yet
Homework 2 1445 - 1446
2 pages
Statistics For Business - Chap09 - Anova PDF
No ratings yet
Statistics For Business - Chap09 - Anova PDF
11 pages
Department of Mathematics Indian Institute of Technology Guwahati Problem Sheet 3
No ratings yet
Department of Mathematics Indian Institute of Technology Guwahati Problem Sheet 3
2 pages
Regression Analysis - VCE Further Mathematics
No ratings yet
Regression Analysis - VCE Further Mathematics
5 pages
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
From Everand
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
GURUPRASAD N H
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Solutions Manual to accompany Introduction to Linear Regression Analysis
From Everand
Solutions Manual to accompany Introduction to Linear Regression Analysis
Douglas C. Montgomery
1/5 (1)

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Uploaded by

Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence

Uploaded by

Chapter 4 –Dimension Reduction

Data Mining for Business Intelligence

© Galit Shmueli and Peter Bruce 2010

Question: Which variables are important for the task at hand

PTRATIO B LSTAT MEDV

Remember heat maps?

Interpretation: Pairs with high correlations have much

Idea: Remove the overlap of information between these variable.

Final product: A smaller number of numerical variables that contain

Use: Useful when measurements are highly correlated. Works ONLY

These linear combinations are uncorrelated (no information

The new variables are called principal components

Two variables are strongly correlated with r = -0.69

Z1 : a1X1 + a2X2

Z1 = -0.847 x calories + 0.532 x ratings

Weights are used to compute the above scores

New distribution of information:

Z and Z2 have correlation of zero (no information

• Z1, Z2, Z3, … Zp, weighted averages of original variables

• All pairs of Z variables have zero correlation

• Order Z’s by variance (z1 largest, Zp smallest)

calories 0.07624155 -0.01066097 0.61074823 -0.61706442 0.45754826 0.12601775

Variance 7204.161133 4833.050293 498.4260864 357.2174377 72.47863007 4.33980322

First 6 components shown

calories 0.32422706 0.36006299 0.13210163 0.30780381 0.08924425 -0.20683768

Variance 3.59530377 3.16411042 1.86585701 1.09171081 0.96962351 0.72342771

 First component accounts for smaller part of variance

You might also like