Dimension Reduction Methods
Dimension Reduction Methods
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Dimension Reduction and
Data Preparation
Outline:
3
Need for Dimension Reducing in Data Mining
4
Need for Dimension Reducing in Data Mining (cont’d)
5
Dimension Reduction for Categorical Variable
6
Dimension Reduction for Categorical Variable
• Boston Housing Dataset Dictionary
houses_boston= read.csv("BostonHousing.csv")
7
Dimension Reduction for Categorical Variable
• The distribution of CAT.MEDV is identical for ZN = 17.5, 90, 95, and 100
(where all neighborhoods have CAT.MEDV = 1).
• Categories ZN = 12.5, 18, 21, 25, 28, 30, 52.5, 70, and 85 can be combined
8
Dimension Reduction for Numeric Variables
• California Housing Dataset Dictionary
• Provides census information from all the block groups from the 1990
California census
#Import data
houses_california = read.csv("CaliforniaHousing.csv")
9
Dimension Reduction for Numeric Variables
# Create the heatmap
library(ggcorrplot)
ggcorrplot(cor(houses_california), type = "lower", lab = TRUE,
method = "circle", title = "Correlation Matrix Heatmap")
10
Dimension Reduction for Numeric Variables
• What if we use this data set (as is) via a regression model?
(e.g., MEDV dependent variable)
– results would become quite unstable
• tiny shifts in predictors would lead to large changes in
regression coefficients
11
Dimension Reduction for Numeric Variables
12
Applying PCA to the Data Set
First Component
𝑃𝐶1 = 𝑒1′ 𝒁 = 𝑒11 𝑍1 + 𝑒12 𝑍2 + ⋯ + 𝑒1𝑚 𝑍𝑚
Eigenvectors Standardized values of the original variables X
13
Applying PCA to the Data Set (cont’d)
14
Applying PCA to the Data Set (cont’d)
#PCA
pca = prcomp(houses_california[,c(1:8)], scale = TRUE)
datasummary(pca)
scale = TRUE standardizes the data
Components are
− They all have high (and very similar) component weights, indicating that mathematically independent,
all four variables are highly correlated with the first principal component. no correlation
16
How many components should we extract?
• Three criteria are:
1. The Eigenvalue Criterion
2. The Proportion of Variance Explained Criterion
3. The Scree Plot Criterion
#Eigenvalues
eigenvalues = pca$sdev^2
17
How many components should we extract?
2. The Proportion of Variance Explained Criterion
– Step 1: specify how much variability to keep
– Step 2: keep the number of components reflecting this
variability
• Suppose we would like our components to explain 85% of the variability in the
variables. Then, we would choose components 1–3, which together explain
86.05% of the variability.
18
How many components should we extract? (cont’d)
# Scree plot
plot(eigenvalues, type = "b", lwd = 2, col = "red",
xlab = "Component Number", ylab = "Eigenvalue",
main = "Scree Plot for Houses Data")
19
How many components should we extract? (cont’d)
20
How many components should we extract? (cont’d)
To summarize:
• The Eigenvalue Criterion:
– Retain components 1–3, but do not throw away component 4 yet.
• The Proportion of Variance Explained Criterion:
– Components 1–3 account for a solid 86% of the variability, and
tacking on component 4 gives us a superb 96% of the variability.
• The Scree Plot Criterion:
– Do not extract more than four components.
21
Replace original data with principal components
22
Other Techniques for Dimension Reduction
• Regression analysis
– Stepwise regression
• Classification techniques
– Feature importance in decision tree
23
Preparing to Model the Data
24
Cross-validation
25
K-fold Cross-Validation Graphically
26
Cross-validation (four possible approaches)
Training
Whole More robust Whole Perform k-fold cross
dataset approach dataset validation (previous
slide)
Testing
Training
Perform k-fold
cross validation
Whole Training on the training
More robust Whole
dataset Validation approach dataset
dataset
(previous slide)
Testing
Testing
# randomize dataset
adult.randomized = adult[sample(nrow(adult),replace =
FALSE),]
data.train = adult.randomized[1:train_size, ]
28
Overfitting
(Number of variables)
29
Balancing The Training Data Set
• Balancing methods:
– Resample records (over sampling)
– Set aside a number of records (under sampling)
– Mix of over and under sampling
30
Balancing The Training Data Set
library(ROSE)
# balanced data set with over-sampling
data.train.balanced.over = ovun.sample(income~., data=data.train,
p=0.5, method="over")
method="over“ over-sampling minority examples
method=“under“ under-sampling majority examples
method=“both“ combination of over- and under-sampling
data.train.balanced.over = data.train.balanced.over$data
table(data.train.balanced.over$income)/nrow(data.train.balanced.over)
data.train.balanced.under = data.train.balanced.under$data
table(data.train.balanced.under$income)/nrow(data.train.balanced.under)
31
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.