0% found this document useful (0 votes)
23 views32 pages

Dimension Reduction Methods

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

Dimension Reduction Methods

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws


and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Dimension Reduction and
Data Preparation
Outline:

This chapter shows how to:


– Deal with curse of
dimensionality in a dataset
Business / Research Data Understanding
Understanding Phase Phase
– Reason for reducing number
of variables Deployment Phase Data Preparation
Phase

– Techniques for dimension Evaluation Phase Modeling Phase

reduction (also known as


feature/variable selection)

CRISP-DM standard process

3
Need for Dimension Reducing in Data Mining

• Multicollinearity – a condition where some of the predictor


variables are correlated with each other
– leads to instability in the solution space
– Possibly resulting in incoherent results
• Inclusion of highly correlated variables overemphasizes
particular components of the model
• Use of too many predictor variables can
– unnecessarily complicate interpretation of analysis
– violates principle of parsimony: that one should consider keeping the
number of predictors to a size that could be easily interpreted
– can lead to overfitting: generality of findings is hindered since new
data does not behave the same way as training data

4
Need for Dimension Reducing in Data Mining (cont’d)

• Several predictors might fall into a single group (also called


a component)
• ex: savings account balance, checking account balance, home equity,
stock portfolio balance, and 401K balance may all fall under a single
component assets

• Goals of using correlation structure among predictor variable


– reduce the number of predictor components
– help ensure that these components are independent

• We will study dimension reduction for


– Categorical variables
– Numerical variables

5
Dimension Reduction for Categorical Variable

• A single categorical variable with 𝑚 categories is typically transformed


into m or 𝑚 − 1 dummy variables

• Each dummy variable takes the values 0 or 1


➢ 0 = “no” for the category
➢ 1 = “yes”

• Problem: Can end up with too many variables


• Solution: Reduce by combining categories that are close to each other

• Exception: Some algorithms such as Naïve Bayes can handle


categorical variables without transforming them into dummies

6
Dimension Reduction for Categorical Variable
• Boston Housing Dataset Dictionary
houses_boston= read.csv("BostonHousing.csv")

count = table(houses_boston$CAT.MEDV, houses_boston$ZN,


dnn=c("CAT.MEDV", "ZN"))

round(prop.table(count, margin = 2),2)


• Data Dictionary

7
Dimension Reduction for Categorical Variable

• The distribution of CAT.MEDV is identical for ZN = 17.5, 90, 95, and 100
(where all neighborhoods have CAT.MEDV = 1).

• These four categories can then be combined into a single category.

• What other categories can be combined?

• Categories ZN = 12.5, 18, 21, 25, 28, 30, 52.5, 70, and 85 can be combined

8
Dimension Reduction for Numeric Variables
• California Housing Dataset Dictionary
• Provides census information from all the block groups from the 1990
California census

#Import data
houses_california = read.csv("CaliforniaHousing.csv")

• Response variable: Median House Value

• Predictors: Median Income, Housing Median Age,


Total Rooms, Total Bedrooms, Population,
Households, Latitude, Longitude

9
Dimension Reduction for Numeric Variables
# Create the heatmap
library(ggcorrplot)
ggcorrplot(cor(houses_california), type = "lower", lab = TRUE,
method = "circle", title = "Correlation Matrix Heatmap")

type = "lower" Display the lower


triangle of the matrix
lab = TRUE Display the correlation values
on the plot
method = "circle" Display the
correlation values on the plot

• What does this matrix tell us about


the correlation among the
variables? (positive, negative and no
correlation)

10
Dimension Reduction for Numeric Variables

• What if we use this data set (as is) via a regression model?
(e.g., MEDV dependent variable)
– results would become quite unstable
• tiny shifts in predictors would lead to large changes in
regression coefficients

• This is where Principle Components Analysis (PCA) comes in


– can identify the components underlying correlated variables

11
Dimension Reduction for Numeric Variables

• PCA Explains correlation structure of predictor variables by


using a smaller set of linear combinations of these variables
– linear combinations are called components

• Total variability of the dataset of 𝑚 variables can often almost


entirely be explained by 𝑘 components, where 𝑚 > 𝑘
– Analyst can replace original 𝑚 variables with the 𝑘 < 𝑚 components
• Modified dataset has 𝑛 records with 𝑘 components

12
Applying PCA to the Data Set

• Suppose original variables 𝑋1, 𝑋2, … , 𝑋𝑚 form a coordinate


system in m-dimensional space

‒ Principle components represent a new coordinate system found by


rotating the original system along the directions of maximum variability

First Component
𝑃𝐶1 = 𝑒1′ 𝒁 = 𝑒11 𝑍1 + 𝑒12 𝑍2 + ⋯ + 𝑒1𝑚 𝑍𝑚
Eigenvectors Standardized values of the original variables X

13
Applying PCA to the Data Set (cont’d)

• Variation among variables:


apply(houses_california, 2, sd)

➢ s(median income) < 2


➢ s(total rooms) > 2,100

– Must standardize variables otherwise total rooms


would dominate median income’s influence (along
with other variables)
• Standardize the variable vectors using

14
Applying PCA to the Data Set (cont’d)
#PCA
pca = prcomp(houses_california[,c(1:8)], scale = TRUE)
datasummary(pca)
scale = TRUE standardizes the data

• Following table shows the percentage of the total variance


explained by each component

This single component accounts for nearly half the total


variability in these 8 predictors
15
Applying PCA to the Data Set (cont’d)
#Component matrix
round(pca$rotation, 4)
− Each column represents the Eigenvectors for one component
− The cell entries are called the component weights, and represent
the partial correlation between the variable and the component
− component weights range between one and negative one (they
are correlations).

Components are
− They all have high (and very similar) component weights, indicating that mathematically independent,
all four variables are highly correlated with the first principal component. no correlation
16
How many components should we extract?
• Three criteria are:
1. The Eigenvalue Criterion
2. The Proportion of Variance Explained Criterion
3. The Scree Plot Criterion

1. The Eigenvalue Criterion


– Sum of the eigenvalues represents the number of variables
– Eigenvalues decrease in magnitude
– An eigenvalue of ‘1’ represents one variable’s worth of information
– Keep components only if eigenvalues > 1
– However, problem when 𝑛 < 20 or 𝑛 > 50 as this criteria keeps to
few or too many components, respectively

#Eigenvalues
eigenvalues = pca$sdev^2

17
How many components should we extract?
2. The Proportion of Variance Explained Criterion
– Step 1: specify how much variability to keep
– Step 2: keep the number of components reflecting this
variability

• Suppose we would like our components to explain 85% of the variability in the
variables. Then, we would choose components 1–3, which together explain
86.05% of the variability.

• If we wanted our components to explain 90% or 95% of the variability, then


we would need to include component 4 as well.

18
How many components should we extract? (cont’d)

3. The Scree Plot Criterion


– Scree plot: a graphical plot of the eigenvalues against the
component number
• Useful for finding an upper bound for the # of components to
keep
– The maximum number of components that should be extracted is
just before where the plot first begins to straighten out into a
horizontal line.

# Scree plot
plot(eigenvalues, type = "b", lwd = 2, col = "red",
xlab = "Component Number", ylab = "Eigenvalue",
main = "Scree Plot for Houses Data")

19
How many components should we extract? (cont’d)

20
How many components should we extract? (cont’d)

To summarize:
• The Eigenvalue Criterion:
– Retain components 1–3, but do not throw away component 4 yet.
• The Proportion of Variance Explained Criterion:
– Components 1–3 account for a solid 86% of the variability, and
tacking on component 4 gives us a superb 96% of the variability.
• The Scree Plot Criterion:
– Do not extract more than four components.

• If there is no clear-cut best solution, try both and see


what happens.

21
Replace original data with principal components

# Replace original data with principal components


new_dataset = as.data.frame(pca$x)

• Should you now retain all the PCs?


‒ Only use the components that you concluded to be the best
based on the three criteria discussed in the previous slides

22
Other Techniques for Dimension Reduction

• Regression analysis
– Stepwise regression

• Classification techniques
– Feature importance in decision tree

Will be discussed in their respective lecture

23
Preparing to Model the Data

24
Cross-validation

Cross-validation is a technique for insuring that the results


uncovered in an analysis are generalizable to an independent and
unseen data set.
• In twofold cross-validation, the data are partitioned, using
random assignment, into a training dataset and a testing
dataset. The testing dataset is also called holdout dataset.
• In k-fold cross validation, the original data is partitioned into k
independent and similar subsets
– K models are then built using the data from k – 1 subsets, using
the kth subset as the test set
– The results from the k models are then combined using
averaging or voting

25
K-fold Cross-Validation Graphically

• Assume five folds (k = 5)

Folds 1, 2, 3, 4: Training dataset


Fold 1dataset
Folds 1, 2, 3, 5: Training
Fold 5: Testing dataset Fold 2 Whole
Fold 3 Fold 4
Dataset Fold 5
Fold 4:1,Testing
Folds 2, 4, 5: dataset
Training dataset
Fold 3: Testing dataset

26
Cross-validation (four possible approaches)

Training
Whole More robust Whole Perform k-fold cross
dataset approach dataset validation (previous
slide)
Testing

Training
Perform k-fold
cross validation
Whole Training on the training
More robust Whole
dataset Validation approach dataset
dataset
(previous slide)
Testing
Testing

Used for parameter tuning to balance bias and variance


27
Cross-validation

# Read in the adult dataset


adult = read.csv("adult.csv")

# randomize dataset
adult.randomized = adult[sample(nrow(adult),replace =
FALSE),]

# Determine the number of rows in your dataset


num_rows = nrow(adult.randomized)

# Calculate the number of rows for the training set (70%)


train_size = round(0.7 * num_rows)

# Create the training and testing sets

data.train = adult.randomized[1:train_size, ]

data.test = adult.randomized[(train_size+1): num_rows, ]

28
Overfitting

• Overfitting results when the provisional model tries to


account for every possible trend or structure in the training
set

(Number of variables)

29
Balancing The Training Data Set

• For classification models, in which one of the target variable classes


has much lower relative frequency than the other classes, balancing
is recommended.

• A benefit of balancing the data is to provide the classification


algorithms with a rich balance of records for each classification
outcome, so that the algorithms have a chance to learn about all
types of records, not just those with high target frequency

• Balancing methods:
– Resample records (over sampling)
– Set aside a number of records (under sampling)
– Mix of over and under sampling

• The test data set should never be balanced

30
Balancing The Training Data Set

#Check the frequency of the response variable


table(data.train$income)/nrow(data.train)

library(ROSE)
# balanced data set with over-sampling
data.train.balanced.over = ovun.sample(income~., data=data.train,
p=0.5, method="over")
method="over“ over-sampling minority examples
method=“under“ under-sampling majority examples
method=“both“ combination of over- and under-sampling

data.train.balanced.over = data.train.balanced.over$data
table(data.train.balanced.over$income)/nrow(data.train.balanced.over)

# balanced data set with under-sampling


data.train.balanced.under = ovun.sample(income~., data=data.train,
p=0.5, method="under")

data.train.balanced.under = data.train.balanced.under$data
table(data.train.balanced.under$income)/nrow(data.train.balanced.under)

31
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.

You might also like