0% found this document useful (0 votes)

46 views9 pages

22 Dim Reduction Part-1

The document describes 4 dimensionality reduction techniques: 1) removing features with missing value ratios above a threshold, 2) removing highly correlated features, 3) removing features with low variances, and 4) using random forest feature selection. Each technique is demonstrated on a sample dataset. Feature engineering steps like imputing missing values and normalizing data are also performed. The techniques aim to reduce the dataset dimensionality by filtering out less informative features.

Uploaded by

Gabriel Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views9 pages

22 Dim Reduction Part-1

Uploaded by

Gabriel Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Dimensionality Reduction Techniques

Here I Covered 4 Method.

1. Missing Value Ratio

2. High Correlation Fillter
3. Low_variance_filter
4. By Random Forest Feature Selection

1. Missing Value Ratio (Method- 1)

In [1]: #importing the libraries
import pandas as pd
#reading the file
data = pd.read_csv('missing_value_ratio.csv')
# first 5 rows of the data
data.head()

Out[1]:
ID season holiday workingday weather temp atemp humidity windspeed count

0 AB101 1.0 0.0 0.0 1.0 9.84 14.395 81.0 NaN 16

1 AB102 1.0 NaN 0.0 NaN 9.02 13.635 80.0 NaN 40

2 AB103 1.0 0.0 NaN 1.0 9.02 13.635 80.0 NaN 32

3 AB104 NaN 0.0 NaN 1.0 9.84 14.395 75.0 NaN 13

4 AB105 1.0 NaN 0.0 NaN 9.84 14.395 NaN 16.9979 1

See here lot of missing value present in this dataset so here we put missing value ratio.

In [2]: # percentage of missing values in each variable

data.isnull().sum()/len(data)*100

Out[2]: ID 0.000000
season 0.069337
holiday 48.497689
workingday 0.069337
weather 0.030817
temp 0.000000
atemp 0.000000
humidity 0.038521
windspeed 41.016949
count 0.000000
dtype: float64

In [3]: # saving missing values in a variable

a = data.isnull().sum()/len(data)*100
In [4]: # saving column names in a variable
variables = data.columns

In [5]: # new variable to store variables having missing values less than a threshold
variable = [ ]
for i in range(data.columns.shape[0]):
if a[i]<=40: #setting the threshold as 40%
variable.append(variables[i])

here i remove the columns which has missing value higher than 40%. 40% is just a threshold.
You set a threshold which has Higher than 40-45 % .

In [6]: print("--------------------Before remove feature------------------")

print("Before remove feature: ",variables)
print("--------------------After remove feature------------------")
print("After remove feature: ",variable)

--------------------Before remove feature------------------

Before remove feature: Index(['ID', 'season', 'holiday', 'workingday', 'weathe
r', 'temp', 'atemp',
'humidity', 'windspeed', 'count'],
dtype='object')
--------------------After remove feature------------------
After remove feature: ['ID', 'season', 'workingday', 'weather', 'temp', 'atem
p', 'humidity', 'count']

Here i remove 'holiday' and 'windspeed' as it has higher missing value ratio

In [7]: # creating a new dataframe using the above variables

new_data = data[variable]
# first five rows of the new data
new_data.head()

Out[7]:
ID season workingday weather temp atemp humidity count

0 AB101 1.0 0.0 1.0 9.84 14.395 81.0 16

1 AB102 1.0 0.0 NaN 9.02 13.635 80.0 40

2 AB103 1.0 NaN 1.0 9.02 13.635 80.0 32

3 AB104 NaN NaN 1.0 9.84 14.395 75.0 13

4 AB105 1.0 0.0 NaN 9.84 14.395 NaN 1

In [8]: # percentage of missing values in each variable of new data
new_data.isnull().sum()/len(new_data)*100

Out[8]: ID 0.000000
season 0.069337
workingday 0.069337
weather 0.030817
temp 0.000000
atemp 0.000000
humidity 0.038521
count 0.000000
dtype: float64

In [9]: # shape of new and original data

print("Before Remove Mising value RAtio: ",data.shape)
print("Before Remove Mising value RAtio: ",new_data.shape)

Before Remove Mising value RAtio: (12980, 10)

Before Remove Mising value RAtio: (12980, 8)

In this way we reduce dimensionality of a data using missing value ratio.

2. High Correlation Fillter (Method 2)

Here I just Demonstrate how you remove high correlation feature.

In [10]: data_me2 = data.copy()

In [11]: data_me2 = data_me2.drop(['ID'],axis=1)

In [12]: correlation = data_me2.corr()

numeric_columns = data_me2.columns

high_corr = [ ]

for c1 in numeric_columns:
for c2 in numeric_columns:
if c1 != c2 and c2 not in high_corr and correlation[c1][c2] > 0.9:
high_corr.append(c1)

In [13]: high_corr

Out[13]: ['temp']
In [14]: import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
sns.heatmap(correlation,annot=True)

Out[14]: <AxesSubplot:>

See here in heat map atemp and temp have high correlation so my method remove one
higher correlation Value.

3. Low_variance_filter (Method 3)
In [15]: #importing the libraries
import pandas as pd
from sklearn.preprocessing import normalize
#reading the file
data = pd.read_csv('low_variance_filter.csv')
# first 5 rows of the data
data.head()

Out[15]:
ID temp atemp humidity windspeed count

0 AB101 9.84 14.395 81 0.0 16

1 AB102 9.02 13.635 80 0.0 40

2 AB103 9.02 13.635 80 0.0 32

3 AB104 9.84 14.395 75 0.0 13

4 AB105 9.84 14.395 75 0.0 1

In [16]: #percentage of missing values in each variable

data.isnull().sum()/len(data)*100

Out[16]: ID 0.0
temp 0.0
atemp 0.0
humidity 0.0
windspeed 0.0
count 0.0
dtype: float64

Here see missing value zero so here we consider variance value to reduce dimensionality.

In [17]: data.var()

Out[17]: temp 61.291712

atemp 73.137484
humidity 398.549141
windspeed 69.322053
count 25843.419864
dtype: float64

See here all the variance value shows.

Always do normalize of your dataset before doing This method.

In [18]: #creating dummy variables of categorical variables

data = data.drop('ID', axis=1)

In [19]: normalize = normalize(data)

In [20]: data_scaled = pd.DataFrame(normalize)
data_scaled.var()

Out[20]: 0 0.005877
1 0.007977
2 0.093491
3 0.008756
4 0.111977
dtype: float64

Now use a threshold value to elminate all low var feature.

In [21]: #storing the variance and name of variables

variance = data_scaled.var()
columns = data.columns

In [22]: #saving the names of variables having variance more than a threshold value
variable = [ ]
for i in range(0,len(variance)):
if variance[i]>=0.006: #setting the threshold as 1%
variable.append(columns[i])

In [23]: print("--------------------Before remove feature------------------")

print("Before remove feature: ",data.columns)
print("--------------------After remove feature------------------")
print("After remove feature: ",variable)

--------------------Before remove feature------------------

Before remove feature: Index(['temp', 'atemp', 'humidity', 'windspeed', 'coun
t'], dtype='object')
--------------------After remove feature------------------
After remove feature: ['atemp', 'humidity', 'windspeed', 'count']

Here it eliminate temp as it var value is 0.005877.

In [24]: # creating a new dataframe using the above variables

new_data = data[variable]
# first five rows of the new data
new_data.head()

Out[24]:
atemp humidity windspeed count

0 14.395 81 0.0 16

1 13.635 80 0.0 40

2 13.635 80 0.0 32

3 14.395 75 0.0 13

4 14.395 75 0.0 1
In [25]: # shape of new and original data
print("Before Remove Mising value RAtio: ",data.shape)
print("Before Remove Mising value RAtio: ",new_data.shape)

Before Remove Mising value RAtio: (12980, 5)

Before Remove Mising value RAtio: (12980, 4)

In This way we do Low Variance Fillter

4. By Random Forest Feature Selection (Method

4)
In [26]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [27]: from sklearn.datasets import load_iris

import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target
df.head()

Out[27]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Target

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

In [28]: X = df.drop("Target",axis=1)
y = df['Target']

In [29]: from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=1, max_depth=10)
model.fit(X,y)

Out[29]: RandomForestRegressor(max_depth=10, random_state=1)

After fitting the model, plot the feature importance

graph:
In [30]: features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:] # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

By this you can determine which variable are important which are not . according to this you
easily eliminate all less important feature.
Based on the above graph, we can handpick the top-most features to reduce the
dimensionality in our dataset.
Alternatively, we can use the SelectFromModel of sklearn to do so. It selects the features
based on the importance of their weights.

SelectFromModel of sklearn

In [31]: from sklearn.feature_selection import SelectFromModel

feature = SelectFromModel(model)
X_select = feature.fit_transform(X,y)

In [32]: print("--------------------Before remove feature------------------")

print("Before Select: ",X.shape)
print("--------------------After remove feature------------------")
print("After Select: ",X_select.shape)

--------------------Before remove feature------------------

Before Select: (150, 4)
--------------------After remove feature------------------
After Select: (150, 2)
In [33]: model.fit(X_select,y)

Out[33]: RandomForestRegressor(max_depth=10, random_state=1)

In This way you can do number of feature and easily emliminate this.

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
12 Dimensionality Reduction Techniqwues (With Python Codes)
No ratings yet
12 Dimensionality Reduction Techniqwues (With Python Codes)
20 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Lec4 SWN MC
No ratings yet
Lec4 SWN MC
45 pages
Hands On Data Cleaning With Pandas and NumPy
No ratings yet
Hands On Data Cleaning With Pandas and NumPy
20 pages
Feature Selection 16891042299
No ratings yet
Feature Selection 16891042299
23 pages
Indexdw
No ratings yet
Indexdw
34 pages
MLFILE
No ratings yet
MLFILE
21 pages
CQF June 2021 M4L4 Solutions
No ratings yet
CQF June 2021 M4L4 Solutions
14 pages
Codeppsjf
No ratings yet
Codeppsjf
16 pages
Pramkk
No ratings yet
Pramkk
10 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
19 20DecTestPICMIC
No ratings yet
19 20DecTestPICMIC
28 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Warpper Method
No ratings yet
Warpper Method
8 pages
Engo 645
No ratings yet
Engo 645
9 pages
EDA Plots Code
No ratings yet
EDA Plots Code
13 pages
Data Analyzer
No ratings yet
Data Analyzer
10 pages
DA Lab
No ratings yet
DA Lab
27 pages
221IT027 DA Lab3
No ratings yet
221IT027 DA Lab3
5 pages
Unit1 ML Programs
No ratings yet
Unit1 ML Programs
5 pages
Data Discretization
No ratings yet
Data Discretization
32 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Fda Exp2 E0323040
No ratings yet
Fda Exp2 E0323040
3 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Lec 45
No ratings yet
Lec 45
9 pages
DA Programs
No ratings yet
DA Programs
44 pages
Machine Learning Lab - Preprocessing
No ratings yet
Machine Learning Lab - Preprocessing
13 pages
Bajaj 220 F Pulsar UG4
50% (2)
Bajaj 220 F Pulsar UG4
15 pages
DV Mid Internal 1
No ratings yet
DV Mid Internal 1
8 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
IDM Assignment
No ratings yet
IDM Assignment
15 pages
Ap Python
No ratings yet
Ap Python
12 pages
AD3301 - Data - Transformation - Ipynb - Colaboratory
No ratings yet
AD3301 - Data - Transformation - Ipynb - Colaboratory
27 pages
MDS372 Lab4 2448001
No ratings yet
MDS372 Lab4 2448001
17 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
Lab File
No ratings yet
Lab File
96 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Lab Extern L
No ratings yet
Lab Extern L
8 pages
Ensemble Learning
100% (1)
Ensemble Learning
7 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Exp 2
No ratings yet
Exp 2
6 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
AIML
No ratings yet
AIML
13 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
DCT Dual Clutch Transmission
50% (2)
DCT Dual Clutch Transmission
16 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Python Scripts For Machine Learning
No ratings yet
Python Scripts For Machine Learning
13 pages
100T Pump Parts List PDF
No ratings yet
100T Pump Parts List PDF
30 pages
Machine Design Section-CAE Indian Institute of Technology Chennai Tamil Nadu 600036
No ratings yet
Machine Design Section-CAE Indian Institute of Technology Chennai Tamil Nadu 600036
1 page
# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
No ratings yet
# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
4 pages
Abb Iec61850
No ratings yet
Abb Iec61850
20 pages
Psychosomatic Disorder
100% (4)
Psychosomatic Disorder
18 pages
RTGS/ Neft Application Form
No ratings yet
RTGS/ Neft Application Form
2 pages
Mechanical Design Concepts: For Non-Mechanical Engineers
No ratings yet
Mechanical Design Concepts: For Non-Mechanical Engineers
2 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
ME4953
No ratings yet
ME4953
5 pages
Service Now
No ratings yet
Service Now
2 pages
Study Notes To ADB
No ratings yet
Study Notes To ADB
20 pages
Og Gas Hauler1200x 21sept 2011
No ratings yet
Og Gas Hauler1200x 21sept 2011
82 pages
An Artificial Intelligence Browser Architecture AI
No ratings yet
An Artificial Intelligence Browser Architecture AI
11 pages
Product Design Is The Process of Creating A New Product To Be Sold by A Business To Its Customers
No ratings yet
Product Design Is The Process of Creating A New Product To Be Sold by A Business To Its Customers
5 pages
Db2 - Bank Case Study
50% (2)
Db2 - Bank Case Study
16 pages
Bel - 16 - ADC and DAC
No ratings yet
Bel - 16 - ADC and DAC
26 pages
MTech Thermal Sciences
No ratings yet
MTech Thermal Sciences
32 pages
Addsup
No ratings yet
Addsup
2 pages
Knowledge Learning Steps Learnig in ServiceNow
No ratings yet
Knowledge Learning Steps Learnig in ServiceNow
1 page
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
03 A Polynomial Linear Regression
No ratings yet
03 A Polynomial Linear Regression
6 pages
03 Multiple Linear Regression
No ratings yet
03 Multiple Linear Regression
7 pages
Power Team PUA PMA Series Pumps - Catalog
No ratings yet
Power Team PUA PMA Series Pumps - Catalog
4 pages
Livpure Product Brochure
No ratings yet
Livpure Product Brochure
2 pages
Bucklin Buildings Offers Cost Effective and Efficient Grain Storage
No ratings yet
Bucklin Buildings Offers Cost Effective and Efficient Grain Storage
5 pages
13 Cross - Validation
No ratings yet
13 Cross - Validation
4 pages
Backward && Forward Feature Selection PART-2
No ratings yet
Backward && Forward Feature Selection PART-2
6 pages
Technical Change, Inequality, and The Labor Market
No ratings yet
Technical Change, Inequality, and The Labor Market
80 pages
Shaji Samuel 01.05.2014 New
No ratings yet
Shaji Samuel 01.05.2014 New
5 pages
Design and Analysis of Trapezoidal Bucket Excavator For Backhoe
No ratings yet
Design and Analysis of Trapezoidal Bucket Excavator For Backhoe
8 pages
Resistance Welding Controls AND Applications
No ratings yet
Resistance Welding Controls AND Applications
68 pages
OJT Manual India Mark-II Rev-2
No ratings yet
OJT Manual India Mark-II Rev-2
61 pages
Fully-Differential Isolation Amplifier: Features Description
No ratings yet
Fully-Differential Isolation Amplifier: Features Description
23 pages
Civil Templates Iitdelhi
No ratings yet
Civil Templates Iitdelhi
3 pages
Energy Efficiency and Church Buildings
No ratings yet
Energy Efficiency and Church Buildings
2 pages
Curriculum Vitae: Career Objective
No ratings yet
Curriculum Vitae: Career Objective
3 pages
Mechanical Logic Devices and Circuits
No ratings yet
Mechanical Logic Devices and Circuits
5 pages
The User Equipment (UE)
No ratings yet
The User Equipment (UE)
4 pages
One Night and One Night Only
No ratings yet
One Night and One Night Only
1 page
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet