22 Dim Reduction Part-1
22 Dim Reduction Part-1
Out[1]:
ID season holiday workingday weather temp atemp humidity windspeed count
See here lot of missing value present in this dataset so here we put missing value ratio.
Out[2]: ID 0.000000
season 0.069337
holiday 48.497689
workingday 0.069337
weather 0.030817
temp 0.000000
atemp 0.000000
humidity 0.038521
windspeed 41.016949
count 0.000000
dtype: float64
In [5]: # new variable to store variables having missing values less than a threshold
variable = [ ]
for i in range(data.columns.shape[0]):
if a[i]<=40: #setting the threshold as 40%
variable.append(variables[i])
here i remove the columns which has missing value higher than 40%. 40% is just a threshold.
You set a threshold which has Higher than 40-45 % .
Here i remove 'holiday' and 'windspeed' as it has higher missing value ratio
Out[7]:
ID season workingday weather temp atemp humidity count
Out[8]: ID 0.000000
season 0.069337
workingday 0.069337
weather 0.030817
temp 0.000000
atemp 0.000000
humidity 0.038521
count 0.000000
dtype: float64
high_corr = [ ]
for c1 in numeric_columns:
for c2 in numeric_columns:
if c1 != c2 and c2 not in high_corr and correlation[c1][c2] > 0.9:
high_corr.append(c1)
In [13]: high_corr
Out[13]: ['temp']
In [14]: import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
sns.heatmap(correlation,annot=True)
Out[14]: <AxesSubplot:>
See here in heat map atemp and temp have high correlation so my method remove one
higher correlation Value.
3. Low_variance_filter (Method 3)
In [15]: #importing the libraries
import pandas as pd
from sklearn.preprocessing import normalize
#reading the file
data = pd.read_csv('low_variance_filter.csv')
# first 5 rows of the data
data.head()
Out[15]:
ID temp atemp humidity windspeed count
Out[16]: ID 0.0
temp 0.0
atemp 0.0
humidity 0.0
windspeed 0.0
count 0.0
dtype: float64
Here see missing value zero so here we consider variance value to reduce dimensionality.
In [17]: data.var()
Out[20]: 0 0.005877
1 0.007977
2 0.093491
3 0.008756
4 0.111977
dtype: float64
In [22]: #saving the names of variables having variance more than a threshold value
variable = [ ]
for i in range(0,len(variance)):
if variance[i]>=0.006: #setting the threshold as 1%
variable.append(columns[i])
Out[24]:
atemp humidity windspeed count
0 14.395 81 0.0 16
1 13.635 80 0.0 40
2 13.635 80 0.0 32
3 14.395 75 0.0 13
4 14.395 75 0.0 1
In [25]: # shape of new and original data
print("Before Remove Mising value RAtio: ",data.shape)
print("Before Remove Mising value RAtio: ",new_data.shape)
Out[27]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Target
In [28]: X = df.drop("Target",axis=1)
y = df['Target']
By this you can determine which variable are important which are not . according to this you
easily eliminate all less important feature.
Based on the above graph, we can handpick the top-most features to reduce the
dimensionality in our dataset.
Alternatively, we can use the SelectFromModel of sklearn to do so. It selects the features
based on the importance of their weights.
SelectFromModel of sklearn
In This way you can do number of feature and easily emliminate this.