20BCP021 Assignment 6
20BCP021 Assignment 6
Page | 1
Title: Apply different feature selection approaches for the
classification/regression task. Compare the performance of different feature
selection approach.
Objective: The objective of this lab assignment is to explore various feature
selection techniques for classification and regression tasks.
Dataset: Use the UCI Iris dataset for the classification task and the Boston
Housing dataset for the regression task.
Tasks:
1) Load the Iris dataset and the Boston Housing dataset.
3) Split each dataset into features (X) and target variable (y).
4) For each dataset and each feature selection approach (minimum 3),
follow these steps:
a. Apply the feature selection technique to select a subset of
features.
b. Split the data into training and testing sets (e.g., 70% training,
30% testing).
c. Train a classification model (e.g., Logistic Regression, Random
Forest) for the Iris dataset and a regression model (e.g., Linear
Regression, Decision Tree) for the Boston Housing dataset using the
selected features.
d. Evaluate the model's performance on the testing set using
appropriate metrics (e.g., accuracy, mean squared error).
Page | 2
IRIS Dataset:
1. Load and Preprocess:
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris_stats
Page | 3
Output:
Page | 4
2. RFE
Code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Initialize RFE
rfe = RFE(estimator=base_estimator, n_features_to_select=3)
# Fit RFE
rfe = rfe.fit(X_train, y_train)
# Results
selected_features_rfe = X.columns[rfe.support_].tolist()
selected_features_rfe, accuracy_rfe
Output:
Page | 5
3. Feature Importance using Random Forest
Code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
clf_rf = LogisticRegression()
clf_rf.fit(X_train_rf, y_train)
# Results
selected_features_rf = X.columns[top_indices].tolist()
selected_features_rf, accuracy_rf
Output:
Page | 6
4. By using Pearson Correlation
Code:
# Compute Pearson correlation matrix
correlation_matrix = iris_df.corr()
clf_corr = LogisticRegression()
clf_corr.fit(X_train_corr, y_train)
# Results
top_correlated_features, accuracy_corr
Output:
Page | 7
5. Analysis of Methods
Accuracy (Iris)
Method Selected Features (Iris) (%)
Feature Importance using Random Petal width (cm), Petal length (cm), Sepal
Forest length (cm) 100
Insight:
All three feature selection methods resulted in 100% accuracy, indicating
that feature selection is highly effective for this task, irrespective of the
method used. This is likely because the Iris dataset is relatively simple and
well-separated.
Page | 8
BOSTON Housing Dataset:
1. Load and Explore
Code:
# Import necessary libraries and reload the uploaded Boston Housing
dataset
import pandas as pd
# Show basic statistics and first few rows to understand its structure
boston_stats = boston_df.describe()
boston_df.head(), boston_stats
boston_stats
Page | 9
Output:
Page | 10
2. Missing values
Code:
# Handle missing values by mean imputation
boston_df = boston_df.fillna(boston_df.mean())
Page | 11
3. RFE
Code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt
# Initialize RFE
rfe_boston = RFE(estimator=base_estimator_boston,
n_features_to_select=5)
# Fit RFE
rfe_boston = rfe_boston.fit(X_train_boston, y_train_boston)
# Results
selected_features_rfe_boston =
X_boston.columns[rfe_boston.support_].tolist()
selected_features_rfe_boston, rmse_rfe
Output:
Page | 12
4. Feature Importance using Random Forest
Code:
from sklearn.ensemble import RandomForestRegressor
reg_rf = LinearRegression()
reg_rf.fit(X_train_rf_boston, y_train_boston)
# Results
selected_features_rf_boston =
X_boston.columns[top_indices_boston].tolist()
selected_features_rf_boston, rmse_rf
Output:
Page | 13
5. By using Pearson Correlation
Code:
# Compute Pearson correlation matrix for Boston Housing dataset
correlation_matrix_boston = boston_df.corr()
reg_corr = LinearRegression()
reg_corr.fit(X_train_corr_boston, y_train_boston)
# Results
top_correlated_features_boston, rmse_corr
Output:
Page | 14
6. Analysis of Methods
Recursive Feature Elimination (RFE) CHAS, NOX, RM, DIS, PTRATIO 5.31
Feature Importance using Random Forest RM, LSTAT, DIS, CRIM, PTRATIO 5.06
Insight:
Page | 15