0% found this document useful (0 votes)
28 views15 pages

20BCP021 Assignment 6

This document describes a lab assignment to compare the performance of different feature selection approaches for classification and regression tasks. The tasks involve loading and preprocessing two datasets - the Iris dataset for classification using logistic regression and the Boston Housing dataset for regression using linear regression. Three feature selection techniques are applied to each dataset - Recursive Feature Elimination (RFE), Random Forest feature importance, and Pearson correlation. Models are trained on the selected features and evaluated on test data to compare the performance of each approach.

Uploaded by

chatgptplus4us
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views15 pages

20BCP021 Assignment 6

This document describes a lab assignment to compare the performance of different feature selection approaches for classification and regression tasks. The tasks involve loading and preprocessing two datasets - the Iris dataset for classification using logistic regression and the Boston Housing dataset for regression using linear regression. Three feature selection techniques are applied to each dataset - Recursive Feature Elimination (RFE), Random Forest feature importance, and Pearson correlation. Models are trained on the selected features and evaluated on test data to compare the performance of each approach.

Uploaded by

chatgptplus4us
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Name: Yash Prajapati

Roll No.: 20BCP021


Batch: G1
Assignment: 6

Page | 1
Title: Apply different feature selection approaches for the
classification/regression task. Compare the performance of different feature
selection approach.
Objective: The objective of this lab assignment is to explore various feature
selection techniques for classification and regression tasks.
Dataset: Use the UCI Iris dataset for the classification task and the Boston
Housing dataset for the regression task.
Tasks:
1) Load the Iris dataset and the Boston Housing dataset.

2) Preprocess the data: handle missing values, encode categorical


variables (if any), and normalize/standardize features.

3) Split each dataset into features (X) and target variable (y).

4) For each dataset and each feature selection approach (minimum 3),
follow these steps:
a. Apply the feature selection technique to select a subset of
features.
b. Split the data into training and testing sets (e.g., 70% training,
30% testing).
c. Train a classification model (e.g., Logistic Regression, Random
Forest) for the Iris dataset and a regression model (e.g., Linear
Regression, Decision Tree) for the Boston Housing dataset using the
selected features.
d. Evaluate the model's performance on the testing set using
appropriate metrics (e.g., accuracy, mean squared error).

5) Compare and analyze the performance of each approach in terms of


model performance and the number of selected features.

6) Summarize your findings and insights in a report, including a


comparison table or visualization.

Page | 2
IRIS Dataset:
1. Load and Preprocess:

Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset


from sklearn.datasets import load_iris

# Load the data


iris_data = load_iris()
iris_df = pd.DataFrame(iris_data.data,
columns=iris_data.feature_names)
iris_df['species'] = iris_data.target

# Show basic statistics


iris_stats = iris_df.describe()

# Plot pairplot for visualization


sns.pairplot(iris_df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Pairplot of Iris Dataset', y=1.02)
plt.show()

iris_stats

Page | 3
Output:

Page | 4
2. RFE

Code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Prepare the data


X = iris_df.iloc[:, :-1] # Features
y = iris_df.iloc[:, -1] # Target

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Initialize the base estimator


base_estimator = LogisticRegression()

# Initialize RFE
rfe = RFE(estimator=base_estimator, n_features_to_select=3)

# Fit RFE
rfe = rfe.fit(X_train, y_train)

# Select important features based on RFE


X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

# Train a classifier on the selected features


clf_rfe = LogisticRegression()
clf_rfe = clf_rfe.fit(X_train_rfe, y_train)

# Evaluate the classifier


y_pred_rfe = clf_rfe.predict(X_test_rfe)
accuracy_rfe = accuracy_score(y_test, y_pred_rfe)

# Results
selected_features_rfe = X.columns[rfe.support_].tolist()
selected_features_rfe, accuracy_rfe
Output:

Page | 5
3. Feature Importance using Random Forest

Code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Initialize Random Forest Classifier


rf_clf = RandomForestClassifier(n_estimators=100,
random_state=42)

# Fit the model


rf_clf.fit(X_train, y_train)

# Get feature importances


feature_importances = rf_clf.feature_importances_

# Sort feature importances in descending order and get the indices


sorted_idx = np.argsort(feature_importances)[::-1]

# Select top 3 important features


top_indices = sorted_idx[:3]

# Train a classifier on the selected features


X_train_rf = X_train.iloc[:, top_indices]
X_test_rf = X_test.iloc[:, top_indices]

clf_rf = LogisticRegression()
clf_rf.fit(X_train_rf, y_train)

# Evaluate the classifier


y_pred_rf = clf_rf.predict(X_test_rf)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Results
selected_features_rf = X.columns[top_indices].tolist()
selected_features_rf, accuracy_rf
Output:

Page | 6
4. By using Pearson Correlation

Code:
# Compute Pearson correlation matrix
correlation_matrix = iris_df.corr()

# Get the correlation of all features with respect to the target


variable ('species')
correlation_with_target =
correlation_matrix["species"].sort_values(ascending=False)

# Select top 3 correlated features


top_correlated_features = correlation_with_target.index[1:4].tolist()

# Train a classifier on the selected features


X_train_corr = X_train[top_correlated_features]
X_test_corr = X_test[top_correlated_features]

clf_corr = LogisticRegression()
clf_corr.fit(X_train_corr, y_train)

# Evaluate the classifier


y_pred_corr = clf_corr.predict(X_test_corr)
accuracy_corr = accuracy_score(y_test, y_pred_corr)

# Results
top_correlated_features, accuracy_corr
Output:

Page | 7
5. Analysis of Methods

o Recursive Feature Elimination (RFE)


▪ Selected Features: Sepal width (cm), Petal length (cm), Petal
width (cm)
▪ Accuracy: 100%
o Feature Importance using Random Forest
▪ Selected Features: Petal width (cm), Petal length (cm), Sepal
length (cm)
▪ Accuracy: 100%
o Pearson Correlation
▪ Selected Features: Petal width (cm), Petal length (cm), Sepal
length (cm)
▪ Accuracy: 100%

6. Table and Insight

Accuracy (Iris)
Method Selected Features (Iris) (%)

Sepal width (cm), Petal length (cm), Petal


Recursive Feature Elimination (RFE) width (cm) 100

Feature Importance using Random Petal width (cm), Petal length (cm), Sepal
Forest length (cm) 100

Petal width (cm), Petal length (cm), Sepal


Pearson Correlation length (cm) 100

Insight:
All three feature selection methods resulted in 100% accuracy, indicating
that feature selection is highly effective for this task, irrespective of the
method used. This is likely because the Iris dataset is relatively simple and
well-separated.

Page | 8
BOSTON Housing Dataset:
1. Load and Explore

Code:
# Import necessary libraries and reload the uploaded Boston Housing
dataset
import pandas as pd

# Load the dataset


boston_df =
pd.read_csv(r"C:\Users\meet1\Downloads\archive1\HousingData
.csv")

# Show basic statistics and first few rows to understand its structure
boston_stats = boston_df.describe()
boston_df.head(), boston_stats

# Plot pairplot for a subset of features for visualization (plotting all


would be too many)
selected_columns = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV']
sns.pairplot(boston_df[selected_columns])
plt.suptitle('Pairplot of Selected Features from Boston Housing
Dataset', y=1.02)
plt.show()

boston_stats

Page | 9
Output:

Page | 10
2. Missing values

Code:
# Handle missing values by mean imputation
boston_df = boston_df.fillna(boston_df.mean())

# Confirm that there are no more missing values


missing_values_count = boston_df.isnull().sum().sum()
missing_values_count

Page | 11
3. RFE
Code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

# Prepare the data


X_boston = boston_df.iloc[:, :-1] # Features
y_boston = boston_df.iloc[:, -1] # Target

# Split the data into training and test sets


X_train_boston, X_test_boston, y_train_boston, y_test_boston =
train_test_split(X_boston, y_boston, test_size=0.3, random_state=42)

# Initialize the base estimator


base_estimator_boston = LinearRegression()

# Initialize RFE
rfe_boston = RFE(estimator=base_estimator_boston,
n_features_to_select=5)

# Fit RFE
rfe_boston = rfe_boston.fit(X_train_boston, y_train_boston)

# Select important features based on RFE


X_train_rfe_boston = rfe_boston.transform(X_train_boston)
X_test_rfe_boston = rfe_boston.transform(X_test_boston)

# Train a regressor on the selected features


reg_rfe = LinearRegression()
reg_rfe = reg_rfe.fit(X_train_rfe_boston, y_train_boston)

# Evaluate the regressor


y_pred_rfe_boston = reg_rfe.predict(X_test_rfe_boston)
rmse_rfe= sqrt(mean_squared_error(y_test_boston,
y_pred_rfe_boston))

# Results
selected_features_rfe_boston =
X_boston.columns[rfe_boston.support_].tolist()
selected_features_rfe_boston, rmse_rfe
Output:

Page | 12
4. Feature Importance using Random Forest
Code:
from sklearn.ensemble import RandomForestRegressor

# Initialize Random Forest Regressor


rf_reg = RandomForestRegressor(n_estimators=100,
random_state=42)

# Fit the model


rf_reg.fit(X_train_boston, y_train_boston)

# Get feature importances


feature_importances_boston = rf_reg.feature_importances_

# Sort feature importances in descending order and get the indices


sorted_idx_boston = np.argsort(feature_importances_boston)[::-1]

# Select top 5 important features


top_indices_boston = sorted_idx_boston[:5]

# Train a regressor on the selected features


X_train_rf_boston = X_train_boston.iloc[:, top_indices_boston]
X_test_rf_boston = X_test_boston.iloc[:, top_indices_boston]

reg_rf = LinearRegression()
reg_rf.fit(X_train_rf_boston, y_train_boston)

# Evaluate the regressor


y_pred_rf_boston = reg_rf.predict(X_test_rf_boston)
rmse_rf = sqrt(mean_squared_error(y_test_boston,
y_pred_rf_boston))

# Results
selected_features_rf_boston =
X_boston.columns[top_indices_boston].tolist()
selected_features_rf_boston, rmse_rf
Output:

Page | 13
5. By using Pearson Correlation
Code:
# Compute Pearson correlation matrix for Boston Housing dataset
correlation_matrix_boston = boston_df.corr()

# Get the correlation of all features with respect to the target


variable ('MEDV')
correlation_with_target_boston =
correlation_matrix_boston["MEDV"].sort_values(ascending=False)

# Select top 5 correlated features


top_correlated_features_boston =
correlation_with_target_boston.index[1:6].tolist()

# Train a regressor on the selected features


X_train_corr_boston =
X_train_boston[top_correlated_features_boston]
X_test_corr_boston = X_test_boston[top_correlated_features_boston]

reg_corr = LinearRegression()
reg_corr.fit(X_train_corr_boston, y_train_boston)

# Evaluate the regressor


y_pred_corr_boston = reg_corr.predict(X_test_corr_boston)
rmse_corr = sqrt(mean_squared_error(y_test_boston,
y_pred_corr_boston))

# Results
top_correlated_features_boston, rmse_corr
Output:

Page | 14
6. Analysis of Methods

o Recursive Feature Elimination (RFE)


▪ Selected Features: CHAS, NOX, RM, DIS, PTRATIO
▪ RMSE: 5.31
o Feature Importance using Random Forest
▪ Selected Features: RM, LSTAT, DIS, CRIM, PTRATIO
▪ RMSE: 5.06
o Pearson Correlation
▪ Selected Features: RM, ZN, B, DIS, CHAS
▪ RMSE: 5.73

7. Table and Insight

Method Selected Features (Boston) RMSE (Boston)

Recursive Feature Elimination (RFE) CHAS, NOX, RM, DIS, PTRATIO 5.31

Feature Importance using Random Forest RM, LSTAT, DIS, CRIM, PTRATIO 5.06

Pearson Correlation RM, ZN, B, DIS, CHAS 5.73

Insight:

• The Feature Importance using Random Forest method resulted in


the lowest RMSE (5.06), making it the most effective feature
selection method for this dataset.
• Pearson Correlation resulted in the highest RMSE (5.73),
suggesting that it might not be the best method for this specific
task.

Page | 15

You might also like