MDS372 Lab4 2448001
MDS372 Lab4 2448001
Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
print("No of records/rows:", df.shape[0])
print("No of features/columns:", df.shape[1])
print("Features:", df.columns)
No of records/rows: 20640
No of features/columns: 9
Features: Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup',
'Latitude', 'Longitude', 'Target'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 Target 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
# Standardize Features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
This means, on average, the squared difference between predicted and actual values is 0.5559.
Wrapper Method
Wrapper methods use a machine learning model to select features by evaluating their impact on
performance.
1. Forward Selection
Starts with no features.
Adds features one by one that improve the model the most.
X_train_k = sfs_k.transform(X_train)
X_test_k = sfs_k.transform(X_test)
# Plotting
plt.figure(figsize=(6, 4))
plt.plot(num_features, mse_values, marker="o", linestyle="--",
color="blue", label="MSE Score")
plt.xlabel("Number of Selected Features")
plt.ylabel("MSE Score")
plt.title("Forward Selection: MSE vs. Number of Features")
plt.legend()
plt.grid()
plt.show()
Train Model & Compute MSE , for each feature count, we:
Select the best features using Forward Selection (sfs). Train the model using only those features.
Calculate MSE (error) and store it.
Plot MSE vs. Number of Features. This shows how model performance changes as we add or remove
features.
sfs = SFS(model, k_features=4, forward=True, floating=False,
scoring='neg_mean_squared_error', cv=5)
sfs.fit(X_train, y_train)
X_train_fs = sfs.transform(X_train)
X_test_fs = sfs.transform(X_test)
This means removing some features improved the model slightly. Suggests that not all features were
useful.
2. Backward Elimination
Starts with all features in the model.
Removes the least important feature one by one, based on the effect on model performance.
Stops when removing another feature would increase the error (MSE).
# Store MSE values for different numbers of selected features
mse_values = []
num_features = []
It removes one feature at a time, starting with the least important one. The removal is based on
how much the feature contributes to reducing error (MSE).
Select the best features using Backward Elimination (sbs). Train the model using only those
selected features. Calculate MSE (error) and store it.
Plot MSE vs. Number of Features showing how model performance changes as we remove features.
Find the Optimal Features the ones where MSE is the lowest.
sbs = SFS(model, k_features=4, forward=False, floating=False,
scoring='neg_mean_squared_error', cv=5)
sbs.fit(X_train, y_train)
X_train_bs = sbs.transform(X_train)
X_test_bs = sbs.transform(X_test)
This means removing some features improved the model slightly. Suggests that not all features were
useful.
Removes the least important feature based on its contribution to the model.
Repeats the process recursively until only the desired number of features remain.
Key Idea: RFE ranks features by importance and eliminates them one by one until only the most
significant ones are left.
# Store MSE values for different feature counts
mse_values = []
num_features = []
It removes one feature at a time, starting with the least important one. The removal is based on
how much the feature contributes to reducing error (MSE).
Train Model & Compute MSE , for each feature count, we:
Select the best features using Recursive Feature Elimination (RFE). Train the model using only
those selected features. Calculate MSE (error) and store it.
Plot MSE vs. Number of Features showing how model performance changes as we remove features.
Find the Optimal Features, the ones where MSE is the lowest.
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
rfe.fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)
This means removing some features slightly increased the error, suggesting that important features
might have been removed.
4. Exhaustive Search
Tests all possible feature subsets within the specified range (min_features=3, max_features=5).
Trains a model for each subset and calculates the corresponding MSE.
Selects the feature set that minimizes MSE, ensuring the best possible feature combination
mse_values = []
feature_counts = list(range(3, 6)) # Testing feature sets of size 3,
4, and 5
X_train_efs = efs.transform(X_train)
X_test_efs = efs.transform(X_test)
Features: 56/56
efs = EFS(model, min_features=4, max_features=4,
scoring='neg_mean_squared_error', cv=3)
efs.fit(X_train, y_train)
X_train_efs = efs.transform(X_train)
X_test_efs = efs.transform(X_test)
Features: 70/70
MSE with Exhaustive Search: 0.5490 (Better Performance than Baseline model)
Thus Exhaustive Search found the optimal 4-feature subset, leading to the lowest error. It confirms
that removing irrelevant features improved model accuracy.
Embedded Method
Embedded methods select features during model training by applying built-in regularization
techniques (e.g., LASSO shrinks coefficients, dropping less important features).
1. LASSO Regression
Performs feature selection by shrinking some coefficients to zero using L1 regularization.
Automatically removes less important features, keeping only the most significant ones.
Smaller alpha means less regularization (closer to standard Linear Regression). Larger alpha
means stronger regularization (more shrinkage of coefficients).
Y-axis represents the Mean Squared Error (MSE) of the model on the test data.
Lower MSE indicates better predictive performance. Higher MSE suggests underfitting (too
much shrinkage).
Small alpha (left side of the plot) → Low regularization : MSE is high if alpha is too low because we
include too many irrelevant features (overfitting risk).
Moderate alpha (middle of the plot) → Optimal point : This is where MSE is minimum, meaning
LASSO has effectively removed unnecessary features while retaining the important ones. This is the
best choice of alpha for balancing bias and variance.
Large alpha (right side of the plot) → High regularization : MSE increases because LASSO shrinks too
many coefficients to zero, leading to underfitting (important features are lost).
Choose alpha where MSE is lowest to get the best feature subset with good predictive power.
# Train Lasso with the best alpha
lasso = Lasso(alpha=best_alpha)
lasso.fit(X_train, y_train)
# Evaluate performance
y_pred = lasso_selected.predict(X_test_lasso)
mse_lasso = mean_squared_error(y_test, y_pred)
So from the MSE results, this gives the lowest MSE (0.5486, better than baseline model), meaning
the model is performing optimally well.
2. Ridge Regression
Lasso (L1 regularization) removes unimportant features by setting coefficients to zero, making it
useful for feature selection.
Ridge (L2 regularization) shrinks coefficients without setting them to zero, meaning all features are
retained but with reduced impact.
If the goal is feature selection, Lasso is the right approach. If we want to reduce multicollinearity
and stabilize coefficients without removing features, Ridge can be tried as well. If unsure, we
cantry both Lasso and Ridge, compare MSE values, and choose the one that performs best
Keeps all features but shrinks coefficients to prevent overfitting.
Helps when features are highly correlated and prevents instability in predictions.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
ridge = Ridge(alpha=best_alpha)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return mean_squared_error(y_test, y_pred)
# Plot
plt.figure(figsize=(10, 5))
ax = sns.barplot(x=methods, y=mses)
LASSO (L1 Regularization) removes unimportant features by setting their coefficients to zero,
leading to a more efficient and optimized model. By selecting only the most relevant features,
LASSO reduces noise and prevents overfitting, which improves MSE.
These methods systematically evaluate feature subsets, ensuring that only relevant features
remain. Since they rely on statistical evaluation, they tend to select similar subsets, leading to
almost identical MSE values.
Keeping all features can introduce irrelevant or redundant features, which may add noise and
slightly reduce model performance.. Feature selection methods remove these unnecessary
features, leading to a small but noticeable improvement.
RFE removes features recursively, but its selection process might have eliminated some
important features, leading to higher error.
Ridge Regression (0.5518) is Better Than Using All Features, But Worse Than LASSO
Ridge (L2 Regularization) shrinks coefficients instead of removing features. It helps control
overfitting but does not eliminate irrelevant features, meaning it doesn’t reduce MSE as
effectively as LASSO.