PA DA2 - Merged
PA DA2 - Merged
BCSE334L-Predictive Analytics
FALL SEM 2024-2025
Slot: E1+TE1
Submitted by-
Arnav Bahuguna
Reg - 21BCE3795
Q. Develop a comprehensive prediction model-based on four machine learning techniques using a
real-time dataset of your choice. Your task includes the following components:
1. Modeling: Develop prediction models using some machine learning techniques of your choice.
(min 4 techniques)
2. Model Tuning: Discuss the tuning methods applied to optimize each regression model.
3. Model Validation: Validate the performance of your models using appropriate metrics. This should
include:
a. Split of data into training and testing sets.
b. Calculation of performance metrics such as Mean Absolute Error (MAE), Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), and R- squared (R²) score.
c. Comparison of the performance of diSerent models and selection of the best-performing
model.
4. Report: Compile a detailed report summarizing the entire process. Your report should include:
a. Introduction and objective of the prediction model.
b. Comprehensive details of the data preprocessing, modeling, tuning, and validation steps.
c. Interpretation of the results and insights gained from the models.
d. Conclusion and any potential future work or improvements.
Ensure that your assignment is well-structured, clearly written, and demonstrates a deep
understanding of regression techniques and their application to real-time datasets. Use high- quality
English and support your explanations with relevant references and citations where appropriate.
1. Introduction
In this project, we aim to develop a comprehensive prediction model using four machine learning
techniques on a real-time dataset. The dataset used is a housing dataset loaded from a CSV file
named Housing.csv. This dataset contains various features relevant to house pricing, such as the
number of bedrooms, bathrooms, square footage of living space, and more. The dataset is sourced
from Kaggle and would be used in regression testing to predict house prices. The objective is to
predict the target variable (e.g., housing prices or any continuous variable) by training and optimizing
several regression models.
The models evaluated include Ridge Regression, Decision Tree Regressor, Random Forest Regressor,
and Support Vector Regressor (SVR). The key goals of this report are:
• To implement and optimize at least four diSerent machine learning models.
• To compare the performance of the models based on various metrics such as MAE, MSE,
RMSE, and R².
• To identify the best-performing model and discuss any potential improvements for future work.
2. Data Preprocessing
The dataset used for this project consists of features (independent variables) and a target variable
(price) to be predicted. Data preprocessing steps included:
1. Handling Missing Values: Rows or columns with missing data were dropped or imputed.
2. Scaling: We applied StandardScaler to standardize the features to have a mean of 0 and a
standard deviation of 1.
3. Train-Test Split: The dataset was split into 80% training and 20% testing sets to evaluate the
model performance on unseen data.
All the preprocessing steps were completed during the previous iteration of the project, hence have
not been mentioned in detail in this document. We had also scaled the data to minimize the number
of outliers.
Impact of Tuning:
• The best combination found was max_depth = 10 and min_samples_split = 20, allowing the
tree to maintain a good balance between complexity and generalization.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 122431.539019124 110636.27198458016
Mean Squared Error (MSE) 50792182369.69726 39237127498.66877
Root Mean Squared Error (RMSE) 225371.21016158487 198083.63763488585
R Squared Score 0.6421405678374152 0.6568032082982447
Impact of Tuning:
• The best combination was n_estimators = 200 and max_depth = 20. This provided a well-
balanced model with enough trees to reduce variance and a controlled depth to prevent
overfitting.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 86510.18535006807 86224.04226690672
Mean Squared Error (MSE) 25090348717.306355 25156235320.69841
Root Mean Squared Error (RMSE) 158399.3330709014 158607.1729799709
R Squared Score 0.7740575447454721 0.7707744616996358
Impact of Tuning:
• The best combination found was C = 10 and kernel = ‘linear’, allowing the model to capture the
nonlinear relationships in the data while maintaining a good balance between bias and
variance.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 222343.80394267367 137747.75623270954
Mean Squared Error (MSE) 148191154279.18042 72231554707.12172
Root Mean Squared Error (RMSE) 384956.04200892913 268759.28766671807
R Squared Score -289701.7426570366 -1.812122075633066
• The R-squared metric is misleading for nonlinear models (SVR in this case) and does not help
you assess goodness-of-fit like so we get a value which lies outside of [0,1].
8. Huber Regression
• Purpose:
o Huber Regression is a robust regression method that combines the strengths of both L1
and L2 loss functions. It is especially useful when dealing with datasets containing
outliers.
o It minimizes squared errors for smaller residuals (like MSE) but uses absolute errors for
larger residuals (like MAE), providing resilience to outliers.
• EGect:
o Small residuals are treated with an MSE approach, encouraging the model to fit closely
to most data points.
o Large residuals are treated with MAE, reducing the influence of outliers on the model,
which minimizes their eSect on the overall regression line.
o The Huber parameter (delta) controls the threshold between MSE and MAE. Lower
delta makes the model more sensitive to outliers, while higher delta makes it less
sensitive.
o Overall, Huber Regression provides a balance between robustness and accuracy,
making it ideal for datasets with a few extreme outliers.
4. Modeling (Classification)
Now, we have categorized ‘price’ column into four classes—Low, Medium, High, and Very High—
using quartiles. This method splitted the continuous price values into four equally sized groups based
on the distribution of the data.
1. Logistic Regression:
• Purpose: Logistic Regression is a simple and interpretable model that is commonly used for
binary and multiclass classification tasks. It models the probability that an instance belongs to
a particular class using a logistic (sigmoid) function, making it suitable for linearly separable
data.
• EGect: Logistic Regression tends to perform well when the relationship between the features
and the target class is linear. However, in more complex datasets with non-linear interactions
(such as your housing dataset, where price categories depend on both continuous and
categorical-like features), Logistic Regression can struggle to capture these complexities. It
relies heavily on well-separated classes, and in this case, its accuracy was lower than more
complex models like Random Forest.
5. Model Validation
To evaluate the performance of the models, we used the following metrics:
• Mean Absolute Error (MAE): The average of the absolute diSerences between predicted and
actual values. It measures how far the predictions are from the actual values on average.
Lower MAE indicates better model performance, with predictions closer to actual values.
• Mean Squared Error (MSE): The average of the squared diSerences between predicted and
actual values. It gives a larger penalty to larger errors because it squares the errors. Lower MSE
means better performance, but it is more sensitive to outliers than MAE.
• Root Mean Squared Error (RMSE): The square root of MSE, bringing the error back to the
original units of the target variable, making it more interpretable. Like MSE, a lower RMSE
indicates better performance, and it penalizes larger errors more heavily than MAE.
• R² Score: The proportion of the variance in the dependent variable that is predictable from the
independent variables. It indicates how well the model fits the data. R² values range from 0 to
1. A value closer to 1 indicates a better fit, with 1 meaning a perfect fit.
• Accuracy: Accuracy is one of the simplest and most commonly used metrics for evaluating
classification models. It measures the proportion of correctly classified instances out of the
total instances in the dataset.
• Classification Report: The classification report shows a representation of the main
classification metrics on a per-class basis. This gives a deeper intuition of the classifier
behavior over global accuracy which can mask functional weaknesses in one class of a
multiclass problem. It includes metrics: Precision, Recall, F1-Score, Support.
Link to ipynb
The ipynb notebook is also attached below for ref.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Housing.csv')
df.head()
sqft_lot15
0 5650
1 7639
2 8062
3 5000
4 7503
[5 rows x 21 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
df.isnull().sum()
id 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64
df[df.columns].plot(kind='box', figsize=(20,10))
<Axes: >
scaler = StandardScaler()
ftransform = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition',
'grade',
'sqft_above', 'sqft_basement', 'yr_built',
'yr_renovated',
'zipcode', 'lat', 'long', 'sqft_living15',
'sqft_lot15']
df[ftransform] = scaler.fit_transform(df[ftransform])
df[df.columns].plot(kind='box', figsize=(20,10))
<Axes: >
features = df.drop('price', axis=1)
y = df['price']
pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(features)
Linear Regression
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
LinearRegression()
lin_pred = lin_model.predict(X_test)
rdg_model = Ridge()
rdg_model.fit(X_train, y_train)
Ridge()
rdg_preds = rdg_model.predict(X_test)
dcst_model = DecisionTreeRegressor()
dcst_model.fit(X_train, y_train)
DecisionTreeRegressor()
dcst_preds = dcst_model.predict(X_test)
svr_model = SVR()
svr_model.fit(X_train,y_train)
svr_preds = svr_model.predict(X_test)
MODEL TUNING
from sklearn.model_selection import GridSearchCV
GridSearchCV(cv=3, estimator=Ridge(),
param_grid={'alpha': [0.01, 0.1, 1, 10, 100]},
scoring='neg_mean_squared_error')
best_ridge = grid_rdg_model.best_estimator_
grid_rdg_preds = best_ridge.predict(X_test)
param_grid_dt = {
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 10, 20]
}
grid_dcst = GridSearchCV(dcst_model, param_grid_dt, cv=3,
scoring='neg_mean_squared_error')
grid_dcst.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=DecisionTreeRegressor(),
param_grid={'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 10, 20]},
scoring='neg_mean_squared_error')
best_dcst = grid_dcst.best_estimator_
y_pred_dcst = best_dcst.predict(X_test)
param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None]
}
rf = RandomForestRegressor(random_state=42)
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=3,
scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42),
param_grid={'max_depth': [10, 20, None],
'n_estimators': [50, 100, 200]},
scoring='neg_mean_squared_error')
best_rf = grid_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)
param_grid_svr = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}
svr = SVR()
grid_search_svr = GridSearchCV(svr, param_grid_svr, cv=3,
scoring='neg_mean_squared_error')
grid_search_svr.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=SVR(),
param_grid={'C': [0.1, 1, 10], 'kernel': ['linear',
'rbf']},
scoring='neg_mean_squared_error')
best_svr = grid_search_svr.best_estimator_
y_pred_svr = best_svr.predict(X_test)
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
KNeighborsRegressor()
y_pred_knn = knn_reg.predict(X_test)
elastic_net_reg = ElasticNet()
elastic_net_reg.fit(X_train, y_train)
ElasticNet()
y_pred_elastic_net = elastic_net_reg.predict(X_test)
BayesianRidge()
y_pred_bayesian_ridge = bayesian_ridge_reg.predict(X_test)
huber_reg = HuberRegressor()
huber_reg.fit(X_train, y_train)
HuberRegressor()
y_pred_huber = huber_reg.predict(X_test)
df.head()
dfc.head()
2 3 6 770 0 1933 0
4 3 8 1680 0 1987 0
dfc.head()
pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(features)
log_c = LogisticRegression(max_iter=1000)
log_c.fit(X_train, y_train)
log_preds = log_c.predict(X_test)
print("Accuracy: ", round(accuracy_score(y_test, log_preds)*100, 2))
print("Classification Report: ", classification_report(y_test,
log_preds))
Accuracy: 64.57
Classification Report: precision recall f1-score
support
rf_c = RandomForestClassifier()
rf_c.fit(X_train, y_train)
rf_preds = rf_c.predict(X_test)
Accuracy: 72.62
Classification Report: precision recall f1-score
support
svc = SVC()
svc.fit(X_train, y_train)
svc_pred = svc.predict(X_test)
print("Accuracy: ", round(accuracy_score(y_test, svc_pred)*100, 2))
print("Classification Report: ", classification_report(y_test,
svc_pred))
Accuracy: 72.39
Classification Report: precision recall f1-score
support
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
Accuracy: 70.19
Classification Report: precision recall f1-score
support