0% found this document useful (0 votes)
8 views29 pages

PA DA2 - Merged

Uploaded by

josephallen.abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

PA DA2 - Merged

Uploaded by

josephallen.abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

DIGITAL ASSIGNMENT 2

BCSE334L-Predictive Analytics
FALL SEM 2024-2025
Slot: E1+TE1

Submitted by-

Arnav Bahuguna

Reg - 21BCE3795
Q. Develop a comprehensive prediction model-based on four machine learning techniques using a
real-time dataset of your choice. Your task includes the following components:
1. Modeling: Develop prediction models using some machine learning techniques of your choice.
(min 4 techniques)
2. Model Tuning: Discuss the tuning methods applied to optimize each regression model.
3. Model Validation: Validate the performance of your models using appropriate metrics. This should
include:
a. Split of data into training and testing sets.
b. Calculation of performance metrics such as Mean Absolute Error (MAE), Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), and R- squared (R²) score.
c. Comparison of the performance of diSerent models and selection of the best-performing
model.
4. Report: Compile a detailed report summarizing the entire process. Your report should include:
a. Introduction and objective of the prediction model.
b. Comprehensive details of the data preprocessing, modeling, tuning, and validation steps.
c. Interpretation of the results and insights gained from the models.
d. Conclusion and any potential future work or improvements.

Ensure that your assignment is well-structured, clearly written, and demonstrates a deep
understanding of regression techniques and their application to real-time datasets. Use high- quality
English and support your explanations with relevant references and citations where appropriate.

1. Introduction
In this project, we aim to develop a comprehensive prediction model using four machine learning
techniques on a real-time dataset. The dataset used is a housing dataset loaded from a CSV file
named Housing.csv. This dataset contains various features relevant to house pricing, such as the
number of bedrooms, bathrooms, square footage of living space, and more. The dataset is sourced
from Kaggle and would be used in regression testing to predict house prices. The objective is to
predict the target variable (e.g., housing prices or any continuous variable) by training and optimizing
several regression models.
The models evaluated include Ridge Regression, Decision Tree Regressor, Random Forest Regressor,
and Support Vector Regressor (SVR). The key goals of this report are:
• To implement and optimize at least four diSerent machine learning models.
• To compare the performance of the models based on various metrics such as MAE, MSE,
RMSE, and R².
• To identify the best-performing model and discuss any potential improvements for future work.
2. Data Preprocessing
The dataset used for this project consists of features (independent variables) and a target variable
(price) to be predicted. Data preprocessing steps included:
1. Handling Missing Values: Rows or columns with missing data were dropped or imputed.
2. Scaling: We applied StandardScaler to standardize the features to have a mean of 0 and a
standard deviation of 1.
3. Train-Test Split: The dataset was split into 80% training and 20% testing sets to evaluate the
model performance on unseen data.

All the preprocessing steps were completed during the previous iteration of the project, hence have
not been mentioned in detail in this document. We had also scaled the data to minimize the number
of outliers.

3. Modeling and Tuning


We have used GridSearchCV which is a tool from the scikit-learn library used for hyperparameter
tuning in machine learning. It essentially automates the process of finding the optimal combination of
hyperparameters for a given machine learning model.
The following machine learning models were used in this project:

1. Ridge Regression Tuning


Hyperparameter: alpha
• Purpose: The alpha parameter in Ridge Regression controls the strength of L2 regularization,
which penalizes large coeSicient values, preventing overfitting by shrinking the coeSicients.
• EGect:
o Low alpha (close to 0): The model behaves similarly to standard linear regression, with
little or no regularization. This can result in overfitting, especially when the model is
complex or has many features.
o High alpha: Large alpha values impose a greater penalty on the coeSicients, making the
model simpler by reducing the impact of less important features. However, this can also
lead to underfitting if the model becomes too constrained.
Tuning Process:
• We used GridSearchCV to search for the best alpha value over a predefined range ([0.01, 0.1,
1, 10, 100]).
• Why GridSearchCV? It automatically evaluates the performance of the model for each alpha
and finds the best one based on cross-validation.
Impact of Tuning:
• The optimal alpha value for Ridge Regression was 100 in this case, balancing regularization to
avoid overfitting without underfitting.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 127247.01883052982 127047.49797298762
Mean Squared Error (MSE) 42340277513.86764 42344471977.561775
Root Mean Squared Error (RMSE) 205767.5327010256 205777.72468749326
R Squared Score 0.5501403299752385 0.5474111916440647

2. Decision Tree Regressor Tuning


Hyperparameters: max_depth, min_samples_split
• Purpose:
o max_depth: This controls the maximum depth of the decision tree. Limiting the depth
can prevent the tree from overfitting by stopping it from becoming overly complex.
o min_samples_split: This defines the minimum number of samples required to split an
internal node. It prevents the model from making splits that are based on too few
samples, which can lead to overfitting.
• EGect:
o Low max_depth: Limits the complexity of the model, reducing the chance of overfitting
but potentially increasing bias (underfitting).
o High max_depth: Allows the tree to grow deeper, capturing more information from the
data, but risks overfitting.
o Low min_samples_split: Encourages more splits, leading to a highly flexible model that
can overfit.
o High min_samples_split: Prevents splits when there are too few samples, resulting in a
more generalized model but possibly underfitting.
Tuning Process:
• We tuned both max_depth and min_samples_split over a range of values ([5, 10, 20] for
max_depth and [2, 10, 20] for min_samples_split).

Impact of Tuning:
• The best combination found was max_depth = 10 and min_samples_split = 20, allowing the
tree to maintain a good balance between complexity and generalization.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 122431.539019124 110636.27198458016
Mean Squared Error (MSE) 50792182369.69726 39237127498.66877
Root Mean Squared Error (RMSE) 225371.21016158487 198083.63763488585
R Squared Score 0.6421405678374152 0.6568032082982447

3. Random Forest Regressor Tuning


Hyperparameters: n_estimators, max_depth
• Purpose:
o n_estimators: This controls the number of trees in the forest. More trees generally
improve performance by reducing variance, but beyond a certain point, the
improvement becomes marginal and increases computation time.
o max_depth: Controls the depth of individual trees in the forest. It helps control the
trade-oS between bias and variance.
• EGect:
o Low n_estimators: Fewer trees lead to a less stable model, increasing variance, while
too few trees may underfit.
o High n_estimators: More trees increase robustness, but after a certain point, the
benefit diminishes.
o Low max_depth: Shallow trees prevent overfitting but may cause the model to underfit.
o High max_depth: Deep trees allow the model to capture more data patterns but can
lead to overfitting.
Tuning Process:
• We used GridSearchCV to tune n_estimators ([50, 100, 200]) and max_depth ([10, 20, None]).

Impact of Tuning:
• The best combination was n_estimators = 200 and max_depth = 20. This provided a well-
balanced model with enough trees to reduce variance and a controlled depth to prevent
overfitting.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 86510.18535006807 86224.04226690672
Mean Squared Error (MSE) 25090348717.306355 25156235320.69841
Root Mean Squared Error (RMSE) 158399.3330709014 158607.1729799709
R Squared Score 0.7740575447454721 0.7707744616996358

4. Support Vector Regressor (SVR) Tuning


Hyperparameters: C, kernel
• Purpose:
o C: This controls the penalty for misclassified points. A higher C gives the model more
flexibility to fit the training data tightly, potentially reducing bias but increasing variance.
o kernel: This defines how the data is mapped into higher dimensions. Common options
are linear (for linearly separable data) and rbf (for more complex relationships).
• EGect:
o Low C: Results in a simpler model with higher bias and less variance, but it can underfit
the data.
o High C: Allows the model to fit more closely to the training data, reducing bias but
potentially leading to overfitting.
o linear kernel: Works well for linearly separable data but may struggle with more
complex data patterns.
o rbf kernel: ESective for capturing nonlinear relationships in the data but requires tuning
of additional parameters like gamma (not done here).
Tuning Process:
• We used GridSearchCV to tune C ([0.1, 1, 10]) and the kernel type (['linear', 'rbf']).

Impact of Tuning:
• The best combination found was C = 10 and kernel = ‘linear’, allowing the model to capture the
nonlinear relationships in the data while maintaining a good balance between bias and
variance.
Error Metrics Default Hyperparameters Tuned Hyperparameters
Mean Absolute Error (MAE) 222343.80394267367 137747.75623270954
Mean Squared Error (MSE) 148191154279.18042 72231554707.12172
Root Mean Squared Error (RMSE) 384956.04200892913 268759.28766671807
R Squared Score -289701.7426570366 -1.812122075633066

• The R-squared metric is misleading for nonlinear models (SVR in this case) and does not help
you assess goodness-of-fit like so we get a value which lies outside of [0,1].

5. K-Nearest Neighbors Regression (KNN)


• Purpose:
o KNN is a non-parametric algorithm used for regression (and classification) tasks. It
predicts the target variable by averaging the values of the k nearest data points in the
feature space.
o Ideal for datasets with non-linear relationships and lower-dimensional spaces.
• EGect:
o The simplicity of KNN allows it to capture patterns without assuming any specific
functional form for the data. However, it is sensitive to the choice of k, the number of
neighbors, and can struggle with high-dimensional data due to the curse of
dimensionality.
o Lower k (e.g., 1 or 3) results in a more responsive model with higher variance (risk of
overfitting).
o Higher k averages over more neighbors, leading to a smoother model that may underfit
if k is too large.

6. Elastic Net Regression


• Purpose:
o Elastic Net combines both L1 (Lasso) and L2 (Ridge) regularization, making it eSective
for feature selection and controlling overfitting.
o It’s useful for high-dimensional datasets or data with correlated features where both L1
and L2 regularization techniques may benefit the model.
• EGect:
o The combination of L1 and L2 regularization can improve the model’s robustness by
shrinking coeSicients and selectively removing irrelevant features, enhancing
interpretability and performance.
o L1 component (controlled by l1_ratio) can drive coeSicients of less important features
to zero, essentially performing feature selection.
o L2 component adds stability, shrinking coeSicients without setting them to zero, thus
preventing high variance.
o The balance between L1 and L2 regularization is crucial. Higher L1 encourages feature
selection, while higher L2 prevents the model from becoming overly sparse.
7. Bayesian Ridge Regression
• Purpose:
o Bayesian Ridge is a probabilistic approach to linear regression, incorporating prior
probability distributions into the coeSicients. This regularization approach is
particularly useful for cases with limited or uncertain data.
o ESective when we want a model that captures uncertainty in predictions, and it’s also
robust to multicollinearity (correlated features).
• EGect:
o By modeling coeSicients as random variables, Bayesian Ridge provides a distribution of
possible values for the target variable rather than a single-point estimate. This can be
beneficial when we want confidence intervals for predictions.
o The model’s prior assumptions about coeSicients provide natural regularization,
reducing overfitting, especially when data is scarce.
o Stronger prior distributions yield smoother models, ideal for small datasets but may
cause underfitting if too restrictive.

8. Huber Regression
• Purpose:
o Huber Regression is a robust regression method that combines the strengths of both L1
and L2 loss functions. It is especially useful when dealing with datasets containing
outliers.
o It minimizes squared errors for smaller residuals (like MSE) but uses absolute errors for
larger residuals (like MAE), providing resilience to outliers.
• EGect:
o Small residuals are treated with an MSE approach, encouraging the model to fit closely
to most data points.
o Large residuals are treated with MAE, reducing the influence of outliers on the model,
which minimizes their eSect on the overall regression line.
o The Huber parameter (delta) controls the threshold between MSE and MAE. Lower
delta makes the model more sensitive to outliers, while higher delta makes it less
sensitive.
o Overall, Huber Regression provides a balance between robustness and accuracy,
making it ideal for datasets with a few extreme outliers.

4. Modeling (Classification)
Now, we have categorized ‘price’ column into four classes—Low, Medium, High, and Very High—
using quartiles. This method splitted the continuous price values into four equally sized groups based
on the distribution of the data.

1. Logistic Regression:
• Purpose: Logistic Regression is a simple and interpretable model that is commonly used for
binary and multiclass classification tasks. It models the probability that an instance belongs to
a particular class using a logistic (sigmoid) function, making it suitable for linearly separable
data.
• EGect: Logistic Regression tends to perform well when the relationship between the features
and the target class is linear. However, in more complex datasets with non-linear interactions
(such as your housing dataset, where price categories depend on both continuous and
categorical-like features), Logistic Regression can struggle to capture these complexities. It
relies heavily on well-separated classes, and in this case, its accuracy was lower than more
complex models like Random Forest.

2. Random Forest Classifier:


• Purpose: The Random Forest Classifier is an ensemble learning model that builds multiple
decision trees and combines their outputs to make predictions. It is particularly useful for
handling complex datasets with non-linear relationships and interactions between features, as
it automatically detects these patterns without requiring extensive data preprocessing or
feature engineering.
• EGect: In this project, the Random Forest Classifier was the most eSective model, achieving
the highest accuracy. Its ability to handle both numerical and categorical-like features (e.g.,
waterfront, view, zipcode), as well as complex feature interactions, gave it a distinct advantage.
It also reduces overfitting compared to single decision trees by averaging the predictions of
multiple trees, making it robust and capable of generalizing well to unseen data.

3. Support Vector Machine (SVM):


• Purpose: SVM is a powerful classification algorithm that works by finding the hyperplane that
best separates data points into diSerent classes. It can handle both linear and non-linear
classification by using a kernel trick (e.g., radial basis function, RBF) to map the data into
higher dimensions where it can be separated more easily.
• EGect: SVM performed well on this dataset, but not as eSectively as Random Forest. While
SVM is good at handling complex decision boundaries, it is sensitive to feature scaling and may
struggle with very large datasets or noisy data. Additionally, SVM does not natively handle
multiclass classification (like this dataset’s four price categories) as easily as Random Forest
does, requiring additional techniques like one-vs-rest. Its accuracy was lower than Random
Forest but still competitive due to its ability to model non-linear relationships.

4. k-Nearest Neighbors (kNN):


• Purpose: kNN is a simple, instance-based learning algorithm that classifies new data points
based on the majority class of the nearest neighbors. It is a non-parametric model, meaning it
makes no assumptions about the underlying data distribution and works by "memorizing" the
training set and using it to classify new examples based on similarity.
• EGect: kNN tends to work well when the dataset has well-separated classes and the distance
metric (such as Euclidean distance) is meaningful for classification. However, in this case, with
mixed feature types (both continuous and categorical-like features) and potentially overlapping
classes, kNN did not perform as well as Random Forest or SVM. It also tends to be slower for
large datasets because it needs to compute distances to all training points for each new
prediction. In this housing dataset, its accuracy was lower, reflecting these challenges.

5. Model Validation
To evaluate the performance of the models, we used the following metrics:
• Mean Absolute Error (MAE): The average of the absolute diSerences between predicted and
actual values. It measures how far the predictions are from the actual values on average.
Lower MAE indicates better model performance, with predictions closer to actual values.
• Mean Squared Error (MSE): The average of the squared diSerences between predicted and
actual values. It gives a larger penalty to larger errors because it squares the errors. Lower MSE
means better performance, but it is more sensitive to outliers than MAE.
• Root Mean Squared Error (RMSE): The square root of MSE, bringing the error back to the
original units of the target variable, making it more interpretable. Like MSE, a lower RMSE
indicates better performance, and it penalizes larger errors more heavily than MAE.
• R² Score: The proportion of the variance in the dependent variable that is predictable from the
independent variables. It indicates how well the model fits the data. R² values range from 0 to
1. A value closer to 1 indicates a better fit, with 1 meaning a perfect fit.
• Accuracy: Accuracy is one of the simplest and most commonly used metrics for evaluating
classification models. It measures the proportion of correctly classified instances out of the
total instances in the dataset.
• Classification Report: The classification report shows a representation of the main
classification metrics on a per-class basis. This gives a deeper intuition of the classifier
behavior over global accuracy which can mask functional weaknesses in one class of a
multiclass problem. It includes metrics: Precision, Recall, F1-Score, Support.

6. Conclusion and Future Work


o In this project, we used a variety of regression models to predict housing prices based on features
such as the number of bedrooms, bathrooms, square footage of the living area and lot, the
condition and grade of the house, and its geographic location (latitude and longitude). After
comparing multiple models, including Linear Regression, Decision Tree Regressor, and Random
Forest Regressor, the Random Forest Regressor emerged as the best-performing model.
o The Random Forest Regressor excelled because it eSectively captured the complex relationships
between the features and the target variable (price). For instance, the dataset contains both
numeric features (like square footage and number of bedrooms) and categorical-like features
(such as waterfront and view), which influence the price in nonlinear ways. Simple linear models,
such as Linear Regression, struggled to capture these interactions and dependencies because
they assume linear relationships between features and the target variable.
o Random Forest, however, uses an ensemble of decision trees, which allows it to:
1. Handle Nonlinear Relationships: Features like the number of bedrooms or the presence
of a waterfront may not have a linear impact on price. For instance, a house with a
waterfront may have a disproportionate eSect on price compared to a slight increase in
square footage. Random Forests are well-equipped to model these types of nonlinear
interactions.
2. Capture Feature Interactions: The model can automatically identify interactions between
features. For instance, the impact of sqft_living on price may diSer depending on the
zipcode. Random Forest's structure of multiple decision trees captures these interactions
without requiring explicit feature engineering.
3. Robustness Against Noise and Overfitting: While Decision Trees can overfit the training
data, Random Forest mitigates this by averaging the predictions of multiple trees built on
random subsets of the data. This ensures that the model generalizes well to unseen data,
as it reduces the variance typically seen in single decision tree models.
o In this dataset, where both numerical and categorical-like features have varying importance
depending on local factors (e.g., location, house quality), Random Forest's flexibility and ability to
aggregate multiple perspectives (trees) gave it a significant edge over simpler models.
o In the same way, we observe Random Forest Classifier emerged as the best-performing model,
achieving the highest accuracy for categorizing ‘price_category’ column which was discrete ,i.e.,
categorical .

Future Work and Potential Improvements


To further enhance the model’s predictive power and accuracy, we can implement the following:
1. Exploring Advanced Ensemble Techniques
o While Random Forest performed well, more advanced ensemble methods like Gradient
Boosting Machines (GBM) or XGBoost could yield additional gains. These models
typically outperform Random Forest in structured data scenarios, as they sequentially
build trees to correct errors from previous iterations, capturing even more complex
patterns in the data.
2. Enhanced Feature Engineering and Selection
o Additional feature engineering could reveal more meaningful patterns. For instance,
creating interaction terms, binning, or applying polynomial transformations might
capture non-linear relationships not initially apparent. Testing feature importance
across diSerent models could also help identify the most predictive features, allowing
for a more streamlined and accurate model.
3. Refining Hyperparameter Tuning
o Although we used GridSearchCV, experimenting with a more refined grid of parameters
could yield further performance improvements. Additionally, techniques like
RandomizedSearchCV or Bayesian Optimization might expedite the tuning process
while still providing optimal parameters, especially if model training times are extensive.
4. Applying Cross-Validation for Robustness
o Implementing k-fold cross-validation would provide a more robust estimate of model
performance by ensuring that the model’s accuracy is consistent across multiple data
splits. This would reduce the likelihood of overfitting to a single train-test split, thereby
enhancing generalization.
5. Testing on Diverse Datasets
o Finally, applying the model to diSerent datasets or expanding the housing dataset to
include additional regions could test the model’s robustness and generalizability. A
broader dataset could reveal regional price diSerences and other factors, adding depth
to the model and potentially leading to additional insights.

Link to ipynb
The ipynb notebook is also attached below for ref.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('Housing.csv')
df.head()

id date price bedrooms bathrooms


sqft_living \
0 7229300521 20141013T000000 231300.0 2 1.00
1180
1 6414100192 20141209T000000 538000.0 3 2.25
2570
2 5631500400 20150225T000000 180000.0 2 1.00
770
3 2487200875 20141209T000000 604000.0 4 3.00
1960
4 1954400510 20150218T000000 510000.0 3 2.00
1680

sqft_lot floors waterfront view ... grade sqft_above


sqft_basement \
0 5650 1.0 0 0 ... 7 1180
0
1 7242 2.0 0 0 ... 7 2170
400
2 10000 1.0 0 0 ... 6 770
0
3 5000 1.0 0 0 ... 7 1050
910
4 8080 1.0 0 0 ... 8 1680
0

yr_built yr_renovated zipcode lat long sqft_living15 \


0 1955 0 98178 47.5112 -122.257 1340
1 1951 1991 98125 47.7210 -122.319 1690
2 1933 0 98028 47.7379 -122.233 2720
3 1965 0 98136 47.5208 -122.393 1360
4 1987 0 98074 47.6168 -122.045 1800

sqft_lot15
0 5650
1 7639
2 8062
3 5000
4 7503

[5 rows x 21 columns]

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

df.drop('date', axis=1, inplace=True)

df.isnull().sum()

id 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64

df.drop('id', axis=1, inplace = True)

df[df.columns].plot(kind='box', figsize=(20,10))

<Axes: >

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
ftransform = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition',
'grade',
'sqft_above', 'sqft_basement', 'yr_built',
'yr_renovated',
'zipcode', 'lat', 'long', 'sqft_living15',
'sqft_lot15']

df[ftransform] = scaler.fit_transform(df[ftransform])
df[df.columns].plot(kind='box', figsize=(20,10))

<Axes: >
features = df.drop('price', axis=1)
y = df['price']

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(features)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(pca_features, y,


test_size=0.3, random_state=101)

Linear Regression

from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

LinearRegression()

lin_pred = lin_model.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error,


classification_report, r2_score

print("Mean Absolute error: ", mean_absolute_error(lin_pred, y_test))


print("Mean Squared error: ", mean_squared_error(lin_pred, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(lin_pred, y_test)))
print("R-squared score: ", r2_score(lin_pred, y_test))

Mean Absolute error: 127249.10489453698


Mean Squared error: 42340254626.76828
Root Mean Squared error: 205767.4770870467
R-squared score: 0.5501677966367371

from sklearn.linear_model import Ridge

rdg_model = Ridge()
rdg_model.fit(X_train, y_train)

Ridge()

rdg_preds = rdg_model.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(rdg_preds, y_test))


print("Mean Squared error: ", mean_squared_error(rdg_preds, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(rdg_preds, y_test)))
print("R-squared score: ", r2_score(rdg_preds, y_test))

Mean Absolute error: 127247.01883052982


Mean Squared error: 42340277513.86764
Root Mean Squared error: 205767.5327010256
R-squared score: 0.5501403299752385

from sklearn.tree import DecisionTreeRegressor

dcst_model = DecisionTreeRegressor()
dcst_model.fit(X_train, y_train)

DecisionTreeRegressor()

dcst_preds = dcst_model.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(dcst_preds,


y_test))
print("Mean Squared error: ", mean_squared_error(dcst_preds, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(dcst_preds, y_test)))
print("R-squared score: ", r2_score(dcst_preds, y_test))

Mean Absolute error: 120855.29673041332


Mean Squared error: 49089921510.92567
Root Mean Squared error: 221562.45510222545
R-squared score: 0.6344749960736829

from sklearn.ensemble import RandomForestRegressor


rf_model = RandomForestRegressor()
rf_model.fit(X_train,y_train)
rf_preds = rf_model.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(rf_preds, y_test))


print("Mean Squared error: ", mean_squared_error(rf_preds, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(rf_preds, y_test)))
print("R-squared score: ", r2_score(rf_preds, y_test))

Mean Absolute error: 86746.69128582078


Mean Squared error: 24859027200.217564
Root Mean Squared error: 157667.4576449356
R-squared score: 0.7762662337723687

from sklearn.svm import SVR

svr_model = SVR()
svr_model.fit(X_train,y_train)
svr_preds = svr_model.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(svr_preds, y_test))


print("Mean Squared error: ", mean_squared_error(svr_preds, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(svr_preds, y_test)))
print("R-squared score: ", r2_score(svr_preds, y_test))

Mean Absolute error: 222343.80394267367


Mean Squared error: 148191154279.18042
Root Mean Squared error: 384956.04200892913
R-squared score: -289701.7426570366

MODEL TUNING
from sklearn.model_selection import GridSearchCV

param_grid_ridge = {'alpha': [0.01, 0.1, 1, 10, 100]}


grid_rdg_model = GridSearchCV(rdg_model, param_grid_ridge, cv=3,
scoring='neg_mean_squared_error')
grid_rdg_model.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=Ridge(),
param_grid={'alpha': [0.01, 0.1, 1, 10, 100]},
scoring='neg_mean_squared_error')

best_ridge = grid_rdg_model.best_estimator_
grid_rdg_preds = best_ridge.predict(X_test)

print(f"Best Ridge Alpha:",grid_rdg_model.best_params_['alpha'])


print("Mean Absolute error: ", mean_absolute_error(grid_rdg_preds,
y_test))
print("Mean Squared error: ", mean_squared_error(grid_rdg_preds,
y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(grid_rdg_preds, y_test)))
print("R-squared score: ", r2_score(grid_rdg_preds, y_test))

Best Ridge Alpha: 100


Mean Absolute error: 127047.49797298762
Mean Squared error: 42344471977.561775
Root Mean Squared error: 205777.72468749326
R-squared score: 0.5474111916440647

param_grid_dt = {
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 10, 20]
}
grid_dcst = GridSearchCV(dcst_model, param_grid_dt, cv=3,
scoring='neg_mean_squared_error')
grid_dcst.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=DecisionTreeRegressor(),
param_grid={'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 10, 20]},
scoring='neg_mean_squared_error')

best_dcst = grid_dcst.best_estimator_
y_pred_dcst = best_dcst.predict(X_test)

print("Best Decision Tree Parameters:",grid_dcst.best_params_)


print("Mean Absolute error: ", mean_absolute_error(y_pred_dcst,
y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_dcst, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_dcst, y_test)))
print("R-squared score: ", r2_score(y_pred_dcst, y_test))

Best Decision Tree Parameters: {'max_depth': 10, 'min_samples_split':


20}
Mean Absolute error: 110631.04159020874
Mean Squared error: 39239209652.15163
Root Mean Squared error: 198088.8933084125
R-squared score: 0.6567287156442908

param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None]
}
rf = RandomForestRegressor(random_state=42)
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=3,
scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42),
param_grid={'max_depth': [10, 20, None],
'n_estimators': [50, 100, 200]},
scoring='neg_mean_squared_error')

best_rf = grid_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)

print("Best Random Forest Parameters:",grid_search_rf.best_params_)


print("Mean Absolute error: ", mean_absolute_error(y_pred_rf, y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_rf, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_rf, y_test)))
print("R-squared score: ", r2_score(y_pred_rf, y_test))

Best Random Forest Parameters: {'max_depth': 20, 'n_estimators': 200}


Mean Absolute error: 86224.04226690672
Mean Squared error: 25156235320.69841
Root Mean Squared error: 158607.1729799709
R-squared score: 0.7707744616996358

from sklearn.svm import SVR

param_grid_svr = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}
svr = SVR()
grid_search_svr = GridSearchCV(svr, param_grid_svr, cv=3,
scoring='neg_mean_squared_error')
grid_search_svr.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=SVR(),
param_grid={'C': [0.1, 1, 10], 'kernel': ['linear',
'rbf']},
scoring='neg_mean_squared_error')

best_svr = grid_search_svr.best_estimator_
y_pred_svr = best_svr.predict(X_test)

print("Best SVR Parameters:",grid_search_svr.best_params_)


print("Mean Absolute error: ", mean_absolute_error(y_pred_svr,
y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_svr, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_svr, y_test)))
print("R-squared score: ", r2_score(y_pred_svr, y_test))
Best SVR Parameters: {'C': 10, 'kernel': 'linear'}
Mean Absolute error: 137747.75623270954
Mean Squared error: 72231554707.12172
Root Mean Squared error: 268759.28766671807
R-squared score: -1.812122075633066

from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)

KNeighborsRegressor()

y_pred_knn = knn_reg.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(y_pred_knn,


y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_knn, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_knn, y_test)))
print("R-squared score: ", r2_score(y_pred_knn, y_test))

Mean Absolute error: 87159.51838371376


Mean Squared error: 27524502043.479027
Root Mean Squared error: 165905.09951016883
R-squared score: 0.7228269728251158

from sklearn.linear_model import ElasticNet

elastic_net_reg = ElasticNet()
elastic_net_reg.fit(X_train, y_train)

ElasticNet()

y_pred_elastic_net = elastic_net_reg.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(y_pred_elastic_net,


y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_elastic_net,
y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_elastic_net, y_test)))
print("R-squared score: ", r2_score(y_pred_elastic_net, y_test))

Mean Absolute error: 125533.69246905086


Mean Squared error: 46411282818.92971
Root Mean Squared error: 215432.78027944054
R-squared score: 0.30885381578698745

from sklearn.linear_model import BayesianRidge


bayesian_ridge_reg = BayesianRidge()
bayesian_ridge_reg.fit(X_train, y_train)

BayesianRidge()

y_pred_bayesian_ridge = bayesian_ridge_reg.predict(X_test)

print("Mean Absolute error: ",


mean_absolute_error(y_pred_bayesian_ridge, y_test))
print("Mean Squared error: ",
mean_squared_error(y_pred_bayesian_ridge, y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_bayesian_ridge, y_test)))
print("R-squared score: ", r2_score(y_pred_bayesian_ridge, y_test))

Mean Absolute error: 127220.492976071


Mean Squared error: 42340606769.06621
Root Mean Squared error: 205768.3327654336
R-squared score: 0.5497874363751005

from sklearn.linear_model import HuberRegressor

huber_reg = HuberRegressor()
huber_reg.fit(X_train, y_train)

HuberRegressor()

y_pred_huber = huber_reg.predict(X_test)

print("Mean Absolute error: ", mean_absolute_error(y_pred_huber,


y_test))
print("Mean Squared error: ", mean_squared_error(y_pred_huber,
y_test))
print("Root Mean Squared error: ",
np.sqrt(mean_squared_error(y_pred_huber, y_test)))
print("R-squared score: ", r2_score(y_pred_huber, y_test))

Mean Absolute error: 118872.32059026897


Mean Squared error: 47919156366.38232
Root Mean Squared error: 218904.445743759
R-squared score: 0.21181390040065873

Using Classification Models on the dataset


dfc = pd.read_csv('Housing.csv')

df.head()

price bedrooms bathrooms sqft_living sqft_lot floors


waterfront \
0 231300.0 -1.473841 -1.447464 -0.979835 -0.228321 -0.915427 -
0.087173
1 538000.0 -0.398669 0.175607 0.533634 -0.189885 0.936506 -
0.087173
2 180000.0 -1.473841 -1.447464 -1.426254 -0.123298 -0.915427 -
0.087173
3 604000.0 0.676503 1.149449 -0.130550 -0.244014 -0.915427 -
0.087173
4 510000.0 -0.398669 -0.149007 -0.435422 -0.169653 -0.915427 -
0.087173

view condition grade sqft_above sqft_basement yr_built


\
0 -0.305759 -0.629187 -0.558836 -0.734708 -0.658681 -0.544898

1 -0.305759 -0.629187 -0.558836 0.460841 0.245141 -0.681079

2 -0.305759 -0.629187 -1.409587 -1.229834 -0.658681 -1.293892

3 -0.305759 2.444294 -0.558836 -0.891699 1.397515 -0.204446

4 -0.305759 -0.629187 0.291916 -0.130895 -0.658681 0.544548

yr_renovated zipcode lat long sqft_living15


sqft_lot15
0 -0.210128 1.870152 -0.352572 -0.306079 -0.943355 -
0.260715
1 4.746678 0.879568 1.161568 -0.746341 -0.432686 -
0.187868
2 -0.210128 -0.933388 1.283537 -0.135655 1.070140 -
0.172375
3 -0.210128 1.085160 -0.283288 -1.271816 -0.914174 -
0.284522
4 -0.210128 -0.073636 0.409550 1.199335 -0.272190 -
0.192849

dfc.drop(['date', 'id'], axis=1, inplace=True)

dfc['price_category'] = pd.qcut(df['price'], q=4, labels=['Low',


'Medium', 'High', 'Very High'])
dfc.drop('price', axis=1, inplace = True)

dfc.head()

bedrooms bathrooms sqft_living sqft_lot floors waterfront


view \
0 2 1.00 1180 5650 1.0 0
0
1 3 2.25 2570 7242 2.0 0
0
2 2 1.00 770 10000 1.0 0
0
3 4 3.00 1960 5000 1.0 0
0
4 3 2.00 1680 8080 1.0 0
0

condition grade sqft_above sqft_basement yr_built yr_renovated


\
0 3 7 1180 0 1955 0

1 3 7 2170 400 1951 1991

2 3 6 770 0 1933 0

3 5 7 1050 910 1965 0

4 3 8 1680 0 1987 0

zipcode lat long sqft_living15 sqft_lot15 price_category

0 98178 47.5112 -122.257 1340 5650 Low

1 98125 47.7210 -122.319 1690 7639 High

2 98028 47.7379 -122.233 2720 8062 Low

3 98136 47.5208 -122.393 1360 5000 High

4 98074 47.6168 -122.045 1800 7503 High

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

ftransform = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',


'floors', 'waterfront','view', 'condition', 'grade', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15']
dfc[ftransform] = scaler.fit_transform(dfc[ftransform])

dfc.head()

bedrooms bathrooms sqft_living sqft_lot floors waterfront


view \
0 -1.473841 -1.447464 -0.979835 -0.228321 -0.915427 -0.087173 -
0.305759
1 -0.398669 0.175607 0.533634 -0.189885 0.936506 -0.087173 -
0.305759
2 -1.473841 -1.447464 -1.426254 -0.123298 -0.915427 -0.087173 -
0.305759
3 0.676503 1.149449 -0.130550 -0.244014 -0.915427 -0.087173 -
0.305759
4 -0.398669 -0.149007 -0.435422 -0.169653 -0.915427 -0.087173 -
0.305759

condition grade sqft_above sqft_basement yr_built


yr_renovated \
0 -0.629187 -0.558836 -0.734708 -0.658681 -0.544898 -
0.210128
1 -0.629187 -0.558836 0.460841 0.245141 -0.681079
4.746678
2 -0.629187 -1.409587 -1.229834 -0.658681 -1.293892 -
0.210128
3 2.444294 -0.558836 -0.891699 1.397515 -0.204446 -
0.210128
4 -0.629187 0.291916 -0.130895 -0.658681 0.544548 -
0.210128

zipcode lat long sqft_living15 sqft_lot15


price_category
0 1.870152 -0.352572 -0.306079 -0.943355 -0.260715
Low
1 0.879568 1.161568 -0.746341 -0.432686 -0.187868
High
2 -0.933388 1.283537 -0.135655 1.070140 -0.172375
Low
3 1.085160 -0.283288 -1.271816 -0.914174 -0.284522
High
4 -0.073636 0.409550 1.199335 -0.272190 -0.192849
High

features = dfc.drop('price_category', axis=1)


y = dfc['price_category']

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(features)

from sklearn.model_selection import train_test_split


from sklearn.metrics import classification_report, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(pca_features, y,


test_size=0.3, random_state=101)

from sklearn.linear_model import LogisticRegression

log_c = LogisticRegression(max_iter=1000)

log_c.fit(X_train, y_train)
log_preds = log_c.predict(X_test)
print("Accuracy: ", round(accuracy_score(y_test, log_preds)*100, 2))
print("Classification Report: ", classification_report(y_test,
log_preds))

Accuracy: 64.57
Classification Report: precision recall f1-score
support

High 0.51 0.49 0.50 1626


Low 0.78 0.80 0.79 1605
Medium 0.53 0.55 0.54 1656
Very High 0.77 0.76 0.76 1597

accuracy 0.65 6484


macro avg 0.65 0.65 0.65 6484
weighted avg 0.65 0.65 0.65 6484

from sklearn.ensemble import RandomForestClassifier

rf_c = RandomForestClassifier()

rf_c.fit(X_train, y_train)
rf_preds = rf_c.predict(X_test)

print("Accuracy: ", round(accuracy_score(y_test, rf_preds)*100, 2))


print("Classification Report: ", classification_report(y_test,
rf_preds))

Accuracy: 72.62
Classification Report: precision recall f1-score
support

High 0.64 0.62 0.63 1626


Low 0.82 0.80 0.81 1605
Medium 0.64 0.65 0.64 1656
Very High 0.81 0.83 0.82 1597

accuracy 0.73 6484


macro avg 0.73 0.73 0.73 6484
weighted avg 0.73 0.73 0.73 6484

from sklearn.svm import SVC

svc = SVC()

svc.fit(X_train, y_train)
svc_pred = svc.predict(X_test)
print("Accuracy: ", round(accuracy_score(y_test, svc_pred)*100, 2))
print("Classification Report: ", classification_report(y_test,
svc_pred))

Accuracy: 72.39
Classification Report: precision recall f1-score
support

High 0.62 0.67 0.65 1626


Low 0.83 0.78 0.81 1605
Medium 0.63 0.65 0.64 1656
Very High 0.84 0.79 0.81 1597

accuracy 0.72 6484


macro avg 0.73 0.72 0.73 6484
weighted avg 0.73 0.72 0.73 6484

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

print("Accuracy: ", round(accuracy_score(y_test, knn_pred)*100, 2))


print("Classification Report: ", classification_report(y_test,
knn_pred))

Accuracy: 70.19
Classification Report: precision recall f1-score
support

High 0.59 0.65 0.62 1626


Low 0.78 0.81 0.80 1605
Medium 0.63 0.57 0.60 1656
Very High 0.82 0.78 0.80 1597

accuracy 0.70 6484


macro avg 0.70 0.70 0.70 6484
weighted avg 0.70 0.70 0.70 6484

You might also like