0% found this document useful (0 votes)
27 views19 pages

ML Assignment (22BCE8086) 2

Uploaded by

gnana25036
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views19 pages

ML Assignment (22BCE8086) 2

Uploaded by

gnana25036
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

ML Assignment – 1

Reg Number: 22BCE8086


Name of The Student: T.Gnana Sai Siddartha
Slot: F2+TF2

Regression model with dimensionality reduction

Objective:
The focus of this assignment is to create a Regression model with the intent address a
continuous target variable using a Kaggle dataset. We provide dimensionality reduction
methods (e.g., Principal Component Analysis PCA) to improve model performance by dealing
with high dimensional data.

Dataset Selection:
 Dataset Used: The dataset used for this assignment is diamonds.csv, which contains
features related to the properties of diamonds, including their cut, clarity, color, and
size, among other characteristics.
 Target Variable: The continuous target variable is price, representing the price of
each diamond.

Data Preprocessing:
1. Loading the Data
The dataset is loaded in using the pandas library. First, we examine the dataset looking for
columns that are irrelevant to our case or missing values.
2. Dropping Unnecessary Columns
In the dataset, there is an index column (Unnamed: 0) which is not needed for our analysis.
We drop this column.

3. Encoding Categorical Variables


The dataset contains categorical features like cut, color, and clarity, which are converted into
numeric format using one-hot encoding.

4. Feature and Target Separation


Separating the independent variables (features) from the target variable (price).

5. Scaling the Features


Since the dataset contains features with varying scales, we apply standardization using
StandardScaler to ensure that all features have a mean of 0 and a standard deviation of 1.

Dimensionality Reduction (PCA):


Employing Principal Component Analysis (PCA) to reduce the dimensionality of the dataset
while retaining 95% of the variance. This technique simplifies the model and enhances its
performance by reducing noise and multicollinearity.
Model Implementation:
1. Train-Test Split

The reduced dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing.

2. Model Training
A Linear Regression model is trained on the training data.

3. Model Evaluation

After training, the model’s performance is evaluated on the test data using two metrics:
Mean Squared Error (MSE) and R-squared (R²).

Detailed Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("diamonds.csv")
print(df.info())
print(df.isnull().sum())

if 'Unnamed: 0' in df.columns:


df = df.drop(['Unnamed: 0'], axis=1)

df = pd.get_dummies(df, columns=['cut', 'color', 'clarity'], drop_first=True)


X = df.drop('price', axis=1)
y = df['price']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
print(f"Explained variance ratio: {explained_variance}")
print(f"Total variance explained: {explained_variance.sum():.2f}")

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y,


test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Prices')
plt.show()

Results:
Model Evaluation Metrics:

 Mean Squared Error (MSE): Evaluates how far the predictions are from the actual
values on average. The MSE result for this model indicates that on average, the
model's predictions are off by a certain amount.
 R-squared (R²): Indicates how well the regression model captures the variance in the
target variable. An R² value of X means that X% of the variance in the target variable
is explained by the model.
Visualization:
Plotting the actual versus predicted diamond prices to visually assess the model’s
performance.

Conclusion:
Dimensionality Reduction:
Applying PCA to reduce the complexity of the data while explaining 95% of the variance. The
gain is that it made the model less complicated, leading to prevention of multicollinearity
and overfitting which are typical issues in high-dimensional data.
Model Performance:

R-squared (R²): Since the model explained 85% of the variance in diamond prices, it
succeeded at capturing signals about how features correlate with the target.

Mean Squared Error (MSE): With an MSE of 2.37 million, the error has a not-too-crazy
average $1,539 bolstered by the price range.
High-dimensional Data Discovery:
PCA improved potential problems such as overfitting and computational complexity by
reducing the number of features with little loss in predictive power.
Overall:
It is a simple and performant model, making it an efficient but accurate tool.

***END***
ML Assignment – 2
Reg Number: 22BCE8086
Name of The Student: T.Gnana Sai Siddartha
Slot: F2+TF2

Diamonds Price Prediction: Advanced Regression Analysis

Objective:

This assignment extends the initial regression task to further optimize the prediction model
for diamond prices. The objectives include exploring additional regression models,
implementing hyperparameter tuning, feature engineering, ensemble methods, and
visualization techniques to enhance model interpretability and performance.

Instructions and Execution

a. Explore Additional Regression Models

Various regression models were evaluated to identify the most effective one for the
diamonds.csv dataset. The models implemented include:

 Linear Regression: Used as a baseline model.


 Decision Tree Regressor: Captures nonlinear relationships.
 Support Vector Regressor (SVR): Applies a radial basis function kernel to handle non-
linearities.
 Random Forest Regressor: An ensemble method to reduce overfitting.
 Gradient Boosting Regressor: Sequentially builds a model by combining weak learners.

Each model was trained and tested, and their performance metrics, such as Mean Squared
Error (MSE) and R-squared (R²), were recorded.

b. Hyperparameter Tuning

Hyperparameter tuning was performed using GridSearchCV to optimize each model's


performance. Key parameters adjusted include:

 Decision Tree: max_depth, min_samples_split


 SVR: C and kernel
 Random Forest: n_estimators, max_depth, min_samples_split
 Gradient Boosting: learning_rate, n_estimators, max_depth
These tuned models demonstrated improved accuracy over their default versions.

c. Feature Engineering

New features were created by combining or transforming existing features to improve


predictive power:

 Feature Combinations: Interaction terms, such as carat * depth, were added to capture
volumetric effects on price.
 Log Transformations: Skewed features were log-transformed for normalization.
 Domain Knowledge: Grouped similar categories (e.g., color and clarity levels) for simplified
interpretation.

d. Ensemble Methods

Ensemble methods such as Random Forest and Gradient Boosting were employed to boost
model performance:

 Random Forest: Used averaging of multiple decision trees to reduce variance.


 Gradient Boosting: Combines weak learners to reduce bias and improve prediction accuracy.
 Stacking Regressor: Combines the best-performing models (Random Forest and Gradient
Boosting) to further improve the overall model performance.

e. Addressing Overfitting and Underfitting

To combat overfitting and underfitting:

 Regularization Techniques: Applied Ridge and Lasso regression for linear models.
 Cross-Validation: K-fold cross-validation was used to check model consistency across
multiple folds.
 Model Complexity Control: Parameters like max_depth and min_samples_split were tuned to
balance model complexity and accuracy.

Implementation
1. Data Loading and Preprocessing:

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

import seaborn as sns

from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt

from google.colab import drive

drive.mount('/content/drive')

df = pd.read_csv('/content/drive/My Drive/diamonds.csv')

print(df.info())

print(df.isnull().sum())

if 'Unnamed: 0' in df.columns:

df = df.drop(['Unnamed: 0'], axis=1)

df = pd.get_dummies(df, columns=['cut', 'color', 'clarity'], drop_first=True)

X = df.drop('price', axis=1)

y = df['price']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=0.95)

X_reduced = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_

print(f"Explained variance ratio: {explained_variance}")

print(f"Total variance explained: {explained_variance.sum():.2f}")

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

output:
2. Model Implementation and Evaluation

2.1 Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Linear Regression MSE: {mse}")

print(f"Linear Regression R2: {r2}")

output:
2.2 Lasso Regression

from sklearn.linear_model import Lasso

from sklearn.metrics import mean_squared_error, r2_score

lasso_model = Lasso(alpha=0.1)

lasso_model.fit(X_train, y_train)

y_pred_lasso = lasso_model.predict(X_test)

mse_lasso = mean_squared_error(y_test, y_pred_lasso)

r2_lasso = r2_score(y_test, y_pred_lasso)

print(f"Lasso Regression MSE: {mse_lasso}")

print(f"Lasso Regression R2: {r2_lasso}")

output:

2.3 Ridge Regression


from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error, r2_score

ridge_model = Ridge(alpha=1.0)

ridge_model.fit(X_train, y_train)

y_pred_ridge = ridge_model.predict(X_test)

mse_ridge = mean_squared_error(y_test, y_pred_ridge)

r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression MSE: {mse_ridge}")

print(f"Ridge Regression R2: {r2_ridge}")

2.4 Decision Tree


from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)

tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)

mse_tree = mean_squared_error(y_test, y_pred_tree)

r2_tree = r2_score(y_test, y_pred_tree)

print(f"Decision Tree MSE: {mse_tree}")

print(f"Decision Tree R2: {r2_tree}")

output:

2.5 Support Vector Regression


from sklearn.svm import SVR

svr_model = SVR(kernel='rbf')

svr_model.fit(X_train, y_train)

y_pred_svr = svr_model.predict(X_test)

mse_svr = mean_squared_error(y_test, y_pred_svr)

r2_svr = r2_score(y_test, y_pred_svr)

print(f"SVR MSE: {mse_svr}")

print(f"SVR R2: {r2_svr}")

output:

2.6 Random Forest

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)

r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest MSE: {mse_rf}")

print(f"Random Forest R2: {r2_rf}")

output:

2.7 Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)

r2_gb = r2_score(y_test, y_pred_gb)

print(f"Gradient Boosting MSE: {mse_gb}")

print(f"Gradient Boosting R2: {r2_gb}")

output:
Results and Analysis

Model Comparison Table:

Model MSE R-squared (R²)

Linear Regression 2368648.9299124884 0.8509985825043783

Lasso Regression 2368675.7205609214 0.8509968972210331

Ridge Regression 2368650.706854858 0.8509984707246508

Decision Tree Regressor 996384.2892797553 0.9373217915081505

Support Vector Regressor 10437962.477216756 0.3433931110555638

Random Forest Regressor 729950.8912601755 0.9540819594974859

Gradient Boosting 941093.1977070668 0.9407999139581141


Regressor

Best Model: [Random Forest Regressor]

The best model is the Random Forest Regressor. This model has the lowest Mean Squared
Error (MSE) at 729950.89 and the highest R-squared (R²) value at 0.9541, indicating it
performs well at predicting the target variable with a high level of accuracy.

 Reasoning: The best model was selected based on its performance in terms of the lowest
MSE and highest R² score, balancing accuracy with model complexity.
Overfitting and Underfitting:

Overfitting occurs when a model performs well on the training data but poorly on
new, unseen data, often due to learning noise and patterns that are not
generalizable.

Models like Decision Tree Regressor and Random Forest Regressor can sometimes
overfit if not tuned properly. The Decision Tree Regressor has a high R² (0.9373) and
relatively low MSE, which might be a sign of slight overfitting since individual decision
trees are prone to it.

Random Forest and Gradient Boosting Regressors use ensemble techniques that
generally help in reducing overfitting, making them robust choices. However, it’s still
essential to check their performance on test data.

Underfitting happens when a model is too simple to capture the patterns in the
data, resulting in poor performance.

The Support Vector Regressor shows a low R² (0.3434) and high MSE, suggesting that it
may be underfitting, as it’s unable to capture the relationship in the data as effectively as
the other models.

f. Visualizations

Scatter Plot (Actual vs. Predicted): Visualizes model accuracy with points ideally along a 45-degree
line.

code:

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure(figsize=(10, 6))

sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Prices')

plt.show()

output:

Residual Plot: Shows residuals to assess unbiased predictions (random scatter around zero indicates
good performance).

code:

plt.figure(figsize=(10, 6))

sns.residplot(x=y_test, y=y_pred, lowess=True, line_kws={'color': 'red'})

plt.xlabel('Actual Price')

plt.ylabel('Residuals')

plt.title('Residuals of Predicted Prices')

plt.show()

output:

Histogram of Residuals: Distribution of residuals, ideally bell-shaped, indicating minimal bias.

code:

residuals = y_test - y_pred

plt.figure(figsize=(10, 6))

sns.histplot(residuals, kde=True)

plt.xlabel('Residuals')

plt.title('Distribution of Residuals')

plt.show()

output:
Correlation Heatmap: Displays feature relationships to identify

code:

plt.figure(figsize=(12, 10))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.title("Feature Correlation Heatmap")

plt.show()

ouput:
g. Conclusion

This study compared several regression models to predict diamond prices, with Random
Forest Regressor emerging as the best performer, achieving an MSE of 729,950.89 and an R²
of 0.9541. Its ensemble approach effectively captured complex patterns in the data,
outperforming other models like Linear Regression and Support Vector Regressor, which
struggled with lower R² values and higher MSE. The Gradient Boosting Regressor also
showed strong performance, making it a solid alternative. Overall, the Random Forest
Regressor demonstrated the best balance of accuracy and generalization for diamond price
prediction.

***END***

You might also like