ML Assignment (22BCE8086) 2
ML Assignment (22BCE8086) 2
Objective:
The focus of this assignment is to create a Regression model with the intent address a
continuous target variable using a Kaggle dataset. We provide dimensionality reduction
methods (e.g., Principal Component Analysis PCA) to improve model performance by dealing
with high dimensional data.
Dataset Selection:
Dataset Used: The dataset used for this assignment is diamonds.csv, which contains
features related to the properties of diamonds, including their cut, clarity, color, and
size, among other characteristics.
Target Variable: The continuous target variable is price, representing the price of
each diamond.
Data Preprocessing:
1. Loading the Data
The dataset is loaded in using the pandas library. First, we examine the dataset looking for
columns that are irrelevant to our case or missing values.
2. Dropping Unnecessary Columns
In the dataset, there is an index column (Unnamed: 0) which is not needed for our analysis.
We drop this column.
The reduced dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing.
2. Model Training
A Linear Regression model is trained on the training data.
3. Model Evaluation
After training, the model’s performance is evaluated on the test data using two metrics:
Mean Squared Error (MSE) and R-squared (R²).
Detailed Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("diamonds.csv")
print(df.info())
print(df.isnull().sum())
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
print(f"Explained variance ratio: {explained_variance}")
print(f"Total variance explained: {explained_variance.sum():.2f}")
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Prices')
plt.show()
Results:
Model Evaluation Metrics:
Mean Squared Error (MSE): Evaluates how far the predictions are from the actual
values on average. The MSE result for this model indicates that on average, the
model's predictions are off by a certain amount.
R-squared (R²): Indicates how well the regression model captures the variance in the
target variable. An R² value of X means that X% of the variance in the target variable
is explained by the model.
Visualization:
Plotting the actual versus predicted diamond prices to visually assess the model’s
performance.
Conclusion:
Dimensionality Reduction:
Applying PCA to reduce the complexity of the data while explaining 95% of the variance. The
gain is that it made the model less complicated, leading to prevention of multicollinearity
and overfitting which are typical issues in high-dimensional data.
Model Performance:
R-squared (R²): Since the model explained 85% of the variance in diamond prices, it
succeeded at capturing signals about how features correlate with the target.
Mean Squared Error (MSE): With an MSE of 2.37 million, the error has a not-too-crazy
average $1,539 bolstered by the price range.
High-dimensional Data Discovery:
PCA improved potential problems such as overfitting and computational complexity by
reducing the number of features with little loss in predictive power.
Overall:
It is a simple and performant model, making it an efficient but accurate tool.
***END***
ML Assignment – 2
Reg Number: 22BCE8086
Name of The Student: T.Gnana Sai Siddartha
Slot: F2+TF2
Objective:
This assignment extends the initial regression task to further optimize the prediction model
for diamond prices. The objectives include exploring additional regression models,
implementing hyperparameter tuning, feature engineering, ensemble methods, and
visualization techniques to enhance model interpretability and performance.
Various regression models were evaluated to identify the most effective one for the
diamonds.csv dataset. The models implemented include:
Each model was trained and tested, and their performance metrics, such as Mean Squared
Error (MSE) and R-squared (R²), were recorded.
b. Hyperparameter Tuning
c. Feature Engineering
Feature Combinations: Interaction terms, such as carat * depth, were added to capture
volumetric effects on price.
Log Transformations: Skewed features were log-transformed for normalization.
Domain Knowledge: Grouped similar categories (e.g., color and clarity levels) for simplified
interpretation.
d. Ensemble Methods
Ensemble methods such as Random Forest and Gradient Boosting were employed to boost
model performance:
Regularization Techniques: Applied Ridge and Lasso regression for linear models.
Cross-Validation: K-fold cross-validation was used to check model consistency across
multiple folds.
Model Complexity Control: Parameters like max_depth and min_samples_split were tuned to
balance model complexity and accuracy.
Implementation
1. Data Loading and Preprocessing:
import pandas as pd
import numpy as np
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/diamonds.csv')
print(df.info())
print(df.isnull().sum())
X = df.drop('price', axis=1)
y = df['price']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
output:
2. Model Implementation and Evaluation
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
output:
2.2 Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.predict(X_test)
output:
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)
output:
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train, y_train)
y_pred_svr = svr_model.predict(X_test)
output:
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
output:
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
output:
Results and Analysis
The best model is the Random Forest Regressor. This model has the lowest Mean Squared
Error (MSE) at 729950.89 and the highest R-squared (R²) value at 0.9541, indicating it
performs well at predicting the target variable with a high level of accuracy.
Reasoning: The best model was selected based on its performance in terms of the lowest
MSE and highest R² score, balancing accuracy with model complexity.
Overfitting and Underfitting:
Overfitting occurs when a model performs well on the training data but poorly on
new, unseen data, often due to learning noise and patterns that are not
generalizable.
Models like Decision Tree Regressor and Random Forest Regressor can sometimes
overfit if not tuned properly. The Decision Tree Regressor has a high R² (0.9373) and
relatively low MSE, which might be a sign of slight overfitting since individual decision
trees are prone to it.
Random Forest and Gradient Boosting Regressors use ensemble techniques that
generally help in reducing overfitting, making them robust choices. However, it’s still
essential to check their performance on test data.
Underfitting happens when a model is too simple to capture the patterns in the
data, resulting in poor performance.
The Support Vector Regressor shows a low R² (0.3434) and high MSE, suggesting that it
may be underfitting, as it’s unable to capture the relationship in the data as effectively as
the other models.
f. Visualizations
Scatter Plot (Actual vs. Predicted): Visualizes model accuracy with points ideally along a 45-degree
line.
code:
plt.figure(figsize=(10, 6))
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Prices')
plt.show()
output:
Residual Plot: Shows residuals to assess unbiased predictions (random scatter around zero indicates
good performance).
code:
plt.figure(figsize=(10, 6))
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.show()
output:
code:
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Distribution of Residuals')
plt.show()
output:
Correlation Heatmap: Displays feature relationships to identify
code:
plt.figure(figsize=(12, 10))
plt.show()
ouput:
g. Conclusion
This study compared several regression models to predict diamond prices, with Random
Forest Regressor emerging as the best performer, achieving an MSE of 729,950.89 and an R²
of 0.9541. Its ensemble approach effectively captured complex patterns in the data,
outperforming other models like Linear Regression and Support Vector Regressor, which
struggled with lower R² values and higher MSE. The Gradient Boosting Regressor also
showed strong performance, making it a solid alternative. Overall, the Random Forest
Regressor demonstrated the best balance of accuracy and generalization for diamond price
prediction.
***END***