0% found this document useful (0 votes)
2 views

Vertopal.com_ML LAB 8

The document outlines a machine learning project focused on predicting mobile phone price ranges using various classification algorithms, including Decision Trees, Random Forests, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The SVM model achieved the highest accuracy of 95.75% and an AUC-ROC score of 0.9988, making it the recommended model for mobile price prediction. It also discusses challenges in traditional methods, data preprocessing, model training/testing, and evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Vertopal.com_ML LAB 8

The document outlines a machine learning project focused on predicting mobile phone price ranges using various classification algorithms, including Decision Trees, Random Forests, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The SVM model achieved the highest accuracy of 95.75% and an AUC-ROC score of 0.9988, making it the recommended model for mobile price prediction. It also discusses challenges in traditional methods, data preprocessing, model training/testing, and evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Sugandh 06701192023

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,
StandardScaler,label_binarize
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix,roc_auc_score
from sklearn.ensemble import RandomForestClassifier

df=pd.read_csv("train.csv - train.csv.csv")
df.head()

battery_power blue clock_speed dual_sim fc four_g int_memory


m_dep \
0 842 0 2.2 0 1 0 7
0.6
1 1021 1 0.5 1 0 1 53
0.7
2 563 1 0.5 1 2 1 41
0.9
3 615 1 2.5 0 0 0 10
0.8
4 1821 1 1.2 0 13 1 44
0.6

mobile_wt n_cores ... px_height px_width ram sc_h sc_w


talk_time \
0 188 2 ... 20 756 2549 9 7
19
1 136 3 ... 905 1988 2631 17 3
7
2 145 5 ... 1263 1716 2603 11 2
9
3 131 6 ... 1216 1786 2769 16 8
11
4 141 2 ... 1208 1212 1411 8 2
15
three_g touch_screen wifi price_range
0 0 0 1 1
1 1 1 0 2
2 1 1 0 2
3 1 0 0 2
4 1 1 0 1

[5 rows x 21 columns]

df.isnull().sum()

battery_power 0
blue 0
clock_speed 0
dual_sim 0
fc 0
four_g 0
int_memory 0
m_dep 0
mobile_wt 0
n_cores 0
pc 0
px_height 0
px_width 0
ram 0
sc_h 0
sc_w 0
talk_time 0
three_g 0
touch_screen 0
wifi 0
price_range 0
dtype: int64

# Split features and target


X = df.drop("price_range", axis=1) # Independent variables
y = df["price_range"]

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42, stratify=y)

dt_model = DecisionTreeClassifier(criterion='entropy', max_depth=5,


random_state=42) # You can use 'entropy' instead of 'gini'
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
y_prob_dt = dt_model.predict_proba(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test,
y_pred))

Accuracy: 0.8725

Confusion Matrix:
[[90 10 0 0]
[ 4 89 7 0]
[ 0 11 76 13]
[ 0 0 6 94]]

Classification Report:
precision recall f1-score support

0 0.96 0.90 0.93 100


1 0.81 0.89 0.85 100
2 0.85 0.76 0.80 100
3 0.88 0.94 0.91 100

accuracy 0.87 400


macro avg 0.87 0.87 0.87 400
weighted avg 0.87 0.87 0.87 400

rf_model = RandomForestClassifier(n_estimators=100, random_state=42,


max_depth=10)
rf_model.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, random_state=42)

dt_model = DecisionTreeClassifier(criterion='gini', max_depth=5,


random_state=42) # You can use 'entropy' instead of 'gini'
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
y_prob_dt = dt_model.predict_proba(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test,
y_pred))

Accuracy: 0.83

Confusion Matrix:
[[89 11 0 0]
[ 6 82 12 0]
[ 0 17 75 8]
[ 0 0 14 86]]

Classification Report:
precision recall f1-score support

0 0.94 0.89 0.91 100


1 0.75 0.82 0.78 100
2 0.74 0.75 0.75 100
3 0.91 0.86 0.89 100

accuracy 0.83 400


macro avg 0.83 0.83 0.83 400
weighted avg 0.83 0.83 0.83 400

y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy:.4f}")

# Classification Report
print("Classification Report:\n", classification_report(y_test,
y_pred_rf))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

Accuracy: 0.8900
Classification Report:
precision recall f1-score support

0 0.95 0.95 0.95 100


1 0.82 0.84 0.83 100
2 0.84 0.82 0.83 100
3 0.95 0.95 0.95 100

accuracy 0.89 400


macro avg 0.89 0.89 0.89 400
weighted avg 0.89 0.89 0.89 400

Confusion Matrix:
[[95 5 0 0]
[ 5 84 11 0]
[ 0 13 82 5]
[ 0 0 5 95]]

# --- SVM Model with Probability Enabled ---


svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)
# Enable predict_proba
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
y_prob_svm = svm_model.predict_proba(X_test) # Now this will work
without error

print("\n--- SVM Results ---")


print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

--- SVM Results ---


Accuracy: 0.9575
[[100 0 0 0]
[ 2 97 1 0]
[ 0 7 89 4]
[ 0 0 3 97]]
precision recall f1-score support

0 0.98 1.00 0.99 100


1 0.93 0.97 0.95 100
2 0.96 0.89 0.92 100
3 0.96 0.97 0.97 100

accuracy 0.96 400


macro avg 0.96 0.96 0.96 400
weighted avg 0.96 0.96 0.96 400

# --- KNN Model ---


knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
y_prob_knn = knn_model.predict_proba(X_test)
print("\n--- KNN Results ---")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

--- KNN Results ---


Accuracy: 0.935
[[99 1 0 0]
[ 2 93 5 0]
[ 0 7 87 6]
[ 0 0 5 95]]
precision recall f1-score support

0 0.98 0.99 0.99 100


1 0.92 0.93 0.93 100
2 0.90 0.87 0.88 100
3 0.94 0.95 0.95 100

accuracy 0.94 400


macro avg 0.93 0.94 0.93 400
weighted avg 0.93 0.94 0.93 400

# Challenges in Traditional Methods:


# Feature Engineering Complexity

# Traditional models like Decision Trees and KNN rely heavily on


manual feature selection.

# Feature importance needs to be analyzed carefully to avoid


irrelevant or redundant features.

# Scalability Issues

# KNN struggles with large datasets due to its computational cost


(O(n^2)) when making predictions.

# Decision Trees may become too deep, leading to overfitting.

# Hyperparameter Sensitivity

# High Computational Cost for Some Models

# SVM and KNN are computationally expensive for large datasets.

# Grid Search for Hyperparameter tuning can be slow without


optimization techniques.

# Model Training and Testing in ML


# Data Preprocessing – Clean and prepare data by handling missing
values, encoding categorical features, and scaling numerical data.

# Train-Test Split – Divide data into training (70-80%) and testing


(20-30%) sets to ensure proper model evaluation.

# Model Training – The model learns patterns from the training data by
adjusting its parameters based on the input features and target
variable.

# Model Testing – The trained model is tested on unseen data (test


set) to evaluate its ability to generalize to new examples.

# Performance Evaluation – The model is assessed using metrics such as


accuracy, precision, recall, F1-score, AUC-ROC, and confusion matrix
to determine its effectiveness.

# Metrics for Evaluating ML Algorithms


# Classification Metrics (for Categorical Targets)
# Accuracy – Measures overall correctness (suitable for balanced
datasets).

# Precision – Measures correctness of positive predictions (useful for


imbalanced datasets).

# Recall (Sensitivity) – Measures how well actual positives are


detected.

# F1-Score – Harmonic mean of precision and recall (best for


imbalanced classes).

# Confusion Matrix – Shows true positives, true negatives, false


positives, and false negatives.

# AUC-ROC (Area Under Curve – Receiver Operating Characteristic) –


Evaluates classification ability across thresholds.

# Binarizing the test set labels only (AFTER splitting)


y_bin_test = label_binarize(y_test, classes=[0, 1, 2, 3])

# --- Accuracy and AUC Calculation ---


models = ['Decision Tree', 'Random Forest', 'SVM', 'KNN']
accuracies = [
accuracy_score(y_test, y_pred),
accuracy_score(y_test, y_pred_rf),
accuracy_score(y_test, y_pred_svm),
accuracy_score(y_test, y_pred_knn)
]

# Multi-class AUC scores using the TEST SET ONLY


auc_scores = [
roc_auc_score(y_bin_test, y_prob_dt, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_rf, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_svm, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_knn, multi_class='ovr')
]

# --- Plotting Accuracy and AUC ---


fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Accuracy Bar Graph


ax[0].bar(models, accuracies, color='skyblue')
ax[0].set_title('Model Accuracy Comparison')
ax[0].set_ylabel('Accuracy')
ax[0].set_ylim(0, 1)

# AUC Bar Graph


ax[1].bar(models, auc_scores, color='lightgreen')
ax[1].set_title('Model AUC Comparison')
ax[1].set_ylabel('AUC Score')
ax[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

# --- Print Model Performances ---


print("\n--- Model Performance ---")
for i, model in enumerate(models):
print(f"{model}: Accuracy = {accuracies[i]:.4f}, AUC =
{auc_scores[i]:.4f}")

--- Model Performance ---


Decision Tree: Accuracy = 0.8300, AUC = 0.9459
Random Forest: Accuracy = 0.8900, AUC = 0.9793
SVM: Accuracy = 0.9575, AUC = 0.9988
KNN: Accuracy = 0.9350, AUC = 0.9914

# Recommended Model for Mobile Price Prediction


# Based on your evaluation metrics (accuracy and AUC-ROC), the Support
Vector Machine (SVM) model performs the best with:

# Accuracy: 95.75%

# AUC-ROC: 0.9988

# Best Model: SVM (with RBF Kernel)


# Pros:

# High accuracy and robustness in high-dimensional spaces.

# Works well with non-linear data using the RBF kernel.

# Good generalization with proper hyperparameter tuning.


# Cons:

# Computationally expensive for very large datasets.

You might also like