0% found this document useful (0 votes)
7 views

Vertopal.com_ML LAB 8

The document outlines a machine learning project focused on predicting mobile phone price ranges using various classification algorithms, including Decision Trees, Random Forests, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The SVM model achieved the highest accuracy of 95.75% and an AUC-ROC score of 0.9988, making it the recommended model for mobile price prediction. It also discusses challenges in traditional methods, data preprocessing, model training/testing, and evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Vertopal.com_ML LAB 8

The document outlines a machine learning project focused on predicting mobile phone price ranges using various classification algorithms, including Decision Trees, Random Forests, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The SVM model achieved the highest accuracy of 95.75% and an AUC-ROC score of 0.9988, making it the recommended model for mobile price prediction. It also discusses challenges in traditional methods, data preprocessing, model training/testing, and evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Sugandh 06701192023

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,
StandardScaler,label_binarize
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix,roc_auc_score
from sklearn.ensemble import RandomForestClassifier

df=pd.read_csv("train.csv - train.csv.csv")
df.head()

battery_power blue clock_speed dual_sim fc four_g int_memory


m_dep \
0 842 0 2.2 0 1 0 7
0.6
1 1021 1 0.5 1 0 1 53
0.7
2 563 1 0.5 1 2 1 41
0.9
3 615 1 2.5 0 0 0 10
0.8
4 1821 1 1.2 0 13 1 44
0.6

mobile_wt n_cores ... px_height px_width ram sc_h sc_w


talk_time \
0 188 2 ... 20 756 2549 9 7
19
1 136 3 ... 905 1988 2631 17 3
7
2 145 5 ... 1263 1716 2603 11 2
9
3 131 6 ... 1216 1786 2769 16 8
11
4 141 2 ... 1208 1212 1411 8 2
15
three_g touch_screen wifi price_range
0 0 0 1 1
1 1 1 0 2
2 1 1 0 2
3 1 0 0 2
4 1 1 0 1

[5 rows x 21 columns]

df.isnull().sum()

battery_power 0
blue 0
clock_speed 0
dual_sim 0
fc 0
four_g 0
int_memory 0
m_dep 0
mobile_wt 0
n_cores 0
pc 0
px_height 0
px_width 0
ram 0
sc_h 0
sc_w 0
talk_time 0
three_g 0
touch_screen 0
wifi 0
price_range 0
dtype: int64

# Split features and target


X = df.drop("price_range", axis=1) # Independent variables
y = df["price_range"]

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42, stratify=y)

dt_model = DecisionTreeClassifier(criterion='entropy', max_depth=5,


random_state=42) # You can use 'entropy' instead of 'gini'
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
y_prob_dt = dt_model.predict_proba(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test,
y_pred))

Accuracy: 0.8725

Confusion Matrix:
[[90 10 0 0]
[ 4 89 7 0]
[ 0 11 76 13]
[ 0 0 6 94]]

Classification Report:
precision recall f1-score support

0 0.96 0.90 0.93 100


1 0.81 0.89 0.85 100
2 0.85 0.76 0.80 100
3 0.88 0.94 0.91 100

accuracy 0.87 400


macro avg 0.87 0.87 0.87 400
weighted avg 0.87 0.87 0.87 400

rf_model = RandomForestClassifier(n_estimators=100, random_state=42,


max_depth=10)
rf_model.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, random_state=42)

dt_model = DecisionTreeClassifier(criterion='gini', max_depth=5,


random_state=42) # You can use 'entropy' instead of 'gini'
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
y_prob_dt = dt_model.predict_proba(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test,
y_pred))

Accuracy: 0.83

Confusion Matrix:
[[89 11 0 0]
[ 6 82 12 0]
[ 0 17 75 8]
[ 0 0 14 86]]

Classification Report:
precision recall f1-score support

0 0.94 0.89 0.91 100


1 0.75 0.82 0.78 100
2 0.74 0.75 0.75 100
3 0.91 0.86 0.89 100

accuracy 0.83 400


macro avg 0.83 0.83 0.83 400
weighted avg 0.83 0.83 0.83 400

y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy:.4f}")

# Classification Report
print("Classification Report:\n", classification_report(y_test,
y_pred_rf))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

Accuracy: 0.8900
Classification Report:
precision recall f1-score support

0 0.95 0.95 0.95 100


1 0.82 0.84 0.83 100
2 0.84 0.82 0.83 100
3 0.95 0.95 0.95 100

accuracy 0.89 400


macro avg 0.89 0.89 0.89 400
weighted avg 0.89 0.89 0.89 400

Confusion Matrix:
[[95 5 0 0]
[ 5 84 11 0]
[ 0 13 82 5]
[ 0 0 5 95]]

# --- SVM Model with Probability Enabled ---


svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)
# Enable predict_proba
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
y_prob_svm = svm_model.predict_proba(X_test) # Now this will work
without error

print("\n--- SVM Results ---")


print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

--- SVM Results ---


Accuracy: 0.9575
[[100 0 0 0]
[ 2 97 1 0]
[ 0 7 89 4]
[ 0 0 3 97]]
precision recall f1-score support

0 0.98 1.00 0.99 100


1 0.93 0.97 0.95 100
2 0.96 0.89 0.92 100
3 0.96 0.97 0.97 100

accuracy 0.96 400


macro avg 0.96 0.96 0.96 400
weighted avg 0.96 0.96 0.96 400

# --- KNN Model ---


knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
y_prob_knn = knn_model.predict_proba(X_test)
print("\n--- KNN Results ---")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

--- KNN Results ---


Accuracy: 0.935
[[99 1 0 0]
[ 2 93 5 0]
[ 0 7 87 6]
[ 0 0 5 95]]
precision recall f1-score support

0 0.98 0.99 0.99 100


1 0.92 0.93 0.93 100
2 0.90 0.87 0.88 100
3 0.94 0.95 0.95 100

accuracy 0.94 400


macro avg 0.93 0.94 0.93 400
weighted avg 0.93 0.94 0.93 400

# Challenges in Traditional Methods:


# Feature Engineering Complexity

# Traditional models like Decision Trees and KNN rely heavily on


manual feature selection.

# Feature importance needs to be analyzed carefully to avoid


irrelevant or redundant features.

# Scalability Issues

# KNN struggles with large datasets due to its computational cost


(O(n^2)) when making predictions.

# Decision Trees may become too deep, leading to overfitting.

# Hyperparameter Sensitivity

# High Computational Cost for Some Models

# SVM and KNN are computationally expensive for large datasets.

# Grid Search for Hyperparameter tuning can be slow without


optimization techniques.

# Model Training and Testing in ML


# Data Preprocessing – Clean and prepare data by handling missing
values, encoding categorical features, and scaling numerical data.

# Train-Test Split – Divide data into training (70-80%) and testing


(20-30%) sets to ensure proper model evaluation.

# Model Training – The model learns patterns from the training data by
adjusting its parameters based on the input features and target
variable.

# Model Testing – The trained model is tested on unseen data (test


set) to evaluate its ability to generalize to new examples.

# Performance Evaluation – The model is assessed using metrics such as


accuracy, precision, recall, F1-score, AUC-ROC, and confusion matrix
to determine its effectiveness.

# Metrics for Evaluating ML Algorithms


# Classification Metrics (for Categorical Targets)
# Accuracy – Measures overall correctness (suitable for balanced
datasets).

# Precision – Measures correctness of positive predictions (useful for


imbalanced datasets).

# Recall (Sensitivity) – Measures how well actual positives are


detected.

# F1-Score – Harmonic mean of precision and recall (best for


imbalanced classes).

# Confusion Matrix – Shows true positives, true negatives, false


positives, and false negatives.

# AUC-ROC (Area Under Curve – Receiver Operating Characteristic) –


Evaluates classification ability across thresholds.

# Binarizing the test set labels only (AFTER splitting)


y_bin_test = label_binarize(y_test, classes=[0, 1, 2, 3])

# --- Accuracy and AUC Calculation ---


models = ['Decision Tree', 'Random Forest', 'SVM', 'KNN']
accuracies = [
accuracy_score(y_test, y_pred),
accuracy_score(y_test, y_pred_rf),
accuracy_score(y_test, y_pred_svm),
accuracy_score(y_test, y_pred_knn)
]

# Multi-class AUC scores using the TEST SET ONLY


auc_scores = [
roc_auc_score(y_bin_test, y_prob_dt, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_rf, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_svm, multi_class='ovr'),
roc_auc_score(y_bin_test, y_prob_knn, multi_class='ovr')
]

# --- Plotting Accuracy and AUC ---


fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Accuracy Bar Graph


ax[0].bar(models, accuracies, color='skyblue')
ax[0].set_title('Model Accuracy Comparison')
ax[0].set_ylabel('Accuracy')
ax[0].set_ylim(0, 1)

# AUC Bar Graph


ax[1].bar(models, auc_scores, color='lightgreen')
ax[1].set_title('Model AUC Comparison')
ax[1].set_ylabel('AUC Score')
ax[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

# --- Print Model Performances ---


print("\n--- Model Performance ---")
for i, model in enumerate(models):
print(f"{model}: Accuracy = {accuracies[i]:.4f}, AUC =
{auc_scores[i]:.4f}")

--- Model Performance ---


Decision Tree: Accuracy = 0.8300, AUC = 0.9459
Random Forest: Accuracy = 0.8900, AUC = 0.9793
SVM: Accuracy = 0.9575, AUC = 0.9988
KNN: Accuracy = 0.9350, AUC = 0.9914

# Recommended Model for Mobile Price Prediction


# Based on your evaluation metrics (accuracy and AUC-ROC), the Support
Vector Machine (SVM) model performs the best with:

# Accuracy: 95.75%

# AUC-ROC: 0.9988

# Best Model: SVM (with RBF Kernel)


# Pros:

# High accuracy and robustness in high-dimensional spaces.

# Works well with non-linear data using the RBF kernel.

# Good generalization with proper hyperparameter tuning.


# Cons:

# Computationally expensive for very large datasets.

You might also like