0% found this document useful (0 votes)
2 views

Machine Learning Assignment (1)

The report details a programming assignment focused on binary and multi-class classification using machine learning techniques on two datasets: Breast Cancer Wisconsin and Car Evaluation. It includes data preprocessing, model implementation (Naive Bayes, KNN, Decision Tree, Random Forest), evaluation results, and hyperparameter tuning, highlighting that KNN and Random Forest performed best in their respective tasks. The conclusion emphasizes the significance of preprocessing, model selection, and tuning in machine learning applications.

Uploaded by

bcool4957
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning Assignment (1)

The report details a programming assignment focused on binary and multi-class classification using machine learning techniques on two datasets: Breast Cancer Wisconsin and Car Evaluation. It includes data preprocessing, model implementation (Naive Bayes, KNN, Decision Tree, Random Forest), evaluation results, and hyperparameter tuning, highlighting that KNN and Random Forest performed best in their respective tasks. The conclusion emphasizes the significance of preprocessing, model selection, and tuning in machine learning applications.

Uploaded by

bcool4957
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

EC2011E Foundations of Machine Learning

Programming Assignment Report

Team Members:-

Kaigala Mani Charan – B230999EC

Kamana Narendra Subbaraj – B231001EC

K Vinay – B230996EC
1. Binary Classification: Breast Cancer Wisconsin (Diagnostic) Dataset

1.1 Dataset Description

The Breast Cancer Wisconsin (Diagnostic) dataset is widely used for binary classification tasks in the
medical domain. It consists of 569 instances with 30 real-valued input features computed from
digitized images of fine needle aspirates (FNA) of breast masses. The diagnosis (target variable) has
two classes:

 M = Malignant (cancerous)

 B = Benign (non-cancerous)

For each of the 10 features (radius, texture, perimeter, area, smoothness, compactness, concavity,
concave points, symmetry, and fractal dimension), the dataset provides:

 Mean

 Standard Error

 Worst (maximum) value

1.2 Preprocessing Steps

Data loading and preprocessing

import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

data = pd.read_csv("wdbc.data", header=None)

columns = ['ID', 'Diagnosis'] + [

f"{feat}_{stat}" for stat in ['mean', 'se', 'worst'] for feat in [

'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness',

'concavity', 'concave_points', 'symmetry', 'fractal_dimension']

data.columns = columns

data.drop('ID', axis=1, inplace=True)

data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})

features = [col for col in data.columns if '_mean' in col]

X = data[features]

y = data['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

1.3 Models Implemented

 Naive Bayes Classifier using GaussianNB()

 K-Nearest Neighbors (KNN) with k=5 using KNeighborsClassifier()

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Naive Bayes
nb = GaussianNB()

nb.fit(X_train, y_train)

y_pred_nb = nb.predict(X_test)

KNN
knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scaled, y_train)

y_pred_knn = knn.predict(X_test_scaled)

1.4 Evaluation Results


Naive Bayes Classifier Output:
Accuracy: 0.9474

Confusion Matrix:

[[70 1]

[ 5 38]]
Classification Report:

precision recall f1-score support

0 0.93 0.99 0.96 71

1 0.97 0.88 0.93 43

accuracy 0.95 114

macro avg 0.95 0.93 0.94 114

weighted avg 0.95 0.95 0.95 114

KNN Classifier Output (k = 5):


Accuracy: 0.9474

Confusion Matrix:

[[68 3]

[ 3 40]]

Classification Report:

precision recall f1-score support

0 0.96 0.96 0.96 71

1 0.93 0.93 0.93 43

accuracy 0.95 114

macro avg 0.94 0.94 0.94 114

weighted avg 0.95 0.95 0.95 114

2. PCA-Based Dimensionality Reduction


from sklearn.decomposition import PCA

for k in [10, 9, 8]:

pca = PCA(n_components=k)

X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

nb_pca = GaussianNB()

nb_pca.fit(X_train_pca, y_train)

print(f"Naive Bayes Accuracy with PCA-{k}:", accuracy_score(y_test, nb_pca.predict(X_test_pca)))

knn_pca = KNeighborsClassifier(n_neighbors=5)

knn_pca.fit(X_train_pca, y_train)

print(f"KNN Accuracy with PCA-{k}:", accuracy_score(y_test, knn_pca.predict(X_test_pca)))

PCA Results (k = number of components used):


Principal Components Naive Bayes Accuracy KNN Accuracy

10 0.9123 0.9474

9 0.9211 0.9474

8 0.9211 0.9474

3. KNN Hyperparameter Tuning


k_values = list(range(1, 16))

accuracies = []

for k in k_values:

model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train_scaled, y_train)

acc = model.score(X_test_scaled, y_test)

accuracies.append(acc)

plt.figure(figsize=(8, 5))

plt.plot(k_values, accuracies, marker='o')

plt.title("KNN Accuracy vs k")

plt.xlabel("k")

plt.ylabel("Accuracy")

plt.grid()
plt.show()

plot:

Observation:

 Highest accuracy observed around k = 5

 Smaller k leads to overfitting; higher k leads to underfitting.

4. Multi-Class Classification: Car Evaluation Dataset

import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

car_data = pd.read_csv("car.data", header=None)

car_data.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

encoder = OrdinalEncoder()

X_car = encoder.fit_transform(car_data.drop('class', axis=1))


y_car = car_data['class']

Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_car, y_car, test_size=0.2, random_state=42)

Decision Tree
dt = DecisionTreeClassifier(random_state=42)

dt.fit(Xc_train, yc_train)

yc_pred_dt = dt.predict(Xc_test)

Decision Tree Results:


Accuracy: 0.9739884393063584

Classification Report:

precision recall f1-score support

acc 0.97 0.92 0.94 83

good 0.62 0.91 0.74 11

unacc 1.00 1.00 1.00 235

vgood 1.00 0.94 0.97 17

accuracy 0.97 346

macro avg 0.90 0.94 0.91 346

weighted avg 0.98 0.97 0.98 346

Random Forest
rf = RandomForestClassifier(random_state=42)

rf.fit(Xc_train, yc_train)

yc_pred_rf = rf.predict(Xc_test)

Random Forest Results:


Accuracy: 0.9739884393063584
Classification Report:

precision recall f1-score support

acc 0.99 0.90 0.94 83

good 0.65 1.00 0.79 11

unacc 0.99 1.00 1.00 235

vgood 1.00 0.94 0.97 17

accuracy 0.97 346

macro avg 0.91 0.96 0.92 346

weighted avg 0.98 0.97 0.98 346

5. Conclusion

 KNN slightly outperformed Naive Bayes for binary classification, especially with scaling.

 PCA reduced dimensionality while maintaining high accuracy, especially for KNN.

 k = 5 was optimal for KNN in this dataset.

 Random Forest outperformed Decision Tree on the multi-class car dataset due to better
generalization from ensemble learning.

 The assignment highlights the importance of preprocessing, model selection, and


hyperparameter tuning in practical ML applications.

6. References

 UCI Machine Learning Repository

 scikit-learn Documentation (https://scikit-learn.org/)

 Course Lecture Slides and Notes

You might also like