Machine Learning Assignment (1)
Machine Learning Assignment (1)
Team Members:-
K Vinay – B230996EC
1. Binary Classification: Breast Cancer Wisconsin (Diagnostic) Dataset
The Breast Cancer Wisconsin (Diagnostic) dataset is widely used for binary classification tasks in the
medical domain. It consists of 569 instances with 30 real-valued input features computed from
digitized images of fine needle aspirates (FNA) of breast masses. The diagnosis (target variable) has
two classes:
M = Malignant (cancerous)
B = Benign (non-cancerous)
For each of the 10 features (radius, texture, perimeter, area, smoothness, compactness, concavity,
concave points, symmetry, and fractal dimension), the dataset provides:
Mean
Standard Error
import pandas as pd
data.columns = columns
X = data[features]
y = data['Diagnosis']
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
Confusion Matrix:
[[70 1]
[ 5 38]]
Classification Report:
Confusion Matrix:
[[68 3]
[ 3 40]]
Classification Report:
pca = PCA(n_components=k)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
nb_pca = GaussianNB()
nb_pca.fit(X_train_pca, y_train)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
10 0.9123 0.9474
9 0.9211 0.9474
8 0.9211 0.9474
accuracies = []
for k in k_values:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_scaled, y_train)
accuracies.append(acc)
plt.figure(figsize=(8, 5))
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
plot:
Observation:
import pandas as pd
encoder = OrdinalEncoder()
Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(Xc_train, yc_train)
yc_pred_dt = dt.predict(Xc_test)
Classification Report:
Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(Xc_train, yc_train)
yc_pred_rf = rf.predict(Xc_test)
5. Conclusion
KNN slightly outperformed Naive Bayes for binary classification, especially with scaling.
PCA reduced dimensionality while maintaining high accuracy, especially for KNN.
Random Forest outperformed Decision Tree on the multi-class car dataset due to better
generalization from ensemble learning.
6. References