0% found this document useful (0 votes)
8 views11 pages

DSASSign 4

The document outlines an assignment on applying the K-Nearest Neighbors (KNN) algorithm for statistical and predictive modeling using the Breast Cancer dataset in Python. It details the steps of dataset selection, data preprocessing, exploratory data analysis, KNN implementation, model training, evaluation, and optimization through hyperparameter tuning. The optimized KNN model achieved improved performance metrics, demonstrating the algorithm's relevance in classification tasks.

Uploaded by

Nasir khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

DSASSign 4

The document outlines an assignment on applying the K-Nearest Neighbors (KNN) algorithm for statistical and predictive modeling using the Breast Cancer dataset in Python. It details the steps of dataset selection, data preprocessing, exploratory data analysis, KNN implementation, model training, evaluation, and optimization through hyperparameter tuning. The optimized KNN model achieved improved performance metrics, demonstrating the algorithm's relevance in classification tasks.

Uploaded by

Nasir khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Assignment #24

Name: Nasir Khan Sayyad


Reg. No: FA19-BCS-212
Teacher: Ma'am Tariq Urooj

Title: Applying K-Nearest


Neighbors (KNN) Algorithm for
Statistical and Predictive
Modeling in R or (Python)

1. Dataset Selection:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Set random seed for reproducibility


np.random.seed(42)

# Task 1: Dataset Selection


# Load the Breast Cancer dataset from scikit-learn
data = load_breast_cancer()
X, y = data.data, data.target

2. Data Exploration and Preprocessing:

Code and Steps

# Task 2: Data Preprocessing


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the feature values


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

3. Exploratory Data Analysis:

# Task 3: Exploratory Data Analysis


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Convert the feature matrix X into a pandas DataFrame


df = pd.DataFrame(X_train, columns=data.feature_names)

# Add the target variable to the DataFrame


df['target'] = y_train

# Calculate and plot the correlation matrix


corr_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Plot histograms of the features


plt.figure(figsize=(12, 10))
for i, feature in enumerate(data.feature_names):
plt.subplot(5, 6, i+1)
sns.histplot(df[feature], kde=True)
plt.title(feature)
plt.tight_layout()
plt.show()

# Plot box plots of the features by target variable


plt.figure(figsize=(12, 10))
for i, feature in enumerate(data.feature_names):
plt.subplot(5, 6, i+1)
sns.boxplot(x='target', y=feature, data=df)
plt.title(feature)
plt.tight_layout()
plt.show()
Output:

4. KNN Algorithm Implementation:


# Task 4: KNN Algorithm Implementation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create a KNN classifier


knn_model = KNeighborsClassifier(n_neighbors=5)

5. Model Training and Evaluation:

# Task 5: Model Training and Evaluation


# Train the KNN model on the training dataset
knn_model.fit(X_train, y_train)

# Predict on the test set


y_pred = knn_model.predict(X_test)

# Evaluate the model's performance


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("\n\n")
print("KNN Model Performance:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("\n\n")

Output:
6. Model Optimization and Hyperparameter Tuning:

# Task 6: Model Optimization and Hyperparameter Tuning


from sklearn.model_selection import GridSearchCV

# Define the parameter grid for grid search


param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}

# Perform grid search to find the best K value


grid_search = GridSearchCV(knn_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best K value and retrain the model


best_k = grid_search.best_params_['n_neighbors']
knn_model = KNeighborsClassifier(n_neighbors=best_k)
knn_model.fit(X_train, y_train)

# Predict on the test set with the optimized model


y_pred_optimized = knn_model.predict(X_test)

# Evaluate the optimized model's performance


accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
precision_optimized = precision_score(y_test, y_pred_optimized)
recall_optimized = recall_score(y_test, y_pred_optimized)
f1_optimized = f1_score(y_test, y_pred_optimized)

print("\n\nAfter optimization")

print("Optimized KNN Model Performance:")


print("Accuracy:", accuracy_optimized)
print("Precision:", precision_optimized)
print("Recall:", recall_optimized)
print("F1-score:", f1_optimized)

Output:
6. Conclusion and Discussion

Summary:
 The KNN algorithm was applied to the breast cancer dataset using Python.
 The initial KNN model achieved an accuracy, precision, recall, and F1-score
of 94.74%.
 Hyperparameter tuning using grid search found the optimal value of K to be
3.
 The optimized KNN model achieved an accuracy, precision, recall, and F1-
score of 95.74%.
 The breast cancer dataset contained 569 samples and 30 features, with class
labels of "malignant" and "benign".
 The KNN algorithm is relevant and applicable in real-world scenarios for
classification tasks.
 The algorithm has strengths in its simplicity, versatility, and ability to handle
various datasets.
 Limitations of KNN include computational complexity, sensitivity to
parameter choices, and equal weighting of features.
 Potential improvements include using advanced distance metrics, feature
selection, and dimensionality reduction techniques.
 The optimized KNN model showed promising results in classifying breast
cancer cases.
 Further research and experimentation can be done to explore other
techniques and algorithms for improved performance.
All Source Code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 5 18:39:26 2023

@author: nasir
"""

# Task 1: Dataset Selection


from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset


data = load_breast_cancer()

# Print a brief description of the dataset


print("Breast Cancer Dataset:")
print("Number of samples:", data.data.shape[0])
print("Number of features:", data.data.shape[1])
print("Class labels:", data.target_names)

# Task 2: Data Preprocessing


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2,
random_state=42)

# Perform feature scaling


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Task 3: Exploratory Data Analysis


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Convert the feature matrix X into a pandas DataFrame


df = pd.DataFrame(X_train, columns=data.feature_names)

# Add the target variable to the DataFrame


df['target'] = y_train

# Calculate and plot the correlation matrix


corr_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Plot histograms of the features


plt.figure(figsize=(12, 10))
for i, feature in enumerate(data.feature_names):
plt.subplot(5, 6, i+1)
sns.histplot(df[feature], kde=True)
plt.title(feature)
plt.tight_layout()
plt.show()

# Plot box plots of the features by target variable


plt.figure(figsize=(12, 10))
for i, feature in enumerate(data.feature_names):
plt.subplot(5, 6, i+1)
sns.boxplot(x='target', y=feature, data=df)
plt.title(feature)
plt.tight_layout()
plt.show()

# Task 4: KNN Algorithm Implementation


from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create a KNN classifier


knn_model = KNeighborsClassifier(n_neighbors=5)

# Task 5: Model Training and Evaluation


# Train the KNN model on the training dataset
knn_model.fit(X_train, y_train)

# Predict on the test set


y_pred = knn_model.predict(X_test)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("\n\n")
print("KNN Model Performance:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("\n\n")

# Task 6: Model Optimization and Hyperparameter Tuning


from sklearn.model_selection import GridSearchCV

# Define the parameter grid for grid search


param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}

# Perform grid search to find the best K value


grid_search = GridSearchCV(knn_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best K value and retrain the model


best_k = grid_search.best_params_['n_neighbors']
knn_model = KNeighborsClassifier(n_neighbors=best_k)
knn_model.fit(X_train, y_train)

# Predict on the test set with the optimized model


y_pred_optimized = knn_model.predict(X_test)

# Evaluate the optimized model's performance


accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
precision_optimized = precision_score(y_test, y_pred_optimized)
recall_optimized = recall_score(y_test, y_pred_optimized)
f1_optimized = f1_score(y_test, y_pred_optimized)

print("\n\nAfter optimization")

print("Optimized KNN Model Performance:")


print("Accuracy:", accuracy_optimized)
print("Precision:", precision_optimized)
print("Recall:", recall_optimized)
print("F1-score:", f1_optimized)

You might also like