Experiment 4
Experiment 4
Aim:
To use KNN (K-nearest-neighbour) technique for regression and classification tasks in the
domain of healthcare.
Objective:
▪ To identify and look for datasets where KNN may be suitable.
▪ To determine the optimum K value, both, in regression and in classification.
▪ To build, train and test the created model using the information on K.
Outcomes:
▪ To understand the working of KNN.
▪ To be able to implement KNN for regression and classification tasks on a given dataset.
▪ To be able to compare models in order to decide an optimum among them.
Theory:
K-nearest neighbours is a simple, yet powerful machine learning algorithm that is often used
for both classification and regression tasks. It operates on the assumption that similar data
points exist in close proximity to each other in feature space. The KNN algorithm works by
identifying the nearest 'k' data points to a given query point and making predictions based on
these neighbours. The value of 'k' is a key parameter in the algorithm, determining how many
neighbours should influence the prediction. One of the primary advantages of KNN is its
simplicity, as it requires no explicit model training. Instead, it stores the entire dataset and
defers decision-making until a query point is presented. This characteristic makes KNN a lazy
learner, as it does not generalize from the training data beforehand. However, this also makes
KNN computationally expensive during prediction, especially for large datasets, as it requires
scanning the entire dataset to find the nearest neighbours. Additionally, the algorithm is
sensitive to feature scaling, as distances between points can dominate the predictions if not
appropriately scaled.
In regression tasks, KNN predicts continuous values based on the values of the 'k' nearest
neighbours. Instead of assigning a class, it computes the average or weighted average of the
neighbours' target values and returns that as the prediction. This method makes KNN highly
intuitive for regression tasks because it assumes that similar data points should have similar
outputs. A lower value of 'k' might result in predictions being overly sensitive to noise, as only
very close neighbours will affect the outcome, leading to higher variance. Conversely, larger
values of 'k' tend to smooth out predictions by including more neighbours, but they might
also lead to bias if the neighbours are too distant or irrelevant. One of the key challenges of
using KNN for regression is deciding the optimal value of 'k', as different datasets may require
different values. Despite these challenges, KNN regression is widely used due to its simplicity
and effectiveness, especially when the dataset is small and noise levels are manageable.
When used for classification, KNN predicts the class of a given query point based on the
majority class of its nearest neighbours. The algorithm checks the labels of the 'k' closest data
points and assigns the query point to the class that occurs most frequently among those
neighbours. If there is a tie, various strategies, such as distance weighting, can be applied to
break it. KNN classification works well for multi-class problems and can handle both binary
and categorical target variables. Its non-parametric nature allows it to make no assumptions
about the underlying distribution of the data, making it flexible for different types of datasets.
However, its simplicity also introduces drawbacks, such as its sensitivity to outliers and
irrelevant features. Furthermore, as the size of the dataset grows, KNN classification can
become slower, since the algorithm must compute the distance from the query point to every
point in the dataset. Efficient implementation techniques, such as using KD-trees or Ball-trees,
can help alleviate these computational challenges, making KNN a versatile choice for many
classification problems.
Dataset Description:
For classification task using KNN, breast cancer (Wisconsin) dataset was used.
The task was to identify whether a patient has or does not have sever breast cancer
(Malignant or Benign) based on various physical quantifications obtained from testing the
patient.
For regression task using KNN, Hospital Stay dataset was utilized.
KNN was used to determine the number of days an admitted patient will stay in a particular
hospital based on the severity of the patient’s disease, the doctor concerned, the department
concerned, etc.
Code:
For KNN Classification-
Filling out the missing data in the dataset using simple average
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(data.drop(columns=['diagnosis']))
Training KNN classifier models using k values in a range and plotting the validation error
errors = []
k_range = range(1, 16)
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
error = 1 - knn.score(X_test_scaled, y_test)
errors.append(error)
plt.figure(figsize=(10, 6))
plt.plot(k_range, errors, marker='o', linestyle='-')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Error Rate')
plt.title('Elbow Method for Optimal K (Error vs K)')
plt.xticks(k_range)
plt.grid(True)
plt.show()
Using the optimal k value as found to finalize model training
optimal_k = 9
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
print(f'Optimal K: {optimal_k}')
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
Replacing the values of the ‘age’ column by considering the midpoint of the range mentioned
def range_to_midpoint(age_range):
start, end = age_range.split('-')
return (int(start) + int(end)) / 2
df['Age'] = df['Age'].apply(range_to_midpoint)
Using one-hot encoding for all the categorial variables in the dataset
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
categorical_columns = ['Department', 'gender', 'Type of Admission', 'Severity of
Illness', 'Insurance', 'Ward_Facility_Code', 'doctor_name', 'health_conditions']
numeric_columns = ['Available Extra Rooms in Hospital', 'staff_available', 'Age',
'Visitors with Patient', 'Admission_Deposit', 'Stay (in days)']
categorical_transformer = OneHotEncoder(sparse=False)
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_columns)
],
remainder='passthrough'
)
df = preprocessor.fit_transform(df)
df = pd.DataFrame(df, columns=(
list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_
columns)) +
numeric_columns
))
Taking different k values in a range and plotting the R-square value of the model trained using
respective k values
neighbors = range(1, 16)
r2_scores = []
for k in neighbors:
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
r2_scores.append(r2)
plt.figure(figsize=(10, 6))
plt.plot(neighbors, r2_scores, marker='o', linestyle='-', color='b')
plt.title('R² Score vs. Number of Neighbors')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('R² Score')
plt.xticks(neighbors)
plt.grid(True)
plt.show()
Selecting the optimal k value for final training and printing the performance parameters of
the final model
knn = KNeighborsRegressor(n_neighbors=9)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
r2 = r2_score(y_test, y_pred)
print(f'R² Score: {r2}')
Output:
For KNN Classifier-
Optimal K: 9
Confusion Matrix:
[[69 2]
[ 2 41]]
Classification Report:
precision recall f1-score support
Conclusion:
By performing this experiment, I was able to understand the concept of KNN and how it may
be used for regression and classification tasks. Following are the inferences as notable from
the experiment conducted-
▪ In case of KNN for classification, on plotting the validation error on different k values,
optimal k was found to be 9 with an error of 0.035, thereby, the optimal accuracy of
the model was found out to be around 96%.
▪ In case of KNN for regression, a similar plot (using R-square on y-axis) revealed the
optimal k value as 9 (after which, no improvements were noted). The R-square value
of the final model was thus found out to be 0.8339 while the MSE was calculated to be
12.81.