0% found this document useful (0 votes)
37 views

Machine Learning Assignment 3

This document describes implementing a KNN classifier on a CSV dataset containing medical data. The KNN algorithm works by finding the distances between query data and examples in the dataset, selecting the K closest examples, and predicting the class or average of labels. The code loads the CSV file, preprocesses the data by filling missing values and normalizing features, splits the data into training and test sets, trains a KNN model with 5 neighbors on the training set, makes predictions on the test set, and evaluates accuracy.

Uploaded by

Yellow Moustache
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Machine Learning Assignment 3

This document describes implementing a KNN classifier on a CSV dataset containing medical data. The KNN algorithm works by finding the distances between query data and examples in the dataset, selecting the K closest examples, and predicting the class or average of labels. The code loads the CSV file, preprocesses the data by filling missing values and normalizing features, splits the data into training and test sets, trains a KNN model with 5 neighbors on the training set, makes predictions on the test set, and evaluates accuracy.

Uploaded by

Yellow Moustache
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Machine Learning Lab Assessment 3

18BCE2301
Devangshu Mazumder

Aim:
Design and implement a KNN Classifier using a csv file

Csv file: processed.cleveland.data

Abstract:
The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning
algorithm. The algorithm can be used to solve both classification and regression problem
statements. The number of nearest neighbours to a new unknown variable that has to be predicted
or classified is denoted by the symbol 'K'.

KNN works by finding the distances between a query and all the examples in the data, selecting the
specified number examples (K) closest to the query, then votes for the most frequent label (in the
case of classification) or averages the labels (in the case of regression).

Sample Code:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, accuracy_score

dataset = pd.read_csv("C:\\Users\\skull\\BCE2301\\vir\\env\\Machine Learning Lab


CSE4020/processed.cleveland.data.csv",
names=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','output'])

dataset_mean= dataset

print("**Before Filling missing values***")


print(dataset_mean.loc[287])

dataset1=dataset_mean

df1=pd.DataFrame(dataset1)

print("**Mean of Coloumn 11**")

print(df1['ca'].mean())

df1.fillna(df1.mean(), inplace=True)

print("**After Filling missing values**")

print(df1.loc[[166,192,287,302]])

print("**Mean of Coloumn 12**")

print(df1['thal'].mean())

df1.fillna(df1.mean(), inplace=True)

print("**After Filling missing values**")

print(df1.loc[[87,266]])

feature_cols = list(dataset.columns[0:13])

print("Feature coloumns: \n{}".format(feature_cols))

#Separate the data into feature data and target data

X= dataset[feature_cols]

y= dataset['output'].values

print("\nFeature values:")

X.head

#split the dataset into training and testing data

X_train,X_test , y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=5)

print(X_train)

#Normalization

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)

print("**After Z-score normalization on X_train***")

print(X_train)

scaler.fit(X_test)

X_test = scaler.transform(X_test)

print("**After Z-score Normalization on X_test***")

print(X_test)

print("KNN CLASSIFER")

clf2 = KNeighborsClassifier(n_neighbors=5)

clf2.fit(X_train,y_train)

y_predictions = clf2.predict(X_test)

cm1 = confusion_matrix(y_test, y_predictions)

print("Accuracy=",accuracy_score(y_test, y_predictions))
OUTPUT:

You might also like