0% found this document useful (0 votes)
6 views6 pages

Practical 7

Uploaded by

Vinut P Maradur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Practical 7

Uploaded by

Vinut P Maradur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Practical: Week 7

Aim:

Write a program to analysis a cancer patient . We are given a data of cancer patient and we have to
find whether the cancer is Primary or Secondary with the help of KNN Machine Learning models.

Theory:

KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs
to the supervised learning domain and finds intense application in pattern recognition, data
mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such as
GMM, which assume a Gaussian distribution of the given data). We are given some prior data (also
called training data), which classifies coordinates into groups identified by an attribute.
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given
an unclassified point, we can assign it to a group by observing what group its nearest neighbors
belong to. This means a point close to a cluster of points classified as ‘Red’ has a higher probability
of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’, and the second
point (5.5, 4.5) should be classified as ‘Red’.

(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used
for its simplicity and ease of implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical and categorical data, making it a
flexible choice for various types of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the similarity of data points in a given
dataset. K-NN is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a
distance metric, such as Euclidean distance. The class or value of the data point is then
determined by the majority vote or average of the K neighbors. This approach allows the
algorithm to adapt to different patterns and make predictions based on the local structure of the
data.
Step-by-Step explanation of how KNN works is discussed below:
Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered while making
prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors
becomes the predicted class for the target data point.
Program:

# Prediction for breast cancer using KNN Classifier


import pandas as pd

# Load the data from data file into the data frame frame
df=pd.read_csv('breast-cancer-wisconsin.csv')
df.head()

# display the column names


df.columns

# since the column names have spaces, remove them


# Convert columns into strings and then replace
# the space with empty string
df.columns = df.columns.str. replace (' ','')
df.columns

# we can find ? mark in the bare_nulei column


# find 0ut such rows there are 16 such rows
df[df['barenuclei'] =='?']
#copy those rows into df where ‘?’ is not found
df = df[df['barenuclei']!='?']

df.drop(['id'], axis=1, errors='ignore', inplace=True)


df

# take 0 to 8th cols in x.


x = df.iloc[:, :9]
x
# take 9th column, i.e. class column in y
y = df.iloc [:, 9] # y can be 2 or 4
y

#split the data


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)

#k value is square root of number of rows in test data


import math

k=math.sqrt(len(y_test))
k

# see k value is even, convert it to odd


if k%2==0: k+=1
k = int(k)
k
11

# import KNeighborsClassifier class


from sklearn.neighbors import KNeighborsClassifier
# create the model with k value obtained above
model = KNeighborsClassifier(n_neighbors=k)
model.fit(x_train, y_train)

# find accuracy
accuracy = model.score (x_test, y_test)
accuracy

# let us find k values and accuracy levels for each k


k_range = range (1, 16)
scores= []
for k in k_range:
model= KNeighborsClassifier(n_neighbors=k)
model.fit(x_train, y_train)
accuracy = model.score(x_test, y_test)
scores. append (accuracy)
print('k= %d Accuracy= %.2f%%' % (k, accuracy*100) )

#show the k values and scores in 1ine plot


#we can see highest accuracy when k=1,3,4,5,7
import matplotlib.pyplot as plt
plt.plot (k_range, scores)
plt.xlabel ("Value of k")
plt.ylabel ("Accuracy")
#Take k=3 for achieving highest accuracy.
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train, y_train)

# find accuracy
accuracy = model.score (x_test, y_test)
accuracy

# predict for the given data


model.predict([[4, 2, 1, 1, 1, 2, 3, 2, 1]])

model.predict([[4,2,1,1,1,2,3,2,1], [8,10,10,8,7,10,9,7,1]])

You might also like