Practical 7
Practical 7
Aim:
Write a program to analysis a cancer patient . We are given a data of cancer patient and we have to
find whether the cancer is Primary or Secondary with the help of KNN Machine Learning models.
Theory:
KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs
to the supervised learning domain and finds intense application in pattern recognition, data
mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such as
GMM, which assume a Gaussian distribution of the given data). We are given some prior data (also
called training data), which classifies coordinates into groups identified by an attribute.
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given
an unclassified point, we can assign it to a group by observing what group its nearest neighbors
belong to. This means a point close to a cluster of points classified as ‘Red’ has a higher probability
of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’, and the second
point (5.5, 4.5) should be classified as ‘Red’.
(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used
for its simplicity and ease of implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical and categorical data, making it a
flexible choice for various types of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the similarity of data points in a given
dataset. K-NN is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a
distance metric, such as Euclidean distance. The class or value of the data point is then
determined by the majority vote or average of the K neighbors. This approach allows the
algorithm to adapt to different patterns and make predictions based on the local structure of the
data.
Step-by-Step explanation of how KNN works is discussed below:
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered while making
prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors
becomes the predicted class for the target data point.
Program:
# Load the data from data file into the data frame frame
df=pd.read_csv('breast-cancer-wisconsin.csv')
df.head()
k=math.sqrt(len(y_test))
k
# find accuracy
accuracy = model.score (x_test, y_test)
accuracy
# find accuracy
accuracy = model.score (x_test, y_test)
accuracy
model.predict([[4,2,1,1,1,2,3,2,1], [8,10,10,8,7,10,9,7,1]])