Dynamic KNNF
Dynamic KNNF
Abstract- This study proposes a novel dynamic pre-processed, and then both models will be tested
model to allocate the value of 'K' in the K Nearest to compare their accuracy.
Neighbour algorithm based on geometrical II. LITERATURE REVIEW
calculations. As we know that the K-nearest The traditional K NEAREST NEIGHBOUR (KNN)
neighbour algorithm is a well-known machine-
Was first proposed by Cover and Hart in 1967
learning algorithm widely used to classify data. It
works to find the k nearest neighbours to the
faces certain disadvantages when working on a
unknown data point [1]. It takes the majority of
large dataset or data with several dimensions.
votes from all k data points and assigns the majority
One of the significant factors that affect the voted value. However, KNN model was improved
accuracy of the model is the value of K. in different ways over the years, resulting in
Fundamentally; the value of K is fixed for all data variations in the original model. This model has a
points. In this study, the value is calculated at the number of notable variations, including weighted
local level for all test points. This study performs a KNN by Saha et al in 1991 [2]. In weighted KNN,
comparative study between suggested model and all votes are not treated equally. The closer
the traditional KNN model. neighbour influences more through its voting power
I. INTRODUCTION Another method was proposed by Gongde Guo1
It's the same principle as one size does not fit all, and Hui Wang to overcome the limitation of being
so we are looking for different values of k for dependent on Fixed value of k, we outline certain
different data points and evaluating its impact on regions to represent classified data points set [3]. In
the accuracy of the model by dynamically the selection of each representative set, we use the
allocating the value of K. To improve the prediction optimal but different k allocated by the dataset itself
of unknown values, we hypothesize that removing to abolish dependency on k without the user's
the dependency on K and shifting the voting criteria intervention. Another approach includes Radius
to the density of data points surrounding the test Neighbour Classification. In this approach, a certain
subject will improve the prediction. In this model, radius is fixed for a particular dataset. All the points
we are constructing geometric constraints which are within the circle area are considered for voting and
flexible as per the location of each data point and the majority vote is applied to an unknown data
also act as barriers to avoid taking votes from point which was proposed by Leo Breiman and
outliers and reduce the impact of noise in the Jerome Friedman in 1984 [4]. More often than not,
decision-making process. The datasets will first be data is available in steady and sustained form as
values aren't classified. In such cases, we take the
mean sum of values from all the nearest voters to
evaluate the value of the test data point as suggested
by Altman in 1992 [5]. Inducing other major
geometrical findings are often fruitful for the
accuracy and efficiency of the model as in the case
of dynamic nearest neighbour queries in Euclidean
space by Mohammed Einus Ali as it revolves
around the construction of Voronoi diagrams [6].
III. METHODOLGY
As one shoe can't fit all similarly one value of K FIG 1: WORKING MODEL
isn't optimal for all test points. In this model, we The next step is to find all the training points
calculate the value of K of each data point using within a radius equivalent to twice the distance
geometry. from the nearest data point. The majority voting is
First, we calculate the distance of all points to the test taken from all points in the area. The unknown data
point. Then we find the index of the closest point and the point takes the value of the majority vote. We
distance between both the test point and the nearest repeat this process for all the test data points.
data point. Let us call this distance as "r''.
Now we calculate K as half of the value of ''r'' We The following is the step by step algorithm of the
did this because we were required to set the value process:
Step 1 : Load the data set
of K to a flexible value that is yielding and
Step 2 : Split the dataset into test and train
adaptable to the close neighbourhood of the test Step 3 : Create a function to predict the
point. If we declare a fixed value of K it might be values
3.1 : Calculate distance from test point
troublesome where the local density of data points to all data point
varies drastically. Over fitting can occur when 3.2 : Find the index of the nearest
neighbour and its distance
noisy or isolated data points dominate the 3.3 : Calculate K as half the distance to
classification choice if k is too small. Under fitting nearest neighbour
3.4 : Find all the training points within
can occur when the classification judgments are a radius of let us say 2d
influenced by data points that are too far from the 3.5 : Take a majority vote of the classes
test point if k is set too high. Conversely, if the data of the training points with in the
radius
points are more densely packed, the closest 3.6 : Make predictions for the test
neighbours would be nearer and the value of K will points
Step 4 : Calculate the accuracy of the
be less, and if the data points are more sparsely model.
packed, the nearest neighbour would be further IV. RESULTS AND DECISION
away and we require a bigger value of K.
As a next step, we will compare the predictions
of both algorithms on various data sets. To analyse
these algorithms, we have taken publically available
data sets, such as diabetes, wine, heart disease,
Seeds, and glass. All pre-processing steps have
been performed on all datasets and they are greater accuracy is overcoming the fixed nature of
adequate for the modelling process. value of K and locally determining the value of K
Given below is the table and graph for for each test point and while voting it also restricts
comparison the influence of data points that are beyond certain
range.