0% found this document useful (0 votes)
53 views35 pages

Part A 3. KNN Classification

The document discusses k-nearest neighbors (kNN) classification. kNN is a lazy learning algorithm that does not build a model from the training data. Instead, it compares a test instance to all training examples to find the k closest matches. It then assigns the test instance the most common class among those k neighbors. The document explains that kNN performance depends on selecting an appropriate value for k and a suitable distance metric. It notes some advantages of kNN, such as simplicity, but also disadvantages like slow classification speed when training data is large.

Uploaded by

Akshay kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views35 pages

Part A 3. KNN Classification

The document discusses k-nearest neighbors (kNN) classification. kNN is a lazy learning algorithm that does not build a model from the training data. Instead, it compares a test instance to all training examples to find the k closest matches. It then assigns the test instance the most common class among those k neighbors. The document explains that kNN performance depends on selecting an appropriate value for k and a suitable distance metric. It notes some advantages of kNN, such as simplicity, but also disadvantages like slow classification speed when training data is large.

Uploaded by

Akshay kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

kNN (k-Nearest Neighbor)

k-Nearest Neighbors
Given a query item:
Find k closest matches
in a labeled dataset ↓
k-Nearest Neighbors
Given a query item: Return the most
Find k closest matches Frequent label
k-Nearest Neighbors
k = 3 votes for “cat”
k-Nearest Neighbors
2 votes for cat,
1 each for Buffalo, Cat wins…
Deer, Lion
K-Nearest Neighbor Learning
Basics

The learning methods based on decision trees, sets of rules, posterior probabilities,
and hyper-planes etc. are called eager learning methods as they learn some kinds
of models from the training data.

k-nearest neighbor (kNN) is a lazy learning method in the sense that no model is
learned from the training data.

Learning only occurs when a test example needs to be classified.


K-Nearest Neighbor Learning
Working

• Let D be the training data set.


• Nothing is done on the training examples.
• When a test instance d is presented, the algorithm compares d with
every training example in D to compute the similarity or distance between them.
• The k most similar (closest) examples in D are then selected.
• This set of examples is called the k nearest neighbors of d.
• d then takes the most frequent class among the k nearest neighbors.
K-Nearest Neighbor Learning
k = 1 is usually not sufficient for determining the class of d due to noise and outliers
in the data.
A set of nearest neighbors is needed to accurately decide the class.

Algorithm kNN(D, d, k)
1. Compute the distance between d and every example in D
2. Choose the k examples in D that are nearest to d, denote the set by P
3 Assign d the class that is the most frequent class in P (or the majority class)
K-Nearest Neighbor Learning
The key component of a kNN algorithm is the distance/similarity function, which is
chosen based on applications and the nature of the data.

The k value that gives the best accuracy on the validation set is usually selected.

For relational data, the Euclidean distance is commonly used. For text documents,
cosine similarity is a popular choice.
Distance
Minkowski Distance
If X = Rd , the Minkowski distance of order p > 0
is defined as:
K-NN metrics
Euclidean Distance: Simplest, fast to compute

Cosine Distance: Good for documents, images, etc.

Jaccard Distance: For set data:

Hamming Distance: For string data:


K-NN metrics
Manhattan Distance: Coordinate-wise distance

Edit Distance: for strings, especially genetic data.

Mahalanobis Distance: Normalized by the sample covariance matrix –


unaffected by coordinate transformations.
K-Nearest Neighbor Learning
K-Nearest Neighbor Learning
Key Points
• Despite its simplicity, the classification accuracy of kNN can be quite strong
• kNN performs equally well as SVM for some text classification tasks
• kNN is also very flexible. It can work with any arbitrarily shaped decision
boundaries
• kNN is, however, slow at the classification time, as there is no model building,
each test instance is compared with every training example at the classification
time, which can be quite time consuming especially when the training set D and
the test set are large
• kNN is unable to handle many features
• Another disadvantage is that kNN does not produce an understandable model
• It is thus not applicable if an understandable model is required in the application
k-NN issues
The Data is the Model
• No training needed.
• Accuracy generally improves with more data.
• Matching is simple and fast (and single pass).
• Usually need data in memory, but can be run off disk.
Minimal Configuration:
• Only parameter is k (number of neighbors)
• Two other choices are important:
• Weighting of neighbors (e.g. inverse distance)
• Similarity metric
Questions upon k-NN
https://www.analyticsvidhya.com/blog/2017/09/30-questions-test-k-nearest-
neighbors-algorithm/
Classification Using K-Nearest Neighbor
Supervised Unsupervised

Labeled Data Unlabeled Data

X1 X2 Class X1 X2
10 100 Square 10 100
2 4 Root 2 4
Nearest Neighbor Search

Given: a set P of n points in Rd


Goal: a data structure, which given a query point q, finds the nearest neighbor p of
q in P

p
q
K-NN

(K-l)-NN: Reduce complexity by having a threshold on the majority. We could


restrict the associations through (K-l)-NN.
K-NN

(K-l)-NN: Reduce complexity by having a threshold on the majority. We could


restrict the associations through (K-l)-NN.

K=5
K-NN

Select 5 Nearest Neighbors


as Value of K=5 by Taking their
Euclidean Distances
K-NN

Decide if majority of Instances over a given


value of K Here, K=5.
Example

Points X1 (Acid Durability ) X2(strength) Y=Classification

P1 7 7 BAD

P2 7 4 BAD

P3 3 4 GOOD

P4 1 4 GOOD
KNN Example

Points X1(Acid Durability) X2(Strength) Y(Classification)


P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
Scatter Plot
Euclidean Distance From Each Point

KNN
P1 P2 P3 P4

(7,7) (7,4) (3,4) (1,4)


Euclidean
Distance of
P5(3,7) from Sqrt((7-3) 2 + (7-7)2 ) = Sqrt((7-3) 2 + (4-7)2 ) = Sqrt((3-3) 2 + (4-7)2 ) = Sqrt((1-3) 2 + (4-7)2 ) =
3 Nearest NeighBour

P1 P2 P3 P4

(7,7) (7,4) (3,4) (1,4)


Euclidean
Distance of
P5(3,7) from Sqrt((7-3) 2 + (7-7)2 ) = Sqrt((7-3) 2 + (4-7)2 ) = Sqrt((3-3) 2 + (4-7)2 ) = Sqrt((1-3) 2 + (4-7)2 ) =

Class BAD BAD GOOD GOOD


KNN Classification

Points X1(Durability) X2(Strength) Y(Classification)


P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 GOOD
Different Values of K
KNN (K=5)
# read in the iris data
from sklearn.datasets import load_iris
iris = load_iris() # create X (features) and y (response)
X = iris.data
y = iris.target
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))
KNN (K=1)
# read in the iris data
from sklearn.datasets import load_iris
iris = load_iris() # create X (features) and y (response)
X = iris.data
y = iris.target
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))
KNN (K=5) Train-test split
# print the shapes of X and y
# X is our features matrix with 150 x 4 dimension
print(X.shape)
# y is our response vector with 150 x 1 dimension
print(y.shape)
# STEP 1: split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
# print the shapes of the new X objects
print(X_train.shape)
print(X_test.shape)
# print the shapes of the new y objects
print(y_train.shape)
print(y_test.shape)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
kNN classifier # We use a loop through the range 1 to 26
# We append the scores in the dictionary
for k in k_range:
 
from sklearn.datasets import load_iris
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X,y)
iris = load_iris() y_pred = knn.predict(X)
  scores.append(metrics.accuracy_score(y, y_pred))
# create X (features) and y (response)  
X = iris.data print(scores)
y = iris.target  
print(X.shape) Output:
   
from sklearn import metrics (150, 4)
from sklearn.neighbors import KNeighborsClassifier [1.0, 0.98, 0.96, 0.96, 0.9666666666666667,
  0.9733333333333334, 0.9733333333333334, 0.98, 0.98,
k_range = range(1, 26) 0.98, 0.9733333333333334, 0.98, 0.98, 0.98,
  0.9866666666666667, 0.9866666666666667, 0.98,
# We can create Python dictionary using [] or dict() 0.9733333333333334, 0.98, 0.98, 0.98, 0.98, 0.98,
scores = [] 0.9733333333333334, 0.98]
    
Changing values of K…
# print the shapes of X and y
# X is our features matrix with 150 x 4 dimension
print(X.shape)
# y is our response vector with 150 x 1 dimension
print(y.shape)
# STEP 1: split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
# try K=1 through K=25 and record testing accuracy
k_range = range(1, 26) [0.94999999999999996, 0.94999999999999996,
# We can create Python dictionary using [] or dict() 0.96666666666666667, 0.96666666666666667,
scores = [] 0.96666666666666667, 0.98333333333333328,
# We use a loop through the range 1 to 26 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328,
# We append the scores in the dictionary 0.98333333333333328, 0.98333333333333328,
for k in k_range: 0.98333333333333328, 0.98333333333333328,
knn = KNeighborsClassifier(n_neighbors=k) 0.98333333333333328, 0.98333333333333328,
knn.fit(X_train, y_train) 0.98333333333333328, 0.96666666666666667,
y_pred = knn.predict(X_test) 0.98333333333333328, 0.96666666666666667,
0.96666666666666667, 0.96666666666666667,
scores.append(metrics.accuracy_score(y_test, y_pred)) 0.96666666666666667, 0.94999999999999996,
print(scores) 0.94999999999999996]
import matplotlib.pyplot as plt
# allow plots to appear within the notebook
%matplotlib inline
# plot the relationship between K and testing accuracy # plt.plot(x_axis, y_axis)
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

You might also like