0% found this document useful (0 votes)
18 views39 pages

Unit 2h

The document provides an overview of the K Nearest Neighbor (KNN) classification algorithm, detailing its methodology, distance measures, and implementation. It explains the importance of proximity measures in machine learning and discusses KNN's application in classification and regression tasks. Additionally, it includes examples, problems, and a question bank for further understanding of the concepts presented.

Uploaded by

sudhir kc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

Unit 2h

The document provides an overview of the K Nearest Neighbor (KNN) classification algorithm, detailing its methodology, distance measures, and implementation. It explains the importance of proximity measures in machine learning and discusses KNN's application in classification and regression tasks. Additionally, it includes examples, problems, and a question bank for further understanding of the concepts presented.

Uploaded by

sudhir kc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

PACE INSTITUTE OF TECHNOLOGY AND SCIENCES

Machine Learning

Dr. G. Ganesh Naidu ME,Ph.D,PG (AI & ML)-IIT KGP


Professor
Department of Civil Engineering
[email protected]
9581456545
UNIT-2

K Nearest Neighbor
Classification

Contents : Nearest Neighbor-Based Models: Introduction to Proximity Measures,

Distance Measures, Non-Metric Similarity Functions, Proximity Between Binary

Patterns, Different Classification Algorithms Based on the Distance Measures ,K-

Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN

Regression, Performance of Classifiers, Performance of Regression Algorithms.


Introduction to KNN

• Nearest Neighbor-based models are a class of machine learning algorithms that classify data

points based on the similarity (or distance) to nearby data points.

• These models are widely used in classification, regression, and recommendation tasks.

• The core idea is simple: for a given test point, the model finds the most similar data points from

the training set and uses their labels or values to make a prediction.
Nearest Neighbor Classifiers

• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test Record

Training Choose k of the


Records “nearest” records
Introduction to KNN
● KNN Classifier is a non-parametric and instance-based learning algorithm.
○ Non-parametric makes no assumptions about the distribution of data and thus
avoids the risks of mistaking the underlying distribution of the data.
○ Instance-based learning means that the algorithm doesn’t explicitly learn any
parameters.
● For classification, the algorithm obtains a majority vote between the K most
similar instances to a given “unseen” observation. K is a count.
● KNN is not suitable if the data is noisy and the target classes do not have clear
demarcation in terms of attribute values.

● The closest class will be identified using the distance measures like Euclidean
distance.
Introduction to Proximity Measures

• Proximity measures (or distance measures) are mathematical


tools used to quantify the similarity or dissimilarity between
two data points.
• These measures are critical in various machine learning tasks
such as clustering, classification, regression, anomaly detection,
and recommendation systems.
• Different proximity measures are suited to different types of
data and tasks. Below is a discussion of the most common
proximity measures with their mathematical derivations.
Distance measures

● Euclidean distance between any two points:

● Manhattan distance
Distance Metrics
Non-Metric Similarity Functions

• Non-metric similarity functions quantify the closeness or resemblance


between two objects without adhering to the mathematical properties of a
metric, such as symmetry or the triangle inequality.
• These functions are often used in domains where metric-based distances are
unsuitable, such as text, categorical data, or high-dimensional spaces.
Common Non-Metric Similarity Functions
Data standardization
● The distance formula is highly dependent on how features / attributes /

dimensions are measured.


● Those dimensions which have larger possible range of values
will dominate the result of the distance calculation using Euclidean
formula.
● To ensure all the dimensions have similar scale, we normalize the data on

all the dimensions / attributes.


● There are multiple ways of normalizing the data. We can use Z-score

standardization.
Where,
x̄: mean of the variable
s: standard deviation
KNN methodology

● Let’s say we have a new instances called x.

● Algorithm will calculate distance between x


and all the instances in the training set.

● Arrange these distances in increasing order.

● Find k nearest neighbors. If k = 3, then it will


select three nearest instances based on
the similarity measure.

● Use k neighbors to determine the class of x


using majority voting. (more than 1
instance in this case) of the closest instances.
KNN Example
● Humidity: (Independent variable) observation Humidity Temperature Rain

the percentage of humidity in 1 58 19 0


2 62 26 0
the atmosphere.
3 40 30 0
● Temperature: 4 36 35 0
5 87 19 1
(independent variable)
6 93 18 1
the temperature during 7 79 16 1
8 69 17 1
precipitation.
9 62 33 0
● Rain: (Target variable) indicates 10 71 15 1

whether it rained or not, takes 11 55 33 0


12 78 19 1
value 1 if rained and value 0
otherwise.
KNN Example

● Let us choose the K = 5 and use the

Euclidean distance
● Compute the Euclidean
distance Humidity Temperature Rainfall
84 34 ?
between the new data for
● For example, each
consider the
instance.observation
first
Humidity = 58 and Temperature = 19,
the Euclidean distance is
[(58 - 84)² + (19-34)²]½ = 31.623
KNN Example

Euclidean
● Computed the Euclidean distances for observation Distance Class label
(sorted) (Rainfall)
each instance with the new data and
5 18.25 1
sorted the data in ascending order with 12 18.97 1

respect to the Euclidean distance. 6 21.02 1


7 21.59 1
● Since K = 5, consider the class labels of
9 22.36 0
first five observations. 2 24.6 0
● 1 appears 4 times and 0 appears 1 time. 8 25.00 1

● Thus for our new instance the class label is 10 25.55 1


11 29.27 0
1 (using max voting).
1 31.62 0
● Implies that for Humidity = 84 and 3 44.55 0

temperature = 34, it will rain. 4 48.04 0


Implementation of KNN
K as a hyperparameter
● K can range from 1 to n number of training data points.
● K values can affect performance of the classifier.
● K in KNN is a hyper parameter. It has to be discovered through iterations!
● We can imagine K as a way of influencing the shape of the boundary
between classes.
● A simple approach is to use K = √n. It depends on individual cases, it is good to
run through various values of K.
● Odd values of K helps to avoid tie between predicted classes.
Small K Large K
A small value of 𝐾 means that ● Increases confidence in prediction.
noise will have a higher influence ● But if k is too large then decision
on the result i.e., the probability may be skewed.
of overfitting is very high. ● Computationally expensive
Let’s answer some questions

1. In which of the following


Images KNN will fail to
segregate the two classes
denoted by red and green
colors?
2. What happened if we build
a KNN model with k=1
Problem

► For X1=3 and X2=7 What will be the classification?

X1=Durability X2=Strength Y=Classification


7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Principal Component Analysis

Numerical Example: Calculation of principal components.


STEP 1: SAMPLE DATA SET

Let us consider X, which is a 3-variate dataset with 10 observations.


Each observation consists of 3 measurements on a wafer: Thickness
(Let x, 1st column), Horizontal displacement (Let y, 2nd column ),
Vertical displacement (Let z, 3rd column)
Principal Component Analysis
STEP 2: COMPUTE THE CORRELATION MATRIX.

First compute the correlation matrix.

x y z x y z
x CC (x,x) CC (x,y) CC (x,z) x 1.oo 0.66865 -0.10131
 7
Y CC (y,x) CC (y,y) CC (y,z)
y 0.668657 1.oo -0.28793
z CC (z,x) CC (z,y) CC (z,z)
z -0.10131 -0.28793 1.00

• The correlation of a variable with itself is 1. For that reason, all


the diagonal values are 1.00.
• Here, CC (x,y) means correlation coefficient between x and y.
Principal Component Analysis
STEP 2: COMPUTE THE CORRELATION MATRIX.

First compute the correlation matrix, let it be R.

x y z
X 1.oo 0.66865 -0.10131
7 
Y 0.668657 1.oo -0.28793
z -0.10131 -0.28793 1.00

Click here for python code to find correlation coefficient for the given data.

Click here for Excel sheet to find the correlation matrix for the same data.
Principal Component Analysis
► Example:
CC(x,y) =

x y xy x2 y2
Where, N is number of data points
7 4 28 49 16
4 1 4 16 1
6 3 18 36 9
8 6 48 64 36
8 5 40 64 25
7 2 14 49 4
5 3 15 25 9 Similarly, CC of other combinations can be
9 5 45 81 25 seen in matrix R.
7 4 28 49 16
8 2 16 64 4
Principal Component Analysis

STEP 3: SOLVE FOR THE ROOTS OF R

Next solve for the roots of correlation matrix R.

λ1 = 1.769, λ2 = 0.927, λ3 = 0.304


Principal Component Analysis
Note:

• Each eigenvalue satisfies |R−λI|=0.

• The sum of the eigenvalues =3=p, which is equal to the


trace of R (i.e., the sum of the main diagonal elements).

• The determinant of R is the product of the eigenvalues.

• The product is λ1×λ2×λ3=0.499


Principal Component Analysis

► STEP 4: COMPUTE THE FIRST COLUMN OF V MATRIX.

 Substituting the first eigenvalue of 1.769 and R in the


appropriate equation we obtain

 This is the matrix expression for three homogeneous equations


with three unknowns and yields the first column of V: 0.64
0.69
-0.34
Principal Component Analysis
► STEP 5: COMPUTE THE REMAINING COLUMNS OF THE V MATRIX

 Notice that if you multiply V by its transpose, the result is


an identity matrix, V′V=I.
PCA loadings and the loading matrix
PCA loadings and the loading matrix
PCA loadings and the loading matrix
Principal Component Analysis
► STEP 6: COMPUTE THE E1/2 MATRIX

 Now form the matrix E1/2, which is a diagonal matrix whose


elements are the square roots of the eigenvalues of R. Then
obtain S, the factor structure, using S=VE^{1/2}.

 So, for example, 0.91 is the correlation between the


second variable and the first principal component.
Principal Component Analysis
► STEP 7: COMPUTE THE COMMUNALITY.

 Next compute the communality, using the first two eigenvalues


only.
Principal Component Analysis

► STEP 8: DIAGONAL ELEMENTS REPORT HOW MUCH OF


THE VARIABILITY IS EXPLAINED.
 Communality consists of the diagonal elements.
var
1 0.8662
2 0.8420
3 0.9876
 This means that the first two principal components
"explain"
86.62 % of the first variable, 84.20 % of the second variable, and
98.76 % of the third.
Principal Component Analysis
► STEP 9: COMPUTE THE COEFFICIENT MATRIX

 The coefficient matrix, B, is formed using the reciprocals of the


diagonals of E1/2.
Principal Component Analysis

► STEP 10: COMPUTE THE PRINCIPAL FACTORS.

 Finally, we can compute the factor scores from ZB, where Z is X converted
to standard score form. These columns are the principal factors.
Principal Component Analysis

► PRINCIPAL FACTORS CONTROL CHART

 These factors can be plotted against the indices, which could be


times. If time is used, the resulting plot is an example of a
principal factors control chart.
QUESTION BANK

S. Question Mark BL CO
N s
o
1 a) Explain the concept of proximity measures and their importance in machine learning. [7M] 2 2
b) Differentiate between metric and non-metric similarity functions with suitable examples. [7M] 2 2
2 Solve a problem to calculate the Euclidean and Manhattan distances for given data points: [14M 3 2
]
Example: Data points: A = [3, 4], B = [6, 8], calculate both distances.
3 a) Describe different distance measures such as Euclidean, Manhattan, and Minkowski with [7M] 2 2
examples.
b) Problem: For two binary patterns A = [1, 0, 1, 1, 0] and B = [1, 1, 1, 0, 0], calculate the Jaccard [7M] 3 2
similarity.
4 a) Explain the K-Nearest Neighbor (KNN) algorithm and describe its working with an example. [7M] 2 2
b) Problem: Given a dataset: [(2, 3, "A"), (4, 5, "B"), (6, 7, "A")], classify the test point (3, 4) for K=3. [7M] 3 2

5 Describe and solve a Radius Distance Nearest Neighbor (RNN) classification problem: [14M 3 2
]
Use a sample dataset and classify points within a radius of 2 units.
QUESTION BANK

6 a) Differentiate between KNN classifier and KNN regression with examples. [7M] 2 2
b) Problem: Perform KNN regression on the given data points: [(1, 2), (2, 3), (3, 4)], predict for [7M] 3 2
x=2.5 using K=2.
7 Analyze the performance of classifiers using a confusion matrix and compute evaluation metrics: [14M] 4 2
Example: Confusion matrix: True Positives=50, False Positives=10, False Negatives=5, True 2
Negatives=35.
8 a) Discuss the factors affecting the performance of regression algorithms in KNN. [7M] 2 2
b) Problem: Calculate the mean squared error (MSE) and mean absolute error (MAE) for a KNN [7M] 4 2
regression problem:
Predicted values: [2.5, 3.7, 4.2], Actual values: [3, 4, 4.5].
Thank You!

You might also like