0% found this document useful (0 votes)
14 views53 pages

KNN 2

This document discusses the k-nearest neighbors (kNN) machine learning algorithm. It explains how kNN works by finding the k closest training examples to a test example in the feature space and predicting the label based on a majority vote of its neighbors. The document also discusses some key points about kNN including its time and space complexity, how it assumes homogeneous neighborhoods, and how it is impacted by outliers and the k hyperparameter value.

Uploaded by

TM CRO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views53 pages

KNN 2

This document discusses the k-nearest neighbors (kNN) machine learning algorithm. It explains how kNN works by finding the k closest training examples to a test example in the feature space and predicting the label based on a majority vote of its neighbors. The document also discusses some key points about kNN including its time and space complexity, how it assumes homogeneous neighborhoods, and how it is impacted by outliers and the k hyperparameter value.

Uploaded by

TM CRO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

k-Nearest Neighbour

Linear Regression
Logistin Regression
BUSINESS CASE - BLINK IT

Blink it needs optimal number o


delivery partners or each store.
Hi h Traffic
Hence, classified stores into 3 based
on out oin deliveries.

Moderate
Traffic

Low Traffic
BLINK IT DATA

Will logistic regression work???


Multiclass Non-
Problem Linear
Logistic Regression requires extensive
search for correctly polynomial feature

Imbalance
data Need of a new algorithm with no
features
Xq1 belon s to (+) class

2
Xq2 belon s to (o) class

FÉIN
Just by lookin at nei hbor points, we
are sure about Xq1, Xq2

K-nearest nei hbour (kNN) model works on same intuition

EEy.at
Class o datapoint (xq) depends
on class o nei hbourin points

in Evilidia distance
How does kNN work?

I xq = [2,5] & data contains 6 data points:

Step 1 : Find euclidean distance: 13,6 2,5


Step 2 : Sort data based on distance:
Step 3 : Pick 3 data points havin minimum distance:

minimum distance
rom xq
Step 4 : Find majority class o these selected data points —>> class label or xq

Majority class 2

Xq belon s to class
2

This selection o data points is decided by


Hyperparameter “k”, hence the name kNN
POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o

nei hbourhood.

Paramete
Wo Wi Wz

52 53 51 54
dirt ftp.u.faiorit Aiiia9
to q
What happens i k=4?

Makin predictions based on 4


nearest nei hbours

2 data points >> class 1


Tie
2 data points >> class 2

problem
Brain iblit
kNN cannot make
predictions

Hence, it is advisable to keep k as odd value


What happens i k=5?

Makin predictions based on 5


nearest nei hbours Break
Still a tie even i we keep k 8 12 a
value as odd

Hack!

Randomly pick class


labels o tied classes

Here, kNN can pick class 1 or class 2 or xq


POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o

nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction


How does kNN has ood per ormance on non linear multi class data?

(+)
Assume data contains: (-) & k=5
(o)

X KM 2 4 8
xq
a 16 Class (+) = xq

kNN ails when data kNN assumes nei hbourhood as


has a lot o homo enous i.e, characteristics
noise/outliers o nearest nei hbour and xq will
be same
POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

wt I 1 1000 1 10
0 100
4
wt it
Bias- variance tradeoff in kNN.. yes…or....no?

Suppose data contains 2


outliers:
xq2

xq1

As xq1 closes to (-) outlier Class prediction (-)


For k=1
As xq1 closes to (+) outlier
Class prediction (+)

kNN tryin to fit every data


point

Rou h Decision Boundary


Takin the same data, what will be class label or xq i k=72?

(+) = 31
(-)= 41

Xq-> (-) class as (-) >(+)


xq

Even when xq is closer to (+), kNN does not fit


trainin data

As k increases, kNN underfits


Summary!

Hovertit

A
Trainin time complexity Space complexity

kNN stores entire trainin data.


O(1)

No computation Nxd
done by kNN

Space
Stores the
complexity
data only
(O) N x d
Mobile Ph
Test time complexity 508ft
Memory

Step 1 : Find distance b/w trainin data and xq = O (n x d)

Step 2 : Sort data = O (nlo n)


Last Hinkiort
Step 3 : Pick nearest nei hbour O(k)

Step 4 : Majority vote O(k)

As k << n & d, hence O(k) i nored


Time complexity = O (nd + nlo n)
POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

● I k increases, kNN underfits. (bias increase, variance decrease)

● I k decreases, kNN overfits. (bias decrease, variance increase)


POINTS TO REMEMBER

● Train time complexity —>> O(1)

● Test time complexity —>>> O(nd + nlo n)

● Space complexity —->> O (nd)


Diabetic Patient Example

Features:
40 Diabetic (+) class
Suppose we take Gender (M,F)

40 non diabetic (-) class A e

BP
kNN does not work on cate orical
data as euclidean distance needs Glucose
Level
numeric data!
Blood roup

A+, B+. O+, AB+, AB-,

Convert cate orical O-, B-, A-


data into numerical
data by ONE HOT
ENCODING (OHE)
OHE to convert cate orical data into numeric data

OHE o ender becomes : (n,2)


Similarly, OHE o Blood Group
becomes : (n,8)
Total dimensions when One Hot Encodin

As OHE increases dimensions, it


leads to curse o dimensionality

Hence, Tar et encodin shall be


used
What is Curse o Dimensionality?

Suppose x1 and x2 have dimension = 4, then

Euclidean distance
Euclidean distance
Due to low dimension, cannot be used
Euclidean distance
between x1 & x2 is very Due to hi h dimension,
lar e Euclidean distance
between x1 & x2 is very
small

Conclusion : Euclidean distance fails when dimension is high


POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

● I k increases, kNN underfits. (bias increase, variance decrease)

● I k decreases, kNN overfits. (bias decrease, variance increase)


POINTS TO REMEMBER

● Train time complexity —>> O(1)

● Test time complexity —>>> O(nd + nlo n)

● Space complexity —->> O (nd)

● Euclidean distance ails when there is hi h


dimension data
What other distance to use?

Can be understood as distance measure as we walk on a path


Manhattan Distance
rom x1 to x2

nnn 7

FI Iiit

Manhattan distance
Coe a lyny
What other distance to use?

all D
Minkowski Distance
ee it

has
Manhattan distance & One Hot Encodin

81 15AM
Break
OHE creates a hi h dimensional
sparse data

Manhattan Distance ives equal


importance to all the eatures ( even
or irrelevant eatures)
Cosine Similarity or One Hot Encodin

I Since Cosine similarity ocuses


Mathi for Mc on direction o vectors, it easily
i nores irrelevant eatures

if
Ran es rom (-1) to 1

hgihd.in data
Least similar Most similar
Distance metric used or kNN

● Euclidean Distance – or low dimensional data


● Cosine similarity — or hi h dimensional data
● Manhattan – use ul when data is like a map
● Minkowski – or usin custom distance metric

di
cosine

Y
LE
POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

● I k increases, kNN underfits. (bias increase, variance decrease)

● I k decreases, kNN overfits. (bias decrease, variance increase)


POINTS TO REMEMBER

● Train time complexity —>> O(1)

● Test time complexity —>>> O(nd + nlo n)

● Space complexity —->> O (nd)

● Euclidean distance ails when there is hi h


dimension data

● Cosine similarity — or hi h dimensional data


[-1,1]
● Manhattan – use ul when data is like a map
● Minkowski – or usin custom distance metric
kNN work so ast in Goo le searches?
Goo le ima es use kNN to
provide amous
monuments just by
searchin city. By hashin al orithm —LSH (Locality Sensitive Hashin )
What is hashin ?

Storin o data in key value pair (Analo ous to Directory)

I query = Delhi

Returns India ate, Red ort,


Qutub Minar

Quickly returns data

Time complexity O(1)


How does LSH work?

For hash table create randomised hash unction (h(x))

Gives key or hash


table
How does LSH work?

Suppose we take a
random vector:

And define:
Hash table

Clubs them into one


key
LSH’s role in astenin kNN

LSH roups similar data points

Suppose or some xq, h(xq) = [0,1,0]

We run kNN only or data points havin h(x) = [0,1,0], instead o whole data

This reduces testin time complexity as kNN is usin a subset o data


POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq] on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

● I k increases, kNN underfits. (bias increase, variance decrease)

● I k decreases, kNN overfits. (bias decrease, variance increase)


POINTS TO REMEMBER

● Train time complexity —>> O(1)

● Test time complexity —>>> O(nd + nlo n)

● Space complexity —->> O (nd)

● Euclidean distance ails when there is hi h


dimension data

● Cosine similarity — or hi h dimensional data


[-1,1]
● Manhattan – use ul when data is like a map
● Minkowski – or usin custom distance metric
POINTS TO REMEMBER

● LSH reduces testin time complexity by selectin


a subset o data determined by h(x).
What are the techniques o imputin ?

● Mean or median o Fj eature


● Analyzin data and manually impute value
● Mean and median o whole data
kNN or Imputation

Step 1 : Exclude j rom data

I
kNN or Imputation

Find distance between xi and rest data — k nearest


Step 2: nei hbour
kNN or Imputation

Step 3: For these nearest nei hbour, check value or j value


POINTS TO REMEMBER

● kNN is a non parametric al orithm.

● kNN predicts class o test data [xq on the basis o nei hbourhood.

WORKING OF kNN:

● Find distance (xq and all trainin data)

● Sort distance

● Pick k nearest nei hbors

● Majority rate o class prediction

● kNN assumes homo eneous nei hbourhood

● It is heavily impacted i outliers increases.

● I k increases, kNN underfits. (bias increase, variance decrease)

● I k decreases, kNN overfits. (bias decrease, variance increase)


POINTS TO REMEMBER

● Train time complexity —>> O(1)

● Test time complexity —>>> O(nd + nlo n)

● Space complexity —->> O (nd)

● Euclidean distance ails when there is hi h


dimension data

● Cosine similarity — or hi h dimensional data


[-1,1]
● Manhattan – use ul when data is like a map
● Minkowski – or usin custom distance metric
POINTS TO REMEMBER

● LSH reduces testin time complexity by selectin


a subset o data determined by h(x).

● kNN can be used or imputation.


d Blue
dz Green

ds Green
d
dff da Blue
i di Blue

100
rs
i
i
HE

If IF

You might also like