Lecture 3 - kNN algorithm
Lecture 3 - kNN algorithm
K-Nearest-Neighbors Algorithm
and Data Conversion
Lê Anh Cường
TDTU
Supervised Learning
• Classification The main difference between them is that the
output variable in regression is numerical (or
• Regression continuous) while that for classification is
categorical (or discrete).
2
Unsupervised Learning
Unsupervised Learning is a machine learning technique in
which the users do not need to supervise the model. Instead,
it allows the model to work on its own to discover patterns and
information that was previously undetected.
3
KNN is a Method in Instance-Based Learning
• Instance-based learning is often termed lazy learning, as there is typically
no “transformation” of training instances into more general “statements”
New
instance
Supervised
examples
kNN label
Directly
computation
KNN is a Method in Instance-Based Learning
• Instance-based learning is often termed lazy learning, as there is typically
no “transformation” of training instances into more general “statements”
Directly label
computation
New label
kNN
instance
Supervised
examples Learning Model
Generalization label
K-Nearest-Neighbors Algorithm
• A case is classified by a majority voting of its neighbors, with the case
being assigned to the class most common among its K nearest
neighbors measured by a Distance Function.
• If K=1, then the case is simply assigned to the class of its nearest
neighbor
What is the most possible label for c?
• Solution: Looking for the nearest K neighbors of c.
• Take the majority label as c’s label
• Let’s suppose k = 3:
Distance Function Measurements
Example
class
Example
kNN for Classification
• A simple implementation of KNN classification is a majority voting
mechanism.
k
• Replace fˆ (q ) = arg max ∑ δ (v, f ( xi )) by:
v∈V i =1
k
1
fˆ (q) = argmax ∑ 2 δ (v, f (x i ))
v ∈V i=1 d( x i, x q )
€
Issues with Distance Metrics
• Most distance measures were designed for linear/real-valued
attributes
• Two important questions in the context of machine learning:
• How to handle nominal attributes
• What to do when attribute types are mixed
Hamming Distance
• For category variables, Hamming distance can be used.
Normalization
Normalization
Exercise
What to do when attribute types are mixed
• Convert categorical values into numerical values
• Binary values: convert to 0 and 1, for example ’male’, ‘female’
• Multiple degress, such as ‘low’, ‘average’, and ‘high’: convert to 1, 2, 3
• if the values are “Red”, “Green”, “Blue” (or more generally, something that
has no intrinsic order): convert to [1,0,0], [0,1,0], [0,0,1]
• Normalize or scale the data into the same interval.
Exercise
KDTree
https://viblo.asia/p/gioi-thieu-thuat-toan-kd-trees-nearest-
neighbour-search-RQqKLvjzl7z
Nearest Neighbor via KDTree
Summary
• kNN can deal with complex and arbitrary decision
boundaries.
• Despite its simplicity, researchers have shown that the
classification accuracy of kNN can be quite strong and
in many cases as accurate as those elaborated
methods.
• kNN is slow at the classification time
• kNN does not produce an understandable model
28