0% found this document useful (0 votes)
20 views26 pages

CSE445 NSU Week - 5

Nearest Neighbor Classifiers utilize labeled records and a proximity metric to classify unknown records based on the majority class of their nearest neighbors. The algorithm requires careful selection of the number of neighbors (k) and may involve distance metrics like Euclidean or Manhattan distance. Feature scaling is essential to ensure that no single attribute dominates the distance calculations, and various query types such as exact match, range, and nearest-neighbor queries are discussed.

Uploaded by

Rabiul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views26 pages

CSE445 NSU Week - 5

Nearest Neighbor Classifiers utilize labeled records and a proximity metric to classify unknown records based on the majority class of their nearest neighbors. The algorithm requires careful selection of the number of neighbors (k) and may involve distance metrics like Euclidean or Manhattan distance. Feature scaling is essential to ensure that no single attribute dominates the distance calculations, and various query types such as exact match, range, and nearest-neighbor queries are discussed.

Uploaded by

Rabiul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Nearest Neighbor Classifiers

● Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably
a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
Nearest-Neighbor Classifiers
● Requires the following:
– A set of labeled records (supervised learning)
– Proximity metric to compute distance/similarity
between a pair of records
– e.g., Euclidean distance, Manhattan distance
– The value of k, the number of nearest neighbors
to retrieve: how many neighbors you want to
consider
– A method for using class labels of K nearest
neighbors to determine the class label of
unknown record (e.g., by taking majority vote)
How to Determine the class label of a Test Sample?

: Unweighted voting
Query Types
• Exact match query: Asks for the object(s) whose key matches query
key exactly.

• Range query: Asks for the objects whose key lies in a specified query
range (interval).

• Nearest-neighbor query: Asks for the objects whose key is “close” to


query key.

4
Exact Match Query
• Suppose that we store employee records in a database:
• Asks for the object(s) whose key matches query key exactly.

• ID Name Age Salary #Children


• Example:
• key=ID: retrieve the record with ID=12345

5
Range Query
• Example:
• key=Age: retrieve all records satisfying
20 < Age < 50
• key= #Children: retrieve all records satisfying
1 < #Children < 4

ID Name Age Salary #Children

6
Nearest-Neighbor(s) (NN) Query
• Example:
• key=Salary: retrieve the employee whose salary is closest to $50,000 (i.e., 1-
NN).
• key=Age: retrieve the 5 employees whose age is closest to 40 (i.e., k-NN, k=5).

ID Name Age Salary #Children

7
Nearest Neighbor(s) Query
• What is the closest restaurant to my hotel?

8
Nearest Neighbor(s) Query
(cont’d)
• Find the 4 closest restaurants to my hotel

9
Nearest Neighbor Query in High
Dimensions
• Very important and practical problem!
• Image retrieval

find N closest
matches (i.e., N
nearest neighbors)
(f1,f2, .., fk)

11
Nearest Neighbor Query in High
Dimensions
• Face recognition

find closest match


(i.e., nearest neighbor)

12
Interpreting Queries Geometrically
• Multi-dimensional keys can be thought as “points” in
high dimensional spaces.

Queries about records  Queries about points

13
Example 1- Range Search in 2D

age = 10,000 x year + 100 x month + day

14
Example 2 – Range Search in 3D

15
Example 3 – Nearest Neighbors
Search

Query
Point

Measure point-to-point distance.

16
Classification…
● Data preprocessing is often required
– Attributes MUST have to be scaled to prevent distance measures from being dominated by
one of the attributes
◆Example:
– height of a person may vary from 1.5m to 1.8m
– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0 means a standard deviation of 1


If K =3: Predicted class: Y (majority/unweighted)

02/02/2025 18
K-Nearest Neighbor (kNN)
● kNN is a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset
● During the training phase, the KNN algorithm stores the entire training dataset as
a reference
● Manhattan distance: Manhattan Distance is the sum of absolute differences
between points across all the dimensions

● Minkowski Distance is the generalized form of Euclidean and Manhattan Distance


● When the order is 1, both Minkowski and Manhattan Distance are the same
● When the order is 2, we can see that Minkowski and Euclidean distances are the
same
K-Nearest Neighbor (kNN)
● When making predictions, it calculates the distance between the input data point
and all the training examples, using a chosen distance metric such as Euclidean
distance.

● Next, the algorithm identifies the K nearest neighbors to the input data point
based on their distances.
● In the case of classification, the algorithm assigns the most common class label
among the K neighbors as the predicted label for the input data point.
● For regression, it calculates the average or weighted average of the target values
of the K neighbors to predict the value for the input data point
Need for Standardizing Attributes: Feature Scaling

• Feature Scaling is a technique to standardize the independent features present in


the data in a fixed range
• If Feature Scaling is NOT done: LOAN feature will dominate all other features
while predicting the class of the given data point
• : Min-Max scaling - This technique re-scales a feature between 0 and 1
• : Standardization
• Standardization is the process of scaling data so that they have a mean value of
0 and a standard deviation of 1
• Good for normally distributed features
• MinMaxScaler and StandardScaler: sklearn

21
Need for Standardizing Attributes

: Min-Max scaling

If K =3: Predicted class: N (majority/unweighted)


02/02/2025 22
Classification…
● Choosing the value of k:
– Min k = 1; Max k = # of samples
– If k is too small, sensitive/vulnerable to noise points/outlier
– If k is too large, neighborhood may include points from other classes: ZeroR classifier
hyperparameters
● sklearn.neighbors.KNeighborsClassifier
● class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *,
weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski',
metric_params=None, n_jobs=None)
● ‘uniform’ : uniform weights. All points in each neighborhood are weighted
equally.
● ‘distance’ : weight points by the inverse of their distance. in this case, closer
neighbors of a query point will have a greater influence than neighbors which are
further away
● algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

● Power parameter for the Minkowski metric. When p = 1, this is equivalent to


using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary
p, minkowski_distance (l_p) is used
Classification…
● How to handle missing values in training and test
sets?
– Proximity computations normally require the presence of
all attributes
– Some approaches use the subset of attributes present in
two instances
◆ This may not produce good results since it effectively uses
different proximity measures for each pair of instances
◆ Thus, proximities are not comparable
K-NN Classifiers…
Handling Irrelevant and Redundant Attributes
– Irrelevant attributes add noise to the proximity measure
– Redundant attributes bias the proximity measure towards certain attributes
– BMI and weight

– KNN: slow in testing


Improving KNN Efficiency
● Avoid having to compute distance to all objects in the
training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
● Condensing
– Determine a smaller set of objects that give the same
performance
● Editing
– Remove objects to improve efficiency

You might also like