0% found this document useful (0 votes)
8 views79 pages

Notes 02

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views79 pages

Notes 02

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Machine Learning

EE514 – CS535

kNN Algorithm: Overview, Analysis,


Convergence and Extensions

Zubair Khalid

School of Science and Engineering


Lahore University of Management Sciences

https://www.zubairkhalid.org/ee514_2023.html
Outline

- k-Nearest Neighbor (kNN) Algrorithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
Supervised Learning
Classification Algorithms or Methods
Predicting a categorical output is called classification

Bayesian Methods
Frequency Table
Decision Trees

Covariance Matrix Linear Dis. Analysis

Logistic Regression
Classification
Similarity Function K Nearest Neighbor

Neural Network
Others
Support Vector
Machine
k-Nearest Neighbor (kNN) Algorithm
Idea:

- Two classes, two features

- We want to assign label to


unknown data point?
?

- Label should be red.


k-Nearest Neighbor (kNN) Algorithm
Idea:

- We have similar labels for similar features.


- We classify new test point using similar training data points.
Algorithm overview:
- Given some new test point x for which we need to predict the class y.
- Find most similar data-points in the training data.
- Classify x “like” these most similar data points.
Questions:
- How do we determine the similarity?
- How many similar training data points to consider?
- How to resolve inconsistencies among the training data points?
k-Nearest Neighbor (kNN) Algorithm
1-Nearest Neighbor:
Simplest ML Classifier Label should be red.
Idea: Use the label of the closest known point

Generalization:
Determine the label of k nearest neighbors and
assign the most frequent label

Label should be red

Label should be blue

k=3 k=7
k-Nearest Neighbor (kNN) Algorithm
Formal Definition:

Interpretation:
k-Nearest Neighbor (kNN) Algorithm
Formal Definition:

- Instance-based learning algorithm; easily adapt to unseen data


k-Nearest Neighbor (kNN) Algorithm
Decision Boundary:
k-Nearest Neighbor (kNN) Algorithm
Decision Boundary:

https://demonstrations.wolfram.com/KNearestNeighborKNNClassifier/
k-Nearest Neighbor (kNN) Algorithm
Characteristics of kNN:

- No assumptions about the distribution of the data


- Non-parametric algorithm
- No parameters

- Hyper-Parameters
- k (number of neighbors)
- Distance metric (to quantify similarity)
k-Nearest Neighbor (kNN) Algorithm
Characteristics of kNN:

- Complexity (both time and storage) of prediction increases with the size
of training data.

- Can also be used for regression (average or inverse distance weighted


average)
- For example,
k-Nearest Neighbor (kNN) Algorithm
Practical issues:

- For binary classification problem, use odd value of k. Why?

- In case of a tie:
- Use prior information
- Use 1-nn classifier or k-1 classifier to decide

- Missing values in the data


- Average value of the feature.
Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm

We need to define distance metric to find the set of k


nearest neighbors, Sx
k-Nearest Neighbor (kNN) Algorithm
Distance Metric:
k-Nearest Neighbor (kNN) Algorithm
Norm of a vector

Properties of Norm
k-Nearest Neighbor (kNN) Algorithm
Distance Metric:
k-Nearest Neighbor (kNN) Algorithm
Distance Metric:
Properties of Distance Metrics:
k-Nearest Neighbor (kNN) Algorithm
Distance Metric:
k-Nearest Neighbor (kNN) Algorithm
Cosine Distance

What is the range of values of angular distance


and what is the interpretation of these values?
k-Nearest Neighbor (kNN) Algorithm
Practical issues in computing distance:

- Mismatch in the values of data


- Issue: Distance metric is mapping from d-dimensional
space to a scaler. The values should be of the same order
along each dimension.

- Solution: Data Normalization


Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm
Choice of k:
- k=1
Sensitive to noise
High variance
Increasing k makes algorithm less sensitive to noise

- k=n
Decreasing k enables capturing finer structure of space

Idea: Pick k not too large, but not too small (depends on data)
How?
k-Nearest Neighbor (kNN) Algorithm
Choice of k:
- Learn the best hyper-parameter, k using the data.

- Split data into training and validation.

- Start from k=1 and keep iterating by carrying out (5 or 10, for example)
cross-validation and computing the loss on the validation data using the
training data.

- Choose the value for k that minimizes validation loss.

- This is the only learning required for kNN.


Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm
Error Convergence:
k-Nearest Neighbor (kNN) Algorithm
Learning Problem
k-Nearest Neighbor (kNN) Algorithm
Bayes Optimal Classifier

Error Rate:
k-Nearest Neighbor (kNN) Algorithm
Error Convergence:

Error Rate:

Reference: Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern


classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27
k-Nearest Neighbor (kNN) Algorithm
Error Convergence:

Error Rate:
k-Nearest Neighbor (kNN) Algorithm
Error Convergence:

Bound on Error Rate:


Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm
Algorithm Computational and Storage Complexity:
Input/Output:

Steps:
k-Nearest Neighbor (kNN) Algorithm
Algorithm:
Steps: Computational Complexity

1. Find distance between given test point and feature vector of every point in D.

2. Find k points in D closest to the given test point vector to form a set SX.

3. Find the most frequent label in the set Sx and assign it to the test point.

Computational Complexity:
Space Complexity:
Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm
Fast kNN:

- kNN Computational complexity: O(nd)

- How to make it faster?


- Dimensionality Reduction
- Feature Selection (to be covered later)
- PCA (to be covered later)

- Use efficient method to find nearest neighbors


- KD Tree
k-Nearest Neighbor (kNN) Algorithm
K-D Tree:
- k-Dimensional tree
- Extended version of binary search tree in higher dimension

- Pick the splitting dimension


- Randomly
- Large variance dimension

- Pick the middle value of the feature along the selected dimension after sorting along
that dimension.

- Use this value as the root node and construct a binary tree and keep going.
k-Nearest Neighbor (kNN) Algorithm
K-D Tree:
Splitting dimension
Example:
k-Nearest Neighbor (kNN) Algorithm
K-D Tree:
Example:
k-Nearest Neighbor (kNN) Algorithm
K-D Tree:
Connection with kNN:
Finding nearest neighbor

Issue: May miss neighbors! Trick to handle this.


k-Nearest Neighbor (kNN) Algorithm
K-D Tree - Summary:

- Enables significant reduction in the time complexity to support


nearest neighbor algorithm.
- Search to O(logn).

- Trade-offs:
- Computational overhead to construct a tree O(n logn).
- Space complexity: O(n).
- May miss neighbors.
- Performance is degraded with the increase in the dimension of
future space (Curse of Dimensionality).
Outline

- k-Nearest Neighbor (kNN) Algroithm Overview


- Algorithm Formulation
- Distance Metrics
- Choice of k
- Algorithm Convergence
- Storage, Time Complexity Analysis
- Fast kNN
- The Curse of Dimensionality
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality:
- Refers to the problems or phenomena associated with classifying,
analyzing and organizing the data in high-dimensional spaces that
do not arise in low-dimensional settings.

- For high-dimensional datasets, the size of data space is huge.

- In other words, the size of the feature space grows exponentially


with the number of dimensions (d) of the data sets.

- To ensure the points stay close to each other, the size (n) of the
data set must also have exponential growth. That means, we need a
very large dataset to maintain the density of points in the high
dimensional space.
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality:
- For high-dimensional datasets, the size of data space is huge.

For an exponentially large number


of cells, we need an exponentially
large amount of training data to
ensure that the cells are not
empty.

Ref: CB
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality:
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality:
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality (Another viewpoint):

D=1 2 10 50 400 784


0.1 0.19 0.65 0.995 1.000 1.000
0.9 0.81 0.35 0.005 0.000 0.000
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality (Another viewpoint):

D=1 2 10 50 400 784


0.01 0.02 0.096 0.395 0.982 0.999
0.99 0.98 0.904 0.605 0.018 0.0004
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality:

Connection with kNN:

- With the increase in the number of features or number of dimensions


of the feature space, data-points are never near to one another.

- kNN algorithm carries out predictions about the test point assuming
we have data-points near to the test point that are similar to the test
point.

- As we do not have neighbors in the high dimensional space, kNN


becomes vulnerable and sensitive to the Curse of Dimensionality.
k-Nearest Neighbor (kNN) Algorithm
The Curse of Dimensionality: Why does kNN work?
Two related explanations;
- Real-world data in the higher dimensional space is confined to a region
with effective lower dimensionality.
- Dimensionality Reduction (to be covered later in the course)

- Real-world data exhibits smoothness that enables us to make


predictions exploiting interpolation techniques.

- For example,
- Data along a line or a plane in higher dimensional space
- detection of orientation of object in an image; data lies on effectively
1 dimensional manifold in probably 1million dimensional space.
- Face recognition in an image (50 or 71 features).
- Spam filter
k-Nearest Neighbor (kNN) Algorithm
Reference:

Overall:
• https://www.cs.cornell.edu/courses/cs4780/2018fa/

• CB: sec 1.1

• HTF: 13.3 up to end of 13.3.2

• The curse of dimensionality


• CB: 1.4
• KM: 1.4.3
• N. Kouiroukidis and G. Evangelidis, "The Effects of Dimensionality Curse in High Dimensional kNN
Search," 2011 15th Panhellenic Conference on Informatics, Kastonia, 2011, pp. 41-45, doi:
10.1109/PCI.2011.45.
Machine Learning
EE514 – CS535

Dimensionality Reduction: Feature Selection


and Feature Extraction (PCA)

Zubair Khalid

School of Science and Engineering


Lahore University of Management Sciences

https://www.zubairkhalid.org/ee514_2023.html
Outline

- Dimensionality Reduction
- Feature Selection
- Feature Extraction - PCA
Dimensionality Reduction
Why?
- Increasing the number of inputs or features does not
always improve accuracy of classification.

- Performance of classifier may degrade with the inclusion


of irrelevant or redundant features.

- Curse of dimensionality; “Intrinsic” dimensionality of the


data may be smaller than the actual size of the data.

Benefits:
- Improve the classification performance.

- Improve learning efficiency and enable faster classification.

- Better understanding of the underlying process mapping inputs to output.


Dimensionality Reduction
Feature Selection and Feature Extraction:
Given a set of features, reduce the number of features such that
“the learning ability of the classifier” is maximized.

Feature Selection: Feature Extraction:


Select a subset of the existing features. Transform existing features to obtain a set of
new features using some mapping function.
Dimensionality Reduction
Feature Selection:
Select a subset of the existing features.

Select the features in the subset that either


improves classification accuracy or maintain same
accuracy.

How many subsets do we have?

How do we choose this subset?


Dimensionality Reduction
Feature Selection:
Example: Data set:
- Five Boolean features
- y=x1 (or) x2
- x3 = (not) x2
- x4 = (not) x5

Optimal subset:
{x1, x2} or {x1, x3}

Optimization in space of all feature subsets


would have

Can’t search over all possibilities and


* Source: A tutorial on genomics by Yu (2004). therefore we rely on heuristic methods.
Dimensionality Reduction
Feature Selection:
How do we choose this subset?
- Feature selection can be considered as an optimization
problem that involves
- Searching of the space of possible feature subsets Feature Subset Selection
- Choose the subset that is optimal or near-optimal with
respect to some objective function
Search subset
- Filter Methods (unsupervised method) Feature Goodness
- Evaluation is independent of the learning algorithm Subset
- Consider the input only and select the subset that Objective Function
has the most information

- Wrapper Methods (supervised method)


- evaluation is carried out using model selection the
machine learning algorithm
- Train on selected subset and estimate error on
validation dataset
Dimensionality Reduction
Feature Selection:
How do we choose this subset?
Filter Methods Wrapper Methods

Filter Feature Selection Wrapper Feature Selection

Search subset Search subset


Feature Information Feature Prediction
Subset Content Subset Accuracy

Objective Function Learning Algorithm


Dimensionality Reduction
Feature Selection:
Filters Method:
- Univariate Methods
- Treats each feature independently of other features

- Calculate score of each feature against the label using the following metrics:
- Pearson correlation coefficient
- Mutual Information
- F-score
- Chi-square
- Signal-to-noise ratio (SNR), etc.

- Rank features with respect to the score

- Select the top k-ranked features (k is selected by the user)


Dimensionality Reduction
Feature Selection:
Filters Method – Ranking Metrics:
- Pearson correlation coefficient (measure of linear dependence)

- Signal-to-noise ratio (SNR)


Dimensionality Reduction
Feature Selection:
Wrappers Method:
- Forward Search Feature Subset Selection Algorithm (Super intuitive)

- Start with empty set as feature subset


- Try adding one feature from the remaining features to the subset
- Estimate classification or regression error for adding each feature
- Add feature to the subset that gives max improvement

- Backward Search Feature Subset Selection Algorithm (Super intuitive)

- Start with full feature set as subset


- Try removing one feature from the subset
- Estimate classification or regression error for removing each feature
- Remove/drop the feature that gives minimal impact on error or reduces the error
Outline

- Dimensionality Reduction
- Feature Selection
- Feature Extraction - PCA
Dimensionality Reduction
Feature Extraction:

Transform existing features to obtain a set of new features using some mapping function.

- The mapping function z=𝑓(x) can be linear or non-linear.

- Can be interpreted as projection or mapping of the data in the higher dimensional


space to the lower dimensional space.

- Mathematically, we want to find an optimum mapping z=𝑓(x) that preserves the


desired information as much as possible.
Dimensionality Reduction
Feature Extraction:
Idea:
- Finding optimum mapping is equivalent to optimizing an objective function.

- We use different objective functions in different methods;

- Minimize Information Loss: Mapping that represent the data as


accurately as possible in the lower-dimensional space, e.g., Principal
Components Analysis (PCA).

- Maximize Discriminatory Information: Mapping that best discriminates


the data in the lower-dimensional space, e.g., Linear Discriminant
Analysis (LDA).

- Here we focus on PCA, that is, a linear mapping.

- Why Linear: Simpler to Compute and Analytically Tractable.


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
- Given features in d-dimensional space

- Project into lower dimensional space using the following linear transformation

- For example (can you tell me size of matrix W for the following cases),
- find best planar approximation to 4D data
- find best planar approximation to 100D data

- We want to find this mapping while preserving as much information as possible, and ensuring

- Objective 1: the features after mapping are uncorrelated; cannot be reduced further

- Objective 2: the features after mapping have large variance


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Geometric Intuition:

Most contribution of each


class lies in this direction

Second Principal First Principal Component


Component

Toy Illustration in two dimensions


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Geometric Intuition:

Change of coordinates: Linear combinations Ignoring the Second Component/Feature


of features
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Mathematical Formulation:
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Mathematical Formulation:

Steps to find Principal Components:

Step 1: Compute Sample Mean:

Step 2: Subtract Sample Mean:


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Mathematical Formulation:
Step 3: Calculate the Covariance Matrix: What is special about these vectors?
Zero mean; taken along all feature vectors

How do you interpret the entries of the


matrix? Spend some time and try to
understand this!
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Special about the Covariance Matrix:

Step 4: Carry out Eigenvalue Decomposition of Covariance Matrix:


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Step 5: Dimensionality Reduction

- Q: How to select k out of d?

- A: Simple, select the ones corresponding to k largest eigenvalues.


Dimensionality Reduction
Feature Extraction - Principal Component Analysis:

Connection with the Objectives:


- Objective 1: the features after mapping are uncorrelated; cannot be reduced further

- Enabled by orthogonality of the principal components

- Objective 2: the features after mapping have large variance

- We have used covariance matrix to define the mapping and used eigenvectors with
largest eigenvalues, that is, those dimensions capturing the variations in the data.

- PCA maps the data along the directions where we have most of the
variations in the data.
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
How do we choose k?
- It depends on the amount of information, that is variance, we want to preserve in the
mapping process.

- We can define a variable T to quantify this preservation of information

- T=1, when k=d; No reduction.


- T=0.8, interpreted as that 80% variation in the data has been preserved.
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Example:
Step 1: Compute sample mean: Step 2: Subtract Sample Mean: Step 3: Calculate the Covariance Matrix:

2.5000 2.4000 0.6900 0.4900


0.5000 0.7000 -1.3100 -1.2100
2.2000 2.9000 0.3900 0.9900
1.9000 2.2000 0.0900 0.2900
3.1000 3.0000 1.2900 1.0900
2.3000 2.7000 0.4900 0.7900
2.0000 1.6000 0.1900 -0.3100
1.0000 1.1000 -0.8100 -0.8100 We have divided by n. Some authors
1.5000 1.6000 -0.3100 -0.3100 divide by n-1. It won’t change the
1.1000 0.9000 -0.7100 -1.0100 principal components
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Example:
Step 4: Carry out Eigenvalue Decomposition of Covariance Matrix:

Step 5: Dimensionality Reduction 3.4591


0.8536
3.6233
2.9054
4.3069
3.5441
2.5320
1.4866
2.1931
1.4073
Dimensionality Reduction
Feature Extraction - Principal Component Analysis:
Practical Considerations and Limitations:
- Data should be normalized before using PCA for dimensionality reduction.

- Usually, we normalize every feature by subtracting mean of that feature followed by


dividing with standard deviation of the feature.

- The covariance matrix of the reduced feature is projection along orthogonal components
(directions) and therefore features are uncorrelated to each other. In other words, PCA
decorrelates the features.

- Limitation:
- PCA does not consider the separation of data with respect to class label and
therefore we do not have a guarantee the mapping of the data along dimensions of
maximum variance results in the new features good enough for class discrimination.
Solution: Linear Discriminant Analysis (LDA) - Find mapping directions along which
the classes are best separated.

You might also like