2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
2EL1730
Lecture 4
Non-parametric learning and nearest neighbor methods
2
Acknowledgements
Supervised learning
4
Linear (Least-Squares) Regression
σ(ζ)
where
z
6
Maximum Likelihood Estimate (MLE)
Maximum Likelihood
Estimate (MLE)
Recall the we have applied MLE for parameter estimation in the logistic regression classifier
Maximum a Posteriori
Estimate (MAP)
prior
7
Bayes Classifier
likelihood prior
posterior evidence
8
Naïve Bayes Classification Model
label/category
9
Linear Discriminant Analysis (LDA)
• Idea: project all the data points into a new space, normally of
lower dimension, which:
– Maximizes the between-class separability
– Minimizes their within-class variability
10
Non-parametric learning
11
Classification: Oranges and Lemons
• Parametric model
• Fixed number of parameters
12
Classification as Induction
• Classification as induction
• Comparison to instances
already seen in training
• Non-parametric learning
13
Non parametric learning
• Non-parametric learning algorithm (does not mean NO parameters)
• The complexity of the decision function grows with the number of data
points
• Examples:
• K-nearest neighbors (today's lecture)
• Tree-based methods
• Some cased of SVMs
14
Parametric VS Non Parametric
• Parametric algorithms
– Pros – Cons
– Simple – Constrained
– Fast – Limited complexity
– Less Data – Overfit
15
How Would You Color the Blank
Circles?
16
How Would You Color the Blank
Circles?
17
Partitioning the Space
19
Nearest Neighbors – The Idea
• Learning:
– Store all the training examples
– The function is only approximated locally
• Prediction:
– For a point x: assign the label of the training example closest to it
– Classification
• Majority vote: predict the class of the most frequent label among
the k neighbors
– Regression
• Predict the average of the labels of the k neighbors
20
Instance-based Learning
• Learning
– Store training instances
• Prediction
– Compute the label for a new instance based on its similarity with the
stored instances
Where the “magic”
happens!
• Also called lazy learning
• Similar to case-based reasoning
• Doctors treating a patient based on how patients with similar
symptoms were treated
• Judges ruling court cases based on legal precedent
21
Computing distances and
similarities
22
Distance Function
non-negativity
identity of indiscernibles
symmetry
triangle inequality
23
Distance Between Instances
• Lp-norm
24
From Distance to Similarity
• Pearson’s correlation
Geometric interpretation?
25
Pearson’s Correlation
26
Categorical Features
• Example: molecules
– Features: atoms and bonds of a certain type
– C, H, S, O, N, …
– O-H, O=C, C-N, ...
27
Binary Representation (1/2)
0 1 1 0 0 1 0 0 0 1 0 1 0 0 1
28
Binary Representation (2/2)
0 1 1 0 0 1 0 0 0 1 0 1 0 0 1
x = 010101001
y = 010011000
• Hamming distance
x = 010101001
y = 010011000
Thus, d(x,y) = 3
• Jaccard similarity
J = (# of 11) / ( # of 01 + # of 10 + # of 11)
= (2) / (1 + 2 + 2) = 2 / 5 = 0.4
30
Let’s go back to the
kNN classifier
31
Nearest Neighbor Algorithm
Algorithm 1
1. Find example (x*, y*) from the stored training set closest to the
test instance x. That is:
32
k-Nearest Neighbors (kNN) Algorithm
1NN 3NN
Algorithm 2
• Find k examples (x*i, y *i), i=1,…,k closest to the test instance x
• The output is the majority class
33
Choice of Parameter k (1/2)
34
Choice of Parameter k (2/2)
m: # of training instances
Source: https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/ 35
Advantages of kNN
36
Drawbacks of kNN
• Memory requirements
– Must store all training data
• Prediction can be slow (will figure it out by yourself in the lab)
– Complexity of the query: O(knm)
– But kNN works best with lots of samples
– Can we further improve the running time?
• Efficient data structures (e.g., k-D trees)
• Approximate solutions based on hashing
• High dimensional data and the curse of dimensionality
– Computation of the distance in a high dimensional space may
become meaningless
– Need more training data
– Dimensionality reduction
Wikipedia: https://en.wikipedia.org/wiki/K-d_tree 37
k-D trees
• Definition
• A binary tree
• Any internal node implements a spatial partition by a
hyperplane H, splitting the point cloud into two equal subsets
• Right subtree: points p on the side of H
• Left subtree: remaining points
• The process halts when a node contains <no points
• Complexity building: O(nlog²(n))
38
k-D trees
39
kNN – Some More Issues
• Simple option: linearly scale the range of each feature to be, e.g., in
the range of [0,1]
40
Decision Boundary of kNN
41
Voronoi Tessellation
Consider the case of 1NN
• Voronoi cell of x:
– Set of all points of the space closer to x than any other point of the
training set
– Polyhedron
• Voronoi tessellation (or diagram) of the space
– Union of all Voronoi cells
• Complexity
– D=2 O(n) space and
– O(logn) query time x
42
Voronoi Tessellation
k=1 k=3
k=1
Wikipedia: https://en.wikipedia.org/wiki/Voronoi_diagram
43
kNN Variants
• Weighted kNN
– Weight the vote of each neighbor xi according to the distance to the
test point x
Source: https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf
44
scikit-learn
http://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighbors
Classifier.html
45
Next Class
46
Thank You!
DiscoverGreece.com 47