0% found this document useful (0 votes)
15 views47 pages

2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views47 pages

2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Machine Learning

2EL1730

Lecture 4
Non-parametric learning and nearest neighbor methods

Fragkiskos Malliaros and Maria Vakalopoulou

Friday, December 11, 2020


Some Updates

• The first individual assignment has been announced on edunao


– Due on December 23 at 23:00
– You will have to submit it on gradescope

2
Acknowledgements

• The lecture is partially based on material by


– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of
Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Nina Balcan and Matt Gormley (CMU)
– Ricardo Gutierrez-Osuna (Texas A&M Univ.)
3
Last lectures

Supervised learning

4
Linear (Least-Squares) Regression

• Learning: finds the parameters that minimize some objective


function

We minimize the sum of the squares:

• (Stochastic) gradient descent


• Or, closed-form solution:
5
Logistic Regression

• How to turn a real-valued expression into a


probability

• Replace the sign() with the sigmoid or logistic function

σ(ζ)
where

z
6
Maximum Likelihood Estimate (MLE)

• Suppose that we have data

Maximum Likelihood
Estimate (MLE)

Recall the we have applied MLE for parameter estimation in the logistic regression classifier

What is happening if we have prior knowledge?

Maximum a Posteriori
Estimate (MAP)

prior

7
Bayes Classifier

likelihood prior

posterior evidence

Bayes’ decision rule:

8
Naïve Bayes Classification Model

Classification using the maximum a posteriori rule (MAP):


(pick the hypothesis that is most probable)

label/category

9
Linear Discriminant Analysis (LDA)

• Idea: project all the data points into a new space, normally of
lower dimension, which:
– Maximizes the between-class separability
– Minimizes their within-class variability

• We are looking for a projection


(w) where
• Examples from the same
class are projected very
close to each other
• And at the same time, the
projected means are as
farther apart as possible

10
Non-parametric learning

11
Classification: Oranges and Lemons

• We can construct a linear


decision boundary:

• Parametric model
• Fixed number of parameters

12
Classification as Induction

• Is there alternative way to


formulate the classification
problem?

• Classification as induction
• Comparison to instances
already seen in training
• Non-parametric learning

13
Non parametric learning
• Non-parametric learning algorithm (does not mean NO parameters)
• The complexity of the decision function grows with the number of data
points

• Contrast with linear/logistic regression (≈ as many parameters as


features)

• Usually: decision function is expressed directly in terms of the training


examples

• Examples:
• K-nearest neighbors (today's lecture)
• Tree-based methods
• Some cased of SVMs

14
Parametric VS Non Parametric
• Parametric algorithms
– Pros – Cons
– Simple – Constrained
– Fast – Limited complexity
– Less Data – Overfit

• Non Parametric algorithms:


– Pros – Cons
– Flexibility – Need Data
– Power – Slow
– Performance – Overfit

15
How Would You Color the Blank
Circles?

16
How Would You Color the Blank
Circles?

17
Partitioning the Space

The training data partitions the entire space


18
Nearest Neighbors – The Idea
• Learning:
– Store all the training examples
– The function is only approximated locally
• Prediction:
– For a point x: assign the label of the training example closest to it

19
Nearest Neighbors – The Idea
• Learning:
– Store all the training examples
– The function is only approximated locally
• Prediction:
– For a point x: assign the label of the training example closest to it

– Classification
• Majority vote: predict the class of the most frequent label among
the k neighbors

– Regression
• Predict the average of the labels of the k neighbors

20
Instance-based Learning

• Learning
– Store training instances
• Prediction
– Compute the label for a new instance based on its similarity with the
stored instances
Where the “magic”
happens!
• Also called lazy learning
• Similar to case-based reasoning
• Doctors treating a patient based on how patients with similar
symptoms were treated
• Judges ruling court cases based on legal precedent

21
Computing distances and
similarities

22
Distance Function

• Distance function on a set X

• Properties of a distance function (or metric)

non-negativity
identity of indiscernibles
symmetry
triangle inequality

23
Distance Between Instances

• Euclidean distance (L2) Manhattan distance: The sum of the


horizontal and vertical distances
between points on a grid

• Manhattan distance (L1)

• Lp-norm

24
From Distance to Similarity

• Pearson’s correlation

• Assuming that the data is centered

Geometric interpretation?
25
Pearson’s Correlation

• Pearson's correlation (centered data)


inner product

• Cosine similarity: the dot product can be used to measure


similarities between vectors

26
Categorical Features

• Represent objects as the list of presence/absence (or counts) of


features that appear in it

• Example: molecules
– Features: atoms and bonds of a certain type
– C, H, S, O, N, …
– O-H, O=C, C-N, ...

27
Binary Representation (1/2)

0 1 1 0 0 1 0 0 0 1 0 1 0 0 1

no occurrence of the 1st 1+ occurrences


feature of the 10th feature

• Hamming distance between two binary representations


– Number of bits that are different
XOR operator

– Equivalent to the L1 distance

28
Binary Representation (2/2)

0 1 1 0 0 1 0 0 0 1 0 1 0 0 1

no occurrence of the 1st 1+ occurrences


feature of the 10th feature

• Jaccard similarity (or Tanimoto similarity)


– Number of shared features (normalized)
AND operator OR operator

Jaccard index: intersection over union (Wikipedia: https://en.wikipedia.org/wiki/Jaccard_index)


29
Example

x = 010101001
y = 010011000

• Hamming distance
x = 010101001
y = 010011000
Thus, d(x,y) = 3

• Jaccard similarity
J = (# of 11) / ( # of 01 + # of 10 + # of 11)
= (2) / (1 + 2 + 2) = 2 / 5 = 0.4

30
Let’s go back to the
kNN classifier

31
Nearest Neighbor Algorithm

• Training examples in the Euclidean space


• Idea: The label of a test data point is estimated from the known
value of the nearest training example
– The distance is typically defined to be the Euclidean one

Algorithm 1
1. Find example (x*, y*) from the stored training set closest to the
test instance x. That is:

2. Output y(x) = y* (The output label)

32
k-Nearest Neighbors (kNN) Algorithm
1NN 3NN

Every example in Every example in


the blue shaded the blue shaded
area will be area will be
misclassified as classified correctly
the blue class as the red class

• Algorithm 1 is sensitive to mis-labeled data (‘class noise’)


• Consider the vote of the k nearest neighbors (majority vote)

Algorithm 2
• Find k examples (x*i, y *i), i=1,…,k closest to the test instance x
• The output is the majority class

33
Choice of Parameter k (1/2)

• Small k: noisy decision


– The idea behind using more than 1 neighbors is to average out the
noise
• Large k
– May lead to better prediction performance
– If we set k too large, we may end up looking at samples that are not
neighbors (are far away from the point of interest)
– Also, computationally intensive. Why?
– Extreme case: set k=m (number of points in the dataset)
• For classification: the majority class
• For regression: the average value

34
Choice of Parameter k (2/2)

Set k by cross validation, by examining the misclassification


error

Rule of thumb for initial


guess:
k=7

m: # of training instances

Source: https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/ 35
Advantages of kNN

• Training is very fast


– Just store the training examples
– Can use smart indexing procedures to speed-up testing
• The training data is part of the ‘model’
– Useful in case we want to do something else with it
• Quite robust to noisy data
– Averaging k votes
• Can learn complex functions (implicitly)

36
Drawbacks of kNN

• Memory requirements
– Must store all training data
• Prediction can be slow (will figure it out by yourself in the lab)
– Complexity of the query: O(knm)
– But kNN works best with lots of samples
– Can we further improve the running time?
• Efficient data structures (e.g., k-D trees)
• Approximate solutions based on hashing
• High dimensional data and the curse of dimensionality
– Computation of the distance in a high dimensional space may
become meaningless
– Need more training data
– Dimensionality reduction
Wikipedia: https://en.wikipedia.org/wiki/K-d_tree 37
k-D trees

• Definition
• A binary tree
• Any internal node implements a spatial partition by a
hyperplane H, splitting the point cloud into two equal subsets
• Right subtree: points p on the side of H
• Left subtree: remaining points
• The process halts when a node contains <no points
• Complexity building: O(nlog²(n))

38
k-D trees

• Input: Training data


where

• Output: k-d tree


Algorithm: build(points,k)
1. k = k%d
2. Project points on k-th axis
3. Create node representing the median point
4. Split into two equal subsets points1, points2
5. left_child = build(points1,k+1)
6. right_child = build(points2,k+1)
7. Return node

39
kNN – Some More Issues

• Normalize the scale of the attributes

• Simple option: linearly scale the range of each feature to be, e.g., in
the range of [0,1]

• Linearly scale each dimension to have 0 mean and variance 1


– Compute the mean μ and variance σ 2 for an attribute x j and scale: (x j -
μ)/σ2

40
Decision Boundary of kNN

• Decision boundary in classification


– Line separating the positive from negative regions
• What decision boundary is the kNN building?
– The nearest neighbors algorithm does not explicitly compute
decision boundaries, but those can be inferred

41
Voronoi Tessellation
Consider the case of 1NN
• Voronoi cell of x:
– Set of all points of the space closer to x than any other point of the
training set
– Polyhedron
• Voronoi tessellation (or diagram) of the space
– Union of all Voronoi cells
• Complexity
– D=2 O(n) space and
– O(logn) query time x

42
Voronoi Tessellation

• The Voronoi diagram defines the decision boundary of the 1NN


• The kNN algorithms also partitions the space but in a more
complex way
• d > 2, O(n^(d/2))

k=1 k=3
k=1
Wikipedia: https://en.wikipedia.org/wiki/Voronoi_diagram
43
kNN Variants

• Weighted kNN
– Weight the vote of each neighbor xi according to the distance to the
test point x

– Other kernel functions can be used to weight the distance of


neighbors

Source: https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf

44
scikit-learn

http://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighbors
Classifier.html

45
Next Class

• Trees/ Ensemble Methods

46
Thank You!

DiscoverGreece.com 47

You might also like