0% found this document useful (0 votes)
10 views46 pages

2223 ML Lecture04

Uploaded by

khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

2223 ML Lecture04

Uploaded by

khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Machine Learning

Lecture 4
Non-parametric Learning and kNNs

Stergios Christodoulidis
MICS Laboratory
CentraleSupélec
Université Paris-Saclay
https://stergioc.github.io/

CentraleSupélec Stergios C. 1
Last Lecture

CentraleSupélec Stergios C. 2
Linear Regression – Problem Formulation

• Data: Inputs are continuous vectors of length 𝐾 (dimensions, features). Outputs are continuous
scalars
𝒟 = {𝒙 ! , 𝑦 (!) }& '
!$% , where 𝑥 ∈ ℝ and y ∈ ℝ
• Prediction: Output is a linear function of the inputs

𝑦- = ℎ( 𝑥 = 𝜃) + 𝜃% 𝑥% + 𝜃* 𝑥* + ⋯ + 𝜃' 𝑥'
𝑦- = ℎ( 𝑥 = 𝜃 +𝒙
• Learning: finds the parameters 𝜃 that minimize some objective function

𝜃 ∗ = argmin ℐ(x, y; θ)
(

CentraleSupélec Stergios C. 3
Gradient Descent – The Idea (1/3)

• Start from a random point u


• How do I get closer to the solution?
• Follow direction to the opposite of the gradient
• The gradient indicates the direction of the steepest descent

𝑢! = 𝑢 − 𝜆∇𝑓(𝑢)
𝜆: learning rate

CentraleSupélec Stergios C. 4
Model Complexity

“Among competing hypotheses (models), the one with Values of parameters θ

the fewest assumptions should be selected”

• A “simple” model is one where θ has few non-zero parameters Θ(1)


θ
• Only a few features are relevant (sparsity)
• 𝜃 " will be small (L1 norm) “complex model”

Θ(2)
• A “simple” model is one where θ is almost uniform θ
“simple model”
• Few features are significantly more relevant than others
• 𝜃 # will be small (L2 norm) Θ(3)
θ
“simple model”

Occam’s Razor (wikipedia): https://en.wikipedia.org/wiki/Occam%27s_razor

CentraleSupélec Stergios C. 5
Ridge Regression

prefers uniform
JRR ( ) = J( ) + || ||22 parameters

N K
1 T
= ( x(i) y (i) )2 + 2
k
2 i=1 k=1

Hyperparameter λ: how
much should we trade-off Model complexity
accuracy versus
complexity

CentraleSupélec Stergios C. 6
LASSO Regression

yields sparse
JLASSO ( ) = J( ) + || ||1 parameters

N K
1 T
= ( x(i) y (i) )2 + | k|
2 i=1 k=1

Hyperparameter λ: how Model complexity


much should we trade-off
accuracy versus
complexity

CentraleSupélec Stergios C. 7
Logistic Regression (1/2)

• Focus on binary classification. Logistic regression aims to model

p(label|data)
by training a classifier of the form
(
1 if ✓> x 0
yi =
0 otherwise

• Intuitively, it does not make sense for 𝑦𝑖 to take values larger than 1 or smaller than 0
• Note the sign() function presented before (classification as a regression task) is not very useful

CentraleSupélec Stergios C. 8
Logistic Regression (2/2)

• How to turn a real-valued expression 𝜃 - 𝑥 ∈ ℝ into a


probability?

𝑝( 𝑦 𝑥 ∈ [0,1]

• Replace the sign() with the sigmoid or logistic function

1
𝑦- = 𝑓 𝑥 = 𝜎 𝜃-𝑥 = !/
1 + 𝑒 .(

CentraleSupélec Stergios C. 9
Today’s Lecture

CentraleSupélec Stergios C. 10
Today’s Lecture

• Non parametric Learning


• kNN

CentraleSupélec Stergios C. 11
Non Parametric Learning

CentraleSupélec Stergios C. 12
Classification: Oranges and Lemons

• We can construct a linear decision boundary:


𝑦 = 𝑠𝑖𝑔𝑛(𝜃) + 𝜃% 𝑥% + 𝜃* 𝑥0 )
• Parametric model
• Fixed number of parameters

CentraleSupélec Stergios C. 13
Classification as Induction

• Is there alternative way to formulate the


classification problem?

• Classification as induction
• Comparison to instances already seen in
training
• Non-parametric learning

CentraleSupélec Stergios C. 14
Non-parametric Learning

l Non-parametric learning algorithm (does not mean NO hyper-parameters)


- The complexity of the decision function grows with the number of data points
- Contrast with linear/logistic regression (≈ as many parameters as features)

- Usually: decision function is expressed directly in terms of the training examples

- Examples:
l k-nearest neighbors

l Tree-based methods
l Some cases of SVMs (next lecture)

CentraleSupélec Stergios C. 15
How Would You Color the Blank Circles?

CentraleSupélec Stergios C. 16
How Would You Color the Blank Circles?

CentraleSupélec Stergios C. 17
Partitioning the Space

CentraleSupélec Stergios C. 18
Nearest Neighbors – The Idea

• Learning (training):
• Store all the training examples
• Prediction:
• For a point x: assign the label of the training example closest to it

CentraleSupélec Stergios C. 19
Nearest Neighbors – The Idea

• Learning (training):
• Store all the training examples
• Prediction:
• For a point x: assign the label of the training example closest to it

• Classification
• Majority vote: predict the class of the most frequent label among the k neighbors

• Regression
• Predict the average of the labels of the k neighbors

CentraleSupélec Stergios C. 20
Instance-based Learning

• Learning
• Store training instances
• Prediction
• Compute the label for a new instance based on its similarity with the stored instances

• Also called lazy learning


• Similar to case-based reasoning:
- Doctors treating a patient based on how patients with similar symptoms were treated
- Judges ruling court cases based on legal precedent

CentraleSupélec Stergios C. 21
Computing distances and similarities

CentraleSupélec Stergios C. 22
Distance Function

• Distance function on a set X


𝑑: 𝑋 → ℝ1
• Properties of a distance function (or metric)
• 𝑑 𝑥, 𝑧 ≥ 0 non-negativity
• 𝑑 𝑥, 𝑥 = 0 identity of indiscernibles
• 𝑑 𝑥, 𝑧 ≥ 𝑑 𝑧, 𝑥 symetry
• 𝑑 𝑥, 𝑧 ≤ 𝑑 𝑥, 𝑢 + 𝑑 𝑢, 𝑧 triangle inequality

CentraleSupélec Stergios C. 23
Distance Between Instances

• Euclidean distance (L2)


v
u
tX
Manhattan distance: The sum of the
n horizontal and vertical distances

d(x1 , x2 ) = ||x1 x2 ||2 =


between points on a grid
(x1j x2j )2
j=1

• Manhattan distance (L1)


n
X
d(x1 , x2 ) = ||x1 x2 ||1 = |x1j x2j |
j=1
• Lp-norm
n
X !1/p
d(x1 , x2 ) = ||x1 x2 ||p = |x1j x2j |p
j=1

CentraleSupélec Stergios C. 24
From Distance to Similarity

1
s=
1+d
• Pearson’s correlation
n
Pn 1X
j=1 (x j x̄)(z j z̄) x̄ = xj
⇢(x, z) = qP qP n
n n j=1
j=1 (x j x̄)2 j=1 (z j z̄)2

• Assuming that the data is centered

Pn
j=1 x jz j
⇢(x, z) = qP qP
n n
j=1 x2j 2
j=1 z j

CentraleSupélec Stergios C. 25
Pearson’s Correlation

• Pearson's correlation (centered data)

Pn
j=1 x jz j hx, zi
⇢(x, z) = qP qP = = cos ✓
n
x2j n 2 ||x|| ||z||
j=1 j=1 z j

• Cosine similarity: the dot product can be used to measure


similarities between vectors

CentraleSupélec Stergios C. 26
Categorical Features

• Represent objects as the list of presence/absence (or counts)


of features that appear in it

• Example: dataset of molecules


• Features: atoms and chemical bonds of a certain type
• C, H, S, O, N, …
• O-H, O=C, C-N, ...

CentraleSupélec Stergios C. 27
Binary Representation (1/2)

0 1 1 0 0 1 0 0 0 1 0 1 0 0 1

no occurrence of the 1st 1+ occurrences


feature of the 10th feature

• Hamming distance between two binary representations XOR operator


• Number of bits that are different Input Output
n
X
A B
d(x1 , x2 ) = (x1j XOR x2j )
0 0 0
j=1
0 1 1
• Equivalent to the L1 distance n
X 1 0 1
d(x1 , x2 ) = |x1j x2j | 1 1 0
j=1

CentraleSupélec Stergios C. 28
Binary Representation (2/2)

0 1 1 0 0 1 0 0 0 1 0 1 0 0 1

no occurrence of the 1st 1+ occurrences


feature of the 10th feature

AND operator OR operator


Input Output Input Output
• Jaccard similarity (or Tanimoto similarity) A B A B
0 0 0 0 0 0
• Number of shared features (normalized) 0 1 0 0 1 1
Pn 1 0 0 1 0 1
1
j=1 (x j AND x2j ) 1 1 1 1 1 1
J(x1 , x2 ) = Pn 1
j=1 (x j OR x2j )

CentraleSupélec Stergios C. 29
Example

x = 010101001
y = 010011000

• Hamming distance
x = 010101001
y = 010011000
Thus, d(x,y) = 3
• Jaccard similarity
𝑱 = (# 𝒐𝒇 𝟏𝟏) / ( # 𝒐𝒇 𝟎𝟏 + # 𝒐𝒇 𝟏𝟎 + # 𝒐𝒇 𝟏𝟏) = (𝟐) / (𝟏 + 𝟐 + 𝟐) = 𝟐 / 𝟓 = 𝟎. 𝟒

CentraleSupélec Stergios C. 30
kNN Classifiers

CentraleSupélec Stergios C. 31
Nearest Neighbor Algorithm

• Training examples in the Euclidean space 𝑥 ∈ ℝ2


• Idea: The label of a test data point is estimated from the known value of the nearest training
example
• The distance is typically defined to be the Euclidean one

Algorithm 1
Input: training data 𝒟 = {𝑥 % , 𝑦 % }%&",…,)
1. Find example (𝑥 ∗ , 𝑦 ∗ ) from the stored training set closest to the test instance 𝑥 such that:
𝑥 ∗ = 𝑎𝑟𝑔min 𝑑(𝑥 % , 𝑥)
+ ! ∈𝒟
2. Then output is f(𝑥) = 𝑦 ∗

CentraleSupélec Stergios C. 32
k-Nearest Neighbors (kNN) Algorithm
1NN

Every example in
the blue shaded
area will be
misclassified as
the blue class

• Algorithm 1 is sensitive to mis-labeled data (‘class noise’)


• Consider the vote of the k nearest neighbors (majority vote)
Algorithm 1
Input: training data 𝒟 = {𝑥 % , 𝑦 % }%&",…,)
1. Find 𝑘 examples 𝑥 ∗. , 𝑦 ∗. , 𝑗 = 1, … , 𝑘 from the stored training set closest to the test instance 𝑥
2. Then output is f(𝑥) = 𝑦 ∗

CentraleSupélec Stergios C. 33
Choice of Parameter k

• Small 𝑘: noisy decision


• The idea behind using more than 1 neighbors is to average out the noise
• Large 𝑘
• May lead to better prediction performance
• If we set 𝑘 too large, we may end up looking at samples that are not neighbors (are far away from the
point of interest)
• Also, computationally intensive. Why?
• Extreme case: set 𝑘 = 𝑚 (number of points in the dataset)
• For classification: the majority class
• For regression: the average value

Think also about generalization

CentraleSupélec Stergios C. 34
How to Choose Parameter k

• Set k by cross validation, by examining the misclassification error

Rule of thumb for


initial guess:
𝑘= 𝑚

m: # of training instances
k=7

Source: https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/

CentraleSupélec Stergios C. 35
Advantages of kNN

• Training is very fast


• Just store the training examples
• Can use smart indexing procedures to speed-up testing
• The training data is part of the ‘model’
• Useful in case we want to do something else with it
• Quite robust to noisy data
• Averaging k votes
• Can learn complex functions (implicitly)

CentraleSupélec Stergios C. 36
Drawbacks of kNN

• Memory requirements
• Must store all training data
• Prediction can be slow (will figure it out by yourself in the lab)
• Complexity of labeling 1 new data point: 𝒪(𝑘𝑛𝑚)
• But kNN works best with lots of samples
• Can we further improve the running time?
• Efficient data structures (e.g., k-D trees)
• Approximate solutions based on hashing
• High dimensional data and the curse of dimensionality
• Computation of the distance in a high dimensional space may become meaningless
• Need more training data
• Dimensionality reduction

CentraleSupélec Stergios C. 37
kNN – Some More Issues

• Normalize the scale of the attributes

• Simple option: linearly scale the range of each feature to be, e.g., in the range of [0,1]

• Linearly scale each dimension to have 0 mean and variance 1


• Compute the mean 𝜇 and variance 𝜎2 for an attribute 𝑥𝑗 and scale: (𝑥𝑗 − 𝜇)/𝜎2

CentraleSupélec Stergios C. 38
Decision Boundary of kNN

• Decision boundary in classification


• Line separating the positive from negative regions
• What decision boundary is the kNN building?
• The nearest neighbors algorithm does not explicitly compute decision boundaries, but those can be
inferred

CentraleSupélec Stergios C. 39
Voronoi Tessellation

• Voronoi cell of x:
• Set of all points of the space closer to x than any
other point of the training set x
• Polyhedron
• Voronoi tessellation (or diagram) of the space
• Union of all Voronoi cells

CentraleSupélec Stergios C. 40
Voronoi Tessellation

• The Voronoi diagram defines the decision boundary


of the 1NN
• The kNN algorithms also partitions the space but in a
more complex way

Wikipedia: https://en.wikipedia.org/wiki/Voronoi_diagram

CentraleSupélec Stergios C. 41
kNN Variants

• Weighted kNN
• Weight the vote of each neighbor xi according to the distance to the test point x

1
wi =
d(x, xi )2

• Other kernel functions can be used to weight the distance of neighbors

Source: https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf

CentraleSupélec Stergios C. 42
Applications of kNN

• Handwritten digit classification


• Input: images of handwritten digits
• Output: classify images into 10 classes

MNIST dataset

Source: http://yann.lecun.com/exdb/mnist/

CentraleSupélec Stergios C. 43
kNN in scikit-learn

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

CentraleSupélec Stergios C. 44
Summary

• Non parametric Learning


• kNN

CentraleSupélec Stergios C. 45
Thank you!

CentraleSupélec Stergios C. 46

You might also like