0% found this document useful (0 votes)
37 views17 pages

Lecture 03 - Supervised Learning by Computing Distances - Plain

This document provides an introduction to supervised learning and discusses computing distances between vectors. It defines key concepts like feature vectors, Euclidean distance, weighted Euclidean distance, and absolute distance. It also outlines common supervised learning problems like classification, regression, and ranking that can be formulated using distances between labeled training data vectors.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views17 pages

Lecture 03 - Supervised Learning by Computing Distances - Plain

This document provides an introduction to supervised learning and discusses computing distances between vectors. It defines key concepts like feature vectors, Euclidean distance, weighted Euclidean distance, and absolute distance. It also outlines common supervised learning problems like classification, regression, and ranking that can be formulated using distances between labeled training data vectors.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Getting Started with Supervised Learning,

Learning by Computing Distances (1)

CS771: Introduction to Machine Learning


Piyush Rai
2
Supervised Learning
Labeled
Training “dog”
Data
“dog” Supervised Learning
“cat” Algorithm
“cat” Cat vs Dog
“cat” Prediction model
Important: In ML (not just sup. learning
but also unsup. and RL), training and test
datasets should be “similar” (we don’t
like “out-of-syllabus” questions in Predicted Label
exams ) A test image
In the above example, it Cat vs Dog (cat/dog)
Does it mean ML is useless if
means that we can’t have test Prediction model this assumption is violated?
data with BnW images or
sketches of cats and dogs Of course not.  Many ML
techniques exist to handle Will give you Just the names for
More formally, the train and
such situations (a bit advanced now – domain adaptation,
test data distributions should
but will touch upon those covariate shift, transfer learning,
be the same
later) etc
CS771: Intro to ML
3
Some Types of Supervised Learning Problems
 Consider building an ML module for an e-mail client

 Some tasks that we may want this module to perform


 Predicting whether an email of spam or normal: Binary Classification
 Predicting which of the many folders the email should be sent to: Multi-class Classification
 Predicting all the relevant tags for an email: Tagging or Multi-label Classification
 Predicting what’s the spam-score of an email: Regression
 Predicting which email(s) should be shown at the top: Ranking
 Predicting which emails are work/study-related emails: One-class Classification

 These predictive modeling tasks can be formulated as supervised learning problems

 Today: A very simple supervised learning model for binary/multi-class classification


 This model doesn’t require any fancy maths – just computing means and distances
CS771: Intro to ML
4
Some Notation and Conventions
 In ML, inputs are usually represented by vectors 0.5 0.3 0.6 0.1 0.2 0.5 0.9 0.2 0.1 0.5

 A vector consists of an array of scalar values


 Geometrically, a vector is just a point in a vector space, e.g.,
 A length 2 vector is a point in 2-dim vector space
Likewise for higher
 A length 3 vector is a point in 3-dim vector space dimensions, even though
harder to visualize
(0.5,0.3) (0.5,0.3,0.6)
0.5 0.3 0.6
0.5 0.3

 Unless specified otherwise


 Small letters in bold font will denote vectors, e.g., , , etc.
 Small letters in normal font to denote scalars, e.g. , etc
 Capital letters in bold font will denote matrices (2-dim arrays), e.g., , etc
CS771: Intro to ML
5
Some Notation and Conventions
 A single vector will be assumed to be of the form

 Unless specified otherwise, vectors will be assumed to be column vectors


 So we will assume to be a column vector of size
 Assuming each element to be real-valued scalar, or (: space of reals)

 If is a feature vector representing, say an image, then


 denotes the dimensionality of this feature vector (number of features)
 (a scalar) denotes the value of feature in the image

 For denoting multiple vectors, we will use a subscript with each vector, e.g.,
 N images denoted by N feature vectors , or compactly as
 The vector denotes the image
 (a scalar) denotes the feature () of the image
CS771: Intro to ML
6
Some Basic Operations on Vectors
 Addition/subtraction of two vectors gives another vector of the same size

 The mean (average or centroid) of vectors


𝑁
1
𝜇= ∑ 𝐱 𝑛 (of the same size as each )
𝑁 𝑛=1
 The inner/dot product of two vectors and Assuming both and have
unit Euclidean norm

= (a real-valued number denoting how “similar” and are)

 For a vector , its Euclidean norm is defined via its inner product with itself
 Also the Euclidean distance of from origin
 Note: Euclidean norm is also called L2
norm

CS771: Intro to ML
7
Computing Distances
 Euclidean (L2 norm) distance between two vectors and


Sqrt of Inner product of Another expression in terms of inner
𝐷 the difference vector! products of individual vectors

𝑑 2 ( 𝒂, 𝒃 )=¿|𝒂− 𝒃|∨¿ 2= ∑ ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂− 𝒃 )
2 ⊤
( 𝒂−𝒃 )=√ 𝒂 𝒂+𝒃 𝒃 − 2 𝒂 𝒃 ¿ ⊤ ⊤ ⊤

𝑖=1
 Weighted Euclidean distance between two vectors and
Useful tip: Can achieve the effect of is a DxD diagonal matrix with weights on its
feature scaling (recall last lecture) diagonals. Weights may be known or even


by using weighted Euclidean learned from data (in ML problems)
distances!
𝐷
𝑑 𝑤 ( 𝒂 , 𝒃 )= ∑ 𝑤 𝑖 ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂 − 𝒃 )
2 ⊤
𝐖 ( 𝒂 − 𝒃)
Note: If is a DxD symmetric matrix
then it is called the Mahalanobis
distance (more on this later)
𝑖=1
 Absolute (L1 norm) distance between two vectors and
L1 norm distance is also known as the Apart from L2 and L1.
Manhattan distance or Taxicab norm 𝐷 there other ways of

𝑑 1 ( 𝒂 , 𝒃 )=¿|𝒂 − 𝒃|∨¿1 =∑ ¿ 𝑎 𝑖 −𝑏 𝑖∨¿ ¿ ¿


(it’s a very natural notion of distance defining distances?
between two points in some vector space)
Yes. Another, although less commonly 𝑖=1
used, distance is the L-infinity distance
(equals to max of abs-value of element-
wise difference between two vectors CS771: Intro to ML
8

Our First Supervised


Learner

CS771: Intro to ML
9
Prelude: A Very Primitive Classifier
The idea also applies to multi-
class classification: Use one
image per class, and predict label
based on the distances of the test
 Consider a binary classification problem – cat vs dog image from all such images

 Assume training data with just 2 images – one and one

 Given a new test image (cat/dog), how do we predict its label?

 A simple idea: Predict using its distance from each of the 2 training images

d( Test
image , ) < d( Test
image
, ) ? Predict cat else dogExcellent question! Glad you
Wait. Is it ML? Seems to be Some possibilities: Use a feature asked!
like just a simple “rule”. learning/selection algorithm to Even this simple model can be
Where is the “learning” part extract features, and use a learned. For example, for the
in this? Mahalanobis distance where you feature extraction/selection part
learn the W matrix (instead of using and/or for the distance computation
a predefined W), using “distance part
metric learning” techniques CS771: Intro to ML
10
Improving Our Primitive Classifier
 Just one input per class may not sufficiently capture variations in a class

 A natural improvement can be by using more inputs per class


“cat” “dog”
“cat” “dog”
“dog”
“cat”

 We will consider two approaches to do this


 Learning with Prototypes (LwP)
 Nearest Neighbors (NN – not “neural networks”, at least not for now )

 Both LwP and NN will use multiple inputs per class but in different ways

CS771: Intro to ML
11
Learning with Prototypes (LwP)
 Basic idea: Represent each class by a “prototype” vector

 Class Prototype: The “mean” or “average” of inputs from that class

Averages (prototypes) of each of the handwritten digits 1-9

 Predict label of each test input based on its distances from the class prototypes
 Predicted label will be the class that is the closest to the test input

 How we compute distances can have an effect on the accuracy of this model
(may need to try Euclidean, weight Euclidean, Mahalanobis, or something else)
Pic from: https://www.reddit.com/r/dataisbeautiful/comments/3wgbv9/average_handwritten_digit_oc/ CS771: Intro to ML
12
Learning with Prototypes (LwP): An Illustration
 Suppose the task is binary classification (two classes assumed pos and neg)

 Training data: labelled examples ,,


 Assume example from positive class, examples from negative class
 Assume green is positive and red is negative

1 𝜇

1
𝜇− = 𝐱𝑛 𝜇−
+¿=

¿
𝑁 − 𝑦 =−1
𝑛 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛¿

For LwP, the prototype


LwP straightforwardly generalizes
vectors (and here) define
to more than 2 classes as well
(multi-class classification) – K Test example Test example the “model”
prototypes for K classes
CS771: Intro to ML
13
LwP: The Prediction Rule, Mathematically
 What does the prediction rule for LwP look like mathematically?

 Assume we are using Euclidean distances here

||𝝁− − 𝐱|| =||𝝁−|| +||𝐱|| −2 ⟨ 𝝁− , 𝐱 ⟩


2 2 2
𝜇− 𝜇+¿¿
¿ ¿

Test example

Prediction Rule: Predict label as +1 if otherwise -1

CS771: Intro to ML
14
LwP: The Prediction Rule, Mathematically
 Let’s expand the prediction rule expression a bit more

 Thus LwP with Euclidean distance is equivalent to a linear model with


 Weight vector 2( Will look at linear models
 Bias term more formally and in more
detail later

 Prediction rule therefore is: Predict +1 if > 0, else predict -1


CS771: Intro to ML
15
LwP: Some Failure Cases
 Here is a case where LwP with Euclidean distance may not work well

Can use feature scaling or use


Mahalanobis distance to handle
such cases (will discuss this in
𝜇− the next lecture)
𝜇+¿¿

Test example

 In general, if classes are not equisized and spherical, LwP with Euclidean
distance will usually not work well (but improvements possible; will discuss
later)
CS771: Intro to ML
16
LwP: Some Key Aspects
 Very simple, interpretable, and lightweight model
 Just requires computing and storing the class prototype vectors

 Works with any number of classes (thus for multi-class classification as well)

 Can be generalized in various ways to improve it further, e.g.,


 Modeling each class by a probability distribution rather than just a prototype vector
 Using distances other than the standard Euclidean distance (e.g., Mahalanobis)

 With a learned distance function, can work very well even with very few
examples from each class (used in some “few-shot learning” models nowadays
– if interested, please refer to “Prototypical Networks for Few-shot Learning”)
CS771: Intro to ML
17
Next Lecture
 Fixing LwP
 Nearest Neighbors

CS771: Intro to ML

You might also like