0% found this document useful (0 votes)
55 views119 pages

771 A18 Lec2

This document provides an introduction and overview of notation for a course on machine learning. It discusses that supervised learning uses training data with input-output pairs while unsupervised learning uses only input data. Each input is usually represented as a vector of feature values. Good features can be specified by experts or learned from data. The notation defines inputs xn and outputs yn, with X representing the feature matrix of all inputs and y representing the vector of all outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views119 pages

771 A18 Lec2

This document provides an introduction and overview of notation for a course on machine learning. It discusses that supervised learning uses training data with input-output pairs while unsupervised learning uses only input data. Each input is usually represented as a vector of feature values. Good features can be specified by experts or learned from data. The notation defines inputs xn and outputs yn, with X representing the feature matrix of all inputs and y representing the vector of all outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Warming-up to ML, and Some Simple Supervised Learners

(Distance-based “Local” Methods)

Piyush Rai

Introduction to Machine Learning (CS771A)

August 2, 2018

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 1
Announcements

Please sign-up on Piazza if you haven’t already

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Announcements

Please sign-up on Piazza if you haven’t already


I’ll be clearing all the add-drop requests by tomorrow

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Announcements

Please sign-up on Piazza if you haven’t already


I’ll be clearing all the add-drop requests by tomorrow
Maths refresher tutorial on Aug 4, 6:00-7:30pm in RM-101

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Announcements

Please sign-up on Piazza if you haven’t already


I’ll be clearing all the add-drop requests by tomorrow
Maths refresher tutorial on Aug 4, 6:00-7:30pm in RM-101
Will be mostly on the basics of multivariate calculus, linear algebra, prob/stats, optimization (basically
things you are expected to know for this course)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 2
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert.

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert. Having a good set of features is half the battle won!

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert. Having a good set of features is half the battle won!
Each yn is the output or response or label associated with input x n

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention

Supervised Learning requires training data given as a set of input-output pairs {(x n , yn )}N
n=1

Unsupervised Learning requires training data given as a set of inputs {x n }N


n=1

Each input x n is (usually) a vector containing the values of the features or attributes or covariates
that encode properties of the data it represents, e.g.,
Representing a 7 × 7 image: x n can be a 49 × 1 vector of pixel intensities

Note: Good features can also be learned from data (feature learning) or extracted using hand-crafted
rules defined by a domain expert. Having a good set of features is half the battle won!
Each yn is the output or response or label associated with input x n
The output yn can be a scalar, a vector of numbers, or a structured object (more on this later)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 3
Some Notation/Nomenclature/Convention
Will assume each input x n to be a D × 1 column vector (its transpose x >
n will be row vector)

xnd will denote the d-th feature of the n-th input

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 4
Some Notation/Nomenclature/Convention
Will assume each input x n to be a D × 1 column vector (its transpose x >
n will be row vector)

xnd will denote the d-th feature of the n-th input


We will use X (N × D feature matrix) to collectively denote all the N inputs
We will use y (N × 1 output/response/label vector) to collectively denote all the N outputs
A feature
D

Input n Output for


xnT xn1 xn2 xnD yn
input n
N X y

Feature Matrix Outputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 4
Some Notation/Nomenclature/Convention
Will assume each input x n to be a D × 1 column vector (its transpose x >
n will be row vector)

xnd will denote the d-th feature of the n-th input


We will use X (N × D feature matrix) to collectively denote all the N inputs
We will use y (N × 1 output/response/label vector) to collectively denote all the N outputs
A feature
D

Input n Output for


xnT xn1 xn2 xnD yn
input n
N X y

Feature Matrix Outputs

Note: If each yn itself is a vector (we will see such cases later) then we will use a matrix Y to
collectively denote all the N outputs (with row n containing yn ) and also use boldfaced y n
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 4
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 5
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words
Here is the bag-of-words feature vector representation of these 3 sentences

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 5
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words
Here is the bag-of-words feature vector representation of these 3 sentences

Here the features are binary (presence/absence of each word)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 5
Getting Features from Raw Data: A Simple Example
Consider the feature representation for some text data consisting of the following sentences:
John likes to watch movies
Mary likes movies too
John also likes football
Our feature “vocabulary” consists of 8 unique words
Here is the bag-of-words feature vector representation of these 3 sentences

Here the features are binary (presence/absence of each word)


Again, note that this may not necessarily be the best “feature” representation for a given task (which is
why other techniques or feature learning may be needed)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 5
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values
Ordinal: Grade (A/B/C etc.) in a course, or any other type where relative values matters

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values
Ordinal: Grade (A/B/C etc.) in a course, or any other type where relative values matters

Often, the features can be of mixed types (some real, some categorical, some ordinal, etc.)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values
Ordinal: Grade (A/B/C etc.) in a course, or any other type where relative values matters

Often, the features can be of mixed types (some real, some categorical, some ordinal, etc.)
Appropriate handling of different types of features may be very important (even if you algorithm is
designed to “learn” good features, given a set of heterogeneous features)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Types of Features and Types of Outputs

Features (in vector x n ) as well as outputs yn can be real-valued, binary, categorical, ordinal, etc.
Real-valued: Pixel intensity, house area, house price, rainfall amount, temperature, etc

Binary: Male/female, adult/non-adult, or any yes/no or present/absent type values


Categorical/Discrete: Pincode, bloodgroup, or any “which one from this finite set” type values
Ordinal: Grade (A/B/C etc.) in a course, or any other type where relative values matters

Often, the features can be of mixed types (some real, some categorical, some ordinal, etc.)
Appropriate handling of different types of features may be very important (even if you algorithm is
designed to “learn” good features, given a set of heterogeneous features)
In Sup. Learning, different types of outputs may require different type of learning models

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 6
Supervised Learning

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 7
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)
Multi-class Classification: yn ∈ {1, 2, . . . , M} or {0, 1, . . . , M − 1} (one of M classes is correct label)
0 0 0 1 0
Illustration of a 5-dim one-hot label vector
for a multi-class classification problem

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)
Multi-class Classification: yn ∈ {1, 2, . . . , M} or {0, 1, . . . , M − 1} (one of M classes is correct label)
0 0 0 1 0
Illustration of a 5-dim one-hot label vector
for a multi-class classification problem

Multi-label Classification: yn ∈ {−1, +1}M or {0, 1}M (a subset of M labels are correct)
1 0 1 0 0
Illustration of a 5-dim binary label vector
for a multi-label classification problem
(unlike one-hot, there can be multiple 1s)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning
Supervised Learning comes in many flavors. The flavor depends on the type of each output yn
Regression: yn ∈ R (real-valued scalar)
Multi-Output Regression: y n ∈ RM (real-valued vector containing M outputs)
0.3 0.1 0.2 0.8 0.4
Illustration of a 5-dim output vector
for a multi-output regression problem

Binary Classification: yn ∈ {−1, +1} or {0, 1} (output in classification is also called “label”)
Multi-class Classification: yn ∈ {1, 2, . . . , M} or {0, 1, . . . , M − 1} (one of M classes is correct label)
0 0 0 1 0
Illustration of a 5-dim one-hot label vector
for a multi-class classification problem

Multi-label Classification: yn ∈ {−1, +1}M or {0, 1}M (a subset of M labels are correct)
1 0 1 0 0
Illustration of a 5-dim binary label vector
for a multi-label classification problem
(unlike one-hot, there can be multiple 1s)

Note: Multi-label classification is also informally called “tagging” (especially in Computer Vision)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 8
Supervised Learning (Contd.)

Structured-Prediction (a.k.a. Structured Output Learning): Each yn is a structured object

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 9
Supervised Learning (Contd.)

Structured-Prediction (a.k.a. Structured Output Learning): Each yn is a structured object

One-Class Classification (a.k.a. outlier/anomaly/novelty detection): yn is “1” or “everything else”

Examples from the class


being modeled (e.g., animals)

All other examples (“outliers”,


e.g., humans, vehicles, etc)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 9
Supervised Learning (Contd.)

Structured-Prediction (a.k.a. Structured Output Learning): Each yn is a structured object

One-Class Classification (a.k.a. outlier/anomaly/novelty detection): yn is “1” or “everything else”

Examples from the class


being modeled (e.g., animals)

All other examples (“outliers”,


e.g., humans, vehicles, etc)

Ranking: Each yn is a ranked list of relevant stuff for a given input/query x

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 9
Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,
Euclidean distance b/w two points (say two images or two documents) x n ∈ RD and x m ∈ RD
v
u D
p uX
d(x n , x m ) = ||x n − x m || = (x n − x m )> (x n − x m ) = t (xnd − xmd )2
d=1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,
Euclidean distance b/w two points (say two images or two documents) x n ∈ RD and x m ∈ RD
v
u D
p uX
d(x n , x m ) = ||x n − x m || = (x n − x m )> (x n − x m ) = t (xnd − xmd )2
d=1

Inner-product similarity b/w x n and x m (cosine, x n , x m are unit-length vectors)


D
X
s(x n , x m ) = hx n , x m i = x >
n xm = xnd xmd
d=1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Computing Distances/Similarities
Assuming all real-valued features, an input x n ∈ RD×1 is a point in a D dim. vector space of reals
Standard rules of vector algebra apply on such representations, e.g.,
Euclidean distance b/w two points (say two images or two documents) x n ∈ RD and x m ∈ RD
v
u D
p uX
d(x n , x m ) = ||x n − x m || = (x n − x m )> (x n − x m ) = t (xnd − xmd )2
d=1

Inner-product similarity b/w x n and x m (cosine, x n , x m are unit-length vectors)


D
X
s(x n , x m ) = hx n , x m i = x >
n xm = xnd xmd
d=1

`1 distance between two points x n and x m


D
X
d1 (x n , x m ) = ||x n − x m ||1 = |xnd − xmd |
d=1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 10
Our First (Supervised) Learning Algorithm

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 11
Our First (Supervised) Learning Algorithm
(need to know nothing except how to
compute distances/similarities between points!)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 11
Prototype based Classification

Given: N labeled training examples {x n , yn }N


n=1 from two classes

Assume green is positive and red is negative class


N+ exampes from positive class, N− examples from negative class

Our goal: Learn a model to predict label (class) y for a new test example x

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 12
Prototype based Classification

Given: N labeled training examples {x n , yn }N


n=1 from two classes

Assume green is positive and red is negative class


N+ exampes from positive class, N− examples from negative class

Our goal: Learn a model to predict label (class) y for a new test example x

A simple “distance from means” model: predict the class that has a closer mean

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 12
Prototype based Classification

Given: N labeled training examples {x n , yn }N


n=1 from two classes

Assume green is positive and red is negative class


N+ exampes from positive class, N− examples from negative class

Our goal: Learn a model to predict label (class) y for a new test example x

A simple “distance from means” model: predict the class that has a closer mean
Note: The basic idea easily generalizes to more than 2 classes as well

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 12
Prototype based Classification: More Formally

What does the decision rule look like, mathematically ?

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: More Formally

What does the decision rule look like, mathematically ?

The mean of each class is given by


1 X 1 X
µ− = xn and µ+ = xn
N− y =−1 N+ y =+1
n n

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: More Formally

What does the decision rule look like, mathematically ?

The mean of each class is given by


1 X 1 X
µ− = xn and µ+ = xn
N− y =−1 N+ y =+1
n n

Euclidean Distances from each mean are given by


||µ− − x||2 = ||µ− ||2 + ||x||2 − 2hµ− , xi
||µ+ − x||2 = ||µ+ ||2 + ||x||2 − 2hµ+ , xi

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: More Formally

What does the decision rule look like, mathematically ?

The mean of each class is given by


1 X 1 X
µ− = xn and µ+ = xn
N− y =−1 N+ y =+1
n n

Euclidean Distances from each mean are given by


||µ− − x||2 = ||µ− ||2 + ||x||2 − 2hµ− , xi
||µ+ − x||2 = ||µ+ ||2 + ||x||2 − 2hµ+ , xi
Decision Rule: If f (x) := ||µ− − x||2 − ||µ+ − x||2 > 0 then predict +1, otherwise predict -1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 13
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2

Imp.: f (x) effectively denotes a hyperplane based classification rule f (x) = w > x + b with the
vector w = µ+ − µ− representing the direction normal to the hyperplane

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2

Imp.: f (x) effectively denotes a hyperplane based classification rule f (x) = w > x + b with the
vector w = µ+ − µ− representing the direction normal to the hyperplane

PN
Imp.: Can show that the rule is equivalent to f (x) = n=1 αn hx n , xi + b, where α’s and b can be
estimated from training data (try this as an exercise)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Prototype based Classification: The Decision Rule
We saw that our decision rule was

f (x) := ||µ− − x||2 − ||µ+ − x||2 = 2hµ+ − µ− , xi + ||µ− ||2 − ||µ+ ||2

Imp.: f (x) effectively denotes a hyperplane based classification rule f (x) = w > x + b with the
vector w = µ+ − µ− representing the direction normal to the hyperplane

PN
Imp.: Can show that the rule is equivalent to f (x) = n=1 αn hx n , xi + b, where α’s and b can be
estimated from training data (try this as an exercise)
This form of the decision rule is very important. Decision rules for many (in fact most) supervised
learning algorithms can be written like this (weighted sum of similarities with all the training inputs)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 14
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

How do I know what’s the right M for my data?

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

How do I know what’s the right M for my data?Some options


Set it based on some knowledge of what you data looks like

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

How do I know what’s the right M for my data?Some options


Set it based on some knowledge of what you data looks like
Learn it from data (called Distance Metric Learning1 - a whole reseach area in itself)

1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Be Careful when Computing Distances
p
Euclidean distance d(x n , x m ) = (x n − x m )> (x n − x m ) may not always be appropriate
Another alternative (still Euclidean-like) can be to use the Mahalanobis distance
q
dM (x n , x m ) = (x n − x m )> M(x n − x m )
 
1 0
Shown below is an illustration of what M = will do (note: figure not to scale)
0 2

“Effective” Space under


Original Space Mahalanobis Transformation

How do I know what’s the right M for my data?Some options


Set it based on some knowledge of what you data looks like
Learn it from data (called Distance Metric Learning1 - a whole reseach area in itself)

Distance Metric Learning is one of the many approaches for feature learning from data
1 Distance Metric Learning. See “A Survey on Metric Learning for Feature Vectors and Structured Data” by Ballet et al
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 15
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)
Instead of Euclidean distances, can use non-Euclidean distances, distance metric learning, or “kernels”

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)
Instead of Euclidean distances, can use non-Euclidean distances, distance metric learning, or “kernels”
Another limitation: Needs plenty of training data from each class to reliably estimate the means

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Prototype based Classification: Some Comments

A very simple supervised learner. Works for any number of classes. Trivial to implement. :-)
This simple approach, if using Euclidean distances, can only learn linear decision boundaries
A reason: The basic approach implicitly assumes that classes are roughly spherical and equi-sized
Several nice improvements/generalizations possible (some of which we will see in coming lectures)
Instead of a point (mean), model classes by prob. distributions (to account for class shapes/sizes)
Instead of Euclidean distances, can use non-Euclidean distances, distance metric learning, or “kernels”
Another limitation: Needs plenty of training data from each class to reliably estimate the means
But with a good feature learner, even ONE (or very few) example per class may be enough (a
state-of-the-art “Few-Shot Learning” model actually uses Prototype based classification)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 16
Another Simple Supervised Learner:
Nearest Neighbors

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 17
Nearest Neighbor

Another classic distance-based supervised learning method

The label y for x ∈ RD will be the label of its nearest neighbor in training data. Also known as
one-nearest-neighbor (1-NN)
Euclidean/Mahalanobis distance can be used to find the nearest neighbor (or can use a learned
distance metric)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 18
Nearest Neighbor

Another classic distance-based supervised learning method

The label y for x ∈ RD will be the label of its nearest neighbor in training data. Also known as
one-nearest-neighbor (1-NN)
Euclidean/Mahalanobis distance can be used to find the nearest neighbor (or can use a learned
distance metric)
We typically use more (K > 1) neighbors in practice

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 18
Nearest Neighbor

Another classic distance-based supervised learning method

The label y for x ∈ RD will be the label of its nearest neighbor in training data. Also known as
one-nearest-neighbor (1-NN)
Euclidean/Mahalanobis distance can be used to find the nearest neighbor (or can use a learned
distance metric)
We typically use more (K > 1) neighbors in practice
Note: The method is widely applicable - works for both classification and regression problems

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 18
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x
Above assumes the K neighbors have equal (1/K ) weights. Can also use distance-based weights

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x
Above assumes the K neighbors have equal (1/K ) weights. Can also use distance-based weights
Note: The rule works for multi-label classification too where each y n ∈ {0, 1}M is a binary vector

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -Nearest Neighbors (K -NN)
Makes one-nearest-neighbor more robust by using more than one neighbor
Test time simply does a majority vote (or average) of the labels of K closest training inputs

For a test input x, the averaging version of the prediction rule for K -nearest neighbors
1 X
y= yn
K
n∈NK (x)
.. where NK (x) is the set of K closest training inputs for x
Above assumes the K neighbors have equal (1/K ) weights. Can also use distance-based weights
Note: The rule works for multi-label classification too where each y n ∈ {0, 1}M is a binary vector
Averaging will give a real-valued “label score vector” y ∈ RM using which we can find the best label(s)
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 19
K -NN for Multi-Label Learning: Pictorial Illustration

Suppose K = 3. The label averaging for a multi-label learning problem will look like

* 1 0 0 1 0
+
y= * 1 0 1 1 0 = 1 0 0.33 0.66 0.33
#1 label #4 label #3 label #2 label #3 label

* 1 0 0 0 1

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 20
K -NN for Multi-Label Learning: Pictorial Illustration

Suppose K = 3. The label averaging for a multi-label learning problem will look like

* 1 0 0 1 0
+
y= * 1 0 1 1 0 = 1 0 0.33 0.66 0.33
#1 label #4 label #3 label #2 label #3 label

* 1 0 0 0 1
Note that we can use the final y to rank the labels based on the real-valued scores
Can use it to predict the best, best-2, best-3, and so on..
Note: This is why multi-label learning is often used in some ranking problems where we wish to
predict a ranking of the possible labels an input can have

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 20
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Try different values of K and look at the accuracies on the validation set

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Try different values of K and look at the accuracies on the validation set
Note: For each K , we typically try multiple splits of train and validation sets

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Try different values of K and look at the accuracies on the validation set
Note: For each K , we typically try multiple splits of train and validation sets

Select the K that gives the best accuracy on the validation set

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
How to Select K : Cross-Validation
We can use cross-validation to select the “optimal” value of K
Cross-validation - Divide the training data into two parts: actual training set and a validation set

(Actual)
Training Training
Data Set
Validation
Set

Never Ever Touch


Test Test While Training

Data Data

Try different values of K and look at the accuracies on the validation set
Note: For each K , we typically try multiple splits of train and validation sets

Select the K that gives the best accuracy on the validation set
Never touch the test set (even if you have access to it) during training to choose the best K
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 21
-Ball Nearest Neighbors

Instead of K nearest neighbors, can instead consider an -radius ball centered at the test point

Ball of radius

Just like selecting K , we can select the optimal  via cross-validation


Have to be careful to choose  so as to not get zero neighbors within -ball :-)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 22
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time
For each test point, need to compute its distance from all the training points

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups

Need to be careful in choosing the distance function to compute distances (especially when the
data dimension D is very large)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Some Aspects about Nearest Neighbor
A simple yet very effective method in practice (if given lots of training data)
Provably has an error-rate that is no worse than twice of the “Bayes optimal” classifier which assumes
knowledge of the true data distribution for each class

Also called a memory-based or instance-based or non-parametric method


No “model” is learned here. Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Predction can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups

Need to be careful in choosing the distance function to compute distances (especially when the
data dimension D is very large)
The 1-NN can suffer if data contains outliers (we will soon see a geometric illustration), or if
amount of training data is small. Using more neighbors (K > 1) is usually more robust
Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 23
Geometry of 1-NN

1-NN induces a Voronoi tessellation of the input space

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 24
The Decision Boundary of 1-NN (for binary classification)

The decision boundary is composed of hyperplanes that form perpendicular bisectors of pairs of
points from different classes

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 25
Effect of Outliers on 1-NN

How the decision boundary can drastically change when the data contains some outliers

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 26
Effect of Varying K

Larger K leads to smoother decision boundaries

Too small K (e.g., K = 1) can lead to overfitting, too large K can lead to underfitting

Pic credit: Chris Bishop (PRML)


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 27
Effect of Varying K

Larger K leads to smoother decision boundaries

Too small K (e.g., K = 1) can lead to overfitting, too large K can lead to underfitting

Pic credit: Chris Bishop (PRML)


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 27
Effect of Varying K

Larger K leads to smoother decision boundaries

Too small K (e.g., K = 1) can lead to overfitting, too large K can lead to underfitting

Pic credit: Chris Bishop (PRML)


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 27
K -NN Behavior for Regression

Pic credit: Victor Lavrenko


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 28
K -NN Behavior for Regression

Pic credit: Alex Smola and Vishy Vishwanathan


Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 29
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 30
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Both are essentially “local” methods (look at local neighborhood of the test point)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 30
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Both are essentially “local” methods (look at local neighborhood of the test point)
Both are simple to understand and only require knowledge of basic geometry

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 30
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Both are essentially “local” methods (look at local neighborhood of the test point)
Both are simple to understand and only require knowledge of basic geometry
Have connections to other more advanced methods (as we will see)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 30
Summary

Looked at two distance-based methods for classification/regression


A “Distance from Means” Method
Nearest Neighbors Method

Both are essentially “local” methods (look at local neighborhood of the test point)
Both are simple to understand and only require knowledge of basic geometry
Have connections to other more advanced methods (as we will see)

Need to be careful when computing the distances (learned Mahalanobis distance metrics, or
“learned features” + Euclidean distance can often do wonders!)

Intro to Machine Learning (CS771A) Warming-up to ML, and Some Simple Supervised Learners 30

You might also like