0% found this document useful (0 votes)
16 views155 pages

ML 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views155 pages

ML 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

Machine Learning

Part II
A Biswas
IIEST, Shibpur
Syllabus

Introduction
Learning Problems, Well-posed learning problems, Designing learning systems.

Concept Learning
Concept learning task, Inductive hypothesis, Ordering of Hypothesis, General-to-specific ordering of
hypotheses. Version spaces, Inductive Bias.

Learning Rule Sets


Sequential Covering Algorithm, First Order Rules, Induction, First Order Resolution, Inverting Resolution.

Regression
Linear regression, Notion of cost function, Logistic Regression, Cost function for logistic regression,
application of logistic regression to multi-class classification.

Continued …
Syllabus (continued)

Supervised Learning
Support Vector Machine, Decision tree Learning, Representation, Problems, Decision Tree Learning Algorithm,
Attributes, Inductive Bias, Overfitting.
Bayes Theorem, Bayesian learning, Maximum likelihood, Least squared error hypothesis, Gradient Search, Naive
Bayes classifier, Bayesian Network, Expectation Maximization Algorithm.

Unsupervised learning
Clustering, K-means clustering, hierarchical clustering.

Instance-Based Learning
k-Nearest Neighbour Algorithm, Radial Basis Function, Locally Weighted Regression, Locally Weighted Function.

Neural networks
Linear threshold units, Perceptrons, Multilayer networks and back-propagation, recurrent networks. Probabilistic
Machine Learning, Maximum Likelihood Estimation.

Regularization, Preventing Overfitting, Ensemble Learning: Bagging and Boosting, Dimensionality reduction
Machine Learning Basics
A machine learning algorithm is an algorithm that is able to
learn from data.
Machine Learning Basics

P ↑ for T with E
Machine Learning Basics

T: how the machine learning system process an example

Example: collection of features quantitatively measured


from some object or event that the machine learning
system will process
Machine Learning Basics

Representing an example :

n
a vector x ∈ ℝ where each xi is a feature

For instance, features of an image are usually the values of


the pixels in the image.
Machine Learning Basics
Some of the most common machine learning tasks:

Classification :

n
The learning algorithm produces a function : f : ℝ → {1,…, k}

y = f(x) where x is the input and y is the class.


Machine Learning Basics
Some of the most common machine learning tasks:

An example of a classi cation task is object recognition:

input= an image (described as a set of pixel values), and

output= a numeric code identifying the object in the image.


fi
Machine Learning Basics
Classification with missing inputs :

not guaranteed that every measurement in its input vector


will always be provided.
Machine Learning Basics
Classification with missing inputs :

When some of the inputs may be missing, rather than


providing a single classi cation function, the learning
algorithm must learn a set of functions. Each function
corresponds to classifying x with a different subset of its
inputs missing.
fi
Machine Learning Basics
Classification with missing inputs :

This kind of situation arises frequently in medical diagnosis,


because many kinds of medical tests are expensive or
invasive.
Machine Learning Basics
Classification with missing inputs :

One way to ef ciently de ne such a large set of functions


is to learn a probability distribution over all of the relevant
variables, then solve the classi cation task by marginalizing
out the missing variables.
fi
fi
fi
Machine Learning Basics
Classification with missing inputs :

With n input variables, we can now obtain all 2 different


n

classi cation functions needed for each possible set of


missing inputs, but we only need to learn a single function
describing the joint probability distribution.
fi
Regression
To predict a numerical value given some input.

n
f:ℝ →ℝ

prediction of the expected claim


prediction of future prices of securities
Unsupervised Learning
Unsupervised learning algorithms experience a dataset
containing many features, then learn useful properties of
the structure of this dataset.

Clustering
Supervised Learning
Supervised learning algorithms experience a dataset
containing features, but each example is also associated with
a label or target.

For example, the Iris dataset is annotated with the species


of each iris plant.
Supervised Learning and unsupervised learning

Unsupervised learning: observing several examples of a


random vector x, and attempting to implicitly or explicitly
learn the probability distribution p(x), or some interesting
properties of that distribution.

Supervised learning: observing several examples of a random


vector x and an associated value or vector y, and learning to
predict y from x, usually by estimating p(y | x).
Regression

Regression is a problem of predicting a real-valued label


(often called a target) given an unlabeled example.

Estimating house price valuation based on house features,


such as area, the number of bedrooms, location and so on is
a famous example of regression.
Linear Regression Problem
Linear regression is a popular regression learning algorithm
that learns a model which is a linear combination of
features of the input example.
Linear Regression Problem
Problem Statement :
Given a collection of labeled examples:
N
{xi, yi}i=1
where N = size of the collection
Xi is the D dimensional feature vector i = 1,…, N
yi real-valued target
1

( j)
Every feature xi , j = 1,…, D is also a real
number
Linear Regression Problem
Problem Statement : …
Objective is to build a model f (x) as a linear combination
w,b

of features of example x:

fw,b = wx + b

where w is a D-dimensional vector of parameters and b is a


real number. The model f is parametrized by two values: w
Linear Regression Problem
Problem Statement : …
The model will predict the unknown y for a given x like
this:

y ← fw,b(x)

Two models parametrized by two dierent pairs (w, b) will


likely produce two dierent predictions when applied to the
same example. We want to nd the optimal values (w*,b*).
fi
Linear Regression

Linear regression for one-dimensional examples.


Linear Regression
To nd the optimal values for w* and b* we have to
minimize

The expression we want to minimise or maximise is called


the objective function, here above is the loss function.
fi
Linear Regression

The loss function is called squared error loss.


All model-based learning algorithms have a loss function
and what we do to nd the best model is we try to
minimize the objective known as the cost function.
In linear regression, the cost function is given by the
average loss, also called the empirical risk.
fi
Linear Regression

The average loss, or empirical risk, for a model, is the


average of all penalties obtained by applying the model to
the training data.
Linear Regression

Why linear model?


Simple.
Linear models rarely over t.
Over tting is the property of a model such that
- the model predicts very well labels of the examples
used during training but
- frequently makes errors when applied to examples
that weren’t seen by the learning algorithm during training.
fi
fi
Linear Regression

Why linear model?


Simple.
Linear models rarely over t.
Over tting is the property of a model such that
- the model predicts very well labels of the examples
used during training but
- frequently makes errors when applied to examples
that weren’t seen by the learning algorithm during training.
fi
fi
Linear Regression
Why linear model?

Polynomial regression with a polynomial of degree 10.


Linear Regression
Why linear model?
The regression line predicts
almost perfectly the targets
almost all training examples,
but will likely make signi cant
errors on new data, for xnew
fi
Linear Regression
Why squared loss?
The absolute value is not convenient, because it doesn’t
have a continuous derivative, which makes the function not
smooth.

Functions that are not smooth create unnecessary


dif culties when employing linear algebra to nd closed
form solutions to optimization problems.
fi
fi
Linear Regression
Why squared loss?
Closed form solutions to nding an optimum of a function
are simple algebraic expressions and are often preferable to
using complex numerical optimization methods, such as
gradient descent (used, among others, to train neural
networks).
fi
Linear Regression
Why squared loss?
One advantage:
Exaggerate the difference between the true target and
the predicted one according to the value of this difference.
Linear Regression
Why squared loss?
Why the derivative of the average loss is important?
- If we can calculate the gradient of the function, we can
then set this gradient to zero and nd the solution to a
system of equations that gives us the optimal values
w* and b*.

fi
Logistic Regression

It is not a regression, but a classi cation learning


algorithm.

The mathematical formulation of logistic regression is


similar to that of linear regression.

fi
Logistic Regression
Problem Statement

- to model yi as a linear function of xi, however, with a


binary yi, it is not straight forward.

The linear combination of features such as wxi + b is a


function that spans from minus in nity to plus in nity, while
yi has only two possible values.
fi
fi
Logistic Regression
Problem Statement

Let us de ne a negative label as 0 and the positive label


as 1, we can do it with a simple continuous function whose
codomain is (0,1).

If the value returned by the model for input x is closer to


0, then we assign a negative label to x; otherwise, the
example is labeled as positive.
fi
Logistic Regression
Problem Statement

One example function (standard logistic function) or sigmoid


function:
Logistic Regression
Problem Statement

Logistic regression model:

Objective: to nd w* and b*.


fi
Logistic Regression

Objective: to nd w* and b*.


In logistic regression, instead of using a squared loss and
trying to minimize the empirical risk, we maximize the
likelihood of our training set according to the model. In
statistics, the likelihood function de nes how likely the
observation (an example) is according to our model.
fi
fi
Logistic Regression

Labeled example (xi, yi)


Assume also that we have found (guessed) some speci c
values wˆ and ˆb of our parameters.

fi
Logistic Regression
If we now apply our model fwˆ,ˆb to x we will get some value
i

0<p<1 as output.
If y is the positive class, the likelihood of yi being the
i

positive class, according to our model, is given by p.

Similarly, if yi is the negative class, the likelihood of it being


the negative class is given by 1- p.
Logistic Regression
The optimization criterion in logistic regression is called
maximum likelihood. Instead of minimizing the average loss,
like in linear regression, we now maximize the likelihood of
the training data according to our model
Logistic Regression

Because of the exp function used in the model, in practice,


it’s more convenient to maximize the log-likelihood instead of
likelihood.
Logistic Regression
Because ln is a strictly increasing function, maximizing
this function is the same as maximizing its argument, and
the solution to this new optimization problem is the same as
the solution to the original problem.
Logistic Regression
Contrary to linear regression, there’s no closed form solution
to the above optimization problem. A typical numerical
optimization procedure used in such cases is gradient
descent.
Decision Tree Learning

A decision tree is an acyclic graph that can be used to


make decisions.
In each branching node of the graph, a speci c feature j
of the feature vector is examined. If the value of the
feature is below a speci c threshold, then the left branch
is followed; otherwise, the right branch is followed.
As the leaf node is reached, the decision is made about the
class to which the example belongs.
fi
fi
Decision Tree Learning

Problem statement:

We have a collection of labeled examples; labels belong to


the set {0, 1}.

We want to build a decision tree that would allow us to


predict the class of an example given a feature vector.
Decision Tree Learning

Decision Tree Learning model: ID3

It looks like the logistic regression.


Decision Tree Learning

Decision Tree Learning model: ID3

The optimization criterion is the average log-likelihood:

Where fID3 is a decision tree.


Decision Tree Learning
Decision Tree Learning model: ID3

Logistic regression learning algorithm which builds a


parametric model fwú,bú by nding an optimal solution to the
optimization criterion.
ID3 algorithm optimizes it approximately by constructing a
Non-parametric model .
fi
Decision Tree Learning
Decision Tree Learning model: ID3
The ID3 learning algorithm works as follows.

Let S denote a set of labeled examples.

At rst, the decision tree only has a start node that

contains all examples .


fi
Decision Tree Learning
Decision Tree Learning model: ID3
S
Start with a constant model fID3

S
The prediction given by the above model, fID3 (x), would be
the same for any input x.
Decision Tree Learning
Decision Tree Learning
Search through all features j=1, … , D and all thresholds t.
Split the set S into two subsets:

and
Decision Tree Learning

The two new subsets would go to two new leaf nodes, and
we evaluate, for all possible pairs (j, t) how good the split
with pieces S− and S+ is.
Finally, we pick the best such values (j, t), split S into
S− and S+, form two new leaf nodes, and continue
recursively on S− and S+ (or quit if no split produces a
model that’s suf ciently better than the current one).
fi
Decision Tree Learning

b) The decision tree after the rst split; it tests whether


feature 3 is less than 18.3 and, depending on the result, the
prediction is made in one of the two leaf nodes.
fi
Decision Tree Learning
The goodness of a split
(“evaluate how good the split is”)

estimated by using the criterion called entropy


Entropy is a measure of uncertainty about a random
variable. It reaches its maximum when all values of the
random variables are equiprobable. Entropy reaches its
minimum when the random variable can have only one value.
Decision Tree Learning
The entropy of a set of examples S is given by:

When we split a set of examples by a certain feature j and


a threshold t, the entropy of a split, H(S− , S+), is simply a
weighted sum of two entropies:
Decision Tree Learning
So, in ID3, at each step, at each leaf node, we nd a split
that minimizes the entropy or we stop at this leaf node.

fi
Decision Tree Learning
The algorithm stops at a leaf node when:
All examples in the leaf node are classi ed correctly by
the one-piece model
We cannot nd an attribute to split upon.
The split reduces the entropy less than some ϵ (the value
for which has to be found experimentally).
The tree reaches some maximum depth d (also has to be
found experimentally).
fi
fi
Decision Tree Learning
Because in ID3, the decision to split the dataset on each
iteration is local (doesn’t depend on future splits), the
algorithm doesn’t guarantee an optimal solution.

The model can be improved by using techniques like


backtracking during the search for the optimal decision
tree at the cost of possibly taking longer to build a model.
Support Vector Machine
Problem: spam detection

Data collection: say 10,000 messages with labels spam or


not_spam.

Convert each email message into a feature vector.

How to convert a real-world entity into a feature vector?


Data analyst decides how to convert by experience.
Support Vector Machine
One of the ways: to convert a text into a feature vector,
called bag of words, is to take a dictionary of English
words (having, say, 20,000 alphabetically sorted words).
Support Vector Machine
- the rst feature is equal to 1 if the email message
contains the word “a”; otherwise, this feature is 0;
- the second feature is equal to 1 if the email message
contains the word “aaron”; otherwise, this feature equals 0;
………………
- the feature at position 20,000 is equal to 1 if the email
message contains the word “zulu”; otherwise, this feature
is equal to 0.
fi
Support Vector Machine
Repeat the above procedure for every email message,
giving 10,000 feature vectors (each vector having the
dimensionality of 20,000) and a label (“spam”/“not_spam”).

Now the input data is machine-readable.


To make output data machine readable:
Spam —> 1 Not_spam—> 0
Or, spam —> +1 and not_spam —> -1
Support Vector Machine
SVM sees every feature vector as a point in a high-
dimensional space (here, 20,000 dimensional).

The algorithm puts all feature vectors on an imaginary


20,000- dimensional plot and draws an imaginary 20,000-
dimensional line (a hyperplane) that separates examples
with positive labels from examples with negative labels.

Decision boundary: separates the different classes.


Support Vector Machine
The equation of the hyperplane is given by two
parameters, a real-valued vector w of the same
dimensionality as our input feature vector x, and a real
number b:
wx - b = 0

where wx means
and D is the dimension of the feature vector x.
Support Vector Machine
Now, the predicted label for some input feature vector x is
given by
y = sign(wx - b)

where sign is a mathematical operator that takes any value


as input and returns +1 if the input is a positive number or
-1 if the input is a negative number.
Support Vector Machine
Objective of SVM is to nd the optimal values of w* and
b* for parameters w and b.

Once these optimal values are identi ed, the model f(x) is
then de ned as
f(x) = sign(w*x- b*)
fi
fi
fi
Support Vector Machine
To predict whether an email message is spam or not spam
using an SVM model:
- take a text of the message,
- convert it into a feature vector,
- then multiply this vector by w*, subtract b*
- and take the sign of the result.

i.e. evaluate f(x) = sign(w*x- b*)


If +1 then spam if -1 then not_spam
Support Vector Machine
First of all, we want the model to predict the labels of our
10,000 examples correctly.
Each example i = 1, . . . , 10000 is given by a pair (xi,yi),
where xi is the feature vector of example i and yi is its
label that takes values either -1 or +1.

So the constriants are:


Support Vector Machine
Also, the hyperplane should separate positive examples from
negative ones with the largest margin.

The margin is the distance between the closest examples of


two classes, as de ned by the decision boundary.

A large margin contributes to a better generalization, that


is how well the model will classify new examples in future.
fi
Support Vector Machine
To achieve this we need to minimise the Euclidean norm of
w denoted by ||w|| and given by

So, the optimisation problem now is


Minimize ||w|| subject to
Support Vector Machine
The solution of this optimization problem, given by
w* and b*, is called the statistical model, or, simply, the
model. The process of building the model is called training.
Support Vector Machine

For two-dimensional feature vectors: The blue and orange


circles represent positive and negative examples, and the
line given by wx - b = 0 is the decision boundary.
Support Vector Machine
Geometrically, the
equations wx - b = 1 and
wx - b = -1 de ne
two parallel hyperplanes.
Distance between the two
hyperplanes is given by
2
.
||w||
fi
Support Vector Machine
So, smaller the norm ||w||
the larger is the distance
between these two
hyperplanes.
Support Vector Machine
This particular version of the algorithm builds the so-called
linear model. It’s called linear because the decision
boundary is a straight line (or a plane, or a hyperplane).
Support Vector Machine
SVM can also incorporate kernels that can make the
decision boundary arbitrarily non-linear.

In some cases, it could be impossible to perfectly separate


the two groups of points because of noise in the data,
errors of labeling, or outliers (examples very dierent from a
“typical” example in the dataset).
Support Vector Machine
Another version of SVM can also incorporate a penalty
hyperparameter for misclassi cation of training examples of
speci c classes.
fi
fi
Support Vector Machine
Any classi cation learning algorithm that builds a model
implicitly or explicitly creates a decision boundary.

The decision boundary can be straight, or curved, or it can


have a complex form, or it can be a superposition of some
geometrical gures.
fi
fi
Support Vector Machine
The form of the decision boundary determines the
accuracy of the model (that is the ratio of examples whose
labels are predicted correctly).

The form of the decision boundary, the way it is


algorithmically or mathematically computed based on the
training data, dierentiates one learning algorithm from
another.
Support Vector Machine
Two other essential dierentiators of learning algorithms to
consider:
speed of model building and
prediction processing time.

In many practical cases, you would prefer a learning


algorithm that builds a less accurate model fast.
Additionally, you might prefer a less accurate model that is
much quicker at making predictions.
Support Vector Machine
Two critical questions need to be answered:

1. What if there’s noise in the data and no hyperplane can


perfectly separate positive examples from negative ones?
2. What if the data cannot be separated using a plane, but
could be separated by a higher-order polynomial?
Support Vector Machine
Support Vector Machine

Inherent nonlinearity.
Support Vector Machine
So, as of now,

Constraints:

and
minimize ||w|| so that the hyperplane was equally
distant from the closest examples of each class.
Support Vector Machine
1 2
Minimising ||w|| is equivalent to minimising | | w | | .
2
The use of this term makes it possible to perform quadratic
programming optimization.
The optimization problem for SVM
Support Vector Machine: Dealing with Noise
To extend SVM to cases in which the data is not linearly
separable, we introduce the hinge loss function:
max(0,1 − yi(wxi − b))
The hinge loss function is zero if the constraints a) and b)
are satis ed, that is, if wxi lies on the correct side of the
decision boundary.
For data on the wrong side of the decision boundary, the
function’s value is proportional to the distance from the
decision boundary.
fi
Support Vector Machine: Dealing with Noise
So, we have to minimise the cost function:

where the hyperparameter C determines the trade-off


between increasing the size of the decision boundary and
ensuring that each x lies on the correct side of the decision
i

boundary.
Support Vector Machine: Dealing with Noise

The value of C is usually chosen experimentally.


Support Vector Machine: Dealing with Noise

For suf ciently high values of C, the second term in the


cost function will become negligible, so the SVM algorithm
will try to nd the highest margin by completely ignoring
misclassi cation.
fi
fi
fi
Support Vector Machine: Dealing with Noise

As we decrease the value of C, making classi cation errors


is becoming more costly, so the SVM algorithm will try to
make fewer mistakes by sacri cing the margin size.

fi
fi
Support Vector Machine: Dealing with Noise

As a larger margin is better for generalization.


Therefore, C regulates the tradeoff between classifying the
training data well (minimizing empirical risk) and classifying
future examples well (generalization).
Support Vector Machine: Dealing with Inherent Non-
Linearity
You can adapt SVM for nonlinear dataset.

Also, if we manage to transform the original space into a


space of higher dimensionality, we may see that the
examples will become linearly separable in this transformed
space.
Support Vector Machine: Dealing with Inherent Non-
Linearity
Kernel trick:
In SVMs, using a function to implicitly transform the
original space into a higher dimensional space during the
cost function optimization is called the kernel trick.
Support Vector Machine: Dealing with Inherent Non-
Linearity
It’s possible to transform a two-dimensional non-linearly-
separable data into a linearly-separable three-dimensional
data using a speci c mapping
ϕ : x ↦ ϕ(x)
where ϕ(x) is a vector of higher dimensionality than x.
fi
Support Vector Machine: Dealing with Inherent Non-
Linearity
For the earlier non-linear data, it can be
Support Vector Machine: Dealing with Inherent Non-
Linearity
Support Vector Machine: Dealing with Inherent Non-
Linearity

It is not known which ϕ will work for the given data.


Support Vector Machine: Dealing with Inherent Non-
Linearity

How to use kernel functions (or, simply, kernels) to


ef ciently work in higher-dimensional spaces without doing
this transformation explicitly.
fi
Support Vector Machine: Dealing with Inherent Non-
Linearity

Optimisation algorithm for SVM to nd w and b

The above optimisation is done by the method of


Lagrange Multipliers.

fi
Support Vector Machine: for Inherent Non-Linearity
Optimisation algorithm for SVM to nd w and b

The above optimisation is done by the method of


Lagrange Multipliers.

This is a convex quadratic optimisation problem - solvable


by quadratic programming algorithms.

fi
Support Vector Machine: for Inherent Non-Linearity

The term xixk is the place where the feature vectors are
used.
In order to transform the vector space into higher
dimensional space, transform xi into ϕ(xi) and
xj into ϕ(xj) and then multiply - computationally costly.
Support Vector Machine: for Inherent Non-Linearity
By using the kernel trick, we can get rid of a costly
transformation of original feature vectors into higher-
dimensional vectors and avoid computing their dot-
product.

We replace that by a simple operation on the original


feature vectors that gives the same result.
Support Vector Machine: for Inherent Non-Linearity
Example of kernel trick:
2 2
Instead of transforming (q1, p1) into (q1 , 2q1p1, p1 ) and
2 2
(q2, p2) into (q2 , 2q2 p2, p2 ) and computing their dot-
2 2 2 2
product to get (q1 q2 + 2q1q2 p1p2 + p1 p2 ):
1. Take the dot-product of q1, p1 and q2, p2 to get
q1q2 + p1p2
2 2 2 2
2. Then square it to get (q1 q2 + 2q1q2 p1p2 + p1 p2 )
k-Nearest Neighbors kNN
A non-parametric learning algorithm.

In other training algorithms, it is allowed to throw away


the training data once the model is built.

In kNN all training examples are kept in memory.


k-Nearest Neighbors kNN
Once a new, previously unseen example x comes in,

the kNN algorithm nds k training examples closest to x

and returns the majority label (in case of classi cation)


or the average label (in case of regression).
fi
fi
k-Nearest Neighbors kNN
The closeness of two points is given by a distance
function.

For example, Euclidean distance seen above is frequently


used in practice.
k-Nearest Neighbors (kNN)
Another popular choice of the distance function is the
negative cosine similarity.
k-Nearest Neighbors (kNN)
It is a measure of similarity of the directions of two
vectors.
If the angle between two vectors is 0 degrees, then two
vectors point to the same direction, and cosine similarity
is equal to 1. If the vectors are orthogonal, the cosine
similarity is 0.
For vectors pointing in opposite directions, the cosine
similarity is -1. If we want to use cosine similarity as a
distance metric, we need to multiply it by -1.
k-Nearest Neighbors (kNN)
The choice of the distance metric, as well as the value
for k, are the choices the analyst makes before running
the algorithm.

So these are hyperparameters.

The distance metric could also be learned from data.


k-Nearest Neighbors (kNN)
A relook:
kNN assumes all instances correspond to points in the
n
n-dimensional space ℝ .
Let an arbitrary instance x be described by the feature
vector
< a1(x), a2(x), …, an(x) >
th
where ar(x) = the value of the r attribute of instance x
k-Nearest Neighbors (kNN)
Then the distance (Euclidean) between two instances xi xi

and xj is de ned to be d(xi, xj) where


xj fi
k-Nearest Neighbors (kNN)
In nearest-neighbor learning the target function may be
either discrete-valued or real-valued.

Let us rst consider learning discrete-valued target


n
functions of the form f : ℝ → V where V is the nite
set {v1, v2, …, vs} .
fi
fi
k-Nearest Neighbors (kNN)
In nearest-neighbor learning the target function may be
either discrete-valued or real-valued.

Let us rst consider learning discrete-valued target


n
functions of the form f : ℝ → V where V is the nite
set {v1, v2, …, vs} .
fi
fi
k-Nearest Neighbors (kNN)
In nearest-neighbor learning the target function may be
either discrete-valued or real-valued.

Let us rst consider learning discrete-valued target


n
functions of the form f : ℝ → V where V is the nite
set {v1, v2, …, vs} .

̂ ) returned by this algorithm as its estimate of f(xq)is just the most common value of f
f(xq
among the k training examples nearest to xq.
fi
fi
k-Nearest Neighbors (kNN)
If we choose k=1, then the 1-NEAREST NEIGHBOR
̂
algorithm assigns to f(xq) the value f(xi) where xi is the
training instance nearest to xq,.

For larger values of k, the algorithm assigns the most


common value among the k nearest training examples.
k-Nearest Neighbors (kNN)

Left: A set of positive and negative training examples


along with the query xq.
k-Nearest Neighbors (kNN)

The 1-NEAREST-NEIGHBOR algorithm classi es positive,


whereas 5-NEAREST-NEIGHBOR classi es it as negative

fi
fi
k-Nearest Neighbors (kNN)

Right: the decision surface induced by the 1-NEAREST-


NEIGHBOR algorithm for a set of training examples.
k-Nearest Neighbors (kNN)

Convex polygon surrounding each training example:the


region of instance space closest to that point example
k-Nearest Neighbors (kNN)
The k-NEAREST-NEIGHBOR algorithm is easily adapted
to approximating continuous-valued target functions.
To accomplish this, we have the algorithm calculate the
mean value of the k nearest training examples rather
than calculate their most common value.
Distance-Weighted NEAREST NEIGHBOR Algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query point xq giving
greater weight to closer neighbors.
Distance-Weighted NEAREST NEIGHBOR Algorithm

Distance-weight the instances for real-valued target


functions as
Distance-Weighted NEAREST NEIGHBOR Algorithm
Note all of the above variants of the k-NEAREST
NEIGHBOR algorithm consider only the k nearest
neighbors to classify the query point.

Once we add distance weighting, there is really no harm


in allowing all training examples to have an in uence on
the classi cation of the xq because very distant examples
,

̂
will have very little effect on f(xq) ——> slow down.
fi
fl
Remarks on kNN
The distance-weighted k-NEAREST NEIGHBOR algorithm
is a highly effective inductive inference method for many
practical problems.
It is robust to noisy training data and quite effective
when it is provided a suf ciently large set of training
data.
Note that by taking the weighted average of the k
neighbors nearest to the query point, it can smooth out
the impact of isolated noisy training examples.
fi
Remarks on kNN
What is the inductive bias of k-NEAREST NEIGHBOR?

The inductive bias corresponds to an assumption that the


classi cation of an instance xq will be most similar to the
classi cation of other instances that are nearby in
Euclidean distance.
fi
fi
Remarks on kNN
One practical issue in applying k-NEAREST NEIGHBOR
algorithms is that the distance between instances is
calculated based on all attributes of the instance.

As opposed to rule and decision tree learning systems


that select only a subset of the instance attributes when
forming the hypothesis.
Remarks on kNN
Effect:
Consider applying k-NEAREST NEIGHBOR to a problem in
which each instance is described by 20 attributes, but
where only 2 of these attributes are relevant to
determining the classi cation for the particular target
function.
In this case, instances that have identical values for the
2 relevant attributes may nevertheless be distant from
one another in the 20-dimensional instance space.
fi
Remarks on kNN
As a result, the similarity metric used by k-NEAREST
NEIGHBOR—depending on all 20 attributes will be
misleading.
The distance between neighbors will be dominated by
the large number of irrelevant attributes. This dif culty,
which arises when many irrelevant attributes are present,
is sometimes referred to as the curse of dimensionality.
Nearest-neighbor approaches are especially sensitive to
this problem.

fi
Remarks on kNN
One interesting approach to overcoming this problem is
to weight each attribute differently when calculating the
distance between two instances.

This corresponds to stretching the axes in the Euclidean


space, shortening the axes that correspond to less
relevant attributes, and lengthening the axes that
correspond to more relevant attributes.
Remarks on kNN
The amount by which each axis should be stretched can
be determined automatically using a cross-validation
approach.
Remarks on kNN
First: note that we wish to stretch (multiply) the jth
axis by some factor zj , where the values z1, …, zn, are
chosen to minimize the true classi cation error of the
learning algorithm.

Second, note that this true error can be estimated


using cross-validation.

fi
Remarks on kNN
Hence, one algorithm is to select a random subset of
the available data to use as training examples, then
determine the values of z1, …, zn that lead to the
minimum error in classifying the remaining examples.

By repeating this process multiple times the estimate


for these weighting factors can be made more accurate.
Remarks on kNN
An even more drastic alternative is to completely
eliminate the least relevant attributes from the instance
space. This is equivalent to setting some of the zi scaling
factors to zero.
Remarks on kNN
Leave-one-out cross-validation: in which the set of m
training instances is repeatedly divided into a training
set of size m-1 and test set of size 1 in all possible
,

ways.
This leave-one- out approach is easily implemented in k-
NEAREST-NEIGHBOR algorithms because no additional
training effort is required each time the training set is
rede ned.
fi
Remarks on kNN
Because this algorithm delays all processing until a new
query is received, signi cant computation can be required
to process each new query.

Various methods have been developed for indexing the


stored training examples so that the nearest neighbors
can be identi ed more ef ciently at some additional cost
in memory.
fi
fi
fi
Remarks on kNN
One such indexing method is the kd-tree (Bentley
1975; Friedman et al. 1977), in which instances are
stored at the leaves of a tree, with nearby instances
stored at the same or nearby nodes.

The internal nodes of the tree sort the new query xq to


the relevant leaf by testing selected attributes of xq.
Locally weighted regression
The nearest-neighbor approaches discussed so far is
approximating the target function f(x) at the single
query point x = xq.

Locally weighted regression is a generalization of this


approach.
Locally weighted regression
It constructs an explicit approximation to f over a local
f

region surrounding xq.

Locally weighted regression uses nearby or distance-


weighted training examples to form this local
approximation to f .
Locally weighted regression
For example, we might approximate the target function
in the neighborhood surrounding using a linear function,
x,

a quadratic function, a multilayer neural network, or


some other functional form.
Locally weighted regression
The phrase "locally weighted regression" is called
local because the function is approximated based only on
a

data near the query point,


weighted because the contribution of each training
example is weighted by its distance from the query point,
and regression because this is the term used widely in
the statistical learning community for the problem of
approximating real-valued functions.
Locally weighted regression
Given a new query instance xq, the general approach in
,

locally weighted regression is to construct an


approximation f^ that ts the training examples in the
neighborhood surrounding xq.
This approximation is then used to calculate the value
̂f(x ) which is output as the estimated target value for
q
the query instance.
fi
Locally weighted regression
̂
The description of f may then be deleted, because a
different local approximation will be calculated for each
distinct query instance.
Locally weighted linear regression
Let us consider the case of locally weighted regression
in which the target function f is approximated near xq
using a linear function of the form
Locally weighted linear regression
Gradient descent: to nd the coef cients w1, …, wn to
wo...w,

minimize the error in tting such linear functions to a


given set of training examples

fi
fi
fi
Locally weighted linear regression
To choose weights that minimize the squared error
summed over the set D of training examples

Leads to the gradient descent training rule


Locally weighted linear regression
Leads to the gradient descent training rule

where η is a constant learning rate.


Locally weighted linear regression
Three possible ways:
Locally weighted linear regression
Gradient descent rule rederived for 3:
Locally weighted linear regression
Gradient descent rule rederived for 3:

the contribution of instance to the weight update is


x

now multiplied by the distance penalty K(d(xq, x)) and


that the error is summed over only the k nearest
training examples.

You might also like