0% found this document useful (0 votes)
3 views122 pages

Lecture 14

The document discusses linear discriminant functions and the perceptron as an early classifier, explaining its learning algorithm and geometric view. It emphasizes the relationship between regression and classification, highlighting the goal of learning a function that captures the relationship between input and output. The document also covers linear regression, the criterion function for optimization, and the linear least squares method for finding optimal parameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views122 pages

Lecture 14

The document discusses linear discriminant functions and the perceptron as an early classifier, explaining its learning algorithm and geometric view. It emphasizes the relationship between regression and classification, highlighting the goal of learning a function that captures the relationship between input and output. The document also covers linear regression, the criterion function for optimization, and the linear least squares method for finding optimal parameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Recap

• We have been considering linear discriminant


functions.

PR NPTEL course – p.1/122


Recap

• We have been considering linear discriminant


functions.
• Such a linear classifier is given by

Xd′

h(X) = 1 if wi φi (X) + w0 > 0


i=1
= 0 Otherwise
where φi are fixed functions.

PR NPTEL course – p.2/122


Recap

• We have been considering linear discriminant


functions.
• Such a linear classifier is given by

Xd′

h(X) = 1 if wi φi (X) + w0 > 0


i=1
= 0 Otherwise
where φi are fixed functions.
• We have been considering the case φi (X) = xi for
simplicity.
PR NPTEL course – p.3/122
Perceptron

• Perceptron is the earliest such classifier.

PR NPTEL course – p.4/122


Perceptron

• Perceptron is the earliest such classifier.


• Assuming augumented feature vector,
h(X) = sgn(W T X).

PR NPTEL course – p.5/122


Perceptron

• Perceptron is the earliest such classifier.


• Assuming augumented feature vector,
h(X) = sgn(W T X).
• ‘find weighted sum and threshold’

PR NPTEL course – p.6/122


Perceptron

• Perceptron is the earliest such classifier.


• Assuming augumented feature vector,
h(X) = sgn(W T X).
• ‘find weighted sum and threshold’

PR NPTEL course – p.7/122


Perceptron Learning Algorithm

• A simple iterative algorithm.

PR NPTEL course – p.8/122


Perceptron Learning Algorithm

• A simple iterative algorithm.


• Each iteration, we locally try to correct errors.

PR NPTEL course – p.9/122


Perceptron Learning Algorithm

• A simple iterative algorithm.


• Each iteration, we locally try to correct errors.

Let ∆W (k) = W (k + 1) − W (k). Then

∆W (k) =0 if W (k)T X(k) > 0 & y(k) = 1, or


W (k)T X(k) < 0 & y(k) = 0
= X(k) if W (k)T X(k) ≤ 0 & y(k) = 1
= − X(k) if W (k)T X(k) ≥ 0 & y(k) = 0

PR NPTEL course – p.10/122


Perceptron: Geometric view

The algorithm has a simple geometric view. Consider the


following data set.

PR NPTEL course – p.11/122


• Suppose W (k) misclassifies a pattern.

PR NPTEL course – p.12/122


• Now the correction made to W (k) can be seen as

PR NPTEL course – p.13/122


• We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.

PR NPTEL course – p.14/122


• We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.
• We also saw the ‘batch’ version of the algorithm. It is
shown to be a gradient descent on a reasonable cost
function.

PR NPTEL course – p.15/122


Perceptron

• A simple ‘device’: Weighted sum and threshold.

PR NPTEL course – p.16/122


Perceptron

• A simple ‘device’: Weighted sum and threshold.


• A simple learning machine. (A neuron model).

PR NPTEL course – p.17/122


Perceptron

• A simple ‘device’: Weighted sum and threshold.


• A simple learning machine. (A neuron model).

PR NPTEL course – p.18/122


• Perceptron is an interesting algorithm to learn linear
classifiers.

PR NPTEL course – p.19/122


• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.

PR NPTEL course – p.20/122


• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.
• In general, not possible to know beforehand whether
data is linearly separable.

PR NPTEL course – p.21/122


• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.
• In general, not possible to know beforehand whether
data is linearly separable.
• We next look at other linear methods in classification
and regression.

PR NPTEL course – p.22/122


Regression Problems

• Recall that the regression or function learning


problem is closely related to learning classifiers.

PR NPTEL course – p.23/122


Regression Problems

• Recall that the regression or function learning


problem is closely related to learning classifiers.
• The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.

PR NPTEL course – p.24/122


Regression Problems

• Recall that the regression or function learning


problem is closely related to learning classifiers.
• The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.
• The main difference is that the ‘targets’ or the ‘output’,
yi , is continuous valued in regression problem while it
can take only finitely many distinct values in a
classifier.

PR NPTEL course – p.25/122


• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).

PR NPTEL course – p.26/122


• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.

PR NPTEL course – p.27/122


• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
• We search over a suitably parameterized class of
functions to find the best one.

PR NPTEL course – p.28/122


• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
• We search over a suitably parameterized class of
functions to find the best one.
• Once again, the problem is that of learning the best
parameters.

PR NPTEL course – p.29/122


Linear Regression

• We will now consider learning a linear function


d
X
f (X) = wi xi + w0
j=1

where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the


parameters.

PR NPTEL course – p.30/122


Linear Regression

• We will now consider learning a linear function


d
X
f (X) = wi xi + w0
j=1

where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the


parameters.
• Thus a linear model can be expressed as
f (X) = W T X + w0 .

PR NPTEL course – p.31/122


Linear Regression

• We will now consider learning a linear function


d
X
f (X) = wi xi + w0
j=1

where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the


parameters.
• Thus a linear model can be expressed as
f (X) = W T X + w0 .
• As earlier, by using an augumented vector X , we can
write this as f (X) = W T X .
PR NPTEL course – p.32/122
• Now, to learn ‘optimal’ W , we need a criterion
function.

PR NPTEL course – p.33/122


• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .

PR NPTEL course – p.34/122


• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
• Then the optimal W would be one that optimizes the
criterion function.

PR NPTEL course – p.35/122


• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
• Then the optimal W would be one that optimizes the
criterion function.
• A criterion function that is most often used is the sum
of squares of errors.

PR NPTEL course – p.36/122


Linear Least Squares Regression

• We want to find a W such that ŷ(X) = f (X) = W T X


is a good fit for the training data.

PR NPTEL course – p.37/122


Linear Least Squares Regression

• We want to find a W such that ŷ(X) = f (X) = W T X


is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

PR NPTEL course – p.38/122


Linear Least Squares Regression

• We want to find a W such that ŷ(X) = f (X) = W T X


is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• We take the ‘optimal’ W to be the minimizer of J(·).

PR NPTEL course – p.39/122


Linear Least Squares Regression

• We want to find a W such that ŷ(X) = f (X) = W T X


is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• We take the ‘optimal’ W to be the minimizer of J(·).


• Known as linear least squares method.

PR NPTEL course – p.40/122


• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

PR NPTEL course – p.41/122


• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• If we are learning a classifier we can have


yi ∈ {−1, +1}.

PR NPTEL course – p.42/122


• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• If we are learning a classifier we can have


yi ∈ {−1, +1}.
• Note that finally we would use sign of W T X as the
classifier output.

PR NPTEL course – p.43/122


• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• If we are learning a classifier we can have


yi ∈ {−1, +1}.
• Note that finally we would use sign of W T X as the
classifier output.
• Thus minimizing J is a good way to learn linear
discriminant functions also.
PR NPTEL course – p.44/122
• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

PR NPTEL course – p.45/122


• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• This is a quadratic function and we can analytically


find the minimizer.

PR NPTEL course – p.46/122


• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• This is a quadratic function and we can analytically


find the minimizer.
• For this we rewrite J(W ) into a more convenient form.

PR NPTEL course – p.47/122


• Recall that we take all vectors to be column vectors.

PR NPTEL course – p.48/122


• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.

PR NPTEL course – p.49/122


• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]

PR NPTEL course – p.50/122


• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]
• A is a n × (d + 1) matrix whose ith row is given by XiT .

PR NPTEL course – p.51/122


• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]
• A is a n × (d + 1) matrix whose ith row is given by XiT .
• Hence, AW would be a n × 1 vector whose ith
element is XiT W .

PR NPTEL course – p.52/122


• Let Y be a n × 1 vector whose ith element is yi .

PR NPTEL course – p.53/122


• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).

PR NPTEL course – p.54/122


• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
• Hence we have
n
1 X ¡ T
¢2 1
J(W ) = X W − yi
i = (AW −Y )T (AW −Y )
2 i=1
2

PR NPTEL course – p.55/122


• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
• Hence we have
n
1 X ¡ T
¢2 1
J(W ) = X W − yi
i = (AW −Y )T (AW −Y )
2 i=1
2

• To find minimizer of J(·) we need to equate its


gradient to zero

PR NPTEL course – p.56/122


• We have
∇ J(W ) = AT (AW − Y )

PR NPTEL course – p.57/122


• We have
∇ J(W ) = AT (AW − Y )
• Equating the gradient to zero, we get

(AT A)W = AT Y

PR NPTEL course – p.58/122


• We have
∇ J(W ) = AT (AW − Y )
• Equating the gradient to zero, we get

(AT A)W = AT Y
• The optimal W satisfies this system of linear
equations. (Called normal equations).

PR NPTEL course – p.59/122


• AT A is a (d + 1) × (d + 1) matrix.

PR NPTEL course – p.60/122


• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).

PR NPTEL course – p.61/122


• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
• Rows of A are the training samples Xi .

PR NPTEL course – p.62/122


• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
• Rows of A are the training samples Xi .
• Hence j th column of A would give the values of j th
feature in all the examples.

PR NPTEL course – p.63/122


• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.

PR NPTEL course – p.64/122


• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
• If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.

PR NPTEL course – p.65/122


• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
• If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.
• This is a reasonable assumption.

PR NPTEL course – p.66/122


• The optimal W is a solution of (AT A)W = AT Y .

PR NPTEL course – p.67/122


• The optimal W is a solution of (AT A)W = AT Y .
• When AT A is invertible, we get the optimal W as

W ∗ = (AT A)−1 AT Y = A† Y
where A† = (AT A)−1 AT , is called the generalized
inverse of A.

PR NPTEL course – p.68/122


• The optimal W is a solution of (AT A)W = AT Y .
• When AT A is invertible, we get the optimal W as

W ∗ = (AT A)−1 AT Y = A† Y
where A† = (AT A)−1 AT , is called the generalized
inverse of A.
• The above W ∗ is the linear least squares solution for
our regression (or classification) problem.

PR NPTEL course – p.69/122


Geometry of Least Squares

• Our least squares method seeks to find a W to


minimize ||AW − Y ||2 .

PR NPTEL course – p.70/122


Geometry of Least Squares

• Our least squares method seeks to find a W to


minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.

PR NPTEL course – p.71/122


Geometry of Least Squares

• Our least squares method seeks to find a W to


minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .

PR NPTEL course – p.72/122


Geometry of Least Squares

• Our least squares method seeks to find a W to


minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .
• The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.

PR NPTEL course – p.73/122


Geometry of Least Squares

• Our least squares method seeks to find a W to


minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .
• The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.
• As we saw, the solution is W ∗ = A† Y and hence the
name generalized inverse for A† .

PR NPTEL course – p.74/122


• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .

PR NPTEL course – p.75/122


• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.

PR NPTEL course – p.76/122


• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .

PR NPTEL course – p.77/122


• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
• Thus, for any W , AW is a linear combination of
columns of A.

PR NPTEL course – p.78/122


• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
• Thus, for any W , AW is a linear combination of
columns of A.
• Hence, if Y is in the space spanned by columns of A,
there is an exact solution.

PR NPTEL course – p.79/122


• Otherwise, we want the projection of Y onto the
column space of A.

PR NPTEL course – p.80/122


• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .

PR NPTEL course – p.81/122


• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .

PR NPTEL course – p.82/122


• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .
• Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .

PR NPTEL course – p.83/122


• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .
• Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .
• That is the least squares solution.

PR NPTEL course – p.84/122


• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .

PR NPTEL course – p.85/122


• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
• Now we have
n
1 X ¡ T ¢2
J(W ) = W Xi + w0 − yi
2 i=1

PR NPTEL course – p.86/122


• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
• Now we have
n
1 X ¡ T ¢2
J(W ) = W Xi + w0 − yi
2 i=1

• For any given W we can find best w0 by equating the


partial derivative to zero.

PR NPTEL course – p.87/122


We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1

PR NPTEL course – p.88/122


We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1

Equating the partial derivative to zero, we get

PR NPTEL course – p.89/122


We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1

Equating the partial derivative to zero, we get


Pn T
i=1 (W Xi + w0 − yi ) = 0

PR NPTEL course – p.90/122


We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1

Equating the partial derivative to zero, we get


Pn T
i=1 (W Xi + w0 − yi ) = 0
X n
T
Pn
⇒ nw0 + W i=1 Xi = yi
i=1

PR NPTEL course – p.91/122


This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1

PR NPTEL course – p.92/122


This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1

• Thus, w0 accounts for difference in the average of


W T X and average of y .

PR NPTEL course – p.93/122


This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1

• Thus, w0 accounts for difference in the average of


W T X and average of y .
• So, w0 is often called the bias term.

PR NPTEL course – p.94/122


• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0

PR NPTEL course – p.95/122


• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0

• As mentioned earlier, we could instead choose any


fixed set of basis functions φi .

PR NPTEL course – p.96/122


• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0

• As mentioned earlier, we could instead choose any


fixed set of basis functions φi .
• Then the model would be
d′
X
ŷ(X) = f (X) = wj φj (X)
j=0
PR NPTEL course – p.97/122
• We can use the same criterion of minimizing sum of
squares of errors.
n
1 X ¡ T ¢2
J(W ) = W Φ(Xi ) − yi
2 i=1

where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .

PR NPTEL course – p.98/122


• We can use the same criterion of minimizing sum of
squares of errors.
n
1 X ¡ T ¢2
J(W ) = W Φ(Xi ) − yi
2 i=1

where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .


• We want the minimizer of J(·).

PR NPTEL course – p.99/122


• We can learn W using the same method as earlier.

PR NPTEL course – p.100/122


• We can learn W using the same method as earlier.
• Thus, we will again have

W ∗ = (AT A)−1 AT Y

PR NPTEL course – p.101/122


• We can learn W using the same method as earlier.
• Thus, we will again have

W ∗ = (AT A)−1 AT Y
• The only difference is that now the ith row of matrix A
would be
[φ0 (Xi ) φ1 (Xi ) · · · φd′ (Xi )]

PR NPTEL course – p.102/122


• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).

PR NPTEL course – p.103/122


• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.

PR NPTEL course – p.104/122


• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m

PR NPTEL course – p.105/122


• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
• The model is: y is a mth degree polynomial in X .

PR NPTEL course – p.106/122


• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
• The model is: y is a mth degree polynomial in X .
• All such problems are tackled in a uniform fashion
using the least squares method we presented.

PR NPTEL course – p.107/122


LMS algorithm

• We are finding W ∗ that minimizes


n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

PR NPTEL course – p.108/122


LMS algorithm

• We are finding W ∗ that minimizes


n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• We could have found the minimum through an


iterative scheme using gradient descent.

PR NPTEL course – p.109/122


LMS algorithm

• We are finding W ∗ that minimizes


n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1

• We could have found the minimum through an


iterative scheme using gradient descent.
• The gradient of J is given by
n
X ¡ T
¢
∇J(W ) = Xi X W − yi
i
i=1

PR NPTEL course – p.110/122


• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1

PR NPTEL course – p.111/122


• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1

• In analogy with what we saw in Perceptron algorithm,


this can be viewed as a ‘batch’ version.

PR NPTEL course – p.112/122


• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1

• In analogy with what we saw in Perceptron algorithm,


this can be viewed as a ‘batch’ version.
• We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.

PR NPTEL course – p.113/122


• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1

• In analogy with what we saw in Perceptron algorithm,


this can be viewed as a ‘batch’ version.
• We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.
• We can instead have an incremental version of this
algorithm.
PR NPTEL course – p.114/122
• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).

PR NPTEL course – p.115/122


• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .

PR NPTEL course – p.116/122


• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .
• Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))

PR NPTEL course – p.117/122


• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .
• Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
• This is called the LMS algorithm.

PR NPTEL course – p.118/122


• In the LMS algorithm, we iteratively update W as

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))

PR NPTEL course – p.119/122


• In the LMS algorithm, we iteratively update W as

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))


• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .

PR NPTEL course – p.120/122


• In the LMS algorithm, we iteratively update W as

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))


• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
• We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.

PR NPTEL course – p.121/122


• In the LMS algorithm, we iteratively update W as

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))


• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
• We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.
• If η is sufficiently small this algorithm also converges
to the minimizer of J(W ).

PR NPTEL course – p.122/122

You might also like