0% found this document useful (0 votes)
6 views69 pages

Neural Networks Two

The document discusses the fundamentals of neural networks, focusing on the perceptron model, which is a type of artificial neural network that learns through trial and error. It explains the learning processes involved, including supervised, reinforcement, and unsupervised learning, and highlights the importance of selecting appropriate weights for effective classification. Additionally, it addresses the limitations of perceptrons in handling non-linearly separable data and outlines the perceptron learning algorithm for updating weights based on input patterns.

Uploaded by

杨西
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views69 pages

Neural Networks Two

The document discusses the fundamentals of neural networks, focusing on the perceptron model, which is a type of artificial neural network that learns through trial and error. It explains the learning processes involved, including supervised, reinforcement, and unsupervised learning, and highlights the importance of selecting appropriate weights for effective classification. Additionally, it addresses the limitations of perceptrons in handling non-linearly separable data and outlines the perceptron learning algorithm for updating weights based on input patterns.

Uploaded by

杨西
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CEG5301: Machine Learning with Applications

Part I: Fundamentals of Neural Networks

Lecture Two: The Perceptron

Xiang Cheng
Associate Professor
Department of Electrical & Computer Engineering
The National University of Singapore

Phone: 65166210 Office: Block E4-08-07


Email: [email protected]
1
5

What is a Neural Network (NN)?

A neural network is a massively parallel distributed processor that has a natural


propensity for storing experiential knowledge and making it available for use.

It employs a massive inter-connection Knowledge is obtained by


of “simple” computing units - neurons. learning from the data/input
It is capable of organizing its structure signals presented to the network.
consists of many neurons, to perform
tasks that are many times faster than the
fastest digital computers nowadays.

2
The artificial neural network resembles the brain in two respects:

1. How does a neural network acquire new knowledge?

Knowledge is acquired by the network through a learning process.

2. Where does the neural network store the knowledge?

Inter-neuron connection strengths known as synaptic weights are used to


store the knowledge.
How do you implement an artificial neural network?
Artificial neural networks are either implemented on a general-purpose computer
or are built into a dedicated hardware.

Is artificial neural network really a parallel computing machine when it is


implemented by MATLAB or Python on PC?

The power of parallel processing can only be realized by building the ANN into a circuit!

Is there any such hardware available now?

Nowadays, field-programmable gate array (FPGA) is a popular way to implement


neural networks.
What is most important key word in neural networks? 7
Learning

What is learning in neural networks?


Learning is a process by which the free parameters (synaptic weights) of
a neural network are adapted through a process of stimulation by the
environment in which the network is embedded.
The type of learning is determined by the manner in which the parameter
changes take place.
Process of learning:

1. The neural network is stimulated by an environment.

2. The neural network undergoes changes in its free parameters, i.e.


synaptic weights, as a result of this stimulation.
3. The neural network responds in a new way to the environment
because of the changes that have occurred in its internal structure.

5
Learning with a teacher:
Supervised Learning
Process of learning:

1. The neural network is fed


with input, and produce an
output.

2. The teacher will tell what the


desired output should be, and the
error signal is generated.

3. The weights are adjusted by the


error signals.

Example: Automatic Flying Helicopter


6
Learning without a teacher:
Reinforcement Learning
Process of learning:

1. The neural network is


interacting with the
environment by taking
various actions.
2. The learning system will be
rewarded or penalized by its
actions. But no explicit error
signal is provided!

3. The weights are adjusted by the


reinforcement signal.

The term is borrowed from


psychology

Real life example?


Learning Robot
7
Learning without a teacher:
Unsupervised or self-organized learning
Process of learning:

1. The neural network is fed


with input.

2. The weights are adjusted based


upon the input signals only!

How can the system adjust the weights without any error or reward signal?

It depends upon the purpose of the learning system. It can perform different tasks such as
principal components analysis and clustering.
You are going to learn all the three types of learning in this course.
Let’s start with the learning of the simplest neural network: perceptron
8
9
Perceptron—single layer neural networks
Frank Rosenblatt (1928-1971) at Cornell University in 1958.

Perceptron was the first computer that could learn new skills by trial and error.
感知器
By the study of neural networks such as the Perceptron, Rosenblatt hoped that "the
fundamental laws of organization which are common to all information handling systems,
machines and men included, may eventually be understood."
Rosenblatt was a colorful character at Cornell in the early 1960s. A handsome bachelor,
he drove a classic MGA sports car and was often seen with his cat named Tobermory. He
enjoyed mixing with undergraduates, and for several years taught an interdisciplinary
undergraduate honors course entitled "Theory of Brain Mechanisms" that drew students
equally from Cornell's Engineering and Liberal Arts colleges.
Perceptron is built around the McCulloch-Pitts model.

11
Goal: To correctly classify the set of externally applied stimuli x1, x2,…, xn
into one of two classes, C1 and C2.

To simplify the mathematical notation, define:


the input vector:

the weight vector:

where n denotes the iteration step.


12
At step n, the input vector x(n) is presented to the perceptron, how does the perceptron
assign the label to the input pattern (pattern classification)?
The process is very simple.

What is the induced local field v(n)?


m m
v ( n) wi (n) xi (n) b(n) wi (n) xi (n) wT (n) x(n)
i 1 i 0

What is the output of the neuron y(n)?


y(n)=1 class 1
v(n) wT (n) x(n) 0
v ( n) wT (n) x(n) 0 y(n)=0 class 2
The induced local field v(n):
m m
v ( n) wi (n) xi (n) b(n) wi (n) xi (n) wT (n) x(n)
i 1 i 0

What if v(n) is ZERO?


That corresponds to the decision boundary!
v(n) wT (n) x(n) 0

What is the geometrical shape of the decision boundary?


If m=1 w1 x1 b 0 A point on a line

If m=2 w1 x1 w2 x2 b 0 A line in a 2-dimensional plane

If m=3 w1 x1 w2 x2 w3 x3 b 0 A plane in the 3-dimensional space

A hyper-plane in the m-dimensional


How about space (input vector space)
Is it possible for the perceptron to produce a decision boundary that is a nonlinear curve?
No.
Given the synaptic weights and bias, how do you find out the
decision boundary produced by the perceptron?

In order for the perceptron to perform a desired task,


its synaptic weights (including the bias) must be
properly selected.
Otherwise, it may not do the job correctly.

The Key Question: How to choose the proper weights?

This is also the “ONLY” question for machine learning!


15
The decision boundary produced by the perceptron:

The Key Question: How to choose the proper weights?

In general, two basic methods can be employed to select a


suitable weight vector:

By off-line calculation of weights (without learning).


If the problem is relatively simple, it is often possible to calculate the
weight vector from the specification of the problem.

By learning procedure.
The weight vector is determined from a given (training) set of input-output vectors
(exemplars) in such a way to achieve the best classification of the training vectors.

Let’s start with the simple method: Selection of weights by off-line calculations.

16
Selection of weights by off-line calculations - Example
Consider the problem of building the NAND gate using a perceptron.

Truth table of NAND Can you formulate it as pattern
classification problem?

Three points (0,0), (0,1) and


(1,0) belong to one class. And
(1,1) belongs to another class.

The input patterns (vectors) belong to two classes and are marked in the input
space {the plane (x1, x2)} with and , respectively. The decision boundary is the
straight line described by the following equation
x2 = x1 + 1.5 or x1 x2 + 1.5 = 0
Is the decision line unique for this problem?
Let’s select w = (1.5, 1, 1).
For each input vector, the output signal is now calculated as

17
Example: What if we have the following truth table (AND gate)?

y=0 y=1

Is the decision boundary the same as that for NAND gate? Yes
Can we use the same weights? No. The output values of the truth table is important!
What are the proper weights? w’= -w = (-1.5, +1, +1).

18
8

Example: Consider 2-dimensional samples (0, 0), (0, 1), (1, 0), (-1, -1) that
belong to one class, and samples (2.1, 0), (0, -2.5), (1.6, -1.6) that belong to
another class. A two-class partition of the input space and a perceptron that
separates samples of opposite classes are as follows:

Can we use (2,-1,+1) as the synaptic weights?


For pattern recognition problem, the labels are user-defined and flexible.
For logic gate, the outputs are fixed! 19
10

Definition: Linearly Separable


If two classes can be separated by one line (plane or hyper-plane in higher
dimensional space).

Figure: (a) A pair of linearly separable patterns; (b) A pair of non-linearly separable patterns.

Two classes are linearly separable if and only if there exists a weight vector w
based on which the perceptron can correctly perform the classification.
线性分类器
以 him 预测输出希望的结影
Z Icwixitb
20
23 owi
output o z 0 学习率实际输出

If the decision line (hyper-plane) is provided, can you always find a


perceptron to perform the task?
1. Try to find out the equation for the hyper-plane and put it in the following form

2. Then the corresponding perceptron can be easily constructed.

But what if the samples are NOT linearly separable (i.e. no straight line can possibly
separate samples belonging to two classes)? Can you find out a perceptron to perform the
task correctly?

There CANNOT be any simple perceptron that achieves the classification task.
Fundamental limitation of simple perceptrons (Minsky 1969). 21
Example: Linearly separable?

Can they be separated by one line? No.


How about two lines? Yes.
Can a single perceptron produce a decision boundary with two lines? No.
22
13

How about the linearly separable case where the dimension of the
input is high?
Hyper-plane in m dimensions is defined by the equation

w0 + w1x1 + w2x2 + … + wmxm = 0


Dividing m-dimensional space into two regions
w0 + w1x1 + w2x2 + … + wmxm > 0 class 1
w0 + w1x1 + w2x2 +…+ wmxm < 0 class 2

For 2-dimensional input spaces, it is easy to determine by geometric construction


whether two classes are linearly separable and to find a two-class classifier.

Can you do the similar thing for 3-dimensional input space?

Still possible!
13

How about the linearly separable case where the dimension of the
input is high?
w0 + w1x1 + w2x2 + … + wmxm = 0

Is it easy to visualize 4 or higher dimensional space and draw the hyper-


plane to separate the classes?
It is not so straightforward for 4- dimensional input spaces and beyond.

In real world applications, the dimension of the input pattern is very high. For
instance the dimension of a 100x100 sized image is 10000.
How to find out the decision boundary in 10000-dimensional space?

In 1958, Rosenblatt demonstrated that Perceptron can learn to do the job correctly!

Perceptron was the first computer that could learn new skills simply by trial and error.
14

Perceptron Learning
Now, suppose that the input variables of the perceptron originate from
TWO linearly separable classes C1 and C2.
We know if C1 and C2 are linearly separable, there must exist at least one
weight vector w0 such that
T
w0 x 0
T
w0 x 0

In other words, we know there is a solution. But where is it?

Training problem: Find a weight vector w such that the perceptron can
correctly classify the training set X.

But how?

It turns out that it can be found out simply by trial and error!

25
Let’s try to figure out the learning algorithm step by step.

Feed a pattern x to the perceptron with weight vector w, it will produce a binary output
y (1 or 0, i.e. Fire or Not Fire).
First consider the case,
v wT x 0 y 0

If the correct label (all the labels of the training samples are known, i.e. there is a
teacher!) is d=0; should we update the weights?
If it ain’t broke, do not fix it.
If the desired output is d=1, should we update the weights?
Of course. Assume the new weight vector is w’, then we have
w' w w

But how to choose w ?

26
Feed a pattern x to the perceptron with weight vector w, it will
produce a binary output y (1 or 0, i.e. Fire or Not Fire).
First consider the case, v wT x 0 y 0

If the desired output is d=1, we need to update the weight,


w' w w
But how to choose w?

Let’s calculate the induced local field of the updated one: v’=w’Tx
Should we increase the induced local field or decrease?
v' v w'T x wT x wT x 0
Given a vector x, what is the simplest way to choose w such that wT x 0 ?
w x
To avoid big jump of the weights, let’s put a small learning rate
w x, 0 wT x xT x 0
If the true label is d=1, and the perceptron makes a mistake, its
synaptic weights are adjusted by
w' w x 27
Now consider the case,
v wT x 0 y 1
If the correct label is d=1; should we update the weights?
We will only adjust the weights when the perceptron makes mistakes (d=0).
Assume the new weight vector is w’, then we have
w' w w

But how to choose w ?

Let’s calculate the induced local field v’=w’Tx. Should we increase the
induced local field or decrease?
v' v w'T x wT x wT x 0

What is the simplest way to choose w such that wT x 0 ?


w x wT x xT x 0

If the true label is d=0, and the perceptron makes a mistake, its
synaptic weights are adjusted by

w' w x
Let’s summarize what we have now
If the true label is d=1, and the perceptron makes a mistake, its synaptic weights are adjusted by
w' w x
If the true label is d=0, and the perceptron makes a mistake, its synaptic weights are adjusted by
w' w x

It seems that these two learning algorithms take different forms.

Let’s try to unify these two algorithms into one algorithm.

Let’s consider the error signal: e=d-y.

What is the error signal when d=1?


e=d-y=1-0=1! w' w x
di 标 值 以实际输出
What is the error signal when d=0?
e=d-y=0-1=-1! w' w x
How to unify these two algorithms with the help of the error signal, e?

w' w ex
15

Perceptron Learning Algorithm


Start with a randomly chosen weight vector w(1);

while there exist input vectors that are misclassified by w(n)


Do Let x(n) be a misclassified input vector;
( )= ( ) ( )
Update the weight vector to
( + 1) = ( ) + ( ) ( )

Increment n
end-while

Imagine that you were a student at 1958, would you be able to


discover this simple algorithm as Rosenblatt did?

Creating a proper problem is as important as solving a problem,


and sometimes even more important!
30
21

Example: There are two classes of patterns surrounding (0.75, 1.75) and
(2.75, 0.75) as shown in the following figure. For each class there are 5
data points:

Using the perceptron learning algorithm, after 100 epochs, the weight vector is

31
Thus the decision line is computed as

The classification result is shown below,

Does it always converge?

Frank Rosenblatt took another big step in mathematically proving the convergence in
1962, 4 years after the algorithm was proposed.
32
Break

Alpha Go Deep Mind:

33
23
Perceptron Convergence Theorem: (Rosenblatt, 1962)
If C1 and C2 are linearly separable, then the perceptron training
algorithm “converges” in the sense that after a finite number of
steps, the synaptic weights remain unchanged and the
perceptron correctly classifies all elements of the training set.
The Proof is too beautiful to skip!
We will only prove the case when w(1)=0, and =1.
w(n 1) w(n) e(n) x(n)
What do we know?
There exists w0, such that
For samples in Class C1, w0T x(k ) 0

For samples in Class C2, w0T x(k ) 0


What do we need to prove?
After a finite number of steps, the weights stop changing.
If the weights stop changing, does it automatically imply that the perceptron
can classify all the training patterns correctly? Yes.
34
Proof: w(n 1) w(n) e(n) x(n) w(1)=0
w(2) w(1) e(1) x(1) e(1) x(1)
w(3) w(2) e(2) x(2) e(1) x(1) e(2) x(2)

We will try to find out the upper bound and lower bound of ||w(n)||.
Let’s figure out the lower bound of ||w(n)|| first:
Let’s first project the weights along the direction of w0 .

Multiply w(n+1) by w0T

Let’s check out all the terms on the right side


e(k ) w0T x(k ) (d (k ) y (k )) w0T x(k )

If
T moi
w x(k ) 0
o
d (k ) 1, y (k ) 0 L e(k ) w0T x(k ) | w0T x(k ) | 0
0

If w0T x(k ) 0 d (k ) 0, y (k ) 1 e(k ) w0T x(k ) | w0T x(k ) | 0

Is it possible that e(k)=0?


Yes, but nothing would happen if e(k)=0. Therefore we just skip these steps. 35
min{| wT
0
Now let’s find out the lower bound of the magnitudes of all possible w0T x(i )
8 0 x (i ) |} for all input patterns x(i) in the training set
Is it possible that 0? No, we assume that no samples lie on the boundary.

min{| w0T x(i ) |} w0T w(n 1) n


Using the Cauchy-Schwarz inequality
11ull.lv113Kuv71
|| x |||| y || | xT y | For any vectors x and y
|| w0 |||| w(n 1) || || w0T w(n 1) || n n
|| w(n 1) ||
|| w0 ||
We can see if the synaptic weights keep on changing forever,
n || w(n 1) ||

The crucial step in reaching this stage is multiplying w(n+1) by w0T


That step is not trivial at all. That’s why it took another 4 years for Frank to figure out
the convergence proof!
Of course, the proof is not over yet. We need to show that the weights will certainly
not grow to infinity.
36
23
Perceptron Convergence Theorem: (Rosenblatt, 1962)
Next, let’s try to obtain the upper bound of ||w(n)||. This time, it is quite simple,
we just need to figure out how the magnitudes of the weights change with time directly:
|| w(n 1) ||2 wT (n 1) w(n 1) ( w(n) e(n) x(n))T ( w(n) e(n) x(n))
wT (n) w(n) 2e(n) wT (n) x(n) e 2 (n) xT (n) x(n)
Let’s check out the middle term
e(n) w(n) T x(n) ( d ( n) y (n)) wT (n) x(n)

If wT (n) x(n) 0 y (n) 1, d (n) 0 e( n ) w T ( n ) x ( n ) 0


nwnmnnnnnnntmnnnmnmmmmmmwvwwwwwwwwmr
If wT (n) x(n) 0 y ( n) 0, d (n) 1 e( n ) w T ( n ) x ( n ) 0
|| w(n 1) ||2 || w(n) ||2 2e(n) wT (n) x(n) || x(n) ||2 || x(n) ||2
|| w(2) ||2 || w(1) ||2 || x(1) ||2
|| w(3) ||2 || w(2) ||2 || x(2) ||2
|| w(n 1) ||2 || w(n) ||2 || x(n) ||2
Summing up all the inequalities || ( + 1)|| || (1)|| || ( )||
Let max{|| x(i ) ||2 } for all input patterns x(i) in the training set
|| w(n 1) ||2 n
37
23
Perceptron Convergence Theorem: (Rosenblatt, 1962)
On one hand, we have the upper bound of ||w(n)|| as
|| w(n 1) ||2 n

On the other hand, we have the lower bound of ||w(n)|| as


n
|| w(n 1) ||
|| w0 ||

n2 2 2
Let’s put them together, 2
|| w( n 1) || n
|| w0 ||
Note that , , w0 are all constants.

If the synaptic weights keep on changing forever, n

Is it possible for above inequality to hold forever?

Impossible!

So the synaptic weights will stop changing in finite time!


nnsnnosnnntneusstt

38
反证in
Reductio ad absurdum (proof by contradiction)
“Reductio ad absurdum, which Euclid loved
so much, is one of a mathematician’s finest
weapons. It is a far finer gambit than any
chess gambit: a chess player may offer the
sacrifice of a pawn or even a piece, but a
mathematician offers the game.”

Quoted from “A Mathematician’s Apology”,


G.H. Hardy, 1940.

G.H. Hardy (1877-1947)


English mathematician

Euclid used this method to prove the existence


of an infinity of prime numbers.

Pythagoras used this weapon to show that


2 is an irrational number.

If you have difficulty proving something by frontal


assault, do not forget this finest weapon!
Euclid
What would happen if the patterns are not linearly separable?

w(n 1) w(n) e(n) x(n)


e( n ) d ( n ) y ( n )

Will the perceptron stop updating its


weights?
No!

In real world problem, the dimension of the pattern vectors is usually very high, how
to determine whether the problem is linearly separable or not?

It is simple! Just let the perceptron learn the training samples. If the weights converge, then
it must be linearly separable. And if the perceptron does not stop learning, then it is not.

40
w(n 1) w(n) e( n ) x ( n )
e( n ) d ( n) y ( n)

How about the choice of the initial weights w(1) ? Would the perceptron
converge if the initial weights are chosen randomly?

Yes!

How about the choice of the learning rate ? Would the perceptron converge if
other positive values are chosen?

Although can be chosen to be any positive value, choosing it properly will dictate
how fast the learning algorithm converge to a right solution.

41
w(n 1) w(n) e( n ) x ( n )
e( n ) d ( n) y ( n)
How to choose the learning rate ?
What would happen if the learning rate is large?
Consider the case when the correct label is d(n)=1, and the perceptron output is y(n)=0.
w(n 1) w(n) x ( n) wT (n 1) x(n) wT (n) x(n) x T ( n) x ( n)

What is the desired value of wT (n 1) x(n) ? Positive or negative? Positive!


Can we make it positive by properly choosing the learning rate ?
If it is chosen very large and applied to the example x(n), then correct answer can be
obtained in one step!
Can we conclude that choosing a large learning rate would speed up the convergence?
If it is chosen very large and applied to the example x(n), then learning is excellent as far
as the present example is concerned, but at the cost of spoiling the learning that has taken
place earlier with respect to other examples. Thus, a large value of is not necessarily
good.
If an extremely small value is chosen for , that also leads to slow learning. Some
intermediate value is the best. Usually the choice is problem dependent.
42
Perceptron can solve many real world pattern classification problems, which
made Rosenblatt very famous in 1960’s.
Can perceptron also deal with another important application of neural networks?
2

4 Regression Problem
Consider a multiple input single output system whose mathematical model is unknown:

Given a set of observations of input-output data:

m = dimensionality of the input space; i = time index.


How to design a multiple input single output model for
the unknown system?
43
For example, 1D input

Error signal at time i:

How the error signal e(i) is used to adjust the synaptic weights in the model
for the unknown system is determined mainly by the cost function used.
It can be formulated as an optimization problem
What is the most common cost function to evaluate how good the model is?
n n
Summation of squares of errors! E ( w) e(i ) 2
(d (i ) y (i )) 2
i 1 i 1
n

Why not use E ( w) | e(i ) |?


i 1
The absolute value function is not smooth!
44
26

Consider a cost function E(w) that is continuously differentiable function of


some unknown weight vector, w.
Aim: To minimize the cost function E(w) with respect to w.
Consider a simple example first, a scalar function f(x):

Where to find the minimal or maximal points? df ( x)


0 and the boundary!
dx
45
Necessary condition for optimality:

where is the gradient operator:

Usually it is not easy to solve directly.

Iterative descent algorithm: Starting with an initial guess denoted by w(0),


generate a sequence of weight vectors w(1), w(2),…, such that the cost function
E(w) is reduced at each iteration ,
E(w(n+1)) < E(w(n))

where w(n) and w(n+1) are the old and updated values of the weight vector,
respectively.
w(n 1) w(n) w(n)

46
w(n 1) w(n) w(n)
How to choose w(n) such that E(w(n+1)) < E(w(n)) ?

What is the meaning of the direction represented by the gradient vector E(w)?

Two-dimensional
Example: f(x,y)

Gradient is the direction along which the function value rises most quickly.
If you want the cost E(w) to decrease, should you let w move along the direction
of the gradient or opposite the direction of the gradient? 47
Method of Steepest Descent (Gradient Descent)

w(n 1) w(n) w(n)

Successive adjustment applied to the weight vector w are in the direction of


steepest descent (a direction opposite to the gradient vector E(w)).

Let g(n) = E(w(n)), steepest descent algorithm is formally described by


w(n) g ( n)

where is a positive constant called the stepsize or learning-rate parameter,


and g(n) is the gradient vector evaluated at the point w(n).

48
•Steepest -descent example: Finding the absolute minimum of a one-dimensional error function
f(x):

f(x)
slope: f’(x0)

x0 x1 = x0 - f’(x0) x
Repeat this iteratively until for some xi, f’(xi) is sufficiently close to 0.

49
Two-dimensional example

50
The updating of the weights is given as

Does it satisfy the condition of iterative descent? E(w(n+1)) < E(w(n))


We need to compare E(w(n+1)) and E(w(n)).
How to express E(w(n+1)) around w(n)? Can you write
E ( w(n 1)) E ( w(n) w(n)) E ( w(n)) ???
You learned that in Calculus! Taylor series
( )
+ = + + +
2
E
E ( w(n 1)) E ( w(n) w(n)) E ( w(n)) w(n)
w
E
g T (n)
w

Therefore the cost function is decreasing as the algorithm progresses


from one iteration to the next for a small positive learning rate .
51
31

Linear Regression Problem


Consider that we are trying to fit a linear model to a set of input-output
pairs (x(1), d(1)), (x(2), d(2)) …, (x(n), d(n)) observed in an interval of
duration n.

y(x)=wTx=w1x1+w2x2+…+wmxm+b

n n
Cost Function: E ( w) e(i ) 2
(d (i ) y (i )) 2
i 1 i 1

How to find out the optimal parameter such that the cost is minimized?
What is the optimality condition?
E ( w)
0
w
52
Standard Linear Least Squares
n
We want to minimize the cost E ( w) e(i ) 2
i 1
Let’s define the error vector:
n
E ( w) e(i ) 2 eT e
i 1

Next, let’s express e in terms of the parameter w,


e(i ) d (i ) y (i ) e d y

y (i ) wT x(i ) x(i ) T w

Regression matrix:

So we have e d Xw
Let’s introduce some basics of matrix calculus.
Given a vector valued function F(x), where its variable x is also a n-dimensional vector,

( )
=
( )
The derivative of F(x) is defined as its Jacobian:


=

If f(x) is a scalar valued function, then it is just a special case of F(x)!

= … It is row vector!

=
54

=

Example 1. =

=0

Example 2.
= , x is a n-dimensional vector,

= , I is the × identity matrix


The two rules for computing derivatives

Product Rule: (
= ( )+ ( )

Example 3. =

=0 + =

Example 4. =

= + =2

Chain Rule: ( ( )
=
Standard Linear Least Squares
n
We want to minimize the cost E ( w) e(i ) 2 eT e
i 1

Do you know how to calculate E E


? 2e T
e e
We also have , e d Xw

Do you know how to calculate e e


? X
w w

How to calculate E
?
w
E E e E
By chain rule, 2e T X 0
w e w w

eT X (d Xw) T X (d T wT X T ) X dT X wT X T X 0

wT X T X dT X X T Xw XTd

If is non-singular, w (X T X ) 1 X T d
Geometric Interpretation of Least Squares
Imagine you want to find out the distance from one
point to a plane. The distance is supposed to be the
shortest distance from this point to all the points in
the plane. It is well known that you should draw a
line from this point, which is perpendicular to the
plane.
This idea can be generalized to the high dimensional vector space.
The desired output d is a point in the n-dimensional space.
Xw is a point in the column space spanned by the column vectors of regression matrix X.
The error vector e d Xw
corresponds to the line segment connecting the two points d and Xw.

We want to find out w such that e has the smallest magnitude ||e||.
Then the error vector e (the line segment) must be orthogonal to the column space of X.

=0
which is the same condition we derived by differentiation!
3

1 Linear Regression Problem


Consider that we are trying to fit a linear model to a set of input-output
pairs (x(1), d(1)), (x(2), d(2)) …, (x(n), d(n)) observed in an interval of
duration n.
y(x)=w1x1+w2x2+…+wmxm+b

The standard linear least squares: w (X T X ) 1 X T d

Regression matrix:

What is the dimension of the regression matrix X?

n (m 1)

It may be out of memory if the number of samples, n, is very large!

How do we deal with the “BIG DATA”? 59


3

1 Linear Regression Problem


Consider that we are trying to fit a linear model to a set of input-output
pairs (x(1), d(1)), (x(2), d(2)) …, (x(n), d(n)) observed in an interval of
duration n.
y(x)=w1x1+w2x2+…+wmxm+b

Can we directly use Rosenblatt’s percetron to solve this linear regression problem?
No. The output of perceptron is either 1 or 0 due to the hard limiter!
Can we modify the percetron a little bit such that it can match the linear model?

We can just replace the hard-limiter by the linear function :

y (i ) v (i ) wT ( i ) x ( i )

Can the linear neuron learn the function by itself just like the percetron? 60
35

Least-Mean-Square Algorithm
Proposed by Widrow and Hoff at Stanford University in 1960.
Based on the instantaneous cost function at step n,
1 2
瞬间 E ( w) e ( n)
2
where e(n) is the error signal measured at step n. e(n) d (n) xT (n) w(n)

The chain rule,

E e( n ) E ( w)
e(n) x T ( n) e( n ) x T ( n )
e w(n) w(n)

The gradient of E(w), E ( w) T


g ( n) ( ) e( n ) x ( n )
w(n)

Applying steepest descent method, we have


w(n 1) w(n) g ( n) w(n) e( n ) x ( n ) is the learning-rate parameter.

This is sometimes called “Widrow-Hoff learning rule” or “incremental gradient


algorithm”.
61
37

Summary of the LMS algorithm:

Given n training samples: {x(i), d(i)}, i = 1, 2,…,n


where x(i) is an input vector, d(i) is the corresponding desired response.
User-selected parameter: learning rate
Weights initialization.
Computation (LMS rule):
For i = 1, 2,…, compute

e( n ) d ( n ) w T ( n ) x ( n )
w(n 1) w(n) e(n) x(n)

Isn’t it the same learning algorithm as that for the Perceptron?


YES!
Why didn’t we derive the learning algorithm for Perceptron following the simple
idea of gradient descent?
We cannot do it because the hard limiter is not differentiable! 62
Example: LMS algorithm
Training sample, (x(i), d(i)): {(1, 3.5), (1.5, 2.5), (3, 2), (3.5, 1), (4, 1)}.
Initial weight is chosen as w’(1) = [2, 0] . Learning rate is 0.1.

Find the solution using LMS algorithm for 50 epochs.


适应滤波的算法
1初始化权重
2计算滤波器输出 信号灬

l
n win 灬
y
3计算误差
信号
an dcmy n
更新权


5 迭代更新
63
Solution:
1. Learning process

y wx

Up to now, all training patterns


have been used once. We say
“one epoch” of the training is
completed.
More epochs needed until
satisfactory fitting is obtained.

64
2. The error function: Average of the cost e2/2 at the end of each epoch.

65
3. Result
After 50 epochs, the weight vector is w’ = [3.5654, -0.5309]’. The figure below
plots the result of the LMS algorithm for this example.

66
Perceptron v.s. LMS algorithm

The perceptron and the LMS algorithm emerged roughly about the same time,
during the late 1950s.
They represent different implementations of single-layer perceptron
based on error-correction-learning.
The LMS algorithm uses a linear neuron!

The learning process in the perceptron stops after a finite number of iterations.
How about LMS algorithm, does it converge in finite time?
In contrast, LMS algorithm usually does not stop unless an extra stopping rule is applied
because perfect fitting is normally impossible!
67
The learning of the perceptrons

w(n 1) w(n) e( n ) x ( n )
Let’s take a closer look at what happens to each synaptic weight:
wi (n 1) wi (n) e( n ) x i ( n )
wi
xi Output, y

The adjustment of the synaptic weight only depends upon the information of the
input neuron and the output neuron, and nothing else.

The synaptic weight is changed along the direction of the input vector, and
the size is controlled by the output error.
68
Q & A…

69

You might also like