Neural Networks Two
Neural Networks Two
Xiang Cheng
Associate Professor
Department of Electrical & Computer Engineering
The National University of Singapore
2
The artificial neural network resembles the brain in two respects:
The power of parallel processing can only be realized by building the ANN into a circuit!
5
Learning with a teacher:
Supervised Learning
Process of learning:
How can the system adjust the weights without any error or reward signal?
It depends upon the purpose of the learning system. It can perform different tasks such as
principal components analysis and clustering.
You are going to learn all the three types of learning in this course.
Let’s start with the learning of the simplest neural network: perceptron
8
9
Perceptron—single layer neural networks
Frank Rosenblatt (1928-1971) at Cornell University in 1958.
Perceptron was the first computer that could learn new skills by trial and error.
感知器
By the study of neural networks such as the Perceptron, Rosenblatt hoped that "the
fundamental laws of organization which are common to all information handling systems,
machines and men included, may eventually be understood."
Rosenblatt was a colorful character at Cornell in the early 1960s. A handsome bachelor,
he drove a classic MGA sports car and was often seen with his cat named Tobermory. He
enjoyed mixing with undergraduates, and for several years taught an interdisciplinary
undergraduate honors course entitled "Theory of Brain Mechanisms" that drew students
equally from Cornell's Engineering and Liberal Arts colleges.
Perceptron is built around the McCulloch-Pitts model.
11
Goal: To correctly classify the set of externally applied stimuli x1, x2,…, xn
into one of two classes, C1 and C2.
By learning procedure.
The weight vector is determined from a given (training) set of input-output vectors
(exemplars) in such a way to achieve the best classification of the training vectors.
Let’s start with the simple method: Selection of weights by off-line calculations.
16
Selection of weights by off-line calculations - Example
Consider the problem of building the NAND gate using a perceptron.
与
Truth table of NAND Can you formulate it as pattern
classification problem?
The input patterns (vectors) belong to two classes and are marked in the input
space {the plane (x1, x2)} with and , respectively. The decision boundary is the
straight line described by the following equation
x2 = x1 + 1.5 or x1 x2 + 1.5 = 0
Is the decision line unique for this problem?
Let’s select w = (1.5, 1, 1).
For each input vector, the output signal is now calculated as
17
Example: What if we have the following truth table (AND gate)?
y=0 y=1
Is the decision boundary the same as that for NAND gate? Yes
Can we use the same weights? No. The output values of the truth table is important!
What are the proper weights? w’= -w = (-1.5, +1, +1).
18
8
Example: Consider 2-dimensional samples (0, 0), (0, 1), (1, 0), (-1, -1) that
belong to one class, and samples (2.1, 0), (0, -2.5), (1.6, -1.6) that belong to
another class. A two-class partition of the input space and a perceptron that
separates samples of opposite classes are as follows:
Figure: (a) A pair of linearly separable patterns; (b) A pair of non-linearly separable patterns.
Two classes are linearly separable if and only if there exists a weight vector w
based on which the perceptron can correctly perform the classification.
线性分类器
以 him 预测输出希望的结影
Z Icwixitb
20
23 owi
output o z 0 学习率实际输出
But what if the samples are NOT linearly separable (i.e. no straight line can possibly
separate samples belonging to two classes)? Can you find out a perceptron to perform the
task correctly?
There CANNOT be any simple perceptron that achieves the classification task.
Fundamental limitation of simple perceptrons (Minsky 1969). 21
Example: Linearly separable?
How about the linearly separable case where the dimension of the
input is high?
Hyper-plane in m dimensions is defined by the equation
Still possible!
13
How about the linearly separable case where the dimension of the
input is high?
w0 + w1x1 + w2x2 + … + wmxm = 0
In real world applications, the dimension of the input pattern is very high. For
instance the dimension of a 100x100 sized image is 10000.
How to find out the decision boundary in 10000-dimensional space?
In 1958, Rosenblatt demonstrated that Perceptron can learn to do the job correctly!
Perceptron was the first computer that could learn new skills simply by trial and error.
14
Perceptron Learning
Now, suppose that the input variables of the perceptron originate from
TWO linearly separable classes C1 and C2.
We know if C1 and C2 are linearly separable, there must exist at least one
weight vector w0 such that
T
w0 x 0
T
w0 x 0
Training problem: Find a weight vector w such that the perceptron can
correctly classify the training set X.
But how?
It turns out that it can be found out simply by trial and error!
25
Let’s try to figure out the learning algorithm step by step.
Feed a pattern x to the perceptron with weight vector w, it will produce a binary output
y (1 or 0, i.e. Fire or Not Fire).
First consider the case,
v wT x 0 y 0
If the correct label (all the labels of the training samples are known, i.e. there is a
teacher!) is d=0; should we update the weights?
If it ain’t broke, do not fix it.
If the desired output is d=1, should we update the weights?
Of course. Assume the new weight vector is w’, then we have
w' w w
26
Feed a pattern x to the perceptron with weight vector w, it will
produce a binary output y (1 or 0, i.e. Fire or Not Fire).
First consider the case, v wT x 0 y 0
Let’s calculate the induced local field of the updated one: v’=w’Tx
Should we increase the induced local field or decrease?
v' v w'T x wT x wT x 0
Given a vector x, what is the simplest way to choose w such that wT x 0 ?
w x
To avoid big jump of the weights, let’s put a small learning rate
w x, 0 wT x xT x 0
If the true label is d=1, and the perceptron makes a mistake, its
synaptic weights are adjusted by
w' w x 27
Now consider the case,
v wT x 0 y 1
If the correct label is d=1; should we update the weights?
We will only adjust the weights when the perceptron makes mistakes (d=0).
Assume the new weight vector is w’, then we have
w' w w
Let’s calculate the induced local field v’=w’Tx. Should we increase the
induced local field or decrease?
v' v w'T x wT x wT x 0
If the true label is d=0, and the perceptron makes a mistake, its
synaptic weights are adjusted by
w' w x
Let’s summarize what we have now
If the true label is d=1, and the perceptron makes a mistake, its synaptic weights are adjusted by
w' w x
If the true label is d=0, and the perceptron makes a mistake, its synaptic weights are adjusted by
w' w x
w' w ex
15
Increment n
end-while
Example: There are two classes of patterns surrounding (0.75, 1.75) and
(2.75, 0.75) as shown in the following figure. For each class there are 5
data points:
Using the perceptron learning algorithm, after 100 epochs, the weight vector is
31
Thus the decision line is computed as
Frank Rosenblatt took another big step in mathematically proving the convergence in
1962, 4 years after the algorithm was proposed.
32
Break
33
23
Perceptron Convergence Theorem: (Rosenblatt, 1962)
If C1 and C2 are linearly separable, then the perceptron training
algorithm “converges” in the sense that after a finite number of
steps, the synaptic weights remain unchanged and the
perceptron correctly classifies all elements of the training set.
The Proof is too beautiful to skip!
We will only prove the case when w(1)=0, and =1.
w(n 1) w(n) e(n) x(n)
What do we know?
There exists w0, such that
For samples in Class C1, w0T x(k ) 0
We will try to find out the upper bound and lower bound of ||w(n)||.
Let’s figure out the lower bound of ||w(n)|| first:
Let’s first project the weights along the direction of w0 .
If
T moi
w x(k ) 0
o
d (k ) 1, y (k ) 0 L e(k ) w0T x(k ) | w0T x(k ) | 0
0
n2 2 2
Let’s put them together, 2
|| w( n 1) || n
|| w0 ||
Note that , , w0 are all constants.
Impossible!
38
反证in
Reductio ad absurdum (proof by contradiction)
“Reductio ad absurdum, which Euclid loved
so much, is one of a mathematician’s finest
weapons. It is a far finer gambit than any
chess gambit: a chess player may offer the
sacrifice of a pawn or even a piece, but a
mathematician offers the game.”
In real world problem, the dimension of the pattern vectors is usually very high, how
to determine whether the problem is linearly separable or not?
It is simple! Just let the perceptron learn the training samples. If the weights converge, then
it must be linearly separable. And if the perceptron does not stop learning, then it is not.
40
w(n 1) w(n) e( n ) x ( n )
e( n ) d ( n) y ( n)
How about the choice of the initial weights w(1) ? Would the perceptron
converge if the initial weights are chosen randomly?
Yes!
How about the choice of the learning rate ? Would the perceptron converge if
other positive values are chosen?
Although can be chosen to be any positive value, choosing it properly will dictate
how fast the learning algorithm converge to a right solution.
41
w(n 1) w(n) e( n ) x ( n )
e( n ) d ( n) y ( n)
How to choose the learning rate ?
What would happen if the learning rate is large?
Consider the case when the correct label is d(n)=1, and the perceptron output is y(n)=0.
w(n 1) w(n) x ( n) wT (n 1) x(n) wT (n) x(n) x T ( n) x ( n)
4 Regression Problem
Consider a multiple input single output system whose mathematical model is unknown:
How the error signal e(i) is used to adjust the synaptic weights in the model
for the unknown system is determined mainly by the cost function used.
It can be formulated as an optimization problem
What is the most common cost function to evaluate how good the model is?
n n
Summation of squares of errors! E ( w) e(i ) 2
(d (i ) y (i )) 2
i 1 i 1
n
where w(n) and w(n+1) are the old and updated values of the weight vector,
respectively.
w(n 1) w(n) w(n)
46
w(n 1) w(n) w(n)
How to choose w(n) such that E(w(n+1)) < E(w(n)) ?
What is the meaning of the direction represented by the gradient vector E(w)?
Two-dimensional
Example: f(x,y)
Gradient is the direction along which the function value rises most quickly.
If you want the cost E(w) to decrease, should you let w move along the direction
of the gradient or opposite the direction of the gradient? 47
Method of Steepest Descent (Gradient Descent)
48
•Steepest -descent example: Finding the absolute minimum of a one-dimensional error function
f(x):
f(x)
slope: f’(x0)
x0 x1 = x0 - f’(x0) x
Repeat this iteratively until for some xi, f’(xi) is sufficiently close to 0.
49
Two-dimensional example
50
The updating of the weights is given as
y(x)=wTx=w1x1+w2x2+…+wmxm+b
n n
Cost Function: E ( w) e(i ) 2
(d (i ) y (i )) 2
i 1 i 1
How to find out the optimal parameter such that the cost is minimized?
What is the optimality condition?
E ( w)
0
w
52
Standard Linear Least Squares
n
We want to minimize the cost E ( w) e(i ) 2
i 1
Let’s define the error vector:
n
E ( w) e(i ) 2 eT e
i 1
y (i ) wT x(i ) x(i ) T w
Regression matrix:
So we have e d Xw
Let’s introduce some basics of matrix calculus.
Given a vector valued function F(x), where its variable x is also a n-dimensional vector,
( )
=
( )
The derivative of F(x) is defined as its Jacobian:
…
=
= … It is row vector!
=
54
…
=
Example 1. =
=0
Example 2.
= , x is a n-dimensional vector,
Product Rule: (
= ( )+ ( )
Example 3. =
=0 + =
Example 4. =
= + =2
Chain Rule: ( ( )
=
Standard Linear Least Squares
n
We want to minimize the cost E ( w) e(i ) 2 eT e
i 1
How to calculate E
?
w
E E e E
By chain rule, 2e T X 0
w e w w
eT X (d Xw) T X (d T wT X T ) X dT X wT X T X 0
wT X T X dT X X T Xw XTd
If is non-singular, w (X T X ) 1 X T d
Geometric Interpretation of Least Squares
Imagine you want to find out the distance from one
point to a plane. The distance is supposed to be the
shortest distance from this point to all the points in
the plane. It is well known that you should draw a
line from this point, which is perpendicular to the
plane.
This idea can be generalized to the high dimensional vector space.
The desired output d is a point in the n-dimensional space.
Xw is a point in the column space spanned by the column vectors of regression matrix X.
The error vector e d Xw
corresponds to the line segment connecting the two points d and Xw.
We want to find out w such that e has the smallest magnitude ||e||.
Then the error vector e (the line segment) must be orthogonal to the column space of X.
=0
which is the same condition we derived by differentiation!
3
Regression matrix:
n (m 1)
Can we directly use Rosenblatt’s percetron to solve this linear regression problem?
No. The output of perceptron is either 1 or 0 due to the hard limiter!
Can we modify the percetron a little bit such that it can match the linear model?
y (i ) v (i ) wT ( i ) x ( i )
Can the linear neuron learn the function by itself just like the percetron? 60
35
Least-Mean-Square Algorithm
Proposed by Widrow and Hoff at Stanford University in 1960.
Based on the instantaneous cost function at step n,
1 2
瞬间 E ( w) e ( n)
2
where e(n) is the error signal measured at step n. e(n) d (n) xT (n) w(n)
E e( n ) E ( w)
e(n) x T ( n) e( n ) x T ( n )
e w(n) w(n)
e( n ) d ( n ) w T ( n ) x ( n )
w(n 1) w(n) e(n) x(n)
l
n win 灬
y
3计算误差
信号
an dcmy n
更新权
不
瓤
5 迭代更新
63
Solution:
1. Learning process
y wx
64
2. The error function: Average of the cost e2/2 at the end of each epoch.
65
3. Result
After 50 epochs, the weight vector is w’ = [3.5654, -0.5309]’. The figure below
plots the result of the LMS algorithm for this example.
66
Perceptron v.s. LMS algorithm
The perceptron and the LMS algorithm emerged roughly about the same time,
during the late 1950s.
They represent different implementations of single-layer perceptron
based on error-correction-learning.
The LMS algorithm uses a linear neuron!
The learning process in the perceptron stops after a finite number of iterations.
How about LMS algorithm, does it converge in finite time?
In contrast, LMS algorithm usually does not stop unless an extra stopping rule is applied
because perfect fitting is normally impossible!
67
The learning of the perceptrons
w(n 1) w(n) e( n ) x ( n )
Let’s take a closer look at what happens to each synaptic weight:
wi (n 1) wi (n) e( n ) x i ( n )
wi
xi Output, y
The adjustment of the synaptic weight only depends upon the information of the
input neuron and the output neuron, and nothing else.
The synaptic weight is changed along the direction of the input vector, and
the size is controlled by the output error.
68
Q & A…
69