Correlation Learning Rule: M I I I
Correlation Learning Rule: M I I I
The correlation learning rule is derived by starting from the criterion function
m
J ( x ) = −∑ y i d i (2.49)
i =1
where y i = ( xi ) w , and performing gradient descent to minimize J. Note that minimizing J(w) is
T
equivalent to maximizing the correlation between the desired target and the corresponding linear
unit's output for all xi , i = l, 2, ... , m. Now, employing steepest gradient descent to minimize J(w)
leads to the learning rule:
⎧ w1 = 0
⎨ k +1 (2.50)
⎩w = w + ρ d x
k k k
By setting ρ to 1 and completing one learning cycle using Equation (2.50), we arrive at the weight
vector w * given by
1
Supervised Learning of a Perceptron
m
w = ∑ d i xi = Xd
*
(2.51)
i =1
where X and d are as defined above. Note that Equation (2.51) leads to the minimum SSE solution
in Equation (2.38) if X † = X . This is only possible if the training vectors x k are encoded such that
XXT is the identity matrix (i.e., the x k vectors are orthonormal).
Another version of this type of learning is the covariance learning rule. This rule is obtained by
steepest gradient descent on the criterion function
J ( w ) = − ∑ ( yi − y )( d ).
m
i
− d
i =1
Here, y and d are computed averages, over all training pairs, for the unit's output and the desired
target, respectively. Covariance learning provides the basis of the cascade-correlation net.
2
Supervised Learning of a Perceptron
The following rule is similar to the μ -LMS rule except that it allows for units with a differentiable
nonlinear activation function f. Figure 2-7 illustrates a unit with a sigmoidal activation function.
Here, the unit's output is y = f(net), with net defined as the vector inner product x T w .
Again, consider the training pairs {xi , d i } , i= l, 2, ... , m, with xi ∈ R n +1 ( xni +1 = 1 for all i) and
d i ∈ [ −1, +1] . Performing gradient descent on the instantaneous SSE criterion function
1
J (w) = ( d − y ) , whose gradient is given by
2
∇J ( w ) = − ( d − y ) f ′ ( net ) x (2.52)
⎧⎪ w1 arbitrary
⎨ k +1 (2.53)
⎩⎪ w = w k
+ ρ ⎡
⎣ d k
− f ( net k
) ⎤
⎦ f ′ ( net k
) x k
= w k
+ ρδ k k
x
where net = ( x ) df
k k T
w k and f ′ = . If f is defined by f ( net ) = tanh ( β net ) , then its derivative is
d net
given by f ′ ( net ) = β ⎡⎣1 − f 2 ( net ) ⎤⎦ . For the "logistic" function, f ( net ) = 1/ (1 + e− β net ) , the
4
Supervised Learning of a Perceptron
derivative is f ′ ( net ) = β f ( net ) ⎡⎣1 − f ( net ) ⎤⎦ . Figure 2-8 plots f and f ′ for the hyperbolic tangent
activation function with β = 1. Note how f asymptotically approaches +1 and −1 in the limit as net
approaches +∞ and −∞ , respectively.
5
Supervised Learning of a Perceptron
Figure 2-8 Hyperbolic tangent activation function f and its derivative f ′ , plotted for −3 ≤ net ≤ +3 .
6
Supervised Learning of a Perceptron
One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of
f ′ ( net ) in Figure 2-8. In particular, notice how f ′ ( net ) ≈ 0 when net has large magnitude (i.e.,
net > 3 ); these regions are called flat spots of f ′ . In these flat spots, we expect the delta learning
rule to progress very slowly (i.e., very small weight changes even when the error ( d − y ) is large),
because the magnitude of the weight change in Equation (2.53) directly depends on the magnitude of
f ′ ( net ) . Since slow convergence results in excessive computation time, it would be advantageous to
try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot
elimination technique involves replacing f ′ by f ′ plus a small positive bias ε . In this case, the
weight update equation reads as
7
Supervised Learning of a Perceptron
One of the primary advantages of the delta rule is that it has a natural extension that may be used to
train multilayered neural nets. This extension, known as error back propagation, will be discussed
in Chapter 3.
Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as
enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning
rules, AHK I, AHK II, and AHK III, are derived based on gradient-descent strategies on an
appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well
suited for generating robust decision surfaces for linearly separable problems. The third training rule,
AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable
problems. The three AHK learning rules preserve the simple incremental nature found in the LMS
and perceptron learning rules. The AHK rules also possess additional processing capabilities, such
8
Supervised Learning of a Perceptron
as the ability to automatically identify critical cluster boundaries and place a linear decision surface
in such a way that it leads to enhanced classification robustness.
Consider a two-class {c1 , c2 } classification problem with m labeled feature vectors (training vectors)
bias of value 1) and that d i = +1( −1) if xi ∈ c1 ( c2 ) . Then, a single perceptron can be trained to
correctly classify the preceding training pairs if an ( n + 1) -dimensional weight vector w is computed
that satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's
activation function):
⎧> 0 if d i = +1
(x )i T
w⎨ for i = 1, 2,..., m (2.55)
⎩< 0 if d = −1
i
9
Supervised Learning of a Perceptron
⎧ + xi if d i = +1
z =⎨ i
i
for i = 1, 2,..., m (2.56)
⎩−x if d i = −1
and we let
Z = ⎡⎣ z1 z 2 ... z m ⎤⎦ (2.57)
ZT w > 0 (2.58)
Now, defining an m-dimensional positive-valued margin vector b (b > 0) and using it in Equation
(2.58), we arrive at the following equivalent form of Equation (2.55):
ZT w = b (2.59)
10
Supervised Learning of a Perceptron
Thus the training of the perceptron is now equivalent to solving Equation (2.59) for w, subject to the
constraint b > 0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation
(2.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to
small positive values, and the pseu-doinverse is used to generate a solution for w (based on the
1 T
initial guess of b) that minimizes the SSE criterion function J ( w , b ) =
2
Z w −b :
2
w = Z†b (2.60)
where Z† = ( ZZ T ) Z , for m > n + 1 . Next, a new estimate for the margin vector is computed by
−1
1
b k +1 = b k + ⎡⎣ε + ε ⎤⎦ with ε k = Z T w k − b k (2.61)
2
11
Supervised Learning of a Perceptron
where i denotes the absolute value of the components of the argument vector, and b k is the
"current" margin vector. A new estimate of w can now be computed using Equation (2.60) and
employing the updated margin vector from Equation (2.61). This process continues until all the
components of ε are zero (or are sufficiently small and positive), which is an indication of linear
separability of the training set, or until ε < 0 , which is an indication of nonlinear separability of the
training set (no solution is found). It can be shown (Ho and Kashyap, 1965) that the Ho-Kashyap
procedure converges in a finite number of steps if the training set is linearly separable. For
simulations comparing the preceding training algorithm with the LMS and perceptron training
procedures, the reader is referred to Hassoun and Clark (1988). This algorithm will be referred to
here as the direct Ho-Kashyap (DHK) algorithm.
The direct synthesis of the w estimate in Equation (2.60) involves a one-time computation of the
pseudoinverse of Z. However, such computation can be computationally expensive and requires
special treatment when ZZ T is ill-conditioned (i.e., the determinant ZZ T close to zero). An
12
Supervised Learning of a Perceptron
alternative algorithm that is based on gradient-descent principles and which does not require the
direct computation of Z† can be derived. This derivation is presented next.
1 T
Starting with the criterion function J ( w , b ) =
2
Z w − b , gradient descent may be performed with
2
respect to b and w so that J is minimized subject to the constraint b > 0. The gradient of J with
respect to w and b is given by
∇ b J ( w, b ) |w k ,bk = − ( Z T w k − b k ) (2.62a)
∇ w J ( w, b ) |w k ,bk +1 = − Z ( Z T w k − b k +1 ) (2.62b)
where the superscripts k and k + 1 represent current and updated values, respectively. One analytic
method for imposing the constraint b > 0 is to replace the gradient in Equation (2.62a) by
−0.5 ( ε + ε ) , with ε as defined in Equation (2.61). This leads to the following gradient-descent
formulation of the Ho-Kashyap procedure:
13
Supervised Learning of a Perceptron
ρ1
b k +1 = b k +
2
( ε k
+ε ) with ε k = Z T w k − b k (2.63a)
w k +1 = w k − ρ 2 Z ( Z T w k − b k +1 )
and ⎡ ρ1ρ 2 ⎛ 2 ⎞⎤ (2.63b)
=w + k
Z ⎢ ε k + ε k ⎜1 − ⎟ ⎥
2 ⎣ ⎝ ρ1 ⎠ ⎦
where ρ1 and ρ 2 are strictly positive constant learning rates. Because of the requirement that all
training vectors z k (or x k ) be present and included in Z, this procedure is called the batch-mode
adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if ρ1 = 0 and b1 = 1 , Equation
(2.63) reduces to the μ -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and
Hart, 1973) if 0 < ρ1 < 2 and 0 < ρ 2 < 2 / λmax where λmax is the largest eigenvalue of the positive
definite matrix ZZ T .
14
Supervised Learning of a Perceptron
A completely adaptive Ho-Kashyap procedure for solving Equation (2.59) is arrived at by starting
from the instantaneous criterion function
J ( w , b ) = ⎡( z i ) w − b i ⎤
1 T
2⎣ ⎦
ρ1
(ε ) with ε ik = ( z i ) w k − bik
T
bik +1 = bik + i
k
+ ε ik (2.64a)
2
w k +1 = w k − ρ 2 z i ⎡( z i ) w k − bik +1 ⎤
T
⎣ ⎦
and (2.64b)
ρ1ρ 2 ⎡
⎛ 2 ⎞⎤ i
=w + k
⎢ ε + ε ⎜ 1 − ⎟
k
⎥z
k
⎝ ρ1 ⎠ ⎦
i i
2 ⎣
Here, bi , represents a scalar margin associated with the xi input. In all the preceding Ho-Kashyap
learning procedures, the margin values are initialized to small positive values, and the perceptron
15
Supervised Learning of a Perceptron
weights are initialized to zero (or small random) values. If full margin error correction is assumed in
Equation (2.64a), i.e., ρ1 = 1, the incremental learning procedure in Equation (2.64) reduces to the
heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing
Equation (2.64) is
where Δb and Δw signify the difference between the updated and current values of b and w,
respectively. This procedure is called the AHK I learning rule. For comparison purposes, it may be
noted that the μ -LMS rule in Equation (2.35) can be written as Δw = − με ik z i , with bi , held fixed at
+1.
16
Supervised Learning of a Perceptron
The implied constraint bi > 0 in Equations (2.64) and (2.65) was realized by starting with a positive
initial margin and restricting the change Δb to positive real values. An alternative, more flexible
way to realize this constraint is to allow both positive and negative changes in Δb , except for the
cases where a decrease in b, results in a negative margin. This modification results in the following
alternative AHK II learning rule:
In the general case of an adaptive margin, as in Equation (2.66), Hassoun and Song (1992) showed
that a sufficient condition for the convergence of the AHK rules is given by
2
0 < ρ2 < (2.67a)
i 2
max z
i
17
Supervised Learning of a Perceptron
Another variation results in the AHK III rule, which is appropriate for both linearly separable and
nonlinearly separable problems. Here, Δw is set to 0 in Equation (2.66b). The advantages of the
AHK III rule are that
(2) it uses such information to discard nonseparable training vectors and speed up convergence
(Hassoun and Song, 1992).
Example 2.2 In this example the perceptron, LMS, and AHK learning rules are compared in terms
of the quality of the solutions they generate. Consider the simple two-class linearly separable
problem shown earlier in Figure 2-4. The μ -LMS rule of Equation (2.35) is used to obtain the
solution shown as a dashed line in Figure 2-9. Here, the initial weight vector was set to 0 and a
learning rate μ = 0.005 is used. This solution is not one of the linearly separable solutions for this
18
Supervised Learning of a Perceptron
problem. Four examples of linearly separable solutions are shown as solid lines in the figure. These
solutions are generated using the perceptron learning rule of Equation (2.2), with varying order of
input vector presentations and with a learning rate of ρ = 0.1. Here, it should be noted that the most
1
robust solution, in the sense of tolerance to noisy input, is given by x2 = x1 + which is shown as a
2
dotted line in Figure 2-9. This robust solution was in fact automatically generated by the AHK I
learning rule of Equation (2.65).
The SSE criterion function in Equation (2.32) is not the only possible choice. We have already seen
other alternative functions such as the ones in Equations (2.20), (2.24), and (2.25). In general, any
differentiable function that is minimized upon setting y i = d i , for i = 1,2,..., m, could be used. One
possible generalization of SSE is the Minkowski-r criterion function (Hanson and Burr, 1988) given
by
19
Supervised Learning of a Perceptron
r
1 m i
J ( w ) = ∑ d − yi (2.68)
r i =1
1 i
J (w) = d − y i r
(2.69)
r
20
Supervised Learning of a Perceptron
Figure 2-9 LMS-generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four
solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the
AHK I rule.
21
Supervised Learning of a Perceptron
r
Figure 2-10 shows a plot of d i − y i for r = 1, 1.5, 2, and 20. The general form of the gradient of
∇J ( w ) = − sgn ( d − y ) d − y f ′ ( net ) x
r −1
(2.70)
Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation
(2.52). If r = 1, then J ( w ) = d i − y i with the gradient [note that the gradient of J(w) does not exist
In this case, the criterion function in Equation (2.68) is known as the Manhattan norm. For r → ∞ , a
supremum error measure is approached.
A small r gives less weight for large deviations and tends to reduce the influence of the outer-most
points in the input space during learning. It can be shown, for a linear unit with normally distributed
22
Supervised Learning of a Perceptron
inputs, that r = 2 is an appropriate choice in the sense of both minimum SSE and minimum
probability of prediction error (maximum likelihood). The proof is as follows.
23
Supervised Learning of a Perceptron
Another criterion function that can be used (Hopfield, 1987) is the instantaneous relative entropy
error measure (Kullback, 1959) defined by
1⎡ ⎛1+ d ⎞ ⎛ 1 − d ⎞⎤
J ( w ) = ⎢(1 + d ) ln ⎜ ⎟ + (1 − d ) ln ⎜ ⎟⎥ (2.76)
2⎣ ⎝1+ y ⎠ ⎝ 1 − y ⎠⎦
where d belongs to the open interval ( −1, +1) . As before, J ( w ) ≥ 0 , and if y = d, then J ( w ) = 0 . If
y = f ( net ) = tanh ( β net ) , the gradient of Equation (2.76) is
∇J ( w ) = − β ( d − y ) x (2.77)
The factor f ′ ( net ) in Equations (2.53) and (2.70) is missing from Equation (2.77). This eliminates
the flat spot encountered in the delta rule and makes the training here more like μ -LMS [note,
however, that y here is given by y = f ( net ) ≠ net ]. This entropy criterion is "well formed" in the
sense that gradient descent over such a function will result in a linearly separable solution, if one
24
Supervised Learning of a Perceptron
exists (Hertz et al., 1991). On the other hand, gradient descent on the SSE criterion function does not
share this property, since it may fail to find a linearly separable solution, as demonstrated in
Example 2.2.
In order for gradient-descent search to find a solution w * in the desired linearly separable region, we
need to use a well-formed criterion function. Consider the following general criterion function:
J ( w ) = ∑ g ( zTw )
m
(2.78)
i =1
where
⎧+ x if x ∈ class c1
z=⎨
⎩ − x if x ∈ class c2
Let s = z T w . The criterion function J(w) is said to be well formed if g ( s ) is differentiable and
satisfies the following conditions (Wittner and Denker, 1988):
25
Supervised Learning of a Perceptron
dg ( s )
1. For all s, − ≥ 0 ; i.e., g does not push in the wrong direction.
ds
dg ( s )
2. There exists ε > 0 such that − ≥ ε for all s ≤ 0 ; i.e., g keeps pushing if there is a
ds
misclassification.
For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the
criterion function is well formed, then gradient descent is guaranteed to enter the region of linearly
separable solutions w * , provided that such a region exists.
26
Table 2-1 Summary of Basic Learning Rules
Learning 1 2
Activation
Criterion Function Learning Vector Conditions Remarks
Rule Function 3
Perceptron J (w) = − ∑ (z T
w − b) ⎧⎪z k if ( z k ) w k ≤ b
T b>0 f ( net ) = sgn ( net ) Converges to z T w > b if
rule with z T w ≤b ⎨ training set is linearly
variable ⎪⎩ 0 otherwise ρ k satisfies: separable. Finite
learning rate convergence if ρ = ρ ,
k
Note: z k = ⎧⎨ x if d k = +1
k
1
⎩−x if d = −1
k k
2
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
3
net = xT w
27
Supervised Learning of a Perceptron
∑(ρ )
m
k 2
3. k =1
2
=0
⎛ m k⎞
⎜∑ρ ⎟
⎝ k =1 ⎠
May`s rule
( zTw − b)
2
⎧ b − ( z k )T w k 0<ρ <2 f ( net ) = sgn ( net ) Finite convergence to the
if ( z )
1
(supervised) J (w) = ∑ ⎪⎪ zk k T
w ≤b
k
solution z T w ≥ b > 0 if the
⎨
2 2
2 zT w ≤b z z b>0 training set is linearly
⎪ separable.
⎪⎩ 0 otherwise
Butz`s rule J ( w ) = −∑ ( z i ) w
T
⎧⎪ z k if ( z k ) w k ≤ 0
T 0 ≤ γ <1 f ( net ) = sgn ( net )
Finite convergence if
(supervised) i ⎨ k
⎪⎩γ z otherwise ρ >0 training set is linearly
separable. Places w in a
region that tends to
minimize the probability of
error for nonlinearly
separable cases.
Widrow- ⎡ d i − ( x i )T w ⎤
2
⎡ d k − ( x k )T w k ⎤ x k α f ( net ) = net Converges in the mean
⎢⎣ ⎥⎦ ρk =
⎢ ⎥⎦
J (w) = ∑ ⎣
Hoff rule 1 k 2 square to the minimum SSE
x
( α -LMS) 2 i 2 or LMS solution if
xi
(supervised) x = x for al i, j.
i j
0 <α < 2
28
Supervised Learning of a Perceptron
(supervised) 2 i ⎢⎣ ⎥⎦ ⎣⎢ ⎦⎥ 3 x
2 square to the minimum SSE
or LMS solution.
Correlation J ( w ) = −∑ d i ( xi ) w
T
d k xk ρ >0 f ( net ) = net Converges to the minimum
rule i SSE solution if the vectors
(supervised) xk are mutually
orthonormal.
Delta rule
(supervised)
J (w) =
1
∑ ( d i − yi )
2
(d k
− y k ) f ′ ( net k ) x k 0 < ρ <1 y = f ( net ) where f Extends the μ -LMS rule to
2 i is a sigmoid function cases with differentiable
(x )
i T
nonlinear activations.
yi w
29
Supervised Learning of a Perceptron
Learning Activation
4 5
Criterion Function Learning Vector Conditions Remarks
Rule Function 6
Minkowski-r 1 sgn ( d k − y k ) d k − y k
r −1
f ′ ( net k ) x k
0 < ρ <1 y = f ( net ) where f 0<r <2 for pseudo-
J (w) = ∑
r
d i − yi
is a sigmoid function Gaussian distribution p ( x )
delta rule r i
(supervised)
with pronounced tails. r = 2
gives delta rule. r = 1 arises
when p ( x ) is a Laplace
distribution.
Note: z k = ⎧⎨ x if d k = +1
k
4
⎩−x if d = −1
k k
5
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
6
net = xT w
30
Supervised Learning of a Perceptron
Relative J (w) β ( d k − y k ) xk 0 < ρ <1 y = tanh ( β net ) Eliminates the flat spot
entropy delta ⎡ ⎛ 1 + d i ⎞⎤ suffered by the delta rule.
⎢( + ) ⎜ 1 + yi ⎟⎥
i
1 d ln
rule 1 ⎣ ⎝ ⎠⎦ Converges to one linearly
= ∑
(supervised) 2 i ⎡ ⎛ 1 − d i ⎞⎤ separable solution if one
+ ⎢(1 − d i ) ln ⎜ i ⎟⎥
⎣ ⎝ 1 − y ⎠⎦ exists.
⎩ − ρ 2ε i z if ε ik ≤ 0
k i
ε ik = ( z i ) w k − bik
T
31
Supervised Learning of a Perceptron
Weight vector:
⎧ −bik
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i
⎪ ρ1
⎨
⎪ − ρ ε k zi −bik
if ε ik ≤
⎪⎩ 2 i
ρ1
ε ik = ( z )
i T
w k − bik
32
Supervised Learning of a Perceptron
⎧ −bik misclassifications.
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i
⎨ ρ1
⎪
⎩ 0 otherwise
ε ik = ( z i ) w k − bik
T
2 i
⎡1 − tanh 2 ( β net k ) ⎤ x k
stochastic activation:
⎣ ⎦ Performance in the average
units
y
(supervised) is equivalent to the delta
⎧ with
⎪ +1
P ( y = +1)
rule applied to a unit with
⎪
=⎨ deterministic activation:
⎪−1 with
⎪ 1 − P ( y = +1)
⎩
1
P ( y = 1) =
1 + e −2 β net
33