0% found this document useful (0 votes)
113 views33 pages

Correlation Learning Rule: M I I I

The document describes several learning rules for supervised learning of perceptrons, including: 1) The correlation learning rule which minimizes the error criterion by performing gradient descent to maximize the correlation between target outputs and network outputs. 2) The delta rule which allows differentiable activation functions and uses gradient descent on the error criterion to update weights proportionally to the error and derivative of the activation function. 3) The Adaptive Ho-Kashyap learning rules which derive three rules (AHK I, II, III) for classification problems by performing gradient descent on an appropriate criterion function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views33 pages

Correlation Learning Rule: M I I I

The document describes several learning rules for supervised learning of perceptrons, including: 1) The correlation learning rule which minimizes the error criterion by performing gradient descent to maximize the correlation between target outputs and network outputs. 2) The delta rule which allows differentiable activation functions and uses gradient descent on the error criterion to update weights proportionally to the error and derivative of the activation function. 3) The Adaptive Ho-Kashyap learning rules which derive three rules (AHK I, II, III) for classification problems by performing gradient descent on an appropriate criterion function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Correlation Learning Rule

The correlation learning rule is derived by starting from the criterion function

m
J ( x ) = −∑ y i d i (2.49)
i =1

where y i = ( xi ) w , and performing gradient descent to minimize J. Note that minimizing J(w) is
T

equivalent to maximizing the correlation between the desired target and the corresponding linear
unit's output for all xi , i = l, 2, ... , m. Now, employing steepest gradient descent to minimize J(w)
leads to the learning rule:

⎧ w1 = 0
⎨ k +1 (2.50)
⎩w = w + ρ d x
k k k

By setting ρ to 1 and completing one learning cycle using Equation (2.50), we arrive at the weight
vector w * given by

1
Supervised Learning of a Perceptron

m
w = ∑ d i xi = Xd
*
(2.51)
i =1

where X and d are as defined above. Note that Equation (2.51) leads to the minimum SSE solution
in Equation (2.38) if X † = X . This is only possible if the training vectors x k are encoded such that
XXT is the identity matrix (i.e., the x k vectors are orthonormal).

Another version of this type of learning is the covariance learning rule. This rule is obtained by
steepest gradient descent on the criterion function

J ( w ) = − ∑ ( yi − y )( d ).
m
i
− d
i =1

Here, y and d are computed averages, over all training pairs, for the unit's output and the desired
target, respectively. Covariance learning provides the basis of the cascade-correlation net.

2
Supervised Learning of a Perceptron

The Delta Rule

The following rule is similar to the μ -LMS rule except that it allows for units with a differentiable
nonlinear activation function f. Figure 2-7 illustrates a unit with a sigmoidal activation function.
Here, the unit's output is y = f(net), with net defined as the vector inner product x T w .

Figure 2-7 A perceptron with a differentiable sigmoidal activation function.


3
Supervised Learning of a Perceptron

Again, consider the training pairs {xi , d i } , i= l, 2, ... , m, with xi ∈ R n +1 ( xni +1 = 1 for all i) and

d i ∈ [ −1, +1] . Performing gradient descent on the instantaneous SSE criterion function
1
J (w) = ( d − y ) , whose gradient is given by
2

∇J ( w ) = − ( d − y ) f ′ ( net ) x (2.52)

leads to the delta rule:

⎧⎪ w1 arbitrary
⎨ k +1 (2.53)
⎩⎪ w = w k
+ ρ ⎡
⎣ d k
− f ( net k
) ⎤
⎦ f ′ ( net k
) x k
= w k
+ ρδ k k
x

where net = ( x ) df
k k T
w k and f ′ = . If f is defined by f ( net ) = tanh ( β net ) , then its derivative is
d net

given by f ′ ( net ) = β ⎡⎣1 − f 2 ( net ) ⎤⎦ . For the "logistic" function, f ( net ) = 1/ (1 + e− β net ) , the

4
Supervised Learning of a Perceptron

derivative is f ′ ( net ) = β f ( net ) ⎡⎣1 − f ( net ) ⎤⎦ . Figure 2-8 plots f and f ′ for the hyperbolic tangent

activation function with β = 1. Note how f asymptotically approaches +1 and −1 in the limit as net
approaches +∞ and −∞ , respectively.

5
Supervised Learning of a Perceptron

Figure 2-8 Hyperbolic tangent activation function f and its derivative f ′ , plotted for −3 ≤ net ≤ +3 .

6
Supervised Learning of a Perceptron

One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of
f ′ ( net ) in Figure 2-8. In particular, notice how f ′ ( net ) ≈ 0 when net has large magnitude (i.e.,
net > 3 ); these regions are called flat spots of f ′ . In these flat spots, we expect the delta learning
rule to progress very slowly (i.e., very small weight changes even when the error ( d − y ) is large),
because the magnitude of the weight change in Equation (2.53) directly depends on the magnitude of
f ′ ( net ) . Since slow convergence results in excessive computation time, it would be advantageous to
try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot
elimination technique involves replacing f ′ by f ′ plus a small positive bias ε . In this case, the
weight update equation reads as

w k +1 = w k + ρ ⎡⎣ d k − f ( net k ) ⎤⎦ ⎡⎣ f ′ ( net k ) + ε ⎤⎦ x k (2.54)

7
Supervised Learning of a Perceptron

One of the primary advantages of the delta rule is that it has a natural extension that may be used to
train multilayered neural nets. This extension, known as error back propagation, will be discussed
in Chapter 3.

Adaptive Ho-Kashyap (AHK) Learning Rules

Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as
enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning
rules, AHK I, AHK II, and AHK III, are derived based on gradient-descent strategies on an
appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well
suited for generating robust decision surfaces for linearly separable problems. The third training rule,
AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable
problems. The three AHK learning rules preserve the simple incremental nature found in the LMS
and perceptron learning rules. The AHK rules also possess additional processing capabilities, such

8
Supervised Learning of a Perceptron

as the ability to automatically identify critical cluster boundaries and place a linear decision surface
in such a way that it leads to enhanced classification robustness.

Consider a two-class {c1 , c2 } classification problem with m labeled feature vectors (training vectors)

{x , d } , i = 1,2,..., m. Assume that x


i i i
belongs to R n +1 (with the last component of xi being a constant

bias of value 1) and that d i = +1( −1) if xi ∈ c1 ( c2 ) . Then, a single perceptron can be trained to
correctly classify the preceding training pairs if an ( n + 1) -dimensional weight vector w is computed
that satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's
activation function):

⎧> 0 if d i = +1
(x )i T
w⎨ for i = 1, 2,..., m (2.55)
⎩< 0 if d = −1
i

Next, if we define a set of m new vectors z i according to

9
Supervised Learning of a Perceptron

⎧ + xi if d i = +1
z =⎨ i
i
for i = 1, 2,..., m (2.56)
⎩−x if d i = −1

and we let

Z = ⎡⎣ z1 z 2 ... z m ⎤⎦ (2.57)

then Equation (2.55) may be rewritten as the single matrix equation

ZT w > 0 (2.58)

Now, defining an m-dimensional positive-valued margin vector b (b > 0) and using it in Equation
(2.58), we arrive at the following equivalent form of Equation (2.55):

ZT w = b (2.59)

10
Supervised Learning of a Perceptron

Thus the training of the perceptron is now equivalent to solving Equation (2.59) for w, subject to the
constraint b > 0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation
(2.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to
small positive values, and the pseu-doinverse is used to generate a solution for w (based on the
1 T
initial guess of b) that minimizes the SSE criterion function J ( w , b ) =
2
Z w −b :
2

w = Z†b (2.60)

where Z† = ( ZZ T ) Z , for m > n + 1 . Next, a new estimate for the margin vector is computed by
−1

performing the constrained (b > 0) gradient descent

1
b k +1 = b k + ⎡⎣ε + ε ⎤⎦ with ε k = Z T w k − b k (2.61)
2

11
Supervised Learning of a Perceptron

where i denotes the absolute value of the components of the argument vector, and b k is the
"current" margin vector. A new estimate of w can now be computed using Equation (2.60) and
employing the updated margin vector from Equation (2.61). This process continues until all the
components of ε are zero (or are sufficiently small and positive), which is an indication of linear
separability of the training set, or until ε < 0 , which is an indication of nonlinear separability of the
training set (no solution is found). It can be shown (Ho and Kashyap, 1965) that the Ho-Kashyap
procedure converges in a finite number of steps if the training set is linearly separable. For
simulations comparing the preceding training algorithm with the LMS and perceptron training
procedures, the reader is referred to Hassoun and Clark (1988). This algorithm will be referred to
here as the direct Ho-Kashyap (DHK) algorithm.

The direct synthesis of the w estimate in Equation (2.60) involves a one-time computation of the
pseudoinverse of Z. However, such computation can be computationally expensive and requires
special treatment when ZZ T is ill-conditioned (i.e., the determinant ZZ T close to zero). An

12
Supervised Learning of a Perceptron

alternative algorithm that is based on gradient-descent principles and which does not require the
direct computation of Z† can be derived. This derivation is presented next.

1 T
Starting with the criterion function J ( w , b ) =
2
Z w − b , gradient descent may be performed with
2
respect to b and w so that J is minimized subject to the constraint b > 0. The gradient of J with
respect to w and b is given by

∇ b J ( w, b ) |w k ,bk = − ( Z T w k − b k ) (2.62a)

∇ w J ( w, b ) |w k ,bk +1 = − Z ( Z T w k − b k +1 ) (2.62b)

where the superscripts k and k + 1 represent current and updated values, respectively. One analytic
method for imposing the constraint b > 0 is to replace the gradient in Equation (2.62a) by
−0.5 ( ε + ε ) , with ε as defined in Equation (2.61). This leads to the following gradient-descent
formulation of the Ho-Kashyap procedure:
13
Supervised Learning of a Perceptron

ρ1
b k +1 = b k +
2
( ε k
+ε ) with ε k = Z T w k − b k (2.63a)

w k +1 = w k − ρ 2 Z ( Z T w k − b k +1 )
and ⎡ ρ1ρ 2 ⎛ 2 ⎞⎤ (2.63b)
=w + k
Z ⎢ ε k + ε k ⎜1 − ⎟ ⎥
2 ⎣ ⎝ ρ1 ⎠ ⎦

where ρ1 and ρ 2 are strictly positive constant learning rates. Because of the requirement that all
training vectors z k (or x k ) be present and included in Z, this procedure is called the batch-mode
adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if ρ1 = 0 and b1 = 1 , Equation
(2.63) reduces to the μ -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and
Hart, 1973) if 0 < ρ1 < 2 and 0 < ρ 2 < 2 / λmax where λmax is the largest eigenvalue of the positive

definite matrix ZZ T .

14
Supervised Learning of a Perceptron

A completely adaptive Ho-Kashyap procedure for solving Equation (2.59) is arrived at by starting
from the instantaneous criterion function

J ( w , b ) = ⎡( z i ) w − b i ⎤
1 T

2⎣ ⎦

which leads to the following incremental update rules:

ρ1
(ε ) with ε ik = ( z i ) w k − bik
T
bik +1 = bik + i
k
+ ε ik (2.64a)
2

w k +1 = w k − ρ 2 z i ⎡( z i ) w k − bik +1 ⎤
T

⎣ ⎦
and (2.64b)
ρ1ρ 2 ⎡
⎛ 2 ⎞⎤ i
=w + k
⎢ ε + ε ⎜ 1 − ⎟
k
⎥z
k

⎝ ρ1 ⎠ ⎦
i i
2 ⎣

Here, bi , represents a scalar margin associated with the xi input. In all the preceding Ho-Kashyap
learning procedures, the margin values are initialized to small positive values, and the perceptron
15
Supervised Learning of a Perceptron

weights are initialized to zero (or small random) values. If full margin error correction is assumed in
Equation (2.64a), i.e., ρ1 = 1, the incremental learning procedure in Equation (2.64) reduces to the
heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing
Equation (2.64) is

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if ε ik > 0 (2.65a)

Δbi = 0 and Δw = − ρ 2ε ik z i if ε ik ≤ 0 (2.65b)

where Δb and Δw signify the difference between the updated and current values of b and w,
respectively. This procedure is called the AHK I learning rule. For comparison purposes, it may be
noted that the μ -LMS rule in Equation (2.35) can be written as Δw = − με ik z i , with bi , held fixed at
+1.

16
Supervised Learning of a Perceptron

The implied constraint bi > 0 in Equations (2.64) and (2.65) was realized by starting with a positive
initial margin and restricting the change Δb to positive real values. An alternative, more flexible
way to realize this constraint is to allow both positive and negative changes in Δb , except for the
cases where a decrease in b, results in a negative margin. This modification results in the following
alternative AHK II learning rule:

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if bik + ρ1ε ik > 0 (2.66a)

Δbi = 0 and Δw = − ρ 2ε ik z i if bik + ρ1ε ik ≤ 0 (2.66b)

In the general case of an adaptive margin, as in Equation (2.66), Hassoun and Song (1992) showed
that a sufficient condition for the convergence of the AHK rules is given by

2
0 < ρ2 < (2.67a)
i 2
max z
i

17
Supervised Learning of a Perceptron

0 < ρ1 < 2 (2.67b)

Another variation results in the AHK III rule, which is appropriate for both linearly separable and
nonlinearly separable problems. Here, Δw is set to 0 in Equation (2.66b). The advantages of the
AHK III rule are that

(1) it is capable of adaptively identifying difficult-to-separate class boundaries and

(2) it uses such information to discard nonseparable training vectors and speed up convergence
(Hassoun and Song, 1992).

Example 2.2 In this example the perceptron, LMS, and AHK learning rules are compared in terms
of the quality of the solutions they generate. Consider the simple two-class linearly separable
problem shown earlier in Figure 2-4. The μ -LMS rule of Equation (2.35) is used to obtain the
solution shown as a dashed line in Figure 2-9. Here, the initial weight vector was set to 0 and a
learning rate μ = 0.005 is used. This solution is not one of the linearly separable solutions for this

18
Supervised Learning of a Perceptron

problem. Four examples of linearly separable solutions are shown as solid lines in the figure. These
solutions are generated using the perceptron learning rule of Equation (2.2), with varying order of
input vector presentations and with a learning rate of ρ = 0.1. Here, it should be noted that the most
1
robust solution, in the sense of tolerance to noisy input, is given by x2 = x1 + which is shown as a
2
dotted line in Figure 2-9. This robust solution was in fact automatically generated by the AHK I
learning rule of Equation (2.65).

Other Criterion Functions

The SSE criterion function in Equation (2.32) is not the only possible choice. We have already seen
other alternative functions such as the ones in Equations (2.20), (2.24), and (2.25). In general, any
differentiable function that is minimized upon setting y i = d i , for i = 1,2,..., m, could be used. One
possible generalization of SSE is the Minkowski-r criterion function (Hanson and Burr, 1988) given
by

19
Supervised Learning of a Perceptron

r
1 m i
J ( w ) = ∑ d − yi (2.68)
r i =1

or its instantaneous version

1 i
J (w) = d − y i r
(2.69)
r

20
Supervised Learning of a Perceptron

Figure 2-9 LMS-generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four
solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the
AHK I rule.

21
Supervised Learning of a Perceptron

r
Figure 2-10 shows a plot of d i − y i for r = 1, 1.5, 2, and 20. The general form of the gradient of

this criterion function is given by

∇J ( w ) = − sgn ( d − y ) d − y f ′ ( net ) x
r −1
(2.70)

Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation
(2.52). If r = 1, then J ( w ) = d i − y i with the gradient [note that the gradient of J(w) does not exist

at the solution points d = y]

∇J ( w ) = − sgn ( d − y ) f ′ ( net ) x (2.71)

In this case, the criterion function in Equation (2.68) is known as the Manhattan norm. For r → ∞ , a
supremum error measure is approached.

A small r gives less weight for large deviations and tends to reduce the influence of the outer-most
points in the input space during learning. It can be shown, for a linear unit with normally distributed
22
Supervised Learning of a Perceptron

inputs, that r = 2 is an appropriate choice in the sense of both minimum SSE and minimum
probability of prediction error (maximum likelihood). The proof is as follows.

Figure 2-10 A family of instantaneous Minkowski-r criterion functions

23
Supervised Learning of a Perceptron

Another criterion function that can be used (Hopfield, 1987) is the instantaneous relative entropy
error measure (Kullback, 1959) defined by

1⎡ ⎛1+ d ⎞ ⎛ 1 − d ⎞⎤
J ( w ) = ⎢(1 + d ) ln ⎜ ⎟ + (1 − d ) ln ⎜ ⎟⎥ (2.76)
2⎣ ⎝1+ y ⎠ ⎝ 1 − y ⎠⎦

where d belongs to the open interval ( −1, +1) . As before, J ( w ) ≥ 0 , and if y = d, then J ( w ) = 0 . If
y = f ( net ) = tanh ( β net ) , the gradient of Equation (2.76) is

∇J ( w ) = − β ( d − y ) x (2.77)

The factor f ′ ( net ) in Equations (2.53) and (2.70) is missing from Equation (2.77). This eliminates
the flat spot encountered in the delta rule and makes the training here more like μ -LMS [note,
however, that y here is given by y = f ( net ) ≠ net ]. This entropy criterion is "well formed" in the
sense that gradient descent over such a function will result in a linearly separable solution, if one

24
Supervised Learning of a Perceptron

exists (Hertz et al., 1991). On the other hand, gradient descent on the SSE criterion function does not
share this property, since it may fail to find a linearly separable solution, as demonstrated in
Example 2.2.

In order for gradient-descent search to find a solution w * in the desired linearly separable region, we
need to use a well-formed criterion function. Consider the following general criterion function:

J ( w ) = ∑ g ( zTw )
m
(2.78)
i =1

where

⎧+ x if x ∈ class c1
z=⎨
⎩ − x if x ∈ class c2

Let s = z T w . The criterion function J(w) is said to be well formed if g ( s ) is differentiable and
satisfies the following conditions (Wittner and Denker, 1988):
25
Supervised Learning of a Perceptron

dg ( s )
1. For all s, − ≥ 0 ; i.e., g does not push in the wrong direction.
ds

dg ( s )
2. There exists ε > 0 such that − ≥ ε for all s ≤ 0 ; i.e., g keeps pushing if there is a
ds
misclassification.

3. g(s) is bounded from below.

For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the
criterion function is well formed, then gradient descent is guaranteed to enter the region of linearly
separable solutions w * , provided that such a region exists.

26
Table 2-1 Summary of Basic Learning Rules

Learning 1 2
Activation
Criterion Function Learning Vector Conditions Remarks
Rule Function 3

Perceptron J (w) = − ∑ zTw ⎧⎪z k if ( z k ) w k ≤ 0


T ρ >0 f ( net ) = sgn ( net ) Finite convergence time if
z w ≤0
T
⎨ training set is linearly
rule ⎪⎩ 0 otherwise separable. w stays
(supervised)
bounded for arbitrary
training sets.

Perceptron J (w) = − ∑ (z T
w − b) ⎧⎪z k if ( z k ) w k ≤ b
T b>0 f ( net ) = sgn ( net ) Converges to z T w > b if
rule with z T w ≤b ⎨ training set is linearly
variable ⎪⎩ 0 otherwise ρ k satisfies: separable. Finite
learning rate convergence if ρ = ρ ,
k

and fixed 1. ρ k ≥ 0 where ρ is a finite positive


margin
(supervised) m
constant.
2.∑ ρ k = ∞
k =1

Note: z k = ⎧⎨ x if d k = +1
k
1

⎩−x if d = −1
k k

2
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
3
net = xT w

27
Supervised Learning of a Perceptron

∑(ρ )
m
k 2

3. k =1
2
=0
⎛ m k⎞
⎜∑ρ ⎟
⎝ k =1 ⎠

May`s rule
( zTw − b)
2
⎧ b − ( z k )T w k 0<ρ <2 f ( net ) = sgn ( net ) Finite convergence to the
if ( z )
1
(supervised) J (w) = ∑ ⎪⎪ zk k T
w ≤b
k
solution z T w ≥ b > 0 if the

2 2
2 zT w ≤b z z b>0 training set is linearly
⎪ separable.
⎪⎩ 0 otherwise

Butz`s rule J ( w ) = −∑ ( z i ) w
T
⎧⎪ z k if ( z k ) w k ≤ 0
T 0 ≤ γ <1 f ( net ) = sgn ( net )
Finite convergence if
(supervised) i ⎨ k
⎪⎩γ z otherwise ρ >0 training set is linearly
separable. Places w in a
region that tends to
minimize the probability of
error for nonlinearly
separable cases.

Widrow- ⎡ d i − ( x i )T w ⎤
2
⎡ d k − ( x k )T w k ⎤ x k α f ( net ) = net Converges in the mean
⎢⎣ ⎥⎦ ρk =
⎢ ⎥⎦
J (w) = ∑ ⎣
Hoff rule 1 k 2 square to the minimum SSE
x
( α -LMS) 2 i 2 or LMS solution if
xi
(supervised) x = x for al i, j.
i j
0 <α < 2

28
Supervised Learning of a Perceptron

μ -LMS ⎡ d k − ( x k )T w k ⎤ x k f ( net ) = net


d − ( xi ) w ⎤
1 ⎡ i 2
2 Converges in the mean
J (w) = ∑ 0< ρ <
T

(supervised) 2 i ⎢⎣ ⎥⎦ ⎣⎢ ⎦⎥ 3 x
2 square to the minimum SSE
or LMS solution.

Stochastic ⎡ d k − ( x k )T w k ⎤ x k ρ k satisfies: f ( net ) = net


d − ( xi ) w ⎤
2
1 ⎡ i
J (w) =
T
μ -LMS rule 2 ⎢⎣ ⎥⎦ ⎣⎢ ⎦⎥ i ≡ mean operator. (At
(supervised) 1. ρ k ≥ 0
each leaning step the
∞ training vector x k is drawn
2.∑ ρ k = +∞ at random). Converges in
k =1 the mean square to the
minimum SSE or LMS

3. ∑ ( ρ k ) < ∞
2 solution.
k =1

Correlation J ( w ) = −∑ d i ( xi ) w
T
d k xk ρ >0 f ( net ) = net Converges to the minimum
rule i SSE solution if the vectors
(supervised) xk are mutually
orthonormal.

Delta rule
(supervised)
J (w) =
1
∑ ( d i − yi )
2
(d k
− y k ) f ′ ( net k ) x k 0 < ρ <1 y = f ( net ) where f Extends the μ -LMS rule to
2 i is a sigmoid function cases with differentiable
(x )
i T
nonlinear activations.
yi w

29
Supervised Learning of a Perceptron

Learning Activation
4 5
Criterion Function Learning Vector Conditions Remarks
Rule Function 6

Minkowski-r 1 sgn ( d k − y k ) d k − y k
r −1
f ′ ( net k ) x k
0 < ρ <1 y = f ( net ) where f 0<r <2 for pseudo-
J (w) = ∑
r
d i − yi
is a sigmoid function Gaussian distribution p ( x )
delta rule r i

(supervised)
with pronounced tails. r = 2
gives delta rule. r = 1 arises
when p ( x ) is a Laplace

distribution.

Note: z k = ⎧⎨ x if d k = +1
k
4

⎩−x if d = −1
k k

5
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
6
net = xT w

30
Supervised Learning of a Perceptron

Relative J (w) β ( d k − y k ) xk 0 < ρ <1 y = tanh ( β net ) Eliminates the flat spot
entropy delta ⎡ ⎛ 1 + d i ⎞⎤ suffered by the delta rule.
⎢( + ) ⎜ 1 + yi ⎟⎥
i
1 d ln
rule 1 ⎣ ⎝ ⎠⎦ Converges to one linearly
= ∑
(supervised) 2 i ⎡ ⎛ 1 − d i ⎞⎤ separable solution if one
+ ⎢(1 − d i ) ln ⎜ i ⎟⎥
⎣ ⎝ 1 − y ⎠⎦ exists.

⎧ρ ε k if ε ik > 0 b1 > 0 f ( net ) = sgn ( net )


∑( )
AHK I 1 ⎡ i T ⎤
2
bi values can only increase
J ( w, b ) = z w − b Margin Δbi = ⎨ 1 i
2 i ⎢⎣
i⎥

(supervised) ⎩ 0 otherwise from their initial values.
0 < ρ1 < 2
Converges to a robust
Weight vector:
2 solution for linearly
0 < ρ2 <
⎧ ρ 2 ( ρ1 − 1) ε ik z i
2
if ε i2 > 0 max z i separable problems.
⎨ i

⎩ − ρ 2ε i z if ε ik ≤ 0
k i

ε ik = ( z i ) w k − bik
T

Margin Δbik b1 > 0 f ( net ) = sgn ( net )


( )
AHK II 1 ⎡ i T ⎤
2
bi values can take any
(supervised)
J ( w, b ) = ∑
2 i ⎢⎣
z w − b i⎥

positive value. Converges to
with margin vector b>0 0 < ρ1 < 2
a robust solution for linearly
separable problems.

31
Supervised Learning of a Perceptron

⎧ k −bik 0 < ρ2 <


2
⎪ ρ1ε i if ε ik > 2
=⎨ ρ1 max z i
⎪ 0 i
⎩ otherwise

Weight vector:

⎧ −bik
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i

⎪ ρ1

⎪ − ρ ε k zi −bik
if ε ik ≤
⎪⎩ 2 i
ρ1
ε ik = ( z )
i T
w k − bik

Margin Δbik b1 > 0 f ( net ) = sgn ( net )


∑( )
AHK III 1 ⎡ i T ⎤
2 Converges for linearly
J ( w, b ) = z w − b
2 i ⎣⎢
i⎥
(supervised) ⎦ separable as well as
⎧ k 0 < ρ1 < 2
with margin vector b>0 −bik nonlinearly separable cases.
⎪ ρ1ε i if ε ik >
=⎨ ρ1 It automatically identifies
⎪ 0 2
⎩ otherwise 0 < ρ2 < 2 and discards the critical
max z i
i
points affecting the
Weight vector:
nonlinear separability, and
results in a solution which
tends to minimize

32
Supervised Learning of a Perceptron

⎧ −bik misclassifications.
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i

⎨ ρ1

⎩ 0 otherwise
ε ik = ( z i ) w k − bik
T

Delta rule for 1


( ) β ⎡⎣ d k − tanh ( β net k ) ⎤⎦ i 0 < ρ <1 Stochastic
J (w) = ∑ d i − yi
2

2 i
⎡1 − tanh 2 ( β net k ) ⎤ x k
stochastic activation:
⎣ ⎦ Performance in the average
units
y
(supervised) is equivalent to the delta
⎧ with
⎪ +1
P ( y = +1)
rule applied to a unit with

=⎨ deterministic activation:
⎪−1 with
⎪ 1 − P ( y = +1)

1
P ( y = 1) =
1 + e −2 β net

33

You might also like