0% found this document useful (0 votes)

113 views33 pages

Correlation Learning Rule: M I I I

The document describes several learning rules for supervised learning of perceptrons, including: 1) The correlation learning rule which minimizes the error criterion by performing gradient descent to maximize the correlation between target outputs and network outputs. 2) The delta rule which allows differentiable activation functions and uses gradient descent on the error criterion to update weights proportionally to the error and derivative of the activation function. 3) The Adaptive Ho-Kashyap learning rules which derive three rules (AHK I, II, III) for classification problems by performing gradient descent on an appropriate criterion function.

Uploaded by

Stefanescu Alexandru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views33 pages

Correlation Learning Rule: M I I I

Uploaded by

Stefanescu Alexandru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Correlation Learning Rule

The correlation learning rule is derived by starting from the criterion function

m
J ( x ) = −∑ y i d i (2.49)
i =1

where y i = ( xi ) w , and performing gradient descent to minimize J. Note that minimizing J(w) is
T

equivalent to maximizing the correlation between the desired target and the corresponding linear
unit's output for all xi , i = l, 2, ... , m. Now, employing steepest gradient descent to minimize J(w)
leads to the learning rule:

⎧ w1 = 0
⎨ k +1 (2.50)
⎩w = w + ρ d x
k k k

By setting ρ to 1 and completing one learning cycle using Equation (2.50), we arrive at the weight
vector w * given by

1
Supervised Learning of a Perceptron

m
w = ∑ d i xi = Xd
*
(2.51)
i =1

where X and d are as defined above. Note that Equation (2.51) leads to the minimum SSE solution
in Equation (2.38) if X † = X . This is only possible if the training vectors x k are encoded such that
XXT is the identity matrix (i.e., the x k vectors are orthonormal).

Another version of this type of learning is the covariance learning rule. This rule is obtained by
steepest gradient descent on the criterion function

J ( w ) = − ∑ ( yi − y )( d ).
m
i
− d
i =1

Here, y and d are computed averages, over all training pairs, for the unit's output and the desired
target, respectively. Covariance learning provides the basis of the cascade-correlation net.

2
Supervised Learning of a Perceptron

The Delta Rule

The following rule is similar to the μ -LMS rule except that it allows for units with a differentiable
nonlinear activation function f. Figure 2-7 illustrates a unit with a sigmoidal activation function.
Here, the unit's output is y = f(net), with net defined as the vector inner product x T w .

Figure 2-7 A perceptron with a differentiable sigmoidal activation function.

3
Supervised Learning of a Perceptron

Again, consider the training pairs {xi , d i } , i= l, 2, ... , m, with xi ∈ R n +1 ( xni +1 = 1 for all i) and

d i ∈ [ −1, +1] . Performing gradient descent on the instantaneous SSE criterion function
1
J (w) = ( d − y ) , whose gradient is given by
2

∇J ( w ) = − ( d − y ) f ′ ( net ) x (2.52)

leads to the delta rule:

⎧⎪ w1 arbitrary
⎨ k +1 (2.53)
⎩⎪ w = w k
+ ρ ⎡
⎣ d k
− f ( net k
) ⎤
⎦ f ′ ( net k
) x k
= w k
+ ρδ k k
x

where net = ( x ) df
k k T
w k and f ′ = . If f is defined by f ( net ) = tanh ( β net ) , then its derivative is
d net

given by f ′ ( net ) = β ⎡⎣1 − f 2 ( net ) ⎤⎦ . For the "logistic" function, f ( net ) = 1/ (1 + e− β net ) , the

4
Supervised Learning of a Perceptron

derivative is f ′ ( net ) = β f ( net ) ⎡⎣1 − f ( net ) ⎤⎦ . Figure 2-8 plots f and f ′ for the hyperbolic tangent

activation function with β = 1. Note how f asymptotically approaches +1 and −1 in the limit as net
approaches +∞ and −∞ , respectively.

5
Supervised Learning of a Perceptron

Figure 2-8 Hyperbolic tangent activation function f and its derivative f ′ , plotted for −3 ≤ net ≤ +3 .

6
Supervised Learning of a Perceptron

One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of
f ′ ( net ) in Figure 2-8. In particular, notice how f ′ ( net ) ≈ 0 when net has large magnitude (i.e.,
net > 3 ); these regions are called flat spots of f ′ . In these flat spots, we expect the delta learning
rule to progress very slowly (i.e., very small weight changes even when the error ( d − y ) is large),
because the magnitude of the weight change in Equation (2.53) directly depends on the magnitude of
f ′ ( net ) . Since slow convergence results in excessive computation time, it would be advantageous to
try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot
elimination technique involves replacing f ′ by f ′ plus a small positive bias ε . In this case, the
weight update equation reads as

w k +1 = w k + ρ ⎡⎣ d k − f ( net k ) ⎤⎦ ⎡⎣ f ′ ( net k ) + ε ⎤⎦ x k (2.54)

7
Supervised Learning of a Perceptron

One of the primary advantages of the delta rule is that it has a natural extension that may be used to
train multilayered neural nets. This extension, known as error back propagation, will be discussed
in Chapter 3.

Adaptive Ho-Kashyap (AHK) Learning Rules

Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as
enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning
rules, AHK I, AHK II, and AHK III, are derived based on gradient-descent strategies on an
appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well
suited for generating robust decision surfaces for linearly separable problems. The third training rule,
AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable
problems. The three AHK learning rules preserve the simple incremental nature found in the LMS
and perceptron learning rules. The AHK rules also possess additional processing capabilities, such

8
Supervised Learning of a Perceptron

as the ability to automatically identify critical cluster boundaries and place a linear decision surface
in such a way that it leads to enhanced classification robustness.

Consider a two-class {c1 , c2 } classification problem with m labeled feature vectors (training vectors)

{x , d } , i = 1,2,..., m. Assume that x

i i i
belongs to R n +1 (with the last component of xi being a constant

bias of value 1) and that d i = +1( −1) if xi ∈ c1 ( c2 ) . Then, a single perceptron can be trained to
correctly classify the preceding training pairs if an ( n + 1) -dimensional weight vector w is computed
that satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's
activation function):

⎧> 0 if d i = +1
(x )i T
w⎨ for i = 1, 2,..., m (2.55)
⎩< 0 if d = −1
i

Next, if we define a set of m new vectors z i according to

9
Supervised Learning of a Perceptron

⎧ + xi if d i = +1
z =⎨ i
i
for i = 1, 2,..., m (2.56)
⎩−x if d i = −1

and we let

Z = ⎡⎣ z1 z 2 ... z m ⎤⎦ (2.57)

then Equation (2.55) may be rewritten as the single matrix equation

ZT w > 0 (2.58)

Now, defining an m-dimensional positive-valued margin vector b (b > 0) and using it in Equation
(2.58), we arrive at the following equivalent form of Equation (2.55):

ZT w = b (2.59)

10
Supervised Learning of a Perceptron

Thus the training of the perceptron is now equivalent to solving Equation (2.59) for w, subject to the
constraint b > 0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation
(2.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to
small positive values, and the pseu-doinverse is used to generate a solution for w (based on the
1 T
initial guess of b) that minimizes the SSE criterion function J ( w , b ) =
2
Z w −b :
2

w = Z†b (2.60)

where Z† = ( ZZ T ) Z , for m > n + 1 . Next, a new estimate for the margin vector is computed by
−1

performing the constrained (b > 0) gradient descent

1
b k +1 = b k + ⎡⎣ε + ε ⎤⎦ with ε k = Z T w k − b k (2.61)
2

11
Supervised Learning of a Perceptron

where i denotes the absolute value of the components of the argument vector, and b k is the
"current" margin vector. A new estimate of w can now be computed using Equation (2.60) and
employing the updated margin vector from Equation (2.61). This process continues until all the
components of ε are zero (or are sufficiently small and positive), which is an indication of linear
separability of the training set, or until ε < 0 , which is an indication of nonlinear separability of the
training set (no solution is found). It can be shown (Ho and Kashyap, 1965) that the Ho-Kashyap
procedure converges in a finite number of steps if the training set is linearly separable. For
simulations comparing the preceding training algorithm with the LMS and perceptron training
procedures, the reader is referred to Hassoun and Clark (1988). This algorithm will be referred to
here as the direct Ho-Kashyap (DHK) algorithm.

The direct synthesis of the w estimate in Equation (2.60) involves a one-time computation of the
pseudoinverse of Z. However, such computation can be computationally expensive and requires
special treatment when ZZ T is ill-conditioned (i.e., the determinant ZZ T close to zero). An

12
Supervised Learning of a Perceptron

alternative algorithm that is based on gradient-descent principles and which does not require the
direct computation of Z† can be derived. This derivation is presented next.

1 T
Starting with the criterion function J ( w , b ) =
2
Z w − b , gradient descent may be performed with
2
respect to b and w so that J is minimized subject to the constraint b > 0. The gradient of J with
respect to w and b is given by

∇ b J ( w, b ) |w k ,bk = − ( Z T w k − b k ) (2.62a)

∇ w J ( w, b ) |w k ,bk +1 = − Z ( Z T w k − b k +1 ) (2.62b)

where the superscripts k and k + 1 represent current and updated values, respectively. One analytic
method for imposing the constraint b > 0 is to replace the gradient in Equation (2.62a) by
−0.5 ( ε + ε ) , with ε as defined in Equation (2.61). This leads to the following gradient-descent
formulation of the Ho-Kashyap procedure:
13
Supervised Learning of a Perceptron

ρ1
b k +1 = b k +
2
( ε k
+ε ) with ε k = Z T w k − b k (2.63a)

w k +1 = w k − ρ 2 Z ( Z T w k − b k +1 )
and ⎡ ρ1ρ 2 ⎛ 2 ⎞⎤ (2.63b)
=w + k
Z ⎢ ε k + ε k ⎜1 − ⎟ ⎥
2 ⎣ ⎝ ρ1 ⎠ ⎦

where ρ1 and ρ 2 are strictly positive constant learning rates. Because of the requirement that all
training vectors z k (or x k ) be present and included in Z, this procedure is called the batch-mode
adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if ρ1 = 0 and b1 = 1 , Equation
(2.63) reduces to the μ -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and
Hart, 1973) if 0 < ρ1 < 2 and 0 < ρ 2 < 2 / λmax where λmax is the largest eigenvalue of the positive

definite matrix ZZ T .

14
Supervised Learning of a Perceptron

A completely adaptive Ho-Kashyap procedure for solving Equation (2.59) is arrived at by starting
from the instantaneous criterion function

J ( w , b ) = ⎡( z i ) w − b i ⎤
1 T

2⎣ ⎦

which leads to the following incremental update rules:

ρ1
(ε ) with ε ik = ( z i ) w k − bik
T
bik +1 = bik + i
k
+ ε ik (2.64a)
2

w k +1 = w k − ρ 2 z i ⎡( z i ) w k − bik +1 ⎤
T

⎣ ⎦
and (2.64b)
ρ1ρ 2 ⎡
⎛ 2 ⎞⎤ i
=w + k
⎢ ε + ε ⎜ 1 − ⎟
k
⎥z
k

⎝ ρ1 ⎠ ⎦
i i
2 ⎣

Here, bi , represents a scalar margin associated with the xi input. In all the preceding Ho-Kashyap
learning procedures, the margin values are initialized to small positive values, and the perceptron
15
Supervised Learning of a Perceptron

weights are initialized to zero (or small random) values. If full margin error correction is assumed in
Equation (2.64a), i.e., ρ1 = 1, the incremental learning procedure in Equation (2.64) reduces to the
heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing
Equation (2.64) is

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if ε ik > 0 (2.65a)

Δbi = 0 and Δw = − ρ 2ε ik z i if ε ik ≤ 0 (2.65b)

where Δb and Δw signify the difference between the updated and current values of b and w,
respectively. This procedure is called the AHK I learning rule. For comparison purposes, it may be
noted that the μ -LMS rule in Equation (2.35) can be written as Δw = − με ik z i , with bi , held fixed at
+1.

16
Supervised Learning of a Perceptron

The implied constraint bi > 0 in Equations (2.64) and (2.65) was realized by starting with a positive
initial margin and restricting the change Δb to positive real values. An alternative, more flexible
way to realize this constraint is to allow both positive and negative changes in Δb , except for the
cases where a decrease in b, results in a negative margin. This modification results in the following
alternative AHK II learning rule:

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if bik + ρ1ε ik > 0 (2.66a)

Δbi = 0 and Δw = − ρ 2ε ik z i if bik + ρ1ε ik ≤ 0 (2.66b)

In the general case of an adaptive margin, as in Equation (2.66), Hassoun and Song (1992) showed
that a sufficient condition for the convergence of the AHK rules is given by

2
0 < ρ2 < (2.67a)
i 2
max z
i

17
Supervised Learning of a Perceptron

0 < ρ1 < 2 (2.67b)

Another variation results in the AHK III rule, which is appropriate for both linearly separable and
nonlinearly separable problems. Here, Δw is set to 0 in Equation (2.66b). The advantages of the
AHK III rule are that

(1) it is capable of adaptively identifying difficult-to-separate class boundaries and

(2) it uses such information to discard nonseparable training vectors and speed up convergence
(Hassoun and Song, 1992).

Example 2.2 In this example the perceptron, LMS, and AHK learning rules are compared in terms
of the quality of the solutions they generate. Consider the simple two-class linearly separable
problem shown earlier in Figure 2-4. The μ -LMS rule of Equation (2.35) is used to obtain the
solution shown as a dashed line in Figure 2-9. Here, the initial weight vector was set to 0 and a
learning rate μ = 0.005 is used. This solution is not one of the linearly separable solutions for this

18
Supervised Learning of a Perceptron

problem. Four examples of linearly separable solutions are shown as solid lines in the figure. These
solutions are generated using the perceptron learning rule of Equation (2.2), with varying order of
input vector presentations and with a learning rate of ρ = 0.1. Here, it should be noted that the most
1
robust solution, in the sense of tolerance to noisy input, is given by x2 = x1 + which is shown as a
2
dotted line in Figure 2-9. This robust solution was in fact automatically generated by the AHK I
learning rule of Equation (2.65).

Other Criterion Functions

The SSE criterion function in Equation (2.32) is not the only possible choice. We have already seen
other alternative functions such as the ones in Equations (2.20), (2.24), and (2.25). In general, any
differentiable function that is minimized upon setting y i = d i , for i = 1,2,..., m, could be used. One
possible generalization of SSE is the Minkowski-r criterion function (Hanson and Burr, 1988) given
by

19
Supervised Learning of a Perceptron

r
1 m i
J ( w ) = ∑ d − yi (2.68)
r i =1

or its instantaneous version

1 i
J (w) = d − y i r
(2.69)
r

20
Supervised Learning of a Perceptron

Figure 2-9 LMS-generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four
solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the
AHK I rule.

21
Supervised Learning of a Perceptron

r
Figure 2-10 shows a plot of d i − y i for r = 1, 1.5, 2, and 20. The general form of the gradient of

this criterion function is given by

∇J ( w ) = − sgn ( d − y ) d − y f ′ ( net ) x
r −1
(2.70)

Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation
(2.52). If r = 1, then J ( w ) = d i − y i with the gradient [note that the gradient of J(w) does not exist

at the solution points d = y]

∇J ( w ) = − sgn ( d − y ) f ′ ( net ) x (2.71)

In this case, the criterion function in Equation (2.68) is known as the Manhattan norm. For r → ∞ , a
supremum error measure is approached.

A small r gives less weight for large deviations and tends to reduce the influence of the outer-most
points in the input space during learning. It can be shown, for a linear unit with normally distributed
22
Supervised Learning of a Perceptron

inputs, that r = 2 is an appropriate choice in the sense of both minimum SSE and minimum
probability of prediction error (maximum likelihood). The proof is as follows.

Figure 2-10 A family of instantaneous Minkowski-r criterion functions

23
Supervised Learning of a Perceptron

Another criterion function that can be used (Hopfield, 1987) is the instantaneous relative entropy
error measure (Kullback, 1959) defined by

1⎡ ⎛1+ d ⎞ ⎛ 1 − d ⎞⎤
J ( w ) = ⎢(1 + d ) ln ⎜ ⎟ + (1 − d ) ln ⎜ ⎟⎥ (2.76)
2⎣ ⎝1+ y ⎠ ⎝ 1 − y ⎠⎦

where d belongs to the open interval ( −1, +1) . As before, J ( w ) ≥ 0 , and if y = d, then J ( w ) = 0 . If
y = f ( net ) = tanh ( β net ) , the gradient of Equation (2.76) is

∇J ( w ) = − β ( d − y ) x (2.77)

The factor f ′ ( net ) in Equations (2.53) and (2.70) is missing from Equation (2.77). This eliminates
the flat spot encountered in the delta rule and makes the training here more like μ -LMS [note,
however, that y here is given by y = f ( net ) ≠ net ]. This entropy criterion is "well formed" in the
sense that gradient descent over such a function will result in a linearly separable solution, if one

24
Supervised Learning of a Perceptron

exists (Hertz et al., 1991). On the other hand, gradient descent on the SSE criterion function does not
share this property, since it may fail to find a linearly separable solution, as demonstrated in
Example 2.2.

In order for gradient-descent search to find a solution w * in the desired linearly separable region, we
need to use a well-formed criterion function. Consider the following general criterion function:

J ( w ) = ∑ g ( zTw )
m
(2.78)
i =1

where

⎧+ x if x ∈ class c1
z=⎨
⎩ − x if x ∈ class c2

Let s = z T w . The criterion function J(w) is said to be well formed if g ( s ) is differentiable and
satisfies the following conditions (Wittner and Denker, 1988):
25
Supervised Learning of a Perceptron

dg ( s )
1. For all s, − ≥ 0 ; i.e., g does not push in the wrong direction.
ds

dg ( s )
2. There exists ε > 0 such that − ≥ ε for all s ≤ 0 ; i.e., g keeps pushing if there is a
ds
misclassification.

3. g(s) is bounded from below.

For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the
criterion function is well formed, then gradient descent is guaranteed to enter the region of linearly
separable solutions w * , provided that such a region exists.

26
Table 2-1 Summary of Basic Learning Rules

Learning 1 2
Activation
Criterion Function Learning Vector Conditions Remarks
Rule Function 3

Perceptron J (w) = − ∑ zTw ⎧⎪z k if ( z k ) w k ≤ 0

T ρ >0 f ( net ) = sgn ( net ) Finite convergence time if
z w ≤0
T
⎨ training set is linearly
rule ⎪⎩ 0 otherwise separable. w stays
(supervised)
bounded for arbitrary
training sets.

Perceptron J (w) = − ∑ (z T
w − b) ⎧⎪z k if ( z k ) w k ≤ b
T b>0 f ( net ) = sgn ( net ) Converges to z T w > b if
rule with z T w ≤b ⎨ training set is linearly
variable ⎪⎩ 0 otherwise ρ k satisfies: separable. Finite
learning rate convergence if ρ = ρ ,
k

and fixed 1. ρ k ≥ 0 where ρ is a finite positive

margin
(supervised) m
constant.
2.∑ ρ k = ∞
k =1

Note: z k = ⎧⎨ x if d k = +1
k
1

⎩−x if d = −1
k k

2
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
3
net = xT w

27
Supervised Learning of a Perceptron

∑(ρ )
m
k 2

3. k =1
2
=0
⎛ m k⎞
⎜∑ρ ⎟
⎝ k =1 ⎠

May`s rule
( zTw − b)
2
⎧ b − ( z k )T w k 0<ρ <2 f ( net ) = sgn ( net ) Finite convergence to the
if ( z )
1
(supervised) J (w) = ∑ ⎪⎪ zk k T
w ≤b
k
solution z T w ≥ b > 0 if the
⎨
2 2
2 zT w ≤b z z b>0 training set is linearly
⎪ separable.
⎪⎩ 0 otherwise

Butz`s rule J ( w ) = −∑ ( z i ) w
T
⎧⎪ z k if ( z k ) w k ≤ 0
T 0 ≤ γ <1 f ( net ) = sgn ( net )
Finite convergence if
(supervised) i ⎨ k
⎪⎩γ z otherwise ρ >0 training set is linearly
separable. Places w in a
region that tends to
minimize the probability of
error for nonlinearly
separable cases.

Widrow- ⎡ d i − ( x i )T w ⎤
2
⎡ d k − ( x k )T w k ⎤ x k α f ( net ) = net Converges in the mean
⎢⎣ ⎥⎦ ρk =
⎢ ⎥⎦
J (w) = ∑ ⎣
Hoff rule 1 k 2 square to the minimum SSE
x
( α -LMS) 2 i 2 or LMS solution if
xi
(supervised) x = x for al i, j.
i j
0 <α < 2

28
Supervised Learning of a Perceptron

μ -LMS ⎡ d k − ( x k )T w k ⎤ x k f ( net ) = net

d − ( xi ) w ⎤
1 ⎡ i 2
2 Converges in the mean
J (w) = ∑ 0< ρ <
T

(supervised) 2 i ⎢⎣ ⎥⎦ ⎣⎢ ⎦⎥ 3 x
2 square to the minimum SSE
or LMS solution.

Stochastic ⎡ d k − ( x k )T w k ⎤ x k ρ k satisfies: f ( net ) = net

d − ( xi ) w ⎤
2
1 ⎡ i
J (w) =
T
μ -LMS rule 2 ⎢⎣ ⎥⎦ ⎣⎢ ⎦⎥ i ≡ mean operator. (At
(supervised) 1. ρ k ≥ 0
each leaning step the
∞ training vector x k is drawn
2.∑ ρ k = +∞ at random). Converges in
k =1 the mean square to the
minimum SSE or LMS
∞
3. ∑ ( ρ k ) < ∞
2 solution.
k =1

Correlation J ( w ) = −∑ d i ( xi ) w
T
d k xk ρ >0 f ( net ) = net Converges to the minimum
rule i SSE solution if the vectors
(supervised) xk are mutually
orthonormal.

Delta rule
(supervised)
J (w) =
1
∑ ( d i − yi )
2
(d k
− y k ) f ′ ( net k ) x k 0 < ρ <1 y = f ( net ) where f Extends the μ -LMS rule to
2 i is a sigmoid function cases with differentiable
(x )
i T
nonlinear activations.
yi w

29
Supervised Learning of a Perceptron

Learning Activation
4 5
Criterion Function Learning Vector Conditions Remarks
Rule Function 6

Minkowski-r 1 sgn ( d k − y k ) d k − y k
r −1
f ′ ( net k ) x k
0 < ρ <1 y = f ( net ) where f 0<r <2 for pseudo-
J (w) = ∑
r
d i − yi
is a sigmoid function Gaussian distribution p ( x )
delta rule r i

(supervised)
with pronounced tails. r = 2
gives delta rule. r = 1 arises
when p ( x ) is a Laplace

distribution.

Note: z k = ⎧⎨ x if d k = +1
k
4

⎩−x if d = −1
k k

5
The general form of the learning equation is w k +1 = w k + ρ k s k , where ρ k is the learning rate and s k is the learning vector.
6
net = xT w

30
Supervised Learning of a Perceptron

Relative J (w) β ( d k − y k ) xk 0 < ρ <1 y = tanh ( β net ) Eliminates the flat spot
entropy delta ⎡ ⎛ 1 + d i ⎞⎤ suffered by the delta rule.
⎢( + ) ⎜ 1 + yi ⎟⎥
i
1 d ln
rule 1 ⎣ ⎝ ⎠⎦ Converges to one linearly
= ∑
(supervised) 2 i ⎡ ⎛ 1 − d i ⎞⎤ separable solution if one
+ ⎢(1 − d i ) ln ⎜ i ⎟⎥
⎣ ⎝ 1 − y ⎠⎦ exists.

⎧ρ ε k if ε ik > 0 b1 > 0 f ( net ) = sgn ( net )

∑( )
AHK I 1 ⎡ i T ⎤
2
bi values can only increase
J ( w, b ) = z w − b Margin Δbi = ⎨ 1 i
2 i ⎢⎣
i⎥
⎦
(supervised) ⎩ 0 otherwise from their initial values.
0 < ρ1 < 2
Converges to a robust
Weight vector:
2 solution for linearly
0 < ρ2 <
⎧ ρ 2 ( ρ1 − 1) ε ik z i
2
if ε i2 > 0 max z i separable problems.
⎨ i

⎩ − ρ 2ε i z if ε ik ≤ 0
k i

ε ik = ( z i ) w k − bik
T

Margin Δbik b1 > 0 f ( net ) = sgn ( net )

( )
AHK II 1 ⎡ i T ⎤
2
bi values can take any
(supervised)
J ( w, b ) = ∑
2 i ⎢⎣
z w − b i⎥
⎦
positive value. Converges to
with margin vector b>0 0 < ρ1 < 2
a robust solution for linearly
separable problems.

31
Supervised Learning of a Perceptron

⎧ k −bik 0 < ρ2 <

2
⎪ ρ1ε i if ε ik > 2
=⎨ ρ1 max z i
⎪ 0 i
⎩ otherwise

Weight vector:

⎧ −bik
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i

⎪ ρ1
⎨
⎪ − ρ ε k zi −bik
if ε ik ≤
⎪⎩ 2 i
ρ1
ε ik = ( z )
i T
w k − bik

Margin Δbik b1 > 0 f ( net ) = sgn ( net )

∑( )
AHK III 1 ⎡ i T ⎤
2 Converges for linearly
J ( w, b ) = z w − b
2 i ⎣⎢
i⎥
(supervised) ⎦ separable as well as
⎧ k 0 < ρ1 < 2
with margin vector b>0 −bik nonlinearly separable cases.
⎪ ρ1ε i if ε ik >
=⎨ ρ1 It automatically identifies
⎪ 0 2
⎩ otherwise 0 < ρ2 < 2 and discards the critical
max z i
i
points affecting the
Weight vector:
nonlinear separability, and
results in a solution which
tends to minimize

32
Supervised Learning of a Perceptron

⎧ −bik misclassifications.
⎪ ρ 2 ( ρ1 − 1) ε i z if ε i2 >
k i

⎨ ρ1
⎪
⎩ 0 otherwise
ε ik = ( z i ) w k − bik
T

Delta rule for 1

( ) β ⎡⎣ d k − tanh ( β net k ) ⎤⎦ i 0 < ρ <1 Stochastic
J (w) = ∑ d i − yi
2

2 i
⎡1 − tanh 2 ( β net k ) ⎤ x k
stochastic activation:
⎣ ⎦ Performance in the average
units
y
(supervised) is equivalent to the delta
⎧ with
⎪ +1
P ( y = +1)
rule applied to a unit with
⎪
=⎨ deterministic activation:
⎪−1 with
⎪ 1 − P ( y = +1)
⎩
1
P ( y = 1) =
1 + e −2 β net

Document Management Plan Template
No ratings yet
Document Management Plan Template
7 pages
1 Hassoun Chap3 Perceptron
No ratings yet
1 Hassoun Chap3 Perceptron
10 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
ML Algorithms
No ratings yet
ML Algorithms
10 pages
Unit 2-nn
No ratings yet
Unit 2-nn
40 pages
t4 Sol
No ratings yet
t4 Sol
8 pages
Genetic Algorithms Versus Traditional Methods
No ratings yet
Genetic Algorithms Versus Traditional Methods
7 pages
Chapter 2. Training NN
No ratings yet
Chapter 2. Training NN
50 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Lect3 UWA PDF
No ratings yet
Lect3 UWA PDF
73 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Lect 4
No ratings yet
Lect 4
54 pages
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
No ratings yet
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
110 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Regression
No ratings yet
Regression
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Notes 1
No ratings yet
Notes 1
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229
No ratings yet
CS229
69 pages
Lect 3 PDF
No ratings yet
Lect 3 PDF
59 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Gradient Maths - Step by Step Delta Rule PDF
No ratings yet
Gradient Maths - Step by Step Delta Rule PDF
18 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
HW 1
No ratings yet
HW 1
6 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
ANN - Perceptron - Adaline
No ratings yet
ANN - Perceptron - Adaline
15 pages
Lecture 02 - Neural Networks - 4p
No ratings yet
Lecture 02 - Neural Networks - 4p
10 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Slide 2
No ratings yet
Slide 2
35 pages
Chapter 7
No ratings yet
Chapter 7
68 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
NN Theory
No ratings yet
NN Theory
138 pages
Learning Rules of ANN
No ratings yet
Learning Rules of ANN
25 pages
Class Test 1
No ratings yet
Class Test 1
5 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages
3.1.1weight Decay, Weight Elimination, and Unit Elimination: GX X X X, Which Is Plotted in
No ratings yet
3.1.1weight Decay, Weight Elimination, and Unit Elimination: GX X X X, Which Is Plotted in
26 pages
Curs 3
No ratings yet
Curs 3
31 pages
3.1 Global-Descent-Based Error Backpropagation: W W Given by
No ratings yet
3.1 Global-Descent-Based Error Backpropagation: W W Given by
28 pages
Widrow-Hoff (-LMS) Learning Rule
No ratings yet
Widrow-Hoff (-LMS) Learning Rule
24 pages
Chapter 1. Introduction: Cybernetics (Connectionism + Neural Networks)
No ratings yet
Chapter 1. Introduction: Cybernetics (Connectionism + Neural Networks)
1 page
Lab Session 07: Home Exercises
No ratings yet
Lab Session 07: Home Exercises
1 page
How To Reset A Cisco 1941 Router
No ratings yet
How To Reset A Cisco 1941 Router
2 pages
Lab 6
No ratings yet
Lab 6
1 page
Lab4 PDF
No ratings yet
Lab4 PDF
1 page
Lab 4
No ratings yet
Lab 4
1 page
Lab Session 03: Parallel and Distributed Algorithms
No ratings yet
Lab Session 03: Parallel and Distributed Algorithms
1 page
Lecture 3 - ATM PDF
No ratings yet
Lecture 3 - ATM PDF
29 pages
02 - Pthread+primitive Sincronizare PDF
No ratings yet
02 - Pthread+primitive Sincronizare PDF
169 pages
Subud Voice Newsletter
No ratings yet
Subud Voice Newsletter
16 pages
Practical Research 2
78% (9)
Practical Research 2
27 pages
Lab 05 PDF
No ratings yet
Lab 05 PDF
7 pages
Author: Hugh Lupton: Plot Diagram
100% (1)
Author: Hugh Lupton: Plot Diagram
2 pages
HINDI - PAPER-I To PAPER-IV
No ratings yet
HINDI - PAPER-I To PAPER-IV
9 pages
Management Accounting Notes1
100% (1)
Management Accounting Notes1
170 pages
Review: EX 1: Use The Correct Form of Verbs in Brackets
No ratings yet
Review: EX 1: Use The Correct Form of Verbs in Brackets
6 pages
2010 Integrated Updated Circulation ACLS Tachycardia Algorithm
No ratings yet
2010 Integrated Updated Circulation ACLS Tachycardia Algorithm
1 page
Viegas 2014
No ratings yet
Viegas 2014
10 pages
Paired Passage ECR Text Set - VIDEO GAMES English I-II
No ratings yet
Paired Passage ECR Text Set - VIDEO GAMES English I-II
21 pages
Ccu Module Cultural Conflict Meeting 6 - 7
No ratings yet
Ccu Module Cultural Conflict Meeting 6 - 7
8 pages
How Asian Teachers Polish Each Lesson To Perfection
No ratings yet
How Asian Teachers Polish Each Lesson To Perfection
8 pages
Rajesh Cat 2019 Score Card
No ratings yet
Rajesh Cat 2019 Score Card
1 page
Shinnie P. Erecilla (ENH 1 Activity)
No ratings yet
Shinnie P. Erecilla (ENH 1 Activity)
1 page
Example Rules For Preventive Ethics
No ratings yet
Example Rules For Preventive Ethics
7 pages
Grammer Exercise
100% (1)
Grammer Exercise
4 pages
Syllabus
No ratings yet
Syllabus
11 pages
Poems Absurdities
No ratings yet
Poems Absurdities
6 pages
English Work Book Class 9
No ratings yet
English Work Book Class 9
17 pages
Adult STS Lesson 145 - Miscellaneous Laws
No ratings yet
Adult STS Lesson 145 - Miscellaneous Laws
4 pages
Elementary Unit
No ratings yet
Elementary Unit
46 pages
Effective Use of PowerPoint
100% (1)
Effective Use of PowerPoint
16 pages
Created By: Annisa Nurjannah Harahap 1191111066 Lecturer:: CRITICAL Jurnal Review Pembelajaran Bilingual
0% (1)
Created By: Annisa Nurjannah Harahap 1191111066 Lecturer:: CRITICAL Jurnal Review Pembelajaran Bilingual
12 pages
Reading Assessment Tool: Banna National High School
No ratings yet
Reading Assessment Tool: Banna National High School
2 pages
Teknik Pemeriksaan CT Scan Sinus Paranasal
No ratings yet
Teknik Pemeriksaan CT Scan Sinus Paranasal
22 pages
Belleville
No ratings yet
Belleville
73 pages
Genetic Disorders
No ratings yet
Genetic Disorders
3 pages
PTMF Pattern - Social Inequality
No ratings yet
PTMF Pattern - Social Inequality
7 pages
Eberhardt L4 InSituStress
100% (1)
Eberhardt L4 InSituStress
28 pages

Correlation Learning Rule: M I I I

Uploaded by

Correlation Learning Rule: M I I I

Uploaded by

Correlation Learning Rule

The Delta Rule

Figure 2-7 A perceptron with a differentiable sigmoidal activation function.

leads to the delta rule:

w k +1 = w k + ρ ⎡⎣ d k − f ( net k ) ⎤⎦ ⎡⎣ f ′ ( net k ) + ε ⎤⎦ x k (2.54)

Adaptive Ho-Kashyap (AHK) Learning Rules

{x , d } , i = 1,2,..., m. Assume that x

Next, if we define a set of m new vectors z i according to

then Equation (2.55) may be rewritten as the single matrix equation

performing the constrained (b > 0) gradient descent

which leads to the following incremental update rules:

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if ε ik > 0 (2.65a)

Δbi = 0 and Δw = − ρ 2ε ik z i if ε ik ≤ 0 (2.65b)

Δbi = ρ1ε ik and Δw = ρ 2 ( ρ1 − 1) ε ik z i if bik + ρ1ε ik > 0 (2.66a)

Δbi = 0 and Δw = − ρ 2ε ik z i if bik + ρ1ε ik ≤ 0 (2.66b)

0 < ρ1 < 2 (2.67b)

(1) it is capable of adaptively identifying difficult-to-separate class boundaries and

Other Criterion Functions

or its instantaneous version

this criterion function is given by

at the solution points d = y]

∇J ( w ) = − sgn ( d − y ) f ′ ( net ) x (2.71)

Figure 2-10 A family of instantaneous Minkowski-r criterion functions

3. g(s) is bounded from below.

Perceptron J (w) = − ∑ zTw ⎧⎪z k if ( z k ) w k ≤ 0

and fixed 1. ρ k ≥ 0 where ρ is a finite positive

μ -LMS ⎡ d k − ( x k )T w k ⎤ x k f ( net ) = net

Stochastic ⎡ d k − ( x k )T w k ⎤ x k ρ k satisfies: f ( net ) = net

⎧ρ ε k if ε ik > 0 b1 > 0 f ( net ) = sgn ( net )

Margin Δbik b1 > 0 f ( net ) = sgn ( net )

⎧ k −bik 0 < ρ2 <

Margin Δbik b1 > 0 f ( net ) = sgn ( net )

Delta rule for 1

You might also like