Radial Basis Function Networks: Algorithms: Neural Computation: Lecture 14
Radial Basis Function Networks: Algorithms: Neural Computation: Lecture 14
Radial Basis Function Networks: Algorithms: Neural Computation: Lecture 14
1.
2.
3.
4.
5.
6.
7.
yk (x) = wkj j ( x)
j=0
x
j
j (x) = exp
2
2
which have centres { j } and widths { j }. Naturally, the way to proceed is to develop a
process for finding the appropriate values for M, {wkj}, { ij} and { j }.
L14-2
outputs yk
weights wkj
1
weights ij
1
inputs xi
The hidden to output layer part operates like a standard feed-forward MLP network, with
the sum of the weighted hidden unit activations giving the output unit activations. The
hidden unit activations are given by the basis functions j (x, j , j ) , which depend on
the weights { ij , j } and input activations {xi} in a non-standard manner.
L14-3
p
where { j } {x }
and the j are all related in the same way to the maximum or average distance between
the chosen centres j. Common choices are
j =
dmax
2M
or
j = 2dave
which ensure that the individual RBFs are neither too wide, nor too narrow, for the
given training data. For large training sets, this approach gives reasonable results.
L14-7
Example 1 : M = N, = 2dave
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L14-8
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L14-9
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L14-10
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L14-11
J =
xp j
j =1 pS j
1
Nj
pS j
It does that by iteratively finding the nearest mean j to each data point {x p},
reassigning the data points to the associated clusters Sj, and then recomputing the cluster
means j. The clustering process terminates when no more data points switch from one
cluster to another. Multiple runs can be carried out to find the lowest J.
L14-13
E=
1
2
(t
p
p
k
yk (x ))
p
and here the outputs are a simple linear combination of the hidden unit activations, i.e.
yk (x ) = wkj j (x p )
j =0
At the minimum of E the gradients with respect to all the weights wki will be zero, so
M
E
p
p
= t k wkj j (x )i (x p ) = 0
wki
p
j=0
and linear equations like this are well known to be easy to solve analytically.
L14-15
T (T W T ) = 0
and the formal solution for the weights is
W T = T
defined as
( T)1 T
which can be seen to have the property = I. Thus the network weights can be
computed by fast linear matrix inversion techniques. In practice, it is normally best to
use Singular Value
Decomposition (SVD) techniques that can avoid problems due to
possible ill-conditioning of , i.e. T being singular or near singular.
L14-16
E = (t yk (x )) = (t wkj j (x p , j , j ))2
p
k
p
k
j=0
and one could iteratively update the weights/basis function parameters using
w jk = w
E
w jk
ij =
E
ij
j =
E
j
We will have all the problems of choosing the learning rates , avoiding local minima
and so on, that we had for training MLPs by gradient descent. Also, there is a tendency
for the basis function widths to grow large leaving non-localised basis functions.
L14-17
2.
3.
We then saw how the two layers of network weights were rather different
and that different techniques were appropriate for training each of them.
4.
5.
Reading
1.
2.