An Empirical Comparison of Node Pruning Methods For Layered Feed-Forward Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings o f 1993 International Joint Conference on Neural Networks

An Empirical Comparison of Node Pruning Methods for


Layered Feed-forward Neural Networks
Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo(*)
Dipartimento di Informatica
Universitii di Bari
Via G. Amendola, 173 - 70126 Bari (Italy)

Abstract-One popular approach t o reduce the size of an artificial neural network is to


prune off hidden units after that learning has taken place. This paper compares three
different node pruning algorithms in terms of size and performance of the reduced
networks. Experimental results are reported and some useful conclusions are drawn.

1. Introduction
A practical question that arises when an artificial neural network is used to solve a particular
problem is to determine its most appropriate architecture and size. These, in fact, are known to affect
considerably both the network functional capabilities and its generalization performance (see e.g. [l]).
One popular approach to solve this problem, usually referred to as node pruning, involves training an
over-dimensioned network and then removing redundant units, in a separate phase. As pointed out by
Karnin [2] this presents several advantages. Firstly, by starting with an over-estimated structure, the
network is guaranteed to learn the desired input-output mapping with the desired degree of accuracy.
Moreover, the training may even take place with a faster learning rate, and the pruning algorithm
needs not to interfere with the learning process.
Recently, several node pruning methods have been proposed but it is difficult to fully assess their
potential because different testing problems were used by their inventors. This paper attempts to
provide a partial answer to this problem. It presents an empirical comparison of three different node
pruning methods, in terms of both the size and the performance of the reduced networks. The extensive
results presented here allows to draw some significant conclusions.

2. The Methods
In this section the methods involved in our comparison are briefly described; for further details the
reader is referred to the original papers, which are readily available. Here, we consider a layered feed-
forward network with L layers (we do count the input units as a layer) and assume that the network is
trained over a sample of P training patterns. As a conventional shorthand we shall denote unit i of
layer 1 by ujl. The number of units in layer 1 will be denoted by nl, w i j l will represent the weight
between units uil and u ~ , * +and~ , y. will denote the output of unit ujl. Each non-input unit uil
*!
receives from the preceding layer a net input tjlgiven by
9-1
tjl= wji, l-lyj, 1-1 + oil (1)
j=1
where Oil is a bias value for uil; the unit will produce its own output yil according to some nonlinear
activation function, which is commonly chosen to be the logistic function f(z)=(l+exp{ -z})-l. In
the following, y,(f) will denote the actual output of unit uil following presentation of pattern p.

2.1. The S&D Method


Sietsma and Dow [3], [4] proposed a two-stage pruning procedure. In the first stage, units that do
not contribute to the solution of the problem at hand are detected using some ad-hoc rules and
removed; then, their outcoming weights are distributed in such a way that the performance of the
network does not change significantly. The second stage, instead, eliminates units that provide
redundant information to the next layer. This step, however, has a limited applicability since it
requires that hidden unit output be sufficiently close to 0 or 1. Indeed, this did not happen in our
simulations as the output values were mostly scattered over the interval. Hence, only the first stage will
be considered here. According to Sietsma and DOWScriteria, useless units have approximately constant
output, or give the same or opposite output as another unit in the layer, across the entire training set.

(*) Author to whom correspondence should be addressed.

321
the S&D method) works as foll s. Suppose that hidden
-
unit uil has constant output y . Then, it is removed and every bias Oj,1+1 is added the quantity y ~ ; i p
Instead, assuming that uil has approximately the same output as ukl, it will be eliminated and Its
outcoming weights will be added to those of node ukl. Finally, if uils output is approximately equal to
l - y k l , then uil is removed, its weights are subtracted by those of ukl, and the biases Oj,1+1 are added
the weights wijls.

2.8. The POF Method


In a more recent paper, Pelillo and Fanelll [6] developed a formal pruning method which is also
based on the idea of removing a hidden unit and adjusting the remaining weights of its layer so that
the net inputs to the succeeding layer remain approximately unchanged. Supposing that hidden unit
ukl is to be removed, this amounts to solving the following linear system
nl
y y j ; 4- y; = Wk;& (2)
j=1
j # k
for all i d . . .nltl and p=l.. .P, where the unknowns 6j;s and 7,s are appropriate adjusting factors for
the weights and the biases O;,~+ls,respectively. In order to clarify our discussion, it is
convenient to represent system (2) in the more compact notation Az=$. Instead of analyzing system
consistency and deriving rules for detecting and removing redundant units (like Sietsma and DOWS
approach), the P&F method uses a very efficient preconditioned conjugate-gradient method, called
CGPCNE, to solve the system in the least squares sense [7]. The algorithm begins with an initial point
To and iteratively produces a sequence of points {Zk} so as to decrease residuals

Tk =\lAZk-b/p. (3)
The criterion for choosing the units to be eliminated, strongly depends on the particular algorithm
used to solve the linear system. In fact, as CGPCNE is a residual reducing method, the unit to be
removed is chosen in such a way that the initial residual ro is minimum among all hidden units of the
layer under consideration. As ?fois usually chosen to be the null vector, this corresponds to choosing
the unit for which 11611 is minimum. Denoting by jjk, the P-dimensional vector consisting of the outputs
of U k l upon presentation of training patterns, the vector 5 is simply obtained by concatenating the nl+l
vectors W k i l g k l , for i = l . . .nl+l. Summarizing, the P&F algorithm starts with a previously trained
network and iteratively produces a sequence of reduced networks. At each step, the minimum-norm
unit is located and the corresponding system is solved. Then, the performance of the reduced network is
tested over the training set: if it is sufficiently close to the performance of the initial trained network,
the process is repeated; otherwise, the last reduced network is rejected, and the algorithm stops.

2.3. The B Method


Burkitt [5] developed a pruning method that differs significantly from the other ones analyzed in this
paper. The original network is modified by introducing, for each hidden layer, an auxiliary layer of
linear response units that allow to detect redundant hidden nodes. Each auxiliary layer is connected to
its own hidden layer with excitatory weights subject to the constraint that their sum on each node is
unity. After this extended network has been trained by back-propagation, the weights wjkls between
the auxiliary layer 1+1 and the cbrresponding hidden layer 1 are updated by a rule that forces them to
have value 1 for a particular hidden unit and to vanish for all the others (in practice, a threshold is
used to force the auxiliary weights to actually become 0 or 1). This means that each auxiliary unit
chooses exactly one unit in the hidden layer. Conversely, each hidden unit can be chosen by zero,
one, or more auxiliary nodes. This further training of the auxiliary units goes on until they all choose
one unit in the hidden layer. Here, it is customary to make use of a parameter ,B to speed up
convergence. After this last training phase, hidden units having the auxiliary weights equal to zero are
the redundant ones and then can be removed; for hidden units chosen by one or more auxiliary units
the new weights to the next original hidden layer are obtained by adding together the weights
connecting those auxiliary units to the next layer.

3. Experimental Results
In order to compare the performance of the pruning methods described above, extensive simulations
were carried out over three different tasks. The first two are classical Boolean problerns for which the
322
minimum number of hidden units is known, namely the parity and the symmetry problems [8]. In
addition, to test the methods over tasks that resemble real-world problems, a simulated pattern
recognition task was used. As the methods often require a further training phase after pruning to
achieve the original performance, we carried out simulations both with and without retraining. In our
analysis, three kinds of measures were used to assess the performance of the pruning methods: 1) the
size of the reduced networks, 2) the misclassification rate, measured as the proportion of exemplars for
which at least one network output value differs from the corresponding target of more than 0.5; and 3)
the usual mean squared error (MSE). For the two Boolean problems these were computed over the
training set, while for the pattern recognition task an additional testing set was used.

3.1. Implementatzon
Each of the competing methods was run over ten over-dimensioned networks (labeled from A to
L) with randomly-generated initial weights. The training (as well as the retraining) of the networks
was carried out by back-propagation with learning rate q=l.O, and momentum term cu=0.7. For the
two Boolean problems learning was stopped when, for each exemplar in the training set, all the
network output values differ from the corresponding targets of less than 0.05; for the pattern
recognition task, instead, exactly 1,000 iterations of back-propagation were carried out.
In order to implement Sietsma and DOWSmethod, their pruning rules must be translated into more
precise criteria. In our implementation, a hidden unit is regarded as having constant behavior when the
variance of its output vector is lower than a threshold t1. Moreover, we considered two hidden units uil
1I
threshold c2. For units giving opposite output, the same condition was applied o jjil and -
and u j l to have the same output values when the normalized distance jjil-jjj1 [ / P is less than another
gj1, As can
be expected, we found that using small threshold values very few redundant units are removed, while
large values result in too many removed nodes and, as a consequence, the network performance worsens
excessively. After a long tuning phase we set tl=O.O1 and t2=0.1 which appeared to be near-optimal.
Also, as Sietsma and Dow did in their simulations, before detecting the units to be removed, output
values less than 0.35 and greater than 0.65 were approximated to 0 and 1 respectively. A further
implementation problem with the S&D method involves the order in which units are removed. In their
original paper, Sietsma and Dow speak about a heuristic method of selecting the units being eliminated
but, unfortunately, they do not specify how this is accomplished. One reasonable choice, which was
actually implemented in the simulations presented here, is to remove units as follows: first constant-
output units, then duplicated units, and finally inversely-duplicated units. When more than one pair of
either duplicated or inversely-duplicated units were detected, that having the closest output vectors was
selected and the unit with more constant output was removed. We tried to do some experiments
with alternative criteria, but this turned out be the best one.
In contrast with the S&D method the other ones considered here do not require such a time-
consuming definition of working parameters. In particular, the P&F method needs only to specify an
explicit criterion that states when to stop the pruning process. Our choice was to terminate the
algorithm when Er-Eo < t3, where E , and E , are the misclassification rates on the training set of the
reduced and original network respectively. In our simulations, a value of 0.01 was chosen for t3. In the
implementation of the B method, instead, two parameters are needed: the convergence factor /3, and
the threshold c4 used to approximate the auxiliary weights to 0 or 1. We set P=O.l and t4=0.7, as
suggested by Burkitt himself [ 5 ] .

3.2. Parzty
In this series of simulations the 4-bit parity problem was considered, which requires at least 4 hidden
units. The initial over-sized networks had one input layer of 4 units, one l0-unit hidden layer, and 1
output node (referred to as a 4-10-1 architecture). For both the S&D and the P&F methods, the back-
propagation algorithm was applied as described previously (i.e. q=1 and a=0.7). For the B method,
instead, this set or parameters was found to perform very poorly as none of the ten training trials
converged to a solution within 3,000 epochs. Therefore, the values q=0.25 and a=0.96 were employed,
which were used by Burkitt in his original successful experiments. However, even using Burkitts
values, 4 of the 10 learning runs did not converge.
After the training phase, the pruning methods (without retraining) were run and the results are
summarized in Table I which shows, for each of the ten networks and for each of the competing
methods, the number of hidden units of the pruned networks, along with the corresponding
misclassification rate and the MSE. In the table, the *s denote the networks that back-propagation
was not able to train.

323
Table I
Comparative results on the parity problem

no. hidden u n i t s misclass. (%) MSE

net S&D P&F B S&D P&F B S&D P&F B


A 3 5 * 12.5 0.0 * 0.12 0.00 *
B 5 4 * 6.3 0.0 * 0.06 0.02 *
C 4 5 4 6.3 0.0 0.0 0.06 0.00 0.00
D 6 5 3 0.0 0.0 25.0 0.00 0.00 0.24
E 6 5 4 0.0 0.0 0.0 0.00 0.01 0.00
F 7 5 * 0.0 0.0 * 0.00 0.00 *
G 5 5 6 0.0 0.0 0.0 0.00 0.00 0.00
H 4 5 4 6.3 0.0 0.0 0.06 0.00 0.00
I 6 5 4 0.0 0.0 0.0 0.00 0.00 0.00
L 5 5 * 12.5 0.0 * 0.12 0.00 *

On the average, the S&D pruned networks had 5.1 hidden units but, as can be seen from the table, a
great deal of variability exists. The P&F method appears to be much more robust, as 90% of the
pruned networks had 5 hidden units, a value which is quite close to the minimum required, and all the
reduced networks maintained 0.0% misclassification rate. Finally, the B method often produced the
minimal network solutions maintaining good performance. Notice that both the S&D and the B
methods generated one network having less than the ideally minimum number of hidden units. Next,
we tried to retrain the pruned networks with back-propagation. Surprisingly, we found that the
retraining of 5 out of 10 S&D reduced networks (included, of course, that having 3 hidden nodes) did
not converge within 3,000 epochs. For the remaining 5 a median number of 24 epochs were needed
(ranging from 0 to 119). The median number of retraining epochs for the P&F pruned nets was 70,
with a minimum of 0 and a maximum of 251. Burkitts pruned networks, instead, were retrained with
a median number of epochs equal to 7 (min=O, max=14), apart from the 3-hidden unit net, for which
convergence was not achieved.

3.3. Symmetry
The symmetry problem [8] is known to require a minimum of two hidden units for solution. In our
experiments a 4-10-1 initial architecture was used. Back-propagation was applied with parameters as
defined above and, here, no problems arose for its convergence. The results of this series of simulation
(without retraining) are summarized in Table 11.

Table II
Comparative results on the symmetry problem

no. hidden u n i t s misclass. (%) MSE

net S&D P&F B S&D P&F B S&D P&F B


A 6 3 3 0.0 0.0 25.0 0.00 0.01 0.16
B 9 3 3 0.0 0.0 0.0 0.00 0.01 0.00
C 6 4 3 0.0 0.0 37.5 0.01 0.00 0.32
D 8 4 3 0.0 0.0 18.8 0.00 0.00 0.11
E 7 2 3 0.0 0.0 0.0 0.00 0.02 0.00
F 4 3 4 0.0 0.0 6.3 0.00 0.00 0.05
G 6 4 3 12.5 0.0 0.0 0.10 0.02 0.00
H 6 4 3 6.3 0.0 0.0 0.02 0.01 0.00
I 7 4 3 0.0 0.0 0.0 0.01 0.02 0.00
L 7 5 4 6.3 0.0 6.3 0.04 0.02 0.07

It is seen that, again, the S&D method is quite sensible to the initial conditions and provided very poor
results, as the reduced networks had between 4 and 9 units. Both the P&F and the B methods gave
324
qualitatively similar results in terms of the number of hidden nodes (an average of about 3.5 for both),
but somewhat different in terms of performance. The pruned networks were later subject to a retraining
phase. Here, all the retraining trials converged: for both the S&D and the B pruned nets a median
number of about 45 epochs were needed for convergence, while the P&F networks required a median
number of 77 epochs.

3.4. A Simulated P a t t e r n Recognition Task


The third problem on which the selected pruning algorithms were compared is rather different from
the previous two. This is a simulated two-class two-feature pattern recognition problem with known
probability distributions, suggested by Niles et al. [9] in an attempt of modeling the multimodal
distributions typical of real-world applications. The distributions for classes 1 and 2 are

and
w x , Pz, 4+ N Y ,-Py, 4
P,(.,Y> = 2 W Y , -Py, 4 7

where N ( t , p ,u2)=(2~a2)-~~exp{-(t-p)~/2a2)is a Gaussian with mean p and variance u2. In the


presented simulations we set a=0.2, p,=2.30~, and py=2.106a, as in [9]. We generated a training and
a separate testing set of 200 and 1,000 samples respectively. Both data sets contained equal number of
exemplars from the two classes, i.e. uniform a-priori distribution was assumed. A 2-10-1 initial
architecture was used. Unlike the previously-discussed tasks, the two decision regions for this problem
presents a considerable amount of overlap, making the training extremely hard. Therefore, in our
simulations the learning process took place by carrying out exactly 1,000 back-propagation steps. The
original trained networks were found to have a mean misclassification rate of 3.35% (with a standard
deviation of 0.58). The results obtained applying the pruning methods (again without retraining) are
summarized in Table 111, which contains also the performance of the original networks.

Table III
Comparative results on the pattern recognition problem

no. hidden u n i t s misclass. (X) MSE

net S&D P&F B Orig S&D P&F B Orig S&D P&F B

A 6 6 3 3.5 3.0 3.5 4.0 0.03 0.03 0.03 0.03


B 6 5 3 2.5 4.0 3.5 4.5 0.02 0.04 0.03 0.03
C 4 4 3 4.0 5.0 5.0 4.0 0.03 0.03 0.03 0.03
D 5 5 3 3.0 3.5 4.0 4.0 0.03 0.04 0.03 0.03
E 4 5 4 3.0 6.0 3.0 4.5 0.03 0.04 0.02 0.03
F 6 6 2 3.5 3.5 3.5 5.5 0.03 0.03 0.03 0.05
G 5 5 3 3.0 3.0 3.0 4.5 0.03 0.03 0.03 0.03
H 6 6 3 3.5 4.5 4.0 3.5 0.03 0.04 0.04 0.03
I 5 4 4 4.5 4.5 5.5 4.0 0.03 0.03 0.04 0.03
L 4 5
- 3 3.0 6.5 2.5 4.5 0.02 0.05 0.02 0.03

As can be seen, both the S&D and P&F methods achieved qualitatively similar results in terms of the
number of hidden units (an average of 5.1) but slightly different in terms of misclassification rate (an
average of 4.35 for S&D and of 3.75 for P&F). The B method, instead, produced smaller networks (a
mean number of hidden units of 3.1) with the same performance as S&D. In this problem, the
retraining phase did not produce significant variations with respect to the original nets, for all the three
methods. Next, we performed some experiments aimed at comparing the generalization ability. Both
the original and the reduced networks were run over the 1,000-sample test set and the results are shown
in Table IV. We see that all the methods obtained networks having roughly equivalent performance,
that is slightly better than that of the original nets.
325
Table IV
Generalization results on the pattern recognition problem

misclass.(%) MSE

nets mean st. d e v . mean s t . dev.


original 7.16 0.99 0.05 0.01
S&D 6.46 0.86 0.05 0.01
P&F 6.95 1.12 0.05 0.01
B 6.53 0.45 0.05 0.00

Finally, we mention that in a separate series of simulations, not reported here, we noted that the
P&F method produces very better results on this problem when different stopping criteria are used (see
also [SI), both in terms of network size and generalization performance.

4. Conclusions
The experimental results presented in this paper allow to draw some useful conclusions about the
performance of three node pruning algorithms. Firstly, the S&D method performs very poorly not only
because of the laborious setting of working parameters but, mainly, for its lack of robustness. By
contrast, both the P&F and the B methods do a very good job of reducing the network size and
maintaining satisfactory performance, without being sensibly affected by the initial conditions.
Moreover, they frequently produce the minimal solution network. This is especially true for the B
s to rain a network with auxiliary hidden layers, it may cause some convergence
etimes forces to look for an appropriate set of parameters. This problem was
also encountered by Burkitt in his experiments, although less frequently. The P&F method, instead,
does not suffer from this drawback. One further nice feature of the P&F method, which stems
essentially from its iterative nature, is that it allows to determine the performance of the reduced
networks in advance, and this usually results in small networks for which retraining is not needed. All
the three methods seem not to improve the generalization ability of the reduced networks, but a much
more systematic analysis is needed. .This is what we are currently trying to do.

References
[l] D. R. Hush and B. G. Horne, Progress in supervised neural networks - Whats new since
Lippmann? IEEE Signal Processing Mag., pp. 8-39, Jan. 1993.
[2] E. D. Karnin, A simple procedure for pruning back-propagation trained neural networks, IEEE
Trans. Neural Networks, vol. 1, no. 2, pp. 239-242, 1990.
[3] J. Sietsma and R. J. F. DOW,Neural net pruning - Why and how, in Proc. Int. Conf. Neural
Networks, San Diego, CA, 1988, pp. I:325-333.
[4] J. Sietsma and R. J. F. DOW, Creating artificial neural networks that generalize, Neural
Networks, vol. 4, pp. 67-79, 1991.
[5] A. N. Burkitt, Optimization of the architecture of feed-forward neural networks with hidden
layers by unit elimination, Complex Sysl., vol. 5, pp. 371-380, 1991.
[6] M. Pelillo and A. M. Fanelli, A method of pruning layered feed-forward neural networks, in
Proc. IWANN93, Sitges, Barcelona, June 1993 (Berlin: Springer-Verlag).
[7] A. BjSrck and T. Elfving, Accelerated projection methods for computing pseudoinverse solutions
of systems of linear equations, BIT, vol. 19, pp. 145-163, 1979.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error
propagation, in D. E. Rumelhart and J. L. McClelland (eds.), Parallel Dzstrzbuted Processing.
vol. 1, Cambridge, MA: MIT Press, 1986, pp. 318-362.
[9] L. Niles, B. Silverman, G. Tajchman, and M. Bush, How limited training data can allow a
neural netwoTk to outperform an optimal statistical classifier, in Proc. ICASSP-89, Glasgow,
1989, vol. 1, pp. 17-20.

326

You might also like