An Empirical Comparison of Node Pruning Methods For Layered Feed-Forward Neural Networks
An Empirical Comparison of Node Pruning Methods For Layered Feed-Forward Neural Networks
An Empirical Comparison of Node Pruning Methods For Layered Feed-Forward Neural Networks
1. Introduction
A practical question that arises when an artificial neural network is used to solve a particular
problem is to determine its most appropriate architecture and size. These, in fact, are known to affect
considerably both the network functional capabilities and its generalization performance (see e.g. [l]).
One popular approach to solve this problem, usually referred to as node pruning, involves training an
over-dimensioned network and then removing redundant units, in a separate phase. As pointed out by
Karnin [2] this presents several advantages. Firstly, by starting with an over-estimated structure, the
network is guaranteed to learn the desired input-output mapping with the desired degree of accuracy.
Moreover, the training may even take place with a faster learning rate, and the pruning algorithm
needs not to interfere with the learning process.
Recently, several node pruning methods have been proposed but it is difficult to fully assess their
potential because different testing problems were used by their inventors. This paper attempts to
provide a partial answer to this problem. It presents an empirical comparison of three different node
pruning methods, in terms of both the size and the performance of the reduced networks. The extensive
results presented here allows to draw some significant conclusions.
2. The Methods
In this section the methods involved in our comparison are briefly described; for further details the
reader is referred to the original papers, which are readily available. Here, we consider a layered feed-
forward network with L layers (we do count the input units as a layer) and assume that the network is
trained over a sample of P training patterns. As a conventional shorthand we shall denote unit i of
layer 1 by ujl. The number of units in layer 1 will be denoted by nl, w i j l will represent the weight
between units uil and u ~ , * +and~ , y. will denote the output of unit ujl. Each non-input unit uil
*!
receives from the preceding layer a net input tjlgiven by
9-1
tjl= wji, l-lyj, 1-1 + oil (1)
j=1
where Oil is a bias value for uil; the unit will produce its own output yil according to some nonlinear
activation function, which is commonly chosen to be the logistic function f(z)=(l+exp{ -z})-l. In
the following, y,(f) will denote the actual output of unit uil following presentation of pattern p.
321
the S&D method) works as foll s. Suppose that hidden
-
unit uil has constant output y . Then, it is removed and every bias Oj,1+1 is added the quantity y ~ ; i p
Instead, assuming that uil has approximately the same output as ukl, it will be eliminated and Its
outcoming weights will be added to those of node ukl. Finally, if uils output is approximately equal to
l - y k l , then uil is removed, its weights are subtracted by those of ukl, and the biases Oj,1+1 are added
the weights wijls.
Tk =\lAZk-b/p. (3)
The criterion for choosing the units to be eliminated, strongly depends on the particular algorithm
used to solve the linear system. In fact, as CGPCNE is a residual reducing method, the unit to be
removed is chosen in such a way that the initial residual ro is minimum among all hidden units of the
layer under consideration. As ?fois usually chosen to be the null vector, this corresponds to choosing
the unit for which 11611 is minimum. Denoting by jjk, the P-dimensional vector consisting of the outputs
of U k l upon presentation of training patterns, the vector 5 is simply obtained by concatenating the nl+l
vectors W k i l g k l , for i = l . . .nl+l. Summarizing, the P&F algorithm starts with a previously trained
network and iteratively produces a sequence of reduced networks. At each step, the minimum-norm
unit is located and the corresponding system is solved. Then, the performance of the reduced network is
tested over the training set: if it is sufficiently close to the performance of the initial trained network,
the process is repeated; otherwise, the last reduced network is rejected, and the algorithm stops.
3. Experimental Results
In order to compare the performance of the pruning methods described above, extensive simulations
were carried out over three different tasks. The first two are classical Boolean problerns for which the
322
minimum number of hidden units is known, namely the parity and the symmetry problems [8]. In
addition, to test the methods over tasks that resemble real-world problems, a simulated pattern
recognition task was used. As the methods often require a further training phase after pruning to
achieve the original performance, we carried out simulations both with and without retraining. In our
analysis, three kinds of measures were used to assess the performance of the pruning methods: 1) the
size of the reduced networks, 2) the misclassification rate, measured as the proportion of exemplars for
which at least one network output value differs from the corresponding target of more than 0.5; and 3)
the usual mean squared error (MSE). For the two Boolean problems these were computed over the
training set, while for the pattern recognition task an additional testing set was used.
3.1. Implementatzon
Each of the competing methods was run over ten over-dimensioned networks (labeled from A to
L) with randomly-generated initial weights. The training (as well as the retraining) of the networks
was carried out by back-propagation with learning rate q=l.O, and momentum term cu=0.7. For the
two Boolean problems learning was stopped when, for each exemplar in the training set, all the
network output values differ from the corresponding targets of less than 0.05; for the pattern
recognition task, instead, exactly 1,000 iterations of back-propagation were carried out.
In order to implement Sietsma and DOWSmethod, their pruning rules must be translated into more
precise criteria. In our implementation, a hidden unit is regarded as having constant behavior when the
variance of its output vector is lower than a threshold t1. Moreover, we considered two hidden units uil
1I
threshold c2. For units giving opposite output, the same condition was applied o jjil and -
and u j l to have the same output values when the normalized distance jjil-jjj1 [ / P is less than another
gj1, As can
be expected, we found that using small threshold values very few redundant units are removed, while
large values result in too many removed nodes and, as a consequence, the network performance worsens
excessively. After a long tuning phase we set tl=O.O1 and t2=0.1 which appeared to be near-optimal.
Also, as Sietsma and Dow did in their simulations, before detecting the units to be removed, output
values less than 0.35 and greater than 0.65 were approximated to 0 and 1 respectively. A further
implementation problem with the S&D method involves the order in which units are removed. In their
original paper, Sietsma and Dow speak about a heuristic method of selecting the units being eliminated
but, unfortunately, they do not specify how this is accomplished. One reasonable choice, which was
actually implemented in the simulations presented here, is to remove units as follows: first constant-
output units, then duplicated units, and finally inversely-duplicated units. When more than one pair of
either duplicated or inversely-duplicated units were detected, that having the closest output vectors was
selected and the unit with more constant output was removed. We tried to do some experiments
with alternative criteria, but this turned out be the best one.
In contrast with the S&D method the other ones considered here do not require such a time-
consuming definition of working parameters. In particular, the P&F method needs only to specify an
explicit criterion that states when to stop the pruning process. Our choice was to terminate the
algorithm when Er-Eo < t3, where E , and E , are the misclassification rates on the training set of the
reduced and original network respectively. In our simulations, a value of 0.01 was chosen for t3. In the
implementation of the B method, instead, two parameters are needed: the convergence factor /3, and
the threshold c4 used to approximate the auxiliary weights to 0 or 1. We set P=O.l and t4=0.7, as
suggested by Burkitt himself [ 5 ] .
3.2. Parzty
In this series of simulations the 4-bit parity problem was considered, which requires at least 4 hidden
units. The initial over-sized networks had one input layer of 4 units, one l0-unit hidden layer, and 1
output node (referred to as a 4-10-1 architecture). For both the S&D and the P&F methods, the back-
propagation algorithm was applied as described previously (i.e. q=1 and a=0.7). For the B method,
instead, this set or parameters was found to perform very poorly as none of the ten training trials
converged to a solution within 3,000 epochs. Therefore, the values q=0.25 and a=0.96 were employed,
which were used by Burkitt in his original successful experiments. However, even using Burkitts
values, 4 of the 10 learning runs did not converge.
After the training phase, the pruning methods (without retraining) were run and the results are
summarized in Table I which shows, for each of the ten networks and for each of the competing
methods, the number of hidden units of the pruned networks, along with the corresponding
misclassification rate and the MSE. In the table, the *s denote the networks that back-propagation
was not able to train.
323
Table I
Comparative results on the parity problem
On the average, the S&D pruned networks had 5.1 hidden units but, as can be seen from the table, a
great deal of variability exists. The P&F method appears to be much more robust, as 90% of the
pruned networks had 5 hidden units, a value which is quite close to the minimum required, and all the
reduced networks maintained 0.0% misclassification rate. Finally, the B method often produced the
minimal network solutions maintaining good performance. Notice that both the S&D and the B
methods generated one network having less than the ideally minimum number of hidden units. Next,
we tried to retrain the pruned networks with back-propagation. Surprisingly, we found that the
retraining of 5 out of 10 S&D reduced networks (included, of course, that having 3 hidden nodes) did
not converge within 3,000 epochs. For the remaining 5 a median number of 24 epochs were needed
(ranging from 0 to 119). The median number of retraining epochs for the P&F pruned nets was 70,
with a minimum of 0 and a maximum of 251. Burkitts pruned networks, instead, were retrained with
a median number of epochs equal to 7 (min=O, max=14), apart from the 3-hidden unit net, for which
convergence was not achieved.
3.3. Symmetry
The symmetry problem [8] is known to require a minimum of two hidden units for solution. In our
experiments a 4-10-1 initial architecture was used. Back-propagation was applied with parameters as
defined above and, here, no problems arose for its convergence. The results of this series of simulation
(without retraining) are summarized in Table 11.
Table II
Comparative results on the symmetry problem
It is seen that, again, the S&D method is quite sensible to the initial conditions and provided very poor
results, as the reduced networks had between 4 and 9 units. Both the P&F and the B methods gave
324
qualitatively similar results in terms of the number of hidden nodes (an average of about 3.5 for both),
but somewhat different in terms of performance. The pruned networks were later subject to a retraining
phase. Here, all the retraining trials converged: for both the S&D and the B pruned nets a median
number of about 45 epochs were needed for convergence, while the P&F networks required a median
number of 77 epochs.
and
w x , Pz, 4+ N Y ,-Py, 4
P,(.,Y> = 2 W Y , -Py, 4 7
Table III
Comparative results on the pattern recognition problem
As can be seen, both the S&D and P&F methods achieved qualitatively similar results in terms of the
number of hidden units (an average of 5.1) but slightly different in terms of misclassification rate (an
average of 4.35 for S&D and of 3.75 for P&F). The B method, instead, produced smaller networks (a
mean number of hidden units of 3.1) with the same performance as S&D. In this problem, the
retraining phase did not produce significant variations with respect to the original nets, for all the three
methods. Next, we performed some experiments aimed at comparing the generalization ability. Both
the original and the reduced networks were run over the 1,000-sample test set and the results are shown
in Table IV. We see that all the methods obtained networks having roughly equivalent performance,
that is slightly better than that of the original nets.
325
Table IV
Generalization results on the pattern recognition problem
misclass.(%) MSE
Finally, we mention that in a separate series of simulations, not reported here, we noted that the
P&F method produces very better results on this problem when different stopping criteria are used (see
also [SI), both in terms of network size and generalization performance.
4. Conclusions
The experimental results presented in this paper allow to draw some useful conclusions about the
performance of three node pruning algorithms. Firstly, the S&D method performs very poorly not only
because of the laborious setting of working parameters but, mainly, for its lack of robustness. By
contrast, both the P&F and the B methods do a very good job of reducing the network size and
maintaining satisfactory performance, without being sensibly affected by the initial conditions.
Moreover, they frequently produce the minimal solution network. This is especially true for the B
s to rain a network with auxiliary hidden layers, it may cause some convergence
etimes forces to look for an appropriate set of parameters. This problem was
also encountered by Burkitt in his experiments, although less frequently. The P&F method, instead,
does not suffer from this drawback. One further nice feature of the P&F method, which stems
essentially from its iterative nature, is that it allows to determine the performance of the reduced
networks in advance, and this usually results in small networks for which retraining is not needed. All
the three methods seem not to improve the generalization ability of the reduced networks, but a much
more systematic analysis is needed. .This is what we are currently trying to do.
References
[l] D. R. Hush and B. G. Horne, Progress in supervised neural networks - Whats new since
Lippmann? IEEE Signal Processing Mag., pp. 8-39, Jan. 1993.
[2] E. D. Karnin, A simple procedure for pruning back-propagation trained neural networks, IEEE
Trans. Neural Networks, vol. 1, no. 2, pp. 239-242, 1990.
[3] J. Sietsma and R. J. F. DOW,Neural net pruning - Why and how, in Proc. Int. Conf. Neural
Networks, San Diego, CA, 1988, pp. I:325-333.
[4] J. Sietsma and R. J. F. DOW, Creating artificial neural networks that generalize, Neural
Networks, vol. 4, pp. 67-79, 1991.
[5] A. N. Burkitt, Optimization of the architecture of feed-forward neural networks with hidden
layers by unit elimination, Complex Sysl., vol. 5, pp. 371-380, 1991.
[6] M. Pelillo and A. M. Fanelli, A method of pruning layered feed-forward neural networks, in
Proc. IWANN93, Sitges, Barcelona, June 1993 (Berlin: Springer-Verlag).
[7] A. BjSrck and T. Elfving, Accelerated projection methods for computing pseudoinverse solutions
of systems of linear equations, BIT, vol. 19, pp. 145-163, 1979.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error
propagation, in D. E. Rumelhart and J. L. McClelland (eds.), Parallel Dzstrzbuted Processing.
vol. 1, Cambridge, MA: MIT Press, 1986, pp. 318-362.
[9] L. Niles, B. Silverman, G. Tajchman, and M. Bush, How limited training data can allow a
neural netwoTk to outperform an optimal statistical classifier, in Proc. ICASSP-89, Glasgow,
1989, vol. 1, pp. 17-20.
326