A Set of Neural Network Benchmark Problems and Benchmarking Rules
A Set of Neural Network Benchmark Problems and Benchmarking Rules
Lutz Prechelt [email protected] Fakultat fur Informatik Universitat Karlsruhe 76128 Karlsruhe, Germany ++49 721 608-4068, Fax: ++49 721 694092 September 30, 1994
Technical Report 21 94
Abstract
is a collection of problems for neural network learning in the realm of pattern classi cation and function approximation plus a set of rules and conventions for carrying out benchmark tests with these or similar problems. Proben1 contains 15 data sets from 12 di erent domains. All datasets represent realistic problems which could be called diagnosis tasks and all but one consist of real world data. The datasets are all presented in the same simple format, using an attribute representation that can directly be used for neural network training. Along with the datasets, Proben1 de nes a set of rules for how to conduct and how to document neural network benchmarking. The purpose of the problem and rule collection is to give researchers easy access to data for the evaluation of their algorithms and networks and to make direct comparison of the published results feasible. This report describes the datasets and the benchmarking rules. It also gives some basic performance measures indicating the di culty of the various problems. These measures can be used as baselines for comparison.
Proben1
CONTENTS
Contents
1 Introduction
1.1 1.2 1.3 1.4 1.5 Why a benchmark set? : : : : : Why benchmarking rules? : : : Scope of Proben1 : : : : : : : Why no arti cial benchmarks? Related work : : : : : : : : : :
: : : : :
: : : : :
: : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
4
4 5 5 6 7
2 Benchmarking rules
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11
General principles : : : : : : : : : : Benchmark problem used : : : : : : Training set, validation set, test set : Input and output representation : : Training algorithm : : : : : : : : : : Error measures : : : : : : : : : : : : Network used : : : : : : : : : : : : : Training results : : : : : : : : : : : : Training times : : : : : : : : : : : : Important details : : : : : : : : : : : Author's quick reference : : : : : : :
8 9 9 11 12 13 14 15 16 16 17
3 Benchmarking problems
3.1 Classi cation problems : : : : : : : : : : : 3.1.1 Cancer : : : : : : : : : : : : : : : : 3.1.2 Card : : : : : : : : : : : : : : : : : 3.1.3 Diabetes : : : : : : : : : : : : : : : 3.1.4 Gene : : : : : : : : : : : : : : : : : 3.1.5 Glass : : : : : : : : : : : : : : : : 3.1.6 Heart : : : : : : : : : : : : : : : : 3.1.7 Horse : : : : : : : : : : : : : : : : 3.1.8 Mushroom : : : : : : : : : : : : : 3.1.9 Soybean : : : : : : : : : : : : : : : 3.1.10 Thyroid : : : : : : : : : : : : : : : 3.1.11 Summary : : : : : : : : : : : : : : 3.2 Approximation problems : : : : : : : : : : 3.2.1 Building : : : : : : : : : : : : : : : 3.2.2 Flare : : : : : : : : : : : : : : : : : 3.2.3 Hearta : : : : : : : : : : : : : : : : 3.2.4 Summary : : : : : : : : : : : : : : 3.3 Some learning results : : : : : : : : : : : : 3.3.1 Linear networks : : : : : : : : : : : 3.3.2 Choosing multilayer architectures : 3.3.3 Multilayer networks : : : : : : : : 3.3.4 Comparison of multilayer results :
17
18 18 18 18 19 19 19 20 20 21 21 21 21 21 22 22 23 23 24 26 27 33
34 35
LIST OF TABLES
35 36 37
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
22 23 25 26 28 29 29 30 31 32 33 33
List of Tables
1 2 3 4 5 6 7 8 9 10 11 12 Attribute structure of classi cation problems : : : : : : : : Attribute structure of approximation problems : : : : : : : Linear network results of classi cation problems : : : : : : : Linear network results of approximation problems : : : : : Architecture nding results of classi cation problems : : : : Architecture nding results of approximation problems : : : Pivot architectures for the datasets : : : : : : : : : : : : : : Pivot architecture results of classi cation problems : : : : : Pivot architecture results of approximation problems : : : : No-shortcut architecture results of classi cation problems : No-shortcut architecture results of approximation problems t-test comparison of pivot and no-shortcut results : : : : : :
1 INTRODUCTION
1 Introduction
This section discusses why standardized datasets and benchmarking rules for neural network learning are necessary at all, what the scope of Proben1 is, and why real data should be used instead of or in addition to arti cial problems as they are often used today.
Its availability lays ground for better algorithm evaluations by enabling easy access to example data of real problems and for better comparability of the results if everybody uses the same problems and setup | while at the same time reducing the workload for the individual researcher. Aspects of learning algorithms that can be studied using Proben1 are for example learning speed, resulting generalization ability, ease of user parameter selection, and network resource consumption. What cannot be assessed well using a set of problems with xed representation such as Proben1 are all those aspects of learning that have to do with the selection or creation of a suitable problem representation. Lack of standard problems is widespread in many areas of computer science. At least in some elds, though, standard benchmark problems exist and are used frequently. The most notable of these positive examples are performance evaluation for computing hardware and for compilers. For many other elds it is clear that de ning a reasonable set of such standard problems is a very di cult task | but neural network training is not one of them.
1 INTRODUCTION
others need the capability of continuous multivariate function approximation. Most problems have both continuous and binary input values. All problems are presented as static problems in the sense that all data to learn from is present at once and does not change during learning. All problems except one the mushroom problem consist of real data from real problem domains. The common properties of the learning tasks themselves are characterized by them all being what I call diagnosis tasks . Such tasks can be described as follows: 1. The input attributes used are similar to those that a human being would use in order to solve the same problem. 2. The outputs represent either a classi cation into a small number of understandable classes or the prediction of a small set of understandable quantities. 3. In practice, the same problems are in fact often solved by human beings. 4. Examples are expensive to get. This has the consequence that the training sets are not very large. 5. Often some of the attribute values are missing. The scope of the Proben1 rules can be characterized as follows. The rules are meant to apply to all supervised training algorithms. Their presentation, however, is biased towards the training of feed forward networks with gradient descent or similar algorithms. Hence, some of the aspects mentioned in the rules do not apply to all algorithms and some of the aspects relevant to certain algorithms have been left out. The rules suggest certain choices for particular aspects of experimental setups as standard choices and say how to report such choices and the results of the experiments. Both parts of Proben1, problems as well as rules, cover only a small part of neural network learning algorithm research. Additional collections of benchmark problems are needed to cover more domains of learning e.g. application domains such as vision, speech recognition, character recognition, control, time series prediction; learning paradigms such as reinforcement learning, unsupervised learning; network types such as recurrent networks, analog continuous-time networks, pulse frequency networks. Su cient benchmarks available today for only a few of these elds. Additions and changes to the rules will also be needed for most of these new domains, learning paradigms, and network types. This is why the digit 1 was included in the name of Proben1; maybe some day Proben100 will be published and the eld will be mature.
is, similar to the ones mentioned above, that we know a-priori that a simple exact solution exists | at least when using the right framework to express it. It is unclear, how this property in uences the observed capability of a learning algorithm or network to nd a good solution: some algorithms may be biased towards the kind of regularity needed for a good solution of these problems and will do very good on these benchmarks, although other algorithms not having such bias would be better in more realistic domains. Summing up, we can conclude that the main problem with the early arti cial benchmarks is that we do not know what the results obtained for them tell us about the behavior of our systems on real world tasks. One way to transcend this limitation is to make the data generation process for the arti cial problems resemble realistic phenomena. The usual way to do that is to replace or complement the data generation based on a simple logic or arithmetic formula by stochastic noise processes and or by realistic models of physical phenomena. Compared to the use of real world data this has the advantage that the properties of each dataset are known, making it easier to characterize for what kinds of problems i.e., dataset characteristics a particular algorithm works better or worse than another. Two problems are left by this approach. First, there is still the danger to prefer algorithms that happen to be biased towards the particular kind of data generation process used. Imagine classi cation of datasets of point clouds generated by multidimensional gaussian noise using a gaussian-based radial basis function classi er. This can be expected to work very well, since the class of models used by the learning algorithm is exactly the same as the class of models employed in the data generation.1 Second, it is often unclear what parameters for the data generation process are representative of real problems in any particular domain. When overlaying a functional and a noise component, the questions to be answered are how strong the non-linear components of the function should be, how strong and of what type the non-linearities in that components should be, and what amount of noise of which type should be added. Choosing the wrong parameters may create a dataset that does not resemble any real problem domain. Clearly arti cial datasets based on realistic models and real data sets both have their place in algorithm development and evaluation. A reason for prefering real data over arti cially generated data is that the former choice guarantees to get results that are relevant for at least a few real domains, namely the ones being tested. Multiple domains must be used in order to increase the con dence that the results obtained did not occur due to a particular domain selection only.
2 BENCHMARKING RULES
meant for general machine learning programs; most of them cannot readily be learned by neural networks because an encoding of nominal attributes and missing attribute values has to be chosen rst. In both collections, the individual datasets themselves were donated by various researchers. With a few exceptions, no partitioning of the dataset into training and test data is de ned in the archives. In no case a sub-partitioning of training data into training set and validation set is de ned. The di erent variants that exist for some of the datasets in the UCI archive create a lot of confusion, because it is often not clear which one was used in an experiment. The Proben1 benchmark collection contains datasets that are taken from the UCI archive with one exception. The data is, however, encoded for direct neural network use, is pre-partitioned into training, validation, and test examples, and is presented in a very exactly documented and reproducible form. Zheng's benchmark 23 , which I recommend everybody to read, does not include its own data, but de nes a set of 13 problems, predominantly from the UCI archive, to be used as a benchmark collection for classi er learning algorithms. The selection of the problems is made for good coverage of a taxonomy of classi cation problems with 16 two- or three-valued features, namely type of attributes, number of attributes, number of di erent nominal attribute values, number of irrelevant attributes, dataset size, dataset density, level of attribute value noise, level of class value noise, frequency of missing values, number of classes, default accuracy, entropy, predictive accuracy, relative accuracy, average information score, relative information score. The Proben1 benchmark problems have not explicitly been selected for good coverage of all of these aspects. Nevertheless, for most of the aspects a good diversity of problems is present in the collection.
2 Benchmarking rules
This section describes how to conduct valid benchmark tests and how to publish them and their results. The purpose of the rules is to ensure the validity of the results and reproducibility by other researchers. An additional bene t of standardized benchmark setups is that results will more often be directly comparable.
benchmarking setup that need to be published to attain reproducibility. For many of these aspects, standard formulations are suggested in order to simplify presentation and comprehension. Comparability: It is very useful if one can compare results obtained by di erent researchers directly. This is possible if the same experimental setup is used. The rules hence suggest a number of so called standard choices for experimental setups that are recommended to be used unless speci c reasons stand against it. The use of such standard choices reduces the variability of benchmarking setups and thus improves comparability of results across di erent publications. In the rules below, phrases typeset in sans serif font like this indicate suggested formulations to be used in publications in order to reduce the ambiguity of setup descriptions. The following sections present the Proben1 benchmarking rules.
10
2 BENCHMARKING RULES
complicated since there may be many local minima in the validation set error curve and since in order to recognize a minimum one has to train until the error rises again, so that resetting the network to an earlier state is needed in order to actually stop at the minimum. See section 3.3 for a more concrete description. Other forms of cross validation besides early stopping are also possible. The data of the validation set could be used in any way during training since it is part of the training data. The actual name `validation set', however, is only appropriate if the set is used to assess the generalization performance of the network. Note the di erentiation: training data is the union of training set and validation set. Be sure to specify exactly which examples of a dataset are used for the training, validation, and test set. It is insu cient to indicate the number of examples used for each set, because it might make a signi cant di erence which ones are used where. As a drastic example think of a binary classi cation problem where only examples of one class happen to be in the training data. For Proben1, a suggested partitioning into training, validation, and test set is given for each dataset. The size of the training, validation, and test set in all Proben1 data les is 50, 25, and 25 of all examples, respectively. Note that this percentage information is not su cient for an exact determination of the sets unless the total number of examples is divisible by four. Hence, the header of each Proben1 data le lists explicitly the number of examples to be used for each set. Assume that these numbers are X , Y , and Z . Then the standard partitioning is to use the rst X examples for the training set, the following Y examples for the validation set and the nal Z examples for the test set. If no validation set is needed, the training set consists of the rst X + Y examples instead. As said before, for problems with only a small number of examples, results may vary signi cantly for di erent partitionings see also the results presented below in section 3.3. Hence it improves the signi cance of a benchmark result when di erent partitionings are used during the measurements and results are reported for each partitioning separately. Proben1 supports this approach. It contains three di erent permutations of each dataset. For instance the problem glass is available in three datasets glass1, glass2, and glass3, which di er only in the ordering of examples, thereby de ning three di erent partitionings of the glass problem data. Additional partitionings although not completely independent ones are de ned by the following rules for the order of examples in the dataset le: a training set, validation set, test set. b training set, test set, validation set. c validation set, training set, test set. d validation set, test set, training set. e test set, validation set, training set. f test set, training set, validation set. This list is to be read as follows: From a partitioning, say glass1, six partitionings can be created by re-interpreting the data into a di erent order of training, validation, and test set. For instance glass1d means to take the data le of glass1 and use the rst 25 of the examples for the validation set, the next 25 for the test set, and the nal 50 for the training set. Obviously, when no validation set is used, a is the same as c and e is the same as f, thus only a, b, d, and e are available. glass1a is identical to glass1. The latter is the preferred name when none of b to f are used in the same context. Note that these partitionings are of lower quality than those created by the permutations 1 to 3, since the latter are independent of each other while the former are not. Therefore, the additional partitionings should be used only when necessary; in most cases, just using xx1, xx2, and xx3 for each problem xx will su ce. If you want to use a di erent partitioning than these standard ones for a Proben1 problem, specify
11
exactly how many examples for each set you use. If you do not take them from the data le in the order training examples, validation examples, test examples, specify the rule used to determine which examples are in which set. Examples: glass1 with 107 examples used for the training set and 107 examples used for the test set for a standard order but nonstandard size of the sets or glass1 with even-numbered examples used for the training set and odd-numbered examples used for the test set, the rst example being number 0 for a nonstandard size and order of sets. If you use the Proben1 conventions, just say glass1 and mention somewhere in your article that your benchmarks conform to the Proben1 conventions, e.g. All benchmark problems were taken from the Proben1 benchmark set; the standard Proben1 benchmarking rules were applied. An imprecise speci cation of the partitioning of a known data set into training, validation and test set is probably the most frequent and the worst obstacle to reproducibility and comparability of published neural network learning results.
12
2 BENCHMARKING RULES
Most of the above discussion applies to outputs as well, except for the fact that there never are missing outputs. Most Proben1 problems are classi cation problems; all of these are encoded using a 1-of-m output representation for the m classes, even for m = 2. The problem representation in Proben1 is xed. This improves the comparability of results and reduces the work needed run benchmarks. The Proben1 datasets are meant to be used exactly as they are. The xed neural network input and output representation is actually one of the major improvements of Proben1 over the previous situation. In the past, most benchmarks consisting of real data were publicly available only in a symbolic representation which can be encoded into a representation suitable for neural networks in many di erent ways. This fact made comparisons di cult. When you perform benchmarks that do not use problems from a well-de ned benchmark collection, be sure to specify exactly which input and output representation you use. Since such a description consumes a lot of space, the only feasible way will usually be to make the data le used for the actual benchmark runs available publicly. Should you make small changes to the representation of Proben1 problems used in your benchmarks, specify these changes exactly. The most common cases of such changes will be concerned with the output representation. If you want to use only a single output for binary classi cation problems, say card1, using only one output or something similar. You may also want to ignore one of the outputs for problems having more than two classes, since one output is theoretically redundant since the outputs always sum to one. If you ignore an output, you should always ignore the last output from the given representation. If you want to use outputs in the range ,1 : : : 1 instead of 0 : : : 1 or in a somewhat reduced range in order to avoid saturation of the output nodes, say for example with the target outputs rescaled to the range ,0:9 : : : 0:9. It will be assumed that the rescaling was done using a linear transformation of the form y 0 = ay + b. Other possibilities include for instance with the outputs rescaled to mean 0 and standard deviation 1, which will also be assumed to be made using a linear transformation. Of course, all these rescaling modi cations can be done for inputs as well, but tell us if you make such changes. I do not recommend to use Proben1 problems with representations that di er substantially from the standard ones unless nding good representations an important part of your work. The input and output representations used in Proben1 are certainly not optimal, but they are meant to be good or at least reasonable ones. Di erences in problem representation, though, can make for large di erences in the performance obtained see for instance 2 , so be sure to specify your representation precisely.
13
These parameters may include depending on the algorithm learning rate, momentum, weight decay, initialization, temperature, etc. For each such parameter there should be a clearly indicated unique name and perhaps also a symbol. For all of the parameters that are adaptive, the adaption rule and its parameters have to be speci ed as well. A particularly important aspect of a training algorithm is its stopping criterion, so do not forget to specify that as well see section 3.3 for an example. For all user-selectable parameters, specify how you found the values used and try to characterize how sensitive the algorithm is to their choice. Note that you must not in any way use the performance on the test set while searching for good parameter values; this would invalidate your results! In particular, choosing parameters based on test set results is an error.
p=1 i=1
14
2 BENCHMARKING RULES
uncommon case. Avoid the term classi cation performance, use classi cation accuracy and classi cation error instead. There are several possibilities to determine the classi cation a network has computed from the outputs of the network. We assume a 1-of-m encoding for m classes using output values 0 and 1. The simplest classi cation method is the winner-takes-all, i.e., the output with the highest activation designates the class. Other methods involve the possibility of rejection, too. For instance one could require that there is exactly one output that is larger than 0.5, which designates the class if it exists and leads to rejection otherwise. To put an even stronger requirement on the credibility of the network output one can set thresholds, e.g. accept an output as 0 if it is below 0.3 and as 1 if it is above 0.7, and reject unless there is exactly one output that is 1 while all others are 0 by this interpretation. There are several other possibilities. When no rejection capability is needed, the winner-takes-all method is considered standard. In all other cases, describe your classi cation decision function explicitly.
15
Other properties of the network architecture also have to be speci ed: the range and resolution of the weight parameters unless plain 32-bit oating point is used, the activation function of each node in the network except for the input nodes which are assumed to use the identity function; see also section 2.10, and any other free parameters associated with the network.
standard deviation computed based on the degrees of freedom, which is one less than the number n of runs.
16
2 BENCHMARKING RULES
all the rest, you can remove these two outliers from the sample. Never remove more than 10 of the values from any one sample; usually one should remove much less. Never remove an outlier unless the resulting distribution satis es the requirement well enough. Other data transformations than removing outliers may be more appropriate to satisfy the requirements of the statistical procedure; for instance test errors are often log-normally distributed, so one must use the logarithm of the test error instead of the test error itself in order to produce valid results. See also section 3.3.4.
17
Network initialization. Specify the initialization conditions of the network. The most important point is the initialization of the network's weights, which must be done with random values for most algorithms in order to break the symmetry among the hidden nodes. Common choices for the initialization are for instance xed methods such as random weights from range ,0:1 : : : 0:1, where the distribution is assumed to be even unless stated otherwise, or p p methods that adapt to the network topology used such as random weights from range ,1= N : : : 1= N for connections into nodes with N input connections. Just like the termination criterion, the initialization can have signi cant impact on the results obtained, so it is important to specify it precisely. Specifying the exact sets of weights used is hopelessly di cult and should usually not be tried. Termination and phase transition criteria. Specify exactly the criteria used to determine when training should stop, or when training should switch from one phase to the next, if any. For most algorithms, the results are very sensitive to these criteria. Nevertheless, in most publications the criteria are speci ed only roughly, if at all. This is one of the major weaknesses of many articles on neural network learning algorithms. See section 3.3 for an example of how to report stopping criteria; the GL family of stopping criteria, which is de ned in that section, is recommended when using the early stopping method.
3 Benchmarking problems
The following subsections each describe one of the problems of the Proben1 benchmark set. For each problem, a rough description of the semantics of the dataset is given, plus some information about the size of the dataset, its origin, and special properties, if any. For most of the problems, results have previously been published in the literature. Since these results never use exactly the same representation and training set test set splitting as the Proben1 versions, the references are not given here; some of them can, however, be found in the documentation supplied with the original dataset, which is part of Proben1. The nal section reports on the results of some learning runs with the Proben1 datasets.
18
3 BENCHMARKING PROBLEMS
3.1.2 Card
Predict the approval or non-approval of a credit card to a customer. Each example represents a real credit card application and the output describes whether the bank or similar institution granted the credit card or not. The meaning of the individual attributes is unexplained for con dence reasons. 51 inputs, 2 outputs 690 examples. This dataset has a good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values in 5 of the examples. 44 of the examples are positive; entropy 0.99 bits per example. This dataset was created based on the crx" data of the Credit screening" problem dataset from the UCI repository of machine learning databases.
3.1.3 Diabetes
Diagnose diabetes of Pima indians. Based on personal data age, number of times pregnant and the results of medical examinations e.g. blood pressure, body mass index, result of glucose tolerance test, etc., try to decide whether a Pima indian individual is diabetes positive or not. 8 inputs, 2 outputs, 768 examples. All inputs are continuous. 65.1 of the examples are diabetes negative; entropy 0.93 bits per example. Although there are no missing values in this dataset according to its documentation, there are several senseless 0 values. These most probably indicate missing data. Nevertheless, we handle this data as if it was real, thereby introducing some errors or noise, if you want into the dataset. This dataset was created based on the Pima indians diabetes" problem dataset from the UCI repository of machine learning databases. 4 Entropy E = P P c log P c for class probabilities P c
Classes c
2
19
3.1.4 Gene
Detect intron exon boundaries splice junctions in nucleotide sequences. From a window of 60 DNA sequence elements nucleotides decide whether the middle is either an intron exon boundary a donor, or an exon intron boundary an acceptor, or none of these. 120 inputs, 3 outputs, 3175 examples. Each nucleotide, which is a four-valued nominal attribute, is enoded binary by two binary inputs The input values used are ,1 and 1, therefore the inputs are not declared as boolean. This is the only dataset that has input values not restricted to the range 0 : : : 1. There are 25 donors and 25 acceptors in the dataset; entropy 1.5 bits per example. This dataset was created based on the splice junction" problem dataset from the UCI repository of machine learning databases.
3.1.5 Glass
Classify glass types. The results of a chemical analysis of glass splinters percent content of 8 di erent elements plus the refractive index are used to classify the sample to be either oat processed or non oat processed building windows, vehicle windows, containers, tableware, or head lamps. This task is motivated by forensic needs in criminal investigation. 9 inputs, 6 outputs, 214 examples. All inputs are continuous, two of them have hardly any correlation with the result. As the number of examples is quite small, the problem is sensitive to algorithms that waste information. The sizes of the 6 classes are 70, 76, 17, 13, 9, and 29 instances, respectively; entropy 2.18 bits per example. This dataset was created based on the glass" problem dataset from the UCI repository of machine learning databases.
3.1.6 Heart
Predict heart disease. Decide whether at least one of four major vessels is reduced in diameter by more than 50. The binary decision is made based on personal data such as age, sex, smoking habits, subjective patient pain descriptions, and results of various medical examinations such as blood pressure and electro cardiogram results. 35 inputs, 2 outputs, 920 examples. Most of the attributes have missing values, some quite many: For attributes 10, 12, and 11, there are 309, 486, and 611 values missing, respectively. Most other attributes have around 60 missing values. Additional boolean inputs are used to represent the missingness" of these values. The data is the union of four datasets: from Cleveland Clinic Foundation, Hungarian Institute of Cardiology, V.A. Medical Center Long Beach, and University Hospital Zurich. There is an alternate version of the dataset heart, called heartc, which contains only the Cleveland data 303 examples. This dataset represents the cleanest part of the heart data; it has only two missing attribute values overall, which makes the value is missing" inputs of the neural network input representation almost redundant. Furthermore, there are still another two versions of the same data, hearta and heartac, corresponding to heart and heartc, respectively. The di erence to the datasets described above is the representation of the output. Instead of using two binary outputs to represent the twoclass decision no vessel is reduced" against at least one vessel is reduced", hearta and heartac use a single continuous output that represents by the magnitude of its activation the number of vessels that are reduced zero to four. Thus, these versions of the heart problem are approximation tasks.
20
3 BENCHMARKING PROBLEMS
The heart and hearta datasets have 45 patients with no vessel is reduced" entropy 0.99 bits per example, for heartc and heartac the value is 54 entropy 1.00 bit per example. These datasets were created based on the heart disease" problem datasets from the UCI repository of machine learning databases. Note that using these datasets requires to include in any publication of the results the name of the institutions and persons who have collected the data in the rst place, namely 1 Hungarian Institute of Cardiology, Budapest; Andras Janosi, M.D., 2 University Hospital, Zurich, Switzerland; William Steinbrunn, M.D., 3 University Hospital, Basel, Switzerland; Matthias P sterer, M.D., 4 V.A. Medical Center, Long Beach and Cleveland Clinic Foundation; Robert Detrano, M.D., Ph.D. All four of these should be mentioned for the heart and hearta datasets, only the last one for the heartc and heartac datasets. See the detailed documentation of the original datasets in the proben1 heart directory.
3.1.7 Horse
Predict the fate of a horse that has a colic. The results of a veterinary examination of a horse having colic are used to predict whether the horse will survive, will die, or will be euthanized. 58 inputs, 3 outputs, 364 examples. In 62 of the examples the horse survived, in 24 it died, and in 14 it was euthanized; entropy 1.32 bits per example. This problem has very many missing values about 30 overall of the original attribute values, which are all represented as missing explicitly using additional inputs. This dataset was created based on the horse colic" problem dataset from the UCI repository of machine learning databases.
3.1.8 Mushroom
Discriminate edible from poisonous mushrooms. The decision is made based on a description of the mushroom's shape, color, odor, and habitat. 125 inputs, 2 outputs, 8124 examples. Only one attribute has missing values 30 missing. This dataset is special within the benchmark set in several respects: it is the one with the most inputs, the one with the most examples, the easiest one5 , and it is the only one that is not real in the sense that its examples are not actual observations made in the real world, but instead are hypothetical observations based on descriptions of species in a book The Audubon Society Field Guide to North American Mushrooms". The examples correspond to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. In the book, each species is identi ed as de nitely edible, de nitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. 52 of the examples are edible ahem, I mean, have class attribute `edible'; entropy 1.00 bit per example. This dataset was created based on the agaricus lepiota" dataset in the mushroom" directory from the UCI repository of machine learning databases.
5 The mushroom dataset is so simple that a net that performs only a linear combination of the inputs can learn it reliably to 0 classi cation error on the test set!
21
3.1.9 Soybean
Recognize 19 di erent diseases of soybeans. The discrimination is done based on a description of the bean e.g. whether its size and color are normal and the plant e.g. the size of spots on the leafs, whether these spots have a halo, whether plant growth is normal, whether roots are rotted plus information about the history of the plant's life e.g. whether changes in crop occurred in the last year or last two years, whether seeds were treated, how the environment temperature is. 35 inputs, 19 outputs, 683 examples. This is the problem with the highest number of classes in the benchmark set. Most attributes have a signi cant number of missing values. The soybean problem has been used often in the machine learning literature, although with several di erent datasets, making comparisons di cult. Most of the past uses use only 15 of the 19 classes, because the other four have only few instances. In this dataset, these are 8, 14, 15, 16 instances versus 20 for most of the other classes; entropy 3.84 bits per example. This dataset was created based on the soybean large" problem dataset from the UCI repository of machine learning databases. Many results for this learning problem have been reported in the literature, but these were based on a large number of di erent versions of the data.
3.1.10 Thyroid
Diagnose thyroid hyper- or hypofunction. Based on patient query data and patient examination data, the task is to decide whether the patient's thyroid has overfunction, normal function, or underfunction. 21 inputs, 3 outputs, 7200 examples. For various attributes there are missing values which are always encoded using a separate input. Since some results for this dataset using the same encoding are reported in the literature, thyroid1 is not a permutation of the original data, but retains the original order instead. The class probabilities are 5.1, 92.6, and 2.3, respectively; entropy 0.45 bits per example. This dataset was created based on the ann" version of the thyroid disease" problem dataset from the UCI repository of machine learning databases.
3.1.11 Summary
For a quick overview of the classi cation problems, have a look at table 1. The table summarizes the external aspects of the training problems that you have already seen in the individual descriptions above. It does also discriminate inputs that take on only two di erent values binary inputs, inputs that have more than two continuous" inputs, and inputs that are present only to indicate that values at some other inputs are missing. In addition, the table indicates the number of attributes of the original problem formulation that were used in the input representation, discriminated to be either binary attributes, continuous" attributes, or nominal attributes with more than two values.
22 Problem
3 BENCHMARKING PROBLEMS Problem attributes Input values Classes Examples E b c n tot. b c m tot. b cancer 0 9 0 9 0 9 0 9 2 699 0.93 card 4 6 5 15 40 6 5 51 2 690 0.99 0 8 0 8 0 8 0 8 2 768 0.93 diabetes gene 0 0 60 60 120 0 0 120 3 3175 1.50 glass 0 9 0 9 0 9 0 9 6 214 2.18 heart 1 6 6 13 18 6 11 35 2 920 0.99 1 6 6 13 18 6 11 35 2 303 1.00 heartc horse 2 13 5 20 25 14 19 58 3 364 1.32 22 125 0 0 125 2 8124 1.00 mushroom 0 0 22 soybean 16 6 13 35 46 9 27 82 19 683 3.84 9 6 0 21 9 6 6 21 3 7200 0.45 thyroid
Problems and the number of binary, continuous, and nominal attributes in the original dataset, number of binary and continuous network inputs, number of network inputs used to represent missing values, number of classes, number of examples, class entropy E in bits per example. Continuous means more than two di erent ordered values.
14 inputs, 3 outputs, 4208 examples. This problem is in its original formulation an extrapolation task. Complete hourly data for four consecutive months was given for training, and output data for the following two months should be predicted. The dataset building1 re ects this formulation of the task: its examples are in chronological order. The other two versions, building2 and building3 are random permutations of the examples, simplifying the problem to be an interpolation problem. The dataset was created based on problem A of The Great Energy Predictor Shootout | the rst building data analysis and prediction problem" contest, organized in 1993 for the ASHRAE meeting in Denver, Colorado.
3.2.2 Flare
Prediction of solar ares. Try to guess the number of solar ares of small, medium, and large size that will happen during the next 24-hour period in a xed active region of the sun surface. Input values describe previous are activity and the type and history of the active region. 24 inputs, 3 outputs, 1066 examples. 81 of the examples are zero in all three output values. This dataset was created based on the solar are" problem dataset from the UCI repository of machine learning databases.
3.2.3 Hearta
The analog version of the heart disease diagnosis problem. See section 3.1.6 on page 19 for the description. For hearta, 44.7, 28.8, 11.8, 11.6, 3.0 of all examples have 0, 1, 2, 3, 4 vessels reduced, respectively. For heartac these values are 54.1, 18.2, 11.9, 11.6, and 4.3.
23
3.2.4 Summary
For a quick overview of the approximation problems, have a look at table 2. The table summarizes Problem Problem attribs. Input values Outputs Examples b c n tot. b c m tot. c building 0 6 0 6 8 6 0 14 3 4208 5 2 3 10 22 2 0 24 3 1066 are hearta 1 6 6 13 18 6 11 35 1 920 13 18 6 11 35 1 303 heartac 1 6 6 Table 2: Attribute structure of approximation problems
Problems and the number of binary, continuous, and nominal attributes of the original problem representation used, number of binary and continuous network inputs, number of network inputs used to represent missing values, number of outputs, number of examples. Continuous means more than two di erent ordered values.
the external aspects of the training problems that you have already seen in the individual descriptions above. It does also discriminate inputs that take on only two di erent values binary inputs, inputs that have more than two continuous" inputs, and inputs that are present only to indicate that values at some other inputs are missing. In addition, the table indicates the number of attributes of the original problem formulation that were used in the input representation, discriminated to be either binary attributes, continuous" attributes, or nominal attributes with more than two values. The outputs have continuous values.
24
3 BENCHMARKING PROBLEMS
This method, called early stopping 6, 9, 12 , is a good way to avoid over tting 7 of the network to the particular training examples used, which would reduce the generalization performance. For optimal performance, the examples of the validation set should be used for further training afterwards, in order not to waste valuable data. Since the optimal stopping point for this additional training is not clear, it was not performed in the experiments reported here. The GL5 stopping criterion is de ned as follows. Let E be the squared error function. Let Etr t be the average error per example over the training set, measured during epoch t. Evat is the error on the validation set after epoch t and is used by the stopping criterion. Etet is the error on the test set; it is not known to the training algorithm but characterizes the quality of the network resulting from training. The value Eopt t is de ned to be the lowest validation set error obtained in epochs up to t: Eopt t = min E t0 t t va
0
Now we de ne the generalization loss at epoch t to be the relative increase of the validation error over the minimum-so-far in percent: E t
va GLt = 100 E ,1 opt t A high generalization loss is one candidate reason to stop training. This leads us to a class of stopping criteria: Stop as soon as the generalization loss exceeds a certain threshold . We de ne the class GL as GL : stop after rst epoch t with GLt To formalize the notion of training progress, we de ne a training strip of length k to be a sequence of k epochs numbered n + 1 : : :n + k where n is divisible by k. The training progress measured in parts per thousand measured after such a training strip is then
P E tr t0 t 2 t , k +1 :::t Pk t = 1000 k min ,1 t 2t,k+1:::t Etr t0 that is, how much was the average training error during the strip larger than the minimum training error during the strip?" Note that this progress measure is high for instable phases of training, where the training set error goes up instead of down. The progress is, however, guaranteed to approach zero in the long run unless the training is globally unstable e.g. oscillating. Just like the progress, GL is also evaluated only at the end of each training strip.
0 0
25
Problem
cancer1 4.25 0.00 2.91 0.01 3.52 0.04 2.93 0.18 0.55 0.59 129 13 cancer2 3.95 0.52 3.77 0.47 4.77 0.39 5.00 0.61 5.36 10.21 87 51 cancer3 3.30 0.00 4.23 0.04 4.11 0.03 5.17 0.00 0.35 0.64 115 18 9.82 0.01 8.89 0.11 10.61 0.11 13.37 0.67 4.57 1.05 62 9 card1 8.24 0.01 10.80 0.16 14.91 0.55 19.24 0.43 4.22 1.08 65 10 card2 card3 9.47 0.00 8.39 0.07 12.67 0.17 14.42 0.46 1.52 0.69 102 9 diabetes1 15.39 0.01 16.30 0.04 17.22 0.06 25.83 0.56 0.05 0.07 209 50 diabetes2 14.93 0.01 17.47 0.02 17.69 0.04 24.69 0.61 0.02 0.02 209 32 diabetes3 14.78 0.02 18.21 0.04 16.50 0.05 22.92 0.35 0.12 0.17 214 22 gene1 8.42 0.00 9.58 0.01 9.92 0.01 13.64 0.10 0.03 0.07 47 6 gene2 8.39 0.00 9.90 0.00 9.51 0.00 12.30 0.14 0.02 0.03 46 4 gene3 8.21 0.00 9.36 0.01 10.61 0.01 15.41 0.13 0.03 0.06 42 4 8.83 0.01 9.70 0.04 9.98 0.10 46.04 2.21 3.81 0.42 129 13 glass1 glass2 8.71 0.09 10.28 0.19 10.34 0.15 55.28 1.27 5.74 0.67 34 6 8.71 0.02 9.37 0.06 11.07 0.15 60.57 3.82 1.76 0.57 135 30 glass3 heart1 11.19 0.01 13.28 0.06 14.29 0.05 20.65 0.31 1.14 0.45 134 15 11.66 0.01 12.22 0.02 13.52 0.06 16.43 0.40 0.13 0.09 184 14 heart2 heart3 11.11 0.01 10.77 0.02 16.39 0.18 22.65 0.69 0.14 0.23 142 15 heartc1 10.17 0.01 9.65 0.03 16.12 0.04 19.73 0.56 0.15 0.11 128 10 heartc2 11.23 0.03 16.51 0.08 6.34 0.25 3.20 1.56 3.98 0.56 136 22 heartc3 10.48 0.31 13.88 0.33 12.53 0.44 14.27 1.67 6.23 1.15 26 9 11.31 0.16 15.53 0.29 12.93 0.38 26.70 1.87 6.22 0.57 27 7 horse1 horse2 8.62 0.28 15.99 0.21 17.43 0.45 34.84 1.38 5.54 0.47 42 16 horse3 10.43 0.27 15.59 0.30 15.50 0.45 32.42 2.65 6.34 1.07 26 6 mushroom1 0.014 | 0.014 | 0.011 | 0.00 | 0.00 | 3000 | soybean1 0.65 0.00 0.98 0.00 1.16 0.00 9.47 0.51 0.28 0.18 553 11 soybean2 0.80 0.00 0.81 0.00 1.05 0.00 4.24 0.25 0.02 0.02 509 19 soybean3 0.78 0.00 0.96 0.00 1.03 0.00 7.00 0.19 0.03 0.04 533 27 thyroid1 3.76 0.00 3.78 0.01 3.84 0.01 6.56 0.00 0.01 0.03 104 16 thyroid2 3.93 0.00 3.55 0.01 3.71 0.01 6.56 0.00 0.01 0.02 98 16 thyroid3 3.85 0.00 3.39 0.00 4.02 0.00 7.23 0.02 0.02 0.02 114 22 Training set: mean and standard deviation stddev of minimum squared error percentage
Test set
Total epochs
Relevant epochs 31 51 29 3 5 12 47 34 46 10 6 6 5 2 11 5 48 53 23 10 3 2 3 3 | 41 18 28 22 16 21
on training set reached at any time during training. Validation set: ditto, on validation set. Test set: mean and stddev of squared test set error percentage at point of minimum validation set error. Test set classi cation: mean and stddev of corresponding test set classi cation error. Over t: mean and stddev of GL value at end of training. Total epochs: mean and stddev of number of epochs trained. Relevant epochs: mean and stddev of number of epochs until minimum validation error.
104 79 92 26 23 44 203 204 185 43 40 39 23 14 27 41 146 113 114 25 12 9 13 8 3000 418 504 522 99 96 109
26 Problem building1 building2 building3 are1 are2 are3 hearta1 hearta2 hearta3 heartac1 heartac2 heartac3 Training set Validation set Test set
3 BENCHMARKING PROBLEMS Over t 2.15 0.00 1.99 2.17 0.72 0.57 1.68 0.06 0.05 0.01 6.99 6.06
mean stddev
The explanation from table 3 applies, except that the test set classi cation error data is not present here.
0.21 0.34 0.37 0.37 0.42 0.39 3.82 4.17 4.06 4.05 3.37 2.85
mean
stddev
0.01 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.11 0.09
0.92 0.37 0.38 0.34 0.46 0.46 4.42 4.28 4.14 4.70 5.21 5.66
mean
stddev
0.06 0.00 0.07 0.01 0.00 0.00 0.03 0.02 0.02 0.02 0.21 0.16
0.78 0.35 0.38 0.52 0.31 0.35 4.47 4.19 4.54 2.69 3.87 5.43
mean
stddev
0.02 0.00 0.08 0.01 0.02 0.00 0.06 0.01 0.01 0.02 0.16 0.23
4.64 0.01 4.45 1.61 0.90 0.73 0.68 0.13 0.05 0.02 2.27 0.99
mean
Total epochs
407 138 401 142 298 23 297 23 229 107 217 102 41 5 12 4 37 3 16 10 35 6 18 12 118 12 27 10 112 10 107 15 116 8 110 10 98 10 96 11 19 4 13 4 29 9 14 3
stddev
mean
Relevant epochs
stddev
1. Some of the problems seem to be very sensitive to over tting. They over t heavily even with only a linear network e.g. card1, card2, glass1, glass2, heartac2, heartac3, heartc2, heartc3, horse1, horse2, horse3. This suggests that using a cross validation technique such as early stopping is very useful for the Proben1 problems. 2. For some problems, there are quite large di erences of behavior between the three permutations of the dataset e.g. test errors of card, heartc, heartac; training times of heartc; over tting of glass. This illustrates how dangerous it is to compare results for which the splitting of the data into training and test data was not the same. 3. Some of the problems can be solved pretty well with a linear network. So one should be aware that sometimes a 'real' neural network might be an overkill. 4. The mushroom problem is boring. Therefore, only a single run was made. It reached zero test set classi cation error after only 80 epochs and zero validation set error after 1550 epochs. However, training stopped only because of the 3000 epoch limit; the errors themselves fell and fell and fell. Due to these results, the mushroom problem was excluded from the other experiments. Using the mushroom problem may be interesting, however, if one wants to explore the scaling behavior of an algorithm with respect to the number of available training examples. 5. Some problems exhibit an interesting inverse" behavior of errors. Their validation error is lower than the minimum training error e.g. cancer1, cancer2, card1, card3, heart3, heartc1, thyroid2, thyroid3. In a few cases, this even extends to the test error cancer1, thyroid2.
27
For each of these topologies, three runs were performed; two with linear output nodes and one with output nodes using the sigmoidal activation function. Note that in this case the sigmoid output nodes perform only a one-sided squashing of the outputs, because the sigmoid range is ,1 : : : 1 whereas the target output range is only 0 : : : 1. The parameters for the RPROP procedure used in all these runs were + = 1:1, , = 0:5, 0 2 0:05 : : : 0:2 randomly per weight, max = 50, min = 0, initial weights -0.5: : : 0.5 randomly. Exchanging this with the parameter set used for the linear networks would, however, not make much of a di erence. Training was stopped when either P5 t 0:1 or more than 3000 epochs trained or the following condition was satis ed: The GL5 stopping criterion was ful lled at least once and the validation error had increased for at least 8 successive strips at least once and the quotient GLt=P5t had been larger than 3 at least once6 . The tables in tables 5 and 6 present the topology and results of the network that produced the lowest validation set error of all these runs for each dataset. The tables also contain some indication of the performance of other topologies by giving the number of other runs that were at most 5 or 10 worse than the best run, with respect to the validation set error. The range of test set errors obtained for these other topologies is also indicated. The architectures presented in these tables are probably not the optimal ones, even among those considered in the set of runs presented. Due to the small number of runs per architecture for each problem, a suboptimal architecture has a decent probability of producing the lowest validation set error just by chance. Experience with the early stopping method suggests that using a network considerably larger than necessary often leads to the best results. As a consequence, the architectures presented in the table shown in table 7 were computed from the results of the runs as the suggested architectures for the various datasets to be used for training of fully connected multi layer perceptrons. These architectures are called the pivot architectures of the respective problems. The rule for computing which architecture is the pivot architecture uses all runs from the within-5-of-best category as candidates. From these, the largest architecture is chosen. Should the same largest topology appear among the candidates with both linear and sigmoidal output units, the one with smaller validation set error is chosen, unless the linear architecture appears twice , in which case it is preferred regardless of its validation set error. The raw data used for this architecture selection is listed in appendix D. It should be noted that these pivot architectures are still not necessarily very good. In particular for some of the problems it might be appropriate to train networks without shortcut connections in order to use networks with a much smaller number of parameters. For instance in the glass problems, the shortcut connections amount for as many as 60 weights, which is about the same number as are needed for a complete network using 4 hidden nodes but no shortcut connections. Since the problem has only 107 examples in the training set, it may be a good idea to start without shortcut connections. Similar argumentation applies for several other problems as well. Furthermore, since many of the pivot architectures are one of the two largest architectures available in the selection runs, namely 32+0 or 16+8, networks with still more hidden nodes may produce superior results for some of the problems. The following section presents results for multiple runs using the pivot architectures, a subsequent section presents results for multiple runs with the same architectures except for the shortcut connections.
only reason for this complicated criterion is that the same set of runs was also used to investigate the behavior
28
3 BENCHMARKING PROBLEMS
Problem
Arch
Validation set
err
cancer1 4+2 l 1.53 1.714 1.053 1.149 0 3 75-205 1.176-1.352 8+4 l 1.284 1.143 4.013 5.747 0 0 95 | cancer2 4+4 l 2.679 2.857 2.145 2.299 2 12 55-360 2.112-2.791 cancer3 card1 4+4 l 8.251 9.827 10.35 13.95 15 23 20-65 10.02-11.02 card2 4+0 l 10.30 10.98 14.88 18.02 6 20 20-50 14.27-16.25 16+8 l 7.236 8.092 13.00 18.02 0 1 50-55 14.52-14.52 card3 diabetes1 2+2 l 15.07 19.79 16.47 25.00 11 23 65-525 16.3-17.52 85-335 17.17-18.4 diabetes2 16+8 l 16.22 21.35 17.46 23.44 4 23 diabetes3 4+4 l 17.26 25.00 15.55 21.35 8 33 65-400 15.65-18.15 2+0 l 9.708 12.72 10.11 13.37 7 13 30-1245 9.979-11.25 gene1 4+2 s 7.669 11.71 7.967 12.11 0 0 1680 | gene2 gene3 4+2 s 8.371 10.96 9.413 13.62 0 1 1170-1645 9.702-9.702 glass1 8+0 l 8.604 31.48 9.184 32.08 4 21 40-510 8.539-10.32 32+0 l 9.766 38.89 10.17 52.83 12 31 15-40 9.913-10.66 glass2 glass3 16+8 l 8.622 33.33 8.987 33.96 9 35 20-1425 8.795-11.84 heart1 8+0 l 12.58 15.65 14.53 20.00 17 23 40-80 13.64-15.25 heart2 4+0 l 12.02 16.09 13.67 14.78 20 23 25-85 13.03-14.39 heart3 16+8 l 10.27 12.61 16.35 23.91 18 23 40-65 16.25-17.21 heartc1 4+2 l 8.057 16.82 21.33 2 7 35-75 15.60-17.87 heartc2 8+8 l 15.17 5.950 4.000 2 12 15-90 5.257-7.989 12.71 16.00 3 10 10-105 12.95-16.55 heartc3 24+0 l 13.09 horse1 4+0 l 15.02 28.57 13.38 26.37 8 29 15-45 12.7-14.97 horse2 4+4 l 15.92 30.77 17.96 38.46 21 34 15-45 16.55-19.85 horse3 8+4 l 15.52 29.67 15.81 29.67 14 31 15-35 15.63-17.88 soybean1 16+8 l 0.6715 4.094 0.9111 8.824 0 0 1045 | soybean2 32+0 l 0.5512 2.924 0.7509 4.706 0 1 895-2185 0.8051-0.8051 soybean3 16+0 l 0.7147 4.678 0.9341 7.647 0 3 565-945 0.9539-0.9809 thyroid1 16+8 l 0.7933 1.167 1.152 2.000 1 1 480-1170 1.194-1.194 thyroid2 8+4 l 0.6174 1.000 0.7113 1.278 0 0 2280 | thyroid3 16+8 l 0.7998 1.278 0.8712 1.500 2 3 590-2055 0.9349-1.1 Arch: nodes in rst hidden layer + nodes in second hidden layer, sigmoidal or linear output nodes for `best'
classif.err
err
Test set
classif.err
5 10
Epochs
Test range
network, i.e., network used in the run with lowest validation set error. Validation set: squared error percentage on validation set, classi cation error on validation set of `best' run missing values are due to technical-historical reasons. Test set: squared error percentage on test set, classi cation error on test set of `best' run. 5: number of other runs with validation squared error at most 5 percent worse than that of best run as shown in second column. 10: ditto, at most 10 worse. Epochs: Range of number of epochs trained for best run and within-10-percent-best runs. Test range: Range of squared test set error percentages for within-10-percent-best runs excluding the `best' run.
3.3 Some learning results Problem Arch Validation set Test set building1 2+2 l 0.7583 0.6450 0.2629 0.2509 building2 16+8 s building3 8+8 s 0.2460 0.2475 4+0 s 0.3349 0.5283 are1 are2 2+0 s 0.4587 0.3214 are3 2+0 l 0.4541 0.3568 4.199 4.428 hearta1 32+0 s hearta2 2+0 l 3.940 4.164 4+0 s 3.928 4.961 hearta3 4.174 2.665 heartac1 2+0 l 4.589 4.514 heartac2 8+4 l heartac3 4+4 l 5.031 5.904 5 10 Epochs Test range 4 13 625-2625 0.6371-0.6841 5 20 1100-2960 0.246-0.2731 5 16 600-2995 0.2526-0.2739 10 30 35-160 0.5232-0.5687 14 30 35-135 0.3167-0.3695 14 32 40-155 0.3493-0.3772 19 33 35-180 4.249-4.733 3 23 20-120 3.948-4.527 15 31 20-105 4.337-5.089 0 3 50-95 2.613-3.328 0 2 15 4.346-4.741 8 16 10-55 4.825-6.540
29
The explanation from table 5 applies, except that the test set classi cation error data is not present here.
Table 6: Architecture nding results of approximation problems l l l l s l l l l l l l l l w 333 100 1832 370 971 1115 572 1288 1220 110 1112 1793 4153 794 Problem building2 cancer2 card2 diabetes2 are2 gene2 glass2 heart2 hearta2 heartac2 heartc2 horse2 soybean2 thyroid2 16+8 8+4 24+0 16+8 32+0 4+2 16+8 32+0 16+0 8+4 8+8 16+8 32+0 8+4
pivot arch
Pivot architecture and the corresponding number w of connections for each data set.
Problem building1 cancer1 card1 diabetes1 are1 gene1 glass1 heart1 hearta1 heartac1 heartc1 horse1 soybean1 thyroid1
16+0 4+2 32+0 32+0 32+0 4+2 16+8 32+0 32+0 2+0 16+8 16+8 16+8 16+8
pivot arch
l l l l s s l l l l l l l l
w 605 196 1400 410 971 1115 572 1288 628 512 744 1793 4841 398
Problem building3 cancer3 card3 diabetes3 are3 gene3 glass3 heart3 hearta3 heartac3 heartc3 horse3 soybean3 thyroid3
16+8 16+8 16+8 32+0 24+0 4+2 16+8 32+0 32+0 16+8 32+0 32+0 16+0 16+8
pivot arch
s l l l s s l l l s l l l l
w 605 436 1528 370 747 1115 572 1288 1220 1052 1288 2161 3209 794
parameters used are the same as for the linear networks as indicated in section 3.3.1. Several interesting observations can be made please compare also with the discussion of the linear network results in section 3.3.1: 1. The results for some of the problems are worse than those obtained using linear networks. This is most notable for the gene problems and less severe for the horse problems and many of the heart disease problems. 2. Not surprisingly, the standard deviations of validation and test set errors and the the tendency to over t are much higher than for linear networks in most of the cases. 3. The correlation of validation set errors with test set errors is quite small for some of the problems less than 0.5 for cancer3, card3, are3, glass1, heartac1, heartc1, horse1, horse2, soybean3. In
of di erent stopping criteria. Those results, however, are not reported here.
30
3 BENCHMARKING PROBLEMS
Problem
cancer1 2.87 0.27 1.96 0.25 1.60 0.41 0.81 1.47 0.60 4.48 4.87 152 111 133 97 cancer2 2.08 0.35 1.77 0.32 3.40 0.33 0.51 4.52 0.70 5.76 6.70 93 75 81 72 cancer3 1.73 0.19 2.86 0.11 2.57 0.24 0.28 3.37 0.71 3.37 1.32 66 20 51 16 card1 8.92 0.54 8.89 0.59 10.53 0.57 0.92 13.64 0.85 3.77 4.47 33 7 25 5 card2 7.12 0.55 11.11 0.32 15.47 0.75 0.53 19.23 0.80 3.32 1.03 32 8 22 6 card3 7.58 0.87 8.42 0.37 13.03 0.50 -0.03 17.36 1.61 3.52 1.46 37 10 28 9 diabetes1 14.74 2.03 16.36 2.14 17.30 1.91 0.99 24.57 3.53 2.31 0.67 196 98 118 72 diabetes2 13.12 1.35 17.10 0.91 18.20 1.08 0.77 25.91 2.50 2.75 2.54 119 42 85 31 diabetes3 13.34 1.11 17.98 0.62 16.68 0.67 0.55 23.06 1.91 2.34 0.65 307 193 200 132 gene1 6.45 0.42 10.27 0.31 10.72 0.31 0.76 15.05 0.89 2.67 0.49 46 9 29 6 gene2 7.56 1.81 11.80 1.19 11.39 1.28 0.97 15.59 1.83 2.12 0.44 321 698 222 595 6.88 1.76 11.18 1.06 12.14 0.95 0.95 17.79 1.73 2.06 0.50 435 637 289 508 gene3 7.68 0.79 9.48 0.24 9.75 0.41 0.33 39.03 8.14 2.76 0.71 67 44 45 39 glass1 glass2 8.43 0.53 10.44 0.48 10.27 0.40 0.72 55.60 2.83 4.27 1.75 29 9 20 7 glass3 7.56 0.98 9.23 0.25 10.91 0.48 0.54 59.25 7.83 2.68 0.47 66 46 45 41 heart1 9.25 1.07 13.22 1.32 14.33 1.26 0.97 19.89 2.27 2.83 1.89 65 16 43 12 heart2 9.85 1.68 13.06 3.29 14.43 3.29 0.98 17.88 1.57 3.27 2.34 57 19 38 13 heart3 9.43 0.64 10.71 0.78 16.58 0.39 0.67 23.43 1.29 3.35 3.72 51 10 37 9 heartc1 6.82 1.20 8.75 0.71 17.18 0.79 0.10 21.13 1.49 4.04 2.98 45 12 36 11 heartc2 10.41 1.76 17.02 1.12 6.47 2.86 0.83 5.07 3.37 4.05 1.89 29 14 21 11 heartc3 10.30 1.79 15.17 1.83 14.57 2.82 0.85 15.93 2.93 8.22 18.67 24 13 17 11 horse1 9.91 1.06 16.52 0.67 13.95 0.60 0.30 26.65 2.52 4.66 2.28 28 5 20 4 horse2 7.32 1.52 16.76 0.64 18.99 1.21 0.30 36.89 2.12 3.87 1.49 31 8 22 8 horse3 9.25 2.36 17.25 2.41 17.79 2.45 0.92 34.60 2.84 3.48 1.26 30 10 21 7 soybean1 0.32 0.08 0.85 0.07 1.03 0.05 0.54 9.06 0.80 2.55 1.37 665 259 551 218 soybean2 0.42 0.06 0.67 0.06 0.90 0.08 0.77 5.84 0.87 2.17 0.16 792 281 675 243 soybean3 0.40 0.07 0.82 0.06 1.05 0.09 0.33 7.27 1.16 2.16 0.13 759 233 639 205 thyroid1 0.60 0.53 1.04 0.61 1.31 0.55 0.99 2.32 0.67 3.06 3.16 491 319 432 266 thyroid2 0.59 0.24 0.88 0.19 1.02 0.18 0.85 1.86 0.41 2.58 1.07 660 460 598 417 thyroid3 0.69 0.20 0.97 0.13 1.16 0.16 0.91 2.09 0.31 2.39 0.43 598 624 531 564 Training set: mean and standard deviation stddev of minimum squared error percentage on training set
reached at any time during training. Validation set: ditto, on validation set. Test set: mean and stddev of squared test set error percentage at point of minimum validation set error. : Correlation between validation set error and test set error. Test set classi cation: mean and stddev of corresponding test set classi cation error. Over t: mean and stddev of GL value at end of training. Total epochs: mean and stddev of number of epochs trained. Relevant epochs: mean and stddev of number of epochs until minimum validation error.
Test set
3.3 Some learning results Problem Training Validation set set building1 building2 building3 are1 are2 are3 hearta1 hearta2 hearta3 heartac1 heartac2 heartac3 0.63 0.23 0.22 0.39 0.42 0.36 3.75 3.69 3.84 3.86 3.41 2.23 0.50 0.02 0.02 0.26 0.16 0.01 0.76 0.87 0.66 0.32 0.42 0.57 2.43 0.28 0.26 0.55 0.55 0.49 4.58 4.47 4.29 4.87 5.51 5.38 1.50 0.02 0.01 0.81 0.43 0.01 0.81 1.00 0.73 0.23 0.65 0.37 Test set Over t 0.96 0.98 0.93 1.00 1.00 0.32 0.95 0.97 0.97 -0.06 0.79 0.80 31.93 0.11 0.42 3.13 3.20 2.58 4.98 7.18 5.34 3.98 7.53 4.64
mean
The explanation from table 8 applies, except that the test set classi cation error data is not present here.
1.70 0.26 0.26 0.74 0.41 0.37 4.76 4.52 4.81 2.82 4.54 5.37
1.01 0.02 0.01 0.80 0.47 0.01 1.14 1.10 0.87 0.22 0.87 0.56
44.07 0.70 1.09 2.48 3.73 0.58 7.85 24.23 14.19 2.25 5.27 2.96
two cases it is even slightly negative card3, heartac1. 4. The correlation value also di ers dramatically between the three variants of some of the problems card, are, heartac, heartc, horse. 5. However, low correlation does not necessary imply bad overall test error results see cancer, card, are, heartac, horse. 6. The training times exhibit dramatic uctuations in a few of the cases building1, gene2, gene3, thyroid3, and less severely cancer1, cancer2, diabetes3, glass1, glass3, thyroid1, thyroid2. 7. The other numbers of training epochs tend to be of the same order as for linear networks, with a few exceptions that are much faster most of the heart disease problems or much slower thyroid, building2, building3. 8. The inverse" error behavior observed for some of the linear networks is no longer present for most of them except cancer1, cancer2, card1. As mentioned above, for some of the problems it might be more appropriate to work without shortcut connections. To quantify the e ect of training without shortcut connections, another series of 60 runs per dataset was conducted using the same parameters as above. This time, however, the network architecture used was modi ed to include only connections between adjacent layers, i.e., no direct connections from the inputs to the outputs and for networks with two hidden layers also no connections from the inputs to the second hidden layer and from the rst hidden layer to the outputs. I call these architectures the no-shortcut architectures. The results of these runs are shown in tables 10 classi cation problems and 11 approximation problems. Once again, a few interesting observations can be made compare also with the above discussions of linear network and pivot architecture results: 1. Leaving out the shortcut connections seems to be appropriate more often than expected see also section 3.3.4. 2. The test error results for the gene problems are better than for linear networks for pivot architectures they were worse than the for linear networks. However, the classi cation errors are worse even than for the pivot architectures.
32 Problem cancer1 cancer2 cancer3 card1 card2 card3 diabetes1 diabetes2 diabetes3 gene1 gene2 gene3 glass1 glass2 glass3 heart1 heart2 heart3 heartc1 heartc2 heartc3 horse1 horse2 horse3 soybean1 soybean2 soybean3 thyroid1 thyroid2 thyroid3 Training Validation set set 0.15 0.23 0.26 0.41 0.51 0.62 1.14 1.27 1.46 1.52 2.60 2.79 0.65 0.66 1.06 0.82 1.24 0.88 1.33 1.16 1.07 1.23 1.85 1.68 0.09 0.19 0.21 0.20 0.13 0.18 1.89 1.76 2.83 8.69 10.87 8.62 15.93 16.94 17.89 8.19 9.46 9.45 9.15 10.03 9.14 13.10 12.32 10.85 8.08 16.86 14.30 15.47 16.07 15.91 1.94 0.59 0.93 1.01 0.89 0.98 Test set
3 BENCHMARKING PROBLEMS Test set Over t classi cation 0.64 0.14 0.59 0.25 0.44 0.41 0.95 0.76 0.91 0.91 0.97 0.97 0.13 0.37 0.73 0.89 0.88 0.93 0.22 0.40 0.75 0.24 -0.19 0.88 0.58 0.96 0.76 0.84 0.59 0.92 1.38 4.77 3.70 14.05 18.91 18.84 24.10 26.42 22.59 16.67 18.41 21.82 32.70 55.57 58.40 19.72 17.52 24.08 20.82 5.13 15.40 29.19 35.86 34.16 29.40 5.14 11.54 2.38 1.91 2.27
mean stddev
2.83 2.14 1.83 8.86 7.18 7.13 14.36 13.04 13.52 2.70 4.55 4.99 7.16 8.42 7.54 9.24 9.73 9.46 5.98 9.85 10.35 10.43 6.68 10.54 1.53 0.46 0.61 0.59 0.60 0.74
0.12 0.14 0.13 0.26 0.27 0.46 1.04 0.91 0.90 1.33 1.95 2.17 0.21 0.27 0.24 0.65 1.09 1.39 0.49 0.70 1.21 0.37 0.79 1.19 0.06 0.13 0.21 0.16 0.11 0.13
Table 10: No-shortcut architecture results of classi cation problems 3. The test error results for the horse problems have also improved, yet are still worse than for linear networks. 4. The correlations of validation and test error are sometimes very di erent than for the pivot architectures see for example card, are, glass, heartac. 5. For are2 and are3, although the correlation is much lower, the standard deviations of test errors are very much smaller, compared to pivot architectures.
1.32 3.47 2.60 10.35 14.94 13.47 16.99 18.43 16.48 8.66 9.54 10.84 9.24 10.09 10.74 14.19 13.61 16.79 16.99 5.05 13.79 13.32 17.68 15.86 2.10 0.79 1.25 1.28 1.02 1.26
0.13 0.28 0.22 0.29 0.64 0.51 0.91 1.00 1.16 1.28 1.91 1.93 0.32 0.28 0.52 0.64 0.89 0.77 0.77 1.36 2.62 0.48 1.41 1.17 0.07 0.22 0.15 0.12 0.11 0.14
0.49 0.94 0.52 1.03 0.86 1.19 1.91 2.26 2.23 3.75 6.93 7.53 5.34 3.70 7.82 0.96 1.14 1.12 1.47 1.63 3.20 2.62 2.46 2.32 2.50 1.05 2.32 0.35 0.24 0.32
3.10 3.82 3.33 3.54 3.99 4.81 2.23 2.50 2.32 2.46 2.29 2.33 2.69 4.00 2.97 3.16 3.56 3.91 5.08 4.83 9.73 6.09 4.28 5.51 3.14 5.06 6.12 3.99 4.71 3.91
Total Relevant epochs epochs 123 31 20 7 7 7 119 46 132 58 284 183 31 9 30 15 15 13 10 10 6 3 7 5 112 222 273 308 269 234 95 44 41 22 17 22 117 70 164 101 250 199 52 22 46 38 36 32 30 18 11 13 18 14 159 362 382 341 388 298 115 28 17 5 5 6 83 26 85 53 255 163 27 8 26 12 12 10 9 9 5 3 6 5 79 202 228 280 246 223
2.54 1.90 1.64 1.25 1.52 3.24 0.53 0.50 0.59 0.53 0.28 0.39 0.64 1.80 1.17 2.38 3.47 4.42 2.64 2.34 10.48 2.53 1.67 3.89 1.99 6.49 7.99 7.14 6.86 9.18
116 54 54 30 26 29 201 102 251 124 321 262 71 30 60 57 51 46 38 25 17 19 25 20 219 417 450 377 421 324
3.3 Some learning results Problem Training Validation set set building1 building2 building3 are1 are2 are3 hearta1 hearta2 hearta3 heartac1 heartac2 heartac3 0.47 0.24 0.22 0.35 0.40 0.37 3.55 3.45 3.74 3.59 2.58 2.45 0.28 0.15 0.01 0.02 0.01 0.01 0.53 0.56 0.72 0.24 0.42 0.46 2.07 0.30 0.26 0.35 0.47 0.47 4.48 4.41 4.46 4.77 5.16 5.74 1.04 0.19 0.01 0.01 0.01 0.01 0.35 0.21 1.01 0.32 0.32 0.36 Test set Over t 0.88 1.00 0.74 0.10 0.43 0.34 0.93 0.55 0.99 0.21 -0.15 0.84 33.93 0.14 0.25 3.02 2.93 2.53 4.17 2.91 5.35 3.78 6.43 5.52
mean
The explanation from table 8 applies, except that the test set classi cation error data is not present here.
1.36 0.28 0.26 0.54 0.32 0.36 4.55 4.33 4.89 2.47 4.41 5.55
0.63 0.20 0.01 0.01 0.01 0.01 0.41 0.15 0.91 0.38 0.56 0.52
49.93 0.78 0.58 0.90 0.99 0.47 7.53 0.75 9.90 1.85 4.43 4.02
Results of statistical signi cance test performed for di erences of mean logarithmic test error between pivot architectures P and no-shortcut architectures N. Entries show di erences that are signi cant on a 90 con dence level plus the corresponding p-value in percent; the letter indicates which architecture is better. Dashes indicate non-signi cant di erences. Parentheses indicate unreliable test results due to non-normality of at least one of the two samples. The test employed was a t-test using the Cochran Cox approximation for the unequal variance case. 2.6 of the data points were removed as outliers.
Table 12: t-test comparison of pivot and no-shortcut results package. Since a t-test assumes that the samples to be compared have normal distributions, the logarithm of the test errors was compared instead of the test errors themselves, because test errors usually have an approximately log-normal distribution. This logarithmic transformation does not change the test result, since the logarithm is strictly monotone; log-normal distributions occur quite often and log-transformations are a very common statistical technique. Since a further assumption of the t-test is equal variance of the samples, the Cochran Cox approximation for the unequal variance
34
case had to be used, because at least for some of the sample pairs cancer1, gene1, hearta1 the standard deviations di ered by more than factor 2. Furthermore, a few outliers had to be removed in order to achieve an approximate normal distribution of the log-errors: In the 2520 runs for the pivot architectures, there were 4 outliers with too low errors and 61 with too high errors. For the no-shortcut architectures, there were no outliers with too low errors and 66 outliers with too high errors. Altogether this makes for 2.6 of outliers. At most 10 outliers, i.e., 6 of 60, were removed from any single sample, namely from heartac2 and heartc3 pivot and from horse3 no-shortcut. A few of the samples deviated so signi cantly from a log-normal distribution that the results of the test are unreliable and thus must be interpreted with care. For the pivot architectures, these nonnormal samples were those of building1, gene2, and gene3, for the no-shortcut architectures they were building1, gene3, heartac3, and soybean2. No outliers were removed from the non-normal samples. The respective test results are shown in parentheses in the table in order to indicate that they are unreliable. This discussion demonstrates how important it is to work very carefully when applying statistical methods to neural network training results. When applied carelessly, statistical methods can produce results that may look very impressive but in fact are just garbage. For 10 of the sample pairs no signi cant di erence of test set errors is found at the 90 con dence level i.e., signi cance level 0.1. In 9 cases the pivot architecture was better while in 23 cases the no-shortcut architecture was better. This result suggests that further search for a good network architecture may be worthwhile for most of the problems, since the architectures used here were all found using candidate architectures with shortcut connections only and just removing the shortcut connections is probably not the best way to improve on them. Summing up, the network architectures and performance gures presented above provide a starting point for exploration and comparison using the Proben1 benchmark collection datasets. It must be noted that none of the above results used the validation set for training. Surely, improvements of the results are possible by using the validation set for training in a suitable way. The properties of the benchmark problems seem to be diverse enough to make Proben1 a useful basis for improved experimental evaluation of neural network learning algorithms. Hopefully, many more similar collections will follow.
to uncompress it.
35 archive. The UCI machine learning databases repository is available by anonymous FTP on machine ics.uci.edu in directory pub machine-learning-databases. This archive is maintained at the University of California, Irvine, by Patrick M. Murphy and David W. Aha. Many thanks to them for their valuable service. The databases themselves were donated by various researchers; thanks to them as well. See the documentation les in the individual dataset directories for details. The building problem is from the energy predictor shootout archive at ftp.cs.colorado.edu in directory pub distribs energy-shootout. If you publish an article about work that used Proben1, it would be great if you dropped me a note with the reference to [email protected].
36
D ARCHITECTURE ORDERING
Each line after the header lines represents one example; rst the examples of the training set, then those of the validation set, then those of the test set. The sizes of these sets are given in the last three header lines the partitioning is always 50 25 25 of the total number of examples. The rst four header lines describe the number of input coe cients and output coe cients per example. A boolean coe cient is always represented as either 0 false or 1 true. A real coe cient is represented as a decimal number between 0 and 1. For all datasets, either bool in or real in is 0 and either bool out or real out is 0. Coe cients are separated by one or multiple spaces; examples including the last are terminated by a single newline character. First on each line are the input coe cients, then follow the output coe cients i.e., each line contains bool in + real in + bool out + real out coe cients. Thus, lines can be quite long. That's all. The encoding used in the data les has all inputs and outputs scaled to the range 0 : : : 1. The scaling was chosen so that the range is at least almost but not always completely used by the examples occurring in the dataset. The gene datasets are an exception in that they use binary inputs encoded as ,1 and 1.
D Architecture ordering
The following list gives for each problem the order of architectures according to increasing squared test set error. Of all architectures tried as listed in section 3.3.2 only those are listed whose test set error was at most 5 larger than the smallest test set error found in any run. Architectures with linear output units can occur twice, because two runs were made for them and each run is considered separately. building1 : 2+2l, 4+2l, 4+4l, 4+0l, 16+0l. building2 : 16+8s, 16+8l, 8+4s, 16+8l, 32+0s, 32+0l. building3 : 8+8s, 32+0s, 4+4l, 32+0l, 16+8s, 8+4s. cancer1 : 4+2l. cancer2 : 8+4l. cancer3 : 4+4l, 16+8l, 16+0l. card1 : 4+4l, 8+8l, 8+8l, 16+0l, 8+4l, 8+0l, 16+8l, 4+2l, 4+4l, 32+0l, 4+2l, 8+4l, 16+0l, 2+0l, 4+0l, 24+0l. card2 : 4+0l, 16+0l, 16+8l, 24+0l, 2+2l, 8+4l, 4+4l. card3 : 16+8l. diabetes1 : 2+2l, 2+2l, 4+4l, 4+4l, 32+0l, 8+4l, 16+0l, 4+0l, 16+8l, 8+0l, 24+0l, 16+0l. diabetes2 : 16+8l, 24+0l, 8+0l, 8+4l, 4+4l. diabetes3 : 4+4l, 8+0l, 32+0l, 24+0l, 8+8l, 24+0s, 2+2l, 32+0l, 8+8l. are1 : 4+0s, 2+2l, 4+0l, 2+2s, 2+0l, 32+0s, 4+2l, 2+2l, 2+0s, 4+0l, 16+0s. are2 : 2+0s, 4+0s, 8+0s, 8+8s, 16+0s, 2+2s, 24+0s, 2+0l, 4+0l, 32+0s, 4+0l, 2+0l, 4+2s, 4+4l, 8+4s. are3 : 2+0l, 2+0l, 2+0s, 4+0s, 2+2s, 2+2l, 2+2l, 4+4s, 16+0s, 16+0l, 4+0l, 4+4l, 8+0l, 24+0s, 8+8l. gene1 : 2+0l, 2+0s, 4+0l, 2+2l, 2+0l, 2+2l, 4+0l, 4+2l. gene2 : 4+2s. gene3 : 4+2s. glass1 : 8+0l, 16+8l, 4+0l, 32+0s, 8+4l. glass2 : 32+0l, 2+2s, 16+0s, 32+0s, 2+0l, 16+8l, 4+4s, 8+0s, 16+8s, 4+0s, 16+8l, 16+0l, 2+0s. glass3 : 16+8l, 2+0s, 16+0l, 16+0l, 8+4s, 16+8s, 8+8s, 8+4l, 2+0l, 16+8l.
REFERENCES
37
heart1 : 8+0l, 24+0l, 4+0l, 32+0l, 16+8l, 8+4l, 32+0l, 8+8l, 16+0l, 4+2l, 4+0l, 4+4l, 8+4l, 24+0l, 2+2l, 4+4l, 8+0l, 16+8l. heart2 : 4+0l, 16+0l, 32+0l, 4+0l, 8+0l, 32+0l, 4+4l, 8+8l, 2+2l, 8+4l, 16+8l, 2+0l, 2+2l, 24+0l, 2+0l, 4+2l, 4+2l, 16+0l, 24+0l, 8+8l, 4+4l. heart3 : 16+8l, 2+0l, 32+0l, 4+0l, 8+0l, 2+2l, 32+0l, 8+8l, 16+0l, 4+2l, 4+4l, 16+0l, 16+8l, 4+4l, 8+4l, 8+4l, 4+2l, 24+0l, 24+0l. hearta1 : 32+0s, 8+4l, 4+0l, 4+4s, 32+0l, 2+0l, 8+0l, 8+4l, 8+8l, 16+8l, 8+0l, 4+0l, 2+2l, 32+0l, 24+0l, 4+2l, 4+4l, 16+0l, 16+8s, 8+8s. hearta2 : 2+0l, 8+4l, 16+0l, 4+2s. hearta3 : 4+0s, 16+0l, 4+4l, 4+0l, 8+0l, 4+4l, 8+4l, 8+0l, 2+0l, 32+0l, 4+2s, 24+0l, 8+8l, 16+8l, 4+2l, 16+8l. heartac1 : 2+0l. heartac2 : 8+4l. heartac3 : 4+4l, 8+4s, 8+0l, 4+0l, 24+0l, 16+8s, 4+0s, 16+0s, 8+0s. heartc1 : 4+2l, 8+8l, 16+8l. heartc2 : 8+8l, 2+2l, 4+0l. heartc3 : 24+0l, 32+0l, 8+8l, 16+8l. horse1 : 4+0l, 4+4l, 4+4l, 4+2s, 2+2l, 8+0s, 16+8l, 4+0s, 16+0l. horse2 : 4+4l, 4+0l, 8+0s, 2+2s, 8+4l, 4+4s, 8+0l, 2+2l, 8+8l, 4+4l, 2+0l, 4+2l, 16+0l, 16+8l, 8+0l, 16+8l, 2+0s, 16+8s, 4+0l, 8+8s, 4+2s, 8+4s. horse3 : 8+4l, 8+0l, 8+4s, 4+0l, 4+4s, 2+0l, 8+8l, 16+0l, 4+4l, 4+4l, 32+0l, 24+0s, 4+2s, 8+0l, 16+8l. soybean1 : 16+8l. soybean2 : 32+0l. soybean3 : 16+0l. thyroid1 : 16+8l, 8+8l. thyroid2 : 8+4l. thyroid3 : 16+8l, 8+4l, 8+8l.
References
1 Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In 22 , pages 598 605, 1990. 2 T.G. Dietterich and G. Bakiri. Error-correcting output codes: A general method for improving multiclass inductive learning programs. In Proc. of the 9th National Conference of Arti cial Intelligence AAAI, pages 572 577, Anaheim, CA, 1991. AAAI Press. 3 Scott E. Fahlman. An empirical study of learning speed in back-propagation networks. Technical Report CMU-CS-88-162, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, September 1988. 4 Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. Technical Report CMU-CS-90-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, February 1990. 5 Scott E. Fahlman and Christian Lebiere. The Cascade-Correlation learning architecture. In 22 , pages 524 532, 1990.
38
REFERENCES
6 William Finno , Ferdinand Hergert, and Hans Georg Zimmermann. Improving model selection by nonconvergent methods. Neural Networks, 6:771 783, 1993. 7 Stuart Geman, Elie Bienenstock, and Ren e Doursat. Neural networks and the bias variance dilemma. Neural Computation, 4:1 58, 1992. 8 Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6:181 214, 1994. 9 K.J. Lang, A.H. Waibel, and G.E. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 31:33 43, 1990. 10 K.J. Lang and M.J. Witbrock. Learning to tell two spirals apart. In Proc. of the 1988 Connectionist Summer School. Morgan Kaufmann, 1988. 11 Martin M ller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 64:525 533, June 1993. 12 N. Morgan and H. Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. In 22 , pages 630 637, 1990. 13 Michael C. Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In 21 , pages 107 115, 1989. 14 Steven J. Nowlan and Geo ry E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 44:473 493, 1992. 15 Lutz Prechelt. A study of experimental evaluations of neural network learning algorithms: Current research practice. Technical Report 19 94, Fakultat fur Informatik, Universitat Karlsruhe, D76128 Karlsruhe, Germany, August 1994. Anonymous FTP: pub papers techreports 1994 199419.ps.Z on ftp.ira.uka.de. 16 Michael D. Richard and Richard P. Lippmann. Neural network classi ers estimate bayesian a-posteriori probabilities. Neural Computation, 3:461 483, 1991. 17 Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, April 1993. IEEE. 18 David Rumelhart and John McClelland, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume Volume 1. MIT Press, Cambridge, MA, 1986. 19 Steen Sj gaard. A Conceptual Approach to Generalisation in Dynamic Neural Networks. PhD thesis, Aarhus University, Aarhus, Danmark, 1991. 20 Brian A. Telfer and Harold H. Szu. Energy functions for minimizing misclassi cation error with minimum-complexity networks. Neural Networks, 75:809 818, 1994. 21 David S. Touretzky, editor. Advances in Neural Information Processing Systems 1, San Mateo, California, 1989. Morgan Kaufman Publishers Inc. 22 David S. Touretzky, editor. Advances in Neural Information Processing Systems 2, San Mateo, California, 1990. Morgan Kaufman Publishers Inc. 23 Zijian Zheng. A benchmark for classi er learning. Technical Report TR474, Basser Department of Computer Science, University of Sydney, N.S.W Australia 2006, November 1993. anonymous ftp from ftp.cs.su.oz.au in pub tr.