Comparison Adaptive Methods Function Estimation From Samples

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO.

4, JULY 1996 969

Comparison of Adaptive Methods for


Function Estimation from Samples
Vladimir Cherkassky , Senior Member, IEEE, D o n Gehring, and Filip Mulier

Abstruct- The problem of estimating an unknown function a training set of known (input-output) samples. The mapping
from a finite number of noisy data points has fundamental is typically implemented as a computational procedure (in
importance for many applications. This problem has been studied software). Once the mapping is obtainedhnferred from the
in statistics, applied math, engineering, artificial intelligence,
and, more recently, in the fields of artificial neural networks, training data, it can be used for predicting the output values
fuzzy systems, and genetic optimization. In spite of many papers given only the values of the input variables. Inputs and
describing individual methods, very little is known about the outputs can be continuous and/or categorical variables. When
comparative predictive (generalization) performance of various outputs are continuous variables, the problem is known as
methods. We discuss subjective and objective factors contributing regression or function estimation; when outputs are categorical
to the difficult problem of meaningful comparisons. We also de-
(class labels), the problem is known as classification. There
scribe a pragmatic framework for comparisons between various
methods, and present a detailed comparison study comprising is a close connection between regression and classification
several thousand individual experiments. Our approach to com- in the sense that any classification problem can be reduced
parisons is biased toward general (nonexpert) users who do not to (multiple output) regression [3]. Here we consider only
have detailed knowledge of the methods used. Our study uses regression problems with a single (scalar) output, i.e., we
six representative methods described using a common taxonomy. seek to estimate a function f of N - 1 predictor variables
Comparisons performed on artificial data sets provide some
insights on applicability of various methods. No single method (denoted by vector 2) from a given set of n training data
proved to be the best, since a method’s performance depends points, or measurements, z, = (xt,gt), (i = 1, . . . , n ) in
significantly on the type of the target function (being estimated), N-dimensional sample space
and on the properties of training data (i.e., the number of samples,
amount of noise, etc.). Hence, our conclusions contradict many
known comparison studies (performed by experts) that usually
y = f(x) + error (1)
show performance superiority of a single method (promoted
by experts). We also observed the difference in a method’s where error is unknown (but zero mean) and its distribution
robustness, i.e., the variation in predictive performance caused by may depend on x. The distribution of training data in x is also
the (small) changes in the training data. In particular, statistical unknown and can be arbitrary.
methods using greedy (and fast) optimization procedures tend to Nonparametric methods make no or very few general as-
be less robust than neural-network methods using iterative (slow) sumptions about the unknown function f(x). Nonparametric
optimization for parameter (weight) estimation.
regression from finite training data is an ill-posed problem
and meaningful predictions are possible only for sufficiently
I. INTRODUCTION smooth functions. We emphasize that the function smoothness
is measured with respect to sampling density of the training
I N the last decade, neural networks have given rise to
high expectations for model-free statistical estimation from
a finite number of samples (examples). There is, however,
data. Additional complications arise due to inherent sparseness
of high-dimensional training data (known as the curse of
increasing awareness that artificial neural networks (ANN’S) dimensionality) and the difficulty in distinguishing between
represent inherently statistical techniques subject to well- signal and error terms in (1).
known statistical limitations [ 11, [ 2 ] . Whereas many early Recently, many adaptive computational methods for
neural network application studies have been mostly empirical, function estimation have been propoced independently in
more recent research successfully applies statistical notions statistics, machine learning, pattern recognition, fuzzy systems,
(such as overfitting, resampling, bias-variance trade-off, the nonlinear systems (chaos) and ANN’S. General lack of
curse of dimensionality, etc.) to improve neural-network per- communication between different fields combined with
formance. Statisticians can also gain much by viewing neural- highly specialized terminology often results in duplication
network methods as new tools for data analysis. or close similarity between methods. For example, there is
The goal of predictive learning [3] is to estimate/learn an a close similarity between tree-based methods in stalistics
unknown functional mapping between the input (explanatory, (CART) and machine learning (ID3); multilayer perceptron
predictor) variables and the output (response) variables, from networks use a functional representation similar to projection
pursuit regression (PPR) [4], Breiman’s PI-method [ 5 ] is
Manuscript received October 6, 1994; revised May 28, 1995 and January related to sigma-pi networks [6] as both seek an output in
4, 1996. This work was supported in part by the 3M Corporation. the sum-of-products form, etc. Unfortunately, the problem
The authors are with the Department of Electrical Engineering, University
of Minnesota, Minneapolis, MN 55455 USA. is not limited to the field-specific jargon, since each
Publisher Item Identifier S 1045-9227(96)04398-6. field develops its methodology based on its own set of
1045-9227/96$05 .OO 0 1996 IEEE
970 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 4, JULY 1996

implicit assumptions and modeling goals. Commonly used minimal user knowledge of the methods is assumed. Training
goals of modeling include: prediction (generalization); (model inference from data) is assumed off-line and computer
explanationhnterpretation of the model; usability of the model; time is assigned negligible cost.
biological plausibility of the method; datddimensionality Artgcial Versus Real-Life Data Sets: Most comparison
reduction, etc. studies focus on real-life applications. In such studies the
Such diversity makes meaningful comparisons difficult. main goal is to achieve best results on a given application
Moreover, adaptive methods for predictive leaming require data set, rather than to provide meaningful comparison of
custom/manual control (parameter tuning) by expert users [ 7 ] , a methods’ performance. Moreover, comparison results are
[8]. Even in a mature field (such as statistics), comparisons greatly affected by application domain-specific knowledge,
are usually controversial, due to the inherent complexity which can be used for appropriate data encoding/preprocessing
of adaptive methods and the subjective choice of data sets and the choice of a method itself. In our experience,
used to illustrate a method’s relative performance, i.e., see such domain-specific knowledge is far more important
discussions in [9], Also, performance comparisons usually for successful applications of predictive learning than the
do not separate a method from its software implementation. choice of a method itself. In addition, application data sets
It would be more accurate to discuss comparisons between are fixed, and their properties are unknown. Hence, it is
software implementations rather than methods. Another (less- generally difficult to evaluate how a methods’ performance
obvious) implementation bias is due to a method’s (im- is affected by the properties of training data, such as
plicit) requirements for computational speed and resources. sample size, amount of noise, etc. In contrast, characteristics
For example, adaptive methods developed by statisticians of artificial data are known and can be readily changed.
(MARS, projection pursuit) were (originally) intended for Therefore, from a methodological perspective, we advocate
the statistical community. Hence, they had to be fast to be using artificial data sets for comparisons. Of course, any
useful for statisticians accustomed to fast methods such as conclusions based on artificial data can be applied to real-life
linear regression. In contrast, neural-network implementations data only if they have similar properties. Characterization
were developed by engineers and computer scientists who of application data remains an important (open) research
are more familiar with compute-intensive applications. As problem.
a result, statistical methods tend to use fast, greedy op- Comparison Methodology: The following generic scheme
timization techniques, whereas neural-network implementa- for application of an adaptive method to predictive learning
tions use more brute force optimization techniques (such was used:
as gradient descent, simulated annealing, and genetic algo- 1) choose a flexible methodhepresentation, i.e., a family of
rithms). nonlinear (parametric) models indexed by a complexity
parameter;
2) estimate/learn model parameters;
11. A FRAMEWORK
FOR COMPARISON 3) choose complexity (regularization) parameter of a
Based on the above discussion, meaningful comparisons method (model selection);
require: 4) evaluate predictive performance of the final model.
1) Careful specification of comparison goals, methodology, Note: Steps 2 and 3 may not be distinct in some methods,
and design of experiments. i.e., early stopping rules in backpropagation training.
2) Use of fully or semiautomatic modeling methods by Each of the steps 2 4 above generally uses its own data set
nonexpert users. Note that the only way to separate the known as:
power of the method from the expertise of a person training set, used to estimate model parameters in step 2;
applying it is to make the method fully automatic validation set, used for choosing the complexity parameter
(no parameter tuning) or semiautomatic (only a few of a method in step 3 (i.e., the number of hidden units in
parameters tuned by a user). Under this approach, auto- a feedforward neural network);
matic methods can be widely used by nonexpert users. test set, for evaluating predictive performance of the final
The issue is whether adaptive methods can be made model in step 4.
semiautomatic without much compromise on their pre- These independent data sets can be readily generated in
dictive performance. Our experience shows that it can be comparison studies using artificial data. In application studies,
accomplished by relying on (compute-intensive) internal when the available data is scarce, the test and validation
optimization, rather than user expertise, to choose proper data can be obtained from the available (training) data via
parameter settings. resampling (cross-validation). Sometimes, the terms validation
Specific assumptions used in our comparison study are and test data are used interchangeably, since many studies use
detailed next. the same (test) samples to choose (optimally) the complexity
Goals of Comparison Modeling: Our main criterion is the parameter and to evaluate the predictive performance of the
predictive performance (generalization) of various methods final model [SI, [lo]. In our study, we also used the same
applied by nonexpert users. The comparison does not take samples (test set) in steps 3 and 4.
into account a methods’ explanationhnterpretation capabilities, Taking the validation set to be the same as the test set
computational (training) time, methods’ complexity, etc. All also avoids the problem of model selection in our study, as
methods (their implementations) should be easy to use, so only explained next. Empirically observed predictive performance
CHERKASSKY et al.: COMPARISON OF ADAPTIVE METHODS 97 1

depends on the outcome of all steps (1)-(3). Hence, com- where


parisons between methods may be complicated by the choice 1) x is a vector of input variables:
of the complexity terdparameter in Step (3), also known as 2) u3 are expansion coefficients (to be determined from
model selection, a very difficult problem by itself. Methods data);
for choosing a complexity parameter is an area of active 3) B ( x , p) are basis functions;
research, and they include data resampling (cross-validation) 4) p3 are parameters of each basis function;
and analytic methods for estimating model prediction error 5 ) usually B(x, PO) = 1;
[ 111, [ 121. The problem of model selection is avoided in our 6) M is the regularization parameter of a method.
comparisons to: Methods based on this taxonomy are also known as dic-
1) focus on the comparison of a methods’ representation tionary methods [ 3 ] , since the methods differ according to
power and optimization/estimation capabilities (that can the set of the basis functions (or dictionary) they use. We
be obscured by the results of model selection); can further distinguish between nonadaptive (parametric) and
2) avoid computational cost of resampling techniques for adaptive methods, as follows:
model selection. Instead, we choose to spend computa- 1) Nonadaptive methods use preset basis functions (and
tional resources on comparing a method’s performance their parameters), so that only coefficients u3 are fit to
on many different data sets. data. Optimal values for u3 are (usually) found by least
Software Package XTAL for Nonexpert Users: To enable/ squares from n training samples, by minimizing
improve usability of adaptive methods by nonexpert users,
several statistical and neural-network methods for nonpara-
metric regression (developed elsewhere) were integrated into a
single package XTAL (stands for crystal) with a uniform user
interface (common for all methods). Note that any large-scale There are two major classes of nonadaptive methods,
comparison involving hundreds or thousands experiments i.e., global parametric methods (such as linear and
would not be practically feasible with methods implemented polynomial regression), and local parametric methods
as stand alone modules. Thus, XTAL incorporates a sequencer (such as kernel smoothen, piecewise-linear regression
that allows it to cycle through large number of experiments and splines). For a good discussion of nonadaptive
without operator intervention. In addition, all methods were methods, see [9]. Note that global parametric methods
modified so that, at most, one or two parameters need to inevitably introduce bias and local parametric methods
be user-defined (no limit is imposed on internal parameter are applicable only to low-dimensional problems (due to
tuning transparent to a user). Since most adaptive methods in inherent sparseness of finite samples in high-dimensions
the package originally had a large number of user-tunable known as “the curse of dimensionality”). Hence, adap-
parameters (typically half a dozen or so), most of these tive methods are the only practical alternative for high-
parameters were either set to carefully chosen default values dimensional problems.
or intemally optimized in the final version included in the 2) Adaptive methods, where (in addition to coefficients a3
package. Naturally, the final choice of user-tunable parameters basis functions themselves and/or their parameters pJ
and the default values is somewhat subjective and it introduces are adapted to data. For adaptive methods optimization
a certain bias into comparisons between methods. This is the (3) becomes a difficult (nonlinear) problem. Hence,
price to pay for the simplicity of using adaptive methods the optimization strategy used becomes very important.
under our approach. The XTAL package was developed by Statistical methods usually adopt greedy optimization
the authors who had no detailed knowledge of every method strategy (stepwise selection) where each basis function
incorporated into the package. is estimated one at a time. In contrast, neural-network
methods usually optimize over the whole set of basis
111. TAXONOMY
OF METHODSFOR FUNCTION
ESTIMATION functions.
It is important to provide a common taxonomy of sta- In this paper, we are mostly concerned with adaptive
tistical and neural network methods for function estimation methods. All reasonable adaptive methods use a set
to interpret the results of empirical comparisons. Reasonable of basis functions rich enough to provide universal
taxonomies can be based on a method’s approximation, i.e., for all target functions f(x) of some
1) representation scheme for the target function (being specified smoothness and for any e > 0 there exist M ,
estimated); a3, pf ( j = 1, . . . , M ) such that
2) optimization strategy;
3) interpretation capability.
In this paper, we follow the representation-scheme taxon-
omy where the function is estimated as a linear combination
What is a good choice for the basis functions (method)
of basis functions (basis function expansion) [3]
used? In general, it depends on the (unknown) tar-
M get function being estimated, in the sense that the
f(4 = UJB.7 (x, PJ1 (2) best dictionary (method) is the one that provides the
f=O “simplest” representation in the form (2) for a given
972 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 4, JULY 1996

(prespecified) accuracy of estimation. The simplest-form They must be chosen very carefully so that overfitting does not
representation, however, does not mean just the number occur. GMBL uses cross-validation to select the smoothing
of terms in ( 2 ) , since different methods may use basis parameters, the distance scale used for each variable and
functions of different complexity. For example, ANN’S method with the best fit. The method’s parameters are adjusted
use fixed (sigmoid) univariate basis functions of linear to minimize the cross-validation estimate using optimization
combinations (projections) of input variables, whereas techniques. This parameter selection is very time consuming
PPR uses arbitrary univariate basis functions of projec- and is done off-line. After the parameter selection is com-
tions. Since PPR uses more complex basis functions than pleted, the power of the method is in its capability to perform
ANN’S, its estimates would generally have fewer terms prediction with data as it arrives in real-time. It also has the
in (2) than those of ANN. ability to deal with nonstationary processes by “forgetting”
Adaptive methods can be further classified as: past data. Since the GMBL model depends on weighted nearest
Global methods, which use basis functions globally neighbor or locally weighted linear interpolation, its estimates
defined in the domain of x. The most popular choice are are rather similar to (but usually more accurate than) the results
univariate functions of projections (linear combinations) of a naive k-nearest neighbors regression. GMBL performs
o f the input variables, as in ANN’S and PPR. This well in low-dimensional cases, but high-dimensional data sets
choice is very attractive since it automatically achieves make parameter selection critical and very time-consuming.
dimensionality reduction. Other methods for choosing Original GMBL code provided by Moore [I61 was used. The
global basis functions are known in statistics, such as GMBL version in the package has no user-defined parameters.
additive models [ 141, tensor-product splines in MARS Default values of the original GMBL implementation were
[9], sum-of-products basis functions used in PI-method used for the internal model selection.
[SI, etc.
Local methods, that use local basis functions (in x- C. Projection Pursui1
space). Such methods either use local basis functions Projection pursuit [4] is a global adaptive method which
explicitly (such as radial basis function networks with exhibits good performance in high-dimensional problems and
locally-tuned units) or implicitly via adaptive distance is invariant to linear coordinate transformations. The model
metric in x-space (as in adaptive-metric nearest neigh- generated by this method is the sum of univariate functions g;
bors, adaptive kernel smoothen, partitioning methods of linear combinations of the elements of x
such as CART, etc.). Such methods effectively perform M
data-adaptive local feature selection.
3=1
Iv. DESCRIPTION
OF REPRESENTATIVE
METHODS The parameters p3 and the functions g3 are adaptively op-
Based on the taxonomy of methods presented above, a timized based on the data. For each of the M projections,
number of regression methods have been selected for compar- the algorithm determines the best p:, using a gradient descent
ison and included in the XTAL package. These methods were technique to search for the projection which minimizes the
chosen to represent a member of each of the major classes unexplained variance. Each g:, is a smoothed version of the
of methods. Each method [except generalized memory-based projected data with smoothing parameters chosen according
learning (GMBL)] is available as public-domain software to a fit criteria such as cross-validation. Since the model is
developed by its original author(s). Next we describe these additive, the search for function projections is done iteratively
methods and their parameter settings. using the so-called backfitting algorithm. This is a greedy
optimization technique where each additive term is estimated
A. Nearest Neighbors one at a time. The model is decomposed based on unexplained
A simple version of k-nearest neighbors regression was variance
M
implemented in the XTAL package a benchmark. Nearest
neighbors is a locally parametric method where the response
,=1
value for a given input is an average of the k closest training 3 i E

samples (in x-space) to this input. The value of k controls the


The p k is optimally chosen using gradient descent while
amount of smoothing performed and is set by the user in the
holding the p:,,j # k fixed. In each iteration another term is
XTAL package.
pulled out of the summation and an optimal p k is found. This
procedure is repeated until the average residual does not vary
B. Generalized Memory-Based Learning significantly. In this way, each function projection is chosen
GMBL [lS], [16] is a statistical technique that was designed to best fit the largest unexplained variance of the data.
specifically for robotic control. The model is based on storing Theoretically it is possible to model any smooth function
past samples of training data to “learn by example.” When new with projection pursuit for large enough 211 [IS]. For large M ,
data arrives, an output is determined by interpolating. GMBL however, the approximation is computationally time consum-
is capable of using either weighted nearest neighbor or locally ing and difficult to interpret. The method approximates radial
weighted linear interpolation (loess) [ 171. The interpolating function well, but harmonic functions are better approximated
method and parameters used are a critical part of the model. by kernel estimators [19]. PP also exhibits a sensitivity to
CHERKASSKY et al.: COMPARISON OF ADAPTIVE METHODS 973

outliers, since their presence increases the chance of choosing


a spurious projection. This occurs because the search for pro-
jections can get caught in a local minima. In terms of speed of
execution, the method is limited by the speed of the smoother
and the rate of convergence of the optimizing algorithm. In
the original implementation of PP [ 131, the supersmoother is
employed for smoothing. Other implementations of projection t X t X
pursuit have used Hermite polynomials [20]. In general, a very
Fig. 1. Pair of one-dimensional basis functions used by the MARS method.
robust adaptive smoother is required, which may cause speed
limitations.
The original implementation of projection pursuit, called regions and modeling each region with a constant value. The
SMART (smooth multiple additive regression technique) [ 131, regions are chosen based on a greedy optimization procedure
employs a heuristic search strategy for selecting the number where in each step, the algorithm selects the split which causes
of projections to avoid poor solutions due to multiple local the largest decrease in mean squared error. A basis function
minima. The SMART user must select the largest number for each region can be described by
of projections ( M L ) to use in the search as wcll as the B,(x) I[xE Rj] (5)
final number of projections ( M F ) .The strategy is to start
with M L projections and remove projections based on their where I is the indicator function. In this case, I has the value
relative importance until the model has M F projections. The one if the vector x is in region R, and zero otherwise. The
model with M p projections is then returned as the regression model can then be described by the following expansion on
solution. To improve ease of use in the XTAL package, M F these basis functions
+
is set by the user, but M L is always taken to be MF 5 . In M

addition, the SMART package allows the user to control the f(X) = U P , (XI. (6)
thoroughness of optimization. In the XTAL implementation, ,=1

this was set to the highest level. The MARS method is based on similar principles of recursive
partitioning and greedy optimization, but uses continuous basis
D. Artificial Neural Networks (ANN’S) functions rather than ones based on the indicator function.
Multilayer perceptrons with a single hidden layer and a lin- The basis functions of the MARS algorithm can each be
ear output unit compute a linear combination of basis functions described in terms of a two-sided truncated power basis
(2), where the basis functions are fixed (sigmoid) univariate function (truncated spline) of the form
functions of linear combinations of input variables. This is
a global adaptive method in our taxonomy. Various training
B,i(z - t ) = [&(z - t)]t (7)

(learning) procedures for such networks differ primarily in the where t is the location of the knot, q is the order (of splines)
optimization strategy used to estimate parameters (weights) of and the + subscript denotes the positive part of the argument.
a network. The XTAL package uses a version of multilayer The basic building block of the MARS model is a pair of
feedforward networks with a single hidden layer described these basis functions which can be adjusted using coefficients
in [21]. This version employs conjugate gradient descent for to give a local approximation to data (Fig. 1). For multivariate
estimating model parameters (weights) and performs a very problems, products of the univariate basis functions are used.
thorough (internal) optimization via simulated annealing to The basis functions for MARS can be described by
escape from local minima (10 annealing cycles). The original K,
implementation from [21] was used with minor modifications.
The method’s implementation in XTAL has a single user-
B:%) = I-I
k=l
[ S I ,k . (XV(3, k ) - t,, k)I!. (8)

defined parameter-the number of hidden units. This is the This is a product of one-dimensional splines each with a
complexity parameter of the method. There is a close sim- directional term ( s g , k = kl). The variable K, defines the
ilarity between PPR and ANN’S in terms of representation, number of splits required to define the region j , U indicates
as both methods use nonlinear univariate basis functions of the particular variable of x used in the splitting and t 3 , k is
linear combinations (projections) of input variables. The two the split point.
methods use very different optimization procedures, however: The MARS model can be interpreted as a tree where each
PPR uses greedy (stepwise) optimization to estimate additive node in the tree consists of a basis function and uses a
terms in (2) one at a time, whereas ANN training estimates tree-based algorithm for constructing the model. Like other
all the basis functions simultaneously. recursive partitioning methods, nodes are split according to a
goodness of fit measure. MARS differs from other partitioning
E. Multivariate Adaptive Regression Splines methods in that all nodes (not just the leaves) of the tree are
MARS [9] is a global adaptive method in our taxonomy. candidates for splitting. Fig. 2 shows an example of a MARS
This method combines the idea of recursive partitioning re- tree. The function described is
gression (CART) [22] with function representation based on 7

tensor-product splines. The method of recursive partitioning f”(4= a,B,(x). (9)


consists of adaptively splitting the sample space into disjoint 3=1
974 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. I , NO. 4, JULY 1996

Fig. 2. Example of a MARS tree

The depth of the tree indicates the interaction level. On each to coordinate rotations. For this reason, the performance of
path down variables are split at most once. The algorithm for the MARS algorithm is dependent on the coordinate system
constructing the tree uses a forward stepwise and backward used to represent the data. This occurs because MARS par-
stepwise strategy. In the forward stepwise procedure, a search titions the space into axis-oriented subregions. The method
is performed over every node in the tree to find a node, which does have some advantages in terms of speed of execution,
when split improves the fit according to the fit criteria. This interpretation, and relatively automatic smoothing parameter
search is done over all candidate variables, split points t 3 . k , selection.
and basis coefficients. For example, in Fig. 2 the root node
Bl(x)is split first on variable 2 1 , and the two daughter nodes F. Constrained Topological Mapping (CTM)
&(x) and B ~ ( xare ) created. Then, the root node is split CTM [24] is an example of local adaptive method. In terms
again on variable 5 2 creating the nodes B ~ ( xand ) Bj(x). of representation of the regression estimate, CTM is similar to
Finally node &(x) is split on variable 2 3 . In the backward CART, where the input (x) space is partitioned into disjoint
stepwise procedure, leaves are removed which cause either (unequal) regions, each having a constant response (output)
an improved fit or slight degradation in fit as long as model value. Unlike CART’S greedy tree partitioning, however, CTM
complexity decreases. uses (nonrecursive) partitioning strategy originating from the
The measure of fit used by the MARS algorithm is the neural network model of self-organizing maps (SOM’s) [25].
generalized cross validation (GCV) estimate [23]. This GCV In the CTM model, a (high-dimensional) regression surface
measure provides an estimate of the future prediction accuracy is estimated (represented) using a fixed number of “units” (or
and is determined by measuring the mean squared error on local prototypes) arranged into a (low-dimensional) topologi-
the training set and penalizing this measurement to account cal map. Each unit has (x, y), coordinates associated with it,
for the increase of variance due to model complexity. The and the goal of training (self-organization) is to position the
user can select the amount of penalization imposed (in terms map units to achieve faithful approximation of the unknown
of degrees of freedom) for each additional split used by function. The CTM model uses a suitable modification (for
the algorithm. Theoretical and empirical studies seem to regression problems) of the original SOM algorithm [25] to
indicate that adaptive knot location adds between two and faithfully approximate the unknown regression surface from
four additional model parameters for each split [9]. The user the training samples. The main modification is that the best-
also selects the maximum number of basis functions and the matching unit step of SOM algorithm is performed in the space
interaction degree for the MARS algorithm. In the XTAL of predictor variables (x-space), rather than in the full (x, y)
implementation, the user selects the maximum number of basis sample space [24]. The effectiveness of SOM/CTM methods
functions and the degrees of freedom (recoded as an integer in modeling high-dimensional distributionslfunctions is due to
from zero to nine). The interaction degree is defaulted to allow the use of low-dimensional maps. The use of topological maps
all interactions. effectively results in performing kernel smoothing in the (low-
The MARS method is well suited for high- as well as dimensional) map space, which constitutes a new approach
low-dimensional problems with a small number of low-order to dimensionality reduction and dealing with the curse of
interactions. An interaction occurs when the effect of one dimensionality [26]. In the regression problem, one assumes
variable depends on the level of one or more other variables data of the form Z k = (xk,yk), (IC = 1, . . . , K ) . Effectively,
and the order of the interaction indicates the number of the CTM algorithm performs Kohonen self-organization in the
interacting variables. Like other recursive partitioning meth- space of the predictor variables x and performs an additional
ods, MARS is not robust in the case of outliers in the update to determine a piecewise constant regression estimate
training data. It also has the disadvantage of being sensitive of y for each unit. In this algorithm the unit locations are
CHERKASSKY ef al.: COMPARISON OF ADAPTIVE METHODS 975

denoted by the vectors wj where j is the vector topological lacks, however, some key features found in other statistical
coordinate for the unit. Units of the map are first initialized methods:
uniformly along the principal components of the data. Then, Piecewise-linear versus piecewise-constant approxima-
the following three iteration steps are used to update the tion: The original CTM algorithm uses a piecewise
units: constant regression surface, which is not an accurate
1) Partitioning: Bin the data according to the index of the representation scheme for smooth functions. Better accu-
nearest unit based on the predictor variables x. In this racy could be achieved using, for example, a piecewise-
step, the data are recoded so that each vector data sample linear fit.
is associated with the topological coordinates of the unit Control of model complexity: Up to this point, there
nearest to it in the predictor space. has been little understanding of how model complexity
in the CTM algorithm is adjusted. Interpreting the unit
ik = argmin I l x k - wj I I for each data vector, update equations as a kernel regression estimate [26]
j gives some insight by interpreting the neighborhood
Xk, ( I C = 1, . . . , K ) . width as a kernel span in the topological map space.
The neighborhood decrease schedule then plays a key
role in the control of complexity.
2) Conditional expectation estimate: Determine the new
Global variable selection: Global variable selection is a
unit locations based on nonparametric regression es-
popular statistical technique commonly used (in linear
timates using the recoded data as in the SOM algo-
regression) to reduce the number of predictor variables
rithm. The original SOM algorithm essentially used
by discarding low-importance variables. The original
kernel smoothing to estimate the conditional expecta-
CTM algorithm, however, provides no information about
tions, where the kernel was given by the neighborhood
variable importance, since it gives all variables equal
function in the topological space. Additionally, update
strength in the clustering step. Since the CTM al-
the response estimates for each unit. Treat the topologi-
gorithm performs self-organization (clustering) based
cal coordinates found above, i k as the predictors and the
on Euclidean distance in the space of the predictor
X k as a vector response. New unit locations are given
variables, the method is sensitive to predictor scaling.
by the regression estimates at all topological coordinate
Hence, variable selection can be implemented in CTM
locations
indirectly via adaptive scaling of predictor variables
K
during training. This scaling makes the method adaptive,
since the quality of the fit in the response variable affects
CXkHx(1lik -jII) the positioning of map units in the predictor space.
wj k=l
= K , this is an estimate of E(xlj) This feature is important for high dimensional problems
where training samples are sparse, since local parametric
k=l methods require dense samples and global parametric
K methods introduce bias. Hence, adaptive methods are
the only practical alternative.
Batch versus flow-through implementation. The original
CTM (as most neural-network methods) is a flow-
through algorithm, where samples are processed one
k=l at a time. Even though flow-through methods may
be desirable in some applications (i.e., control), they
for all valid topological coordinates j . Here H x is the are generally inferior to batch methods (that use all
kernel function used for updating unit locations and HY available training samples) commonly used in statistics
is the kernel function used for updating the response for estimation, both in terms of computational speed
estimates. and estimation accuracy. In particular, the results of
3) Neighborhood decrease: The complexity of each of the modeling using flow-through methods may depend on
estimates is independently increased. When using kernel the (heuristic) choice of the learning rate schedule [29].
smoothing, this corresponds to decreasing the span of Hence, the batch version of CTM has been developed
the smoother. in [30].
The trained CTM provides piecewise-constant interpolation These deficiencies in the original CTM algorithm can be
between the units. The constant-response regions are defined overcome using statistically-motivated improvements, ,as de-
in terms of the Voronoi regions of the units in the predictor tailed next. These improvements have been incorporated in
space. Prediction based on this model is essentially a table the latest version of CTM included in XTAL package.
lookup. For a given input x , the nearest unit is found in the 1) Local Linear Regression Estimates: The original CTM
space of the predictor variables and the piecewise constant algorithm, can be modified to provide piecewise linear ap-
estimate for that unit is given as response. proximation for the regression surface. Using locally weighted
Empirical results [24], [27], [28] have shown that the orig- multiple linear regression, the neighborhood function would
inal CTM algorithm provides accurate regression estimates. It be used to weight the observations and zero and first-order
916 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 4, JULY 1996

terms can be estimated for each unit. The regression estimate those variables that are most important in the regression are
for each unit is global, but local samples are given more given more weight in the distance calculation. The sensitivity
weight according to the neighborhood function. This differs of a variable on the regression surface can be determined
from the flow-through procedure proposed by Ritter et al. locally for each Voronoi region. These local sensitivities can
[28] which effectively uses linear regression over each Voronoi be averaged over the Voronoi regions to judge the global
region separately, and then forms a weighted average of the importance of a variable on the whole regression estimate.
regression coefficients using the neighborhood function to Since new regression estimates are given with each iteration
determine the first-order estimate for each unit. This method of the CTM algorithm, this scaling can be done adaptively,
does not take into account the density of points in each Voronoi i.e., the scalinghariable importance parameters are used in
region, since the coefficient averaging process is done over the distance calculation when clustering the data according
the regions independent of the number of samples falling in to the Voronoi regions. This effectively causes more units
each region. Such averaging also makes it difficult to see to be placed along variable axis which have larger average
what error functional is being minimized and is statistically sensitivity.
inefficient. The method proposed here is similar to loess 4 ) Implementation Details: To make a regression method
[17], in that a weighted mean-squared error criterion is mini- practical, there are a number of implementation issues that
mized. It differs from loess in that the neighborhood function must be addressed. The Batch CTM software package provides
rather than a nearest neighbor ranking is used to weight some additional features that improve the quality of the results
the observations. Using a piecewise linear regression surface and improve the ease of use. For example, in interpolation
rather than piecewise constant gives CTM more flexibility (no noise) problems with a small number of samples, model
in function fitting. Hence, fewer units are required to give selection based on cross-validation provides an overly smooth
the same level of accuracy. Also, the limiting case of a map estimate. For these problems it may be advantageous to do
with one unit corresponds to linear regression. The regression model selection based on the mean squared error on the
surface produced by CTM using linear fitting is not guaranteed training set. The package allows the user to select either cross-
to be continuous at the edges of the Voronoi regions. The validation or training set error or a mixture of the two to
neighborhoods of adjacent units overlap, however, so the perform model selection. This effectively provides the user
linear estimates for each region are based on common data control over a complexity penalty. The package is also capable
samples. This imposes a mild constraint which tends to induce of automatically estimating the number of units of the map
continuity. based on the error (cross-validation or training set) score. This
2) Using Cross-Validation to Select Final Neighborhood heuristic procedure provides good automatic selection of the
Width: The complexity of a model produced by the SOM number of units when the user does not wish to enter a specific
or CTM method is determined by the final neighborhood number. When used with XTAL, the user supplies the model
complexity penalty, an integer from 0 to 9 (max. smoothing)
width used in the training algorithm [26]. In other words,
the final neighborhood width is a smoothing (regulariza- and the dimensionality of the map.
tion) parameter of the CTM method. To estimate the correct
amount of smoothing from the data, one commonly uses cross- V. EXPERIMENTAL
SETUP
validation. In this procedure, a series of regression estimates Experiment design includes the specification of:
are determined based on portions of the training data, and 1) types of functions (mappings) used to generate samples;
the sum of squares error is measured using the remaining 2) properties of the training and test data sets;
validation samples. For each regression estimate, a different 3) specification of performance metric used for compar-
subset of validation samples are chosen from the original isons;
training set, so that each training sample is used exactly 4) description of modeling methods used (including default
once for validation. The final measure of generalization error parameter settings).
is the average of the sum of squares error. Because of the Functions Used: In the first part of our study, artificial data
computational advantages, leave-one-out cross validation was sets were generated for eight “representative” two-variable
chosen for CTM. In this case, each sample is systematically functions taken from statistical and ANN literature. These
removed from the training set to be used as the validation include different types of functions, such as harmonic, ad-
set. If the regression estimation procedure can be decomposed ditive, complicated interaction, etc. In the second part of
as a linear matrix operation on the training data, then the comparisons, several high-dimensional data sets were used,
leave one out cross validation score can be easily computed i.e., one six-variable function and four-variable functions.
~41. These high-dimensional functions include intrinsically low-
3) Variable Selection via Adaptive Sensitivity Scaling: The dimensional functions that can be easily estimated from data,
CTM algorithm effectively applies the Kohonen SOM in the as well as difficult functions for which model-free estimation
space of the predictor variables to determine the unit locations. (from limited-size training data) is not possible. The list of all
The SOM is essentially a clustering technique, which is 13 functions is shown in Appendix I.
sensitive to the particular distance scaling used. For CTM, Training Data Characteristics Include Its Distribution, Size,
a heuristic scaling technique can be implemented based on and Noise: Training set distribution was uniform in x-space.
the sensitivity of the linear fits for each Voronoi region. We Since using a random number generator to create a small
would like to adjust the scales of the predictor variables so that number of samples results in somewhat clustered samples
CHERKASSKY er al.: COMPARISON OF ADAPTIVE METHODS 911

2
* spiral distributions. These spiral distributions were created by
I * placing samples at evenly spaced points along a linear spiral
1.5 -
* I
a * in such a way that the samples were always of even density
8 throughout the surface of space. Thus, the spiral distribution
I *
1- is the polar equivalent of a uniform rectangular grid based on
* Cartesian coordinates, but it has the advantage that its points
0.5 I
do not lie on lines parallel to Cartesian axes. The uniform
* spiral distribution corresponds to the designed experiment
s! 0- setting as opposed to observational setting (that favors random
* distributions).
-0.5 - Training set size: Three sizes were used for each function
*
and distribution type (random and uniform spiral), i.e., small
-1 - D 8-
(25 samples), medium (100 samples), and large (400 samples).
-1.5 - * Training set noise: The training samples were corrupted
by three different levels of Gaussian noise: no noise, medium
* noise (SNR = 4), and high noise (SNR = 2). Thus there were a
-2 * , * a
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
total of 189 training data sets generated, i.e., eight two-variable
functions with two distribution types, three sample sizes, and
three noise levels (8*2*3*3 = 144), and five high-dimensional

* D
* *’ *. . 8
*
k
*
functions with a single distribution type, three sample sizes,
and three noise levels ( 5 * 1 * 3 * 3 = 45).
Test Data: A single data set was generated for each of the
1.5 - * *
’ 8
* * * * * thirteen functions used. For two-variable functions, the test set
1- * * * *- had 961 points uniformly spaced on a 31 x 31 square grid.
* For high-dimensional functions, the test data consisted of 961
* * * * * * I
0.5 - * a I * points randomly sampled in the domain of X .
* * * * * Pe$ormance Metric: The performance index used to
8 I *
g 0 .* * * * ** *
compare predictive performance (generalization capability)
I * of the methods is the normalized root mean square error
-0.5 - I 8 * * - (NRMS), i.e., the average RMS on the test set normalized
D * * * . by the standard deviation of the test set. This measure
* * e 8 8 L
-I-
* * represents the fraction of unexplained standard deviation.
* a
Hence, a small value of NRMS indicates good predictive
* a * a *
-1.5.
* * * a performance, whereas a large value indicates poor performance
a
* ,
* * * * * a (the value NRMS = 1 corresponds to the “mean response”
-2
-2 -1.5 -1 -0.5 0 0.5 I 1.5 2 model).
XI Modeling Methods: Methods included in the XTAL pack-
(b) age were described in Section IV.
User-Controlled Parameter Settings: Each method (except
Fig. 3 . (a) 25 samples from a uniform distribution. (b) Uniform spiral
distribution with 100 samples. GMBL) was run four times on every training data set, with
the following parameter settings:
[see Fig. 3(a)], a uniform spiral distribution was also used to 1) K”: IC = 2, 4, 8, 16.
produce more uniform samples for the two-variable functions 2) GMBL: no parameters (run only once).
[see Fig. 3(b)]. The random distribution was used for high- 3 ) CTM: map dimensionality set to 2, smoothing parame-
dimensional functions 9-12. The spiral distribution was used ter = 0, 2, 5, 9.
for function I3 to generate the values of two hidden variables 4) MARS: 100 maximum basis functions, smoothing pa-
that were transformed into four-dimensional training data. rameter (degrees-of-freedom) = 0, 2, 5 , 9.
The main motivation for introducing the uniform spiral 5) PPR: number of terms (in the smallest model) = 1, 2,
distribution was to eliminate the variability in model estimates 5, 8.
due the variance of finite random samples in 2-space [as shown 6) ANN: number of hidden units = 5 , 10, 20, 40.
in Fig. 3(a)], without resorting to averaging of the model Number of Experiments: With 189 training data sets and
estimates over many random samples. The usual averaging Fix modeling methods, each applied with four parameter
approach does not seem practically feasible, given the size settings (except GMBL applied once), the total number of
of this study (several thousand individual experiments). The experiments performed is: 189 * 5 * 4 +
189 * 1 = 3969.
solution to this dilemma taken in this study was to run two Most other comparison studies on regression typically use only
sequences of experiments for each of the two-dimensional tens of experiments. The sheer number of experimental data
functions; one sequence was run using random distributions reveals many interesting insights that (sometimes) contradict
and another, otherwise identical, sequence was run using the findings from smaller studies.
978 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 4, JULY 1996

7
E
5
q
3
2
1
U

Function 8
Fig. 4. Representations of the two variable functions used in the comparisons

VI. DISCUSSION
OF EXPERIMENTAL
RESULTS the comparative performance of methods for a particular
target function. For each method, four user-controlled pa-
Experimental results summarizing nearly 4000 individual rameter values were tried, and only the best model (with
experiments are presented in Appendix 11. Each table shows the smallest NRMS on the test set) was used for compar-
CHERKASSKY et al.: COMPARISON OF ADAPTIVE METHODS 979

jsons, Then the best methods were marked as crosses in a result is not surprising since CTM is similar to kernel methods
comparison table. Often, methods showed very close best which are known to work best with harmonic functions. CTM
performance (within 5%); hence several winners (crosses) performs rather poorly on functions of linear combinations
are entered in the table row. The absolute prediction per- of input variables (i.e., 4 and 6). Overall, CTM exhibits
formance is indicated by the best NRMS error value in the robust behavior, except at small samples. On a wide range
left column (this value corresponds to the best method and of function types CTM displayed a peculiar ability to give
is marked with bold cross). For two-variable functions, the exceptionally good results in interpolation situations where
two NRMS values in each row correspond to the random the sample size was large (400) and there was no added
and spiral distribution of training samples, respectively. For noise.
small samples and/or difficult (high-dimensional) functions, MARS Observations: For two-variable functions, MARS
often none of the methods provided a model better than the performed best with “additive” function 6. It also performed
response mean (Le,, NRMS _> 1). In such cases, there were best when estimating the high-dimensional “additive” func-
no winners, and no entries were made in the comparison tions (9 and 10). In fact, function 10 is much more linear
tables. than its analytical form suggests, due to the small range of its
Each method’s performance is discussed next with respect input variables. It can be accurately approximated by Taylor
to: series expansion around X = 0, which lends itself to linear
1) type of function (mapping); representation. MARS performed poorly when used with two-
2) characteristics of the training set, i.e., sample dimensional data sets of 25 samples.
size/distribution and the amount of added noise; PPR Observations: PPR’s performance was similar to, but
3) a method’s robustness with respect to characteristics of generally worse than, that of ANN. Both methods use a
training data and tunable parameters. Robust methods common representation, i.e., the sum of functions of linear
show small (predictable) variation in their predictive per- combinations of input variables. Thus, performance of these
formance in response to small changes in the (properties two methods is well suited to a function like four which can
of) training data or tunable parameters (of a method). be reduced to this form. On the two-dimensional “additive”
Methods exhibiting robust behavior are preferable for function (6) the performance of projection pursuit was su-
two reasons: first, they are easier to tune for optimal perior to all other methods tested. As noted by Maechler et
performance, and second, their performance is more al. [IO] harmonic functions are a worst case for projection
predictable and reliable. pursuit. This is evident in this study by the particularly poor
K-NN and GMBL Observations: These two local methods performance of projection pursuit on the harmonic functions
provide qualitatively similar performance, and their predictions 5 and 8.
are usually inferior in situations where more accurate model Projection pursuit is highly sensitive to the correct choice of
estimation (by other, more structured methods) is possible. K- it’s “term” parameter. This parameter is, however, difficult to
NN and GMBL, however, perform best in situations where choose and best results can only be obtained by testing a large
accurate estimation is impossible as indicated by high nor- number of different values. It is this facet of projection pursuit
malized RMS values (i.e., NRMS > 0.75). This can be that probably best accounts for the significant improvement in
seen most clearly in situations with small sample size and/or the results reported in this study over those previously reported
nonsmooth functions, i.e., functions I1 and 12 (the four- in Maechler et al. [lo] and Hwang et al. [20].
dimensional ‘‘multiplicative’’and “cascaded” functions) where ANN Observations: For most of the two-dimensional func-
other methods were completely unable to produce usable tions (functions 1 , 2 , 3 , 4 , 5 , 7 , and 8) and the four-dimensional
results (i.e., NRMS value > 1 indicating that estimation function with the underlying two-dimensional function (func-
error was greater than the standard deviation for the test tion 13), ANN gave the best overall estimation performance.
set). This was evident both in the score for total wins as well as the
Both methods utilize a memory based structure which makes score based on a winner take all basis. ANN was robust with
it possible to add new training samples without having to respect to sample size, noise level, and choice of its tuning
retrain the method. This can be an important advantage in parameter (number of hidden units). ANN was outperformed
some situations because a method such as ANN can take many on additive functions by projection pursuit in two dimensions
hours to retrain. (function 6) and by MARS in high-dimensional functions
Overall, both K-NN and GMBL showed very predictable (functions 9 and 10). The excellent performance of ANN
(robust) behavior with respect to changes in sample size, can be explained by its ability to approximate three common
noise levels, and function class which made their results types of functions: 1) functions of linear combinations of
dependable performance measures. All other methods showed input variables; 2) radial functions, due to the observation
far greater variability, particularly with respect to function that ANN and PP use a similar representation, and in view
class. of the known theoretical results on the suitability of PP to
CTM Observations: In this study CTM did well at es- approximate (asymptotically optimally) radial functions [ 191;
timating harmonic functions (functions 1, 5 , and 8). This and 3) harmonic functions, since a kernel function can be
result is consistent with those of Cherkassky, Lee, and Lari- constructed as a linear combination of sigmoids [lo], and
Najafi [27] which studied functions 5 , 6, and 7 and that of kernel methods are known to approximate harmonic functions
Cherkassky and Mulier [30] which included function 8. This well [19].
980 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. I, NO. 4, JULY 1996

The performance of ANN was generally somewhat better number of simple sigmoid functions, whereas PP assumes
than that reported in previous comparisons studied by Maech- that a function can be estimated by a small number of
ler et al. [lo], Hwang et al. [20], and Cherkassky et al. [27].We complex univariate functions of linear combinations of input
attribute this difference to the simulated annealing algorithm variables. Our results indicate that unstructured methods (such
used in this study to escape from local minima during training. as GMBL and K") are generally more robust than other
This optimization is computationally expensive, and results (more structured) methods. Of course, better robustness does
in training times in excess of several hours on the Sun 4 not imply better prediction performance.
workstations used. Run times for ANN were at least one, Also, neural-network methods (ANN, CTM) are more ro-
and typically two, orders of magnitude greater than for other bust than statistical ones (MARS, PP). This is due to differ-
methods. ences in the optimization procedures used. Specifically, greedy
General Comments on All Methods: Overall, most meth- (stepwise) optimization commonly used in statistical methods
ods provide comparable predictive performance for large results in more brittle model estimates than the neural network-
samples. This is not surprising, since all (reasonable) adaptive style optimization, where all the basis functions are estimated
methods are asymptotically optimal (universal approximators). together in an iterative fashion.
A methods' performance becomes more uneven at smaller Comparison with Earlier Studies: Since our comparison is
sample sizes. The comparative pel-foimance of diffaent biased toward (and performed by) nonexpert users, and the
methods is summarized below: number of user-tunable parameters and their values (for all
BEST WORST methods) was intentionally limited to only four parameter
Prediction accuracy settings, the quality of the reported results may be suspect.
(dense samples) ANN K", GMBL It is entirely possible that the same adaptive methods can
Prediction accuracy provide more accurate estimates if they are applied by expert
(sparse samples) GMBL, KNN MARS, PP users who can tune their parameters at will. To show that
Additive target functions MARS, PP K", GMBL our approach provides (almost) optimal results, we present
Harmonic functions CTM, ANN PP comparisons with earlier studies performed by the experts
Radial functions ANN, PP K" using the same (or very similar) experimental setup. The
Robustness wrt comparison tables for three two-variable functions (harmonic,
parameter tuning ANN, GMBL PP additive and complicated interaction) originally used in [lo],
Robustness wrt sample are shown in Appendix 111. All results from previous studies
properties ANN, GMBL PP, MARS. are rescaled to the same performance index (NRMS on the
test set) used in this study. The first table in Appendix I11
Here denseness of samples is measured with respect to the
shows NRMS error comparisons for several methods using
target function complexity (i.e., smoothness), so that in our
400 training samples. Original results for ANN, PP, and CTM
study dense sample observations refer mostly to mediudlarge
were given in [lo] and [24], whereas original MARS results
sample sizes for two-variable functions, and sparse sample
were provided by Friedman [9]. In all cases (except one) this
observations refer to small sample results for two-variable
study provides better NRMS than the original studies. The
functions as well as all sample sizes for high-dimensional
difference could be explained as follows:
functions.
The small number of high-dimensional target functions 1) in the case of ANN, by a more thorough optimization
included in this comparison study makes any definite con- (via simulated annealing) in our version;
clusions difficult. Our results, however, confirm the well- 2) in the case of PP, by better choice of default parameter
known notion that high-dimensional (sparse) data can be values in our version;
effectively estimated only if its target function has some 3) in the case of MARS, by a different choice of user-
special property. For example, additive target functions (9 tunable parameter values;
and 10) can be accurately estimated by MARS; whereas 4) in the case of CTM, our study employs a new batch ver-
functions with correlated input variables (function 13) can sion with piecewise-linear approximation [311, whereas
be accurately estimated by ANN, GMBL, and CTM. On the the original study used a flow-through version with
other hand, examples of inherently complex target functions piecewise-constant approximation.
(11 and 12) can not be accurately estimated by any method The remaining three tables in Appendix I11 show compar-
due to sparseness of training data. An interesting observation isons on the same three functions, but using 225 samples
is that whenever accurate estimation is not possible (i.e., sparse with no noise added and with added noise. The original study
samples), more structured methods generally fail, but memory- [20]uses the same SMART implementation of PP (as in our
based methods provide better accuracy. study), but a different version of ANN. Since our study does
Methods' Robustness: Even though all methods in our not use 225 samples. the tables show the results for 100
study are flexible nonparametric function estimators, there and 400 samples for comparison. It would be reasonable to
are two unstructured nonparametric methods (GMBL and expect that the results for 225 samples (of the original study)
K"), whereas the other methods (ANN, PP, MARS, and should fall in between 225 and 400 samples of our study.
CTM) introduce certain structure (implicit assumptions) into In fact, in most cases our results for 100 samples are better
estimation. For example, the ANN method assumes that than the results for 225 samples from the original study. This
unknown function can be accurately estimated by a large provides an additional confidence in our results, as well as
CHERKASSKY et al.: COMPARISON OF ADAPTIVE METHODS 981

an empirical evidence in favor of the premise that complex the software package developed for this project would be
adaptive methods can be automated without compromise on completely impossible to accomplish by manual methods.
their prediction accuracy. We emphasize that the results of any comparison (including
ours) on a method’s predictive performance should be taken
with caution. Such comparisons do not (and in most cases,
VII. CONCLUSIONS can not) differentiate between methods and their software
implementations. In some situations, however, poor relative
We discussed the important but intrinsically difficult sub-
performance of a method may be due to its software imple-
ject of comparisons between adaptive methods for predictive
mentation. Such problems caused by software implementation
learning. Many subjective and objective factors contributing
can be usually detected only by expert users, typically by
to the problem of meaningful comparisons were enumerated
the original authordimplementen of a method. In addition,
and discussed. A pragmatic framework for comparisons of
a method’s performance can be adversely affected by the
various methods by nonexpert users was presented. This
choice (or poor implementation) of the estimatiodoptimization
approach is proved feasible by developing a software pack-
procedure (i.e., step 3 in the generic method outline given in
age XTAL that incorporates several adaptive methods for
Section 11). For example, the final performance of multilayer
regression for nonexpert users, and successfully applying this
networks (ANN) greatly depends on the optimization tech-
software to perform a massive comparison study comprising
nique used to estimate model parameters (weights). Similarly,
thousands of individual experiments. As expected, none of
predictive performance of PP depends on the type of smoother
the methods provided superior prediction accuracy for all
used to estimate nonlinear functions of projections [20]. The
situations. Moreover, a method’s performance was found to
results of this study may suggest that a brute force compute-
depend significantly and nontrivially on the properties of
intensive approach to optimization pays off (assuming that
training data. For example, no (single) method was able
computing power is cheap), as indicated by the success of
to give optimal performance, for a given target function,
empirical methods such as ANN and CTM. Specifically in the
at all combinations of sample size and noise level. Hence,
case of ANN, this study achieves much better performance
we may conclude that the real value of comparisons lies
than in [lo] and [20] on the same data sets, due to the use
in the interpretation/explanation of comparison results. This
of simulated annealing to escape from local minima. More
implies the need for a meaningful characterization of the
research is needed, however, to evaluate and quantify the effect
modeling (estimation) methods as well as the data sets. Our
of optimization procedures on predictive performance.
paper provides a method’s characterization using a general
taxonomy of methods for function estimation based on the
type of basis functions and on the optimization procedure APPENDIXI
used. FUNCTIONS
USED TO GENERATE
DATASETS
Experimental results presented in this paper can be used to
draw certain conclusions on the applicability of representative
Funcfion 1 from Brei” 119911
methods to various data sets (with well-defined characteris- y = sin(x1 * xz) X uniform in [-2.21
tics). For example: Function 2 from Brei“ [1991]
y = exp(x1 * sin(n * xz)) X uniform in [-1, 1)
1) additive target functions are best estimated using statis- Function 3 - the GBCW function from Gu et al[1990]
tical methods (MARS, PP); a = 40 * exp(8*((xl- .5)2 + (xz - .5)2))
b = exp(8*((xl- .2)2 + (xz - .7)2))
2) projection-based methods (ANN and PP) are the best c = exp(8*((xl~.7)2 + (xz - .2)2))
y = a!(b + e ) X uniform in [O,11
choice for the (target) functions of linear combinations
Function 4 from Masers [1991]
of input variables; y = (1 + sin(2xi + 3x2))/(3.5 + sin (xi -xi)) X uniform in [-2,2]
3) harmonic functions are best estimated using CTM and Function 5 (harmonic)from Maechleret al[1990]
y = 42.659(2 + x1)/20 + Re(z5)) where z = X I + k z - .5(1 + i)
ANN;
or equivalently with xi = X I - .5, x2 = x2 -.5
4) neural-network methods tend to be more robust than y = 42.659(.1 + x1(.05 + xf - lOxtx$ + 5x4)) X uniform in [-S, .5]
statistical ones; Function 6 (additive) from Maechleret al[1990]
5 ) unstructured methods (nearest neighbor and kemel- y = 1.3356[1.5(1 - xi) + exp(2xl - 1 )sin(3n(xi - .6)2) + exp(3(xz - .5))sin(4x(xz - .9)*)]
X uniform in [OJ]
smoother type) tend to be more robust than more Function 7 (complicated interaction)from Maechleret al[1990]
y = 1.9r1.35 + exp(xi)sin(l3(xi - .6)2)exp(-xz)sin(7xz)] X uniform in 10, 11
structured methods.
Function 8 (harmonic) from Cherkasskyet a l [I9911
The problem of choosing a method for a given application y = sin(2r * sqrtcxf + x& x uniform in [-I, 11
data set is a difficult one. A very effective solution to this Function 9 (ddimensionaladditive) adapted from Friedman cl9911
problem was successfully demonstrated in this project by y = losin(mrix2) + 20(x3 - .5)2 + 10x4 + 5x5 + 0x6 X uniform in [-I, I]
Function 10 (4-dimensionaladditive)
automating the sequencing of a rather “brute force” compari- y = exp(2x1sin(xxd) + sin(xzx3) X uniform in [-.25, 2 5 1
son of available regression methodologies. A very substantial Function I 1 (4-dimensionalmultiplicative)
reduction in the amount of time required to familiarize a y = 4(x1 - .5)(x4 - . ~ ) s i n ( sqrt(x3
~x + x$) X uniform [-1, 11
novice user with advanced modeling methods is achieved Function 12 (4-dimensionai cascaded)
a = exp(Zxlsin(xq)). b = exp(2x2sin(nx3))
by providing a single universal control scheme that supports y = sin(a * h) X uniform in [-1, 11
all methods and hides unnecessary details from the user. Function 13 (4nominalvariables,2 hidden variables)
y = sin(a * b)
Even for an experienced user, the execution of the long where hidden variablesa, b are from uniform spiral in [-2.21.
Observed (nomid) X-vatiables are.: X I = atcos@), XZ = w ( a 2 + “3 = a + b, Xq = a
sequences of comprehensive experiments made possible by
982 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. I, NO. 4, JULY 1996

APPENDIX 11 i
Function 5 (harmonic)
EXPERIMENTALRESULTS y = 42.659(2 + x1)/'20 + Re(z5)) where z = x i + Lrz - .5(1 + i)

KNN GMBL CIM MARS W ANN KNN CMBL CIM MARS W ANN

1
K" GMBL CIM MAPS W ANN K" GMBL Cmr MARS W ANN s d # samples

1
no noise .821/.614 X X X

1
smnU #samples medium .934/.504 X X
nonoise .440/.314 X X x x x x x high .867/.664 X X x x X
medium 5171.390 X X

1 I
high .660/.534 X X X medium #samples
no noise 2291.187
medium #samples
nonoise .110/.061
medium .205/.210
hi; .369/.315 X X 1 g
medium ,3991.245
hi; .422/.347
large #samples
no noise .059/.038
X

x
x
x
x
large # samples
no noise .033/.017
medium .095/.098 X X x x x x
medium .172l.131
hi h .185/.199
E
X
X x x X
hi h .182/.125 X x x x x x x WINS
W.T.A. I! ? ? ? : 6
4
3
0
0
0
8
4
WINS
W.T.A. 1: 6
1
7
2
3
0
4
0
8
6 Fwmion 6
y = 1.3356[1.5(1 - XI)+ exp(2xi - 1 )sin(3rr(xl - .6)*) + exp(3(xz - .5))sin(4n(xz- .9P)1
RANDOM DISTRIBUTION SPIRAL DISTRIBUTION
KNN GMBL CIM MARS W ANN K" GMBL CIM MARS W ANN
I I
KNN GMBL CIM MARS FQ ANN KNN GMBL CIM MARS W *NN S d #samples
nonoise .400/.189 x x x X
small# samples medium .360/.250 X X
no noise .375/.253 X x x high .921/.450 X X
medium .511/.328 x x x x
high .622.572 X X X medium #samples
no noise .029/.OM x x x x
medium #samples medium .1491.113 X x x
no noise .117/.oM) x x x x x X high .321/.240 x x X X
medium .167/.154 X X x x x
high ,2661.261 X X X large #samples
no noise .010/.008 x x x x x
large #samples medium .068/.065 x x x x x
no noise .016/.015 x x x x x x high .144/.161 x x x x x
medium .098/.1M) x x x X x x x x x
high .136/.146 X X X WINS 0 1 3 8 4
W.T.A.
WINS 3 4 4 6 7
W.T.A.

KNN GMSL CIM MAPS FQ ANN KNN GMBL CIM MARS W ANN
XNN OMBL CIM MARS W ANN K" GMBL CIM MARS W ANN
S d # snmples
small# samples no noise ,498J.453 X X X
no noise .092/.050 X medium .493/.474
medium .343/.212 X X high .640/.509 X X x x X
high .414/.451 X X X medium # sMlpres
medium #samples no noise .218/.112 X x x X X
no noise .037/.029 x x X x x medium .255/.192 X X X
medium .099/.096 X X high ,394l.338 X X X
high .286/.202 X X x x x x huge #samples
large #samples
no noise .061/.034 x x x x x
no noise .016/.010 x x x x medium ,142.125 x x x x x x x
high .238/.148 x x X
medium .075/.060
high .125/.148
X x x
x x
x
x
x
x x x WINS 1 3 1 5 ;I;; 4 2 2 7
WINS
W.T.A. 1: 4
2
1
0
1
0
4
1
4
0
4
1
1
0
7
3
8
5
W.T.A. 3 0 0 4

RANDOM DISTRIBUTION SPIRAL DISTRIBUTION RANDOM DISTRIBUTION


KNN OMBL CIM MAPS W ANN XNN GMBL CIM MARS W ANN KNN GMBL CIM MARS W ANN K" GMBL CIM MARS W ANN

small# samples small #samples


no noise ,4191.359 X X nonoise ,792l.628 X X X X
medium .843/.626 x x X medium .920/.646 X X X x x
high .709/.650 X X high 1.842 x x
medium #samples medium # samples
no noise .175/.136 X X X nonoise 27U.156 X X X
xx x
xx
medium .285/.180 medium .245/.238 X X
high .366/.327 X high .436/.347 X X
large #samples
large #samples
no noise .120/.052 X x x X nonoise ,106f.049 X X
medium .150/.126 X x medium ,196J.162 X x x X X
high .198/.220 x x x high .227/.248 X X X X
WINS 0 0 1 0 6 8 0 0 1 0 5 6 WINS 1 2 3 0 2 6 0 3 6 0 1 5
W.T.A. 0 0 0 0 3 6 0 0 1 0 2 6 W.T.A. 0 0 2 0 1 5 0 3 2 0 0 4
CHERKASSKY et al.: COMPARISON OF ADAPTIVE METHODS 983

Fmnion 9 Function 13 (4 nominal variables, 2 hidden variables)


+
y = lOsin(mr1xz) 20(x3 - .5)2 + 1% + 5x5 + y =sin(a * b)
where hidden variables a, b are from uniform spiral in [-2,2].
RANDOM DISIRIBUTION Observed (nominal) X-variables are:
xi = a*cos(b). x2 = qrt(az + bz), x3 = a + h. nq = a
KNN GM8L CIM MAR3 W ANN
I RANWM DISTRIBUTION
small # samples
nonoise .496 X x x KNNGMBLCIMMAR3 W A"
medium 546 x x x smoll #samples
high ,677 X X X X
no noise .423 x x X
medium I s
nonoise
... . .
medium ,319
z ~.~ s I X
X
X
medium ,465
high 510
X
X
X

high .456 X X medun # samples


nonoise .052 X X
medium .237 X X X
high .308 X x x
large #samples
nonoise ,012 x x X
medium ,102 x x x x x
high ,179 X
WINS 0 5 5 3 3 7
W.T.A. 0 3 1 0 0 5
Function 10
y = exp(2xisin(mfl) + sin(x3nq)
RANDOM DISTRIBUTION
YNN GMBL CIM W W ANN

small #samples APPENDIXI11


nonoise ,283 x x
medium ,272
high S14
X
X
COMPARISONS WITH EARLIERSTUDIES
medium # samples
no noise .026 X
medium ,146 X
high .246 X Original Study: Maechler et a1 [19901; Cherkasskyet al [19911

large #samples Experimental set-up


no noise .011 x x nvmber Of Rd?Ihg S " p k S : 400
medium ,107 X X
high .182 x x x M ~ S Glevel: large (SNR = 4)
sample dislribufiam uniform grid (earlier studies)
WINS 0 1 8 2 2 uniform spiral (this study)
W.T.A.
ANN rP CIM MM3
function5 (harmonic)
originalstudy ,308 304 .131 .190
this study .151 ,230 .131 .169
function6 (additive) .G96 .128 ,170 .063
original study .095 .065 .147 .122
this study
function 7 (complicated)
original study 227 ,206 .197 ,179
this study .126 .I41 .125 .I38

small #samples
no noise Original study: Hwang et at [1994]
medium
high Function #5 (harmonic)
medium #samples
SAMPLESlZEISNRATIO GMBL CTM MAR3 W ANN
ORIGINAL STUDY:
2Wno noise NIA NJA NJA .498 .428
225lSN4 NJA NJA NJA SO4 ,457
large #samples
nonoise .668 THIS STUDY
medium .710 1Owno noise .256 .I87 ,199 .383 .206
lWSN4 .299 ,308 .308 .440 .245
W n o noise .149 .038 ,066 ,259 .133
WINS
W.T.A. I: 5
2
0
0
0
0
0
0
0
0
W S N 4

Function #6 (additive)
.154 .131 .169 .229 .I51

Fmrion 12 (4-dimensional waded) SAMPLESUEISNRATIO GMBL CIM hui(s W ANN


a = exp(2xlsin(mz)). h = exp(2x)sin(zxq)) ORIGINALSTUDY 022 ,057
y = sin(a b) 225lno noise NJA NIA NIA .I36 .141
225lSN-1 NIA NIA NJA
RANDOM DISIRIBUTION
THIS STUDY
XNN GMBL CTM hui(s W ANN
I 1Wno noise ,196 ,224 .030 ,035 .077
1WSN-1 ,244 ,300 ,187 .I13 ,146
small #samples W n o noise ,169 ,031 .008 .008 .065
no noise W S N 4 ,170 .I47 .I22 .065 .095

Function #7 (complicated interaction)


medium # " , d e s
no noise
SAMPLESlZEISNRATlO GMBL CTM W W ANN
ORIGINAL STUDY 168 ,146
225lno noise NJA NIA NJA .208 ,265
large #samples 225lSN4 NIA NIA NJA
THIS STUDY
1Wno noise ,182 .138 ,243 ,156 .113
1WSN4 .237 .399 .242 ,246 .I92
WINS W n o noise .155 .034 ,046 ,081 ,099
W.T.A. 0 0 0 W S N 4 .178 .125 .139 ,141 .126
984 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 4, JULY 1996

ACKNOWLEDGMENT [24] V. Cherkassky and H. Lari-Najafi, “Constrained topological mapping for


nonparametric regression analysis,” Neural Networks, vol. 4, pp. 27-40,
The authors wish to thank J. H. Friedman of Stanford 1991.
University for providing PP and MARS code and A. W. [25] T . Kohonen, SeEf-Organization and Associative Memory. Berlin:
Springer-Verlag, 1989.
Moore Of Camegie University for providing GMBL [26] F. Mulier and V. Cherkassky, “Self-organization as an iterative kernel
code used in this project. Several stimulating discussions with smoothing process,” Neural Computa., vol. 7, no. 6, pp. 1165-1177,
J. H. Friedman on the comparisons between methods are also 1995.
[27] V. Cherkassky, Y. Lee, and H. Lari-Najafi, “Self-organizing network
gratefully acknowledged. for regression: Efficient implementation and comparative evaluation,”
in Proc. IJCNN, vol. 1, 1991, pp. 79-84.
REFERENCES [28] H. Ritter, T. Martinetz, and K. Schulten, Neural Computation and Self-
Organizing Maps: An Introduction. Reading, MA: Addison-Wesley,
B. D. Ripley, “Pattern Recognition and Neural Networks,” in Statist. 1992.
Images. Cambridge, U.K.: Cambridge University Press, 1996. [29] F Mulier and V. Cherkassky, “Statistical analysis of self-organization,”
V. Cherkassky, J. H. Friedman, and H. Wechsler, Eds., From Statisrics Neural Networks, vol. 8 , no. 5 , pp. 717-127, 1995.
to Neural Networks: Theory and Pattern Recognition Applications, vol. [30] V. Cherkassky and F. Mulier, “Application of self-organizing maps to
136. Berlin: Springer-Verlag, NATO AS1 Series F, 1994. regression problems,” J. Amer. Statist. Assoc., submitted for publication,
J. H. Friedman, “An overview of predictive learning and function ap- 1994.
proximation,” in From Statistics to Neural Networks: Theory and Pattem
Recognition Applicaiions, vol. 136, V. Cheikaaaky, J. H. Friedman, and
H. Wechsler, Eds. Berlin: Springer-Verlag. NATO AS1 Series F. 1994.
J. H. Friedman and W. Stnetzle, “Projection pursuit regression,“ J. Amer.
Statist. Assoc., vol. 76, pp. 817-823, 1981. Vladimir Cherkassky (S’83-M’85-SM’92) re-
L. Breiman, “The PI method for estimating multivariate functions from ceived the M.S. degree in systems engineering from
noisy data,” Technometrics, vol. 3, no. 2 , pp. 125-160, 1991. the Moscow Aviation Institute, Russia, in 1976, and
D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Eds.. the Ph.D. degree in electrical engineering from the
Parallel Distributed Processing. Cambridge, MA: MIT Press. 1986. University of Texas at Austin in 1985.
A. S. Weigend and N.A. Gershenfeld, Eds., Time Series Prediction: He is Associate Professor of Electrical Engineer-
Forecasting the Future and Understanding the Past. Reading, MA: ing at the University of Minnesota, Twin Cities
Addison-Wesley, 1993. campus. His current research is on theory and
K. Ng and R. P. Lippmann, “A comparative study of the practical applications of neural networks to data analysis,
characteristics of neural network and conventional pattern classifiers,” knowledge extraction, and process modeling.
MIT Lincoln Lab., Tech. Rep. 894, 1991. Dr. Cherkasskv was activelv involved in the
J. H. Friedman, “Multivariate adaptive regression splines (with discus- organization of several conferences on artificial neural networks. He was
sion),” Ann. Statist., vol. 19, pp. 1-141, 1991. Director of the NATO Advanced Study Institute (ASI) From Statistics to
M. Maechler, D. Martin, J. Schimert, M. Csoppensky, and J. N. Hwang,
Neural Networks: Theory and Pattern Recognition Applications held in
“Projection pursuit learning networks for regression,” in Proc. Znr. Cont
France in 1993. He is a member of the program committee and session chair
Tools AI, 1990, pp. 350-358.
J. Moody, “Prediction risk and architecture selection for neural net- at the World Congress on Neural Networks (WCNN) in 1995 and 1996.
works,” in From Statistics to Neural Networks: Theory and Pattern
Recognition Applications, vol. 136, V. Cherkassky, J. H. Friedman, and
H. Wechsler, Eds. Berlin: Springer-Verlag, NATO AS1 Series F, 1994.
[I21 V. N. Vapnik, The Nature of Statistical Leaming Theor).. Berlin:
Springer-Verlag, 1995. Don Gehring received the B.S. degree summa cum
[13] J. H. Friedman, “SMART user’s guide,” Dep. Statistics, Stanford Univ.. laude in computer science at the University of
Tech. Rep. 1, 1984. Minnesota, Twin Cities campus, in 1996.
[ 141 T. Hastie and R. Tibshirani, Generalized Additive Models. New York: Previously, while at the University of California
Chapman and Hall, 1990. at Berkeley, he founded a company providing com-
[ 151 C. G. Atkeson, “Memory-based approaches to approximating continuous puter system design services for scientific research.
functions,” in Proc. Wkshp. Nonlinear Modeling Forecasting, Santa Fe, The study reported in this paper was presented as
NM, 1990. a thesis project for B.S. degree at the University of
[I61 A. W. Moore, “Fast robust adaptive control by learning only feedforward Minnesota.
models,” in Advances in NIPS-4J, E. Moody et al., Eds. San Mateo,
CA: Morgan Kaufmann, 1992.
[17] W. S. Cleveland and S. J. Delvin, “Locally weighted regression: An
approach to regression analysis by local fitting,” J. Amer. Statist. Assoc.,
vol. 83, no. 403, pp. 596-610, 1988.
[18] P. Diaconis and M. Shahshahani, “On nonlinear functions of linear
combinations,” SIAM J. Sci. Statist. Comput., vol. 5 , pp. 175-191, 1984.
[19] D. L. Donoho and I. M. Johnstone, “Projection-based approximation Filip Mulier received the B.S. degree, the M.S.
and a duality with kernel methods,” Ann. Statist., vol. 17, no. 1, pp. degree, and the Ph.D. degree in 1989, 1992, and
58-106, 1989. 1994, respectively, all in electrical engineering at the
[20] J. Hwang, S. Lay, M. Maechler, and R. D. Martin, “Regression modeling University of Minnesota, Twin Cities campus. His
in backpropagation and projection pursuit learning,” IEEE Trans. Neural dissertation work was on the analysis and charac-
Networks, vol. 5 , pp. 342-353, 1994. terization of self-organizing maps from a statistical
[21] T. Masters, Practical Neural Network Recipes in C++. New York: i viewpoint.
Academic, 1993. In 1993, he acted as the Administrative Assistant
[22] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classijcation for the NATO Advanced Study Institute on Statistics
and Regression Trees. Belmont, CA: Wadsworth, 1984. and Neural Networks. Presently, he is working at a
[23] P. Craven and G. Wahba, “Smoothing noisy data with spline functions,” large multinational corporation based in St. Paul,
Numer. Math., vol. 31, pp. 377403, 1979. Minnesota, in the areas of neural networks and statistical data analysis.

You might also like