Model Complexity Control For Regression Using VC Generalization Bounds
Model Complexity Control For Regression Using VC Generalization Bounds
(unknown) joint probability density function (pdf) parameterization but is not appropriate for approximating
The unknown function in (2) is the mean of the functions nonlinear in parameters [2].
output conditional probability (aka regression function) Second, one needs to estimate the unknown prediction risk
from the known empirical risk, in order to choose the model
(3) corresponding to smallest (estimated) prediction risk. These
estimates are known as model selection criteria in statistics.
A learning method (or estimation procedure) selects the “best” Analytic model selection criteria have been developed in statis-
model from a set of (parameterized) approximating tics using asymptotic (large-sample) theory. In practical (finite
functions (or possible models) specified a priori, sample) applications, empirical data-driven model selection
where the quality of an approximation is measured by the using resampling is the usual choice. However, both analytic
loss or discrepancy measure A common loss and resampling methods suffer from large variability with
function for regression is the squared error finite data.
Third, there is a problem of finding a (global) minimum of
(4) the empirical risk. Strictly speaking, this is possible only for a
set of functions linear in parameters. With nonlinear estimators
The set of functions supported by a learning
(such as multilayer perceptrons) an optimization algorithm can
method may or may not contain the regression function (3).
find, at best, only a local minimum.
Thus learning is the problem of finding the function
Vapnik–Chervonenkis (VC) theory [1], [2] provides a prin-
(regressor) that minimizes the prediction risk functional
cipled solution to the first two problems. It defines a new
measure of complexity (called the VC-dimension) which co-
(5)
incides with the classical definition (the number of parameters)
for linear parameterization. VC-theory also provides analytical
using only the training data. This risk functional measures the
generalization bounds that can be used for estimating predic-
accuracy of the learning method’s predictions of the unknown
tion risk. However, VC-theory cannot be rigorously applied to
nonlinear estimators (such as neural networks) where the VC-
Since the probability measure in (1) is not known,
dimension cannot be accurately estimated, and the empirical
minimization of the prediction risk is a difficult (ill-posed)
risk cannot be reliably minimized [5]. A new method known
problem. Modern adaptive learning methods (in neural net-
as Support Vector Machines (SVM’s) [2] developed at AT&T
works and statistics) use a wide (very flexible) set of ap-
Bell Labs provides a practical solution to the third problem.
proximating functions ordered according to some measure of
The SVM formulation ensures global minimization of the
complexity (or flexibility to fit the training data). Then the
empirical risk using constrained linear functions in a very
problem is to choose the model of optimal complexity for a
high-dimensional intermediate feature space.
given (finite) sample.
There is a growing awareness that the VC-theory provides a
Many learning algorithms are based on the inductive prin-
satisfactory theoretical and conceptual framework for learning
ciple known as “empirical risk minimization,” which amounts
with finite samples. This paper demonstrates practical appli-
to choosing the model (from a set of approximating functions)
cability of using VC-bounds for regression for complexity
that minimizes the empirical risk, or the average loss on the
control in the case of linear estimators. Note that in the general
training data
case of nonlinear estimators the issue of model complexity
control becomes very difficult as it involves solving all three
(6) problems outlined above. However, for linear estimators there
is only one problem, i.e., estimation of the prediction risk.
However, the goal of learning is to obtain a model providing Hence, for linear estimators the comparison of different model
minimal prediction risk, i.e., minimal error for future data. selection methodologies becomes tractable.
It is well known that for a given training sample size there This paper is organized as follows. Section II describes
exist a model of optimal complexity corresponding to smallest classical model selection criteria developed in statistics, and
prediction (generalization) error for future data. Hence, any the VC-based approach to complexity control. Section III
reasonable method for learning from finite samples needs to describes empirical comparisons for linear and penalized linear
have some provisions for complexity control. Implementa- estimators. Section IV describes a situation where analytic
tions of complexity control include [2], [4], [5]: penalization model selection approaches may fail. Summary and discussion
(regularization), weight decay (in neural networks), parameter are given in Section V.
(weight) initialization (in neural-network training), and various
greedy procedures (aka constructive, growing, or pruning
methods). II. COMPLEXITY CONTROL AND
There are three distinct problems common to all method- ESTIMATION OF PREDICTION RISK
ologies for complexity control. This section reviews (representative) classical statistical
First, one needs to define a meaningful complexity index methods for model selection, and contrasts them to the method
for a set of (parameterized) functions. The usual index is based on VC-theory. Classical methods for model selection are
the number of (free) parameters; it works well for linear based on asymptotic results for linear models. Nonasymptotic
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1077
(guaranteed) bounds on the prediction risk based on VC-theory problem of regression. Note that application of the minimum
have been proposed in [1]. description length (MDL) arguments [13] yields a penalization
factor identical to the Schwartz criterion, though the latter was
derived using Bayesian formulation. A recent model selection
A. Classical Model Selection Criteria criteria described in [14] has the goal of minimizing prediction
There are two general approaches for estimating prediction risk for regression.
risk for regression problems with finite data. One is based Typically, the classical criteria are constructed by first defin-
on data resampling. The other approach is to use analytic ing the prediction risk in terms of the linear approximating
estimates of the prediction risk as a function of the empirical function. Then, asymptotic arguments are used to develop limit
risk (training error) penalized (adjusted) by some measure of distributions for various components of the prediction risk,
model complexity. Once an accurate estimate of the prediction leading to an asymptotic form of the prediction risk. Finally,
risk is found it can be used for model selection by choosing the the data, along with estimates of the noise variance are used
model complexity which minimizes the estimated prediction to estimate the expected value of the asymptotic prediction
risk. In the statistical literature, various prediction risk esti- risk. For example the FPE criterion depends on assuming a
mates have been proposed for model selection (in the linear gaussian distribution to develop the asymptotic prediction risk.
case). In general, these estimates all take the form of In addition, FPE depends on an estimate of the noise variance
given by
where is a monotonically increasing function of the ratio There are several common assumptions underlying all these
of model complexity (degrees of freedom) and the training model selection criteria:
sample size [6]. The function is often called a penalization 1) The target function is linear.
factor because it inflates the average residual sum of squares 2) The set of linear functions of the learning machine con-
for increasingly complex models. The following forms of tains the target function. That is, the learning machine
have been proposed in the statistical literature: provides an unbiased estimate.
3) The noise is independent and identically distributed.
final prediction error (FPE) 4) That the empirical risk is minimized.
Additional assumptions reflecting the noise distribution and
Schwartz’ criterion (SC)
limit distributions are also applied in the development of each
selection criterion.
generalized cross-validation (GCV) Another popular alternative (to analytic methods) is to
choose model complexity using resampling. In this paper
Shibata’s model selector (SMS) we consider leave-one-out cross-validation (CV). Under this
approach, the prediction risk is estimated via cross-validation,
and the model providing lowest estimated risk is chosen.
All these classical approaches are motivated by asymptotic It can be shown [4] that is asymptotically (for large )
arguments for linear models and therefore apply well for large equivalent to analytic model selection criteria (such as FPE,
training sets. In fact, for large , prediction estimates provided GCV, and SMS). Unfortunately, the computational cost of
by FPE, GCV, and SMS are asymptotically equivalent. More- CV grows linearly with the number of samples, and often
over, the model selection criteria above are all based on a becomes prohibitively large for practical applications. Addi-
parametric philosophy. That is, the goal of model selection is tional complications arise in the context of using resampling
to select the terms of the approximating function in order to with nonlinear estimators (such as neural networks), due to
match the target function (under the assumption that the target existence of multiple local minima and the dependence of the
function is contained in a set of linear approximating func- final solution (obtained by an optimization algorithm) on the
tions). The sucess of the model selection criteria is measured initial conditions (weight initialization)—see [4]. Nevertheless,
according to this philosophy [7], [10], [11], [12]. This classical resampling remains the prefered approach for model selection
approach can be contrasted to the VC-theory approach, where in many learning methods. In this paper, we use CV as a
the goal of model selection is to choose the approximating benchmark method for comparing various analytic methods.
function with the lowest prediction risk (irrespective of the It can be shown [12] that the above estimates of the pre-
number of terms chosen). diction risk (excluding SC) are not consistent in the following
Classical model selection criteria were designed with spe- sense: The probability of selecting the model with the same
cific applications in mind. For example, FPE was originally de- number of terms as the target function does not converge to
signed for model identification for autoregressive time series, one as the number of observations is increased (with fixed
and GCV was developed as an estimate for cross-validation number of maximum basis functions). In addition, resampling
(itself an estimate of prediction risk) in spline smoothing. It methods for model selection may suffer from the same lack
is only SC and SMS which were developed for the generic of consistency. For example, leave-one-out CV which has
1078 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
asymptotic equivalence to (7) is not consistent [12]. Note that following bound on prediction risk holds with probability :
this definition of consistency is developed from a parametric
viewpoint (choosing the correct number of terms). It is not prediction risk
equivalent to consistency in estimating the prediction risk.
In most cases, these (classical) model selection approaches
are applied in practical situations when the underlying assump-
tions do not hold, i.e., they are applied when the model may
be biased or the number of samples is finite. In this paper we
are interested in such realistic situations. Also, several gener-
alizations of analytic estimates for prediction error suitable (8)
for nonlinear models have been recently proposed by [15]
where is the VC-dimension of the set of approximating
and [16]. These estimates generalize the notion of degrees
functions and is a constant which reflects the “tails of the loss
of freedom to nonlinear models, however, they are still based
function distribution,” i.e., the probability of observing large
on asymptotic assumptions.
values of the loss, and is a theoretical constant. The above
Finally, we note that model selection criteria are used in
upperbound holds with probability (confidence level of
this paper for complexity control, i.e., choosing an optimal
an estimate). Note that the VC-bound (8) for regression has
number of terms (in a linear model). This is not equivalent
general form similar to (7).
to the goal of accurate estimation of prediction risk. In fact,
For practical use of the bound (8) for model selection, one
prediction risk estimates are significantly affected by the vari-
needs to set the value of the constants and the confidence
ability of (finite) training samples. Hence, a model selection
level Reference [17] shows that one can use this bound
criterion can provide a poor estimate of prediction risk, yet the
with constant close to 1, so we choose We set
differences between its risk estimates (for models of different
based on the following (informal) arguments. Consider the
complexity) may yield accurate model selection. Likewise, a
case when In this case the bound should yield an
model selection criterion may provide an accurate estimate of
uncertainty of the type 0/0 with confidence level
prediction risk for the poorly chosen model complexity.
This will happen when From a practical viewpoint, the
confidence level of the bound (8) should depend on the sample
B. VC-Based Complexity Control size i.e., for larger sample sizes we should expect higher
VC-theory provides a very general and powerful framework confidence level; so we set
for complexity control called structural risk minimization Further, we need to estimate the VC dimension of a set of
(SRM). Under SRM, a set of possible models (approximat- approximating functions. For linear methods the VC dimension
ing functions) are ordered according to their complexity (or can be estimated as the number of free parameters (or degrees
flexibility to fit the data). Specifically under SRM the set of freedom) [1], [2]. For example, for polynomial estimators
of approximating functions has a structure, (of degree the VC dimension is Making all
that is, it consists of the nested subsets (or elements) these substitutions into (8) gives the following penalization
such that factor which we call Vapnik’s measure :
(9)
where each element of the structure has finite VC- where Penalization factor (9) is used for model
dimension By design, a structure provides ordering of its selection comparisons reported in Sections III and IV.
elements according to their complexity (i.e., VC-dimension): The common constructive implementation of SRM can be
described as follows: For a given set of training data, the SRM
According to SRM, solving a learning problem with finite principle selects the function minimizing the empirical risk for
data requires a priori specification of a structure on a set of the functions from Then for each element of a structure
approximating functions. Then for a given data set, optimal the guaranteed risk is found using the bound provided by the
model estimation involves two tasks: right-hand side of (8). Finally, an optimal structure element
providing minimal guaranteed risk is chosen.
1) selecting an element (subset) of a structure (having
Application of SRM in practice depends on a chosen struc-
optimal complexity;
ture. An example of a generic structure (commonly used
2) estimating the model from this subset. The model pa-
in neural networks and statistical methods) is a dictionary
rameters are found via minimization of the empirical
representation [3], where the set of approximating functions is
risk (i.e., training error).
Note that Step 1) corresponds to model selection, whereas
(10)
Step 2) corresponds to parameter estimation in statistical
methods. However, SRM provides analytic upper bounds on
the prediction risk which can be used for model selection where is a set of basis functions with (possibly
[1], [2], [17]. For regression problems with squared loss the adjustable) parameters , and are linear coefficients. Both
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1079
and are estimated to fit the training data. Representation equivalent kernel representation. The approximations provided
(10) defines a structure, since by a linear estimator for the training data can be written as
(14)
where the vector contains the response
Hence the number of terms in expansion (10) specifies an
samples, the matrix contains the predictor
element of a structure.
samples, and the matrix is an matrix that transforms
Further we distinguish between adaptive methods where
the response values into estimates for each sample. The matrix
the basis functions are nonlinear in parameters ,
is often called the “hat” matrix, since it transforms responses
and nonadaptive (or linear) methods where basis functions are
into estimates. Consider the ridge regression risk functional
prespecified (or fixed a priori). Examples of adaptive methods
include multilayer perceptron networks and recent statistical
methods, such as classification and regression trees (CART) (15)
and projection pursuit regression [3].
In this paper, we consider complexity control only for linear For a given penalty strength , the solution which minimizes
methods (linear estimators), where the model is estimated as (15) is a linear estimator with the “hat” matrix
a linear combination of prespecified (fixed) basis functions
(16)
(11) —see [18], [19] for details. In particular, the effective degrees
of freedom is estimated via equivalent smoother matrix of
a penalized estimator
These methods differ mainly in the type of chosen basis
functions and the procedure for choosing optimal number trace (17)
of terms (model selection).
Penalization also represents a form of SRM [2]. Consider or equivalently [19] as
a set of functions , where is a vector of parameters
having some fixed length. For example, the parameters can be (18)
the weights of a neural network (of fixed topology). Let us
introduce the following structure on this set of functions: where is the regularization parameter and
are the eigenvalues of the Hessian matrix of the linear nonpe-
nalized estimate. Expressions (17), (18) are usually described
where (12) as the effective degrees of freedom (of a penalized estimator).
In this study, we also use these expressions to estimate the
Minimization of the empirical risk on each element VC-dimension, in order to apply the VC-bound (8), (9) for
of a structure is a constrained optimization problem, which complexity control. However, it is not clear how accurately
is achieved by minimizing the “penalized” risk functional these expressions estimate the VC-dimension in the penalized
case. For this reason, application of SRM for complexity
(13) control of penalized linear estimators is somewhat heuristic.
with an appropriately chosen Lagrange multiplier such that
III. EMPIRICAL COMPARISONS OF
Functional (13) represents the penalization formulation METHODS FOR MODEL SELECTION
where an optimal choice of the regularization parameter This section describes empirical comparison of classical
is conceptually similar to selecting the optimal number methods for model selection with VC-based approach for
of terms in a dictionary method (10), (11). The particular linear/penalized linear estimators. First, we describe the com-
structure (12), (13) is equivalent to a ridge penalty (used in parison methodology, then the experimental set up (including
statistical methods) or weight decay (used in neural networks). specification of the data sets and linear estimators used), and
The complexity of the “penalized” risk functional (13) or finally the comparison results.
an equivalent structure (12) can be estimated analytically if Comparison Methodology: In order to compare various
approximating functions are linear (in parameters). model selection criteria, we need to specify the type of basis
In this paper, we use penalized linear estimators for uni- functions of a linear estimator. Then comparisons between
variate regression, i.e., polynomial estimators of fixed degree various model selection approaches are performed using
25 with a ridge penalty on the norm of coefficients (free the same type of (linear) approximating functions. Model
parameters). Model selection with penalized linear estimators parameters are estimated by the linear least squares fitting
uses the same expressions for the penalization factor as parameters to the training data. Model selection amounts to
above; however the “effective” degrees of freedom is used choosing the optimal complexity of a linear estimator, based
in place of the number of free parameters. Estimating the on the available training data. The (optimal) model complexity
complexity (i.e., effective degrees of freedom) of a penalized provides the lowest prediction risk (mean-squared-error)
linear estimator is usually based on the eigenvalues of its estimated from the empirical risk using a model selection
1080 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
(19)
Trigonometric polynomials
(20) (b)
Fig. 2. Target functions used in the comparison: (a) Piecewise polynomial.
(b) Sine squared function sin2 (2x):
We consider all polynomials of degree 0–25 as possible
candidates for model selection.
We also used penalized linear estimators formed by using output values for given input samples over the standard
the polynomial of fixed(large) degree 25, with an additional deviation of the gaussian noise. The following noise levels
constraint on the norm of its coefficients, leading to the were used: SNR .
penalization formulation Whereas it is not feasible to compare empirically all pos-
sible combinations of different target functions, sample sizes
(21) etc., we tried to choose a reasonably broad cross-section of
comparison data sets. For example, we use small and large
sample size to observe the difference between the finite-sample
where the choice of the regularization parameter controls and asymptotic setting for model selection. Similarly, we use a
model complexity. In the penalization formulation (21), both large range of SNR values in order to compare model selection
algebraic (19) and trigonometric polynomials (20) of degree methods for noisy and noseless training data. The chosen
25 were used. target functions exemplify smooth (easy to estimate) and
Training data: This study used simulated training data, discontinuous (hard to estimate) mappings, because smooth
The random -values are uniformly functions can be approximated by low-order models whereas
distributed in a [0,1] interval, and the -values are generated discontinous functions require high-order models (e.g., poly-
according to , where is additive (Gaussian) nomials of high degree). Also, the chosen target functions
noise, and is the target function (being estimated). generally do not match the basis functions well, except for
The following two target functions were used (see Fig. 2). one set of experiments when the sine-squared target function
Sine-squared function is being estimated via trigonometric polynomials. Overall, the
data sets used for comparisons reflect a wide range of practical
conditions, in order to make the comparison of model selection
Discontinuous piecewise polynomial function methods more meaningful.
Comparison indexes: Various classical model selection
methods and the VC-based method
are compared for a given type of linear estimator and a
particular choice of training data. The results are presented in
Three different sample sizes were used: small (30 samples), the form of box plots representing the empirical distribution
medium (100), and large (1000). Each training sample was (of the comparison index) based on 300 repetitions of the
generated with different levels of additive gaussian noise. The fitting/model selection using different random samples with
noise is defined in terms of signal-to-noise ratio (SNR) as the the same statistical characteristics. Specifically we use the
ratio of the standard deviation of the true (target function) following performance indexes.
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1081
1) RISK (MSE) is defined as the distance between the experiments. Due to space limitations, only a representative
target function and the regression estimate chosen by a subset of comparison results is discussed below. The results
given model selection method. are presented separately for different sample sizes, because
2) Degrees of Freedom (DOF) is the number of free pa- for small samples the estimation accuracy varies dramatically
rameters (model complexity) chosen by a given model between model selection criteria whereas for very large
selection criterion. samples the difference becomes negligibly small. Likewise,
3) Risk Estimation Accuracy measures the accuracy of the we describe separately the comparison results for linear
estimates for prediction risk provided by each model estimators and penalized linear estimators, which represent two
selection approach. For a regression estimate found different types of estimators (or two types of SRM structures).
by a given model selection approach we calculate (a) Comparisons are shown mainly for classical analytic methods
estimated risk via expression (7) and (b) the risk as (FPE, SC, GCV, SMS) and the VC-based model selection
the distance between the regression estimate and (VM). In addition, we show representative results (for 30 and
the target function. Then the risk estimation accuracy 100 samples) for model selection using leave-one-out CV.
is defined as the (absolute value of) difference between Since CV does not provide any improvement over analytical
the estimated risk (a) and the risk (b). notice that the VC-based model selection (VM), we do not show CV-based
risk estimation accuracy is lower bounded by the value model selection for most comparison plots.
of the noise variance, i.e., even for a perfectly accurate Comparisons for linear estimators: Figs. 3–7 show
estimate we still observe an error due to additive noise. comparisons for linear estimators. Figure captions specify
Performance criteria: Note that the first index, that is the data set used, i.e., the target function, sample size, and
risk (MSE), is of primary importance for comparisons, because noise level (SNR). Part (a) of each figure shows comparison
the quality of predictive learning (estimation) from finite results for regression models estimated using polynomial basis
samples has been defined in terms of the risk functional (5). functions, whereas part (b) shows results using trigonometric
Relative prediction performance of various model selection basis functions. Depending on the target function, two slightly
criteria can be judged from the box plots of risk (MSE) of each different cosine basis functions were used, in order to create
method. Box plots showing lower values of risk correspond a good match between a target function and (harmonic)
to better model selection approaches. In particular, better approximating functions. Namely, for estimating the sine-
model selection approaches select models providing lowest squared target function, we used trigonometric basis functions
guaranteed prediction risk (i.e., with lowest RISK at the 95% , whereas for estimating the piecewise polynomial
mark), and also smallest variation of the risk (i.e., narrow box target functions we used trigonometric basis functions
plots). As can be seen from the results reported below, the in the expansion (11). In contrast, the polynomial basis
methods providing lowest guaranteed prediction risk do not functions illustrate a (more realistic) situation when the
necessarily provide lowest average risk (i.e., lowest risk at approximating functions do not match the target function well.
the 50% mark). The other two performance indexes provide Small sample size: Comparison results for small sample
additional insights into the properties of model selection size are shown in Figs. 3 and 4, for the sine-squared and
approaches. The DOF index shows the model complexity piecewise-polynomial target function, respectively. It can be
(degrees of freedom) chosen by a given method. The DOF clearly seen that Vapnik’s measure (VM) significantly out-
box plot, in combination with the risk box plot, provides performs other model selection criteria, i.e., provides lower
insights about an overfitting (or underfitting) of a given values of Risk at 75 and 95%. It achieves superior prediction
method, relative to the optimally chosen DOF. Likewise, the accuracy by conservative selection of the model complexity,
risk estimation accuracy should be examined in conjunction i.e., selecting lower DOF than other methods. Likewise the
with the risk and DoF boxplots. The risk estimation accuracy results in Figs. 3 and 4 show that VC-bound (8) provides better
measures the quality of risk estimates for the model of chosen risk estimation accuracy than other methods, and the lowest
complexity. Note that a poorly chosen model (i.e., that overfits variability of the risk estimates. Notice that the use of leave-
the training data) may yield a very accurate risk estimate, one-out CV shown in Fig. 3 does not yield any improvement
whereas a model of optimal complexity chosen by a different over analytic VM model selection. As can be expected, cross-
method (for the same data) would yield larger values of risk validation outperforms classical analytic model selection (with
estimation accuracy. For this reason, direct comparison of small samples).
methods in terms of the risk estimation accuracy is rather Medium sample size: Comparison results for medium
meaningless. Instead, a better interpretation may be based on sample size are shown in Figs. 5 and 6, for the sine-squared
the width of the box plots of the risk estimation accuracy. and piecewise-polynomial target function, respectively. Here
The width of these box plots reflects a method’s sensitivity the VC-based model selection also tends to outperform other
to random sample variations, and it can be used as a measure methods (at 75 and 95 percentiles), however the difference
of variability of the risk estimates. Specifically, narrow box between methods is less than in the small-sample case.
plots indicate that a method is insensitive (robust) to random Results in Fig. 6(a) show that VC-based approach yields
variations of the data. lowest prediction risk at 95% but is slightly inferior to GCV
Comparison results: All possible combinations of three and SC at a 75%. The VM tends to choose lowest model
sample size values, seven SNR values, two target functions and complexity, in comparison with other methods. Notice that
four types of linear estimators yield 168 different comparison VM model selection also outperforms cross-validation, as
1082 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
(a) (b)
Fig. 3. Results for the sine squared target function with sample size of 30 and SNR = 2.5. (a) Polynomial estimation. (b) Trigonometric estimation.
shown in Fig. 5. Likewise the box plots of the risk estimation levels etc. (not shown here due to space constraints) show
accuracy suggest that the VC-methods provides the lowest superiority of the VC-based model selection in the penalized
variability of the risk estimates. case.
Large sample size: Fig. 7 shows the results for large Estimating pure noise: Finally, we present a simple ex-
sample size (1000) and high noise (SNR 0.5). As expected ample of modeling pure noise (i.e., gaussian noise with a
all methods yield very similar prediction accuracy (due to standard deviation of one) for 30 samples, using polynomial
an asymptotic setting), with the VC method having a slight estimators. The results shown in Fig. 10 clearly illustrate the
edge (i.e., lowest guaranteed prediction risk). Notice that VM superiority of VC-based model selection over classical analytic
selects the lowest model complexity, and provides the lowest and resampling approaches. In fact, all classical methods
variability of risk estimates. (including detect a structure (i.e., select DOF greater than
Comparisons for penalized linear estimators: Figs. 8 and 1) when there is no structure in the data. In contrast, VC-based
9 show comparison results for penalized estimators for the model selection (almost) never detects any structure in the data
small sample case. The results are (qualitatively) similar to (i.e., chooses DOF equal to one).
the nonpenalized case, in that the VC-based model selection Empirical results for linear/penalized linear estimators sug-
yields most accurate predictive models. It does so by selecting gest that the VC method is superior to classical methods for
lower (estimated) DOF than the other methods. Also note that model selection, in terms of selecting the models with the
VC model selection achieves lowest variability of chosen esti- lowest worst-case prediction risk, and the lowest variability
mated DOF (the width of box plots) relative to other methods. of risk estimates. The difference in performance may be quite
Likewise, comparison results for different sample sizes, noise dramatic for small samples, and gradually decreases for larger
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1083
(a) (b)
Fig. 4. Results for the piecewise polynomial target function with sample size of 30 and SNR = 2.5. (a) Polynomial estimation. (b) Tigonometric estimation.
samples. Notice that the performance of classical methods bound (8). Our empirical results indirectly suggest that the
is greatly affected by random variability of (small) training effective degrees of freedom estimates the VC-dimension
samples, i.e., methods vary as much as several orders of rather accurately. However, it is possible to measure the
magnitude between the top 25% and bottom 25% of the VC-dimension in the penalized case directly as suggested in
Risk box plots. In contrast, VC-based model selection is [20]—this is a promising research area.
very insensitive to random sample variations (i.e., narrow box
plots).
Based on our empirical evidence, we conclude that for small IV. WHEN VC-BASED COMPLEXITY CONTROL MAY FAIL
samples the prediction (estimation) accuracy of a learning In this section we describe a class of function estimation
method depends mainly on the model selection procedure, problems unsuitable for analytic model selection approaches
rather than the choice of approximating functions. However, (including VC-based model selection). Specifically, consider
for large samples the prediction accuracy is determined mainly unbounded approximating functions [i.e., algebraic polynomi-
by the suitable choice of approximating functions, as all als (19)] used to estimate difficult (i.e., discontinuous) target
model selection approaches perform equally well asymptot- functions from low noise/noiseless samples. In this setting
ically (with infinite number of samples). “good” models fitting the training data (almost) exactly would
The same VC-bound (8) was successfully used for com- be quite complex (i.e., polynomials of high degree). Such
plexity control of the penalized linear estimators, where the models interpolate (between training samples) quite well, but
effective degrees of freedom (estimated via the smoother they tend to have large boundary errors (bad extrapolation
matrix) was used in place of the VC-dimension. Note that outside the range of training data) due to the unbounded
strictly speaking we should use the VC-dimension in the approximating functions.
1084 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
(a) (b)
Fig. 5. Results for the sine squared target function with sample size of 100 and SNR = 1.0. (a) Polynomial estimation. (b) Trigonometric estimation.
Let us consider a typical example illustrating this phe- very accurate interpolation (between training samples) which
nomenon: estimating a step function step from requires higher-order polynomials due to the discontinuous
30 samples uniform in [0,1] with a “small” Gaussian noise target function. These estimates tend to produce huge errors
(SNR 10). Algebraic polynomials (19) are used to estimate at the boundaries of the [0,1] interval, i.e., between zero and
this target function. Two model selection strategies are used, the leftmost data point, and between the rightmost data point
namely VC-based complexity control and leave-one-out cross- and one. We call this effect “overfitting at the boundaries.”
validation. This experimental set-up closely follows [22] which It does not happen with noisy data (high SNR) when the
also compares VC-based complexity control with a resampling VC model selection yields lower-order polynomials (and the
approach for this problem. boundary effects become less significant). In the case of cross-
Experimental results in Fig. 11 suggest that cross-validation validation, the boundary effects are implicitly “detected” when
provides more accurate estimates than the VC-based approach the training sample closest to the boundary is “left out” for
in this setting, by choosing lower-degree polynomials. Similar validation. Namely, for high-order polynomial models the val-
(relative) under-performance of the VC model selection can idation error for the sample near the boundary is much larger
be observed for larger sample sizes under low-noise settings, than the validation error for other samples. Consequently, the
but under higher noise settings (SNR 1 or SNR 2) cross-validation approach will choose lower-order polynomial
the VC-based approach in fact outperforms cross-validation. estimates. In order to verify this explanation, we repeated
Detailed (visual) analysis of the estimates produced by cross- model selection experiments with a slight modification for
validation and by VC-method suggest the following explana- cross-validation, where the left- and rightmost samples are not
tion: under the low-noise setting VC model selection yields left out for cross-validation. In this case, as expected, cross-
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1085
(a) (b)
Fig. 6. Results for the piecewise polynomial target function with sample size of 100 and SNR = 1.0. (a) Polynomial estimation. (b) Trigonometric estimation.
validation results deteriorate, and become similar to VC model V. SUMMARY AND DISCUSSION
selection results (see Fig. 12). Empirical results presented in this paper and elsewhere
In summary, empirical results presented in this section show [5], [21], [23] suggest that VC generalization bounds can be
possible settings where analytic model selection is inferior to successfully used for model selection with linear estimators.
resampling approaches due to overfitting at the boundaries. Besides (linear) dictionary methods and penalized estimators
These settings are characterized by a combination of three considered in this paper, such linear estimators include or-
factors: thogonal basis functions (i.e., wavelet estimators) considered
1) unbounded approximating functions; in [21].
2) mismatch between the target function and approximating Our empirical findings contradict a widely held opinion that
functions leading to higher-order models; VC-bounds are too conservative to be useful in practice for
3) low-noise training data. model selection. We comment next on several likely causes
These boundary effects can be overcome in several (obvious) of these misconceptions. First, the VC generalization bounds
ways: are often applied with the upperbound estimates of parameter
1) use bounded approximating functions, i.e., trigonometric values [in the bound (8)] cited from Vapnik’s original books
polynomials instead of algebraic ones; or papers [1]. For practical problems, this leads to poor (too
2) when using unbounded approximating functions, avoid conservative) model selection. As noted in [5], the VC-theory
boundary effects by considering the prediction accuracy provides an analytical form of the bounds up to the value of
only the interval given by the leftmost and rightmost constants. Reasonable selection of the constant values has to be
training samples. This solution has a disadvantage in done empirically for a given type of a learning problem (e.g.,
that it can not be easily extended to a multivariate case. classification or regression). The second cause of misconcep-
1086 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
Fig. 7. Results for the sine squared target function with sample size of 1000 and SNR = 0.5 for polynomial estimation.
(a) (b)
Fig. 8. Results for the piecewise polynomial target function with sample size of 30 and SNR = 2.5. (a) Penalized polynomial estimation. (b) Penalized
trigonometric estimation.
tion is using the VC bounds to estimate the generalization error the problem is that typical network training procedures in-
of feedforward neural networks. Here the common approach is evitably introduce a regularization effect, so the “theoretical”
to estimate the bound on the generalization error (of a trained VC dimension can be quite different from the “actual” VC
network) using (theoretical) estimates of the VC-dimension as dimension which takes into account the regularization effect
a function of the number of parameters (or network weights). of a training algorithm. The third problem is that VC-bounds
This generalization bound is then compared against the actual are often applied to nonlinear estimators (i.e., neural networks)
generalization error (measured empirically), and a conclusion where the empirical risk cannot be reliably minimized (due to
is made regarding the poor quality of VC bounds. Here existence of multiple local minima and saddle points).
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1087
(a) (b)
Fig. 9. Results for the piecewise polynomial target function with sample size of 30 and SNR = 5.0. (a) Penalized polynomial estimation. (b) Penalized
trigonometric estimation.
Fig. 11. Results for the step target function with a sample size of 30 and
SNR = 10. In this setting the analytical VC-theory approach does not perform
as well as the cross-validation approach, due to boundary effects.
Xuhui Shao (S’98) received the B.S. and M.S. Vladimir N. Vapnik, for a photograph and biography, see this issue, p. 999.
degrees in electrical engineering from Tsinghua Uni-
versity, Beijing, China, in 1993, 1995, respectively,
and the Ph.D. degree in electrical engineering from
University of Minnesota, St. Paul, in 1999.
He is a Staff Scientist at HNC Software Inc., San
Diego, CA. His research interests include model
selection, large margin classifiers, and intelligent
signal processing.