0% found this document useful (0 votes)

123 views15 pages

Model Complexity Control For Regression Using VC Generalization Bounds

Uploaded by

Nathalia Santos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views15 pages

Model Complexity Control For Regression Using VC Generalization Bounds

Uploaded by

Nathalia Santos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO.

5, SEPTEMBER 1999 1075

Model Complexity Control for Regression

Using VC Generalization Bounds
Vladimir Cherkassky, Senior Member, IEEE, Xuhui Shao, Student Member, IEEE,
Filip M. Mulier, and Vladimir N. Vapnik

Abstract—It is well known that for a given sample size there

exists a model of optimal complexity corresponding to the smallest
prediction (generalization) error. Hence, any method for learning
from finite samples needs to have some provisions for com-
plexity control. Existing implementations of complexity control
include penalization (or regularization), weight decay (in neural
networks), and various greedy procedures (aka constructive,
growing, or pruning methods). There are numerous proposals
for determining optimal model complexity (aka model selec-
tion) based on various (asymptotic) analytic estimates of the
prediction risk and on resampling approaches. Nonasymptotic
bounds on the prediction risk based on Vapnik–Chervonenkis Fig. 1. Block diagram of a generic learning system.
(VC)-theory have been proposed by Vapnik. This paper describes
application of VC-bounds to regression problems with the usual
squared loss. An empirical study is performed for settings where 3) a learning machine, which is capable of implementing a
the VC-bounds can be rigorously applied, i.e., linear models set of approximating functions where
and penalized linear models where the VC-dimension can be is a set of parameters of an arbitrary nature.
accurately estimated, and the empirical risk can be reliably
minimized. Empirical comparisons between model selection using The goal of learning is to select a function (from this set)
VC-bounds and classical methods are performed for various noise which approximates best the system’s response. This selection
levels, sample size, target functions and types of approximating is based on the knowledge of finite number of samples
functions. Our results demonstrate the advantages of VC-based (training data) generated according to
complexity control with finite samples.
(unknown) joint distribution
Index Terms— Complexity control, linear estimators, model The quality of an approximation produced by the learning
selection, penalized estimators, regression, VC generalization machine is measured by the discrepancy or loss
bounds, VC theory.
between the true output produced by the System and the
estmate produced by the learning machine for a given input
I. THE LEARNING PROBLEM AND COMPLEXITY CONTROL The expected value of the loss is given by the prediction
risk functional
A learning method is an algorithm (usually implemented
in software) that estimates an unknown mapping (de-
pendency) between a system’s inputs and outputs, from the (1)
available data, i.e., known (input, output) samples. Once such
a dependency has been accurately estimated, it can be used Learning is the process of finding the function
for prediction of system outputs from the input values. The which minimizes the risk functional (1) over the set of
usual goal of learning (estimation) is the prediction accuracy functions supported by the Learning Machine, using only the
(for future data), also known as generalization. training data is unknown). This formulation [2]; [3]
A generic learning system (shown in Fig. 1) consists of: is very general and describes many learning problems such as
1) a generator of random input vectors drawn in- interpolation, regression, and classification.
dependently from a fixed (but unknown) probability In this paper, we consider the problem of function estimation
distribution ; from noisy samples. This problem is known in statistics as
2) a system which returns an output value for every input (nonlinear) regression. In the regression formulation, the goal
vector according to the fixed conditional distribution is to estimate an unknown continuous-valued function in the
, which is also unknown; relationship
Manuscript received March 16, 1998; revised October 29, 1998 and May (2)
13, 1999. This work was supported in part by NSF under Grant IRI-9618167,
and by the IBM Partnership Award.
V. Cherkassky is with the Electrical and Computer Engineering Department, where the random error (noise) is zero mean, is a -
University of Minnesota, Minneapolis MN 55455 USA. dimensional vector and is a scalar output. The estimation is
X. Shao is with HNC Software, San Diego CA 92121 USA. made based on a finite number of samples (training data):
F. M. Mulier is with Net Perceptions, Minneapolis, MN 55344 USA.
V. N. Vapnik is with AT&T Labs, Red Bank, NJ 07701 USA. The training data are independent
Publisher Item Identifier S 1045-9227(99)07232-X. and identically distributed (i.i.d.) generated according to some
1045–9227/99$10.00  1999 IEEE
1076 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

(unknown) joint probability density function (pdf) parameterization but is not appropriate for approximating
The unknown function in (2) is the mean of the functions nonlinear in parameters [2].
output conditional probability (aka regression function) Second, one needs to estimate the unknown prediction risk
from the known empirical risk, in order to choose the model
(3) corresponding to smallest (estimated) prediction risk. These
estimates are known as model selection criteria in statistics.
A learning method (or estimation procedure) selects the “best” Analytic model selection criteria have been developed in statis-
model from a set of (parameterized) approximating tics using asymptotic (large-sample) theory. In practical (finite
functions (or possible models) specified a priori, sample) applications, empirical data-driven model selection
where the quality of an approximation is measured by the using resampling is the usual choice. However, both analytic
loss or discrepancy measure A common loss and resampling methods suffer from large variability with
function for regression is the squared error finite data.
Third, there is a problem of finding a (global) minimum of
(4) the empirical risk. Strictly speaking, this is possible only for a
set of functions linear in parameters. With nonlinear estimators
The set of functions supported by a learning
(such as multilayer perceptrons) an optimization algorithm can
method may or may not contain the regression function (3).
find, at best, only a local minimum.
Thus learning is the problem of finding the function
Vapnik–Chervonenkis (VC) theory [1], [2] provides a prin-
(regressor) that minimizes the prediction risk functional
cipled solution to the first two problems. It defines a new
measure of complexity (called the VC-dimension) which co-
(5)
incides with the classical definition (the number of parameters)
for linear parameterization. VC-theory also provides analytical
using only the training data. This risk functional measures the
generalization bounds that can be used for estimating predic-
accuracy of the learning method’s predictions of the unknown
tion risk. However, VC-theory cannot be rigorously applied to
nonlinear estimators (such as neural networks) where the VC-
Since the probability measure in (1) is not known,
dimension cannot be accurately estimated, and the empirical
minimization of the prediction risk is a difficult (ill-posed)
risk cannot be reliably minimized [5]. A new method known
problem. Modern adaptive learning methods (in neural net-
as Support Vector Machines (SVM’s) [2] developed at AT&T
works and statistics) use a wide (very flexible) set of ap-
Bell Labs provides a practical solution to the third problem.
proximating functions ordered according to some measure of
The SVM formulation ensures global minimization of the
complexity (or flexibility to fit the training data). Then the
empirical risk using constrained linear functions in a very
problem is to choose the model of optimal complexity for a
high-dimensional intermediate feature space.
given (finite) sample.
There is a growing awareness that the VC-theory provides a
Many learning algorithms are based on the inductive prin-
satisfactory theoretical and conceptual framework for learning
ciple known as “empirical risk minimization,” which amounts
with finite samples. This paper demonstrates practical appli-
to choosing the model (from a set of approximating functions)
cability of using VC-bounds for regression for complexity
that minimizes the empirical risk, or the average loss on the
control in the case of linear estimators. Note that in the general
training data
case of nonlinear estimators the issue of model complexity
control becomes very difficult as it involves solving all three
(6) problems outlined above. However, for linear estimators there
is only one problem, i.e., estimation of the prediction risk.
However, the goal of learning is to obtain a model providing Hence, for linear estimators the comparison of different model
minimal prediction risk, i.e., minimal error for future data. selection methodologies becomes tractable.
It is well known that for a given training sample size there This paper is organized as follows. Section II describes
exist a model of optimal complexity corresponding to smallest classical model selection criteria developed in statistics, and
prediction (generalization) error for future data. Hence, any the VC-based approach to complexity control. Section III
reasonable method for learning from finite samples needs to describes empirical comparisons for linear and penalized linear
have some provisions for complexity control. Implementa- estimators. Section IV describes a situation where analytic
tions of complexity control include [2], [4], [5]: penalization model selection approaches may fail. Summary and discussion
(regularization), weight decay (in neural networks), parameter are given in Section V.
(weight) initialization (in neural-network training), and various
greedy procedures (aka constructive, growing, or pruning
methods). II. COMPLEXITY CONTROL AND
There are three distinct problems common to all method- ESTIMATION OF PREDICTION RISK
ologies for complexity control. This section reviews (representative) classical statistical
First, one needs to define a meaningful complexity index methods for model selection, and contrasts them to the method
for a set of (parameterized) functions. The usual index is based on VC-theory. Classical methods for model selection are
the number of (free) parameters; it works well for linear based on asymptotic results for linear models. Nonasymptotic
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1077

(guaranteed) bounds on the prediction risk based on VC-theory problem of regression. Note that application of the minimum
have been proposed in [1]. description length (MDL) arguments [13] yields a penalization
factor identical to the Schwartz criterion, though the latter was
derived using Bayesian formulation. A recent model selection
A. Classical Model Selection Criteria criteria described in [14] has the goal of minimizing prediction
There are two general approaches for estimating prediction risk for regression.
risk for regression problems with finite data. One is based Typically, the classical criteria are constructed by first defin-
on data resampling. The other approach is to use analytic ing the prediction risk in terms of the linear approximating
estimates of the prediction risk as a function of the empirical function. Then, asymptotic arguments are used to develop limit
risk (training error) penalized (adjusted) by some measure of distributions for various components of the prediction risk,
model complexity. Once an accurate estimate of the prediction leading to an asymptotic form of the prediction risk. Finally,
risk is found it can be used for model selection by choosing the the data, along with estimates of the noise variance are used
model complexity which minimizes the estimated prediction to estimate the expected value of the asymptotic prediction
risk. In the statistical literature, various prediction risk esti- risk. For example the FPE criterion depends on assuming a
mates have been proposed for model selection (in the linear gaussian distribution to develop the asymptotic prediction risk.
case). In general, these estimates all take the form of In addition, FPE depends on an estimate of the noise variance
given by

estimated risk (7)

where is a monotonically increasing function of the ratio There are several common assumptions underlying all these
of model complexity (degrees of freedom) and the training model selection criteria:
sample size [6]. The function is often called a penalization 1) The target function is linear.
factor because it inflates the average residual sum of squares 2) The set of linear functions of the learning machine con-
for increasingly complex models. The following forms of tains the target function. That is, the learning machine
have been proposed in the statistical literature: provides an unbiased estimate.
3) The noise is independent and identically distributed.
final prediction error (FPE) 4) That the empirical risk is minimized.
Additional assumptions reflecting the noise distribution and
Schwartz’ criterion (SC)
limit distributions are also applied in the development of each
selection criterion.
generalized cross-validation (GCV) Another popular alternative (to analytic methods) is to
choose model complexity using resampling. In this paper
Shibata’s model selector (SMS) we consider leave-one-out cross-validation (CV). Under this
approach, the prediction risk is estimated via cross-validation,
and the model providing lowest estimated risk is chosen.
All these classical approaches are motivated by asymptotic It can be shown [4] that is asymptotically (for large )
arguments for linear models and therefore apply well for large equivalent to analytic model selection criteria (such as FPE,
training sets. In fact, for large , prediction estimates provided GCV, and SMS). Unfortunately, the computational cost of
by FPE, GCV, and SMS are asymptotically equivalent. More- CV grows linearly with the number of samples, and often
over, the model selection criteria above are all based on a becomes prohibitively large for practical applications. Addi-
parametric philosophy. That is, the goal of model selection is tional complications arise in the context of using resampling
to select the terms of the approximating function in order to with nonlinear estimators (such as neural networks), due to
match the target function (under the assumption that the target existence of multiple local minima and the dependence of the
function is contained in a set of linear approximating func- final solution (obtained by an optimization algorithm) on the
tions). The sucess of the model selection criteria is measured initial conditions (weight initialization)—see [4]. Nevertheless,
according to this philosophy [7], [10], [11], [12]. This classical resampling remains the prefered approach for model selection
approach can be contrasted to the VC-theory approach, where in many learning methods. In this paper, we use CV as a
the goal of model selection is to choose the approximating benchmark method for comparing various analytic methods.
function with the lowest prediction risk (irrespective of the It can be shown [12] that the above estimates of the pre-
number of terms chosen). diction risk (excluding SC) are not consistent in the following
Classical model selection criteria were designed with spe- sense: The probability of selecting the model with the same
cific applications in mind. For example, FPE was originally de- number of terms as the target function does not converge to
signed for model identification for autoregressive time series, one as the number of observations is increased (with fixed
and GCV was developed as an estimate for cross-validation number of maximum basis functions). In addition, resampling
(itself an estimate of prediction risk) in spline smoothing. It methods for model selection may suffer from the same lack
is only SC and SMS which were developed for the generic of consistency. For example, leave-one-out CV which has
1078 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

asymptotic equivalence to (7) is not consistent [12]. Note that following bound on prediction risk holds with probability :
this definition of consistency is developed from a parametric
viewpoint (choosing the correct number of terms). It is not prediction risk
equivalent to consistency in estimating the prediction risk.
In most cases, these (classical) model selection approaches
are applied in practical situations when the underlying assump-
tions do not hold, i.e., they are applied when the model may
be biased or the number of samples is finite. In this paper we
are interested in such realistic situations. Also, several gener-
alizations of analytic estimates for prediction error suitable (8)
for nonlinear models have been recently proposed by [15]
where is the VC-dimension of the set of approximating
and [16]. These estimates generalize the notion of degrees
functions and is a constant which reflects the “tails of the loss
of freedom to nonlinear models, however, they are still based
function distribution,” i.e., the probability of observing large
on asymptotic assumptions.
values of the loss, and is a theoretical constant. The above
Finally, we note that model selection criteria are used in
upperbound holds with probability (confidence level of
this paper for complexity control, i.e., choosing an optimal
an estimate). Note that the VC-bound (8) for regression has
number of terms (in a linear model). This is not equivalent
general form similar to (7).
to the goal of accurate estimation of prediction risk. In fact,
For practical use of the bound (8) for model selection, one
prediction risk estimates are significantly affected by the vari-
needs to set the value of the constants and the confidence
ability of (finite) training samples. Hence, a model selection
level Reference [17] shows that one can use this bound
criterion can provide a poor estimate of prediction risk, yet the
with constant close to 1, so we choose We set
differences between its risk estimates (for models of different
based on the following (informal) arguments. Consider the
complexity) may yield accurate model selection. Likewise, a
case when In this case the bound should yield an
model selection criterion may provide an accurate estimate of
uncertainty of the type 0/0 with confidence level
prediction risk for the poorly chosen model complexity.
This will happen when From a practical viewpoint, the
confidence level of the bound (8) should depend on the sample
B. VC-Based Complexity Control size i.e., for larger sample sizes we should expect higher
VC-theory provides a very general and powerful framework confidence level; so we set
for complexity control called structural risk minimization Further, we need to estimate the VC dimension of a set of
(SRM). Under SRM, a set of possible models (approximat- approximating functions. For linear methods the VC dimension
ing functions) are ordered according to their complexity (or can be estimated as the number of free parameters (or degrees
flexibility to fit the data). Specifically under SRM the set of freedom) [1], [2]. For example, for polynomial estimators
of approximating functions has a structure, (of degree the VC dimension is Making all
that is, it consists of the nested subsets (or elements) these substitutions into (8) gives the following penalization
such that factor which we call Vapnik’s measure :

(9)

where each element of the structure has finite VC- where Penalization factor (9) is used for model
dimension By design, a structure provides ordering of its selection comparisons reported in Sections III and IV.
elements according to their complexity (i.e., VC-dimension): The common constructive implementation of SRM can be
described as follows: For a given set of training data, the SRM
According to SRM, solving a learning problem with finite principle selects the function minimizing the empirical risk for
data requires a priori specification of a structure on a set of the functions from Then for each element of a structure
approximating functions. Then for a given data set, optimal the guaranteed risk is found using the bound provided by the
model estimation involves two tasks: right-hand side of (8). Finally, an optimal structure element
providing minimal guaranteed risk is chosen.
1) selecting an element (subset) of a structure (having
Application of SRM in practice depends on a chosen struc-
optimal complexity;
ture. An example of a generic structure (commonly used
2) estimating the model from this subset. The model pa-
in neural networks and statistical methods) is a dictionary
rameters are found via minimization of the empirical
representation [3], where the set of approximating functions is
risk (i.e., training error).
Note that Step 1) corresponds to model selection, whereas
(10)
Step 2) corresponds to parameter estimation in statistical
methods. However, SRM provides analytic upper bounds on
the prediction risk which can be used for model selection where is a set of basis functions with (possibly
[1], [2], [17]. For regression problems with squared loss the adjustable) parameters , and are linear coefficients. Both
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1079

and are estimated to fit the training data. Representation equivalent kernel representation. The approximations provided
(10) defines a structure, since by a linear estimator for the training data can be written as
(14)
where the vector contains the response
Hence the number of terms in expansion (10) specifies an
samples, the matrix contains the predictor
element of a structure.
samples, and the matrix is an matrix that transforms
Further we distinguish between adaptive methods where
the response values into estimates for each sample. The matrix
the basis functions are nonlinear in parameters ,
is often called the “hat” matrix, since it transforms responses
and nonadaptive (or linear) methods where basis functions are
into estimates. Consider the ridge regression risk functional
prespecified (or fixed a priori). Examples of adaptive methods
include multilayer perceptron networks and recent statistical
methods, such as classification and regression trees (CART) (15)
and projection pursuit regression [3].
In this paper, we consider complexity control only for linear For a given penalty strength , the solution which minimizes
methods (linear estimators), where the model is estimated as (15) is a linear estimator with the “hat” matrix
a linear combination of prespecified (fixed) basis functions
(16)

(11) —see [18], [19] for details. In particular, the effective degrees
of freedom is estimated via equivalent smoother matrix of
a penalized estimator
These methods differ mainly in the type of chosen basis
functions and the procedure for choosing optimal number trace (17)
of terms (model selection).
Penalization also represents a form of SRM [2]. Consider or equivalently [19] as
a set of functions , where is a vector of parameters
having some fixed length. For example, the parameters can be (18)
the weights of a neural network (of fixed topology). Let us
introduce the following structure on this set of functions: where is the regularization parameter and
are the eigenvalues of the Hessian matrix of the linear nonpe-
nalized estimate. Expressions (17), (18) are usually described
where (12) as the effective degrees of freedom (of a penalized estimator).
In this study, we also use these expressions to estimate the
Minimization of the empirical risk on each element VC-dimension, in order to apply the VC-bound (8), (9) for
of a structure is a constrained optimization problem, which complexity control. However, it is not clear how accurately
is achieved by minimizing the “penalized” risk functional these expressions estimate the VC-dimension in the penalized
case. For this reason, application of SRM for complexity
(13) control of penalized linear estimators is somewhat heuristic.
with an appropriately chosen Lagrange multiplier such that
III. EMPIRICAL COMPARISONS OF
Functional (13) represents the penalization formulation METHODS FOR MODEL SELECTION
where an optimal choice of the regularization parameter This section describes empirical comparison of classical
is conceptually similar to selecting the optimal number methods for model selection with VC-based approach for
of terms in a dictionary method (10), (11). The particular linear/penalized linear estimators. First, we describe the com-
structure (12), (13) is equivalent to a ridge penalty (used in parison methodology, then the experimental set up (including
statistical methods) or weight decay (used in neural networks). specification of the data sets and linear estimators used), and
The complexity of the “penalized” risk functional (13) or finally the comparison results.
an equivalent structure (12) can be estimated analytically if Comparison Methodology: In order to compare various
approximating functions are linear (in parameters). model selection criteria, we need to specify the type of basis
In this paper, we use penalized linear estimators for uni- functions of a linear estimator. Then comparisons between
variate regression, i.e., polynomial estimators of fixed degree various model selection approaches are performed using
25 with a ridge penalty on the norm of coefficients (free the same type of (linear) approximating functions. Model
parameters). Model selection with penalized linear estimators parameters are estimated by the linear least squares fitting
uses the same expressions for the penalization factor as parameters to the training data. Model selection amounts to
above; however the “effective” degrees of freedom is used choosing the optimal complexity of a linear estimator, based
in place of the number of free parameters. Estimating the on the available training data. The (optimal) model complexity
complexity (i.e., effective degrees of freedom) of a penalized provides the lowest prediction risk (mean-squared-error)
linear estimator is usually based on the eigenvalues of its estimated from the empirical risk using a model selection
1080 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

method (7). The quality (accuracy) of estimation is measured

as the mean-squared-error MSE or distance between the
true function and the model chosen by a method. The quality of
regression estimates is affected by the random training sample.
In order to create a valid comparison for finite training sets,
the fitting/model selection experiment is repeated 300 times
with different random samples for a fixed training set size and
noise level, and the resulting empirical distribution of MSE or
RISK is shown (for each method) using box plots. Standard
box plot notation specifies marks at 95, 75, 50, 25, and 5
percentile of an empirical distribution of MSE.
Experimental Setup: Included in the design specifications (a)
are the approximating (basis) functions of a linear estimator,
and the characteristics of the training data, as described next.
Approximating Functions: This study used linear estima-
tors with polynomial and trigonometric basis functions, i.e.,
Algebraic polynomials

(19)

Trigonometric polynomials

(20) (b)
Fig. 2. Target functions used in the comparison: (a) Piecewise polynomial.
(b) Sine squared function sin2 (2x):
We consider all polynomials of degree 0–25 as possible
candidates for model selection.
We also used penalized linear estimators formed by using output values for given input samples over the standard
the polynomial of fixed(large) degree 25, with an additional deviation of the gaussian noise. The following noise levels
constraint on the norm of its coefficients, leading to the were used: SNR .
penalization formulation Whereas it is not feasible to compare empirically all pos-
sible combinations of different target functions, sample sizes
(21) etc., we tried to choose a reasonably broad cross-section of
comparison data sets. For example, we use small and large
sample size to observe the difference between the finite-sample
where the choice of the regularization parameter controls and asymptotic setting for model selection. Similarly, we use a
model complexity. In the penalization formulation (21), both large range of SNR values in order to compare model selection
algebraic (19) and trigonometric polynomials (20) of degree methods for noisy and noseless training data. The chosen
25 were used. target functions exemplify smooth (easy to estimate) and
Training data: This study used simulated training data, discontinuous (hard to estimate) mappings, because smooth
The random -values are uniformly functions can be approximated by low-order models whereas
distributed in a [0,1] interval, and the -values are generated discontinous functions require high-order models (e.g., poly-
according to , where is additive (Gaussian) nomials of high degree). Also, the chosen target functions
noise, and is the target function (being estimated). generally do not match the basis functions well, except for
The following two target functions were used (see Fig. 2). one set of experiments when the sine-squared target function
Sine-squared function is being estimated via trigonometric polynomials. Overall, the
data sets used for comparisons reflect a wide range of practical
conditions, in order to make the comparison of model selection
Discontinuous piecewise polynomial function methods more meaningful.
Comparison indexes: Various classical model selection
methods and the VC-based method
are compared for a given type of linear estimator and a
particular choice of training data. The results are presented in
Three different sample sizes were used: small (30 samples), the form of box plots representing the empirical distribution
medium (100), and large (1000). Each training sample was (of the comparison index) based on 300 repetitions of the
generated with different levels of additive gaussian noise. The fitting/model selection using different random samples with
noise is defined in terms of signal-to-noise ratio (SNR) as the the same statistical characteristics. Specifically we use the
ratio of the standard deviation of the true (target function) following performance indexes.
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1081

1) RISK (MSE) is defined as the distance between the experiments. Due to space limitations, only a representative
target function and the regression estimate chosen by a subset of comparison results is discussed below. The results
given model selection method. are presented separately for different sample sizes, because
2) Degrees of Freedom (DOF) is the number of free pa- for small samples the estimation accuracy varies dramatically
rameters (model complexity) chosen by a given model between model selection criteria whereas for very large
selection criterion. samples the difference becomes negligibly small. Likewise,
3) Risk Estimation Accuracy measures the accuracy of the we describe separately the comparison results for linear
estimates for prediction risk provided by each model estimators and penalized linear estimators, which represent two
selection approach. For a regression estimate found different types of estimators (or two types of SRM structures).
by a given model selection approach we calculate (a) Comparisons are shown mainly for classical analytic methods
estimated risk via expression (7) and (b) the risk as (FPE, SC, GCV, SMS) and the VC-based model selection
the distance between the regression estimate and (VM). In addition, we show representative results (for 30 and
the target function. Then the risk estimation accuracy 100 samples) for model selection using leave-one-out CV.
is defined as the (absolute value of) difference between Since CV does not provide any improvement over analytical
the estimated risk (a) and the risk (b). notice that the VC-based model selection (VM), we do not show CV-based
risk estimation accuracy is lower bounded by the value model selection for most comparison plots.
of the noise variance, i.e., even for a perfectly accurate Comparisons for linear estimators: Figs. 3–7 show
estimate we still observe an error due to additive noise. comparisons for linear estimators. Figure captions specify
Performance criteria: Note that the first index, that is the data set used, i.e., the target function, sample size, and
risk (MSE), is of primary importance for comparisons, because noise level (SNR). Part (a) of each figure shows comparison
the quality of predictive learning (estimation) from finite results for regression models estimated using polynomial basis
samples has been defined in terms of the risk functional (5). functions, whereas part (b) shows results using trigonometric
Relative prediction performance of various model selection basis functions. Depending on the target function, two slightly
criteria can be judged from the box plots of risk (MSE) of each different cosine basis functions were used, in order to create
method. Box plots showing lower values of risk correspond a good match between a target function and (harmonic)
to better model selection approaches. In particular, better approximating functions. Namely, for estimating the sine-
model selection approaches select models providing lowest squared target function, we used trigonometric basis functions
guaranteed prediction risk (i.e., with lowest RISK at the 95% , whereas for estimating the piecewise polynomial
mark), and also smallest variation of the risk (i.e., narrow box target functions we used trigonometric basis functions
plots). As can be seen from the results reported below, the in the expansion (11). In contrast, the polynomial basis
methods providing lowest guaranteed prediction risk do not functions illustrate a (more realistic) situation when the
necessarily provide lowest average risk (i.e., lowest risk at approximating functions do not match the target function well.
the 50% mark). The other two performance indexes provide Small sample size: Comparison results for small sample
additional insights into the properties of model selection size are shown in Figs. 3 and 4, for the sine-squared and
approaches. The DOF index shows the model complexity piecewise-polynomial target function, respectively. It can be
(degrees of freedom) chosen by a given method. The DOF clearly seen that Vapnik’s measure (VM) significantly out-
box plot, in combination with the risk box plot, provides performs other model selection criteria, i.e., provides lower
insights about an overfitting (or underfitting) of a given values of Risk at 75 and 95%. It achieves superior prediction
method, relative to the optimally chosen DOF. Likewise, the accuracy by conservative selection of the model complexity,
risk estimation accuracy should be examined in conjunction i.e., selecting lower DOF than other methods. Likewise the
with the risk and DoF boxplots. The risk estimation accuracy results in Figs. 3 and 4 show that VC-bound (8) provides better
measures the quality of risk estimates for the model of chosen risk estimation accuracy than other methods, and the lowest
complexity. Note that a poorly chosen model (i.e., that overfits variability of the risk estimates. Notice that the use of leave-
the training data) may yield a very accurate risk estimate, one-out CV shown in Fig. 3 does not yield any improvement
whereas a model of optimal complexity chosen by a different over analytic VM model selection. As can be expected, cross-
method (for the same data) would yield larger values of risk validation outperforms classical analytic model selection (with
estimation accuracy. For this reason, direct comparison of small samples).
methods in terms of the risk estimation accuracy is rather Medium sample size: Comparison results for medium
meaningless. Instead, a better interpretation may be based on sample size are shown in Figs. 5 and 6, for the sine-squared
the width of the box plots of the risk estimation accuracy. and piecewise-polynomial target function, respectively. Here
The width of these box plots reflects a method’s sensitivity the VC-based model selection also tends to outperform other
to random sample variations, and it can be used as a measure methods (at 75 and 95 percentiles), however the difference
of variability of the risk estimates. Specifically, narrow box between methods is less than in the small-sample case.
plots indicate that a method is insensitive (robust) to random Results in Fig. 6(a) show that VC-based approach yields
variations of the data. lowest prediction risk at 95% but is slightly inferior to GCV
Comparison results: All possible combinations of three and SC at a 75%. The VM tends to choose lowest model
sample size values, seven SNR values, two target functions and complexity, in comparison with other methods. Notice that
four types of linear estimators yield 168 different comparison VM model selection also outperforms cross-validation, as
1082 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

(a) (b)
Fig. 3. Results for the sine squared target function with sample size of 30 and SNR = 2.5. (a) Polynomial estimation. (b) Trigonometric estimation.

shown in Fig. 5. Likewise the box plots of the risk estimation levels etc. (not shown here due to space constraints) show
accuracy suggest that the VC-methods provides the lowest superiority of the VC-based model selection in the penalized
variability of the risk estimates. case.
Large sample size: Fig. 7 shows the results for large Estimating pure noise: Finally, we present a simple ex-
sample size (1000) and high noise (SNR 0.5). As expected ample of modeling pure noise (i.e., gaussian noise with a
all methods yield very similar prediction accuracy (due to standard deviation of one) for 30 samples, using polynomial
an asymptotic setting), with the VC method having a slight estimators. The results shown in Fig. 10 clearly illustrate the
edge (i.e., lowest guaranteed prediction risk). Notice that VM superiority of VC-based model selection over classical analytic
selects the lowest model complexity, and provides the lowest and resampling approaches. In fact, all classical methods
variability of risk estimates. (including detect a structure (i.e., select DOF greater than
Comparisons for penalized linear estimators: Figs. 8 and 1) when there is no structure in the data. In contrast, VC-based
9 show comparison results for penalized estimators for the model selection (almost) never detects any structure in the data
small sample case. The results are (qualitatively) similar to (i.e., chooses DOF equal to one).
the nonpenalized case, in that the VC-based model selection Empirical results for linear/penalized linear estimators sug-
yields most accurate predictive models. It does so by selecting gest that the VC method is superior to classical methods for
lower (estimated) DOF than the other methods. Also note that model selection, in terms of selecting the models with the
VC model selection achieves lowest variability of chosen esti- lowest worst-case prediction risk, and the lowest variability
mated DOF (the width of box plots) relative to other methods. of risk estimates. The difference in performance may be quite
Likewise, comparison results for different sample sizes, noise dramatic for small samples, and gradually decreases for larger
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1083

(a) (b)
Fig. 4. Results for the piecewise polynomial target function with sample size of 30 and SNR = 2.5. (a) Polynomial estimation. (b) Tigonometric estimation.

samples. Notice that the performance of classical methods bound (8). Our empirical results indirectly suggest that the
is greatly affected by random variability of (small) training effective degrees of freedom estimates the VC-dimension
samples, i.e., methods vary as much as several orders of rather accurately. However, it is possible to measure the
magnitude between the top 25% and bottom 25% of the VC-dimension in the penalized case directly as suggested in
Risk box plots. In contrast, VC-based model selection is [20]—this is a promising research area.
very insensitive to random sample variations (i.e., narrow box
plots).
Based on our empirical evidence, we conclude that for small IV. WHEN VC-BASED COMPLEXITY CONTROL MAY FAIL
samples the prediction (estimation) accuracy of a learning In this section we describe a class of function estimation
method depends mainly on the model selection procedure, problems unsuitable for analytic model selection approaches
rather than the choice of approximating functions. However, (including VC-based model selection). Specifically, consider
for large samples the prediction accuracy is determined mainly unbounded approximating functions [i.e., algebraic polynomi-
by the suitable choice of approximating functions, as all als (19)] used to estimate difficult (i.e., discontinuous) target
model selection approaches perform equally well asymptot- functions from low noise/noiseless samples. In this setting
ically (with infinite number of samples). “good” models fitting the training data (almost) exactly would
The same VC-bound (8) was successfully used for com- be quite complex (i.e., polynomials of high degree). Such
plexity control of the penalized linear estimators, where the models interpolate (between training samples) quite well, but
effective degrees of freedom (estimated via the smoother they tend to have large boundary errors (bad extrapolation
matrix) was used in place of the VC-dimension. Note that outside the range of training data) due to the unbounded
strictly speaking we should use the VC-dimension in the approximating functions.
1084 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

(a) (b)
Fig. 5. Results for the sine squared target function with sample size of 100 and SNR = 1.0. (a) Polynomial estimation. (b) Trigonometric estimation.

Let us consider a typical example illustrating this phe- very accurate interpolation (between training samples) which
nomenon: estimating a step function step from requires higher-order polynomials due to the discontinuous
30 samples uniform in [0,1] with a “small” Gaussian noise target function. These estimates tend to produce huge errors
(SNR 10). Algebraic polynomials (19) are used to estimate at the boundaries of the [0,1] interval, i.e., between zero and
this target function. Two model selection strategies are used, the leftmost data point, and between the rightmost data point
namely VC-based complexity control and leave-one-out cross- and one. We call this effect “overfitting at the boundaries.”
validation. This experimental set-up closely follows [22] which It does not happen with noisy data (high SNR) when the
also compares VC-based complexity control with a resampling VC model selection yields lower-order polynomials (and the
approach for this problem. boundary effects become less significant). In the case of cross-
Experimental results in Fig. 11 suggest that cross-validation validation, the boundary effects are implicitly “detected” when
provides more accurate estimates than the VC-based approach the training sample closest to the boundary is “left out” for
in this setting, by choosing lower-degree polynomials. Similar validation. Namely, for high-order polynomial models the val-
(relative) under-performance of the VC model selection can idation error for the sample near the boundary is much larger
be observed for larger sample sizes under low-noise settings, than the validation error for other samples. Consequently, the
but under higher noise settings (SNR 1 or SNR 2) cross-validation approach will choose lower-order polynomial
the VC-based approach in fact outperforms cross-validation. estimates. In order to verify this explanation, we repeated
Detailed (visual) analysis of the estimates produced by cross- model selection experiments with a slight modification for
validation and by VC-method suggest the following explana- cross-validation, where the left- and rightmost samples are not
tion: under the low-noise setting VC model selection yields left out for cross-validation. In this case, as expected, cross-
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1085

(a) (b)
Fig. 6. Results for the piecewise polynomial target function with sample size of 100 and SNR = 1.0. (a) Polynomial estimation. (b) Trigonometric estimation.
validation results deteriorate, and become similar to VC model V. SUMMARY AND DISCUSSION
selection results (see Fig. 12). Empirical results presented in this paper and elsewhere
In summary, empirical results presented in this section show [5], [21], [23] suggest that VC generalization bounds can be
possible settings where analytic model selection is inferior to successfully used for model selection with linear estimators.
resampling approaches due to overfitting at the boundaries. Besides (linear) dictionary methods and penalized estimators
These settings are characterized by a combination of three considered in this paper, such linear estimators include or-
factors: thogonal basis functions (i.e., wavelet estimators) considered
1) unbounded approximating functions; in [21].
2) mismatch between the target function and approximating Our empirical findings contradict a widely held opinion that
functions leading to higher-order models; VC-bounds are too conservative to be useful in practice for
3) low-noise training data. model selection. We comment next on several likely causes
These boundary effects can be overcome in several (obvious) of these misconceptions. First, the VC generalization bounds
ways: are often applied with the upperbound estimates of parameter
1) use bounded approximating functions, i.e., trigonometric values [in the bound (8)] cited from Vapnik’s original books
polynomials instead of algebraic ones; or papers [1]. For practical problems, this leads to poor (too
2) when using unbounded approximating functions, avoid conservative) model selection. As noted in [5], the VC-theory
boundary effects by considering the prediction accuracy provides an analytical form of the bounds up to the value of
only the interval given by the leftmost and rightmost constants. Reasonable selection of the constant values has to be
training samples. This solution has a disadvantage in done empirically for a given type of a learning problem (e.g.,
that it can not be easily extended to a multivariate case. classification or regression). The second cause of misconcep-
1086 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

Fig. 7. Results for the sine squared target function with sample size of 1000 and SNR = 0.5 for polynomial estimation.

(a) (b)
Fig. 8. Results for the piecewise polynomial target function with sample size of 30 and SNR = 2.5. (a) Penalized polynomial estimation. (b) Penalized
trigonometric estimation.

tion is using the VC bounds to estimate the generalization error the problem is that typical network training procedures in-
of feedforward neural networks. Here the common approach is evitably introduce a regularization effect, so the “theoretical”
to estimate the bound on the generalization error (of a trained VC dimension can be quite different from the “actual” VC
network) using (theoretical) estimates of the VC-dimension as dimension which takes into account the regularization effect
a function of the number of parameters (or network weights). of a training algorithm. The third problem is that VC-bounds
This generalization bound is then compared against the actual are often applied to nonlinear estimators (i.e., neural networks)
generalization error (measured empirically), and a conclusion where the empirical risk cannot be reliably minimized (due to
is made regarding the poor quality of VC bounds. Here existence of multiple local minima and saddle points).
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1087

(a) (b)
Fig. 9. Results for the piecewise polynomial target function with sample size of 30 and SNR = 5.0. (a) Penalized polynomial estimation. (b) Penalized
trigonometric estimation.

Fig. 11. Results for the step target function with a sample size of 30 and
SNR = 10. In this setting the analytical VC-theory approach does not perform
as well as the cross-validation approach, due to boundary effects.

1) conceptual level: this refers to the key theorem of

VC-theory [2] stating necessary and sufficient condi-
tions for convergence of the empirical risk to the true
risk. This theorem asserts that the consistency (conver-
gence) is determined by the worst-case model from a
set of approximating functions. Conceptually, this theo-
rem demonstrates that any learning procedure based on
minimization of the empirical risk (training error) must
Fig. 10. Results for pure Gaussian noise ( = 1) with sample size of 30 be based on a “worst-case analysis.” All results of VC-
using polynomial estimation. theory, including constructive bounds for generalization
used in this paper, are based on this theorem.
VC-bounds are often refered to as worst-case bounds in 2) information level: this refers to the available information
the neural network literature. Unfortunately, this is often done about the properties of the unknown distribution of
without clear understanding of three different contexts in which the training data. The VC-theory specifies the sufficient
such worst case arguments are used in the original VC- condition on the fast rate of convergence of the empirical
theory [1]. Specifically, VC-theory distinguishes three distinct risk to the true risk in terms of the properties of
categories of worst case analysis: this distribution. This (sufficient) condition leads to
1088 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

[3] J. H. Friedman, “An overview of predictive learning and function

approximation,” From Statistics to Neural Networks: Theory and Pattern
Recognition Applications, V. Cherkassky, J. H. Friedman, and H.
Wechsler, Eds. New York: Springer-Verlag, NATO ASI Series F, v.
136, 1994.
[4] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge:
Cambridge Univ. Press, 1996.
[5] V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory
and Methods. New York: Wiley, 1998.
[6] W. Hardle, P. Hall, and J. S. Marron, “How far are automatically chosen
regression smoothing parameters from their optimum?,” JASA 83, 1988,
pp. 86–95.
[7] H. Akaike, “Statistical predictor information,” Ann. Inst. Statist. Math.,
vol. 22, pp. 203–217, 1970.
[8] G. Shwartz, “Estimating the dimension of a model,” Ann. Statist., vol.
6, pp. 461–464, 1978.
[9] P. Craven and G. Wahba, “Smoothing noisy data with spline functions,”
Fig. 12. Results for the step target function with a sample size of 30 and SNR Numerische Math., vol. 31, pp. 377–403, 1979.
= 10. Omitting boundary data points from the cross-validation calculations [10] R. Shibata, “An optimal selection of regression variables,” Biometrika,
(cv 3 ) reduced the performance of this model selection method. vol. 68, pp. 45–54, 1981.
[11] A. J. Miller, Subset Selection in Regression. New York: Chapman and
nonconstructive generalization bounds [2]. However, Hall, 1990.
[12] J. Shao, “Linear model selection by cross-validation,” JASA, 88, vol.
most practically useful results of VC-theory use the 422, pp. 486–494, 1993.
distribution-independent condition for fast convergence [13] J. Rissanen, Stochastic Complexity and Statistical Inquiry. Singapore:
based on the notions of the growth function and VC- World, 1989.
[14] D. Foster and E. George, “The risk inflation criteria for multiple
dimension. This distribution-independent condition can regression,” Ann. Statist., vol. 22, pp. 1947–1975, 1994.
be also interpreted as the worst case assumption. The [15] J. E. Moody, “Note on generalization, regularization and architecture
selection in nonlinear learning systems,” in Proc. 1st IEEE-SP Wkshp.
constructive generalization bound (8) used in this paper Neural Networks Signal Processing. New York: IEEE Press, 1991, pp.
also falls under the worst-case scenario in this sense. 1–10.
3) technical level: this refers to the choice of the constants [16] N. Murata, S. Yoshisawa, and S. Amari, “Neural-network information
criterion determining the number of hidden units for artificial neural
in the VC generalization bounds. As noted earlier, network models,” IEEE Trans. Neural Networks, vol. 5, pp. 865–872,
these constants cannot be determined theoretically, even 1994.
though the VC-theory provides the worst-case (upper- [17] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[18] T. Hastie and R. Tibshirani, Generalized Additive Models. New York:
bound) estimates for these constants. In this paper and Chapman and Hall, 1990.
other recent studies [5], [21], we used the same em- [19] C. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford
pirically chosen values for these constants resulting in Univ. Press, 1995.
[20] V. Vapnik, E. Levin, and Y. Le Cun, “Measuring the VC-dimension of
the new penalization factor for model selection called a learning machine,” Neural Comput., MIT Press, vol. 6, pp. 851–876,
Vapnik’s measure (8). Hence, the VC-bounds used in 1994.
[21] V. Cherkassky and X. Shao, “Model selection for wavelet-based signal
this paper are not based on the worst case assumption estimation,” in Proc. IJCNN-98, Anchorage, AK, 1998.
in this (technical) sense. [22] D. Schuurmans, “A new metric-based approach to model selection,” in
Proc. AAAI-97, 1997.
Results reported in this paper provide strong empirical [23] V. Cherkassky, F. Mulier, and V. Vapnik, “Comparison of VC-method
evidence that the VC-theory (when carefully applied) can be with classical methods for model selection,” invited paper, in Proc.
successfully used for complexity control. These findings have WCNN-96, San Diego, CA, 1996.
an obvious practical significance for linear estimators. More-
over, our results have an important conceptual implication
that with finite samples, model selection methods based on
asymptotic/ average case theories are inferior to methods based Vladimir Cherkassky (S’83–M’85–SM’92) re-
on guaranteed (worst case) arguments provided by VC-theory. ceived the Ph.D. degree in electrical and computer
engineering from University of Texas at Austin in
Future related research may include rigorous application 1986.
of the VC bound (8) to complexity control of nonlinear He is Associate Professor of Electrical and
estimators (such as neural networks). This would require 1) Computer Engineering at the University of
Minnesota. He has authored more than 120 technical
accurate estimation of the VC-dimension in the spirit of [20] publications in the areas of computer networks,
and 2) specification of the neural network settings where the parallel processing, statistical methods and artificial
global minimum of the empirical risk can be reliably found. neural networks. His current research is on methods
for predictive learning from data, and he has
Further research in VC-theory may yield better practical VC- cowritten a monograph Learning From Data (New York: Wiley, 1998).
bounds by incorporating available information about unknown Dr. Cherkassky is a member of the International Neural Network Society
distributions, i.e., relaxing the worst case assumption on the (INNS). He served on the Governing Board of INNS from 1996 to 1998.
He was on the program committee of major international conferences on
information level (2). Artificial Neural Networks, including International Joint Conference on Neural
Networks (IJCNN), and World Congress on Neural Networks (WCNN).
REFERENCES Currently he serves on editorial board of IEEE TRANSACTIONS ON NEURAL
NETWORKS, and of Neural Networks (the official journal of the International
[1] V. Vapnik, Estimation of Dependencies Based on Empirical Data. New Neural Network Society, European Neural Network Society, and Japanese
York: Springer-Verlag, 1982. Neural Network Society). He was Director of the NATO-sponsored Advanced
[2] , The Nature of Statistical Learning Theory. New York: Study Institute (ASI) From Statistics to Neural Networks held in France in
Springer-Verlag, 1995. 1993.
CHERKASSKY et al.: MODEL COMPLEXITY CONTROL FOR REGRESSION 1089

Xuhui Shao (S’98) received the B.S. and M.S. Vladimir N. Vapnik, for a photograph and biography, see this issue, p. 999.
degrees in electrical engineering from Tsinghua Uni-
versity, Beijing, China, in 1993, 1995, respectively,
and the Ph.D. degree in electrical engineering from
University of Minnesota, St. Paul, in 1999.
He is a Staff Scientist at HNC Software Inc., San
Diego, CA. His research interests include model
selection, large margin classifiers, and intelligent
signal processing.

Filip M. Mulier received the Ph.D. degree in elec-

trical engineering from the University of Minnesota
in 1994.
He worked as a Senior Research Engineer at
3M from 1995 to 1999. He is currently a Research
Scientist at Net Perceptions, Inc. and also affiliated
with the Department of Electrical and Computer
Engineering at the University of Minnesota. He has
coauthored the book Learning from Data; Concepts,
Theory, and Methods (New York: Wiley, 1998).
In addition to his research in learning theory, his
current interests are in the application of learning methods for characterizing
physical systems, financial prediction, and marketing analysis.

Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Vapnik 20 A
No ratings yet
Vapnik 20 A
37 pages
Training Neural Networks: Key Concepts
No ratings yet
Training Neural Networks: Key Concepts
37 pages
Unit-2 IML
No ratings yet
Unit-2 IML
54 pages
Week12 Summary Detail
No ratings yet
Week12 Summary Detail
10 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Learning Kernel Classifiers. Theory and Algorithms
100% (3)
Learning Kernel Classifiers. Theory and Algorithms
371 pages
Neural Networks in Economic Analysis
100% (2)
Neural Networks in Economic Analysis
27 pages
INTRODUCTION To Machine Learning
No ratings yet
INTRODUCTION To Machine Learning
188 pages
Introduction To Machine Learning Author Nils J. Nilsson
No ratings yet
Introduction To Machine Learning Author Nils J. Nilsson
188 pages
Nips 2007
No ratings yet
Nips 2007
8 pages
Kernel Classifiers for Researchers
No ratings yet
Kernel Classifiers for Researchers
382 pages
Statistical Reinforcement Learning and Decision Making
No ratings yet
Statistical Reinforcement Learning and Decision Making
157 pages
Gradient Boosting for Statisticians
No ratings yet
Gradient Boosting for Statisticians
45 pages
Stability and Generalization in ML
No ratings yet
Stability and Generalization in ML
28 pages
Best Generalisation Error PDF
No ratings yet
Best Generalisation Error PDF
28 pages
Notes Stat Learning
No ratings yet
Notes Stat Learning
64 pages
Advanced Learning Theory Insights
No ratings yet
Advanced Learning Theory Insights
59 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Adaptive Methods for Function Estimation
No ratings yet
Adaptive Methods for Function Estimation
16 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
29 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
Statistical Decision Theory, Least Squares, and Bias Variance Tradeoff
No ratings yet
Statistical Decision Theory, Least Squares, and Bias Variance Tradeoff
3 pages
Thesis Rakhlin
No ratings yet
Thesis Rakhlin
148 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
9 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Montanez Dissertation
No ratings yet
Montanez Dissertation
143 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
2000 Conf
No ratings yet
2000 Conf
22 pages
Vapnik - Statistical Learning Theory - Wiley 1998
100% (1)
Vapnik - Statistical Learning Theory - Wiley 1998
760 pages
SOP Dispensing
No ratings yet
SOP Dispensing
8 pages
Cole - Cork Clyone & Ross PDF
No ratings yet
Cole - Cork Clyone & Ross PDF
378 pages
2 Quad
No ratings yet
2 Quad
24 pages
SSOT - 3. Data Access - R8.5
No ratings yet
SSOT - 3. Data Access - R8.5
34 pages
Resource Management Plan Search IMS 2
No ratings yet
Resource Management Plan Search IMS 2
8 pages
Anthocyanins: Sources & Health Benefits
No ratings yet
Anthocyanins: Sources & Health Benefits
23 pages
Product Recommendation Cectek Gladiator Gladiator 500 EFI (2008 and After)
No ratings yet
Product Recommendation Cectek Gladiator Gladiator 500 EFI (2008 and After)
2 pages
Escalation-Matrix 1685342662
No ratings yet
Escalation-Matrix 1685342662
5 pages
Causes and Management of Stress
No ratings yet
Causes and Management of Stress
5 pages
LKPD Descriptive Text Kelas X
No ratings yet
LKPD Descriptive Text Kelas X
2 pages
3720contemporary Empirical Political Theory Kristen Renwick Monroe (Editor) Instant Download 2025
No ratings yet
3720contemporary Empirical Political Theory Kristen Renwick Monroe (Editor) Instant Download 2025
90 pages
Convert 24 Volt To 12 Volt Direct Current
No ratings yet
Convert 24 Volt To 12 Volt Direct Current
2 pages
Youth Group Fellowship Activities
No ratings yet
Youth Group Fellowship Activities
4 pages
Pelton Wheel Efficiency Analysis
No ratings yet
Pelton Wheel Efficiency Analysis
6 pages
Welding Consumable Receiving Log
100% (3)
Welding Consumable Receiving Log
11 pages
Realistic Fiction: Maria Celebrates Brazil
No ratings yet
Realistic Fiction: Maria Celebrates Brazil
9 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
Davao City Council Session Minutes
No ratings yet
Davao City Council Session Minutes
31 pages
Dr. Sadao's Dilemma: Duty vs. Patriotism
No ratings yet
Dr. Sadao's Dilemma: Duty vs. Patriotism
4 pages
Race Comp 2
100% (1)
Race Comp 2
18 pages
Earthing System Specifications 16645
No ratings yet
Earthing System Specifications 16645
7 pages
Postpaid Monthly Statement: Pay Via Airtel Thanks App
No ratings yet
Postpaid Monthly Statement: Pay Via Airtel Thanks App
5 pages
NISM-Series-XXII: Fixed Income Securities Certification Examination
No ratings yet
NISM-Series-XXII: Fixed Income Securities Certification Examination
5 pages
UCSP Module 6 Part 2
No ratings yet
UCSP Module 6 Part 2
20 pages
Design Thinking in Business Innovation
No ratings yet
Design Thinking in Business Innovation
34 pages
Power System Bus Bar Configurations
No ratings yet
Power System Bus Bar Configurations
3 pages
Surface Roughness-Conersion Chart Table
No ratings yet
Surface Roughness-Conersion Chart Table
2 pages
Markos Fflags.@Biohazard .
No ratings yet
Markos Fflags.@Biohazard .
5 pages
Lec 2
No ratings yet
Lec 2
35 pages
BCS405A-Module 3 - Maths Notes BCS405A-Module 3 - Maths Notes
No ratings yet
BCS405A-Module 3 - Maths Notes BCS405A-Module 3 - Maths Notes
58 pages

Model Complexity Control For Regression Using VC Generalization Bounds

Uploaded by

Model Complexity Control For Regression Using VC Generalization Bounds

Uploaded by

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO.

5, SEPTEMBER 1999 1075

Model Complexity Control for Regression

Abstract—It is well known that for a given sample size there

estimated risk (7)

method (7). The quality (accuracy) of estimation is measured

1) conceptual level: this refers to the key theorem of

[3] J. H. Friedman, “An overview of predictive learning and function

Filip M. Mulier received the Ph.D. degree in elec-

You might also like