Selecting Amongst Large Classes of Models: Brian D. Ripley
Selecting Amongst Large Classes of Models: Brian D. Ripley
Brian D. Ripley
Professor of Applied Statistics University of Oxford [email protected] http://stats.ox.ac.uk/ripley
Manifesto
Statisticians and other users of statistical methods have been choosing models for a long time, but the current availability of large amounts of data and of computational resources means that model choice is now being done on a scale which was not dreamt of 25 years ago. Unfortunately, the practical issues are probably less widely appreciated than they used to be, as statistical software and the advent of AIC, BIC and all that has made it so much easier for the end user to trawl through literally thousands of models (and in some cases many more). Traditional distinctions between parametric and non-parametric models are often moot, when people now (attempt to) t neural networks with half a million parameters.
Explanation vs Prediction
This causes a lot of confusion. For explanation, Occams razor applies and we want an explanation that is as simple as possible, but no simpler attrib Einstein and we do have a concept of a true model, or at least a model that is a good working approximation to the truth, for all models are false, but some are useful G.E.P. Box, 1976
Explanation is like doing scientic research. On the other hand, prediction is like doing engineering development. All that matters is that it works. And if the aim is prediction, model choice should be based on the quality of the predictions. Workers in pattern recognition have long recognised this, and used validation sets to choose between models, and test sets to assess the quality of the predictions from the chosen model. One of my favourite teaching examples is Ein-Dor, P. & Feldmesser, J. (1987) Attributes of the performance of central processing units: a relative performance prediction model. Communications of the ACM 30, 308317. which despite its title selects a subset of transformed variables. The paper is a wonderful example of how not to do that, too.
BoxCox transformations
95%
logLikelihood
1350
1300
1250
0.2
0.0
0.2
0.4
0.6
0.8
0.2
0.0
0.2
0.4
0.6
0.8
For prediction I nd a good analogy is that of choosing between expert opinions: if you have access to a large panel of experts, how would you use their opinions? People do tend to pick one expert (guru) and listen to him/her, but it would seem better to seek a consensus view, which translates to model averaging rather than model choice. Our analogy is with experts, which implies some prior selection of people with a track record: one related statistical idea is the Occams window (Madigan & Raftery, 1994) which keeps only models with a reasonable record. Because the model may be used in scenarios very different from those in which it was tested, generalization is still important, and other things being equal a mechanistic model or a simple empirical model has more chance of reecting the data-generation mechanism and so of generalizing. But other things rarely are equal.
Computational cost
A major reason to choose a model appears still to be computational cost, a viewpoint of Geisser (1993). Even if we can t large families of models, we may have time to consider the predictions only from a few. A much-quoted example is a NIST study on reading hand-written ZIP codes, which have to be read in about 1/2 second each to be useful in a sorting machine.
Cross-validation
A much misunderstood topic!
Leave-one-out CV
The idea is that given a dataset of N points, we use our model-building procedure on each subset of size N 1, and predict the point we left out. Then the set of predictions can be summarized by some measure of prediction accuracy. Idea goes back at least as far as Mosteller & Wallace (1963), and Allens (1971, 4) PRESS (prediction sum-of-squares) used this to choose a set of variables in linear regression. Stone (1974) / Geisser (1975) pointed out we could apply this to many aspects of model choice, including parameter estimation. NB: This is not jackkning a la Quenouille and Tukey. Having to do model-building N times can be prohibitive unless there are computational shortcuts.
V-fold cross-validation
Divide the data into V sets, and amalgamate V 1 of them, build a model and predict the result for the remaining set. Do this V times leaving a different set out each time. How big should V be? We want the model-building problem to be realistic, so want to leave out a small proportion. We dont want too much work. So usually V is 310. One early advocate of this was the CART book (Breiman, Friedman, Olshen & Stone, 1984) and program.
Does it work?
Leave-one-out CV does not work well in general. It makes too small changes to the t. 10-fold CV often works well, but sometimes the result is very sensitive to the partitioning used. We can average over several random partitions. Often better for comparisons than for absolute values of performance. How prediction accuracy is measured can be critical.
Schwarzs (1978) criterion, often called BIC or SBC, replaces 2 by log n for a suitable denition of n, the size of the dataset. In the original regression context this is just the number of cases. BIC was anticipated by work of Harold Jeffreys in the 1930s.
Derivation of AIC
Suppose we have a dataset of size N , and we t a model to it by maximum likelihood, and measure the t by the deviance D (constant minus twice maximized log-likelihood). Suppose we have m (nite) nested models. Hypothetically, suppose we have another dataset of the same size, and we compute the deviance D for that dataset at the MLE for the rst dataset. We would expect that D would be bigger than D, on average. In between would be the value D if we had evaluated the deviance at the true parameter values. Some Taylor-series expansions show that E D E D p, ED ED p
and hence AIC = D + 2p is (to this order) an unbiased estimator of E D. And that is a reasonable measure of performance, the Kullback-Leibler divergence between the true model and the plug-in model (at the MLE). These expectations are over the dataset under the assumed model.
Crucial assumptions
1. The model is true! Suppose we use this to select the order of an AR(p) model. If the data really came from an AR(p0 ) model, all models with p p0 are true, but those with p < p0 are not even approximately true. This assumption can be relaxed. Takeuchi (1976) did so, and his result has been rediscovered by Stone (1977) and many times since. p gets replaced by a much more complicated formula. 2. The models are nested AIC is widely used when they are not. 3. Fitting is by maximum likelihood. Nowadays many models are tted by penalized methods or Bayesian averaging . . . . That can be worked through too, in NIC or Moodys peff.
4. The Taylor-series approximations are adequate. People have tried various renements, notably AICC (or AICc) given by N AICC = D + 2p N p+1 Also, the MLEs need to be in the interior of the parameter space, even when a simpler or alternative model is true. (Not likely to be true for variance components for example.) 5. AIC is a reasonably good estimator of E D, or at least that differences between models in AIC are reasonably good estimators of differences in E D. This seems the Achilles heel of AIC. AIC = Op(N ) but the variability as an estimate is Op( N ). This reduces to Op(1) for differences between models provided they are nested.
AIC has been criticised in asymptotic studies and simulation studies for tending to over-t, that is choose a model at least as large as the true model. That is a virtue, not a deciency: this is a prediction-based criterion, not an explanation-based one. AIC is asymptotically equivalent to leave-one-out CV for iid samples and using deviance as the loss function (Stone, 1977), and in fact even when the model is not true NIC is equivalent (Ripley, 1996).
Bayesian approaches
Note the plural I think Bayesians are rarely Bayesian in their model choices. Assume M (nite) models, exactly one of which is true. In the Bayesian formulation, models are compared via P {M | T }, the posterior probability assigned to model M . P {M | T } p(T | M )pM , p (T | M ) = p(T | M, )p( ) d
so the ratio in comparing models M1 and M2 is proportional to p(T | M2)/p(T | M1), known as the Bayes factor. However, a formal Bayesian approach then averages predictions from models, weighting by P {M | T }, unless a very peculiar loss function is in use. And this has been used for a long time, despite recent attempts to claim the credit for Bayesian Model Averaging.
Suppose we just use the Bayes factor as a guide. The difculty is in evaluating p(T | M ). Asymptotics are not useful for Bayesian methods, as the prior on is often very important in providing smoothing, yet asymptotically negligible. We can expand out the log posterior density via Laplace approximation and drop various terms, eventually reaching log p(T | M ) L(; T ) 1 2 log |H |. where H is the Hessian of the log-likelihood and we needed to assume that the prior is very diffuse. For an iid random sample of size n from the assumed model, the penalty might be roughly proportional to ( 1 2 log n) p provided the parameters are identiable. This is Schwarzs BIC up to a factor of two. As with AIC, the model with minimal BIC is chosen.
Crucial assumptions
1. The data were derived as an iid sample. (What about e.g. random effects models?) (Originally for linear models only.) 2. Choosing a single model is relevant in the Bayesian approach. 3. The model is true. 4. The prior can be neglected. We may not obtain much information about parameters which are rarely effective, even in very large samples. 5. The simple asymptotics are adequate and that the rate of data collection on each parameter would be the same. We should be interested in comparing different models for the same N , and in many problems p will be comparable with N . Note that as this is trying to choose an explanation, we would expect it to neither overt nor undert, and there is some theoretical support for that. There are other (semi-)Bayesian approaches, including DIC.
Model averaging
For prediction purposes (and that applies to almost all Bayesians) we should average the predictions over models. We do not choose a single model. What do we average? The probability predictions made by the models. For linear regression this amounts to averaging the coefcients over the models (being zero where a regressor is excluded), and this becomes a form of shrinkage. [Other forms of shrinkage like ridge regression may be as good at very much lower computational cost.] Note that we may not want to average over all models. We may want to choose a subset for computational reasons, or for plausibility.
GAG
10
20
30
40
50
5 Age
10
15
Clearly we want to t a smooth curve. What? Polynomial? Exponential? Choosing the degree of a polynomial by F-tests gives degree 6.
GAG
10
20
30
40
50
5 Age
10
15
Is this good enough? Smoothing splines would be the numerical analysts way to t a smooth curve to such a scatterplot. The issue is how smooth and in this example it has been chosen automatically by GCV.
> plot(GAGurine, pch=20) > lines(smooth.spline(Age, GAG), lwd = 3, col="blue")
An alternative would be local polynomials, using a kernel to dene local and choosing the bandwidth automatically.
> > > > > plot(GAGurine, pch=20) (h <- dpill(Age, GAG)) lines(locpoly(Age, GAG, degree = 0, bandwidth = h)) lines(locpoly(Age, GAG, degree = 1, bandwidth = h), lty = 3) lines(locpoly(Age, GAG, degree = 2, bandwidth = h), lty = 4)
50
50
GAG
10
20
30
5 Age
10
15
1010
p-value image of a single fMRI brain slice thresholded to show p-values below 104 and overlaid onto an image of the slice. Colours indicate differential responses within each cluster. An area of activation is shown in the visual cortex.
109
108
107
106
105
104
Formal training/validation/test sets, or the cross-validatory equivalents, are a very general and safe approach. Regression diagnostics are often based on approximations to overtting or case deletion. Now we can (and some of us do) t extended models with smooth terms or use tting algorithms that downweight groups of points. (I rarely use least squares these days.) It is still all too easy to select a complex model just to account for a tiny proportion of aberrant observations. Alternative explanations with roughly equal support are commonplace. Model averaging seems a good solution. Selecting several models, studying their predictions and taking a consensus is also a good idea, when time permits and when non-quantitative information is available.
Epilogue
My memory (which I hope is reliable enough) is that I rst encountered Nelder as an commentator in an ornithology journal, playing Sherlock Holmes over the suspiciously large number of rare birds reported from near Hastings at around the turn of the 20th century. My friend and co-author Bill Venables (an avid birdwatcher) tells me John is celebrating his 80th birthday by birdwatching in Australia, including visiting Kakadu National Park in NT (highly recommended from our 2003 visit). So here is a little practice, with an Australian bias.