0% found this document useful (0 votes)

62 views38 pages

Selecting Amongst Large Classes of Models: Brian D. Ripley

Statisticians and other users of statistical methods have been choosing models for a long time, but the current availability of large amounts of data and of computational resources means that model choice is now being done on a scale which was not dreamt of 25 years ago

Uploaded by

sootos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views38 pages

Selecting Amongst Large Classes of Models: Brian D. Ripley

Uploaded by

sootos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Selecting Amongst Large Classes of Models

Brian D. Ripley
Professor of Applied Statistics University of Oxford [email protected] http://stats.ox.ac.uk/ripley

Manifesto
Statisticians and other users of statistical methods have been choosing models for a long time, but the current availability of large amounts of data and of computational resources means that model choice is now being done on a scale which was not dreamt of 25 years ago. Unfortunately, the practical issues are probably less widely appreciated than they used to be, as statistical software and the advent of AIC, BIC and all that has made it so much easier for the end user to trawl through literally thousands of models (and in some cases many more). Traditional distinctions between parametric and non-parametric models are often moot, when people now (attempt to) t neural networks with half a million parameters.

Where do the models come from?

Sometimes a set of models is provided based on subject-matter theory. In my experience good theory is very rare. Sometimes called mechanistic models. One example is the BlackScholes theory of option pricing. Most often some simple restrictions are placed on the behaviour we expect to nd, for example linear models, AR(p) processes, factorial models with limited interactions. Sometimes called empirical models. Note that these can be very broad classes if transformations of variables (on both sides) are allowed. We now have model classes that can approximate any reasonable model, for example neural networks. And we may have subsets within these such as (generalized) additive models. Nowadays we may have the data and the computational resources to t such models.

Why do we want to choose a model?

It took me a long while to realize how profound a question that was.

Explanation vs Prediction
This causes a lot of confusion. For explanation, Occams razor applies and we want an explanation that is as simple as possible, but no simpler attrib Einstein and we do have a concept of a true model, or at least a model that is a good working approximation to the truth, for all models are false, but some are useful G.E.P. Box, 1976

Explanation is like doing scientic research. On the other hand, prediction is like doing engineering development. All that matters is that it works. And if the aim is prediction, model choice should be based on the quality of the predictions. Workers in pattern recognition have long recognised this, and used validation sets to choose between models, and test sets to assess the quality of the predictions from the chosen model. One of my favourite teaching examples is Ein-Dor, P. & Feldmesser, J. (1987) Attributes of the performance of central processing units: a relative performance prediction model. Communications of the ACM 30, 308317. which despite its title selects a subset of transformed variables. The paper is a wonderful example of how not to do that, too.

More on CPUs performance

There were six machine characteristics: the cycle time (nanoseconds), the cache size (Kb), the minimum and maximum possible main memory size (Kb) the minimum and maximum possible number of channels. How much memory and how many channels the actual machine tested had is unspecied. The original paper gave a linear regression for the square root of performance, but log scale looks more intuitive. We have a technology to test that, from Box & Cox (1964).

BoxCox transformations
95%

logLikelihood

1350

1300

1250

0.2

0.0

0.2

0.4

0.6

0.8

That is not what we were expecting!

Caveat: what did we transform?

We only transformed the response: it is natural to transform the regressors as well, so we need to choose several transformations simultaneously. We have technology to do that, even with non-parametric smooth functions (ACE, AVAS, . . . ) but it is not very reliable. Old-fashioned methods work: we discretized the continuous regressors into four groups and used these as categorical predictors.

BoxCox transformations revisited

95% 1250 logLikelihood 1450 1400 1350 1300

0.2

0.0

0.2

0.4

0.6

0.8

which is rather satisfying.

Why select a model at all?

It does seem a widespread misconception that model choice is about choosing the best model For explanation we ought to be alert to be possibility of there being several (roughly) equally good explanatory models. I learnt that from David Cox after having already done a lot of informal model choice in applied problems in which I would have beneted from offering several alternative solutions. Simplicity helps both with communicating the concepts embodied in the model and in what psychologists call generalization, the ability to work in scenarios very different from those in which the model was studied. So there is a premium on few models.

For prediction I nd a good analogy is that of choosing between expert opinions: if you have access to a large panel of experts, how would you use their opinions? People do tend to pick one expert (guru) and listen to him/her, but it would seem better to seek a consensus view, which translates to model averaging rather than model choice. Our analogy is with experts, which implies some prior selection of people with a track record: one related statistical idea is the Occams window (Madigan & Raftery, 1994) which keeps only models with a reasonable record. Because the model may be used in scenarios very different from those in which it was tested, generalization is still important, and other things being equal a mechanistic model or a simple empirical model has more chance of reecting the data-generation mechanism and so of generalizing. But other things rarely are equal.

All the models/experts may be wrong

Note that taking a consensus view only helps sometimes with generalization. Draper (1995) has a graph of predictions of oil prices for 198190 made in 1980. The analysts were all condent, differed considerably from each other, and were all way off the oil price was $13 in 1986!

Computational cost
A major reason to choose a model appears still to be computational cost, a viewpoint of Geisser (1993). Even if we can t large families of models, we may have time to consider the predictions only from a few. A much-quoted example is a NIST study on reading hand-written ZIP codes, which have to be read in about 1/2 second each to be useful in a sorting machine.

An historical perspective Model choice in 1977

Thats when I started to learn about this. The set of models one could consider was severely limited by computational constraints, although packages such as GLIM 3.77 were becoming available. Stepwise selection was the main formal tool, using hypothesis tests between a pair of nested models, e.g. F tests for regressions. Few people did enough tests to worry much about multiple comparisons issues. Residual plots were used, but they were crude plots and limited to small datasets. There was very little attempt to deal with choosing between models that were genuinely different explanations: Coxs (1961) tests of separate families of hypotheses existed but was little known and less used. But the world was changing . . . .

Cross-validation
A much misunderstood topic!

Leave-one-out CV
The idea is that given a dataset of N points, we use our model-building procedure on each subset of size N 1, and predict the point we left out. Then the set of predictions can be summarized by some measure of prediction accuracy. Idea goes back at least as far as Mosteller & Wallace (1963), and Allens (1971, 4) PRESS (prediction sum-of-squares) used this to choose a set of variables in linear regression. Stone (1974) / Geisser (1975) pointed out we could apply this to many aspects of model choice, including parameter estimation. NB: This is not jackkning a la Quenouille and Tukey. Having to do model-building N times can be prohibitive unless there are computational shortcuts.

V-fold cross-validation
Divide the data into V sets, and amalgamate V 1 of them, build a model and predict the result for the remaining set. Do this V times leaving a different set out each time. How big should V be? We want the model-building problem to be realistic, so want to leave out a small proportion. We dont want too much work. So usually V is 310. One early advocate of this was the CART book (Breiman, Friedman, Olshen & Stone, 1984) and program.

Does it work?
Leave-one-out CV does not work well in general. It makes too small changes to the t. 10-fold CV often works well, but sometimes the result is very sensitive to the partitioning used. We can average over several random partitions. Often better for comparisons than for absolute values of performance. How prediction accuracy is measured can be critical.

AIC, BIC and all that

Akaike (1973, 1974) introduced a criterion for model adequacy, rst for time-series models and then more generally. He relates how his secretary suggested he call it An Information Criterion, AIC. This has a very appealing simplicity: AIC = 2 log(maximized likelihood) + 2p where p is the number of estimated parameters. Choose the model with the smallest AIC (and perhaps retain all models within 2 of the minimum). Despite that, quite a few people have managed to get it wrong! This is similar to Mallows Cp criterion for regression, Cp = RSS/ 2 + 2p N and is the same if 2 is known. Both are of the form measure of t + complexity penalty

Schwarzs (1978) criterion, often called BIC or SBC, replaces 2 by log n for a suitable denition of n, the size of the dataset. In the original regression context this is just the number of cases. BIC was anticipated by work of Harold Jeffreys in the 1930s.

Derivation of AIC
Suppose we have a dataset of size N , and we t a model to it by maximum likelihood, and measure the t by the deviance D (constant minus twice maximized log-likelihood). Suppose we have m (nite) nested models. Hypothetically, suppose we have another dataset of the same size, and we compute the deviance D for that dataset at the MLE for the rst dataset. We would expect that D would be bigger than D, on average. In between would be the value D if we had evaluated the deviance at the true parameter values. Some Taylor-series expansions show that E D E D p, ED ED p

and hence AIC = D + 2p is (to this order) an unbiased estimator of E D. And that is a reasonable measure of performance, the Kullback-Leibler divergence between the true model and the plug-in model (at the MLE). These expectations are over the dataset under the assumed model.

Crucial assumptions
1. The model is true! Suppose we use this to select the order of an AR(p) model. If the data really came from an AR(p0 ) model, all models with p p0 are true, but those with p < p0 are not even approximately true. This assumption can be relaxed. Takeuchi (1976) did so, and his result has been rediscovered by Stone (1977) and many times since. p gets replaced by a much more complicated formula. 2. The models are nested AIC is widely used when they are not. 3. Fitting is by maximum likelihood. Nowadays many models are tted by penalized methods or Bayesian averaging . . . . That can be worked through too, in NIC or Moodys peff.

4. The Taylor-series approximations are adequate. People have tried various renements, notably AICC (or AICc) given by N AICC = D + 2p N p+1 Also, the MLEs need to be in the interior of the parameter space, even when a simpler or alternative model is true. (Not likely to be true for variance components for example.) 5. AIC is a reasonably good estimator of E D, or at least that differences between models in AIC are reasonably good estimators of differences in E D. This seems the Achilles heel of AIC. AIC = Op(N ) but the variability as an estimate is Op( N ). This reduces to Op(1) for differences between models provided they are nested.

AIC has been criticised in asymptotic studies and simulation studies for tending to over-t, that is choose a model at least as large as the true model. That is a virtue, not a deciency: this is a prediction-based criterion, not an explanation-based one. AIC is asymptotically equivalent to leave-one-out CV for iid samples and using deviance as the loss function (Stone, 1977), and in fact even when the model is not true NIC is equivalent (Ripley, 1996).

Bayesian approaches
Note the plural I think Bayesians are rarely Bayesian in their model choices. Assume M (nite) models, exactly one of which is true. In the Bayesian formulation, models are compared via P {M | T }, the posterior probability assigned to model M . P {M | T } p(T | M )pM , p (T | M ) = p(T | M, )p( ) d

so the ratio in comparing models M1 and M2 is proportional to p(T | M2)/p(T | M1), known as the Bayes factor. However, a formal Bayesian approach then averages predictions from models, weighting by P {M | T }, unless a very peculiar loss function is in use. And this has been used for a long time, despite recent attempts to claim the credit for Bayesian Model Averaging.

Suppose we just use the Bayes factor as a guide. The difculty is in evaluating p(T | M ). Asymptotics are not useful for Bayesian methods, as the prior on is often very important in providing smoothing, yet asymptotically negligible. We can expand out the log posterior density via Laplace approximation and drop various terms, eventually reaching log p(T | M ) L(; T ) 1 2 log |H |. where H is the Hessian of the log-likelihood and we needed to assume that the prior is very diffuse. For an iid random sample of size n from the assumed model, the penalty might be roughly proportional to ( 1 2 log n) p provided the parameters are identiable. This is Schwarzs BIC up to a factor of two. As with AIC, the model with minimal BIC is chosen.

Crucial assumptions
1. The data were derived as an iid sample. (What about e.g. random effects models?) (Originally for linear models only.) 2. Choosing a single model is relevant in the Bayesian approach. 3. The model is true. 4. The prior can be neglected. We may not obtain much information about parameters which are rarely effective, even in very large samples. 5. The simple asymptotics are adequate and that the rate of data collection on each parameter would be the same. We should be interested in comparing different models for the same N , and in many problems p will be comparable with N . Note that as this is trying to choose an explanation, we would expect it to neither overt nor undert, and there is some theoretical support for that. There are other (semi-)Bayesian approaches, including DIC.

Model averaging
For prediction purposes (and that applies to almost all Bayesians) we should average the predictions over models. We do not choose a single model. What do we average? The probability predictions made by the models. For linear regression this amounts to averaging the coefcients over the models (being zero where a regressor is excluded), and this becomes a form of shrinkage. [Other forms of shrinkage like ridge regression may be as good at very much lower computational cost.] Note that we may not want to average over all models. We may want to choose a subset for computational reasons, or for plausibility.

How do we choose the weights?

In the Bayesian theory this is clear, via the Bayes factors. In practice this is discredited. Even if we can compute them accurately (and via MCMC we may have a chance), we assume that one and exactly one model is true. In practice Bayes factors can depend on aspects of model inadequacy which are of no interest. Via cross-validation (goes back to Stone, 1974). Via bootstrapping (LeBlanc & Tibshirani, 1993). As an extended estimation problem, with the weights depending on the sample via a model (e.g. a multiple logistic); so-called stacked generalization and mixtures of experts.

Bagging, boosting, random forests

Model averaging ideas have been much explored in the eld of classication trees. In bagging models are tted from bootstrap resamples of the data, and weighted equally. In boosting each additional model is chosen to (attempt to) repair the inadequacies of the current averaged model by resampling biased towards the mistakes. In random forests the tree-construction algorithm randomly restricts itself at the choice of each split.

(Practical) model selection in 2004

The concept of a model ought to be much, much larger than in 1977. Even a decade ago, people attempted to t neural networks with half a million free parameters. Many models are not tted by maximum likelihood, to very large datasets. Model classes can often overlap in quite extensive ways.

Calibrating GAG in urine

Susan Prosser measured the concentration of the chemical GAG in the urine of 314 children aged 018 years. Her aim was to establish normal levels at different ages.

GAG

5 Age

Clearly we want to t a smooth curve. What? Polynomial? Exponential? Choosing the degree of a polynomial by F-tests gives degree 6.

GAG

5 Age

Is this good enough? Smoothing splines would be the numerical analysts way to t a smooth curve to such a scatterplot. The issue is how smooth and in this example it has been chosen automatically by GCV.
> plot(GAGurine, pch=20) > lines(smooth.spline(Age, GAG), lwd = 3, col="blue")

An alternative would be local polynomials, using a kernel to dene local and choosing the bandwidth automatically.
> > > > > plot(GAGurine, pch=20) (h <- dpill(Age, GAG)) lines(locpoly(Age, GAG, degree = 0, bandwidth = h)) lines(locpoly(Age, GAG, degree = 1, bandwidth = h), lty = 3) lines(locpoly(Age, GAG, degree = 2, bandwidth = h), lty = 4)

const linear quadratic 40 40 GAG 0 5 Age 10 15 0 10 20 30

GAG

5 Age

(Practical) model selection in 2004

... There are lots of formal gures of adequacy for a model. Some have proved quite useful, but Their variability as estimators can be worrying large. Computation, e.g. of effective number of degrees of freedom, can be difcult. Their implicit measure of performance can be overly sensitive to certain aspects of the model which are not relevant to our problem. The assumptions of the theories need to be checked, as the criteria are used way outside their known spheres of validity (and in some cases where they are clearly not valid). Nowadays people do tens of thousands of signicance tests, or more.

Plotting multiple p values

1010

p-value image of a single fMRI brain slice thresholded to show p-values below 104 and overlaid onto an image of the slice. Colours indicate differential responses within each cluster. An area of activation is shown in the visual cortex.

109

108

107

106

105

104

Formal training/validation/test sets, or the cross-validatory equivalents, are a very general and safe approach. Regression diagnostics are often based on approximations to overtting or case deletion. Now we can (and some of us do) t extended models with smooth terms or use tting algorithms that downweight groups of points. (I rarely use least squares these days.) It is still all too easy to select a complex model just to account for a tiny proportion of aberrant observations. Alternative explanations with roughly equal support are commonplace. Model averaging seems a good solution. Selecting several models, studying their predictions and taking a consensus is also a good idea, when time permits and when non-quantitative information is available.

Epilogue
My memory (which I hope is reliable enough) is that I rst encountered Nelder as an commentator in an ornithology journal, playing Sherlock Holmes over the suspiciously large number of rare birds reported from near Hastings at around the turn of the 20th century. My friend and co-author Bill Venables (an avid birdwatcher) tells me John is celebrating his 80th birthday by birdwatching in Australia, including visiting Kakadu National Park in NT (highly recommended from our 2003 visit). So here is a little practice, with an Australian bias.

Pinheiro and Bates - 2000 - Mixed-Effects Models in S and S-PLUS PDF
100% (2)
Pinheiro and Bates - 2000 - Mixed-Effects Models in S and S-PLUS PDF
537 pages
Practical Signal Processing Using Matlab
100% (1)
Practical Signal Processing Using Matlab
0 pages
Lecture+Notes+Model+ Selection PDF
No ratings yet
Lecture+Notes+Model+ Selection PDF
12 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Chap2-Some Unique Features of Data Science Projects
No ratings yet
Chap2-Some Unique Features of Data Science Projects
44 pages
Interview Questions Companie
No ratings yet
Interview Questions Companie
72 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Model Selection Techniques - An Overview: Jie Ding, Vahid Tarokh, and Yuhong Yang
No ratings yet
Model Selection Techniques - An Overview: Jie Ding, Vahid Tarokh, and Yuhong Yang
21 pages
Reviews Less 1 - 4
No ratings yet
Reviews Less 1 - 4
115 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Model Selection NEW
No ratings yet
Model Selection NEW
24 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
33 pages
Bayesian Workflow
No ratings yet
Bayesian Workflow
77 pages
ML Unit II Modelling Notes
No ratings yet
ML Unit II Modelling Notes
18 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Unit 3
No ratings yet
Unit 3
55 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
No ratings yet
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
47 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Week 10 - Lecture 10
No ratings yet
Week 10 - Lecture 10
59 pages
Week2-Day 1-Introduction To Data Mining
No ratings yet
Week2-Day 1-Introduction To Data Mining
30 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
4227 GUI Ebook Data Science Interview Guide
No ratings yet
4227 GUI Ebook Data Science Interview Guide
25 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Bi Intro
No ratings yet
Bi Intro
24 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Data Mining - Other Classifiers
No ratings yet
Data Mining - Other Classifiers
7 pages
Anintroductiontomachinelearning: Michaelclark Centerforsocialresearch Universityofnotredame
No ratings yet
Anintroductiontomachinelearning: Michaelclark Centerforsocialresearch Universityofnotredame
43 pages
ISL Answers
No ratings yet
ISL Answers
19 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Guide
No ratings yet
Guide
1,201 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Data Science in FInancial Services - 3
No ratings yet
Data Science in FInancial Services - 3
76 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
HW 1
No ratings yet
HW 1
3 pages
Quantitative Finance: 1 Probability and Statistics
No ratings yet
Quantitative Finance: 1 Probability and Statistics
2 pages
Lecture13 PDF
No ratings yet
Lecture13 PDF
48 pages
ST102 Notes
0% (1)
ST102 Notes
21 pages
MIT18 S096F13 Pset5
No ratings yet
MIT18 S096F13 Pset5
8 pages
Bentler and Bonett, 1980
No ratings yet
Bentler and Bonett, 1980
19 pages
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
No ratings yet
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
157 pages
HW2 Solution
No ratings yet
HW2 Solution
11 pages
Logistic Nota
No ratings yet
Logistic Nota
87 pages
Unit III
No ratings yet
Unit III
19 pages
f11 Examtopics
No ratings yet
f11 Examtopics
2 pages
Fatigue Curves For Polyester Moorings
No ratings yet
Fatigue Curves For Polyester Moorings
8 pages
Maximum Likelihood Estimation (MLE)
No ratings yet
Maximum Likelihood Estimation (MLE)
4 pages
Visual Political Knowledge: A Different Road To Competence?: Markus Prior
No ratings yet
Visual Political Knowledge: A Different Road To Competence?: Markus Prior
17 pages
Pass 11 - Non-Inferiority Tests For Two Proportions
No ratings yet
Pass 11 - Non-Inferiority Tests For Two Proportions
30 pages
Lang WU - Jin QIU - Applied Multivariate Statistical Analysis and Related Topics With R-EDP Sciences (2021)
No ratings yet
Lang WU - Jin QIU - Applied Multivariate Statistical Analysis and Related Topics With R-EDP Sciences (2021)
236 pages
AMT305SYLLABUS
No ratings yet
AMT305SYLLABUS
16 pages
ch9 QML
No ratings yet
ch9 QML
25 pages
Convolution-Distributions: Peter Reinhard Hansen Chen Tong
No ratings yet
Convolution-Distributions: Peter Reinhard Hansen Chen Tong
81 pages
System Identification: Arun K. Tangirala
No ratings yet
System Identification: Arun K. Tangirala
25 pages
Pattern Recognition 2nd Ed. (2009)
No ratings yet
Pattern Recognition 2nd Ed. (2009)
113 pages
PP144142
No ratings yet
PP144142
38 pages
H. Paul Barringer, P.E.: (In Red)
100% (2)
H. Paul Barringer, P.E.: (In Red)
7 pages
A Final Report
No ratings yet
A Final Report
39 pages
MML Book
No ratings yet
MML Book
381 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
Longitudinal PDF
No ratings yet
Longitudinal PDF
664 pages
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
No ratings yet
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
36 pages

Selecting Amongst Large Classes of Models: Brian D. Ripley

Uploaded by

Selecting Amongst Large Classes of Models: Brian D. Ripley

Uploaded by

Selecting Amongst Large Classes of Models

Where do the models come from?

Why do we want to choose a model?

More on CPUs performance

That is not what we were expecting!

Caveat: what did we transform?

BoxCox transformations revisited

which is rather satisfying.

Why select a model at all?

All the models/experts may be wrong

An historical perspective Model choice in 1977

AIC, BIC and all that

How do we choose the weights?

Bagging, boosting, random forests

(Practical) model selection in 2004

Calibrating GAG in urine

const linear quadratic 40 40 GAG 0 5 Age 10 15 0 10 20 30

(Practical) model selection in 2004

Plotting multiple p values

You might also like