About Model Selection
About Model Selection
net/publication/297718282
CITATIONS READS
23 624
1 author:
Guillaume Rochefort-Maranda
Statistics Canada
13 PUBLICATIONS 52 CITATIONS
SEE PROFILE
All content following this page was uploaded by Guillaume Rochefort-Maranda on 10 October 2017.
Guillaume Rochefort-Maranda
Contents
1 Introduction 2
2 Selecting a Model 3
2.1 Constructing the Data Set . . . . . . . . . . . . . . . . . . . . 3
2.2 Fitting a Polynomial Regression . . . . . . . . . . . . . . . . . 5
2.3 Fitting a Kernel Regression . . . . . . . . . . . . . . . . . . . 10
4 Conclusion 24
1
1 Introduction
2
a data set and explain how we can choose a regression model with an
additive error term and a linear smoother by using a parametric (polyno-
mial regression) and a nonparametric approach (kernel regression). This
allows me to discuss five different concepts of simplicity in the second sec-
tion. The R codes to recreate the results of the analyses are included in the
annex.
2 Selecting a Model
3
Figure 1: The Data Set
y i = f ( x i ) + ei ,
i.i.d.
ei ⇠ N (0, s2 ),
4
model. Both regressions are similar in the sense that their respective es-
timate fb( x ) of the function f ( x ), evaluated on the observed data, can be
defined with a linear operator S (a linear smoother) that does not depend
on y:
fb( x ) = Sy
However, both regressions are different in the sense that a polynomial re-
gression is a parametric model and a kernel regression is a nonparametric
model.
This distinction and its implications will become clearer. For the pur-
pose of this paper, it is sufficient to say that a parametric model yields an
estimate fb( x ) such that we only have to know its parameters in order to
compute it for any given x. On the other hand, a nonparametric model
provides and estimate fb( x ), such that we always need to know about the
observations in our data set in order to compute fb( x ) for any given x.
Before we move on with the estimation of f ( x ), it is also worth men-
tioning that I did not construct that function naively. f ( x ) has some prop-
erties that will highlight an important difference between the parametric
and the nonparametric approach. It will help me to illustrate a way in
which simplicity can lead to a better approximation of the truth.
To fit a polynomial regression, we must first assume that f ( x ) has the fol-
lowing form:
p
f (x) = Â b k xk
k =0
5
Secondly, we need to estimate the parameters b k . We can do so by solving
the following equation, which determines the parameters that will min-
imise the square of the difference between the observed y and f ( x ):
bb = argmin(y f ( x ))2
b
Finally, we need to figure out the number of parameters p that will deter-
mine the best estimate for f ( x ) out of all the possible polynomial regres-
sion models that can fit the data set.
To understand the nature of this challenge, let us compare two differ-
ent models. Figure 2 represents an estimate of f ( x ) provided by a model
with 4 adjustable parameters. Figure 3, on the other hand, represents an
estimate provided by a model with 11 adjustable parameters. In the two
figures the orange dashed line represents the estimate of the polynomial
regression; the red line f ( x ); and the small blue dots the data.
The question is to determine the best model out of the two. One intu-
itive criterion would be to compare the mean squared error for each model
by using all the observations ( x, y) in our data set. This quantity is called
the training mean squared error (MSEtrain ).
1 200
fb( xi ))2
200 iÂ
MSEtrain = ( yi
=1
1 Under the assumptions made in section 2.1, least squares estimates and maximum
likelihood estimates are the same.
6
Figure 2: Polynomial Regression, Adjustable Parameters=4
7
Accordingly, one might conclude that the second model is better than the
first because the MSEtrain for the second model is smaller:
1 n
fb( xi(new) )2
n iÂ
MSEtest = (yi(new)
=1
Unfortunately, the model that has the smallest MSEtrain is not neces-
sarily the one that has the smallest MSEtest . For example, when we are
trying to fit a polynomial regression model, we can decrease MSEtrain and
increase the MSEtest by adding too many adjustable parameters to our
model (i.e., parameters whose values are not fixed before we fit the model
to the data). When this happens, we say that our model is overfitting the
data.
A more judicious choice would be to compute MSEtest directly with an
independent data set. But in practice, we do not always have the luxury
of having such an independent data set that we do not want to use in the
construction of our model. A more common approach is to choose the
model that minimises MSEtrain and a penalty for the complexity of the
model that is measured with the number of adjustable parameters k. The
goal is to choose a model that does not overfit the data.
8
The Akaike Information Criterion (AIC) is one of the many criteria that
implement that idea. For this analysis (under the assumptions made in
section 2.1), the AIC can be expressed as follows:
1 200
fb( xi ))2 ) + 2k
200 iÂ
AIC = 200 log( ( yi
=1
1 200 2
200 Â
CV = ( yi fb( i ) ( xi ))
i
If we use both criteria to make our choice we will find that the second
model as a smaller AIC
and a smaller CV
(0.04445178 < 0.05536721)
In fact, further exploration indicates that the second model is the best poly-
nomial model according to both criteria.
9
2.3 Fitting a Kernel Regression
Now that we have found our best polynomial model (given CV and AIC),
let’s try to find the best kernel regression model. As we will see, this task
will be significantly different. When we constructed the polynomial re-
gression model we used what is called a ’top-down’ approach. We deter-
mined a priori the form of our estimation for f ( x ) and then tried to find the
values of its adjustable parameters that best fit the data set. In other words,
our estimation of f ( x ) was limited to the family of polynomial functions.
On the other hand, when we wish to fit a kernel regression model, we
do not make such strong a priori restrictions about the form of f ( x ). In
fact, we construct an estimate for f ( x ) with the assumption that close-by x
values must have similar y values. This approach is said to be ’bottom-up’
because the estimate will depend more heavily on the observations that
we have made.
To be more precise, for any given x0 , a kernel regression will provide a
weighted mean value of all the observed y values that are within a certain
range h from x0 . Its expression can be written as follows, where K is some
unspecified kernel:
200 K ( x0 h x i )
fb( x0 ) = Â Â200 K( x0 xi
)
yi
i =1 i =1 h
A kernel is a function that determines the weight of the nearby observa-
tions. In this paper, I will use an Epanechnikov kernel. It is defined as
follows:
10
8
>
< 3 (1
4 u2 ), if |u| 1
K (u) =
>
:0, if not
The challenge here will be to find the appropriate value for h. If h is too
small, our estimate of f ( x ) will overfit the data. But if h is too large, our
estimate will tend to take the form of an horizontal line and the fit with
the data set that we have will be awful. Just like in the parametric context,
we will not be able to rely on MSEtrain to choose our model. But, we will
be able to rely on the AIC and CV.
If we use CV, we find that the best estimate is obtained with h = 1.223.
The CV score associate with that h is 0.04400635. We can visualise the
resulting estimate in Figure 4. As before, the orange dashed line represents
the estimate of the kernel regression; the red line f ( x ) and the small blue
dots the data.
However, the application of the AIC criterion is not as straightforward
in this case. As we can see, the only adjustable parameter here is h. It is
the only expression in the equation of our model that is not fixed before
we attempt to fit a kernel regression model with the data (the kernel has
been determined a priori). Hence the number of adjustable parameters will
be useless as a measure of complexity. To carry on, we will need to use a
more general definition of a parameter in order to use the AIC for our
kernel regression. We will have to determine what is called ”the effective
number of parameters”(Friedman et al. 2001, p.232).
As mentioned in section 2.1, both the polynomial and the kernel regres-
sion estimates yb on the observed data ( x, y) can be defined with a linear
11
Figure 4: Epanechnikov Kernel Regression (CV)
fb( x ) = Sy
12
where tr(S) is the trace of the matrix S:
1 200
fb( xi ))2 ) + 2tr (S)
200 iÂ
AIC = 200 log( ( yi
=1
Given our data set and the choice-criteria that we have defined, the upshot
of the previous analysis is that we will choose a kernel (nonparametric)
regression model over a polynomial (parametric) one. The choice of one
particular kernel regression estimate however is underdetermined since
the two choice-criteria that we used do not converge.
In this section, I rely on that analysis to discuss the importance of sim-
plicity in model selection. I will define 5 different concepts of simplicity.
Doing so, I want to bring some important nuances to the existing literature
on this topic.
13
My first objective is to correct a mistake that we often find in the philo-
sophical literature about what makes a model simpler than another. My
second objective is to show that the importance that we give for a particu-
lar notion of simplicity will depend on the goal that we pursue when we
select a model. Therefore, when we wish to explain why simplicity mat-
ters in science, we have no choice but to take more than one definition of
simplicity into account. In other words, I wish to support a view accord-
ing to which different goals will justify the importance of different notions
of simplicity. This is what I call a pluralist view of simplicity.
This kind of work is different from that of other philosophers, such
as Kevin Kelly, who wish to explain why simplicity is important when
our goal is to find the truth. See (Kelly 2007b) for example. By looking
at other goals, we get a better understanding of the scientific practice of
model selection. We will see that the importance of a particular concept
of simplicity will depend on whether or not we are interested in a good
predictive model; a model that can be constructed under computational
or time constraints; an interpretable model; or in the validity of certain
kinds of models. The fun fact is that we cannot always achieve all of these
goals without making compromises. I will make this clearer in following
sections.
Looking back at section 2, we see that simplicity plays a crucial role when
we used the AIC to select our models. It is one of many criteria, like
the Bayesian information criterion (BIC) and the Minimum Description
14
Length criterion (MDL), that relies on the idea that our model should max-
imise its fit to the training data and be penalised for its complexity. The
justification for these criteria is that we want to avoid models that over-
fit the data, i.e.,we wish to avoid choosing models for which the MSEtest
is larger than the MSEtrain . This is essential to obtain a good predictive
model.
For this particular reason, philosophers of science have been quick to
underscore the importance of parametric simplicity in model selection:
15
The scientific relevance of simplicity has long been a matter of debate
in philosophical circles. Therefore, it is easy to understand the appeal of
a mathematically rigorous justification for the scientific relevance of para-
metric simplicity in model selection. It is no surprise that parametric sim-
plicity has been the focus of several important articles written by philoso-
phers such as Elliott Sober, Christopher Hitchcock, and Malcolm Forster.
However, the neglect of nonparametric models often results in false
claims:
16
justable parameter is simpler and that a criterion like AIC defines simplic-
ity in terms of the number of adjustable parameters. I believe that this kind
of mistake is symptomatic of a lack of understanding of how parametric
complexity can cause overfitting.
In the specific cases discussed in section 2, the number of parameters
(effective parameters) is actually a measure of the weight given by a model
to the observed yi in order to compute their corresponding fitted values ybi .
This is what explains the link between parametric simplicity and overfit-
ting models. The more weight is given to yi in order to compute its fitted
value, the more our model will fit the data and thus model the irreducible
error.
But more importantly, there is much more to simplicity than a property
that allows us to avoid overfitting models. In fact, a good criterion to
avoid overfitting model does not even need to take parametric simplicity
into account. As we have seen, we can estimate MSEtest with CV and
completely eliminate the need to rely on parametric simplicity. See also
(Forster 2007; Hitchcock and Sober 2004).
In what follows, I will complete the picture2 . By comparing para-
metric with non-parametric models, we can identify at least four other
concepts of simplicity: theoretical, computational, epistemic, and dimen-
sional. They are all important facets of simplicity that are not discussed in
the literature mentioned in the introduction. They only become apparent
2I am not suggesting here that the previously mentioned philosophers are not aware
that the picture is incomplete and that more work needs to be done. My intention is to
bring the debates forward.
17
when we compare parametric models with their nonparametric counter-
parts.
Going back to section 2.2 we can see that I have made a substantial as-
sumption about the form of f ( x ) in order to estimate it with a polynomial
model. The quality of the estimate depended heavily on this assumption
(that is why I defined f ( x ) the way I did). If we look at figures 2 and 3 and
compare the red and the orange dashed lines, we can see that a polynomial
estimate will always fail to model the tails of f ( x ). In other words, a false
a piori assumption about the form of f ( x ) can impose a limit on the quality
of the estimate. This is why theory-laden approaches can be problematic.
On the other hand, we made no such a priori assumptions when we
fitted a kernel regression model. We can immediately see how this paid
off by looking at Figure 4. We see that the estimate provided by the kernel
regression is closer to the true function. Therefore, theoretical simplicity
seems to be of the utmost importance in this case.
But let us remember that we are not supposed to know the true func-
tion f ( x ). Thus we are not supposed to see that a polynomial regression
will fail to model the tails of f ( x ) and that the kernel regression estimate is
closer to the true function. What we do know however is that we obtained
the best CV score with a Kernel regression. This gives us evidence that the
MSEtest is lower for the Kernel regression than it is for the polynomial re-
gression. Thus, we can now appreciate the importance of theoretical sim-
plicity. Theoretically simpler models can have the best MSEtest . In other
18
words they can provide us with better predictive models.
Another price to pay for using nonparametric models is that they are much
more difficult to interpret. In comparison with parametric regressions,
such as linear or polynomial regressions, nonparametric regressions ”can
lead to such complicated estimates of f that it is difficult to understand
how any individual predictor is associated with the response” (James et al.
19
2013, p.25).
To see this, let us assume that a dependent variable y can be expressed
in function of x plus an additive error term. As before, let us assume that
the errors are uncorrelated and follow a centred normal distribution. Now
consider the following two 2 estimates of the function:
fb( x ) = 4 + 5x (1)
200 K ( x1.223
0 xi
)
fb( x0 ) = Â 200 x0 x i y i (2)
i =1 Âi =1 K ( 1.223 )
20
ship between our variables.
The fact of the matter is that there are research contexts where we do
not necessarily wish to make predictions with a model, but where we want
to know how an independent variable is related with the dependent vari-
ables. For instance, a scientist might be interested in knowing if maternal
depression is positively related (at to what extend) with a child’s learning
difficulties in school. In that context, it is important to be able to interpret
the resulting estimate of the function between the two variables. We could
therefore have to choose a parametric model over a nonparametric one
even if the latter makes more accurate predictions and is parametrically
simpler.
In fact, there is often a compromise to make if we prefer interpretable
over predictive models or vice versa. Depending on our goal (understand-
ing or predicting) we might value parametric simplicity and epistemic
simplicity differently:
21
that model in 2 dimensions in order to visualise and interpret it more eas-
ily. This is what I did in section 2. But this is not a solution when the
number of dimensions is high. This brings me to one last notion of sim-
plicity that is at play in model selection.
22
that it will be useful to implement various techniques, such as principal
component analysis3 , in order to reduce the dimension of our data set. In
other words, dimensional simplification can be very important when we
construct and choose a model. Not only does it allow us to avoid over-
fitting models but it is essential to maintain the validity a nonparametric
model.
This conclusion seems to add weight to Sober’s following quotes:
23
4 Conclusion
• Besides parametric simplicity, there are at least four other important con-
cepts of simplicity in model selection: theoretical, computational, epistemic,
and dimensional.
24
the former. By making this choice, I do not mean to imply that other frame-
works, such as the Bayesian framework, are less important or justified. In
fact, it would be interesting to compare the frequentist and the Bayesian
approach to model selection by taking the non-parametric approaches into
consideration. This is a topic for future work.
References
Forster, M. and E. Sober (1994). How to Tell When Simpler, More Unified,
or Less Ad Hoc Theories will Provide More Accurate Predictions. The
British Journal for the Philosophy of Science 45(1), 1–35.
25
Hitchcock, C. and E. Sober (2004). Orediction Versus Accommodation
and the Risk of Overfitting. The British Journal for the Philosophy of Sci-
ence 55(1), 1–34.
Kelly, K. T. (2007a). How Simplicity Helps You Find the Truth Without
Pointing at it. In Induction, algorithmic learning theory, and philosophy, pp.
111–143. Springer.
26
Sober, E. (2009). Parsimony and Models of Animal Minds. In The Philoso-
phy of Animal Minds, pp. 237–257. Cambridge University Press.
27
Annex
Polynomial regressions
3rd order model
reg3<- lm(y ~ x+I(x^2)+I(x^3))
reg3$coefficients
## (Intercept) x I(x^2) I(x^3)
## -0.244178747 0.263346283 -0.026423155 0.000377407
fitv<-reg3$fitted.values
datp<-cbind(d, fitv)
datp<-as.data.frame(datp)
datp<-datp[order(datp$x),]
#plot(x,y, ylim=c(-0.5, 1), col="dark blue", pch=19, lwd=1)
#lines(datp$x, datp$fitv, lwd=3, col="orange", lty=2, ylim=c(-0.5, 1))
#par(new=T)
#curve(f, from=min(x), to = max(x), col="red", lwd=3, ylim=c(-0.5, 1), ylab="
")
AIC
logs<-200*log((sum((reg3$fitted.values-y)^2))/200)
pen=(2*4)
aicrreg3<-logs+pen
aicrreg3
## [1] -578.8056
CV
cv3<-rep(NA, 200)
for(i in 1:200){
reg<- lm(y[-i] ~ x[-i]+I(x[-i]^2)+I(x[-i]^3))
ypr<-(reg$coefficients[1]+reg$coefficients[2]*(x[i])
+reg$coefficients[3]*(x[i]^2)+ reg$coefficients[4]*(x[i]^3))
cv3[i]<-(y[i]-ypr)
cvr3<-sum(cv3^2)
}
cvr3/200
## [1] 0.05536721
AIC
logs<-200*log((sum((reg10$fitted.values-y)^2))/200)
pen=(2*11)
aicrreg10<-logs+pen
aicrreg10
## [1] -618.6998
CV
cv10<-rep(NA, 200)
for(i in 1:200){
reg<- lm(y[-i] ~ x[-i]+I(x[-i]^2)+I(x[-i]^3)+I(x[-i]^4)
+I(x[-i]^5)+I(x[-i]^6)+I(x[-i]^7)+I(x[-i]^8)
+I(x[-i]^9)+I(x[-i]^10))
ypr<-(reg$coefficients[1]+reg$coefficients[2]*(x[i])
+reg$coefficients[3]*(x[i]^2)+reg$coefficients[4]*(x[i]^3)
+reg$coefficients[5]*(x[i]^4)+reg$coefficients[6]*(x[i]^5)
+reg$coefficients[7]*(x[i]^6) +reg$coefficients[8]*(x[i]^7)
+reg$coefficients[9]*(x[i]^8)+reg$coefficients[10]*(x[i]^9)
+reg$coefficients[11]*(x[i]^10))
cv10[i]<-(y[i]-ypr)
cvr10<-sum(cv10^2)
}
cvr10/200
## [1] 0.04445178
Kernel regressions
How to find the best h with CV.
h = seq(1, 2, 0.001)
cv<-rep(NA, length(h))
for(i in 1:length(h)){
u<-matrix(NA, nrow = 200, ncol = 200)
for(j in 1:200){
u[j,]<-(x[j]-x)/h[i]
}
ep<-function(x){
cond=abs(x)<=1
((3/4)*(1-(x^2)))*cond
}
M<- ep(u)
N<-apply(M, 1, sum)
L = matrix(NA, nrow = 200, ncol = 200)
for( k in 1:200){
L[k,] = M[k,]/N[k]
}
yhat = L%*%y
v<-rep(NA, 200)
for(l in 1:200){
v[l]<-(y[l]-yhat[l])/(1-L[l,l])
}
cv[i]<-(sum(v^2))
}
min(cv)/200
## [1] 0.04400635
h[which.min(cv)]
## [1] 1.223
ep<-function(x){
cond=abs(x)<=1
((3/4)*(1-(x^2)))*cond
}
Mopt<- ep(uopt)
Nopt<-apply(Mopt, 1, sum)
Lopt = matrix(NA, nrow = 200, ncol = 200)
for( k in 1:200){
Lopt[k,] = Mopt[k,]/Nopt[k]
}
yhatopt = Lopt%*%y
datpred<-cbind(d, yhatopt)
datpred<-as.data.frame(datpred)
datpred<-datpred[order(datpred$x),]
#plot(x, y, ylim=c(-0.5, 1),pch=19, lwd=1, col="dark blue")
#lines(datpred$x,datpred$yhatopt, lwd=3, col="orange", lty=2, ylim=c(-0.5,
1))
#par(new=T)
#curve(f, from=min(x), to = max(x), col="red", lwd=3, ylim=c(-0.5, 1), ylab="
")