Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)
Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)
Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)
(Ch. 10)
Introduction: When dealing with a set of potential predictor variables, how do we determine which
combinations of predictor variables comprise the best model? In today's exercises we will explore this
question. In this first worksheet, we will examine polynomial terms (i.e. X vs. X2 vs. X3) which are
not truly independent variables (since X2 is calculated from X), but which are useful to modelling
quadratic or cubic relationships (i.e. non-straight line relationships). Here we demonstrate how
sequential sums of squares should be used when testing for an effect of polynomial predictors, and
will show how sequential sums of squares can be partitioned into polynomial components (e.g.
identifying the relative contribution of linear and quadratic relationships).
The dataset: In this example two sets of data are analysed, Y against X and Y against XS (all stored
in the XS dataset).
The second explanatory variable, XS, has been calculated from X by subtracting 0.2.
So X and XS are essentially the same variable, but measured on slightly different scales.
Source: pg. 206 G & H.
File: XS.txt
1. Look at Box 10.11a and 10.11b on p 206-207 in G&H. (This case is different from earlier
cases because it includes polynomials).
Note: In the textbook the notation Y = X|X|X is the same as Y = X + X^2 + X^3
(with the latter being the correct format for R), and X*X*X is the same as X^3
(again, with the latter being appropriate for R).
2. Which of the tests are preferred: those based on adjusted (i.e. Type III) SS or those based on
sequential (i.e. Type I) SS? Why?
#The one which bases the significance test on the sequential sums of squares
#- as it is this which gives a consistent answer about the shape of the relationship
between x & Y.
XS IS EXCLUDED
o Remember: In R, Anova() calculate adjusted SS, and anova() calculates sequential SS.
Adj ss
P values more than 0.05.
X is significant predictor of Y
Xsq p value less than 0.05 , still sig predictor taking into account X
Xcub p value is more than 0.05 therefore not sig
The dataset: A factorial experiment was conducted to investigate the yield of barley. Thirty-six plots
were divided into four blocks. Three varieties were compared at three different row-spacings. These
data are stored in the barley dataset in variables BYIELD, BSPACE, BVARIETY and BBLOCK.
Source: pg. 207 G & H.
File: Barley.txt
Orthogonal model.
Draw an interaction diagram using the function interaction.plot(). You can visualize
the effect of these two predictor variables on the response variable in two ways:
OPINION BASED
First analysis from interaction plot, effect of variety … from space etc
In variety 1 of barley we can see that the mean yield increases from space 1 to 3.
In variety 2 of barley we can see that the mean yield decreases from space 1 to
3. Now conduct a second GLM analysis, treating BSPACE as a continuous variable and
including a quadratic term for BSPACE:
i.e. BYIELD = BBLOCK + BVARIETY + BSPACE + BVARIETY*BSPACE +
BSPACE2 + BVARIETY*BSPACE2.
Hint: R won’t recognize “BSPACE2” (or “BSPACEsq”, etc.) unless you have created
a new variable with that name (as in the previous exercise).
This shows us
Relationship between bspace and byield shows
Interaction plot is the same as BSPACE interaction plot. Linear term BSPACE was
adequate to describe the relationship.
5. What is your conclusion - do your conclusion from the second analysis differ from those
you drew after the first analysis?
No it doesn’t differ.
BSPACE sq had the same interaction plots as BSPACE, bspace sq was not a sig
predictor.
Model selection II: datasets with several explanatory variables (Ch.
11)
Introduction: When dealing with multiple potential predictor variables we often need to choose a
selection of independent variables that best describes our dependent variable. Here we consider how
to identify the best model (i.e. with the best combination of predictor variables) when presented with
many possible combinations of predictors.
The dataset: The efficacy of two proprietary treatments for cat fleas was compared by a journalist
for a pet magazine. In a survey, households with cats were asked to choose a ‘focal cat’, and to report
information which has been stored in the fleas dataset:
File: Fleas.txt
3. Conduct the analysis FLEAS = TRTMT + HAIRL + NCATS + CARPET (model 2).
4. What do you conclude?
6. Conduct the analysis LOGFLEAS = TRTMT + HAIRL + NCATS + CARPET (model 3).
7. What do you conclude?
Use cr plots
Partial residual plots- test assumptions or effect of variable in isolation
Only hairl is not a sig factor -seen from original model made before variables removed.
8. Look at R2, residual plots and p-values for analysis 2 and 3. Would you expect model
criticism techniques to show model 2 or 3 to be the better model?
Let’s create a better model now (predicting "LOGFLEAS"), by excluding any predictor
variables from model 3 that do not strongly improve the model when they are included.
There are several statistical approaches to this process of “variable selection” or “model
building” – we will illustrate three here.
9. First, we can compare the adjusted R2 values for different models. The model with the
highest adjusted R2 can be considered as the model that maximizes the proportion of
variation in the dependent variable that is explained, after compensating for the fact that
even adding a randomly-generated (i.e. biologically-meaningless) variable will increase the
raw R2 value for a model. In other words, comparing adjusted R2 values between two models
lets you ask if the model with more predictor variables is actually better than the model with
less predictor variables.
Run four new models based on model 3, where in each of the new models one of the
predictor variables is excluded. Calculate the adjusted R2 for all four of these models,
and compare with the adjusted R2 for model 3.
What does this tell us about the predictor variables? Which predictor variable(s) do
not increase adjusted R2 values when included in a model?
Based on these results, which variable(s) would you drop from model 3?
10. Second, we can use the AIC score for different models to rank them in terms of how well
each model fits the observed data. A smaller AIC score indicates a better fit and, therefore, a
better combination of predictor variables.
The stepAIC() function (from the MASS library) will automatically add and/or
remove predictor variables to estimate the best combination of predictor variables to
explain a given response variable. Run the stepAIC function to identify which
predictor variables can be excluded from the model based on how they affect
models’ AIC scores.
model4 <- stepAIC(object = model3)
summary(model4)
The summary() function will then present the variables that have not been eliminated
and, therefore, can be considered to be part of the “best” model.
Does this “best model” differ from the best model you identified in step 9
above?
Remember: AIC values can only be used to compare models that have the same
response variables, and that AIC values are only useful for comparing models.
11. Finally, we can test if the improvement in a model with the addition of 1 (or more) variables
is better than expected if simply adding a random (i.e. unrelated) variable. This is done using
an F-test, and is implemented using the anova() function.
anova(model1, model2)
The p-value then tells us if the reduction in residual sum of squares is larger than
expected by chance. A significant p-value indicates that the larger (i.e. more
explanatory variables) model predicts the response variable significantly better than
the simpler (i.e. fewer predictor variables) model.
This test is only valid when comparing nested models – i.e. both models have exactly
the same y variable, and the larger model contains all the variables that are present in
the smaller (i.e. simpler) model.
What happens when you run anova(model2, model4)? Why do you think you get the
error message?
Compare models 3 and 4 this way to determine if your final model is better than the
full model (i.e. model 3 which includes all possible predictor variables).
It is important to note that the aim of model building (as illustrated above) is to simplify models; in other
words, to reduce the number of predictor variables as much as possible without impacting (much) on model
performance.
If your objective is simply to build a predictive model (i.e. to maximize the proportion of variability in the
response variable that can be explained), then having a model with tens or hundreds of predictor variables
may be appropriate. Many of those predictors will likely not improve the model’s predictive ability by
much; but there is no cost to include them.
However, in biology we usually have a slightly more refined aim: to build an explanatory model (i.e. where
we aim to understand which predictor variables have a statistically significant and biologically-meaningful
effect on the response variable). An explanatory model will not include predictor variables that have
minimal impacts on model performance, but will limit predictors to those that strongly and/or significantly
improve model performance. Explanatory models typically focus on hypothesis testing (i.e. is response Y
significantly related to predictor X) and pay attention to the relationships between each predictor and the
response variable.
ANOVA test vs ANOVA table
In this exercise we use both an ANOVA test and an ANOVA table. Remember that
an ANOVA table is a way to summarize data (called with the anova() function),
while an ANOVA analysis is a statistical test (called with the lm() function).
Tell which model is better and explain still get partial marks