Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Model selection I: principles of model choice and designed experiments

(Ch. 10)
Introduction: When dealing with a set of potential predictor variables, how do we determine which
combinations of predictor variables comprise the best model? In today's exercises we will explore this
question. In this first worksheet, we will examine polynomial terms (i.e. X vs. X2 vs. X3) which are
not truly independent variables (since X2 is calculated from X), but which are useful to modelling
quadratic or cubic relationships (i.e. non-straight line relationships). Here we demonstrate how
sequential sums of squares should be used when testing for an effect of polynomial predictors, and
will show how sequential sums of squares can be partitioned into polynomial components (e.g.
identifying the relative contribution of linear and quadratic relationships).

The dataset: In this example two sets of data are analysed, Y against X and Y against XS (all stored
in the XS dataset).
The second explanatory variable, XS, has been calculated from X by subtracting 0.2.
So X and XS are essentially the same variable, but measured on slightly different scales.
Source: pg. 206 G & H.

File: XS.txt

1. Look at Box 10.11a and 10.11b on p 206-207 in G&H. (This case is different from earlier
cases because it includes polynomials).
 Note: In the textbook the notation Y = X|X|X is the same as Y = X + X^2 + X^3
(with the latter being the correct format for R), and X*X*X is the same as X^3
(again, with the latter being appropriate for R).

2. Which of the tests are preferred: those based on adjusted (i.e. Type III) SS or those based on
sequential (i.e. Type I) SS? Why?

Sequential sum of squares because we have POLINOMIAL variables


It gives a consistent answer about the shape of the relationship between x and y
Adding predictors to model related to each other

#The one which bases the significance test on the sequential sums of squares
#- as it is this which gives a consistent answer about the shape of the relationship
between x & Y.

3. Repeat the analyses in R.


o Open the data from the file “XS.txt” and call it DATA
o Create a new column with the polynomial terms in the dataset
 e.g. DATA$Xsq <- DATA$X ^ 2
 e.g. DATA$Xcub <- DATA$X ^ 3
o Run a general linear model as in BOX 10.11 and calculate the adjusted and sequential
SS.

XS IS EXCLUDED
o Remember: In R, Anova() calculate adjusted SS, and anova() calculates sequential SS.

4. How do we interpret these results?

Seq ss shows p< 2.2e-16


Statistically significant p less than 0.05.
Xsq p value : 0.0002745
Statistically significant p less than 0.05
Xcub p value: 0.1779164
P value more than 0.05. not statistically significant.
Xcub is not a statistically significant predictor variable

Adj ss
P values more than 0.05.

X is significant predictor of Y
Xsq p value less than 0.05 , still sig predictor taking into account X
Xcub p value is more than 0.05 therefore not sig

COULD OF ALSO DONE THE XS MODEL AS WELL . BUT SEPERATELY

The dataset: A factorial experiment was conducted to investigate the yield of barley. Thirty-six plots
were divided into four blocks. Three varieties were compared at three different row-spacings. These
data are stored in the barley dataset in variables BYIELD, BSPACE, BVARIETY and BBLOCK.
Source: pg. 207 G & H.

File: Barley.txt

1. Run the analysis: BYIELD = BBLOCK + BSPACE + BVARIETY +


BSPACE*BVARIETY, treating the explanatory variables as categorical variables.
 Calculate both sequential SS and adjusted SS.

Orthogonal model.
 Draw an interaction diagram using the function interaction.plot(). You can visualize
the effect of these two predictor variables on the response variable in two ways:

i. interaction.plot(x.factor = dat$BSPACE, trace.factor = dat$BVARIETY,


response = dat$BYIELD)

ii. interaction.plot(x.factor = dat$BVARIETY, trace.factor = dat$BSPACE,


response = dat$BYIELD)
 Both plots are visualizing the same statistical interaction. Which plot do you find
easiest to interpret?

OPINION BASED

2. What conclusion do you draw from this first analysis?

First analysis from interaction plot, effect of variety … from space etc

In variety 1 of barley we can see that the mean yield increases from space 1 to 3.

In variety 2 of barley we can see that the mean yield decreases from space 1 to

In variety 3 of barley we can see the mean yield increases from 1 to 3.

THUS BARLEY 3 TYPE 3 BEST FOR YIELD.

3. Now conduct a second GLM analysis, treating BSPACE as a continuous variable and
including a quadratic term for BSPACE:
 i.e. BYIELD = BBLOCK + BVARIETY + BSPACE + BVARIETY*BSPACE +
BSPACE2 + BVARIETY*BSPACE2.
 Hint: R won’t recognize “BSPACE2” (or “BSPACEsq”, etc.) unless you have created
a new variable with that name (as in the previous exercise).

Convert it back to cont Bspace


4. What does this new model (with BSPACE as a continuous variable) tell us about the
relationship between BSPACE and BYIELD?
Does the quadratic term (i.e. BSPACE2) show that there is a strongly non-linear relationship
between barley spacing and yield?
Or is the linear term (BSPACE) adequate to describe the relationship?

Use anova first

This shows us
Relationship between bspace and byield shows

BSPACE p value =0.00679


P value less than 0.05. BSPACE is a significant predictor variable of yield

Do an interaction plot to visualize it

Interaction plot is the same as BSPACE interaction plot. Linear term BSPACE was
adequate to describe the relationship.

One should be bspace and b variety

Effect of space on yield depends on variety

5. What is your conclusion - do your conclusion from the second analysis differ from those
you drew after the first analysis?

No it doesn’t differ.

BSPACE sq had the same interaction plots as BSPACE, bspace sq was not a sig
predictor.
Model selection II: datasets with several explanatory variables (Ch.
11)
Introduction: When dealing with multiple potential predictor variables we often need to choose a
selection of independent variables that best describes our dependent variable. Here we consider how
to identify the best model (i.e. with the best combination of predictor variables) when presented with
many possible combinations of predictors.

The dataset: The efficacy of two proprietary treatments for cat fleas was compared by a journalist
for a pet magazine. In a survey, households with cats were asked to choose a ‘focal cat’, and to report
information which has been stored in the fleas dataset:

NCATS: the number of cats in the household.


CARPET: if the focal cat was allowed (CARPET=1) or not (CARPET=2) in rooms with carpets.
FLEAS: the average density of fleas on the focal cat over the year.
TRTMT: a code of 1 or 2 for the two proprietary treatments being compared.
HAIRL: the length in mm of the hair on the focal cat’s back.
The data are saved in the file ‘Fleas’.
Source: pg. 229 G & H.

File: Fleas.txt

1. Conduct the analysis FLEAS = TRTMT (model 1).


2. What do you conclude?

Use summary table


Anova - p value =0.69
Treatment is not a significant predictor variable of the average density of fleas on the
focal cat over the year.
P value is bigger than 0.05 .
We do not reject Ho hypothesis

3. Conduct the analysis FLEAS = TRTMT + HAIRL + NCATS + CARPET (model 2).
4. What do you conclude?

Use summary table


Anova
5. Have extra explanatory variables helped or hindered in the treatment comparison?

6. Conduct the analysis LOGFLEAS = TRTMT + HAIRL + NCATS + CARPET (model 3).
7. What do you conclude?

Use summary table


Anova

Use cr plots
Partial residual plots- test assumptions or effect of variable in isolation

Only hairl is not a sig factor -seen from original model made before variables removed.

8. Look at R2, residual plots and p-values for analysis 2 and 3. Would you expect model
criticism techniques to show model 2 or 3 to be the better model?

Let’s create a better model now (predicting "LOGFLEAS"), by excluding any predictor
variables from model 3 that do not strongly improve the model when they are included.
There are several statistical approaches to this process of “variable selection” or “model
building” – we will illustrate three here.

9. First, we can compare the adjusted R2 values for different models. The model with the
highest adjusted R2 can be considered as the model that maximizes the proportion of
variation in the dependent variable that is explained, after compensating for the fact that
even adding a randomly-generated (i.e. biologically-meaningless) variable will increase the
raw R2 value for a model. In other words, comparing adjusted R2 values between two models
lets you ask if the model with more predictor variables is actually better than the model with
less predictor variables.

 Run four new models based on model 3, where in each of the new models one of the
predictor variables is excluded. Calculate the adjusted R2 for all four of these models,
and compare with the adjusted R2 for model 3.
 What does this tell us about the predictor variables? Which predictor variable(s) do
not increase adjusted R2 values when included in a model?

 Based on these results, which variable(s) would you drop from model 3?

10. Second, we can use the AIC score for different models to rank them in terms of how well
each model fits the observed data. A smaller AIC score indicates a better fit and, therefore, a
better combination of predictor variables.
 The stepAIC() function (from the MASS library) will automatically add and/or
remove predictor variables to estimate the best combination of predictor variables to
explain a given response variable. Run the stepAIC function to identify which
predictor variables can be excluded from the model based on how they affect
models’ AIC scores.
 model4 <- stepAIC(object = model3)
 summary(model4)

 This will perform “backwards elimination”, sequentially testing which predictor


variables can be removed (i.e. eliminated”) from model3 without increasing the
models’ AIC values, continuing to drop predictor variables that do not strongly
contribute to explaining LOGFLEAS until dropping any further predictors will
increase the AIC value.

 The summary() function will then present the variables that have not been eliminated
and, therefore, can be considered to be part of the “best” model.
 Does this “best model” differ from the best model you identified in step 9
above?
 Remember: AIC values can only be used to compare models that have the same
response variables, and that AIC values are only useful for comparing models.

11. Finally, we can test if the improvement in a model with the addition of 1 (or more) variables
is better than expected if simply adding a random (i.e. unrelated) variable. This is done using
an F-test, and is implemented using the anova() function.
 anova(model1, model2)
 The p-value then tells us if the reduction in residual sum of squares is larger than
expected by chance. A significant p-value indicates that the larger (i.e. more
explanatory variables) model predicts the response variable significantly better than
the simpler (i.e. fewer predictor variables) model.
 This test is only valid when comparing nested models – i.e. both models have exactly
the same y variable, and the larger model contains all the variables that are present in
the smaller (i.e. simpler) model.
 What happens when you run anova(model2, model4)? Why do you think you get the
error message?

Different response variables. Cant be compared.

 Compare models 3 and 4 this way to determine if your final model is better than the
full model (i.e. model 3 which includes all possible predictor variables).

Use small anova to see if anova of model 1 and model 2


To see if its improving the model
Alternative hypothesis is directional
null: there is no difference in the variance explained by the two models
#Alt: there is significant difference in the variance explained by the bigger model
(explains it better) than the smaller model.

Reason we aren’t comparing 1 to 3 and 2 to 3 is because of the Log there.


12. Do these three approaches agree on which variables should be included in the final model?
Different model building procedures can sometimes reach different conclusions, so it is
good to make an a priori decision about which variable selection methods you plan to use
before you start your analyses (it would be unethical to trying different methods until you
get the answer that you are wanting).

Does model building / variable selection build the “best” mode?

It is important to note that the aim of model building (as illustrated above) is to simplify models; in other
words, to reduce the number of predictor variables as much as possible without impacting (much) on model
performance.

If your objective is simply to build a predictive model (i.e. to maximize the proportion of variability in the
response variable that can be explained), then having a model with tens or hundreds of predictor variables
may be appropriate. Many of those predictors will likely not improve the model’s predictive ability by
much; but there is no cost to include them.

However, in biology we usually have a slightly more refined aim: to build an explanatory model (i.e. where
we aim to understand which predictor variables have a statistically significant and biologically-meaningful
effect on the response variable). An explanatory model will not include predictor variables that have
minimal impacts on model performance, but will limit predictors to those that strongly and/or significantly
improve model performance. Explanatory models typically focus on hypothesis testing (i.e. is response Y
significantly related to predictor X) and pay attention to the relationships between each predictor and the
response variable.
ANOVA test vs ANOVA table

In this exercise we use both an ANOVA test and an ANOVA table. Remember that
an ANOVA table is a way to summarize data (called with the anova() function),
while an ANOVA analysis is a statistical test (called with the lm() function).

NB ONLY ONE METHOD TELLS IF MODEL IS BETTER THAN OTHER


ANOVA FUNCTION TELLS THAT

In test- is model with 17 pred better than model with 3

Not asking full variable selection procedure

In test- best way to answer q


Will be indication of which method to use, wont say it directly

Tell which model is better and explain still get partial marks

anova used to see if variable excluded or included


also use anova when you need to do sequential like mentioned in the textbook in today's lecture

You might also like