0% found this document useful (0 votes)
42 views6 pages

R Multiple Regression Exercise 2019

This document provides an introduction to multiple linear regression analysis in R. It discusses key concepts like quantifying relationships between multiple independent and dependent variables while accounting for confounding effects, model selection, and data transformation. The document walks through an example analysis of US life expectancy data, performing correlations, multiple regression, and iterative model selection to determine which variables best predict life expectancy while controlling for other variables. Key findings include that income, murder rate, and high school graduation rates significantly influence life expectancy.

Uploaded by

Clarissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
42 views6 pages

R Multiple Regression Exercise 2019

This document provides an introduction to multiple linear regression analysis in R. It discusses key concepts like quantifying relationships between multiple independent and dependent variables while accounting for confounding effects, model selection, and data transformation. The document walks through an example analysis of US life expectancy data, performing correlations, multiple regression, and iterative model selection to determine which variables best predict life expectancy while controlling for other variables. Key findings include that income, murder rate, and high school graduation rates significantly influence life expectancy.

Uploaded by

Clarissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 6

An introduction to multiple regression in R

Figure 1 – Well…

Last week we worked through the basics of simple linear regression analysis. Today and the next class
we’re going to take this into the multivariate realm and look at multiple regression. There are several
types of multiple regression, but we’ll focus on the most straightforward, linear case.

Background to the Problem

Extending OLS regression we will move further to quantifying the relationships between one
independent variable and two or more dependent variables. As well as quantifying these more complex
relationships, we will learn about model selection – a concept that becomes more important the more
complicated a dataset we analyse. Through the course of this worksheet we will learn about the
following concepts, and how to implement them in R:

Multiple Linear Regression and adjusted r2


Model selection and the trade-off between model fit and number of explanatory variables
Data transformation
Exercise

As with all of my exercises, the questions you will be expected to answer are interspersed throughout
the text. Please read carefully to make sure you don’t miss anything, and be sure to include answers to
every question in your write-up.

Part 1 – Preparation

As you did last week, create a directory to hold this week’s data and analyses somewhere convenient.
Switch to this directory as your working directory – Click “Misc”, “Change Working Directory” and
then find the directory you’ve just created. Remember that after each session you’ll want to save your
results by clicking “Workspace” and then “Save Workspace File”.

Part 2 – Analysis of US Life Expectancy

After spending a little while trying to compile some US demographic data by state for this case study, it
was pointed out to me that these (legacy) data already exist as a test dataset in R. So, thanks to William
B. King, your life and mine starts more easily this class!

> state.data <- as.data.frame(state.x77)

You can look at the original data if you’d like (state.x77), but it isn’t a data matrix, so isn’t as useful to
us. One useful way to view a summary of some data that you’ve imported is to use the structure
command:

> str(state.data)

This dataframe needs a little tidying up before we can start analysing it, so…:

Question 1) What does each variable mean, and in what units is it measured? Rename your columns
to remove the spaces in the variable names (you could just substitute the spaces for
periods) – that will save some hassle later. Add an appropriately named population
density column to your data set. Provide your answers and commands.

You may find the help command in R useful here. If you type a question mark, followed by any R
command, you will display the help file. For example, ?str() will tell you about the structure command.

As a first step, we’re interested in determining if there are any potential causal relationships among the
variables that in our dataset. In comes rule number one. If in doubt, plot it up. We could plot each pair
of variables separately, but that would be pretty tedious. We could also write a script to plot each pair
and output them into a matrix of plots, but you saw how much trouble we had with that last time…
Instead, fortunately, someone has already done the job for us. Try this:

> pairs(state.data)

Pretty nifty, eh?

Question 2) From the plot you have produced, state three pairs of variables that you would expect to
show moderate to strong correlations, and three pairs that you would expect to show no
correlation.

We can actually calculate these correlations (r2 from the OLS regressions as we saw last week) very
easily in R using the function cor(). When applied to a data frame, this will produce a matrix of the
pairwise correlations of variables in the data frame. These will lie between -1(perfect negative
correlation) and 1 (perfect positive correlation). We could simply output these using cor(state.data), but
we can visualise them even more easily using the following commands:

> library(lattice)
> levelplot(cor(state.data))

The first function tells R to use commands from an additional pre-existing, but not initialised, library of
commands called lattice. There are a huge number of these command libraries available, some come
packaged with R (like lattice), but others you have to download separately. The second line of code
uses one of these new functions to plot up a visual representation of the correlation matrix with strong
positive correlations in blue and strong negative correlations in pink. This may be the first time you’ve
used explicitly nested commands like this in R, but they work the same way as they do in algebra,
starting from the innermost set of parentheses and working outwards.

Question 3) Looking at your correlation plot, do your predictions from question 2 hold? What are the
strongest positive and negative correlations? Remember that a variable will always
correlate 100% positively with itself, and so that information isn’t that useful…

If we were to simply carry out OLS regressions between these pairs of variables, we would be ignoring
the fact that some of the variables that would not be included in our bivariate model could influence the
relationship that we are trying to understand, perhaps even in complex and unexpected manners. To
steal an example from our friend William King, teacher pay is negatively correlated with SAT scores.
Huh?? What are we paying these people for? The issue here, which would not be clear with a bivariate
model, is that SAT scores are strongly negatively correlated with the proportion of students who sit the
test (i.e. when only a few students sit the test it is usually the best few, so scores are high, when
everyone sits, scores decrease), and teacher pay is positively correlated with proportion of students
sitting the SAT. Higher paid teachers get more of their students to sit the test, but this lowers the overall
score.

Multiple regression overcomes this problem by looking at all the variables together and asking “what
would the influence of x variable be if I hold the influence of all other variables constant”? In the case
above, this restores the world to rights and shows that students of more highly paid teachers have
significantly higher SAT scores when accounting for the confounding effects.

Let’s apply this to our dataset. It is reasonable to hypothesise that many of our variables will influence
life expectancy in some fashion. Fortunately for us, it is a simple matter to add these extra variables
into the function we called last week to carry out OLS regression, transforming it into multiple
regression:

> complete.model <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost
+ Area, data = state.data)

In this model, we are asking R to calculate the influence of each of the independent variables on the
dependent variable, accounting for the influence of all of the other independent variables. Nifty, eh?

Question 4) Output a summary of your complete regression model. Is the model statistically
significant, and if so, what does this mean? What proportion of the variation in the
dependent variable (life expectancy) is explained by this combination of independent
variables? Which of the IVs appear significantly related to the DV and which don’t?
Provide a copy of your summary output to support your explanations.

As you can see not all of our IVs are significantly related to our DV. The most obvious solution to this
problem would be to re-cast the model without all of the non-significant IVs. This would, however, be
inappropriate. As we remove unrelated IVs from our model, we are removing the confounding effects
of random variation, and the relationships of other IVs to our DV can change. To be more rigorous
about this, we can remove the variables in turn, from least correlated to most correlated and see what
influence this has on our model.

Looking at our IVs, perhaps unsurprisingly, illiteracy shows the least significant correlation with life
expectancy. Let’s remove it and rerun the model. We could simply retype our previous command
without the “+ Illiteracy”. R has a shortcut for us, however:

> model2 = update(complete.model, .~. – Illiteracy)

Here the period means “everything the same from the previous model”, so we’re producing a new
model with the same DV and the same IVs except without Illiteracy.

Question 5) Provide a summary of the new model. What has changed in the relationships of the IVs,
p and the adjusted r2 value? How do you interpret this?

Repeat the step for the next least significant variable and take a look at your regression summary
again? Are we seeing a pattern? Continue on until you have removed all variables that aren’t significant
in your model. Did you choose a p-value that you are willing to accept as statistically significant? If
not, you should have done, and stated so earlier. P = 0.05 is our standard, but we now know that we
don’t have to stick to that.

Question 6) Provide a summary of your final model. Which variables influence life expectancy?
Does this make sense? What issues did you encounter in narrowing down to this model?
Provide a table containing the p-values and adjusted r2 values for each model you
examined.

As you can see, adjusted r2 generally decreases as we remove variables from the model (i.e. models
with fewer variables are doing a worse job at describing your data than models with more variables),
but not by much. A model with lots of parameters, most of which don’t have a large impact isn’t
particularly useful when we are trying to understand general controlling principles, however. What we
want is a model that balances the best explanatory power with the fewest number of variables. How do
we figure out what this is?

A simple way is to use adjusted r2 – the “adjusted” in its title indicates that the goodness of fit of the
DV to the IVs has been modified to account for the number of parameters in the model, providing a
measure of the balance of model complexity and explanatory power. R2 is a tricky parameter in general,
however, as there is some portion of the variation in any real dataset that is random and hence can
never be explained by the model. Additionally, small differences in r2 can lead to large differences in
model fit, as illustrated on the following page (courtesy Charles Annis).

Another method that I prefer is to use a technique based in information theory to assess model fit. The
model with the lowest calculated “Information Criterion” value (most commonly Akaike’s “An
Information Criterion” or AIC), contains the most information for the least model complexity – is the
“best” model.

We can calculate AIC values for each of our candidate models using a command that looks something
like this:

> AIC(complete.model, …………., smallest.model)

Question 7) Provide a summary of the best fitting model according to AIC. Which variables
influence life expectancy now? How would you interpret this? How does this differ
from the model you selected in question 6?

You can actually carry out AIC-based model selection starting with the complete model using a single
command:

> step(complete.model, direction = “backward”)

Check that this gives you the same answer as before. We now know that the factors controlling life
expectancy in the US are state population, murder rate, high school graduation rate and number of days
per year below freezing. Interesting and hurrah for R!

Wait, wait! Hang on. What did we forget? How about testing that we don’t violate some of the
assumptions of linear regression. Oops. Let’s do that now.
Question 8) Provide a 2x2 diagnostic plot of the regression test statistics for your best fitting model,
and evaluate whether any assumptions have been violated.

Looks like there might be a couple of issues, doesn’t it? Firstly, I’d recommend removing the couple of
states with the highest leverage from your dataset, providing you can justify it... Maybe something else
is going with those that isn’t captured by the variables we measured.

Question 9) Which states did you remove, and why? Provide the code you used to remove the states.

One other issue is that some of our individual IVs might not be normally distributed. Unlike with OLS
regression, we can’t directly see which IVs are violating our assumptions. A quick way to eyeball this
is through using box and whisker plots. If these are symmetrical, our data are normally distributed. If
not, something else is going on.

Let’s do this now:

> par(mfrow=c(3,3))
> for(i in 1:8){boxplot(state.data[,i],main=dimnames(state.data)[[2]][i])}

Three of our variables seem like they’re strongly skewed away from normality. We have several
options here – ignore it (which we can do, as I talked about earlier), use a non-linear regression model
(about which more in a couple of weeks time), or transform the variables. The latter is the simplest to
pull off, so let’s give it a go.

Log transforming data will normalize quite strong skews, and can be achieved in R using the log() or
log10() functions, depending on the base you wish to use.

Question 10) Log transform the three offending variables, plot new box plots showing that the
normality of these data have improved, and then repeat the regression/model selection
exercise using these new variables in place of the non-transformed variables. What
influence does this have on your model? Is the AIC better than before (i.e. does log
transforming provide a better model than the non-transformed data)? How do you
interpret your results? Provide all the plots and output necessary to support your
arguments.

You might also like