Regression PDF
Regression PDF
Simple Regression
An Introduction to Regression
In regression analysis we fit a predictive model to our data: we use a model to predict values of the
dependent variable (DV) from one or more independent variables (IVs).1 Simple regression seeks to predict
an outcome from a single predictor whereas multiple regression seeks to predict an outcome from several
predictors. This is an incredibly useful tool because it allows us to go a step beyond the data that we
actually possess. The model that we fit to our data is a linear one and can be imagined by trying to
summarize a data set with a straight line.
With any data set there are a number of lines that could be used to summarize the general trend and so
we need a way to decide which of many possible lines to choose. For the sake of drawing accurate
conclusions we want to fit a model that best describes the data. The simplest way to fit a line is to use
your eye to gauge a line that looks as though it summarizes the data well. However, the eyeball method
is very subjective and so offers no assurance that the model is the best one that could have been chosen.
Instead, we use a mathematical technique called the method of least squares to find the line that best
describes the data collected (See Field, 2005, Chapter 5 for a description).
Some Important Information about Straight Lines
Any straight line can be drawn if you know: (1) the slope (or gradient) of the line, and (2) the point at
which the line crosses the vertical axis of the graph (the intercept of the line). The equation of a straight
line is defined in equation (1), in which Y is the outcome variable that we want to predict and Xi is the ith
subjects score on the predictor variable. b1 is the gradient and b0 is the intercept of of the straight line
fitted to the data. There is a residual term, i, which represents the difference between the score
predicted by the line for subject i and the score that subject i actually obtained. The equation is often
conceptualized without this residual term (so, ignore it if its upsetting you); however, it is worth knowing
that this term represents the fact our model will not fit perfectly the data collected.
Yi = b0 + b1X i + i (1)
A particular line has a specific intercept and gradient. Figure 1 shows a set of lines that have the same
intercept but different gradients, and a set of lines that have the same gradient but different intercepts.
Figure 1 also illustrates another useful point: that the gradient of the line tells us something about the
nature of the relationship being described: a line that has a gradient with a positive value describes a
positive relationship, whereas a line with a negative gradient describes a negative relationship. So, if you
look at the graph in Figure 1 in which the gradients differ but the intercepts are the same, then the
thicker line describes a positive relationship whereas the thinner line describes a negative relationship.
If it is possible to describe a line knowing only the gradient and the intercept of that line, then the model
that we fit to our data in linear regression (a straight line) can also be described mathematically by
equation (1). With regression we strive to find the line that best describes the data collected, then
estimate the gradient and intercept of that line. Having defined these values, we can insert different
values of our predictor variable into the model to estimate the value of the outcome variable.
1
Unfortunately, you will come across people (and SPSS for that matter) referring to regression variables as
dependent and independent variables (as in controlled experiments). However, correlational research by
its nature seldom controls the independent variables to measure the effect on a dependent variable.
Instead, variables are measured simultaneously and without strict control. It is, therefore, inaccurate to
label regression variables in this way. For this reason I label independent variables as predictors, and the
dependent variable as the outcome.
100 100
80 80
60 60
40 40
20 20
0 0
0 10 20 30 40 0 10 20 30 40
Figure 1: Shows lines with the same gradients but different intercepts, and lines that share the same
intercept but have different gradients
SS M
R2 = (2)
SS T
Interestingly, this value is the same as if you square the Pearson correlation between variables (see Field,
2005, Chapter 4) and it is interpreted in the same way. Therefore, in simple regression we can take the
square root of this value to obtain the Pearson correlation coefficient. As such, the correlation coefficient
provides us with a good estimate of the overall fit of the regression model, and R2 provides us with a good
gauge of the substantive size of the relationship.
A second use of the sums of squares in assessing the model is through the F-test. The F-test is something
we will cover in greater detail later in the term, but briefly this test is based upon the ratio of the
improvement due to the model (SSM) and the difference between the model and the observed data (SSR).
In fact, rather than using the sums of squares themselves, we take the mean sums of squares (referred to
as the mean squares or MS). The result is the mean squares for the model (MSM) and the residual mean
squares (MSR) see Field 2005 for more detail. At this stage it isnt essential that you understand how the
mean squares are derived (it is explained in Field, 2005, Chapter 8 and in your lectures on ANOVA).
However, it is important that you understand that the F-ratio (equation (3)) is a measure of how much the
model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
MS M
F= (3)
MS R
If a model is good, then we expect the improvement in prediction due to the model to be large (so, MSM
will be large) and the difference between the model and the observed data to be small (so, MSR will be
small). In short, a good model should have a large F-ratio (greater than one at least) because the top half
of equation (3) will be bigger than the bottom. The exact magnitude of this F-ratio can be assessed using
critical values for the corresponding degrees of freedom.
Figure 2: Scatterplot showing the Figure 3: Main dialog box for regression
relationship between record sales and the
amount spent promoting the record
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
a
1 Regression 433687.833 1 433687.833 99.587 .000
Residual 862264.167 198 4354.870
Total 1295952.0 199
a. Predictors: (Constant), Advertising Budget (thousands of pounds)
b. Dependent Variable: Record Sales (thousands)
SPSS Output 2
Model Parameters
The ANOVA tells us whether the model, overall, results in a significantly good degree of prediction of the
outcome variable. However, the ANOVA doesnt tell us about the individual contribution of variables in
the model (although in this simple case there is only one variable in the model and so we can infer that
this variable is a good predictor). The table in SPSS Output 3 provides details of the model parameters
(the beta values) and the significance of these values. We saw in equation (1) that b0 was the Y intercept
and this value is the value B for the constant. So, from the table, we can say that b0 is 134.14, and this
can be interpreted as meaning that when no money is spent on advertising (when X = 0), the model
predicts that 134,140 records will be sold (remember that our unit of measurement was thousands of
records). We can also read off the value of b1 from the table and this value represents the gradient of the
regression line. It is 9.612 E02, which in unabbreviated form is 0.09612.2 Although this value is the slope
of the regression line, it is more useful to think of this value as representing the change in the outcome
associated with a unit change in the predictor. Therefore, if our predictor variable is increased by 1 unit
(if the advertising budget is increased by 1), then our model predicts that 0.096 extra records will be sold.
Our units of measurement were thousands of pounds and thousands of records sold, so we can say that for
an increase in advertising of 1000 the model predicts 96 (0.096 1000 = 96) extra record sales. As you
might imagine, this investment is pretty bad for the record company: they invest 1000 and get only 96
extra sales! Fortunately, as we already know, advertising accounts for only one-third of record sales!
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 134.140 7.537 17.799 .000
Advertising Budget
9.612E-02 .010 .578 9.979 .000
(thousands of pounds)
a. Dependent Variable: Record Sales (thousands)
SPSS Output 3
The values of b represent the change in the outcome resulting from a unit change in the predictor. If the
model was useless at predicting the outcome, then if the value of the predictor changes, what might we
expect the change in the outcome to be? Well, if the model is very bad then we would expect the change
in the outcome to be zero. A regression coefficient of zero means: (a) a unit change in the predictor
variable results in no change in the predicted value of the outcome (the predicted value of the outcome
does not change at all), and (b) the gradient of the regression line is zero, meaning that the regression
line is flat. Hopefully, what should be clear at this stage is that if a variable significantly predicts an
outcome, then it should have a b value significantly different from zero. This hypothesis is tested using a
t-test (see Field, 2005, Chapter 7). The t-statistic tests the null hypothesis that the value of b is zero:
therefore, if it is significant we accept the hypothesis that the b value is significantly different from zero
and that the predictor variable contributes significantly to our ability to estimate values of the outcome.
The values of t can be compared to the values that we would expect to find by chance alone: if t is very
large then it is unlikely to have occurred by chance. SPSS provides the exact probability that the observed
value of t is a chance result, and as a general rule, if this observed significance is less than 0.05, then
social scientists agree that the result reflects a genuine effect. For these two values, the probabilities are
0.000 (zero to 3 decimal places) and so we can say that the probability of these t values occurring by
chance is less than 0.001. Therefore, they reflect genuine effects. We can, therefore, conclude that
advertising budget makes a significant contribution to predicting record sales.
2
You might have noticed that this value is reported by SPSS as 9.612 E02 and many students find this
notation confusing. Well, think of E02 as meaning move the decimal place 2 steps to the left, so 9.612
E02 becomes 0.09612.
Task 1
Lacourse et al. (2001) conducted a study to see whether suicide risk was related to listening to heavy
metal music. They devised a scale to measure preference for bands falling into the category of heavy
metal. This scale included heavy metal bands (Black Sabbath, Iron Maiden), speed metal bands (Slayer,
Metallica), death/black metal bands (Obituary, Burzum) and gothic bands (Marilyn Manson, Sistsers of
Mercy). They then used this (and other variables) as predictors of suicide risk based on a scale measuring
suicidal ideation etc. devised by Tousignant et al., (1988).
Lacourse, E., Claes, M., & Villeneuve, M. (2001). Heavy Metal Music and Adolescent Suicidal Risk. Journal
of Youth and Adolescence, 30 (3), 321-332. [Available through the Sussex Electronic Library].
Lets imagine we replicated this study. The data file HMSuicide.sav (on the course website) contains the
data from such a replication. There are two variables representing scores on the scales described above:
hm (the extent to which the person listens to heavy metal music) and suicide (the extent to which
someone has suicidal ideation and so on). Using these data carry out a regression analysis to see whether
listening to heavy metal predicts suicide risk.
Your
Answers:
Does listening to heavy metal significantly predict suicide risk (quote relevant statistics)?
Your
Answers:
What is the nature of the relationship between listening to heavy metal and suicide
risk? (sketch a scatterplot if it helps you to explain).
Your
Answers:
Your
Answers:
As listening to heavy metal increases by 1 unit, how much does suicide risk increase?
Your
Answers: