Linear Regression
Linear Regression
The aim of this learning module is to test whether an apparently linear relationship
between two variables is real, or whether it could have happened by chance because of
variability. To do this we use regression analysis, strictly speaking linear regression.
This is based on the 'method of least squares' that we've already met in using Excel to fit
a trendline to data on a spreadsheet chart.
However, there are many cases when one measurement is clearly independent of the
other one. Some examples:
1. Ages and weights of people: the weight of a growing person clearly depends on their
age, but their age is independent of their weight.
2. Time and intracellular pH in cells: you can measure the pH in cells at particular times,
but it would be meaningless to plan to measure the time at particular pH values.
3. Reaction rate and temperature: you can control the temperature and this affects the
rate of the reaction, but you can't set a reaction rate that will affect the temperature of
the experiment.
Statistically speaking, you shouldn't investigate these cases using correlation. Instead
you should plot the data correctly and use regression analysis to see if there is a linear
association.
Regression analysis calculates the "line of best fit" through the data points.
It does this by finding a straight line, y = a + bx, which minimises the sum of the
squares of the distances, si2, of each point to the line.
The problem
You could get exactly the same line and exactly the same equation just by chance even
if there is no association between the independent variable and the dependent one. This
can easily happen when the data points are very scattered, as on the right-hand graph
below.
In panel (a) you can clearly see that there is a significant linear relationship between the
two variables. This will be indicated by the fact that the sum of squares Σs i2 is low.
In panel (b) the points are all over the place so the value of Σsi2 is likely to be very large,
indicating that there is not a significant association - even though the line of best fit is
exactly the same as in panel (a).
To illustrate the use of linear regression analysis in Prism, let's consider the relationship
between time and the mass of a batch of eggs. It looks as if the mass of the eggs fell as
they got older, but is this a significant fall? To determine this we must test whether the
slope of the regression line is significantly less than zero. The null hypothesis is that the
mas of the eggs does not change over time; in other words that the slope of the line is
zero.
To enter the data we use an XY data table in Prism, putting the time values (the
independent variable in this example) in the X column and the mass of the eggs (the
dependent variable) in the first Y column:
After pressing the Analyze button we select Linear Regression from the list of XY
analyses:
Accepting all the default options in the next dialogue box we get to a Results page
looking like this:
Here we can see that the line of best fit has a Slope of -1.361 with a standard error of
0.0951, and a Y-intercept of 89.44 with a standard error of 2.279.
The equation of the line Y = -1.361*X + 89.44 is shown at the bottom of the window.
The P value for the difference between the slope and zero is much less than 0.05 so we
can reject our null hypothesis and conclude that the mass of the eggs does decrease
significantly with time.
The Graph page shows an XY plot of the data together with the fitted regression line:
You can also carry out your own t-tests on the results of the linear regression obtained
with Prism to work out whether the slope or intercept are different from any particular
values.
For example, you could test whether eggs are significantly lighter than, say, 90 g when
they are laid. In other words, test whether the intercept is significantly lower than 90.
We do this as follows:
This gives:
3.Compare t with the critical value for N - 2 degrees of freedom, where N = number
of points:
tcrit = 2.069