Correlation and Regression
Correlation and Regression
17/06/13 4:26 PM
This site is no longer maintained and has been left for archival purposes
Text and links may be out of date
Correlation
Suppose that we took 7 mice and measured their body weight and their length from nose to tail. We obtained the following results and want to know if there is any relationship between the measured variables. [To keep the calculations simple, we will use small numbers] Mouse 1 2 3 4 5 6 7 Procedure (1) Plot the results on graph paper. This is the essential first step, because only then can we see what the relationship might be - is it linear, logarithmic, sigmoid, etc? Units of weight (x) 1 4 3 4 8 9 8 Units of length (y) 2 5 8 12 14 19 22
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
Page 1 of 10
17/06/13 4:26 PM
In our case the relationship seems to be linear, so we will continue on that assumption. If it does not seem to be linear we might need to transform the data. (2) Set out a table as follows and calculate S x, S y, S x2, S y2, S xy, Weight (x) Mouse 1 Mouse 2 Mouse 3 Mouse 4 Mouse 5 Mouse 6 Mouse 7 Total Mean 1 4 3 4 8 9 8
Sx
and
Length (y) 2 5 8 12 14 19 22
x2 1 16 9 16 64 81 64
= 37
Sy
= 82
S x2
= 1278
S xy
= 553
= 5.286
= 11.714
(3) Calculate
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
Page 2 of 10
17/06/13 4:26 PM
= 0.9014 in our case. (7) Look up r in a table of correlation coefficients (ignoring + or - sign). The number of degrees of freedom is two less than the number of points on the graph (5 df in our example because we have 7 points). If our calculated r value exceeds the tabulated value at p = 0.05 then the correlation is significant. Our calculated value (0.9014) does exceed the tabulated value (0.754). It also exceeds the tabulated value for p = 0.01 but not for p = 0.001. If the null hypothesis were true (that there is no relationship between length and weight) we would have obtained a correlation coefficient as high as this in less than 1 in 100 times. So we can be confident that weight and length are positively correlated in our sample of mice. Important notes: 1. If the calculated r value is positive (as in this case) then the slope will rise from left to right on the graph. As weight increases, so does the length. If the calculated value of r is negative the slope will fall from left to right. This would indicate that length decreases as weight increases. 2. The r value will always lie between -1 and +1. If you have an r value outside of this range you have made an error in the calculations. 3. Remember that a correlation does not necessarily demonstrate a causal relationship. A significant correlation only shows that two factors vary in a related way (positively or negatively). This is obvious in our example because there is no logical reason to think that weight influences the length of the animal (both factors are influenced by age or growth stage). But it can be easy to fall into the "causality trap" when looking at other types of correlation. What does the correlation coefficient mean? The part above the line in this equation is a measure of the degree to which x and y vary together (using the deviations d of each from the mean). The part below the line is a measure of the degree to which x and y vary separately.
17/06/13 4:26 PM
plotted on the X axis. The other variable is termed the dependent variable and is plotted on the Y axis. Suppose that we had the following results from an experiment in which we measured the growth of a cell culture (as optical density) at different pH levels.
pH 3 4 4.5 5 5.5 6 6.5 7 7.5 Optical density 0.1 0.2 0.25 0.32 0.33 0.35 0.47 0.49 0.53
We plot these results (see below) and they suggest a straight-line relationship.
Using the same procedures as for correlation, set out a table as follows and calculate S x, S y, S x2, S y2, S xy, and (mean of y). Optical density (y) 0.1 0.2 0.25 0.32 0.33 0.35 0.47 x2 9 16 20.25 25 30.25 36 42.25 y2 0.01 0.04 0.0625 0.1024 0.1089 0.1225 0.2209 xy 0.3 0.8 1.125 1.6 1.815 2.1 3.055
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
Page 4 of 10
17/06/13 4:26 PM
7 7.5 Total
Sx
0.49 0.53 = 49
Sy
49 56.25
S x2
0.240 0.281
S y2
3.43 3.975
S xy
= 3.04
= 284
= 1.1882
= 18.2
Mean
= 5.444
= 0.3378
Now calculate
Calculate Calculate
Now we want to use regression analysis to find the line of best fit to the data. We have done nearly all the work for this in the calculations above. The regression equation for y on x is: y = bx + a where b is the slope and a is the intercept (the point where the line crosses the y axis) We calculate b as:
= 1.649 x 17.22 = 0.0958 in our case We calculate a as: a= -b (0.3378), (5.444) and b (0.0958) we thus find a (-0.1837).
So the equation for the line of best fit is: y = 0.096x - 0.184 (to 3 decimal places). To draw the line through the data points, we substitute in this equation. For example: when x = 4, y = 0.384, so one point on the line has the x,y coordinates (4, 0.384); when x = 7, y = 0.488, so another point on the line has the x,y coordinates (7, 0.488). It is also true that the line of best fit always passes through the point with coordinates , we actually need only one other calculated point in order to draw a straight line. so
17/06/13 4:26 PM
Below is a printout of the Regression analysis from Microsoft "Excel". It is obtained simply by entering two columns of data (x and y) then clicking "Tools - Data analysis - Regression". We see that it gives us the correlation coefficient r (as "Multiple R"), the intercept and the slope of the line (seen as the "coefficient for pH" on the last line of the table). It also shows us the result of an Analysis of Variance (ANOVA) to calculate the significance of the regression (4.36 X 10-7).
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.989133329 0.978384742 0.975296848 0.022321488 9
Coefficients Intercept pH
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
-0.18348387 0.030215 -6.07269 0.000504 -0.25493 -0.11204 -0.25493 -0.11204 0.095741935 0.005379 17.80015 4.36E-07 0.083023 0.108461 0.083023 0.108461
Presenting the results The final graph should show: (i) all measured data points; (ii) the line of best fit; (iii) the equation for the line; (iv) the R2 and p values.
17/06/13 4:26 PM
untransformed data it will do this by assuming that the relationship is linear! (i) For plots of data that suggest exponential (logarithmic) growth, convert all y values to log of y (using either log10 or loge). Then go through the linear regression procedures above, using the log y data instead of y data. (ii) For sigmoid curves (drug dose response curves and UV killing curves are often sigmoid), the y values (proportion of the population responding to the treatment) can be converted using a logistic or probit transformation. Sometimes it is useful to convert the x (dose) data to logarithms; this condenses the x values, removing the long tails of non-responding individuals at the lowest and highest dose levels. A plot of logistic or probit (y) against dose (x) or log of dose (x) should show a straight-line relationship.
B
Percent
C
Proportion to logistic
D
% to Probit
E
Probit to %
% to arcsin arcsin to %
Page 7 of 10
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
17/06/13 4:26 PM
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0.001 0.005 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.5 0.96 0.97 0.98 0.995 0.9999 0.99999 0.999999
-2.99957 -2.29885 -1.99564 -1.6902 -1.50965 -1.38021 -1.27875 -1.19498 -1.12338 -1.0607 -1.0048 -0.95424 0 1.380211 1.50965 1.690196 2.298853 3.999957 4.999996 6
1.91 2.424 2.674 2.946 3.119 3.249 3.355 3.445 3.524 3.595 3.659 3.718 5 6.751 6.881 7.054 7.576 8.719 9.265 9.768
1.812 4.055 5.739 8.13 9.974 11.54 12.92 14.18 15.34 16.43 17.46 18.43 45 78.46 80.03 81.87 85.95 89.43 89.82 89.94
As an example of the use of transformations, the data from a fictitious dose-response curve (table below) are shown in two curves - first, without transformation and then after transforming the proportion responding to logistic values.
Dose
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Proportion Logistic
0.01 0.015 0.02 0.04 0.045 0.05 0.07 0.1 0.19 0.25 0.34 0.44 0.53 0.62 0.68 0.74 0.79 0.83 0.85 0.88 0.9 0.92 0.935 0.95 -1.99564 -1.81734 -1.6902 -1.38021 -1.32679 -1.27875 -1.12338 -0.95424 -0.62973 -0.47712 -0.28807 -0.10474 0.052178 0.212608 0.327359 0.454258 0.575408 0.688629 0.753328 0.865301 0.954243 1.060698 1.157898 1.278754
Page 8 of 10
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
17/06/13 4:26 PM
25 26 27 28
CONTENTS INTRODUCTION
THE SCIENTIFIC METHOD Experimental design Designing experiments with statistics in mind Common statistical terms Descriptive statistics: standard deviation, standard error, confidence intervals of mean. WHAT TEST DO I NEED? STATISTICAL TESTS: Student's t-test for comparing the means of two samples Paired-samples test. (like a t-test, but used when data can be paired) Analysis of variance for comparing means of three or more samples: For comparing separate treatments (One-way ANOVA) Calculating the Least Significant Difference between means Using a Multiple Range Test for comparing means For factorial combinations of treatments (Two-way ANOVA) Chi-squared test for categories of data Poisson distribution for count data Correlation coefficient and regression analysis for line fitting: linear regression logarithmic and sigmoid curves TRANSFORMATION of data: percentages, logarithms, probits and arcsin values STATISTICAL TABLES: t (Student's t-test) F, p = 0.05 (Analysis of Variance) F, p = 0.01 (Analysis of Variance) F, p = 0.001 (Analysis of Variance) c2 (chi squared) r (correlation coefficient) Q (Multiple Range test) Fmax (test for homogeneity of variance)
This site is no longer maintained and has been left for archival purposes
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 9 of 10
17/06/13 4:26 PM
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient
Page 10 of 10