C 7
L E
BIOSTATISTICS
CORRELATION
AND
SIMPLE LINAR REGRESSION
Objectives
• To examine the linear relationship between two quantitative variables
using
• CORRELATION (see Lec 7a)
• REGRESSION
Comments on Graphs
• Always draw a graph
• examples highlights correlation coefficient as a summary statistic is not sufficient
for final decision of the data
• If coefficient of linear correlation between (x, y) is significant; a linear
equation can be expressed y in terms of x.
• This equation can be used to predict the values of y given values of x.
• This equation is called the regression equation.
• The value r2 is the proportion of the variation in y that is explained by
the linear association between x and y.
SIMPLE LINEAR REGRESSION
Correlation and Regression
• Correlation describes the strength of a linear relationship between
two variables
• Linear means “straight line”
• Regression tells us how to draw the straight line described by the
correlation
• Calculates the “best-fit” line for a certain set of data
Simple Linear Regression Equation
X
Regression Equation
• Linear association between 2 quantitative variables
• (Independent variable or predictor variable or explanatory variable)
• (dependent variable or response variable).
where is the intercept; estimate of regression intercept
is the slope; estimate of the regression slope
• and are sample estimates of and (population parameters)
• : value of the observation
• : estimated Y value , for a given observation
Simple Linear Regression Equation
Y
Yi β0 β1Xi ε i
Observed Value Yi
of Y for Xi
Slope = β1 Change in Y
Change in X
Predicted Value
of Y for Xi
Intercept = β0
X=0 Xi
X
Examples
Assumptions/Requirements
• For each fixed value of x, corresponding values of y have a bell-shaped
distribution.
• For different values of x, distribution of y-values all have the same
variance (homoscedasticity)
-Variance increases when x increases -Variance remains constant when x increases
-Variance is not the same for all values of x -Variance is approximately the same for all values of x
The Slope and Intercept
For 𝐲 =𝒃 𝟎 + 𝒃𝟏 𝒙
𝑆 𝐿𝑂𝑃𝐸 ∷ 𝑏1=𝑛 ¿ ¿
Example: Regression
Interpretation
•
For every 1unit increase in ,
there is a 0.182 unit decrease
in
The sign of the slope
coefficient indicates the
direction
Example: Matched Pairs
• Examine the relationship between self-reported and measured
female heights (in.).
• Create a Scatterplot
• Remember in session on paired t-test,
we failed to reject the null hypothesis of a mean height
difference being equal to zero.
•
Note the different limits in the two plots
• Linear Correlation Coefficient: r = 0.856863
• Coefficient of Determination: r2= 0.7342
Regression: Making
Predictions
• Only predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not extrapolate
0
beyond the range of
0 500 1000 1500 2000 2500 3000
Square Feet
observed X’s
Exercise
• Data: height and age for 64 children aged 16 or less.
• RQ: Are children’s height related to age? If it is, describe the
association between them.
Scatter Diagram
r = 0.88
• Note:
Coefficient called “_cons” is the intercept ‘a’
The “age” coefficient shows the slope ‘b’
y = 62.2 + 7.2 x
Therefore: Height (cm) = 62.2 + 7.2 (Age (yrs))
Constructing the
Regression Line
• Method of Least Squares
An interpretation of the correlation coefficient, r
• r2 measures how much of the variation in the y variable is accounted
for by the linear relationship with the x variable.
The total variation in y can
be thought of as the sum of the
squared distances from each
y-point to their mean.
(Total SS)
After fitting the regression line,
there is considerably less
variation remaining.
“Residual variation”
(Residual SS)
Also “Error SS”
• The difference between the total sum of squares and the residual sum
of squares is the amount of variation explained by the regression
model
Total ss – Residual ss = Model ss
Total ss – Error ss = Regression ss
Measures of Variation
• Total variation is made up of two parts:
SST SSR SSE
Total Sum of Regression Sum of Error Sum of
Squares Squares Squares
SST ( Yi Y )2 SSR ( Ŷi Y )2 SSE ( Yi Ŷi )2
where:
Y= Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi
= Predicted value of Y for the given Xi value
Measures of Variation
• Total variation is made up of two parts:
SST SSR SSE
• SST = total sum of squares (Total Variation)
Measures the variation of the Yi values around their mean
• SSR = regression sum of squares (Explained Variation)
Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
Variation in Y attributable to factors other than X
• Note:
“Model” is the Model/Regression Sum of Squares
“Residual” is the Residual/Error Sum of Squares
“Total” is the Total Sum of Squares
Measures of Variation
Y Coefficient of Determination, r2
Yi
^
𝒀
SSE = (Yi - i )2 (the unexplained deviation)
SST = (Yi - )2 The proportion of variation
^i explained by the model
𝒀
2
SSR = (i - )2 (the explained deviation)
0 r 1
𝒀 R-squared is a goodness-of-fit measure
Indicates the percentage of the variance in
the dependent variable that the
independent variables explains.
Measures strength of the relationship
between model and the dependent variable
Xi X
• The difference between the total sum of squares and the residual sum
of squares is the amount of variation explained by the regression
model
Total ss – Residual ss = Model ss
Total ss – Error ss = Regression ss
The proportion of variation
explained by the model
r2=0.7657, implies that 76.6% of the variation is explained by the
regression model
76.6% of the variation in height is explained by age in this model
A simple linear regression was calculated to predict participants height based on their age. A significant regression equation was
found (p<0.001) with 76.6% of the variation in height explained by the model. Participants’ predicted height is equal to 62.2 +
7.2 (Age) years when height is measured in centimetres. Participants’ average height increased 7.2cm for every year of age.
Sampling Error in The Regression Line
• Sample: = bo + b1 x Correlation coefficient r
• Population: y = β0 + β1x Correlation coefficient
Null hypothesis: x and y are not linearly related
We can test either
i.e. Ho : = 0 or β = 0
Using β = 0, is another way to asses if there is a significant linear relationship.
This is given by SPSS
• b = 7.238393 and s.e.(β ) = 0.5085 (units???)
• t = (7.2384 - 0) / 0.5085 = 14.23 => p <0.001
• Very strong evidence that the true slope is not equal to zero.
Exercise 1
Looking at associations between a biomarker of allergy, and
environmental factors
Look at the examples of regression output, and for each one:
1. Do a quick sketch the regression line
2. What is the correlation between the variables - and is this
correlation statistically significant?
3. Interpret the output in words, in terms of the relationship (if any)
between the variables. (Think about the slope, confidence
interval, p-value)
4. How much of the variation in the response variable is due to
variation in the explanatory variable?
There is a significant association between the biomarker and
maxpm10, with an estimated increase in the biomarker of 0.07
(95% CI 0.06-0.07) units per 1 unit increase in maxpm10
(p<0.001).
Or: an estimated increase of 6.92 (6.48 – 7.37) units
Exercise: 2
Looking at associations between a biomarker of allergy, and environmental
factors
Look at the examples of regression output, and for each one:
1. Do a quick sketch the regression line
2. What is the correlation between the variables - and is this correlation
statistically significant?
3. Interpret the output in words, in terms of the relationship (if any)
between the variables. (Think about the slope, confidence interval, p-
value)
4. How much of the variation in the response variable is due to variation in
the explanatory variable?
• For the 2nd one only: find the predicted value of the biomarker when
mintemp = 20
biom
• biom
mintemp
biom = 16.46892 - 0.8152396*mintemp
biom
Predicted value of biom when mintemp = 20?
Predicted value at 20: 16.46892 - 0.8152396*20 = 0.17 units
mintemp
Predictions Using Regression Eqs
• Prediction Y given X:
• If linear correlation is NOT significant (i.e. fail to reject,
Do not use regression line; the mean () is the best predicted y-value
• If linear correlation is significant (i.e. rejected ,
Use equation to find best predicted y value; stay within the range of
the available/observed data
• To determine if correlation is significant:
• Calculate r and test H1:
Reporting Regression: APA
Consider
• You want to know if height predicts weight