0% found this document useful (0 votes)
19 views50 pages

Biostatistics Lect 7b - 112025

The document covers the concepts of correlation and simple linear regression, focusing on the examination of linear relationships between two quantitative variables. It explains how to draw regression equations to predict values and emphasizes the importance of the correlation coefficient and the coefficient of determination (r²) in understanding the strength of these relationships. Additionally, it discusses the assumptions required for regression analysis and provides examples and exercises to illustrate these concepts.

Uploaded by

sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views50 pages

Biostatistics Lect 7b - 112025

The document covers the concepts of correlation and simple linear regression, focusing on the examination of linear relationships between two quantitative variables. It explains how to draw regression equations to predict values and emphasizes the importance of the correlation coefficient and the coefficient of determination (r²) in understanding the strength of these relationships. Additionally, it discusses the assumptions required for regression analysis and provides examples and exercises to illustrate these concepts.

Uploaded by

sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

C 7

L E

BIOSTATISTICS

CORRELATION
AND
SIMPLE LINAR REGRESSION
Objectives

• To examine the linear relationship between two quantitative variables


using

• CORRELATION (see Lec 7a)

• REGRESSION
Comments on Graphs
• Always draw a graph
• examples highlights correlation coefficient as a summary statistic is not sufficient
for final decision of the data

• If coefficient of linear correlation between (x, y) is significant; a linear


equation can be expressed y in terms of x.
• This equation can be used to predict the values of y given values of x.
• This equation is called the regression equation.

• The value r2 is the proportion of the variation in y that is explained by


the linear association between x and y.
SIMPLE LINEAR REGRESSION
Correlation and Regression

• Correlation describes the strength of a linear relationship between


two variables
• Linear means “straight line”

• Regression tells us how to draw the straight line described by the


correlation
• Calculates the “best-fit” line for a certain set of data
Simple Linear Regression Equation

X
Regression Equation
• Linear association between 2 quantitative variables
• (Independent variable or predictor variable or explanatory variable)
• (dependent variable or response variable).

where is the intercept; estimate of regression intercept


is the slope; estimate of the regression slope

• and are sample estimates of and (population parameters)


• : value of the observation
• : estimated Y value , for a given observation
Simple Linear Regression Equation

Y
Yi β0  β1Xi  ε i
Observed Value Yi
of Y for Xi
Slope = β1 Change in Y
Change in X
Predicted Value
of Y for Xi

Intercept = β0

X=0 Xi
X
Examples
Assumptions/Requirements
• For each fixed value of x, corresponding values of y have a bell-shaped
distribution.
• For different values of x, distribution of y-values all have the same
variance (homoscedasticity)

-Variance increases when x increases -Variance remains constant when x increases


-Variance is not the same for all values of x -Variance is approximately the same for all values of x
The Slope and Intercept

For 𝐲 =𝒃 𝟎 + 𝒃𝟏 𝒙

𝑆 𝐿𝑂𝑃𝐸 ∷ 𝑏1=𝑛 ¿ ¿
Example: Regression
Interpretation

For every 1unit increase in ,


there is a 0.182 unit decrease
in
The sign of the slope
coefficient indicates the
direction
Example: Matched Pairs
• Examine the relationship between self-reported and measured
female heights (in.).

• Create a Scatterplot
• Remember in session on paired t-test,
we failed to reject the null hypothesis of a mean height
difference being equal to zero.

Note the different limits in the two plots

• Linear Correlation Coefficient: r = 0.856863


• Coefficient of Determination: r2= 0.7342
Regression: Making
Predictions
• Only predict within the relevant range of data

Relevant range for


interpolation
450
400
House Price ($1000s)

350
300
250
200
150
100
50 Do not extrapolate
0
beyond the range of
0 500 1000 1500 2000 2500 3000
Square Feet
observed X’s
Exercise
• Data: height and age for 64 children aged 16 or less.
• RQ: Are children’s height related to age? If it is, describe the
association between them.
Scatter Diagram

r = 0.88
• Note:
Coefficient called “_cons” is the intercept ‘a’
The “age” coefficient shows the slope ‘b’
y = 62.2 + 7.2 x
Therefore: Height (cm) = 62.2 + 7.2 (Age (yrs))
Constructing the
Regression Line
• Method of Least Squares
An interpretation of the correlation coefficient, r

• r2 measures how much of the variation in the y variable is accounted


for by the linear relationship with the x variable.

The total variation in y can


be thought of as the sum of the
squared distances from each
y-point to their mean.

(Total SS)
After fitting the regression line,
there is considerably less
variation remaining.

“Residual variation”
(Residual SS)

Also “Error SS”


• The difference between the total sum of squares and the residual sum
of squares is the amount of variation explained by the regression
model
Total ss – Residual ss = Model ss
Total ss – Error ss = Regression ss
Measures of Variation
• Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum of Error Sum of
Squares Squares Squares

SST  ( Yi  Y )2 SSR  ( Ŷi  Y )2 SSE  ( Yi  Ŷi )2


where:
Y= Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi
= Predicted value of Y for the given Xi value
Measures of Variation
• Total variation is made up of two parts:

SST  SSR  SSE


• SST = total sum of squares (Total Variation)

Measures the variation of the Yi values around their mean

• SSR = regression sum of squares (Explained Variation)

Variation attributable to the relationship between X and Y

• SSE = error sum of squares (Unexplained Variation)

Variation in Y attributable to factors other than X


• Note:
“Model” is the Model/Regression Sum of Squares
“Residual” is the Residual/Error Sum of Squares
“Total” is the Total Sum of Squares
Measures of Variation
Y Coefficient of Determination, r2
Yi
^
𝒀
SSE = (Yi - i )2 (the unexplained deviation)

SST = (Yi - )2 The proportion of variation


^i explained by the model
𝒀
2
SSR = (i - )2 (the explained deviation)
0 r 1
𝒀 R-squared is a goodness-of-fit measure

Indicates the percentage of the variance in


the dependent variable that the
independent variables explains.
Measures strength of the relationship
between model and the dependent variable
Xi X
• The difference between the total sum of squares and the residual sum
of squares is the amount of variation explained by the regression
model
Total ss – Residual ss = Model ss
Total ss – Error ss = Regression ss
The proportion of variation
explained by the model
r2=0.7657, implies that 76.6% of the variation is explained by the
regression model
76.6% of the variation in height is explained by age in this model
A simple linear regression was calculated to predict participants height based on their age. A significant regression equation was
found (p<0.001) with 76.6% of the variation in height explained by the model. Participants’ predicted height is equal to 62.2 +
7.2 (Age) years when height is measured in centimetres. Participants’ average height increased 7.2cm for every year of age.
Sampling Error in The Regression Line

• Sample: = bo + b1 x Correlation coefficient r


• Population: y = β0 + β1x Correlation coefficient

Null hypothesis: x and y are not linearly related

We can test either


i.e. Ho : = 0 or β = 0

Using β = 0, is another way to asses if there is a significant linear relationship.


This is given by SPSS
• b = 7.238393 and s.e.(β ) = 0.5085 (units???)
• t = (7.2384 - 0) / 0.5085 = 14.23 => p <0.001
• Very strong evidence that the true slope is not equal to zero.
Exercise 1
Looking at associations between a biomarker of allergy, and
environmental factors
Look at the examples of regression output, and for each one:
1. Do a quick sketch the regression line
2. What is the correlation between the variables - and is this
correlation statistically significant?
3. Interpret the output in words, in terms of the relationship (if any)
between the variables. (Think about the slope, confidence
interval, p-value)
4. How much of the variation in the response variable is due to
variation in the explanatory variable?
There is a significant association between the biomarker and
maxpm10, with an estimated increase in the biomarker of 0.07
(95% CI 0.06-0.07) units per 1 unit increase in maxpm10
(p<0.001).
Or: an estimated increase of 6.92 (6.48 – 7.37) units
Exercise: 2
Looking at associations between a biomarker of allergy, and environmental
factors
Look at the examples of regression output, and for each one:
1. Do a quick sketch the regression line
2. What is the correlation between the variables - and is this correlation
statistically significant?
3. Interpret the output in words, in terms of the relationship (if any)
between the variables. (Think about the slope, confidence interval, p-
value)
4. How much of the variation in the response variable is due to variation in
the explanatory variable?
• For the 2nd one only: find the predicted value of the biomarker when
mintemp = 20
biom
• biom

mintemp
biom = 16.46892 - 0.8152396*mintemp
biom
Predicted value of biom when mintemp = 20?

Predicted value at 20: 16.46892 - 0.8152396*20 = 0.17 units

mintemp
Predictions Using Regression Eqs
• Prediction Y given X:
• If linear correlation is NOT significant (i.e. fail to reject,
Do not use regression line; the mean () is the best predicted y-value
• If linear correlation is significant (i.e. rejected ,
Use equation to find best predicted y value; stay within the range of
the available/observed data
• To determine if correlation is significant:
• Calculate r and test H1:
Reporting Regression: APA
Consider
• You want to know if height predicts weight

You might also like