0% found this document useful (0 votes)
33 views

Linear Regression and Correlation

This document discusses linear regression and correlation. It defines linear regression as a method to summarize the relationship between two variables with a straight line. It explains key aspects of linear regression including the intercept, slope, and least squares method. The document provides an example of linear regression using data on body weight and plasma volume. It then defines correlation as a measure of how closely the data points fit along the linear regression line. The Pearson correlation coefficient is presented as a measure of the strength of linear correlation between two variables.

Uploaded by

mayogebukapuka2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Linear Regression and Correlation

This document discusses linear regression and correlation. It defines linear regression as a method to summarize the relationship between two variables with a straight line. It explains key aspects of linear regression including the intercept, slope, and least squares method. The document provides an example of linear regression using data on body weight and plasma volume. It then defines correlation as a measure of how closely the data points fit along the linear regression line. The Pearson correlation coefficient is presented as a measure of the strength of linear correlation between two variables.

Uploaded by

mayogebukapuka2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Linear regression and

Correlation

1
Learning objectives
• Interpret a scatter diagram

• Interpret a regression equation


– Intercept
– Slope

• Use and explain the correlation coefficient

2
Example
Subject Body Plasma
weight volume (l)
(kg)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12

Aim of analysis is to see whether a change in


plasma volume is associated with a change in
body weight 3
Scatter plot
• First step in investigating the relationship
between two variables

• Diagram shows visually the shape and


degree of closeness of the relationship

4
Scatter plot
3.6

3.4
P
l 3.2
a
s
m 3
a
2.8
v
o
2.6
l
u
m 2.4
e
2.2

2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)

Plasma volume tends to increase with increasing body weight. 5


Linear regression
• Summarizes relationship by a line drawn through the
scatter of points.

• Regression looks at the change in one variable (the


response or outcome or dependent variable) that
corresponds to a given change in the other (the
explanatory or predictor or independent variable)

• Used to predict or estimate the value of the response


associated with a fixed value of the explanatory variable

6
Linear regression equation

• Any straight line drawn on a graph can be


represented by the equation:
y = a + bx

where y refers to the values of the


dependent variable
x to values of the explanatory
(independent) variable.

7
Linear regression line

y b
y = a + bx
a

8
Linear regression
• The constant 'a' is the intercept, the point at which the
line crosses the y-axis.
– value of y when x = 0

• The coefficient of x variable ('b') is the slope of the line.


– the average change (increase or decrease) in y due to a unit
change in x.
– measures association between y and x

• If b is positive, then the expected value of y increases in


magnitude as x increases

• If b is negative, the expected value of y decreases as x


increases

• b sometimes called the regression coefficient. 9


Least squares method
• Best fitting line derived using least
squares method

• Coefficients a and b are calculated to


minimize sum of squares of the vertical
deviations of the points about the line

10
11
Linear regression
b = (x - x )(y - y )
(x - x )2

• Numerator = xy -(xy)/n

• Denominator = 2 2
x - (x) /n
12
Linear regression
a = y - bx
where y = y/n and x = x/n

• The resultant line is called the regression


line, which estimates the average value of
y for a given value of x.
13
Example – data on plasma volume
and body weight

n =8
x = 535
x2 = 35983.5
y = 24.02
y2 = 72.798
xy = 1615.295

14
Example

b = 1615.296 - (535)(24.02)/8
35983.5 - (535)2/8

= 8.96/205.38 = 0.043615

and
a = 3.0025 - 0.043615 x 66.875
= 0.0857

15
Example 1
Regression line is given by:

Plasma volume = 0.09 + 0.04 x body weight

Interpretation of slope
For every one point change (1 kg) in body weight,
on average there is a corresponding increase of
0.04 litres in plasma volume

16
Simple linear regression
3.6

3.4
P
l 3.2
a
s
m 3
a
2.8
v
o
2.6
l
u
m 2.4
e
2.2

2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)

Plasma volume = 0.09 + 0.04 x body weight


17
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women

Age SBP Age SBP Age SBP


22 131 41 139 52 128
23 128 41 171 54 105
24 116 46 137 56 145
27 106 47 111 57 141
28 114 48 115 58 153
29 123 49 133 59 157
30 117 49 128 63 155
32 122 50 183 67 176
33 99 51 130 71 172
35 121 51 133 77 178
40 147 51 144 81 217

18
SBP (mm Hg)

220

200
SBP  81.54  1.222  Age

180

160

140

120

100

80
20 30 40 50 60 70 80 90

Age (years)

19
Prediction
• Can use regression equation to predict the value
of y for a particular value of x

• Should not use regression line to predict values


outside the range of x in the original data

• Eg the predicted plasma volume for a man


weighing 66kg is 0.0857 + 0.0436 x 64 = 2.876
litres

20
Sampling error in the regression
line
• Higher levels of plasma volume are associated
with higher values of weight

• Is there evidence that this association between


plasma volume and weight is statistically
significant?

• The estimated values for a and b are estimates


of the population values for intercept and slope,
so they are subject to sampling errors 21
Sampling error of regression

• Most software output will also show the


standard error of the slope and 95%
confidence intervals around the
slope/correlation coefficient
plasvol coefficient Std error t P value 95% CI

Weight 0.0436 0.0153 2.857 0.029 0.0063 to 0.081

consta 0.0857 1.024 0.084 0.936 -2.420 to 2.591


nt

22
Correlation
• Linear regression - straight line summarizing the
relationship between two variables.
– Does not tell how closely the data lie on a straight
line.
• Correlation is defined as the quantification of
the degree to which two continuous variables
are related, provided that the relationship is
linear
– i.e. the closeness with which the points lie along the
straight line

23
Correlation coefficient
• Denote the true underlying population
correlation between X and Y by ρ (rho)

• The correlation quantifies the strength of


the linear relationship between X and Y

• The population correlation can be


estimated from a sample of data using the
Pearson correlation coefficient r
24
Correlation coefficient
The correlation coefficient is calculated as

 xy  ( x)( y)
r  n
2 2
[ x 2 
(  x)
]*[ y 2 
(  x)
]
n n

25
Correlation coefficient
From Example 1, correlation coefficient for
the association between body weight and
plasma volume is given by

r = 8.96______ = 0.76
sqrt(205.38 x 0.678)

• There is a fairly strong linear relationship


26
Properties of correlation coefficient

• Has same sign as the regression coefficient b


• It must lie between -1 and +1.
• If r = 0, there is no linear relationship
• If r = 1 or -1, the relationship is perfectly linear, ie.
all points lie exactly on the regression line.
• If r > 0, then y increases with increasing x values
(positive correlation).
• If r < 0, then y decreases with increasing x values
(negative correlation).
27
28
29
30
Strength of linear relationship
Correlation Strength of linear
Coefficient value relationship
At least 0.8 Very strong

0.6 - 0.8 Moderately strong

0.3 - 0.5 Fair

Less than 0.3 Poor

31
Inference about unknown
population correlation
• We can make inference about the
unknown population correlation ρ using
the sample correlation coefficient r

• Most often we want to test H0: ρ = 0

• Can test the null hypothesis using t- test

32
The test statistic is

t = r-0fffffffff with n-2 df


√(n-2)/(1-r2)

For the DPT and mortality rate data,


t = −0.791
√(20-2)/(1-(-0.791)2)
= −5.49

For a t distribution with 18 df, p < 0.001

We reject H0 at the 0.05 level of significance and conclude


that the true population correlation is not equal to 0; in fact,
it is less than 0
33
Limitations of the correlation
coefficient
• It quantifies only the strength of the linear relationship
between two variables

• It is very sensitive to outlying values, and thus can


sometimes be misleading

• It cannot be extrapolated beyond the observed ranges of


the variables

• A high correlation does not imply a cause-and-effect


relationship
34
Summary
• Regression and correlation allow us to examine
the linear association between two quantitative
variables

• There may be other (independent) variables that


influence the response variable
– Multiple regression

• Even when strong associations are found these


do not necessarily mean that the relationship is
causal

35

You might also like