0% found this document useful (0 votes)
11 views

Bio2 Module 1 - Simple Linear Regression and Correlation

This document provides an overview of simple linear regression and correlation. It defines linear regression as estimating the numerical relationship between variables using an equation of the form Ŷ= a + bX. Regression finds the line that minimizes the sum of squared errors between predicted and actual Y values. Correlation (measured by r) indicates the strength and direction of the linear relationship between two variables. The document provides examples of calculating the linear regression equation, correlation coefficient, and rank correlation coefficient from datasets. It also discusses limitations of correlation including that correlation does not necessarily indicate causation.

Uploaded by

tamirat hailu
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Bio2 Module 1 - Simple Linear Regression and Correlation

This document provides an overview of simple linear regression and correlation. It defines linear regression as estimating the numerical relationship between variables using an equation of the form Ŷ= a + bX. Regression finds the line that minimizes the sum of squared errors between predicted and actual Y values. Correlation (measured by r) indicates the strength and direction of the linear relationship between two variables. The document provides examples of calculating the linear regression equation, correlation coefficient, and rank correlation coefficient from datasets. It also discusses limitations of correlation including that correlation does not necessarily indicate causation.

Uploaded by

tamirat hailu
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

University of Gondar

Gondar College of Medicine and Health Sciences

Biostatistics II _ module 1

Simple linear regression


and correlation

Getu Degu

November 2008
Simple linear regression and correlation

Data are frequently given in pairs where one variable is


dependent on the other.

E.g. 1. Weight and height


2. House rent and income
3. Yield and fertilizer

It is usually desirable to express their relationship by


finding an appropriate mathematical equation. To
form the equation, collect the data on these two
variables. Let the observations be denoted by (X1 ,Y1),
(X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).

However, before trying to quantify this relationship,


plot the data and get an idea of their nature.

Plot these points on the XY plane and obtain the


scatter diagram.

1
Relationship between heights of
fathers and their oldest sons

73
Heights of oldest sons (inches)

72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)

NB: The actual figures of the above scatter diagram


are given on page 5.

2
A) Simple linear regression

The scatter diagram helps to choose the curve that


best fits the data. The simplest type of curve is a
straight line whose equation is given by Ŷ= a + bxi .
This equation is a point estimate of Y = α + βXi .

b= the sample regression coefficient of Y on X.


β= the population regression coefficient of Y on X.

Y on X means Y is the dependent variable and X is


the independent one.

3
Regression is a method of estimating the numerical
relationship between variables. For example, we
would like to know what is the mean or expected
weight for factory workers of a given height, and what
increase in weight is associated with a unit
increase in height.

The purpose of a regression equation is to use one


variable to predict another.

How is the regression equation determined?

4
The Method of least square

The difference between the given score Y and the


predicted score Ŷ is known as the error of estimation.
The regression line, or the line which best fits the
given pairs of scores, is the line for which the sum of
the squares of these errors of estimation (Σеi²) is
minimized. That is, of all the curves, the curve with
minimum Σеi² is the least square regression which
best fits the given data.

The least square regression line for the set of


observations (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn)
has the equation Ŷ = a + bxi .

5
The values ‘a’ and ‘b’ in the equation are constants,
i.e., their values are fixed. The constant ‘a’ indicates
the value of y when x=0. It is also called the y
intercept. The value of ‘b’ shows the slope of the
regression line and gives us a measure of the change
in y for a unit change in x.

This slope (b) is frequently termed as the regression


coefficient of Y on X. If we know the values of ‘a’
and ‘b’, we can easily compute the value of Ŷ for any
given value of X.

6
The constants ‘a’ and ‘b’ are determined by solving
simultaneously the equations (normal equations):

ΣY = an + bΣX
ΣXY = aΣX + bΣX²

a= -b

b= =

7
Example: Heights of 10 fathers (X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.

Father (X) oldest son (Y) product (XY) X²

63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041

Total 676 679 45967 45784

8
a= -b

b= =

b= = = = 0.77

a= - 0.77 ( ) = 67.9 – 52.05 = 15.85

Therefore, Ŷ = 15.85 + 0.77 X

The regression coefficient of Y on X (i.e., 0.77) tells


us the change in Y due to a unit change in X.

9
Estimate the height of the oldest son for a father’s
height of 70 inches.

Ŷ = 15.85 + 0.77 (70) = 69.75 inches.

NB: 1) n is the number of pairs of X and Y scores


which are used in determining the
regression line. In the above example, n=10.

2) Be careful to distinguish between (ΣX)² and


Σχ².

10
Explained, unexplained (error), total variations

If all the points on the scatter diagram fall on the


regression line we could say that the entire variance
of Y is due to variations in X.

Explained variation = Σ(Ŷ- )²

The measure of the scatter of points away from the


regression line gives an idea of the variance in Y that
is not explained with the help of the regression
equation.

Unexplained variation = Σ(Y - Ŷ)²

11
The variation of the Y’s about their mean can also be
computed. The quantity Σ(Y- )² is called the total
variation.

Explained variation + unexplained variation = Total variation

The ratio of the explained variation to the total


variation measures how well the linear regression line
fits the given pairs of scores. It is called the
coefficient of determination, and is denoted by r².

r² =

The explained variation is never negative and is never


larger than the total variation. Therefore, r² is always
between 0 and 1. If the explained variation equals 0,
r² = 0.

If r² is known, then r =  . The sign of r is the same


as the sign of b from the regression equation.

Since r² is between 0 and 1, r is between -1 and +1.

12
B) Linear Correlation (Karl Pearson’s Coefficient of
linear correlation):- measures the degree of linear
correlation between two variables (eg. X and Y).
This correlation coefficient is given in pure number,
independent of the units in which the variables are
expressed. It also tells us the direction of the slope
of a regression line is positive or negative.

Its formula is: r =

13
Properties

1) -1  r 1
2) r is a pure number without any unit
3) If r is close to 1  a strong positive
relationship
4) If r is close to -1  a strong negative
relationship
5) If r = 0 → no correlation

Determine the value of ‘r’ for the scores in the


above example.

r = 0.7776  0.78

14
Rank correlation coefficient

The Karl Pearson’s coefficient of correlation cannot


be used in cases where the direct quantitative
measurement of the phenomenon under study is not
possible. In such situations we can use the
Spearman’s rank correlation coefficient.

The spearman’s rank correlation coefficient, denoted


by rs , measures the correlation between two paired
samples of ranked data. This correlation coefficient is
applied to the ranks in two paired samples (not to the
original scores). The formula for computing rank
correlation by this method is:

rs = 1 -

 List the n pairs of ranks; X,Y.


 Find the differences (di ) between the ranks.
 Square these differences and add the squares
(di²).
 Compute rs .

15
Example

Six paintings were ranked by two judges. Calculate


the rank correlation coefficient.

Painting First judge Second judge di di²


(X) (Y)
A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4

di² = 10, n = 6.

rs = 1 - =1- =1-

= 1 – 0.29

= 0.71

How do you interpret the above correlation coefficient?

16
Spurious correlation

As with regression analysis, similar warnings pertain


to the limitations in the interpretation of a correlation
coefficient.

1. The correlation coefficient applies only to a linear


relationship between X and Y
2. Correlation does not mean causation

Often one encounters what seem to be nonsense or


spurious correlations between two variables that
logically appear to be totally unrelated to one another.
These often arise with correlations taken over time,
usually over a period of several years.

17
What do you think about the correlation coefficient (r)
of 0.9 between the amount of rainfall in Canada and
the maize production in Ethiopia from 1990 to 2000?
Assume the yearly data of the amount of rainfall and
maize production for the years 1990 to 2000 are
available.

18
Exercise 5
Data on FEV1 (forced expiratory volume in one
second) (Y) and height (X) of 20 male medical
students are given below:

Height (cm) FEV1(litres)


164.0 3.54
167.0 3.54
170.4 3.19
171.2 2.85
171.2 3.42
171.3 3.20
172.0 3.60
172.0 3.78
174.0 4.32
176.0 3.75
177.0 3.09
177.0 4.05
177.0 5.43
177.4 3.60
178.0 2.98
180.7 4.80
181.0 3.96
183.1 4.78
183.6 4.56
183.7 4.68

A) Find the regression of Y on X.


B) What is the expected FEV1 for a male student whose height is
175 cm ?
C) What is the expected FEV1 for a female student whose height
is 166 cm ?
D) What is the expected FEV1 for a male student whose height is
270 cm?
E) Determine the Karl Pearson’s linear correlation coefficient.
F) Compute the coefficient of determination and give an
explanation for it.

19

You might also like