0% found this document useful (0 votes)
8 views

Lecture 7 - Correlation and Regression Batch 11

Correlation and regression are statistical methods used to describe the strength and direction of relationships between variables. Correlation measures the degree of association between two variables using a correlation coefficient (r) ranging from -1 to 1. Regression finds the linear relationship between variables to allow prediction of one variable based on the other using a regression equation. Scatter plots are used to visualize relationships and determine if they are linear, curvilinear, or random.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 7 - Correlation and Regression Batch 11

Correlation and regression are statistical methods used to describe the strength and direction of relationships between variables. Correlation measures the degree of association between two variables using a correlation coefficient (r) ranging from -1 to 1. Regression finds the linear relationship between variables to allow prediction of one variable based on the other using a regression equation. Scatter plots are used to visualize relationships and determine if they are linear, curvilinear, or random.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Correlation & Regression

Learning Outcome
At the end of this lecture, students should be able to:

1 Define and describe correlation and regression

Interpret different type of scatter diagrams


2
Explain correlation coefficient and linear correlation
3 relationship

4 Describe Pearson’s r and coefficient of determination (r2)

5 Define and describe regression


6 Interpret least square of regression line

7 Apply regression formula

8 Explain regression coefficient

3
Correlation & Regression
measures the direction and
strength of the linear
Correlation relationship between two
quantitative variables

4
Linear Regression

Explanation Prediction

summarizes the linear is often used as a


relationship between two mathematical model to
variables in the form of a predict the value of a
response variable, y, based
regression equation
on a value of an explanatory
describes how a response variable, x
variable, y, changes as an
explanatory variable, x,
changes
5
Examining the relationship
between two variables
There are situations where one needs
to examine the relationship between
two variables.

Eg :- Do children grow taller with age ?

Eg :- For the analysis of a particular


substance in blood, does the new
method give results that are similar to
the old method?
6
 In analytical biology and chemistry, methods are compared
for several reasons, as when a new instrument is
purchased or when a new analytical technique is
introduced.
 Results obtained with the old instrument/technique will
need to be compared with results obtained with the new
instrument/technique, to determine whether there might
be changes imprecision or bias that would affect
interpretation of data.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Correlation
The strength of the association between two variables (i.e. the degree
of correlation or interdependence) is determined statistically by
calculating the correlation coefficient (pekali korelasi).
(coefficient means a constant number that serves as a measure of some
property or characteristic)

Regression

Predictions (i.e. predicting one variable against another) are undertaken


by using the method of least-squares regression.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Scatter Diagrams
• When the investigator has collected two series of
observations, he then needs to decide if there is a
relationship between them. For this it is best to construct
a scatter diagram first, where the vertical scale represents
one set of figures and the horizontal scale another.

• The vertical scale or y-axis is usually called the


‘dependent’ variable and it is the variable to be predicted
or determined by experiment

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 9


• The horizontal or x-axis is called the ‘independent’
variable and it is the known variable

• This relationship is described as the regression of y


on x.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 10


Example - Age vs. Mean Height

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Correlation

The shape of the scatter of plotted points will


give you an idea about the strength of the
correlation between the two sets of variables

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Graph A Graph B
14 25
12
20
10
New method

8 15
Series1
6 Series1

4
10

2
5
0
0 2 4 6 8 10 12 14 0
Old Method 0 5 10 15 20 25

 If the dots approximate a straight line, we say that there is a


linear correlation between the two variables
 If the line moves from lower left to upper right, we say that
there is a positive linear correlation between the two variables
– Graph A
 If the line moves from upper left to lower right, we say that
there is a negative linear correlation between the two
variables – Graph B BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 13
Graph C Graph D

If the dots have linear regions and curved regions as shown in Graph
C, we say there is a curvilinear relationship between the two variables
If the dots are scattered randomly as shown in Graph D, we say there
is no relationship between the two variables

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 14


Correlation Coefficient
• If you want a definitive numerical value to describe the
linear relationship between two variables, then you can
calculate a statistic called the
• coefficient of correlation, ρ (rho) – for a population
OR
• coefficient of correlation, r – for a sample
(r is an estimate of ρ)

• The correlation coefficient, r, is a measure of the linear


relationship between two variables

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Coefficient of linear correlation,r,
You don’t have to memories this formula now, just know that it exists!

r= n(Σxy) – (Σx)(Σy)
√ n (Σx2) – (Σx)2 √n(Σy2) – (Σy)2
where x = value for one of the variables
y = value for the other variable
n = number of pairs of scores
(If a curved line is needed to express the relationship, other and
more complicated, measures of correlation must be used)
This is also known as Pearson’s Coefficient of Linear Correlation
and it is used to compare normally distributed data sets.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


• You must know that if you want to determine whether
two data sets that are NOT normally distributed or
correlated, than you have to use a nonparametric test
called the Spearman Rank Correlation Test.
• For this test, you will need to calculate the Spearman
Rank Correlation Coefficient, R, which is used much in the
same way that we use Pearson’s Correlation Coefficient, r.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Characteristics of Pearson’s r
• r measures the strength of the association between the variables
tested
• r values will range from -1.00 to +1.00
• If all points of x are equal to y then r = 1.00
• The closer that r is to 1.00, the stronger the linear correlation
between the two variables tested
• The correlation coefficient is +1.00, when x and y increase
together on a perfect straight line.
• The correlation coefficient is -1.00 when one decreases and the
other increases on a straight line
• The r value approaches zero when there is no relationship
between the variables tested.
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
14

12

10
New method

8
Series1
6

0
0 2 4 6 8 10 12 14
Old Method

Positive linear correlation, r = +1.00

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


30

25

20

15 Series1

10

0
0 5 10 15 20 25

Positive linear correlation but r = <1.00

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Positive linear correlation but r = <1.00

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


25

20

15
Series1
10

0
0 5 10 15 20 25

Negative linear correlation, r ≈ - 1

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Negative linear correlation, r < - 1
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
Curvilinear relationship, r  0
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
Random scatter, r  0
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 27
Interpretation of r
• Colton: interpreting r

• 0 – 0.25 (or -0.25) » little or no relationship

• 0.25 to 0.50 (or -0.25 to -0.50) » fair degree of


relationship

• 0.50 to 0.75 (or -0.50 to -0.75) » moderate to good


relationship

• >0.75 (or -0.75) » very good to excellent relationship


BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
What r is NOT !
• r is a measure of LINEAR ASSOCIATION
• r does NOT tell us if Y is a function of X
• Correlation does NOT mean causation
i.e.,
r does NOT tell us if X causes Y
r does NOT tell us if Y causes X
• r does NOT tell us what the scatter plot looks like or where
the points are plotted

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


r squared (r2), the Coefficient of Determination

• It has a value that ranges from 0 to 1


• It is the fraction of the variance in the two variables that is
shared (by what degree x and y vary together)

• e.g. If we are comparing two variables, x (height) and y (weight),


and r2 = 0.49, then 49% of the variance in x, height, can be
explained by or accounted for by variation in y, weight. The
remaining 51% (the error term) is accounted for by some
factor(s) other than y, weight, for example, genetic factors .

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


• Likewise, 49% of the variance in y, weight, can be explained by
or accounted for by the variation in x, height, and the
remaining 51% (the error term) may be accounted for by some
factor(s) other than x, height, for example, the tendency to eat
fatty foods.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 31


•The greater the correlation coefficient, the greater the coefficient
of determination, and the more of the variability in y that can be
accounted for by the variability in x.

•r2 is a Statistical measure of Goodness-Of-Fit. It measures how good


the estimated regression equation is. The higher the r-squared, the
more confidence one can have in the regression equation (see next
two graphs).

• Statistically, the coefficient of determination (r2)represents the


proportion of the total variation in the y variable that is explained by
the regression equation.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Age vs. Height: r2=0.9888.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Age vs. Height: r2=0.849.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Regression
 In addition to establishing whether a relationship (i.e.
correlation) exists between two variables, we sometimes want
to predict an outcome or response variable from an independent
variable. For this we use regression.

 Linear regression is the best estimate of a dependent variable


when an independent variable is known and the relationship is
linear.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


When given a prediction problem, what do we do?

a) We first plot all the dots on a scatter diagram.

b) Then we try to fit a straight line to the data in such a way that it
best represents the linear relationship between the two
variables. Such as fitted line is called an estimated regression line.
regression line - garis regresi

a) Value of a dependent variable from an independent variable.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


To predict mean height at age 32
months?

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Important Q: How do we best estimate the
regression line i.e. obtain the line of best
fit ?

We use a mathematical method known as


the least-squares method.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


• In the least squares method, we analyse the deviations of our
observations (i.e. points) from the estimated regression line
that we have drawn.

• There are three ways to measure the space between a point


and a line:
- vertically in the y direction,
- horizontally in the x direction, and
- on a perpendicular to the line

• We choose to measure the space vertically. Why? Because in


drawing a regression line, it is to use it to predict the y value for
a given x, and the vertical distances are how far off the
predictions of y would be from the points we actually
measured.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


• To obtain an idea of the total deviation shown by the estimated
regression line, we could summate all the vertical deviations of
our observations from the line.

• However, each vertical deviation could be positive or negative,


depending on whether the line falls above or below that point.

• Therefore, to overcome the nullifying effect of the opposing signs,


we first square the vertical deviations from the estimated
regression line and then summate them.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


So what does the method of least squares
do?

It is a mathematical method to obtain a


regression equation such that the sum of
the squares of the vertical distances of the
data points from the regression line is as
small as possible……Since we use the sum of
squares, the method is called the method of
least squares.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


Least Squares Regression Line
It is the line that is ‘best’ in the sense that it minimizes
the sum of the squared errors in the vertical (Y) direction

Y
*
**
*
errors
**
X
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
Formula for the The Least Squares Regression Line

• Given that the association between the two variables we are


comparing is described by a straight line, its equation must be
in the form of

y = mx + c - and we have to define two features of the line i.e.


m and c if we are to place the line correctly on the diagram.

– c is its distance above the baseline (intercept on the y-axis


when x = 0)
– m is the slope and is called the regression coefficient. It is
the amount by which change in x must be multiplied to give
the corresponding average change in y, or the amount y
changes for a unit increase in x.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


regression coefficient - pekali regresi
Mathematically, to fulfill the least squares
criteria,
m = n(Σ xy) – (Σx) (Σy) and
n(Σx2) – (Σ x)2

c = 1 (Σ y – m. Σx)
n
YOU DON’T HAVE TO MEMORISE THIS FORMULA! JUST NOTE THAT IT EXISTS!

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


What do we do with the regression equation?
• With this equation we can find a series of values of y, the
dependent variable, that correspond to each of a series of values
of x, the independent variable.

• With the coordinates calculated with the regression equation, we


can then draw the least squares regression line on a scatter
diagram and predict values of y that correspond to specific values
of x.

• We can also draw the least squares regression line just knowing
the values of m and c in the regression equation.

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


For your understanding,

• If the line is y=3x+2 and we have a point (2,9), the predicted value
is 3×2+2=8.

• If we subtract the actual measured value 9 from the predicted


value, 8, we say that the deviation is –1 (negative because the
predicted value is less than the actual value).

• In general, the deviation (vertical gap) between any given point


(x,y) and the line y=mx+c will be mx+c–y.
• The regression equation is often more useful than the
correlation coefficient.

• It enables us to predict y from x and gives us a better summary


of the relationship between the two variables.
BIOMEDICAL SCIENCE, FACULTY OF MEDICINE
• If the slope is positive, y increases as x
increases.

• If the slope is negative, y decreases as x


increases.

• If the slope = 0, then X does not help in


predicting Y (linearly)

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE


REFERENCES
• Abhaya Indrayan. (2013). Medical Biostatistics, 3rd
edition. Chapman & Hall/CRC Biostatistics Series.
• Blair, R., Richard, T. (2007). Biostatistics for the
Health Sciences. Pearson (UK).

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 48


Thank you

BIOMEDICAL SCIENCE, FACULTY OF MEDICINE 49

You might also like