0% found this document useful (0 votes)
11 views13 pages

Correlation Analysis

Correlation analysis measures the association between two continuous variables using the Pearson correlation coefficient, denoted as r, which ranges from -1 to +1. A positive r indicates that higher values of one variable are associated with higher values of the other, while a value close to zero suggests no linear relationship. The document also discusses hypothesis testing for correlation coefficients and the importance of visualizing data to identify potential non-linear relationships.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Correlation Analysis

Correlation analysis measures the association between two continuous variables using the Pearson correlation coefficient, denoted as r, which ranges from -1 to +1. A positive r indicates that higher values of one variable are associated with higher values of the other, while a value close to zero suggests no linear relationship. The document also discusses hypothesis testing for correlation coefficients and the importance of visualizing data to identify potential non-linear relationships.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

CORRELATION 1

ANALYSIS
Correlation is the method of analysis to use when
studying the possible association between two
continuous variables.

If we want to measure the degree of association,


calculating the correlation coefficient can do this.

The standard method (Pearson correlation) leads


to a quantity called r that can take on any value from
-1 to +1.

This correlation coefficient r measures the degree of


'straight-line association’ between the values of
two variables. Thus a value of +1.0 or -1.0 is
obtained if all the points in a scatter plot lie on a
perfectly straight line.
Correlation Analysis
2
The correlation between two variables is positive
if higher values of one variable are associated
with higher values of the other ; and negative if
one variable tends to be lower as the other gets
higher.
A correlation of around zero indicates that there
is no linear relation between the values of the
two variables (i.e. they are not linearly
correlated).
In fact, a correlation coefficient close to zero does not mean
there is no correlation at all, there may be other forms of
correlation than linearity, such as parabolic or curved type
relationship.

In essence r is a measure of the scatter of the


points around an underlying linear trend:
the greater the spread of the points the lower
the correlation.
Correlation Analysis
3

The correlation coefficient usually calculated is called


Pearson's correlation coefficient (other coefficients are
used for ranked data, etc.).

If we have two variables X and Y with values x i and yi for


the ith individual, the correlation between them denoted
by r(X,Y)  Xgiven
 (X i is )(Y  by
Y)  XY  [ X  Y ] / n
r i 
2
 (X i  X)  (Yi  Y)
2
[ X 2  ( X ) 2 / n][ Y 2  ( Y ) 2 / n]

The equation is clearly symmetrical as it does not matter


which variable is X and which is Y.

NB: this differs from the case of Regression analysis,


Correlation Analysis
Example: The following data shows the respective weight of a sample of 12 fathers
and their oldest son. Compute the correlation coefficient between the two weight
measurements
Wt of father - X Wt of son – Y X2 Y2 XY

65 68 4225 4624 4420


63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 4096 4225 4160
68 69 4624 4761 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4828
67 67 4489 4489 4489
69 68 4761 4624 4692
71 70 5041 4900 4970
4
Correlation Analysis
5

Scatter plot of father's by son's weight

72
71
70
69
68
67
66
65
Y

64
60 62 64 66 68 70 72
X
Correlation Analysis
6

The correlation coefficient for the data on


fathers’ and sons’ will be:
2 2
 X 800,  X 53,418,  Y 811,  Y 54,849,  XY 54,107

 (x - x )(y  y)  xy  ( x )( y)/n 54,107  (800)(811)/12 40.33


2 2 2 2
 ( x  x)  x  ( x) / n 53,418  (800) / 12 84.67
2 2 2 2
 ( y  y )  y  ( y ) / n 54,849  (811) / 12 38.92
40.33
r 0.703
(84.67)(38.92)
Inference on Correlation
Coefficient
r=0 r<0

b=0 b<0
Y

Y
X
X

r>0

b>0
Y

7
Hypothesis Testing On Correlation
Coefficient
8 Under the null hypothesis that there is no association
in the population (=0), the appropriate test
statistics will be based on the quantity
n 2
t r
1 r2
That has a t distribution with n-2 degrees of freedom.

Then the null hypothesis can be tested by looking


this value up in the table of the t distribution.
For the fathers’ and sons’ weight data:
n 2 12  2
t r  2
 0.703  2
3.12
1 r 1  (0.703)

p < 0.01, i.e., the correlation coefficient is


significantly different from 0.
Interpretation Of
9
Correlation
 Correlation coefficients lie within the range -1 to +1,
with the mid-point of zero indicating no linear
association between the two variables.

 A very small correlation does not necessarily indicate


that two variables are not associated, however.

 To be sure of this we should study a plot of the data,


because it is possible that the two variables display a
non-linear relationship (for example cyclical or
curved). In such cases r will underestimate the
association, as it is a measure of linear association
alone.
Interpretation of correlation
 Very small r values may be statistically significant in
10
moderately large samples, but whether they are
clinically relevant must be considered on the merits of
each case.

 One way of looking at the correlation helps to modify


over-enthusiasm is to calculate 100r2, the coefficient of
determination called goodness of fit, which is the
percentage of variability in the data that is 'explained'
by the linear association.

 So a correlation of 0.7 implies that just about half (49%)


of the variability may be put down to the observed
association, and so on.

 Interpretation of association is often problematic


because causation cannot be directly inferred. When
looking at variables where there is no background
knowledge, inferring a causal link is not justified.
Interpretation of correlation
In regression analysis context,

Total variation in Y = Variation ‘explained by fitting the


line’ + variation not explained by fitting line (error)

Residual (error) tells you about variation not explained by


fitting line

Therefore, the goodness of fit of the regression line (r2),


coefficient of determination, is given by:
2 Variation in Y Explained by the fitted line SSR
r  
Total variation of the outcome variable SST

11
Computational formula
What is the relationship between the
slope and the correlation
coefficient?

Show
Page that:
357 Ronald Sx
r b
Sy
Sy
b r
Sx
12
Interpretation of correlation
For the father’s and son’s weight data, the
ANOVA table is:

Sum of df Mean Square F Sig.


Model Squares
Regression 19.21391 1 19.21391 9.75189 0.010822
(SSR)

Residual 19.70276 10 1.970276


(SSE)
Total 38.91667 11
(SST)

13

You might also like