0% found this document useful (0 votes)
41 views27 pages

Correlation: We Take Two Measurements, of Two Different Physical Properties Are They Related?

1) A high correlation coefficient between two variables does not necessarily mean there is a causal relationship between them. Just because variables are correlated does not prove one causes the other. 2) With a large enough dataset, randomly generated uncorrelated variables may appear to have a significant correlation simply due to chance. 3) Correlation only indicates a relationship between variables but does not provide evidence of one variable causing changes in the other. The statistical literature contains examples where an observed correlation did not reflect a true causal link.

Uploaded by

Jean Rangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views27 pages

Correlation: We Take Two Measurements, of Two Different Physical Properties Are They Related?

1) A high correlation coefficient between two variables does not necessarily mean there is a causal relationship between them. Just because variables are correlated does not prove one causes the other. 2) With a large enough dataset, randomly generated uncorrelated variables may appear to have a significant correlation simply due to chance. 3) Correlation only indicates a relationship between variables but does not provide evidence of one variable causing changes in the other. The statistical literature contains examples where an observed correlation did not reflect a true causal link.

Uploaded by

Jean Rangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Correlation

We take two measurements, of


two different physical properties;
are they related?
What affects the degree (or amount)
of correlation?

• number of observations;
• strength of relationship (slope);
• strength of correlation (scatter).
• significance
• confidence
How many points are in each quadrant?

2 5

6 2
Simple case: centred on (0,0)
Y
X is positive
X is negative
Y is positive:
Y is positive:
X x Y is positive
X x Y is negative

X is negative X is positive
Y is negative : Y is negative :
X x Y is positive X x Y is negative
Porosity, φ and Permeability, K are both
always positive:
K
x

x
x
x x
x
x

x
x
x

φ
But the difference between φ and mean(φ)
plotted against the difference between K and
mean(K) centres the plot on (0,0):

(K – K)
x
x
x
x x
x
x (φ – φ)

x
x
x
So the difference between φ and mean(φ) and
the difference between K and mean(K) gives us
the basis for the measure of correlation we want:
(K – K)
4 points in
1 point in x quadrant
quadrant x
x
x x
x
x (φ – φ)
1 point in
x quadrant
x
4 points in x
quadrant
_ _
8 points for which (K – K).(φ – φ)
is positive versus 2 points for which
it is negative
(K – K)
4 points in
1 point in x quadrant
quadrant x
x
x x
x
x (φ – φ)
1 point in
x quadrant
x
4 points in x
quadrant
A formula for calculating the
correlation:

 x  x  y  y 
r i i

n i j
Notation
∑X = ∑xi
= x1 + ……………..….. xn

∑Y = ∑yi
= y1 + ……………. ….. yn
But a better formula for calculating
the correlation is:

r
 x y    x  y / n
i i i i

 
 x   x  
2
  y   y 
  2


 n    
2 i  2 i
i n i

    
Why is this a better formula for
calculating the correlation?

Because each of the terms can be


calculated relatively simply
Calculating the correlation coefficient (1)

Set up one cell of a spreadsheet for each of


the terms in the equation for the
correlation coefficient, r :
∑xiyi
∑xi ∑yi / n
∑xi2 ∑yi2
(∑xi)2 / n (∑yi)2 / n
Notation: ∑xi
∑xi = x1 + x2 + x3 + x4 + ….. xn
In Excel, if the data are in rows 7 –
232 (n = 226) of column M, then:

∑xi = SUM(M7:M226)

∑xi2 = SUMSQ(M7:M226)

∑xiyi = SUMPRODUCT(J2:J4,I2:I4)
Notation: ∑xi
∑xi = x1 + x2 + x3 + x4 + ….. xn
∑yi = y1 + ……………. ….. yn

In Excel, if the data are in rows 7 – 232


(n = 226) of columns M and P, then:

∑xiyi = SUMPRODUCT(M7:M232,P7:P232)
∑xiyi - ∑xi ∑yi / n
______________________________
√(∑xi2 - (∑xi)2 / n ) (∑yi2 - (∑yi)2 / n )

• ∑xi = SUM(M7:M232)

• ∑xi2 = SUMSQ(M7:M232)

• ∑xiyi = SUMPRODUCT(M7:M232,P7:P232)
Calculating the correlation coefficient (3)

• Calculate the correlation coefficient for the


porosity and permeability data in
USGS_poroperm_data\37-Lindquist-1988.xls
Alternatively, the whole expression can be evaluated in a single
function call:
PEARSON takes 2 arguments, the arrays of x and of y. Hence in
the above example:
Calculating the correlation coefficient (4)
• Calculate the correlation coefficient for the
porosity and permeability data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and log10(permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and ln(permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and (cube root of permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls
Cautions:
1. False positives:
Just because there is a high correlation coefficient does
not mean there is a high correlation. It is the nature of
random processes that, if we take a number of
uncorrelated variables and plot them against each other,
then there will be a spread of correlations around the
zero point and, if we take enough pairs, then the highest
and lowest will be significant at any pre-defined
percentage point.
Experiment
• Create 10 sets of 10 pairs of random
numbers.
• Tabulate each set.
• Calculate the correlation coefficient for
each set
Larger number of pairs reduces this
risk: two data sets both with r = 0.84

0.08 8

0.07 6

0.06
4
0.05
2
0.04

0.03 0
0 5 10 15 20 25
0.02 -2

0.01
-4
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 -6
Cautions:

2 .Causality:
Just because two variables are correlated, does not mean
there is a causal relationship between them, even once we
have ruled out the "false positive" effect. The statistical
literature abounds with counter-examples, mostly
accidental and some hilarious, at least to those who
weren't involved.
Does this show what the authors
think it shows?

You might also like