Correlation and Linear Regression
Correlation and Linear Regression
Linear Regression
1
Causation vs. Association
2
Causation vs. Association (cont.)
• Hill’s criteria of causation (proposed by A. B. Hill)
- Strength of the association: strong/weak associations.
- Consistency of findings: refers to the repeated observation of an
association in different populations, different investigators,
different methods, etc.
- Specificity of the association: requires that a cause leads to a
single effect, not multiple effects. However, a single cause often
leads to multiple effect. Smoking is a perfect example.
- Temporal relationship: Exposure always precedes the outcome.
First exposure, then disease.
3
Causation vs. Association (cont.)
• Hill’s criteria of causation (cont.)
- Biological plausibility: refers to the biological plausibility of the
hypothesis.
Finding is consistent with existing biological and medical
knowledge.
- Dose-Response Relationship: incremental change in disease rates in
conjunction with corresponding changes in exposure.
- Consideration of Alternate Explanations
- Coherence: causal interpretation fit with known facts with the current
knowledge of the natural history/biology of the disease.
Experimental evident demonstrate that under controlled conditions,
changing the exposure causes a change in the outcome is of great value.
4
Association Causation
y y
x x
y y
x x
Before we conduct any type of analysis, we should always create a two-
7
way scatter plot of the data.
Scatter plot of the 2 variables
y y
x x
y y
x x
8
Scatter plot of the 2 variables
No relationship
x
9
Correlation Analysis
Some specific names for “correlation” in one’s data:
• The population correlation coefficient: ρ (rho).
• The sample correlation coefficient: r
• Range from -1 to 1, unit free.
• The value of r can be substantially influenced by a small
fraction of outliers.
10
Correlation Analysis
11
Correlation coefficient
12
Pearson’s Correlation Coefficient
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable 13
Map r value with its scatter plot
y y y
y y 1. r = 0.3
2. r = 1
3. r = - 0.6
4. r = 0
x x 5. r = -1 14
(D) (E)
Examples of approximate r values
y y y
x x x
r = -1 r = -0.6 r=0
y y
x x
r = +0.3 r = +1 15
Significance Test for Correlation
H0: ρ = 0 (no correlation)
• Hypotheses
HA: ρ ≠ 0 (correlation exists)
What do the hypotheses mean in words?
- Null hypothesis: The population correlation coefficient is not
significantly different from zero. There is not a significant linear
relationship(correlation) between x and y in the population.
- Alternative hypothesis: The population correlation coefficient is
significantly different from zero. There is a significant linear
relationship (correlation) between x and y in the population.
• Test statistic
16
Example: r = ?
17
Example
18
Example
H0: ?
HA: ?
=0.05 , df = n – 2, r = 0. 886, n = 8
r
t=
-
1 r 2
n-2
19
Example
r 0.886
t= = = 4.68
1- r 2 1- 0.886 2
n-2 8-2
20
17 21
Example
r .886
t= = = 4.68 Decision:
1- r 2 1- .886 2 Reject H0
23
Simple linear regression
• A technique to explore the nature of the relationship between two
continuous variables.
24
Simple linear regression model
25
Linear regression
26
Example of height and weight
• Look at this
scatter plot, what
do you think
about the trend
between height
and weight?
• What is the best
way to show this
trend?
27
How about a line?
Which line?
28
The best fitting line
30
Conditional population & conditional distribution
31
The Linear Model
Assumptions:
Linearity
Constant standard deviation Y |X
Where:
- y is the dependent or
response variable.
- x is the independent or
predictor variable.
- is the error term in the
model.
32
Population Linear Regression
33
Population Linear Regression (continued)
y y = β 0 + β1x + ε
Observed Value
of y for xi
εi Slope = β
Predicted Value Random Error
of y for xi
for this x value
Intercept = β0
xi x
34
Linear Regression Assumptions
35
Ordinary least square - OLS
Linear regression model
36
Regression Picture
yi
ŷi = xi +
C A
B
y
B y
A
C
yi
Least squares estimation
gave us the line (β) that
x minimized C2
n n n
(y
i =1
i - y) 2
= ( yˆ
i =1
i - y) 2
+ ( yˆ
i =1
i - yi ) 2
A2 B2 C2 R2=SSreg/SStotal
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to Variance around the regression line
observations from naïve mean of y naïve mean of y Additional variability not explained
Total variation Variability due to x (regression) by x—what least squares method
aims to minimize
Ordinary least square - OLS
Why squares?
Because the sum of the residuals will cancel out
(that’s how we know our line is closest to all dots)
38
Ordinary least square - OLS
Solving this minimization problem
39
The estimated linear regression
40
Simple linear regression
41
yˆ = -59.026 + 0.710 x
42
The coefficient of determination R2
43
Interpretation
44
Note
45
Example: Y: arm circumference; X: height
yˆ is the average arm circumference for a group of children all of the
same height, x
46