correlation_and_regression
correlation_and_regression
and
Regression
Analysis
Correlation Analysis
• The term “correlation” refers to a
measure of the strength of
association between two variables.
• Finding the relationship between two
quantitative variables without being
able to infer causal relationships
• Correlation is a statistical
technique used to determine the
degree to which two variables
are related.
• If the two variables increase or decrease together,
they have a positive correlation.
• If, increases in one variable are associated with
decreases in the other, they have a negative
correlation
Visualizing Correlation
• A scatter plot (or scatter diagram) is used to show the
relationship between two variables.
• Linear relationships implying straight line association are
visualized with scatter plots
Linear Correlation Only!
Y Y
X X
Y Y
X X
Correlation Coefficient
r
(x x )( y y )
[ ( x x ) 2 ][ ( y y ) 2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Statistical Inference for
Correlation Coefficients
• Significance Test for Correlation
– Hypotheses
Gestational Birth
Age Weight
Gestationa
l Age 1
Birth
Weight 0.818 1
There is a relatively strong linear
relationship between gestational age at
birth and birth weight
Using SPSS
Using SPSS
r
H0 : ρ =
0
H1 : ρ ≠
Spearman Rank Correlation Method
Given by Prof. Spearman in 1904
1 2 -1 1
2 4 -2 4
3 1 2 4
4 5 1 1
5 3 2 4
6 8 -2 4
7 7 0 0
8 6 2 4
Σd2 =22
Rank Correlation
Ranks Not Given (Sales and Advertisement.)
If three ranked equal at the 8th place, they are given the rank
[8 + 9 +10] /3 = 9 which is the common rank to be assigned
to each; and the next rank will be 11.
15 40 7 3 4 16
20 30 5.5 5 0.5 0.25
28 50 4 2 2 4
12 30 8 5 3 9
40 20 3 7 4- 16
60 10 2 8 6 36
20 30 5.5 5 0.5 0.25
80 60 1 1 0 0
81.5
Cautions about Correlation
• Correlation is only a good
statistic to use if the relationship
is roughly linear.
• Correlation can not be used to
measure non-linear
relationships
• Always plot your data to make
sure that the relationship is
roughly linear!
Regression Analysis
Regression analysis is used to:
– Predict the value of a dependent variable
based on the value of at least one
independent variable.
– Explain the impact of changes in an
independent variable on the dependent
variable.
Dependent variable: the variable we wish to
explain.
Independent variable: the variable used to
explain the
dependent variable.
Simple Linear Regression
Model
Population Rando
Population Independe m Error
Slope
y intercept nt Variable term,
Coefficient
Dependen or
y β0 β1x ε
t Variable residua
l
xi x
Linear Regression Assumptions
• The assumption of linearity
– The relationship between the dependent and
independent variables is linear.
• The assumption of homoscedasticity
– The errors have the same variance
• The assumption of independence
– The errors are independent of each other
• The assumption of normality
– The errors are normally distributed
Estimated Regression Model
The sample regression line provides an
estimate of the population regression line
ŷ i b0 b1x
t variable
(y (b 0 b1x)) 2
b1
( x x )( y y)
(x x) 2 b0 y b1 x
sx
r b1
sy
Sx is the standard deviation of X and Sy the standard
deviation of Y
Example
• Use the previous example assuming
the birth weight is the dependent
variable and gestational age as
the independent variable.
• Fit a linear-regression line relating
birth weight to gestational age using
these data.
• Predict the birth weight of a baby
from a women with gestational age
40.5 weeks.
:Using SPSS
b0
b1
Regression Statistics
Multiple R 0.818
R Square 0.668
Adjusted R Square 0.646
Standard Error 414.427
Observations 17
ANOVA
df SS MS F Significance F
519141 519141
Regression 1 1 1 30.227 0.000
257624 171749.
Residual 15 9 9
776766
Total 16 0
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
1263.04
Intercept 4020.05 - 9 3.183- 0.006 6712.18- 1327.93-
Coefficient of
Determination, R2
Gestational Age
40
30
20
10
0
0 20 40 60 80 100 120
Sample Percentile
2
0
-2
-4
-6
Birth Weight
Making Predictions
Population model:
Y-intercept Population slopes Random Error
y β0 β1x1 β 2 x 2 βk x k ε
Estimated multiple regression model:
Estimated Estimated
(or predicted) Estimated slope coefficients
value of y intercept
ŷ b0 b1x1 b 2 x 2 bk x k
Example:
• Use the previous example assuming
the birth weight is the dependent
variable and gestational age and
maternal weight as the
independent variables.
• Fit a linear-regression line relating birth
weight to gestational age and maternal
weight.
• Predict the birth weight of a baby from
a women with gestational age 40.5
weeks and maternal weight 95 kg.
Example: Using Excel
Example: Using Excel
Regression Statistics
Multiple R 0.93 86 % of the variation in birth
R Square 0.86
Adjusted R
weight is explained by
Square 0.84 variation in gestational age
Standard Error 281.07
Observations 17 in week and maternal
weight in Kg
ANOVA
Significance
df SS MS F F
3330820.1
Regression 2 6661640.24 2 42.16 0.00
Residual 14 1106019.76 79001.41
Total 16 7767660.00
Coefficient Standard P-
s Error t Stat value Lower 95% Upper 95%
Intercept 4060.82- 856.67 4.74- 0.00 5898.21- 2223.44-
Gestational Age 125.01 25.71 4.86 0.00 69.87 180.14
maternal weight 29.96 6.95 4.31 0.00 15.07 44.86