0% found this document useful (0 votes)
2 views

correlation_and_regression

The document provides an overview of correlation and regression analysis, explaining the concepts of correlation coefficients, significance testing, and the relationship between variables. It discusses methods for calculating correlation, including Pearson and Spearman, and outlines the principles of simple linear regression, including assumptions and interpretation of coefficients. Additionally, it emphasizes the importance of visualizing data and understanding the limitations of correlation in establishing causation.

Uploaded by

nan7625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

correlation_and_regression

The document provides an overview of correlation and regression analysis, explaining the concepts of correlation coefficients, significance testing, and the relationship between variables. It discusses methods for calculating correlation, including Pearson and Spearman, and outlines the principles of simple linear regression, including assumptions and interpretation of coefficients. Additionally, it emphasizes the importance of visualizing data and understanding the limitations of correlation in establishing causation.

Uploaded by

nan7625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Correlation

and
Regression
Analysis
Correlation Analysis
• The term “correlation” refers to a
measure of the strength of
association between two variables.
• Finding the relationship between two
quantitative variables without being
able to infer causal relationships
• Correlation is a statistical
technique used to determine the
degree to which two variables
are related.
• If the two variables increase or decrease together,
they have a positive correlation.
• If, increases in one variable are associated with
decreases in the other, they have a negative
correlation
Visualizing Correlation
• A scatter plot (or scatter diagram) is used to show the
relationship between two variables.
• Linear relationships implying straight line association are
visualized with scatter plots
Linear Correlation Only!

Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Correlation Coefficient

• The population correlation coefficient


ρ (rho) measures the strength of
the association between the
variables.

• The sample (Pearson) correlation


coefficient r is an estimate of ρ
and is used to measure the strength
of the linear relationship in the
sample observations.
Correlation Coefficient Continue
d
• r is a statistic that quantifies a relation between
two variables.
• Can be either positive or negative
• Falls between -1.00 and 1.00
Correlation Coefficient Continue
d

• The value of the number (not the sign)


indicates the strength of the relation.
• The purpose is to measure the strength
of a linear relationship between 2
variables.
• A correlation coefficient does not ensure
“causation” (i.e. a change in X causes a
change in Y)
Calculating the Correlation
Coefficient
• The sample (Pearson) correlation coefficient
(r ) is defined by:

r
 (x  x )( y  y )
[ ( x  x ) 2 ][ ( y  y ) 2 ]

where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Statistical Inference for
Correlation Coefficients
• Significance Test for Correlation
– Hypotheses

H0: ρ = 0 (no correlation)


H1: ρ ≠ 0 (correlation
exists)
• Test statistic r
t
2
1 r
n 2
Example
• A small study is conducted Gestational Age Birth Weight
34.7 1895
involving 17 infants to 36 2030
investigate the association 29.3 1440
40.1 2835
between gestational age at 35.7 3090
birth, measured in weeks, 42.4 3827
40.3 3260
and birth weight, 37.3 2690
measured in grams. 40.9 3285
38.3 2920
38.5 3430
41.4 3657
39.7 3685
39.7 3345
41.1 3260
38 2680
38.7 2005
The scatter plot
Using Excel

Gestational Birth
Age Weight
Gestationa
l Age 1
Birth
Weight 0.818 1
There is a relatively strong linear
relationship between gestational age at
birth and birth weight
Using SPSS
Using SPSS

r
H0 : ρ =
0
H1 : ρ ≠
Spearman Rank Correlation Method
 Given by Prof. Spearman in 1904

 This method correlation is applicable to determine


degree of correlation between two variables in case of
ordinal data.

 These variables can be assigned ranks but their


𝟔 is not possible.
quantitative measurement
 It is denoted by r = 1 𝜮𝑫𝟐
– 𝑵 (𝑵𝟐
−𝟏)
 R = Rank correlation coefficient
 D = Difference between two ranks (R1 – R2)
 N = Number of pair of observations
 As in case of r, –1≤r≤1
Spearman’
s Rank
Correlatio
n Method

When ranks When ranks When equal


are given are not or tied ranks
given exist
Ranks Given
Suppose we have ranks of 8 students of B.Sc. in Statistics and Mathematics. On
the basis of rank we would like to know that to what extent the knowledge of the
student in Statistics and Mathematics is related.

Rank in Stats Rank in Maths Difference of Square of d


Ranks d2
d

1 2 -1 1
2 4 -2 4
3 1 2 4
4 5 1 1
5 3 2 4
6 8 -2 4
7 7 0 0
8 6 2 4

Σd2 =22
Rank Correlation
Ranks Not Given (Sales and Advertisement.)

Since n = 8 and ∑d² =


4,
rho
rho = 1 - 6x4 / 8(8² - 1)
rho = 1 - 0.0476
rho = 0.95
Another Example – Ranks Not Given
Step 2: The contingency table will look like
Tied or Repeated ranks

In this case, a different formula is used when there is more than


one item having the same value.

where mi is the number of repetitions of ith rank


Thumb Rule For Tied Ranks
Assign rank to each data. It is customary to assign rank 1
to the largest data, and 2 to next largest and so on.

Note: If there are two or more samples with the same


value, the mean rank should be used.
Repeated ranks

When two or more items have equal values (i.e., a tie) it is


difficult to give ranks to them. In such cases the items are
given the average of the ranks they would have received.

For example, if two individuals are placed in the 8th place,


they are given the rank [8+9] / 2 = 8.5 each, which is
common rank to be assigned and the next will be 10;

If three ranked equal at the 8th place, they are given the rank
[8 + 9 +10] /3 = 9 which is the common rank to be assigned
to each; and the next rank will be 11.

In this case, a different formula is used when there is more


than one item having the same value.
Tied ranks

Compute the rank correlation coefficient for the following data of


the marks obtained by 8 students in the Commerce and
Mathematics.
Tied Ranks Case: Example
Marks in Marks in Rank Rank D
Commerc Maths R1 R2 D2
e

15 40 7 3 4 16
20 30 5.5 5 0.5 0.25
28 50 4 2 2 4
12 30 8 5 3 9
40 20 3 7 4- 16
60 10 2 8 6 36
20 30 5.5 5 0.5 0.25
80 60 1 1 0 0
81.5
Cautions about Correlation
• Correlation is only a good
statistic to use if the relationship
is roughly linear.
• Correlation can not be used to
measure non-linear
relationships
• Always plot your data to make
sure that the relationship is
roughly linear!
Regression Analysis
Regression analysis is used to:
– Predict the value of a dependent variable
based on the value of at least one
independent variable.
– Explain the impact of changes in an
independent variable on the dependent
variable.
Dependent variable: the variable we wish to
explain.
Independent variable: the variable used to
explain the
dependent variable.
Simple Linear Regression
Model

• Only one independent variable,


X
• Relationship between X and Y
is described by a linear function.
• Changes in Y are assumed to
be caused by changes in X.
The formula for a simple linear regressio

Population Rando
Population Independe m Error
Slope
y intercept nt Variable term,
Coefficient
Dependen or

y β0  β1x  ε
t Variable residua
l

Linear component Random Error


component

The regression coefficients β0 and β1 are unknown and have to be


estimated from the observed data (sample).
y y β0  β1x  ε
Observed
Value of y for
xi
εi Slope =
Predicted β1
Random Error
Value of y for
xi for this x
value
β0

xi x
Linear Regression Assumptions
• The assumption of linearity
– The relationship between the dependent and
independent variables is linear.
• The assumption of homoscedasticity
– The errors have the same variance
• The assumption of independence
– The errors are independent of each other
• The assumption of normality
– The errors are normally distributed
Estimated Regression Model
The sample regression line provides an
estimate of the population regression line

Estimated Estimate of Estimate of the


(or the regression
predicted) y regression slope
value intercept
Independen

ŷ i b0  b1x
t variable

The individual random error terms ei have a mean


of zero
Least Squares Method
• b0 and b1 are called the regression
coefficients and obtained by finding
the values of b0 and b1 that
minimize the sum of the squared
residuals
e 2
  (y  ŷ) 2

  (y  (b 0  b1x)) 2

Define a residual e as the difference between the observed y


and fitted , that is, Residuals are interpreted as estimates of
random errors e‘s
The Least Squares Equation
• The formulas for b1 and b0 are:

b1 
 ( x  x )( y  y)
 (x  x) 2 b0  y  b1 x

• b0 is the estimated average value of y when the value of x


is zero
• b1 is the estimated change in the average value of y as a
result of a one-unit change in x
• The coefficients b0 and b1 will usually be found using
computer software, such as SPSS.
Relationship between the Regression
Coefficient (b1 ) and the Correlation
Coefficient (r )
• What is the relationship between the
sample regression coefficient (b1) and the
sample correlation coefficient (r)?

sx
r b1
sy
Sx is the standard deviation of X and Sy the standard
deviation of Y
Example
• Use the previous example assuming
the birth weight is the dependent
variable and gestational age as
the independent variable.
• Fit a linear-regression line relating
birth weight to gestational age using
these data.
• Predict the birth weight of a baby
from a women with gestational age
40.5 weeks.
:Using SPSS
b0
b1

birth weig ht - 4020.054  180.455 (gestation al age)


:Using Excel
:Using Excel
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.818
R Square 0.668
Adjusted R Square 0.646
Standard Error 414.427
Observations 17

ANOVA
df SS MS F Significance F
519141 519141
Regression 1 1 1 30.227 0.000
257624 171749.
Residual 15 9 9
776766
Total 16 0

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
1263.04
Intercept 4020.05 - 9 3.183- 0.006 6712.18- 1327.93-
Coefficient of
Determination, R2

• The coefficient of determination


is the portion of the total variation in
the dependent variable that is
explained by variation in the
independent variable

• The coefficient of determination is


also called R-squared and is
denoted as R 22 2
R r
Coefficient of
Determination, R2

• R2 = Explained variation / Total variation


• R2 is always (%) and between 0% and 100%:
• 0% indicates that the model explains none of
the variability of the response data around its
mean.
• 100% indicates that the model explains all the
variability of the response data around its
mean.
• In general, the higher the R-squared, the
better the model fits your data.
Coefficient of
Determination, R2
Regression Statistics
Multiple R 0.818
R Square 0.668
Adjusted R Square 0.646
Standard Error 414.427
Observations 17

66.8 % of the variation in


r2 = birth weight is explained
0.668 by variation in
gestational age in week
F- test for Simple Linear Regression
• The criterion for goodness of fit is the
ratio of the regression sum of squares
to the residual sum of squares.
• A large ratio indicates a good fit,
whereas a small ratio indicates a poor
fit.
• In hypothesis-testing terms we want to
test the hypothesis:
H0: β = 0 vs. H1: β ≠ 0
ANOVA
df SS MS F Significance F
519141 519141
Regression 1 1 1 30.227 0.000
257624 171749.
Residual 15 9 9
776766
Total 16 0

The P-value < 0.05.


Therefore H0 is rejected, implying a significant
linear relationship between birth weight and
gestational age.
Interpretation of bo

birth weight =− 4020.054+180.455(gestational age)


• b0 is the estimated mean value of Y
when the value of X is zero (if X = 0 is in
the range of observed X values)
• Because a baby cannot have age 0, b0
has no practical application
Interpreting b1

birth weight =− 4020.054+180.455(gestational age)


• b1 estimates the change in the mean
value of Y as a result of a one-unit
increase in X
• Here, b1 = 180.455 tells us that the mean
value of a birth weight increases by
180.5 grams , on average, for each
additional week.
Checking the Regression
Assumptions
There are two strategies for checking the
regression assumptions:

1. Examining the degree to which the


variables satisfy the criteria, .e.g.
normality and linearity, before the
regression is computed by plotting
relationships and computing diagnostic
statistics.
ei Yi  Yˆi
2. Studying plots of residuals and
computing diagnostic statistics after the
regression has been computed.
Check Linearity assumption:

A scatter plot (or scatter diagram) is used to show the


relationship between two variables.
Check Independence assumption:
Error terms associated with individual observations should
be independent of each other.
Rule of thumb: Random samples ensure independence.

scatterplot of residuals and predicted value should show no


trends
Check Equal Variance Assumption
(Homoscedasticity):
Variability of error terms should be the same (constant) for
all values of each predictor.
Check 1: Scatterplot of residuals against the predicted
value shows consistent spread.
Check 2: Boxplot of y against each predictor of x should
show consistent spread.
Check Normality Assumption:

Check normality of residuals and individual variables and


identify outliers of variables using normal probability plot

• Run normality tests. All or almost all of them should


have P-value > 0.05
• Plot histogram of residuals. A bell-shaped
curve centered around zero should be
displayed.
• Construct normal probability plot (qq_plot)
of residuals
Using Excel
Normal Probabilit y Plot
50

Gestational Age
40
30
20
10
0
0 20 40 60 80 100 120
Sample Percentile

Birt h W eight Res idual Plot


6
4
Residuals

2
0
-2
-4
-6
Birth Weight
Making Predictions

Predict the birth weight of a baby from


a women with gestational age 40.5
weeks.
birth weight - 4020.054  180.455 (40.5)
 3288.3735 gm
Multiple Regression
• In practice, there is often more than
one independent variable and we
would like to look at the relationship
between each of the independent
variables (X1,…, Xk) and the
dependent variable (Y) after taking
into account the remaining
independent variables.
• This type of problem is the subject
matter of multiple-regression analysis
The Multiple Regression Model
Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)

Population model:
Y-intercept Population slopes Random Error

y β0  β1x1  β 2 x 2    βk x k  ε
Estimated multiple regression model:
Estimated Estimated
(or predicted) Estimated slope coefficients
value of y intercept

ŷ b0  b1x1  b 2 x 2    bk x k
Example:
• Use the previous example assuming
the birth weight is the dependent
variable and gestational age and
maternal weight as the
independent variables.
• Fit a linear-regression line relating birth
weight to gestational age and maternal
weight.
• Predict the birth weight of a baby from
a women with gestational age 40.5
weeks and maternal weight 95 kg.
Example: Using Excel
Example: Using Excel
Regression Statistics
Multiple R 0.93 86 % of the variation in birth
R Square 0.86
Adjusted R
weight is explained by
Square 0.84 variation in gestational age
Standard Error 281.07
Observations 17 in week and maternal
weight in Kg
ANOVA
Significance
df SS MS F F
3330820.1
Regression 2 6661640.24 2 42.16 0.00
Residual 14 1106019.76 79001.41
Total 16 7767660.00

Coefficient Standard P-
s Error t Stat value Lower 95% Upper 95%
Intercept 4060.82- 856.67 4.74- 0.00 5898.21- 2223.44-
Gestational Age 125.01 25.71 4.86 0.00 69.87 180.14
maternal weight 29.96 6.95 4.31 0.00 15.07 44.86

You might also like