0% found this document useful (0 votes)
10 views

Correlation & Regression

Uploaded by

Saumya Landge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Correlation & Regression

Uploaded by

Saumya Landge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Correlation and Regression

Pratap L Jadhav
Assistant professor Statistics and Demography
Department of Community Medicine
Seth G. S. Medical College & KEM Hospital, Mumbai
Introduction: Correlation
➢Sometimes two continuous variables are measured in same person
such as height and weight, temperature and pulse rate, age and
weight age and height etc.
➢At other times the same character (variable) is measured in two
related groups such as tallness in parents and tallness in children,
intelligent quotient (IQ) in brothers and in corresponding sisters
(siblings).
➢The relationship or association between two quantitatively measured
or continuous variable is called correlation
➢Correlation is the statistical measure for finding out degree (strength)
of association between two variables.
➢By “association” we mean the tendency of the variables to move
together.
Introduction: Correlation
• If the two variables X and Y are so related that movements (or
variation) in one tends to be accompanied by the corresponding
movements(or variation) in the other and X and Y are said to be
correlated.
• The movements (variation) may be in the same direction (i.e. either
both X and Y increases or both of them decreases called as directly
proportional) or in opposite direction (i.e. one X increases and other,
Y decreases called as inversely proportional)
• Correlation is said be Positive or negative according as these
movements are in the same or in the opposite directions.
• If Y is unaffected by any change in X, then X and Y are said to be
uncorrelated.
Introduction: Correlation
• Correlation may be Linear or Non-linear .
• If the amount of variation in X bears a constant ratio to the
corresponding amount of variation in Y, then correlation
between X and Y said to be linear. Otherwise it is nonlinear
• The degree or extent of relationship between two variables
is measured by Karl Pearson’s coefficient of correlation or
simply called as correlation coefficient and denoted by “r”
• The extent or degree of correlation varies between -1 and +1
i.e. -1≤ r ≥ +1
Types of correlation
➢There are five types of correlation depending on its extent and
direction
1) Perfect positive correlation : in this, two variables X and Y are
directly proportional and fully correlated with each other. (i.e. r=
+1) in hypothetical condition only it exist. In nature there is not a
single example of perfectly positive correlation but some variables
approaches toward it. (e.g. height and weight up to certain age)
2) Perfectly negative Correlation: in this relationship two variables X
and Y are inversely proportional to each other and r= -1. similar to
perfect positive correlation it exist in hypothetical situations. (e.g.
Pressure applied and Volume of gas )
3) Partial positive correlation: in this relationship X and Y moves in
same direction but not like perfectly positive the value of r lies
between 0 and 1 i.e. 0 < r < +1 (e.g. weight and cholesterol level)
Types of correlation
4) Partial negative correlation : in this relationship the two variables
are moves in opposite direction in some extent. i.e. r value lies
between -1 to 0 ( -1 < r < 0) ( e.g. working men hours and
duration to complete task)
5) Absolutely No correlation : if there is no movement in specific
direction and movements are haphazard and indicating no linear
relationship exist between two variables. i.e. r = 0
➢Correlation between two variables may be determined by any of the
following method
1) Scatter diagram
2) Covariance method or Karl Pearson method
3) Ranked method / Spearman (beyond Scope )
Scatter Diagram
Scatter Diagram Showing correlation between Height • The existence of correlation can be
and Weight shown graphically means of Scatter
80 diagram. Statistical data relating to
simultaneous movement
75
(variation)of two variables can be
70
represented by dots.
65
• One of the two variable
X(independent variable)is taken
Weight (kg)

60 along horizontal axis and other


variable (dependent variable) say Y
55 is taken along the vertical axis.
50 • All the pairs of X and Y are shown
by dots on graph. This
45
diagrammatic presentation of
40 bivariate data is known as scatter
155 160 165 170 175 180 diagram
Height (cm)
Scatter Diagram interpretation
➢If all the dots lies on the straight line and there is no scatter or
deviation observed around this line there exist perfect correlation
➢If there is little scatter or deviation observed around line means there
exist moderate correlation
➢If wider scatter or deviations observed from the line means there exist
weak correlation.
➢If dots scatter haphazardly without any pattern and we can draw
multiple trend lines means no correlation exist.
➢If the line make an acute angle (< 90o) with positive x axis there exist
positive correlation and if line makes Obtuse angle (> 90o) with
positive x axis, there exist Negative Correlation between two variables.
Correlation coefficient by covariance method
• When associated variables are normally distributed such as height and
weight, the correlation coefficient “r” is calculate by covariance method.
• If (x1, y1), (x2, y2), ……. (xn, yn) are n pairs of observations on two variables X
and Y then Covariance of X and Y i.e. Cov(X, Y) is given by
σ 𝑋−𝑋ത 𝑌−𝑌ത
𝐶𝑜𝑣 𝑋, 𝑌 =
𝑛
• and the correlation coefficient “r” between X and Y is calculate by following
formula
𝐶𝑂𝑉(𝑋,𝑌) σ 𝑋−𝑋ത 𝑌−𝑌ത
𝑟= = ……(1)
𝜎𝑋 𝜎𝑌 𝑛𝜎𝑋 𝜎𝑌
σ𝑋σ𝑌
σ 𝑋−𝑋ത 𝑌−𝑌ത σ 𝑋𝑌−
𝑛
Or 𝑟 = ……(2) OR 𝑟 = σ𝑋 2 σ𝑌 2
……(3)
σ 𝑋−𝑋ത 2 σ 𝑌−𝑌ത 2
σ 𝑋 2− σ 𝑌2 −.
𝑛 𝑛
Interpretation of “r”
• The correlation coefficient is also subjected to sampling variation and
hence observed value of “r” is to be tested for it’s significance
• The test of significance for “r” is given by
𝑁−2
• 𝑡𝑛−2 = 𝑟 ∗
1−𝑟 2
• Where r= correlation coefficient N = no of paired observation
• Prepare null hypothesis: There is no correlation exist between two
variables X and Y (i.e. r= 0)
• Compare this test statistics with t table at (n-2) degrees of freedom
and 5 % level of significance.
Interpretation of “r”
• The crude interpretation correlation coefficient “r” is as follows
• Nature of correlation coefficient “r” is like Ordinal Scale i.e. larger the
absolute value of r means stronger the correlation & smaller value of
r (r closer to 0) means weaker correlation.
• If r = ±1 perfect positive/ negative correlation
• r = ±0.90 to ±0.99 Very strong Positive/ Negative Correlation
• r = ±0.75 to ±0.90 Strong Positive/ Negative Correlation
• r = ±0.60 to ±0.75 Moderate Positive/ Negative Correlation
• r = ±0.30 to ±0.60 weak Positive/ Negative Correlation
• r = 0.00 to ±0.30 Very weak(by chance)Positive/ Negative Correlation
Regression
➢A good measure of relationship between two variable is given by
correlation coefficient which tells us about the strength of relationship and
direction of relationship as well.
➢We also discussed that correlation coefficient measured only linear
relationship between two variable.
➢After determining the correlation between two variables, we wish to
determine a mathematical relationship between them so that we can 1)
predict the value of a variable on the value of other variable and (2)
explain the impact of change in the value of independent variable on the
value of dependent variable.
➢Fitting mathematical function between two correlated variables using
paired observations on them is studied in regression analysis.
Regression: simple linear regression model
• Out of two variables, one is consider as dependent variable and the
other as independent variable. Using regression model, we study those
changes in the value of dependent variable that are resulted by the
change in independent variable.
• In other words we regress the dependent variable on the independent
variable. Therefore dependent variable is also called as regressed
variable or study variable, while independent variable is called regressor
variable or explanatory variable.
• Simple Linear regression model : A linear regression model between a
single dependent (study ) variable and a single independent
(explanatory) variable is termed as simple linear regression.
• Let us denote the dependent (study) variable by Y and independent
(explanatory) variable by X. we collect paired observations (x1, y1), (x2,
y2), ….., (xn, yn) on the variable X and Y
Simple Linear regression model
➢The simplest relationship between X and Y is a linear relationship
which is given by: Y= a + bX or X = a+bY
➢Fitting a model means obtaining the value of the intercept term of
the line (a) and slope of the line (b) on the basis of collected
observations on X and Y. different values of a and b will give different
lines.
➢In order to obtain the most suitable or best average values of Y
variable for a given value of X variable the second equation is used for
estimating the value of X variable for the given value of Y variable.
➢Therefore the two lines of regression are
𝑌 − 𝑌ത = 𝑏𝑦𝑥 𝑋 − 𝑋ത ……….(1) regression of Y on X
𝑋 − 𝑋ത = 𝑏𝑥𝑦 𝑌 − 𝑌ത ……….(2) regression of X on Y
Where 𝑋ത and 𝑌ത are the mean of X and Y series respectively.
𝑏𝑦𝑥 and 𝑏𝑥𝑦 are the regression coefficients
Simple Linear regression model
• The estimate of Y for given X using linear regression equation of y on
x (𝑏𝑦𝑥 ) is Y = a + bX or 𝑌 − 𝑌ത = 𝑏𝑦𝑥 𝑋 − 𝑋ത
• The value of a and b are calculate by Least Square Method by solving
two equations
1) σ 𝑦 = 𝑛𝑎 + 𝑏 σ 𝑥 …………..(1)
2) σ 𝑥𝑦 = 𝑎 σ 𝑥 + 𝑏 σ 𝑥 2 …….(2)
• The constant b or 𝑏𝑦𝑥 is called regression coefficient of Y on X. in
positive correlation b > 0 and in Negative correlation b < 0
Regression coefficient 𝒃𝒚𝒙 and 𝒃𝒙𝒚
Regression coefficient of Y on X is denoted Regression coefficient of X on Y is denoted
by 𝑏𝑦𝑥 and given by following formulae by 𝑏𝑥𝑦 and given by following formulae
➢If the value of correlation coefficient (r) ➢If the value of correlation coefficient (r)
and values of standard deviation of x and y and values of standard deviation of x and y
are known are known
𝜎𝑦 𝜎𝑥
𝑏𝑦𝑥 = 𝑟 ×
𝜎𝑥
……….(1) 𝑏𝑥𝑦 = 𝑟 × ……….(1)
𝜎𝑦
➢If mean is already calculated ➢If mean is already calculated
σ 𝑋−𝑋ത 𝑌−𝑌ത σ 𝑋−𝑋ത 𝑌−𝑌ത
𝑏𝑦𝑥 = σ 𝑋−𝑋ത 2
………..(2) 𝑏𝑥𝑦 = ………..(2)
σ 𝑌−𝑌ത 2
➢If means are not calculated then directly ➢If means are not calculated then directly
calculated by calculated by
σ𝑥σ𝑦 σ𝑥σ𝑦
σ 𝑥𝑦− σ 𝑥𝑦−
𝑛
𝑏𝑦𝑥 = σ𝑥 2
………..(3) 𝑏𝑥𝑦 = 𝑛
……….. (3)
2
σ𝑥 − 2 σ𝑦 2
𝑛 σ𝑦 −
𝑛
Difference between correlation and regression
• Correlation coefficient shows the degree or strength of relationship
between two variables X and Y. where as regression enables us to
predict the value of one variable (say Y) on the basis of other Variable
(x).
• Thereby the cause and relationship between two variables is
understood precisely (i) when there is a perfect correlation (i.e. r ± 1),
the two regression line Y= a + bx and X = a’ + b’Y coincide
• (ii) when correlation (r) is = 0 the variables are independent and two
regression lines intersects at 90o.
• (iii) if the correlation coefficient is closer to “1” strength of regression
for prediction is very good.
Relation between (r ) , Regression coefficient 𝒃𝒚𝒙 and 𝒃𝒙𝒚
• The correlation coefficient ( r) and Regression coefficient 𝑏𝑦𝑥 and 𝑏𝑥𝑦
are relates as follows
𝑟 2 = 𝒃𝒚𝒙 × 𝒃𝒙𝒚 and r2 ≤ 1

𝑟 = ± 𝒃𝒚𝒙 × 𝒃𝒙𝒚
➢r is positive when 𝑏𝑦𝑥 is positive and r is negative when 𝑏𝑦𝑥 is
negative
➢r, 𝒃𝒚𝒙 and 𝒃𝒙𝒚 are simultaneously of same sign.
Exercise no 14.2 page no- 88
➢Using usual notation given 8
•𝑟= = 0.9
64×100
N= 10, σ 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത = 72
(There exist strong Positive Correlation)
σ 𝑥 − 𝑥ҧ 2 = 64 σ 𝑦 − 𝑦ത 2 = 100
• The test of significance for “r” is given
Find correlation coefficient (r ) and its by
significance.
𝑁−2 8
H0: The correlation coefficient(r) 𝑡𝑛−2 = 𝑟 ∗ = 0.9 × = 5.84
1−𝑟 2 1−0.81
between two variables X and Y = 0
• T tabulated at 8 degrees of freedom
H1: The correlation coefficient(r) and 5 % level of significance = 2.31
between two variables X and Y ≠ 0
• Since T calculated > T tabulated reject
The correlation coefficient (r ) is null hypothesis and accept Alternative
calculate using formula hypothesis there is significant
σ 𝑋 − 𝑋ത 𝑌 − 𝑌ത correlation between X and Y
𝑟=
σ 𝑋 − 𝑋ത 2σ 𝑌 − 𝑌ത 2
Exercise no 14.3 Page no:
• The estimate of Y for given X using
➢Correlation coefficient between age (x) in
years and Systolic blood pressure (y) in mm linear regression equation of y on x
of Hg is 0.8, mean age is 50 years. Mean (𝑏𝑦𝑥 ) is 𝑌 − 𝑌ത = 𝑏𝑦𝑥 𝑋 − 𝑋ത
systolic blood pressure is 130 mm of Hg. (y- 130) = 1 x (55 – 50)
Standard deviation of age is 8 yrs.,
standard deviation of blood pressure is 10 Y= 135
mm of Hg. Find regression equation of Y
on X and estimate systolic blood pressure
So for a person whose age is 55
for a person whose age is 55 yrs. estimated value of systolic blood
pressure = 135 mm of Hg.
➢Given r= 0.8 𝑥ҧ = 50 𝑎𝑛𝑑 𝑦ത = 130 σx = 8
σy= 10 have to calculate 𝒃𝒚𝒙 and estimate
value of Y for x= 55.
𝜎𝑦 10
➢𝑏𝑦𝑥 = 𝑟 × = 0.8 × = 1 ……….(1)
𝜎𝑥 8
Chapter no 15 & 16
Pratap L Jadhav
Assistant Professor Statistics & Demography
Department of Community Medicine
Seth G.S. Medical College & K.E.M. Hospital
Corrections in the formulae in chapter 15
• Corrected formulae for Chapter no 15 (page no 92 )
σ𝑤2
𝑤1 𝐴𝑆𝐹𝑅
1) 𝑇𝑜𝑡𝑎𝑙 𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 𝑅𝑎𝑡𝑒 𝑇𝐹𝑅 = 1000
Xh
Where h = Length of class interval ASFR = Age Specific Fertility rate
σ𝑤2
𝑤1 𝐴𝑆𝐹𝑅 𝑓𝑜𝑟 𝐹𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑣𝑒 𝑏𝑖𝑟𝑡ℎ𝑠
2) 𝐺𝑟𝑜𝑠𝑠 𝑅𝑒𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 𝐺𝑅𝑅 = ×ℎ
1000
Where h = Length of class interval ASFR = Age Specific Fertility rate
Additional formula for example 15.2
σ 𝑃𝑠 ×𝐷𝑥
3) Standardized Death Rate = SDR 𝑆𝐷𝑅 = σ 𝑃𝑠
Where Ps= Population of Standard Locality Dx= Death rate of Comparable population
4) SDR of Standard Population = Crude Death rate of Standard Population
Example 15.2 Page no 95
• In the following table population of locality A and B of different age
groups together with age specific death rates are given
• Taking locality A as standard Population find standardized death rate
of locality A and B separately and hence find which of the two
localities A and B is healthier? Also find Crude death rate of Locality B.
Age group Locality A Locality B
Population Death/1000 Population Death/1000
(Ps) (Ds) (Px) (Dx)
0-10 600 30 400 40
10-20 1000 05 1500 04
20-60 3000 08 2400 10
60 and 400 50 700 30
above
Example 15.2 Page no 95
𝑁𝑜 𝑜𝑓 𝐷𝑒𝑎𝑡ℎ𝑠 𝑑𝑢𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑌𝑒𝑎𝑟
• Crude death rate = × 1000
𝑀𝑖𝑑−𝑦𝑒𝑎𝑟 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

Age Group Locality A Locality B Ps * Dx


Ps Ds No deaths (A) Px Dx No of Deaths (B)
0-10 600 30 18 400 40 16 24000
10-20 1000 05 05 1500 04 06 4000
20-60 3000 08 24 2400 10 24 30000
> 60 Years 400 50 20 700 30 21 12000
Total 5000 67 5000 67 70,000

67 67
CDR (A)= ∗ 1000 =13.4/1000 population CDR (B)= ∗ 1000 =13.4/1000 population
5000 5000
σ 𝑃𝑠 ×𝐷𝑥 70000
SDR (A) = CDR (A) = 13.4/1000 population 𝑆𝐷𝑅(𝐵) = σ 𝑃𝑠
= = 14/1000 population
5000
• Since SDR (A) < SDR (B) Population A is healthier compared to Population B
Unsolved exercises
➢From Exercise no 15.3 onwards use Appropriate formula for
the given problem and solve with appropriate unit.
➢Unit of the indicator should be the unit of the quantity which is in the
denominator
➢For Total fertility Rate (TFR) unit is = ____births/ Female ( it can be
described as child bearing capacity of the female during her reproductive
age group)
➢For Gross Reproduction Rate (GRR) unit is = __Female births/ Female
(Described as No of female births per Female during her reproductive age
group.)
➢In Chapter no 16, All are statistical Fallacies and we have to
disagree with the statements with logical reasoning using
appropriate statistical measures.

You might also like