Correlation and Regression
Correlation and Regression
𝜮𝒙𝒊 𝜮𝒚𝒊
𝜮𝒙𝒊 𝒚𝒊 −
= 2 −𝛴𝑥𝑖
𝒏
∑𝑦 ̅)2 = 𝛴𝑥𝑖2 −
⸪ (𝒙𝒊 − 𝒙
𝜮𝒙𝒊
𝒏
{𝛴𝑥𝑖 𝑛 }{∑ 𝑦𝑖2 − 𝑛 𝑖 }
𝜮𝒙𝒊 𝒚𝒊 −𝒏𝒙̅ 𝒚
̅
= 2 2 simillarly, 𝛴(𝒙𝒊 − 𝒙
̅)(𝒚𝒊 − 𝒚
̅)
(𝛴𝑥𝑖 −𝑛𝑥̅ )((𝛴𝑦𝑖 −𝑛𝑦̅ 2 )
2
𝜮𝒙𝒊 𝜮𝒚𝒊
=𝜮𝒙𝒊 𝒚𝒊 − 𝒏
Note: 1. The cov. Between the standardized x and y data is called correlation between x and
y.
2. It may be noted that cor(x,y) or r(x,y) provides a linear relationship between x and y.
3.Karl Pearson’s correlation coefficient is also called “product- moment correlation
coefficient since
Cov(x,y)= E[{x-E(x)}{y-E(y)}]=𝜇11
Scatter diagram: It is the simplest way of diagrammatic representation of bivariate data.
Thus, for the bivariate distribution (𝑥𝑖 , 𝑦𝑖 ); i= 1,2,…,n, if the values of the variable x and y
are plotted along the x-axis and y- axis respectively in x-y plane, the diagram so obtain is
known as “scatter diagram”. From the scatter diagram we can form a fairy good, though
vague, idea whether the variables are correlated or not, e.g, if points are very dense, i,e, very
close to each other, we should expect a fairly good amount of correlation between variables
and if the points are scattered, a poor correlation is expected. This method, however, is not
suitable if the number of observations is fairly large.
Scatter diagram and correlation coefficient:
SC
…
….
𝑥̅ =………….., 𝑦̅ = …………..
1
𝑐𝑜𝑣(𝑥,𝑦) 𝛴(𝒙𝒊−𝒙
̅ )(𝒚 −𝒚
𝒊
̅)
Cor(x,y)= =𝑛
𝜎𝑥 𝜎𝑦 𝜎𝑥 𝜎𝑦
Regression Analysis: In the correlation analysis we have discussed the degree of relationship
without considering which is cause and which is the effect. In regression analysis there are
two types of variables. The variable whose value is influenced or is to be predicted is called
dependent variable and the variable which influences the values or used for prediction is
called independent variable.
In regression analysis independent variable is also known as regressor or predictor or
explanatory variable while the dependent variable is also known as regressed or explained
variable. In regression analysis we find an algebraic function of the form y=f(x) i,e we
express the dependent variable as a function of the independent variable. Thus regression
analysis makes possible to estimate or predict the unknown values of dependent variables for
known values of independent variables.
The term regression literally means “stepping back towards average”. It was
first used by a British biometrician. Sir Francis Galton (1882-1911) a cousin of Charles
Darwin in connection with the inheritance of stature.
In regression analysis there are two types of variables. The variable whose
value is influenced or is to be predicted is called dependent variable is also known as
repressor or predictor or explanatory while the dependent variable is also known as
response or explained variable.
Types of regression models: There are two basic types of regression. They
are given below:
Simplifying assumptions:
i) Linearity: The random error term ∈𝑖 has a mean equal to zero for each x.
when mean value of ᵋ is zero, the mean value of y for a given x is equal to
𝐸(𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑥
ii) Homoscedasticity: this equal varience assumption means that in the
underlying population, the varience of the variable 𝑦𝑖 denoted by 𝜎 2 , is the
same at each 𝑋 = 𝑥𝑖 . Equivalently, the variance of ∈𝑖 is 𝜎 2 at each 𝑋 = 𝑥𝑖 .
iii) Independence: The error term ∈𝑖 are statistically independent. We
called 𝑦𝑖 = 𝛽0 + 𝛽1 +∈𝑖 the underlying population model and 𝑦𝑖 = 𝛽0 + 𝛽1 +
𝑒𝑖 , the estimated model.
iv) Normality: This assumption specifies that the distribution of ∈𝑖 values
should be normal.
Coefficient of determination:
We may ask “ How well does the independent variable explain the dependent in
the regression model” the coefficient of determination is one concept that answer
this question. The coefficient of determination is way to measure the contribution
of independent variable in predicting dependent variable. It is denoted by the
symbol 𝑟 2 . Its value lies between zero to one (0≤ 𝑟 2 ≤ 1). If x contributes
information for predicting y, 𝑟 2 will be greater than zero. When x contributes no
information for predicting y , 𝑟 2 will be near to zero. In regression with single
independent variable 𝑟 2 is same as the square of the correlation between
dependent and independent variable.
PROPERTIES:
(1) Correlation coefficient between two variables x and y is the geometric mean of the two
regression coefficients 𝑏𝑥𝑦 𝑎𝑛𝑑 𝑏𝑦𝑥 . This is known as fundamental property of
regression coefficients i,e 𝜌2 = 𝛽𝑦𝑥 . 𝛽𝑥𝑦
⇒ 𝜌 = √𝛽𝑦𝑥 . 𝛽𝑥𝑦
For sample,
r=√𝑏𝑥𝑦 . 𝑏𝑦𝑥
(2) The signs of regression coefficients and correlation coefficients are always the same.
This is known as signature property of regression coefficients
(3) If 𝛽𝑦𝑥 > 1 ⇔ 𝛽𝑥𝑦 < 1
(4) If the variable x and y are independent, the regression coefficients are zero. This is
known as independent property of regression co efficients.
Notes:(1)Multiple regression suffers from multicollinearity , autocorrelation and
heteroscedasticity
(2) Linear regression is very sensitive to outliers. It may terribly affect the regression line
and eventually the forecasted values.
(3) multicollinearity exist when two or more of the predictors in the regression model
are moderately or highly correlated.
t-test for testing the significance of an observed regression coefficient: Here the problem is
to test if a random sample (𝑥𝑖 , 𝑦𝑖 ), (i=1,2…n) drawn from a bivariate normal population in
which regression coefficient of y on x is β. This time regression line of y on x (for the given
sample) is:
µ11
Y−𝑦̅ = 𝑏(𝑋 − 𝑥̅ ), 𝑏 = 𝜎𝑥2
𝑦̂ = 𝑦̅ + 𝑏(𝑥𝑖 − 𝑥̅ )
Under 𝐻0 that the regression coefficient is β, Prof. R.A Fisher Proved that the statistics –