0% found this document useful (0 votes)
34 views7 pages

Correlation and Regression

Uploaded by

laishramrohesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views7 pages

Correlation and Regression

Uploaded by

laishramrohesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

e-notes on Correlation and regression

Prepared by: T. Loidang Chanu


Assistant Professor, Statistics
BEAS department, CAEPHT, CAU, Ranipool, Sikkim
Covariance: Covariance indicates how two variables are related. Positive covariance
indicates the two variables are positively related and a negative covariance means the
variables are inversely related. The formula for calculating covariance of sample data is-
̅)(𝒚𝒊 −𝒚
𝜮(𝒙𝒊 −𝒙 ̅)
Cov(x,y)=
𝒏−𝟏
1
= 𝜮𝒙𝒊 𝒚𝒊 − 𝒙̅ 𝒚
̅
𝑛−1
Where 𝑥𝑖 and 𝑦𝑖 refers to the values of two variables for ith observation, x̅ and y̅ are means
of two variables. “n” is the number of data points in the sample.
Note: a) Cov(x,y)>0,then there is a +ve relationship
b) Cov(x,y)<0,then there is a -ve relationship
c) Cov(x,y) does not tell us much about strength of such a relationship because it is
affected by changes in the units of measurement. To avoid this disadvantage of the
covariance, we standardized the data.
Correlation coefficient of : Let x and y be any two random variables (discrete and
continuous) with standard deviation of 𝜎𝑥 𝑎𝑛𝑑𝜎𝑦 respectively. The correlation coefficient of
x and y, denoted by cor(x,y) or 𝜌𝑥𝑦 is defined as-
𝑐𝑜𝑣(𝑥,𝑦) 𝜎𝑥𝑦
𝜌𝑥𝑦 = 𝑐𝑜𝑟(𝑥𝑦) = =
𝜎𝑥 𝜎𝑦 𝜎𝑥 𝜎𝑦
1
̅)(𝒚𝒊 −𝒚
𝛴(𝒙𝒊 −𝒙 ̅)
= 1
𝑛
1
̅)}2 ∑{(𝒚𝒊 −𝒚
√ ∑{(𝒙𝒊 −𝒙 ̅)}2
𝑛 𝑛

𝜮𝒙𝒊 𝜮𝒚𝒊
𝜮𝒙𝒊 𝒚𝒊 −
= 2 −𝛴𝑥𝑖
𝒏
∑𝑦 ̅)2 = 𝛴𝑥𝑖2 −
⸪ (𝒙𝒊 − 𝒙
𝜮𝒙𝒊
𝒏
{𝛴𝑥𝑖 𝑛 }{∑ 𝑦𝑖2 − 𝑛 𝑖 }

𝜮𝒙𝒊 𝒚𝒊 −𝒏𝒙̅ 𝒚
̅
= 2 2 simillarly, 𝛴(𝒙𝒊 − 𝒙
̅)(𝒚𝒊 − 𝒚
̅)
(𝛴𝑥𝑖 −𝑛𝑥̅ )((𝛴𝑦𝑖 −𝑛𝑦̅ 2 )
2

𝜮𝒙𝒊 𝜮𝒚𝒊
=𝜮𝒙𝒊 𝒚𝒊 − 𝒏

Note: 1. The cov. Between the standardized x and y data is called correlation between x and
y.
2. It may be noted that cor(x,y) or r(x,y) provides a linear relationship between x and y.
3.Karl Pearson’s correlation coefficient is also called “product- moment correlation
coefficient since
Cov(x,y)= E[{x-E(x)}{y-E(y)}]=𝜇11
Scatter diagram: It is the simplest way of diagrammatic representation of bivariate data.
Thus, for the bivariate distribution (𝑥𝑖 , 𝑦𝑖 ); i= 1,2,…,n, if the values of the variable x and y
are plotted along the x-axis and y- axis respectively in x-y plane, the diagram so obtain is
known as “scatter diagram”. From the scatter diagram we can form a fairy good, though
vague, idea whether the variables are correlated or not, e.g, if points are very dense, i,e, very
close to each other, we should expect a fairly good amount of correlation between variables
and if the points are scattered, a poor correlation is expected. This method, however, is not
suitable if the number of observations is fairly large.
Scatter diagram and correlation coefficient:

SC

Testing the significance of correlation coefficient (𝝆 = 𝟎)


To test the hypothesis that the correlation coefficient of the bivariate normal population is
zero, we can use the observed correlation coefficient (r) in the sample of n pairs of
observation in t- statistic as
𝑟
t= 1−𝑟 2 √𝑛 − 2 with (n-2 ) d.f
Note: 𝐻0 : 𝜌 = 0
𝐻1 : 𝜌 ≠ 0
If the null hypothesis (𝐻0 ) is accepted, we conclude that the variables may be regarded as
uncorrelated in the population.
Properties of correlation coefficient:
(a) The value of correlation coefficient is a pure number which is independent of the units
of measurement of the two variables.
(b) The value of the correlation coefficient lies between -1 and +1.
(c) 𝑟 2 , the square of the correlation coefficient is referred to as co efficient of
determination
(d) The quantity (1- 𝑟 2 ) is referred to as coefficient of alienation.
(e) The square root of correlation coefficient gives the proportion of common variance
not shared between two variables and called coefficient of alienation.
Q. Calculate the correlation coefficient for the following heights (in inches) of fathers (x) and
their son (y)
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Solution:
x y 𝑥2 𝑦2 xy
65 67 4225 4489 4355
….
…..
….


….

𝑥̅ =………….., 𝑦̅ = …………..
1
𝑐𝑜𝑣(𝑥,𝑦) 𝛴(𝒙𝒊−𝒙
̅ )(𝒚 −𝒚
𝒊
̅)
Cor(x,y)= =𝑛
𝜎𝑥 𝜎𝑦 𝜎𝑥 𝜎𝑦

Regression Analysis: In the correlation analysis we have discussed the degree of relationship
without considering which is cause and which is the effect. In regression analysis there are
two types of variables. The variable whose value is influenced or is to be predicted is called
dependent variable and the variable which influences the values or used for prediction is
called independent variable.
In regression analysis independent variable is also known as regressor or predictor or
explanatory variable while the dependent variable is also known as regressed or explained
variable. In regression analysis we find an algebraic function of the form y=f(x) i,e we
express the dependent variable as a function of the independent variable. Thus regression
analysis makes possible to estimate or predict the unknown values of dependent variables for
known values of independent variables.
The term regression literally means “stepping back towards average”. It was
first used by a British biometrician. Sir Francis Galton (1882-1911) a cousin of Charles
Darwin in connection with the inheritance of stature.

He was interested in predicting the height of son based on height of father.


Looking at the scatter plots of these heights, Galton saw that the trend was linear and
increasing. After fitting a line to these data. He observed that the fathers whose heights
were taller than the average, the regression line predicted that taller fathers tended to
have shorter sons and shorter fathers tended to have taller sons. There is a regression
towards the mean.

Linear Regression: Linear Regression analysis is the method for predicting


values of one or more response variable (dependent variables) from one or more
predictor variables (or independent variables).

In regression analysis there are two types of variables. The variable whose
value is influenced or is to be predicted is called dependent variable is also known as
repressor or predictor or explanatory while the dependent variable is also known as
response or explained variable.

An economist may want to investigate the relationship between expenditure


and income. Let us see what factors does a household consider when deciding how
much money it should spend on food every day or every week or every month. The
income of the household is the main factor and many other variables also effect the
food expenditure i.e., family size, preferences on food and other household items etc.
These all are explanatory variables because the all vary independently and they
explain the variation in expenditure among different households.

Types of regression models: There are two basic types of regression. They
are given below:

i) Simple linear regression


ii) Multiple linear regression
i) Simple linear Regression model: Let us tentatively assume that the regression line
of variable Y on X has the form 𝛽0 + 𝛽1.Then we can write the linear regression model
Y=𝛽0 + 𝛽1 𝑋+ ε,
where:
Y is called response variable (or dependent variable) and
X is called predictor variable (or independent variable).
𝛽0and𝛽1 – structural parameters,
ε – random component.
If it is appropriate to think of X is determined independently of the equation and
to think of Y as explained by the equation, then X is often called on exogenous variable
and y and endogenous variable; in this case X is called and explanatory variable. More
generally, X is also reference to as regressor. When we fit the line, 𝑦 = 𝑎 + 𝑏𝑥 to data,
we speaking y on X.

In the design of n-observations performed on Y and X it is written as follows:


Y= 𝛽0 + 𝛽𝑖 𝑋𝑖+ ε , i=1,2,…. n

Variable may be quantitative (e.g., income) to qualitative (e.g., rural, urban


residence). Quantitative variable may be either discrete (e.g., no. of children ever born)
or continuous (e.g., time). Qualitative variables are also called categorical variables or
factors. In regression analysis all variables whether quantitative or qualitative, are
presented numerically. Qualitative variables are presented by dummy variables.

Generally speaking, a model is description of the relationships connecting the


variables of the interest. The process of model-building consists of putting together a
set of formal expression of these relationship to the point was the behaviour of these
model adequately mimics the behaviour of the system.

The choice of mathematical form of the model (including choice of variables


and the statement of underlying assumptions) is referred to as model specification.
The term specification error is used to indicate an incorrect model specification.

Simplifying assumptions:

i) Linearity: The random error term ∈𝑖 has a mean equal to zero for each x.
when mean value of ᵋ is zero, the mean value of y for a given x is equal to
𝐸(𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑥
ii) Homoscedasticity: this equal varience assumption means that in the
underlying population, the varience of the variable 𝑦𝑖 denoted by 𝜎 2 , is the
same at each 𝑋 = 𝑥𝑖 . Equivalently, the variance of ∈𝑖 is 𝜎 2 at each 𝑋 = 𝑥𝑖 .
iii) Independence: The error term ∈𝑖 are statistically independent. We
called 𝑦𝑖 = 𝛽0 + 𝛽1 +∈𝑖 the underlying population model and 𝑦𝑖 = 𝛽0 + 𝛽1 +
𝑒𝑖 , the estimated model.
iv) Normality: This assumption specifies that the distribution of ∈𝑖 values
should be normal.

Coefficient of determination:

We may ask “ How well does the independent variable explain the dependent in
the regression model” the coefficient of determination is one concept that answer
this question. The coefficient of determination is way to measure the contribution
of independent variable in predicting dependent variable. It is denoted by the
symbol 𝑟 2 . Its value lies between zero to one (0≤ 𝑟 2 ≤ 1). If x contributes
information for predicting y, 𝑟 2 will be greater than zero. When x contributes no
information for predicting y , 𝑟 2 will be near to zero. In regression with single
independent variable 𝑟 2 is same as the square of the correlation between
dependent and independent variable.

PROPERTIES:

(1) Correlation coefficient between two variables x and y is the geometric mean of the two
regression coefficients 𝑏𝑥𝑦 𝑎𝑛𝑑 𝑏𝑦𝑥 . This is known as fundamental property of
regression coefficients i,e 𝜌2 = 𝛽𝑦𝑥 . 𝛽𝑥𝑦

⇒ 𝜌 = √𝛽𝑦𝑥 . 𝛽𝑥𝑦
For sample,
r=√𝑏𝑥𝑦 . 𝑏𝑦𝑥
(2) The signs of regression coefficients and correlation coefficients are always the same.
This is known as signature property of regression coefficients
(3) If 𝛽𝑦𝑥 > 1 ⇔ 𝛽𝑥𝑦 < 1
(4) If the variable x and y are independent, the regression coefficients are zero. This is
known as independent property of regression co efficients.
Notes:(1)Multiple regression suffers from multicollinearity , autocorrelation and
heteroscedasticity

(2) Linear regression is very sensitive to outliers. It may terribly affect the regression line
and eventually the forecasted values.

(3) multicollinearity exist when two or more of the predictors in the regression model
are moderately or highly correlated.

(4) Autocorrelation: it is also known as serial correlation. It is the similarity between


observations as a function of the time lag between them

t-test for testing the significance of an observed regression coefficient: Here the problem is
to test if a random sample (𝑥𝑖 , 𝑦𝑖 ), (i=1,2…n) drawn from a bivariate normal population in
which regression coefficient of y on x is β. This time regression line of y on x (for the given
sample) is:
µ11
Y−𝑦̅ = 𝑏(𝑋 − 𝑥̅ ), 𝑏 = 𝜎𝑥2

The estimate of Y for a given value say 𝑥𝑖 of X as given by the line

𝑦̂ = 𝑦̅ + 𝑏(𝑥𝑖 − 𝑥̅ )

Under 𝐻0 that the regression coefficient is β, Prof. R.A Fisher Proved that the statistics –

(𝑛−2) ∑𝑖(𝑥𝑖 −𝑥̅ )2 1/2


t = (b -β ) ( ∑𝑖(𝑦𝑖 −𝑦̅)2
)

follows t-distribution with (n-2) d.f

You might also like