Simple Linear Correlation and Regression
Simple Linear Correlation and Regression
Credits 5
Semester 1
When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula is
known as correlation. - Croxton and Cowden
A lack of relationship between variables implies that they are independent variables.
Both variables may be mutually influencing each other so neither can be designated
as the cause and the other the effect
If both the variables are varying in the same direction, i.e., if increase in one
variable results in increase in the other variable, on average, or if decrease in one
variable results in decrease in the other variable on average, correlation is said to
be positive.
If both the variables are varying in opposite directions, i.e., if increase in one
variable results in decrease in the other variable, on average, or if decrease in one
variable results in increase in the other variable on average, correlation is said to be
positive.
In partial correlation, we recognize more than two variables, but consider only two
variables to be influencing each other simultaneously, the effect of other influencing
variables being kept constant. (analysis of yield of rice per acre and amount of
rainfall limited to periods with a certain constant temperature)
If the amount of change in one variable tends to bear constant ratio to the amount of
change in the other variable, then correlation is said to be linear. If the variables
were plotted, all the plotted points would fall on a straight line.
If the amount of change in one variable does not tend to bear constant ratio to the
amount of change in the other variable, then correlation is said to be non-linear or
curvilinear.
For each pair of X and Y observations, we put a dot and therefore obtain as many
dots as there are number of observations.
The greater the scatter of points on the chart, the lesser the relationship between
both variables.
The more closely the points come to a straight line from the lower left hand corner
to the upper right hand corner, correlation is said to be perfectly positive ( r=+1 ).
If the points resemble perfectly positive correlation but instead form a narrow band,
there is said to be a high degree of positive correlation.
The more closely the points come to a straight line from the upper left hand corner
to the lower right hand corner, correlation is said to be perfectly negative ( r=-1 ).
If the points resemble perfectly negative correlation but instead form a narrow band,
there is said to be a high degree of negative correlation.
If the lines are widely scattered from the line of best fit, there is a low degree of
correlation and if they are closely scattered over the line of best fit, there is a high
degree of correlation.
Merits
Simple and non-mathematical way of studying correlation between the variables. It
can be easily understood and a rough idea can be formed from a single glance.
If the variables are related, it is possible to see the line or estimating equation
describes the relationship.
Limitations
Cannot establish the exact degree of correlation between the variables.
Graphic Method
The individual values of the two variables are plotted, forming two individual curves
for X and Y.
If both curves are moving in the same direction, either upward or downward,
correlation is said to be positive.
If the curves are moving in the opposite direction, correlation is said to be negative.
This method is normally used when we are given data over a period of time (time
series).
One of the very few symbols used to universally denote degree of correlation
∑ xy
r= N σx σy
where
ˉ ); y = (Y − Yˉ )
x = (X − X
σx =Standard Deviation of X
σy =Standard Deviation of Y
The value of the correlation coefficient shall always lie between ±1.
When r = +1, there is perfectly positive correlation, and when r = −1, there is
perfectly negative correlation. When r = 0, there is no relationship between the
variables.
The coefficient describes not only the magnitude but also direction of relationship.
Direct Method
Revised formula widely used:
∑ xy
r=
∑ x2 × ∑ y 2
that is,
∑ (X i − X ˉ )(Yi − Yˉ
r=
ˉ
∑ (X i − X )2 ∑ (Yi − Yˉ )2
N ∑ d2x − (∑ dx )2 N ∑ d2y − (∑ dy )2
N ∑ f d2x − (∑ f dx )2 N ∑ f d2y − (∑ f dy )2
Assumptions underlying r
There is a linear relationship between the variables.
There is a cause and effect relationship between the forces affecting the distribution
of items in X and Y. If such a relationship is not formed between the variables, i.e.,
they are independent, there cannot be any correlation. Eg: no relationship between
income and height because the forces that affect them are not common.
Very often misinterpreted and thus great care must be taken to interpret
Interpreting r
General rules to interpret r :
The closeness of the relationship is not proportional to r . For example, if the value
of r is 0.8, it does not indicate a relationship twice as close as 0.4. It is in fact much
closer.
Properties of r
The coefficient of correlation lies between -1 and +1. Symbolically, −1 ≤ r ≤ +1.
The coefficient of correlation is independent of scale and origin for the variables X
and Y. (change in origin means adding or subtracting a value, and change in scale
means multiplying or dividing a value. Mean, standard deviation and thus deviation
from mean get incremented or multiplied by that factor but gets cancelled out due to
numerator and denominator).
r= bxy × byx
Both regression coefficients are nothing but change in the first variable for change
^2 for when X and Y are each interchanged.
in the second variable, i.e., slope, i.e., β
∑ xy ∑ yx
rxy =
N σx σy
= N σy σx
= ryx
We deal with stochastic variables, i.e, variables that are random and have some
intrinsic random variability within them.
Regression itself NEVER implies causation. As Kendall and Stuart said, ideas of
causation can only arise outside of statistics, from some related theory.
Basis for
Regression Analysis Correlation
Comparison
Given that one variable is said to explain It is not possible to study causation
the other(s), it is possible to study with the help of correlation as there
Studying
causation with the AID of regression. is no establishment of a dependent
Causation
Regression analysis alone cannot imply and explanatory variable. They are
causation. simply related variables.
All the predicted mean values of y with respect to fixed values of x are called
conditional expected values, as they depend on the given values of x. They are
denoted by E(Y ∣X), read as the expected value of Y given the value of X. What is
the expected value of Y given X? The knowledge of X helps to better predict the
value of Y.
ˉ , i.e,
This is distinguished from the unconditional expected value of Y, which is Y
total value of all observations divided by the number of observations.
If we plot the conditional expected values of Y against the various X values and join
them, we get the population regression line (PRL) or the population regression
curve. It is the regression of Y on X.
The population regression curve is the locus of the conditional means of the
dependent variable for the fixed values of the explanatory variable (s).
X .
E(Y ∣Xi ) = f(Xi )
It states merely that the expected value of the distribution of Y given Xi is
with X .
E(Y ∣Xi ) = β1 + β2 Xi
β2 : slope coefficient
ui = Yi − E(Y ∣Xi )
or
Yi = E(Y ∣Xi ) + ui
negative values and is called the stochastic disturbance or stochastic error term.
The equation can be interpreted as: the expenditure of an individual family, given its
income level, can be expressed as a sum of two components:
i. E(Y ∣Xi ), which is the mean consumption expenditure of all the families with
Yi = β1 + β2 Xi + ui
Y^ = β^1 + β^2 Xi
where:
β^1 = estimator of β1
β^2 = estimator of β2
The particular numerical value obtained by the estimator when applied is called an
estimate. An estimate is non-random as it is a particular point value obtained from
the estimator.
Y^ = β^1 + β^2 Xi + u
^i
Y^ = β^1 + β^2 Xi + u
^i
Yi = β^1 + β^1 + u
^1
= Y^i + u
^i
But to actually determine the SRF, express the above equation as:
^i = Yi − Y^i
u
= Yi − β^1 − β^2 Xi
This shows that the residual term is simply the difference between the actual and
estimated Y values.
A simple sum of residuals is not satisfactory as all residuals are given equal weight
in the sum no matter how close they are to the SRF. This is avoided by fixing the
SRF in such a way that the below is as small as possible.
^i , this method gives more weight to residuals that are closer to the
By squaring u
SRF.
This beats the previous method which could give a small sum even if residuals are
^i (in
widely spread about the SRF (due to cancelling) because the larger the u
^2i .
absolute values), the larger the u
Direct Method
For estimating β^1 and β^2 , differentiating the above equation with respect to both
∑ Yi = nβ^1 + β^2 ∑ Xi
∑ Yi Xi = β^1 ∑ Xi + β^2 ∑ X 2
i
Solving the above equations simultaneously gives the value of both estimators.
Indirect Method
By solving with variables simultaneously:
∑ x2i
β^1 = Yˉ − β^2 X
ˉ
where:
Xˉ = mean of X
Yˉ = mean of Y
xi = Xi − X
ˉ
yi = Yi − Yˉ
Estimators obtained from these methods are known as least-square estimators as they
are obtained using the least squares principle.
2. They are point estimators, i.e., they provide only a single value as compared to
interval estimators which provide a range of values.
3. Once the estimators are derived, they can be substituted in the equation to form the
SRF. The SRF thus derived has two properties:
ˉ and Yˉ . This is because the method of deriving the
i. It passes through X
estimators involves both and can be shown.
ˉ
ii. Y^ = Yˉ by substituting in the equation.
iii. The mean value of residuals is zero.
Yi = β1 + β2 + ui
Sometimes the intercept value is not taken as a variable as it does not vary.
var(β^2 ) =
σ
∑ x2i
se(β^2 ) = σ
∑ x21
∑ X i2 2
var(β^1 ) = σ
n ∑ x21
∑ X i2
se(β^1 ) = σ
n ∑ x21
where:
var = variance
se= standard error
σ 2 = constant variance of ui
All values for the above can be estimated from the data except:
∑ u^2i
^2 =
σ n−2
^2 = ∑ y2 − β^2 ∑ x2
∑u i
i
2
i
∑ u^2i
^=
σ n−2
^ is the standard error of estimate or the standard error of the regression (se). It is
σ
the standard deviation of the Y values about the estimated regression line.
squares = sum of squares due to regression, i.e., due to the explanatory variable
regression line
This shows that the total variation in observed Y values about their mean can be
partitioned into two parts: one attributable to the regression line and the other to
random forces because not all actual Y observations lie on the fitted line.
TSS
∑ (Y^i − Yˉ )2 ∑ u^2i
1=
∑ (Yi − Yˉ )2
+ ∑ (Yi − Yˉ )2
Properties of r 2 :
It is a non-negative quantity
0 ≤ r 2 ≤ 1. An r 2 of 1 means a perfect fit, i.e., Y^i = Yi for each i. An r 2 of 0
∑ y^i2
=
∑ y12
β^22 ∑ x2i
=
∑ y12
2
= β^22 ( ∑ y 2i )
∑x
i
E(β^2 ) = β2
Thus instead of relying on the point estimator alone, an interval may be constructed
around the point estimator, with a small number of standard errors on either side,
say, the interval has 95% probability of containing the true value.
^2 is to β2 , so we try to find two positive
Assume we want to know how close β
numbers δand α, αlying between 0 and 1, such that the probability that the
^2
probability that the random interval (β − δ, β^2 + δ)contains the true β2 is 1 − α.
Symbolically,
P r(β^2 − δ ≤ β2 ≤ β^2 + δ) = 1 − α
For example, if α = 0.05or 5%, it would be read as: the probability that the
random interval shown there includes the true β2 is 0.95 or 95%.
For eg: to choose any five numbers whose total is 100, there is independent choice
only for four numbers as the fifth has the restriction that it should be 100 minus the
remaining chosen numbers. Choice was reduced due to one restriction placed, i.e.,
df = 5-1 =4.
Similarly, if there are 10 classes into which frequencies must be assigned such that
the number of cases, the mean and standard deviation agree with the original
distribution, then there are three restrictions placed and thus df = 10-3 = 7.
Thus, df = ν
The term number of degrees of freedom means the total number of observations in
the
sample (= n) less the number of independent (linear) constraints or restrictions put
on them.
In other words, it is the number of independent observations out of a total of n
observations.
For example, before the RSS can be computed,
β^1 and β^2 must first be obtained. These
two estimates therefore put two restrictions on the RSS. Therefore, there are n− 2,
not n, independent observations to compute the RSS. The general rule is this:
df = (n− number of parameters estimated).
H1 : β2
= 0(There is a significant relationship between X and Y in words)
evidence shows that the model is highly & statistically significant and null hypothesis
H0 is rejected/statistically insignificant and there is no significant relationship between Y