2.simple Regression Analysis Chapter 6
2.simple Regression Analysis Chapter 6
Since the points are unlikely to fall precisely on the line, the exact linear
relationship in Eq. (6.1) must be modified to include a random disturbance,
error, or stochastic term, 𝑢,
𝑌𝑖 = 𝑏2 + 𝑏1 𝑋 + 𝑢 … … . (6.2)
State each of the five assumptions of the classical regression model (OLS)
and give an intuitive explanation of the meaning and need for each of them.
(1)The first assumption of the classical linear regression model (OLS) is that
the random error term 𝑢 is normally distributed.
(3) The third assumption is that the variance of the error term is constant in
each period and for all values of 𝑋:
𝐸(𝑢𝑖 )2 = 𝜎𝑢2
𝑢~𝑁(0, 𝜎𝑢2 )
(4) The fourth assumption is that the value which the error term assumes in
one period is uncorrelated or unrelated to its value in any other period:
This ensures that the average value of 𝑌 depends only on 𝑋 and not on 𝑢,
and it is, once again, required in order to have efficient estimates of the
regression coefficients and unbiased tests of their significance.
(5) The fifth assumption is that the explanatory variable assumes fixed
values that can be obtained in repeated samples, so that the explanatory
variable is also uncorrelated with the error term:
𝐸(𝑋𝑖 𝑢𝑖 ) = 0
EXAMPLE 1.
The Table-1 gives the bushels of corn per acre, 𝒀, resulting from the use
of various amounts of fertilizer in pounds per acre, 𝑿, produced on a farm
in each of 10 years from 1971 to 1980. These are plotted in the scatter
diagram of Fig. 6-1. The relationship between 𝑿 and 𝒀 in Fi is
approximately linear (ie., the points would fall on or near a straight line).
Year 𝑛 𝑌𝑖 𝑋
1971 1 40 6
1972 2 44 10
1973 3 46 12
1974 4 48 14
1975 5 52 16
1976 6 58 18
1977 7 60 22
1978 8 68 24
1979 9 74 26
1980 10 80 32
̂0 + 𝑏̂1 ∑ 𝑋𝑖 … … . (1)
∑ 𝑌𝑖 = 𝑛𝑏
̂0 ∑ 𝑋𝑖 + 𝑏̂1 ∑ 𝑥𝑖2 … … . (2)
∑ 𝑋𝑖 𝑌𝑖 = 𝑏
∑𝑥𝑖 𝑦𝑖
𝑏̂1 = … … . (3)
∑𝑥𝑖2
̂0 + 𝑏̂1 𝑋𝑖 … …. (4)
̂𝑖 = 𝑏
𝑌
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 6/27
̂𝑖 = 27.12 + 1.66 𝑋𝑖
𝑌 (the estimated regression equation)
̂0 . When 𝑋𝑖 = 18 = 𝑌̅,
Thus, when 𝑋𝑖 = 0, Ŷ = 27.12 = 𝑏
Ŷ = 27.12 + 1.66(18) = 57 = 𝑌̅.
As a result, the regression line passes through the point (𝑋̅, 𝑌̅).
Since vector X and vector Y are already defined in MATLAB. Now we have
̂0 and 𝑏̂1 .
to calculate 𝑏
Since there are two ways to calculate 𝑏̂1 , so one by one we shall learn both
methods.
First Method
Second Method
Since above all statements are executed one by one on command prompt
in command window.
We can combine all statements in one script file in editor window and
executed all statements in a single click.
∑𝑋𝑖2
𝑉𝑎𝑟 𝑏̂0 = 𝜎𝑛2
𝑛∑𝑥𝑖2
1
𝑉𝑎𝑟 𝑏̂1 = 𝜎𝑛2
𝑛∑𝑥𝑖2
2
∑𝑒𝑖2
𝑠 = 𝜎̂𝑛2 =
𝑛−𝑘
Unbiased estimates of the variance of 𝑏̂0 and 𝑏̂1 , are then given by
∑𝑒𝑖2 ∑𝑋𝑖2
𝑠𝑏20 =
𝑛 − 𝑘 𝑛∑𝑥𝑖2
∑𝑒𝑖2 1
𝑠𝑏21 =
𝑛 − 𝑘 ∑𝑥𝑖2
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 12/27
so that 𝑠𝑏̂0 and 𝑠𝑏̂1 are the standard errors of the estimates. Since 𝑢𝑖 , is
normally distributed, 𝑌, and therefore 𝑏̂0 and 𝑏̂1 , are also normally
distributed, so that we can use the distribution with 𝑛 − 𝑘 degrees of
freedom, to test hypotheses about and construct confidence intervals for
𝑏̂0 and 𝑏̂1 .
EXAMPLE 3
Table 3 (an extension of Table 2) shows the calculations required to test
the statistical significance of 𝑏̂0 and 𝑏̂1 .
We need to calculate
𝑛, ∑𝑒𝑖 , ∑𝑒𝑖2 , ∑𝑋𝑖2 , ∑𝑥𝑖2 , ∑𝑦𝑖2 , 𝑠 2𝑏̂0 , 𝑠𝑏2̂1 , 𝑠 𝑏̂0 , 𝑠 𝑏̂1 , 𝑡0 and 𝑡1
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 14/27
Estimation of 𝑡0 and 𝑡1
STUDENT’S T-DISTRIBUTION
We can find t-distribution value using MATLAB with in specified df and level
of significance as under:
2 2
∑(𝑌𝑖 – 𝑌̅)2 = ∑(𝑌̂𝑖 − 𝑌̅) + ∑(𝑌𝑖 − 𝑌̂𝑖 )
Total variation in Y Explained Residual
(or total sum of variation in Y variation in Y
squares) (regression sum (or error sum
of squares
of squares
𝑇𝑆𝑆 = 𝑅𝑆𝑆 + 𝐸𝑆𝑆
𝑅𝑆𝑆 𝐸𝑆𝑆
𝑅2 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
𝑅2 can be calculated by
2
∑𝑦̂ 2 ∑𝑒𝑖2
𝑅 = =1−
∑𝑦𝑖2 ∑𝑦𝑖2
2
Where ∑𝑦̂ 2 = ∑(𝑌̂𝑖 − 𝑌̅𝑖 )
Following Figure shows the total, the explained, and the residual variation of
Y.
The data in Table-4 reports the aggregate consumption (Y, in billions of U.S.
dollars) and disposable income (X, also in billions of U.S. dollars) for a
developing economy for the 12 years from 1988 to 1999.
Draw a scatter diagram for the data and determine by inspection if there
exists an approximate linear relationship between Y and X.
𝑌𝑒𝑎𝑟 𝑛 𝑌𝑖 𝑋𝑖
1988 1 102 114
1989 2 106 118
1990 3 108 126
1991 4 110 130
1992 5 122 136
1993 6 124 140
1994 7 128 148
1995 8 130 156
1996 9 142 160
1997 10 148 164
1998 11 150 170
1999 12 154 178
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 19/27
From above Fig. it can be seen that the relationship between consumption
expenditures 𝑌 and disposable income 𝑋 is approximately linear, as required
by the linear regression model.
(c) Why would you expect most observed values of 𝒀 not to fall exactly on
a straight line?
where 𝑖 refers to each year in time-series analysis (as with the data in
Table ) or to each economic unit (such as a family) in cross-sectional
analysis.
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 20/27
(b) The exact linear relationship in Eq. (6.1) can be made stochastic by
adding a random disturbance or error term, M,, giving
𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 + 𝑢𝑖
The OLS method gives the best straight line that fits the sample of 𝑋𝑌
observations in the sense that it minimizes the sum of the squared
(vertical) deviations of each observed point on the graph from the
straight line.
(c) Why do we not simply take the sum of the deviations without squaring
them?
We cannot take the sum of the deviations of each of the observed points
from the OLS line because deviations that are equal in size but opposite
in sign cancel out, so the sum of the deviations equals 0.
Taking the sum of the absolute deviations avoids the problem of having
the sum of the deviations equal to 0. However, the sum of the squared
deviations is preferred so as to penalize larger deviations relatively more
than smaller deviations.
Starting from Eq. (6.3) calling for the minimization of the sum of the squared
deviations or residuals, derive (a) normal Eq. (6.4) and (b) normal Eq. (6.5).
(a)
2
∑𝑒𝑖2 = ∑(𝑌𝑖 − 𝑌̂𝑖 ) = ∑(𝑌𝑖 − 𝑏̂0 − 𝑏̂1 𝑋𝑖 )
(b) Normal Eq. (6.4) is derived by minimizing ∑𝑒𝑖2 with respect to 𝑏̂0 :
2
∂∑𝑒𝑖2 𝜕Σ(𝑋𝑖 − 𝑏̂0 − 𝑏̂𝑖 𝑋𝑖 )
= =0
∂𝑏̂0 ∂𝑏̂0
2∑(𝑦1 − 𝑏̂0 − 𝑏̂1 𝑋𝑖 )(−1) = 0
∑(𝑦𝑖 − 𝑏̂0 − 𝑏̂1 𝑋𝑖 ) = 0
∑𝑌𝑖 = 𝑛𝑏̂0 + 𝑏̂1 ∑𝑋𝑖 … … . (𝐵1)
Solve simultaneously (B1) and (B2) to get values of 𝑏̂1 and 𝑏̂0
(α) Multiplying Eq. (?) by 𝑛 and Eq. (?) by ∑𝛸𝑖 ., we get
𝑛∑𝑋𝑖 𝑌𝑖 = 𝑏̂0 𝑛∑𝑋𝑖 + 𝑏̂1 𝑛∑𝑋𝑖2 … … . (𝐴1)
∑𝑋𝑖 ∑𝑌𝑖 = 𝑏̂0 𝑛∑𝑋𝑖 + 𝑏̂1 (∑𝑋𝑖 )2 … … … (𝐴2)
Subtracting Eq. (A2) from Eq. (A1), we get
𝑛∑𝑋𝑖 𝑌𝑖 − ∑𝑋𝑖 ∑𝑌𝑖 = 𝑏̂1 [𝑛∑𝑋𝑖2 − (∑𝑋𝑖 )2 ] … … . . (𝐴3)
Solving Eq. (A3) for 𝑏1 , we get
𝑛∑𝑋𝑖 𝑌𝑖 − ∑𝑋𝑖 ∑𝑌𝑖
𝑏̂1 = … … . . (𝐴4)
𝑛∑𝑋𝑖2 − (∑𝑋𝑖 )2
(b) Equation (A5) is obtained by simply solving Eq. (B1) for 𝑏̂0
𝑌𝑖 𝑋𝑖 𝑋𝑖 𝑌𝑖 𝑋𝑖2
1 102 114 11,628 12,996
2 106 118 12,508 13,924
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 23/27
(b) Plot the regression line and show the deviations of each 𝒀𝒊 , from the
corresponding 𝒀 ̂𝒊
(b) To plot the regression equation, we need to define any two points on the
regression line.
For example, when 𝑋𝑖 = 114, 𝑌̂𝑖 = 2.30 + 0.86(114) = 100.34.
When 𝑋𝑖 = 178, 𝑌̂𝑖 = 2.30 + 0.86(178) = 155.38.
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 24/27
The consumption regression line is plotted in Fig. (?), where the positive
and negative residuals are also shown.
The regression line represents the best fit to the random sample of
consumption-disposable income observations in the sense that it
minimizes the sum of the squared (vertical) deviations from the line.
Assignment
̂𝟏 , in deviation form for
a) Starting with Eq. (?), derive the equation for 𝒃
the case where 𝑿 ̅=𝒀 ̅ = 𝟎.
̂𝟎 when 𝑿
(b) What is the value of 𝒃 ̅ = 𝒀
̅=𝟎?
Construct the 95% confidence interval for (a) 𝒃𝟎 and (b) 𝒃𝟏 , in above
problem
(a) The 95% confidence interval for 𝑏0 , is given by
𝑏0 = 𝑏̂0 ± 2.228𝑠𝑏̂0 = 2.30 ± 2.228(7.17) = 2.30 ± 15.97
So 𝑏0 is between -13.67 and 18.27 with 95% confidence. Note how wide
(and meaningless) the 95% confidence interval 𝑏0 is, reflecting the fact
that &, is highly insignificant.
Assignment:
C1.1 Find 𝑹𝟐 for the estimated consumption regression of previous
̂𝟐
∑𝒚 ∑𝒆𝟐
problem using the equation (a) 𝑹𝟐 = 𝒊
and (b) 𝑹𝟐 = 𝟏 − 𝒊
. Also find
∑𝒚𝟐
𝒊 ∑𝒚𝟐
𝒊
the results using MATLAB statements.
C1.3 Table (C1) gives the per capita income to the nearest $100 (𝑌) and
the percentage of the economy represented by agriculture (𝑋)
reported by the World Bank World Development Indicators for 1999
for 15 Latin American countries.
(a) Estimate the regression equation of 𝑌, on 𝑋.
(b)Test at the 5% level of significance for the statistical significance
of the parameters.
(c) Find the coefficient of determination.
(d) Use MATLAB statements to compute all the computations given
in (a), (b) and (c)
(e) Report the results obtained in part (a), (b) and (c) in standard
summary form.
Table (C1)
𝐶𝑜𝑢𝑛𝑡𝑟𝑦 (1) (2) (3) (4) (5) (6) (7) (8)
𝑛 1 2 3 4 5 6 7 8
𝑌𝑖 76 10 44 47 23 19 13 19
𝑋𝑖 6 16 9 8 14 11 12 10
*Key: (1) Argentina; (2) Bolivia; (3) Brazil; (4) Chile; (5) Colombia; (6)
Dominican Republic; (7) Ecuador; (8) El Salvador; (9) Honduras; (10)
Mexico; (11) Nicaragua; (12) Panama; (13) Peru; (14) Uruguay; (15)
Venezuela.
Source: World Bank World Development Indicators.
C1.4 Draw a scatter diagram for the data in Table(C2) and determine by
inspection if there is an approximate linear relationship between 𝑌𝑖 ,
and 𝑋𝑖 .
CHAPTER 6 SIMPLE REGRESSION ANALYSIS 27/27
C1.5 For the data in Table (C2, find the value of (a) 𝑏̂1 , and (b) 𝑏̂0 .
(c) Write the equation for the estimated OLS regression line.
Table (C2)
Observations on variables Y and X
𝑛 𝑌𝑖 𝑋𝑖
1 20 2
2 28 3
3 40 5
4 45 4
5 37 3
6 52 5
7 54 7
8 43 6
9 65 7
10 56 8
C1.6 (a) On a set of axes, plot the data in Table (C2), plot the estimated
OLS regression line and show the residuals.
(b) Show algebraically that the regression line goes through point
𝑋̅𝑌̅.
C1.7 For the data in Table (C2) , find (a) 𝑠 2 (b) 𝑠𝑏2̂0 and 𝑠𝑏̂0 , and (c 𝑠𝑏2̂1 and
𝑠𝑏̂1