Chapter Three: Estimation of Multiple Linear Regression Model
Chapter Three: Estimation of Multiple Linear Regression Model
CHAPTER THREE
ESTIMATION OF MULTIPLE LINEAR REGRESSION MODEL
3.1 The Multiple Regression Model
The simple regression analysis studied previously is inadequate in practice, because it assumes
that only one independent variable is considered to explain the variation in the dependent
variable. But in most economic theory the variation in the dependent variable is practically
explained by a number of explanatory variables. Therefore, we need to extend the simple
regression analysis two or more explanatory variables are involved so as to explain the variations
in the dependent variable.
The multiple regression analysis allows us to explicitly include many factors which
simultaneously affect the dependent variable. In this situation, more of the variation in the
dependent variable can possibly explained by the potential explanatory variables. This is
important both for testing economic theories and for evaluating policy effects when we must rely
on non-experimental data. As a result the multiple regression analysis is still the most widely
vehicle for empirical analysis in economics and other social sciences.
Since the multiple regression analysis allows many observed factors (X’S) to affect the
dependent variable (Y) the general multiple linear regression model with K independent
variables can be written in the population as:
Yi 0 1 X 1i 2 X 2i 3 X 3i ... K X K i ...............(3.1)
Where 0 is the intercept, 1 is the parameter associated with X 1 ; 2 is the parameter associated
with X 2 , and so on. We can also refer to the parameters other than the intercept as slope
parameters; is the error term or disturbance that contains factors other than X 1 , X 2 ,…, X K
that affect Y, because no matter how many explanatory variables are included in the model, there
will always be factors we cannot include in the model.
To illustrate the multiple regression model with any independent variables it is useful to remind
the wage example. That is, besides a single independent variable (education) that is explained in
the simple regression model there are some other variables such as working experience, job
training, ability etc that affect the wage level.
The simplest possible multiple regression model is a model with two independent variables and
this model can be written in population as:
Yi 0 1 X 1i 2 X 2i i .......... .................... ........( 3.2)
Where Y is the dependent variable, X 1 & X 2 are explanatory variables and the ’S and i refers
to as stated in earlier equation.
Note that the intercept term 0 usually gives the mean or average effect on Y if all variables
excluded from the model (i.e, X 1 = X 2 =0) But the slope coefficients 1 and 2 gives the partial
effect of the explanatory variable under consideration on the dependent variable given the value
of the other explanatory variable is constant. This is the concept of partial in which the effect of a
particular explanatory variable is analyzed by holding constant the values of other explanatory
variables.
As a result the interpretation of the above two variable case of the multiple regression model is
that analyzing the conditional mean or expected value of Y conditional up on the given or fixed
values of X 1 and X 2 .
Given some observations which might ranges from i to n on Y, X 1 & X 2 we can obtain the
estimated values of ˆ 0 , ˆ1 & ˆ 2 using the method of ordinary least squares-OLS through
minimizing the sum of squared residuals.
2 i Yi Yˆi
Minimize ˆi where
Yi ˆ0 ˆ1 X 1i ˆ 2 X 2i
Thus, minimizing
i (Yi Yˆi ) 2
2
2
The partial derivative of ˆi with respect to ̂ 0 gives the following results:
Yi ˆ 0 ˆ1 X 1i ˆ 2 X 2 i ) 2
2
ˆi
0
ˆ 0 ˆ 0
In similar fashion, let’s compute the partial derivatives of Uˆ i with respect to ˆ1 andˆ 2 .
2
Uˆ 2 (Yi ˆ 0 ˆ1 X 1i ˆ 2 X 2 i )
2
0
ˆ 2 ˆ 2
(Y ˆ ˆ X ˆ X )( X ) 0
i 0 1 1i 2 2i 2i
By solving the three equations (3.8), (3.10) and (3.11), which are obtained from partial
ˆ1 andˆ 2 . Alternatively, we can solve for ˆ1 and ˆ 2 by substituting the values of ̂ 0 on equation
(3.9) into the two normal equations (3.10) & (3.11) through undertake some algebraic
manipulations.
Thus, from the simultaneous system of the normal equations, we obtain values of ˆ1 and ˆ 2 as:
( X 1i X 1 )(Yi Y ) ( X 2 i X 2 ) 2 ( X 2 i X 2 )(Yi Y ) ( X 1i X 1 )( X 2i X 2 )
ˆ1 ...(3.12)
( X 1i X 1 ) 2 ( X 2i X 2 ) 2 ( X 1i X 1 )( X 2i X 2 )
2
( X 2 i X 2 )(Yi Y ) ( X 1i X 1 ) 2 ( X 1i X 1 )(Yi Y ) ( X 1i X 1 )( X 2i X 2 )
ˆ 2 ...(3.13)
( X 1i X 1 ) 2 ( X 2i X 2 ) 2 ( X 1i X 1 )( X 2 i X 2 )
2
If we write the above equations (3.12) & (3.13) in lower cases letters for deviations from the
means (i.e, X 1i X 1i X 1 , X 2i X 2i X 2 , andYi Yi Y ) , the equations can be reduced to:
2
( y i x1i )( x2 i ) ( y i x2 i )( x1i x2 i )
ˆ1 2 2
.......................(3.14)
( x1i )( x2 i ) ( x1i x2 i ) 2
2
( yi x 2i )( x1i ) ( yi x1i )( x1i x2 i )
ˆ 2 2 2
.......................(3.15)
( x1i )( x 2i ) ( x1i x 2i ) 2
The intercept ̂ 0 refers to the predicted value of Y when X 1 = 0 and X 2 = 0. The slope estimates
ˆ1 and ̂ 2 refers to the partial effects of one independent variable on Y while holding constant
the other explanatory variable. This is also termed as the ‘ceteris paribus’ interpretations of the
slope estimates.
For example, when X 2 is held fixed so that X 2 = 0, then the predicted change in Y is resulted
from the changes in X 1 . (Note that the intercept has nothing to do in changes in Y).
Yˆ ˆ 2 X 2
And it’s associated estimated equation become:
Note that sometimes we may want to know the effects of more than one independent variables by
changing them simultaneously. This is easily done using equation (3.16). For example, if we
want to know the effects of the one unit increments on both explanatory variables X 1 & X 2
simultaneously on Y, we need to add the coefficients of X 1 & X 2 and multiply by 100 to turn
the effect into percent.
To derive R2, we simply follow the procedures in simple regression model first by defining the
total sum of squares (TSS), the explained sum of squares (ESS), and the residual sum of squares
(RSS) as:
2
TSS (Yi Y ) 2 yi ......................................(3.19)
ESS (Yˆi Y ) 2 yˆ i ......................................(3.20)
2
2
RSS (Yi Yˆi ) 2 U i ......................................(3.21)
Using the same argument in the simple regression case, the total variation in Yi is the sum of the
total variation in Yi & Uˆ i .
i.e, TSS = ESS + RSS
2 2 2
yi yˆ i U i .......... .......... .......... .................... ..3.22
Assuming that the total variation in Y is nonzero, as is the case unless Yi is constant in the
sample, we can divide (3.22) by TSS to get:
ESS/TSS + RSS/TSS = 1
Just as in simple regression case, the R2 is defined to be:
R2 = ESS/TSS = 1 – RSS/TSS ………………..….(3.23)
Since ESS ˆ1 yi x1i ˆ 2 y i x2i (through doing some algebraic manipulations), the equation
of R2 can also be written as:
where y, x and X are lower case letters to indicate deviations from the mean values.
1 2
3.4.1 Assumptions
The assumptions of the multiple linear regression models are:
Assumption 1: Linearity in parameters
The parameters are linear functions of the dependent variable. Thus the model in the population
can be written as:
Y 0 1 X 1i 2 X 2i ... K X K U ,......... ....(3.25)
Where 0 , 1 ,..., K are the unknown parameters constants and U is the unobserved random term
or the disturbance term.
Assumption5: Homoskedasticity
The variance of in the error term, U, conditional on explanatory variables is the same for all
combinations of outcomes of the explanatory variables.
For instance, given three explanatory variables, the variance in U can be written as:
Var (U / X 1 , X 2 , X 3 ) 2 .......... .................... (3.28) )
The homoskedesticity assumption regress that the variance of the unobserved error term U does
not depend on the given combination of explanatory variables rather it is constant regardless of
the explanatory variables. If this assumption fails, the model exhibits hetroskedasticity which
refers to the variability of variances in the error term.
Assumption 6: Normality Of U
The population error U is independent of the explanatory variables and is normally distributed
with zero mean and variance 2 :
u ~ normal (0, 2 ) u ~ N(0, 2 )......... .......... .......... .......... .......... .......... ....3.29
Assumption 7: No autocorrelation
The values of U i corresponding to X i are independent from the values of any other U j
ii. Unbiasedness: the OLS estimators ̂ j are unbiased estimators for the population
E ( ˆ j ) j , j 0,1,..., k
E ( ˆ j ) j 0.................... .................... .......... .................(3.32)
iii. The word best refers to the smallest variance the OLS estimators. Therefore, since
method of OLS give minimum variances of the estimators it is the best one as compared
to other competing methods such as maximum likelihood, 2 stages least squares etc.
Due to the BLUE properties of OLS estimators assumptions 1 through 5 are called Gauss
Markov assumptions. Finally, the Gauss Markov theorem states that under assumptions 1
through 5, ˆ 0 , ˆ1 ,............ˆ K are the best linear unbiased estimators (BLUEs)
Therefore, the mean of the estimates ˆ 0 , ˆ1 andˆ 2 are the corresponding population parameters
0 , 1 and 2 respectively.
Similarly, we can derive variances and standard errors of these estimates in a manure as follows:
2 2 2 2
2 1 X X 2i X 2 X 1i 2 X 1 X 2 X 1i X 2i
Var ( ˆ0 ) U 1 2 2 .........( 3.34)
n X 1i X 2i ( X 1i X 2 i ) 2
2
2 X 2i
Var ( ˆ1 ) U 2 2 2
.................... .................(3.35)
X 1i X 2 i ( X 1i X 2i )
2 X 1i
Var ( ˆ 2 ) U 2 2 2
.......... .......... .................( 3.36)
X 1i X 2 i ( X 1i X 2 i )
2 2
Where U U i / n k , n is number of samples, K is number of parameters.
The corresponding standard errors are:
From this wage equation, for example, the null hypothesis H 0 : 2 0 means that once
education and tenure have been accounted for, the number of years in work force (exp.) has no
effect on the hourly wage. Note that the alternative hypothesis will reveals the opposite one.
The statistic used to test the null hypothesis (against any alternative hypothesis) is called the t-
statistics or the t-ratio and can be defined as:
t s t ˆ ( ˆ j j ) / Se( j )
j
( ˆ j 0) / Se( j )
ˆ j / Se( j )......... .......... .......... .......... .......... .......( 3.41)
In addition to the t-statistics, the theoretical or critical value of t, tc, at some significant level,
and degree of freedom, n-k-1, is required so as to determine the decision rule for rejecting or
accepting the null hypothesis.
The critical values of t are directly obtained at the t-distribution table using the level of
significance and the degree of freedom. As the degrees of freedom in the t-distribution get large,
the t-distribution approaches to the normal distribution. For example, when n-k-1=120, the 5%
critical value for one tailed test is 1.658 compared with the standard normal value of 1.645. these
are close enough for practical purposes; for degrees of freedom greater than 120, one can use the
standard normal critical values.
Some treatment of regression analysis define the t-statistics as the absolute value so that the t-
statistics is always positive. However, this practice has the draw backs of making testing against
one-sided alternative clumsy. Therefore, throughout this test for one sided case, the t-statistics
always has the same sign as the corresponding OLS coefficient estimate.
In order to determine the decision rule for rejecting or accepting the null hypothesis H 0 : j 0 ,
we need to make the relevant alternative hypothesis. There are two alternatives in which the
alternative hypothesis is formulated.
i. One-Sided Alternative
Positive Alternative: it is the right tailed test in which the hypothesized population parameter j is
positive so that alternative hypothesis is formulated as: H 0 : j 0 . Since j is greater than zero, the t-
Negative Alternative: it is the left-tailed test in which the hypothesized population parameter is negative
so that the alternative hypothesis is formulated as: H 1 : j 0 . Since j is less than zero, but the t-
statistics is expected to be negative. Thus, to compare this negative t-statistics with the t-critical value we
must get the negative t-critical though it is always positive on the table.
Decision rule:
If ts < -tc, then reject H 0
Illustration
Suppose that the estimated wage equation with standard errors in parentheses, is:
log( wa gˆ e) 2.274 0.00046educ 0.48 exp 0.0002ten
(6.113) (0.00010 ) (0.40) (0.00022 )................... .........( 3.42)
n 408, R 2 0.0541
Test the significance of education (for right tailed test) and the significance of tenure (for left
tailed) at 5% of significant level.
Significance of education;
Significance of tenure:
Since ts > -t c we accept H 0 so that tenure is not statistically significant in determining wage.
When the alternative is two-sided we are interested in the absolute value of t-statistics. Thus the decision
rule for the null hypothesis H 0 : j 0 against the alternative hypothesis is:
Note that to find the t-critical value for two-tailed test the significant level α has to be divided by
2 so as to share the significance level for each tail of the t-distribution equally. For the sake of
illustration we can test the significance of experience in the earlier estimated wage equation
(3.42).
Since /ts/ < tc we accept H 0 and hence experience is not statistically significant in determining wage at
5% level of significance.
Although H 0 : j 0 is the most common hypothesis, we sometimes want to test whether j is equal
to some other given constant. Two common examples are j 1 and j 1 , generally, if the null
testing at different significance levels, we can use the p-values for testing the parameter. The p-
value refers to the smallest probability of rejecting the null hypothesis when it is true. Implying,
it is the smallest significance level at which the null hypothesis would be rejected.
We obtain the p-value by computing the probability that at random variable with certain degree
of freedom. Thus, we reject the null-hypothesis when the value of t distributed random variable
denoted by T with n-k-1 degree of freedom is larger than the value of t-statistics in absolute
value, /t/.
P(/T/ > /t/)…………………………………….(3.44)
Therefore, p-value is the significance level of test when we use the value of t-statistics as the
critical value for test.
However, in most cases the p-value is computed using the econometric package STATA but not
available in table. Once the p-values have been computed, a classical test can be carried out at
any desired level. If denotes the significance level of the test (in decimal form), then H 0 is
rejected if p-value < .
Computing p-value for one-sided alternatives is also quite simple. Although some regression
packages only compute p-values for two-sided alternatives, the p-values can be obtained by just
dividing the two-sided p-value by 2.
Illustration
Suppose you estimate a regression model and obtain a p-value = 0.086 for a particular parameter estimate
ˆ1 and given the 5% of significance level test the significance of the parameter 1 at one-tailed and two
tailed alternatives.
For two-tailed test or classical test we reject H 0 , since P / 2 while for one-tailed test
Note:
Note: (1) when H 0 is not rejected we prefer to use the language “we fail to reject H 0 at x%
level,” rather than “ H 0 is accepted at x% level. (2) The statistical significance is distincted from
Confidence Interval
Confidence interval also called interval estimates for the population provide a range or likely values for
the population parameter j . For example, a 95% confidence interval (CI) for the unknown j is given
by:
As an example, for df = n-k-1 = 25 a 95% CI for ̂ j which is assumed to be -0.056 (its estimated value)
0.0148) for 95% samples. That is, if 100 samples of size 29 are selected and 100 confidence
intervals like (3.46) are constructed, we expect 95 of them to contain the true population
parameter j .
now we try to point out the joint effect of one or more coefficients in the multiple regression
model using F-test. For the sake of simplicity let’s take the simplified form of multiple
regression with three variables (i.e. with two independent variables) by considering the following
null hypothesis:
H 0 : 1 2 0
This null hypothesis is a joint hypothesis that 1 and 2 are jointly of simultaneously equal to zero. A
test of such a hypothesis is called overall significance hypothesis test.
Unlike testing individual hypothesis using the t-test, two or more parameters are tested jointly
using F-test in which its value is calculated through the analysis of variance technique
(ANOVA). That is, computing the TSS, ESS and RSS first with their respective degree of
freedoms as n-1, k-1, n-k where n is the total number of observations and k is the total number of
parameters to be estimated including the intercept term.
Thus, F is computed using the formula:
ESS / k 1
FS
RSS / n k
Once the F-statistics (FS) is compute using the above formula we can test the overall significance
by comparing this calculated value with its critical value (FC) obtained from the F-table at the
numerator degree of freedom (k-1) and denominator degree of freedom (n-k), at α level of
significance. Therefore, the decision rule will be:
If Fs > Fc, then reject H 0
As an example of testing the overall significance, consider the following model to explain
students score interms of factors such as class size (classize) and teachers’ compensation
(tchcomp).
Score 0 1 classize 2 tchcomp
Given 20 number of observations and the TSS &ESS are 363 and 257 respectively, test the
statistical significance of the slope parameters jointly at 1% level of significance.
ESS / k 1 257 / 3 1 257 / 2
FS 20.60
RSS / n k 106 / 20 3 106 / 17
FC F ,ndf ,ddf F0.01, 2 ,17 6.11
Since Fs > Fc we reject H0 so that the factors class size and teachers compensation are
statistically significant in affecting students performance.
Thus, the F-test, which is a measure of the overall significance of the estimated regression is
equivalent to a test of significance of R 2. Implying that testing the null hypothesis
H 0 : 1 2 0 using F-test is equivalent to testing the null hypothesis H0 : R2 = 0.
Note that there are two types of models in this lesson: unrestricted and restricted models. The
unrestricted model is a model with potential k-independent variables (and k+1 number of
parameters: adding one for the intercept) where as the restricted model is a model with less than
k-independent variables due to drop out of some independent variables. Accordingly, the
unrestricted and restricted models have the unadjusted R-squared, R2 and the adjusted R-squared,
R 2 respectively.
It is R 2 but not R 2 which helps to test the significance for exclusion of some variables from the
unrestricted model. In other words, the overall significance of the unrestricted model (with more
explanatory variables) is compared with the overall significance of restricted model (with fewer
explanatory variables) using the adjusted R-squared.