3 Multiple Regression Model
3 Multiple Regression Model
Ani Katchova
2
Multiple regression model definition and
advantages
• Multiple regression model has several independent variables (also
called regressors).
• Multiple regression incorporates more independent variables into the
model, i.e. it is more realistic.
• It explicitly holds other factors fixed (and included in the regression)
that otherwise will be in the error term.
• It allows for more flexible functional forms (including squared
variables, interactions, logs, etc.)
3
Terminology
Multiple regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 + 𝑢𝑢
𝑦𝑦 is dependent variable, 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑘𝑘 are independent variables, 𝑢𝑢 is error, 𝛽𝛽0 , 𝛽𝛽1 , . . , 𝛽𝛽𝑘𝑘 are
parameters. There are 𝑘𝑘 independent variables.
𝑥𝑥𝑗𝑗 denotes any of the independent variables, and 𝛽𝛽𝑗𝑗 is its parameter, 𝑗𝑗 = 1 … 𝑘𝑘
Estimated equation: 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥𝑘𝑘
𝑦𝑦� is predicted value, 𝛽𝛽̂0 , 𝛽𝛽̂1 , …, 𝛽𝛽̂𝑘𝑘 are coefficients,
𝑛𝑛 is the number of observations, Population Sample
4
Interpretation of coefficients
∆𝑦𝑦 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖 𝑦𝑦
𝛽𝛽̂𝑗𝑗 = =
∆𝑥𝑥𝑗𝑗 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖 𝑥𝑥𝑗𝑗
5
Regression model example
• Multiple regression model explaining how the wage is explained by education,
experience, and tenure.
• Multiple regression model:
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽3 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 + 𝑢𝑢
• Wage is measured in $/hour, education in years, experience in years, and tenure
in this company in years.
• Estimated equation for the predicted value of wage:
� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽̂2 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽̂3 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
• Residuals: 𝑢𝑢� = 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 − 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
�
• We estimate the regression model to find the coefficients.
• 𝛽𝛽̂1 measures the change in wage associated with one more year of education,
holding other factors fixed.
6
Estimated equation and interpretation
• Estimated equation
� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽̂2 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽̂3 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
= −2.87 + 0.60 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 0.02𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 0.17𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
• Wage is measured in $/hour, education in years, experience in years, and tenure in this
company in years.
• Interpretation of 𝛽𝛽̂1 : the hourly wage increases by $0.60 for each additional year of
education, holding other factors fixed.
• Interpretation of 𝛽𝛽̂2 : the hourly wage increases by $0.02 for each additional year of
experience, holding other factors fixed.
• Interpretation of 𝛽𝛽̂3 : the hourly wage increases by $0.17 for each additional year of
tenure in the company, holding other factors fixed.
• Interpretation of 𝛽𝛽̂0 : if all regressors are zero, a person’s wage is -$2.87 (but no one in
the sample has zero for all regressors).
7
Stata output for multiple regression
. regress wage educ exper tenure
8
Regression model example
Reg with 1 variable Reg with 2 variables Reg with 3 variables
VARIABLES wage wage wage
educ 0.541*** 0.644*** 0.599***
(0.0532) (0.0538) (0.0513)
exper 0.0701*** 0.0223*
(0.0110) (0.0121)
tenure 0.169***
(0.0216)
Constant -0.905 -3.391*** -2.873***
(0.685) (0.767) (0.729)
When education increases by 1 year, wage increases by $0.60, holding other factors fixed.
The coefficient on education changes slightly from one model to the next depending on
what other independent variables are included in the model.
9
Derivation of the OLS estimates
For a regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 + 𝑢𝑢
• We need to estimate the regression equation:
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥𝑘𝑘
and find the coefficients 𝛽𝛽̂0 , 𝛽𝛽̂1 , …, 𝛽𝛽̂𝑘𝑘 by looking at the residuals
• 𝑢𝑢� = 𝑦𝑦 − 𝑦𝑦� = 𝑦𝑦 − (𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥𝑘𝑘 )
• Obtain a random sample of data with n observations
(𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), where 𝑖𝑖 = 1 … 𝑛𝑛 is the observation and 𝑗𝑗 = 1 … 𝑘𝑘 is the
index for the independent variable.
10
Derivation of the OLS estimates
• The goal is to obtain as good fit as possible of the estimated
regression equation.
• Minimize the sum of squared residuals:
𝑛𝑛 𝑛𝑛
min � 𝑢𝑢� 2 = � (𝑦𝑦 − (𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 ) )2
𝑖𝑖=1 𝑖𝑖=1
• We obtain the OLS coefficients 𝛽𝛽̂0 , 𝛽𝛽̂1 , …, 𝛽𝛽̂𝑘𝑘
• OLS is Ordinary Least Squares, based on minimizing the sum of
squared residuals.
11
OLS properties
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥̅1 + 𝛽𝛽̂2 𝑥𝑥̅2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥̅𝑘𝑘
The sample average of the dependent and independent variables are on the
regression line.
𝑛𝑛
� 𝑢𝑢� 𝑖𝑖 = 0
𝑖𝑖=1
The residuals sum up to zero (note that the sum of squared residuals are
minimized).
𝑛𝑛
� 𝑥𝑥𝑖𝑖𝑖𝑖 𝑢𝑢� 𝑖𝑖 = 0
𝑖𝑖=1
The covariance between the independent variable 𝑥𝑥𝑗𝑗 and the residual 𝑢𝑢� is
zero.
12
Partialling out
• Partialling out shows an alternative way to obtain the regression
coefficient.
• Population regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + 𝑢𝑢
• First regress the independent variable on all other independent variables:
𝑥𝑥1 = 𝛼𝛼0 + 𝛼𝛼2 𝑥𝑥2 + 𝛼𝛼3 𝑥𝑥3 + 𝑒𝑒
• Get the residuals from this regression: 𝑒𝑒̂ = 𝑥𝑥1 − 𝑥𝑥�1 (showing the part of
the independent variable not explained by the other variables)
• Run the regression of the dependent variable on the residuals from this
regression to obtain the same coefficient: 𝑦𝑦 = 𝛾𝛾0 + 𝛽𝛽1 𝑒𝑒� + 𝑣𝑣
• The coefficient 𝛽𝛽̂1 shows the relationship between 𝑦𝑦 and 𝑥𝑥1 that is not
explained by the other variables. This is why “holding everything else
fixed” is added when interpreting coefficients.
13
Partialling out example
Dep. variable on all Indep. variable on other Dep. variable on
indep. variables indep. variables residuals
VARIABLES wage educ wage
educ 0.599***
(0.0513)
exper 0.0223* -0.0738***
(0.0121) (0.00976)
tenure 0.169*** 0.0477***
(0.0216) (0.0183)
ehat 0.599***
(0.0556)
Constant -2.873*** 13.57*** 5.896***
(0.729) (0.184) (0.146)
The coefficient on education is 0.599 in the original regression is the same as the coefficient on 𝑒𝑒.̂
Interpretation: if education increases by 1 year, wage increases by $0.60, holding everything else fixed.
14
Goodness of fit measures
R-squared
• Remember that SST = SSE+ SSR or
• total sum of squares = explained sum of squares + residual sum of squares
• R-squared = R2 = SSE/SST = 1 – SSR/SST
• R-squared is equal to the explained sum of squares divided by the total
sum of squares. It measures the proportion of total variation that is
explained by the regression. R-squared is a goodness of fit measure.
• An R-squared of 0.7 is interpreted as 70% of the variation is explained by
the regression and the rest is due to error.
• R-squared that is greater than 0.25 is considered good fit.
• A problem with the R-squared is that it always increases when an
additional regressor is added because SST is the same, but SSE increases.
15
Adjusted R-squared
• The adjusted R-squared is an R-squared that has been adjusted for the number of
regressors in the model. The adjusted R-squared increases only if the new regressor
improves the model.
𝑆𝑆𝑆𝑆𝑆𝑆
• Recall that R-squared = 𝑅𝑅 2 = 1 −
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆/(𝑛𝑛−𝑘𝑘−1)
• Adjusted R-squared = Adj 𝑅𝑅 2 = 1 −
𝑆𝑆𝑆𝑆𝑆𝑆/(𝑛𝑛−1)
• where n is the number of observations and k is the number of independent variables.
• Adjusted R-squared = 1 − (1 − 𝑅𝑅2 )(𝑛𝑛 − 1)/(𝑛𝑛 − 𝑘𝑘 − 1)
• As the number of regressors k increases, the adjusted R-squared would be penalized.
• When adding more regressors, the adjusted R2 may increase but it also may decrease or
even get negative.
• Rule: choose a model with higher adjusted R-squared when deciding whether to keep a
variable or not in the model.
16
Adjusted R-squared calculation
Source SS df MS Number of obs = 526
F(3, 522) = 76.87
Model 2194.11162 3 731.370541 Prob > F = 0.0000
Residual 4966.30269 522 9.51398982 R-squared = 0.3064
Adj R-squared = 0.3024
Total 7160.41431 525 13.6388844 Root MSE = 3.0845
20
Assumption 1: linearity in parameters
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 + 𝑢𝑢
21
Assumption 2: random sampling
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑖𝑖1 + 𝛽𝛽2 𝑥𝑥𝑖𝑖2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑖𝑖𝑘𝑘 + 𝑢𝑢𝑖𝑖
22
Assumption 3: no perfect collinearity
• No perfect collinearity - none of the independent variables are constant and
there are no exact relationships among them.
• An independent variable cannot be a constant because it will be collinear with
the constant/intercept in the model.
• An independent variable cannot be a perfect linear combination of other
independent variables (perfect collinearity). It must be dropped from the model.
• Independent variables can still be highly correlated with each other
(multicollinearity), which is not a violation of this assumption though
multicollinearity is also problematic.
• If independent variables are in shares that sum up to 100% (e.g. proportion of
income spent on food and proportion of income not spent on food), one of the
variables must be dropped. But income spent on food and income not spent on
food do not sum up to 100% so both variables can be used in the model.
23
Perfect collinearity example
• Model with female
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝑢𝑢
• Model with female and male
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝛽𝛽3 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 + 𝑢𝑢
where 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 1 − 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
This model cannot be estimated because female and male are perfectly
collinear. Solution:
- Drop the collinear variable 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
- Drop the constant (not commonly used)
24
Perfect collinearity example
Model with Model with Model with female and
female female and male male but no constant
(male is dropped)
VARIABLES wage wage wage
male 0.623
(0.673)
Constant 0.623 0.623
(0.673) (0.673) 25
Multicollinearity: VIF
• Multicollinearity is when two or more independent variables are highly linearly related
(correlated) with each other. The solution is to remove some of the variables or combine them.
• Multicollinearity can be checked by using variance inflation factors (VIF)
• 𝑉𝑉𝑉𝑉𝑉𝑉𝑗𝑗 = 1/(1 − 𝑅𝑅𝑗𝑗2 )
• 𝑅𝑅𝑗𝑗2 is the R-squared from a regression of 𝑥𝑥𝑗𝑗 on the other independent variables.
• Higher 𝑅𝑅𝑗𝑗2 would mean that 𝑥𝑥𝑗𝑗 is well explained by other independent variables.
• From the formula above, when 𝑅𝑅𝑗𝑗2 is higher, then 𝑉𝑉𝑉𝑉𝑉𝑉𝑗𝑗 will be higher.
• The rule of thumb is to drop variable 𝑥𝑥𝑗𝑗 if 𝑉𝑉𝑉𝑉𝑉𝑉𝑗𝑗 > 10, which means that 𝑅𝑅𝑗𝑗2 > 0.9. This variable
would be well explained by the other variables in the model.
• VIF=1 for simple regression because there are no other variables and 𝑅𝑅𝑗𝑗2 = 0
• Another rule of thumb is to drop a variable if it has a correlation coefficient above 0.9 with
another variable. But it is better to use VIF for multicollinearity among all independent variables
instead of correlation coefficient between two variables.
26
Multicollinearity example: correlation
Correlation table Parents’ average Parents went to grad Parents went to
education school college
Parents’ average education 1
Parents went to grad school 0.79 1
Parents went to college 0.81 0.42 1
Model: how test scores of a student depend on parents’ average education and whether parents
went to grad school or college.
First, find the correlation between two variables at a time.
The variable of parents’ average education is highly correlated with whether parents
went to grad school or college; the correlation coefficients are 0.79 and 0.81.
27
Multicollinearity example: VIF
VIF Model with Model without Model without
multicollinearity multicollinearity multicollinearity
VARIABLES Test score Test score Test score
Parents’ average education 10.79 1
Parents went to grad school 4.78 1.25
Parents went to college 4.54 1.25
28
Multicollinearity example
Model with Model without Model without
multicollinearity multicollinearity multicollinearity
VARIABLES Test score Test score Test score
Parents’ average education 203.9*** 148.2***
(18.78) (5.847)
Parents went to grad school -1.091 5.829***
(0.758) (0.475)
Parents went to college -2.426*** 2.648***
(0.587) (0.350)
Constant 162.7*** 545.1*** 251.4***
The coefficients in the models without and with multicollinearity are very different.
In the model with multicollinearity we can wrongly conclude that if parents went to college,
the test scores of the student will be lower.
Drop the variable that is causing multicollinearity (parents’ average education) and use last two models.
Standard errors are higher for model with multicollinearity. 29
Assumption 4: zero conditional mean
(exogeneity)
𝐸𝐸 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 , 𝑥𝑥𝑖𝑖𝑖 , … , 𝑥𝑥𝑖𝑖𝑖𝑖 ) = 0
• Expected value of error term 𝑢𝑢 given the independent variables 𝑥𝑥 is zero.
• The expected value of the error must not differ based on the values of the
independent variables.
• With more variables, this assumption is more likely to hold because fewer
unobservable factors are in the error term.
• Independent variables that are correlated with the error term are called
endogenous; endogeneity is a violation of this assumption.
• Independent variables that are uncorrelated with the error term are called
exogenous; this assumption holds if all independent variables are
exogenous.
30
Zero conditional mean example
Regression model
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝑢𝑢
31
Example of exogeneity vs endogeneity
Exogeneity - zero conditional mean Endogeneity - conditional mean is not zero
10
30
25
5
uhat_modified
20
Residuals
15
0
10
-5
5
10 12 14 16 18 10 12 14 16 18
educ educ
E(u|x)=0 error term is the same given education E(u|x)>0 ability/error is higher when education is higher
32
Unbiasedness of the OLS estimators
• Gauss Markov Assumptions 1-4 (linearity, random sampling, no
perfect collinearity, and zero conditional mean) lead to the
unbiasedness of the OLS estimators.
𝐸𝐸 𝛽𝛽̂𝑗𝑗 = 𝛽𝛽𝑗𝑗 , where 𝑗𝑗 = 0, 1, … , 𝑘𝑘
• Expected values of the sample coefficients 𝛽𝛽̂ are the population
parameters 𝛽𝛽.
• If we estimate the regression model with many random samples, the
average of these coefficients will be the population parameter.
• For a given sample, the coefficients may be very different from the
population parameters.
33
Omitted variable bias
• The “true” population regression model is: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝑢𝑢
• We need to estimate: 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2
• But instead we estimate a misspecified model: 𝑦𝑦� = 𝛽𝛽�0 + 𝛽𝛽�1 𝑥𝑥1 , where 𝑥𝑥2 is
the omitted variable from this model.
• If 𝑥𝑥1 and 𝑥𝑥2 are correlated, there will be a relationship between them
𝑥𝑥2 = 𝛿𝛿0 + 𝛿𝛿1 𝑥𝑥1 + 𝑣𝑣
Substitute in above equation to get:
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝛿𝛿0 + 𝛿𝛿1 𝑥𝑥1 + 𝑣𝑣 + 𝑢𝑢
= (𝛽𝛽0 +𝛽𝛽2 𝛿𝛿0 ) + (𝛽𝛽1 +𝛽𝛽2 𝛿𝛿1 )𝑥𝑥1 + 𝛽𝛽2 𝑣𝑣 + 𝑢𝑢
The coefficient that will be estimated for 𝑥𝑥1 when 𝑥𝑥2 is omitted will be
biased.
34
Omitted variable bias
• An unbiased coefficient is when 𝐸𝐸 𝛽𝛽̂1 = 𝛽𝛽1 , but this coefficient is
biased because 𝐸𝐸 𝛽𝛽�1 = 𝛽𝛽1 +𝛽𝛽2 𝛿𝛿1 , where 𝛽𝛽2 𝛿𝛿1 is the bias.
• With an omitted variable, the coefficient will not be biased if
• 𝛽𝛽2 = 0. Looking at 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝑢𝑢, this means that 𝑥𝑥2 does not
belong in the model (𝑥𝑥2 is irrelevant).
• 𝛿𝛿1 = 0. Looking at 𝑥𝑥2 = 𝛿𝛿0 + 𝛿𝛿1 𝑥𝑥1 + 𝑣𝑣, this means that 𝑥𝑥2 and 𝑥𝑥1 are not
correlated.
• In other words, if the omitted variable is irrelevant 𝛽𝛽2 = 0 or uncorrelated
𝛿𝛿1 = 0, there will be no omitted variable bias.
35
Omitted variable bias example
• Suppose the “true” population regression model is:
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 + 𝑢𝑢
• But instead we estimate a misspecified model:
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛼𝛼0 + 𝛼𝛼1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝑒𝑒
where 𝑎𝑎𝑏𝑏𝑏𝑏𝑏𝑏 is the omitted variable from this model.
• If 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 and 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 are correlated, there will be a relationship between them
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛿𝛿0 + 𝛿𝛿1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝑣𝑣
Substitute in above equation to get:
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝛿𝛿0 + 𝛿𝛿1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝑣𝑣 + 𝑢𝑢
= (𝛽𝛽0 +𝛽𝛽2 𝛿𝛿0 ) + (𝛽𝛽1 +𝛽𝛽2 𝛿𝛿1 )𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝛽𝛽2 𝑣𝑣 + 𝑢𝑢
When 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 is omitted, the coefficient that will be estimated for 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 is 𝛽𝛽�1 =
𝛽𝛽1 + 𝛽𝛽2 𝛿𝛿1 , where 𝛽𝛽2 𝛿𝛿1 is the bias.
36
Omitted variable bias example
Model with educ Model with educ Model of abil and
and abil (abil is omitted) educ
VARIABLES wage wage abil
educ 1.153*** 1.392*** 0.551***
(0.127) (0.103) (0.0213)
abil 0.433***
(0.137)
Constant -2.523 -4.857*** -5.389***
(1.543) (1.360) (0.282)
Now, suppose that we have experience instead of ability for the true model.
Coefficient in model with omitted variable = 1.392 = original coefficient + bias = 1.948 + 0.614*(-0.905)= 1.948-0.556
In the model with omitted variable, for each additional year of education, wage increases by $1.39 instead of $1.95,
so there is a negative bias of -$0.56.
The effect of education on wage is underestimated because people with higher education have less experience.
The coefficient on education is also reflecting the -$0.56 effect of experience on wage through its relationship with educ.
38
Including irrelevant variables
• Suppose the “true” population regression model is:
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝑢𝑢
• But instead we estimate a misspecified model:
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + 𝑒𝑒
where 𝑥𝑥3 is an irrelevant variable that is included.
• In the population regression model, 𝐸𝐸 𝛽𝛽̂3 = 𝛽𝛽3 = 0 because the
variable is not included.
• Adding an irrelevant variable does not cause bias but it increases the
sampling variance.
39
Assumption 5: homoscedasticity
• Homoscedasticity 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 , 𝑥𝑥𝑖𝑖2 , … , 𝑥𝑥𝑖𝑖𝑘𝑘 = 𝜎𝜎 2
• Variance of the error term 𝑢𝑢 must not differ with the independent
variables.
• Heteroscedasticity 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 , 𝑥𝑥𝑖𝑖𝑖 , … , 𝑥𝑥𝑖𝑖𝑖𝑖 ≠ 𝜎𝜎 2 is when the variance
of the error term 𝑢𝑢 is not constant for each 𝑥𝑥.
40
Homoscedasticity vs heteroscedasticity
example
15
15
10
10
Residuals, uhat
5
5
Residuals
0
0
-5
-5
-10
-10
0 5 10 15 20
0 10 20 30 40 50 educ
exper
𝜎𝜎 2
With multiple regression: 𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑇𝑇𝑗𝑗 (1−𝑅𝑅𝑗𝑗2 )
• 𝑆𝑆𝑆𝑆𝑇𝑇𝑗𝑗 = ∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖𝑗𝑗 − 𝑥𝑥𝑗𝑗̅ )2 is the total sampling variance of variable 𝑥𝑥𝑗𝑗 . High variance of the
regressor will lead to lower variance of the coefficient.
• 𝑅𝑅𝑗𝑗2 is the R-squared from a regression of 𝑥𝑥𝑗𝑗 on all other independent variables. High R-
squared means that the regressor is very well explained by other regressors
(multicollinearity) and it will have higher variance of its coefficient.
• 𝜎𝜎 2 is the variance of the error term. High variance in the error term will lead to high
variance of the coefficients.
• Estimators with lower variance are also called more precise and are desirable. This
means that high variance in the independent variable, low R-squared for 𝑥𝑥𝑗𝑗 (less
multicollinearity), and low variance in the error term are desirable.
42
Unbiasedness of the error variance
• The variance of the error term 𝜎𝜎 2 is not known but can be estimated
as the variance of the residuals, corrected for the number of
regressors 𝑘𝑘.
𝑛𝑛
1
2
𝜎𝜎� = � 𝑢𝑢� 𝑖𝑖2
𝑛𝑛 − 𝑘𝑘 − 1
𝑖𝑖=1
• The degrees of freedom is (𝑛𝑛 − 𝑘𝑘 − 1).
• Gauss Markov Assumptions 1-5 (linearity, random sampling, no
perfect collinearity, zero conditional mean, and homoscedasticity)
lead to the unbiasedness of the error variance.
𝐸𝐸 𝜎𝜎� 2 = 𝜎𝜎 2
43
Standard errors of the regression coefficients
• Standard errors of the regression coefficients:
𝜎𝜎� 2
𝑠𝑠𝑠𝑠 𝛽𝛽̂𝑗𝑗 = 𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑇𝑇𝑗𝑗 (1 − 𝑅𝑅𝑗𝑗2 )
44
Variance in misspecified models
• “True” population regression model is: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝑢𝑢
• We need to estimate: 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥1 + 𝛽𝛽̂2 𝑥𝑥2
• But instead we estimate a misspecified model: 𝑦𝑦� = 𝛽𝛽�0 + 𝛽𝛽�1 𝑥𝑥1 , where 𝑥𝑥2 is the
omitted variable from this model.
𝜎𝜎 2 𝜎𝜎 2
𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂1 = �
2 > 𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽1 =
𝑆𝑆𝑆𝑆𝑇𝑇1 (1−𝑅𝑅1 ) 𝑆𝑆𝑆𝑆𝑇𝑇1
• There is a trade off between bias and variance.
• Omitted variables lead to a misspecified model, and biased coefficients, which is
undesirable. But the coefficients have lower variance, which is desirable.
• Typically it is better to have unbiased coefficients first and then lower variance
(precision), so it is better to include the independent variable in the model.
• But irrelevant regressors should not be included. They will not cause bias but will
increase variance.
45
Variance in misspecified model example
(same as omitted variable bias example)
Model with educ Model with educ
and abil (abil is omitted)
VARIABLES wage wage
educ 1.153*** 1.392***
(0.127) (0.103)
abil 0.433***
(0.137)
Constant -2.523 -4.857***
(1.543) (1.360)
The standard error of the coefficient on education is 0.103 in the misspecified model which is lower than
in the original model 0.127.
Remember that the standard error is the square root of the variance.
46
Gauss-Markov theorem
• Gauss Markov theorem says that under assumptions 1-5 (linearity in
parameters, random sampling, no perfect collinearity, zero
conditional mean, homoscedasticity), the OLS estimators are best
linear unbiased estimators (BLUE) of the coefficients.
• Estimators are linear in the dependent variable.
∑(𝑥𝑥𝑖𝑖 −𝑥𝑥)(𝑦𝑦
̅ �
𝑖𝑖 −𝑦𝑦)
̂ 𝑛𝑛 ̂
𝛽𝛽𝑗𝑗 = ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑖𝑖 𝑦𝑦𝑖𝑖 , for example in simple regression 𝛽𝛽1 = ∑ ̅ 2
(𝑥𝑥𝑖𝑖 −𝑥𝑥)
• From all linear estimators, OLS estimators have the lowest variance
(best means lowest variance).
47
Review questions
1. Define regression model, estimated equation, and residuals.
2. What method is used to obtain the coefficients? What are the OLS properties?
3. Explain the goodness of fit measures R-squared and adjusted R-squared.
4. Define perfect collinearity and multicollinearity. What test is used to detect
multicollinearity? What is done to prevent these problems?
5. Explain omitted variable bias. If a variable is omitted, are the coefficients
biased? What happens to the variance of the coefficients?
6. List and explain the 5 Gauss Markov assumptions.
7. Which assumptions are needed for the unbiasedness of the coefficients?
8. Which assumptions are needed to calculate the variance of the OLS
coefficients?
9. Explain the Gauss-Markov Theorem.
48