0% found this document useful (0 votes)
63 views14 pages

Statistical Modelling: Regression: Choosing The Independent Variables

This document discusses best practices for choosing independent variables in regression analysis. It notes that omitting a relevant variable can lead to omitted variable bias, while including an irrelevant variable does not bias estimates but increases standard errors. The document provides an example showing the effects of omitting an income variable from a model of chicken consumption. It also discusses sequential specification searches, sensitivity analysis, and data mining approaches to variable selection.

Uploaded by

dwqef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views14 pages

Statistical Modelling: Regression: Choosing The Independent Variables

This document discusses best practices for choosing independent variables in regression analysis. It notes that omitting a relevant variable can lead to omitted variable bias, while including an irrelevant variable does not bias estimates but increases standard errors. The document provides an example showing the effects of omitting an income variable from a model of chicken consumption. It also discusses sequential specification searches, sensitivity analysis, and data mining approaches to variable selection.

Uploaded by

dwqef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

FACULTY OF ECONOMICS AND BUSINESS

CAMPUS BRUSSEL
Master of Business Engineering

Statistical Modelling
Regression: Choosing the independent variables

Studenmund (2013). Using econometrics: A practical guide (6th.


edition). Edinburgh: Pearson Education Limited.

1
Choosing the independent variables
 Specifying an econometric equation involves choosing
o the correct independent variables to include in the equation (Chapter 6)
o the correct functional form of the regression model (Chapter 7)
o the correct form of the stochastic error term (see part time series models)
 Two potential errors can be made when choosing the independent
variables
o Fail to include a relevant variable. This will lead to a bias in the OLS
estimates of other predictors (so-called omitted variables bias), and hence
also to wrong predictions of the dependent variable.
o Include an irrelevant variable in the equation. This will not affect the
estimated coefficients of the other predictors or predictions of the dependent
variable, but it will increase the standard errors of estimated regression
coefficients.

2
Example: Annual chicken consumption
 Consider the following model of the annual consumption of chicken in
the US.

with
o : per capita chicken consumption (in pounds) in year
o : the price of chicken (in cents per pound) in year
o : the price of beef (in cents per pound) in year
o : US per capita disposable income (in $100) in year
 We fit the model using data for the years 1974-2002 .
Using one-sided tests, we see
that each of the individual
regression coefficients is
significantly different from 0 in
the expected direction (p<0.05)
and that
3
Omitting a relevant variable
 To illustrate the impact of omitting a relevant variable, we also fit the
model without including the disposable
income. The results indicate that omitting
the relevant variable YD leads to
biased estimates
changes from -.11 to -.359
changes from .032 to .284
 In this example the omitted
variable bias is large.
 In this example the fit of the
model substantially decreases by
omitting YD, .

4
Mechanics of omitting a relevant variable
 Suppose the following regression models hold for the population
(1)

(2)

 Substituting (2) in (1) we see what happens if we omit from the


regression model:

 Using OLS to fit this regression model we see that

and hence

5
Mechanics of omitting a relevant variable
 It follows that is unbiased if one or both of the following conditions
hold
o : after controlling for , does not affect (i.e. is not
relevant).
o : and are uncorrelated
and that is biased if and
 Omitting a relevant variable (i.e. ) is usually problematic because
predictors tend to be correlated to some extent.
 In the example on annual chicken consumption and
.
 The expected bias in is assumed to be negative because

6
Including an irrelevant variable
 Including an irrelevant variable will not cause bias in the estimated
regression coefficient of other predictor variables if the true regression
coefficient of the variable equals 0.
 However, including an irrelevant variable will increase the standard
errors of the estimated coefficients for other variables.
 Example: suppose we add the variable TEMP (annual average change in
temperature in 0.1 degrees) to the model for annual demand of chicken
consumption: As expected, the regression
coefficient of TEMP is not
significantly different from 0.
The standard errors for PC, PB, YD
slightly increase when adding an
irrelevant variable
PC .032 to .033
PB .017 to .018
YD .014 to .015
decreases from .99041 to .99036
7
Mechanics of including an irrelevant variable
 Suppose the true model reads

 We add an irrelevant variable and use the model


with

o If is irrelevant and hence and X 2 will be uncorrelated. So


there is no violation of the Gauss-Markov assumptions.

o Even if and are correlated, will not be correlated to and hence


is still unbiased.

8
Specification searches
The subject of how to search for the best specification of the econometric
equation is quite controversial. Different approaches are used in practice:
Sequential specification searches
 Estimate an initial equation and then sequentially adapt the model
(drop/add variables, change functional form, etc.) to find a plausible
equation. Present the final model as if it is the only one estimated.
 This approach overestimates the statistical significance if the results of
previous regressions are ignored. When conducting many statistical tests,
the probability to make at least one type-I error in the entire collection of
tests is larger than .
Sensitivity analysis:
 Purposely run a number of alternative model specifications (using
different functional forms, variable definitions, subsets of the data) to
evaluate whether certain results are robust.

9
Specification searches
Data mining
 Many domains use data mining to model a dependent variable on the
basis of a large set of available predictor variables. The main goal is to use
the model for making predictions on new data about future behavior:
o Predict whether a customer will buy a product in the next three months
o Predict whether a customer of a bank will be able to pay off a loan
 As data mining models are often constructed using a large set of available
predictor variables, variable selection is an important issue:
o It is important to select the predictors with the best predictive performance
on test data (i.e., data not used for estimating models)
o Including irrelevant variables increases the variance of the estimator, and
increases the test error (RSS computed on test data)

10
Specification searches
 Data mining often uses automated procedures for variable selection:
o e.g. Forward selection begins with the null model, and subsequently adds
the predictors that yield the greatest additional improvement to the model fit
(i.e. increase in ).
o To avoid overfitting the data, it is important to estimate candidate models
(i.e. models with a certain set of predictors) on a training data set, and
evaluate the fit of the candidate models on a test set. The final model
selected is the one with the best fit on the test set.
 The data mining approach is very much data-driven, and the usefulness
of models is derived from their performance in predicting future behavior.
 This contrasts with a theory-driven approach in which one aims to build
a theoretically and statistically valid econometric model.
 In this course we focus on the theory-driven approach to econometrics.

11
Best practices in selecting independent variables
 The following guidelines can be used to decide whether a variable should
be included in the model:
o theory: Is the variable’s place in the equation is unambiguous and
theoretically sound?
o t-test: is the variable’s estimated coefficient significant in the expected
direction?
o : does the overall fit of the equation (adjusted for degrees of freedom)
improve when the variable is added to the equation?
o Do other variables’ coefficients change significantly when the variable is
added to the equation?
 Do not just remove a non-significant variable from the model, because
this can lead to omitted variable bias.
 Minimize the number of equations estimated (except for sensitivity
analysis)

12
Exercise: Pick new location of Woody’s restaurant
 For the example of Woody’s (see Chapter 5 on hypothesis testing) we list
the output of the estimated regression model and the correlations between
predictors.
 Suppose you refit the model without including the independent variable
P. Discuss the kind of omitted-variable bias you expect in and of the
new model.

13
Solution: Pick new location of Woody’s restaurant
We expect a positive bias in

and also a positive bias in

14

You might also like