Stata Lecture Unit Root
Stata Lecture Unit Root
Mark E. McGovern
Geary WP2012/01
January 2012
UCD Geary Institute Discussion Papers often represent preliminary work and are circulated to
encourage discussion. Citation of such a paper should account for its provisional character. A revised
version may be available directly from the author.
Any opinions expressed here are those of the author(s) and not those of UCD Geary Institute. Research
published in this series may include views on policy, but the institute itself takes no institutional policy
positions.
A Practical Introduction to Stata
Mark E. McGovern∗
January 2012
Abstract
This document provides an introduction to the use of Stata. It is designed to be an overview rather than
a comprehensive guide, aimed at covering the basic tools necessary for econometric analysis. Topics cov-
ered include data management, graphing, regression analysis, binary outcomes, ordered and multinomial
regression, time series and panel data. Stata commands are shown in the context of practical examples.
Contents
1 Introduction 4
1.1 Opening Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Audit Trails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 User Written Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Menus and Command Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Data browser and editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.10 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Data Manipulation 9
2.1 Describing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Generating Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 if Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Summarising with tab and tabstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Introduction to Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Joining Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Tabout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.1 Tabout with Stata 9/10/11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.2 Tabout with Stata 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Recoding and Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Macros, Looping and Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.11 Counting, sorting and ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.12 Reshaping Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.13 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
∗ Geary Institute & School of Economics, University College Dublin. PGDA Fellow, Harvard Center for Population and
Development Studies. I gratefully acknowledge funding from the Health Research Board. This document is based on notes
for the UCD MA econometrics module and a two day course in the UCD School of Politics. Preliminary, comments welcome.
Email: [email protected]
1
3 Regression Analysis 21
3.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Outreg2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Post Regression Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Interaction Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Specification and Misspecification Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Binary Regression 26
4.1 The Problem With OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Logit and Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Time Series 32
5.1 Initial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Testing For Unit Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Dealing With Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Panel Data 44
7.1 Panel Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Panel Data Is Special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Random and Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 The Hausman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8 Instrumental Variables 53
8.1 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.2 Two Stage Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.3 Weak Instruments, Endogeneity and Overidentification . . . . . . . . . . . . . . . . . . . . . . 56
List of Tables
1 Logical Operators in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Tabout Example 1 - Crosstabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Tabout Example 2 - Variable Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 OLS Regression Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 OLS Regression Output With Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Outreg Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Linear Probability Model Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8 Logit and Probit Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9 Marginal Effects Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
10 Alternative Binary Estimators for HighWage . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
11 Dickey Fuller Test Ouput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
12 A Comparison of Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13 Ordered Probit Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
14 OLS and MFX for Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
15 Multinomial Logit Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
16 MFX for Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
17 xtdescribe Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
18 xtsum Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
19 xttrans Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
20 Test For A Common Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
21 Random Effects Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
22 Fixed Effects Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
23 Comparison of Panel Data Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
24 Correlation Between Income, Openness and Area . . . . . . . . . . . . . . . . . . . . . . . . . 55
25 OLS and IV Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
26 Testing for Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
List of Figures
1 An example of a graph matrix chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Graph Example 1: Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Graph Example 2: Labelled Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 A Problem With OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Problem Solved With Probit and Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Autocorrelation Functions For Infant Mortality and GDP . . . . . . . . . . . . . . . . . . . . 32
7 Partial Autocorrelation Functions For Infant Mortality and GDP . . . . . . . . . . . . . . . . 33
8 Using OLS To Detrend Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9 Health Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
10 Height Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
11 Ordered Logit Predicted Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
12 Multinomial Logit Predicted Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
13 BHPS Income . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
14 BHPS Income by Job Satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
15 Graph Matrix for Openess, Area and Income Per Capita . . . . . . . . . . . . . . . . . . . . . 53
16 Openness and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Objective
The aim of this document is to provide an introduction to Stata, and to describe the requirements necessary
to undertake the basics of data management and analysis. This document is designed to complement rather
than substitute for a comprehensive set of econometric notes, no advice on theory is intended. Although
originally intended to accompany an econometrics course in UCD, the following may be of interest to anyone
getting started with Stata. Topics covered fall under the following areas: data management, graphing,
regression analysis, binary regression, ordered and multinomial regression, time series and panel data. Stata
commands are shown in red. It is assumed the reader is using version 11, although this is generally not
necessary to follow the commands.
3
1 Introduction
1.1 Opening Stata
Stata 11 is available on UCD computers by clicking on the “Networked Applications”. Select the “Mathe-
matics and Statistics” folder and Stata v11. It is also possible to run Stata from your own computer. Log into
UCD connect and click “Software for U” on the main page. You will first need to download and install the
client software, then you will be able to access Stata 11, again in the “Mathematics and Statistics” folder. For
further details see http://www.ucd.ie/itservices/teachinglearningit/applications/softwareforu/d.en.21241
Stata 11 is recommended, however Stata 8.0 may also be available on the NAL (Novell Application
Launcher). Click Start and open the NAL. Open the Specialist Applications folder and click into Economics.
Open wsestata.exe, or right-click and add as a shortcut to your desktop. Alternatively, click Start > Run,
paste in Y:\nalapps\W95\STATASE\v8.0 and click enter.
1.2 Preliminaries
Before starting, we need to cover a very important principle of data analysis. It is vital that you keep track
of any changes you make to data. There is nothing worse than not knowing how you arrived at a particular
result, or accidentally making a silly mistake and then saving your data. This can lead to completely incorrect
conclusions. For example you might confuse your values for male and female and conclude that men are
more at risk of certain outcomes, etc. These mistakes are embarrassing at best, and career threatening at
worst. There are three simple tips to avoid these problems. Firstly keep a log of everything. Secondly, to
ensure you don’t embed any mistakes you’ve made in future work, most econometricians never save their
datasets. Generally people initially react badly to this suggestion. However you don’t need to saves changes
to the dataset itself if you implement all manipulations using do files. The final tip therefore, is to use do
files. We will cover each of these in what follows.
The first thing we need to do is open our data. If we have a file saved somewhere on our hard disk
we could use the menus to load it. FILE, OPEN. Or we could write out the full path for the file, e.g.
“h:\Desktop\”. The path for your desktop will differ depending on the computer your are using, however,
if you are on a UCD machine this should be it. This is awkward, and we will also need somewhere to store
results, and analysis. So we will create a new folder on our desktop called “Stata”. Right click on your
desktop, and select NEW, FOLDER. Rename this to “Stata”. We will also create a new folder within this
called “Ado” which we will use to install new commands. Save the files for this class into the “Stata” folder.
Stata starts with a default working directory, but it is well hidden and not very convenient, so we want to
change the working directory to our new folder. First we check the current working directory with pwd. Now
we can change it cd ‘‘h:\Desktop\Stata’’. If you are unsure where your new “Stata” folder is, right click
on it and go to PROPERTIES. You will see the path under LOCATION. Add “\Stata ” to this. Now we
can load our data files. One final piece of housekeeping, because we can only write to the personal drive
(“h:\”) on UCD computers we need to be able to install user written commands here. So we set this folder
with sysdir set PLUS ‘‘h:\Desktop\Stata\Ado’’. This is only necessary if you are running Stata from
a UCD computer.
Now we have this set up, accessing files saved in Stata format (.dta) is straightforward. use icecream2.
If you make changes to the data, you will not be allowed to open another dataset without clearing Stata’s
memory first. gen year=2010. We will encounter the gen command later. Now if we try and load the data
again use icecream2 we get the error message “no; data in memory would be lost”. We need to use the
command clear first, then we can reload the dataset use icecream2. Alternatively, using the clear option
automatically drops the dataset in current use use icecream2, clear. This raises a very important point,
we need to keep track of our analysis and our changes to the data. Never ever save changes to a dataset.
If you have no record of what you have done not only will you get lost and not be able to reproduce your
results, neither will anyone else. And you won’t be able to prove that you’re not just making things up.
This is where do files come in. A do file (not to be confused with an ado file)1 is simply a list of commands
1 This is a do file which contains a programme. Stata uses these to run most of its commands. This is also how we are able
4
that you wish to perform on your data. Instead of saving changes to the dataset, you will run the do file
on the original data. You can add new commands to the do file as you progress in your analysis. This way
you will always have a copy of the original data, you will always be able to reproduce your results exactly,
as will anyone else who has the do file. You will also only need to make the same mistake once. The top
journals require copies of both data and do files so that your analysis is available to all. It is not uncommon
for people to find mistakes in the analysis of published papers. We will look at simple example. Do files have
the suffix “.do”. You can execute a do file like this do intro. 2 do tutorial1 would run all of the analysis
for this particular tutorial. There are several ways to open, view and edit do files. The first is through Stata.
Using the menus go to WINDOW DO-FILE EDITOR, NEW DO-FILE. Or click on the notepad icon below
the menus. Or type doedit in the command window. Or press CTRL F8. Each of these will open the dofile
editor. Alternatively you can write do files in notepad or word. They must be saved as .do files however.
You don’t have to execute a whole do file, you can also copy and paste commands into the command window.
Here we will create our own do file using the commands in this document.
As well as using do-files to keep track of your analysis, it is important to keep a log (a record of all
commands and output) in case Stata or your computer crashes during a session. Therefore you should open
a log at the start of every session. log using newlog, replace. To examine the contents of a log using the
menus go to FILE, VIEW. Alternatively type view logexample. Also useful is set more off, which allows
Stata to give as much output as it wants. This setting is optional but otherwise Stata will give only one
page of output at a time. Finally, you must have enough memory to use your data. You can set the amount
of memory Stata uses. By default, it starts out with 10 megabytes which is not always enough. If you run
out of memory you will get the error message “no room to add more observations”. For most data files 30
megabytes will be enough, so we will start by setting this as the memory allocation. set mem 30m. To check
the current memory usage type memory. You could set memory to several hundred megabytes to ensure that
Stata will never run out, but this makes your computer slow (especially if you have a slow computer) and so
is not recommend. None of the files we will be examining require more than this. Note that if you run out of
memory you will have to clear your data, set the memory higher and re-run your analysis before proceeding.
In general all of these items are things you will want to place at the start of every do file.
clear
set mem 30m
cd "h:\Desktop\Stata"
sysdir set PLUS "h:\Desktop\Stata\Ado"
set more off
capture log close
local x=c(current_date)
log using "h:\Desktop\Stata\‘x’", append
Lines 7 and 8 require some explanation. The outcome of this is that Stata will record all analysis you
conduct on a particular day in a log file, the name of which will be that day’s date. We will explain how this
works when we discuss macros. Note that Stata ignores lines with begin with “*”, so we will use this to write
comments. The command “capture” is also important. If you are running a do file and it encouters and
error, the analysis will stop. The “capture” command tells Stata to proceed even if it encounters a mistake.
If you are running Stata on your own computer, there is a way to alter the default settings that Stata
starts with. When it launches, Stata looks for a do file called “profile.do” and runs any commands it contains.
You can create this file so that these changes are made automatically every time you launch Stata. (i.e.
memory is set, directory is set and a log is started). As well as a working directory, Stata also has other
directories where programmes are stored. We need to put our “profile.do” into the “personal” folder. To
find it, type sysdir. We now paste the following into a text file (either using notepad or Stata), and save it
as “profile.do” into that directory.
to install new user written commands. Usually we will be able to install these automatically, however sometimes we need to do
this manually. All that is involved here is saving the appropriate ado file into the appropriate directory which you can locate
with sysdir.
2 run intro executes the do file but suppresses any output.
5
1.3 Audit Trails
1. Remember to keep a record of everything
2. Never alter the original dataset
Often you will have to make do with Microsoft Excel (.xls) files. Fortunately Stata can import these quite
easily. If you have an Excel file named myfile.xls, you can import it using Stata’s insheet command.
First, in Microsoft Excel, click File > Save As. Now instead of saving as a Microsoft Office Excel file, save
the file as a CSV (Comma Delimited) file using the dropdown menu. You can then load the data using the
command insheet using myfile.csv. The “infile” command is an alternative for loading other types of
data. If a direct import with this command fails, try opening it in excel and following the instructions above.
We will discuss how to import SPSS files when discussing an example of a user written command.
6
programme “usespss” which is used to import data saved in SPSS format.3 . If we try to import the SPSS
file in our directory, use spssfile.sav we get the error message “file SPSSfile.sav not Stata format”. You
would imagine that it should be relatively easy to transfer files between different statistics packages but this
is not the case. Without this command you might need to use the expensive StatTransfer programme. As we
know what we’re looking for, the process of installing the programme is easy. We use the “findit” command.
findit usespss. In fact if you’re sure of the name you can simply type ssc install usespss. If you are
not exactly sure of the name, but have a general idea of what it’s called you can use the command search
usespss, all. A new window will appear, and clicking on the blue link will take you to a new page. Click
on CLICK HERE to install. This programme is now ready to run. You can also access the help file, help
usespss. We can now open the SPSS file in our folder. usespss using spssfile.sav. We could now save
this in Stata format for future use. save, replace. It is saved as “SPSSfile.dta”. However as you can see,
this is just the same as our icecream dataset.4 We will re-load the Stata version. use icecream2, clear.
1.9 Syntax
Getting to grips with how to communicate with Stata is perhaps the most daunting aspect of starting out.
Generally programmes and commands take the form of “command name” “variable name(s)” “, options.”
We will shortly see examples with tab and later regress. The exact syntax for a particular command is
detailed in the help file. For example, help tab. Here the aim is to introduce you to some of the most
important commands. As you become more familiar with them you will be able to use the various options
available, depending on the particular task you wish to perform.
Stata understands abbreviations, once the abbreviation can only be interpreted one way. For example,
the full command to run a regression is “regress”, however Stata understands what you mean if you only type
“reg”. The same principle applies to variable names. Using the icecream dataset, typing tab ti is equivalent
3 SPSS is a statistics package popular in the other social sciences
4 An alternative for transferring SPSS files into Stata is to download SPSS which is available from UCD Connect, open the
SPSS file and save in Stata format.
7
Table 1: Logical Operators in Stata
And &
Or |
Not ! or ˜
Multiplication *
Division \
Addition +
Subtraction -
Less Than <
Greater Than >
Less Than or Equal =<
More Than or Equal =>
To The Power Of ˆ
Wildcard *
to tab time. However, the command tab t will return the error message “t ambiguous abbreviation.” In
the help file for a command, the shortest acceptable abbreviated version is underlined.
8
2 Data Manipulation
Here we introduce the basic commands for manipulating your data. The most important logical operators in
Stata are outlined in table 1. The most frequently used are & (and), | (or) and ! (not). These are essential
for manipulating the data correctly. We can illustrate some of these using the “display” command, which we
can shorten to “di”: di 10*20 and di 6/(5-2) +18. Notice that strings require double quotes di Welcome
does not work but di ‘‘Welcome’’ does. You can also access system values and programme results with this
command, for example today’s date di ‘‘‘c(current date)’’’. Note again that di ‘c(current date)’
returns an error message. We need single quotes because “c(current date)” is a macro. This is how we were
able to name our log file, see section 2.10 for more details on macros. We will explain the use of the wildcard
in section 2.9. Stata will know whether you mean multiplication or the wildcard depending on the situation.
2.3 if Commands
Oftentimes we want Stata to run a command conditional on some requirement. For example, the correlation
between price and consumption if the temperature is greater than 25◦ C. This is easily achieved: corr
price cons if temp>25. To add more conditions to a command, for example to examine the correlation
9
if temperature is both greater than 25◦ C and less than 35◦ C, we use the & operator: corr price cons if
temp>25 & temp<35.
We can also use if to investigate data more closely: summ cons if price >.275. We can create
dummy variables with if commands. Typically two steps are needed. First we create a variable set equal
to zeros: gen expensive = 0. Now we replace it: replace expensive = 1 if price >avg price. See
the tutorial on regression analysis for more details on dummy variables. Similarly we can control for outliers
using if commands. For example if you want to eliminate the most expensive 5% of observations, the
following would work:
egen top_fivepercent_prices = pctile(price), p(95)
drop if price > top_fivepercent_prices
We remove these variables from the data with the “drop” command5 as we do not need them in this
analysis. drop loggedprice twice root temp avg price high price expensive.
10
result of this in the variable window where this label will appear. The label will also be used in tables and
other output generated by Stata.
We have already encountered the other type of labels, value labels. These are labels attached to particular
values of a variable and are mainly used with categorical data. In the case of the variable province, we have
already seen how this works. For a quick way of seeing exactly which labels are attached to each value, type
codebook province to obtain the name of the value label. Then lablebook province1 will show all the
values and their labels. From this we can see that in the variable province, the value “1” is labelled with
the name “Ulster”, 2 “Leinster” etc. If there are many value labels you may need to use the option all.
codebook province, all.
You will often need to either create your own value labels, or else modify existing ones. The procedure
is almost the same in both cases. If you are starting from scratch, you will need to pick a name for your
label. In this case we will create value labels for the weekend variable, and call the label weekend1. Suppose
we know that “1” refers to “weekend” and “2” refers to “weekday”. We use the “label define” command,
followed by the values and their labels in double quotation marks. label define weekend1 1’’Weekend’’
2’’Weekday’’. We then need to attach our value label to the existing variable using label values weekend
weekend1. We tab weekend to confirm the change. The only difference in the case of modifying an existing
set of values labels is that you need to obtain the name of the value label (“codebook” will supply you with
this). Then you use the “label define” command with the “modify” option to change one or more of the
labels. You do not need to reattach the modified value label to the variable, this is done automatically. You
also do not need to write out the full set of labels, only the one you want to change. For example, if we
wanted to change the label on the province variable for the value “2” from “Leinster” to “Dublin”, we would
use label define prov 2’’Dublin’’, modify.
panel dataset.
11
the master dataset, the dataset we merged with , or both. See help merge for details. For no we will drop
this variable with drop merge.
2.7 Tabout
We have already discussed tabulating variables and looking at crosstabs. Here we will examine how to
extract these results in a way that can be easily used in presentations or papers. To do this we need the user
written command “tabout”. To find and install, we first search for it. search tabout, all. The option all
allows us to search the internet. We find an entry in blue for tabout, with the description:
‘TABOUT’: module to export publication quality cross-tabulations / tabout is a table building program
for oneway and twoway / tables of frequencies and percentages, and for summary tables. It / produces
publication quality tables for export to a text file.
12
Table 2: Tabout Example 1 - Crosstabs
Province
Ulster Dublin Munster Connacht Total
Weekend Num % Num % Num % Num % Num %
Weekend 5 55.6 7 58.3 3 60.0 3 50.0 18 56.3
Weekday 4 44.4 5 41.7 2 40.0 3 50.0 14 43.8
Total 9 100.0 12 100.0 5 100.0 6 100.0 32 100.0
“Tabout” is at its most powerful when used in conjunction with LATEX, which is the software which was
used to create this document. Note that the more recent version has a slightly different syntax, but the
general idea is the same. You can also get confidence intervals using the latest version.
13
Several issues arise with trying to manipulate string variables. For example, if we try replace prov2=15
if county==Armagh, we get an error message. This is because in general Stata requires strings to be
surrounded with double quotations marks. replace province2=15 if county==‘‘Armagh’’ works. But
we should undo that change. replace province2=1 if county==‘‘Armagh’’. As we will see later on,
single quotation marks have other uses. In general, dealing with strings can be tricky, for example, we cannot
replace a string value with a numerical value. We also want to avoid having to type out a string every time
we want to manipulate the data. Expressions involving >and <clearly don’t work with strings. Fortunately,
there is a command which converts string variables into categorical variables with the appropriate value labels
(e.g. the province variable). Here we use this on the county variable. encode county, gen(county2). Now
we have a categorical numerical variable county2 which is appropriately labelled. We can check this in
the data browser. encode county, replace would have given us almost the same result, except with the
original county variable replaced rather than generating a new variable. There is a command, “decode”
which reverses the process.
14
macro with the “local” command, and access it using single quotation marks ‘’.89 First we define a macro
x which takes the numerical value 10. local x 10. Now every time we call the macro ‘x’ we have the value
10. To check this we type di ‘x’. We can now use this macro in expressions, for example di 100-‘x’.
We can also use it to manipulate variables. gen income2=income*‘x’. We can also store words in macros.
Suppose you wanted to add the word “icecream” to each variable name. You could type rename price
icecreamprice and rename cons icecreamscons etc. To save time you could store the word “icecream” in
a macro. local y icecream. Then rename price ‘y’price would give the same result. You may wonder
as to how useful this is, and in these cases it is probably not particularly helpful. A better example is when
we want to store a list of variables. Rather than typing out the whole list every time, we can save the
variables in a macro. local z price income temp. Then suppose we wanted to recode all missing values
in for all of these, instead of typing mvdecode price income temp, mv(100) we can type mvdecode ‘z’,
mv(100). This is a small dataset so it’s not a particularly big deal here, but it’s a different matter when you
have 100s of variables.
Macros are also important for accessing results stored by programmes.10 The macros saved by a pro-
gramme are listed in the help file. For example, if we look at help summarize, we see the list for this
command.11 Suppose we are interested in constructing new versions of our variables which are in the form
of a z score (standardised deviation from a variable’s mean: zi = (µ−x i)
sd(x) . We can see from the help file that
the command “summarize” stores the two results we need, the mean and standard deviation in the macros
“r(mean)” and “r(sd)” respectively.12 We can use these to form our new variables. To access the stored
results we need an “=” that we didn’t when we were defining our own macros. First we run the command
sum time. Then we define our macros local a=r(mean) and local b=r(sd). To check we have the correct
results di ‘a’ and di ‘b’. Now we can generate our new variable. gen ztime=(‘a’-time)/‘b’. If we now
sum ztime, we see that the mean is effectively zero (it’s actually a very small number due to rounding), as
it should be seeing as the average variation around a mean is zero by definition. The standard deviation is
also as expected. We will later write our own programme which will allow us to transform all our variables
in this way in a single line of code. Think how long it would take to do this in excel by hand.
Loops are another time saving device which employ the use of macros. For example in our icecream
dataset, we notice from the data browser that the variable hour has been badly inputted. These should all
be in the format of the 24 hour clock, however you can see that the final two zeros are missing from some
of the entries. This makes analysis difficult, for example sum hour will give misleading results. In order to
correct this we make use of a loop. This involves the forvalues command. Essentially we want to add two
zeros to every entry less than 100.
forvalues i=1(1)24 {
replace hour=hour*100 if hour==‘i’
}
Notice the syntax, we are creating a macro “i” which will start at the value 1, execute the command for
that value, move on to the next value (“2”), execute the command for that value etc until the loop ends.
The first number refers to the starting value, the number in brackets is the increment, and the final number
is the end value. We need a curly brace at the end of this line, the command on a separate line, and another
curly brace again on a separate line. Executing this command will give us a well behaved variable with every
entry in the same format. sum hour.
The other type of loop (the “foreach” command) is generally used when you want to perform the same
task on a number of different variables. We will write our own programme to transform every variable into a
z score. We will call it “zscore”. We then use the foreach loop, with “i” being the macro that corresponds to
8 Note that ‘ is the inward pointing single quotation mark, and is usually the button to the left of the number 1 on your
name.
11 Type mac list to see which macros are currently in use.
12 For more on accessing macros see help extended fcn.
15
every individual variable we wish to transform. The macro “0” refers to the list of variables we are interested
in, essentially everything we type after our new command.13 Like forvalues, we need a curly brace at the
end of this line, and each of our commands also on a separate line. For every variable in “0” (our variable
list) we are running the sum command, obtaining the macros for the mean and standard deviation, and then
generating a new variable which will have the prefix “zscore”. We also summarise this new variable. After
the final curly brace we need “end” to tell Stata the programme is finished.
We can now run the programme on whichever variables we are interested in. For example, zscore time
cons price. The summary statistics confirm we have what we wanted. Note that if you try to call a new
programme by the same name as an existing programme you get an error message. program define zscore.
So if you want to modify an existing programme you will first need to drop it. program drop zscore. But
if there is no program called zscore this will produce an error message. Hence the use of “capture” just like
when we were opening our log file. Also, you will need to define your programme again each time you start
a new Stata session, unless you save it as an ado file.14
forvalues i=1(1)10 {
by province: replace var10=var10[‘i’] if var10==. & var10[‘i’]!=.}
13 Within the macro “0”, the macro “1” refers to the first variable, “2” the second etc.
14 Itso happens that Stata already has a way of creating standardised variables with egen newvar=std(var).
15 These two steps could be combined with “bysort”: bysort province: gen provinceno= N
16
Within each province we are simply replace var10 with the value of the first observation in each province,
provided that variable is missing. We are then looping over ten values as there are 10 at most 10 observations.
The variable is only replaced if there is a missing value for that observation. Another useful command in
this context is gsort, which orders the variables according to the values of some other variable.
2.13 Graphs
Like the tables we discussed above, graphs are a powerful tool for exploring, summarising and presenting
your data. The basic graph commands for Stata are straightforward, however getting to grips with all the
available options is tricky. It would be impossible to discuss all the different types of graph, however we will
discuss the most common types. Like with tables we will divide graphs into two types, those that deal with
continuous variables and those that deal with categorical data. We will load our icecream data again. use
icecream2, clear.
For continuous data, the easiest way to visualise the relationship between two variables is to produce a
scatterplot of them, e.g. scatter cons temp. If, instead, we want a graph of the line of best fit between
the two variables, the relevant command is graph twoway lfit cons temp. We can also combine multiple
plots in one graph using the twoway command. For example, try using twoway (scatter cons temp) (lfit
cons temp). We can add some complexities very easily. For example you may wish to add confidence interval
“bands” around your line of best fit. To achieve this, use lfitci instead of lfit.16 Stata also can produce
several graphs in the one chart. For example we can create a 3x3 matrix of scatterplots with by inputting
graph matrix cons temp price, scheme(s1mono). To investigate the distribution of a single variable, we
can create a histogram of it using the histogram command. 17 For example, histogram temp.
We can produce a bar chart showing the mean of our variables using graph bar time cons price
income temp. We can display this breakdown for values of a particular variable in the same graph graph
bar time cons price income temp, by(province) or in different graphs graph bar time cons price
income temp, over(province). We may also want to graph categorical variables like province, with the
aim of showing the per cent in each category. The best way to do this is with the user written command
“catplot”. As before we use the search command search catplot, all or ssc install catplot, and click
on the blue link to install. Now we can use this to graph our province variable by itself catplot province,
or by another variable catplot province weekend.
We will discuss this further in the time series section, but if you have time series data then line graphs
can be important. First we need to sort our data by the time variable, in this case time. sort time. Then
we use the “line” command to graph the variables. The last variable needs to be our time variable. So in
this case we could have line cons income time.
16 Note that the easiest way to change the overall look of a graph is with schemes. See help schemes. We will use
“scheme(s1mono)” to generate the graphs in this document as we want them in black and white.
17 A similar chart is produced with graph7.
17
Figure 1: An example of a graph matrix chart
20 40 60 80
.6
cons .4
.2
80
60
temp
40
20
.29
.28
price
.27
.26
.2 .4 .6 .26 .27 .28 .29
We can save any graph we produce in Stata using the graph export command. After drawing our graph,
Stata will open it in a new window. It is possible to save it using the menus FILE, SAVE AS. We can also
type graph export filename.png, replace. Stata can save graphs in various different formats, but .png
is the most straightforward. 18 . We can then use the graphs in other documents and presentations.19
It would take too long to go through all of the available options, but some of the most important ones
refer to the title and axis labels. For example:
line income temp time, title(Time Series Graph of Income and Tempreature Over Time) ///
xtitle(Time Period) ytitle(Euro(Income) and Degrees(Temp)) caption(Source: icecream.dta)
18
The “///” tells Stata to read the next line as part of the same command. Two examples of the kind
of graphs that are possible are provided below. The first graph was made using the user written command
“spmap”. The second graph was generated using the cyear dataset with the following code:
And:
http://www.ats.ucla.edu/stat/stata/library/GraphExamples/default.htm
19
Figure 2: Graph Example 1: Map
Proportion In Cluster 4
Family And Community Integrated
Sweden
Denmark
Percent
Ireland 33 − 40
Netherlands Poland 25.5 − 33
Germany 16 − 25.5
Belgium
Czech Rep 9 − 16
Austria
France Switzerland
Italy
Spain
Greece
Zimbabwe Lesotho
Nigeria
Kenya
400
Benin Zambia
Tanzania
300
Mozambique
Ghana Uganda
Mali
Burkino Faso
Madagascar
200
Malawi Liberia
100
Linear Fit
Note: Correlation=−0.1737
Source: WHO and Penn World Tables
20
3 Regression Analysis
The purpose of this lab is to provide an introduction to regression analysis using Stata, particularly the
interpretation of the output produced by this and other statistical packages.
If we type help regress we see that the basic command takes the form regress y x1 x2 x3. Load the
icecream data. use icecream2, clear. If we want to regress consumption on income temperature and price
from our icecream dataset we type reg cons price temp income.20 The output from this is presented in
table 4. Stata will present the output almost immediately as this is a small dataset. We see that there are
30 observations used in the regression. Also note that Stata automatically includes a constant term, the
variable cons, not to be confused with your y variable, cons. You do not need to create a constant term
yourself. 21 Interpreting output like this needs to become second nature if you are interested in pursuing
research in economics. The most important of these numbers are the coefficients and their p values so we
start with these. The coefficients (second column) tell us the effect of each X on Y (dy/dx). In this case
none of the variables are in log form so the interpretation is straightforward. The coefficients tell us how Y
changes if X changes by one unit. So in this case if price goes up by one unit, then consumption changes by
-1.044, i.e. consumption falls by 1.044. Likewise, if income goes up by one unit, consumption will go up by
.003. This is not the whole story however, some of these coefficients may be indistinguishable from zero (in a
statistical sense). It is important to remember the difference between a coefficient and a t-stat. Coefficients
are the “real” effects of variables but aren’t very useful without knowledge of the variance. Each coefficient is
assigned a t statistic in the 4th column. t = coefficient . The standard error of the coefficient is displayed
se(coefficient)
in the third column.
As we know the distribution of the t statistic, so we are able to assess the probability that the population
coefficient is equal to zero. This probability is the number displayed in the 5th column. So in this case we
cannot reject the possibility that price has no effect on consumption, however we can reject the possibility
that temperature and income have no effect. Another way of saying this is that the 95% confidence interval
for price includes 0. Also included in the output are the Residual Sum of Squares, the Model (or Explained)
Sum of Squares, and the Total Sum of Squares. We also have the R2 , which describes the proportion of
variation in the outcome explained by the model. Remember that R2 = M SS/T SS. We also have the F
statistic, which is a test of whether the coefficients are all jointly equal to zero. An F test is analogous to
a t test, except it examines several hypotheses at the same time. In this case F=22. As we know the F
distribution we can assign it a p value in the same way we did with the t values. In this case we reject the
possibility that the model coefficients are all zero. The F stat can also be seen as a test of whether the R2
value has arisen at random. The R2 term can only increase as more variables are added to the regression
and can therefore be misleading, as the model may contain several redundant variables which have little
explanatory power. The adjusted R2 takes account of the number of variables used.
Interpretation of coefficients when some of the variables are in log form is a little different. If both X
and Y variables are in logs, then the X coefficient tells you the percentage change in Y when X goes up
by 1%. If Y is in log form and X is not, then the X coefficient tells you the percentage change in Y when
X goes up by one unit. If Y is not in log form but X is, then the coefficient tells you the unit change in
Y if X goes up by 1%. For example, suppose we generate a new variable which is the log of income gen
logincome=ln(income), and include this in our regression instead of our original variable. reg cons price
temp logincome. Now the coefficient on logincome is .3 and we interpret this as meaning that a 1% rise in
income results in consumption rising by .3 units.
21
Table 4: OLS Regression Output
------------------------------------------------------------------------------
cons | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
price | -1.044413 .834357 -1.25 0.222 -2.759458 .6706322
temp | .0034584 .0004455 7.76 0.000 .0025426 .0043743
income | .0033078 .0011714 2.82 0.009 .0008999 .0057156
_cons | .1973149 .2702161 0.73 0.472 -.3581223 .752752
------------------------------------------------------------------------------
which runs from 1-4 with tab province). Instead we need to include a separate dummy variable for each
category.22 We could create 3 new variables, taking the value one for each category and zero otherwise.23 gen
leinster=(province==2) and gen munster=(province==3) and gen Connacht=(province==4).24 Now
we can include these 3 variables in our regression. reg cons price temp income leinster munster
connacht. There is an easier way of doing this however. We do not actually need to generate the sep-
arate dummy variables ourselves if we use the xi command, which expands categorical variables for us. xi:
reg cons price temp income i.province. Remember to include “i.” before the categorical variable you
wish to expand. Note that these dummies will appear in your variable window.
It is important to be able to interpret the coefficients on these dummy variables correctly. Notice again
that there are only three dummies, despite the fact that there are four categories in the province variable.
Each of these refers to the effect of each category relative to the omitted category. We can clearly see that
the omitted category is province 1 (Ulster). In any case, Stata tells us which one is omitted. If we want to
change the omitted category, for example if we wanted to compare the effect of the other provinces to being
in Leinster, we can set which group Stata omits. char province[omit] 2. If we run the regression again
we see this is indeed the case xi: reg cons price temp income i.province. The output is presented in
table 5.
3.2 Outreg2
You will encounter dozens if not hundreds of regressions, and if you need to present these results copying
and pasting from Stata is not really an option. Fortunately there is a user written command, similar to
tabout, which automates the export of regression results. First we need to install the programme. findit
outreg2 or search outreg2, all, or ssc install outreg2. Having done this, we can check the help file
help outreg2. We run the outreg command after our regression to save our results in excel format (this is
probably best, unless you are using LaTeX). We need to specify the filename. outreg2 using test, excel
replace. This exports all coefficients, if required you can specify the variables for export, e.g. outreg2
temp income using test, excel append. The replace and append options operate as with tabout. If you
choose append, new regression results will appear in different columns in the file. The excel option is also
22 Including binary variables on the right hand side of a regression does not violate any of the assumptions which make OLS
BLUE.
23 We don’t need 4 variables, this would be falling into the Dummy Variable Trap.
24 A quicker way to do this is tab province, gen(province).
22
Table 5: OLS Regression Output With Dummy Variables
------------------------------------------------------------------------------
cons | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
price | -1.272991 .9760936 -1.30 0.205 -3.292195 .7462123
temp | .0034345 .0004815 7.13 0.000 .0024385 .0044306
income | .0030756 .0012691 2.42 0.024 .0004503 .005701
_Iprovince_2 | .0093664 .0190921 0.49 0.628 -.0301285 .0488614
_Iprovince_3 | .003358 .0225584 0.15 0.883 -.0433076 .0500236
_Iprovince_4 | .0181066 .0210197 0.86 0.398 -.025376 .0615892
_cons | .2737497 .3113172 0.88 0.388 -.370259 .9177583
------------------------------------------------------------------------------
important as this ensures that excel will recognise the way the standard errors are formatted. An example is
shown in table 7. It is also possible to export the results directly to a word document with the “word” option
outreg2 using test2, word replace. In this case you will get a document with the extension “.rtf”.
23
Table 6: Outreg Example
(1)
VARIABLES cons
price -1.273
(0.976)
temp 0.00343***
(0.000482)
income 0.00308**
(0.00127)
Iprovince 1 -0.00937
(0.0191)
Iprovince 3 -0.00601
(0.0233)
Iprovince 4 0.00874
(0.0202)
Constant 0.283
(0.318)
Observations 30
R2 0.728
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
x variables scatter res temp and scatter res income. Both of these plots show the residuals centred
around zero which is not indicative of any misspecification, however we can test this more formally.
24
If you have a problem with heteroskedasticity, you can use White (1980) heteroskedastic-consistent stan-
dard errors by adding the “robust” option, e.g. regress y x1 x2 x3, robust. Robust errors are not
always appropriate. You should also consider whether you should cluster your standard errors with the
“cluster(variable)” option. See chapter 8 in Mostly Harmless Econometrics for more details on both these
issues. With regards to autocorrelation you can produce a correlogram with the ac command. The Durbin-
Watson test is called with dwstat. If you’d like to run a Cochrane-Orcutt regression instead of OLS, you
can use prais regression method. The standard test for model specification is the Ramsey RESET test.
The relevant command is ovtest. A high F -test suggests your model is improperly specified. To test for
multicollinearity, enter the vif command after a regression. This tests the variable inflation factor, which is
basically the effect of collinearity. One suggestion is that values greater than ten warrant further investiga-
tion. You can use Stata’s sktest to test normality. You could plot the residuals with predict myresids,
resid which produces the residuals in a variable named myresids. Then you can produce a histogram of
this variable, or use a qnorm plot. For example try kdensity myresids, normal to compare your graph to
the normal bell-curve.
25
Table 7: Linear Probability Model Output
4 Binary Regression
Including categorical or binary measures on the right hand side of a regression does not pose any problems
for estimation, however it is a different matter when the Y variable is categorical. We first examine the
case of a binary dependent variable. We will use the wages data. use wages1. We have information on
an individual’s years of schooling, their current wage(in logs), their experience (in years), and their gender.
We first summarise the data describe and summarize. So we have 4 variables and 3296 observations. We
examine the wage variable with histogram wage. As there are clearly a number of outliers we restrict this
graph to values less than 20, and add a kernel density plot. histogram wage if wage <20, kdensity. The
variable appears roughly lognormal. Suppose we are interested in the determinants of wages. As an initial
step, we could examine the correlation between our variables. pwcorr wage exper male school, sig.
Wages are significantly positively correlated with being male and years of schooling. We could also examine
the average wage by years of schooling tabstat wage, by(school). Again we see evidence of positive
correlation, however the relationship appears to be non-linear. The lowest average wage is for those in the
8 and 9 years of schooling. This is easier to see graphically graph bar wage, over(school) title(Wages
by Years of Schooling). We could run a standard regression with this data, but instead suppose we are
interested in the determinants of earning a wage with a value greater than 7. We generate a variable that
takes the value 1 if the individual earns this value or greater, and zero otherwise. gen highwage=(wage>7).
28% of the sample are in this category tab highwage. Males are more likely to be in this category tab
highwage male, row.
26
As our outcome is binary, all observations will lie either on 1 or 0. twoway(scatter highwage school)(lfit
highwage school), scheme(s1mono) title(Scatterplot of High Wage and Schooling) ytitle(High
Wage) xtitle(Years of Schooling). Because there are a limited number of outcomes, each point could
represent several hundred observations or 10, it is not possible to tell. To get a better idea of the distribution
of our variables, we add some random noise to our plot using the jitter option twoway(scatter highwage
school, jitter(10))(lfit highwage school), title(Scatterplot of High Wage and Schooling) ytitle(High
Wage) xtitle(Years of Schooling). This issue is clear from this graph, for individuals with less than 9
years of schooling, the predicted probability of earning a high wage is negative, which clearly does not make
sense. To confirm this we obtain the predicted probabilities from our model. predict xbhat and check
them against years of schooling tabstat xbhat, by(school).
0 5 10 15
Years of Schooling
27
Table 8: Logit and Probit Output
------------------------------------------------------------------------------
highwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | .0320678 .0113016 2.84 0.005 .009917 .0542186
male | .5130279 .0496018 10.34 0.000 .4158102 .6102456
school | .2565379 .0161262 15.91 0.000 .2249311 .2881447
_cons | -4.138153 .2306573 -17.94 0.000 -4.590233 -3.686073
------------------------------------------------------------------------------
------------------------------------------------------------------------------
highwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | .0530249 .0194082 2.73 0.006 .0149855 .0910644
male | .8678092 .0847772 10.24 0.000 .7016489 1.03397
school | .4386614 .0283132 15.49 0.000 .3831685 .4941544
_cons | -7.028232 .4058646 -17.32 0.000 -7.823712 -6.232752
------------------------------------------------------------------------------
28
Figure 5: Problem Solved With Probit and Logit
0 5 10 15
Years of Schooling
29 Often odds ratios from logit models are reported in psychology, sociology and epidemiology.
29
Table 9: Marginal Effects Output
By default, Stata calculates these marginal effects at the mean of the independent variables, however it is
also possible to evaluate them at other values. For example, suppose you suspect that the effect of experience
and schooling on wages differs for men and women. Then you could evaluate the marginal effects for women.
mfx compute, at(male=0) and outreg2 using test, mfx excel append ctitle(mfx probit women).
For men mfx compute, at(male=1) and outreg2 using test, mfx excel append ctitle(mfx probit
men). To reiterate, because OLS is a linear estimator the estimated marginal effect is the same at every set
of X values. You can think about this in terms of the slope of the regression line in the bivariate case. OLS
is a straight line, so the slope of the line is the same at every point, unlike with logit and probit.
The marginal effects of experience and schooling appear larger for men than women. There other ways
of approaching how to estimate marginal effects. An alternative is to calculate the marginal effects for every
value and then take the average to obtain the average partial effect. For this we use the user written command
margeff. ssc install margeff to install. Then we run the command with the replace option as we wish to
export the results. margeff, replace. We use outreg2 without the mfx option this time. outreg2 using
test, excel append ctitle(amfx probit). We can see that in this case both approaches give similar
results. Mfx2 is another user written command which produces marginal effects. If you are using a probit
model, you can obtain the marginal effects directly (without the coefficients) with the command dprobit
highwage exper male school.
30
Table 10: Alternative Binary Estimators for HighWage
(1) (2) (3) (4) (5)
VARIABLES mfx probit mfx probit women mfx probit men amfx probit mfx hetprob
If you are concerned about potential misspecification, the heteroskedastic probit model allows you to
specify the variance as a function of one or more of the independent variables. Suppose we suspect that
the variance of our error term depends on your level of schooling, we could estimate the following hetprob
highwage exper male school, het(school) and compute the marginal effects mfx compute and export
them outreg2 using test, excel mfx append ctitle(mfx hetprob). The LR test of lnsigma is a test
of our assumption that the variance depends on schooling. We reject at the 5% significance level but fail to
reject at the 10% level. Finally, note that if you are including interaction effects in these kinds of binary
models you must be careful about interpreting them. The user written command “inteff” is recommended.
31
1.00
1.00
Autocorrelations of im_rate_per_annum
Autocorrelations of gdp_maddison
0.50
0.50
0.00
0.00
−0.50
−0.50
−1.00
−1.00
0 5 10 15 20 0 5 10 15 20
Lag Lag
Bartlett’s formula for MA(q) 95% confidence bands Bartlett’s formula for MA(q) 95% confidence bands
5 Time Series
We will examine the basic fundamentals of time series analysis using data on Irish infant mortality. First
we load the data use im, clear and apply some labels. lab var netm ’’Net Migration’’ and lab var
at ’’Average Annual Tempreature’’ lab var ar ’’Average Annual Rainfall’’. We also generate a
time squared variable gen year2=year^2. We will first summarise the data describe and sum, and plot
our variables. As we have time series data we need to tell Stata. tsset year. First we sort the data. sort
year.
32
1.00
1.00
Partial autocorrelations of im_rate_per_annum
0.00
−0.50 0.00
−0.50
0 5 10 15 20 0 5 10 15 20
Lag Lag
95% Confidence bands [se = 1/sqrt(n)] 95% Confidence bands [se = 1/sqrt(n)]
It is important to be able to interpret these tests properly. The null hypothesis is the presence of a unit
root. If the test statistic does not exceed the 5% critical value in absolute terms (i.e. is not more negative
then the 5% critical value), we cannot reject non-stationarity. In each case here we fail to reject the null
hypothesis of a unit root, and conclude that both variables are integrated of order 1. There are several
potential ways of addressing this. Remember that under normal conditions non-stationarity violates the
regularity assumption on which OLS is based. To deal with this problem, we could detrend the variables,
we could check for the presence of a cointegrating relationship between gdp and infant mortality, and finally
we could run a model in first differences.
33
Figure 8: Using OLS To Detrend Variables
80
60
40
20
0
−20 Residuals and Fitted Values From Infant Mortality
case, detrending the variables is not enough to induce stationarity. So we cannot simply include these
detrended variables in our regression. Note that in fact this kind of detrending procedure is hardly ever
necessary, as it is equivalent to simply including the time trend on the right hand side of our regression.32 .
Compare the coefficients on gdp from these two regression reg im gdp year year233 and reg imres
gdpres. There is a second possibility for running a valid regression, it may be that despite the fact that
infant mortality and gdp are non-stationary, they may be cointegrated, i.e. share a common trend. If we
were running Stata version 10 or above we could run a Johansen test for cointegrating vectors between the
two variables, however there are alternatives. The principle in testing for cointegration is that if such a
relationship exists between two variables, the residuals from their regression should be stationary. So we
run our basic model, this time including year and year squared. We know from the above analysis that
we cannot rely on this detrending to induce stationarity, however we will examine the residuals to test for
a cointegrating relationship.34 reg im gdp at ar netm year year2. We first graph the residuals line
cointres year and ac coint and pac coint. varsoc cointres and then dfuller cointres, trend
lags(1). We therefore cannot reject the null that the residuals are non-stationary, and therefore reject
the presence of a cointegrating relationship. We finally return to the option of running a regression in
first differences. Fortunately Stata provides a number of time series operators which simplify the process.
L.var is the lag of that variable, i.e. l.im is imt−1 . D.im is the first difference (imt − imt−1 ), and f.im
is the first lead (imt+1 ). We can easily generate a new variable that is the first difference of gdp. gen
dgdp=d.gdp and for infant mortality gen dim=d.im . Both now look roughly stationary, although gdp
32 This is the Frisch-Waugh-Lovell theorem.
33 Note that we need to add underscores in order not to confuse Stata as we created two new variables imres and gdpres.
34 This form of unit root test is known as an Engle Granger cointegration test.
34
Table 12: A Comparison of Time Series Models
(1) (2) (3)
VARIABLES Naive Regression Detrended Regression First Differences
Observations 35 35 35
R2 0.860 0.379 0.119
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
still causes some concerns. line dim year line dgdp year. We reject non-stationarity in both cases in
formal tests. varsoc dim varsoc dgdp. dfuller dim, lags(1) trend. dfuller dgdp, trend lags(0).
We now proceed with a regression in first differences reg dim dgdp at ar netm and export the results
outreg2 using timeseries, excel append ctitle(First Differences).35 We conclude that the only
valid model is the first differences regression, and that gdp has no effect on infant mortality.
35
6 Ordinal and Multinomial Regression
6.1 Ordinal Data
We have seen an example of how to analyse data where the outcome variable is binary. We will now examine
the case where the Y variable is categorical. There are two situations; where the variable has a clear ranking
(e.g. health or education) we will consider the ordered logit and ordered probit models. When the Y variable
has no clear ranking the multinomial logit or multinomial probit is appropriate. We will load data taken
from the Health Survey for England. use ordinal, clear. As always we examine our data. describe
and sum. We will also examine our outcome variable in more detail. rename genhelf health renames
the variable. tab genhelf. We see that this is the standard health question asked in surveys. Individuals
are asked to assess their general health on a scale of 1 to 5, with 1 being very bad and 5 excellent. If we
examine the labelling on this variable, we see that the coding is reversed. codebook health and labelbook
genhelf. We will fix this, and also tidy the value labels. recode health (1=5)(2=4)(4=2)(5=1). We
can now change the labels lab def genhelf 1’’Very Bad’’ 2’’Bad’’ 3’’Fair’’ 4’’Good’’ 5’’Very
Good’’, modify. Now this looks better tab health. Suppose we are interested in comparing health
across gender. We could take the average health score, tabstat health, by(gender), but this is not
ideal and in fact there is little difference here. Instead we should compare across categories. We could
graph this relationship catplot health sex, percent scheme(s1mono) title(Self Assessed Health
by Gender) blabel(bar, format(%8.1f)).
Bad 2.5
Good 18.8
Bad 2.7
Good 23.3
0 5 10 15 20 25
percent
There are more women in the very good and good categories, but also more in the bottom two categories.
Note how we added labels onto the end of the bars with the “blabel” option. We also examine height.
rename htval height and codebook height and labelbook htval. We replace missing values mvdecode
height, mv(-1).
36
Figure 10: Height Distribution
Height Distribution
.04
.03
Density
.02.01
0
We see that height is roughly normally distributed hist height, normal scheme(s1mono) title(Height
Distribution). Suppose we wish to examine the determinants of self reported health status. We generate
an age squared term gen age2=age2 . We could start with a simple OLS model xi:reg health age age2
i.sex i.topqual3 height. We save the results outreg2 using ordinal, excel replace.
However, the difficulty with OLS is potentially more important in this situation than in the previous
case where we had a binary variable. The assumption behind these kinds of models is that the 5 point
scale is really a manifestation of an underlying continuous health index. OLS assumes that movement
from very bad to bad is equivalent to movement from good to very good. A simple examination of the
distribution shows that it is highly skewed towards good health. The problem is that we do not know
where these thresholds lie in this unobserved continuous index. The ordered logit and probit models take
account of these thresholds. They are similar to the standard probit and logit models we encountered in
the last section. xi:oprobit health age age2 i.sex i.topqual3 height and xi:ologit health age
age2 i.sex i.topqual3 height. The output from the ordered probit model is shown in table 13. The
“cut” values refer to estimates of the thresholds.
The same issue as in the binary case arises here. These are coefficients and not marginal effects, so we
can only identify the significance of the variable and not the magnitude. We can obtain marginal effects, but
in this case it will be the marginal effect of the variable for being in that particular category compared to all
others. In the ordered logit case, if we were interested in the marginal effect for being in very good health
mfx compute, predict(outcome(5)). Note that these are computationally intensive and can take a long
time to estimate. In this case we see that a unit increase in height increases the probability of being in the
very good category by .3%. Being in education category 8 reduces the probability of being in this category
by nearly 20%. mfx2 will estimate marginal effects for all categories. This raises a difficult question about
interpretation. Here there are 5 categories, and with 11 variables, which means 55 coefficients.
37
Table 13: Ordered Probit Output
------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0056614 .003921 -1.44 0.149 -.0133464 .0020236
age2 | -.0000358 .0000398 -0.90 0.369 -.0001139 .0000423
_Isex_2 | .1133844 .0373295 3.04 0.002 .0402199 .1865488
_Itopqual3_2 | .0691763 .0826651 0.84 0.403 -.0928444 .2311969
_Itopqual3_3 | -.0504849 .08567 -0.59 0.556 -.218395 .1174252
_Itopqual3_4 | -.1730846 .080075 -2.16 0.031 -.3300286 -.0161406
_Itopqual3_5 | -.1893723 .0780889 -2.43 0.015 -.3424238 -.0363208
_Itopqual3_6 | -.3217053 .0921638 -3.49 0.000 -.5023429 -.1410677
_Itopqual3_7 | -.2502062 .097808 -2.56 0.011 -.4419063 -.0585061
_Itopqual3_8 | -.5067183 .0812331 -6.24 0.000 -.6659322 -.3475045
height | .0085651 .0020132 4.25 0.000 .0046193 .012511
-------------+----------------------------------------------------------------
/cut1 | -1.47487 .3508896 -2.162601 -.7871385
/cut2 | -.7184513 .3490263 -1.40253 -.0343723
/cut3 | .1835116 .3488053 -.5001342 .8671575
/cut4 | 1.393712 .3490541 .7095789 2.077846
------------------------------------------------------------------------------
Also, suppose you find that a variable increases the probability of being in the top category and the
bottom category. It is difficult to know what to conclude from this kind of result. Often researchers will
divide the variable into two categories (e.g. very goood and good health vs all other categories). OLS is a
potential alternative if there are a large amount of categories.
A useful written command is “spost” which is aimed at the analysis of this type of categorical data. It
provides a number of diagnostic tests, as well as tools for interpretation. We will make use of it to graph
the predicted probabilities for our outcome. ssc install spost. We obtain the predicted probabilities for
each of our outcome categories by height prgen height, gen(height1), and graph these with the following
code. This is useful as it is apparent from this that height increases the chances of being in the top category
and reduces the chances of being in the “fair” category. Interestingly, it appears to have little effect on the
probability of being in the bottom two groups.
38
twoway connected heighp1 heighp2 heighp3 heighp4 heighp5 heighx, scheme(s1mono) ///
title(’’Predicted Probabilities - Health Condition by Height’’, span) ///
subtitle(’’Ordered Logit Model’’, span) ytitle(Probability) ///
legend(lab(1 ’’Pr(Very Bad)=1’’) lab(2 ’’Pr(Bad)=1)’’) lab(3 ’’Pr(Fair)=1’’) ///
lab(4 ’’Pr(Good)=1’’) lab(5 ’’Pr(Very Good)=1’’)) xtitle(Height)
39
Table 14: OLS and MFX for Ordinal Data
(1) (2)
VARIABLES OLS Logit MFX For Very Good Health
40
This time we will look at the predicted probabilities for men by age prgen age, x( Isex 2=0) f(20)
t(80) gen(male). The probability of having a mortgage unsurprisingly decreases with age, and increases
for the other two options.
20 40 60 80
Age
Pr(Own)=1 Pr(Mortgage)=1)
Pr(Rent)=1
41
Table 15: Multinomial Logit Output
42
Table 16: MFX for Multinomial Logit
(1) (2) (3)
VARIABLES Own Mortgage Rent
43
7 Panel Data
7.1 Panel Set Up
This tutorial presents the basics of panel data analysis, using a sample of employed workers in the British
Household Panel dataset.37 It is available for download from the UK Data Archive http://www.data-archive.ac.uk/.
First we load the data use bhps1, clear. Note that this is quite a large dataset. If we browse, we see that
the data is in the incorrect “long” format so we need to use the reshape command to fix this. reshape long
mlstat jbstat hlstat smoker jbsat fisit caruse age qfedhi hhsize hhtype ncars fiyr fihhyr jbhrs,
i(pid) j(year). Now we have a unique identifier (“pid”), a time variable (“year”), and our outcomes.
As always, we first label and tidy our data.
lab var mlstat ‘‘Marital Status’’
lab var jbstat ‘‘Job Status’’
rename hlstat health
lab var health ‘‘General Health’’
lab var smoker ‘‘Smoker’’
lab var jbsat ‘‘Job Satisfaction’’
lab var fisit ‘‘Financial Situation’’
lab var caruse ‘‘Use A Car’’
lab var age ‘‘Age’’
lab var qfedhi ‘‘Highest Educational Qualification’’
lab var hhsize ‘‘Household Size’’
lab var hhtype ‘‘Type of Household’’
lab var ncars ‘‘Number of Cars’’
lab var fiyr ‘‘Annual Income’’
lab var fihhyr ‘‘Household Annual Income’’
lab var year ‘‘Wave’’
lab var jbhrs ‘‘Hours Worked Per Week’’
gen age2=age^2
lab var age2 ‘‘Age Squared’’
Suppose we are interested in the relationship between income and job satisfaction. We first examine
our job satisfaction variable tab jbsat. There are over 40,000 observations on this variable in the data;
remember that this will involve information on the same individuals over time. We will ignore missing
values and individuals who didn’t answer. First we check these values codebook jbsat and labelbook
kjbsat. Alternatively, the user written command lablist is useful, which you can install with ssc install
lablist. Then we can recode these to missing values replace jbsat=. if jbsat<1. And check tab
jbsat. We also do this for missing values in our health and education variables mvdecode health, mv(-9
-1) and mvdecode qfedhi, mv(-9 -8). Let’s examine our income variable hist fiyr. This suggests
the presence of outliers which we may wish to exclude from our analysis. We first remove missing neg-
ative values replace fiyr=. if fiyr<0. One way to do this is to exclude the bottom and top 5%
of observations. First we find these values using the “centile” command. centile fiyr, centile(5
95) and recode these values as missing replace fiyr=. if fihhyr<2500 | fihhyr>32000. Now we
have a much more sensible distribution hist fiyr. We take the log of income gen logincome=ln(fiyr)
and graph the distribution hist logincome, kdensity scheme(s1mono) title(Income Distribution)
xtitle(Log Income) caption(’’Source: BHPS’’, span).
We can also examine our job satisfaction variable tab jbsat and graph it catplot jbsat sex, percent
blabel(bar, format(%8.1f)) title(Job Satisfaction) subtitle(By Gender). We can obtain the av-
erage income for each category tabstat logincome, by(jbsat). We notice that those in the highest cate-
gory for job satisfaction have the lowest income. We can use a scatterplot to confirm that the relationship
is non-linear.
37 This is a fantastic resource covering 18 waves, see http://www.iser.essex.ac.uk/survey/bhps for more details.
44
Figure 13: BHPS Income
Income Distribution
.8
.6
Density
.4.2
0
If we look at the health variable, tab health and codebook health, we notice that the highest level of
health corresponds to “1” and the lowest to “5”. This may be confusing so we reverse this recode health
(1=5)(2=4)(4=2)(5=1). Don’t forget to change the labels labelbook khlstat and label define khlstat
1’’Very Bad’’ 2’’Bad’’ 3’’Fair’’ 4’’Good’’ 5’’Very Good’’, modify.
We will now start our regression analysis. We could estimate a simple OLS model xi:reg logincome
sex age age2 i.jbsat i.health i.year i.qfedhi with income as a function of gender, age age squared,
job satisfaction health and education, with controls for survey year. This involves ignoring the fact that we
have information on the same individuals over time, which may not be ideal as you can imagine.
If we try to run a panel model xi:xtreg logincome sex age age2 i.jbsat i.health i.year i.qfedhi,
re we get an error message. Like with the time series case, we must inform Stata that we have a panel, and
identify our time and identifier. xtset pid year. However, before running the model, we should examine
the panel aspects of the data. The xtdescribe command we obtain the characteristics of our data.
We have a strongly balanced panel (that is all individuals are present in all waves), with 6429 people
tracked over 11 years. We can check that this gives us over 70,000 observations in total bysort pid: gen
no= N and tab no. “xtsum” is another useful command. Remember that there is variation within individuals
as well as across individuals in a panel. We examine this for our job satisfaction variable xtsum jbsat.
Time invariant variables have no variation within individuals over time xtsum sex. This will be impor-
tant when we examine the fixed effects model. Another command is “xttrans” which gives the transition
probabilities for a variable. xttrans jbsat. See table 19. So if you are in category 7, you have a 50/50
chance of remaining there next period.
45
Figure 14: BHPS Income by Job Satisfaction
0 2 4 6 8
Job Satisfaction
46
Table 19: xttrans Output
Job |
Satisfacti | Job Satisfaction
on | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
1 | 20.69 15.71 12.45 11.88 12.26 | 100.00
2 | 8.15 15.54 20.76 9.94 18.73 | 100.00
3 | 3.57 8.31 24.72 14.58 24.92 | 100.00
4 | 2.73 4.29 14.00 23.48 28.49 | 100.00
5 | 1.10 2.59 8.26 10.64 35.05 | 100.00
6 | 0.57 1.25 3.67 3.87 19.40 | 100.00
7 | 0.74 0.56 1.25 1.62 6.97 | 100.00
-----------+-------------------------------------------------------+----------
Total | 1.60 2.71 6.99 7.45 21.52 | 100.00
Job |
Satisfacti | Job Satisfaction
on | 6 7 | Total
-----------+----------------------+----------
1 | 19.92 7.09 | 100.00
2 | 22.93 3.95 | 100.00
3 | 20.91 2.99 | 100.00
4 | 22.43 4.57 | 100.00
5 | 37.85 4.51 | 100.00
6 | 60.99 10.25 | 100.00
7 | 40.04 48.82 | 100.00
-----------+----------------------+----------
Total | 45.24 14.48 | 100.00
47
It is also likely that the Y variable is correlated over time. We can confirm this by looking at the
correlation within income over time pwcorr logincome l.logincome l2.logincome l3.logincome, sig.
Note that we are using a time series operator, all of which also work in a panel context. As expected, income
is serially correlated. Not that the pwcorr command uses parwise deletion of missing values, whereas corr
uses listwise.
48
Table 20: Test For A Common Intercept
Estimated results:
| Var sd = sqrt(Var)
---------+-----------------------------
logincome | .6037161 .7769917
e | .1510079 .3885974
u | .2929869 .5412827
Test: Var(u) = 0
chi2(1) = 33273.67
Prob > chi2 = 0.0000
sigma_u | .38981447
sigma_e | .25416102
rho | .70169989 (fraction of variance due to u_i)
49
Table 22: Fixed Effects Output
F(32,26999) = 339.13
corr(u_i, Xb) = -0.0551 Prob > F = 0.0000
sigma_u | .54079082
sigma_e | .25416102
rho | .81908033 (fraction of variance due to u_i)
This eliminates the influence of all time invariant characteristics of the individual, including unobserved
heterogeneity. The intuition is as follows. As personality doesn’t change over time (let’s just assume it
doesn’t for now!), it cannot explain the change in the outcome over time. Of course it may explain the
level, but this doesn’t matter when we’re looking at first differences or deviations from a mean. This is
commonly referred to at the fixed effects estimator, as it is equivalent to adding a fixed effect (dummy
variable) for each individual. xi:xtreg logincome sex age age2 i.jbsat i.health i.year i.qfedhi,
fe and export the results outreg2 age age2 sex Ijbsat 2 Ijbsat 3 Ijbsat 4 Ijbsat 5 Ijbsat 6
Ijbsat 7 using panel, excel append ctitle(FE). Notice that the overall explained variance is much
lower.
Table 22 compares these estimates across the different approaches. We see that the coefficient on job
satisfaction category 7 (completely satisfied) is -.04 (i.e. being in this category reduces income by 4%,
remember it is measured in logs) for POLS, -.03 for RE, and -.025 for FE, and is also only marginally
significant. So taking account of unobserved heterogeneity reduces the effect substantially. It is important
to understand how the FE coefficient is generated. Only movement from one category of job satisfaction to
another (changes over time within individuals), is used to estimate the effect on income. People who move
categories are generally referred to as “switchers”.
50
7.4 The Hausman Test
The drawback of the FE estimator is that it is less efficient than the RE estimator (i.e. higher variance), and
also the fact that you cannot recover the coefficients on time invariant characteristics. So if the assumption
of no correlation between the individual error and explanatory variables holds, then you should use Random
Effects. Fortunately there is a way of testing which estimator is more appropriate in any given situation.
This is based on the fact that under the null hypothesis of random individual effects the estimators should
give similar coefficients. The Hausman test can be implemented by comparing the estimates from the two
models. Suppose we run a simple case xi:xtreg logincome i.jbsat, re. We need to save these estimates
so we can use them for comparison estimates save RE. Now we run our FE model xi:xtreg logincome
i.jbsat, fe and save estimates save FE. Now we test hausman FE RE, sigmamore.
chi2(6) = (b-B)’[(V_b-V_B)^(-1)](b-B)
= 193.25
Prob>chi2 = 0.0000
In this case, we can reject the null hypothesis that the coefficients are equal, therefore we conclude that
the FE model is correct.
7.5 Dynamics
We have already seen that there is a high degree of correlation in our income variable, and we may wish to
account for this with a more dynamic model, for example by including income in the previous period (yt−1 )
as a regressor. Unfortunately the FE estimator is biased (RE is worse here as αi is incorporated into it ) in
the case where T is small (T>30 is recommended). We run this specification for comparison only xi:xtreg
logincome l.logincome sex age age2 i.jbsat i.health i.year i.qfedhi, fe. Note again the use of
the lag operator (l.logincome). A potential solution is to use an instrumental variables approach. The
details are not straightforward, and require further assumptions about the structure of the error term. We
implement the Arellano and Bond estimator for illustration xi:xtabond logincome age age2 i.jbsat
i.health i.year i.qfedhi, which uses the second lag as an instrument for the first difference.
51
Table 23: Comparison of Panel Data Estimators
(1) (2) (3) (7) (8)
VARIABLES OLS(Clustered) RE FE FE Lag FE ABond
52
8 Instrumental Variables
8.1 Endogeneity
This tutorial will provide an introduction to Instrumental Variables regression. One of the basic assumptions
required for the unbiasedness of OLS is exogeneity, that there is no correlation between the X variables and
the error term in the model. This can arise under a number of scenarios, including omitted variables,
simultaneity and measurement error. Common examples include the effect of schooling on wages and the
effect of aid on growth. In these circumstances IV can be used to produce consistent estimates of the effect of
the endogenous variable on the outcome. The idea is to find some variable (Z) that predicts the X variable,
but not the Y. This isolates the exogenous part of X, which we can use to examine the true causal effect
of X on Y. In the case of the schooling example, this could be a change in compulsory schooling leaving
age which exogenously (i.e. is outside the decision making capacity or characteristics of the individual)
increases the amount of schooling received. Here we will illustrate the technique using a simple example of
how openness affects growth. First we load the data use openness, clear and have a preliminary look
at it sum and describe. sum pcinc open land. If we graph scatterplots of these three variables we see
that there are a number of outliers, so we remove these from our analysis graph matrix pcinc open land,
scheme(s1mono) title(’’Income, Openness and Area’’).
Figure 15: Graph Matrix for Openess, Area and Income Per Capita
20 40 60 80
.6
cons .4
.2
80
60
temp
40
20
.29
.28
price
.27
.26
.2 .4 .6 .26 .27 .28 .29
53
Oil producers have higher income, as do countries with “good” data. Higher inflation has a negative effect.
However it is important to recognise that the relationship between GDP and openness may be determined
by another unobserved factor, or that higher income may cause higher levels of openness. In other words,
openness is potentially endogenous, and therefore the model estimated above has no causal interpretation.
To deal with this issue we need to instrument for this variable. The two criteria for an instrument are that
the variable must be correlated with the X (openness) but not correlated with the Y (income). One possi-
bility here is land area. If you believe that smaller countries are more likely to engage in international trade
(because of economies of scale, specialisation etc.) but that area does not necessarily affect GDP per capita
otherwise, then area is a valid instrument. The first assumption is testable, if we graph the relationship we can
clearly see a significant negative correlation twoway (scatter open land if land<600000) (lfit open
land if land<600000), ytitle(Imports % GDP) title(Openness and Land Area) scheme(s1mono).
If you do not believe the second assumption in this case, that is probably wise! We will just use this as
an example of how to implement the procedure. IV papers ultimately depend on how much you believe this
assumption, and you will encounter many you do not. Unfortunately there is no econometric test which will
settle the matter entirely. We could look at the correlation between these three variables pwcorr pcinc open
land, sig. First, openness is significantly negatively correlated with area. Secondly income is positively
correlated with openness, although not significantly. The assumption of no correlation between Z and U
refers to the relationship in the population, here we only have a sample, and a relatively small sample at
that. However, if we have more instruments than exogenous variables, it is possible to test how close the
corresponding sample moment is to zero. We will illustrate this later.
54
Table 24: Correlation Between Income, Openness and Area
55
Table 26: Testing for Weak Instruments
Tests of endogeneity
Ho: variables are exogenous
Robust score chi2(1) = .467088 (p = 0.4943)
Robust regression F(1,98) = .423783 (p = 0.5166)
56
Earlier we stated that it was possible to examine the validity of an instrument if the model was over-
identified (i.e. more instruments than endogenous variables). This is based on the fact that the sample
correlation between the instrument and error term should be close to zero if the instruments are valid. We
will illustrate this using a second instrument, whether the country is a major oil producer. Clearly this
will affect GDP as well as openness, so this is not a good instrument. Again we just using this example
for illustration of how to implement the test. First we run our model with our two instruments, and the
gmm option. The generalised method of moments is similar to the 2sls estimator, but is suitable for use in
the overidentified case. ivregress 2sls pcinc inf good (open = land oil), wmatrix(robust). We
can now implement the test, which is often referred to as the “overidentifying restrictions test” or the
“Hansen-Sargan” test. estat overid.
In this case we reject the null hypothesis that all instruments are valid, which is hardly surprising.
Finally, there are a number of other important commands relating to IV that you may require in your
analysis. The user written command “ivreg2” provides extensions to the Stata “ivregress” command. When
the instrument is binary, the “treatreg” command is applicable, and when your Y variable is binary, the
“ivprobit” command. With weak instruments, the LIML (limited information maximum likelihood) and
JIVE (Jackknife IV) estimators can be used to provide better small sample estimates. The command for IV
in a panel context is “xtivreg”. There are also a number of different estimators for dealing with dynamic
panel models, such as “xtabond” which implements the Arellano and Bond estimator.
57
9 Recommended Reading and References
Dublin Microeconomics Blog http://dublinmicroblog.blogspot.com/
Graphs 1 http://www.survey-design.com.au/Usergraphs.html
Graphs 2 http://www.ats.ucla.edu/stat/stata/library/GraphExamples/default.htm
Regression Models for Categorical Dependent Variables Using Stata, J. Scott Long and Jeremy Freese
http://www.stata.com/bookstore/regmodcdvs.html
University of Essex. Institute for Social and Economic Research. ESDS Longitudinal and University of Essex. UK
Data Archive. ESDS Longitudinal, British Household Panel Survey: Waves 1-11, 1991-2002: Teaching Dataset
(Work, Family and Health) [computer file]. 2nd Edition. University of Essex. Institute for Social and Economic
Research, [original data producer(s)]. Colchester, Essex: UK Data Archive [distributor], November 2011. SN: 4901,
http://dx.doi.org/10.5255/UKDA-SN-4901-2
58