Gravity13 Stata
Gravity13 Stata
Introduction to Stata
for regression analysis
Instructor: Yong Yoon, PhD
Chulalongkorn University
March 19, 2013
Part 1
• Overview of Stata
– User interface, command syntax, help system, file
management, working with do-file editor
– Updating Stata and accessing user-written
routines
– Data management: basic principles of organization
and transformation
– Data management tools and data validation
– Introduction to graphics
– Producing publication-quality output
2
User interface, command syntax, help system, file management,
working with do-file editor
3
getting started
• Starting Stata
Double-click the Stata icon on the desktop (if there is one) or
select Stata from the Start menu.
• Closing Stata
Choose eXit from the file menu, click the Windows close box
(the `x' in the top right corner), or type exit at the command
line. You will have to type clear first if you have any data in
memory (or simply type exit, clear).
• The Stata screen is divided in 4 parts. In “Review” you can see the last
commands that have been executed. In “Variables” you can see all the
variables in the current database. In “Results” you can see the commands’
output. Finally, in the window “Command” you can enter the commands.
Results here
• Quick Notes
1. Stata is case-sensitive.
2. . is the Stata prompt.
3. When you work, always use a -do- file
4. To see content of a –do- file, type, e.g.,
. type profile.do
6
first commands
7
more first commands
. use PennTab
. describe
. summarize
. list country wbcode year pop rgdpch openk
grgdpch in 1/10
. list country wbcode year pop rgdpch openk
grgdpch if wbcode == "THA“
. list country year if missing(rgdpch)
. browse
9
first regression
10
help system
11
file management (1)
• Stata reads and saves data from the working directory, usually c:\data,
unless you specify otherwise (say, if using a thumb drive).
• You can change directory using the command
. cd [drive:]directoryname
and to see which working directory you are using typde pwd (type help
cd for details.)
• I recommend that you create a separate directory for each research
project you are involved in, and start your Stata session by changing to
that directory.
• Stata has other commands for interacting with the operating system,
including mkdir to create a directory, dir to list the names of the files in
a directory, type to list their contents, copy to copy files, and erase to
delete a file.
• You can (and probably should) do these tasks using the operating system
directly, but the Stata commands may come handy if you want to write a
program to perform repetitive tasks.
12
file management (2)
15
accessing user-written routines
• Stata native graph types are not ideal for viewing categorical-
variable distributions and histograms. In this case, for example, it is
nice to employ the user-written program called catplot, which you
can obtain by typing:
. findit catplot
• Simply follow the links to install. Then try:
. catplot income_grp percent
. catplot rgdp_m, percent by(income_grp)
• There are many specially written (by Stata and by independent
authors) commands and routines you can use, easily found over the
web.
• You can even get Stata data online (e.g. Penn Tables, World Bank
data, etc.)
16
comments and annotations
17
Data management: basic principles of organization and
transformation
18
variable types
19
variable names
20
missing values
21
observation subscripts _n and _N
22
making categorical variable
23
loading data (1)
24
loading data (2)
25
loading data (3)
29
sorting (and merging/combining datasets)
31
Introduction to graphics
32
Producing publication-quality output
• To save the current graph on disk using Stata's own format, type graph save
file_name. This command has two options, replace, which you need to use if
the file already exists, and asis, which freezes the graph (including its current style)
and then saves it.
• The default is to save the graph in a live format that can be edited in future sessions, for
example by changing the scheme. After saving a graph in Stata format you can load it
from the disk with the command graph use file_name. (Note that graph
save and graph use are analogous to save and use for Stata files.) Any graph stored in
memory can be displayed using graph display. You can also list, describe,
rename, copy, or drop graphs stored in memory, to learn more type
. help graph_manipulation
• If you plan to incorporate the graph in another document you will probably need to
save it in a more portable format. Stata's command graph export filename can export
the graph using a wide variety of vector or raster formats, which is usually understood
from the file extension. Vector formats such as Windows metafile (wmf or emf) or
Adobe's PostScript and its variants (ps, eps, pdf) contain essentially drawing instructions
and are thus resolution independent, so they are best for inclusion in other documents
where they may be resized. Raster formats such as Portable Network Graphics (png)
save the image pixel by pixel using the current display resolution, and are best for
inclusion in web pages. An example I use is:
. graph export fig61.png, width(400) replace
33
some examples of Stata’s graphical capabilities
34
seven basic types of graphs
. histogram histograms
. graph twoway two-variable scatterplots,
line plots, and many others
. graph matrix scatterplot matrices
. graph box box plots
. graph pie pie charts
. graph bar bar charts
. graph dot dot plots
35
fitting lines
36
tidying up labels
37
titles, legends and captions
38
Part 2
• Regression Basics (with Stata)
– Inference and the idea of sampling distribution
– Regression: Estimation and inference
– OLS assumptions and properties
– Post-estimation regression diagnosis: Violating the
classical assumptions (consequences, detection,
solution)
– The problem of endogeniety again!
39
inference
40
simple random sample
43
some common functions in Stata
44
simulation in Stata
• Let’s do this 10,000 times and keep a record of the number of heads you
get.
• To do this you need to write a small program in Stata.
program cointoss
drop _all
set obs 100
gen toss = runiform()
replace toss = 0 if toss <=.5
replace toss = 1 if toss > .5
egen tt = sum(toss)
end
• To run the program, simply type:
. simulate x = tt, seed(10101) reps(10000) nodots:
cointoss
45
. histogram x, bin(10000)
25
20
15
Density
10
5
0
30 40 50 60 70
tt
47
does openness matter for economic growth?
48
always nice to start with histograms
Openness
20
6
Frequency
15
4
Percent
2
10
0
49
OLS in Stata
50
lfit
DEU
ESP ISL BHS
PRT BRB
ATG
RUS
LVA DMA
9
MEX PAN
CUB
THA LBN
ECU GNQ
PER
NIC AZE
8
PAK CIV
WSM
COG
SYR MNG
LAO
BEN
7
NGA
SOM
ERI
COD
6
1 2 3 4 5 6
Openness (logs)
51
R-squared
• Stata stores regression results in e() a.k.a e-class variables. These can be
regarded as repositories into which Stata puts the various results for its
own housekeeping.
. quietly regress ln_rgdp ln_open
. ereturn list
• Try also matrix list e(b) after the ereturn list
SSE SSR
R2 1 (0 R 2 1)
SST SST
• How “good” is the regression? One measure is the so called R-squared (R2)
which can be interpreted as the fraction of sample variation in Y that is
explained by X. You can easily read the R-square in the regression output, but
below, I show how to calculate it manually.
. display 10.30/47.606
. display e(mss)/(e(mss)+e(rss))
52
Assumptions of OLS
53
Properties of OLS
54
properties of OLS (2)
57
inference
59
outlier, leverage and influence
• To get the hat matrix and Cook's distance we use two more
options of predict, hat and cook
. predict hres, hat // leverage
. predict cres, cook // cook distance (influence)
• Leverage points are an unusual x-values that may control
certain model properties. Often these do not effect the
estimates of the regression coefficients, but it certainly will
dramatic effect on the model summary statistics such as R2
and the standard errors of the regression coefficients.
• Influence points usually have a moderately unusual x-
coordinate, and the y value is unusual as well. The influence
points however have a noticeable impact on the model
coefficients in that it pulls the regression model in its direction.
• To detect an outlier, we can find cases with standardized or
jackknifed residuals greater than 2 in magnitude
. list ehat stdres jkres hres cres if abs(stdres) >
2 | abs(jkres) > 2, clean
60
leverage and influence (2)
61
residual plots
63
serial correlation in residuals
DW t 2
T
uˆ
t 1
t
2
65
Multivariate regression
66
67
basic trade gravity model
• The basic gravity equation is used to explain the value of trade that
takes place between two countries:
• Interpret this and check SSS and try these (note that many of the
commands we performed before are not allowed for after robust
estimation).
. predict yhat // finds all fitted values of y
. predict uhat, residuals // raw residuals
. estat vif // test for MC
. estat ovtest // Ramsey’s RESET
69
testing the overall significance of the regression
• The F-test is employed to test whether all the X’s taken together (jointly) are
statistically significant.
• One form of the test statistic is (k is the number of parameters):
R 2 k 1
F
1 R 2 n k
• And as usual, if observed F-stat > critical-F at k-1 and n-k degrees of freedom
(at some appropriate α), then we reject the null of β1 = β 2 = … = β k-1 = 0 (the
regression overall is statistically insignificant). Or, reject null if the p-value is
smaller than α.
• In Stata, you can do this test manually by:
. dis "F-stat = " e(F)
. dis "p-value is " Ftail(e(df_m),e(df_r),e(F))
. dis "Crticial-F at 5% is "
invFtail(e(df_m),e(df_r),0.05)
70
testing (linear restrictions) in Stata
F
SSRR SSRUR m 2
RUR RR2 m
SSRUR n k
or
1 RUR2 n k
71
the Wald test for comparing models
72
comparing models (2)
73
The problem of endogeniety
74
confounding
Y X
Z (confounding factor)
76
non-linear forms
77
common errors you should avoid
GOOD LUCK!
79
references (some of my favorites)
http://www.stata.com/support/faqs/
http://www.stata.com/links/resources1.html
Statistics with Stata (Updated for Version 10) by Lawrence C. Hamilton
An Introduction to Modern Econometrics Using Stata by Christopher F. Baum
Microeconometrics Using Stata, Revised Edition by A. Colin Cameron and Pravin
K. Trivedi
• Hill, R. C., W. E. Griffiths, and G. C. Lim. 2011. Principles of Econometrics. 4th
ed. Hoboken, NJ: Wiley.
• Adkins, L. C., and R. C. Hill. 2011. Using Stata for Principles of Econometrics.
4th ed. Hoboken, NJ: Wiley.
• Stock, J. H., and M. W. Watson. 2011. Introduction to Econometrics. 3rd. ed
Addison Wesley
• Verbeek, M. 2012. A Guide to Modern Econometrics. 4th ed. Wiley & Sons.
• Wooldridge, J. M. 2012. Introductory Econometrics: A Modern Approach. 5th
ed. Cincinnati, OH: South-Western.
• Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data.
Cambridge, MA: MIT Press.
• Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An
Empiricist’s Companion. Princeton, NJ: Princeton University Press.
80