An Introduction To Stata For Economists: Data Analysis
An Introduction To Stata For Economists: Data Analysis
Economists
Part II:
Data Analysis
Kerry L. Papps
2. Overview
• Do-files
• Summary statistics
• Correlation
• Linear regression
• Generating predicted values and hypothesis testing
• Instrumental variables and other estimators
• Panel data capabilities
• Panel estimators
2. Overview (cont.)
• Writing loops
• Graphs
4. Comment on notation used
• Consider the following syntax description:
list [varlist] [in range]
– Text in typewriter-style font should
be typed exactly as it appears (although there
are possibilities for abbreviation).
– Italicised text should be replaced by desired
variable names etc.
– Square brackets (i.e. []) enclose optional Stata
commands (do not actually type these).
5. Comment on notation used
(cont.)
• For example, an actual Stata command might be:
list name occupation
• This notation is consistent with notation in Stata
Help menu and manuals.
6. Do-files
• Do-files allow commands to be saved and
executed in “batch” form.
• We will use the Stata do-file editor to write do-
files.
• To open do-file editor click Window Do-File
Editor or click
• Can also use WordPad or Notepad: Save as “Text
Document” with extension “.do” (instead of
“.txt”). Allows larger files than do-file editor.
7. Do-files (cont.)
• Note: a blank line must be included at the end of a
WordPad do-file (otherwise last line will not run).
• To run a do-file from within the do-file editor,
either select Tools Do or click
• If you highlight certain lines of code, only those
commands will run.
• To run do-file from the main Stata windows,
either select File Do or type:
do dofilename
8. Do-files (cont.)
• Can “comment out” lines by preceding with * or
by enclosing text within /* and */.
• Can save the contents of the Review window as a
do-file by right-clicking on window and selecting
“Save All...”.
9. Univariate summary
statistics
• tabstat produces a table of summary statistics:
tabstat varlist [, statistics(statlist)]
• Example:
tabstat age educ, stats(mean sd
sdmean n)
• summarize displays a variety of univariate
summary statistics (number of non-missing
observations, mean, standard deviation, minimum,
maximum):
summarize [varlist]
10. Multivariate summary
statistics
• table displays table of statistics:
table rowvar [colvar] [, contents(clist
varname)]
• clist can be freq, mean, sum etc.
• rowvar and colvar may be numeric or string
variables.
• Example:
table sex educ, c(mean age median
inc)
11. Multivariate summary
statistics (cont.)
• One “super-column” and up to 4 “super-rows” are
also allowed.
• Missing values are excluded from tables by
default. To include them as a group, use the
missing option with table.
EXERCISE 1
12. Generating simple statistics
• Open the do-file editor in Stata. Run all your solutions
to the exercises from here.
• Open nlswork.dta from the internet as follows:
webuse nlswork
• Type summarize to look at the summary statistics
for all variables in the dataset.
• Generate a wage variable, which exponentiates
ln_wage:
gen wage=exp(ln_wage)
EXERCISE 1 (cont.)
13. Generating simple statistics
• Restrict summarize to hours and wage and
perform it separately for non-married and married
(i.e. msp==0 and 1).
• Use tabstat to report the mean, median,
minimum and maximum for hours and wage.
• Report the mean and median of wage by age
(along the rows) and race (across the columns) :
table age race, c(mean wage median
wage)
14. Sets of dummy variables
• Dummy variables take the values 0 and 1 only.
• Large sets of dummy variables can be created
with:
tab varname, gen(dummyname)
• When using large numbers of dummies in
regressions, useful to name with pattern, e.g. id1,
id2… Then id* can be used to refer to all
variables beginning with *.
15. Correlation
• To obtain the correlation between a set of
variables, type:
correlate [varlist] [[weight]] [,
covariance]
• covariance option displays the covariances
rather than the correlation coefficients.
• pwcorr displays all the pairwise correlation
coefficients between the variables in varlist:
pwcorr [varlist] [[weight]] [, sig]
16. Correlation (cont.)
• sig option adds a line to each row of matrix
reporting the significance level of each correlation
coefficient.
• Difference between correlate and pwcorr is
that the former performs listwise deletion of
missing observations while the latter performs
pairwise deletion.
• To display the estimated covariance matrix after a
regression command use:
estat vce
17. Correlation (cont.)
• (This matrix can also be displayed using Stata’s
matrix commands, which we will not cover in this
course.)
18. Linear regression
• To perform a linear regression of depvar on
varlist, type:
regress depvar varlist [[weight]] [if
exp] [, noconstant robust]
• depvar is the dependent variable.
• varlist is the set of independent variables
(regressors).
• By default Stata includes a constant. The
noconstant option excludes it.
19. Linear regression (cont.)
• robust specifies that Stata report the Huber-
White standard errors (which account for
heteroskedasticity).
• Weights are often used, e.g. when data are group
averages, as in:
regress inflation unemplrate year
[aweight=pop]
• This is weighted least squares (i.e. GLS).
• Note that here year allows for a linear time trend.
20. Post-estimation commands
• After all estimation commands (i.e. regress,
logit) several predicted values can be computed
using predict.
• predict refers to the most recent model
estimated.
• predict yhat, xb creates a new variable yhat
equal to the predicted values of the dependent
variable.
• predict res, residual creates a new
variable res equal to the residuals.
21. Post-estimation commands
(cont.)
• Linear hypotheses can be tested (e.g. t-test or F-
test) after estimating a model by using test.
• test varlist tests that the coefficients
corresponding to every element in varlist jointly
equal zero.
• test eqlist tests the restrictions in eqlist, e.g.:
test sex==3
• The option accumulate allows a hypothesis to
be tested jointly with the previously tested
hypotheses.
22. Post-estimation commands
(cont.)
• Example:
regress lnw sex race school age
test sex race
test school == age, accum
EXERCISE 2
23. Linear regression
• Compute the correlation between wage and
grade. Is it significant at the 1% level?
• Generate a variable called age2 that is equal to the
square of age (the square operator in Stata is ^).
• Create a set of race dummies with:
tab race, gen(race)
• Regress ln_wage on: age, age2, race2,
race3, msp, grade, tenure, c_city.
EXERCISE 2 (cont.)
24. Linear regression
• Display the covariance matrix from this
regression.
• Use predict to generate a variable res
containing the residuals from the equation.
• Use summarize to confirm that the mean of the
residuals is zero.
• Rerun the regression and report Huber-White
standard errors.
25. Additional estimators
• Instrumental variables:
ivregress 2sls depvar exogvars
(endogvars=ivvars)
• Both exogvars and ivvars are used as instruments
for endogvars.
• For example:
ivregress 2sls price inc pop
(qty=cost)
• Logit:
logit depvar indepvars
26. Additional estimators
(cont.)
• Probit:
probit depvar indepvars
• Ordered probit:
oprobit depvar indepvars
• Tobit:
tobit depvar indepvars, ll(cutoff)
• For example, tobit could be used to estimate
labour supply:
tobit hrs educ age child, ll(0)
EXERCISE 3
27. IV and probit
• Repeat the regression from Exercise 2 using
ivregress 2sls and instrument for tenure
using union and south. Compare the results
with those from Exercise 2.
• Estimate a probit model for union with the
following regressors: age, age2, race2,
race3, msp, grade, c_city, south.
28. Panel data manipulation
• Panel data generally refer to the repeated
observation of a set of fixed entities at fixed
intervals of time (also known as longitudinal data).
• Stata is particularly good at arranging and analysing
panel data.
• Stata refers to two panel display formats:
– Wide form: useful for display purposes and often
the form data obtained in.
– Long form: needed for regressions etc.
29. Panel data manipulation
(cont.)
Example of wide form:
i xij