Stata Tutorial
Stata Tutorial
using Stata
(ver. 4.6)
Oscar Torres‐Reyna
Data Consultant
[email protected]
http://dss.princeton.edu/training/
PU/DSS/OTR
List of topics
What is Stata? Examples of frequencies and crosstabulations
Stata screen and general description Creating dummies
First steps (log, memory and directory) Graphs
From SPSS/SAS to Stata 9 Scatterplot
Example of a dataset in Excel 9 Histograms
From Excel to Stata (copy‐and‐paste, *.csv) 9 Catplot (for categorical data)
Saving the dataset 9 Bars (graphing mean values)
Describe and summarize (command, menu) Regression:
Rename and label variables (command, menu) 9 Overview and basic setting
Creating new variables (generate) 9 Correlation matrix
Recoding variables (recode) 9 Output interpretation (what to look for)
Recoding variables using egen 9 Graph matrix
Changing values (replace) 9 Saving regression coefficients
Extracting characters from regular expressions 9 F‐test
Value labels using the menu 9 Testing for linearity
Indexing (using _n and _N) 9 Testing for normality
9 Creating ids and ids by categories 9 Testing for homoskedasticity
9 Lags and forward values 9 Testing for omitted‐variable bias
9 Countdown and specific values 9 Testing for multicolinearity
Sorting 9 Robust standard errors
Deleting variables (drop) 9 Specification error
Merge 9 Outliers
Append 9 Summary of influence indicators
Merging fuzzy text (reclink) 9 Summary of distance measures
Frequently used Stata commands 9 Interaction terms
Exploring data: 9 Publishing regression table (outreg2)
9 Frequencies (tab, table) Useful sites (links only)
9 Crosstabulations (with test for associations) 9 Is my model OK?
9 Descriptive statistics (tabstat) 9 I can’t read the output of my model!!!
9 Topics in Statistics
9 Recommended books
PU/DSS/OTR
What is Stata?
• It is a multi‐purpose statistical package to help you explore, summarize and
analyze datasets.
• A dataset is a collection of several pieces of information called variables (usually
arranged by columns). A variable can have one or several values (information for
one or several cases).
• Other statistical packages are SPSS, SAS and R.
• Stata is widely used in social science research and the most used statistical
software on campus.
PU/DSS/OTR
This is the Stata screen…
PU/DSS/OTR
and here is a brief description …
PU/DSS/OTR
First steps
Three basic procedures you may want to do first:
• Set your working directory (see next slide for more info)
• Create a log file (sort of Stata’s built-in tape recorder and where you
can retrieve the output of your work)
• Using the menu go to File – Log – Begin (see the next slide for
more info)
• Set the correct memory allocation for your data. Some datasets more
memory, depending on the size you can type set mem 700m to open
the big ones
PU/DSS/OTR
First steps: graphic view
Three basic procedures you may want to do first: create a log file (sort of Stata’s built-in tape recorder and where you can
retrieve the output of your work), set your working directory, and set the correct memory allocation for your data.
Click on “Save as type:” right below ‘File name:”
1 and select Log (*.log). This will create the file
called Log1.log (or whatever name you want with
extension *.log) which can be read by any word
processor or by Stata (go to File – Log – View). If
you save it as *.smcl (Formatted Log) only Stata
can read it. It is recommended to save the log file
as *.log
2 3 When dealing with really big datasets you may want to increase the memory:
set mem 700m /*You type this in the command window */
Shows your current working directory. To estimate the size of the file you can use the formula:
You can change it by typing Size (in bytes) = (8*Number of cases or rows*(Number of variables + 8))
cd c:\mydirectory
PU/DSS/OTR
From SPSS/SAS to Stata
You can use the command usespss to read SPSS files in Stata or
the command usesas to read SAS files. If you have a file in SAS
XPORT you can use fduse (or go to file‐import). For SPSS, you
may need to install it by typing
ssc install usespss
Once installed just type
usespss using “c:\mydata.sav”
Type help usespss for more details.
There is a similar command for SAS (usesas), just repeat the
previous steps.
For ASCII data please see the ‘Stata’ module at
http://dss.princeton.edu/training/
PU/DSS/OTR
Example of a dataset in Excel.
Variables are arranged by columns and cases by rows. Each variable has more than one value
3 ‐ Press Ctrl‐v to paste the
data…
PU/DSS/OTR
1 ‐ Close the data editor by pressing the “X” button on the upper‐right corner of the editor
2 ‐ The “Variables”
window will show all
the variables in your
data
4 ‐ This is what you will see in the output window,
the data has been saved as students.dta
PU/DSS/OTR
Excel to Stata (import *.csv)
You can also save the Excel file as *.csv (comma‐separated values) and import it in Stata. In Excel go to File‐Save as.
You may get the following messages, click OK and YES…
In Stata go to File‐Import‐ASCII data created by spreadsheet. Click on ‘Browse’ to find the file and then OK.
2
PU/DSS/OTR
Commands: describe and summarize
To start exploring the data we’ll use two commands: describe and summarize
To describe the data go to Data – Describe data – Describe data in memory and press OK
To summarize the data go to Data – Describe data – Summary statistics and press OK
PU/DSS/OTR
Exploring data: frequencies
Frequency refers to the number of times a value is repeated. Frequencies are used to analyze
categorical data. The tables below are frequency tables, values are in ascending order. In Stata use
the command tab (type help tab for more details)
variable
‘Freq.’ provides a raw count of each value. In this case 10
students for each major.
‘Percent’ gives the relative frequency for each value. For
example, 33.33% of the students in this group are econ
majors.
‘Cum.’ is the cumulative frequency in ascending order of
the values. For example, 66.67% of the students are
econ or math majors.
variable
‘Freq.’ Here 6 students read the newspaper 3 days a
week, 9 students read it 5 days a week.
‘Percent’. Those who read the newspaper 3 days a week
represent 20% of the sample, 30% of the students in the
sample read the newspaper 5 days a week.
‘Cum.’ 66.67% of the students read the newspaper 3 to 5
days a week.
PU/DSS/OTR
Exploring data: frequencies (using table)
var1 var2
PU/DSS/OTR
Exploring data: crosstabs
Also known as contingency tables, help you to analyze the relationship between two or more
variables (mostly categorical). Below is a crosstab between the variable ‘ecostatu’ and ‘gender’. We
use the command tab (but with two variables to make the crosstab).
Options ‘col’, ‘row’ gives you the column The first value in a cell tells you the number of
and row percentages. observations for each xtab. In this case, 90
respondents are ‘male’ and said that the
var1 var2 economy is doing ‘very well’, 59 are ‘female’
and believe the economy is doing ‘very well’
The second value in a cell gives you row
percentages for the first variable in the xtab.
Out of those who think the economy is doing
‘very well’, 60.40% are males and 39.60% are
females.
The third value in a cell gives you column
percentages for the second variable in the xtab.
Among males, 14.33% think the economy is
doing ‘very well’ while 7.92% of females have
the same opinion.
– For nominal data use chi2, lrchi2, V
– For ordinal data use gamma and taub
Fisher’s exact test – Use exact instead of chi2 when
frequencies are less than 5 across the
table.
X2(chi‐square) tests for relationships between variables. The null
hypothesis (Ho) is that there is no relationship. To reject this we need a
Pr < 0.05 (at 95% confidence). Here both chi2 are significant. Therefore
we conclude that there is some relationship between perceptions of the
economy and gender
Cramer’s V is a measure of association between two nominal variables. It
goes from 0 to 1 where 1 indicates strong association (for rXc tables). In
2x2 tables, the range is ‐1 to 1. Here the V is 0.15, which shows a small
association.
Gamma and taub are measures of association between two ordinal
variables (both have to be in the same direction, i.e. negative to positive,
low to high). Both go from ‐1 to 1. Negative shows inverse relationship,
closer to 1 a strong relationship. Gamma is recommended when there
are lots of ties in the data. Taub is recommended for square tables.
Fisher’s exact test is used when there are very few cases in the cells
(usually less than 5). It tests the relationship between two variables. The
null is that variables are independent. Here we reject the null and
conclude that there is some kind of relationship between variables
PU/DSS/OTR
Exploring data: descriptive statistics
For continuous data we use descriptive statistics. These statistics are a collection of measurements of
two things: location and variability. Location tells you the central value of your variables (the mean is
the most common measure of this) . Variability refers to the spread of the data from the center value
(i.e. variance, standard deviation). Statistics is basically the study of what causes such variability. We
use the command tabstat to get these stats.
•The mean is the sum of the observations divided by the total number of observations.
•The median (p50 in the table above) is the number in the middle . To get the median you have to order the data
from lowest to highest. If the number of cases is odd the median is the single value, for an even number of cases
the median is the average of the two numbers in the middle.
•The standard deviation is the squared root of the variance. Indicates how close the data is to the mean. Assuming
a normal distribution, 68% of the values are within 1 sd from the mean, 95% within 2 sd and 99% within 3 sd
•The variance measures the dispersion of the data from the mean. It is the simple mean of the squared distance
from the mean.
•Count (N in the table) refers to the number of observations per variable.
•Range is a measure of dispersion. It is the difference between the largest and smallest value, max – min.
•Min is the lowest value in the variable.
•Max is the largest value in the variable.
PU/DSS/OTR
Exploring data: descriptive statistics
You could also estimate descriptive statistics by subgroups. For example, by gender below
PU/DSS/OTR
Examples of frequencies and crosstabulations
PU/DSS/OTR
More examples of frequencies and crosstabulations
rename var1 id
rename var2 country
rename var3 party
rename var4 imports
rename var5 exports
PU/DSS/OTR
Menu options for rename and label variable
Renaming variables using the menu…
PU/DSS/OTR
Creating new variables
To generate a new variable use the command generate (gen for short), type
generate [newvar] = [expression]
… results for the first five students…
generate x = 5
generate y = 4*15
generate z = y/x
You can also use generate with string variables. For example:
PU/DSS/OTR
1.- Recoding ‘age’ into three groups. Recoding variables
recode age (18 19 = 1 “18 to 19”) (20/28 = 2 “20 to 29”) (30/39 = 3 “30 to 39”) (else=.),
generate(agegroups) label(agegroups)
PU/DSS/OTR
Recoding variables using egen
You can recode variables using the command egen and options cut/group.
egen [newvar] = cut (oldvar), at (break1, break2, break3, etc.)
Notice that the breaks show ranges. Below we type four breaks. The first starts at 18 and ends before 20, the
second starts at 20 and ends before 30, the third starts at 30 and ends before 40.
You could also use the option group, which specifies groups with equal frequency (you have to add value
labels:
egen [newvar] = cut (oldvar), group(number of groups)
Before After
Before After
PU/DSS/OTR
Extracting characters from regular expressions
To remove strings from var1 below use the following command
gen var2=regexr(var1,"[.\}\)\*a-zA-Z]+","")
gen var2=regexr(var1,"[.0-9]+","")
PU/DSS/OTR
Value labels using the menu: step 1
Adding value labels using the menu…
This will appear in the results window… You could also type in the command window:
label define sex 1 “Female” 2 “Male”
PU/DSS/OTR
Indexing: creating ids
Using _n, you can create a unique identifier for each case in your data, type
Check the results in the data editor, ‘idall’ is equal to ‘id’
Using _N you can also create a variable with the total number of cases in your
dataset:
Check the results in the data editor:
PU/DSS/OTR
Indexing: creating ids by categories
We can create id by categories. For example, lets Check the results in the data editor:
create an id by major.
PU/DSS/OTR
Indexing: lag and forward values
You can create lagged values with _n . Lets rename ‘idall’ as ‘months’ (time
variable) and will create a lagged variable containing the value of the previous
case: Check the results in the data editor:
If you want to lag more than one period just change [_n-1] to [_n-2] for a lag of
two periods, [_n-3] for three, etc.
A more advance alternative to create lags uses the “L” operand within a time
series setting (tset command must be specified first)
You can create a variable based on one value of another variable. For example,
lets create a variable with the highest SAT value in the sample.
Check the results in the data editor:
NOTE: You could get the same result without sorting by using
egen and the max function
PU/DSS/OTR
Sorting
Before After
sort var1 var2 …
Gsort is another command to sort data. The difference between gsort and
sort is that with gsort you can sort in ascending or descending order, while
with sort you can sort only in ascending order. Use +/- to indicate whether you
want to sort in ascending/descending order. Here are some examples:
PU/DSS/OTR
Deleting variables
We have created lots of variables, now we need to do some clean up. Two
commands can do this: drop and keep.
Before
After
Or
Notice the dash between ‘total’ and ‘readnews2’, you can use this format to indicate a list so you
do not have to type in the name of all the variables
PU/DSS/OTR
Deleting cases (selectively)
The World Values Survey (http://www.worldvaluessurvey.org/) provides data for several countries and different years
(waves). Lets say you want to use data for the United States only. Looking at the codebook the variable for country/year is
s025. A frequency of that variable (with and without labels) gives us the following:
tab s025 tab s025, nolabel
Click on –more- to continue The option nolabel gives you the numeric codes for each country/year
PU/DSS/OTR
Deleting cases (selectively) cont.
Frequencies make it difficult to determine the numeric codes for the United States. To find out these we use the command
labelbook. Type:
labelbook s025 Click on –more- to continue
We want to keep data for United States (1999) only. The code is 8401999 (see above). To do this we type
drop if s025!=8401999 /*The operator != means ‘not equal’ */
NOTE: you can drop cases with missing values by typing: drop if missing(var1, var2, var3, …) PU/DSS/OTR
Merge/Append
MERGE - You merge when you want to add more variables to an existing dataset.
(type help merge in the command window for more details)
What you need:
– Both files must be in Stata format
– Both files should have at least one variable in common (id)
Step 1. You need to sort the data by the id or ids common to both files you want to merge. For both datasets type:
– sort [id1] [id2] …
– save [datafile name], replace
Step 2. Open the master data (main dataset you want to add more variables to, for example data1.dta) and type:
– merge [id1] [id2] … using [i.e. data2.dta]
For example, opening a hypothetical data1.dta we type
– merge lastname firstname using data2.dta
To verify the merge type
– tab _merge
Here are the codes for _merge:
_merge==1 obs. from master data
_merge==2 obs. from only one using dataset
_merge==3 obs. from at least two datasets, master or using
If you want to keep the observations common to both datasets you can drop the rest by typing:
– drop if _merge!=3 /*This will drop observations where _merge is not equal to 3 */
APPEND - You append when you want to add more cases (more rows to your data, type help append for more details).
Open the master file (i.e. data1.dta) and type:
– append using [i.e. data2.dta]
PU/DSS/OTR
Merging fuzzy text (reclink)
RECLINK - Matching fuzzy text. Reclink stands for ‘record linkage’. It is a program written by Michael Blasnik to merge imperfect
string variables. For example
Data1 Data2
Princeton University Princeton U
Reclink helps you to merge the two databases by using a matching algorithm for these types of variables. Since it is a user
created program, you may need to install it by typing ssc install reclink. Once installed you can type help reclink
for details
As in merge, the merging variables must have the same name: state, university, city, name, etc. Both the master and the using
files should have an id variable identifying each observation.
Note: the name of ids must be different, for example id1 (id master) and id2 (id using). Sort both files by the matching (merging)
variables. The basic sytax is:
The variable myscore indicates the strength of the match; a perfect match will have a score of 1. Description (from reclink help
pages):
“reclink uses record linkage methods to match observations between two datasets where no perfect key fields exist --
essentially a fuzzy merge. reclink allows for user-defined matching and non-matching weights for each variable and
employs a bigram string comparator to assess imperfect string matches.
The master and using datasets must each have a variable that uniquely identifies observations. Two new variables are
created, one to hold the matching score (scaled 0-1) and one for the merge variable. In addition, all of the
matching variables from the using dataset are brought into the master dataset (with newly prefixed names) to allow
for manual review of matches.”
PU/DSS/OTR
Graphs: scatterplot
Scatterplots are good to explore possible relationships or patterns between variables. Lets see if there is some relationship between age
and SAT scores. For many more bells and whistles type help scatter in the command window.
twoway scatter age sat, mlabel(last) || twoway scatter age sat, mlabel(last) ||
lfit age sat lfit age sat, yline(30) xline(1800)
PU/DSS/OTR
Graphs: scatterplot
By categories
PU/DSS/OTR
Graphs: histogram
Histograms are another good way to visually explore data, especially to check for a normal
distribution; here are some examples (type help histogram in the command window for further
details):
PU/DSS/OTR
Graphs: catplot
Catplot is used to graph categorical data. Since it is a user defined program you may
have to install it typing: ssc install catplot
Now, type
tab agegroups major, col row cell catplot bar major agegroups, blabel(bar)
PU/DSS/OTR
catplot hbar major agegroups, percent(major sex) Graphs: catplot
blabel(bar) by(sex)
Raw counts by major and sex
PU/DSS/OTR
Graphs: means
Stata can also help to visually present
summaries of data. If you do not want to
type you can go to ‘graphics’ in the menu.
PU/DSS/OTR
Regression: a practical approach (intro)
PU/DSS/OTR
Regression: a practical approach (overview)
We use regression to estimate the unknown effect of changing one variable over
another (Stock and Watson, 2003, ch. 4)
When we run a regression we assume a linear relationship between two variables (i.e.
X and Y). Technically, it estimates how much Y changes when X changes one unit.
In Stata we use the command regress, type:
regress [dependent variable] [independent variable(s)]
regress y x
In a multivariate setting we type:
regress y x1 x2 x3 …
Before running a regression it is recommended to have a clear idea of what you
are trying to estimate (i.e. which are your dependent and independent
variables).
A regression makes sense only if there is a sound theory behind it.
PU/DSS/OTR
Regression: a practical approach (overview) cont.
Data and examples for this section come from the book Statistics with Stata (updated
for version 9) by Lawrence C. Hamilton (chapter 6). Click here to download the data or
search for it at http://www.duxbury.com/highered/. Use the file states.dta
(educational data for the U.S.).
PU/DSS/OTR
Regression: a practical approach (setting)
Starting question: Are SAT scores higher in states that spend more money on education controlling by
other factors?
– Dependent (or predicted, Y) variable – SAT scores, variable csat in dataset
– Independent (or predictor, X) variable(s) – Expenditures on education, variable expense
in dataset. Other variables percent, income, high, college.
Y
Y
PU/DSS/OTR
Regression: graph matrix
Here is another option for the graph.
PU/DSS/OTR
Regression: what to look for
Robust standard errors (to control
Lets run the regression: for heteroskedasticity)
PU/DSS/OTR
Regression: functional form/linearity
As a footnote, another graphical way to explore a possible linear relationship between variables or to detect
nonlinearity to define a functional form is by using the command acprplot (augmented component‐plus‐residual
plot). Right after running a the regression:
The option lowess (locally weighted scatterplot smoothing) draw the observed pattern in the data to help identify
nonlinearities. Percent shows a quadratic relation, it makes sense to add a square version of it. High shows a
polynomial pattern as well but goes around the regression line (except on the right). We could keep it as is for now.
The p-value is 0.0451, under the 0.05 usual threshold (95% confidence) so we conclude that both variables
have indeed some effect on SAT. In a way, this is saying that both have similar effect or measuring the same
thing (which could suggest multicolinearity). We could keep high since it was borderline significant.
Note: Not to be confused with ttest. Type help test and help ttest for more details
PU/DSS/OTR
Regression: output
Lets try the new model. It has now a higher R‐squared (0.92) and all the variables are significant.
Percent’s coefficient is -6.52. So, if percent increases by one unit, csat will decrease by 6.52 units.
With a statistically significant p-value of 0.000 (which means that -6.52 is statistically different from 0),
percent has an important impact on csat controlling by other variables (holding them constant). You
could read percent2 (which explains the upward effect) the same way. The net effect of percent is the
difference between both coefficients (which is still negative).
High’s coefficient is 2.98. So, if high increases by one unit, csat will increase by 2.98 units.
The constant 844.82 means that if all variables are 0, the average csat score would be 844.82. It is
where the regression line crosses the Y axis.
PU/DSS/OTR
Regression: saving regression coefficients/getting predicted values
How good the model is will depend on how well it predicts Y and on the validity of the tests.
There are two ways to generate the predicted values of Y (usually called Yhat) given the model:
predict csat_predict
label variable csat_predict "csat predicted"
PU/DSS/OTR
Regression: observed vs. predicted values
PU/DSS/OTR
Regression: testing for normality
A main assumption of the regression model (OLS) that guarantee the validity of all tests (p, t and F) is that residuals behave
‘normal’. Residuals (here indicated by the letter “e”) are the difference between the observed values (Y) and the predicted values
(Yhat): e = Y – Yhat.
Three graphs will help us check for normality in the residuals: kdensity, pnorm and qnorm.
kdensity e, normal
A kernel density plot produces a kind of histogram for the
residuals, the option normal overlays a normal distribution to
compare. Here residuals seem to follow a normal distribution.
Below is an example using histogram.
Standardize normal probability plot (pnorm) checks Quintile-normal plots (qnorm) check for non-normality in the
for non-normality in the middle range of residuals. extremes of the data (tails). It plots quintiles of residuals vs
Again, slightly off the line but looks ok. quintiles of a normal distribution. Tails are a bit off the normal.
pnorm e qnorm e
A non-graphical test is the Shapiro-Wilk test for normality. It tests the hypothesis that the distribution is normal, in this case the
null hypothesis is that the distribution of the residuals is normal. Type
swilk e
The null hypothesis is that the distribution of the residuals is normal, here the p-value is 0.64 (way over the usual 0.05 threshold)
therefore we failed to reject the null. We conclude then that residuals are normally distributed.
PU/DSS/OTR
Regression: testing for homoskedasticity
Another important assumption is that the variance in the residuals has to be homoskedastic, which means constant. Residuals
cannot varied for lower of higher values of X (i.e. fitted values of Y since Y=Xb). A definition:
“The error term [e] is homoskedastic if the variance of the conditional distribution of [ei] given Xi [var(ei|Xi)], is constant for i=1…n, and in particular
does not depend on x; otherwise, the error term is heteroskedastic” (Stock and Watson, 2003, p.126)
When plotting residuals vs. predicted values (Yhat) we should not observe any pattern at all. In Stata we do this using rvfplot
right after running the regression, it will automatically draw a scatterplot between residuals and predicted values; and hettest
to produce a non-graphical test.
rvfplot, yline(0) estat hettest
These two tests suggest the presence of heteroskedasticity in our model. The problem with this is that we may have the wrong
estimates of the standard errors for the coefficients and therefore their t-values.
By default Stata assumes homoskedastic standard errors, so we need to adjust our model to account for heteroskedasticity. To do
this we use the option robust in the regress command.
Notice the difference in the standard errors and the t-values. Following Stock and Watson, as a rule-
of-thumb, you should always assume heteroskedasticiy in your model and use robust standard errors
by adding the option robust (or r for short) to the regression command (see Stock and Watson,
2003, chapter 4)
PU/DSS/OTR
Regression: omitted‐variable test
How do we know we have included all variables we need to explain Y?
Testing for omitted variable bias is important for our model since it is related to the assumption that the
error term and the independent variables in the model are not correlated (E(e|X) = 0)
If we are missing one variable in our model and “[(1)] is correlated with the included regressor; and
[(2)] the omitted variable is a determinant of the dependent variable” (Stock and Watson, 2003, p.144),
then our regression coefficients are inconsistent.
In Stata we test for omitted-variable bias using the ovtest command. After running the regression
type:
ovtest
The null hypothesis is that the model does not have omitted-variables bias, the p-value is 0.2319
higher that the usual threshold of 0.05, so we fail to reject the null and conclude that we do not need
more variables.
PU/DSS/OTR
Regression: specification error
Another command to test model specification is linktest. It basically checks whether we need more
variables in our model by running a new regression with the observed Y (csat) against Yhat
(csat_predicted) and Yhat-squared as independent variables1.
The thing to look for here is the significance of _hatsq. The null hypothesis is that there is no
specification error. If the p-value of _hatsq is not significant then we fail to reject the null and
conclude that our model is correctly specified.
1 For more details see http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm, and/or type help linktest.
PU/DSS/OTR
Regression: outliers
To check for outliers we use the avplots command (added-variable plots). Outliers are data points
with extreme values that could have a negative effect on our estimators. After running the regression
type:
avplots
These plots regress each variable against all others, notice the coefficients on each. All data points
seem to be in range, no outliers observed.
For more details and tests on this and influential and leverage variables please check
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
PU/DSS/OTR
Regression: summary of influence indicators
DfBeta Measures the influence (in A case is an influential outlier if In Stata after running the regression type: In SPSS: Analyze-Regression-
standard errors terms) of each Linear; click Save. Select under
observation on the coefficient of |DfBeta|> 2/SQRT(N) reg y x1 x2 x3 “Influence Statistics” to add as a
a particular independent new variable (DFB1_1) or in
variable (for example, x1) Where N is the sample size. dfbeta x1 syntax type
REGRESSION
Note: Stata estimates Note: you could also type: /MISSING LISTWISE
standardized DfBetas. /STATISTICS COEFF OUTS
predict DFx1, dfbeta(x1) R ANOVA
/CRITERIA=PIN(.05)
To estimate the dfbetas for all predictors just POUT(.10)
type: /NOORIGIN
/DEPENDENT Y
dfbeta /METHOD=ENTER X1 X2 X3
/CASEWISE PLOT(ZRESID)
To flag the cutoff OUTLIERS(3) DEFAULTS
DFBETA
gen cutoffdfbeta = abs(DFx1) > /SAVE MAHAL COOK LEVER
2/sqrt(e(N)) & e(sample) DFBETA SDBETA DFFIT
SDFIT COVRATIO .
DfFit It is a summary measure of High influence if After running the regression type Same as DfBeta above (DFF_1)
leverage and high residuals.
|DfFIT| >2*SQRT(k/N) predict dfits if e(sample), dfits
Measures how much an
observation influences the Where k is the number of To generate the flag for the cutoff type:
regression model as a whole. parameters (including the
intercept) and N is the sample gen cutoffdfit=
How much the predicted values size. abs(dfits)>2*sqrt((e(df_m)
change as a result of including +1)/e(N)) & e(sample)
and excluding a particular
observation.
Covariance ratio Measures the impact of an High impact if In Stata after running the regression type Same as DfBeta above
observation on the standard |COVRATIO-1| ≥ 3*k/N (COV_1)
errors Where k is the number of predict covratio if e(sample),
parameters (including the covratio
intercept) and N is the sample
size.
PU/DSS/OTR
Regression: summary of distance measures
Cook’s distance Measures how much an observation High influence if In Stata after running the regression In SPSS: Analyze-Regression-Linear;
influences the overall model or type: click Save. Select under “Distances”
predicted values. D > 4/N to add as a new variable (COO_1) or
predict D, cooksd in syntax type
It is a summary measure of leverage Where N is the sample size. REGRESSION
and high residuals. /MISSING LISTWISE
. A D>1 indicates big outlier problem /STATISTICS COEFF OUTS R
ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X1 X2 X3
/CASEWISE PLOT(ZRESID)
OUTLIERS(3) DEFAULTS DFBETA
/SAVE MAHAL COOK LEVER
DFBETA SDBETA DFFIT SDFIT
COVRATIO.
Leverage Measures how much an observation High influence if In Stata after running the regression Same as above (LEV_1)
influences regression coefficients. type:
leverage h > 2*k/N
predict lev, leverage
Where k is the number of parameters
(including the intercept) and N is the
sample size.
Mahalanobis distance It is rescaled measure of leverage. Higher levels indicate higher distance Not available Same as above (MAH_1)
from average values.
M = leverage*(N-1)
The M-distance follows a Chi-square
Where N is sample size. distribution with k-1 df and
alpha=0.001 (where k is the number of
independent variables).
PU/DSS/OTR
Sources for the summary tables:
influence indicators and distance measures
• Statnotes:
http://faculty.chass.ncsu.edu/garson/PA765/regress.htm#outlier2
• An Introduction to Econometrics Using Stata/Christopher F. Baum, Stata
Press, 2006
• Statistics with Stata (updated for version 9) / Lawrence Hamilton,
Thomson Books/Cole, 2006
PU/DSS/OTR
Regression: multicollinearity
An important assumption for the multiple regression model is that independent variables are not perfectly
multicolinear. This is, one regressor should not be a linear function of another. When multicollinearity is
present, Stata will drop one of the variables to avoid a division by zero in the OLS procedure (see Stock and
Watson, 2003, chapter 5). A mayor problem with multicollinearity is that standand errors may be inflated. The
Stata command to check for multicollinearity is vif (variance inflation factor). Right after running the
regression type:
We do not observe multicollinearity problems here. All
vifs are under 10 .
Form more details see http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm, and/or type help vif.
PU/DSS/OTR
Regression: publishing regression output (outreg2)
The command outreg2 gives you the type of presentation you see in published papers. If outreg2 is not available you need to
install it by typing
Let’s say the regression is regress csat percent percent2 high, robust
The basic syntax for outreg2 is: outreg2 using [pick a name], [type either word or excel]
After the regression type the following if you want to export the results to excel*
*See the following document for some additional info/tips http://www.fiu.edu/~tardanic/brianne.pdf PU/DSS/OTR
Regression: publishing regression output (outreg2)
You can add more models to compare. Lets say you want to add another model without percent2:
regress csat percent high, robust
Now type to export the results to excel (notice we add the append option)
outreg2 using results, word append
In excel In word
NOTE: If you run logit/probit regression with odds ratios you need to add the option eform to export the odd ratios
Type help outreg2 for more details. If you do not see outreg2, you may have to install it by typing ssc install outreg2. If this does not work type
findit outreg2, select from the list and click “install”.
Note: If you get the following error message (when you use the option append or replace it means that you need to close the excel/word window.
PU/DSS/OTR
Regression: publishing regression output (outreg2) continue
For a customized look, here are some options:
*** Excel
outreg2 using results, bdec(2) tdec(2) rdec(2) adec(2) alpha (0.01, 0.05, 0.10)
addstat(Adj. R-squared, e(r2_a)) excel
*** Word
outreg2 using results, bdec(2) tdec(2) rdec(2) adec(2) alpha (0.01, 0.05, 0.10)
addstat(Adj. R-squared, e(r2_a)) excel
For excel
PU/DSS/OTR
Regression: interaction between dummies
Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of
another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below there
could be the case that the effect of student‐teacher ratio on test scores may depend on the percent of English learners in the district*.
– Dependent variable (Y) – Average test score, variable testscr in dataset.
– Independent variables (X)
• Binary hi_str, where ‘0’ if student‐teacher ratio (str) is lower than 20, ‘1’ equal to 20 or higher.
– In Stata, first generate hi_str = 0 if str<20. Then replace hi_str=1 if str>=20.
• Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher
– In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10.
• Interaction term str_el = hi_str * hi_el. In Stata: generate str_el = hi_str*hi_el
We run the regression
regress testscr hi_el hi_str str_el, robust
The effect of hi_str on the tests scores is ‐1.9 but given the interaction term (and assuming all coefficients are significant), the net effect is
-1.9 -3.5*hi_el. If hi_el is 0 then the effect is ‐1.9 (which is hi_str coefficient), but if hi_el is 1 then the effect is ‐1.9 ‐3.5 = ‐5.4.
In this case, the effect of student‐teacher ratio is more negative in districts where the percent of English learners is higher.
See the next slide for more detailed computations.
*The data used in this section is the “California Test Score” data set (caschool.dta) from chapter 6 of the book Introduction to Econometrics from Stock and Watson, 2003. Data can be downloaded from
http://wps.aw.com/aw_stock_ie_2/50/13016/3332253.cw/index.html.For a detailed discussion please refer to the respective section in the book. PU/DSS/OTR
Regression: interaction between dummies (cont.)
You can compute the expected values of test scores given different values of hi_str and hi_el. To see the effect of hi_str given
hi_el type the following right after running the regression in the previous slide.
These are different scenarios holding constant hi_el and varying
hi_str. Below we add some labels
We then obtain the average of the estimations for the test scores (for all four scenarios, notice same values for all cases).
Here we estimate the net effect of low/high student-teacher ratio holding constant the percent of
English learners. When hi_el is 0 the effect of going from low to high student-teacher ratio goes
from a score of 664.2 to 662.2, a difference of 1.9. From a policy perspective you could argue that
moving from high str to low str improve test scores by 1.9 in low English learners districts.
When hi_el is 1, the effect of going from low to high student-teacher ratio goes from a score of
645.9 down to 640.5, a decline of 5.4 points (1.9+3.5). From a policy perspective you could say
that reducing the str in districts with high percentage of English learners could improve test scores
by 5.4 points.
*The data used in this section is the “California Test Score” data set (caschool.dta) from chapter 6 of the book Introduction to Econometrics from Stock and Watson, 2003. Data can be downloaded from
http://wps.aw.com/aw_stock_ie_2/50/13016/3332253.cw/index.html.For a detailed discussion please refer to the respective section in the book. PU/DSS/OTR
Regression: interaction between a dummy and a continuous variable
Lets explore the same interaction as before but we keep student‐teacher ratio continuous and the English learners variable as binary. The
question remains the same*.
– Dependent variable (Y) – Average test score, variable testscr in dataset.
– Independent variables (X)
• Continuous str, student‐teacher ratio.
• Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher
• Interaction term str_el2 = str * hi_el. In Stata: generate str_el2 = str*hi_el
We will run the regression
regress testscr str hi_el str_el2, robust
*The data used in this section is the “California Test Score” data set (caschool.dta) from chapter 6 of the book Introduction to Econometrics from Stock and Watson, 2003. Data can be downloaded from
http://wps.aw.com/aw_stock_ie_2/50/13016/3332253.cw/index.html.For a detailed discussion please refer to the respective section in the book. PU/DSS/OTR
Regression: interaction between two continuous variables
Lets keep now both variables continuous. The question remains the same*.
– Dependent variable (Y) – Average test score, variable testscr in dataset.
– Independent variables (X)
• Continuous str, student‐teacher ratio.
• Continuous el_pct, percent of English learners.
• Interaction term str_el3 = str * el_pct. In Stata: generate str_el3 = str*el_pct
We will run the regression
regress testscr str el_pct str_el3, robust
*The data used in this section is the “California Test Score” data set (caschool.dta) from chapter 6 of the book Introduction to Econometrics from Stock and Watson, 2003. Data can be downloaded from
http://wps.aw.com/aw_stock_ie_2/50/13016/3332253.cw/index.html.For a detailed discussion please refer to the respective section in the book. PU/DSS/OTR
Creating dummies
You can create dummy variables by either using recode or using a combination of tab/gen commands:
tab major, generate(major_dum)
PU/DSS/OTR
Creating dummies (cont.)
Here is another example:
tab agregroups, generate(agegroups_dum)
PU/DSS/OTR
Basic data reporting describe
Frequently used Stata commands
codebook
Category Stata commands inspect
Type help [command name] in the windows command for details
Source: http://www.ats.ucla.edu/stat/stata/notes2/commands.htm
Getting on‐line help help list
search browse
Operating‐system interface pwd count
cd assert
sysdir summarize
mkdir Table (tab)
dir / ls tabulate
erase Data manipulation generate
copy replace
type egen
Using and saving data from disk use recode
clear rename
save drop
append keep
merge sort
compress encode
Inputting data into Stata input decode
edit order
infile by
infix reshape
insheet Formatting format
The Internet and Updating Stata update label
net Keeping track of your work log
ado notes
news Convenience display PU/DSS/OTR
Is my model OK? (links)
Times series: dfueller test for unit roots (for R and Stata)
http://www.econ.uiuc.edu/~econ472/tutorial9.html
– http://www.stata.com/support/faqs/stat/panel.html
– http://www.stata.com/support/faqs/stat/xtreg.html
– http://www.stata.com/support/faqs/stat/xt.html
– http://dss.princeton.edu/online_help/analysis/panel.htm
PU/DSS/OTR
I can’t read the output of my model!!! (links)
Regression
http://www.ats.ucla.edu/stat/stata/topics/regression.htm
PU/DSS/OTR
Topics in Statistics (links)
Stata Library. Graph Examples (some may not work with STATA 10)
http://www.ats.ucla.edu/STAT/stata/library/GraphExamples/default.htm
Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and
SPSS
http://www.indiana.edu/~statmath/stat/all/ttest/
PU/DSS/OTR
Useful links / Recommended books
• DSS Online Training Section http://dss.princeton.edu/training/
• UCLA Resources to learn and use STATA http://www.ats.ucla.edu/stat/stata/
• DSS help‐sheets for STATA http://dss/online_help/stats_packages/stata/stata.htm
• Introduction to Stata (PDF), Christopher F. Baum, Boston College, USA. “A 67‐page description of Stata, its key
features and benefits, and other useful information.” http://fmwww.bc.edu/GStat/docs/StataIntro.pdf
• STATA FAQ website http://stata.com/support/faqs/
• Princeton DSS Libguides http://libguides.princeton.edu/dss
Books
• Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson Addison
Wesley, 2007.
• Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill.
Cambridge ; New York : Cambridge University Press, 2007.
• Econometric analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008.
• Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O.
Keohane, Sidney Verba, Princeton University Press, 1994.
• Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge
University Press, 1989
• Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam
Kachigan, New York : Radius Press, c1986
• Statistics with Stata (updated for version 9) / Lawrence Hamilton, Thomson Books/Cole, 2006
PU/DSS/OTR