0% found this document useful (0 votes)
1K views

Stata 14 Tutorial PDF

This tutorial introduces STATA software for statistical analysis and econometrics. It covers importing and exploring cross-sectional data, descriptive statistics, graphs, simple and multiple regression, and data transformations. The tutorial explains both interactive and batch ("do-file") modes, and highlights STATA's ability to store results for further analysis. While an introduction, it notes the program's limitations and directs users to reference manuals for more advanced questions.

Uploaded by

Anand Retharekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Stata 14 Tutorial PDF

This tutorial introduces STATA software for statistical analysis and econometrics. It covers importing and exploring cross-sectional data, descriptive statistics, graphs, simple and multiple regression, and data transformations. The tutorial explains both interactive and batch ("do-file") modes, and highlights STATA's ability to store results for further analysis. While an introduction, it notes the program's limitations and directs users to reference manuals for more advanced questions.

Uploaded by

Anand Retharekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

STATA 14 Tutorial

by Manfred W. Keil

to Accompany

Introduction to Econometrics, 4th Edition (2018)


by James H. Stock and Mark W. Watson

------------------------------------------------------------------------------------------------------------------

1. STATA: INTRODUCTION 1

2. CROSS-SECTIONAL DATA

Interactive Use: Data Input and Simple Data Analysis 3

a) The Easy and Tedious Way: Manual Data Entry 4


b) Summary Statistics 8
c) Graphical Presentations 10
d) Simple Regression 14
e) Entering Data from a Spreadsheet 16
f) Importing Data Files directly into STATA 17
g) Multiple Regression Model 20
h) Data Transformations 21

Batch (Do-Files) 23

3. SUMMARY OF FREQUENTLY USED STATA COMMANDS 37

4. FINAL NOTE 43

-----------------------------------------------------------------------------------------------------------------
1. STATA: INTRODUCTION

This tutorial will introduce you to a statistical and econometric software package called
STATA. The tutorial is an introduction to some of the most commonly used features in
STATA. These features were used by the authors of your textbook to generate the statistical
analysis report in Chapters 3-9 (Stock and Watson, 2018). The tutorial provides the necessary
background to reproduce the results of Chapters 3-9 and to carry out related exercises. It does
not cover panel data (Chapter 10), binary dependent variables (Chapter 11), instrumental
variable analysis (Chapter 12), or time-series analysis (Chapters 15-17), nor the estimates
presented in Big Data (Chapter 14).

The most current professional version is STATA 15. Both STATA 13 and STATA 14 are
sufficiently similar so that those who have only have access to STATA 13 can also use this
tutorial. As with many statistical packages, newer versions of a program allow you to use more
advanced and recently developed techniques that you, as a first time user, most likely will not
encounter in a first course of statistics or econometrics. There are several versions of STATA
14, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of
the number of variables STATA can handle and the speed at which information is processed.
Most users will probably work with the “Intercooled” (IC) version.

STATA runs on the Windows, Mac, and Unix computers platform. I assume most of you will
be using STATA on Windows computers. It is produced by StataCorp in College Station, TX.
You can read about various product information at the firm’s Web site, www.stata.com . There
are 21 subject-specific statistics reference manuals in addition to four general reference
manuals (User’s Guide, Base, Data Management, Graphics, Functions) and the User’s Guide
that can be downloaded with STATA 15 (STATA 14 is not that different as far as you, as a
beginner, are concerned). Perhaps the most useful of these are the User’s Guide and the Base
Reference Manuals. You can order STATA by calling (800) 782-8272 or writing to
[email protected]. In addition, if you purchase the Student Version, you can acquire
STATA at a steep discount. Prices vary, but you could get a “perpetual license” for STATA/IC
for $198, or a six-month license for as low as $45.

Econometrics deals with three types of data: cross-sectional data, time series data, and panel
(longitudinal) data (see Chapter 1 of the Stock and Watson (2018)). In a cross-section you
analyze data from multiple entities at a single point in time. In a time series you observe the
behavior of a single entity over multiple time periods. This can range from high frequency data
such as financial data (hours, days); to data observed at somewhat lower (monthly)
frequencies, such as industrial production, inflation, and unemployment rates; to quarterly data
(GDP) or annual (historical) data. One big difference between cross-sectional and time series
analysis is that the order of the observation numbers does not matter in cross-sections. With
time series, you would lose some of the most interesting features of the data if you shuffled the
observations. Finally, panel data can be viewed as a combination of cross-sectional and time
series data, since multiple entities are observed at multiple time periods. STATA allows you to
work with all three types of data.

-1-
STATA is most commonly used for cross-sectional and panel data in academics, business, and
government, but you can work with it relatively easily when you analyze time-series data.
STATA allows you to store results within a program and to “retrieve” these results for further
calculations later. Remember how you calculated confidence intervals in statistics say for a
population mean? Basically you needed the sample mean, the standard error, and some value
from a statistical table. In STATA, you can calculate the mean and standard deviation of a
sample and then temporarily “store” these. You then work with these numbers in a standard
formula for confidence intervals. In addition, STATA provides the required numbers from the
relevant distribution (normal,  2 , F, etc.).

While STATA is truly “interactive,” you can also run a program as a “batch” mode

 Interactive use: you type a STATA command in the STATA Command Window (see
below) and hit the Return/Enter key on your keyboard. STATA executes the command
and the results are displayed in the STATA Results Window. Then you enter the next
command, STATA executes it, and so forth, until the analysis is complete. Even the
simplest statistical analysis typically will involve several STATA commands.
 Batch mode: all of the commands for the analysis are listed in a file, and STATA is told
to read the file and execute all of the commands. These files are called Do-Files and are
saved using a .do suffix.

In the good old days, the equivalent of writing a Do-File was to submit a “batch” of cards, each
card containing a single command (now line), to a technician, who would use a card reader to
enter these into the computer. The computer would then execute the sequence of statements.
(You stored this batch of cards typically in a filing cabinet, and the deck was referred to as a
“file.”) While you will work at first in interactive mode by clicking on buttons or writing single
line commands, you will very soon discover the advantage of running your regressions in batch
mode. This method allows you to see the history of commands, and you can also analyze
where exactly things went wrong if there are problems (“errors”) with any of your commands.
This tutorial will initially explain the interactive use of STATA since it is more intuitive.
However, we will switch as soon as it makes sense into the batch mode and you should
seriously try to do your research/class work using this mode (“Do-Files”).

STATA produces highly professional looking graphs and charts. However, it requires some
practice to generate these. A separate manual (Graphics) is devoted to the topic only. Since
STATA works in a Windows format, it allows you to cut and paste the data into other
Windows-based program, such as Word or WordPerfect.

Finally, there is a warning about the limitations of this tutorial. The purpose is to help you gain
an initial understanding of how to work with STATA. I hope that the tutorial looks less
daunting than the manuals. However, it cannot replace the accompanying manuals, which you
will have to consult for more detailed questions (alternatively use “Help” within the program).
Feel free to provide me with feedback of how the tutorial can be improved for future
generations of students ([email protected]). Colleagues of mine and I have decided to set up a
-2-
“Wiki”” run by stud dents but su upervised by faculty at m my academicc institutionn. We have ffound
that thee “wisdom ofo crowds” often
o producces valuablee informationn for those w who follow. This
is, of course,
c just a suggestion n. Finally yoou may wannt to think abbout workinng with statiistical
softwarre as learninng a new lan nguage: practicing it rouutinely will rresult in impprovement. If you
set it aside
a for too long, you will
w only rem member the most imporrtant lines buut will forgeet the
importaant details. Another
A dangger of tutoriaals like this is that you ssimply follow
w the instrucctions
and wh hen you are done, you do d not remeember the coommands. Itt is thereforre a good iddea to
keep a separate sheeet and to wrrite down co ommands andd examples of them if yyou think youu will
use theem later. I will
w give you u short exercises so that you can praactice the coommands onn your
own. AtA the end off this tutorial, I have prov
vided a summ mary of seleected STAT ccommands.

2. CRO
OSS-SECTIIONAL DAT
TA

Interacctive Use: Data


D Input an
nd Simple Data Analysiss

Let’s get
g started. Click
C on the STATA ico on to begin yyour sessionn, or choose STATA 12 from
your STTART wind dow. Once yo ou have starrted STATA, you will seee a large wiindow contaaining
severall smaller win ndows. At th his point youu can load a data set or enter data (described beelow)
and beggin the statisstical analysiis.

-3-
The ressults of yourr various opperations willl be displayyed in the soo-called Ressults Window
w. On
the botttom right, th
here is a Vaariables Winndow, whichh shows the names of vaariables currrently
active in the dataffile. Above iti is the Revview Window w, which letts you vieww previously used
STATA A command ds. In interacctive use, STATA allow ws you to eexecute com mmands eitheer by
clickingg on commaand buttons oro by typing the equivaleent command into the Coommand Window
on the bottom
b of the initial pag
ge.

In this tutorial, wee will work with


w two daata applicatioons: two crooss-sectionall (Californiaa Test
Score Data
D Set ussed in chaptters 4-9; and
d the Curreent Populatioon Survey D
Data Set used in
Chapteers 3 and 8) as
a an exercisse.

a) The Easy and Teedious Way: Manual Data Entry

In Chaapters 4 to 9 you will work


w with th
he Californiaa Test Scoree Data Set. These are ccross-
section
nal data. There are 420 ob bservations from K-6 annd K-8 schoool districts ffor the years 1998
and 1999. You willl not want to t enter a larrge amount of data mannually, sincee it is tedious and
leaves room for human
h errorr. As a resuult, it is gennerally not a recommeended methood of
inputtin
ng data. Ho owever, there are occaasions whenn you have collected ddata by youurself
(somethhing that eco
onomists aree doing moree and more). The alternaative is to entter the data iinto a
spreadssheet (Excel) and then to
o cut and passte the data ((see below).

Enterin
ng data man nually is useed here for pedagogicaal purposes since it gives you an iinitial
understtanding of how
h to work
k with data inn STATA. IIn other worrds, it will bbe useful thaat you
becomee aware of entering, and
d editing, datta in the proggram. Here I will use a sub-sample of 10
observaations from the
t California Test Score Data Set.

To starrt, click on th
he Data Editor (Edit) button
b on thhe toolbar, orr type the coommand ediit into
the Commmand Win ndow. This will
w open the following s creen:

-4-
To enteer data manu ually, start ty
yping in the observationss (no need too label the coolumns now
w; you
will naame the varriables subseequently). HereH I have chosen 10 observationns of test sscores
(testscrr) and the sttudent-teacher ratio (str)) from the ddata set you will use in Chapter 4 oof the
textboook (type in thhe numbers for
f all three columns).

School testscrr strr

1 606.8 19..5
2 631.1 20..1
3 631.4 21..5
4 631.8 20..1
5 631.9 20..4
6 632.0 22..4
7 632.0 22..9
8 638.5 19..1
9 638.7
7 20..2
10 639.3 19..7

After entering
e the data, double-click the grey box aat the top oof the first ccolumn (thee box
directly
y above the blue one in
n the abovee picture). T
This will ressult in the ffollowing boox to
appear::

-5-
In the Name
N box, replace
r var1
1 (school) with
w the nam me of the firsst column vaariable, heree obs.
Do a siimilar operattion for the second colu
umn, that is rrename var22 as testscr. IIn the Labell box,
you maay want to enter inform mation that that
t helps yoou remembeer how the data was crreated
originaally or as infformation fo
or others wh
ho may subs equently woork with youur data. I suuggest
you entter here

Avg
g test score (=(read_scr+
(= +math_scr)//2)

Similarrly you could


d enter for th
he third variaable str

Student teacherr ratio (teachhers/enrl_tot)

After completing th
his task, the Data Editorr screen shouuld look as ffollows:

-6-
Next cllose the box
x. Note that your commaands to edit the data now
w appear inn the Resultss Box,
ommand to edit is listeed in the Co
your co ommand Boxx, and your newly creaated variablees are
shown in the variab
ble list on th
he upper righ
ht-hand side:

Enterin
ng data in th
his way is veery tedious, and
a you wil l make data input errorss frequently.. You
will see below howw to enter data
d directly from a spreeadsheet or aan ASCII file, which arre the
most coommon form ms of data yoou will receiv
ve in the futuure.

In geneeral, you can


n look at variiables that allready exist by typing inn the commaand

list varna
ame1, varnam
me2, …

where varnamei
v reffers to a variiable that ex
xists in your w
workfile. Trry it here by typing

lisst testscr str

This co
ommand will list, one sccreen at a tim
me, the data on the variaables for eveery observatiion in
the datta set. (Misssing values are denoted by a periodd or “.” in S STATA.) Laater on, youu will
work with
w large datad set, and
d you will probably
p nott want to seee all observvations. Youu can
imagine how long this may take with 5,00 00 observatiions or moree. Failing too look at thee data
-7-
observaation by observation off course tak
kes away thhe ability to spot errorss in the dataa set,
perhaps generated by others duuring data en
ntry. Howevver, there aree other methhods to spott such
problem
ms such as su
ummarizingg the data.

You caan always stop the listin


ng by hitting
g the break bbutton on thhe toolbar (itt looks like a red
pentago
on with a wh he middle). This button can be usedd to stop the execution oof any
hite “x” in th
demandd in STATAA.

hould see thee following:


You sh

b) Summary Statisttics

For thee moment, leet’s just see if we are wo


orking with the same daata set. Typee in the folloowing
commaand

sum teestscr str, deetail

sum staands for “sum mmarize” an nd the option detail givees you a moore extensivee list of summmary
statisticcs for each of the variaables you haave entered . These incllude the meedian and ceertain
percenttiles of the frequency
fr disstribution. You
Y will learrn later that yyou can alsoo obtain sum
mmary
statisticcs for a subset of your daata by addinng an if or in command fo
following thee variable naame.

-8-
The summary statiistics are expplained in Chapter
C 2 off your textboook (for exam
mple, Kurtoosis is
defined
d in equation
n (2.15) on page
p 22 in Sttock and Waatson (2018))).

If yourr summary sttatistics diffe


fer, then checck the data aagain. To retturn to the ddata observattions,
edit thee data using
g the Data Editor
E or sim
mply return to the otherr open windoow in the ST TATA
program. Once yo ou have locaated the dataa problem, click on thee observatioon and change it.
After correcting thee problem, press
p the presserve buttonn again.

Once you
y have enttered the data, there are various thinngs you can ddo with it. YYou may waant to
keep a hard copy of
o what you just enteredd. If so, clickk on the Print button. T
This will prinnt the
entire output
o of wh
hat you have produced so
o far.

In geneeral, it is a good
g idea to save the daata and your work frequeently in som me form. Maany of
us havve learned th hrough pain nful experien nces how eaasy it is to lose hours of work byy not
backingg up data/ressults in somee fashion. To o save the ddata set you ccreated, eithher press the Save
button or click on File and theen Save As. Follow thee usual Winndows formaat for savingg files
(drives, directories, file type, etc.).
e If you save datasetts in STAT TA readable format, thenn you
should use the exteension “.dtaa.” Once you u have savedd your workk, you can ccall it up thee next
time yo
ou intend to use it by clicking on Fille and then O Open. Try thhese operatioons by savinng the
currentt workfile un nder the namme “SW14smp pl.dta.”

-9-
c) Grap
phical Preseentations

Most often
o it is a good
g idea to generate graaphs (“picturres”) to get ssome “feel” for the data.. You
will be able to deteect outliers which
w may be the result oof data entryy errors or yoou will be abble to
see if th
he data “mak kes sense.” Although
A STTATA offerss many graphhing optionss, we will onnly go
through h a few comm monly used ones here.1 There
T are thhree graphs thhat you will use most offten:

 histograms or bar chartss;


 line graphs,, where one or more varriables are pplotted acrosss entities (thhese will become
more imporrtant in time series analyysis when yoou are plottinng variables over time);
 scatterplots (crossplots)), where one variable is ggraphed agaainst another.

The pu
urpose of histtograms is to
o display absolute or relative frequencies for a ssingle variabble. In
generall, the commaand is

histogram
h va
arname, perccent title( )

The ‘percent’ optio on producess relative freequencies, aand the title option addss whatever nname
you place between n ( ) to thee top of thee graph. Yoou can eitheer save the graph you have
generatted, or copy and paste it into an nother Winndows basedd documentt, such as Word
(replacing ‘percentt’ with ‘freq quency’ would have reesulted in abbsolute, rathher than relaative,
frequen
ncies to be plotted;
p therre are other options forr you to expplore, such aas the numbber of
classes (“bins”) to choose, etc.)).

Try
histo
ogram testscr, percent tittle(Testscorees)

1
I foundd the following
g STATA site particularly
p useeful for graphs::
http://ww
ww.stata.com/ssupport/faqs/grraphics/gph/staatagraphs.html
- 10 -
To creaate a line grraph in a cro
oss section, you can addd a third varriable in youur data set w which
takes on
o the numb ber of the observation
o (here:
( 1, 2, 3, …, 10). Name it “oobs” and labbel it
“Schoool District Noo.” Let’s plo
ot the studen
nt-teacher rattio for the fiirst 10 obserrvations usinng the
scatter command. The comm mand is folloowed by thee two variaables you w would like too see
plotted
d, where the first
f one apppears on the Y axis and thhe second onn the X axis.

scatter va
arname1 varrname2

plots va
ariable 1 ag
gainst variab
ble 2. Try th
his with the sstudent-teachher ratio andd the just crreated
variable school.

sccatter str obss

The ressulting graph


h just gives you
y the data points here..

There are
a two wayys to make this
t more infformative, oone is to connnect the pooints by usinng the
line co
ommand folllowed by thhe two variaable names. Alternativeely you can use the tw woway
conneccted comman
nd to have bo
oth the pointts and the linnes displayedd. Try both hhere:

line
l str obs
twow
way connecteed str obs

he graph app
After th pears, you can edit it using the Grapph Editor (eeither use File and then Start
Graph h Editor or push the Grraph Editor button). Allter the grapph until it loooks like thee one
below.

- 11 -
Let mee help you geetting startedd and then you
y do the reest. We willl begin withh the x-axis. You
can ediit specific ax
xis labels or numbers byy first clickinng what youu would like to change. Click
on the x-axis
x and a red box shoould surrounnd the numbeers. Then cllick the varioous options, such
as tick numbers, labbels, and griid lines.

Some of
o the alternaations can bee made in thee resulting ddialog boxes

Frequently you wiill be interested either in causal reelationships between vaariables or in the
ability of one variaable to forecast another. As a result, it is a goodd idea to plott two variables in
the sam
me graph.

The first way to loo o plot the obbservations oof both variaables. This can be
ok for a relationship is to
done byy generalizin
ng the comm mand twowayy connected to include m more than tw wo variable nnames
(one fo
or the Y axis and one for the X axis). Try this herre with

nnected str teestscr obs


twoway con

The ressulting graph


h is pretty un
ninformativee, since test sscores and sttudent-teachher ratios aree on a
differen
nt scale. You
u can allow for
f two (or more)
m scaless by enteringg the followinng commandd:

twowayy (scatter strr obs, c(1) ya


axis(1)) (scattter testscr oobs, c(1) yaxxis(2))

This coommand insttructs STAT TA to use two o Y axis, onee for the studdent-teacher ratio on the left
side off the graph, and
a the otherr for test scorres on the rigght side of thhe graph. Yoou may wannt to
“beautiify” the resulting graph by
b using the graph editorr. See if youu can producce somethingg like
the grapph below:

- 12 -
Grahph 2
Test Scores and Student-Teacher Ratio Across 10 School Districts

640
24
23

630
Student teacher ratio
22

Avg test score


620
21
20

610
19

600
18
1 2 3 4 5 6 7 8 9 10
School District

Student-Teacher Ratio Avg Test Score

To get an even better idea about the relationship, you can display a two-dimensional
relationship in a scatterplot (see page 85-6 of your Stock and Watson (2018) textbook). Given
our discussion above, you could simply use the command scatter testscr str. However, you
may want to see what a fitted line through that scatter plot would look like, in which case you
have to modify the command slightly:

scatter testscr str || lfit testscr str

where ‘||’ is the key ‘|’ typed twice.

This will result in the following graph (after beautification):

Graph 3
Scatterplot of Test Scores vs Student-Teacher Ratio
640 630
Test Scores
610 620
600

19 20 21 22 23
Student-Teacher Ratio

Fitted values

(Not to worry about the positive slope here. Remember, this is a sample, and a very small one
at that. After all, you may get 10 heads in 10 flips of a coin.)

- 13 -
d) Simple Regression

There is a commonly held belief among many parents that lower student-teacher ratios will
result in better student performance. Consequently, in California, for example, all K-3 classes
were reduced to a maximum student-teacher ratio of 20 (“Class Size Reduction Act” – CSR) in
the late ‘90s. This comes at a cost, of course. Initially, it was $1.8 billion a year. At such a high
cost, the natural question arises whether or not it is worth it. That is why you are analyzing the
effect of reducing student-teacher ratios in Chapters 4-9 of the Stock and Watson textbook.

For the 10 school districts in our sample, we seem to have found a positive relationship
between larger classes and poor student performance. Not to worry – we will soon work with
all 420 observations from the California School Data Set, and we will then find the negative
relationship you have seen in the textbook – for now, we are more concerned about learning
techniques in STATA.

In the previous section, we included a regression line in the scatterplot, something that you
should have encountered towards the end of your statistics course. However, the graph of the
regression line does not allow you to make quantitative statements about the relationship; you
want to know the exact values of the slope and the intercept. For example, in general
applications, you may want to predict the effect of an increase by one in the explanatory
variable (here the student-teacher ratio) on the dependent variable (here the test scores).

To answer the questions relating to the more precise nature of the relationship between class
size and student performance, you need to estimate the regression intercept and slope. A
regression line is little else than fitting a line through the observations in the scatterplot
according to some principle. You could, for example, draw a line from the test score for the
lowest student-teacher ratio to the test score for the highest student-teacher ratio, ignoring all
the observations in between. Or you could sort the data by student-teacher ratio and split the
sample in half so that the observations with the lowest ten student-teacher ratios are in one set,
and the observations with the highest ten student-teacher ratios are in the other set. For each of
the two sets you could calculate the average student-teacher ratio and the corresponding
average test score, and then connect the two resulting points. Or you could just eyeball the
relationship. Some of these principles have better properties than others to infer the true
underlying (population) relationship from the given sample. The principle of ordinary least
squares (OLS), for example, will give you desirable properties under certain restrictive
assumptions that are discussed in Chapter 4 of the Stock/Watson textbook.

Back to computing. If the dependent variable, Y, is only determined by a single explanatory


variable X in a linear fashion of the type

Yi   0  1 X i  ui i=1,2, ..., N

with “u” representing the error, or random disturbance, not accounted for by the linear

- 14 -
equatio ue for  0 and 1 . IIf you had values for these
on, then thee task is to find a valu
coefficients, then 1 describes the effect off a unit increease in X onn Y. Often a regression line is
a lineaar approximaation to an underlying
u relationship
r and the inttercept  0 oonly has a uuseful
meanin ng if observvations aroun nd X=0 occur in the daata. As we have seen iin the scatteerplot
above, there are no o observatioons around the
t student-tteacher ratioo of zero, annd it is therrefore
better not
n to interprret the numeerical value of
o the interceept at all. Yoour professoor most likelyy will
give yo ou a seriouss penalty inn the exam for
f interpretting the inteercept here because witth no
studentts present, th
here is no score to record
d. (What woould be the ffunction of thhe teacher inn that
case?)

There are
a various ways
w to estim
mate the reg
gression line . The comm
mand for regrressing a varriable
Y on a constant (inttercept) and another variiable X is:

reg Y X

where “reg”
“ standss for least squ
uares regression. For thee current appplication, typpe

g testscr str, r
reg

where the
t “r” following the com mma indicattes that you aare using heeteroskedastiicity-robust
standarrd errors (eveen though yoou have not requested ann intercept too be included, STATA w will
automaatically do soo. There is an
n option for you to supppress the inteercept, but yoou will mostt
likely never
n use it).

utput appearss as follows:


The ou

Accordding to these results, low


wering the stuudent-teacheer ratio by onne student per class resuults in
an decrrease of 0.6 points, on av
verage, in thhe district wiide test scorre. Using thee notation off your
textboo
ok, you shou uld display th
he results as follows:

 = 618.9.1 + 0.61
TestScore 0  STR, R 2 = 0.007, SSER = 9.8
(51.1) (2.33)

- 15 -
Note th
hat the resullt for the 10 chosen scho ool districts is quite diffferent from the sample of all
420 schhool districtss. However, this is a ratther small saample and thhe regressionn R2 is quitee low.
As a matter
m of faact, in Chap pter 5 of yo our textbookk, you will learn that the above slope
coefficient is not sttatistically siignificant.

e) Enteering Data frrom a Spread


dsheet

So far you entered d data manuually. Most often you w will work wwith larger ddata sets thaat are
externa al to the STA
ATA prograam, i.e., theyy will not bee included inn, or be partt of, the proogram
itself. This
T makes sense as daata sets eith
her become very large or are geneerated by annother
program m, such as a spreadsheett.

Stock and
a Watson n present thee California Test Score Data Set inn Chapter 4 of the textbbook.
Locate the correspponding Exceel file caschhool.xlsx on the accomppanying webb site (wheree you
found this
t tutorial) and open itt. Highlight all
a data and copy it. Nexxt, start STA
ATA and typpe the
words “edit” into the
t command line. Thiss will open tthe Data Ediitor. Make sure to selecct the
grey boox to the im
mmediate righht of “1” before pastingg. Now passte the data iinto the new
w data
editor, choosing th
he option “Treat First Roow as Variaable Names.”” Note that STATA hass now
convenniently includ
ded the name of the variables in the Data Editor.

This is what you sh


hould see in STATA:

When you
y are donee, you are ready to save the
t file. Nam
me it caschool.dta.

You caan now reprooduce Equattion (4.11) from


f the texttbook. Use tthe regressioon commandd you
previou
usly learned to generate the followin
ng output.

- 16 -
. reg testscr str, r

Linear regression Number of obs = 420


F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581

Robust
testscr Coef. Std. Err. t P>|t| [95% Conf. Interval]

str -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671


_cons 698.933 10.36436 67.44 0.000 678.5602 719.3057

(You can find the standard errors and the t-statistic on p. 139 of the Stock and Watson (2018)
textbook. The regression R 2 , sum of squared residuals (SSR), and standard error of the
regression (SER) are presented in Section 4.3.)

f) Importing Data Files directly into STATA

Excel (Spreadsheet) Files

Even though the cut and paste method seemed straightforward enough, there is a second, more
direct way to import data into STATA from Excel, which does not involve copying and pasting
data points.

Start again with a new STATA file. In general, make sure your data is organized with the
variable names in Row 1 of your spreadsheet with each column representing a different
variable, and the observations in the rows beneath the variable names. Then, save your data set
in Excel (or an alternative spreadsheet program) as a .csv file (specifically CSV (comma
delimited) (this stands for comma separated values). Next, type the following command into
the command window in STATA:

insheet using filename

where (filename) is the directory location of your file. (To find this, locate the file and right-
click, selecting the Properties button). You must add the file name at the end of the directory
location, proceeded by a backslash; example C:\Econometrics\StockWatson\caschool.csv. If
your filename has any spaces or any symbol that appears on the number keys of the keyboard,
then you should put quotation marks around your filename. STATA reads spaces as denoting
separations between words, and therefore will only read the filename up until the first space or
symbol, and then considers the rest to be a separate command.

NOTE: In order to insheet data, there must be no data already stored in memory. To get rid of
any data that is already stored, type the command

clear

before “insheeting.”

- 17 -
Once you have insheeted your data, you should see this reflected in your Results box and your
variables should appear in your Variables List box. You can type edit to see your data in the
data editor.

To save your data as a STATA file, click on File on the upper toolbar, then select Save As.
When you save your file, make sure it is saved as a .dta file. This type of file can only be
opened in STATA. While there are alternative methods, this one is the most straightforward.

Note: When you save a STATA dataset, you are really only saving the dataset as it exists at the
time you chose to save. You are not retaining any of the analysis you may have conducted,
although if you have changed the data since opening the file, these changes will be reflected.

As an exercise, copy the caschool.xls or caschool.xlsx data file from the Stock and Watson
website and save the Excel file in some subdirectory on your computer as a .csv file. Then
import the data set using the insheet command. Finally run the simple regression of testscr on
str and check that your output contains 420 observations and corresponds to the STATA
regression output in the previous section.

ASCII data

You can also import data from an ASCII file (text file). This assumes that you either saved data
from a different source as an ASCII file or that you received data in ASCII file format. The file
must be organized with one observation in each row, and the variables in the data set must be
in separate columns.

Using the infile command, type the name of the variable that represents each column, followed
by the file name. For example, consider an ASCII dataset that looks as follows:

10.75 12 6 1 0
16.50 16 3 0 0

…..

12.10 12 8 1 1

and which you want to import into STATA. Each row corresponds to observations on an entity
(here an individual). The first columns above is the hourly wage, the second is years of
education, the third is potential experience, the fourth is a binary variable which equals one if
the individual belongs to a union and is zero otherwise, and the last column is another binary
variable which takes on the value of one if the individual is married and is zero otherwise.

To import the data, you type the following command:

infile var1 var2 var3 using location/filename


- 18 -
If you do this correctly, STATA will display “15 observations read”.

STATA dataset

Data files that have been saved in STATA format, carry the extension .dta

To open a dataset that is already saved as a .dta file, you can either go to File and then Open to
select your dataset, or you can type the command

use (directory location\filename.dta)

This will open your dataset into STATA, as long as you have changed your working directory
to the location on your computer where the data file is stored. The command to change the
working directory is

CD: C:\(location)

Here are two tricks that will be of help down the road.

(i) If you are not sure how to type in the location of your data file, just right-click on
your ‘Start’ button and select ‘Explore.’ Then find your data set. Next right click on
the data set and chose ‘Properties.’ A new window opens up. Copy the ‘Location.’
Return to the Command Window in STATA and type ‘use “’ and then past the
location. Add ‘\’ and the name of the file, including the extension. Then finish the
command with a ‘, clear’.

Here is an example from my computer:

use “C:\ClaremontLectures\ECON125\STATA\baseb.dta”, clear

(ii) The ‘clear’ command is very important. It erases previous data, if there was any,
from memory. I, and others, have wasted time trying to find errors in programming
simply by not clearing memory. Even if you don’t understand the reason, the advice
is always to include the ‘clear’ command when you read in a new data set.

You can try doing this with the caschool.dta data set from the Stock and Watson website.
Simply save that data set on your computer, then double click on it. This will open STATA
with the data loaded already. Obviously this is the easiest method to import data into STATA.

Regardless of which method you use to import data, it is always a good idea to inspect the
data to check if there are some abnormalities. To do this, click on the ‘Data Editor
(Browse)’ button below the drop down menus.

- 19 -
g) Multiple Regression Model

Economic theory most often suggests that the behavior of a certain variable is influenced not
only by a single variable, but by a multitude of factors. The demand for a product, e.g. LA
Laker tickets, depends not only on the price of the product but also on the price of other goods,
income, taste, etc. Similarly, the Phillips curve suggests that inflation depends not only on the
unemployment rate, but also on inflationary expectation and possibly supply shocks, etc.

An extension of the simple regression model is the multiple regression model, which
incorporates more than one regressor (see Equation (6.7) in the textbook on page 177).

Yi   0  1 X 1i   2 X 2i  ...   k X ki  ui , i = 1,…,n.

To estimate the coefficients of the multiple regression model, you proceed in a similar way as
in the simple regression model. The difference is that you now need to list the additional
explanatory variables. In general, the command is:

reg Y X1 X2 … Xk, (options)

where (options) can be omitted (this is the default and gives you homoskedasticity-only
standard errors) or can be replaced by various possible entries ( e.g. “r” for heteroskedasticity
robust standard errors).

Let’s continue to work with the caschool data set that we used for the simple regression. See if
you can reproduce the following regression output, which corresponds to Column 5 in Table
7.1 of the Stock and Watson (2018) textbook (page 224). The option used below is (r) to
produce heteroskedasticity-robust standard error (STATA refers to these as “Robust Standard
Errors”).
. reg testscr str el_pct meal_pct calw_pct, r

Linear regression Number of obs = 420


F( 4, 415) = 361.68
Prob > F = 0.0000
R-squared = 0.7749
Root MSE = 9.0843

Robust
testscr Coef. Std. Err. t P>|t| [95% Conf. Interval]

str -1.014353 .2688613 -3.77 0.000 -1.542853 -.4858534


el_pct -.1298219 .0362579 -3.58 0.000 -.201094 -.0585498
meal_pct -.5286191 .0381167 -13.87 0.000 -.6035449 -.4536932
calw_pct -.0478537 .0586541 -0.82 0.415 -.1631498 .0674424
_cons 700.3918 5.537418 126.48 0.000 689.507 711.2767

The interpretation of the coefficients is equivalent to that of a controlled science experiment: it


indicates the effect of a unit change in the relevant variable on the dependent variable, holding
all other factors constant (“ceteris paribus”).
- 20 -
Section 7.2 of the Stock and Watson (2018) textbook discusses the F-statistic for testing
restrictions involving multiple coefficients, the so called Wald test. To test whether all of the
above coefficients are zero with the exception of the intercept, you can use the test command
followed by each restriction that you want to test in parenthesis (STATA uses the name of the
variable associated with the coefficient in combination with the restriction). Type

test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)

STATA will generate the following output:


. test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)

( 1) str = 0
( 2) el_pct = 0
( 3) meal_pct = 0
( 4) calw_pct = 0

F( 4, 415) = 361.68
Prob > F = 0.0000

Note that the F-statistic is identical to the same statistic listed in the regression output.

See if you can generate the F-statistic of 5.43 following Equation (7.12) in the Stock and
Watson (2018) text and listed at the bottom of page 226 (restrict the coefficients of STR and
Expn to be zero).

h) Data Transformations

So far, we have only used data in regressions that already existed in some file that we either
created or used. Almost always, you will be required to transform some of the raw data that
you received before you run a regression. In STATA you transform variables by using the
“gen” (as in generate) command. For example, Chapter 8 of the Stock/Watson textbook
introduces the polynomial regression model, logarithms, and interactions between variables.
Let us reproduce Equations (8.2), (8.11), (8.18), and (8.37) here. The following commands
generate the necessary variables2:

gen avginc2=avginc^2
Stock and Watson call this Income2
gen avginc3=avginc^3
Stock and Watson call this Income3
gen lavginc=log(avginc)
Stock and Watson call this ln(Income)

2
For example, I have generated a variable called “avginc2”, and assigned it to be the square of the previously
defined variable “avginc”. Note that I am generating variable names that are self-explanatory. They could have
been called “variable1”, “variable2”, “variable3”, etc. but it is a good idea to create variable names that you can
remember.
- 21 -
gen ltesttscr=log(testtscr)
Stock
S and Watson
Wa call thhis ln(TestScores)
gen strpcctel=str*el_p
_pct
Stock
S and Watson
Wa call thhis STR×PctE EL

Note ho
ow the comm
mands and generated
g varriables are ddisplayed in S
STATA, inccluding thosee in
red when you makee a mistake in
i the commmand (e.g. gennr instead off gen).

Next ru
un the four regressions
r correspondin
c ng to the fouur equations using the saame techniqque as
for mulltiple regresssion analysiss. Finally sav
ve your workrkfile again aand exit STA
ATA.

Exercisse

One off the probleems with thee type of tu utorial you are workingg on is thatt you just foollow
instructtions withou ut internalizzing them. A typical sttudent will finish the ttutorial withh few
problemms but then n little is rettained. If I asked you tto retrieve a data set aand to run a few
regresssions, for exaample, would you be ablle to do that?? Or would yyou say “howw do I do thiis?”

Let’s see how mucch you undeerstood. Go to the Stockk and Watsoon website ffor the 4th eddition
(http://w
www.pearso
onhighered.ccom/stock_w
watson). Clicck on the Companion Website, ggo to

- 22 -
Student Resources. Go to the Data Sets for Replicating Empirical Results, and download the
CPS data set for Chapter 8 (CPS2015 Data (STATA Dataset)). Next open STATA

Then replicate the results for columns (1) from Table 8.1 on page 263 of the Stock and Watson
(2018) textbook. Why do you think your results differ from those listed in the table? What if
you found a way to restrict your sample to only include individuals who are at least 30 but not
older than 64? To find a way to restrict your sample, look for Help and the if command. Then,
restricting your sample to those individuals in that age group, replicate columns (1) to (3). For
column (4), define potential experience as the Mincer experience variable (age – Years of
education – 6).

Batch Files

So far, you have either clicked on buttons in STATA or used the “Command Window” to type
executable statements (commands one by one, or line by line). But what if you wanted to keep
a permanent record of all the transformations you made, regressions you tried, graphs you
created, etc.? In that case, you would need to create a “program” that consists of a list of line
commands similar to those that you used in the “Command Window” previously. After having
created such a program, which is a “text” or “ASCII” file, you can then execute (“run”) it and
view the output afterwards (if it did not contain any errors). Batch files can also include loops
and conditional branching (if you don’t know what these are, not to worry). Batch files in
STATA are called Do-Files.

Using STATA in batch mode has two important advantages over using STATA interactively:

 the Do-File provides an audit trail for your work. The file provides an exact record of
each STATA command;
 even the best computer programmers will make typing or other errors when using
STATA. When a command contains an error, it won’t be executed by STATA, or
worse, it will be executed but produce the wrong result. Following an error, it is often
necessary to start the analysis from the beginning. If you are using STATA
interactively, you must retype all of the commands. If you are using a Do-File, then you
only need to correct the command containing the error and rerun the file.

Let’s create such a program. Click on New Do-File Editor button. This opens the “STATA Do-
File Editor” box. Type in, the following commands exactly as they appear below (changing
lines 1 and 2 depending on where you saved files). Computers are “stupid” as they
differentiate between upper and lowercase letters, and do not understand what you want them
to do if you use the wrong case. So make sure all commands are in lower case. Luckily, your
commands turn purple when you have typed a valid command.

- 23 -
log using \statafiles\stata1.log, replace
use \statafiles\caschool.dta
describe
generate income = avginc*1000
summarize income
log close
exit

Here is the meaning of the seven lines of this program:

Line 1: This is an administrative command that tells STATA where to display the results of
your analysis. STATA output files are called log files. The current line tells STATA to
open a log file called stata1.log (you could have used any name, such as
love_metrix.log, meaning, the word “stata1” is not required here). If there is already a
file with the same name in the folder, STATA is instructed to replace it. Before you
save the Do-File, replace the path in this line with the relevant path on the computer
you are using.

Line 2: This line concerns the data set. As you learned earlier in the tutorial, datasets in
STATA are called dta files. The dataset which you will use here is caschool.dta, which
you downloaded earlier. The current line tells STATA the location and name of the
dataset to be used for the analysis. Before you save the Do-File, replace the path in this
line with the relevant path of the location where you saved caschool.dta to.

Line 3: This line also concerns the data set. It tells STATA to “describe” the dataset (a shorter
version of the command is “des” instead of “describe”). This command produces a list
of the variable names and any variable descriptions stored in the data set.

Line 4: This line tells STATA to create a new variable called income (a shorter version of the
command is “gen” instead of “generate”). The new variable is constructed by
multiplying the variable avginc by 1000. The variable avginc is contained in the dataset
and is the average household income in a school district expressed in thousands of
dollars. The new variable income will be the average household income expressed in
dollars instead of thousands of dollars.

Line 5: This line tells STATA to compute some summary statistics (a shorter version of the
command is “sum” instead of “summarize”). STATA will produce the mean, standard
deviation, etc.

Line 6: This line closes the file stata1.log which contains the output.

Line 7: This line tells STATA that the program has ended.

As long as you have replaced the path in line 1 and line 2 with the relevant paths from the
computer you are working on, and if you downloaded/saved the California Test Score Data
- 24 -
Set, then we are good to go. Save the Do-File, using the .do suffix. Next execute this Do-File
by first opening STATA on your computer. Next, click on the File menu, then Do…, and then
select the stata1.do file you just saved. This will “run” or “execute” the program.

You will be able to see the program being executed in the Results Window. Since the execution
will not fit into one screen, you can scroll up and see everything that happened during the
“run.” Sometimes (although not here) you may see that the program execution pauses, and that

“--more--“

is displayed at the bottom of the Results Window. If this happens, push any key on the
keyboard and execution will continue.

To exit STATA, click on the usual exit button at the top right of STATA (alternatively click on
File and then Exit.) STATA will ask you if you really want to exit, and you will respond Yes.

Your output has been saved in stata1.log and you can look at it by opening the file with any
text editor (Notepad, for example) or in Word/WordPerfect. Here is what you should see:

‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 
 log:  your path here 
 log type:  text 
 opened on:   your date and time here 
 
. use \your path here 
 
. describe 
 
Contains data from \your path here\caschool.dta
obs: 420
vars: 18 15 Dec 2010 07:57
size: 60,060 (94.3% of memory free)
----------------------------------------------------------------------------
---------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---------------------------------------
observat float %9.0g
dist_cod float %9.0g
county str18 %18s
district str53 %53s
gr_span str8 %8s
enrl_tot float %9.0g
teachers float %9.0g
calw_pct float %9.0g
meal_pct float %9.0g
computer float %9.0g
testscr float %9.0g
- 25 -
comp_stu float %9.0g
expn_stu float %9.0g
str float %9.0g
avginc float %9.0g
el_pct float %9.0g
read_scr float %9.0g
math_scr float %9.0g
----------------------------------------------------------------------------
---------------------------------------
Sorted by:

. generate income = avginc*1000

. summarize income

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
income | 420 15316.59 7225.89 5335 55328

. log close
log: C:\your path here\stata1.log
log type: text
closed on: your date and time here
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 
 
You now have an initial idea of how to work with Do-Files in STATA. The rest of this part of
the tutorial will guide you through further commands and make the initial Do-File more
complex. I suggest that you continue to work with the batch file you just created and then for
you to add new lines to this program (if you use the .pdf version of this tutorial or have printed
the tutorial using a color printer, then the new commands will appear in red).

- 26 -
#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;

The new version of the Do-File carries out exactly the same calculations as before. However it
uses four features of STATA for more complicated analysis. The first new command is

# delimit ;

This command tells STATA that each STATA command ends with a semicolon. If STATA
does not see a semicolon at the end of the line, then it assumes that the command carries over
to the following line. This is useful because complicated commands in STATA are often too
long to fit on a single line. (Make sure to place a “;” at the end of the seven old commands.)
The above Do-File contains an example of a STATA command written on two lines: near the
bottom of the file you see the command sum income written on two lines. STATA combines
these two lines into one command because the first line does not end with a semicolon. While
two lines are not necessary for this command, some STATA commands can get quite long, so
it is good to get used to employing this feature.

A word of warning: if you use the # delimit ; command, it is critical that you end each
command with a semicolon. Forgetting the semicolon on even a single line means that the Do-
File will not run properly (again, don’t forget to add the seven “;” in the first version of the
program).

- 27 -
The second new feature of the above Do-File is that many of the lines begin with an asterisk.
STATA ignores the text that comes after “*”, so that these lines can be used for comments or
to describe what the commands that follow are doing. Note that each of these lines ends with a
semicolon. Without the semicolon, STATA would include the next line as part of the text
description.

A final new feature in the program is the command

set more off

This command eliminates the need to hit a key on your keyboard in the case when STATA fills
the Results Window and stops displaying further results (the -- more -- would appear).

Run the program and check your results by looking at the resulting log file.

Next, change the previous version of the Do-File by adding commands until the new version
looks as follows (again, new commands can be seen in red if your tutorial displays colors):

#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
gen testscr_lo = testscr if (str<20);
gen testscr_hi = testscr if (str>=20);
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
sum testscr;
ttest testscr=0;
ttest testscr_lo=0;
ttest testscr_hi=0;
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
- 28 -
*Repeat the Analysis using STR = 19;
*********************************************************;
replace testscr_lo=testscr if (str<19);
replace testscr_hi=testscr if (str>=19;
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;

There are three new features in this new version.

1) You should already be familiar with the command generate or gen. New variables are
created using only a portion of the dataset. Two of the variables in the dataset are testscr
(the average test score in a school district) and str (the district’s average class size or
student teacher ratio). The STATA command

gen testscr_lo = testscr if (str<20)

generates a new variable testscr_lo that is equal to testscr, but this variable is only
defined for districts that have an average class size of less than twenty students (that is,
for which str < 20).

The statement str<20 is an example of a “relational operation.” STATA uses several relational
operators:

< less than


> greater than
<= less than or equal to
>= greater than or equal to
== equal to
~= not equal to

2) The ttest command constructs tests and confidence intervals for the mean of a
population or for the difference between two means (see Stock and Watson, 2018; 71-
85). The command is used in two different ways in the program.

The first is

ttest testscr=0

- 29 -
. ttest testscr=0

One-sample t test

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

testscr 420 654.1565 .9297082 19.05335 652.3291 655.984

mean = mean(testscr) t = 703.6149


Ho: mean = 0 degrees of freedom = 419

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

This command computes the sample mean and standard deviation of the variable testscr,
computes a t-test that the population mean is equal to zero, and computes a 95%
confidence interval for the population mean. (In this example, the t-test that the
population mean of test scores is equal to zero is not really of interest, but the
confidence interval for the mean is what we are looking for in this example.) The same
command is then used for testscr_lo and testscr_hi (see section 3.2 and 3.3 in Stock and
Watson (2018)).

The second form of the command is

ttest testscr_lo=testscr_hi, unequal unpaired

. ttest testscr_lo=testscr_hi, unequal unpaired

Two-sample t test with unequal variances

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

testsc~o 238 657.3513 1.254794 19.35801 654.8793 659.8232


testsc~i 182 649.9788 1.323379 17.85336 647.3676 652.5901

combined 420 654.1565 .9297082 19.05335 652.3291 655.984

diff 7.37241 1.823689 3.787296 10.95752

diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426


Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

Executing this statement will test the hypothesis that testscr_lo and testscr_hi come from
populations with the same mean. That is, the command computes the t-statistic for the
null hypothesis that the (population) mean of test scores for districts with class sizes less
than 20 students is the same as the mean of test scores for districts with class sizes
greater than 20 students. The command uses two “options” that are listed after the
comma in the command. These options are unequal and unpaired. The option unequal
tells STATA that the variances in the two populations may not be the same. The option
unpaired tells STATA that the observations are for different districts, that is, these are
not panel data representing the same entity at two different time periods (see section 3.4
in Stock and Watson (2018)).

- 30 -
3) A third new feature in the Do-File is the command replace. This appears near the
bottom of the file. Here, the analysis is to be carried out again, but using 19 as the cutoff
for small classes. Since the variables testscr_lo and testscr_hi already exist (they were
define by the gen command earlier in the program), STATA cannot “generate” variables
with the same name. Instead, the command replace is used to replace the existing series
with new series. In essence, the command instructs the program to overwrite the
previously stored data.

You are now ready to execute (“run”) the program as done before.

As before, change the previous version of the Do-File by adding commands until the new
version looks as follows (again, new commands can be seen in red if your tutorial displays
colors):

#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
gen testscr_lo = testscr if (str<20);
gen testscr_hi = testscr if (str>=20);
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
*********************************************************;
***** Table 4.1 *****;
*********************************************************;
sum str testscr, detail;
*********************************************************;
***** Figure 4.2 *****;
*********************************************************;
twoway scatter testscr str || lfit testscr str;
*********************************************************;
***** Correlation *****;
*********************************************************;
- 31 -
cor str testscr;
*********************************************************;
***** Equation 4.11 and 5.8 *****;
*********************************************************;
reg testscr str, robust;
*********************************************************;
***** Equation 5.18 *****;
gen d = (str<20);
reg testscr d, r;
*********************************************************;
sum testscr;
ttest testscr=0;
ttest testscr_lo=0;
ttest testscr_hi=0;
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*Repeat the Analysis using STR = 19;
*********************************************************;
replace testscr_lo=testscr if (str<19);
replace testscr_hi=testscr if (str>=19);
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;

The new commands reproduce some of the empirical results shown in Chapters 4 and 5 of
Stock and Watson (2018). There are several features of STATA included in the new commands
which have not been used in the previous examples:

1) The summarize command (“sum”) is now includes the option detail, which provides
more detailed summary statistics. The command is written as

sum str testscr, detail

This command tells STATA to compute summary statistics for the two variables str and
testscr. The option detail produces detailed summary statistics that include, for
example, the percentiles that are reported in Table 4.1 on p. 105 of Stock and Watson
(2018).

2) The command

twoway scatter testscr str || lfit testscr str

constructs a scatterplot of testscr versus str and includes the estimated regression line for
the simple regression of the California Test Score Data Set, shown on p. 106 of Stock
and Watson (2018). In case you have difficulties finding the symbol “||”, it appears on
- 32 -
your keyboard above the backslash.

3) The command

cor str testscr

tells STATA to compute the correlation between the student teacher ratio and test
scores.

4) Next you will reproduce equations (4.11) and (5.8) in Stock and Watson (2018) by using
the regress (or short reg) command:

reg testscr str, r

instructs STATA to run an OLS regression with testscr as the dependent variable and str
as the regressor. The robust (short r) option tells STATA to calculate heteroskedasticity-
robust formulas for the standard errors of the regression coefficient estimators. Omitting
this option results in the display of homoskedasticity-only standard errors.

5) The final innovation over the previous version of the Do-File is contained in the two
commands following the line Equation 5.18. First a binary (sometimes referred to as
dummy or indicator) variable “d” is created suing the STATA command

gen d = (str<20)

This variable is equal to 1 if the expression in parenthesis is true, that is, when the
student teacher ratio is less than 20. Otherwise it is equal to 0, in other words, when the
expression is false, or when the student teacher ratio  20. STATA allows you to use
any of the relational operators defined above. The final regression command tells
STATA to run a regression of test scores on the binary variable just created. The output
reproduces equation (5.18) on p. 146 of Stock and Watson (2018).

Run the program now and look at the output in the log-file.

The upcoming Do-File will be the last program in this tutorial. Having understood all five
should give you a solid grounding in STATA programming. As before, there are several
commands added to the previous version of the Do-File. Add these commands to your older
version until the new version looks as follows (new commands can be seen in red if your
tutorial displays colors):

- 33 -
#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
gen testscr_lo = testscr if (str<20);
gen testscr_hi = testscr if (str>=20);
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
*********************************************************;
***** Table 4.1 *****;
*********************************************************;
sum str testscr, detail;
*********************************************************;
***** Figure 4.2 *****;
*********************************************************;
twoway scatter testscr str || lfit testscr str;
*********************************************************;
***** Correlation *****;
*********************************************************;
cor str testscr;
*********************************************************;
***** Equation 4.11 and 5.8 *****;
*********************************************************;
reg testscr str, r;
*********************************************************;
***** Equation 5.18 *****;
gen d = (str<20);
reg testscr d, r;
*********************************************************;
sum testscr;
ttest testscr=0;
ttest testscr_lo=0;
ttest testscr_hi=0;
ttest testscr_lo=testscr_hi, unequal unpaired;

- 34 -
*********************************************************;
*Repeat the Analysis using STR = 19;
*********************************************************;
replace testscr_lo=testscr if (str<19);
replace testscr_hi=testscr if (str>=19);
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
***** Table 6.1 *****;
*********************************************************;
gen str_20 = (str<20);
gen ts_lostr = testscr if str_20==1;
gen ts_histr = testscr if str_20==0;
gen elq1 = (el_pct<1.94);
gen elq2 = (el_pct>=1.94)*(el_pct<8.78);
gen elq3 = (el_pct>=8.78)*(el_pct<23.01);
gen elq4 = (el_pct>23.01);
ttest ts_lostr=ts_histr, unp une;
ttest ts_lostr=ts_histr if elq1==1, unp une;
ttest ts_lostr=ts_histr if elq2==1, unp une;
ttest ts_lostr=ts_histr if elq3==1, unp une;
ttest ts_lostr=ts_histr if elq4==1, unp une;
*********************************************************;
***** Equation 7.5 *****;
*********************************************************;
reg testscr str el_pct, r;
*********************************************************;
***** Equation 7.6 *****;
*********************************************************;
replace expn_stu = expn_stu/2000;
reg testscr str expn_stu el_pct, r;
*********************************************************;
* Display Variance-Covariance Matrix;
*********************************************************;
vce;
*********************************************************;
***** F-test report in text;
*********************************************************;
test str expn_stu;
*********************************************************;
***** Correlations reported in text;
*********************************************************;
cor testscr str expn_stu el_pct meal_pct calw_pct;
*********************************************************;
*****Table 7.1 *****;
*********************************************************;
* Column (1);
reg testscr str, r;
display “adjusted Rsquared = “ e(r2_a);
* Column (2);
reg testscr str el_pct, r;
display “adjusted Rsquared = “ e(r2_a);
* Column (3);
reg testscr str el_pct meal_pct, r;
display “adjusted Rsquared = “ e(r2_a);
- 35 -
* Column (4);
reg testscr str el_pct calw_pct, r;
display “adjusted Rsquared = “ e(r2_a);
* Column (5);
reg testscr str el_pct meal_pct calw_pct, r;
display “adjusted Rsquared = “ e(r2_a);
*********************************************************;
***** Appendix – rule of thumb F-Statistic;
*********************************************************;
reg testscr str expn el_pct;
test str expn;
reg testscr el_pct;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;

The file produces several of the empirical results from Chapter 7 of Stock and Watson (2018).
As before, some commands have been abbreviated when there is no possibility of confusion.
The file uses abbreviations for STATA commands throughout (generate becomes gen, regress
turns into reg, etc.).

In essence there are two new commands:

1) The first new command involves the testing of restrictions in equation 7.6 (page 209 of
Stock and Watson (2018)). The command

reg testscr str expn_stu el_pct, r

instructs STATA to compute the regression. The command vce asks STATA to print out
the estimated variances and covariances of the estimated regression coefficients. The
command

test str expn_stu

gets STATA to carry out the joint test that the coefficients on str and expn_stu are both
equal to zero.

2) The second new command is in the analysis of Table 7.1 on page 224 of Stock and
Watson (2018). When STATA computes an OLS regression, it computes the adjusted
2
R-squared ( R ) as described in Section 6.4, page 181 of Stock and Watson (2018).
However, STATA does not display all of the results it computes, including the adjusted
R-squared. The command

display “Adjusted Rsquared = “ e(r2_a)

- 36 -
instructs STATA to print out (“display”) the adjusted R-squared. Whatever appears
between the two quotation marks (“ “) will be displayed in your output (you did not
have to display the words Adjusted Rsquared but could have chosen anything else, such
as My Measure of Fit). However e(r2_a) tells STATA where to retrieve the stored
result from and cannot be changed. The adjusted R-squared is not the only statistic that
STATA stores and does not display. You can use the Help function or look in the
Reference volume under Saved Results for the reg command to find other statistics.

Other Examples of Do-Files

You will find other examples of Do-Files on the accompanying Web site for the Stock and
Watson (2018) econometrics textbook. You can download STATA Do-Files fro there to
reproduce all of the analysis in Chapters 3-13. You will also find a STATA Do-File for the
time series chapters 15-17 there. STATA programming for time series is somewhat more
complicated than for cross-sectional or panel data. EViews and RATS are econometric
programs specifically designed for time series data, and the web site contains EViews and
RATS programs for Chapters 15-17, as well as a tutorial for EViews.

3. A SUMMARY OF SELECTED STATA COMMANDS

This section lists several of the most useful STATA commands. Many of these commands have
options. For example, the command summary has the option detail and the command regress
has the option robust. In the descriptions below, options are shown in square brackets [ ]. Many
of these commands have several options and can be used in many different ways. The
descriptions below show how these commands are commonly used. Other uses and options can
be found in STATA’s Help menu and in the other sources listed at the beginning of this
tutorial.

The list of commands provided here is a small fraction of the commands in STATA, but these
are the important commands that you will need to get started for your econometrics course.
You should extend the list or create your own in addition to what is listed here.

Administrative Commands

# delimit
sets the character that marks the end of a command. For example, the command #
delimit ; tells STATA that all commands will end with a semicolon. This command is
used in Do-Files.

clear
deletes/erases all variables from the current STATA session.

- 37 -
exit
in a Do-File, the command tells STATA that the program has ended. If you type exit in
the STATA Command Window, then STATA will close.

log
controls STATA log files, which is where STATA writes output. There are two
common uses of this command:

 log using filename [,append replace]. This opens the file given by filename as a
log file for STATA output. The options append and replace are used when there
is already a file with the same name. With append, STATA will append the
output to the bollom of the existing file. With replace, STATA will replace the
existing file with the new output file.
 log close. This closes the current log file.

set mat #
sets the maximum number of variables that can be used in a regression. The default
maximum is 40. If you have a huge number of observations and want to run a
regression with 45 variables, then you will need to use the command, where # is a
number greater than 45.

set memory #m
is used in Windows and Unix versions of STATA to set the amount of memory used by
the program. For details, see the discussion within the tutorial.

set more off


tells STATA not to pause and display the –more—message in the Results Window.

Data Management

describe
describes the contents of data in memory or on disk. A related command is describe
using filename, which describes the dataset in filename

drop list of variables


this deletes/erases the variables in list of variables from the current STATA session.
For example, drop str testscr will delete the two variables str and testscr

keep list of variables


deletes/erases all of the variables from the current STATA session except those in list of
variables. Alternatively, it keeps the variables in the list and drops everything else. For
example, keep str testscr will keep the two variables str and testscr and deletes all of
the other variables in the current STATA session.

- 38 -
list list of variables
tells STATA to print all of the observations for the variables listed in list of variables.

save filename [, replace]


tells STATA to save the dataset that is currently in memory as a file with name
filename. The option replace tells STATA that it may replace any other file with the
name filename.

use filename
tells STATA to load a dataset from the file filename.

Transforming and Creating New Variables

New variables are created using the command generate, and existing variables are modified
using the command replace.

Examples:

generate newts = testscr/100

creates a new variable called newts that is constructed as the variable testscr divided by 100.

replace testscr = testscr/100

changes the variable testscr so that all observations are divided by 100.

You can use the standard arithmetic operations of addition (+), subtraction (-), multiplication
(*), division (/), and exponentiation (^) in generate/replace commands. For example,

generate ts_squared = testscr*testscr

creates a new variable ts_squared as the square of testscr. (This could also have been
accomplished by using the command gen ts_squared = testscr^2.)

You can also use relational operators to construct binary variables. For example, in the forth
batch file, the following command was included

gen d = (str<20);

This created the binary variable d that was equal to 1 when str<20 and was equal to 0
otherwise.

Standard functions can also be used. Three of the most useful are:

- 39 -
abs(x) computes the absolute value of x
exp(x) provides the exponentiation of x
ln(x) computes the natural logarithm of x

For example, the command

gen ln_ts = ln(testscr)

creates the variable ln_ts, which is equal to the logarithm of the variable testscr.

Finally, logical operators can also be used. For example,

gen testscr_lo = testscr if (str<20);

creates a variable testscr_lo that is equal to testscr, but only for those observations for which
str<20.

Statistical Operations

cor list of variables


tells STATA to compute the correlation between each of the variables in list of
variables

twoway scatter var1 var2 || lfit var1 var2


produces a scatter plot of var1 on the Y-axis and var2 on the X-axis. If the || lfit part is
included then the fitted OLS line is also displayed

predict newvarname [, residuals]


when this command follows the regress command, the OLS predicted values or
residuals are calculated and stored under the name newvarname. When the option
residuals is used, the residuals are computed; otherwise the predicted values are
computed and placed into newvarname.

Example:

reg testscr str expn_stu el_pct, r


predict tshat
predict uhat, residuals

Here, testscr is regressed on str, expn_stu, el_pct (first command); the fitted values are saved
and stored under the name tshat (second command), and the residuals are saved under the
name uhat (third command).

- 40 -
regress depvar list of variables [if expression] [,robust noconstant]
carries out an OLS regression of the variable depvar on list of variables. When if
expression is used, then the regression is estimated using observations for which
expression is true. The option robust tells STATA to use the heteroskedasticity-robust
formula for the standard errors of the the coefficient estimators. The option noconstant
tells STATA not to include a constant (intercept) in the regression.

Examples:

reg testscr str, r


reg testscr str expn_stu el_pct, r

summarize [list of variables] [, details]


computes summary statistics. If the command is used without a list of variables, then
summary statistics are computed for all of the variables in the dataset. If the command
is used with a list of variables, then summary statistics are computed for all variables in
the list. If the option details is used, more detailed summary statistics (including
percentiles) are computed.

Examples:

sum testscr str

computes summary statistics for the variables testscr and str.

sum testscr str, detail

computes detaild summary statistics for the variables testscr and str.

test
this command is used to test hypothese about regression coefficients. It can be used to
test many types of hypotheses. The most common use of this command is to carry out a
joint test that several coefficients are equal to zero. Used this way, the form of the
command is test list of variables where the list is to be carried out on the coefficients
corresponding to the variables given in list of variables.

Example:

reg testscr str expn_stu el_pct, r


test str expn_stu

Here testscr is regressed on str, expn_stu, and el_pct (first command), and a joint test of
the hypothesis that the coefficient on str and expn_stu are jointly equal to zero is carried
out (second command).
- 41 -
ttest
this command is used to thest a hypothesis about the mean or the difference between
two means. The command has several forms. Here are a few:

ttest varname = # [if expression]}[,level(#)]

Here you test the null hypothesis that the population mena of the series varname is
equal to #. When if expression is used, then the test is computed using observations for
which expression is true. The option level(#) is the desired level of the confidence
interval. If this option is not used, then a confidence level of 95% is used.

Examples:

ttest testscr = 0;

tests the null hypothesis that the population mean of testscr is equal to 0 and computes a 95%
confidence interval.

ttest testscr = 0, level(90);

tests the null hypothesis that the population mean of testscr is equal to 0 and computed a 90%
confidence interval.

ttest testscr = 0 if (str<20);

tests the null hypothesis that the population mean of testscr is equal to – and computes a 95%
confidence interval only using observations for which str<20.

ttest varname1 = varname 2 [if expression] [, level(#) unpaired unequal]


tests the null hypothesis that the population mean of series varname1 is equal to the
population mean of series varname2. When if expression is used, then the test is
computed using observations for which expression is true. The option level (#) is the
desired level of the confidence interval. If this option is not used, then a confidence
level of 95% is used. The option unpaired means that the observations are not paired
(they are not panel data), and the option unequal means that the population variables
may not be the same. (Section 3.4 of Stock and Watson (2018) describes the equality of
means tests under the unpaired and unequal assumptions.)

Examples:

ttest testscr_lo=testscr_hi, une unp;

- 42 -
tests the null hypothesis that the population mean of testscr_lo is equal to the population mean
of testscr_hi and computes a 95% confidence interval. Calculations are performed using the
unequal variance and unpaired formula of Section 3.4 of Stock and Watson, 2018).

ttest ts_lostr=ts_hisstr if elq1==1, unp une;

tests the null hypothesis that the population mean of ts_lostr is equal to the population mean of
ts_histr, and computes a 95% confidence interval. Calculations are performed for those
observations for which elq1 is equal to 1. Calculations are performed using the unequal variance
and unpaired formula of Section 3.4 of Stock and Watson (2018).

4. FINAL NOTE

For a complete list of commands, consult the STATA User’s Guide and Base Reference
Manuals. In addition, there are more detailed manuals on Graphics, Time Series, Data
Management, etc.; a total of 21 manuals which can download. Alternatively, you may want to
use the “Help” command inside STATA. It will display details of STATA commands
including options. Under the Search… tab, for example, you will find most of what you are
looking for. As mentioned before, this tutorial is not intended to replace the Reference or
User’s Guide. The best way to learn how to use the program is to spend some time exploring
and working with it.

STATA replication batch files for all the results in the Stock/Watson textbook are available
from the Web site. You are invited to download these and study them.

- 43 -

You might also like