Stata Guide 1
Stata Guide 1
STATA GUIDE
UNIT 1
GETTING STARTED WITH STATA 8
Introduction
This is a self-study guide to the use of STATA. Always bring it along with you to the workshops so
that you can consult it whenever the need arises. It can best be used when you are sitting behind a
computer with STATA up and running.
This guide assumes that you are familiar with the use of Windows
Operating System and with Excel.
The use of the command system means is that you cannot just click (with your mouse) on a particular
option (say, a graph or a calculation) from a pull-down menu, instead you have to instruct STATA
exactly what to do by writing a command. You may find this to be cumbersome at first, but (after some
time and practice) you will see that STATA’s command structure allows you to approach the data from
whatever angle you choose to take.
Now make sure your computer is switch on and STATA is up and running !
The first thing you’ll note when you look at the screen is that STATA consists of a set of separate
windows with the following titles (on the title bars of the respective windows):
Of the five windows listed above, the last two are optional but useful to retain them on the screen. The
main window is titled Intercooled Stata 8.0. This window contains the stata main menu, toolbar and
two more windows viz. Stata Results and Stata Command. We suggest that you arrange the windows as
shown by the following picture by dragging them appropriately and resizing them wherever necessary.
1
AUD, MA Economics 2012-2014
Apart from the five visible windows there are three other windows in Stata which are hidden and only
appear when requested on command or by clicking an icon on the toolbar. They are:
• Graph
• Log
• Command
• Help/lookup
• The command window is the only window in which you can write in and converse with STATA;
• The results window lists your commands along with the results (calculations, etc.);
• the review window lists all previous commands (when starting up, this is empty);
• the variables window lists the variables in your file (when starting up, this is empty);
• the graph window will appear whenever you instruct STATA to graph something;
• the log window (invisible, unless explicitly requested) allows you to keep a running log (account)
of everything that appears in the results window (you can import the log later into Word if you
want);
• the help/lookup window gives you detailed information on all STATA commands, if requested.
As we go along we shall explain how to use these various windows (there are still other windows which
we shall mention when appropriate).
2
Stata Guide: Chandan Mukherjee
Important Note:
2. Starting a session: Default Working Directory, Log file of numerical results and Log file of
commands
Before you start a session with STATA, you should first ensure that your working directory, where you
have your source data files for the session and where you would like to keep your output files
(numerical results, graphs, commands used), is set as desired by you. I have set Stata the working
directory C:\SDE2012 as default. During all the days in this workshop, we shall always use this as the
default directory. The current default directory is always indicated by STATA at the bottom-left corner
of its main window. Please check this now.
Anytime during a working session, if you have any doubt about the default directory, you check it by
the following command (try it now):
pwd [don’t forget to press the <Enter> key after typing the command]
STATA allows you to keep track of your instructions and results on a separate
file, called a log file. After a session, you can import the file in Word to verify
the instructions you gave and STATA error messages, and to keep a record of
the results of your work. Graphs are not included in a log file. If you want to
keep them, you will have to save separately (we shall come to this later).
To open a log file, you have two alternatives. First, you can instruct STATA to start recording your
commands and the results by typing the following command:
XXX is just the name you give to the file; for example, this is first workshop are working on, so you
can give a name ws01 indicating that this is your first session with Stata. The specification text after the
comma tells Stata to keep the log file in text (ASCII) format so that it can be read by a text editor,or, a
word processor such as Microsoft WORD.
You are now ready to execute your third Stata command Type the following and press the <Enter>
key:
3
AUD, MA Economics 2012-2014
There is a second method to open a log file is to use the mouse and click on icon for LOG (4th from the
left on the toolbar) and a dialog box will appear in which you can write the name of the log file (for
example, ws01). We shall try this method later.
Once you have executed the command opening a log file, it will keep a running account of your
activities. Opening a log also creates a log window which you can look at any time you want by clicking
the icon once more.
If you want temporarily to stop the log file, type, log off. To resume the log file, type log on. To close
the file at the end of your work, type log close.
Notes:
• We strongly advise you always to keep a log file of your work. It allows you to
check what you did during a session with STATA, and you can always edit the
log file as a report of your results. This saves time. Both functions are
extremely convenient.
• Add your comments, whenever appropriate, after typing *.
• keep in mind that graphs are never stored in the log file !
Finally, you can also keep a record of all the STATA commands you are executing during a seesion by
opening what is called a command log file. Note that the command log file keeps only the commands
and not the results (and the error messages) like the log file. The utility of a command log file is that
you can re-execute all the commands of a session without having to type them again, by converting this
file into what is called a do file. We shall show you this facility later.
STATA is a command-driven programme and, hence, will issue an error warning if you write something
that does not conform with its syntax. But often you want just to jot down some comments on what you
are doing. These may be conclusions, hints, clues, etc, or just general comments. To do so, put * (=
asterix) in front of your entry into the command window and STATA will ignore what follows, but print
it in the results window anyhow.
and look at the results window. STATA reproduces your text, but does not consider it as a command.
This facility will prove to be very convenient when you are carrying out data analysis and want to
record your ideas along with it!
Hence, * just tells STATA to ignore what follows, but print it anyhow into the results window (and,
hence, into your log file).
STATA without data is not much use. To get data (already stored as STATA files) into STATA is
simple. STATA data files always take the form XXX.dta where XXX is the file name. The use
command allows you to load a STATA data file. For example,
4
Stata Guide: Chandan Mukherjee
use worker.dta
Note: if you stored the india file in another directory than the a: directory, adjust command accordingly!
Note: Data set india.dta stores data on a sample of urban workers in a South-Indian town. The data
give information on the sex of the worker, age, educational level, weekly earnings, and
whether or not it concerns a permanent or temporary job.
You will now notice that the variables window is no longer empty, but instead features 5 variables –
labeled respectively: sex, age, edu, wi, and pt.
If you want to save this data set under a new name, say worker, you can do this as follows:
save worker
This will save the file in you’re a: directory. The file will be saved as worker.dta (hence, .dta is added
as default extension, even if you do not type it in full). Hence, save worker is equivalent to typing
save worker.dta explicitly.
But if you try to do it again, STATA will issue an error warning and refuse to execute your command.
To see this, type,
save worker
. save a:\worker
file a:\worker.dta already exists
r(602);
Note: r(602) labels the error you made with a number and gives a cryptic explanation. Here STATA
notifies you that you are saving to a file that already exists! STATA does not allow you to do this,
unless you explicitly tell it to overwrite the earlier file with its new version.
To do so, you use the command replace along with save as follows:
Hence, to save a file anew after it already exists, you need to combine the save command with the
replace command! Otherwise, STATA will refuse to save a file over an already existing one. You must
tell STATA that you want to override the existing file!
The command clear will clear out all data from active memory, but does not affect existing data files.
Try this:
clear
5
AUD, MA Economics 2012-2014
use worker
you’ll note that the data file is loaded again. The command clear, therefore, does not operate on files,
but simply wipes out the data in the active memory.
5. Getting to know your data: describe, list, sort, codebook and inspect
Make sure the data set worker.dta is loaded. To see what a data set contains, use the describe command
by typing,
describe
Hence, the data set consists of 5 variables over 261 observations. sex, edu, and pt are categorical
variables and have labels attached to them. These labels are identified by separate names given in the
fourth column above. To see what the labels mean, you can use the following command:
label list
. label list
lpt:
0 Otherwis
1 Permanen
2 Otherwis
lsex:
1 MALE
2 FEMALE
ledu:
1 NONE
2 UPPER PR
3 SECONDAR
4 HIGHER
Hence, for example, the variable sex with labels lsex takes the value 1 for male and 2 for female. The
variable edu with labels ledu ranges over the integers from 1 to 4, listing levels of education (none,
upper primary, secondary and higher). The variable pt with labels lpt has labels over the range of
integers from 1 to 5, indicating status of job [permanent, otherwise (temporary), substitute, helper, or
other)].
If you want to look at the values entered for a variable, you can use the list command as follows:
list wi
6
Stata Guide: Chandan Mukherjee
Try it. STATA will show you a numbered list of values for wi. This list is truncated at the bottom of the
results window and ends with the statement – more – , indicating that there is more to come. Press the
space bar to see the next set of values, and continue doing so until you reach the end of the data
(observation 261). If you get fed up with the list going on forever (and with a big data set this can
happen), just press the break key to halt the procedure. (Note: depending on your key board, you may
have to press control break or Fn break simultaneously).
You can list more than one variable at the time – example, try list wi age sex. If you do this, you ‘ll find
that the data will be displayed in matrix form, one row for each observation.
Looking back at the description STATA gave of the data, you’ll note that the last line mentioned
‘Sorted by: ’. In this case, the space after ‘sorted by’ is left blank, meaning that the data base has not
been sorted.
You can always sort your data base with the sort command. For example, type,
sort wi
To check that the data have been sorted, just type describe. STATA will now report that the data have
been sorted by the variable wi. To check this, type list wi, and you’ll see that the data on wi are now
sorted from smallest to highest values.
You can also sort by more than one variable. For example, try the following:
sort sex wi
Using the describe command you’ll note that the data are now sorted by sex and wi. What does this
mean? STATA will first sort by sex and subsequently by wi (leaving sorting by sex unchanged!). You
can verify this by looking at your data as follows:
list sex wi
and you’ll note that all men come first (remember that male = 1 and female = 2, hence, 1 comes first),
and women later. Within each category (male/female), the data are then sorted by weekly wages (wi).
7
AUD, MA Economics 2012-2014
Also try sort pt and then list pt. If you do this you’ll note that only two categories out of the 5 listed are
effectively used – namely, permanent and otherwise. Presumably, when the data set was constructed the
intention was to use more categories, but – in the end – only two were applied.
EXERCISE
There are two further useful commands to get to know your variables better: codebook and inspect.
They require little explanation. Just try them out:
codebook wi
yields:
mean: 165.886
std. dev: 133.658
and,
inspect wi
yields,
8
Stata Guide: Chandan Mukherjee
Sometimes you are only interested in part of your data and not in the whole range. For example, you
may want to look at the smallest 5 values or, alternatively at the highest 5 values. You can select a range
of values using the in qualifier after a command.
sort wi
list wi in 1/5
. list wi in 1/5
wi
1. 15
2. 15
3. 16
4. 18
5. 18
listing the smallest 5 values. It is more fun, however, if you try the following:
Note that, at the bottom of the wage scale, the men are invariably very young, but not the women. Try it
again for the range 1/10 to see whether this is true for the lowest 10 as well.
Now try
list wi in 15/10
. list wi in 15/10
Obs. nos. out of range
r(198);
STATA always moves from the first to the last observation (as sorted in the data base). Using the
qualifier in 15/10, therefore, tells STATA something it cannot do – that is, to go from 15 to 10 (since
STATA will encounter observation 10 first).
9
AUD, MA Economics 2012-2014
What happens if you try list wi in 260/265? Try it! STATA will again tell you that you are out of range.
Why? The reason is that there are only 261 observations in your sample. The range 260 to 265,
therefore, is not possible.
What to do if you want to know the top 5 earnings? Try it first and then read on!
If you subtracted 5 from 261, and then typed list wi in 256/261, you’ll have noticed that you end up
with 6 rather than 5 top values. If you subtracted 4 from 261, you got it right. But the procedure is
cumbersome. First, you have to check the size of your sample and then subtract the required number of
top values minus one from the sample size. Not difficult, but cumbersome.
list wi in –5/-1
STATA replies:
. list wi in -5/-1
wi
257. 500
258. 530
259. 600
260. 622
261. 950
Using the minus sign tells STATA to count from the top down rather than from the bottom up. Hence, -
5 means the fifth highest value! But why not write list wi in –1/-5? Try it. STATA will tell you that
you’re out of range. Why is that?
As explained earlier, STATA always starts from the first to the last and, hence, will encounter –5 (the
fifth from the top) before it gets to –1. If you specify the range as –1 to –5, therefore, STATA will tell
you that you are out of range.
Now try,
You’ll note that the top 5 are all men, none aged below 20. Check whether this is true for the top ten as
well!
The last two are interesting. It tells you that the bottom 20 earners all have no or only primary
education, but several of the top 20 earners also have no or only primary education.
10
Stata Guide: Chandan Mukherjee
An important lesson of these examples is that using the list command after prior sorting of the data can
be very revealing if you look separately at the top or bottom values. You get to know your data better,
even before you engage in more sophisticated statistical analysis.
11
AUD, MA Economics 2012-2014
operator meaning
In data analysis and statistics, logical operators are very frequently used. They allow you to trim your
sample to particular subsets relevant to your analysis – for example, men only, or women only; those
with higher education; those younger than 20 years of age, etc. Here are some simple examples (try
them as you go along!):
list age wi sex if age<20 Lists age, income and sex for those below 20 years
of age
list edu if sex ==2 & wi < 45 Lists educational level of women with incomes
below 45
list wi sex if sex ==1|sex ==2 A silly command! Why? 1
list sex age if edu==1 | edu ==2 List all those with educational level of primary
education or below.
list wi if sex==2 & edu ==4 List income of women with higher education
(you’ll note that there are only 6 women in this
category).
1
The reason why it is silly is that it does the same as list wi sex because sex only takes values 1 and 2.
Hence the logical statement if sex ==1 | sex ==2 picks both men and women.
12
Stata Guide: Chandan Mukherjee
An extremely useful command is count. It hardly does anything, but combined with if qualifiers, it can
be very useful. What it does is just count the number of observations in question.
count
261
But it gets more interesting if you use count with the if qualifiers:
and Stata will respond that there are 55 women in the sample.
In data analysis we often generate new variables as we go along. To do this, we use the generate
command. For example, suppose you want to use the logarithms of weekly wages instead of wages
itself.
Note: log10wi is the name given to the new variable (you can use any name!)
After creating a variable, it is useful to label it. You can do this using the label variable command as
follows:
If you now use the describe command you’ll note that the new variables have been added to the list.
You will often use the if qualifier to construct a new variable. Suppose, for example, that you want to
create a new variable of weekly wage income for men only. You can do this as follows:
13
AUD, MA Economics 2012-2014
To check what you have done, try list wi wi_men sex and you’ll see that the newly created variables
only lists incomes for men, leaving incomes of women as missing values.
Now create another variable only listing weekly incomes for women.
At times we also want to create a categorical variable. Suppose, for example, that we want to create a
dummy variable that picks out women: hence, dummy = 1 for women and 0 for men. There are two
ways to do this.
The first way is more cumbersome, but instructive in what it teaches you about data manipulation in
STATA. It proceeds in two steps: first pick the women and assign them the value 1; next pick the men
and assign them 0. You can do this as follows:
but if you now do generate dummy = 0 if sex==1, STATA will protest and tell you that you made an
error since dummy is already defined as a variable. In other words, you cannot generate an existing
variable twice. To add the men, you need to use the command replace as follows:
The second way is more elegant, but quite intricate. It makes use of the fact that a logical operator
returns 1 if the statement is true and 0 if false. You do this as follows (calling the new dummy
dummy2):
This is very handy! When sex equals 2 (= female) the logical statement is true, returns 1 and, hence,
assigns the value 1 to dummy2. When sex equals 1 (=male), the logical statement is false, return 0, and,
hence, assigns 0 to dummy2. All in one go! Check what you have done using list sex dummy
dummy2. Both dummies are identical.
You may now wish to assign specific labels to each of the numerical values of the dummy variable –
say, female to 1 and male to 0. You do this by first defining the label and subsequently assigning the
label to the variable in question.
[This step defines a label called dumlbl (just a name) and assign the specific labels female to 1 and male
to 0.]
[This step assigns the labels defined above to the variable dummy2].
14
Stata Guide: Chandan Mukherjee
Check what you did with list sex dummy dummy2 and you’ll note that the variable dummy2 now
features the labels, while the variable dummy doesn’t (since we did not assign labels to dummy).
STATA is command driven and, hence, involves some typing into the command window. You can,
however, shorter the time you spend on this considerably using a few simple tricks.
First, STATA allows you to abbreviate its commands – usually, the first 1 or 2 letters of the command.
To know which abbreviation to use, just type help followed by the name of the command. Say, try help
list. This will prompt STATA to give you quite a bit of information in return. What matters here is just
the bit where STATA shows the command line with its various options. The relevant extract is as
follows:
list [varlist] [if exp] [in range] [, [no]display nolabel noobs doublespace ]
Note that STATA underlined the first letter of the list command. This means that the command can be
abbreviated by its first letter. Hence, l wi does the same as list wi.
If you have forgotten the abbreviation to use, but you think it is l, check it by using the command which
followed by the abbreviation:
which l
STATA replies:
. which l
built-in command: list
confirming that you got it right. Try which li and you’ll note that STATA also accepts that one.
Next try which g and STATA will tell you that this is the abbreviation for generate. But, if, for
example, you type which b, STATA replies with an error message as follows:
. which b
command b not found as either built-in or ado-file
r(111);
A second trick you can use to simplify your work is that you can edit earlier commands by either using
the Pg Up and Pg Dn keys on your key boards (which allow you to walk through earlier commands) or
by clicking (with your mouse) the relevant command in the review window (which stores the commands
you already executed beforehand).
Hence, for example, do list wi. Now you want to add the variables sex and age to the list. To do this,
press Pg Up once, the previous command now appears in the command window, and add sex and age.
The third trick is that you do not have to type the names of your variables. You can input a variable into
the command line by just clicking on its name in the variables window.
15
AUD, MA Economics 2012-2014
Here is an example using the first and last tricks. You want the command list sex age wi. Instead of
typing it out in full, proceed as follows: type l (the first letter: the abbreviation of the command list) and
then click respectively on sex age wi in the variables window. Hence, the whole command list sex age
wi can be obtained very simply by first typing one letter and then executing three mouse clicks in the
variables window!
The command exit allows you to stop a STATA session. If, however, you undertook data manipulations
during a session (say, by using the generate command), STATA will refuse to allow you to exit
because you may loose data in the process.
exit, clear
These commands, taken together, tell STATA that you want to exit and that you are not interested in
keeping whatever data manipulations you made during this session. If you want to keep the newly
created variables, either use save, replace or, better still, save the file under a new name.
For example,
save worker2
log close
exit
The file worker2.dta now contains the original data and those generated in your first work session. By
keeping track of different versions, you will still keep the original data file woker.dta without it being
swamped with the clutter of things you tried out.
Sometimes you may be working with a data file with a large number of variables, although for the
problem at hand you only need a sub-set of them. To trim down your data set, you can either use the
drop or the keep commands.
To see this, start up STATA again, and load the file you just saved:
drop sex
and you’ll note that the sex variable no longer features in the variables window.
Next try,
keep wi age
and you’ll notice that the data base is now reduced to only two variables.
Important note: Be careful not to use save, replace after you dropped variables from a data set!
Some of your data will be lost!
Save the truncated file under a new name instead!
16
Stata Guide: Chandan Mukherjee
If you start up STATA and load a file (say, worker.dta), and then you issue the command save worker,
STATA will store the file into the default directory (usually, but not always, c:\data). If you want the
data to be stored elsewhere, you can change your default directory for purpose of the project in
question. You do this by (1) specifying the default directory, and (2) creating a special sub-directory, if
needed.
To see what directory you are presently working in, just type pwd and STATA will tell you what your
current directory is (to remember this command, I think of it as please which directory).
Suppose you now want to create a subdirectory – say, worker. You can do this as follows:
mkdir worker
and then you can change the default to this new directory as follows:
cd worker
. cd worker
c:\data\worker
which shows that your default directory is now c:\data\worker (if your previous default directory was
correctly set as c:\data).
Now you data file and log file can both be found within the same sub-directory specifically created for
the problem at hand. Before we go to the next section, let us close the log file (which will be an empty
file because no command has been executed after opening it).
log close
clear
Now that you have had a bit of practice, you should be able to discern, at least vaguely, the general
rules for composing a STATA command.
17
AUD, MA Economics 2012-2014
STATA commands can be a bit confusing at first. If, for example, you forget
to type a comma “ , “ or a bracket “ ) “ , or you type two brackets “ )) ”, or
you misspell a command, STATA will answer with an error message.
• [command word]: You have already used several command words describe, list, generate,
sort. A STATA command must start with a command word. For example, if you want to create a
new variable by name logwi which is the logarithm of the variable wi (weekly wage earnings) , it is
not enough to type -
logwi = log(wi)
STATA interprets the first word of your instruction as the command word in its own vocabulary, and
logwi is not a command, it is simply the name you have decided to give to the variable which is
logarithim of wi.
generate logwi=log(wi)
which, in the language of STATA reads as - generate a new variable by name logwi by taking
logarithm of the variable wi.
• [variable name(s) or expression]: Commands are mostly concerned with something you want
STATA to do on a variable or variables existing in your dataset (such as creating a histogram of a
variable or, creating a new variable). The command word therefore must obviously be followed by
the name(s) of existing variable(s) or the name of a new variable to be used in the command. For
example, sort wi or list age. Some commands, however, do not feature a variable name or
expression; for example the save and use commands are followed by filenames
[qualifiers]: There are two qualifying STATA words which follow the first two parts [command &
variable name(s)]. They are - in and if. Unlike the first two parts, the qualifiers are optional, you
use them only when they are necessary. Without the qualifiers STATA executes the command
(graph or generate, for example) for the whole dataset. Qualifiers are used to restrict the command
execution to a part of the data.
The comma separates the main command from the various options to go with the command. You
already came across some cases like save, replace and exit, clear. There is only one comma after
which various options follow.
18
Stata Guide: Chandan Mukherjee
You can export and import data from and into STATA by using copy & paste
across window applications.
Let us first export data from STATA. To do this, open the STATA editor: click
editor, and use your mouse to select the whole data set (top left corner of data to
bottom right corner). Click edit and select copy editor data.
Open up excel and click edit, followed by paste to paste the data into Excel
You’ll note that variable names will also be copied along with the data!
You can import data into STATA in exactly the same way. In Excel, copy the whole range (including
the variable names on top). Go to STATA, type clear to empty data set, and open the editor. Click on
top left corner cell. Then click on edit, followed by paste. You have now brought your data back into
STATA.
Note: There are many more ways to import data into STATA. For the time being, we shall only make
use of those explained above.
19