A Quick Introduction To STATA
A Quick Introduction To STATA
HG
Revised September 2010
1. How to access STATA from the pc’s at the computer lab and
elsewhere at UiO.
At the computer lab, after having logged in, you may find STATA 11 (or StataIC 11 or
Small Stata if the StataIC11 does not open) under Programs on your computer, and you
can start it straight away.
Otherwise you have to log in again at a remote desktop at the server “statwin2” or
“statwin3”. In that case do the following: Go to
Start -> Programs -> Accessories ->( Communications ->) 1 Remote Desktop Connection
Write statwin2 (or statwin3 ) in the window in the menu that appears, and then press
connect. Log in, and a new desktop should appear on the screen.
Start -> Programs -> Stata 11 -> StataIC 11 (or Small Stata)
The present tutorial is a very short introduction to some of the topics from the Getting
started ( [GS] -> [GSW]) manual in the pdf documentation file. Later you may find it
useful to familiarize yourself with the Getting started [GSW] manual (and the User [U]
manual) on your own as need arises.
1
On some computers you do not need to enter Communications.
2
2. The windows:
STATA has 4 separate windows for typing in commands and for viewing results. The
command window is for writing and running commands interactively. In the review
window you can view (and activate) the command lines you have previously written.
Clicking a command in the review window reproduces the command int command
window for editing and rerunning. In the variables window all variables and labels are
listed. Clicking on a variable name in the varaiables window reproduces that name in the
command window.
When loading Stata the first time you probably get the Stata window in standard form –
i.e. without any colors. By going to the menu 2
2
Or, shorter: simply right-click inside the result window and choose preferences…
3
you can choose color combinations for your output with different colors for different
kinds of information – which makes the output easier to read (a matter of taste). In the
figure I have chosen “Classic” which is one of the options.
Excercise; Load the exercise data, stored in a file called auto.dta, which is included in
STATA. I.e., write sysuse auto (or sysuse auto.dta) in the command window and press
enter (more details below).
You can obtain a description of the data set via the menu:
File -> Example datasets -> Example datasets installed with Stata
and then click the describe link outside auto.dta.
In any case you will see that the command line enters the review window and the results
window (this illustrates how the menus can be used to learn the command lines. Learning
the commands facilitates greater flexibility, quicker computing, and clearly a better
understanding of how the program operates). For example, from the review window we
see that the data description we obtained via the menus, we could have obtained directly
by entering the command sysdescribe auto.dta. Clicking on that line reproduces the
command in the command window. Pressing enter then runs the command again.
Note also that now all the variables in the data set are listed in the variables window.
3. More on how to load, save and use the built-in data set,
auto.dta
To obtain a list of all built-in data sets that follow with STATA, you can use the
command sysuse dir.
The command dir gives a listing of your own home directory (i.e., M:\).
You can save a copy of the data set (or any other data set) in your own directory by the
command save, and later retrieve it by the command use. By save the data are stored in a
binary file (with extension .dta) – in a special STAT format that is easy to read for
STATA. For example, the command save auto, makes a copy of the data set, with name
“auto.dta”, in your home directory. This is specially useful if you have made changes and
additions to the data set during your STATA session. Then, next time you start STATA
for a new session, you can simply give the command use auto, and the data are reloaded
into STATA as you left them last time.
If you have many files in your home directory, it may be smart to make a new
subdirectory and change the working directory to that one. For example, before starting
STATA, suppose you have already made a subdirectory, called statadat (say), to contain
all your stata files. Then, within STATA, the command cd statadat changes the working
4
directory to M:\statadat, and commands like save, dir, use, etc. will refer to that directory
(instead of to M:\ which is the default). The next time you open STATA, the first thing
you should do, is to give the command cd statadat, (or with your chosen name for the
directory) so that your working directory is the correct one.
4. The spreadsheet:
If you write edit or browse in the command box (or press the edit or browse icon), a
spreadsheet window will pop up (there are also a menu and short cut buttons for opening
the spreadsheet window). Note that if you used the browse command, you can only view
and not edit the spreadsheet. To edit the data you need to use the edit command or press
the edic icon.
If you right click on the variable names and choose “variable properties”, you can edit the
name, or the variable labels. (Alternatively you can edit the variable information by
pressing the variable manager icon on the tool bar.)
You are also given information on the format of the variable. In the spreadsheet you have
opened, clicking on the variable name “make” tells you that the label is “Make and
model” and the format is “%-18s”. The s indicates that the variable “make” is a string
variable (consists of letters, not numbers), and that it will be stored using (maximum) 18
letters. The variable “price” has the different format “%8.0gc”. Here, the letter “c”
5
indicates that a comma is used to separate at the thousands, while “g” indicates that the
variable is stored as an integer. If we change the format to read “%8.1fc”, the variable is
no longer stored as an integer, but as a number on the real line where one decimal place is
shown. If we edit it to “%8.2fc”, two decimal places are shown etc. Writing only “%8.2f”
will take away the comma separation at the thousands.
A note on formats. Number variables can indeed be stored as string variables. This will
often be the case when the data that are loaded are not originally in STATA format.
When such data are loaded, it is therefore good practice to check whether the number
variables are stored correctly.
In STATA you can refer to each variable by the variable name. You can also refer to the
line number by using the reference “in”.
Important! Note that you cannot run commands when either the “edit spreadsheet”
window or the “browser” window is open. You have to close these windows before you
can issue any commands.
Exercise; Write the command list make in 2, and the command list weight in 1/7. What
is returned in the results window?
The information given by the help command is quite detailed and concentrated, and can
be hard to read by the beginner. Alternatively it may be easier to locate generate in the
[D] manual, Data management, in the documentation pdf-file, and study the examples
given there.
Or, by going via the menu to help -> search… and search for generate, you obtain much
information on various sources where you can find information on generate.
You can read more on the help facilities in the [GSW] manual section 4.
The following figure shows the result of the command: help generate
6
7
The brackets indicate that specification within the brackets is optional for the command.
For example, the [by varlist:] formulation is optional (i.e. is not required ) and specifies
that the command is to be repeated for each variable in the variable list. Not all
commands can use this formulation.
EXAMPLE; In the tutorial dataset we may want to construct a new variable that equals
mpg/weight. The command generate (or gen for short) is one of the commands to create
new variables. Writing help generate in the command window returns the following
syntax from the help window.
Here the command name (generate), the name of the new variable to be generated
(newvar) chosen by you, and the function that describes how the new variable is to be
constructed (=exp) has to be specified. The help text explains that [type] has to be
specified only if the variable that you want to create is to become a string variable, or if it
is important to specify the decimal precision of the new variable. If a string variable is to
be generated type can be specified to str10 showing that the variable is to be stored with
10 letters. If a number variable that is generated has to have decimal precision type can
be specified to double. The :lbname formulation is optional and allows you to specify a
variable label that describes the content of the new variable.
If you want to change the content of an existing variable in the data base, you can use the
replace command:
8
Exercise. Do this and check in the browser (or edit) window that a new variable has been
created. Remember to close the data window before you can proceed.
Exercise; You can use the help function to establish what the following commands does;
(these are some must-to-know STATA commands). You do not need to go through all of
them now, but try, e.g., “list” and “summarize”. Try also to execute the command,
summarize x.
7. Logical expressions:
If you decide to use the optional [if exp] specification you must use a special syntax for logical
operators.
How many different makes did you get in each of the three cases?
Note 1: Note (in the browse or edit window) that the variable “foreign” has two values, 1 (with
label “Foreign”) and 0 (with label “Domestic”). The actual values, 1 and 0, are stored, but the
labels “Foreign” and “Domestic” are displayed in the data base. If you click one value (i.e. one of
the “Foreign”s), you will see the corresponding numerical value in the small window at the top of
9
the data base window. The command, label list, will give a list of labels defined. You can learn
how to define labels in your data set by help label.
Note 2: Remember to close the data window before issuing any command.
Note 3: When you need to write a command that is similar to a command you have used before
that is shown in the review window, then often the easiest way is to click on the command in the
review window. This causes the command to appear in the command window where you can edit
it before running it. Similarly, if you click on a variable name in the variables window, the
variable name pops up in the command window where the cursor is.
8. Graphics
The graphics facility in STATA is quite well developed and allows numerous variations and
editing facilities. For a start it is recommended to experiment with the graphics menu. You can
then note the syntax that is automatically written in the results window. Use the auto dataset.
Make a histogram over price using 10 bins (Command: histogram price, bin(10)). You can also
use the menu, Graphics -> Histogram. Compare box-plots of foreign and domestic cars (graph
box price, medtype(line) over(foreign)). Draw a scatter diagram of miles pr. gallon and weight
(twoway (scatter mpg weight)). Try also to reproduce these three graphs by using the graphics –
menu.
9. Linear regressions:
To fit simple or multiple linear regressions, use the regress command (or by the menu: statistics -
> linear regression …. ). Using the auto data, generate variable x=mpg/weight (if you have not
done it before) and type:
Exercise; Interpret the estimated model 3. Check out the syntax for the predict command used
after the regress command and use it to obtain the predicted price, predicted residuals and
squared residuals. [e.g. predict predy produces a new variable in the database containing the
predicted values. It is here called predy, or another name that you choose. The command predict
res , residuals produces a variable, containing the residuals (i.e. observed price minus predicted
price), with name res, or another name of your choice.]
3
In regression analysis we try to explain a dependent variable (price) by a linear combination of some explanatory
variables (mpg, weight, x, foreign). In other words:
price = a0 + a1 ⋅ mpg + a2 ⋅ weight + a3 ⋅ x + a4 ⋅ foreign + error
where the constants a0 , a1 ,, a4 , which in the output from Stata appear under “Coef”, are determined (estimated)
by the program from the data in an optimal manner so that the last term “error” (i.e. the unexplained part of price -
also called residual) becomes as small as possible on average (in a certain sense). The linear combination (before the
error term) represents the explained part of price and is called predicted price. If the model is good, the error term
should be evenly distributed around zero (i.e. have constant variance) whatever the value of predicted price. If that is
not the case, we call the model heteroscedastic.
10
Can you use the scatter command, or the graphics->Two way graphics (scatterplot, line etc)
menu to assess whether the model specification is likely to be heteroscedastic (e.g. plot the
residuals against the predicted price)? Use the help facility to list the functions library. Generate a
variable that equals the natural logarithm of the price and re-estimate the model. How would you
interpret the model now?
This facility allows you to print or save all commands you used with the output produced in the
result window for a session with STATA. It is particularly useful when you hand in written
papers in class, so that the teacher can see how you obtained your results. To start logging a
session, type log using sessionname, where sessionname is the name you decide for the session.
This logging you can turn off and on again temporarily during the session by the commands log
off and log on. When the session is completed,however, type; log close. In the results window
you will now be told where the log file is saved. When you want to view or print the log file you
type; view address\sessionname.smcl.
Note. “smcl” is the program STATA uses to produce output files (see help smcl for more
information). It produces files with the extension smcl. You can copy content in the viewer to a
word document in the usual manner (i.e., marking, copy and paste). If you want the log-file in the
usual text format, you must add the option “text” to the log using – command, i.e.,
If you wish to log only the commands you have used in a session for use in a new and later
session, you can do that by the command cmdlog, or, if you late in a session think it would be
useful to save all the commands in the review window for a later session, simply right-click on
the review window and save the list of commands in a do-file (see section 13 below). For more
information on this read page 118 in the Getting started manual [GSW] under [GS] in the
documentation pdf-file.
11. Num(ber)lists:
Often you will find reference to numlist in the STATA syntax description. Numlist is simply a
sequence of numbers, which can be specified in various ways. As an example; the sequence 2 4 6
8 10 and the numlist 2(2)10 will be synonymous to STATA. To get an overview of different
ways to specify numlists, type help numlist.
11
Input the following lines and figure out what they do.
Command Notes
clear
browse Close the browse window to get back to the
command level.
Set obs 100
browse
egen year = fill(1900 1901) “egen” is an extended version of “generate” that we
need for defining new variables, e.g. consisting of
patterned data and other types.
browse
egen trend = fill(0.1(0.1)10)
browse
generate a = sin(trend)
browse
generate cycle=trend+a
twoway (line trend cycle year)
set seed ? Replace ? by an integer of your choice, e.g. your
birthday like for example 100786. This starts the
algorithm for generating random data. By using the
same seed you can produce the same data later. If
seed is not specified, stata will choose a seed by
default which changes every time you draw random
numbers.
drawnorm u
generate gdp = cycle+u
twoway (line trend cycle gdp year)
Exercise; Load the auto dataset. Explain why this sequence of commands can be used to draw a
random sample of 20 cars:
gen u = uniform()
sort u
mark sample in 1/20
Often you will need to type a sequence of commands several times. In this case you should use
the do-file editor (press the short cut icon with a picture of an envelope). In the do-file you can
write in multiple lines and run them in a sequence. You can save the do-file for later use. Often
12
you will want to specify loops in the do-file editor. As an example, suppose you have variables;
year1, year2, year3, … , year100, and that you want to transform these variables from variables
with string values to variables real numbers as values. You can then type (don’t do this here
unless you actually have 100 string variables called year1, year2,… defined in the data base!)
forvalues x = 1/100 {
generate y`x’ = real(year`x’)
}
STATA will then perform this command successively for `x’ running from 1 to 100. Note that all
the definitions for numlist can be used with this command. The command will generate new
variables, y1, y2, …., y100 that represent nummeric versions of year1, year2,…,year100.
To test out a loop, try the following command in a do-file. (For the di command see below. Note
also that the single quotes in `x’ are different. The first one I find on my keyboard on the top of
the back-slash (\) key, and the last one on my *-key.).
forvalues x = 2/20 {
di “I will do `x’ attempts to do my homework properly”
}
To execute the commands in the do-file directly from the do-file editor, push the execute icon or
use the menu with tools -> execute (do) . If you want to use commands in the do-file in a later
session, you can save the do-file under some name, exmpldo.do, for example, and load into the
do-file editor in a later session.
Note: This small program, consisting of only one command, you could also have entered directly
in the command window. Note that the three lines must be entered separately: After the first line
press enter. Then a number, 2, appears waiting for the second line and so on. The last line
consisting of the end braceonly, closes the program input and runs the program. (Note that the
single quotes in `x’ are different. The first one I find on my keyboard on the top of the back-
slash (\) key, and the last one on my *-key.)
14. Calculator
di is short for the display command. display is used for printing strings or scalar numbers. It can
be used as a calculator. Try out the following (-> denotes output):
You want to describe the output, you can add a comment in double quotes:
di “e = “ exp(1)
di sqrt(2*_pi)
Now, make a table of P(T ≤ t + 1| T > t ) for t = 0.5, 1, 1.5, 2, 2.5, 3, 3.5 and for
α = 0.5, 1, 2 (with λ = 1 always). Note that writing F (t ) for the cdf of T, we get
4
Note that gammap() assumes that λ = 1 in the Γ(α , λ ) -distribution. If you need to caculate P(T ≤ t ) when
λ ≠ 1 , use that λT ~ Γ(α ,1) (see e.g. Rice exercize 2.61), which implies P(T ≤ t ) = gammap (α , λt ) in
Stata. For example, if T ~ Γ(2, 2) , then you get P (T ≤ 1.5) = 0.8008... using the command di gammap(2, 3).
Note also that when STATA talks of Γ( a, b) , it means that a is the same as our α , but b means 1 λ . So, e.g.,
that our Γ(2, 2) - distribution corresponds to STATA’s Γ(2, 0.5) - distribution.
14
1 − F (t + 1) F (t + 1) − F (t )
P(T ≤ t + 1| T > t ) = 1 − =
1 − F (t ) 1 − F (t )
[Hint: First empty STATA of all data by the clear command. Then open the data
spreadsheet e.g. by the data editor icon. Enter the data for t (0.5, 1, …., 3.5) in the first
column. I.e., start with writing 0.5. Note that it appears in the little window at the top. Press
enter (or the ↓ arrow). Now, the number is entered in the first cell of column 1 (with name
var1). Write the second number, 1, and press enter and so on. Change the name of the first
column from var1 to, for example, t, by right- clicking on the head of the column and
variable properties, and then write in the new name.
After thus having entered the numbers for t, generate a new column named e.g. y1 with
P (T ≤ t + 1| T > t ) for α = 0.5 . This can be done by the gen command:
Note that you now can save the results in a Stata data set by the save command. Say you
want to name the dataset “gam”, for example. Then use the command save gam. The table
is then saved as a Stata data set in your working directory under the name gam.dta. You can
easily retrieve the data in a later session (using the same working directory; see section 3),
simply by the command use gam.]
b. Suppose we want to calculate P(T ≤ 4) when T ~ Γ(α= 2, λ= 3.2) , with λ different from
1. In order to use gammap for this we must transform T to Γ(α ,1) . According to the
=
solution of Rice exercise 2:61, Y λT ~ Γ(α ,1) . Hence, for λ = 3.2 we get