Stata For Dummies v1m
Stata For Dummies v1m
Gwilym Pryce 4 March 2009 These notes are designed to accompany a 2 hour practical workshop on the widely used statistics programme, Stata. The session includes guidance on how to open and exit Stata, how to open a data file, how to use a syntax file, how to save output, how to create a simple table, how to create a graph, and how to run a simple regression.
dataset in memory. Since we have not opened a data file yet, the Results window should list the following details or something similar:
. describe Contains data obs: vars: size: Sorted by:
Notice that the first line of the output lists the command you have entered. It then tells you that you have zero observations (obs), zero variables (vars), the data file has zero size, leaving you with 100% of memory, and that your data are not sorted by any particular variable. In the Command window: notice also that you can scroll through previous commands by pressing <page up> and <page down> on your keyboard. Once the command is in view, you can edit it. If you want to re-enter a command (whether edited or unedited), simply press <Enter> when the command is in view. Review: Having been truly enlightened by the outcome of you first command, you will notice that your describe instruction has also appeared in the Review window. This window keeps a record of your commands. It saves you from having to retype a command. If you want to repeat an instruction, simply click on the appropriate line of the Review window and it will appear in the Command window. You can then edit it, if you wish, in the Command window, before pressing the <Enter> key to run the command. Variables: This window simply lists the variables in memory and the labels ascribed to them. Since we have not loaded a dataset, the Variables window should be blank. Once you do have some variables to play with, you can click on a variable name in the Variables window and the name of the variable will be pasted to the Command window sparing you the trouble of having to type it quite a useful facility when you want to perform an operation on lots of variables. Data: The Data-editor remains out of view until you open it either by typing edit in the Command window, or by selecting Data-editor from the list that pops-up when you select Window from the menu bar. The Data-editor looks like a spreadsheet but is in fact a lot less flexible. For example, variables are always presented as columns with the variable names at the top and observations as rows. You cannot enter formulas in the cells, only data, either numerical or string. Nevertheless, the Data-editor is probably the easiest way to enter your own data (an alternative is the input command, or you can import data from Excel and other formats1). Viewer:
1
Statransfer is a useful companion program it converts data from a wide variety of formats.
This window is only opened if you select Help from the menu bar or if you want to view a Log-file (one that keeps track of your output and commands see below). It is worth noting at this point that Stata has an excellent Help facility. It takes a little while to get used to the format, but it is very comprehensive and has a very consistent structure. The Help facility is so good, in fact, that you could probably get by without ever having to refer to the printed manuals. For example, click on Help, Search, then type edit, and press <Enter>. Scroll down the list of entries on offer in the Viewer window (which will have opened automatically) until you come to the edit hyperlink, and click on it. (Alternatively, you could have simply typed help edit in the Command window). You should then see a detailed description of the edit command. Items in bold and in square brackets refer to the manual volume where you can find more detailed information. In this case, it should say [D] which refers to the Data Manual, which is only worth knowing if you have a set of manuals (expensive). Log-file: It is important to note that nothing you have done so far has been saved or recorded for posterity. Once you close Stata, all the commands youve entered and outputs youve created are lost forever. If you have created or edited a Data-file or Do-file (see below), Stata will ask you if you want to save the changes, but it will not offer such useful prompts for entries to the Command or Results window. To save a record of your Stata session you must open a Log-file. The easiest way to do this is to click File, Log, Begin, then decide on the folder and file name. Do this now so that you have a record of the remainder of this session: Click File, Log, Begin. Call the file Stata for Dummies (or whatever you like) and save it to your H: drive, or temporarily onto the C: drive (note that the latter will be deleted when the computer is turned off, assuming you are using a lab computer). Alternatively you can enter the log using instruction in the Command window followed by the directory and filename. For example,
log using "H:\My Documents\whatever_you_like.smcl"
where smcl is the file extension for Stata log-files. You can view the contents of the log file at any time by going to File, Log, View.
NB Remember to close the log-file before you exit Stata otherwise the file will be lost! To close the log-file simply type close in the Command window (or click File, Log, Close). Dont do this just yet, however. Wait until the end of the session.
Its a good idea to add titles and labels to your Do-file to make it easier to follow when you return to it at a future date. If you start a line with an asterix, Stata will ignore everything that follows on that line. Type *=============================== on the first line of your Do-file. Then press <Enter> and type *Stata Training Session on the second line. Then copy the first line (highlight and press <Ctrl+C>), and paste it onto the third line (<Ctrl+V>). Your Do-file should now look something like this:
*=============================== *Stata Training Session *===============================
Then close the Data Editor by clicking on X in the top right corner of the Data Editor, and Accept Changes. You will see that in the Variables window, you now have three variables listed, var1, var2, and var3. We now want to give these variables more meaningful names. Type and run (highlight the lines then press <Crl+R>) the following commands in your Do-file:
rename var1 id rename var2 income rename var3 sex
You will see in the Variables window that the names of the variables have changed accordingly. The next step is to label the variables. Type and run the following three lines from your Do-file:
label variable id "Respondent Identification Code" label variable income "Respondent basic income ()" label variable sex "Sex of respondent"
You will see in the Variables window that the variables now have labels (you might need to widen the Variables window to see this simply use your mouse to drag the edge of Variables window until you can read the variable labels). Save the data in an appropriate folder by typing and running the save command in your Do-file. For example, if you wanted to save the file in H:\My Documents folder (probably not a good idea if you are using a lab computer), you would type:
save "H:\My Documents\income_data.dta"
where dta is the extension used to identify the file as a Stata dataset.
You will see that the Variables window is now blank. Now open the Data-file you have just created: Enter and run the use command in your Do-file. Depending on the folder you saved your data the command will look something like:
use "H:\My Documents\income_data.dta", clear
On this occasion, you dont actually need the comma followed by the clear option since you had already entered clear as a separate command prior to running the use command. Normally, however, you wouldnt run clear as a separate command but as an option at the end of the use command because the latter only clears the data from memory whereas the former wipes everything (macros, scalars, matrices, mata routines, and lots of other stuff you dont need to know about just now). If you had not cleared the data (either separately or as an option) Stata would have come up with an error message warning you that did not open the data file because data in memory would have been lost.
Now use the gen command to create a new variable called total_income which is the sum of overtime and basic income:
gen total_income = income + overtime label variable total_income Total Income ()
Now run a simple frequency table for sex of respondent using the tab command:
tab sex
This should result in the following table appearing in the Results window:
Sex of | respondent | Freq. Percent Cum. ------------+----------------------------------female | 2 40.00 40.00
This a useful command because one of the options (typed after a comma) is to generate a series of dummy (binary) variables, one for each category of the variable in question. To do this for the sex variable, type:
tab sex, gen(sex_)
which should repeat the frequency table and create two new variables, sex_1 which equals 1 if the observation is female and zero otherwise, and sex_2 which equals 1 if the observation is male and zero otherwise. This is a most useful facility, particularly when has a variable with many potential categories for which a separate dummy variable has to be created for each category (as is often the case when one needs to include the effect of a categorical variable in a regression equation).
By adding the detail option, a more comprehensive list of descriptive statistics is revealed. Running the following command from your Do-file,
sum(income overtime), detail
yields:
Respondent basic income () ------------------------------------------------------------Percentiles Smallest 1% 15845 15845 5% 15845 20323 10% 15845 22000 Obs 5 25% 20323 31000 Sum of Wgt. 5 50% 75% 90% 95% 99% 22000 31000 74500 74500 74500 Largest 20323 22000 31000 74500 Mean Std. Dev. Variance Skewness Kurtosis 32733.6 23989.04 5.75e+08 1.31378 2.983174
To run descriptive statistics by category of another variable such as income by gender you can use the tab categorical variable, sum(continuous variable) command. For example, try entering tab sex, sum(income) You should obtain the following table:
| Summary of Respondent basic income Sex of | () respondent | Mean Std. Dev. Freq. ------------+-----------------------------------female | 18922.5 4352.2422 2 male | 41941 28697.839 3 ------------+-----------------------------------Total | 32733.6 23989.037 5
8. Creating a Graph
Type hist income to get a histogram of income:
3.0e-05 0 1.0e-05 Density 2.0e-05
20000
60000
Type scatter income overtime to get a scatter plot of basic income and overtime income:
80000 20000 0 Respondent basic income () 40000 60000
1000
5000
Enter and run graph bar (mean) total_income, over(sex) to get a bar chart of the mean income of respondents by sex:
10,000
40,000
female
male
9. Running a Regression
The syntax for running a regression is very simple. Simply type regress followed by the dependent variable, followed by the independent variables (separated by spaces). Run a regression of overtime on basic income and sex using the following syntax: regress overtime income sex_1 You should get a table of regression results that looks like the following:
Source | SS df MS -------------+-----------------------------Model | 12447426.2 2 6223713.12 Residual | 2610573.77 2 1305286.88 -------------+-----------------------------Total | 15058000 4 3764500 Number of obs F( 2, 2) Prob > F R-squared Adj R-squared Root MSE = = = = = = 5 4.77 0.1734 0.8266 0.6533 1142.5
-----------------------------------------------------------------------------overtime | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.0777339 .0279902 -2.78 0.109 -.1981659 .0426982 sex_1 | -3197.65 1225.908 -2.61 0.121 -8472.309 2077.008 _cons | 5593.57 1346.56 4.15 0.053 -200.2087 11387.35 ------------------------------------------------------------------------------
Now try running the regression only on males: regress overtime income if sex == male which should yield the following output:
Source | SS df MS -------------+-----------------------------Model | 10255839.8 1 10255839.8 Residual | 2410826.85 1 2410826.85 -------------+-----------------------------Total | 12666666.7 2 6333333.33
= = = = = =
------------------------------------------------------------------------------
overtime | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.0789081 .0382577 -2.06 0.287 -.5650182 .4072021 _cons | 5642.817 1837.999 3.07 0.200 -17711.18 28996.81 ------------------------------------------------------------------------------
Now try running the original regression using Whites standard errors (which give more reliable t-values when you have heteroskedasticity) by including the robust option: Run the following regression: regress overtime income sex_1, robust
Linear regression Number of obs = F( 2, 2) = Prob > F = R-squared = Root MSE = 5 6.38 0.1354 0.8266 1142.5
-----------------------------------------------------------------------------| Robust overtime | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.0777339 .0244834 -3.17 0.087 -.1830773 .0276096 sex_1 | -3197.65 1385.386 -2.31 0.147 -9158.487 2763.186 _cons | 5593.57 1787.983 3.13 0.089 -2099.499 13286.64 ------------------------------------------------------------------------------
10.
Remember to save changes to your Do-file (in the Do-file editor, click File, Save). Also, close the log-file before you exit Stata otherwise the file will be lost. To close the log-file simply type close in the Command window and press <Enter> or click File, Log, Close in the Results window.
10
2. Create two new variables: a. First, create a variable equal to the natural log of price: gen price_ln = ln(price) b. Now create a variable equal to the ratio of trunk to length and call this t_to_l_ratio. 3. Now label these two new variables and create summary statistics and histograms for all continuous variables in the data. 4. Create frequency tables and bar charts for categorical variables 5. Create dummy variables for foreign and make. 6. Run a scatter plot of price on weight 7. Run a regression of price on weight, and the dummies you have created 8. Repeat for log price. 9. After running the regression, enter the following command: ereturn list. This command displays the results that Stata saves automatically following a regression (though note that the information is lost as soon as you run another regression or terminate your Stata session). You can access these saved scalars, matrices and macros in subsequent commands. This is very useful if, for example, you want to compute new variables or run tests that require this information.
11
12