Getting Started With Stata
Getting Started With Stata
Page 1 of 11
Note: if you are going to use Stata outside Harvards network, then you will also need to download
and install VPN Client as well (from the same download page). Every time you want to use Stata
from outside the network, you will need to first start VPN Client to connect to the Harvard servers
(use your FAS username and password in the prompts).
Installing
Once you have downloaded the two programs as outlined above, you now need to install them. First
install the program KeyAccess. Click on the program FAS_k2Client.exe, click OK, click on next
through all the prompts in the window, and then click Install. This should only take a few seconds.
When asked, you do not need to re-start your computer at this time.
Next you will need to install Stata. Open up the program you just downloaded: Stata10.exe. Click on
OK to being a Harvard affiliate, and the install should begin automatically. After about 1 minute,
another window will pop open. Click Next three times, and the installation will continue. After about
30 second, a third prompt window will open up. Click Next through these prompts, and set-up will
continue. After about 3-4 minutes, set-up should be complete. Click OK once the set-up finishes.
You can delete the two programs you downloaded (presumably on your desktop); these were used for
installation only. Now, restart your computer before booting up Stata for the first time.
21 June 2009
Page 2 of 11
Note:: the first time you open up Stata, it will ask if you want to install updates. Please make sure
you go through this process the first time, this will remove some bugs from older versions of the
software. After you do this once, you will no longer need to update throughout the semester.
21 June 2009
Page 3 of 11
Notice, there are 4 main windows (along with the menus up top):
1. Results Stata will print-out any analysis output or communication (like error messages)
2. Command The user enters commands for Stata to run analysis.
3. Variables This window will list the variables that have been entered into Stata
4. Review This window displays the commands that Stata has processed
Data Entry
There are 3 Main Ways to bring data into Stata: by importing data created by another program or
editor, by manual entry, and by reading a data set that Stata has saved in the native format of the
program. We will mainly be using the importing option in this class whenever you use a data set for
the first time. It is important to know about the manual entry technique, as it may be useful for when
project time roles around at the end of the semester. The tutorial that is part of problem set 1 describes
how to save and re-use data in Statas native format, which is useful when you want to save your
progress when doing your homework.
Importing Data
Most datasets are stored as simple text files (with extensions .csv, .txt, .dat, or even .raw) which can
easily be imported into Stata. However, you will need to do a little bit of work to import an Excel file
the ends with .xls.
This tutorial uses the 2004_Election.csv data file found on the course website here (under the Stata
Information tab):
http://isites.harvard.edu/fs/docs/icb.topic553772.files/2004%20Election.csv
Save this file on your computers desktop to start. To import, click on the menu File Import
ASCII data created by a spreadsheet. In the window that pops open, click on the Browse button, and
select the file you want. You may have to change Files of Type to Comma Separated Values (*.csv).
Click on OK, and the new variables should be entered into Statas memory.
Note: If you get an error message like you must start with an empty dataset, then the simplest fix is
to just type: clear in the command window and click enter. Be careful though, as this will remove any
old data floating around in Statas memory.
Creating a .csv file from a .xls Excel File
Sometimes data will come as an Excel file ending in .xls. The easiest way to deal with this is to open
the file in Excel, and then save as a .csv type file. Once you have the file opened in Excel, go to the
menu: File Save As. In the window that pops open, chagce the option Save as type to the option
CSV (Comma delimited) (*.csv). Save the file in this format in an easy to get to location on your
computer (like the desktop). In the windows that pop open, click on OK and Yes; we really do want to
change this to a file that can be read into other software.
Manually Entering Data (use at your own risk)
Near the top of the Stata window, you will see what looks like a table/spreadsheet ( ). If you click
on this button, the Data Editor window should open. Once this is open, you can simply click on a cell
and enter data, or just copy and paste data directly from Excel. After getting the data set-up the way
21 June 2009
Page 4 of 11
you want it, click on the Preserve button to save your changes for later. You can close this by clicking
like any window in MS-Windows (you must close this window to do any analysis).
on the
3. Data Visualization
Once the dataset is read in, the main concern now is how to manipulate data. Problem set 1 discusses
how to use the Stata menus to produced simple graphs and summary statistics, so that material is not
reproduced here. This section of the tutorial discusses how to enter commands directly into the Stata
Command window. Here, we will learn to get summary statistics (think measures of center, spread,
etc) and graph/plot our data. Except for complex commands, menu choices and direct commands
produce identical results.
Please note: some of the methods illustrated in this tutorial will not be used until the second week of
class or later.
Summary Statistics
To get some quick statistics on a quantitative variable, use the command summarize followed by the
list of variables you are interested, for example (can be done through the menu: Statistics ->
Summaries, Tables, and Test -> Summaries and Descriptive Statistics
-> Display Additional Statistics and selecting bush_perc and gsp in the variables line,
and then click submit)
. summarize bush_perc gsp
When typing in any Stata commands, you can just copy and paste the commands from this tutorial, but
do not copy the dot at the beginning of each command (thats just from the Stata output screen).
Doing so will lead to an error. You will notice that when Stata prints results in the Result window, it
adds the dot at the beginning of the line to indicate a command it has just executed.
Unfortunately, the above does not give the median or quartiles. To get percentiles, you have to give
the option detail, as such (or highlight Display additional statistics from the Display
Additional Statistics menu from above):
. summarize bush_perc gsp, detail
To get frequencies of a categorical variable, use the command tabaulate (menu: Statistics > Summaries, Tables, and Test -> Summaries and Descriptive
Statistics -> Tables -> One-way tables):
. tabulate region
Histograms
To get a quick view of the distribution of a variable, use the command hist (menu: Graphics
-> Histogram). Enter:
. hist bush_perc
21 June 2009
Page 5 of 11
Density
A new window should pop up with a histogram of your chosen variable like this one:
.3
.4
.5
bush_perc
.6
.7
Boxplots
To produce a boxplot of a variable, use the command graph box. (menu: Graphics
-> Box plots). Enter:
. graph box bush_perc
You can also split a boxplot into different categories. (menu: click on the Categories tab in the Box
plots menu window). For example, we can do:
. graph box bush_perc, over(region)
.4
bush_perc
.5
.6
.7
MW
Scatterplots
NE
21 June 2009
Page 6 of 11
To get a quick visual of how two variables are related, use the command twoway (Note, the first
variable is the y-variable, and the second is the x-variable). (menu - driven: Graphics -> Twoway graph (scatter, lines, etc). In the window that opens, click Create, choose your
Y and X variables appropriately, and click accept. Then in the next window, click submit)
. graph twoway scatter bush_perc gsp
To add the regression line to the scatterplot, add the command lfit like this (menu: from the Twoway window, create the scatterplot graphic above, and then click on Create to add a 2nd plot. In this
plot window, click on the Fit plots option and choose the same X and Y variables as above. Click
Accept, and then in the graphics window there should be two plot objects. Click submit and it should
add the line to the scatterplot like below):
. graph twoway (lfit bush_perc gsp) (scatter bush_perc gsp)
.4
.5
.6
.7
30000
40000
50000
gsp
Fitted values
60000
70000
bush_perc
21 June 2009
Page 7 of 11
4. Data Analysis
Next week, we will learn to measure and analyze the association between two variables (correlation
and regression). Later in the course, we will see many more ways to do analysis (confidence intervals,
hypothesis testing, ANOVA, etc). Lets do some work on what we know for now:
Correlation
To find the correlation coefficient between two (or more) variables, use the command corr (menu:
Statistics -> Summaries, Tables, and Test -> Summaries and
Descriptive Statistics -> Correlations and covariances):
. corr bush_perc gsp
Regression
To get the printout of a regression (to find the estimates for the slope and intercept of a line), use the
command regress (menu: Statistics -> Linear models and related ->
Linear regression):
:
. regress bush_perc gsp
The results should look like this:
Source |
SS
df
MS
-------------+-----------------------------Model | .036676619
1 .036676619
Residual | .312433678
48 .006509035
-------------+-----------------------------Total | .349110297
49
.0071247
Number of obs
F( 1,
48)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
50
5.63
0.0217
0.1051
0.0864
.08068
-----------------------------------------------------------------------------bush_perc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gsp | -3.63e-06
1.53e-06
-2.37
0.022
-6.71e-06
-5.56e-07
_cons |
.6795575
.0634325
10.71
0.000
.5520179
.807097
------------------------------------------------------------------------------
We know we often would like to look at the residual plot for a regression to see if the assumptions are
met (to see if there is a pattern in the residuals, like a U-shape). To get the residual vs. fitted plot (the
fitted variable being your y i ), use the command rvfplot. Note, this command should be entered
directly following the regress command, since it refers back to it (menu: Statistics ->
Linear models and related -> Regression diagnostics -> Residualversus-fitted plot):
. rvfplot
Page 8 of 11
-.2
-.1
Residuals
0
.1
.2
21 June 2009
.4
.45
.5
Fitted values
.55
.6
Practice Problem
S&P 500 Stock Index. This dataset can be downloaded here:
ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=00&b=3&c=1950&d=01&e=01&f=2009&g=m&ignore=.csv
For this problem we will be using the monthly S&P 500 Index Prices Since 1950. We are going to see
if we can predict the S&P 500 price by the volume traded that day.
Download the above chart onto your desktop (I called it SP500.csv). Open up Stata. Within Stata,
read in the file using the menu: File Import ASCII data created by a spreadsheet like above.
Alternatively, you can copy and paste the file directly from Excel. Open the file in Excel. It should
look like this:
Copy the columns of interest, paste into Statas Data Editor, and then hit Preserve; it should look like
this:
21 June 2009
Page 9 of 11
Once you have the file read into Stata correctly, we can start analyzing the data. Now type in the
commands into the command window (one at a time & without the dot):
. summarize close volume , detail
. summarize close volume
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------close |
709
366.6579
444.0311
17.05
1549.38
volume |
709
4.39e+08
9.54e+08
1024300
7.23e+09
.001
.002
Density
.003
.004
.005
Density
5.0e-10 1.0e-09 1.5e-09 2.0e-09 2.5e-09
. hist close
. hist volume
500
1000
1500
Close
a) What do you notice about the two variables we are interested in?
2.000e+09
4.000e+09
Volume
6.000e+09
8.000e+09
21 June 2009
Page 10 of 11
1000
2000
3000
2.00e+09
4.00e+09
Volume
Fitted values
6.00e+09
8.00e+09
Close
Number of obs
F( 1,
707)
Prob > F
R-squared
Adj R-squared
Root MSE
=
709
= 1004.21
= 0.0000
= 0.5868
= 0.5863
= 285.61
-----------------------------------------------------------------------------close |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------volume |
3.56e-07
1.12e-08
31.69
0.000
3.34e-07
3.79e-07
_cons |
210.1585
11.80873
17.80
0.000
186.9742
233.3429
------------------------------------------------------------------------------
-2000
-1000
Residuals
1000
. rvfplot
1000
2000
3000
Fitted values
21 June 2009
Page 11 of 11
c) What is the predicted closing price for a day that had a billion (109) shares traded? What about for
3.12 billion (3.12*109)?
d) What are our results? Does this seem to be a good fit? How do you know?