Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
About STATA
Basic Operations
Regression Analysis
Panel Data Analysis
About
STATA provides commands to analyze panel data (crosssectional time-series, longitudinal, repeated-measures, and
correlated data), cross-sectional data, time-series data,
survival-time data, cohort study,
Getting ready
Basic Operations
Entering Data
Exploring Data
Modifying Data
Managing Data
Analyzing Data
Entering Data
Insheet: Read ASCII (text) data created by a spreadsheet (.csv files only)
Save: Store the dataset currently in memory on disk in Stata data format
Example
cd u:\stata
dir
insheet using hs0.csv (If file has variable name on the first line)
Save hs
insheet gender id race ses schtyp prgtype read write math science
socst using hs0_noname.csv, clear(If file doesnt have variable name on the
first line)
Count
Describe
Compress
Clear
use hs, clear (only for files in Stata files, can be use over internet)
Memory
Exploring data
Correlate: Correlations
Example
Modifying Data
Egen: Extended generate - has special functions that can be used when
creating a new variable
Example
Use hs0
Order id gender
label variable schtyp "The type of school the student
attended."
label define scl 1 public 2 private
label values schtyp scl
codebook schtyp
list schtyp in 1/10
list schtyp in 1/10, nolabel
encode prgtype, gen(prog) (create a new numeric version of the
string variable prgtype)
label variable prog "The type of program in which the student
was enrolled."
codebook prog
list prog in 1/10
list prog in 1/10, nolabel
Example (cont)
rename gender female (easier to work with since we dont have to deal with 0s and 1s)
codebook female
label variable total "The total of the read, write and socst."
recode race 5 = .
sum total
Codebook total
save hs1
Managing Data
cd Change directory
Example
We take the hs1 data file and make a separate folder called honors and store
a copy of our data which just has the students with reading scores of 60 or
higher
Pwd
Dir
Ls
cd honors
Describe
summarize read
drop ses
describe
list in 1/20
Analyzing Data
Ttest: t-test
Regress: Regression
Predict: Predicts after model estimation
Kdensity: Kernel density estimates and graphs
Pnorm: Graphs a standardized normal plot
Qnorm: Graphs a quantile plot
Rvfplot: Graphs a residual versus fitted plot
Rvpplot: Graphs a residual versus individual predictor plot
Xi: Creates dummy variables during model estimation
Test: Test linear hypotheses after model estimation
Oneway: One-way analysis of variance
Anova: Analysis of variance
Logistic: Logistic regression
Logit: Logistic regression
Example
ttest write = 50 (This is the one-sample t-test, testing whether the sample of
writing scores was drawn from a population with a mean of 50 )
ttest write = read (This is the paired t-test, testing whether or not the mean of
write equals the mean of read)
ttest write, by(female) (This is the two-sample independent t-test with pooled
(equal) variances)
Example (cont)
regress write read female, robust (we run the regression with robust
standard errors. This is very useful when there is heterogeneity of
variance. This option does not affect the estimates of the regression
coefficients.)
predict r, resid (When using the resid option the predict command
calculates the residual)
Example (cont)
xi: regress write read i.prog (The xi prefix is used to dummy code categorical
variables such as prog. The predictor prog has three levels and requires two
dummy-coded variables)
test _Iprog_2 _Iprog_3 (The test command is used to test the collective effect of
the two dummy-coded variables; in other words, it tests the main effect of prog)
xi: regress write i.prog*read (create dummy variables for prog and for the
interaction of prog and read)
test _IproXread_2 _IproXread_3 (tests the overall interaction)
test _Iprog_2 _Iprog_3 (tests the main effect of prog)
gen honcomp = write >= 60 (create a dichotomous variable called honcomp
(honors composition) to use as our dependent variable)
tab honcomp
The logistic command defaults to producing the output in odds ratios but can
display the coefficients if the coef option is used. The exact same results can be
obtained by using the logit command, which produces coefficients as the default
but will display the odds ratio if the or option is used:
logit honcomp read female
logit honcomp read female, or
Logistic Regression
Classical Regression vs Logistic Regression
All of the previous regression examples have used continuous dependent variables.
The population means of the dependent variables at each level of the independent
variable are not on a straight line, i.e., no linearity.
The variance of the errors are not constant, i.e., no homogeneity of variance.
Logistic Regression - 2
Logit:
Use admission into a graduate program in which 70% of the males and 30% of the
females are admitted
Let P equal the probability of being admitted.
Let the odds of a male admitted be odds(M) = P/Q = P/1-P = .7/.3 = 2.3333
Let the odds of a female admitted be odds(F) = P/Q = P/1-P = .3/.7 = .42857
The odds if being admitted to the program are about 5.44 times greater for males then
for females.
In effect, this represents a transformation of the dependent variable such that the
resulting logistic regression equation better meets the assumptions of linearity,
normality and homogeneity of variance
Interpreting logit coefficients:
Logistic slope coefficients can be interpreted as the effect of a unit of change in the X
variable on the predicted logits with the other variables in the model held constant. That
is, how a one unit change in X effects the log of the odds when the other variables in the
model held constant.
Interpreting Odds Ratios:
Odds ratios in logistic regression can be interpreted as the effect of a one unit of change
in X in the predicted odds ratio with the other variables in the model held constant
Logistic Regression 3
Logistic Regression 4
Example 1: Categorical Independent Variable
lstat
Do file
Do-files are created with the do-file editor or any other text editor. Any
command which can be executed from the command line can be placed in a dofile
To open a do file editor: Window Do-file Editor or Ctrl + 8
set more off
use hsb2, clear
generate lang = read + write
label variable lang "language score"
tabulate lang
tabulate lang female
tabulate lang prog
tabulate lang schtyp
summarize lang, detail
table female, contents(n lang mean lang sd lang)
table prog, contents(n lang mean lang sd lang)
table ses, contents(n lang mean lang sd lang)
correlate lang math science socst
regress lang math science female
set more on
Do file cont.
Look at the commands in a do-file that contains:
. type hsbbatch.do
To run the do-file.
do hsbbatch
Panel Data
Creat the do file as followed
sort group
ttest pre, by(group) /* check to see if the groups differ on the pretest depression score
*/
hotel dep1 dep2 dep3 dep4 dep5 dep6, by(group)/*There isn't much of a difference
between groups on the pretest so let's try a Hotelling's T2
Using Hotelling's T2 we find a significant difference between the two groups. The T2 did not
make use of any of the information concerning the pretest but that's okay for the moment
especially since we know that the pretest differences were not significant.*/
xtgee dep pre group visit, fam(gaus) link(iden) i(subj) t(visit) corr(ind) /*The three
previous analyses provide identical incorrect results.
The common thread among them is that they all assume that the observations within the
subjects are independent. This seems, on the face of it, to be highly unlikely. Scores on the
depression scale are not likely to be independent from one visit to the next.
Of the three, only xtgee makes the assumption concerning the correlations explicit.*/
xtsum dep
Panel data 2
/*We can analyze these data using compound symmetry for the correlational structure.
This approach can be tried using exchangable for the correlation matrix in xtgee */
xtgee dep pre group visit, fam(gaus) link(iden) i(subj) t(visit) corr(exc)
xtcorr
/*Note in particular the change in the standard errors between this analysis and the
previous one.
Now let's try a different correlation structure, auto regressive with lag one.*/
xtgee dep pre group visit, fam(gaus) link(iden) i(subj) t(visit) corr(ar1)
/*back up and reconsider the group by visit interaction.
We will try a model with the interaction using the ar1 correlations. */
generate gxv = group*visit
xtgee dep pre group visit gxv, fam(gaus) link(iden) i(subj) t(visit) corr(ar1)
/* The group by visit interaction still is not significant even though this may be a better
approach for testing it.
So far we have been treating visit as a continuous variable.
Is it possible that our analysis might change if we were to treat visit as a categorical
variable, the way that the anova did?
Let's try one last analysis using xi to create dummy variables on-the-fly. */
xi: xtgee dep pre group i.visit, fam(gaus) link(iden) i(subj) corr(ar1)
The help command can be used from the command line or from the Help
window. To use help the command must be spelled correctly and the full
name of the command must be used. help contents will list all
commands that can be accessed using help
help if
help anova
help regress
The search command searches for information in Stata manuals, FAQs,
and Stata Technical Bulletins (STBs). The search options include: manual
which restricts searches to the Stata Manual; author when searching for
an author by name; stb which restricts searhes to STBs; faq which
restricts searches to FAQs.The search command can be used from either
the command line or the Help window.
search if
search regression
search ttest, manual
Each copy of Stata comes with a built-in tutorital. Typing tutorial brings
up information about the tutorials. tutorial regress will bring up the
tutorial on regression.
tutorial
tutorial regress
End of Session