Overview of Stata Estimation Commands
Overview of Stata Estimation Commands
Contents
26.1 Introduction
26.2 Means, proportions, and related statistics
26.3 Linear regression with simple error structures
26.4 Structural equation modeling (SEM)
26.5 ANOVA, ANCOVA, MANOVA, and MANCOVA
26.6 Generalized linear models
26.7 Binary-outcome qualitative dependent-variable models
26.8 ROC analysis
26.9 Conditional logistic regression
26.10 Multiple-outcome qualitative dependent-variable models
26.11 Count dependent-variable models
26.12 Exact estimators
26.13 Linear regression with heteroskedastic errors
26.14 Stochastic frontier models
26.15 Regression with systems of equations
26.16 Models with endogenous sample selection
26.17 Models with time-series data
26.18 Panel-data models
26.18.1 Linear regression with panel data
26.18.2 Censored linear regression with panel data
26.18.3 Generalized linear models with panel data
26.18.4 Qualitative dependent-variable models with panel data
26.18.5 Count dependent-variable models with panel data
26.18.6 Random-coefficients model with panel data
26.19 Multilevel mixed-effects models
26.20 Survival-time (failure-time) models
26.21 Treatment-effect models
26.22 Generalized method of moments (GMM)
26.23 Estimation with correlated errors
26.24 Survey data
26.25 Multiple imputation
26.26 Multivariate and cluster analysis
26.27 Pharmacokinetic data
26.28 Specification search tools
26.29 Power and sample-size analysis
26.30 Obtaining new estimation commands
26.31 References
26.1 Introduction
Estimation commands fit models such as linear regression and probit. Stata has many such
commands, so it is easy to overlook a few. Some of these commands differ greatly from each other,
others are gentle variations on a theme, and still others are equivalent to each other.
1
2 [ U ] 26 Overview of Stata estimation commands
Estimation commands share features that this chapter will not discuss; see [U] 20 Estimation and
postestimation commands. Especially see [U] 20.21 Obtaining robust variance estimates, which
discusses an alternative calculation for the estimated variance matrix (and hence standard errors) that
many of Stata’s estimation commands provide, and [U] 20.12 Performing hypothesis tests on the
coefficients.
Here, however, this chapter will put aside all of that — and all issues of syntax — and deal solely
with matching commands to their statistical concepts. This chapter will not cross-reference specific
commands. To find the details on a particular command, look up its name in the index.
for a continuous y variable. In this category, estimation is restricted to when σ2 is constant across
observations j . The model is called the linear regression model, and the estimator is often called the
(ordinary) least-squares (OLS) estimator.
regress is Stata’s linear regression command. (regress produces the robust estimate of variance
as well as the conventional estimate, and regress has a collection of commands that can be run
after it to explore the nature of the fit.)
Also, the following commands will do linear regressions, as does regress, but offer special
features:
1. ivregress fits models in which some of the regressors are endogenous, using either instrumental
variables or generalized method of moments (GMM) estimators.
2. areg fits models yj = xj β +dj γ +j , where dj is a mutually exclusive and exhaustive dummy
variable set. areg obtains estimates of β (and associated statistics) without ever forming dj ,
meaning that it also does not report the estimated γ. If your interest is in fitting fixed-effects
models, Stata has a better command—xtreg—discussed in [U] 26.18.1 Linear regression with
panel data. Most users who find areg appealing will probably want to use xtreg because
it provides more useful summary and test statistics. areg duplicates the output that regress
would produce if you were to generate all the dummy variables. This means, for instance, that
the reported R2 includes the effect of γ.
[ U ] 26 Overview of Stata estimation commands 3
3. boxcox obtains maximum likelihood estimates of the coefficients and the Box – Cox transform
parameters in a model of the form
where ∼ N (0, σ 2 ). Here the y is subject to a Box–Cox transform with parameter θ. Each of
the x1 , x2 , . . . , xk is transformed by a Box–Cox transform with parameter λ. The z1 , z2 , . . . , zl
are independent variables that are not transformed. In addition to the general form specified
above, boxcox can fit three other versions of this model defined by the restrictions λ = θ,
λ = 1, and θ = 1.
4. tobit allows estimation of linear regression models when yi has been subject to left-censoring,
right-censoring, or both. Say that yi is not observed if yi < 1,000, but for those observations,
it is known that yi < 1,000. tobit fits such models.
ivtobit does the same but allows for endogenous regressors.
5. intreg (interval regression) is a generalization of tobit. In addition to allowing open-ended
intervals, intreg allows closed intervals. Rather than observing yj , it is assumed that y0j and
y1j are observed, where y0j ≤ yj ≤ y1j . Survey data might report that a subject’s monthly
income was in the range $1,500–$2,500. intreg allows such data to be used to fit a regression
model. intreg allows y0j = y1j and so can reproduce results reported by regress. intreg
allows y0j to be −∞ and y1j to be +∞ and so can reproduce results reported by tobit.
6. truncreg fits the regression model when the sample is drawn from a restricted part of the
population and so is similar to tobit, except that here the independent variables are not
observed. Under the normality assumption for the whole population, the error terms in the
truncated regression model have a truncated-normal distribution.
7. cnsreg allows you to place linear constraints on the coefficients.
8. eivreg adjusts estimates for errors in variables.
9. nl provides the nonlinear least-squares estimator of yj = f (xj , β) + j .
10. rreg fits robust regression models, which are not to be confused with regression with robust
standard errors. Robust standard errors are discussed in [U] 20.21 Obtaining robust vari-
ance estimates. Robust regression concerns point estimates more than standard errors, and it
implements a data-dependent method for downweighting outliers.
11. qreg produces quantile regression estimates, a variation that is not linear regression at all but
is an estimator of yj = xj β + j . In the basic form of this model, sometimes called median
regression, xj β measures not the predicted mean of yj conditional on xj , but its median. As
such, qreg is of most interest when j does not have constant variance. qreg allows you to
specify the quantile, so you can produce linear estimates for the predicted 1st, 2nd, . . . , 99th
percentile.
Another command, bsqreg, is identical to qreg but presents bootstrap standard errors.
The sqreg command estimates multiple quantiles simultaneously; standard errors are obtained
via the bootstrap.
The iqreg command estimates the difference between two quantiles; standard errors are obtained
via the bootstrap.
12. vwls (variance-weighted least squares) produces estimates of yj = xj β +j , where the variance
of j is calculated from group data or is known a priori. vwls is therefore of most interest to
categorical-data analysts and physical scientists.
4 [ U ] 26 Overview of Stata estimation commands
g{E(yj )} = xj β, yj ∼ F
where g() is called the link function and F is a member of the exponential family, both of which
you specify before estimation. glm fits this model.
The GLM framework encompasses a surprising array of models known by other names, including
linear regression, Poisson regression, exponential regression, and others. Stata provides dedicated
estimation commands for many of these. Stata has, for instance, regress for linear regression,
poisson for Poisson regression, and streg for exponential regression, and that is not all of the
overlap.
glm by default uses maximum likelihood estimation and alternatively estimates via iterated
reweighted least squares (IRLS) when the irls option is specified. For each family, F , there is
a corresponding link function, g(), called the canonical link, for which IRLS estimation produces
results identical to maximum likelihood estimation. You can, however, match families and link func-
tions as you wish, and, when you match a family to a link function other than the canonical link,
you obtain a different but valid estimator of the standard errors of the regression coefficients. The
estimator you obtain is asymptotically equivalent to the maximum likelihood estimator, which, in
small samples, produces slightly different results.
For example, the canonical link for the binomial family is logit. glm, irls with that combination
produces results identical to the maximum-likelihood logit (and logistic) command. The binomial
family with the probit link produces the probit model, but probit is not the canonical link here. Hence,
glm, irls produces standard error estimates that differ slightly from those produced by Stata’s
maximum-likelihood probit command.
Many researchers feel that the maximum-likelihood standard errors are preferable to IRLS estimates
(when they are not identical), but they would have a difficult time justifying that feeling. Maximum
likelihood probit is an estimator with (solely) asymptotic properties; glm, irls with the binomial
family and probit link is an estimator with (solely) asymptotic properties, and in finite samples, the
standard errors differ a little.
6 [ U ] 26 Overview of Stata estimation commands
Still, we recommend that you use Stata’s dedicated estimators whenever possible. IRLS — the
theory — and glm, irls — the command — are all encompassing in their generality, meaning that
they rarely use the right jargon or provide things in the way you wish they would. The narrower
commands, such as logit, probit, and poisson, focus on the issue at hand and are invariably
more convenient.
glm is useful when you want to match a family to a link function that is not provided elsewhere.
glm also offers several estimators of the variance–covariance matrix that are consistent, even when
the errors are heteroskedastic or autocorrelated. Another advantage of a glm version of a model
over a model-specific version is that many of these VCE estimators are available only for the glm
implementation. You can also obtain the ML–based estimates of the VCE from glm.
Related to logit, the skewed logit estimator scobit adds a power to the logit link function and is
estimated by Stata’s scobit command.
Turning to probit, you have two choices: probit and ivprobit. probit fits a maximum-likelihood
probit model. ivprobit fits a probit model where one or more of the regressors are endogenously
determined.
Stata also provides bprobit and gprobit. The bprobit command is a maximum likelihood
estimator — equivalent to probit — but works with data organized in the different way outlined above.
gprobit is the weighted-regression, grouped-data estimator.
Continuing with probit: hetprobit fits heteroskedastic probit models. In these models, the variance
of the error term is parameterized.
heckprobit fits probit models with sample selection.
Also, Stata’s biprobit command fits bivariate probit models, meaning two correlated outcomes.
biprobit also fits partial-observability models in which only the outcomes (0, 0) and (1, 1) are
observed.
rocgold performs tests of equality of ROC area against a “gold standard” ROC curve and can
adjust significance levels for multiple tests across classifiers via Šidák’s method.
rocreg performs ROC regression; it can adjust both sensitivity and specificity for prognostic factors
such as age and gender; it is by far the most general of all the ROC commands.
rocregplot graphs ROC curves as modeled by rocreg. ROC curves may be drawn across covariate
values, across classifiers, and across both.
See [R] roc.
In the context denoted by the name conditional logistic regression — mentioned above — subjects
are members of pools, and one or more are chosen, typically to be infected by some disease or to
have some other unfortunate event befall them. Thus the characteristics of the chosen and not chosen
are known, and the issue of the characteristics of the chooser never arises. Either way, it is the same
model.
In their choice-model interpretations, mlogit and clogit assume that the odds ratios are inde-
pendent of other alternatives, known as the independence of irrelevant alternatives (IIA) assumption.
This assumption is often rejected by the data and the nested logit model relaxes this assumption.
nlogit is also popular for fitting the random utility choice model.
asmprobit is for use with outcomes that have no natural ordering and with regressors that are
alternative specific. It is weakly related to mlogit. Unlike mlogit, asmprobit does not assume the
IIA.
mprobit is also for use with outcomes that have no natural ordering but with models that do not
have alternative-specific regressors.
E(count) = Ej exp(xj β)
where Ej is the exposure time. poisson fits this model; see [R] poisson. There is also an exact
Poisson estimator; see [U] 26.12 Exact estimators. ivpoisson fits a Poisson model where one or
more of the regressors are endogenously determined. It can also be used for modeling nonnegative
continuous outcomes instead of counts. See [R] ivpoisson
Negative binomial regression refers to estimating with data that are a mixture of Poisson counts.
One derivation of the negative binomial model is that individual units follow a Poisson regression
model but there is an omitted variable that follows a gamma distribution with parameter α. Negative
binomial regression estimates β and α. nbreg fits such models. A variation on this, unique to Stata,
allows you to model α. gnbreg fits those models. See [R] nbreg.
Truncation refers to count models in which the outcome count variable is observed only above a
certain threshold. In truncated data, the threshold is typically zero. Commands tpoisson and tnbreg
fit such models; see [R] tpoisson and [R] tnbreg.
Zero inflation refers to count models in which the number of zero counts is more than would
be expected in the regular model. The excess zeros are explained by a preliminary probit or logit
process. If the preliminary process produces a positive outcome, the usual counting process occurs,
and otherwise the count is zero. Thus whenever the preliminary process produces a negative outcome,
excess zeros are produced. The zip and zinb commands fit such models; see [R] zip and [R] zinb.
yi = xi β + vi − sui
where
where k· and l· are correlated with correlation ρkl , a quantity to be estimated from the data. This
is called Zellner’s seemingly unrelated regressions, and sureg fits such models. When x1j = x2j =
· · · = xmj , the model is known as multivariate regression, and the corresponding command is mvreg.
The equations need not be linear; if they are not linear, use nlsur.
UCM stands for unobserved components model and decomposes a time series into trend, seasonal,
cyclic, and idiosyncratic components after controlling for optional exogenous variables. UCM provides
a flexible and formal approach to smoothing and decomposition problems. The ucm command fits
UCM models. See [TS] ucm.
Relatedly, band-pass and high-pass filters are also used to decompose a time series into trend
and cyclic components, even though the tsfilter commands are not estimation commands; see
[TS] tsfilter. Provided are Baxter–King, Butterworth, Christiano–Fitzgerald, and Hodrick–Prescott
filters.
Concerning ARIMA, ARFIMA, and UCM, the estimated parameters are sometimes more easily
interpreted in terms of the implied spectral density. psdensity transforms results; see [TS] psdensity.
Stata’s prais command performs regression with AR(1) disturbances using the Prais – Winsten
or Cochrane – Orcutt transformation. Both two-step and iterative solutions are available, as well as a
version of the Hildreth – Lu search procedure. See [TS] prais.
newey produces linear regression estimates with the Newey – West variance estimates that are robust
to heteroskedasticity and autocorrelation of specified order. See [TS] newey.
Stata provides estimators for ARCH, GARCH, univariate, and multivariate models. These models
are for time-varying volatility. ARCH models allow for conditional heteroskedasticity by including
lagged variances. GARCH models also include lagged second moments of the innovations (errors).
ARCH stands for autoregressive conditional heteroskedasticity. GARCH stands for generalized ARCH.
arch fits univariate ARCH and GARCH models, and the command provides many popular extensions,
including multiplicative conditional heteroskedasticity. Errors may be normal or Student’s t or may
follow a generalized error distribution. Robust standard errors are optionally provided. See [TS] arch.
mgarch fits multivariate ARCH and GARCH models, including the diagonal vech model and the
constant, dynamic, and varying conditional correlation models. Errors may be multivariate normal or
multivariate Student’s t. Robust standard errors are optionally provided. See [TS] mgarch.
Stata provides VAR, SVAR, and VEC estimators for modeling multivariate time series. VAR and
SVAR deal with stationary series, and SVAR places additional constraints on the VAR model that
identifies the impulse–response functions. VEC is for cointegrating VAR models. VAR stands for vector
autoregression. SVAR stands for structural VAR. VEC stands for vector error-correction model.
var fits VAR models, svar fits SVAR models, and vec fits VEC models. These commands share many
of the same features for specification testing, forecasting, and parameter interpretation; see [TS] var
intro for both var and svar, [TS] vec intro for vec, and [TS] irf for all three impulse–response
functions and forecast-error variance decomposition. For lag-order selection, residual analysis, and
Granger causality tests, see [TS] var intro (for var and svar) and [TS] vec intro.
sspace estimates the parameters of multivariate state-space models using the Kalman filter. The
state-space representation of time-series models is extremely flexible and can be used to estimate
the parameters of many different models, including vector autoregressive moving-average (VARMA)
models, dynamic-factor (DF) models, and structural time-series (STS) models. It can also solve some
stochastic dynamic-programming problems. See [TS] sspace.
dfactor estimates the parameters of dynamic-factor models. These flexible models for multivariate
time series provide for a vector-autoregressive structure in both observed outcomes and in unobserved
factors. They also allow exogenous covariates for observed outcomes or unobserved factors. See
[TS] dfactor.
[ U ] 26 Overview of Stata estimation commands 13
where you may specify the variance structure of it . If you specify that it is independent for all i and
t, xtgls produces the same results as regress up to a small-sample degrees-of-freedom correction
applied by regress but not by xtgls.
You may choose among three variance structures concerning i and three concerning t, producing
a total of nine different models. Assumptions concerning i deal with heteroskedasticity and cross-
sectional correlation. Assumptions concerning t deal with autocorrelation and, more specifically, AR(1)
serial correlation.
Alternative methods report the OLS coefficients and a version of the GLS variance–covariance
estimator. xtpcse produces panel-corrected standard error (PCSE) estimates for linear cross-sectional
time-series models, where the parameters are estimated by OLS or Prais–Winsten regression. When
you are computing the standard errors and the variance–covariance estimates, the disturbances are,
by default, assumed to be heteroskedastic and contemporaneously correlated across panels.
14 [ U ] 26 Overview of Stata estimation commands
In the jargon of GLS, the random-effects model fit by xtreg has exchangeable correlation
within i — xtgls does not model this particular correlation structure. xtgee, however, does.
xtgee fits population-averaged models, and it optionally provides robust estimates of variance.
Moreover, xtgee allows other correlation structures. One that is of particular interest to those with
many data goes by the name unstructured. The within-panel correlations are simply estimated in an
unconstrained way. [U] 26.18.3 Generalized linear models with panel data will discuss this estimator
further because it is not restricted to linear regression models.
xthtaylor uses instrumental variables estimators to estimate the parameters of panel-data random-
effects models of the form
The individual effects ui are correlated with the explanatory variables X2it and Z2i but are uncorrelated
with X1it and Z1i , where Z1 and Z2 are constant within panel.
xtfrontier fits stochastic production or cost frontier models for panel data. You may choose from
a time-invariant model or a time-varying decay model. In both models, the nonnegative inefficiency
term is assumed to have a truncated-normal distribution. In the time-invariant model, the inefficiency
term is constant within panels. In the time-varying decay model, the inefficiency term is modeled as
a truncated-normal random variable multiplied by a specific function of time. In both models, the
idiosyncratic error term is assumed to have a normal distribution. The only panel-specific effect is
the random inefficiency term.
See [U] 26.19 Multilevel mixed-effects models for a generalization of xtreg that allows for
multiple levels of panels, random coefficients, and variance-component estimation in general.
g{E(yj )} = xj β, yj ∼ F (1)
where g() is the link function and F is a member of the exponential family, both of which you
specify before estimation.
There are two ways to extend the generalized linear model to panel data. They are the generalized
linear mixed model (GLMM) and generalized estimation equations (GEE).
GEE uses a working correlation structure to model within-panel correlation. GEEs may be fit with
the xtgee command; see [XT] xtgee.
For generalized linear models with multilevel data, including panel data, see [U] 26.19 Multilevel
mixed-effects models.
[ U ] 26 Overview of Stata estimation commands 15
Even though the effect on outcome is not directly observed, one can control for the effect if one
is willing to assume that the effect is the same for all observations within a group and that, across
groups, the effect is a random draw from a statistical distribution that is uncorrelated with the overall
residual of the model and other group effects.
We have just described multilevel models.
A more complicated scenario might have three levels: students nested within teachers within a
high school, patients nested within doctors within a hospital, or tractors nested within an assembly
line within a factory.
An alternative to three-level hierarchical data is crossed data. We have workers and their occupation
and the industry in which they work.
In any case, multilevel data arise in a variety of situations. One possible way of analyzing such
data is simply to ignore the multilevel aspect of the data. If you do that, and assuming that the ignored
effect is uncorrelated with the residual, you will still obtain unbiased coefficients, although standard
errors produced by standard methods will be incorrect. Many estimation commands in Stata provide
cluster–robust standard errors to get around that problem.
You can obtain more efficient parameter estimates, however, if you use an estimator that explicitly
accounts for the multilevel nature of the data. And if you want to perform comparisons across groups,
you must use such estimators.
Stata provides a suite of multilevel estimation commands. The estimation commands are the
following:
The above estimators provide random intercepts and random coefficients and allow constraints
to be placed on coefficients and on variance components. (The QR decomposition estimators do not
allow constraints.)
See the [ME] Stata Multilevel Mixed-Effects Reference Manual; in particular, see [ME] me.
[ U ] 26 Overview of Stata estimation commands 17
MI is named for the imputations it produces to replace the missing values in the data. MI does not
just form replacement values for the missing data, it produces multiple replacements. The purpose is
not to create replacement values as close as possible to the true ones, but to handle missing data in
a way resulting in valid statistical inference.
There are three steps in an MI analysis. First, one forms M imputations for each missing value
in the data. Second, one fits the model of interest separately on each of the M resulting datasets.
Finally, one combines those M estimation results into the desired single result.
The mi command does this for you. It can be used with most of Stata’s estimation commands,
including survey, survival, and panel and multilevel models. See [MI] intro.
14. discrim and candisc perform discriminant analysis. candisc performs linear discriminant
analysis (LDA). discrim also performs LDA, and it performs quadratic discriminant analysis
(QDA), k th nearest neighbor (KNN), and logistic discriminant analysis. The two commands differ
in default output. discrim shows the classification summary, candisc shows the canonical linear
discriminant functions, and both will produce either.
PSS analysis can also answer other questions that may arise during the planning stage of a study.
For example, what is the power of a test given an available sample size, and how likely is it to
detect an effect of interest given limited study resources? The answers to these questions may help
reduce the cost of a study by preventing an overpowered study or may avoid wasting resources on
an underpowered study.
See [PSS] intro for more information about PSS analysis.
The power command performs PSS analysis. It provides PSS analysis for comparison of means,
variances, proportions, and correlations. One-sample, two-sample, and paired analyses are supported.
power provides both tabular output and graphical output, or power curves; see [PSS] power, table
and [PSS] power, graph for details.
See [PSS] power for a full list of supported methods and the description of the command.
You can work with power commands either interactively or via a convenient point-and-click
interface; see [PSS] GUI for details.
26.31 References
Gould, W. W. 2000. sg124: Interpreting logistic regression in all its forms. Stata Technical Bulletin 53: 19–29.
Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 257–270. College Station, TX: Stata Press.
. 2011. Use poisson rather than regress; tell a friend. The Stata Blog: Not Elsewhere Classified.
http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/.