Panel Data Econometrics in R: The PLM Package: Yves Croissant Giovanni Millo
Panel Data Econometrics in R: The PLM Package: Yves Croissant Giovanni Millo
Abstract
This introduction to the plm package is a slightly modified version of Croissant and
Millo (2008), published in the Journal of Statistical Software.
Panel data econometrics is obviously one of the main fields in the profession, but most
of the models used are difficult to estimate with R. plm is a package for R which intends
to make the estimation of linear panel models straightforward. plm provides functions to
estimate a wide variety of models and to make (robust) inference.
1. Introduction
Panel data econometrics is a continuously developing field. The increasing availability of
data observed on cross-sections of units (like households, firms, countries etc.) and over time
has given rise to a number of estimation approaches exploiting this double dimensionality to
cope with some of the typical problems associated with economic data, first of all that of
unobserved heterogeneity.
Timewise observation of data from di↵erent observational units has long been common in
other fields of statistics (where they are often termed longitudinal data). In the panel data
field as well as in others, the econometric approach is nevertheless peculiar with respect to
experimental contexts, as it is emphasizing model specification and testing and tackling a
number of issues arising from the particular statistical problems associated with economic
data.
Thus, while a very comprehensive software framework for (among many other features) max-
imum likelihood estimation of linear regression models for longitudinal data, packages nlme
(Pinheiro, Bates, DebRoy, and the R Core team 2007) and lme4 (Bates 2007), is available in
the R (R Development Core Team 2008) environment and can be used, e.g., for estimation
of random e↵ects panel models, its use is not intuitive for a practicing econometrician, and
maximum likelihood estimation is only one of the possible approaches to panel data econo-
metrics. Moreover, economic panel datasets often happen to be unbalanced (i.e., they have a
di↵erent number of observations between groups), which case needs some adaptation to the
methods and is not compatible with those in nlme. Hence the need for a package doing panel
data “from the econometrician’s viewpoint” and featuring at a minimum the basic techniques
econometricians are used to: random and fixed e↵ects estimation of static linear panel data
models, variable coefficients models, generalized method of moments estimation of dynamic
models; and the basic toolbox of specification and misspecification diagnostics.
2 Panel Data Econometrics in R: The plm Package
Furthermore, we felt there was the need for automation of some basic data management
tasks such as lagging, summing and, more in general, applying (in the R sense) functions
to the data, which, although conceptually simple, become cumbersome and error-prone on
two-dimensional data, especially in the case of unbalanced panels.
This paper is organized as follows: Section 2 presents a very short overview of the typical
model taxonomy1 . Section 3 discusses the software approach used in the package. The next
three sections present the functionalities of the package in more detail: data management
(Section 4), estimation (Section 5) and testing (Section 6), giving a short description and
illustrating them with examples. Section 7 compares the approach in plm to that of nlme
and lme4, highlighting the features of the latter two that an econometrician might find most
useful. Section 8 concludes the paper.
>
yit = ↵it + it xit + uit (1)
>
yit = ↵ + xit + uit (2)
>
yit = ↵ + xit + µi + ✏it (3)
The appropriate estimation method for this model depends on the properties of the two error
components. The idiosyncratic error ✏it is usually assumed well-behaved and independent of
both the regressors xit and the individual error component µi . The individual component
may be in turn either independent of the regressors or correlated.
If it is correlated, the ordinary least squares (ols) estimator of would be inconsistent, so
it is customary to treat the µi as a further set of n parameters to be estimated, as if in the
1
Comprehensive treatments are to be found in many econometrics textbooks, e.g. Baltagi (2001) or
Wooldridge (2002): the reader is referred to these, especially to the first 9 chapters of Baltagi (2001).
2
For the sake of exposition we are considering only the individual e↵ects case here. There may also be time
e↵ects, which is a symmetric case, or both of them, so that the error has three components: uit = µi + t + ✏it .
Yves Croissant, Giovanni Millo 3
general model ↵it = ↵i for all t. This is called the fixed e↵ects (a.k.a. within or least squares
dummy variables) model, usually estimated by ols on transformed data, and gives consistent
estimates for .
If the individual-specific component µi is uncorrelated with the regressors, a situation which is
usually termed random e↵ects, the overall error uit also is, so the ols estimator is consistent.
Nevertheless, the common error component over individuals induces correlation across the
composite error terms, making ols estimation inefficient, so one has to resort to some form
of feasible generalized least squares (gls) estimators. This is based on the estimation of the
variance of the two error components, for which there are a number of di↵erent procedures
available.
If the individual component is missing altogether, pooled ols is the most efficient estimator
for . This set of assumptions is usually labelled pooling model, although this actually refers
to the errors’ properties and the appropriate estimation method rather than the model itself.
If one relaxes the usual hypotheses of well-behaved, white noise errors and allows for the
idiosyncratic error ✏it to be arbitrarily heteroskedastic and serially correlated over time, a more
general kind of feasible gls is needed, called the unrestricted or general gls. This specification
can also be augmented with individual-specific error components possibly correlated with the
regressors, in which case it is termed fixed e↵ects gls.
Another way of estimating unobserved e↵ects models through removing time-invariant indi-
vidual components is by first-di↵erencing the data: lagging the model and subtracting, the
time-invariant components (the intercept and the individual error component) are eliminated,
and the model
>
yit = xit + uit (4)
(where yit = yit yi,t 1 , xit = xit xi,t 1 and, from (3), uit = uit ui,t 1 = ✏it for
t = 2, ..., T ) can be consistently estimated by pooled ols. This is called the first-di↵erence,
or fd estimator. Its relative efficiency, and so reasons for choosing it against other consistent
alternatives, depends on the properties of the error term. The fd estimator is usually preferred
if the errors uit are strongly persistent in time, because then the uit will tend to be serially
uncorrelated.
Lastly, the between model, which is computed on time (group) averages of the data, discards
all the information due to intragroup variability but is consistent in some settings (e.g., non-
stationarity) where the others are not, and is often preferred to estimate long-run relationships.
Variable coefficients models relax the assumption that it = for all i, t. Fixed coefficients
models allow the coefficients to vary along one dimension, like it = i for all t. Random
coefficients models instead assume that coefficients vary randomly around a common average,
as it = + ⌘i for all t, where ⌘i is a group– (time–) specific e↵ect with mean zero.
The hypotheses on parameters and error terms (and hence the choice of the most appropriate
estimator) are usually tested by means of:
• pooling tests to check poolability, i.e. the hypothesis that the same coefficients apply
across all individuals,
• if the homogeneity assumption over the coefficients is established, the next step is to
establish the presence of unobserved e↵ects, comparing the null of spherical residuals
with the alternative of group (time) specific e↵ects in the error term,
4 Panel Data Econometrics in R: The plm Package
• the choice between fixed and random e↵ects specifications is based on Hausman-type
tests, comparing the two estimators under the null of no significant di↵erence: if this is
not rejected, the more efficient random e↵ects estimator is chosen,
• even after this step, departures of the error structure from sphericity can further a↵ect
inference, so that either screening tests or robust diagnostics are needed.
Dynamic models and in general lack of strict exogeneity of the regressors, pose further prob-
lems to estimation which are usually dealt with in the generalized method of moments (gmm)
framework.
These were, in our opinion, the basic requirements of a panel data econometrics package
for the R language and environment. Some, as often happens with R, were already fulfilled
by packages developed for other branches of computational statistics, while others (like the
fixed e↵ects or the between estimators) were straightforward to compute after transforming
the data, but in every case there were either language inconsistencies w.r.t. the standard
econometric toolbox or subtleties to be dealt with (like, for example, appropriate computation
of standard errors for the demeaned model, a common pitfall), so we felt there was need for an
“all in one” econometrics-oriented package allowing to make specification searches, estimation
and inference in a natural way.
3. Software approach
• NULL (the default value), it is then assumed that the first two columns contain the
individual and the time index and that observations are ordered by individual and by
time period,
• a character vector of length two containing the names of the individual and the time
index,
• an integer which is the number of individuals (only in case of a balanced panel with
observations ordered by individual).
The pdata.frame function is then called internally, which returns a pdata.frame which is
a data.frame with an attribute called index. This attribute is a data.frame that contains
the individual and the time indexes.
It is also possible to use directly the pdata.frame function and then to use the pdata.frame
in the estimation functions.
Yves Croissant, Giovanni Millo 5
3.2. Interface
Estimation interface
plm provides four functions for estimation:
• plm: estimation of the basic panel models, i.e. within, between and random e↵ect
models. Models are estimated using the lm function to transformed data,
• pvcm: estimation of models with variable coefficients,
• pgmm: estimation of generalized method of moments models,
• pggls: estimation of general feasible generalized least squares models.
The interface of these functions is consistent with the lm() function. Namely, their first two
arguments are formula and data (which should be a data.frame and is mandatory). Three
additional arguments are common to these functions :
• index: this argument enables the estimation functions to identify the structure of the
data, i.e. the individual and the time period for each observation,
• effect: the kind of e↵ects to include in the model, i.e. individual e↵ects, time e↵ects
or both3 ,
• model: the kind of model to be estimated, most of the time a model with fixed e↵ects
or a model with random e↵ects.
The results of these four functions are stored in an object which class has the same name
of the function. They all inherit from class panelmodel. A panelmodel object contains:
coefficients, residuals, fitted.values, vcov, df.residual and call and functions that
extract these elements are provided.
Testing interface
The diagnostic testing interface provides both formula and panelmodel methods for most
functions, with some exceptions. The user may thus choose whether to employ results stored
in a previously estimated panelmodel object or to re-estimate it for the sake of testing.
Although the first strategy is the most efficient one, diagnostic testing on panel models mostly
employs ols residuals from pooling model objects, whose estimation is computationally in-
expensive. Therefore most examples in the following are based on formula methods, which
are perhaps the cleanest for illustrative purposes.
ˆ = (X > V̂ 1
X) 1
(X > V̂ 1
y) (5)
Nevertheless, in practice plain computation of ˆ has long been an intractable problem even
for moderate-sized datasets because of the need to invert the N ⇥ N V̂ matrix. With the
advances in computer power, this is no more so, and it is possible to program the “naive”
estimator (5) in R with standard matrix algebra operators and have it working seamlessly for
the standard “guinea pigs”, e.g. the Grunfeld data. Estimation with a couple of thousands
of data points also becomes feasible on a modern machine, although excruciatingly slow and
definitely not suitable for everyday econometric practice. Memory limits would also be very
near because of the storage needs related to the huge V̂ matrix. An established solution
exists for the random e↵ects model which reduces the problem to an ordinary least squares
computation.
R has general facilities for fast matrix computation based on object orientation: particular
types of matrices (symmetric, sparse, dense etc.) are assigned the relevant class and the
additional information on structure is used in the computations, sometimes with dramatic
e↵ects on performance (see Bates 2004) and packages Matrix (see Bates and Maechler 2007)
and SparseM (see Koenker and Ng 2007). Some optimized linear algebra routines are available
in the R package bdsmatrix (see Atkinson and Therneau 2007) which exploit the particular
block-diagonal and symmetric structure of V̂ making it possible to implement a fast and
reliable full-matrix solution to problems of any practically relevant size.
The V̂ matrix is constructed as an object of class bdsmatrix. The peculiar properties of this
matrix class are used for efficiently storing the object in memory and then by ad-hoc versions
of the solve and crossprod methods, dramatically reducing computing times and memory
usage. The resulting matrix is then used “the naive way” as in (5) to compute ˆ, resulting in
speed comparable to that of the demeaning solution.
n
X
V̂R ( ) = (X > X) 1
Xi> Ei Xi (X > X) 1
(7)
i=1
4
See packages lmtest (Zeileis and Hothorn 2002) and car (Fox 2007).
5
Moreover, coeftest() provides a compact way of looking at coefficient estimates and significance diag-
nostics.
8 Panel Data Econometrics in R: The plm Package
R> library("plm")
The four datasets used are EmplUK which was used by Arellano and Bond (1991), the Grunfeld
data (Kleiber and Zeileis 2008) which is used in several econometric books, the Produc data
used by Munnell (1990) and the Wages used by Cornwell and Rupert (1988).
R> head(Grunfeld)
R> E <- pdata.frame(EmplUK, index = c("firm", "year"), drop.index = TRUE, row.names = TRUE
R> head(E)
firm year
1 1 1977
2 1 1978
3 1 1979
4 1 1980
5 1 1981
6 1 1982
Two further arguments are logical : drop.index drop the indexes from the data.frame
and row.names computes “fancy” row names by pasting the individual and the time indexes.
While extracting a serie from a pdata.frame, a pseries is created, which is the original
serie with the index attribute. This object has specific methods, like summary and as.matrix
are provided. The former indicates the total variation of the variable and the share of this
variation that is due to the individual and the time dimensions. The latter gives the matrix
representation of the serie, with, by default, individual as rows and time as columns.
R> summary(E$emp)
R> head(as.matrix(E$emp))
0 1 2
1-1977 5.041 NA NA
1-1978 5.600 5.041 NA
1-1979 5.015 5.600 5.041
1-1980 4.715 5.015 5.600
1-1981 4.093 4.715 5.015
1-1982 3.166 4.093 4.715
Further functions called Between, between and Within are also provided to compute the
between and the within transformation. The between returns unique values, whereas
Between duplicate the values and returns a vector which length is the number of obser-
vations.
1-1977 1-1978 1-1979 1-1980 1-1981 1-1982 1-1983 2-1977 2-1978 2-1979
NA NA 5.041 5.600 5.015 4.715 4.093 NA NA 71.319
R> head(Within(E$emp))
R> head(between(E$emp), 4)
Yves Croissant, Giovanni Millo 11
1 2 3 4
4.366571 71.362428 19.040143 26.035000
1 1 1 1 1 1 1 2
4.366571 4.366571 4.366571 4.366571 4.366571 4.366571 4.366571 71.362428
2 2
71.362428 71.362428
R>
4.3. Formulas
There are circumstances where standard formula are not very usefull to describe a model,
notably while using instrumental variable like estimators: to deal with these situations, we
use the Formula package.
The Formula package provides a class which unables to construct multi-part formula, each
part being separated by a pipe sign. plm provides a pFormula object which is a Formula with
specific methods.
The two formulas below are identical :
R> emp~wage+capital|lag(wage,1)+capital
R> emp~wage+capital|.-wage+lag(wage,1)
In the second case, the . means the previous parts which describes the covariates and this
part is “updated”. This is particulary interesting when there are a few external instruments.
5. Model estimation
The basic use of plm is to indicate the model formula, the data and the model to be estimated.
For example, the fixed e↵ects model and the random e↵ects model are estimated using:
R> summary(grun.re)
Call:
plm(formula = inv ~ value + capital, data = Grunfeld, model = "random")
Effects:
var std.dev share
idiosyncratic 2784.46 52.77 0.282
individual 7089.80 84.20 0.718
theta: 0.8612
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-178.00 -19.70 4.69 19.50 253.00
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -57.834415 28.898935 -2.0013 0.04674 *
value 0.109781 0.010493 10.4627 < 2e-16 ***
capital 0.308113 0.017180 17.9339 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For a random model, the summary method gives information about the variance of the com-
ponents of the errors. Fixed e↵ects may be extracted easily using fixef. An argument type
indicates howfixed e↵ects should be computed : in level type = ’level’ (the default), in
deviation from the overall mean type = ’dmean’ or in deviation from the first individual
type = ’dfirst’.
Yves Croissant, Giovanni Millo 13
1 2 3 4 5 6
-11.552778 160.649753 -176.827902 30.934645 -55.872873 35.582644
7 8 9 10
-7.809534 1.198282 -28.478333 52.176096
The fixef function returns an object of class fixef. A summary method is provided, which
prints the e↵ects (in deviation from the overall intercept), their standard errors and the test
of equality to the overall intercept.
In case of a two-ways e↵ect model, an additional argument effect is required to extract fixed
e↵ects:
Four estimators of this parameter are available, depending on the value of the argument
random.method :
The estimation of the variance of the error components are performed using the ercomp
function, which has a method and an effect argument, and can be used by itself :
For example, to estimate a two-ways e↵ect model for the Grunfeld data:
Call:
plm(formula = inv ~ value + capital, data = Grunfeld, effect = "twoways",
model = "random", random.method = "amemiya")
Yves Croissant, Giovanni Millo 15
Effects:
var std.dev share
idiosyncratic 2644.13 51.42 0.236
individual 8294.72 91.08 0.740
time 270.53 16.45 0.024
theta : 0.8747 (id) 0.2969 (time) 0.296 (total)
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-176.00 -18.00 3.02 18.00 233.00
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -64.351811 31.183651 -2.0636 0.04036 *
value 0.111593 0.011028 10.1192 < 2e-16 ***
capital 0.324625 0.018850 17.2214 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the “e↵ects” section of the result, the variance of the three elements of the error term and
the three parameters used in the transformation are now printed. The two-ways e↵ect model
is for the moment only available for balanced panels.
Unbalanced panels
Most of the features of plm are implemented for panel models with some limitations :
• the only estimator of the variance of the error components is the one proposed by Swamy
and Arora (1972)
The following example is using data used by (?) to estimate an hedonic housing prices
function. It is reproduced in (Baltagi 2001), p. 174.
Call:
plm(formula = mv ~ crim + zn + indus + chas + nox + rm + age +
dis + rad + tax + ptratio + blacks + lstat, data = Hedonic,
model = "random", index = "townid")
Effects:
var std.dev share
idiosyncratic 0.01718 0.13106 0.572
individual 0.01287 0.11342 0.428
theta :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2439 0.5409 0.6218 0.6078 0.7093 0.7936
Residuals :
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.62700 -0.06680 -0.00127 -0.00219 0.06860 0.55400
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 9.6871e+00 1.9603e-01 49.4159 < 2.2e-16 ***
crim -7.4447e-03 1.0502e-03 -7.0891 4.713e-12 ***
zn 8.4524e-05 6.4423e-04 0.1312 0.8956688
indus 1.4759e-03 3.9871e-03 0.3702 0.7114187
chasyes -3.3181e-03 2.9256e-02 -0.1134 0.9097487
nox -5.8387e-03 1.2451e-03 -4.6895 3.552e-06 ***
rm 9.0316e-03 1.1903e-03 7.5877 1.643e-13 ***
age -8.4567e-04 4.6851e-04 -1.8050 0.0716850 .
dis -1.4620e-01 4.3834e-02 -3.3352 0.0009166 ***
rad 9.5854e-02 2.6343e-02 3.6387 0.0003030 ***
tax -3.7787e-04 1.7506e-04 -2.1585 0.0313730 *
ptratio -2.9445e-02 8.9631e-03 -3.2851 0.0010921 **
blacks 5.6059e-01 1.0214e-01 5.4886 6.497e-08 ***
lstat -2.9213e-01 2.3940e-02 -12.2025 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
plm(formula = log(crmrte) ~ log(prbarr) + log(polpc) + log(prbconv) +
log(prbpris) + log(avgsen) + log(density) + log(wcon) + log(wtuc) +
log(wtrd) + log(wfir) + log(wser) + log(wmfg) + log(wfed) +
log(wsta) + log(wloc) + log(pctymle) + log(pctmin) + region +
smsa + factor(year) | . - log(prbarr) - log(polpc) + log(taxpc) +
log(mix), data = Crime, model = "random")
Effects:
var std.dev share
idiosyncratic 0.02244 0.14981 0.327
individual 0.04629 0.21515 0.673
theta: 0.7455
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.74900 -0.07100 0.00401 0.07840 0.47600
Coefficients :
18 Panel Data Econometrics in R: The plm Package
The Hausman-Taylor model (see Hausman and Taylor 1981) may be estimated with the pht
function. The following example is from Baltagi (2001) p.130.
Effects:
var std.dev share
idiosyncratic 0.02304 0.15180 0.025
individual 0.88699 0.94180 0.975
theta: 0.9392
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.92000 -0.07070 0.00657 0.07970 2.03000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 2.7818e+00 3.0765e-01 9.0422 < 2.2e-16 ***
wks 8.3740e-04 5.9973e-04 1.3963 0.16263
southyes 7.4398e-03 3.1955e-02 0.2328 0.81590
smsayes -4.1833e-02 1.8958e-02 -2.2066 0.02734 *
marriedyes -2.9851e-02 1.8980e-02 -1.5728 0.11578
exp 1.1313e-01 2.4710e-03 45.7851 < 2.2e-16 ***
I(exp^2) -4.1886e-04 5.4598e-05 -7.6718 1.696e-14 ***
bluecolyes -2.0705e-02 1.3781e-02 -1.5024 0.13299
ind 1.3604e-02 1.5237e-02 0.8928 0.37196
unionyes 3.2771e-02 1.4908e-02 2.1982 0.02794 *
sexmale 1.3092e-01 1.2666e-01 1.0337 0.30129
blackyes -2.8575e-01 1.5570e-01 -1.8352 0.06647 .
ed 1.3794e-01 2.1248e-02 6.4919 8.474e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
e↵ects are introduced if effect is fixed to "time" or "individual" (the default value).
Coefficients are assumed to be fixed if model="within" or random if model="random". In the
first case, a di↵erent model is estimated for each individual (or time period). In the second
case, the Swamy model (see Swamy 1970) model is estimated. It is a generalized least squares
model which uses the results of the previous model. Denoting ˆi the vectors of coefficients
obtained for each individual, we get:
n ⇣
!
X ⌘ 1 ⇣ ⌘ 1
ˆ= ˆ + ˆ 2 (X > Xi ) 1 ˆ + ˆ 2 (X > Xi ) 1 ˆi (8)
i i i i
i=1
where ˆi2 is the unbiased estimator of the variance of the errors for individual i obtained from
the preliminary estimation and:
n n
! n
!> n
1 X 1X 1X 1X
ˆ = ˆi ˆi ˆi ˆi ˆ 2 (X > Xi ) 1
(9)
n 1 i=1
n i=1 n i=1 n i=1 i i
R> summary(grun.varr)
Call:
pvcm(formula = inv ~ value + capital, data = Grunfeld, model = "random")
Residuals:
total sum of squares : 2177914
id time
0.67677732 0.02974195
>
yit = ⇢yit 1 + xit + µi + ✏it (10)
>
yit = ⇢ yit 1 + xit + ✏it (11)
Least squares are inconsistent because ✏it is correlated with yit 1 g. yit 2 is a valid, but
weak instrument (see Anderson and Hsiao 1981). The gmm estimator uses the fact that the
number of valid instruments is growing with t:
• t = 3: y1 ,
• t = 4: y1 , y2 ,
• t = 5: y1 , y2 , y3
n
! n
!
X X
>
e i ( ) Wi A Wi> ei ( ) (13)
i=1 i=1
One-step estimators are computed using a known weighting matrix. For the model in first
di↵erences, one uses:
n
! 1
X
A (1)
= Wi> H (1) Wi (14)
i=1
with:
0 1
2 1 0 ... 0
B 1 2 1 ... 0 C
B C
B C
H (1) = d> d = B 0 1 2 ... 0 C (15)
B C
B .. .. .. .. .. C
@ . . . . . A
0 0 0 1 2
(2) Pn (1) (1)> (1)
Two-steps estimators are obtained using Hi = i=1 ei ei where ei are the residuals of
the one step estimate.
Blundell and Bond (1998) show that with weak hypothesis on the data generating process,
suplementary moment conditions exist for the equation in level :
More precisely, they show that yit 2 = yit 2 yit 3 is a valid instrument. The estimator is
obtained using the residual vector in di↵erence and in level :
e+
i = ( e i , ei )
n
!!> n n n
X ēi ( ) X X X
Zi+> = yi1 ēi3 , yi1 ēi4 , yi2 ēi4 , . . . ,
i=1
ei ( ) i=1 i=1 i=1
n
X n
X n
X n X
X T
yi1 ēiT , yi2 ēiT , . . . , yiT 2 ēiT , xit ēit
i=1 i=1 i=1 i=1 t=3
n n n
!>
X X X
ei3 yi2 , ei4 yi3 , . . . , eiT yiT 1
i=1 i=1 i=1
The gmm estimator is provided by the pgmm function. It’s main argument is a dynformula
which describes the variables of the model and the lag structure.
Yves Croissant, Giovanni Millo 23
In a gmm estimation, there are “normal” instruments and “gmm” instruments. gmm instru-
ments are indicated in the second part of the formula. By default, all the variables of the
model that are not used as gmm instruments are used as normal instruments, with the same
lag structure ; “normal” instruments may also be indicated in the third part of the formula.
The e↵ect argument is either NULL, "individual" (the default), or "twoways". In the first
case, the model is estimated in levels. In the second case, the model is estimated in first
di↵erences to get rid of the individuals e↵ects. In the last case, the model is estimated in first
di↵erences and time dummies are included.
The model argument specifies whether a one-step or a two-steps model is required ("onestep"
or "twosteps").
The following example is from Arellano and Bond (1991). Employment is explained by past
values of employment (two lags), current and first lag of wages and output and current value
of capital.
Call:
pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage),
0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp),
2:99), data = EmplUK, effect = "twoways", model = "twosteps")
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6191000 -0.0255700 0.0000000 -0.0001339 0.0332000 0.6410000
Coefficients
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1:2)1 0.474151 0.185398 2.5575 0.0105437 *
lag(log(emp), 1:2)2 -0.052967 0.051749 -1.0235 0.3060506
lag(log(wage), 0:1)0 -0.513205 0.145565 -3.5256 0.0004225 ***
lag(log(wage), 0:1)1 0.224640 0.141950 1.5825 0.1135279
log(capital) 0.292723 0.062627 4.6741 2.953e-06 ***
lag(log(output), 0:1)0 0.609775 0.156263 3.9022 9.530e-05 ***
lag(log(output), 0:1)1 -0.446373 0.217302 -2.0542 0.0399605 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following example is from Blundell and Bond (1998). The “sys” estimator is obtained
using transformation = "ld" for level and di↵erence. The robust argument of the summary
method enables to use the robust covariance matrix proposed by Windmeijer (2005).
Call:
pgmm(formula = log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) +
lag(log(capital), 0:1) | lag(log(emp), 2:99) + lag(log(wage),
2:99) + lag(log(capital), 2:99), data = EmplUK, effect = "twoways",
model = "onestep", transformation = "ld")
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7530000 -0.0369000 0.0000000 0.0002882 0.0466100 0.6002000
Coefficients
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1) 0.935605 0.026295 35.5810 < 2.2e-16 ***
lag(log(wage), 0:1)0 -0.630976 0.118054 -5.3448 9.050e-08 ***
lag(log(wage), 0:1)1 0.482620 0.136887 3.5257 0.0004224 ***
lag(log(capital), 0:1)0 0.483930 0.053867 8.9838 < 2.2e-16 ***
lag(log(capital), 0:1)1 -0.424393 0.058479 -7.2572 3.952e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
NA
Call:
pggls(formula = log(emp) ~ log(wage) + log(capital), data = EmplUK,
model = "pooling")
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.80700 -0.36550 0.06181 0.03230 0.44280 1.58700
Coefficients
Estimate Std. Error z-value Pr(>|z|)
6
The “random e↵ect” is better termed “general fgls” model, as in fact it does not have a proper random
e↵ects structure, but we keep this terminology for general language consistency.
26 Panel Data Econometrics in R: The plm Package
The fixed e↵ects pggls (see Wooldridge 2002, p. 276) is based on the estimation of a within
model in the first step; the rest follows as above. It is estimated by:
The pggls function is similar to plm in many respects. An exception is that the estimate
of the group covariance matrix of errors (zz$sigma, a matrix, not shown) is reported in the
model objects instead of the usual estimated variances of the two error components.
6. Tests
As sketched in Section 2, specification testing in panel models involves essentially testing
for poolability, for individual or time unobserved e↵ects and for correlation between these
latter and the regressors (Hausman-type tests). As for the other usual diagnostic checks, we
provide a suite of serial correlation tests, while not touching on the issue of heteroskedasticity
testing. Instead, we provide heteroskedasticity-robust covariance estimators, to be described
in Subsection 6.7.
F statistic
The same test can be computed using a formula as first argument of the pooltest function:
R> pooltest(inv~value+capital,data=Grunfeld,model="within")
The e↵ects tested are indicated with the effect argument (one of individual, time or
twoways).
To test the presence of individual and time e↵ects in the Grunfeld example, using the Gourier-
oux et al. (1982) test, we use:
or
R> plmtest(inv~value+capital,data=Grunfeld,effect="twoways",type="ghm")
pFtest computes F tests of e↵ects based on the comparison of the within and the pooling
models. Its main arguments are either two plm objects (the results of a pooling and a within
model) or a formula.
R> pFtest(inv~value+capital,data=Grunfeld,effect="twoways")
Hausman Test
correctly specified, the latter, based on the likelihood framework, are crucially dependent on
normality and homoskedasticity of the errors.
In plm we provide a number of joint, marginal and conditional ml-based tests, plus some semi-
parametric alternatives which are robust vs. heteroskedasticity and free from distributional
assumptions.
This test is (n-) asymptotically distributed as a standard Normal regardless of the distribution
of the errors. It does also not rely on homoskedasticity.
It has power both against the standard random e↵ects specification, where the unobserved
e↵ects are constant within every group, as well as against any kind of serial correlation. As
such, it “nests” both random e↵ects and serial correlation tests, trading some power against
more specific alternatives in exchange for robustness.
While not rejecting the null favours the use of pooled ols, rejection may follow from serial
correlation of di↵erent kinds, and in particular, quoting Wooldridge (2002), “should not be
interpreted as implying that the random e↵ects error structure must be true”.
Below, the test is applied to the data and model in Munnell (1990):
data: formula
z = 3.9383, p-value = 8.207e-05
alternative hypothesis: unobserved effect
R> pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,data=Produc,test="j")
30 Panel Data Econometrics in R: The plm Package
data: formula
chisq = 4187.6, df = 2, p-value < 2.2e-16
alternative hypothesis: AR(1) errors or random effects
Rejection of the joint test, though, gives no information on the direction of the departure
from the null hypothesis, i.e.: is rejection due to the presence of serial correlation, of random
e↵ects or of both?
Bera, Sosa-Escudero, and Yoon (2001) derive locally robust tests both for individual random
e↵ects and for first-order serial correlation in residuals as “corrected” versions of the standard
LM test (see plmtest). While still dependent on normality and homoskedasticity, these
are robust to local departures from the hypotheses of, respectively, no serial correlation or
no random e↵ects. The authors observe that, although suboptimal, these tests may help
detecting the right direction of the departure from the null, thus complementing the use of
joint tests. Moreover, being based on pooled ols residuals, the BSY tests are computationally
far less demanding than likelihood-based conditional tests.
On the other hand, the statistical properties of these “locally corrected” tests are inferior
to those of the non-corrected counterparts when the latter are correctly specified. If there
is no serial correlation, then the optimal test for random e↵ects is the likelihood-based LM
test of Breusch and Godfrey (with refinements by Honda, see plmtest), while if there are no
random e↵ects the optimal test for serial correlation is, again, Breusch-Godfrey’s test9 . If the
presence of a random e↵ect is taken for granted, then the optimal test for serial correlation
is the likelihood-based conditional LM test of Baltagi and Li (1995) (see pbltest).
The serial correlation version is the default:
R> pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,data=Produc)
data: formula
chisq = 52.636, df = 1, p-value = 4.015e-13
alternative hypothesis: AR(1) errors sub random effects
The BSY test for random e↵ects is implemented in the one-sided version10 , which takes heed
that the variance of the random e↵ect must be non-negative:
R> pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,data=Produc,test="re")
data: formula
z = 57.914, p-value < 2.2e-16
alternative hypothesis: random effects sub AR(1) errors
9
LM3 in Baltagi and Li (1995).
10
Corresponding to RSOµ⇤ in the original paper.
Yves Croissant, Giovanni Millo 31
R> pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,data=Produc,alternative="onesided")
As usual, the LM test statistic is based on residuals from the maximum likelihood estimate of
the restricted model (random e↵ects with serially uncorrelated errors). In this case, though,
the restricted model cannot be estimated by ols any more, therefore the testing function
depends on lme() in the nlme package for estimation of a random e↵ects model by maximum
likelihood. For this reason, the test is applicable only to balanced panels.
No test has been implemented to date for the symmetric hypothesis of no random e↵ects in
a model with errors following an AR(1) process, but an asymptotically equivalent likelihood
ratio test is available in the nlme package (see Section 7)..
On the converse, in short panels the test gets severely biased towards rejection (or, as the
induced correlation is negative, towards acceptance in the case of the one-sided DW test with
alternative="greater"). See below for a serial correlation test applicable to “short” fe
panel models.
plm objects retain the “demeaned” data, so the procedure is straightforward for them. The
wrapper functions pbgtest and pdwtest re-estimate the relevant quasi-demeaned model by
ols and apply, respectively, standard Breusch-Godfrey and Durbin-Watson tests from package
lmtest:
The tests share the features of their ols counterparts, in particular the pbgtest allows testing
for higher-order serial correlation, which might turn useful, e.g., on quarterly data. Analo-
gously, from the point of view of software, as the functions are simple wrappers towards
bgtest and dwtest, all arguments from the latter two apply and may be passed on through
the ‘. . . ’ operator.
Rejecting the restriction = 1/(T 1) makes us conclude against the original null of no
serial correlation.
The building blocks available in plm, together with the function linearHypothesis() in pack-
age car, make it easy to construct a function carrying out this procedure: first the fe model is
estimated and the residuals retrieved, then they are lagged and a pooling AR(1) model is esti-
mated. The test statistic is obtained by applying linearHypothesis() to the latter model to
test the above restriction on , supplying a heteroskedasticity- and autocorrelation-consistent
covariance matrix (vcovHC with the appropriate options, in particular method="arellano")12 .
model as a Breusch-Godfrey LM test on within residuals (see Baltagi and Li 1995, par. 2.3 and formula
12). They also observe that the test on within residuals can be used for testing on the re model, as “the
within transformation [time-demeaning, in our terminology] wipes out the individual e↵ects, whether fixed
or random”. Generalizing the Durbin-Watson test to fe models by applying it to fixed e↵ects residuals is
documented in Bhargava, Franzini, and Narendranathan (1982).
12
see Subsection 6.7.
Yves Croissant, Giovanni Millo 33
data: plm.model
chisq = 312.3, p-value < 2.2e-16
alternative hypothesis: serial correlation
The test is applicable to any fe panel model, and in particular to “short” panels with small
T and large n.
and testing the restriction = 0.5, corresponding to the null of no serial correlation. Drukker
(2003) provides Monte-carlo evidence of the good empirical properties of the test.
On the other extreme (see Wooldridge 2002, 10.6.1), if the di↵erenced errors eit are uncorre-
lated, as by definition uit = ui,t 1 + eit , then uit is a random walk. In this latter case, the
most efficient estimator is the first di↵erence (fd) one; in the former case, it is the fixed e↵ects
one (within).
The function pwfdtest allows testing either hypothesis: the default behaviour h0="fd" is to
test for serial correlation in first-di↵erenced errors:
data: plm.model
chisq = 1.5251, p-value = 0.2169
alternative hypothesis: serial correlation in differenced errors
while specifying h0="fe" the null hypothesis becomes no serial correlation in original errors,
which is similar to the pwartest.
data: plm.model
chisq = 131.55, p-value < 2.2e-16
alternative hypothesis: serial correlation in original errors
Not rejecting one of the two is evidence in favour of using the estimator corresponding to
h0. Should the truth lie in the middle (both rejected), whichever estimator is chosen will
have serially correlated errors: therefore it will be advisable to use the autocorrelation-robust
covariance estimators from the Subsection 6.7 in inference.
i.e., as averages over the time dimension of pairwise correlation coefficients for each pair of
cross-sectional units.
The Breusch-Pagan (Breusch and Pagan 1980) LM test, based on the squares of ⇢ij , is valid
for T ! 1 with n fixed; defined as
n
X1 n
X
LM = Tij ⇢ˆ2ij
i=1 j=i+1
where in the case of an unbalanced panel only pairwise complete observations are considered,
and Tij = min(Ti , Tj ) with Ti being the number of observations for individual i; else, if the
14
This is the case, e.g., if in an unobserved e↵ects model when xsd is due to an unobservable factor structure,
with factors that are uncorrelated with the regressors. In this case the within or random estimators are still
consistent, although inefficient (see De Hoyos and Sarafidis 2006).
Yves Croissant, Giovanni Millo 35
panel is balanced, Tij = T for each i, j. The test is distributed as 2n(n 1)/2 . It is inappropriate
whenever the n dimension is “large”. A scaled version, applicable also if T ! 1 and then
n ! 1 (as in some pooled time series contexts), is defined as
s n n q
1 X1 X
SCLM = ( Tij ⇢ˆ2ij )
n(n 1) i=1 j=i+1
s n n q
2 X1 X
CD = ( Tij ⇢ˆij )
n(n 1) i=1 j=i+1
based on ⇢ij without squaring (also distributed as a standard Normal) is appropriate both in
n– and in T –asymptotic settings. It has remarkable properties in samples of any practically
relevant size and is robust to a variety of settings. The only big drawback is that the test
loses power against the alternative of cross-sectional dependence if the latter is due to a factor
structure with factor loadings averaging zero, that is, some units react positively to common
shocks, others negatively.
The default version of the test is "cd". These tests are originally meant to use the residuals
of separate estimation of one time-series regression for each cross-sectional unit, so this is the
default behaviour of pcdtest.
data: formula
z = 5.3401, p-value = 9.292e-08
alternative hypothesis: cross-sectional dependence
If a di↵erent model specification (within, random, ...) is assumed consistent, one can resort
to its residuals for testing15 by specifying the relevant model type. The main argument of
this function may be either a model of class panelmodel or a formula and a data.frame; in
the second case, unless model is set to NULL, all usual parameters relative to the estimation
of a plm model may be passed on. The test is compatible with any consistent panelmodel
for the data at hand, with any specification of effect. E.g., specifying effect="time" or
effect="twoways" allows to test for residual cross-sectional dependence after the introduction
of time fixed e↵ects to account for common shocks.
data: formula
z = 4.6612, p-value = 3.144e-06
alternative hypothesis: cross-sectional dependence
If the time dimension is insufficient and model=NULL, the function defaults to estimation of a
within model and issues a warning.
where [w(p)]ij is the (i, j)-th element of the p-th order proximity matrix, so that if h, k are
not neighbours, [w(p)]hk = 0 and ⇢ˆhk gets “killed”; this is easily seen to reduce to formula
(14) in Pesaran (Pesaran 2004) for the special case considered in that paper. The same can
be applied to the LM and SCLM tests.
Therefore, the local version of either test can be computed supplying an n ⇥ n matrix (of any
kind coercible to logical), providing information on whether any pair of observations are
neighbours or not, to the w argument. If w is supplied, only neighbouring pairs will be used in
computing the test; else, w will default to NULL and all observations will be used. The matrix
needs not really be binary, so commonly used “row-standardized” matrices can be employed
as well: it is enough that neighbouring pairs correspond to nonzero elements in w 16 .
Preliminary results
We consider the following model :
pi
X
yit = yit 1 + ✓i yit L + ↵mi dmt + ✏it
L=1
16
The very comprehensive package spdep for spatial dependence analysis (see Bivand 2008) contains features
for creating, lagging and manipulating neighbour list objects of class nb, that can be readily converted to and
from proximity matrices by means of the nb2mat function. Higher orders of the CD(p) test can be obtained
lagging the corresponding nbs through nblag.
Yves Croissant, Giovanni Millo 37
pi
X
yit = ⇢yit 1 + ✓i yit L + ↵mi dmt + ✏it
L=1
• the Hall method, which consist on removing the higher lags while it is not significant.
The ADF regression is run on T pi 1 observations for each individual, so that the total
number of observations is n ⇥ T̃ where T̃ = T pi 1
p̄ is the average number of lags. Call ei the vector of residuals.
Estimate the variance of the ✏i as :
PT 2
t=pi +1 eit
ˆ✏2i =
dfi
Levin-Lin-Chu model
Then, compute artificial regressions of yit and yit 1 on yit L and dmt and get the two
vectors of residuals zit and vit .
Standardize these two residuals and run the pooled regression of zit /ˆi on vit /ˆi to get ⇢ˆ, its
standard deviation ˆ (ˆ
⇢) and the t-statistic t⇢ˆ = ⇢ˆ/ˆ (ˆ
⇢).
Compute the long run variance of yi :
2 3
T
X K̄
X T
X
2 1 2 1
ˆyi = yit +2 wK̄L 4 yit yit L
5
T 1 t=2 L=1
T 1 t=2+L
Define s̄i as the ratio of the long and short term variance and s̄ the mean for all the individuals
of the sample
ˆyi
si =
ˆ ✏i
Pn
i=1 si
s̄ =
n
38 Panel Data Econometrics in R: The plm Package
t⇢ nT̄ s̄ˆ✏˜ 2 ˆ (ˆ
⇢)µ⇤mT̃
t⇤⇢ = ⇤
mT̃
follows a normal distribution under the null hypothesis of stationarity. µ⇤mT̃ and ⇤
mT̃
are
given in table 2 of the original paper and are also available in the package.
n
1X
t̄ = t⇢i
n i=1
µ⇤mT̃ and ⇤
mT̃
are given in table 2 of the original paper and are also available in the package.
while "white2" is "white1" restricted to a common variance inside every group, estimated
P
as i2 = Tt=1 û2it /T , so that ⌦i = IT ⌦ i2 (see Greene (2003, 13.7.1–2) and Wooldridge (2002,
10.7.2); "arellano" (see ibid. and the original ref. Arellano 1987) allows a fully general
structure w.r.t. heteroskedasticity and serial correlation:
17
See White (1980) and White (1984).
Yves Croissant, Giovanni Millo 39
2 2
3
i1 i1,i2 ... ... i1,iT
6 .. 7
6 2 . 7
6 i2,i1 i2 7
6 .. .. .. 7
⌦i = 6
6 . . .
7
7 (17)
6 .. 7
6 2 7
4 . iT 1 iT 1,iT 5
... ... 2
iT,i1 iT,iT 1 iT
The latter is, as already observed, consistent w.r.t. timewise correlation of the errors, but on
the converse, unlike the White 1 and 2 methods, it relies on large n asymptotics with small
T.
The fixed e↵ects case, as already observed in Section 6.4 on serial correlation, is complicated
by the fact that the demeaning induces serial correlation in the errors. The original White
estimator (white1) turns out to be inconsistent for fixed T as n grows, so in this case it is
advisable to use the arellano version (see Stock and Watson 2006).
The errors may be weighted according to the schemes proposed by MacKinnon and White
(1985) and Cribari-Neto (2004) to improve small-sample performance18 .
The main use of vcovHC is together with testing functions from the lmtest and car packages.
These typically allow passing the vcov parameter either as a matrix or as a function (see
Zeileis 2004). If one is happy with the defaults, it is easiest to pass the function itself:
R> library("lmtest")
R> re <- plm(inv~value+capital,data=Grunfeld,model="random")
R> coeftest(re,vcovHC)
t test of coefficients:
else one may do the covariance computation inside the call to coeftest, thus passing on a
matrix:
R> coeftest(re,vcovHC(re,method="white2",type="HC3"))
For some tests, e.g. for multiple model comparisons by waldtest, one should always provide
a function19 . In this case, optional parameters are provided as shown below (see also Zeileis
2004, p.12):
18
The HC3 and HC4 weighting schemes are computationally expensive and may hit memory limits for nT
in the thousands, where on the other hand it makes little sense to apply small sample corrections.
19
Joint zero-restriction testing still allows providing the vcov of the unrestricted model as a matrix, see the
documentation of package lmtest.
40 Panel Data Econometrics in R: The plm Package
Wald test
Moreover, linearHypothesis from package car may be used to test for linear restrictions:
R> library("car")
R> linearHypothesis(re, "2*value=capital", vcov.=vcovHC)
Hypothesis:
2 value - capital = 0
A specific vcovHC method for pgmm objects is also provided which implements the robust
covariance matrix proposed by Windmeijer (2005) for generalized method of moments esti-
mators.
panel data specifications used in econometrics and the general framework used in statistics
for mixed models20 .
R is particularly strong on mixed models’ estimation, thanks to the long-standing nlme pack-
age (see Pinheiro et al. 2007) and the more recent lme4 package, based on S4 classes (see
Bates 2007)21 . In the following we will refer to the more established nlme to give some ex-
amples of “econometric” panel models that can be estimated in a likelihood framework, also
including some likelihood ratio tests. Some of them are not feasible in plm and make a useful
complement to the econometric “toolbox” available in R.
20
This discussion does not consider gmm models. One of the basic reasons for econometricians not to choose
maximum likelihood methods in estimation is that the strict exogeneity of regressors assumption required for
consistency of the ml models reported in the following is often inappropriate in economic settings.
21
The standard reference on the subject of mixed models in S/R is Pinheiro and Bates (2000).
22
Lagrange Multiplier tests based on the likelihood principle are suitable for testing against more general
alternatives on the basis of a maintained model with spherical residuals and find therefore application in testing
for departures from the classical hypotheses on the error term. The seminal reference is Breusch and Pagan
(1980).
42 Panel Data Econometrics in R: The plm Package
23
For fixed e↵ects estimation, as the sample grows (on the dimension on which the fixed e↵ects are specified)
so does the number of parameters to be estimated. Estimation of individual fixed e↵ects is T – (but not n–)
consistent, and the opposite.
24
In doing so, we stress that “equivalence” concerns only the specification of the model, and neither the
appropriateness nor the relative efficiency of the relevant estimation techniques, which will of course be depen-
dent on the context. Unlike their mixed model counterparts, the specifications in plm are, strictly speaking,
distribution-free. Nevertheless, for the sake of exposition, in the following we present them in the setting which
ensures consistency and efficiency (e.g., we consider the hypothesis of spherical errors part of the specification
of pooled ols and so forth).
Yves Croissant, Giovanni Millo 43
where the x1 , . . . xp are the fixed e↵ects regressors and the z1 , . . . zp are the random e↵ects
regressors, assumed to be normally distributed across groups. The covariance of the random
e↵ects coefficients kk0 is assumed constant across groups and the covariances between the
errors in group i, 2 ijj 0 , are described by the term ijj 0 representing the correlation structure
of the errors within each group (e.g., serial correlation over time) scaled by the common error
variance 2 .
Random e↵ects
In the Laird and Ware notation, the re specification is a model with only one random e↵ects
regressor: the intercept. Formally, z1ij = 1 8i, j, zqij = 0 8i, 8j, 8q 6= 1 ij = 1 for i = j,
0 else). The composite error is therefore uij = 1bi1 + ✏ij . Below we report coefficients of
Grunfeld’s model estimated by gls and then by ml
R> require(nlme)
R> reGLS<-plm(inv~value+capital,data=Grunfeld,model="random")
R> reML<-lme(inv~value+capital,data=Grunfeld,random=~1|firm)
R> coef(reGLS)
R> summary(reML)$coef$fixed
R>
Estimation of a mixed model with random coefficients on all regressors is rather demanding
from the computational side. Some models from our examples fail to converge. The below
example is estimated on the Grunfeld data and model with time e↵ects.
R> vcm<-pvcm(inv~value+capital,data=Grunfeld,model="random",effect="time")
R> vcmML<-lme(inv~value+capital,data=Grunfeld,random=~value+capital|year)
R> coef(vcm)
y
(Intercept) -18.5538638
value 0.1239595
capital 0.1114579
R> summary(vcmML)$coef$fixed
R>
R> vcmf<-pvcm(inv~value+capital,data=Grunfeld,model="within",effect="time")
R> vcmfML<-lmList(inv~value+capital|year,data=Grunfeld)
R>
Unrestricted fgls
The general, or unrestricted, feasible gls, pggls in the plm nomenclature, is equivalent to
a model with no random e↵ects regressors (biq = 0 8i, q) and an error covariance structure
which is unrestricted within groups apart from the usual requirements. The function for
estimating such models with correlation in the errors but no random e↵ects is gls().
This very general serial correlation and heteroskedasticity structure is not estimable for the
original Grunfeld data, which have more time periods than firms, therefore we restrict them
to firms 4 to 6.
Yves Croissant, Giovanni Millo 45
R> summary(gglsML)$coef
The within case is analogous, with the regressors’ set augmented by n 1 group dummies.
and analogously the random e↵ects panel with, e.g., AR(1) errors (see Baltagi 2001, chap 5),
which is a very common specification in econometrics, may be fit by lme specifying an addi-
tional random intercept:
R> reAR1ML<-lme(inv~value+capital,data=Grunfeld,random=~1|firm,
+ correlation=corAR1(0,form=~year|firm))
The regressors’ coefficients and the error’s serial correlation coefficient may be retrieved this
way:
R> summary(reAR1ML)$coef$fixed
R> coef(reAR1ML$modelStruct$corStruct,unconstrained=FALSE)
Phi
0.823845
Significance statistics for the regressors’ coefficients are to be found in the usual summary
object, while to get the significance test of the serial correlation coefficient one can do a
likelihood ratio test as shown in the following.
R> lmML<-gls(inv~value+capital,data=Grunfeld)
R> anova(lmML,lmAR1ML)
The AR(1) test on the random e↵ects model is to be done in much the same way, using the
random e↵ects model objects estimated above:
R> anova(reML,reAR1ML)
A likelihood ratio test for random e↵ects compares the specifications with and without random
e↵ects and spherical idiosyncratic errors:
R> anova(lmML,reML)
The random e↵ects, AR(1) errors model in turn nests the AR(1) pooling model, therefore
a likelihood ratio test for random e↵ects sub AR(1) errors may be carried out, again, by
comparing the two autoregressive specifications:
Yves Croissant, Giovanni Millo 47
R> anova(lmAR1ML,reAR1ML)
whence we see that the Grunfeld model specification doesn’t seem to need any random e↵ects
once we control for serial correlation in the data.
8. Conclusions
With plm we aim at providing a comprehensive package containing the standard functionali-
ties that are needed for the management and the econometric analysis of panel data. In partic-
ular, we provide: functions for data transformation; estimators for pooled, random and fixed
e↵ects static panel models and variable coefficients models, general gls for general covariance
structures, and generalized method of moments estimators for dynamic panels; specification
and diagnostic tests. Instrumental variables estimation is supported. Most estimators allow
working with unbalanced panels. While among the di↵erent approaches to longitudinal data
analysis we take the perspective of the econometrician, the syntax is consistent with the basic
linear modeling tools, like the lm function.
On the input side, formula and data arguments are used to specify the model to be estimated.
Special functions are provided to make writing formulas easier, and the structure of the data
is indicated with an index argument.
On the output side, the model objects (of the new class panelmodel) are compatible with
the general restriction testing frameworks of packages lmtest and car. Specialized methods
are also provided for the calculation of robust covariance matrices; heteroskedasticity- and
correlation-consistent testing is accomplished by passing these on to testing functions, together
with a panelmodel object.
The main functionalities of the package have been illustrated here by applying them on some
well-known datasets from the econometric literature. The similarities and di↵erences with
the maximum likelihood approach to longitudinal data have also been briefly discussed.
We plan to expand the methods in this paper to systems of equations and to the estimation
of models with autoregressive errors. Addition of covariance estimators robust vs. cross-
sectional correlation are also in the offing. Lastly, conditional visualization features in the R
environment seem to o↵er a promising toolbox for visual diagnostics, which is another subject
for future work.
Acknowledgments
While retaining responsibility for any error, we thank Je↵rey Wooldridge, Achim Zeileis and
three anonymous referees for useful comments. We also acknowledge kind editing assistance
by Lisa Benedetti.
48 Panel Data Econometrics in R: The plm Package
References
Arellano M (1987). “Computing Robust Standard Errors for Within Group Estimators.”
Oxford Bulletin of Economics and Statistics, 49, 431–434.
Arellano M, Bond S (1991). “Some Tests of Specification for Panel Data : Monte Carlo
Evidence and an Application to Employment Equations.” Review of Economic Studies, 58,
277–297.
Baltagi B (2001). Econometric Analysis of Panel Data. 3rd edition. John Wiley and Sons
ltd.
Baltagi B, Li Q (1991). “A Joint Test for Serial Correlation and Random Individual E↵ects.”
Statistics and Probability Letters, 11, 277–280.
Bates D (2007). lme4: Linear Mixed–E↵ects Models Using S4 Classes. R package version
0.99875-9, URL http://CRAN.R-project.org.
Bates D, Maechler M (2007). matrix: A Matrix Package for R. R package version 0.99875-2,
URL http://CRAN.R-project.org.
Bera A, Sosa-Escudero W, Yoon M (2001). “Tests for the Error Component Model in the
Presence of Local Misspecification.” Journal of Econometrics, 101, 1–23.
Bhargava A, Franzini L, Narendranathan W (1982). “Serial Correlation and the Fixed E↵ects
Model.” Review of Economic Studies, 49, 533–554.
Bivand R (2008). spdep: Spatial Dependence: Weighting Schemes, Statistics and Models. R
package version 0.4-17.
Yves Croissant, Giovanni Millo 49
Blundell R, Bond S (1998). “Initital Conditions and Moment Restrictions in Dynamic Panel
Data Models.” Journal of Econometrics, 87, 115–143.
Breusch T, Pagan A (1980). “The Lagrange Multiplier Test and Its Applications to Model
Specification in Econometrics.” Review of Economic Studies, 47, 239–253.
Cornwell C, Rupert P (1988). “Efficient Estimation With Panel Data: an Empirical Compar-
ison of Instrumental Variables Estimators.” Journal of Applied Econometrics, 3, 149–155.
Croissant Y, Millo G (2008). “Panel Data Econometrics in R: The plm Package.” Journal of
Statistical Software, 27(2). URL http://www.jstatsoft.org/v27/i02/.
Drukker D (2003). “Testing for Serial Correlation in Linear Panel–Data Models.” The Stata
Journal, 3(2), 168–177.
Fox J (2007). car: Companion to Applied Regression. R package version 1.2-5, URL http:
//CRAN.R-project.org/,http://socserv.socsci.mcmaster.ca/jfox/.
Gourieroux C, Holly A, Monfort A (1982). “Likelihood Ratio Test, Wald Test, and Kuhn–
Tucker Test in Linear Models With Inequality Constraints on the Regression Parameters.”
Econometrica, 50, 63–80.
Hausman J, Taylor W (1981). “Panel Data and Unobservable Individual E↵ects.” Economet-
rica, 49, 1377–1398.
Honda Y (1985). “Testing the Error Components Model With Non–Normal Disturbances.”
Review of Economic Studies, 52, 681–690.
Kleiber C, Zeileis A (2008). Applied Econometrics with R. Springer-Verlag, New York. ISBN
978-0-387-77316-2, URL http://CRAN.R-project.org/package=AER.
Koenker R, Ng P (2007). sparsem: Sparse Linear Algebra. R package version 0.74, URL
http://CRAN.R-project.org.
Laird N, Ware J (1982). “Random–E↵ects Models for Longitudinal Data.” Biometrics, 38,
963–974.
50 Panel Data Econometrics in R: The plm Package
Mundlak Y (1978). “On the Pooling of Time Series and Cross Section Data.” Econometrica,
46(1), 69–85.
Munnell A (1990). “Why Has Productivity Growth Declined? Productivity and Public In-
vestment.” New England Economic Review, pp. 3–22.
Nerlove M (1971). “Further Evidence on the Estimation of Dynamic Economic Relations from
a Time–Series of Cross–Sections.” Econometrica, 39, 359–382.
Pesaran M (2004). “General Diagnostic Tests for Cross Section Dependence in Panels.” CESifo
Working Paper Series, 1229.
Pinheiro J, Bates D, DebRoy S, the R Core team DS (2007). nlme: Linear and Nonlinear
Mixed E↵ects Models. R package version 3.1-86, URL http://CRAN.R-project.org.
R Development Core Team (2008). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.
Swamy P, Arora S (1972). “The Exact Finite Sample Properties of the Estimators of Coeffi-
cients in the Error Components Regression Models.” Econometrica, 40, 261–275.
Wallace T, Hussain A (1969). “The Use of Error Components Models in Combining Cross
Section With Time Series Data.” Econometrica, 37(1), 55–72.
Windmeijer F (2005). “A Finite Sample Correction for the Variance of Linear Efficient Two–
Steps Gmm Estimators.” Journal of Econometrics, 126, 25–51.
Wooldridge J (2002). Econometric Analysis of Cross–Section and Panel Data. MIT press.
Zeileis A (2004). “Econometric Computing With HC and HAC Covariance Matrix Estimators.”
Journal of Statistical Software, 11(10), 1–17. URL http://www.jstatsoft.org/v11/i10/.
Affiliation:
Yves Croissant
LET-ISH
Avenue Berthelot
F-69363 Lyon cedex 07
Telephone: +33/4/78727249
Fax: +33/4/78727248
E-mail: [email protected]
Giovanni Millo
DiSES, Un. of Trieste and R&D Dept., Generali SpA
Via Machiavelli 4
34131 Trieste (Italy)
Telephone: +39/040/671184
Fax: +39/040/671160
E-mail: [email protected]