Croissant y Millo, Panel Data Econometrics
Croissant y Millo, Panel Data Econometrics
Abstract
This introduction to the plm package is a slightly modified version of Croissant and
Millo (2008), published in the Journal of Statistical Software.
Panel data econometrics is obviously one of the main fields in the profession, but most
of the models used are difficult to estimate with R. plm is a package for R which intends
to make the estimation of linear panel models straightforward. plm provides functions to
estimate a wide variety of models and to make (robust) inference.
1. Introduction
Panel data econometrics is a continuously developing field. The increasing availability of
data observed on cross-sections of units (like households, firms, countries etc.) and over time
has given rise to a number of estimation approaches exploiting this double dimensionality to
cope with some of the typical problems associated with economic data, first of all that of
unobserved heterogeneity.
Timewise observation of data from different observational units has long been common in
other fields of statistics (where they are often termed longitudinal data). In the panel data
field as well as in others, the econometric approach is nevertheless peculiar with respect to
experimental contexts, as it is emphasizing model specification and testing and tackling a
number of issues arising from the particular statistical problems associated with economic
data.
Thus, while a very comprehensive software framework for (among many other features) max-
imum likelihood estimation of linear regression models for longitudinal data, packages nlme
(Pinheiro, Bates, DebRoy, and the R Core team 2007) and lme4 (Bates 2007), is available in
the R (R Development Core Team 2008) environment and can be used, e.g., for estimation
of random effects panel models, its use is not intuitive for a practicing econometrician, and
maximum likelihood estimation is only one of the possible approaches to panel data econo-
metrics. Moreover, economic panel datasets often happen to be unbalanced (i.e., they have a
different number of observations between groups), which case needs some adaptation to the
methods and is not compatible with those in nlme. Hence the need for a package doing panel
data “from the econometrician’s viewpoint” and featuring at a minimum the basic techniques
econometricians are used to: random and fixed effects estimation of static linear panel data
models, variable coefficients models, generalized method of moments estimation of dynamic
models; and the basic toolbox of specification and misspecification diagnostics.
2 Panel Data Econometrics in R: The plm Package
Furthermore, we felt there was a need for automation of some basic data management tasks
such as lagging, summing and, more in general, applying (in the R sense) functions to the
data, which, although conceptually simple, become cumbersome and error-prone on two-
dimensional data, especially in the case of unbalanced panels.
This paper is organized as follows: Section 2 presents a very short overview of the typical
model taxonomy1 . Section 3 discusses the software approach used in the package. The next
three sections present the functionalities of the package in more detail: data management
(Section 4), estimation (Section 5) and testing (Section 6), giving a short description and
illustrating them with examples. Section 7 compares the approach in plm to that of nlme
and lme4, highlighting the features of the latter two that an econometrician might find most
useful. Section 8 concludes the paper.
The appropriate estimation method for this model depends on the properties of the two error
components. The idiosyncratic error it is usually assumed well-behaved and independent of
both the regressors xit and the individual error component µi . The individual component
may be in turn either independent of the regressors or correlated.
1
Comprehensive treatments are to be found in many econometrics textbooks, e.g. Baltagi (2005, 2013) or
Wooldridge (2002, 2010): the reader is referred to these, especially to the first 9 chapters of Baltagi (2005,
2013).
2
For the sake of exposition we are considering only the individual effects case here. There may also be time
effects, which is a symmetric case, or both of them, so that the error has three components: uit = µi + λt + it .
Yves Croissant, Giovanni Millo 3
(where ∆yit = yit − yi,t−1 , ∆xit = xit − xi,t−1 and, from (3), ∆uit = uit − ui,t−1 = ∆it for
t = 2, ..., T ) can be consistently estimated by pooled ols. This is called the first-difference,
or fd estimator. Its relative efficiency, and so reasons for choosing it against other consistent
alternatives, depends on the properties of the error term. The fd estimator is usually preferred
if the errors uit are strongly persistent in time, because then the ∆uit will tend to be serially
uncorrelated.
Lastly, the between model, which is computed on time (group) averages of the data, discards
all the information due to intragroup variability but is consistent in some settings (e.g., non-
stationarity) where the others are not, and is often preferred to estimate long-run relationships.
Variable coefficients models relax the assumption that βit = β for all i, t. Fixed coefficients
models allow the coefficients to vary along one dimension, like βit = βi for all t. Random
coefficients models instead assume that coefficients vary randomly around a common average,
as βit = β + ηi for all t, where ηi is a group– (time–) specific effect with mean zero.
The hypotheses on parameters and error terms (and hence the choice of the most appropriate
estimator) are usually tested by means of:
• pooling tests to check poolability, i.e. the hypothesis that the same coefficients apply
across all individuals,
4 Panel Data Econometrics in R: The plm Package
• if the homogeneity assumption over the coefficients is established, the next step is to
establish the presence of unobserved effects, comparing the null of spherical residuals
with the alternative of group (time) specific effects in the error term,
• the choice between fixed and random effects specifications is based on Hausman-type
tests, comparing the two estimators under the null of no significant difference: if this is
not rejected, the more efficient random effects estimator is chosen,
• even after this step, departures of the error structure from sphericity can further affect
inference, so that either screening tests or robust diagnostics are needed.
Dynamic models and in general lack of strict exogeneity of the regressors, pose further prob-
lems to estimation which are usually dealt with in the generalized method of moments (gmm)
framework.
These were, in our opinion, the basic requirements of a panel data econometrics package
for the R language and environment. Some, as often happens with R, were already fulfilled
by packages developed for other branches of computational statistics, while others (like the
fixed effects or the between estimators) were straightforward to compute after transforming
the data, but in every case there were either language inconsistencies w.r.t. the standard
econometric toolbox or subtleties to be dealt with (like, for example, appropriate computation
of standard errors for the demeaned model, a common pitfall), so we felt there was need for an
“all in one” econometrics-oriented package allowing to make specification searches, estimation
and inference in a natural way.
3. Software approach
• NULL (the default value), it is then assumed that the first two columns contain the
individual and the time index and that observations are ordered by individual and by
time period,
• a character vector of length two containing the names of the individual and the time
index,
• an integer which is the number of individuals (only in case of a balanced panel with
observations ordered by individual).
The pdata.frame function is then called internally, which returns a pdata.frame which is
a data.frame with an attribute called index. This attribute is a data.frame that contains
the individual and the time indexes.
Yves Croissant, Giovanni Millo 5
It is also possible to use directly the pdata.frame function and then to use the pdata.frame
in the estimation functions.
3.2. Interface
Estimation interface
plm provides four functions for estimation:
• plm: estimation of the basic panel models, i.e. within, between and random effect
models. Models are estimated using the lm function to transformed data,
The interface of these functions is consistent with the lm() function. Namely, their first two
arguments are formula and data (which should be a data.frame and is mandatory). Three
additional arguments are common to these functions:
• index: this argument enables the estimation functions to identify the structure of the
data, i.e. the individual and the time period for each observation,
• effect: the kind of effects to include in the model, i.e. individual effects, time effects
or both3 ,
• model: the kind of model to be estimated, most of the time a model with fixed effects
or a model with random effects.
The results of these four functions are stored in an object which class has the same name
of the function. They all inherit from class panelmodel. A panelmodel object contains:
coefficients, residuals, fitted.values, vcov, df.residual and call and functions that
extract these elements are provided.
Testing interface
The diagnostic testing interface provides both formula and panelmodel methods for most
functions, with some exceptions. The user may thus choose whether to employ results stored
in a previously estimated panelmodel object or to re-estimate it for the sake of testing.
Although the first strategy is the most efficient one, diagnostic testing on panel models mostly
employs ols residuals from pooling model objects, whose estimation is computationally in-
expensive. Therefore most examples in the following are based on formula methods, which
are perhaps the cleanest for illustrative purposes.
3
Although in most models the individual and time effects cases are symmetric, there are exceptions: es-
timating the fd model on time effects is meaningless because cross-sections do not generally have a natural
ordering, so trying effect="time" stops with an error message as does effect="twoways" which is not defined
for fd models.
6 Panel Data Econometrics in R: The plm Package
Nevertheless, in practice plain computation of β̂ has long been an intractable problem even
for moderate-sized datasets because of the need to invert the N × N V̂ matrix. With the
advances in computer power, this is no more so, and it is possible to program the “naive”
estimator (5) in R with standard matrix algebra operators and have it working seamlessly for
the standard “guinea pigs”, e.g. the Grunfeld data. Estimation with a couple of thousands
of data points also becomes feasible on a modern machine, although excruciatingly slow and
definitely not suitable for everyday econometric practice. Memory limits would also be very
near because of the storage needs related to the huge V̂ matrix. An established solution
exists for the random effects model which reduces the problem to an ordinary least squares
computation.
where θ = 1−[σu2 /(σu2 +T σe2 )]1/2 , ȳ and X̄ denote time means of y and X, and the disturbance
vit − θv̄i is homoskedastic and serially uncorrelated. Thus the feasible re estimate for β may
be obtained estimating θ̂ and running an ols regression on the transformed data with lm().
The other estimators can be computed as special cases: for θ = 1 one gets the fixed effects
estimator, for θ = 0 the pooled ols one.
Moreover, instrumental variable estimators of all these models may also be obtained using
several calls to lm().
For this reason the three above estimators have been grouped inside the same function.
On the output side, a number of diagnostics and a very general coefficients’ covariance matrix
estimator also benefits from this framework, as they can be readily calculated applying the
standard ols formulas to the demeaned data, which are contained inside plm objects. This
will be the subject of Subsection 3.4.
Yves Croissant, Giovanni Millo 7
n
V̂R (β) = (X > X)−1 Xi> Ei Xi (X > X)−1
X
(7)
i=1
where Ei is a function of the residuals êit , t = 1, . . . T chosen according to the relevant het-
eroskedasticity and correlation structure. Moreover, it turns out that the White covariance
matrix calculated on the demeaned model’s regressors and residuals (both part of plm ob-
jects) is a consistent estimator of the relevant model’s parameters’ covariance matrix, thus the
4
See packages lmtest (Hothorn, Zeileis, Farebrother, Cummins, Millo, and Mitchell 2015) and car (Fox
2016).
5
Moreover, coeftest() provides a compact way of looking at coefficient estimates and significance diag-
nostics.
8 Panel Data Econometrics in R: The plm Package
method is readily applicable to models estimated by random or fixed effects, first difference
or pooled ols methods. Different pre-weighting schemes taken from package sandwich (see
Zeileis 2004; Lumley and Zeileis 2015) are also implemented to improve small-sample per-
formance. Robust estimators with any combination of covariance structures and weighting
schemes can be passed on to the testing functions.
R> library("plm")
The four datasets used are EmplUK which was used by Arellano and Bond (1991), the Grunfeld
data (Kleiber and Zeileis 2008) which is used in several econometric books, the Produc data
used by Munnell (1990) and the Wages used by Cornwell and Rupert (1988).
R> head(Grunfeld)
firm year
1 1 1977
2 1 1978
3 1 1979
4 1 1980
5 1 1981
6 1 1982
Two further arguments are logical: drop.index = TRUE drops the indexes from the data.frame
and row.names = TRUE computes “fancy” row names by pasting the individual and the time
indexes. While extracting a series from a pdata.frame, a pseries is created, which is the
original series with the index attribute. This object has specific methods, like summary and
as.matrix. The former indicates the total variation of the variable and the shares of this
variation due to the individual and the time dimensions. The latter gives the matrix repre-
sentation of the series, with, by default, individuals as rows and times as columns.
R> summary(E$emp)
R> head(as.matrix(E$emp))
0 1 2
1-1977 5.041 NA NA
1-1978 5.600 5.041 NA
1-1979 5.015 5.600 5.041
1-1980 4.715 5.015 5.600
1-1981 4.093 4.715 5.015
1-1982 3.166 4.093 4.715
Further functions called Between, between and Within are also provided to compute the
between and the within transformation. The between returns unique values, whereas
Between duplicates the values and returns a vector which length is the number of ob-
servations.
1-1977 1-1978 1-1979 1-1980 1-1981 1-1982 1-1983 2-1977 2-1978 2-1979
NA NA 5.041 5.600 5.015 4.715 4.093 NA NA 71.319
R> head(Within(E$emp))
R> head(between(E$emp), 4)
Yves Croissant, Giovanni Millo 11
1 2 3 4
4.366571 71.362428 19.040143 26.035000
1 1 1 1 1 1 1 2
4.366571 4.366571 4.366571 4.366571 4.366571 4.366571 4.366571 71.362428
2 2
71.362428 71.362428
4.3. Formulas
In some circumstances, standard formulas are not very useful to describe a model, notably
while using instrumental variable like estimators: to deal with these situations, we use the
Formula package.
The Formula package provides a class which enables to construct multi-part formula, each
part being separated by a pipe sign. plm provides a pFormula object which is a Formula with
specific methods.
The two formulas below are identical:
R> emp~wage+capital|lag(wage,1)+capital
R> emp~wage+capital|.-wage+lag(wage,1)
In the second case, the . means the previous parts which describes the covariates and this
part is “updated”. This is particularly interesting when there are a few external instruments.
5. Model estimation
The basic use of plm is to indicate the model formula, the data and the model to be estimated.
For example, the fixed effects model and the random effects model are estimated using:
R> summary(grun.re)
Call:
plm(formula = inv ~ value + capital, data = Grunfeld, model = "random")
Effects:
var std.dev share
idiosyncratic 2784.46 52.77 0.282
individual 7089.80 84.20 0.718
theta: 0.8612
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-177.6063 -19.7350 4.6851 19.5105 252.8743
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -57.834415 28.898935 -2.0013 0.04674 *
value 0.109781 0.010493 10.4627 < 2e-16 ***
capital 0.308113 0.017180 17.9339 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For a random model, the summary method gives information about the variance of the com-
ponents of the errors. Fixed effects may be extracted easily using fixef. An argument type
indicates how fixed effects should be computed: in levels type = "level" (the default), in
deviations from the overall mean type = "dmean" or in deviations from the first individual
type = "dfirst".
1 2 3 4 5 6
-11.552778 160.649753 -176.827902 30.934645 -55.872873 35.582644
7 8 9 10
-7.809534 1.198282 -28.478333 52.176096
The fixef function returns an object of class fixef. A summary method is provided, which
prints the effects (in deviation from the overall intercept), their standard errors and the test
of equality to the overall intercept.
In case of a two-ways effect model, an additional argument effect is required to extract fixed
effects:
The estimation of the variance of the error components are performed using the ercomp
function, which has a method and an effect argument, and can be used by itself:
For example, to estimate a two-ways effect model for the Grunfeld data:
Call:
plm(formula = inv ~ value + capital, data = Grunfeld, effect = "twoways",
model = "random", random.method = "amemiya")
Effects:
var std.dev share
idiosyncratic 2644.13 51.42 0.256
individual 7452.02 86.33 0.721
time 243.78 15.61 0.024
theta: 0.868 (id) 0.2787 (time) 0.2776 (total)
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-176.9062 -18.0431 3.2697 17.1719 234.1735
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -63.767791 29.851537 -2.1362 0.0339 *
value 0.111386 0.010909 10.2102 <2e-16 ***
capital 0.323321 0.018772 17.2232 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the “effects” section of the result, the variance of the three elements of the error term and
the three parameters used in the transformation are now printed. The two-ways effect model
is for the moment only available for balanced panels.
Unbalanced panels
Most of the features of plm are implemented for panel models with some limitations:
• the only estimator of the variance of the error components is the one proposed by Swamy
and Arora (1972)
The following example is using data used by (Harrison and Rubinfeld 1978) to estimate an
hedonic housing prices function. It is reproduced in (Baltagi 2005), p.174/Baltagi (2013),
p.197.
Call:
plm(formula = mv ~ crim + zn + indus + chas + nox + rm + age +
dis + rad + tax + ptratio + blacks + lstat, data = Hedonic,
model = "random", index = "townid")
Effects:
var std.dev share
idiosyncratic 0.01696 0.13025 0.562
individual 0.01324 0.11505 0.438
theta:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2505 0.5483 0.6284 0.6141 0.7147 0.7976
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.62902 -0.06712 -0.00156 -0.00216 0.06858 0.54973
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 9.6859e+00 1.9751e-01 49.0398 < 2.2e-16 ***
crim -7.4120e-03 1.0478e-03 -7.0738 5.211e-12 ***
zn 7.8877e-05 6.5001e-04 0.1213 0.9034662
indus 1.5563e-03 4.0349e-03 0.3857 0.6998718
chasyes -4.4247e-03 2.9212e-02 -0.1515 0.8796662
nox -5.8425e-03 1.2452e-03 -4.6921 3.510e-06 ***
rm 9.0552e-03 1.1886e-03 7.6182 1.331e-13 ***
age -8.5787e-04 4.6793e-04 -1.8333 0.0673581 .
dis -1.4442e-01 4.4094e-02 -3.2753 0.0011301 **
rad 9.5984e-02 2.6611e-02 3.6069 0.0003415 ***
tax -3.7740e-04 1.7693e-04 -2.1331 0.0334132 *
ptratio -2.9476e-02 9.0698e-03 -3.2499 0.0012336 **
blacks 5.6278e-01 1.0197e-01 5.5188 5.529e-08 ***
lstat -2.9107e-01 2.3927e-02 -12.1650 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
as illustrated in the following example from Baltagi (2005), p.120/Baltagi (2013), p.137
(”G2SLS”).
Call:
plm(formula = log(crmrte) ~ log(prbarr) + log(polpc) + log(prbconv) +
log(prbpris) + log(avgsen) + log(density) + log(wcon) + log(wtuc) +
log(wtrd) + log(wfir) + log(wser) + log(wmfg) + log(wfed) +
log(wsta) + log(wloc) + log(pctymle) + log(pctmin) + region +
smsa + factor(year) | . - log(prbarr) - log(polpc) + log(taxpc) +
log(mix), data = Crime, model = "random")
Effects:
var std.dev share
idiosyncratic 0.02227 0.14923 0.326
individual 0.04604 0.21456 0.674
theta: 0.7458
18 Panel Data Econometrics in R: The plm Package
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.7485123 -0.0710015 0.0040742 0.0784401 0.4756493
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -0.4538241 1.7029840 -0.2665 0.789955
log(prbarr) -0.4141200 0.2210540 -1.8734 0.061498 .
log(polpc) 0.5049285 0.2277811 2.2167 0.027014 *
log(prbconv) -0.3432383 0.1324679 -2.5911 0.009798 **
log(prbpris) -0.1900437 0.0733420 -2.5912 0.009796 **
log(avgsen) -0.0064374 0.0289406 -0.2224 0.824052
log(density) 0.4343519 0.0711528 6.1045 1.847e-09 ***
log(wcon) -0.0042963 0.0414225 -0.1037 0.917426
log(wtuc) 0.0444572 0.0215449 2.0635 0.039495 *
log(wtrd) -0.0085626 0.0419822 -0.2040 0.838456
log(wfir) -0.0040302 0.0294565 -0.1368 0.891220
log(wser) 0.0105604 0.0215822 0.4893 0.624798
log(wmfg) -0.2017917 0.0839423 -2.4039 0.016520 *
log(wfed) -0.2134634 0.2151074 -0.9924 0.321421
log(wsta) -0.0601083 0.1203146 -0.4996 0.617544
log(wloc) 0.1835137 0.1396721 1.3139 0.189383
log(pctymle) -0.1458448 0.2268137 -0.6430 0.520458
log(pctmin) 0.1948760 0.0459409 4.2419 2.565e-05 ***
regionwest -0.2281780 0.1010317 -2.2585 0.024272 *
regioncentral -0.1987675 0.0607510 -3.2718 0.001129 **
smsayes -0.2595423 0.1499780 -1.7305 0.084046 .
factor(year)82 0.0132140 0.0299923 0.4406 0.659676
factor(year)83 -0.0847676 0.0320008 -2.6489 0.008286 **
factor(year)84 -0.1062004 0.0387893 -2.7379 0.006366 **
factor(year)85 -0.0977398 0.0511685 -1.9102 0.056587 .
factor(year)86 -0.0719390 0.0605821 -1.1875 0.235512
factor(year)87 -0.0396520 0.0758537 -0.5227 0.601345
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Hausman-Taylor model (see Hausman and Taylor 1981) may be estimated with the pht
function. The following example is from Baltagi (2005), p.130/Baltagi (2013), p.146.
+ sex+black+bluecol+south+smsa+ind,
+ data=Wages,index=595)
R> summary(ht)
Effects:
var std.dev share
idiosyncratic 0.02304 0.15180 0.134
individual 0.14913 0.38618 0.866
theta: 0.853
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.070475 -0.116106 0.013176 0.125657 2.139104
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 2.9317e+00 1.7849e-01 16.4256 < 2.2e-16 ***
wks 8.3787e-04 7.7790e-04 1.0771 0.2814400
southyes 2.9987e-02 3.2519e-02 0.9221 0.3564628
smsayes -3.7427e-02 2.2243e-02 -1.6826 0.0924499 .
marriedyes -3.0798e-02 2.4596e-02 -1.2522 0.2105119
exp 1.1284e-01 3.2032e-03 35.2261 < 2.2e-16 ***
I(exp^2) -4.2043e-04 7.0808e-05 -5.9376 2.892e-09 ***
bluecolyes -1.7773e-02 1.7855e-02 -0.9954 0.3195249
ind -8.9816e-03 1.8608e-02 -0.4827 0.6293372
unionyes 3.3535e-02 1.9281e-02 1.7393 0.0819856 .
sexfemale -1.3980e-01 7.1167e-02 -1.9643 0.0494919 *
blackyes -2.9270e-01 8.4004e-02 -3.4844 0.0004932 ***
ed 1.3698e-01 1.2456e-02 10.9973 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n
!
−1 −1
ˆ + σ̂i2 (Xi> Xi )−1 ˆ + σ̂i2 (Xi> Xi )−1
X
β̂ = ∆ ∆ β̂i (8)
i=1
where σ̂i2 is the unbiased estimator of the variance of the errors for individual i obtained from
the preliminary estimation and:
n n
! n
!> n
ˆ = 1 1X 1X 1X
σ̂ 2 (X > Xi )−1
X
∆ β̂i − β̂i β̂i − β̂i − (9)
n − 1 i=1 n i=1 n i=1 n i=1 i i
Call:
pvcm(formula = inv ~ value + capital, data = Grunfeld, model = "random")
Residuals:
total sum of squares: 2177914
id time
0.67677732 0.02974195
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Least squares are inconsistent because ∆it is correlated with ∆yit−1 . yit−2 is a valid, but
weak instrument (see Anderson and Hsiao 1981). The gmm estimator uses the fact that the
number of valid instruments is growing with t:
• t = 3: y1 ,
• t = 4: y1 , y2 ,
• t = 5: y1 , y2 , y3
The moment conditions are: ni=1 Wi> ei (β) where ei (β) is the vector of residuals for individual
P
n n
! !
>
Wi> ei (β)
X X
ei (β) Wi A (13)
i=1 i=1
22 Panel Data Econometrics in R: The plm Package
n
!−1
Wi> H (1) Wi
X
(1)
A = (14)
i=1
with:
2 −1 0 . . . 0
−1 2 −1 . . . 0
(1) > 0 −1 2 . . . 0
H =d d= (15)
.. .. .. .. ..
. . . . .
0 0 0 −1 2
(2) Pn (1) (1)> (1)
Two-steps estimators are obtained using Hi = i=1 ei ei where ei are the residuals of
the one step estimate.
Blundell and Bond (1998) show that with weak hypothesis on the data generating process,
supplementary moment conditions exist for the equation in level:
e+
i = (∆ei , ei )
n
!!> n n n
ēi (β)
Zi+>
X X X X
= yi1 ēi3 , yi1 ēi4 , yi2 ēi4 , . . . ,
i=1
ei (β) i=1 i=1 i=1
n
X n
X n
X n X
X T
yi1 ēiT , yi2 ēiT , . . . , yiT −2 ēiT , xit ēit
i=1 i=1 i=1 i=1 t=3
n n n
!>
X X X
ei3 ∆yi2 , ei4 ∆yi3 , . . . , eiT ∆yiT −1
i=1 i=1 i=1
The gmm estimator is provided by the pgmm function. It’s main argument is a dynformula
which describes the variables of the model and the lag structure.
Yves Croissant, Giovanni Millo 23
In a gmm estimation, there are “normal” instruments and “gmm” instruments. gmm instru-
ments are indicated in the second part of the formula. By default, all the variables of the
model that are not used as gmm instruments are used as normal instruments, with the same
lag structure; “normal” instruments may also be indicated in the third part of the formula.
The effect argument is either NULL, "individual" (the default), or "twoways". In the first
case, the model is estimated in levels. In the second case, the model is estimated in first
differences to get rid of the individuals effects. In the last case, the model is estimated in first
differences and time dummies are included.
The model argument specifies whether a one-step or a two-steps model is required ("onestep"
or "twosteps").
The following example is from Arellano and Bond (1991). Employment is explained by past
values of employment (two lags), current and first lag of wages and output and current value
of capital.
Call:
pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage),
0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp),
2:99), data = EmplUK, effect = "twoways", model = "twosteps")
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6190677 -0.0255683 0.0000000 -0.0001339 0.0332013 0.6410272
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1:2)1 0.474151 0.185398 2.5575 0.0105437 *
lag(log(emp), 1:2)2 -0.052967 0.051749 -1.0235 0.3060506
lag(log(wage), 0:1)0 -0.513205 0.145565 -3.5256 0.0004225 ***
lag(log(wage), 0:1)1 0.224640 0.141950 1.5825 0.1135279
log(capital) 0.292723 0.062627 4.6741 2.953e-06 ***
lag(log(output), 0:1)0 0.609775 0.156263 3.9022 9.530e-05 ***
lag(log(output), 0:1)1 -0.446373 0.217302 -2.0542 0.0399605 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
24 Panel Data Econometrics in R: The plm Package
The following example is from Blundell and Bond (1998). The “sys” estimator is obtained
using transformation = "ld" for level and difference. The robust argument of the summary
method enables to use the robust covariance matrix proposed by Windmeijer (2005).
Call:
pgmm(formula = log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) +
lag(log(capital), 0:1) | lag(log(emp), 2:99) + lag(log(wage),
2:99) + lag(log(capital), 2:99), data = EmplUK, effect = "twoways",
model = "onestep", transformation = "ld")
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7530341 -0.0369030 0.0000000 0.0002882 0.0466069 0.6001503
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1) 0.935605 0.026295 35.5810 < 2.2e-16 ***
lag(log(wage), 0:1)0 -0.630976 0.118054 -5.3448 9.050e-08 ***
lag(log(wage), 0:1)1 0.482620 0.136887 3.5257 0.0004224 ***
lag(log(capital), 0:1)0 0.483930 0.053867 8.9838 < 2.2e-16 ***
lag(log(capital), 0:1)1 -0.424393 0.058479 -7.2572 3.952e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
NA
Call:
pggls(formula = log(emp) ~ log(wage) + log(capital), data = EmplUK,
model = "pooling")
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
6
The “random effect” is better termed “general fgls” model, as in fact it does not have a proper random
effects structure, but we keep this terminology for general language consistency.
26 Panel Data Econometrics in R: The plm Package
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 2.023480 0.158468 12.7690 < 2.2e-16 ***
log(wage) -0.232329 0.048001 -4.8401 1.298e-06 ***
log(capital) 0.610484 0.017434 35.0174 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 1853.6
Residual Sum of Squares: 402.55
Multiple R-squared: 0.78283
The fixed effects pggls (see Wooldridge 2002, p.276) is based on the estimation of a within
model in the first step; the rest follows as above. It is estimated by:
The pggls function is similar to plm in many respects. An exception is that the estimate
of the group covariance matrix of errors (zz$sigma, a matrix, not shown) is reported in the
model objects instead of the usual estimated variances of the two error components.
6. Tests
As sketched in Section 2, specification testing in panel models involves essentially testing
for poolability, for individual or time unobserved effects and for correlation between these
latter and the regressors (Hausman-type tests). As for the other usual diagnostic checks, we
provide a suite of serial correlation tests, while not touching on the issue of heteroskedasticity
testing. Instead, we provide heteroskedasticity-robust covariance estimators, to be described
in Subsection 6.7.
F statistic
The same test can be computed using a formula as first argument of the pooltest function:
The effects tested are indicated with the effect argument (one of "individual", "time" or
"twoways"). The test statistics implemented are also suitable for unbalanced panels.8
To test the presence of individual and time effects in the Grunfeld example, using the Gourier-
oux et al. (1982) test, we use:
or
pFtest computes F tests of effects based on the comparison of the within and the pooling
models. Its main arguments are either two plm objects (the results of a pooling and a within
model) or a formula.
Hausman Test
the null of spherical residuals10 . There may also be serial correlation of the “usual” kind in
the idiosyncratic error terms, e.g. as an AR(1) process. By “testing for serial correlation” we
mean testing for this latter kind of dependence.
For these reasons, the subjects of testing for individual error components and for serially
correlated idiosyncratic errors are closely related. In particular, simple (marginal) tests for one
direction of departure from the hypothesis of spherical errors usually have power against the
other one: in case it is present, they are substantially biased towards rejection. Joint tests are
correctly sized and have power against both directions, but usually do not give any information
about which one actually caused rejection. Conditional tests for serial correlation that take
into account the error components are correctly sized under presence of both departures from
sphericity and have power only against the alternative of interest. While most powerful if
correctly specified, the latter, based on the likelihood framework, are crucially dependent on
normality and homoskedasticity of the errors.
In plm we provide a number of joint, marginal and conditional ml-based tests, plus some semi-
parametric alternatives which are robust vs. heteroskedasticity and free from distributional
assumptions.
This test is (n-) asymptotically distributed as a standard Normal regardless of the distribution
of the errors. It does also not rely on homoskedasticity.
It has power both against the standard random effects specification, where the unobserved
effects are constant within every group, as well as against any kind of serial correlation. As
such, it “nests” both random effects and serial correlation tests, trading some power against
more specific alternatives in exchange for robustness.
While not rejecting the null favours the use of pooled ols, rejection may follow from serial
correlation of different kinds, and in particular, quoting Wooldridge (2002), “should not be
interpreted as implying that the random effects error structure must be true”.
Below, the test is applied to the data and model in Munnell (1990):
10
Neglecting time effects may also lead to serial correlation in residuals (as observed in Wooldridge 2002,
10.4.1).
30 Panel Data Econometrics in R: The plm Package
data: formula
z = 3.9383, p-value = 8.207e-05
alternative hypothesis: unobserved effect
data: formula
chisq = 4187.6, df = 2, p-value < 2.2e-16
alternative hypothesis: AR(1) errors or random effects
Rejection of the joint test, though, gives no information on the direction of the departure
from the null hypothesis, i.e.: is rejection due to the presence of serial correlation, of random
effects or of both?
Bera, Sosa-Escudero, and Yoon (2001) derive locally robust tests both for individual random
effects and for first-order serial correlation in residuals as “corrected” versions of the standard
LM test (see plmtest). While still dependent on normality and homoskedasticity, these
are robust to local departures from the hypotheses of, respectively, no serial correlation or
no random effects. The authors observe that, although suboptimal, these tests may help
detecting the right direction of the departure from the null, thus complementing the use of
joint tests. Moreover, being based on pooled ols residuals, the BSY tests are computationally
far less demanding than likelihood-based conditional tests.
On the other hand, the statistical properties of these “locally corrected” tests are inferior
to those of the non-corrected counterparts when the latter are correctly specified. If there
is no serial correlation, then the optimal test for random effects is the likelihood-based LM
test of Breusch and Godfrey (with refinements by Honda, see plmtest), while if there are no
random effects the optimal test for serial correlation is, again, Breusch-Godfrey’s test11 . If the
presence of a random effect is taken for granted, then the optimal test for serial correlation
is the likelihood-based conditional LM test of Baltagi and Li (1995) (see pbltest).
The serial correlation version is the default:
11
LM3 in Baltagi and Li (1995).
Yves Croissant, Giovanni Millo 31
data: formula
chisq = 52.636, df = 1, p-value = 4.015e-13
alternative hypothesis: AR(1) errors sub random effects
The BSY test for random effects is implemented in the one-sided version12 , which takes heed
that the variance of the random effect must be non-negative:
data: formula
z = 57.914, p-value < 2.2e-16
alternative hypothesis: random effects sub AR(1) errors
R> pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,
+ data=Produc, alternative="onesided")
As usual, the LM test statistic is based on residuals from the maximum likelihood estimate of
the restricted model (random effects with serially uncorrelated errors). In this case, though,
the restricted model cannot be estimated by ols any more, therefore the testing function
depends on lme() in the nlme package for estimation of a random effects model by maximum
likelihood. For this reason, the test is applicable only to balanced panels.
No test has been implemented to date for the symmetric hypothesis of no random effects in
a model with errors following an AR(1) process, but an asymptotically equivalent likelihood
ratio test is available in the nlme package (see Section 7)..
Recall that plm model objects are the result of ols estimation performed on “demeaned” data,
where, in the case of individual effects (else symmetric), this means time-demeaning for the
fe (within) model, quasi-time-demeaning for the re (random) model and original data, with
no demeaning at all, for the pooled ols (pooling) model (see Section 3).
For the random effects model, Wooldridge (2002) observes that under the null of homoskedas-
ticity and no serial correlation in the idiosyncratic errors, the residuals from the quasi-
demeaned regression must be spherical as well. Else, as the individual effects are wiped
out in the demeaning, any remaining serial correlation must be due to the idiosyncratic com-
ponent. Hence, a simple way of testing for serial correlation is to apply a standard serial
correlation test to the quasi-demeaned model. The same applies in a pooled model, w.r.t. the
original data.
The fe case needs some qualification. It is well-known that if the original model’s errors are
uncorrelated then fe residuals are negatively serially correlated, with cor(ûit , ûis ) = −1/(T −
1) for each t, s (see Wooldridge 2002, 10.5.4). This correlation clearly dies out as T increases,
so this kind of AR test is applicable to within model objects only for T “sufficiently large”13 .
On the converse, in short panels the test gets severely biased towards rejection (or, as the
induced correlation is negative, towards acceptance in the case of the one-sided DW test with
alternative="greater"). See below for a serial correlation test applicable to “short” fe
panel models.
plm objects retain the “demeaned” data, so the procedure is straightforward for them. The
wrapper functions pbgtest and pdwtest re-estimate the relevant quasi-demeaned model by
ols and apply, respectively, standard Breusch-Godfrey and Durbin-Watson tests from package
lmtest:
The tests share the features of their ols counterparts, in particular the pbgtest allows testing
for higher-order serial correlation, which might turn useful, e.g., on quarterly data. Analo-
gously, from the point of view of software, as the functions are simple wrappers towards
bgtest and dwtest, all arguments from the latter two apply and may be passed on through
the ‘. . . ’ argument.
for each t, s. Wooldridge suggests basing a test for this null hypothesis on a pooled regression
of fe residuals on themselves, lagged one period:
ˆi,t = α + δˆ
i,t−1 + ηi,t
Rejecting the restriction δ = −1/(T − 1) makes us conclude against the original null of no
serial correlation.
The building blocks available in plm make it easy to construct a function carrying out this
procedure: first the fe model is estimated and the residuals retrieved, then they are lagged
and a pooling AR(1) model is estimated. The test statistic is obtained by applying the above
restriction on δ and supplying a heteroskedasticity- and autocorrelation-consistent covariance
matrix (vcovHC with the appropriate options, in particular method="arellano")14 .
data: plm.model
F = 312.3, df1 = 1, df2 = 889, p-value < 2.2e-16
alternative hypothesis: serial correlation
The test is applicable to any fe panel model, and in particular to “short” panels with small
T and large n.
and testing the restriction δ = −0.5, corresponding to the null of no serial correlation. Drukker
(2003) provides Monte Carlo evidence of the good empirical properties of the test.
On the other extreme (see Wooldridge 2002, 10.6.1), if the differenced errors eit are uncorre-
lated, as by definition uit = ui,t−1 + eit , then uit is a random walk. In this latter case, the
most efficient estimator is the first difference (fd) one; in the former case, it is the fixed effects
one (within).
The function pwfdtest allows testing either hypothesis: the default behaviour h0="fd" is to
test for serial correlation in first-differenced errors:
14
see Subsection 6.7.
15
Here, eit for notational simplicity (and as in Wooldridge): equivalent to ∆it in the general notation of
the paper.
34 Panel Data Econometrics in R: The plm Package
data: plm.model
F = 0.9316, df1 = 1, df2 = 749, p-value = 0.3348
alternative hypothesis: serial correlation in differenced errors
while specifying h0="fe" the null hypothesis becomes no serial correlation in original errors,
which is similar to the pwartest.
data: plm.model
F = 131.01, df1 = 1, df2 = 749, p-value < 2.2e-16
alternative hypothesis: serial correlation in original errors
Not rejecting one of the two is evidence in favour of using the estimator corresponding to
h0. Should the truth lie in the middle (both rejected), whichever estimator is chosen will
have serially correlated errors: therefore it will be advisable to use the autocorrelation-robust
covariance estimators from the Subsection 6.7 in inference.
PT
t=1 ûit ûjt
ρ̂ij = PT
( t=1 ûit )1/2 ( Tt=1 û2jt )1/2
2 P
i.e., as averages over the time dimension of pairwise correlation coefficients for each pair of
cross-sectional units.
The Breusch-Pagan (Breusch and Pagan 1980) LM test, based on the squares of ρij , is valid
for T → ∞ with n fixed; defined as
n−1
X n
X
LM = Tij ρ̂2ij
i=1 j=i+1
where in the case of an unbalanced panel only pairwise complete observations are considered,
and Tij = min(Ti , Tj ) with Ti being the number of observations for individual i; else, if the
panel is balanced, Tij = T for each i, j. The test is distributed as χ2n(n−1)/2 . It is inappropriate
whenever the n dimension is “large”. A scaled version, applicable also if T → ∞ and then
n → ∞ (as in some pooled time series contexts), is defined as
n−1 n q
s
1 X X
SCLM = ( Tij ρ̂2ij )
n(n − 1) i=1 j=i+1
n−1 n q
s
2 X X
CD = ( Tij ρ̂ij )
n(n − 1) i=1 j=i+1
based on ρij without squaring (also distributed as a standard Normal) is appropriate both in
n– and in T –asymptotic settings. It has remarkable properties in samples of any practically
relevant size and is robust to a variety of settings. The only big drawback is that the test
loses power against the alternative of cross-sectional dependence if the latter is due to a factor
structure with factor loadings averaging zero, that is, some units react positively to common
shocks, others negatively.
The default version of the test is "cd". These tests are originally meant to use the residuals
of separate estimation of one time-series regression for each cross-sectional unit, so this is the
default behaviour of pcdtest.
If a different model specification (within, random, ...) is assumed consistent, one can resort
to its residuals for testing17 by specifying the relevant model type. The main argument of
this function may be either a model of class panelmodel or a formula and a data.frame; in
the second case, unless model is set to NULL, all usual parameters relative to the estimation
of a plm model may be passed on. The test is compatible with any consistent panelmodel
for the data at hand, with any specification of effect. E.g., specifying effect="time" or
effect="twoways" allows to test for residual cross-sectional dependence after the introduction
of time fixed effects to account for common shocks.
If the time dimension is insufficient and model=NULL, the function defaults to estimation of a
within model and issues a warning.
s n−1 n
1 X X q
CD = Pn−1 Pn ( [w(p)]ij Tij ρ̂ij )
i=1 j=i+1 w(p)ij i=1 j=i+1
where [w(p)]ij is the (i, j)-th element of the p-th order proximity matrix, so that if h, k are
not neighbours, [w(p)]hk = 0 and ρ̂hk gets “killed”; this is easily seen to reduce to formula
(14) in Pesaran (Pesaran 2004) for the special case considered in that paper. The same can
be applied to the LM and SCLM tests.
Therefore, the local version of either test can be computed supplying an n × n matrix (of any
kind coercible to logical), providing information on whether any pair of observations are
neighbours or not, to the w argument. If w is supplied, only neighbouring pairs will be used in
computing the test; else, w will default to NULL and all observations will be used. The matrix
17
This is also the only solution when the time dimension’s length is insufficient for estimating the heteroge-
neous model.
Yves Croissant, Giovanni Millo 37
needs not really be binary, so commonly used “row-standardized” matrices can be employed
as well: it is enough that neighbouring pairs correspond to nonzero elements in w 18 .
Preliminary results
We consider the following model:
pi
X
yit = δyit−1 + θi ∆yit−L + αmi dmt + it
L=1
pi
X
∆yit = ρyit−1 + θi ∆yit−L + αmi dmt + it
L=1
• the Hall method, which consist in removing the higher lags while it is not significant.
The ADF regression is run on T − pi − 1 observations for each individual, so that the total
number of observations is n × T̃ where T̃ = T − pi − 1
p̄ is the average number of lags. Call ei the vector of residuals.
Estimate the variance of the i as:
PT 2
t=pi +1 eit
σ̂2i =
dfi
Levin-Lin-Chu model
Then, compute artificial regressions of ∆yit and yit−1 on ∆yit−L and dmt and get the two
vectors of residuals zit and vit .
18
The very comprehensive package spdep for spatial dependence analysis (see Bivand 2008) contains features
for creating, lagging and manipulating neighbour list objects of class nb, that can be readily converted to and
from proximity matrices by means of the nb2mat function. Higher orders of the CD(p) test can be obtained
lagging the corresponding nbs through nblag.
38 Panel Data Econometrics in R: The plm Package
Standardize these two residuals and run the pooled regression of zit /σ̂i on vit /σ̂i to get ρ̂, its
standard deviation σ̂(ρ̂) and the t-statistic tρ̂ = ρ̂/σ̂(ρ̂).
Compute the long run variance of yi :
T K̄ T
2 1 X 2
X 1 X
σ̂yi = ∆yit +2 wK̄L ∆yit ∆yit−L
T − 1 t=2 L=1
T − 1 t=2+L
Define s̄i as the ratio of the long and short term variance and s̄ the mean for all the individuals
of the sample
σ̂yi
si =
σ̂i
Pn
i=1 si
s̄ =
n
follows a normal distribution under the null hypothesis of stationarity. µ∗mT̃ and σm
∗
T̃
are
given in table 2 of the original paper and are also available in the package.
n
1X
t̄ = tρi
n i=1
µ∗mT̃ and σm
∗
T̃
are given in table 2 of the original paper and are also available in the package.
All types assume no correlation between errors of different groups while allowing for het-
eroskedasticity across groups, so that the full covariance matrix of errors is V = In ⊗ Ωi ; i =
1, .., n. As for the intragroup error covariance matrix of every single group of observations,
"white1" allows for general heteroskedasticity but no serial correlation, i.e.
2
σi1 ... ... 0
0 2
..
σi2 .
Ωi = . (16)
. ..
. . 0
0 2
. . . . . . σiT
while "white2" is "white1" restricted to a common variance inside every group, estimated
as σi2 = Tt=1 û2it /T , so that Ωi = IT ⊗ σi2 (see Greene (2003, 13.7.1–2) and Wooldridge (2002,
P
10.7.2); "arellano" (see ibid. and the original ref. Arellano 1987) allows a fully general
structure w.r.t. heteroskedasticity and serial correlation:
2
σi1 σi1,i2 . . . ... σi1,iT
σ 2
..
i2,i1 σi2 .
Ωi = .. .. ..
(17)
. . .
.. 2
. σiT σiT −1,iT
−1
σiT,i1 ... . . . σiT,iT −1 2
σiT
The latter is, as already observed, consistent w.r.t. timewise correlation of the errors, but on
the converse, unlike the White 1 and 2 methods, it relies on large n asymptotics with small
T.
The fixed effects case, as already observed in Section 6.4 on serial correlation, is complicated
by the fact that the demeaning induces serial correlation in the errors. The original White
estimator (white1) turns out to be inconsistent for fixed T as n grows, so in this case it is
advisable to use the arellano version (see Stock and Watson 2008).
The errors may be weighted according to the schemes proposed by MacKinnon and White
(1985) and Cribari-Neto (2004) to improve small-sample performance20 .
The main use of vcovHC is together with testing functions from the lmtest and car packages.
These typically allow passing the vcov parameter either as a matrix or as a function (see
Zeileis 2004). If one is happy with the defaults, it is easiest to pass the function itself:
R> library("lmtest")
R> re <- plm(inv~value+capital, data=Grunfeld, model="random")
R> coeftest(re,vcovHC)
t test of coefficients:
else one may do the covariance computation inside the call to coeftest, thus passing on a
matrix:
For some tests, e.g. for multiple model comparisons by waldtest, one should always provide
a function21 . In this case, optional parameters are provided as shown below (see also Zeileis
2004, p.12):
Wald test
Moreover, linearHypothesis from package car may be used to test for linear restrictions:
R> library("car")
R> linearHypothesis(re, "2*value=capital", vcov.=vcovHC)
Hypothesis:
2 value - capital = 0
A specific vcovHC method for pgmm objects is also provided which implements the robust
covariance matrix proposed by Windmeijer (2005) for generalized method of moments esti-
mators.
nlme and lme4 are estimated by (restricted or unrestricted) maximum likelihood. While under
normality, homoskedasticity and no serial correlation of the errors ols are also the maximum
likelihood estimator, in all the other cases there are important differences.
The econometric gls approach has closed-form analytical solutions computable by standard
linear algebra and, although the latter can sometimes get computationally heavy on the ma-
chine, the expressions for the estimators are usually rather simple. ml estimation of longitudi-
nal models, on the contrary, is based on numerical optimization of nonlinear functions without
closed-form solutions and is thus dependent on approximations and convergence criteria. For
example, the “gls” functionality in nlme is rather different from its “econometric” counter-
part. “Feasible gls” estimation in plm is based on a single two-step procedure, in which an
inefficient but consistent estimation method (typically ols) is employed first in order to get a
consistent estimate of the errors’ covariance matrix, to be used in gls at the second step; on
the converse, “gls” estimators in nlme are based on iteration until convergence of two-step
optimization of the relevant likelihood.
25
For fixed effects estimation, as the sample grows (on the dimension on which the fixed effects are specified)
so does the number of parameters to be estimated. Estimation of individual fixed effects is T – (but not n–)
consistent, and the opposite.
Yves Croissant, Giovanni Millo 43
Random effects
In the Laird and Ware notation, the re specification is a model with only one random effects
regressor: the intercept. Formally, z1ij = 1 ∀i, j, zqij = 0 ∀i, ∀j, ∀q 6= 1 λij = 1 for i = j,
0 else). The composite error is therefore uij = 1bi1 + ij . Below we report coefficients of
Grunfeld’s model estimated by gls and then by ml
R> library(nlme)
R> reGLS <- plm(inv~value+capital, data=Grunfeld, model="random")
R> reML <- lme(inv~value+capital, data=Grunfeld, random=~1|firm)
R> coef(reGLS)
R> summary(reML)$coefficients$fixed
y
(Intercept) -18.5538638
value 0.1239595
capital 0.1114579
R> summary(vcmML)$coefficients$fixed
Unrestricted fgls
The general, or unrestricted, feasible gls, pggls in the plm nomenclature, is equivalent to
a model with no random effects regressors (biq = 0 ∀i, q) and an error covariance structure
which is unrestricted within groups apart from the usual requirements. The function for
estimating such models with correlation in the errors but no random effects is gls().
This very general serial correlation and heteroskedasticity structure is not estimable for the
original Grunfeld data, which have more time periods than firms, therefore we restrict them
to firms 4 to 6.
Yves Croissant, Giovanni Millo 45
R> summary(gglsML)$coefficients
The within case is analogous, with the regressors’ set augmented by n − 1 group dummies.
and analogously the random effects panel with, e.g., AR(1) errors (see Baltagi 2005, 2013,
chap 5), which is a very common specification in econometrics, may be fit by lme specifying
an additional random intercept:
The regressors’ coefficients and the error’s serial correlation coefficient may be retrieved this
way:
R> summary(reAR1ML)$coefficients$fixed
Phi
0.823845
Significance statistics for the regressors’ coefficients are to be found in the usual summary
object, while to get the significance test of the serial correlation coefficient one can do a
likelihood ratio test as shown in the following.
The AR(1) test on the random effects model is to be done in much the same way, using the
random effects model objects estimated above:
A likelihood ratio test for random effects compares the specifications with and without random
effects and spherical idiosyncratic errors:
The random effects, AR(1) errors model in turn nests the AR(1) pooling model, therefore
a likelihood ratio test for random effects sub AR(1) errors may be carried out, again, by
comparing the two autoregressive specifications:
Yves Croissant, Giovanni Millo 47
whence we see that the Grunfeld model specification doesn’t seem to need any random effects
once we control for serial correlation in the data.
8. Conclusions
With plm we aim at providing a comprehensive package containing the standard functionali-
ties that are needed for the management and the econometric analysis of panel data. In partic-
ular, we provide: functions for data transformation; estimators for pooled, random and fixed
effects static panel models and variable coefficients models, general gls for general covariance
structures, and generalized method of moments estimators for dynamic panels; specification
and diagnostic tests. Instrumental variables estimation is supported. Most estimators allow
working with unbalanced panels. While among the different approaches to longitudinal data
analysis we take the perspective of the econometrician, the syntax is consistent with the basic
linear modeling tools, like the lm function.
On the input side, formula and data arguments are used to specify the model to be estimated.
Special functions are provided to make writing formulas easier, and the structure of the data
is indicated with an index argument.
On the output side, the model objects (of the new class panelmodel) are compatible with
the general restriction testing frameworks of packages lmtest and car. Specialized methods
are also provided for the calculation of robust covariance matrices; heteroskedasticity- and
correlation-consistent testing is accomplished by passing these on to testing functions, together
with a panelmodel object.
The main functionalities of the package have been illustrated here by applying them on some
well-known datasets from the econometric literature. The similarities and differences with
the maximum likelihood approach to longitudinal data have also been briefly discussed.
We plan to expand the methods in this paper to systems of equations and to the estimation
of models with autoregressive errors. Addition of covariance estimators robust vs. cross-
sectional correlation are also in the offing. Lastly, conditional visualization features in the R
environment seem to offer a promising toolbox for visual diagnostics, which is another subject
for future work.
Acknowledgments
While retaining responsibility for any error, we thank Jeffrey Wooldridge, Achim Zeileis and
three anonymous referees for useful comments. We also acknowledge kind editing assistance
by Lisa Benedetti.
48 Panel Data Econometrics in R: The plm Package
References
Arellano M, Bond S (1991). “Some Tests of Specification for Panel Data: Monte Carlo
Evidence and an Application to Employment Equations.” Review of Economic Studies,
58(2), 277–297.
Baltagi B (2005). Econometric Analysis of Panel Data. 3rd edition. John Wiley and Sons
ltd.
Baltagi B (2013). Econometric Analysis of Panel Data. 5th edition. John Wiley and Sons
ltd.
Baltagi B, Chang Y, Li Q (1992). “Monte Carlo results on several new and existing tests for
the error component model.” Journal of Econometrics, 54(1–3), 95–120.
Baltagi B, Chang Y, Li Q (1998). “Testing for random individual and time effects using
unbalanced panel data.” Advances in Econometrics, 13, 1–20.
Baltagi B, Li Q (1990). “A lagrange multiplier test for the error components model with
incomplete panels.” Econometric Reviews, 9(1), 103–107.
Baltagi B, Li Q (1991). “A Joint Test for Serial Correlation and Random Individual Effects.”
Statistics and Probability Letters, 11(3), 277–280.
Bates D (2007). lme4: Linear Mixed–Effects Models Using S4 Classes. R package version
0.99875-9, URL https://cran.r-project.org/package=lme4.
Bates D, Maechler M (2016). Matrix: Sparse and Dense Matrix Classes and Methods. R
package version 1.2-7.1, URL https://cran.r-project.org/package=Matrix.
Yves Croissant, Giovanni Millo 49
Bera A, Sosa-Escudero W, Yoon M (2001). “Tests for the Error Component Model in the
Presence of Local Misspecification.” Journal of Econometrics, 101(1), 1–23.
Bhargava A, Franzini L, Narendranathan W (1982). “Serial Correlation and the Fixed Effects
Model.” Review of Economic Studies, 49(4), 533–554.
Bivand R (2008). spdep: Spatial Dependence: Weighting Schemes, Statistics and Models. R
package version 0.4-17, URL https://cran.r-project.org/package=spdep.
Blundell R, Bond S (1998). “Initial Conditions and Moment Restrictions in Dynamic Panel
Data Models.” Journal of Econometrics, 87(1), 115–143.
Breusch T, Mizon G, Schmidt P (1989). “Efficient Estimation Using Panel Data.” Economet-
rica, 57(3), 695–700.
Breusch T, Pagan A (1980). “The Lagrange Multiplier Test and Its Applications to Model
Specification in Econometrics.” Review of Economic Studies, 47(1), 239–253.
Cornwell C, Rupert P (1988). “Efficient Estimation With Panel Data: An Empirical Com-
parison of Instrumental Variables Estimators.” Journal of Applied Econometrics, 3(2),
149–155.
Croissant Y, Millo G (2008). “Panel Data Econometrics in R: The plm Package.” Journal of
Statistical Software, 27(2). URL http://www.jstatsoft.org/v27/i02/.
Drukker D (2003). “Testing for Serial Correlation in Linear Panel–Data Models.” The Stata
Journal, 3(2), 168–177.
Fox J (2016). car: Companion to Applied Regression. R package version 2.1-3, URL https:
//cran.r-project.org/package=car,http://socserv.socsci.mcmaster.ca/jfox/.
Gourieroux C, Holly A, Monfort A (1982). “Likelihood Ratio Test, Wald Test, and Kuhn–
Tucker Test in Linear Models With Inequality Constraints on the Regression Parameters.”
Econometrica, 50(1), 63–80.
Harrison D, Rubinfeld D (1978). “Hedonic housing prices and the demand for clean air.”
Journal of Environmental Economics and Management, 5(1), 81–102.
Hausman J, Taylor W (1981). “Panel Data and Unobservable Individual Effects.” Economet-
rica, 49(6), 1377–1398.
50 Panel Data Econometrics in R: The plm Package
Honda Y (1985). “Testing the Error Components Model With Non–Normal Disturbances.”
Review of Economic Studies, 52(4), 681–690.
Hothorn T, Zeileis A, Farebrother RW, Cummins C, Millo G, Mitchell D (2015). lmtest: Test-
ing Linear Regression Models. R package version 0.9-34, URL https://cran.r-project.
org/package=lmtest.
Kleiber C, Zeileis A (2008). Applied Econometrics with R. Springer-Verlag, New York. ISBN
978-0-387-77316-2, URL https://cran.r-project.org/package=AER.
Koenker R, Ng P (2016). SparseM: Sparse Linear Algebra. R package version 1.72, URL
https://cran.r-project.org/package=SparseM.
Laird N, Ware J (1982). “Random–Effects Models for Longitudinal Data.” Biometrics, 38(4),
963–974.
Mundlak Y (1978). “On the Pooling of Time Series and Cross Section Data.” Econometrica,
46(1), 69–85.
Munnell A (1990). “Why Has Productivity Growth Declined? Productivity and Public In-
vestment.” New England Economic Review, pp. 3–22.
Nerlove M (1971). “Further Evidence on the Estimation of Dynamic Economic Relations from
a Time Series of Cross Sections.” Econometrica, 39(2), 359–382.
Pesaran M (2004). “General Diagnostic Tests for Cross Section Dependence in Panels.” CESifo
Working Paper Series, 1229.
Pinheiro J, Bates D, DebRoy S, the R Core team DS (2007). nlme: Linear and Nonlinear
Mixed Effects Models. R package version 3.1-86, URL https://cran.r-project.org/
package=nlme.
R Development Core Team (2008). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.r-project.org/.
Yves Croissant, Giovanni Millo 51
Stock JH, Watson MW (2008). “Heteroskedasticity–Robust Standard Errors for Fixed Effects
Panel Data Regression.” Econometrica, 76(1), 155–174.
Swamy P, Arora S (1972). “The Exact Finite Sample Properties of the Estimators of Coeffi-
cients in the Error Components Regression Models.” Econometrica, 40(2), 261–275.
Therneau T (2014). bdsmatrix: Routines for Block Diagonal Symmetric matrices. R package
version 1.3-2, URL https://cran.r-project.org/package=bdsmatrix.
Wallace T, Hussain A (1969). “The Use of Error Components Models in Combining Cross
Section With Time Series Data.” Econometrica, 37(1), 55–72.
Windmeijer F (2005). “A Finite Sample Correction for the Variance of Linear Efficient Two–
Step GMM Estimators.” Journal of Econometrics, 126(1), 25–51.
Wooldridge J (2002). Econometric Analysis of Cross Section and Panel Data. MIT Press.
Wooldridge J (2010). Econometric Analysis of Cross Section and Panel Data. 2nd edition.
MIT Press.
Zeileis A (2004). “Econometric Computing With HC and HAC Covariance Matrix Estimators.”
Journal of Statistical Software, 11(10), 1–17. URL http://www.jstatsoft.org/v11/i10/.
Affiliation:
Yves Croissant
CEMOI
Faculté de Droit et d’Economie
Université de La Réunion
15 avenue René Cassin
CS 92003
F-97744 Saint-Denis Cedex 9
Telephone: +262262938446
E-mail: [email protected]
Giovanni Millo
DiSES, Un. of Trieste and R&D Dept., Generali SpA
Via Machiavelli 4
34131 Trieste (Italy)
Telephone: +39/040/671184
52 Panel Data Econometrics in R: The plm Package
Fax: +39/040/671160
E-mail: [email protected]