0% found this document useful (0 votes)

125 views31 pages

Bai and NG 2002

This paper develops some econometric theory for factor models of large dimensions. The focus is the determination of the number of factors (r), which is an unresolved issue in the rapidly growing literature on multifactor models. Simulations show that the proposed criteria have good finite sample properties in many configurations of the panel data encountered in practice.

Uploaded by

Milan Martinovic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views31 pages

Bai and NG 2002

Uploaded by

Milan Martinovic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Econometrica, Vol. 70, No.

1 (January, 2002), 191–221

DETERMINING THE NUMBER OF FACTORS IN

APPROXIMATE FACTOR MODELS

By Jushan Bai and Serena Ng1

In this paper we develop some econometric theory for factor models of large dimen-
sions. The focus is the determination of the number of factors r, which is an unresolved
issue in the rapidly growing literature on multifactor models. We first establish the con-
vergence rate for the factor estimates that will allow for consistent estimation of r. We
then propose some panel criteria and show that the number of factors can be consistently
estimated using the criteria. The theory is developed under the framework of large cross-
sections (N ) and large time dimensions (T ). No restriction is imposed on the relation
between N and T . Simulations show that the proposed criteria have good finite sample
properties in many configurations of the panel data encountered in practice.

Keywords: Factor analysis, asset pricing, principal components, model selection.

1 introduction
The idea that variations in a large number of economic variables can be
modeled by a small number of reference variables is appealing and is used in
many economic analyses. For example, asset returns are often modeled as a
function of a small number of factors. Stock and Watson (1989) used one ref-
erence variable to model the comovements of four main macroeconomic aggre-
gates. Cross-country variations are also found to have common components; see
Gregory and Head (1999) and Forni, Hallin, Lippi, and Reichlin (2000b). More
recently, Stock and Watson (1999) showed that the forecast error of a large num-
ber of macroeconomic variables can be reduced by including diffusion indexes,
or factors, in structural as well as nonstructural forecasting models. In demand
analysis, Engel curves can be expressed in terms of a finite number of factors.
Lewbel (1991) showed that if a demand system has one common factor, budget
shares should be independent of the level of income. In such a case, the num-
ber of factors is an object of economic interest since if more than one factor is
found, homothetic preferences can be rejected. Factor analysis also provides a
convenient way to study the aggregate implications of microeconomic behavior,
as shown in Forni and Lippi (1997).
Central to both the theoretical and the empirical validity of factor models is
the correct specification of the number of factors. To date, this crucial parameter
1
We thank three anonymous referees for their very constructive comments, which led to a much
improved presentation. The first author acknowledges financial support from the National Science
Foundation under Grant SBR-9709508. We would like to thank participants in the econometrics
seminars at Harvard-MIT, Cornell University, the University of Rochester, and the University of
Pennsylvania for help suggestions and comments. Remaining errors are our own.

191
192 jushan bai and serena ng

is often assumed rather than determined by the data.2 This paper develops a
formal statistical procedure that can consistently estimate the number of factors
from observed data. We demonstrate that the penalty for overfitting must be
a function of both N and T (the cross-section dimension and the time dimen-
sion, respectively) in order to consistently estimate the number of factors. Con-
sequently the usual AIC and BIC, which are functions of N or T alone, do
not work when both dimensions of the panel are large. Our theory is developed
under the assumption that both N and T converge to infinity. This flexibility is
of empirical relevance because the time dimension of datasets relevant to factor
analysis, although small relative to the cross-section dimension, is too large to
justify the assumption of a fixed T .
A small number of papers in the literature have also considered the problem
of determining the number of factors, but the present analysis differs from these
works in important ways. Lewbel (1991) and Donald (1997) used the rank of a
matrix to test for the number of factors, but these theories assume either N or T
is fixed. Cragg and Donald (1997) considered the use of information criteria when
the factors are functions of a set of observable explanatory variables, but the data
still have a fixed dimension. For large dimensional panels, Connor and Korajczyk
(1993) developed a test for the number of factors in asset returns, but their test
is derived under sequential limit asymptotics, i.e., N converges to infinity with a
fixed T and then T converges to infinity. Furthermore, because their test is based
on a comparison of variances over different time periods, covariance stationarity
and homoskedasticity are not only technical assumptions, but are crucial for the
validity of their test. Under the assumption that N → for fixed T , Forni and
Reichlin (1998) suggested a graphical approach to identify√ the number of factors,
but no theory is available. Assuming N T → with N /T → , Stock and
Watson (1998) showed that a modification to the BIC can be used to select
the number of factors optimal for forecasting a single series. Their criterion is
restrictive not only because it requires N T , but also because there can be
factors that are pervasive for a set of data and yet have no predictive ability
for an individual data series. Thus, their rule may not be appropriate outside of
the forecasting framework. Forni, Hallin, Lippi, and Reichlin (2000a) suggested
a multivariate variant of the AIC but neither the theoretical nor the empirical
properties of the criterion are known.
We set up the determination of factors as a model selection problem. In con-
sequence, the proposed criteria depend on the usual trade-off between good fit
and parsimony. However, the problem is nonstandard not only because account
needs to be taken of the sample size in both the cross-section and the time-
series dimensions, but also because the factors are not observed. The theory we
developed does not rely on sequential limits, nor does it impose any restrictions
between N and T . The results hold under heteroskedasticity in both the time and

2
Lehmann and Modest (1988), for example, tested the APT for 5, 10, and 15 factors. Stock and
Watson (1989) assumed there is one factor underlying the coincident index. Ghysels and Ng (1998)
tested the afﬁne term structure model assuming two factors.
approximate factor models 193

the cross-section dimensions. The results also hold under weak serial and cross-
section dependence. Simulations show that the criteria have good finite sample
properties.
The rest of the paper is organized as follows. Section 2 sets up the preliminaries
and introduces notation and assumptions. Estimation of the factors is considered
in Section 3 and the estimation of the number of factors is studied in Section 4.
Specific criteria are considered in Section 5 and their finite sample properties
are considered in Section 6, along with an empirical application to asset returns.
Concluding remarks are provided in Section 7. All the proofs are given in the
Appendix.

2 factor models
Let Xit be the observed data for the ith cross-section unit at time t, for i =
1 N , and t = 1 T . Consider the following model:

(1) Xit = i Ft + eit

where Ft is a vector of common factors, i is a vector of factor loadings associated

with Ft , and eit is the idiosyncratic component of Xit . The product i Ft is called
the common component of Xit . Equation (1) is then the factor representation of
the data. Note that the factors, their loadings, as well as the idiosyncratic errors
are not observable.
Factor analysis allows for dimension reduction and is a useful statistical tool.
Many economic analyses fit naturally into the framework given by (1).
1. Arbitrage pricing theory. In the finance literature, the arbitrage pricing the-
ory (APT) of Ross (1976) assumes that a small number of factors can be used to
explain a large number of asset returns. In this case, Xit represents the return of
asset i at time t Ft represents the vector of factor returns, and eit is the idiosyn-
cratic component of returns. Although analytical convenience makes it appealing
to assume one factor, there is growing evidence against the adequacy of a single
factor in explaining asset returns.3 The shifting interest towards use of multifactor
models inevitably calls for a formal procedure to determine the number of fac-
tors. The analysis to follow allows the number of factors to be determined even
when N and T are both large. This is especially suited for financial applications
when data are widely available for a large number of assets over an increasingly
long span. Once the number of factors is determined, the factor returns Ft can
also be consistently estimated (up to an invertible transformation).
2. The rank of a demand system. Let p be a price vector for J goods and ser-
vices, eh be total spending on the J goods by household h. Consumer theory
postulates that Marshallian demand for good j by consumer h is Xjh = gj p eh .
Let wjh = Xjh /eh be the budget share for household h on the jth good. The
3
Cochrane (1999) stressed that financial economists now recognize that there are multiple sources
of risk, or factors, that give rise to high returns. Backus, Forsei, Mozumdar, and Wu (1997) made
similar conclusions in the context of the market for foreign assets.
194 jushan bai and serena ng

rank of a demand system holding prices fixed is the smallest integer r such that
wj e = j1 G1 e + · · · + jr Gr e. Demand systems are of the form (1) where the
r factors, common across goods, are Fh = G1 eh · · · Gr eh . When the number
of households, H , converges to infinity with a fixed J G1 e · · · Gr e can be esti-
mated simultaneously, such as by nonparametric methods developed in Donald
(1997). This approach will not work when the number of goods, J , also converges
to infinity. However, the theory to be developed in this paper will still provide a
consistent estimation of r and without the need for nonparametric estimation of
the G· functions. Once the rank of the demand system is determined, the non-
parametric functions evaluated at eh allow Fh to be consistently estimable (up to
a transformation). Then functions G1 e · · · Gr e may be recovered (also up to
a matrix transformation) from Fh h = 1 H via nonparametric estimation.
3. Forecasting with diffusion indices. Stock and Watson (1998, 1999) considered
forecasting inflation with diffusion indices (“factors”) constructed from a large
number of macroeconomic series. The underlying premise is that these series may
be driven by a small number of unobservable factors. Consider the forecasting
equation for a scalar series

yt+1 = Ft + Wt + t

The variables Wt are observable. Although we do not observe Ft , we observe

Xit i = 1 N . Suppose Xit bears relation with Ft as in (1). In the present
context, we interpret (1) as the reduced-form representation of Xit in terms of
the unobservable factors. We can ﬁrst estimate Ft from (1). Denote it by Ft . We
can then regress yt on Ft−1 and Wt−1 to obtain the coefﬁcients ˆ and , ˆ from
which a forecast

ŷT +1T = ˆ FT + W
ˆ T

can be formed. Stock and Watson (1998, 1999) showed that this approach of
forecasting outperforms many competing forecasting methods. But as pointed
out earlier, the dimension of F in Stock and Watson (1998, 1999) was determined
using a criterion that minimizes the mean squared forecast errors of y. This may
not be the same as the number of factors underlying Xit , which is the focus of
this paper.

21 Notation and Preliminaries

Let Ft0 0i ,
and r denote the true common factors, the factor loadings, and the
true number of factors, respectively. Note that Ft0 is r dimensional. We assume
that r does not depend on N or T . At a given t, we have
Xt = 0 Ft0 + et
(2)
N × 1 N × r r × 1 N × 1

where Xt = X1t X2t XNt 0 = 01 02 0N , and et = e1t e2t

eNt . Our objective is to determine the true number of factors, r. In classical
approximate factor models 195

factor analysis (e.g., Anderson (1984)), N is assumed fixed, the factors are inde-
pendent of the errors et , and the covariance of et is diagonal. Normalizing the

covariance matrix of Ft to be an identity matrix, we have = 0 0 + !, where
and ! are the covariance matrices of Xt and et , respectively. Under these
assumptions, a root-T consistent and asymptotically normal estimator of , say,

the sample covariance matrix = 1/T Tt=1 Xt − XX

t − X can be obtained.
The essentials of classical factor analysis carry over to the case of large N but
fixed T since the N × N problem can be turned into a T × T problem, as noted
by Connor and Korajczyk (1993) and others.
Inference on r under classical assumptions can, in theory, be based on the
eigenvalues of since a characteristic of a panel of data that has an r factor
representation is that the first r largest population eigenvalues of the N × N
covariance of Xt diverge as N increases to infinity, but the r + 1th eigenvalue
is bounded; see Chamberlain and Rothschild (1983). But it can be shown that all
nonzero sample eigenvalues (not just the first r) of the matrix increase with
N , and a test based on the sample eigenvalues is thus not feasible. A likelihood
ratio test can also, in theory, be used to select the number of factors if, in addi-
tion, normality of et is assumed. But as found by Dhrymes, Friend, and Glutekin
(1984), the number of statistically significant factors determined by the likelihood
ratio test increases with N even if the true number of factors is fixed. Other
methods have also been developed to estimate the number of factors assuming
the size of one dimension is fixed. But Monte Carlo simulations in Cragg and
Donald (1997) show that these methods tend to perform poorly for moderately
large N and T . The fundamental problem is that the theory developed for clas-
sical factor models does not apply when both N and T → . This is because
consistent estimation of (whether it is an N × N or a T × T matrix) is not a
well defined problem. For example, when N > T , the rank of is no more than
T , whereas the rank of can always be N . New theories are thus required to
analyze large dimensional factor models.
In this paper, we develop asymptotic results for consistent estimation of the
number of factors when N and T → . Our results complement the sparse but
growing literature on large dimensional factor analysis. Forni and Lippi (2000)
and Forni et al. (2000a) obtained general results for dynamic factor models, while
Stock and Watson (1998) provided some asymptotic results in the context of
forecasting. As in these papers, we allow for cross-section and serial dependence.
In addition, we also allow for heteroskedasticity in et and some weak dependence
between the factors and the errors. These latter generalizations are new in our
analysis. Evidently, our assumptions are more general than those used when the
sample size is fixed in one dimension.
Let X i be a T × 1 vector of time-series observations for the ith cross-section
unit. For a given i, we have

Xi = F 0 0i + ei
(3)
T × 1 T × r r × 1 T × 1
196 jushan bai and serena ng

where X i = Xi1 Xi2 XiT F 0 = F10 F20 FT0 , and ei = ei1 ei2
eiT . For the panel of data X = X 1 X N , we have

X = F0 0 + e
(4)
T × N T × r r × N T × N

with e = e1 eN .
Let trA denote the trace of A. The norm of the matrix A is then A =
trA A1/2 . The following assumptions are made:
T 0 0
Assumption A—Factors: EFt0 4 < and T −1 t=1 Ft Ft → F as T →
for some positive deﬁnite matrix F .

Assumption B—Factor Loadings: i ≤ ¯ < , and 0 0 /N − D → 0 as

N → for some r × r positive deﬁnite matrix D.

Assumption C—Time and Cross-Section Dependence and Heteroskedasticity:

There exists a positive constant M < , such that for all N and T ,
1. Eeit = 0 Eeit 8 ≤ M;

2. Ees et /N = EN −1 N i=1 eis eit = )N s t )N s s ≤ M for all s, and

T −1 Ts=1 Tt=1 )N s t ≤ M;
3. Eeit ejt = *ij t with *ij t ≤ *ij for some *ij and for all t; in addition,
N
N −1 N i=1 j=1 *ij ≤ M;
N T T
4. Eeit ejs = *ij ts and N T −1 N i=1 j=1 t=1 s=1 *ij ts ≤ M;

5. for every t s EN −1/2 N i=1 e is e it − Ee is e it 4
≤ M.

Assumption D—Weak Dependence between Factors and Idiosyncratic Errors:

N
2
1 1 T
E Ft eit
0
N i=1 √
T t=1 ≤ M

Assumption A is standard for factor models. Assumption B ensures that each

factor has a nontrivial contribution to the variance of Xt . We only consider
nonrandom factor loadings for simplicity. Our results still hold when the i are
random, provided they are independent of the factors and idiosyncratic errors,
and E i 4 ≤ M. Assumption C allows for limited time-series and cross-section
dependence in the idiosyncratic component. Heteroskedasticity in both the time
and cross-section dimensions is also allowed. Under stationarity in the time
dimension, )N s t = )N s − t, though the condition is not necessary. Given
Assumption C1, the remaining assumptions in C are easily satisﬁed if the eit are
independent for all i and t. The allowance for some correlation in the idiosyn-
cratic components sets up the model to have an approximate factor structure. It is
more general than a strict factor model, which assumes eit is uncorrelated across
i, the framework in which the APT theory of Ross (1976) is based. Thus, the
results to be developed will also apply to strict factor models. When the factors
approximate factor models 197

and idiosyncratic errors are independent (a standard assumption for conventional

factor models), Assumption D is implied by Assumptions A and C. Independence
is not required for D to be true. For example, suppose that eit = it Ft with
it being independent of Ft and it satisﬁes Assumption C; then Assumption D
holds. Finally, the developments proceed assuming that the panel is balanced.
We also note that the model being analyzed is static, in the sense that Xit has a
contemporaneous relationship with the factors. The analysis of dynamic models
is beyond the scope of this paper.
For a factor model to be an approximate factor model in the sense of
Chamberlain and Rothschild (1983), the largest eigenvalue (and hence all of
the eigenvalues) of the N × N covariance matrix ! = Eet et must be bounded.
Note that Chamberlain and Rothschild focused on the cross-section behavior of
the model and did not make explicit assumptions about the time-series behavior
of the model. Our framework allows for serial correlation and heteroskedastic-
ity and is more general than their setup. But if we assume et is stationary with
Eeit ejt = *ij , then from matrix theory, the largest eigenvalue of ! is bounded
N
by max i N j=1 *ij . Thus if we assume j=1 *ij ≤ M for all i and all N , which
implies Assumption C3, then (2) will be an approximate factor model in the
sense of Chamberlain and Rothschild.

3 estimation of the common factors

When N is small, factor models are often expressed in state space form, nor-
mality is assumed, and the parameters are estimated by maximum likelihood. For
example, Stock and Watson (1989) used N = 4 variables to estimate one factor,
the coincident index. The drawback is that because the number of parameters
increases with N ,4 computational difficulties make it necessary to abandon infor-
mation on many series even though they are available.
We estimate common factors in large panels by the method of asymptotic prin-
cipal components.5 The number of factors that can be estimated by this (non-
parametric) method is min+N T ,, much larger than permitted by estimation
of state space models. But to determine which of these factors are statistically
important, it is necessary to first establish consistency of all the estimated com-
mon factors when both N and T are large. We start with an arbitrary number
k k < min+N T ,. The superscript in ki and Ftk signifies the allowance of k
factors in the estimation. Estimates of k and F k are obtained by solving the
optimization problem

N
T 2
V k = minN T −1 Xit − ki Ftk
F k i=1 t=1

4
Gregory, Head, and Raynauld (1997) estimated a world factor and seven country specific fac-
tors from output, consumption, and investment for each of the G7 countries. The exercise involved
estimation of 92 parameters and perhaps stretched the state-space model to its limit.
5
The method of asymptotic principal components was studied by Connor and Korajzcyk (1986)
and Connor and Korajzcyk (1988) for fixed T . Forni et al. (2000a) and Stock and Watson (1998)
considered the method for large T .
198 jushan bai and serena ng

subject to the normalization of either k k /N = Ik or F k F k /T = Ik . If we con-

centrate out k and use the normalization that F k F k /T = Ik , the optimization

to maximizing trF k XX F k . The estimated factor matrix,
problem is identical √
k
denoted by F , is T times the eigenvectors corresponding to the k largest
k = F k F k −1 F k X = F k X/T
eigenvalues of the T × T matrix XX . Given F k
is the corresponding matrix of factor loadings.
The solution to the above minimization problem is not unique, even though the
sum of squared residuals V k k k
√ is unique. Another solution is given by F ,
k
where is constructed as N times the eigenvectors corresponding to the k
largest eigenvalues of the N ×N matrix X X. The normalization that k
k /N =
k k
Ik implies F = X /N . The second set of calculations is computationally less
costly when T > N , while the first is less intensive when T < N .6
Define

Fk = F k F k F k /T 1/2

a rescaled estimator of the factors. The following theorem summarizes the asymp-
totic properties of the estimated factors.

Theorem 1: For any ﬁxed k ≥ 1, √ there √exists a r × k matrix H k with

k
rankH = min+k r,, and CN T = min+ N T ,, such that
T
1
Fk − H k F 0 2 = Op 1
(5) CN2 T
T t=1 t t

Because the true factors (F 0 ) can only be identiﬁed up to scale, what is being
considered is a rotation of F 0 . The theorem establishes that the time average of
the squared deviations between the estimated factors and those that lie in the
true factor space vanish as N T → . The rate of convergence is determined by
the smaller of N or T , and thus depends on the panel structure.
Under the additional assumption that Ts=1 )N s t2 ≤ M for all t and T , the
result7
2
CN2 T Ft − H k Ft0 = Op 1 for each t

(6)

can be obtained. Neither Theorem 1 nor (6) implies uniform convergence in t.

Uniform convergence is considered by Stock and Watson (1998). These authors
obtained a much slower convergence rate than CN2 T , and their result requires
√
N T . An important insight of this paper is that, to consistently estimate
the number of factors, neither (6) nor uniform convergence is required. It is
the average convergence rate of Theorem 1 that is essential. However, (6) could
be useful for statistical analysis on the estimated factors and is thus a result of
independent interest.
6
A more detailed account of computation issues, including how to deal with unbalanced panels, is
given in Stock and Watson (1998).
7
The proof is actually simpler than that of Theorem 1 and is thus omitted to avoid repetition.
approximate factor models 199

4 estimating the number of factors

Suppose for the moment that we observe all potentially informative factors
but not the factor loadings. Then the problem is simply to choose k factors that
best capture the variations in X and estimate the corresponding factor loadings.
Since the model is linear and the factors are observed, i can be estimated by
applying ordinary least squares to each equation. This is then a classical model
selection problem. A model with k + 1 factors can ﬁt no worse than a model
with k factors, but efﬁciency is lost as more factor loadings are being estimated.
Let F k be a matrix of k factors, and
1 N T
2
V k F k = min Xit − ki Ftk
N T i=1 t=1

be the sum of squared residuals (divided by N T ) from time-series regressions of

X i on the k factors for all i. Then a loss function V k F k + kgN T , where
gN T is the penalty for overﬁtting, can be used to determine k. Because the
estimation of i is classical, it can be shown that the BIC with gN T = lnT /T
can consistently estimate r. On the other hand, the AIC with gN T = 2/T
may choose k > r even in large samples. The result is the same as in Geweke
and Meese (1981) derived for N = 1 because when the factors are observed,
the penalty factor does not need to take into account the sample size in the
cross-section dimension. Our main result is to show that this will no longer be
true when the factors have to be estimated, and even the BIC will not always
consistently estimate r.
Without loss of generality, we let

1 N T 2
V k Fk = min Xit − ki Ftk

(7)
NT
i=1 t=1

denote the sum of squared residuals (divided by N T ) when k factors are esti-
mated. This sum of squared residuals does not depend on which estimate of F is
used because they span the same vector space. That is, V k F k = V k F k =
V k Fk . We want to ﬁnd penalty functions, gN T , such that criteria of the
form

PCk = V k Fk + kgN T

can consistently estimate r. Let kmax be a bounded integer such that r ≤ kmax.

Theorem 2: Suppose that Assumptions A–D hold and that the k factors
are estimated by principal components. Let k̂ = arg min0≤k≤kmax PCk. Then
2
limN T → Probk̂ = r = 1 if√(i) gN
√ T → 0 and (ii) CN T · gN T → as N ,
T → , where CN T = min+ N T ,.

Conditions (i) and (ii) are necessary in the sense that if one of the conditions
is violated, then there will exist a factor model satisfying Assumptions A–D, and
200 jushan bai and serena ng

yet the number of factors cannot be consistently estimated. However, conditions

(i) and (ii) are not always required to obtain a consistent estimate of r.
A formal proof of Theorem 2 is provided in the Appendix. The crucial ele-
ment in consistent estimation of r is a penalty factor that vanishes at an appro-
priate rate such that under and overparameterized models will not be chosen.
An implication of Theorem 2 is the following:

Corollary 1: Under the Assumptions of Theorem 2, the class of criteria

deﬁned by

ICk = lnV k Fk + kgN T

will also consistently estimate r.

Note that V k Fk is simply the average residual variance when k factors are
assumed for each cross-section unit. The IC criteria thus resemble information
criteria frequently used in time-series analysis, with the important difference that
the penalty here depends on both N and T .
Thus far, it has been assumed that the common factors are estimated by the
method of principle components. Forni and Reichlin (1998) and Forni et al.
(2000a) studied alternative estimation methods. However the proof of Theorem 2
mainly uses the fact that Ft satisﬁes Theorem 1, and does not rely on principal
components per se. We have the following corollary:

Corollary 2: Let G k be an arbitrary estimator of F 0 . Suppose there exists a

k k = min+k r,, and for some C
matrix H such that rank H 2 ≤ C 2 ,
NT NT

T
1
(8) 2
C G k F 0 2 = Op 1
k − H
NT t t
T t=1

k and CN T replaced by C
Then Theorem 2 still holds with Fk replaced by G N T .

The sequence of constants C 2 does not need to equal C 2 = min+N T ,.

NT NT
t satisfying
Theorem 2 holds for any estimation method that yields estimators G
8 2
(8). Naturally, the penalty would then depend on CN T , the convergence rate
for Gt .

5 the PCp and the ICp

In this section, we assume that the method of principal components is used to

estimate the factors and propose speciﬁc formulations of gN T to be used in

8
We are grateful for a referee whose question led to the results reported here.
approximate factor models 201
N T
practice. Let 3̂ 2 be a consistent estimate of N T −1 i=1 t=1 Eeit
2
. Consider
the following criteria:

N +T NT
PCp1 k = V k Fk + k3̂ 2 ln 4
NT N +T

k 2 N +T
PCp2 k = V k F + k3̂ ln CN2 T 4
NT
2

k 2 ln CN T
PCp3 k = V k F + k3̂
CN2 T

Since V k Fk = N −1 N 2 2
i=1 3̂i , where 3̂i = êi êi /T , the criteria generalize the Cp
criterion of Mallows (1973) developed for selection of models in strict time-
series or cross-section contexts to a panel data setting. For this reason, we refer
to these statistics as Panel Cp PCp criteria. Like the Cp criterion, 3̂ 2 provides
the proper scaling to the penalty term. In applications, it can be replaced by
V kmax Fkmax . The proposed penalty functions are based on the sample size in
the smaller of the two dimensions. All three criteria satisfy conditions (i) and (ii)
of Theorem 2 since CN−2T ≈ N + T /N T → 0 as N T → . However, in ﬁnite
samples, CN−2T ≤ N + T /N T . Hence, the three criteria, although asymptotically
equivalent, will have different properties in ﬁnite samples.9
Corollary 1 leads to consideration of the following three criteria:

k N +T NT
ICp1 k = lnV k F + k ln 4
NT N +T

N +T
(9) ICp2 k = lnV k Fk + k ln CN2 T 4
NT

ln CN2 T
ICp3 k = lnV k Fk + k
CN2 T

The main advantage of these three panel information criteria (ICp ) is that they
do not depend on the choice of kmax through 3̂ 2 , which could be desirable in
practice. The scaling by 3̂ 2 is implicitly performed by the logarithmic transfor-
mation of V k Fk and thus not required in the penalty term.
The proposed criteria differ from the conventional Cp and information criteria
used in time-series analysis in that gN T is a function of both N and T . To
understand why the penalty must be speciﬁed as a function of the sample size in

9
Note that PCp1 and PCp2 , and likewise, ICp1 and ICp2 , apply speciﬁcally to the principal com-
ponents estimator because CN2 T = min+N T , is used in deriving them. For alternative estimators
N T .
satisfying Corollary 2, criteria PCp3 and ICp3 are still applicable with CN T replaced by C
202 jushan bai and serena ng

both dimensions, consider the following:

k 2 2
AIC1 k = V k F + k3̂ 4
T

ln T
BIC1 k = V k Fk + k3̂ 2 4
T

k 2 2
AIC2 k = V k F + k3̂ 4
N

ln N
BIC2 k = V k Fk + k3̂ 2 4
N

k 2 N + T − k
AIC3 k = V k F + k3̂ 2 4
NT

N + T − k lnN T
BIC3 k = V k Fk + k3̂ 2
NT
The penalty factors in AIC1 and BIC1 are standard in time-series applications.
Although gN T → 0 as T → AIC1 fails the second condition of Theorem 2
for all N and T . When N T and N logT /T −→ , the BIC1 also fails
condition (ii) of Theorem 2. Thus we expect the AIC1 will not work for all N
and T , while the BIC1 will not work for small N relative to T . By analogy, AIC2
also fails the conditions of Theorem 2, while BIC2 will work only if N T .
The next two criteria, AIC3 and BIC3 , take into account the panel nature of the
problem. The two specifications of gN T reflect first, that the effective number
of observations is N · T , and second, that the total number of parameters being
estimated is kN + T − k. It is easy to see that AIC3 fails the second condition
of Theorem 2. While the BIC3 satisfies this condition, gN T does not always
vanish. For example, if N = expT , then gN T → 1 and the first condition
of Theorem 2 will not be satisfied. Similarly, gN T does not vanish when T =
expN . Therefore BIC3 may perform well for some but not all configurations
of the data. In contrast, the proposed criteria satisfy both conditions stated in
Theorem 2.

6 simulations and an empirical application

We ﬁrst simulate data from the following model:

r √
Xit = ij Ftj + 6eit
j=1
√
= cit + 6eit

where the factors are T × r matrices of N 0 1 variables, and the factor load-
ings are N 0 1 variates. Hence, the common component of Xit , denoted by cit ,
has variance r. Results with ij uniformly distributed are similar and will not
approximate factor models 203

be reported. Our base case assumes that the idiosyncratic component has the
same variance as the common component (i.e. 6 = r. We consider thirty con-
figurations of the data. The first five simulate plausible asset pricing applications
with five years of monthly data (T = 60) on 100 to 2000 asset returns. We then
increase T to 100. Configurations with N = 60 T = 100 and 200 are plausible
sizes of datasets for sectors, states, regions, and countries. Other configurations
are considered to assess the general properties of the proposed criteria. All com-
putations were performed using Matlab Version 5.3.
Reported in Tables I to III are the averages of k̂ over 1000 replications, for
r = 1 3, and 5 respectively, assuming that eit is homoskedastic N 0 1. For all
cases, the maximum number of factors, kmax, is set to 8.10 Prior to computation
of the eigenvectors, each series is demeaned and standardized to have unit vari-
ance. Of the three PCp criteria that satisfy Theorem 2, PCp3 is less robust than
PCp1 and PCp2 when N or T is small. The ICp criteria generally have prop-
erties very similar to the PCp criteria. The term N T /N + T provides a small
sample correction to the asymptotic convergence rate of CN2 T and has the effect
of adjusting the penalty upwards. The simulations show this adjustment to be
desirable. When min+N T , is 40 or larger, the proposed tests give precise esti-
mates of the number of factors. Since our theory is based on large N and T , it is
not surprising that for very small N or T , the proposed criteria are inadequate.
Results reported in the last five rows of each table indicate that the ICp crite-
ria tend to underparameterize, while the PCp tend to overparameterize, but the
problem is still less severe than the AIC and the BIC, which we now consider.
The AIC and BIC’s that are functions of only N or T have the tendency
to choose too many factors. The AIC3 performs somewhat better than AIC1
and AIC2 , but still tends to overparameterize. At first glance, the BIC3 appears
to perform well. Although BIC3 resembles PCp2 , the former penalizes an extra
factor more heavily since lnN T > ln CN2 T . As can be seen from Tables II and
III, the BIC3 tends to underestimate r, and the problem becomes more severe
as r increases.
Table IV relaxes the assumption of homoskedasticity. Instead, we let eit = eit1
for t odd, and eit = eit1 + eit2 for t even, where eit1 and eit2 are independent N 0 1.
Thus, the variance in the even periods is twice as large as the odd periods.
Without loss of generality, we only report results for r = 5. PCp1 PCp2 ICp1 , and
ICp2 continue to select the true number of factors very accurately and dominate
the remaining criteria considered.
We then vary the variance of the idiosyncratic errors relative to the common
component. When 6 < r, the variance of the common component is relatively
large. Not surprisingly, the proposed criteria give precise estimates of r. The
results will not be reported without loss of generality. Table V considers the case
6 = 2r. Since the variance of the idiosyncratic component is larger than the

10
In time-series analysis, a rule such as 8 int[T /1001/4 ] considered in Schwert (1989) is sometimes
used to set kmax, but no such guide is available for panel analysis. Until further results are available,
a rule that replaces T in Schwert’s rule by min+N T , could be considered.
204 jushan bai and serena ng

TABLE I
r √
DGP: Xit = j=1 ij Ftj + 6eit ; r = 1; 6 = 1.