Lasso RRR
Lasso RRR
Lasso RRR
y
Zhipeng Liao Peter C. B. Phillips
Abstract
Model selection and associated issues of post-model selection inference present well known
challenges in empirical econometric research. These modeling issues are manifest in all applied
work but they are particularly acute in multivariate time series settings such as cointegrated
systems where multiple interconnected decisions can materially a¤ect the form of the model
and its interpretation. In cointegrated system modeling, empirical estimation typically proceeds
in a stepwise manner that involves the determination of cointegrating rank and autoregressive
lag order in a reduced rank vector autoregression followed by estimation and inference. This
paper proposes an automated approach to cointegrated system modeling that uses adaptive
shrinkage techniques to estimate vector error correction models with unknown cointegrating
rank structure and unknown transient lag dynamic order. These methods enable simultaneous
order estimation of the cointegrating rank and autoregressive order in conjunction with oracle-
like e¢ cient estimation of the cointegrating matrix and transient dynamics. As such they o¤er
considerable advantages to the practitioner as an automated approach to the estimation of
cointegrated systems. The paper develops the new methods, derives their limit theory, discusses
implementation, reports simulations and presents an empirical illustration with macroeconomic
aggregates.
1 Introduction
Cointegrated system modeling is now one of the main workhorses in empirical time series research.
Much of this empirical research makes use of vector error correction (VECM) formulations. While
Department of Economics, UC Los Angeles, 8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095.
Email: [email protected]
y
Yale University, University of Auckland, University of Southampton and Singapore Management University.
Support from the NSF under Grant No SES 09-56687 is gratefully acknowledged. Email: [email protected]
1
there is often some prior information concerning the number of cointegrating vectors, most practical
work involves (at least con…rmatory) pre-testing to determine the cointegrating rank of the system
as well as the lag order in the autoregressive component that embodies the transient dynamics.
These order selection decisions can be made by sequential likelihood ratio tests (e.g. Johansen,
1988, for rank determination) or the application of suitable information criteria (Phillips, 1996).
Both approaches are popular in empirical research.
Information criteria o¤er certain advantages such as joint determination of the cointegrating
rank and autoregressive order, consistent estimation of both order parameters (Chao and Phillips,
1999; Athanasopoulos et al., 2011), robustness to heterogeneity in the errors, and the convenience
and generality of semi-parametric estimation in cases where the focus is simply the cointegrating
rank (Cheng and Phillips, 2010, 2012). Sequential testing procedures have recent enhancements
including bootstrap modi…cations to improve test performance and under certain conditions provide
consistent order estimation by adaptation if test size is driven to zero as the sample size expands to
in…nity. However, these adaptive methods have not been systematically investigated in the VECM
framework and there is little research on rate control and testing order, and no asymptotics for
such adaptive procedures to o¤er guidance for empirical implementation. More importantly in
the VECM setting, sequential tests involve di¤erent test statistics for lags and cointegrating rank,
and model selection is inevitably unstable in the sense that di¤erent models may be selected when
di¤erent sequential orders are used. Moreover, general to speci…c and speci…c to general testing
algorithms encounter obstacles to consistent model selection even when test size is driven to zero
(see section 9 for an example). Finally, while they are appealing to practitioners, all of these
methods are nonetheless subject to pre-test bias and post model selection inferential problems
(Leeb and Pötscher, 2005).
The present paper explores a di¤erent approach. The goal is to liberate the empirical researcher
from some of the di¢ culties of sequential testing and order estimation procedures in inference
about cointegrated systems and in policy work that relies on associated impulse responses. The
ideas originate in recent work on sparse system estimation using shrinkage techniques such as Lasso
and bridge regression. These procedures utilize penalized least squares criteria in regression that
can succeed, at least asymptotically, in selecting the correct regressors in a linear regression frame-
work while consistently estimating the non-zero regression coe¢ cients. Caner and Knight (2009)
…rst showed how this type of estimator may be used in a univariate autoregressive model with a
potential unit root. While apparently e¤ective asymptotically these procedures do not avoid post
model selection inference issues in …nite samples because the estimators implicitly carry e¤ects from
the implementation of shrinkage which can result in bias, multimodal distributions and di¢ culty
2
discriminating local alternatives that can lead to unbounded risk (Leeb and Pötscher, 2008). On
the other hand, the methods do radically simplify empirical research with large dimensional sys-
tems where order parameters must be chosen and sparsity is expected. When data-based tuning
parameter selection is employed, the methods also enable automated implementation making them
convenient for empirical practice.
One of the contributions of this paper is to develop new adaptive versions of shrinkage methods
that apply in vector error correction modeling which by their nature involve reduced rank coe¢ cient
matrices and order parameters for lag polynomials and trend speci…cations. The implementation
of these methods in this econometric setting is by no means immediate. In particular, multivariate
models with some unit roots and cointegration involve dimension reductions and nonlinear restric-
tions which present new di¢ culties of both formulation and asymptotics in the Lasso framework
that go beyond existing work in the statistics literature such as Yuan et al (2007). The present pa-
per contributes to the Lasso and econometric literatures by providing a new penalty function that
handles these complications, developing a rigorous limit theory of order selection and estimation
for this multivariate nonlinear nonstationary setting, and devising a straightforward method of im-
plementation that is well suited to empirical econometric research. When reduced to the univariate
case, our results cover the methodology and implicit unit root test procedure suggested in Caner
and Knight (2009) and extend their univariate results to cases where there is misspeci…cation in
the transient dynamics.
The paper designs a mechanism of estimation and selection that works through the eigenvalues of
the levels coe¢ cient matrix and the coe¢ cient matrices of the transient dynamic components. This
formulation is necessary because of the nonlinearities involved in potential reduced rank structures
and the interdependence of decision making concerning the form of the transient dynamics and
the cointegrating rank structure. The resulting methods apply in quite general vector systems
with unknown cointegrating rank structure and unknown lag dynamics. They permit simultaneous
order estimation of the cointegrating rank and autoregressive order in conjunction with oracle-
like e¢ cient estimation of the cointegrating matrix and transient dynamics. As such they o¤er
considerable advantages to the practitioner. In e¤ect, it becomes unnecessary to implement pre-
testing procedures because the empirical results reveal all of the order parameters as a consequence
of the …tting procedure.
A novel contribution of the paper in this nonlinear setting where eigenvalues play a key role is the
use of a penalty which is a simple convex function of the coe¢ cient matrix. The new penalty makes
penalized estimation stable and accurate, facilitates the limit theory, and simpli…es implementation
because existing code for grouped L-1 penalized estimation can be used for computation. All the
3
theoretical results are rigorously derived in a general nonstationary set-up that allows for unit
roots, cointegration and transient dynamics, which combines with the new penalty formulation
to complement recent asymptotic theory for Lasso estimation in stationary vector autoregressive
(VAR) models (Song and Bickel, 2009; Kock and Callot, 2012) and multivariate regression (Yuan
et al, 2007; Peng et al., 2010).
The paper is organized as follows. Section 2 lays out the model and assumptions and shows how
to implement adaptive shrinkage methods in VECM systems. Section 3 considers a simpli…ed …rst
order version of the VECM without lagged di¤erences which reveals the approach to cointegrating
rank selection and develops key elements in the limit theory. Here we show that the cointegrating
rank ro is identi…ed by the number of zero eigenvalues of o and the latter is consistently recovered
by suitably designed shrinkage estimation. Section 4 extends this system and its asymptotics to the
general case of cointegrated systems with weakly dependent errors. Here it is demonstrated that
the cointegration rank ro can be consistently selected despite the fact that o itself may not be con-
sistently estimable. Section 5 deals with the practically important case of a general VECM system
driven by independent identically distributed (iid) shocks, where shrinkage estimation simultane-
ously performs consistent lag selection, cointegrating rank selection, and optimal estimation of the
system coe¢ cients. Section 6 considers adaptive selection of the tuning parameter and Section 7
reports some simulation …ndings. Section 8 applies our method to an empirical example. Section
9 concludes and outlines some useful extensions of the methods and limit theory to other models.
Proofs and some supplementary technical results are given in the Appendix.
Notation is standard. For vector-valued, zero mean, covariance stationary stochastic processes
P
fat gt 1 and fbt gt 1 , ab (h) = E[at b0t+h ] and ab = 1
h=0 ab (h) denote the lag h autocovariance
matrix and one-sided long-run covariance matrix. Moreover, we use ab for ab (0) and n;ab =
P
n 1 nt=1 at b0t as the corresponding sample average. The notation k k denotes the Euclidean norm
and jAj is the determinant of a square matrix A. A0 refers to the transpose of any matrix A and
kAkB jjA0 BAjj for any conformable matrices A and B. Ik and 0l are used to denote k k identity
matrix and l l zero matrices respectively. The symbolism A B means that A is de…ned as B;
the expression an = op (bn ) signi…es that Pr (jan =bn j ) ! 0 for all > 0 as n go to in…nity; and
an = Op (bn ) when Pr (jan =bn j M ) ! 0 as n and M go to in…nity. As usual, "!p " and "!d "
imply convergence in probability and convergence in distribution, respectively.
4
2 Vector Error Correction and Adaptive Shrinkage
Throughout this paper we consider the following parametric VECM representation of a cointegrated
system
p
X
Yt = o Yt 1 + Bo;j Yt j + ut ; (2.1)
j=1
0
where Yt = Yt Yt 1; Yt is an m-dimensional vector-valued time series, o = o o has rank
0 ro m, Bo;j (j = 1; :::; p) are m m (transient) coe¢ cient matrices and ut is an m-vector
error term with mean zero and nonsingular covariance matrix uu . The rank ro of o is an order
parameter measuring the cointegrating rank or the number of (long run) cointegrating relations
in the system. The index set of non zero matrices Bo;j (j = 1; :::; p) is a second order parameter,
characterizing the transient dynamics in the system.
0
As o = o o has rank ro , we can choose o and o to be m ro matrices with full rank. When
ro = 0, we simply take o = 0. Let o;? and o;? be the matrix orthogonal complements of o and
0 0 1
o and, without loss of generality, assume that o;? o;? = Im ro and o;? o;? = Im ro .
0
Suppose o 6= 0 and de…ne Q = [ o; o? ] : In view of the well known relation (e.g., Johansen,
1995)
0 1 0 0 1 0
o( o o) o + o;? ( o;? o;? ) o;? = Im ; (2.2)
h i
1 0 1 0 1
it follows that Q = o( o o) ; o;? ( o;? o;? ) ,
2 3 2 3
0 0 0
o o o o o 0
Q o =4 5 and Q oQ
1
=4 5: (2.3)
0 0 0
0 0 0
Under Assumption RR in Section 3, o o is an invertible matrix and hence the matrix o o o has
full rank. Cointegrating rank is the number ro of non zero eigenvalues of o or the nonzero row
vector count of Q o. When o = 0, then the result holds trivially with ro = 0 and o;? = Im .
The matrices o? and o;? are composed of normalized left and right eigenvectors, respectively,
corresponding to the zero eigenvalues in o.
Conventional methods of estimation of (2.1) include reduced rank regression or maximum like-
lihood based on the assumption of Gaussian ut and a Gaussian likelihood. This approach relies on
known ro and known transient dynamics structure, so implementation requires preliminary order
parameter estimation. The system can also be estimated by unrestricted fully modi…ed vector
1
Note that when m ro > 1, the normalizations 0o;? o;? = Im ro and 0o;? o;? = Im ro are not su¢ cient to
ensure the uniqueness of o;? and o;? . In the paper, we only need the existence of normalized o;? and o;? such
that 0o;? o = 0 and 0o;? o = 0.
5
autoregression (Phillips, 1995), which leads to consistent estimation of the unit roots in (2.1), the
cointegrating vectors and the transient dynamics. This method does not require knowledge of ro
but does require knowledge of the transient dynamics structure. In addition, a semiparametric ap-
proach can be adopted in which ro is estimated semiparametrically by order selection as in Cheng
and Phillips (2010, 2012) followed by fully modi…ed least squares regression to estimate the cointe-
grating matrix. That approach achieves asymptotically e¢ cient estimation of the long run relations
(under Gaussianity) but does not estimate the transient relations.
The present paper explores direct estimation of the parameters of (2.1) by Lasso-type regression.
The resulting estimator is a shrinkage estimator that takes account of potential degeneracies in the
system involving both long run reduced rank structures and transient dynamics. Speci…cally, the
least squares (LS) shrinkage estimator of ( o ; Bo ) where Bo = (Bo;1 ; :::; Bo;p ) is de…ned as
( n
X X 2
( b n; B
bn ) = arg min Yt Yt 1 Bj Yt j
;B1 ;:::;Bp 2Rm m j p
t=1
9
p
X m
X =
+n b;j;n kBj k +n r;k;n k n;k ( )k (2.4)
;
j=1 k=1
where b;j;n and r;k;n (j = 1; :::; p and k = 1; :::; m) are tuning parameters that directly control the
penalization, n;k (
) is the k-th row vector of Qn , and Qn denotes the normalized left eigenvector
matrix of eigenvalues of b 1st . The matrix b 1st is some …rst step (e.g., OLS) estimate of o . The
penalty function on the coe¢ cients Bj (j = 1; :::; p) of the lagged di¤erences is called as group
Lasso penalty (see, Yuan and Lin, 2006). On the other hand, the penalty function on is di¤erent
from the group Lasso, because it works on the rows of the adaptively transferred matrix Qn , not
the rows (or any deterministic function, e.g. eigenvalue) of directly.2
Given the tuning parameters, this procedure delivers a one step estimator of the model (2.1) with
an implied estimate of the cointegrating rank (based on the number of non-zero rows of Qn b n ) and
an implied estimate of the transient dynamic structure (that is, Bo;j in Bo with kBo;j k = 0 for j =
bn . It is therefore well suited to empirical implementation where
1; :::; p) based on the …tted value B
information is limited concerning model speci…cation. By de…nition, the penalized LS estimate
is invariant to permutation of the lag di¤erences, which implies that the rank and lag di¤erences
selected in the penalized LS estimation are stable regardless the potential structure of the true
model. This feature is a particular advantage of Lasso-type model selection methods over traditional
2
The tranform of the matrix is important for rank selection, because by the consistency of the …rst step estimator
b 1st , Qn o has and only has m ro rows which are asymptotically non zero. Note that the fact that a matrix does
not have full rank does not necessarily mean that any element in should be zero. Hence the penalized LS regression
in (2.4) with group Lasso penalty of does not deliver implication for rank selection in general case.
6
sequential testing procedures which typically work from general to speci…c formulations.
A novel contribution of this paper is that it provides an adaptive penalty function f ( ) =
P
m
r;k;n k n;k ( )k, which enables penalized LS estimation in (2.4) to perform rank selection.3 Im-
k=1
portantly, this penalty function di¤ers from those proposed in the statistics literature for dimension
reduction in multivariate regression with iid data. For example, Peng et al (2009) assume that the
coe¢ cient matrix has many zero components and suggest dimension reduction by penalizing the
estimates of the components in the coe¢ cient matrix with L-1 and L-2 penalty functions. Yuan
et al (2007) propose to penalize the singular values of the estimate of the coe¢ cient matrix with
an L-1 penalty to achieve dimension reduction. While this approach is intuitive and the idea of
working through the eigenvalues of was used independently in our own earlier work, Yuan et al
(2007) provide theory only under an orthonormal regressor design, which is unrealistic in VECM
structures with nonstationary data4 .
Let 0( 0( 0 (
o) = [ 1 o ); :::; m o )]
denote the row vectors of Q o. When fut gt 1 is iid or a
martingale di¤erence sequence, the LS estimators ( b 1st ; B
b1st ) of ( o ; Bo ) are well known to be
consistent. The eigenvalues and corresponding eigenspace of o can also be consistently estimated.
Thus it seems intuitively clear that some form of adaptive penalization can be devised to consistently
distinguish the zero and nonzero components in Bo and ( 5
o ). We show that the shrinkage LS
estimator de…ned in (2.4) enjoys these oracle-like properties, in the sense that the zero components
in Bo and ( o) are estimated as zeros with probability approaching 1 (w.p.a.1). Thus, o and the
non-zero elements in Bo are estimated as if the form of the true model were known and inferences
can be conducted as if we knew the true cointegration rank ro .
If the transient behavior of (2.1) is misspeci…ed and (for some given lag order p) the error
process fut gt 1 is weakly dependent and ro > 0, then consistent estimators of the full matrix
( o ; Bo ) are typically unavailable without further assumptions. However, the m ro zero eigenvalues
of can still be consistently estimated with an order n convergence rate, while the remaining
o
p
eigenvalues of o are estimated with asymptotic bias at a n convergence rate. The di¤erent
3
The new penalty is de…ned as a function on Rm m , i.e. on the square matrix . While this formulation is
relevant in the present setting, it is clear that the approach can be trivially extended to the general case with any
matrix.
4
As indicated, the idea in Yuan et al.(2007) is related to the original approach pursued in an earlier version (2010)
of the present paper. In that version, we showed that when adding the L-1 penalty on the eigenvalues to the LS
criterion, the m ro smallest eigenvalues of the penalized LS estimate of the cointegration matrix o have convergence
rate faster than n 1 . This result has implications for e¢ cient estimation of the VECM when the true model is nested.
But it does not necessarily imply model selection because selection requires that zero eigenvalues be estimated as
zeros with positive probability. That is a challenging problem due to the highly nonlinear relation between o and
its eigenvalues. The approach pursued in the present paper is far simpler, enhancing implementation and leading
directly to the required asymptotic result.
5
The adaptive penalization means that the penalization on the estimators of zero components (e.g., zero matrices
Bo;j ) is large, while the penalization on the estimators of non zero components (e.g., non zero matrices Bo;j ) is small.
7
convergence rates of the eigenvalues are important, because when the non-zero eigenvalues of o
are occasionally (asymptotically) estimated as zeros, the di¤erent convergence rates are useful in
consistently distinguishing the zero eigenvalues from the biasedly estimated non-zero eigenvalues
of o. Speci…cally, we show that if the estimator of some non-zero eigenvalue of o has probability
limit zero under misspeci…cation of the lag order, then this estimator will converge in probability
p
to zero at the rate n, while estimates of the zero eigenvalues of o all have convergence rate n.
Hence the tuning parameters f m
r;k;n gk=1 can be constructed in the way such that the adaptive
penalties associated with estimates of zero eigenvalues of o will diverge to in…nity at a rate faster
than those of estimates of the nonzero eigenvalues of o, even though the latter also converge
to zero in probability. As we have prior knowledge about these di¤erent divergence rates in a
potentially cointegrated system, we can impose explicit conditions on the convergence rate of the
tuning parameters f r;k;n gm to ensure that only ro rows of Qn b n are adaptively shrunk to zero
k=1
w.p.a.1.
For the empirical implementation of our approach, we provide data-driven procedures for se-
lecting the tuning parameter of the penalty function in …nite samples. For practical purposes our
method is executed in the following steps, which are explained and demonstrated in detail as the
paper progresses.
(1) After preliminary LS estimation of the system, perform a …rst step GLS shrinkage estimation
with tuning parameters
for k = 1; :::; m and j = 1; :::; p, where jj k ( )jj denotes the k-th largest modulus of the eigenvalues
f k ( )gm bj;1st is some …rst step (OLS) estimates of Bo;j (j = 1; :::; p).
6 and B
k=1 of the matrix
(2) Construct adaptive tuning parameters using the …rst step GLS shrinkage estimates and the
formulas in (6.10) and (6.11). Using the adaptive tuning parameters, obtain the GLS shrinkage
estimator ( b g;n ; B
bg;n ) of ( o ; Bo ) - see (5.12). The cointegration rank selected by the shrinkage
method is implied by the rank of the shrinkage estimator b g;n and the lagged di¤erences selected
bg;n .
by the shrinkage method are implied by the nonzero matrices in B
(3) The GLS shrinkage estimator contains shrinkage bias introduced by the penalty on the
nonzero eigenvalues of b g;n and nonzero matrices in B
bg;n . To remove this bias, run a reduced rank
regression based on the cointegration rank and the model selected in the GLS shrinkage estimation
6
Throughout this chapter, for any m m matrix , we order the eigenvalues of in decreasing order by their
modulus, i.e. k 1 ( )k k 2 ( )k ::: k m ( )k. When there is a pair of complex conjugate eigenvalues, we order
the one with a positive imaginary part before the other.
8
in step (2).
This section considers the following simpli…ed …rst order version of (2.1),
0
Yt = o Yt 1 + ut = o o Yt 1 + ut : (3.1)
The model contains no deterministic trend and no lagged di¤erences. Our focus in this simpli…ed
system is to outline the approach to cointegrating rank selection and develop key elements in the
limit theory, showing consistency in rank selection and reduced rank coe¢ cient matrix estimation.
The theory is extended in subsequent sections to models of the form (2.1).
We start with the following condition on the innovation ut .
Assumption 3.1 (WN) fut gt 1 is an m-dimensional iid process with zero mean and nonsingular
covariance matrix u.
Assumption 3.1 ensures that the full parameter matrix o is consistently estimable in this
simpli…ed system. The iid condition could, of course, be weakened to martingale di¤erences with
no material changes in what follows. Under Assumption 3.1, partial sums of ut satisfy the functional
law
[n ]
1 X
n 2 ut !d Bu ( ); (3.2)
t=1
Assumption 3.2 (RR) (i) The determinantal equation jI (I + o) j = 0 has roots on or out-
side the unit circle; (ii) the matrix o has rank ro , with 0 ro m; (iii) if ro > 0, then the matrix
0
R = Iro + o o has eigenvalues within the unit circle.
t
X
0 1 0
Yt = C us + o ( o o ) R(L) o ut + CY0 ; (3.3)
s=1
where C = 0 1 0 .
o;? ( o;? o;? ) o;? Using the matrix Q, (3.1) transforms as
Zt = o Zt 1 + wt ; (3.4)
9
where 0 1 0 1 0 1 0 1
0 0
o Yt Z1;t o ut w1;t
Zt = @ A @ A ; wt = @ A @ A
0 Y Z2;t 0 u w2;t
o;? t o;? t
and =Q 1.
o oQ Under Assumption 3.2 and (3.2), we have the functional law
2 3 2 3
[n ]
1 X 0
o Bu ( ) Bw1 ( )
n 2 wt !d Bw ( ) = QBu ( ) = 4 5 4 5:
0 B ( ) Bw2 ( )
t=1 o;? u
n
X Xm
b n = arg min k Yt Yt 1k
2
+n r;k;n k n;k ( )k : (3.5)
2Rm m k=1
t=1
Theorem 3.1 (Consistency) Let = maxk2S r;k;n , then under Assumptions WN, RR and
r;n
When consistent shrinkage estimators are considered, Theorem 3.1 extends Theorem 1 of Caner
and Knight (2009) who used shrinkage techniques to perform a unit root test. As the eigenvalues
k ( ) of the matrix are continuous functions of , we deduce from the consistency of b n and
continuous mapping that k(
b n ) !p k( o) for all k = 1; :::; m. Theorem 3.1 implies that the
nonzero eigenvalues of o are estimated as non-zeros, which means that the rank of o will not be
under-selected. However, consistency of the estimates of the non-zero eigenvalues is not necessary
for consistent cointegration rank selection. In that case what is essential is that the probability
limits of the estimates of those (non-zero) eigenvalues are not zeros or at least that their convergence
rates are slower than those of estimates of the zero eigenvalues. This point will be pursued in the
following section where it is demonstrated that consistent estimation of the cointegrating rank
continues to hold for weakly dependent innovations fut gt 1 even though full consistency of b n does
not generally apply in that case.
1
Theorem 3.2 (Rate of Convergence) De…ne Dn = diag(n 2 Iro ; n 1 Im ro ), then under the
conditions of Theorem 3.1, the LS shrinkage estimator b n satis…es the following:
10
(a) if ro = 0, then b n o = Op (n 1 +n 1
r;n );
1
(b) if 0 < ro m, then bn o Q 1D 1
n = Op (1 + n 2 r;n ).
The term r;n represents the shrinkage bias that the penalty function introduces to the LS shrink-
1
age estimator. If the convergence rate of r;k;n (k 2 S ) is fast enough such that n 2 r;n = Op (1),
then Theorem 3.2 implies that b n 1
o = Op (n ) when ro = 0 and
bn 1 1
o Q Dn = Op (1)
1
otherwise. Hence, under Assumption WN, RR and n 2 r;n = Op (1), the LS shrinkage estimator
b n has the same convergence rate of the LS estimator b 1st (see, Lemma 10.2 in the appendix).
However, we next show that if the tuning parameter r;k;n (k 2 S c ) does not converge to zero
too fast, then the correct rank restriction r = ro is automatically imposed on the LS shrinkage
estimator b n w.p.a.1.
Let Sn; denote the index set of the nonzero rows of Qn b n and its complement Sn; c be the
index set of the zero rows of Qn b n . We subdivide the matrix Qn as Q0n = Q0 ;n ; Q0 ? ;n , where
Q ;n and Q ? ;n
are the …rst ro rows and the last m ro rows of Qn respectively. Under Lemma
10.2 and Theorem 3.1,
Q ;n
bn = Q ;n
b 1st + op (1) = ;n Q ;n + op (1) (3.6)
and similarly
Q bn = Q b 1st + op (1) = Q + op (1) = op (1); (3.7)
? ;n ? ;n ? ;n ? ;n
where ;n = diag[ 1(
b 1st ); :::;
b 1st )] and = diag[ ro +1 ( b 1st ); :::; m ( b 1st )]. Result in
ro (
? ;n
(3.6) implies that the …rst ro rows of Qn b n are nonzero w.p.a.1., while the results in (3.7) means
that the last m ro rows of Qn b n are arbitrarily close to zero with w.p.a.1. Under (3.6) we deduce
that S Sn; . However, (3.7) is insu¢ cient for showing that S c c , because in that case,
Sn;
what we need to show is Q ? ;n b n = 0 w.p.a.1.
Theorem 3.3 (Super E¢ ciency) Suppose that Assumptions WN and RR are satis…ed. If
1
n 2 r;n = Op (1) and r;k;n !p 1 for k 2 S c , then
Pr Q b n = 0 ! 1 as n ! 1: (3.8)
? ;n
Theorem 3.3 requires the tuning parameters related to the zero and non-zero components have
di¤erent asymptotic behaviors. As we do not have any prior information about the zero and non-
zero components, it is clear that some sort of adaptive penalization should appear in the tuning
parameters f m
r;k;n gk=1 . Such adaptive penalty is constructed in (6.1) of Section 6 and su¢ cient
1
conditions for n 2 r;n = Op (1) and r;k;n !p 1 for k 2 S c are provided in Lemma 6.1.
11
Combining Theorem 3.1 and Theorem 3.3, we deduce that
Pr (Sn; = S ) ! 1; (3.9)
which implies consistent cointegration rank selection, giving the following result.
Pr r( b n ) = ro ! 1 (3.10)
From Corollary 3.4, we can deduce that the rank constraint r( ) = ro is imposed on the LS
shrinkage estimator b n w.p.a.1. As b n satis…es the rank constraint w.p.a.1, we expect it has better
properties in comparison to the OLS estimator b 1st which assumes the true rank is unknown. This
conjecture is con…rmed in the following theorem.
1
Theorem 3.5 (Limiting Distribution) Suppose that conditions of Theorem 3.3 and n 2 r;n =
op (1) are satis…ed. We have
bn Q 1
Dn 1 !d Bm;1 0 1 0B (3.11)
o o( o o) o m;2
where Z Z
1 0 0 1
Bm;1 N 0; u z1 z1 and Bm;2 dBu Bw 2
( Bw2 Bw 2
) :
Compared with the OLS estimator, we see that in the LS shrinkage estimation, the right lower
(m ro ) (m ro ) submatrix of Q 1
oQ is estimated at a faster rate than n. The improved
property of the LS shrinkage estimator b n arises from the fact that the correct rank restriction
12
r( b n ) = ro is satis…ed w.p.a.1, leading to the lower right zero block in the limit distribution (3.11)
after normalization.
Compared with the oracle reduced rank regression (RRR) estimator (i.e. the RRR estimator in-
formed by knowledge of the true rank, see e.g. Johansen, 1995, Phillips, 1998 and Anderson, 2002),
the LS shrinkage estimator su¤ers from second order bias in the limit distribution (3.11), which is
R 0 in the limit matrix B
evident in the endogeneity bias of the factor dBu Bw 2 m;2 . Accordingly, to
remove the endogeneity bias we introduce the generalized least square (GLS) shrinkage estimator
b g;n which satis…es the weighted extremum problem
n
X m
X
b g;n = arg min k Yt Yt 1 k2b 1 +n r;k;n jj n;k ( )jj; (3.14)
u;n
2Rm m
t=1 k=1
where b u;n is some consistent estimator of u. GLS methods enable e¢ cient estimation in cointe-
grating systems with known rank (Phillips, 1991a, 1991b). Here they are used to achieve e¢ cient
estimation with unknown rank. In fact, the asymptotic distribution of b g;n is the same as that of
the oracle RRR estimator.
Corollary 3.6 (Oracle Properties) Suppose Assumptions 3.1 and 3.2 hold. If b u;n !p u and
1
the tuning parameter satis…es n 2 r;n = op (1) and r;k;n !p 1 for k 2 S c , then
Pr r( b g;n ) = ro ! 1 as n ! 1 (3.15)
b g;n 1 R R
o Q Dn 1 !d Bm;1 0
o( o o)
1 0 ( B B0 )
dBu w2 Bw 2 w 2 w2
1 ; (3.16)
where Bu w2 ( ) Bu ( ) 1
uw2 w2 w2 Bw2 ( ).
which implies that the GLS shrinkage estimate b g;n has the same limiting distribution as that of
the oracle RRR estimator.
Remark 3.7 In the triangular representation of a cointegration system studied in Phillips (1991a),
13
we have = [Iro ; 0ro 0 = [ Iro ; Oo ]0 and w2 = u2 . Moreover, we obtain
o (m ro ) ] , o
0 1 0 1 0 1
Iro Oo Iro Oo Iro Oo
o =@ A, Q = @ A and Q 1
=@ A:
0 0m ro 0 Im ro 0 Im ro
0
By the consistent rank selection, the GLS shrinkage estimator b g;n can be decomposed as b g;n b g;n
w.p.a.1, where b g;n b0 ; B
[A b 0 ]0 is the …rst ro columns of b g;n and b g;n = [ Iro ; O
bg;n ]0 . From
g;n g;n
p
bg;n
n A Iro !d N (0; 1
u1 z1 z1 ) (3.18)
and Z Z 1
bg;n O
nA bg;n Oo !d dBu1 2 Bu0 2 Bu2 Bu0 2 (3.19)
where Bu1 and Bu2 denotes the …rst ro and last m ro vectors of Bu ; and Bu1 2 = Bu1
1
u;12 u;22 Bu2 . Under (3.18), (3.19) and CMT, we deduce that
Z Z 1
bg;n
n O Oo !d dBu1 2 Bu0 2 Bu2 Bu0 2 : (3.20)
In this section we study shrinkage reduced rank estimation in a scenario where the equation inno-
vations fut gt 1 are weakly dependent. Speci…cally, we assume that fut gt 1 is generated by a linear
process satisfying the following condition.
P1 j
Assumption 4.1 (LP) Let D(L) = j=0 Dj L , where D0 = Im and D(1) has full rank. Let ut
have the Wold representation
1
X 1
X 1
ut = D(L)"t = Dj "t j, with j 2 jjDj jj < 1; (4.1)
j=0 j=0
where "t is iid (0; "" ) with "" positive de…nite and …nite fourth moments.
14
P1
Denote the long-run variance of fut gt 1 as u = h= 1 uu (h). From the Wold representation
in (4.1), we have = D(1) 0
u "" D(1) , which is positive de…nite because D(1) has full rank and ""
is positive de…nite. The fourth moment assumption is needed for the limit distribution of sample
autocovariances in the case of misspeci…ed transient dynamics.
As expected, under general weak dependence assumptions on ut ; the simple reduced rank re-
gression models (2.1) and (3.1) are susceptible to the e¤ects of potential misspeci…cation in the
transient dynamics. These e¤ects bear on the stationary components in the system. In particular,
due to the centering term uz (1) in (10.62), both the OLS estimator b 1st and the shrinkage esti-
1
mator b n are asymptotically biased. Speci…cally, we show that b 1st has the following probability
limit (see, Lemma 10.4 in the appendix),
b 1st !p 1 Q 1
Ho Q + o; (4.2)
where Ho = Q 1
uz1 (1) z1 z1 ; 0m (m ro ) . Note that
1 1 0 0
1 =Q Ho Q + o = o + uz1 (1) z1 z1 o = eo o; (4.3)
which implies that the asymptotic bias of the OLS estimator b 1st is introduced via the bias in the
0
pseudo true value limit e o . Observe also that 1 = eo o has rank at most equal to ro ; the number
0
of rows in o.
matrix. By the de…nition of 1 , we know that o;? is the right eigenvectors of the zero eigenvalues
of 1 . Thus, e 1 lies in some subspace of the space spanned by o . Let Q1 denote the ordered7 left
eigenvector matrix of and de…ne 1;k ( ) = Q1 (k) , where Q1 (k) denotes the k-th row of Q1 .
1
Corollary 4.1 Let er;n = maxk2Se r;k;n , then under Assumptions RR, LP and er;n = op (1), the
LS shrinkage estimator b n is consistent, i.e. b n !p 1 .
Corollary 4.1 implies that the shrinkage estimator b n has the same probability limit as that
of the OLS estimator b 1st . As the pseudo limit 1 may have more zero eigenvalues, compared
7
The eigenvectors in Q1 are ordered according to the magnitudes of the eigenvalues, i.e. the ordering of the
eigenvalues of 1 .
15
m
with Theorem 3.1, Corollary 4.1 imposes weaker condition on the tuning parameters f r;k;n gk=1 .
The next corollary provides the convergence rate of the LS shrinkage estimate to the pseudo true
parameter matrix 1.
Corollary 4.2 Under Assumptions RR, LP and er;n = op (1), the LS shrinkage estimator b n
satis…es
(a) if ro = 0, then b n = Op (n 1 +n 1e );
1 r;n
1
(b) if 0 < ro m, then bn 1 Q 1D 1
n = Op (1 + n 2 er;n ).
h i
Recall that Qn is the normalized left eigenvector matrix of b 1st . Decompose Q0n as Q0e ;n ; Q0e ? ;n
where Q e ;n and Q e ? ;n are the …rst r1 and last m r1 rows of Qn respectively. Under Corollary 4.1
and Lemma 10.4.(a),
Q e ;n b n = Q e ;n b 1st + op (1) = e ;n Q e ;n + op (1) (4.4)
where is a diagonal matrix with the ordered …rst (largest) r1 eigenvalues of b 1st . (4.4) and
e ;n
Lemma 10.4.(b) implies that the …rst r1 rows of Qn b n are estimated as nonzero w.p.a.1. On the
other hand, by Corollary 4.1 and Lemma 10.4.(a),
where is a diagonal matrix with the ordered last (smallest) m r1 eigenvalues of b 1st . Under
e ? ;n
Lemma 10.4.(b) and (c), we know that Q e ? ;n b n converges to zero in probability, while its …rst
1
ro r1 rows and the last m ro rows have the convergence rates n 2 and n respectively. We next
show that the last m ro rows of Qn b n are estimated as zeros w.p.a.1.
Corollary 4.3 (Super E¢ ciency) Under Assumptions LP and RR, if r;k;n !p 1 for k 2 S c
1
and n 2 er;n = Op (1), then we have
Pr Qn (k) b n = 0 ! 1 as n ! 1, (4.6)
for any k 2 S c .
Corollary 4.3 implies that b n has at least m ro eigenvalues estimated as zero w.p.a.1. However,
the matrix 1 may have more zero eigenvalues than o. To ensure consistent cointegration rank
selection, we need to show that the ro r1 zero eigenvalues of 1 are estimated as non-zeros w.p.a.1.
From Lemma 10.4, we see that b 1st has m ro eigenvalues which converge to zero at the rate n
p
and ro r1 eigenvalues which converge to zero at the rate n. The di¤erent convergence rates of
16
the estimates of the zero eigenvalues of 1 enable us to empirically distinguish the estimates of the
m ro zero eigenvalues of 1 from the estimates of the ro r1 zero eigenvalues of 1, as illustrated
in the following corollary.
1
Corollary 4.4 Under Assumptions LP and RR, if n 2 r;k;n = op (1) for k 2 fr1 + 1; :::; ro g and
1
n 2 er;n = Op (1), then we have
Pr Qn (k) b n 6= 0 ! 1 as n ! 1, (4.7)
Pr r( b n ) = ro ! 1 as n ! 1, (4.8)
Compared with Theorem 3.3, Theorem 4.5 imposes similar conditions on the tuning parameters
m
f r;k;n gk=1 . It is clear that when the pseudo limit 1 preserves the rank of o, i.e. ro = r1 , we do
not need to show Corollary 4.4 because Theorem 4.5 follows by Corollary 4.2 and Corollary 4.3. In
1
that case, Theorem 4.5 imposes the same conditions on the tuning parameters, i.e. n 2 er;n = Op (1)
and r;k;n !p 1 for k 2 S c , where er;n = r;n . On the other hand, when r1 < ro , the conditions in
1
Theorem 4.5 is stronger, because it requires n 2 r;k;n = op (1) for k 2 fr1 + 1; :::; ro g. In Section 6,
we construct empirically available tuning parameters which are shown to satisfy the conditions of
Theorem 4.5 without knowing whether r1 = ro or r1 < ro .
Theorem 4.5 states that the true cointegration rank ro can be consistently selected, though
the matrix o is not consistently estimable. Moreover, when the probability limit 1 of the LS
shrinkage estimator has rank less than ro , Theorem 4.5 ensures that only ro rank is selected in the
LS shrinkage estimation. This result is new in the shrinkage based model selection literature, as the
Lasso-type of techniques are usually advocated because of their ability of shrinking small estimates
(in magnitude) to be zeros in estimation. However, in Corollary 4.4, we show the LS shrinkage
estimation does not shrink the estimates of the extra ro r1 zero eigenvalues of 1 to be zero.
17
5 Extension II: Estimation with Explicit Transient Dynamics
p
X
Yt = o Yt 1 + Bo;j Yt j + ut (5.1)
j=1
with simultaneous cointegrating rank selection and lag order selection. Recall that the unknown
parameters ( o ; Bo ) are estimated by penalized LS estimation
( n
X Xp 2
( b n; B
bn ) = arg min Yt Yt 1 Bj Yt j
;B1 ;:::;Bp 2Rm m j=1
t=1
9
p
X m
X =
+n b;j;n kBj k +n r;k;n k n;k ( )k : (5.2)
;
j=1 k=1
For consistent lag order selection the model should be consistently estimable and it is assumed
that the given p in (5.1) is such that the error term ut satis…es Assumption 3.1. De…ne
p
X
C( ) = o+ Bo;j (1 ) j , where Bo;0 = Im .
j=0
The following assumption extends Assumption 3.2 to accommodate the general structure in (5.1).
Assumption 5.1 (GRR) (i) The determinantal equation jC( )j = 0 has roots on or outside the
unit circle; (ii) the matrix o has rank ro , with 0 ro m; (iii) the (m ro ) (m ro ) matrix
0 1
p
X
0 @
o;? Im Bo;j A o;? (5.3)
j=1
is nonsingular.
Under Assumption 5.1, the time series Yt has the following partial sum representation,
t
X
Yt = CB us + (L)ut + CB Y0 (5.4)
s=1
h Pp i 1 P1
where CB = 0 Im 0
o;? o;? j=1 Bo;j o;? o;? and (L)ut = s=0 s ut s is a stationary
0 0
process. From the partial sum representation in (5.4), we deduce that o Yt = o (L)ut and Yt j
18
De…ne an m(p + 1) m(p + 1) rotation matrix QB and its inverse QB1 as
0 1
0
0 0 1
o 0 1 0 1
B C o( o o) 0 o;? ( o;? o;? )
QB B 0 Imp C and Q 1 = @ A:
@ A B
0 Imp 0
0 0
o;?
0
Denote Xt 1 = Yt0 1 ; :::; Yt0 p and then the model in (5.1) can be written as
2 3
h i Yt 1
Yt = o Bo 4 5 + ut : (5.5)
Xt 1
Let 2 3 2 3
Yt 1 Z3;t 1
Zt 1 = QB 4 5=4 5; (5.6)
Xt 1 Z2;t 1
h i
0 0 Y
where Z3;t 1 = Yt0 1 o Xt0 1 is a stationary process and Z2;t 1 = o;? t 1 comprises the
c such that kB k = 0 for
I(1) components. Denote the index set of the zero components in Bo as SB o;j
c and kB k =
all j 2 SB o;j 6 0 otherwise. We next derive the asymptotic properties of the LS shrinkage
estimator ( b n ; B
bn ) de…ned in (5.2).
Lemma 5.1 Suppose that Assumptions WN and GRR are satis…ed. If r;n = op (1) and b;n = op (1)
where b;n maxj2SB b;j;n , then the LS shrinkage estimator ( b n ; B
bn ) satis…es
h i 1 1
( b n; B
bn ) ( o ; Bo )
1
QB1 Dn;B = Op (1 + n 2 r;n + n2 b;n ) (5.7)
1
where Dn;B = diag(n Iro +mp ; n 1I
2 m ro ).
Lemma 5.1 implies that the LS shrinkage estimators ( b n ; B bn ) have the same convergence rates
as the OLS estimators ( b 1st ; B
b1st ) (see, Lemma 10.6.a). We next show that if the tuning parameters
and (k 2 SBc and j 2 S c ) converge to zero but not too fast, then the zero rows of Q
r;k;n b;j;n o
and zero matrices in Bo are estimated as zero w.p.a.1. Let the zero rows of Qn b n be indexed by
c
Sn; and the zero matrix in Bbn be indexed by S c .
n;B
Theorem 5.1 Suppose that Assumptions WN and GRR are satis…ed. If the tuning parameters
1 1
satisfy n 2 ( r;n + b;n ) = Op (1), r;k;n !p 1 and n 2 b;j;n !p 1 for k 2 S c and j 2 SB
c , then we
have
Pr Q ;n
b n = 0 ! 1 as n ! 1; (5.8)
19
c
and for all j 2 SB
bn;j = 0m
Pr B m ! 1 as n ! 1: (5.9)
Theorem 5.1 indicates that the zero rows of Q o (and hence the zero eigenvalues of o) and
the zero matrices in Bo are estimated as zeros w.p.a.1. Thus Lemma 5.1 and Theorem 5.1 imply
consistent cointegration rank selection and consistent lag order selection.
We next derive the asymptotic distribution of b S = b n ; B bS , where B
B
bS denotes the LS
B
shrinkage estimator of the nonzero matrices in Bo . Let ISB = diag(I1;m ; :::; IdSB ;m ) where the Ij;m
(j = 1; :::; dSB ) are m m identity matrices and dSB is the dimensionality of the index set SB :
De…ne 0 1
0
o 0
B C 1 1
QS B ISB C
1
0 A and Dn;S diag(n Iro ; n ISB ; n Im ro );
2 2
@
0 0
o;?
where the identity matrix ISB = ImdSB in QS serves to accommodate the nonzero matrices in Bo .
Let XS;t denote the nonzero lagged di¤erences in (5.1), then the true model can be written as
1
Yt = o Yt 1 + Bo;SB XS;t 1 + ut = o;S QS ZS;t 1 + ut (5.10)
h i
0 0 Y
with Z3S;t 1 = Yt0 1 o 0
XS;t 1
and Z2;t 1 = o;? t 1 . From Lemma 10.5, we obtain
n
X
1 0 0
n Z3S;t 1 Z3S;t 1 !p E Z3S;t 1 Z3S;t 1 z3S z3S :
t=1
1
Theorem 5.2 Under conditions of Theorem 5.1, if n 2 ( r;n + b;n ) = op (1), then
bS QS 1 Dn;S
1
!d Bm;S 0 1 0B ; (5.11)
o;S o( o o) o m;2
where Z Z
1 0 0 1
Bm;S N (0; u z3S z3S ) and Bm;2 dBu Bw 2
( Bw2 Bw 2
) :
20
Theorem 5.2 extends the result of Theorem 3.5 to the general VEC model with lagged di¤erences.
From Theorem 5.2, the LS shrinkage estimator b S is more e¢ cient than the OLS estimator b n
in the sense that: (i) the zero components in Bo are estimated as zeros w.p.a.1 and thus their LS
shrinkage estimators are super e¢ cient; (ii) under the consistent lagged di¤erences selection, the
true nonzero components in Bo are more e¢ ciently estimated in the sense of smaller asymptotic
variance; and (iii) the true cointegration rank is estimated and therefore when ro < m some parts
of the matrix are estimated at a rate faster than root-n.
o
The LS shrinkage estimator b n su¤ers from second order asymptotic bias, evident in the com-
ponent Bm;2 of the limit (5.11). As in the simpler model this asymptotic bias is eliminated by GLS
estimation. Accordingly we de…ne the GLS shrinkage estimator of the general model as
( n
X Xp 2
( b g;n ; B
bg;n ) = arg min Yt Yt 1 Bj Yt j
;B1 ;:::;Bp 2Rm m j=1 b u;n
1
t=1
9
p
X m
X =
+n b;j;n kBj k +n r;k;n k n;k ( )k : (5.12)
;
j=1 k=1
To conclude this section, we show that the GLS shrinkage estimator ( b g;n ; B
bg;n ) is oracle e¢ cient
in the sense that it has the same asymptotic distribution as the RRR estimate assuming the true
cointegration rank and lagged di¤erences are known.
Corollary 5.3 (Oracle Properties of GLS) Suppose the conditions of Theorem 5.2 are satis-
…ed. If b u;n !p u , then
Pr r( b g;n ) = ro ! 1 and Pr B
bg;j;n = 0 ! 1 (5.13)
bS R R
o;S QS 1 Dn;S
1
!d Bm;S 0
o( o o)
1 0 ( B B0 )
dBu w2 Bw 2 w 2 w2
1 (5.14)
Corollary 5.3 is proved using the same arguments as Corollary 3.6 and Theorem 5.2. Its proof
is omitted. The asymptotic distributions of the penalized LS/GLS estimates can be used to con-
duct inference on o and Bo . However, use of these asymptotic distributions implies that the true
cointegrating rank and lag order are selected with probability one. In consequence, these distribu-
tions may provide poor approximations to the …nite sample distributions of the penalized LS/GLS
estimates when model selection errors occur in …nite samples, leading to potential size distortions
21
in inference based on (5.11) or (5.14). The development of robust approaches to con…dence interval
construction therefore seems an important task for future research.
Remark 5.4 Although the grouped Lasso penalty function P (B) = kBk is used in LS shrinkage
estimation (5.2) and GLS shrinkage estimation (5.12), we remark that a full Lasso penalty function
can also be used and the resulting GLS shrinkage estimate enjoys the same properties stated in
Corollary 5.3. The GLS shrinkage estimation using the (full) Lasso penalty takes the following
form
( n
X Xp 2
( b g;n ; B
bg;n ) = arg min Yt Yt 1 Bj Yt j
;B1 ;:::;Bp 2Rm m j=1 b u;n
1
t=1
9
p X
X m X
m m
X =
+n b;j;l;s;n jBj;ls j +n r;k;n k n;k ( )k
;
j=1 l=1 s=1 k=1
(5.15)
where Bj;ls denotes the (l; s)-th element of Bj . The advantage of the grouped Lasso penalty P (B)
is that it shrinks elements in B to zero groupwisely, which makes it a natural choice for the lag
order selection (as well as lag elimination) in VECM models. The Lasso penalty is more ‡exible
and when used in shrinkage estimation, it can do more than select the zero matrices. It can also
select the non-zero elements in the nonzero matrices Bo;j (j 2 SB ) w.p.a.1.
Remark 5.5 The ‡exibility of the Lasso penalty enables GLS shrinkage estimation to achieve more
goals in one-step, in addition to model selection and e¢ cient estimation. Suppose that the vector Yt
can be divided in r and m r dimensional subvectors Y1;t and Y2;t , then the VECM can be rewritten
as
2 3 2 32 3 2 32 3
11 12 p
X 11 12
Y1;t o o Y1;t 1 Bo;j Bo;j Y1;t j
4 5 = 4 54 5+ 4 54 5 + ut ;
Y2;t 21 22 Y2;t 21
Bo;j 22
Bo;j Y2;t
o o 1 j=1 j
(5.16)
where o and Bo;j (j = 1; ::; p) are partitioned in line with Yt . By de…nition, Y2;t does not Granger-
cause Y1;t if and only if
12 12
o = 0 and Bo;j = 0 for any j 2 SB .
One can attach the (grouped) Lasso penalty of 12 in (5.16) such that the causality test is auto-
matically executed in GLS shrinkage estimation.
22
Remark 5.6 In this paper, we only consider the Lasso penalty function in the LS or GLS shrink-
age estimation. The main advantage of the Lasso penalty is that it is a convex function, which
combines the convexity of the LS or GLS criterion, making the computation of the shrinkage esti-
mate faster and more accurate. It is clear that as long as the tuning parameter satis…es certain rate
requirements, our main results continue to hold if other penalty functions (e.g., the bridge penalty)
are used in the LS or GLS shrinkage estimation.
r;k;n m! b;j;n
r;k;n = and b;j;n = (6.1)
jj k(
b 1st )jj! b1st;j jj!
jjB
where r;k;n andare non-increasing positive sequences and ! is some positive …nite constant.
b;j;n
The adaptive penalty in r;k;n is jj k ( b 1st )jj ! (k = 1; :::; m), because for any k 2 S c , there
is jj k ( b 1st )jj ! !p 1 and for any k 2 S , there is jj k ( b 1st )jj ! !p jj k ( o )jj ! = O(1) under
Assumption WN8 . Similarly, the adaptive penalty in b;j;n is m! jjB b1st;j jj ! , where the extra term
m! is used to adjust the e¤ect of dimensionality of Bj on the adaptive penalty. Such adjustment
does not e¤ect the asymptotic properties of the LS/GLS shrinkage estimation, but it is used to
improve their …nite sample performances. To see the e¤ect of the dimensionality on the adaptive
penalty, we write
" m X
m
#!
X 2 2
b1st;j jj =
jjB ! b1st;j;lh
B :
l=1 h=1
23
b1st;j;lh j2 , i.e.
of the square terms jB
" m X
m
#!
X 2
m 2 b1st;j;lh j
jB 2
=m ! b1st;j jj!
jjB
l=1 h=1
in the adaptive penalty. Under some general rate conditions on r;k;n and b;j;n , the following
lemma shows that the tuning parameters speci…ed in (6.1) satisfy the conditions in our theorems
of super e¢ ciency and oracle properties.
1
Lemma 6.1 (i) If n 2 r;k;n = o(1) and n! r;k;n ! 1, then under Assumptions WN and RR we
have
1
n2 r;n = op (1) and r;k;n !p 1
1+!
for any k 2 S c ; (ii) if n 2
r;k;n = o(1) and n! r;k;n ! 1, then under Assumptions LP and RR
1 1
n 2 er;n = op (1), n 2 r;k;n = op (1) and r;k0 ;n !p 1
1
for any k 2 fr1 + 1; :::; ro g and k 0 2 S c ; (iii) if n 2 r;k;n = o(1) and n! r;k;n ! 1 for any
1 1+!
k = 1; :::; m, and n 2
b;j;n = o(1) and n 2
b;j;n ! 1 for any j = 1; :::; p, then under Assumptions
WN and GRR
1
n2 ( r;n + b;n ) = op (1), r;k;n !p 1 and b;j;n !p 1
c.
for any k 2 S c and j 2 SB
It is notable that, when ut is iid, r;k;n is required to converge to zero with the rate faster than
1
n 2 , while when ut is weakly dependent, r;k;n has to converge to zero with the rate faster than
1+!
n 2 . The convergence rate of r;k;n in Lemma 6.1.(ii) is faster to ensure that the pseudo ro r1
zero eigenvalues in 1 are estimated as non-zeros w.p.a.1. When r1 = ro , 1 contains no pseudo zero
1
eigenvalues and it has the true rank ro . It is clear that in this case, we only need n 2 r;k;n = o(1) and
1
n! r;k;n ! 1 to show that the tuning parameters in (6.1) satisfy n 2 r;n = op (1) and r;k0 ;n !p 1
for any k 0 2 S c .
m p
From Lemma 6.1, we see that the conditions imposed on r;k;n k=1 and f b;j;n gj=1 to ensure
oracle properties in GLS shrinkage estimation only restrict the rates at which the sequences r;k;n
and b;j;n go to zero. But in …nite samples these conditions are not precise enough to provide
a clear choice of tuning parameter for practical implementation. On one hand these sequences
should converge to zero as fast as possible so that shrinkage bias in the estimation of the nonzero
components of the model is as small as possible. In the extreme case where r;k;n = 0 and b;j;n = 0,
LS shrinkage estimation reduces to LS estimation and there is no shrinkage bias in the resulting
24
estimators. (Of course there may still be …nite sample estimation bias). On the other hand, these
sequences should converge to zero as slow as possible so that in …nite samples zero components in
the model are estimated as zeros with higher probability. In the opposite extremity r;k;n = 1
and b;j;n = 1, and then all parameters of the model are estimated as zeros with probability one
in …nite samples. Thus there is bias and variance trade-o¤ in the selection of the sequences in
m p
r;k;n k=1 and f b;j;n gj=1 .
By de…nition Tbn = Qn b n and the k-th row of Tbn is estimated as zero only if the following …rst
order condition holds
n p
1X X r;k;n
Qn (k) b u;n
1
( Yt b n Yt 1
bn;j Yt
B 0
j )Yt 1 < : (6.2)
n 2jj k(
b 1st )jj!
t=1 j=1
Let T Q o and T (k) be the k-th row of the matrix Q o. If a nonzero T (k) (k ro ) is estimated
as zero, then the left hand side of the above inequality will be asymptotically close to a nonzero real
number because the under-selected cointegration rank leads to inconsistent estimation. To ensure
the shrinkage bias and errors of under-selecting the cointegration rank are small in …nite samples,
one would like to have r;k;n converge to zero as fast as possible.
On the other hand, the zero rows of T are estimated as zero only if the same inequality in (6.2)
is satis…ed. As n k ( b 1st ) = Op (1), we can rewrite the inequality in (6.2) as
n p
1X X n! r;k;n
Qn (k) b u;n
1
( Yt b n Yt 1
bn;j Yt
B j )Yt
0
1 < : (6.3)
n 2jjn b
k ( 1st )jj
!
t=1 j=1
The sample average in the left side of this inequality is asymptotically a vector of linear combinations
of non-degenerate random variables, and it is desirable to have n! r;k;n diverge to in…nity as fast
as possible to ensure that the true cointegration rank is selected with high probability in …nite
!
samples. We propose to choose r;k;n = cr;k n 2 (here cr;k is some positive constant whose selection
is discussed later) to balance the requirement that r;k;n converges to zero and n! r;k;n diverges to
in…nity as fast as possible.
Using similar arguments we see that the component Bo;j in Bo will be estimated as zero if the
following condition holds
n p 1
1 X X n 2 b;j;n
n b 1 b n Yt bn;j Yt Yt0 j
u;n ( Yt B j) < : (6.4)
2
1
b1st;j jj!
2jjB
t=1 j=1
As Bo;j 6= 0, the left side of the above inequality will be asymptotically close to a nonzero real
25
number because the under-selected lagged di¤erences also lead to inconsistent estimation. To
ensure the shrinkage bias and error of under-selection of the lagged di¤erences are small in the
1
…nite samples, it is desirable to have n 2 b;j;n converge to zero as fast as possible.
On the other hand, the zero component Bo;j in Bo is estimated as zero only if the same inequality
b1st;j = Op (n 12 ) the inequality in (6.4) can be written as
in (6.4) is satis…ed. As B
n p 1+!
1 X X n 2
b;j;n
n 2 b 1( Yt b n Yt 1
bn;j Yt
B j) Yt0 j < : (6.5)
u;n 1
b1st;j jj!
2jjn B 2
t=1 j=1
The sample average on the left side of this inequality is asymptotically a vector of linear combina-
1+!
tions of non-degenerated random variables, and again it is desirable to have n 2
b;j;n diverge to
in…nity as fast as possible to ensure that zero components in Bo are selected with high probability
1 !
in …nite samples. We propose to choose b;j;n = cb;j n 2 4 (again cb;j is some positive constant
whose selection is discussed later) to balance the requirement that b;j;n converges to zero and
1+!
n 2
b;j;n diverges to in…nity as fast as possible.
We next discuss how to choose the loading coe¢ cients in r;k;n and b;j;n . Note that the sample
average on the left hand side of (6.3) can be written as
Qn (k) b u;n n
1 X
F ;n (k) [ut bn o QB1 Zt 0
1 ]Yt 1 :
n
t=1
Similarly, the sample average on the left hand side of (6.5) can be written as
b 1 Xn
u;n bn
Fb;n (j) p [ut o QB1 Zt 1] Yt0 j :
n t=1
The next lemma provides the asymptotic distributions of F ;n (k) and Fb;n (j) for k = 1; :::; m and
j = 1; :::; p.9
Lemma 6.2 Suppose that the conditions of Corollary 5.3 are satis…ed, then
Z
F ;n (k) = Qn (k)T1; o dBu Bu0 T2; o + op (1) (6.6)
1 1 0 1 1 0 1 0 1 0
T1; o = u u o( o u o) o u and T2; o = o;? ( o;? o;? ) o;? ;
9
The proof of Lemma 6.2 is in the supplemental appendix of this paper.
26
further, for j = 1; :::; p,
1 1
2 2
Fb;n (j) !d u Bm m (1) yj jz3S (6.7)
where Tb1; and Tb2; are some estimates of T1; o and T2; o . Of course, the rank of o needs to
be estimated before T1; o and T2; o can be estimated. We propose to run a …rst step shrinkage
! 1 !
estimation with r;k;n = 2 log(n)n 2 and b;j;n = 2 log(n)n 2 4 to get initial estimates of the rank
ro and the order of the lagged di¤erences. Then, based on this …rst-step shrinkage estimation, one
can construct Tb1; , Tb2; and thus the empirical loading coe¢ cient b
cr;k . Similarly, We propose to
select cb to normalize the random sum in (6.6), i.e.
1
cb;j = 2 b u;n
b 1=2 b2
yj yj ; (6.9)
Pn
where b yj yj = 1
n t=1 Yt j Yt0 j . From the expression in (6.7), it seems that the empirical
analog of yj jz3S is a more propriate term to normalize Fb;n (j). However, if Yt j is a redundant
0
lag and the residual of its projection on o Yt 1 and non-redundant lagged di¤erences is close to
zero, then yj jz3S and its estimate will be close to zero. As a result, b
cb;j tends to be small,
which will increase the probability of including Yt
in the selected model with higher probability
j
in …nite samples. To avoid such unappealing scenario, we use b yj yj instead of the empirical
analog of yj jz3S in (6.9). It is clear that b
cb;j can be directly constructed from the preliminary LS
estimation.
The choice of ! is a more complicated issue which is not pursued in this paper. For the empirical
applications, we propose to choose ! = 2 because such a choice is popular in the Lasso-based
variable selection literature, it satis…es all our rate criteria, and simulations show that the choice
works remarkably well. Based on all the above results, we propose the following data dependent
tuning parameters for LS shrinkage estimation:
2
r;k;n = Qn (k)Tb1; b u;n
1=2 b u;n
1=2 b
T2; jj k(
b 1st )jj 2
(6.10)
n
27
and
2m2 b 1=2 b2
1
b1st;j jj 2
b;j;n = u;n y j yj jjB (6.11)
n
for k = 1; :::; m and j = 1; :::; p. The above discussion is based on the general VECM with iid ut .
In the simple ECM where the cointegration rank selection is the only concern, the adaptive tuning
parameters proposed in (6.10) are still valid. The expression in (6.10) will be invalid when ut is
weakly dependent and r1 < ro . In that case, we propose to replace the leading term 2n 1 in (6.10)
by 2n 3=2 .
7 Simulation Study
We conducted simulations to assess the …nite sample performance of the shrinkage estimates in
terms of cointegrating rank selection and e¢ cient estimation. Three models were investigated. In
the …rst model, the simulated data are generated from
0 1 0 1 0 1
Y1;t Y1;t 1 u1;t
@ A= o@ A+@ A; (7.1)
Y2;t Y2;t 1 u2;t
0 1
1 0:5
where ut iid N (0; u) with u =@ A. The initial observation Y0 is set to be zero for
0:5 0:75
simplicity. o is speci…ed as follows
0 1 0 1 0 1 0 1
11;o 12;o 0 0 1 0:5 0:5 0:1
@ A=@ A, @ A and @ A (7.2)
21;o 22;o 0 0 1 0:5 0:2 0:4
where "t iid N (0; ") with " = diag(1:25; 0:75). The initial values Y0 and "0 are set to be zero.
The third model has the following form
0 1 0 1 0 1 0 1
Y1;t Y1;t 1 Y1;t 1 Y1;t 3
@ A= o@ A + B1;o @ A + B3;o @ A + ut ; (7.3)
Y2;t Y2;t 1 Y2;t 1 Y2;t 3
28
where ut is generated under the same condition in (7.1), o is speci…ed similarly in (7.2), B1;o and
B3;o are taken to be diag(0:4; 0:4) such that Assumption 5.1 is satis…ed. The initial values (Yt ; "t )
(t = 3; :::; 0) are set to be zero. In the above three cases, we include 50 additional observations
to the simulated sample with sample size n to eliminate start-up e¤ects from the initialization.
In the …rst two models, we assume that the econometrician speci…es the following model
0 1 0 1
Y1;t Y1;t 1
@ A= o@ A + ut ; (7.4)
Y2;t Y2;t 1
where ut is iid(0; u) with some unknown positive de…nite matrix u. The above empirical model is
correctly speci…ed under the data generating assumption (7.1), but is misspeci…ed under (7.2). We
are interested in investigating the performance of the shrinkage method in selecting the correct rank
of o under both data generating assumptions and e¢ cient estimation of o under Assumption
(7.1).
In the third model, we assume that the econometrician speci…es the following model
0 1 0 1 0 1
3
X
Y1;t Y1;t 1 Y1;t j
@ A= o@ A+ Bj;o @ A + ut ; (7.5)
Y2;t Y2;t 1 j=1 Y2;t j
where ut is iid(0; u) with some unknown positive de…nite matrix u. The above empirical model
is over-parameterized according to (7.3). We are interested in investigating the performance of the
shrinkage method in selecting the correct rank of o and the order of the lagged di¤erences, and
e¢ cient estimation of o and B2;o .
Table 11.1 presents …nite sample probabilities of rank selection under di¤erent model speci…-
cations. Overall, the GLS shrinkage method performs very well in selecting the true rank of o.
When the sample size is small (i.e. n = 100) and the data are iid, the probability of selecting the
true rank ro = 0 is close to 1 (around 0.96) and the probabilities of selecting the true ranks ro = 1
and ro = 2 are almost equal to 1. When the sample size is increased to 400, the probability of
selecting the true ranks ro = 0 and ro = 1 are almost equal to 1 and the probability of selecting
the true rank ro = 2 equals 1. Similar results show up when the data are weakly dependent (model
2). The only di¤erence is that when the pseudo true eigenvalues are close to zero, the probability
of falsely selecting these small eigenvalues is increased, as illustrated in the weakly dependent case
with ro = 2. However, as the sample size grows, the probability of selecting the true rank moves
closer to 1.
Tables 11.3, 11.4 and 11.5 provide …nite sample properties of the GLS shrinkage estimate, the
29
OLS estimate and the oracle estimate (under the …rst simulation design) in terms of bias, standard
deviation and root of mean square error. When the true rank ro = 0, the unknown parameter o is
a zero matrix. In this case, the GLS shrinkage estimate clearly dominates the LS estimate due to
the high probability of the shrinkage method selecting the true rank. When the true rank ro = 1,
we do not observe an e¢ ciency advantage of the GLS shrinkage estimator over the LS estimate, but
the …nite sample bias of the shrinkage estimate is remarkably smaller (Table 11.4). From Corollary
3.6, we see that the GLS shrinkage estimator is free of high order bias, which explains its smaller
bias in …nite samples. Moreover, Lemma 10.2 and Corollary 3.6 indicate that the OLS estimator
and the GLS shrinkage estimator (and hence the oracle estimator) have almost the same variance.
This explains the phenomenon that the GLS shrinkage estimate does not look more e¢ cient than
the OLS estimate. To better compare the OLS estimate, the GLS shrinkage estimate and the oracle
estimate, we transform the three estimates using the matrix Q and its inverse (i.e. the estimate
b is transformed to Q b Q 1 ). Note that in this case, Q o Q 1 = diag(-0:5; 0). The …nite sample
properties of the transformed estimates are presented in the last two panels of Table 11.4. We see
that the elements in the last column of the transformed GLS shrinkage estimator enjoys very small
bias and small variance even when the sample size is only 100. The elements in the last column
of the OLS estimator, when compared with the elements in its …rst column, have smaller variance
but larger bias. It is clear that as the sample size grows, the GLS shrinkage estimator approaches
the oracle estimator in terms of overall performance. When the true rank ro = 2, the LS estimator
is better than the shrinkage estimator as the latter su¤ers from shrinkage bias in …nite samples. If
shrinkage bias is a concern, one can run a reduced rank regression based on the rank selected by the
GLS shrinkage estimation to get the so called post-Lasso estimator (c.f. Belloni and Chernozhukov,
2013). The post-Lasso estimator also enjoys oracle properties and it is free of shrinkage bias in
…nite samples.
Table 11.2 shows …nite sample performance probabilities of the new shrinkage method in joint
rank and lag order selection for model 3. Evidently, the method performs very well in selecting the
true rank and true lagged di¤erences (and thus the true model) in all scenarios.10 It is interesting to
see that the probabilities of selecting the true ranks are not negatively a¤ected either by adding lags
to the model or by the lagged order selection being simultaneously performed with rank selection.
Tables 11.6, 11.7 and 11.8 present the …nite sample properties of GLS shrinkage, OLS, and oracle
estimation. When compared with the oracle estimates, some components in the GLS shrinkage
10
Joint determination of the lagged di¤erences and cointegration rank can also be performed using information
criteria, such as AIC and/or BIC suggested in Phillips and McFarland (1997) and Chao and Phillips (1999). In
the supplemental appendix, we conduct more simulation studies to compare the …nite sample performances of the
information criteria with LS shrinkage estimation.
30
50
Real Consumption
Real Income
40 Real Investment
30
20
10
0
1947:1 1953:2 1959:3 1965:4 1972:1 1978:2 1984:3 1990:4 1997:1 2003:2 2009:3
US GNP, Consumption and Investment. Data Source: Sources: Federal Reserve Economic Data (FRED) St.
Louis Fed
estimate even have smaller variances, though their …nite sample biases are slightly larger. As a
result, their root mean square errors are smaller than these of their counterparts in oracle estimation.
Moreover, the GLS shrinkage estimate generally has smaller variance when compared with the OLS
estimate, though the …nite sample bias of the shrinkage estimate of nonzero component is slightly
larger, as expected. The intuition that explains how the GLS shrinkage estimate can outperform
the oracle estimate lies in the fact that there are some zero components in Bo and shrinking their
estimates towards zero (but not exactly to zero) helps to reduce their bias and variance. From
this perspective, the shrinkage estimates of the zero components in Bo share features similar to
traditional shrinkage estimates, revealing that …nite sample shrinkage bias is not always harmful.
8 An Empirical Example
This section reports an empirical example to illustrate the application of these techniques to time
series modeling of long-run and short-run behavior of aggregate income, consumption and invest-
ment in the US economy. The sample11 used in the empirical study is quarterly data over the
period 1947-2009 from the Federal Reserve Economic Data (FRED).
The sample data are shown in Figure 8.1. Evidently, the time series display long-term trend
growth, which is especially clear in GNP and consumption, and some commonality in the growth
mechanism over time. In particular, the series show evidence of some co-movement over the entire
period. We therefore anticipate that modeling the series in terms of a VECM might reveal some
11
We thank George Athanasopoulos for providing the data.
31
non-trivial cointegrating relations. That is to say, we would expect cointegration rank ro to satisfy
0 < ro < 3. These data were studied in Athanasopoulos et. al. (2011) who found on the same
sample period and data that information criteria model selection produced a zero rank estimate
for ro and a single lag ( Yt 1) in the ECM.
Let Yt = (Ct ; Gt ; It ), where Ct , Gt and It denote the logarithms of real consumption per capita,
real GNP per capita and real investment per capita at period t respectively. For the same data as
Athanasopoulos et. al. (2011) we applied our shrinkage methods to estimate the following system12
3
X
Yt = Yt 1+ Bk Yt k + ut . (8.1)
k=1
Unrestricted LS estimation of this model produced eigenvalues 0.0025 and -0.0493 0.0119i, which
indicates that might contain at least one zero eigenvalue as the positive eigenvalue estimates
0.0025 is close to zero. The LS estimates of the lag coe¢ cients Bk are
0 1 0 1 0 1
.14 -.03 .16 .33 -.09 .10 .31 -.20 .24
B C B C B C
b1;1st = B .72 -.18 .97 C , B
B b2;1st = B .43 -.06 .23 C , B
b3;1st = B .19 -.11 -.15 C :
@ A @ A @ A
.19 .02 .35 .16 -.06 .07 .09 -.03 .06
From these estimates it is by no means clear which lagged di¤erences should be ruled out from
(8.1). From their magnitudes, it seems that Yt 1, Yt 2 and Yt 3 might all be included in the
empirical model.
We applied LS shrinkage estimation to the model (8.1). Using the LS estimate, we constructed
an adaptive penalty for GLS shrinkage estimation. We …rst tried GLS shrinkage estimation with
tuning parameters
for k; j = 1; 2; 3. The eigenvalues of the GLS shrinkage estimate of are 0.0000394, -0.0001912 and
0, which implies that contains one zero eigenvalue. There are two nonzero eigenvalue estimates
which are both close to zero. The e¤ect of the adaptive penalty on these two estimates is substantial
because of the small magnitudes of the eigenvalues of the original LS estimate of . As a result,
the shrinkage bias in the two nonzero eigenvalue estimates is likely to be large. The GLS shrinkage
12
The system (8.1) was …tted with and without an intercept. The …ndings were very similar and in both cases
cointegrating rank was found to be 2. Results are reported here for the …tted intercept case. Of course, Lasso methods
can also be applied to determine whether an intercept should appear in each equation or in any long-run relation
that might be found. That extension of Lasso is not considered in the present paper. It is likely to be important in
forecasting.
32
estimates of B2 and B3 are zero, while the GLS shrinkage estimate of B1 is
0 1
.0687 .1076 .0513
B C
b1 = B .4598 .1212 .4053 C .
B @ A
.0986 .1123 .2322
Using the results from the above GLS shrinkage estimation, we construct the adaptive loading
parameters in (6.8) and (6.9). Using the adaptive tuning parameters in (6.10) and (6.11), we
perform a further GLS shrinkage estimation of the empirical model (8.1). The eigenvalues of the
new GLS shrinkage estimate of are -0.0226 0.0158i and 0, which again imply that contains one
zero eigenvalue. Of course, the new nonzero eigenvalue estimates also contains nontrivial shrinkage
bias. The new GLS shrinkage estimates of B2 and B3 are zero, but the estimate of B1 becomes
0 1
.0681 .1100 .0115
B C
b1 = B .4288 .1472 .4164 C .
B @ A
.1054 .1136 .1919
Finally, we run a post-Lasso RRR estimation based on the cointegration rank and lagged dif-
ference selected in the above GLS shrinkage estimation. The RRR estimates are the following
0 1 0 1
.026 -.022 0 1 .127 .028 .312
B C .822 -.555 -.128 B C
Yt = B C@
@ .082 -.026 A
A Yt 1 +B C
@ .598 -.088 1.098 A Yt 1 bt
+u
-.265 .378 -.887
-.012 .013 .161 .055 .364
where the eigenvalues of the RRR estimate of are -0.0262, -0.0039 and 0. To sum up, this
empirical implementation of our approach estimates cointegrating rank ro to be 2 and selects one
lagged di¤erence in the ECM (8.1). These results corroborate the manifestation of co-movement in
the three time series Gt , Ct and It through the presence of two cointegrating vectors in the …tted
model, whereas traditional information criteria fail to …nd any co-movement in the data and set
cointegrating rank to be zero.
9 Conclusion
One of the main challenges in any applied econometric work is the selection of a good model
for practical implementation. The conduct of inference and model use in forecasting and policy
analysis are inevitably conditioned on the empirical process of model selection, which typically leads
33
to issues of post-model selection inference. Adaptive Lasso and bridge estimation methods provide
a methodology where these di¢ culties may be partly attenuated by simultaneous model selection
and estimation to facilitate empirical research in complex models like reduced rank regressions
where many selection decisions need to be made to construct a satisfactory empirical model. On
the other hand, as indicated in the Introduction, the methods certainly do not eliminate post-
shrinkage selection inference issues in …nite samples because the estimators carry the e¤ects of the
in-built selections.
This paper shows how to use the methodology of shrinkage in a multivariate system to develop an
automated approach to cointegrated system modeling that enables simultaneous estimation of the
cointegrating rank and autoregressive order in conjunction with oracle-like e¢ cient estimation of the
cointegrating matrix and the transient dynamics. As such the methods o¤er practical advantages to
the empirical researcher by avoiding sequential techniques where cointegrating rank and transient
dynamics are estimated prior to model …tting.
As indicated in the Introduction, sequential method can encounter obstacles to consistent order
estimation even when test size is driven to zero as the sample size n ! 1: For instance, in the
model (7.3) considered earlier
p
X
Yt = o Yt 1 + Bo;j Yt j + ut
j=1
where p is large but …xed, the model selection limitations of standard sequential testing are in-
evitably worse, although these may be mitigated by orthonormalization, parsimonious encompass-
ing, and other automated devices (Hendry and Krolzig, 2005; Hendry and Johansen, 2013). The
methods of the present paper do not require any speci…c order or format of the lag di¤erences to
ensure consistent model selection. As a result, the approach is invariant to permutations of the
order of the lag di¤erences. Moreover, the method is easier to implement in empirical work, requires
no intensive cross lag search procedures, is automated with data-based tuning parameter selection,
and is computationally straightforward.
34
Various extensions of the methods developed here seem desirable. One rather obvious (and sim-
ple) extension is to allow for parametric restrictions on the cointegrating matrix which may relate
to theory-induced speci…cations. Lasso type procedures have so far been con…ned to parametric
models, whereas cointegrated systems are often formulated with some nonparametric elements re-
lating to unknown features of the model. A second extension of the present methodology, therefore,
is to semiparametric formulations in which the error process in the VECM is weakly dependent,
which is partly considered already in Section 4. Third, it will be interesting and useful, given the
growing availability of large dimensional data sets in macroeconomics and …nance, to extend the
results of the paper to high dimensional VECM systems where the dimension m of the matrix o
and the length p of the lag order are large. The e¤ects of post-shrinkage inference issues also merit
detailed investigation. These matters and other generalizations of the framework will be explored
in future work.
10 Appendix
We start with some standard preliminary results and then prove the main results in each of the
sections of the paper in turn, together with various lemmas that are useful in those derivations.
Denote
n
X 0 n
X 0
Z1;t 1 Z2;t 1 Z2;t 1 Z1;t 1
Sb12 = , S21 = ,
n n
t=1 t=1
Xn 0 n
X 0
Z1;t 1 Z1;t 1 Z2;t 1 Z2;t 1
Sb11 = and Sb22 = .
n n
t=1 t=1
P n R 0
(e) n 1 t=1 ut Z2;t 0
1 !d Bw2 dBu0 .
The quantities in (b), (c), (d), and (e) converge jointly.
Proof of Lemma 10.1. See Johansen (1995) and Cheng and Phillips (2009).
35
10.2 Proof of Main Results in Section 3
n n
! n
! 1
X X X
b 1st = arg min k Yt Yt 1k
2
= Yt Yt0 1 Yt 1 Yt0 1 : (10.1)
2Rm m
t=1 t=1 t=1
The asymptotic properties of b 1st and its eigenvalues are described in the following result.
b 1st o Q 1
Dn 1 !d (Bm;1 ; Bm;2 ) (10.2)
n 1(
b 1st ); :::; m ro (
b 1st ) !d eo;1 ; :::; eo;m ro ; (10.3)
where the eo;j (j = 1; :::; m ro ) are solutions of the following determinantal equation
Z Z 1
0 0
Im r0 dBw2 Bw 2
Bw2 Bw 2
= 0: (10.4)
The proof of Lemma 10.2 is in the supplemental appendix of this paper. Lemma 10.2 is useful
because the OLS estimate b 1st and the related eigenvalue estimates can be used to construct adap-
tive penalty in the tuning parameters. The convergence rates of b 1st and k(
b 1st ) are important
for delivering consistent model selection and cointegrated rank selection.
Let Pn be the inverse of Qn . We subdivide the matrices Pn and Qn as Pn = [P ;n ; P ? ;n
] and
Q0n = Q0 ;n ; Q0 ? ;n , where Q ;n and P ;n are the …rst ro rows of Qn and …rst ro columns of Pn
respectively (Q ? ;n
and P ? ;n
are de…ned accordingly). By de…nition,
36
where ? ;n
is an diagonal matrix with the ordered last (smallest) m ro eigenvalues of b 1st . Using
the results in (10.5), we can de…ne a useful estimator of o as
n;f = b 1st P ? ;n ? ;n
Q ? ;n
: (10.6)
to the unrestricted estimate b 1st which removes components in the eigen-representation of the
unrestricted estimate that correspond to the smallest m ro eigenvalues.
By de…nition
Q ;n =Q ;n
b 1st Q ;n P Q = ;n Q ;n (10.7)
n;f ? ;n ? ;n ? ;n
where ;n is an diagonal matrix with the ordered …rst (largest) ro eigenvalues of b 1st , and more
importantly
Q =Q b 1st Q P Q = 0(m ro ) m : (10.8)
? ;n n;f ? ;n ? ;n ? ;n ? ;n ? ;n
From Lemma 10.2.(b), (10.7) and (10.8), we can deduce that Q ;n n;f is a ro m matrix which is
nonzero w.p.a.1 and Q ? ;n n;f is always a (m ro ) m zero matrix for all n. Moreover
n;f o = ( b 1st o) P ? ;n ? ;n
Q ? ;n
1
( n;f o) Q Dn 1 = Op (1): (10.9)
Thus, the estimator is at least as good as the OLS estimator b 1st in terms of its rate of
n;f
convergence. Using (10.9) we can compare the LS shrinkage estimator b n with n;f to establish
the consistency and convergence rate of b n .
Proof of Theorem 3.1. De…ne
n
X Xm
2
Vn ( ) = k Yt Yt 1k +n r;k;n k n;k ( )k :
k=1
t=1
We can write
n
X 2 0
k Yt Yt 1k = y Y01 Im vec( ) y Y01 Im vec( )
t=1
37
where y = vec ( Y ), Y = ( Y1 ; :::; Yn )m n and Y 1 = (Y0 ; :::; YT 1 )m n .
Xn
vec( n;f
b n )0 Yt 1 Yt0 1 Im vec( n;f b n )
t=1
Xn
+2vec( n;f
b n )0 vec Yt 1 u0t
t=1
Xn
+2vec( n;f
b n )0 Yt 1 Yt0 1 Im vec( o n;f )
t=1
m
X h i
n r;k;n jj n;k ( n;f )jj jj n;k ( b n )jj : (10.10)
k=1
n
X Z 1 n
X
2 0
n Yt 1 Yt 1 !d Bu (a)Bu0 (a)da and n 2
Yt 0
1 ut = Op (n 1
): (10.11)
t=1 0 t=1
n;min jj
bn n;f jj
2
2(c1;n + c2;n )jj b n n;f jj dn 0; (10.12)
2
Pn 0
where n;min denotes the smallest eigenvalue of n t=1 Yt 1 Yt 1 ; which is positive w.p.a.1,
Xn
2 0
c1;n = jjn Yt 1 ut jj;
t=1
Xn
2 0
c2;n = m n Yt 1 Yt 1 jj n;f o jj;
t=1
m
X
1
and dn = n r;k;n jj n;k ( n;f )jj. (10.13)
k=1
Under (10.9) and (10.11), c1;n = op (1) and c2;n = op (1). Under (10.7), (10.8) and r;k;n = op (1) for
all k 2 S ,
ro
X
1 1
dn = n r;k;n jj n;k ( n;f )jj = op (n ): (10.14)
k=1
From (10.12), (10.13) and (10.14), it is straightforward to deduce that jj b n n;f jj = op (1). The
consistency of b n follows from the triangle inequality and the consistency of n;f .
n
X n
X 1
1 0 0 1 0
n Yt 1 Yt 1 !p yy = R(1) u R(1) and n Yt 1 ut = Op (n 2 ). (10.15)
t=1 t=1
38
From the results in (10.10) and (10.15), we get
n;min jj
bn n;f jj
2
2n(c1;n + c2;n )jj b n n;f jj ndn 0 (10.16)
1
Pn 0
where n;min denotes the smallest eigenvalue of n t=1 Yt 1 Yt 1 , which is positive w.p.a.1, c1;n ,
c2;n and dn are de…ned in (10.14). It is clear that nc1;n = op (1) and nc2;n = op (1) under (10.15) and
(10.9), and ndn = op (1) under (10.14). So, consistency of b n follows directly from the inequality
in (10.16), triangle inequality and the consistency of n;f .
1
Denote Bn = (Dn Q) , then when 0 < ro < m, we can use the results in Lemma 10.1 to deduce
that
n
X n
X
0 1
Yt 1 Yt 1 = Q Dn 1 Dn Zt 0 1 0 1
1 Zt 1 Dn Dn Q
t=1 t=1
20 1 3
z1 z1 0
= Bn 4@ R A + op (1)5 Bn0 ;
0 0
Bw2 Bw 2
and thus
n
!
X
vec( n;f
b n) 0
Yt 1 Yt0 1 Im vec( n;f
b n) n;min jj(
bn 2
n;f )Bn jj ; (10.17)
t=1
Pn 0
where n;min is the smallest eigenvalue of Dn t=1 Zt 1 Zt 1 Dn and is positive w.p.a.1. Next
observe that
!
h i0 n
X
vec( n;f
b n ) vec Bn Dn Zt 0
1 ut jj( b n n;f )Bn jje1;n (10.18)
t=1
and
n
!
X
vec( n;f
b n) 0
Yt 1 Yt0 1 Im vec( o n;f ) jj( b n n;f )Bn jje2;n (10.19)
t=1
where
Xn Xn
0 0
e1;n = jjDn Zt 1 ut jj and e2;n = mjjDn Zt 1 Zt 1 Dn jj jj( n;f o )Bn jj: (10.20)
t=1 t=1
39
Under Lemma 10.1 and (10.9), e1;n = Op (1) and e2;n = Op (1). From (10.10), (10.17), (10.18),
(10.19), we have the inequality
n;min jj(
bn n;f )Bn jj
2
2(e1;n + e2;n )jj( b n n;f )Bn jj ndn 0; (10.21)
which implies
p 1
(bn n;f )Bn = Op (1 + ndn2 ): (10.22)
1 1
bn o = Op (n 2 + dn2 ) = op (1);
m
X h i
r;k;n jj n;k ( n;f )jj jj n;k (
b n )jj
k=1
ro
X h i
r;k;n jj n;k ( n;f )jj jj n;k (
b n )jj
k=1
ro max r;k;n jj
bn n;f jj: (10.23)
k2S
Xn
vec( n;f
b n )0 Yt 1 Yt0 1 Im vec( n;f b n )
t=1
Xn
+2vec( n;f
b n )0 vec Yt 1 u0t
t=1
Xn
+2vec( n;f
b n )0 Yt 1 Yt0 1 Im vec( o n;f )
t=1
nro r;n jj b n n;f jj: (10.24)
n;min jj
bn n;f jj
2
2(c1;n + c2;n + n 1
ro r;n )jj
bn n;f jj 0 (10.25)
where under (10.11) c1;n = Op (n 1) and c2;n = Op (n 1 ). We deduce from the inequality (10.25)
and (10.9) that
bn o = Op (n 1
+n 1
r;n ): (10.26)
40
When ro = m, we use (10.24) to obtain
n;min jj
bn n;f jj
2
2n(c1;n + c2;n + n 1
ro r;n )jj
bn n;f jj 0 (10.27)
Pn 1 1
where nc1;n = jj n1 0
t=1 Yt 1 ut jj = Op (n 2 ) and nc2;n = Op (n 2 ) by Lemma 10.1 and (10.9). The
inequality (10.27) and (10.9) lead to
1
bn o = Op (n 2 + r;n ): (10.28)
When 0 < ro < m, we can use the results in (10.17), (10.18), (10.19), (10.20) and (10.24) to
deduce that
Pn 0
where e1;n = kDn Q t=1 Yt 1 ut k = Op (1) and e2;n = Op (1) by Lemma 10.1 and (10.9). By the
de…nition of Bn ,
1
jj( n;f
b n )Bn B 1
jj cn 2 jj( n;f
b n )Bn jj (10.30)
n
where c is some …nite positive constant. Using (10.29), (10.30) and (10.9), we get
1
(bn o )Bn = Op (1 + n 2 r;n ) (10.31)
Proof of Theorem 3.3. To facilitate the proof, we rewrite the LS shrinkage estimation problem
as n
X Xm
Tbn = arg min k Yt Pn T Yt 1k
2
+n r;k;n k n;k (Pn T )k : (10.32)
T 2Rm m k=1
t=1
By de…nition, b n = Pn Tbn and Tbn = Qn b n for all n. Under (3.6) and (3.7),
0 1 0 1
Q ;n nb b
Q ;n 1st
Tbn = @ A=@ A + op (1): (10.33)
b
Q ? ;n n b
Q ? ;n 1st
Results in (3.8) follows if we can show that the last m ro rows of Tbn are estimated as zeros w.p.a.1.
By de…nition, n;k (Pn T ) = Qn (k)Pn T = T (k) and the problem in (10.32) can be rewritten as
n
X Xm
Tbn = arg min k Yt Pn T Yt 1k
2
+n r;k;n kT (k)k ; (10.34)
T 2Rm m k=1
t=1
41
which has the following Karush-Kuhn-Tucker (KKT) optimality conditions
8 Pn Tbn (k)
< 1
t=1 ( Yt Pn Tbn Yt 0 0
1 ) Pn (k)Yt 1 = r;k;n
if Tbn (k) 6= 0
n 2 jjTbn (k)jj
Pn ; (10.35)
: 1
Pn Tbn Yt 0 0 r;k;n
if Tbn (k) = 0
n t=1 ( Yt 1 ) Pn (k)Yt 1 2
for k = 1; :::; m. Conditional on the event fQn (ko ) b n 6= 0g for some ko satisfying ro < ko m, we
obtain the following equation from the KKT optimality conditions
1 Xn r;ko ;n
( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1 = : (10.36)
n t=1 2
The sample average in the left hand side of (10.36) can be rewritten as
n
1X
( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1
n
t=1
n
1X
= [ut ( b n 0 0
o )Yt 1 ] Pn (ko )Yt 1
n
t=1
P Pn
Pn (ko ) nt=1 ut Yt0
0
1 Pn0 (ko )( b n o)
0
t=1 Yt 1 Yt 1
= : (10.37)
n n
and
Pn
Pn0 (ko )( b n o)
0
t=1 Yt 1 Yt 1
n Pn 0
1 Dn t=1 Zt 1 Zt 1
= Pn0 (ko )( b n o )Q
1
Dn Q0 1
= Op (1): (10.39)
n
1 Xn
( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1 = Op (1): (10.40)
n t=1
r;ko ;n
By the assumption on the tuning parameters, we have 2 !p 1, which together with the results
in (10.36) and (10.40) implies that
Pr Qn (ko ) b n = 0 ! 1 as n ! 1:
42
As the above result holds for any ko such that ro < ko m, this …nishes the proof.
Proof of Theorem 3.5. From Corollary 3.4, for large enough n the shrinkage estimator b n
0
can be decomposed as b n b n w.p.a.1, where b n and b n are some m ro matrices. Without loss of
generality, we assume the …rst ro columns of o are linearly independent. To ensure identi…cation,
we normalize o as o = [Iro ; Oro ]0 where Oro is some ro (m ro ) matrix such that
0
o = o o =[ o; o Oro ]: (10.41)
Hence o is the …rst ro columns of o which is an m ro matrix with full rank and Oro is uniquely
determined by the equation o Oro = o;2 , where
denotes the last m ro columns of o . o;2
0 0 0 0 0
1;o;? + 2;o;? Oro = 0 and 1;o;? 1;o;? + 2;o;? 2;o;? = Im ro (10.42)
1
0 0 0
1;o;? = 2;o;? Oro and 2;o;? = (Im ro + Or0 o Oro ) 2 : (10.43)
1
From Theorem 3.2 and n 2 r;n = op (1), we have
p
Op (1) = ( b n o )Q
1
Dn 1 = ( b n o) n 0 1 0
o ( o o ) ; n o;? ( o;? o;? )
1
(10.44)
p
Op (1) = n( b n 0
o) o( o o)
1
p h i
= n (b n b0 b
o) n + o( n
0 0 1
o) o( o o) (10.45)
and
0
nb n b n o
0
o;? ( o;? o;? )
1
= Op (1): (10.46)
h i
0 bn 0 1
Op (1) = o bn n(O Oro ) 2;o;? ( o;? o;? )
43
which implies that
1 1
bn
n(O Oro ) = 0
+ op (1) Op (1)( 0
+ Or0 o Oro ) 2 = Op (1)
o o o;? o;? )(Im ro (10.47)
where 0
o bn = 0
o o + op (1) is by the consistency of b n . By the de…nition of b n , (10.47) means that
n( b
n o) = Op (1), which together with (10.45) implies that
p h p i
1
n (b n o ) = Op (1) o n( b n o)
0
o
0
o o + op (1) = Op (1): (10.48)
From Corollary 3.4, we can deduce that b n and b n minimize the following criterion function
w.p.a.1
n
X ro
X
0 2 0
Vn ( ; ) = Yt Yt 1 +n r;k;n jj n;k ( )jj: (10.49)
t=1 k=1
p 0 h i
De…ne U1;n = n (b n o) and U3;n = n b n o
bn
= 0ro ; n O Oo 0ro ; U2;n , then
0
bn o Q 1
Dn 1 = b n bn o + (b n o)
0
o Q 1
Dn 1
h 1
i
0 1 0 1
= n 2 b n U3;n o( o o) + U1;n ; b n U3;n o;? ( o;? o;? ) :
De…ne
h 1
i
0 1 0 1
n (U ) = n 2 b n U3 o( o o) + U1 ; b n U3 o;? ( o;? o;? ) ;
where U3 = [0ro ; U2 ]. Then by de…nition, Un = U1;n ; U2;n minimizes the following criterion
function w.p.a.1
n
X 2 2
Vn (U ) = k Yt o Yt 1 n (U )Dn Zt 1 k k Yt o Yt 1 k
t=1
ro
X
+n r;k;n [jj n;k ( n (U )Dn Q + o )jj jj n;k ( o )jj] :
k=1
1
n (U )Dn Q = Op (n 2 ):
44
Hence, from the triangle inequality, we can deduce that for all k 2 S
uniformly over U 2 K.
From (10.48),
0 1
n (U ) !p U1 ; o U3 o;? ( o;? o;? ) 1 (U ) (10.51)
n
X 2 2
k Yt o Yt 1 n (U )Dn Zt 1 kE k Yt o Yt 1 kE
t=1
n
!
0
X
0
= vec [ n (U )] Dn Zt 1 Zt 1 Dn Im vec [ n (U )]
t=1
n
!
0
X
2vec [ n (U )] vec ut Zt0 1 Dn
t=1
20 1 3
z1 z1 0
1 (U )] 4@ A Im 5 vec [
0
! d vec [ R 1 (U )]
0 0
Bw2 Bw 2
0
2vec [ 1 (U )] vec [(V1;m ; V2;m )] V (U ) (10.52)
R 0
uniformly over U 2 K, where V1;m N (0; u z1 z1 )and V2;m Bw2 dBu0 .
h i
By de…nition 1 (U ) = U1 ; o U2 2;o;? ( 0o;? o;? ) 1 , thus
1 0 0
vec [ 1 (U )] = vec(U1 )0 ; vec( 0
o U2 2;o;? ( o;? o;? ) )
and
0 1 0 1 0
vec( o U2 2;o;? ( o;? o;? ) ) = ( o;? o;? ) 2;o;? o vec(U2 ):
V (U ) = vec(U1 )0 [ z1 z1 Im ] vec(U1 )
Z
0 0 1 0 0 1 0 0
+vec(U2 ) 2;o;? ( o;? o;? ) Bw2 Bw 2
( o;? o;? ) 2;o;? o o vec(U2 )
45
The expression in (10.53) makes it clear that V (U ) is uniquely minimized at
h i
0 1
U1 ; U2 ( o;? o;? ) 2;o;?
where
0 1 0
U1 = Bm;1 and U2 = ( o o) o Bm;2 . (10.54)
From (10.47) and (10.48), we can see that Un is asymptotically tight. Invoking the Argmax Con-
tinuous Mapping Theorem (ACMT), we can deduce that
h i
0 1
Un = (U1;n ; U2;n ) !d U1 ; U2 ( o;? o;? ) 2;o;?
bn Q 1
Dn 1 !d Bm;1 0 1 0B :
o o( o o) o m;2
Proof of Corollary 3.6. The consistency, convergence rate and super e¢ ciency of b g;n can be
established using similar arguments in the proof of Theorem 3.1, Theorem 3.2 and Theorem 3.3.
Under the super e¢ ciency of b g;n , the true rank ro is imposed on b g;n w.p.a.1. Thus for large
0
enough n, the GLS shrinkage estimator b g;n can be decomposed as b g;n b w.p.a.1, where b g;n g;n
and b g;n are some m ro matrices and they minimize the following criterion function w.p.a.1
n
X ro
X
0 0 b 1 0 0
Yt Yt 1 u;n Yt Yt 1 +n r;k;n jj n;k ( )jj: (10.55)
t=1 k=1
0
o = o o =[ o; o Oro ] and o = [Iro ; Oro ]0
where Oro is some ro (m ro ) matrix uniquely determined by the equation o Oro = o;2 , where
o;2 denotes the last m ro columns of o.
46
p h i
De…ne U1;n = n (b g;n b
o ) and U3;n = n( g;n ) ro
bg;n
0 = 0 ;n O Oo 0ro ; U2;n ,
o
then
0
bn o Q 1
Dn 1 = b g;n b g;n o + (b g;n o)
0
o Q 1
Dn 1
h 1
i
0 1 0 1
= n 2 b g;n U3;n o( o o) + U1;n ; b g;n U3;n o;? ( o;? o;? ) :
De…ne
h 1
i
0 1 0 1
n (U ) = n 2 b g;n U3 o( o o) + U1 ; b g;n U3 o;? ( o;? o;? ) ;
then by de…nition, Un = U1;n ; U2;n minimizes the following criterion function w.p.a.1
n h
X i
0 b 1
Vn (U ) = (ut n (U )Dn Zt 1) u;n (ut n (U )Dn Zt 1) u0t b u;n
1
ut
t=1
ro
X
+n r;k;n [jj n;k ( n (U )Dn Q + o )jj jj n;k ( o )jj] : (10.56)
k=1
Following similar arguments in the proof of Theorem 3.5, we can deduce that for any k 2 S
and
n
X n
X
(ut n (U )Dn Zt 1 )
b 1 (ut 0
n (U )Dn Zt 1 ) u0t b u;n
1
ut
u;n
t=1 t=1
0 1
! d vec(U1 ) z1 z1 u vec(U1 )
Z
+vec(U2 )0 0
2;o;? ( o;? o;? )
1 0
Bw2 Bw 2
( 0
o;? o;? )
1 0
2;o;?
0
o u
1
o vec(U2 )
2vec(U1 )0 vec u
1
V1;m 2vec(U2 )0 vec 0
o u
1
V2;m ( 0
o;? o;? )
1 0
2;o;?
V (U ) (10.58)
Z 1
0 1 1 0 1 0 0 1 1
Ug;2 = ( o u o) o u V2;m Bw2 Bw 2
( o;? o;? ) 2;o;? :
47
Invoking the ACMT, we obtain
0
b g;n o Q 1
Dn 1 = b g;n b g;n o + (b g;n o)
0
o Q 1
Dn 1
" Z #
1
1 0 1 1 0 1 0
!d V1;m z1 z1 ; o( o u o) o u V2;m Bw2 Bw 2
:
(10.59)
Note that
0 1 1 0 1 0 0 1 1 0 0 1
( o u o) o u = ( oQ e
u Q o) oQ e
u Q
0 0 1 0 1
= ( o o) u
e (11)( o o ) [( e Q
o o ); 0] u
0 1 1 0 0
= ( o o) e
u (11) e (11) o
u + e (12) o;?
u : (10.60)
Under 1
e (12)
u = e (11) w1 w2
u w2 w2 ,
0 1 1 0 1 0 1 0 1 0
( o u o) o u =( o o) ( o w1 w2 w2 w2 o;? ): (10.61)
b g;n 1 R 0 R 1
o Q Dn 1 !d Bm;1 0
o( o o)
1 Bw2 dBu0 w2 0
Bw2 Bw 2
:
The following lemma is useful in establishing the asymptotic properties of the shrinkage estimator
with weakly dependent innovations.
Lemma 10.3 Under Assumption 3.2 and 4.1, (a), (b) and (c) of Lemma 10.1 are unchanged,
while Lemma 10.1.(d) becomes
n
X
1
0
n 2 ut Z1;t 1 uz1 (1) !d N (0; Vuz1 ); (10.62)
t=1
48
P1 0
where uz1 (1) = j=0 uu (j) o Rj < 1 and Vuz1 is the long run variance matrix of ut Z1;t 1;
n
X Z 0
1 0
n ut Z2;t 1 !d Bw2 dBu0 +( uu uu ) o? : (10.63)
t=1
The proof of Lemma 10.3 is in the supplemental appendix of this paper. Let P1 = (P11 ; P12 ) be
the orthonormalized right eigenvector matrix of 1 and 1 be a r1 r1 diagonal matrix of nonzero
eigenvalues of 1, where P11 is an m r1 matrix (of eigenvectors of nonzero eigenvalues) and P12
is an m (m r1 ) matrix (of eigenvectors of zero eigenvalues). By the eigenvalue decomposition,
0 10 1
1 0 Q11
1 = (P11 ; P12 ) @ A@ A = P11 1 Q11 (10.64)
0 0m r1 Q12
which implies that Q11 P11 = Ir1 . From (10.64), without loss of generality, we can de…ne e 1 = P11
and e 1 = Q011 1 . By (10.65), we deduce that
0
which imply that e 1 e 1 and e 01 e 1 are nonsingular r1 r1 matrix. Without loss of generality, we let
0
e 1? = P12 and e 1? = Q012 , then e 1? e 1? = Im r1 and under (10.65),
e 0 e 1 = Q12 P11 = 0
1?
0
which implies that e ? e 1 = 0 as e 1? = ( e ? ; o? ).
Let [ 1 ( b 1st ); :::; m ( b 1st )] and [ 1 ( 1 ); :::; m ( 1 )] be the ordered eigenvalues of b 1st and
1 respectively. For the ease of notation, we de…ne
1 1 0
N1 N (0; Vuz1 ) + uz1 (1) z1 z1 N (0; Vz1 z1 ) z1 z1 o
49
where N (0; Vuz1 ) is a random matrix de…ned in (10.62) and N (0; Vz1 z1 ) denotes the matrix limit
p
distribution of n Sb11 z1 z1 . We also de…ne
Z Z 1
N2 dBu Bu0 +( uu uu ) o?
0
Bw2 Bw 2
0
o? :
The next lemma provides asymptotic properties of the OLS estimate and its eigenvalues when the
data is weakly dependent.
Lemma 10.4 Under Assumption 3.2 and 4.1, we have the following results:
(a) the OLS estimator b 1st satis…es
b 1st 1 Q 1
Dn 1 = Op (1) (10.66)
n[ ro +1 (
b 1st ); :::; m(
b 1st )] !d [e0 e0
ro +1 ; :::; m ] (10.67)
0
where ej (j = ro + 1; :::; m) are the ordered solutions of
0 1
uIm ro
0
N2 + N1 e ? e ? N1 e ? e 0 N2 = 0; (10.68)
o? ? o?
p
n[ r1 +1 (
b 1st ); :::; ro (
b 1st )] !d [e0r +1 ; :::; e0r ] (10.69)
1 o
0
where ej (j = r1 + 1; :::; ro ) are the ordered solutions of
uIro r1
e 0 N1 e = 0: (10.70)
? ?
The proof of Lemma 10.4 is in the supplemental appendix of this paper. Recall that Pn is
h i
de…ned as the inverse of Qn . We divide Pn and Qn as Pn = P e ;n ; P e ? ;n and Q0n = Q0e ;n ; Q0e ? ;n ,
where Q e ;n and P e ;n are the …rst r1 rows of Qn and …rst r1 columns of Pn respectively (Q e ? ;n and
P e ? ;n are de…ned accordingly). By de…nition,
50
where e ? ;n is an diagonal matrix with the ordered last (smallest) m r1 eigenvalues of b 1st . Using
the results in (10.71), we can de…ne a useful estimator of 1 as
By de…nition
Q e ;n e n;f = Q e ;n b 1st Q e ;n P e ? ;n e ? ;n Q e ? ;n = e ;n Q e ;n (10.73)
where e ;n is an diagonal matrix with the ordered …rst (largest) ro eigenvalues of b 1st , and more
importantly
Q e ? ;n e n;f = Q e ? ;n b 1st Q e ? ;n P e ? ;n e ? ;n Q e ? ;n = 0(m r1 ) m : (10.74)
From Lemma 10.4.(b), (10.73) and (10.74), we can deduce that Q e ;n e n;f is a r1 m matrix which
is nonzero w.p.a.1 and Q e ? ;n e n;f is a (m r1 ) m zero matrix for all n. Using (10.71), we can
write
e n;f 1 = ( b 1st 1) P e ? ;n e ? ;n Q e ? ;n
1 p
P e ? ;n Q e ? ;n 1Q Dn 1 = nP e ? ;n Q e ? ;n 1 Q 1
p p
= nP e ? ;n Q e ? ;n b 1st 1 Q 1
+ nP e ? ;n Q e ? ;n b 1st Q 1
p 1
= nP e ? ;n e ? ;n Q e ? ;n Q + Op (1) = Op (1): (10.77)
e n;f 1 Q 1
Dn 1 = Op (1): (10.78)
Comparing (10.76) with (10.78), we see that e n;f is as good as the OLS estimate b 1st in terms of
its rate of convergence.
Proof of Corollary 4.1. First, when ro = 0, then 1 = e o 0o = 0 = o . Hence, the consistency
of b n follows by the similar arguments to those in the proof of Theorem 3.1. To …nish the proof,
we only need to consider the scenarios where ro = m and ro 2 (0; m).
51
Using the same notation for Vn ( ) de…ned in the proof of Theorem 3.1, by de…nition we have
Vn ( b n ) Vn ( e n;f ), which implies
!
h i0 n
X h i
vec( e n;f b n) Yt 1 Yt0 1 Im vec( e n;f b n)
t=1
" n #
h i0 X n
X
+2 vec( e n;f b n ) vec ut Yt0 1 ( 1 o) Yt 0
1 Yt 1
t=1 t=1
!
h i0 n
X
2 vec( e n;f b n) Yt 0
1 Yt 1 Im vec( e n;f 1)
t=1
(m )
X h i
n r;k;n jj n;k (
e n;f )jj jj n;k ( b n )jj : (10.79)
k=1
n
1X 0 0
Yt 1 Yt 1 !p yy = R(1) u R(1) : (10.80)
n
t=1
n;min jj
bn e n;f jj jj b n e n;f jj(c1n + c2n ) dn 0; (10.81)
1 Pn 0
where n;min denotes the smallest eigenvalue of n t=1 Yt 1 Yt 1 , which is positive w.p.a.1,
Pn 0
Pn 0
t=1 ut Yt 1 t=1 Yt 1 Yt 1
c1n = ( 1 o)
n n
1
!p uy (1) uy (1) yy yy =0 (10.82)
Xn
c2n = m n 1
Yt 0
1 Yt 1 jj e n;f 1 jj = op (1) (10.83)
t=1
m
X h i r1
X
dn = r;k;n jj e
n;k ( n;f )jj jj n;k ( b n )jj r;k;n jj n;k (
e n;f )jj = op (1) (10.84)
k=1 k=1
by Lemma 10.4, (10.74) and r;k;n = op (1) for k = 1; :::; r1 . So the consistency of b n follows directly
from (10.78), the inequality in (10.81) and the triangle inequality.
52
When 0 < ro < m,
n
!
X
vec( b n e n;f ) 0
Yt 1 Yt0 1 Im vec( b n e n;f )
t=1
n
!
X
= vec( b n e n;f )0 Bn Dn Zt 0 0
1 Zt 1 Dn Bn Im vec( b n e n;f )
t=1
b
n;min jj( n
e n;f )Bn jj2 (10.85)
Pn 0
where n;min denotes the smallest eigenvalue of Dn t=1 Zt 1 Zt 1 Dn which is positive de…nite
w.p.a.1 under Lemma 10.3. Next, note that
( n n
)
X X
ut Zt0 1 ( 1 o )Q
1
Zt 1 Zt0 1 Dn
t=1 t=1
2 Pn 30 2 Pn 30
1 1
n 0 0 1 0 (1)
t=1 Z1;t 1 ut n t=1 Z1;t 1 Z1;t 1 z1 z1
2 2
uz1
= 4 Pn 5 4 Pn 5 : (10.86)
n 1 0 1 0 1 0 (1)
t=1 Z2;t 1 ut n t=1 Z2;t 1 Z1;t 1 z1 z1 uz1
n
X n
X
1 0 1 0 1 0
n Z2;t 1 ut = Op (1) and n Z2;t 1 Z1;t 1 z1 z1 uz1 (1) = Op (1): (10.87)
t=1 t=1
Similarly, we get
n
X
1 1
0 0 1 0
n 2 Z1;t 1 ut uz1 (1) n 2 [Sn;11 z1 z1 ] z1 z1 uz1 (1) = Op (1): (10.88)
t=1
Pn 0 1
Pn 0
De…ne e1n = t=1 ut Zt 1 ( 1 o )Q t=1 Zt 1 Zt 1 Dn , then from (10.86)-(10.88) we
can deduce that e1n = Op (1). By the Cauchy-Schwarz inequality, we have
hX n Xn i
vec( b n e n;f )0 vec ut Yt0 1 ( 1 o) Yt 1 Yt0 1
t=1 t=1
hnX n Xn o i
= vec( b n e n;f )0 vec ut Zt0 1 ( 1 o )Q
1
Zt 1 Zt0 1 Dn Bn0
t=1 t=1
jj( b n e n;f )Bn jje1n : (10.89)
53
Under Lemma 10.3 and (10.78),
n
!
X
e2n vec( e n;f b n) 0
Yt 1 Yt0 1 Im vec( e n;f 1)
t=1
n
!
X
= vec( e n;f b n )0 Bn Dn Zt 0 0
1 Zt 1 Dn Bn Im vec( e n;f 1)
t=1
Xn
jj( b n e n;f )Bn jj jj( e n;f 1 )Bn jj jjDn Zt 0
1 Zt 1 Dn jj = Op (1):
t=1
(10.90)
n;min jj(
bn e n;f )Bn jj2 2jj( b n e n;f )Bn jj2 (e1n + e2n ) dn 0 (10.91)
where dn = op (1) by (10.84). Now, the consistency of b n follows by (10.91) and the same arguments
in Theorem 3.1.
Proof of Corollary 4.2. From Lemma 10.4 and Corollary 4.1, we deduce that w.p.a.1
m
X h i
r;k;n jj n;k (
e n;f )jj jj n;k (
b n )jj
k=1
X h i
r;k;n jj e
n;k ( n;f )jj jj n;k ( b n )jj
k2Se
where c > 0 is a generic positive constant. When ro = 0, the convergence rate of b n could be
derived using the same arguments in Theorem 3.2. Hence, to …nish the proof, we only need to
54
consider scenarios where ro = m or 0 < ro < m.
When ro = m, following similar arguments to those of Theorem 3.2, we get
n;min jj
e n;f b n jj2 cjj e n;f b n jj c1n + c2n + er;n 0; (10.94)
where
n
X n
X
1
c1n = n ut Yt0 1 n 1
( 1 o) Yt 0
1 Yt 1
t=1 t=1
1 1
n
X h 1
i
= n 2 n 2 ut Yt0 1 uy (1)
1
uy (1) z1 z1 n 2 Sb11 z1 z1
t=1
1
= Op (n 2 ) (10.95)
Xn 1
c2n = n 1
Yt 0
1 Yt 1
e n;f 1 = Op (n 2 ) (10.96)
t=1
by Lemma 10.3 and 10.78. From the results in (10.78), (10.94), (10.95) and (10.96), we deduce that
1
bn 1 = Op (n 2 + er;n ): (10.97)
When 0 < ro < m, we can use (10.89) and (10.90) in the proof of Corollary 4.1 and (10.93) and
to get w.p.a.1
n;min jj(
e n;f b n )Bn jj2 2jj( e n;f b n )Bn jj(e1;n + e2;n ) cn n jj e n;f b n jj; (10.98)
where e1;n = Op (1) and e2;n = Op (1) as illustrated in the proof of Corollary 4.1. By the Cauchy-
Schwarz inequality,
1
jj( e n;f b n )Bn B
n
1
jj cn 2 jj( e n;f b n )Bn jj: (10.99)
n;min jj(
e n;f b n )Bn jj2 cjj( e n;f b n )Bn jj(e1;n + e2;n + n 21 er;n ) 0: (10.100)
1
(bn 1 )Bn = (bn e n;f )Bn + ( e n;f 1 )Bn = Op (1 + n 2 er;n );
55
which …nishes the proof.
Proof of Corollary 4.3. Using similar arguments in the proof of Theorem 3.3, we can rewrite
the LS shrinkage estimation problem as
n
X Xm
Tbn = arg min k Yt Pn T Yt 1k
2
+n r;k;n kT (k)k : (10.101)
T 2Rm m k=1
t=1
Result in (4.6) is equivalent to Tbn (k) = 0 for any k 2 fro + 1; :::; mg. Conditional on the event
fQn (ko ) b n 6= 0g for some ko satisfying ro < ko m, we get the following equation from the KKT
optimality conditions,
1 Xn r;ko ;n
( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1 = : (10.102)
n t=1 2
The sample average in the left hand side of (10.102) can be rewritten as
Pn Pn
t=1 ( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1 Pn0 (ko ) t=1 [ut (bn o )Yt
0
1 ]Yt 1
=
" n n n #
n
Pn0 (ko ) X X
= [ut ( 1 o )Yt 1 ]Yt
0
1 (bn 1) Yt 0
1 Yt 1 : (10.103)
n
t=1 t=1
1 Xn
( Yt Pn Tbn Yt 0 0
1 ) Pn (ko )Yt 1 = Op (1): (10.106)
n t=1
While by the assumption on the tuning parameters, r;ko ;n !p 1, which together with the results
in (10.102) and (10.106) implies that
Pr Qn (ko ) b n = 0 ! 1 as n ! 1:
As the above result holds for any ko such that ro < ko m, this …nishes the proof.
56
Let Pro ;n and Qro ;n be the …rst ro columns of Pn and the …rst ro rows of Qn respectively.
Let Pro r1 ;n and Qro r1 ;n be the last ro r1 columns of Pro ;n and the last ro r1 rows of Qro ;n
respectively. Under Lemma 10.4.(c),
Qro r1 ;n
b n Bn = Qro r ;n ( b n b 1st )Bn + Qro r ;n ( b 1st 1 )Bn + Qro r1 ;n 1 Bn
1 1
p
= nQro r1 ;n 1 Q 1 + Op (1)
p p
= nQro r1 ;n ( 1 b 1st )Q 1 + nQro r1 ;n b 1st Q 1 + Op (1)
p
= n ro r1 ;n Qro r1 ;n Q 1 + Op (1) = Op (1) (10.107)
where is a diagonal matrix with the (r1 + 1)-th to the ro -th eigenvalues of b 1st . Let Tb ;n be
ro r1 ;n
h i
the …rst ro rows of Tbn = Qn b n , then Tb ;n = Qro ;n b n . De…ne T 0 ;n = 01 Q0e ;n ; 0m (ro r1 ) , then
2 3
Q e ;n b n 1 Bn
Tb ;n T ;n Bn = 4 5 = Op (1) (10.108)
Qro r1
b
;n n Bn
Proof of Corollary 4.4. Using the results of Corollary 4.3, we can rewrite the LS shrinkage
estimation problem as
n
X Xro
Tbn = arg min k Yt Pn T Yt 1k
2
+n r;k;n kT (k)k (10.109)
T 2Rm m k=1
t=1
with the constraint T (k) = 0 for k = ro + 1; ::; m. Recall that Tb ;n is the …rst ro rows of Tbn , then
the problem in (10.109) can be rewritten as
n
X X ro
Tb ;n = arg min k Yt Pro ;n T Yt 1k
2
+n r;k;n kT (k)k (10.110)
T 2Rro m k=1
t=1
n h
X i
2 2
Vn (U ) = Yt Pro ;n (U Bn 1 + T ;n )Yt 1 k Yt Pro ;n T ;n Yt 1 k
t=1
Xro
+n r;k;n U Bn 1 + T ;n )(k) kT ;n (k)k
k=1
Xro
= V1;n (U ) + n r;k;n U Bn 1 + T ;n )(k) kT ;n (k)k :
k=1
57
1 1
For any U in some compact subset of Rro m, n 2 U Dn Q = O(1). Thus n 2 er;n = op (1) and
Lemma 10.4.d imply that
1 1
n r;k;n (U Bn 1 + T ;n )(ko ) kT ;n (ko )k n2 r;k;n n 2 (U Bn 1 )(ko ) = op (1) (10.111)
1
for ko = 1; :::; r1 . On the other hand, n 2 r;k;n = op (1) implies that
1 1
n r;k;n (U Bn 1 + T ;n )(ko ) kT ;n (ko )k n2 r;k;n n 2 (U Bn 1 )(ko ) = op (1) (10.112)
where
Xn
An;t (U ) vec (U )0 Bn 1 Yt 0 0 1
1 Yt 1 Bn Pr0o ;n Pro ;n vec (U )
t=1
and
h Xn i
Bn;t (U ) vec (U )0 vec Pr0o ;n ( Yt Pro ;n T 0 0 1
;n Yt 1 ) Yt 1 Bn :
t=1
n n
! 1
X X
Un = (Pr0o ;n Pro ;n ) 1 Pr0o ;n ( Yt Pro ;n T 0
;n Yt 1 ) Yt 1 Yt 1 Yt0 1 Bn
t=1 t=1
h i
= (Pr0o ;n Pro ;n ) 1
Pr0o ;n b 1st T ;n Bn :
By de…nition, Pn = [Pro ;n ; Pm ro ;n ], where Pro ;n and Pm ro ;n are the right normalized eigenvec-
tors of the largest ro and smallest m ro eigenvalues of b 1st respectively. From Lemma 10.4.(c)
and (d), we deduce that Pr0o ;n Pm ro ;n = 0 w.p.a.1. Thus, we can rewrite Un as
h i
Un = (Pr0o ;n Pro ;n ) 1
Pr0o ;n Pn Qn b 1st T ;n Bn = Qro ;n b 1st T ;n Bn
w.p.a.1. Results in (10.111) and (10.112) imply that un = Un +op (1). Thus the limiting distribution
of the last ro r1 rows of un is identical to the limiting distribution of the last ro r1 rows of Un .
Let Uro r1 ;n be the last ro r1 rows of Un , then by de…nition
Qro r1 ;n
b n Bn = U + op (1) = ro r1 ;n Qro r1 ;n Bn + op (1) (10.113)
ro r1 ;n
58
h i
where ro r1 ;n diag b
r1 +1 ( 1st ); :::; ro ( b 1st . From (10.113) and Lemma 10.4, we obtain
)
1
n 2 Qro r1 ;n
b n = n 21 ro r1 ;n Qro r1 ;n + op (1) = ro r1 (
e0 )Qr r1 ;o + op (1) (10.114)
o
where ro r1 (
e0 ) 0 0
diag(er1 +1 ; :::; ero ) is a non-degenerated full rank random matrix, and Qro r1 ;o
denotes the probability limit of Qro r1 ;n and it is a full rank matrix. From (10.114), we deduce
that
1
lim sup Pr n 2 Qro r1 ;n
bn = 0 = 0
n!1
Lemma 10.5 follows by standard arguments like those in Lemma 10.1 and its proof is omitted.
We next establish the asymptotic properties of the OLS estimator ( b 1st ; B b1st ) of ( o ; Bo ) and
the asymptotic properties of the eigenvalues of b 1st . The estimate ( b 1st ; B
b1st ) has the following
closed-form solution
0 1 1
Sb b
S y 1 x0
b 1st ; B
b1st = Sby0 y1 Sby0 x0 @ y1 y1 A ; (10.115)
Sbx y 0 1 Sbx x
0 0
where
n n
1X 1X
Sby0 y1 = Yt Yt0 1 ; Sby0 x0 = Yt Xt0 1;
n n
t=1 t=1
Xn Xn
1 1
Sby1 y1 = Yt 0
1 Yt 1 ; Sby1 x0 = Yt 1 Xt0 1;
n n
t=1 t=1
n
1X
Sbx0 y1 = Sby0 1 x0 and Sbx0 x0 = Xt 1 Xt0 1. (10.116)
n
t=1
59
Denote Y = (Y0 ; :::; Yn 1 )m n , Y = ( Y1 ; :::; Yn )m n and
c0 = In
M n 1
X 0 Sbx01x0 X,
where X = ( X0 ; :::; Xn 1 )mp n , then b 1st has the explicit partitioned regression representa-
tion
1 1
b 1st = c0 Y 0
YM c0 Y 0
Y M = o
c0 Y 0
+ UM c0 Y 0
Y M ; (10.117)
h R R i
N (0; 1 0 ( B B0 ) 1 ; (10.118)
u z3 z3 ); dBu Bw 2 w2 w2
(c) For 8k = ro + 1; :::; m, the eigenvalues k ( b 1st ) of b 1st satisfy Lemma 10.2.(c).
The proof of Lemma 10.6 is in the supplemental appendix of this paper. Lemma 10.6 is useful,
because the …rst step estimator ( b 1st ; B
b1st ) and the eigenvalues of b 1st are used in the construction
of the penalty function.
Proof of Lemma 5.1. Let = ( ; B) and
n
X Xp 2
Vn ( ) = Yt Yt 1 Bj Yt j
j=1
t=1
Xp Xm
+n b;j;n kBj k +n r;k;n k n;k ( )k :
j=1 k=1
Set b n = ( b n ; B
bn ) and de…ne an infeasible estimator e n = ( n;f ; Bo ), where n;f is de…ned in
(10.6). Then by de…nition
(en o )QB
1 1
Dn;B =( n;f o ; 0)QB
1 1
Dn;B = Op (1) (10.119)
60
By de…nition Vn ( b n ) Vn ( e n ), so that
n h io0 n h io
vec ( e n b n )QB1 Dn;B
1
Wn vec ( e n b n )QB1 Dn;B1
n h io0 n Xn o
+2 vec ( e n b n )QB1 Dn;B1
vec Dn;B Zt 1 u0t
t=1
n h io0 n h io
+2 vec ( e n b n )QB1 Dn;B1
Wn vec ( o e n )QB1 Dn;B 1
where
n
X
0
Wn = Dn;B Zt 1 Zt 1 Dn;B Im(p+1) ;
t=1
X h i
d1;n = n b;j;n kBo;j k
bn;j jj ;
jjB
j2SB
X h i
d2;n = n r;k;n k n;k ( n;f )k jj n;k ( b n )jj :
k2S
2
n (bn e n )Q
B
1 1
Dn;B (bn e n )Q
B
1 1
Dn;B (c1;n + c2;n ) (d1;n + d2;n ) ;
(10.121)
where n denotes the smallest eigenvalue of Wn , which is bounded away from zero w.p.a.1,
Xn
c1;n = Dn;B Zt 0
1 ut and c2;n = kWn k ( o
e n )Q 1 1
Dn;B : (10.122)
t=1 B
By the de…nition of the penalty function, Lemma 10.6 and the Slutsky Theorem, we …nd that
X
d1;n n b;j;n kBo;j k = Op (n b;n ) and (10.123)
j2SB
X
d2;n n r;k;n k n;k ( n;f )k = Op (n r;n ): (10.124)
k2S
From the inequality in (10.121), the results in (10.123), (10.124) and (10.125), we deduce that
1=2
(bn e n )Q
B
1 1
Dn;B = OP (1 + n1=2 b;n + n1=2 1=2
r;n ):
61
1=2 1=2
which implies jj b n e n jj = OP (n
= op (1). This shows the consistency of b n .
1=2 + b;n + r;n )
We next derive the convergence rate of the LS shrinkage estimator b n . Using the similar
arguments in the proof of Theorem 3.2, we get
1
jd1;n j cn 2 b;n
bn o QB1 Dn;B
1
(10.126)
and
1
jd2;n j cn 2 r;n
bn o QB1 Dn;B
1
: (10.127)
1
jd1;n + d2;n j cn 2 n
bn o QB1 Dn;B
1
(10.128)
where n = b;n + r;n . From the inequality in (10.121) and the result in (10.128),
2 1
n (bn e n )Q
B
1 1
Dn;B (bn e n )Q
B
1 1
Dn;B (c1;n + c2;n + n 2 n) 0; (10.129)
1
which together with (10.125) implies that ( b n e n )Q
B
1 1
Dn;B = Op (1 + n 2 n ). This …nishes the
proof.
Proof of Theorem 5.1. The …rst result can be proved using similar arguments in the proof of
Theorem 3.3. Speci…cally, we rewrite the LS shrinkage estimation problem as
n
X Xp 2
bn ) =
(Tbn ; B arg min Yt Pn T Yt 1 Bj Yt j
T;B1 ;:::;Bp 2Rm m t=1 j=1
Xm Xp
+n r;k;n kT (k)k +n b;j;n kBj k : (10.130)
k=1 j=1
By de…nition, b n = Pn Tbn and Tbn = Qn b n for all n. Results in (5.8) follows if we can show that
the last m ro rows of Tbn are estimated as zeros w.p.a.1.
The KKT optimality conditions for Tbn are
8 n
> P b n Yt Pp b n b
r;k;n Tn (k)
>
< ( Yt 1 j=1 Bn;j Yt 0 0
j ) Pn (k)Yt 1 = if Tbn (k) 6= 0
t=1 2jjTbn (k)jj
P
n Pp ;
> b n Yt b
>
: n 1 ( Yt 1 j=1 Bn;j Yt
0
j ) Pn (k)Yt
0
1 < r;k;n
2 if Tbn (k) = 0
t=1
62
for k = 1; :::; m. Conditional on the event fQ ;n (ko )
b n 6= 0g for some ko satisfying ro < ko m,
we obtain the following equation from the KKT optimality conditions
n
X Xp
1 b n Yt bn;j Yt 0 0 r;k;n
n ( Yt 1 B j ) Pn (ko )Yt 1 = : (10.131)
j=1 2
t=1
The sample average in the left hand side of (10.36) can be rewritten as
n
1X b n Yt
Xp
bn;j Yt 0 0
( Yt 1 B j ) Pn (ko )Yt 1
n j=1
t=1
Xn
1
= [ut (bn o )QB
1
Zt 0 0
1 ] Pn (ko )Yt 1
n
t=1
P 1 Pn
Pn (ko ) nt=1 ut Yt0 1
0 Pn0 (ko )( b n o )QB
0
t=1 Zt 1 Yt 1
= = Op (1)
n n
(10.132)
where the last equality is by Lemma 10.5 and Lemma 5.1. However, under the assumptions on the
tuning parameters r;ko ;n !p 1, which together with the results in (10.131) and (10.132) implies
that
Pr Q ;n (ko )
b n = 0 ! 1 as n ! 1:
As the above result holds for any ko such that ro < ko m, this …nishes the proof of (5.8).
We next show the second result. The LS shrinkage estimators of the transient dynamic matrices
satisfy the following KKT optimality conditions:
8
> P
n
b n Yt Pp b n b
b;j;n Bn;j bn;j 6= 0
>
< ( Yt 1 j=1 Bn;j Yt j) Yt0 j = if B
b
2jjBn;j jj
t=1
;
1 P Pp
n b
>
> b n Yt b b;j;n Bn;j bn;j = 0
: n ( Yt 1 j=1 Bn;j Yt j) Yt0 j < bn;j jj if B
t=1 2jjB
bn;j 6= 0m
for any j = 1; :::; p. On the event fB mg
c , we get the following equation
for some j 2 SB
from the optimality conditions,
n 1
1 X Xp n2 b;j;n
n 2 ( Yt b n Yt 1
bn;j Yt
B j) Yt0 j = : (10.133)
j=1 2
t=1
63
The sample average in the left hand side of (10.133) can be rewritten as
n
X Xp
1
n 2 ( Yt b n Yt 1
bn;j Yt
B j) Yt0 j
j=1
t=1
Xn
1
= n 2 [ut (bn o )QB
1
Zt 1] Yt0 j
t=1
Xn n
X
1 1
= n 2 ut Yt0 j n 2 (bn o )QB
1
Zt 1 Yt0 j = Op (1) (10.134)
t=1 t=1
where the last equality is by Lemma 10.5 and Lemma 5.1. However, by the assumptions on the
1
tuning parameters n 2 b;j;n ! 1, which together with (10.133) and (10.134) implies that
bn;j = 0m
Pr B m ! 1 as n ! 1
Proof of Theorem 5.2. Follow the similar arguments in the proof of Theorem 3.5, we normalize
o as o = [Iro ; Oro ]0 to ensure identi…cation, where Oro is some ro (m ro ) matrix such that
0
o = o o =[ o; o Oro ]. From Lemma 5.1, we have
1 1
n2 (bn 0
o) o( o o)
1 bn
n 2 (B Bo ) n( b n 0
o ) o;? ( o;? o;? )
1 = Op (1);
bn
n O Oo = Op (1); (10.135)
1
bn
n (B 2 Bo ) = Op (1); (10.136)
1
n 2 (b n o) = Op (1); (10.137)
where (10.135) and (10.137) hold with similar arguments in showing (10.47) and (10.48) in the
proof of Theorem 3.5.
From the results of Theorem 5.1, we deduce that b n , b n and B
bS minimize the following
B
2
n
X X
0
Vn ( S) = Yt Yt 1 Bj Yt j
t=1 j2SB
X X
0
+n r;k;n n;k ( ) +n b;j;n kBj k :
k2S j2SB
64
p 0 bn
De…ne U1;n = n (b n o) and U2;n = 0ro ; U2;n , where U2;n = n O Oo and U3;n =
p bS
n B B
Bo;SB . Then
h i
bn o
bS
; B Bo;SB QS 1 Dn;S
1
B
h 1
i
0 1 0 1
= n 2 b n U2;n o( o o) + U1;n ; U3;n ; b n U2;n o;? ( o;? o;? ) :
Denote
h 1
i
0 1 0 1
n (U ) = n 2 b n U2 o( o o) + U1 ; U3 ; b n U2 o;? ( o;? o;? ) ;
then by de…nition, Un = U1;n ; U2;n ; U3;n minimizes the following criterion function
n
X 2
Vn (U ) = ut 1
n (U )Dn;S ZS;t 1 kut k2
t=1
X h h i i
1
+n r;k;n n;k n (U )Dn;S QS L1 + o k n;k ( o )k
k2S
X h i
1
+n b;j;n n (U )Dn;S QS Lj+1 + Bo;j kBo;j k :
j2SB
where Lj = diag(Aj;1 ; :::; Aj;dSB +1 ) with Aj;j = Im and Ai;j = 0 for i 6= j and j = 1; :::; dSB +1 .
For any compact set K 2 Rm ro R ro (m ro ) Rm mdSB
and any U 2 K, there is
1
1
n (U )Dn;S QS = Op (n 2 ):
Hence using similar arguments in the proof of Theorem 3.5, we can deduce that
X h h i i
1
n r;k;n n;k n (U )Dn;S QS L1 + o k n;k ( o )k = op (1) (10.138)
k2S
and
X h i
1
n b;j;n n (U )Dn;S QS Lj+1 + Bo;j kBo;j k = op (1) (10.139)
j2SB
uniformly over U 2 K.
Next, note that
0 1
n (U ) !p U1 ; U3 ; o U2 o;? ( o;? o;? ) 1 (U ) (10.140)
65
uniformly over U 2 K. By Lemma 10.5 and (10.140), we can deduce that
n
X 2
ut 1
n (U )Dn;S ZS;t 1 kut k2
t=1
20 1 3
z3S z3S 0
1 (U )] 4@ A Im 5 vec [
0
! d vec [ R 1 (U )]
0 0
Bw2 Bw 2
0
2vec [ 1 (U )] vec [(V3;m ; V2;m )] V (U ) (10.141)
R 0
uniformly over U 2 K, where V3;m = N (0; u z3S z3S ) and V2;m = Bw2 dBu0 .
Using similar arguments in the proof of Theorem 3.5, we can rewrite V (U ) as
The expression in (10.142) makes it clear that V (U ) is uniquely minimized at (U1 ; U2 ; U3 ), where
(U1 ; U3 ) = V3;m 1 and
z3S z3S
Z 1
0 1 0 0 0 1
U2 = ( o o) o V2;m Bw2 Bw 2
( o;? o;? ) 2;o;? . (10.143)
From (10.135), (10.136) and (10.137), we see that Un is asymptotically tight. Invoking the ACMT,
we deduce that Un !d U . The results in (5.11) follow by applying the CMT.
66
fr1 + 1; :::; ro g, by Lemma 10.4.(d), jjn 2
1
k(
b 1st )jj! !d jje0 jj! which is a non-degenerated and
k
1+!
1 n 2
r;k;n
n 2
r;k;n = 1 = op (1) (10.146)
jjn 2
k(
b 1st )jj!
References
[1] Anderson, T.W., "Reduced rank regression in cointegrated models", Journal of Econometrics,
106, pp. 203-216, 2002
[2] Athanasopoulos, G., Guillen, O.T.C., Issler, J.V., and Vahid, F., "Model selection, estimation
and forecasting in VAR models with short-run and long-run restrictions", Journal of Econo-
metrics, vol. 164, no. 1, pp. 116-129, 2011
[3] Belloni, A. and V. Chernozhukov, “Least squares after model selection in high-Dimensional
sparse models,” Bernoulli, 19, 521-547, 2013.
[4] Caner, M. and K. Knight, "No country for old unit root tests: bridge estimators di¤erentiate
between nonstationary versus stationary models and select optimal lag", unpublished manu-
script, 2009.
[5] Chao, J. and P.C.B. Phillips, "Model selection in partially nonstationary vector autoregressive
processes with reduced rank structure", Journal of Econometrics, vol. 91, no. 2, pp. 227-271,
1999.
[6] Cheng, X. and P.C.B. Phillips, "Semiparametric cointegrating rank selection", Econometrics
Journal, vol. 12, pp. S83-S104, 2009.
[7] Cheng, X. and P.C.B. Phillips, "Cointegrating rank selection in models with time-varying
variance", Journal of Econometrics, vol. 142, no. 1, pp. 201-211, 2012
[8] Fan J. and R. Li, "Variable selection via nonconcave penalized likelihood and its oracle prop-
erties", Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348-1360, 2001.
[9] Hendry, D. F. and S. Johansen, "Model discovery and Trygve Haavelmo’s legacy", Econometric
Theory, forthcoming.
67
[10] Hendry, D. F. and H.-M. Krolzig, "The properties of automatic Gets modelling", Economic
Journal, 115, C32-C61, 2005.
[11] Johansen, S., "Statistical analysis of cointegration vectors", Journal of Economic Dynamics
and Control, vol. 12, no. 2-3, pp. 231.254, 1988.
[12] Johansen, S., Likelihood-based inference in cointegrated vector autoregressive models. Oxford
University Press, USA, 1995.
[13] Knight K. and W. Fu, "Asymptotics for lasso-type estimators", Annals of Statistics, vol. 28,
no. 5, pp. 1356.1378, 2000.
[14] Kock, A. and L. Callot, "Oracle inequalities for high dimensional vector autoregressions",
CREATES Research Paper 2012-16, 2012.
[15] Leeb, H. and B. M. Pötscher, "Model selection and inference: facts and …ction", Econometric
Theory, vol. 21, no. 01, pp. 21-59, 2005.
[16] Leeb, H. and B. M. Pötscher, "Sparse estimators and the oracle property, or the return of the
Hodges estimator", Journal of Econometrics, vol. 142, no. 1, pp. 201-211, 2008.
[17] Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., and Wang, P., "Regu-
larized multivariate regression for identifying master predictors with application to integrative
genomics study of breast cancer", Annals of Applied Statistics, vol. 4, pp. 53-77, 2010.
[18] Phillips, P. C. B., "Optimal inference in cointegrated systems", Econometrica, vol. 59, no. 2,
pp. 283-306, 1991a.
[19] Phillips, P. C. B., "Spectral regression for cointegrated time series", In W. Barnett, J. Pow-
ell and G. Tauchen (eds.), Nonparametric and Semiparametric Methods in Economics and
Statistics, 413-435. New York: Cambridge University Press, 1991b.
[20] Phillips, P. C. B., "Fully modi…ed least squares and vector autoregression", Econometrica, vol.
63, no. 5, pp. 1023-1078, 1995.
[21] Phillips, P. C. B., "Econometric model determination", Econometrica, vol. 64, no. 4, pp. 763-
812, 1996.
[22] Phillips, P. C. B., "Impulse response and forecast error variance asymtotics in nonstationary
VARs", Journal of Econometrics, 83, 21-56, 1998.
68
[23] Phillips, P.C.B. and J.W. McFarland, "Forward exchange market unbiasedness: The case of
the Australian dollar since 1984", Journal of International Money and Finance, vol. 16 pp.
885–907, 1997.
[24] Phillips, P. C. B. and V. Solo, "Asymptotics for linear processes", Annals of Statistics, vol.
20, no. 2, pp. 971-1001, 1992.
[25] Song, S. and P. Bickel (2009): "Large vector auto regressions", SFB 649 Discussion Paper
2011-048.
[26] Yuan, M., A. Ekici, and Z. Lu, and Monteiro, R, "Dimension reduction and coe¢ cient estima-
tion in multivariate linear regression", Journal Of the Royal Statistical Society Series B, vol.
69, pp. 329.346, 2007.
[27] Yuan, M. and Y. Lin, "Model selection and estimation in regression with grouped variables",
Journal of the Royal Statistical Society, Series B, 68, pp. 49-67, 2006
[28] Yuan M., and Y. Lin, "Model selection and estimation in the Gaussian graphical model",
Biometrika, vol. 94, pp. 19-35, 2007.
[29] Zou, H., "The adaptive lasso and its oracle properties", Journal of the American Statistical
Association, vol. 101, no. 476, pp. 1418-1429, 2006.
69
Table 11.1 Cointegration Rank Selection with Adaptive Lasso Penalty
Model 1
ro =0, o =(0 0) ro =1, o =(0 -0.5) ro =2, o =(-0.6 -0.5)
n = 100 n = 400 n = 100 n = 400 n = 100 n = 400
rbn = 0 0.9588 0.9984 0.0000 0.0002 0.0000 0.0000
rbn = 1 0.0412 0.0016 0.9954 0.9996 0.0000 0.0000
rbn = 2 0.0000 0.0000 0.0046 0.0002 1.0000 1.0000
Model 2
ro =0, 1 =(0 0) ro =1, 1 =(0 -0.25) ro =2, 1 =(-0.30 -0.15)
n = 100 n = 400 n = 100 n = 400 n = 100 n = 400
rbn = 0 0.9882 0.9992 0.0010 0.0000 0.0006 0.0000
rbn = 1 0.0118 0.0008 0.9530 0.9962 0.1210 0.0008
rbn = 2 0.0010 0.0000 0.0460 0.0038 0.8784 0.9992
Replications=5000, !=2, adaptive tuning parameter n given in eqation (6.15). o represents the eigenvalues of the
true matrix o , while 1 represents the eigenvalues of the pseudo true matrix 1 .
Table 11.2 Rank Selection and Lagged Order Selection with Adaptive Lasso Penalty
Cointegration Rank Selection
ro =0, o =(0 0) ro =1, o =(0 -0.5) ro =2, o =(-0.6 -0.5)
n = 100 n = 400 n = 100 n = 400 n = 100 n = 400
rbn = 0 0.9818 1.0000 0.0000 0.0000 0.0000 0.0000
rbn = 1 0.0182 0.0000 0.9980 1.0000 0.0000 0.0008
rbn = 2 0.0000 0.0000 0.0020 0.0000 1.0000 0.9992
Lagged Di¤erence Selection
ro =0, o =(0 0) ro =1, o =(0 -0.5) ro =2, o =(-0.6 -0.5)
n = 100 n = 400 n = 100 n = 400 n = 100 n = 400
pbn 2 T 0.9856 0.9976 0.9960 0.9998 0.9634 1.0000
pbn 2 C 0.0058 0.0004 0.0040 0.0002 0.0042 0.0000
pbn 2 I 0.0086 0.0020 0.0000 0.0000 0.0324 0.0000
Model Selection
ro =0, o =(0 0) ro =1, o =(0 -0.5) ro =2, o =(-0.6 -0.5)
n = 100 n = 400 n = 100 n = 400 n = 100 n = 400
b n2 T
m 0.9692 0.9976 0.9942 0.9998 0.9634 0.9992
b n2 C
m 0.0222 0.0004 0.0058 0.0002 0.0042 0.0000
b n2 I
m 0.0086 0.0020 0.0000 0.0000 0.0324 0.0008
Replications=5000, !=2, adaptive tuning parameter n given in (6.15) and (6.16). o in each column represents the
eigenvalues of o . "T" denotes selection of the true lags model, "C" denotes the selection of a consistent lags model
(i.e., a model with no incorrect shrinkage), and "I" denotes the selection of an inconsistent lags model (i.e. a model
with incorrect shrinkage).
70
Table 11.3 Finite Sample Properties of the Shrinkage Estimates
Model 1 with ro = 0, o = (0:0 0:0) and n = 100
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 -0.0005 0.0073 0.0073 -0.0251 0.0361 0.0440 0.0000 0.0000 0.0000
12 0.0000 0.0052 0.0052 0.0005 0.0406 0.0406 0.0000 0.0000 0.0000
21 0.0000 0.0035 0.0035 0.0002 0.0301 0.0301 0.0000 0.0000 0.0000
22 0.0004 0.0069 0.0069 -0.0244 0.0349 0.0426 0.0000 0.0000 0.0000
Model 1 with ro = 0, o = (0:0 0:0) and n = 400
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 0.0000 0.0000 0.0000 -0.0084 0.0118 0.0145 0.0000 0.0000 0.0000
12 0.0000 0.0000 0.0000 -0.0001 0.0101 0.0101 0.0000 0.0000 0.0000
21 0.0000 0.0000 0.0000 -0.0001 0.0134 0.0134 0.0000 0.0000 0.0000
22 0.0000 0.0000 0.0000 -0.0082 0.0116 0.0142 0.0000 0.0000 0.0000
Replications=5000, !=2, adaptive tuning parameter n given in equation (6.15). o in each column represents the
eigenvalues of o . The oracle estimate in this case is simply a 4 by 4 zero matrix.
71
Table 11.4 Finite Sample Properties of the Shrinkage Estimates
Model 1 with ro = 1, o = (0:0 -0:5) and n = 100
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 0.0032 0.0609 0.0610 -0.0067 0.0551 0.0555 -0.0046 0.0548 0.0550
12 -0.0023 0.0308 0.0308 -0.0066 0.0285 0.0293 -0.0023 0.0275 0.0276
21 0.0015 0.0617 0.0617 -0.0035 0.0478 0.0480 -0.0018 0.0476 0.0477
22 -0.0012 0.0308 0.0308 -0.0045 0.0246 0.0250 -0.0009 0.0238 0.0238
Model 1 with ro = 1, o = (0:0 -0:5) and n = 400
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 0.0008 0.0343 0.0343 -0.0027 0.0307 0.0308 -0.0020 0.0306 0.0307
12 0.0004 0.0171 0.0171 -0.0013 0.0155 0.0157 -0.0007 0.0153 0.0154
21 -0.0007 0.0312 0.0312 -0.0025 0.0276 0.0277 -0.0010 0.0275 0.0275
22 -0.0004 0.0156 0.0156 -0.0016 0.0140 0.0140 -0.0003 0.0138 0.0138
Model 1 with ro = 1, o = (0:0 -0:5) and n = 100
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
Q11 0.0022 0.0833 0.0833 0.0008 0.0728 0.0728 -0.0055 0.0712 0.0714
Q12 -0.0003 0.0069 0.0069 -0.0130 0.0243 0.0276 0.0000 0.0033 0.0033
Q21 0.0008 0.0778 0.0779 0.0012 0.0658 0.0658 -0.0046 0.0643 0.0644
Q22 -0.0003 0.0052 0.0052 -0.0119 0.0220 0.0251 0.0000 0.0004 0.0004
Model 1 with ro = 1, o = (0:0 -0:5) and n = 400
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
Q11 0.0004 0.0415 0.0415 -0.0003 0.0405 0.0405 -0.0023 0.0401 0.0401
Q12 0.0000 0.0010 0.0010 0.0000 0.0081 0.0092 -0.0019 0.0010 0.0010
Q21 0.0000 0.0371 0.0371 -0.0044 0.0368 0.0368 0.0000 0.0364 0.0364
Q22 0.0000 0.0001 0.0001 -0.0040 0.0073 0.0083 0.0000 0.0001 0.0001
Replications=5000, !=2, adaptive tuning parameter n given in equation (6.15). o in each column represents the
eigenvalues of o . The oracle estimate in this case is the RRR estimate with rank restriction r=1.
72
Table 11.5 Finite Sample Properties of the Shrinkage Estimates
Model 1 with ro = 2, o = (-0:6, -0:5) and n = 100
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 -0.0228 0.0897 0.0926 -0.0104 0.0934 0.0940 -0.0104 0.0934 0.0940
12 0.0384 0.0914 0.0992 -0.0008 0.0904 0.0904 -0.0008 0.0904 0.0904
21 -0.0247 0.0995 0.1025 0.0016 0.0813 0.0813 0.0016 0.0813 0.0813
22 0.0505 0.1459 0.1544 -0.0099 0.0780 0.0786 -0.0099 0.0780 0.0786
Model 1 with ro = 2, o = (-0:6, -0:5) and n = 400
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 -0.0058 0.0524 0.0527 -0.0025 0.0523 0.0523 -0.0025 0.0523 0.0523
12 0.0051 0.0545 0.0547 0.0009 0.0508 0.0509 0.0009 0.0508 0.0509
21 -0.0049 0.0546 0.0548 -0.0019 0.0459 0.0459 -0.0019 0.0459 0.0459
22 0.0075 0.0750 0.0754 -0.0037 0.0438 0.0440 -0.0037 0.0438 0.0440
Replications=5000, !=2, adaptive tuning parameter n given in equation (6.15). o in each column represents the
eigenvalues of o . The oracle estimate in this case is simply the OLS estimate.
Replications=5000, !=2, adaptive tuning parameter n given in equations (6.15) and (6.16). o in each column
represents the eigenvalues of o . The oracle estimate in this case is simply the OLS estimate assuming that o and
B2o are zero matrics.
73
Table 11.7 Finite Sample Properties of the Shrinkage Estimates
Model 3 with ro = 1, o = (0:0, -0:5) and n = 400
Lasso Estimates OLS Oracle Estimates
Bias Std RMSE Bias Std RMSE Bias Std RMSE
11 -0.0012 0.0653 0.0653 -0.0015 0.0653 0.0653 -0.0006 0.0647 0.0647
21 -0.0005 0.0564 0.0564 -0.0011 0.0563 0.0563 -0.0003 0.0558 0.0558
12 -0.0006 0.0326 0.0326 -0.0009 0.0327 0.0327 -0.0003 0.0324 0.0324
22 -0.0002 0.0282 0.0282 -0.0007 0.0282 0.0282 -0.0002 0.0279 0.0279
B1;11 -0.1086 0.0536 0.1211 -0.0028 0.0572 0.0572 -0.0022 0.0532 0.0533
B1;21 -0.0766 0.0432 0.0880 -0.0024 0.0490 0.0491 -0.0021 0.0461 0.0462
B1;12 -0.0351 0.0660 0.0747 -0.0019 0.0769 0.0769 -0.0022 0.0727 0.0728
B1;22 -0.0281 0.0643 0.0702 -0.0018 0.0672 0.0672 -0.0019 0.0633 0.0633
B2;11 0.0000 0.0000 0.0000 -0.0010 0.0438 0.0438 0.0000 0.0000 0.0000
B2;21 0.0000 0.0000 0.0000 -0.0012 0.0378 0.0378 0.0000 0.0000 0.0000
B2;12 0.0000 0.0000 0.0000 -0.0015 0.0789 0.0789 0.0000 0.0000 0.0000
B2;22 0.0000 0.0000 0.0000 -0.0005 0.0674 0.0674 0.0000 0.0000 0.0000
B3;11 -0.1206 0.0336 0.1252 -0.0032 0.0424 0.0425 -0.0023 0.0375 0.0375
B3;21 -0.0825 0.0295 0.0876 -0.0029 0.0373 0.0374 -0.0021 0.0327 0.0328
B3;12 -0.1010 0.0388 0.1082 -0.0020 0.0701 0.0701 -0.0017 0.0523 0.0523
B3;22 -0.0730 0.0460 0.0862 -0.0029 0.0611 0.0611 -0.0020 0.0461 0.0462
Replications=5000, !=2, adaptive tuning parameter n given in equations (6.15) and (6.16). o in each column
represents the eigenvalues of o . The oracle estimate in this case refers to the RRR estimate with r=1 and the
restriction that B2o = 0.
Replications=5000, !=2, adaptive tuning parameter n given in equation (6.15) and (6.16). o in each column
represents the eigenvalues of o . The oracle estimate in this case is simply the OLS estimate with the restriction
that B2o = 0.
74