0% found this document useful (0 votes)
27 views18 pages

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

This document introduces a new method for estimating sparse precision matrices called the lasso penalized D-trace loss estimator. It defines a constrained empirical loss minimization framework for estimating high-dimensional sparse precision matrices using a loss function. Within this framework, it proposes a new loss function called the D-trace loss, which is convex and uniquely minimized by the precision matrix. The lasso penalized D-trace loss estimator is defined as the minimizer of the lasso penalized D-trace loss subject to the constraint that the solution be positive definite. It establishes conditions under which this new estimator has the sparse recovery property and derives its convergence rates. An efficient algorithm is also developed to compute the proposed estimator.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views18 pages

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

This document introduces a new method for estimating sparse precision matrices called the lasso penalized D-trace loss estimator. It defines a constrained empirical loss minimization framework for estimating high-dimensional sparse precision matrices using a loss function. Within this framework, it proposes a new loss function called the D-trace loss, which is convex and uniquely minimized by the precision matrix. The lasso penalized D-trace loss estimator is defined as the minimizer of the lasso penalized D-trace loss subject to the constraint that the solution be positive definite. It establishes conditions under which this new estimator has the sparse recovery property and derives its convergence rates. An efficient algorithm is also developed to compute the proposed estimator.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Biometrika (2014), 101, 1, pp. 103–120 doi: 10.

1093/biomet/ast059
Printed in Great Britain Advance Access publication 12 February 2014

Sparse precision matrix estimation via lasso penalized


D-trace loss
BY TENG ZHANG
Department of Mathematics, Princeton University, Fine Hall, Washington Rd, Princeton,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
New Jersey 08544, U.S.A.
[email protected]

AND HUI ZOU


School of Statistics, University of Minnesota, 224 Church St SE, Minneapolis,
Minnesota 55455, U.S.A.
[email protected]

SUMMARY
We introduce a constrained empirical loss minimization framework for estimating high-
dimensional sparse precision matrices and propose a new loss function, called the D-trace loss,
for that purpose. A novel sparse precision matrix estimator is defined as the minimizer of the lasso
penalized D-trace loss under a positive-definiteness constraint. Under a new irrepresentability
condition, the lasso penalized D-trace estimator is shown to have the sparse recovery property.
Examples demonstrate that the new condition can hold in situations where the irrepresentability
condition for the lasso penalized Gaussian likelihood estimator fails. We establish rates of con-
vergence for the new estimator in the elementwise maximum, Frobenius and operator norms. We
develop a very efficient algorithm based on alternating direction methods for computing the pro-
posed estimator. Simulated and real data are used to demonstrate the computational efficiency
of our algorithm and the finite-sample performance of the new estimator. The lasso penalized
D-trace estimator is found to compare favourably with the lasso penalized Gaussian likelihood
estimator.
Some key words: Constrained minimization; D-trace loss; Graphical lasso; Graphical model selection; Precision
matrix; Rate of convergence.

1. INTRODUCTION
Assume that we have n independent and identically distributed p-dimensional random vari-
ables. Let  ∗ be the population covariance matrix and let ∗ = ( ∗ )−1 denote the correspond-
ing precision matrix. Massive high-dimensional data frequently arise in computational biology,
medical imaging, genomics, climate studies, finance and other fields, and it is of both theo-
retical and practical importance to estimate high-dimensional covariance or precision matri-
ces. In this paper we focus on estimating a sparse precision matrix ∗ when the dimension
is large. Sparsity in ∗ is interesting because each nonzero entry of ∗ corresponds to an
edge in a Gaussian graphical model for describing the conditional dependence structure of the
observed variables (Whittaker, 1990). Specifically, if x ∼ N p (μ,  ∗ ), then i∗j = 0 if and only
if xi ⊥
⊥ x j | {xk : k =
| i, j}. The construction of Gaussian graphical models has applications in a


C 2014 Biometrika Trust
104 T. ZHANG AND H. ZOU
wide range of fields, including genomics, image analysis and macroeconomics (Li & Gui, 2006;
Wille & Bühlmann, 2006; Dobra et al., 2009; Li, 2009). Meinshausen & Bühlmann (2006) pro-
posed a neighbourhood selection scheme in which one can sequentially estimate the support of
each row of ∗ by fitting an 1 or lasso penalized least squares regression model (Tibshirani,
1996). Yuan (2010) used the Dantzig selector (Candès & Tao, 2007) to replace the lasso penal-
ized least squares in the neighbourhood selection scheme. Peng et al. (2009) proposed a joint
neighbourhood estimator using the lasso penalization. Cai et al. (2011) proposed a constrained 1
minimization estimator for estimating sparse precision matrices and established its convergence
rates under the elementwise ∞ norm and Frobenius norm. A common drawback of the methods

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
mentioned above is that they do not always guarantee that the final estimator is positive definite.
One can also use Cholesky decomposition to estimate the precision or covariance matrix, as in
Huang et al. (2006). With this approach, a sparse regularized estimator of the Cholesky factor
is first derived and then the estimated Cholesky factor is used to construct the final estimator
of ∗ . The regularized Cholesky decomposition approach always gives a positive-semidefinite
matrix but does not necessarily produce a sparse estimator of ∗ .
To the best of our knowledge, the only existing method for deriving a positive-definite sparse
precision matrix is via the lasso or 1 penalized Gaussian likelihood estimator or its variants.
Yuan & Lin (2007) proposed the lasso penalized likelihood criterion and suggested using the
maxdet algorithm to compute the estimator. Motivated by Banerjee et al. (2008), Friedman et al.
(2008) developed a blockwise coordinate descent algorithm, called the graphical lasso, for
solving the lasso penalized Gaussian likelihood estimator. Witten et al. (2011) presented some
computational tricks to further boost the efficiency of the graphical lasso. In the literature, the
graphical lasso is often used as an alternative name for the lasso penalized Gaussian likelihood
estimator. Convergence rates for the graphical lasso have been established by Rothman et al.
(2008) and Ravikumar et al. (2011).
The graphical lasso estimator is outside the penalized maximum likelihood estimation
paradigm, as it works for non-Gaussian data (Ravikumar et al., 2011). To gain a better under-
standing, we propose a constrained convex optimization framework for estimating large precision
matrices, within which the graphical lasso can be viewed as a special case. We further introduce
a new loss function, the D-trace loss, which is convex and minimized at −1 . We define a novel
estimator as the minimizer of the lasso penalized D-trace loss under the constraint that the solu-
tion be positive definite. The D-trace loss is much simpler than the graphical lasso loss, thus
permitting a more direct theoretical analysis and offering significant computational advantages.
Under a new irrepresentability condition, we prove the sparse recovery property of the new esti-
mator and show through examples that our irrepresentability condition is satisfied while that for
the graphical lasso fails. Asymptotically, the new estimator and the graphical lasso have compara-
ble rates of convergence in the elementwise maximum, Frobenius and operator norms. Through
simulation, we show that in finite samples the new estimator outperforms the graphical lasso,
even when the data are generated from Gaussian distributions.

2. METHODOLOGY
2·1. An empirical loss minimization framework
We begin with some notation and definitions. For a p × p matrix X = (X i, j ) ∈ R p× p , its

Frobenius norm is X F = ( i, j X i,2 j )1/2 . We also use X 1,off to denote the off-diagonal 1

norm: X 1,off = i =| j |X i, j |. Let S ( p) denote the space of all p × p positive-definite matri-
ces. For any two symmetric matrices X, Y ∈ R p× p , we write X  Y when X − Y is positive
Sparse precision matrix estimation 105
semidefinite. We use vec(X ) to denote the p 2 -vector formed by stacking the columns of X , and
X, Y
means tr(X Y T ) throughout the paper.
Suppose that we want to use a  from S ( p) to estimate (0 )−1 . We use a loss function
L(, 0 ) for this estimation problem, and we require it to satisfy the following two conditions.

Condition 1. The loss function L(, 0 ) is a smooth convex function of .

Condition 2. The unique minimizer of L(, 0 ) occurs at (0 )−1 .

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Condition 1 is required for computational reasons, and Condition 2 is needed so that we get
the desired precision matrix when the loss function L(, 0 ) is minimized. It is also important
that L(, 0 ) be constructed directly through 0 , not (0 )−1 , because in practice we need to
use its empirical version L(,  ˆ 0 is an estimate of 0 , to compute the estimator of
ˆ 0 ), where 
(0 ) . With such a loss function in hand, we can construct a sparse estimator of (0 )−1 via the
−1

convex program
arg min L(, ) ˆ + λn 1,off , (1)
∈S ( p)

where ˆ denotes the sample covariance matrix and λn > 0 is the 1 penalization parameter.
The graphical lasso can be seen as an application of the empirical loss minimization frame-
work, defined as
ˆ − log det() + λn 1,off .
arg min , 
(2)
∈S ( p)

Yuan & Lin (2007) proposed this estimator by following the penalized maximum likelihood esti-
mation paradigm: , 
ˆ − log det() corresponds to the negative loglikelihood function of the
multivariate Gaussian model. Comparing (2) to (1), we see that the graphical lasso is an empiri-
cal loss minimizer where the loss function is L G (, 0 ) = , 0
− log det(). One can verify
that L G (, 0 ) satisfies Conditions 1 and 2. Although L G (, 0 ) has dual interpretations, it
has been shown that the graphical lasso provides a consistent estimator even when the data do
not follow a multivariate Gaussian distribution (Ravikumar et al., 2011). Thus, the empirical loss
minimization view of the graphical lasso is more fundamental and can better explain its broader
successes with non-Gaussian data.

2·2. A new estimator


From the empirical loss minimization viewpoint, L G is not the most natural and convenient
loss function for precision matrix estimation because of the log-determinant term. We show in
this paper that there is a much simpler loss function than L G for estimating sparse precision
matrices. The new loss function is
1
L D (, 0 ) = 2 , 0
− tr(). (3)
2
As L D is expressed as the difference of two trace operators, we call it the D-trace loss. We first
verify that L D satisfies the two conditions above. To check Condition 1, observe that
    
1 1 − 2 2
L D (1 , 0 ) + L D (2 , 0 ) − 2L D (1 + 2 ), 0 = 2 , 0  0.
2 2
106 T. ZHANG AND H. ZOU
To check Condition 2, we show that the derivative of (3) is (0 + 0 )/2 − I and that the
Hessian of L D can be expressed as (0 ⊗ I + I ⊗ 0 )/2, where ⊗ denotes the Kronecker prod-
uct. Since 0 is positive definite, the Hessian has only positive eigenvalues (see, e.g., Pease, 1965,
§ XIV.7) and so is positive definite. It is then easy to see that  = 0−1 is the unique minimizer
of L D (1 , 0 ) as a function of .
We have verified that the D-trace loss is a valid loss function to be used in the empirical loss
minimization framework. The corresponding estimator is then defined according to (1):

1 2

ˆ = arg min
 ˆ − tr() + λn 1,off ,
 , (4)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
∈S ( p) 2

where λn is a nonnegative penalization parameter. One can also use the 1 norm, 1 =

| j |i, j |, in (4). In many applications we know a priori that the smallest eigenvalue of the
i=
true precision matrix is at least , where  is a certain threshold. We can easily incorporate this
into the estimator by considering

1 2

ˆ = arg min
 ˆ − tr() + λn 1,off .
 , (5)
  I 2

In § 3 we derive an efficient algorithm for solving (5), setting  = 10−8 as the default.
From a computational point of view, L D is more convenient than L G . We can view the D-trace
loss as an analogue of the least squares loss, used in regression, for precision matrix estima-
tion. It is difficult to come up with a simpler loss function than L D that satisfies Conditions 1
and 2. One might argue that L G should be the optimal loss function at least for estimating ∗
for Gaussian distributions, owing to its likelihood interpretation. However, the conventional wis-
dom does not necessarily hold true in the empirical loss minimization framework for precision
matrix estimation. For simplicity, let λn = 0 and compare the minimizer of the empirical loss
with the maximum likelihood estimator when  −1 exists. Then we see that if the loss function
satisfies Conditions 1 and 2, the solution in (1) is always  −1 , regardless of the actual form of the
loss function. This is different from what happens in conventional regression problems, where
unpenalized loss functions produce different estimates, such as in the case of Huber’s regression
versus least squares. In the rest of the paper we study the theoretical and numerical properties
of the lasso penalized D-trace loss estimator for estimating sparse precision matrices. We have
found that the new estimator enjoys theoretical and empirical advantages over the lasso penalized
Gaussian likelihood estimator.
Our estimator has an interesting connection to the constrained 1 minimization estimator
(Cai et al., 2011) defined through


p
minimize |i j | subject to ˆ − I |  λn .
max | (6)
i, j
i, j

Cai et al. (2011) regularized the diagonal elements of . To simplify the discussion, we can do
the same for our estimator by using 1 in (4); then the penalized L D estimator is

1 2

arg min ˆ − tr() + λn 1 .


 , (7)
∈S ( p) 2
Sparse precision matrix estimation 107
The solution of (7) satisfies
1
ˆ + )
( ˆ − I = λn Ẑ , (8)
2
where Ẑ represents the subgradient, taking values in [−1, 1]. Therefore, following the derivation
of the Dantzig selector (Candès & Tao, 2007), we can relax (8) and drop the positive-definiteness
constraint to define a constrained minimization estimator through

p
1
minimize |i j | subject to max ( ˆ + )
ˆ − I  λn . (9)
i, j 2

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
i, j

Comparing (9) and (6), we see that the Dantzig version of the penalized L D estimator is very
similar to the estimator of Cai et al. (2011). An important difference between (9) and (6) is that
the solution of (9) is guaranteed to be symmetric, which is not the case for (6).
A referee called our attention to an unpublished manuscript by Liu and Luo, available at
http://arxiv.org/abs/1203.3896. Let θk be the kth column vector of , and let ek denote a
p-dimensional vector with 1 in the kth coordinate and 0 in all other coordinates. Liu and Luo’s
estimator is motivated by the fact that the constrained 1 minimization estimator in (6) has the
following equivalent formulation:

minimize |θk |1 ˆ k − ek |∞  λn
subject to |θ (k = 1, . . . , p). (10)

See Lemma 1 of Cai et al. (2011). Liu and Luo’s estimator of θk is defined by
1
ˆ k − ekT θk + λn |θk |1 .
arg min θkT θ (11)
θk 2

Liu and Luo used a reverse Dantzig selector step to get (11) from (10). A major advantage of
doing so is that solving (11) can be computationally more efficient than solving (10). On the
other hand, the penalized L D estimator in (7) can be rewritten as
p 

1
arg min ˆ k − ek θk + λn |θk |1 .
θk θ
T T
(12)
=[θ1 ,...,θ p ]∈S ( p) k=1 2

Therefore, if we drop the positive-definiteness constraint, (12) reduces to solving (11) for
k = 1, . . . , p. The fundamental difference between our estimator and Liu and Luo’s estimator
is that our method respects the positive-definite nature of a precision matrix, while Liu and Luo’s
method treats a precision matrix estimation problem as p separate vector estimation problems;
their estimator is not even guaranteed to be symmetric.

3. ALGORITHM
3·1. Architecture of the algorithm based on the alternating direction method
In this section we develop an efficient algorithm for solving the constrained optimization
problem in (5), based on the alternating direction method. Before delving into the technical
details, it is interesting to first review the efforts that have been devoted to solving the lasso
penalized Gaussian likelihood estimator. Yuan & Lin (2007) used the maxdet algorithm to com-
pute the lasso penalized Gaussian likelihood estimator, but that algorithm is very slow for
high-dimensional data. Banerjee et al. (2008) and Friedman et al. (2008) developed blockwise
108 T. ZHANG AND H. ZOU
descent algorithms. Duchi et al. (2008) proposed a projected gradient method, and Lu (2009)
proposed a method that involves applying Nesterov’s smooth optimization technique; in both
these papers the authors showed that their algorithms perform faster than blockwise descent
algorithms. More recently, Scheinberg et al. (2010) developed an alternating direction method
for solving the lasso penalized Gaussian likelihood estimator and showed that their method is
faster than the projected gradient method (Duchi et al., 2008) as well as Nesterov’s smooth opti-
mization method (Lu, 2009). Based on previous work, the alternating direction method is the
state-of-the-art algorithm for solving the lasso penalized Gaussian likelihood estimator. In order
to compare the D-trace loss and Gaussian likelihood function in computational terms, we derive

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
an alternating direction method for solving the lasso penalized D-trace estimator and compare the
computational efficiency of the lasso penalized D-trace loss with that of the Gaussian likelihood
estimators, showing that the new estimator is faster.
We introduce two new matrices, 0 and 1 , and rewrite (4) as

1
ˆ − tr() + λn 0 1,off
arg min 2 , 
subject to [, ] = [0 , 1 ]. (13)
1  I 2

From (13), we consider the augmented Lagrangian

1 2

L(, 0 , 1 , 0 , 1 ) =  , ˆ − tr() + λn 0 1,off + h(1  I )


2
+ 0 ,  − 0
+ 1 ,  − 1

+ (ρ/2) − 0 2F + (ρ/2) − 1 2F ,

where h(1  I ) is an indicator function defined by



0, 1  I ;
h(1  I ) =
∞, otherwise.

Let (k , k0 , k1 , k0 , k1 ) be the solution at step k, for k = 0, 1, 2, . . . . We update (, 0 , )
according to

k+1 = arg min L(, k0 , k1 , k0 , k1 ), (14)


=T

[k+1 k+1
0 , 1 ] = arg min L(k+1 , 0 , 1 , k0 , k1 ), (15)
0 =T0 ,1  I

[k+1 k+1
0 , 1 ] = [0 , 1 ] + ρ[
k k k+1
− k+1
0 ,
k+1
− k+1
1 ]. (16)

Step (16) is trivial. For (14), we can write

1
ˆ + 2ρ I
− , I + ρk0 + ρk1 − k0 − k1
.
k+1 = arg min 2 , 
= T 2

Let G(A, B) denote the solution to the optimization problem


ρ
arg min 2 , A
− , B
, A > 0. (17)
=T 2
Sparse precision matrix estimation 109
Then we can write

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 ).


k+1 = G( (18)

The explicit solution to (17) is given in the following theorem.

THEOREM 1. Given any p-dimensional symmetric positive-definite matrix A and any


p-dimensional matrix B, let A = U A  A U AT be the eigenvalue decomposition of A, with ordered
eigenvalues σ1  · · ·  σ p . Define

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
1
G(A, B) = arg min 2 , A
− B, 
.
=T 2

Then
 T  
G(A, B) = U A U A BU A ◦ C U AT , (19)

where ◦ denotes the Hadamard product of matrices and Ci, j = 2/(σi + σ j ).

To update k+1
0 , from (15) we write

ρ
k+1
0 = arg min 20 , I
− 0 , ρk+1 + k0
+ λn 0 1,off .
0 =T 20

Let S(A, λ) denote the solution to the optimization problem

1
arg min 20 , I
− 0 , A
+ 0 0 1,off .
0 =T 2 0

Then we can write


 
1 k λn
k+1 =S  k+1
+ 0 , , (20)
0
ρ ρ
where the operator S is defined by


⎪ Ai, j , i= j,

⎨ A − λ,
i, j i=
| j, Ai, j > λ,
S(A, λ)i, j =

⎪ Ai, j + λ, i=
| j, Ai, j < −λ,


0, i=
| j, −λ  Ai, j  λ.

To update k+1
1 , we write

ρ
k+1
1 = arg min 21 , I
− 1 , ρk+1 + k1
. (21)
1  I 2

For a symmetric matrix X we define the matrix operator [X ]+ as follows: let the eigenvalue
decomposition of X be U X diag(λ1 , . . . , λ p )U XT ; then
 
[X ]+ = U X diag max(λ1 , ), . . . , max(λ p , ) U XT .
110 T. ZHANG AND H. ZOU
The solution to (21) is then
 
k
k+1 =  k+1
+ 1 .
1
ρ
+

We now have all the pieces needed to carry out the alternating direction method for solving
(5). Algorithm 1 summarizes the details.

Algorithm 1. Alternating direction method for solving (5).

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Step 1. Initialization: k = 0, 0 , 00 = 01 .

Step 2. Repeat (a)–(d) until convergence:


(a) k = k + 1;
ˆ + 2ρ I, I + ρk + ρk − k − k ), and set
(b) use Theorem 1 to compute G( 0 1 0 1

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 );


k+1 = G(

(c) let k+1


0 = S(k+1 + k0 /ρ, λn /ρ) and k+1
1 = [k+1 + k1 /ρ]+ ;
(d) let k+1
0 = k0 + ρ(k+1 − k+1
0 ) and 1
k+1
= k1 + ρ(k+1 − k+11 ).

3·2. Implementation
Here we discuss the implementation details for Algorithm 1. The most computationally expen-
sive part is the update of k+1
1 , owing to the eigenvalue constraint. If we drop that constraint and
consider
1 2

˘ = arg min  ,ˆ − tr() + λn 1,off , (22)


∈R p× p ,=T 2

then we can derive a much simpler alternating direction method for computing . ˘ If ˘ I,
ˆ ˘ ˘
we must have  = . If we find that  has an eigenvalue less than , then we can always use
Algorithm 1 to find  ˆ = ,
˘ in which ˘ can be taken as the initial value of 0 . This implemen-
tation strategy could save a lot of computational time.
We now work out the simplified alternating direction method for computing . ˘ Following the
same steps as in § 3·1, we consider the augmented Lagrangian
1 2

L(, 0 , ) = ˆ − tr() + λn 0 1 + ,  − 0


+ (ρ/2) − 0 2F .
 ,
2
We update (, 0 , ) according to the following three steps:

k+1 = arg min L(, k0 , k ), (23)


=T

k+1
0 = arg min L(k+1 , 0 , k ), (24)
0

k+1 = k + ρ(k+1 − k+1


0 ).

The solutions to (23) and (24) are given in (18) and (20), respectively. Algorithm 2 summarizes
˘ and the final estimator .
the details for computing  ˆ
Sparse precision matrix estimation 111
Algorithm 2. Alternating direction method implementation for our estimator.
ˆ −1 where diag()
Step 1. Initialization: k = 0, 0 , 00 = {diag()} ˆ is a diagonal matrix
ˆ
which keeps the diagonal elements of .

Step 2. Repeat (a)–(d) until convergence:

(a) k = k + 1;
(b) k+1 = G( ˆ + ρ I, I + ρk − k );
0
(c) 0 = S(k+1 + ρ −1 k , ρ −1 λn );
k+1

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
(d) k+1 = k + ρ(k+1 − k+1 0 ).

˘ defined in (22).
Step 3. Report the converged k as the solution to 
˘ > , report 
Step 4. If λmin () ˘ as .
ˆ

ˆ and in its Step 1 use 


Step 5. Otherwise, use Algorithm 1 to calculate , ˘ as the initial
value for 0 and 1 .
0 0

We have implemented Algorithm 2 in Matlab. In our implementation, we take ρ = 1 and stop


the algorithm when both of the following criteria are satisfied:

k+1 − k F k+1 − k0 F


< 10−7 , 0
< 10−7 .
max(1, k F , k+1 F ) max(1, k0 F , k+1
0 F )

4. NUMERICAL RESULTS

Among existing methods, the lasso penalized Gaussian likelihood estimator is the only pop-
ular precision matrix estimator that can simultaneously retain sparsity and positive definiteness.
To show the virtue of the D-trace loss, we use simulations to compare the performance of our
estimator with that of the lasso penalized Gaussian likelihood estimator.
In the simulation study, data were generated from N (0,  ∗ ). The following three forms of  ∗
were considered.
∗ = 1, ∗ = 0·2 for 1  |i − j|  2 and ∗ = 0 otherwise.
Model 1: i,i i, j i, j
∗ = 1, ∗ = 0·2 for 1  |i − j|  4 and ∗ = 0 otherwise.
Model 2: i,i i, j i, j

∗ = 1, ∗
Model 3: i,i ∗ = 0·2 and i,∗ j = 0 otherwise;
i,i+1 = 0·2 for mod(i, p
1/2 ) =
| 0, i,i+ p 1/2
this is the grid model in Ravikumar et al. (2011) and requires p 1/2 to be an integer.
The sample size was taken to be n = 400 in all three models. We let p = 500 in Models 1
and 2, and p = 484 in Model 3. Each estimator was tuned by five-fold crossvalidation. Simu-
lation results based on 100 independent replications are reported in Table 1, where we compare
the two estimators in terms of five quantities: the Frobenius risk E( ˆ − ∗ F ), the operator
risk E( ˆ −  2 ), the matrix 1,∞ risk E(
∗ ˆ −  1,∞ ), and the percentages of correctly

estimated nonzeros and zeros. Table 1 shows that our estimator performs better than the lasso
penalized Gaussian likelihood estimator, even though the data are Gaussian. We also recorded the
running time of each estimator by fixing the parameter λn at the value chosen by crossvalidation.
We computed the lasso penalized Gaussian likelihood estimator by using the alternating direction
112 T. ZHANG AND H. ZOU
Table 1. Results of simulation study: comparison of our estimator with the lasso penalized Gaus-
sian likelihood estimator, i.e., graphical lasso, in terms of three different matrix norms and the
percentages of correctly estimated nonzeros and zeros. Reported numbers are averages over 100
independent runs, with standard errors given in parentheses. In the first three columns smaller
numbers are better; in the last two columns larger numbers are better
Frobenius Operator 1,∞ TP TN
Model 1
Our estimator 7·19 (0·06) 0·77 (0·02) 1·06 (0·04) 88·80 (0·86) 98·77 (0·03)
Graphical lasso 7·49 (0·19) 0·78 (0·02) 1·26 (0·09) 88·12 (2·82) 97·65 (0·71)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Model 2
Our estimator 11·70 (0·09) 1·59 (0·01) 1·92 (0·03) 63·47 (1·57) 98·66 (0·20)
Graphical lasso 11·88 (0·03) 1·61 (0·01) 2·11 (0·05) 64·88 (0·69) 97·40 (0·06)
Model 3
Our estimator 5·07 (0·06) 0·56 (0·02) 0·91 (0·04) 99·41 (0·22) 98·57 (0·04)
Graphical lasso 5·26 (0·06) 0·58 (0·02) 1·06 (0·06) 99·76 (0·13) 97·48 (0·07)
TP, percentage of correctly estimated nonzeros; TN, percentage of correctly estimated zeros.

method as implemented by Scheinberg et al. (2010). The average running time for our estima-
tor was 1·2 seconds, whereas that for the lasso penalized Gaussian likelihood estimator was 2
seconds.

5. THEORETICAL RESULTS

5·1. Notation
In this section we study the theoretical properties of the proposed estimator in the ultrahigh-
dimensional setting. Under suitable regularity conditions, the proposed estimator is consistent
under various matrix norms and has a sparse recovery property with high probability. In partic-
ular, when the xi are sampled from a sub-Gaussian distribution, consistency holds if log( p) is
small compared to n.
We assume that the true precision matrix ∗ is sparse. Let S = {(i, j) : i,∗ j = | 0} denote the
support of ∗ and S c the complement of S. Let d be the maximum node degree in ∗ , and
denote by s the number of edges in the graph corresponding to ∗ . We introduce some additional 
notation to facilitate the presentation.
 p For2 a1/2 vector x = (x1 , . . . , xn ) ∈ Rn , the 1 norm |xi | is
written as |x|1 , and the 2 norm ( i=1 xi ) is written as x.  For a matrix X , the elementwise
matrix norm maxi, j |X i, j | is written as X ∞ , the 1 norm i, j |X i, j | is written as X 1 , the

1,∞ norm maxi ( j |X i, j |) is written as X 1,∞ , and the operator norm maxx=1 X x is
written as X . For any subset T of {1, . . . , p} × {1, . . . , p}, we denote by vec(X )T the subvector
of vec(X ) indexed by T . For any two subsets T1 and T2 of {1, . . . , p} × {1, . . . , p}, we denote
by X T1 T2 the submatrix of X with rows and columns indexed by T1 and T2 , respectively. We use
λmax (X ) and λmin (X ) to denote the largest and smallest eigenvalues of a symmetric matrix X .
We write θmin = mini, j∈S |i,∗ j |, α = 1 − maxe∈S c  e,S ∗ ( ∗ )−1  , =
1 ˆ S,S − ∗ ,  =
S,S S,S
ˆ −  ∗ , ε =   ∞ , κ =  ∗−1 1,∞ and κ =  ∗ 1,∞ .
S,S

5·2. The irrepresentability condition


We first present the irrepresentability condition for establishing the model selection consis-
tency of our estimator. An irrepresentability condition is also required for the lasso penalized
Gaussian likelihood estimator for estimating sparse precision matrices (Ravikumar et al., 2011).
Sparse precision matrix estimation 113
Denoting the Kronecker matrix sum by ⊕ and the Kronecker matrix product by ⊗, our irrep-
resentability condition involves the function
1 1
() = ( ⊕ ) = ( ⊗ I + I ⊗ ).
2 2
Upon using the definition of the Kronecker matrix sum, we see that () is a p 2 × p 2 matrix
indexed by vertex pairs and that

()( j,k),(l,m) = k,m δ( j, l) +  j,l δ(k, m), (25)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
| l. For simplicity, we write ∗ = ( ∗ ) and
where δ( j, l) = 1 if j = l and δ( j, l) = 0 if j =
ˆ ˆ
= (). In our theoretical analysis, the following irrepresentability condition is assumed:
∗ ∗ −1
maxc  e,S ( S,S ) 1 < 1. (26)
e∈S

It is interesting to compare (26) with the irrepresentability condition for the lasso penalized
Gaussian likelihood estimator (Ravikumar et al., 2011, Assumption 1), which is

maxc ( ∗ ⊗  ∗ )e,S {( ∗ ⊗  ∗ ) S,S }−1 1 < 1. (27)


e∈S

Notice that (26) involves the Kronecker sum  ∗ ⊕  ∗ while (27) uses the Kronecker product
∗ ⊗ ∗.
It is difficult to compare (26) and (27) in general. Here we compare them on a specific example
used by Meinshausen (2008) and Ravikumar et al. (2011), with ∗ ∈ R4×4 , i,i ∗ = 1, ∗ =
2,3
∗3,2 = 0, ∗1,4 = ∗4,1 = 2c2 and i,∗ j = c otherwise, where we assume c ∈ [−2−1/2 , 2−1/2 ] so
that ∗ is positive definite. For this example, we can verify numerically that (26) holds for |c| 
0·31 while (27) requires that |c| < 0·2017 (Ravikumar et al., 2011, § 3.1.1). Thus, when |c| ∈
[0·2017, 0·31], (26) holds while (27) fails.
We also compared (26) and (27) on two autoregressive models of orders 1 and 3. In the first,
we let ∗ ∈ R p× p , i,i∗ = 1, ∗ = c for |i − j| = 1 and ∗ = 0 otherwise. In the second, we
i, j i, j
let ∗ ∈ R p× p , i,i∗ = 1, ∗ = c for 1  |i − j|  3 and ∗ = 0 otherwise. The condition (26)
i, j i, j
was less restrictive than (27) for all values of p that we tested. For example, consider p = 30. For
the autoregressive model of order 1, (26) holds for |c| < 0·41 and (27) holds only for |c| < 0·35;
for the autoregressive model of order 3, (26) holds for |c| < 0·22 while (27) holds only for |c| <
0·14.

5·3. Rates of convergence


We establish rates of convergence and the model selection consistency of the penalized D-trace
estimator under the assumption that x1 , . . . , xn are independent and identically sampled from a
∗ 1/2 are sub-Gaussian with
sub-Gaussian distribution with covariance  ∗ such that all the X i /i,i
parameter σ . Here X i is the ith coordinate of the random vector X , so we assume that
∗ −1/2
E[exp{t X i (i,i ) }]  exp(σ 2 t 2 /2) (t ∈ R). (28)

THEOREM 2. Under (28) and the irrepresentability condition (26), choose


 1/2
λn = 12α −1 (κ κ 2 + κ ) 128(1 + 4σ 2 )2 max(i,i
∗ 2
) (η log p + log 4)/n
i
114 T. ZHANG AND H. ZOU
for some η > 2 and

n > C1 max λmin (∗ )−1 min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )},
2
12dκ , 12α −1 (κ κ 2 + κ ), {max i,i

8(1 + 4σ 2 )}−1 (η log p + log 4),
i

∗ )2 . Then, with probability greater than 1 − p η−2 , we have


where C1 = 128(1 + 4σ 2 )2 maxi (i,i

ˆ − ∗ ∞  {5dκ 2 + 12α −1 (κ κ 2 + κ )} {C1 (η log p + log 4)/n}1/2 ,



ˆ − ∗ F  (s + p)1/2 {5dκ 2 + 12α −1 (κ κ 2 + κ )} {C1 (η log p + log 4)/n}1/2 ,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014

ˆ − ∗ 2  min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )} {C1 (η log p + log 4)/n}1/2 .


ˆ recovers all zeros in ∗ . Moreover, if


In addition, 

n > C1 {5dκ 2 + 12α −1 (κ κ 2 + κ )}2 (η log p + log 4)/θmin


2
,
ˆ recovers all zeros and nonzeros in ∗ .
then 

Next, we establish rates of convergence and model selection consistency of the penalized
D-trace estimator under a weaker polynomial tail assumption. Assume that x1 , . . . , xn are inde-
pendent and identically sampled from a distribution with polynomial tails having covariance  ∗
∗ )−1/2 X has finite 4mth moments, i.e., there exist m and K ∈ R such that
such that (i,i i m

∗ −1/2
E{(i,i ) X i }4m  K m (i = 1, . . . , p). (29)

THEOREM 3. Under (29) and the irrepresentability condition (26), choose

λn = 24n −1/2 α −1 (κ κ 2 + κ )(max i,i



)(K m + 1)1/(2m) p η/(2m)
i

for some η > 2 and



n > C2 p η/m max λmin (∗ )−1 min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )},
2
12dκ , 12α −1 (κ κ 2 + κ )
∗ )2m (K + 1)}1/m . Then, with probability 1 − p η−2 , we have
where C2 = {22m (maxi i,i m

ˆ − ∗ ∞  {5dκ 2 + 12α −1 (κ κ 2 + κ )}C p η/(2m) n −1/2 ,



1/2
2
ˆ − ∗ F  (s + p)1/2 {5dκ 2 + 12α −1 (κ κ 2 + κ )}C p η/(2m) n −1/2 ,

1/2
2
ˆ − ∗ 2  min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )}C p η/(2m) n −1/2 .

1/2
2

ˆ recovers all zeros in ∗ . Moreover, if


In addition, 

n > C2 p η/m {5dκ 2 + 12α −1 (κ κ 2 + κ )}2 /θmin


2
,
ˆ recovers all zeros and nonzeros in ∗ .
then 

These rate-of-convergence results look similar to those of Ravikumar et al. (2011). However,
our technical analysis is different from theirs. The key component in their analysis is Brouwer’s
Sparse precision matrix estimation 115
fixed-point theorem, but we can use a more direct approach to analyse the penalized D-trace
estimator, thanks to its simple expression.

6. DISCUSSION
In the empirical loss minimization framework, the D-trace loss is much simpler than the Gaus-
sian likelihood loss, which is basically a quadratic function of the precision matrix. Its simplicity
leads to theoretical and computational advantages. We have provided theoretical and empirical
evidence to support the D-trace loss and the lasso penalized D-trace estimator. On the other hand,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
our results do not imply that the D-trace loss estimator is superior to the graphical lasso. Con-
ceptually, the D-trace loss is to the Gaussian likelihood as the hinge loss underlying the support
vector machine is to the binomial likelihood for logistic regression. Each has its own merits, and
neither dominates the other. An open question remains concerning the irrepresentability con-
dition: we can neither prove nor disprove that (26) is always weaker than (27). This technical
problem will be studied in another paper.

ACKNOWLEDGEMENT
We thank Dr Shiqian Ma for sharing the Matlab code that implements his alternating direction
method for computing the lasso penalized Gaussian likelihood estimator. We thank the editor,
associate editor and referees for their suggestions, as well as Professors Peter Bühlmann and Tony
Cai for helpful discussions. Zou’s research was supported in part by the U.S. National Science
Foundation and the Office of Naval Research.

APPENDIX: TECHNICAL PROOFS

Proof of Theorem 1
With a positive definite A, A, 
/2 − B, 
is a strictly convex function over . Therefore, we only
2

need to check that its derivative is zero at G(A, B), i.e., 2−1 {AG(A, B) + G(A, B)A} − B = 0. Equiva-
lently, we need to check that
diag(σ1 , . . . , σ p ){U AT G(A, B)U A } + {U AT G(A, B)U A } diag(σ1 , . . . , σ p ) = U AT BU A .

The above equation can be verified by calculation for G(A, B) defined in (19), and so Theorem 1 is proved.

Proofs of Theorems 2 and 3


We prove these two theorems simultaneously. For clarity of presentation, we first sketch the proof and
then fill in the details of the technical lemmas and their proofs.
Following Definition 1 in Ravikumar et al. (2011), we assume that there exists a constant v∗ > 0 and a
function f such that
ˆ i, j − i,∗ j |  δ)  1/ f (n, δ)
pr(| (1  i, j  p; 0 < δ < 1/v∗ ). (A1)

We also define
n f (δ, r ) = arg max{n : f (n, δ)  r }, δ f (n, r ) = arg max{δ : f (n, δ)  r }.

The tail assumption (A1) holds for a large class of random vectors. Two special cases, sub-Gaussian

tails and polynomial tails, are defined in (28) and (29). When (28) holds, we have v∗ = {maxi i,i 8(1 +
2 −1 ∗ 2 −1
4σ )} and f (n, δ) = exp(c∗ nδ )/4, where c∗ = {128(1 + 4σ ) maxi (i,i ) } (Ravikumar et al., 2011,
2 2 2

§ 2.3.1). Straightforward calculation gives δ f (n, p η ) = {128(1 + 4σ 2 )2 maxi (i,i


∗ 2
) (η log p + log 4)/n}1/2
116 T. ZHANG AND H. ZOU
and n f (δ, p η ) = 128(1 + 4σ 2 )2 maxi (i,i
∗ 2
) (η log p + log 4)/δ 2 . When (29) holds, we have v∗ = 0 and
f (n, δ) = c∗ n m δ 2m , where c∗ = 2−2m (maxi i,i
∗ −2m
) (K m + 1)−1 (Ravikumar et al., 2011, § 2.3.2). Thus
η η/(2m) −1/(2m) −1/2
δ f (n, p ) = p c∗ n and n f (δ, p ) = p η/m c∗−1/m δ −2 . With these preparations in place, The-
η

orems 2 and 3 can be proved using the following technical lemma.

˘ by
LEMMA A1. Define 
1 2

˘ =
 arg min ˆ − tr() + λn 1,off .
 , (A2)
∈R p× p ,=T 2

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Then the following hold:
˘ S = 0 if
(a) vec()
 
 
maxc ˆ e,S ˆ S,S
−1
vec(I ) S < αλn /2, maxc  ˆ e,S ( ˆ S,S )−1   1 − α/2; (A3)
e∈S e∈S 1

˘ S = 0 if
(b) vec()
1
ε< , (A4)
12dκ
6ε(κ κ 2 + κ )  0·5 α min(λn , 1); (A5)

(c) assuming the conditions in part (b), we also have


5
˘ − ∗ ∞ < λn κ +
 d(1 + λn )εκ 2 . (A6)
2

The proof of Lemma A1 is based on the following auxiliary lemma, which is used to control  ˆ S,S
−1

∗ −1
S,S ∞ and  ˆ S,S − S,S 1,∞ by ε =   ∞ . For convenience we present it here.
−1 ∗ −1

LEMMA A2. Assuming (A4), we have

R( )1,∞  6d 2 ε2 κ 3 , R( )∞  12dε2 κ 3 , (A7)


∗ ∗−1 ∗−1 ∗−1
where R( ) = { S,S + ( ) S,S }−1 − S,S + S,S ( ) S,S S,S . Moreover, we have

 ˆ S,S
−1 ∗−1
− S,S 1,∞  6d 2 ε2 κ 3 + 2dεκ 2 , (A8)
 ˆ S,S
−1 ∗−1
− S,S ∞  12dε2 κ 3 + 2εκ 2 . (A9)

In this proof we assume the general choices of n and λn :


   
−1
n > n f 1/ max σmin min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )} , (A10)
 
−1
θmin {5dκ 2 + 12α −1 (κ κ 2 + κ )}, 12dκ , 12α −1 (κ κ 2 + κ ), v∗ , p η

and λn = 12α −1 (κ κ 2 + κ )δ f (n, p η ) for some η > 2.


(a) By the definition of n f , with probability at least 1 − 1/ p η−2 we have
 
ε =  ˆ −  ∗ ∞  δ f (n, p η ) < 1/ max 12dκ , 12α −1 (κ κ 2 + κ ), v∗ . (A11)

Now we verify the two assumptions in Lemma A1(b). Assumption (A4) is easy to verify using (A11).
From (A11) and the definition of λn we also have λn  1, and (A5) follows from the definition of λn and
the fact that λn  1.
Sparse precision matrix estimation 117
The convergence rate of  ˘ − ∗ ∞ then follows from (A6), (A4), the control of ε by δ f (n, p η )
in (A11), the definition of λn , and the fact that λn  1:
5
˘ − ∗ ∞ < λn κ + d(1 + λn )εκ 2  λn κ + 5dεκ 2

2
 {5dκ + 12α −1 (κ κ 2 + κ )}δ f (n, p η ).
2
(A12)

The estimation of  ˆ − ∗  follows from (A12), the fact that 


ˆ = ,˘ which will be shown at the end
of the proof of Lemma A1, and the estimation of v∗ , f (n, p ) and δ f (n, p η ).
η

(b) Combining the bound on  ˘ − ∗ ∞ with the fact that there are at most s + p nonzero elements

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
in ˘ and that the nonzeros of 
˘ form a subset of ∗ , we obtain

˘ − ∗ F  (s + p)1/2 
 ˘ − ∗ ∞  (s + p)1/2 {5dκ 2 + 12α −1 (κ κ 2 + κ )}δ f (n, p η ) (A13)

and
˘ − ∗ 2  min{(s + p)1/2 , d}
 ˘ − ∗ ∞

 min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )}δ f (n, p η ). (A14)

The estimation of  ˆ − ∗ 2 and  ˆ − ∗ F follows from (A13), (A14), the equality  ˆ = ,˘ and the
η η
estimation of v∗ , f (n, p ) and δ f (n, p ).
(c) By part (b) of Lemma A1,  ˘ specifies all zeros in ∗ . When (A10) holds, with probability at least
η−2
1 − 1/ p we have that δ f (n, p )  {5dκ 2 + 12α −1 (κ κ 2 + κ )}/θmin . By combining this with (A12),
η

˘ recovers all zeros and nonzeros in ∗ . Finally, we show that  ˆ = .


˘ Using the fact that with probability
at least 1 − 1/ p η−2 , δ f (n, p η )  λmin (∗ )/[min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )}], together
with (A14), we deduce that λmin () ˘ > 0 and therefore  ˆ = .˘ This completes the proof of Theorems 2
and 3.

Proof of Lemma A1
˜
(a) First, we define  as the solution to the hypothetical problem
1 2
˜ = arg min
 ˆ − tr() + λn 1,off .
 , 
(A15)
=T , S c =0 2

From its directional derivative, we obtain the equality


˜
{( ˆ +
ˆ )/2
˜ − I + Z }S = 0

where


⎨= 0, (i, j) ∈ S c or i = j,
Z i, j = sign( ˜ i, j ), (i, j) ∈ S, i =| j, ˜ i, j =
| 0,


∈ [−1, 1], (i, j) ∈ S, i =| j, ˜
i, j = 0.

Applying the definition of ˆ = ()


ˆ in (25), this can be rewritten as

ˆ
{ vec( ˜ − vec(I ) + λn vec(Z )} S = 0.
) (A16)

˜ S c = 0, (A16) is equivalent to ˆ S,S vec()


Recall that  ˜ S − vec(I ) S + λn vec(Z ) S = 0, and the explicit solu-
tion to (A15) is
˜ S = ˆ S,S
vec() −1
{vec(I ) S − λn vec(Z ) S }. (A17)
118 T. ZHANG AND H. ZOU
Now we verify that  ˜ is also the solution to (A2). Since the objective function in (A2) is convex, we
˜ is zero; that is,
only need to verify that its derivative at  = 

1
(ˆ ˜ ˜ ˆ
2  + ) − I  λn (1  i = | j  p),
i, j

1
(ˆ˜ + ˆ − I = 0 (i = 1, . . . , p).
˜ ) (A18)
2
i,i

Applying (A16), we have that (A18) holds when (i, j) ∈ S. Therefore we need only verify (A18) for (i, j) ∈

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
S c . As vec(I ) S c = 0, to prove (A18) it is sufficient to prove that for e ∈ S c ,

˜ S |  λn .
| ˆ e,S vec() (A19)

Upon combining (A17) with (A19), it suffices to prove that for e ∈ S c ,

| ˆ e,S ˆ S,S
−1
vec(I ) S − λn ˆ e,S ˆ S,S
−1
vec(Z ) S |  λn . (A20)

Since vec(Z ) S ∞  1, we have | ˆ e,S ˆ S,S−1


vec(Z ) S |   ˆ e,S ˆ S,S
−1
1 , Combining this upper bound of
ˆ ˆ −1
| e,S S,S vec(Z ) S | with the assumptions in (A3), we prove (A20) as follows:

| ˆ e,S ˆ S,S
−1
vec(I ) S − λn ˆ e,S ˆ S,S
−1
vec(Z ) S |  | ˆ e,S ˆ S,S
−1
vec(I ) S | + λn | ˆ e,S ˆ S,S
−1
vec(Z ) S |
 | ˆ e,S ˆ S,S
−1
vec(I ) S | + λn  ˆ e,S ˆ S,S
−1
1
 αλn /2 + λn (1 − α/2) = λn .

Since (A20) implies (A18), we have shown that  ˜ is also the solution 
˘ in (A2). By the definition of
˜ ˘
, we obtain vec() S = 0.
(b) We prove this part in two steps. First, we show that (A21) implies the two conditions in (A3):

maxc  ˆ e,S ( ˆ S,S )−1 − e,S


∗ ∗
( S,S )−1 1  0·5 α min(λn , 1). (A21)
e∈S

Then we prove (A21). Therefore we get vec() ˘ S = 0 upon applying the result of part (a).
∗ ∗ −1
Combining α = 1 − maxe∈S c  e,S ( S,S ) 1 with the triangle inequality, we obtain the second assump-
tion in (A3) from (A21). Using the fact that
∗ −1
S,S vec(I ) S = vec(∗ ) S , (A22)

∗ −1
we have S∗c ,S { S,S vec(I ) S } = S∗c ,S {vec(∗ ) S } = vec{( ∗ ∗ + ∗  ∗ )/2} S c = 0, and the first condition
in (A3) can be verified as follows:

| ˆ e,S ˆ S,S
−1
vec(I ) S | = |( ˆ e,S ˆ S,S
−1 ∗
− e,S ∗ −1
S,S ∗
) vec(I ) S | + | e,S ∗ −1
S,S vec(I ) S |
  ˆ e,S ˆ S,S
−1 ∗
− e,S ∗ −1
S,S 1 + 0
 αλn /2.

We now prove (A21). Since the right-hand side of (A21) is equivalent to the right-hand side of (A5),
we need only prove that the left-hand side of (A21) is smaller than the left-hand side of (A5). Note that
 ∗ ∞  2 ∗ ∞ ,  ∗ 1,∞  2 ∗ 1,∞ , and the left-hand side of (A21) can be controlled as follows,
Sparse precision matrix estimation 119
by applying (A9), (A26) and (A27): for any e ∈ S , c

 ˆ e,S ( ˆ S,S )−1 − e,S


∗ ∗
( S,S )−1 1
= ( ˆ e,S − e,S
∗ ∗
)( S,S )−1 + e,S

( ˆ S,S
−1 ∗−1
− S,S ) + ( ˆ e,S − e,S

)( ˆ S,S
−1 ∗−1
− S,S )1
 ( ˆ e,S − e,S
∗ ∗−1
) S,S ∗
1 +  e,S ( ˆ S,S
−1 ∗−1
− S,S )1 + ( ˆ e,S − e,S

)( ˆ S,S
−1 ∗−1
− S,S )1
  ˆ e,S − e,S
∗ ∗−1
∞  S,S 1,∞ + 2 ∗ 1,∞  ˆ S,S
−1 ∗−1
− S,S ∞ + 2dε ˆ S,S
−1 ∗−1
− S,S ∞
 2εκ + (2κ + 2dε)(12dε2 κ 3 + 2εκ 2 ). (A23)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Inserting (A4) into the right-hand side of (A23) yields the simplification
 
1
2εκ + (2κ + 2dε)(12dε2 κ 3 + 2εκ 2 )  2εκ + 2κ + (εκ 2 + 2εκ 2 )

 
1
= ε 2κ + 6κ κ + κ
2
2
< 6ε(κ κ 2 + κ ). (A24)

Upon combining (A5), (A23) and (A24), we obtain (A21).


˘ = ,
(c) By using the fact that  ˜ along with (A17) and (A22), we obtain
˘ − ∗ ∞ = vec()
 ˜ S − vec(∗ ) S ∞

= ( ˆ S,S
−1 ∗ −1
− S,S )vec(I ) S − λn ˆ S,S
−1
vec(Z ) S ∞
  ˆ S,S
−1 ∗ −1
− S,S 1,∞ + λn  ˆ S,S
−1
1,∞
 (1 + λn ) ˆ S,S
−1 ∗ −1
− S,S ∗ −1
1,∞ + λn  S,S 1,∞ . (A25)

Then we prove (A6) by applying (A4) and (A8) to the right-hand side of (A25):
5
˘ − ∗ ∞  λn κ + (1 + λn )(6d 2 ε2 κ 3 + 2dεκ 2 ) < λn κ + (1 + λn )dεκ 2 .

2
This completes the proof of Lemma A1.

Proof of Lemma A2
ˆ∗
Using the definition of and , we have
( ) S,S 1,∞  2dε, (A26)
∗−1
and then (A4) implies that  S,S 1,∞ ( ) S,S 1,∞ < 1/3. Following the proof of Ravikumar et al.
(2011, Appendix B), we obtain that R( )∞  3( ) S,S ∞ ( ) S,S 1,∞ κ 3 /2 and R( )1,∞ 
3( ) S,S 21,∞ κ 3 /2. Then we prove (A7) by combining (A26) with the fact that

( ) S,S ∞   ˆ − ∗ ∞  2
ˆ −  ∗ ∞ = 2ε. (A27)

Moreover,
∗−1 ∗−1 ∗−1 2
 S,S ( ) S,S S,S 1,∞  ( ) S,S 1,∞  S,S 1,∞  2dεκ 2 , (A28)
∗−1 ∗−1 ∗−1 2
 S,S ( ) S,S S,S ∞  ( ) S,S ∞  S,S 1,∞ = ( ) S,S ∞ κ 2  2εκ 2 . (A29)

Then (A8) and (A9) are obtained by combining (A7), (A28), (A29) and the definition of R( ). This
completes the proof of Lemma A2.
120 T. ZHANG AND H. ZOU
REFERENCES
BANERJEE, O., EL GHAOUI, L. & D’ASPREMONT, A. (2008). Model selection through sparse maximum likelihood
estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9, 485–516.
CAI, T., LIU, W. & LUO, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation.
J. Am. Statist. Assoc. 106, 594–607.
CAND ÈS, E. & TAO, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist.
35, 2313–51.
DOBRA, A., EICHER, T. & LENKOSKI, A. (2009). Modeling uncertainty in macroeconomic growth determinants using
Gaussian graphical models. Statist. Methodol. 7, 292–306.
DUCHI, J., GOULD, S. & KOLLER, D. (2008). Projected subgradient methods for learning sparse Gaussians. In Proc.
24th Annual Conf. Uncertainty Artif. Intel. (UAI 2008). Corvallis, Oregon: AUAI Press, pp. 145–52.
FRIEDMAN, J. H., HASTIE, T. J. & TIBSHIRANI, R. J. (2008). Sparse inverse covariance estimation with the graphical

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
lasso. Biostatistics 9, 432–41.
HUANG, J., LIU, N., POURAHMADI, M. & LIU, L. (2006). Covariance matrix selection and estimation via penalised
normal likelihood. Biometrika 93, 85–98.
LI, H. & GUI, J. (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications
to inference of genetic networks. Biostatistics 7, 302–17.
LI, S. (2009). Markov Random Field Modeling in Image Analysis. New York: Springer.
LU, Z. (2009). Smooth optimization approach for sparse covariance selection. SIAM J. Optimiz. 19, 1807–27.
MEINSHAUSEN, N. (2008). A note on the lasso for Gaussian graphical model selection. Statist. Prob. Lett. 78, 880–4.
MEINSHAUSEN, N. & BÜHLMANN, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann.
Statist. 34, 1436–62.
PEASE, M. (1965). Methods of Matrix Algebra. London: Academic Press.
PENG, J., WANG, P., ZHOU, N. & ZHU, J. (2009). Partial correlation estimation by joint sparse regression models.
J. Am. Statist. Assoc. 104, 735–46.
RAVIKUMAR, P., WAINWRIGHT, M., RASKUTTI, G. & YU, B. (2011). High-dimensional covariance estimation by min-
imizing 1 -penalized log-determinant divergence. Electron. J. Statist. 5, 935–80.
ROTHMAN, A., BICKEL, P., LEVINA, E. & ZHU, J. (2008). Sparse permutation invariant covariance estimation. Electron.
J. Statist. 2, 494–515.
SCHEINBERG, K., SHIQIAN MA, S. & GOLDFARB, D. (2010). Sparse inverse covariance selection via alternating lin-
earization methods. In Advances in Neural Information Processing Systems 23, J. Lafferty, C.K.I. Williams, J.
Shawe-Taylor, R. Zemel & A. Culotta, eds. New York: Curran Associates, pp. 2101–9.
TIBSHIRANI, R. J. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88.
WHITTAKER, J. (1990). Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.
WILLE, A. & BÜHLMANN, P. (2006). Low-order conditional independence graphs for inferring genetic networks.
Statist. Appl. Genet. Molec. Biol. 5, Issue 1, Article 1.
WITTEN, D., FRIEDMAN, J. H. & SIMON, N. (2011). New insights and faster computations for the graphical lasso.
J. Comp. Graph. Statist. 20, 892–900.
YUAN, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach Learn.
Res. 11, 2261–86.
YUAN, M. & LIN, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35.

[Received May 2012. Revised October 2013]

You might also like