0% found this document useful (0 votes)

27 views18 pages

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

This document introduces a new method for estimating sparse precision matrices called the lasso penalized D-trace loss estimator. It defines a constrained empirical loss minimization framework for estimating high-dimensional sparse precision matrices using a loss function. Within this framework, it proposes a new loss function called the D-trace loss, which is convex and uniquely minimized by the precision matrix. The lasso penalized D-trace loss estimator is defined as the minimizer of the lasso penalized D-trace loss subject to the constraint that the solution be positive definite. It establishes conditions under which this new estimator has the sparse recovery property and derives its convergence rates. An efficient algorithm is also developed to compute the proposed estimator.

Uploaded by

Reshma Khemchandani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views18 pages

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

Uploaded by

Reshma Khemchandani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Biometrika (2014), 101, 1, pp. 103–120 doi: 10.

1093/biomet/ast059
Printed in Great Britain Advance Access publication 12 February 2014

Sparse precision matrix estimation via lasso penalized

D-trace loss
BY TENG ZHANG
Department of Mathematics, Princeton University, Fine Hall, Washington Rd, Princeton,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
New Jersey 08544, U.S.A.
[email protected]

AND HUI ZOU

School of Statistics, University of Minnesota, 224 Church St SE, Minneapolis,
Minnesota 55455, U.S.A.
[email protected]

SUMMARY
We introduce a constrained empirical loss minimization framework for estimating high-
dimensional sparse precision matrices and propose a new loss function, called the D-trace loss,
for that purpose. A novel sparse precision matrix estimator is defined as the minimizer of the lasso
penalized D-trace loss under a positive-definiteness constraint. Under a new irrepresentability
condition, the lasso penalized D-trace estimator is shown to have the sparse recovery property.
Examples demonstrate that the new condition can hold in situations where the irrepresentability
condition for the lasso penalized Gaussian likelihood estimator fails. We establish rates of con-
vergence for the new estimator in the elementwise maximum, Frobenius and operator norms. We
develop a very efficient algorithm based on alternating direction methods for computing the pro-
posed estimator. Simulated and real data are used to demonstrate the computational efficiency
of our algorithm and the finite-sample performance of the new estimator. The lasso penalized
D-trace estimator is found to compare favourably with the lasso penalized Gaussian likelihood
estimator.
Some key words: Constrained minimization; D-trace loss; Graphical lasso; Graphical model selection; Precision
matrix; Rate of convergence.

1. INTRODUCTION
Assume that we have n independent and identically distributed p-dimensional random vari-
ables. Let ∗ be the population covariance matrix and let ∗ = ( ∗ )−1 denote the correspond-
ing precision matrix. Massive high-dimensional data frequently arise in computational biology,
medical imaging, genomics, climate studies, finance and other fields, and it is of both theo-
retical and practical importance to estimate high-dimensional covariance or precision matri-
ces. In this paper we focus on estimating a sparse precision matrix ∗ when the dimension
is large. Sparsity in ∗ is interesting because each nonzero entry of ∗ corresponds to an
edge in a Gaussian graphical model for describing the conditional dependence structure of the
observed variables (Whittaker, 1990). Specifically, if x ∼ N p (μ, ∗ ), then i∗j = 0 if and only
if xi ⊥
⊥ x j | {xk : k =
| i, j}. The construction of Gaussian graphical models has applications in a

C 2014 Biometrika Trust
104 T. ZHANG AND H. ZOU
wide range of fields, including genomics, image analysis and macroeconomics (Li & Gui, 2006;
Wille & Bühlmann, 2006; Dobra et al., 2009; Li, 2009). Meinshausen & Bühlmann (2006) pro-
posed a neighbourhood selection scheme in which one can sequentially estimate the support of
each row of ∗ by fitting an 1 or lasso penalized least squares regression model (Tibshirani,
1996). Yuan (2010) used the Dantzig selector (Candès & Tao, 2007) to replace the lasso penal-
ized least squares in the neighbourhood selection scheme. Peng et al. (2009) proposed a joint
neighbourhood estimator using the lasso penalization. Cai et al. (2011) proposed a constrained 1
minimization estimator for estimating sparse precision matrices and established its convergence
rates under the elementwise ∞ norm and Frobenius norm. A common drawback of the methods

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
mentioned above is that they do not always guarantee that the final estimator is positive definite.
One can also use Cholesky decomposition to estimate the precision or covariance matrix, as in
Huang et al. (2006). With this approach, a sparse regularized estimator of the Cholesky factor
is first derived and then the estimated Cholesky factor is used to construct the final estimator
of ∗ . The regularized Cholesky decomposition approach always gives a positive-semidefinite
matrix but does not necessarily produce a sparse estimator of ∗ .
To the best of our knowledge, the only existing method for deriving a positive-definite sparse
precision matrix is via the lasso or 1 penalized Gaussian likelihood estimator or its variants.
Yuan & Lin (2007) proposed the lasso penalized likelihood criterion and suggested using the
maxdet algorithm to compute the estimator. Motivated by Banerjee et al. (2008), Friedman et al.
(2008) developed a blockwise coordinate descent algorithm, called the graphical lasso, for
solving the lasso penalized Gaussian likelihood estimator. Witten et al. (2011) presented some
computational tricks to further boost the efficiency of the graphical lasso. In the literature, the
graphical lasso is often used as an alternative name for the lasso penalized Gaussian likelihood
estimator. Convergence rates for the graphical lasso have been established by Rothman et al.
(2008) and Ravikumar et al. (2011).
The graphical lasso estimator is outside the penalized maximum likelihood estimation
paradigm, as it works for non-Gaussian data (Ravikumar et al., 2011). To gain a better under-
standing, we propose a constrained convex optimization framework for estimating large precision
matrices, within which the graphical lasso can be viewed as a special case. We further introduce
a new loss function, the D-trace loss, which is convex and minimized at −1 . We define a novel
estimator as the minimizer of the lasso penalized D-trace loss under the constraint that the solu-
tion be positive definite. The D-trace loss is much simpler than the graphical lasso loss, thus
permitting a more direct theoretical analysis and offering significant computational advantages.
Under a new irrepresentability condition, we prove the sparse recovery property of the new esti-
mator and show through examples that our irrepresentability condition is satisfied while that for
the graphical lasso fails. Asymptotically, the new estimator and the graphical lasso have compara-
ble rates of convergence in the elementwise maximum, Frobenius and operator norms. Through
simulation, we show that in finite samples the new estimator outperforms the graphical lasso,
even when the data are generated from Gaussian distributions.

2. METHODOLOGY
2·1. An empirical loss minimization framework
We begin with some notation and definitions. For a p × p matrix X = (X i, j ) ∈ R p× p , its

Frobenius norm is X F = ( i, j X i,2 j )1/2 . We also use X 1,off to denote the off-diagonal 1

norm: X 1,off = i =| j |X i, j |. Let S ( p) denote the space of all p × p positive-definite matri-
ces. For any two symmetric matrices X, Y ∈ R p× p , we write X Y when X − Y is positive
Sparse precision matrix estimation 105
semidefinite. We use vec(X ) to denote the p 2 -vector formed by stacking the columns of X , and
X, Y
means tr(X Y T ) throughout the paper.
Suppose that we want to use a from S ( p) to estimate (0 )−1 . We use a loss function
L(, 0 ) for this estimation problem, and we require it to satisfy the following two conditions.

Condition 1. The loss function L(, 0 ) is a smooth convex function of .

Condition 2. The unique minimizer of L(, 0 ) occurs at (0 )−1 .

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Condition 1 is required for computational reasons, and Condition 2 is needed so that we get
the desired precision matrix when the loss function L(, 0 ) is minimized. It is also important
that L(, 0 ) be constructed directly through 0 , not (0 )−1 , because in practice we need to
use its empirical version L(, ˆ 0 is an estimate of 0 , to compute the estimator of
ˆ 0 ), where
(0 ) . With such a loss function in hand, we can construct a sparse estimator of (0 )−1 via the
−1

convex program
arg min L(, ) ˆ + λn 1,off , (1)
∈S ( p)

where ˆ denotes the sample covariance matrix and λn > 0 is the 1 penalization parameter.
The graphical lasso can be seen as an application of the empirical loss minimization frame-
work, defined as
ˆ − log det() + λn 1,off .
arg min ,
(2)
∈S ( p)

Yuan & Lin (2007) proposed this estimator by following the penalized maximum likelihood esti-
mation paradigm: ,
ˆ − log det() corresponds to the negative loglikelihood function of the
multivariate Gaussian model. Comparing (2) to (1), we see that the graphical lasso is an empiri-
cal loss minimizer where the loss function is L G (, 0 ) = , 0
− log det(). One can verify
that L G (, 0 ) satisfies Conditions 1 and 2. Although L G (, 0 ) has dual interpretations, it
has been shown that the graphical lasso provides a consistent estimator even when the data do
not follow a multivariate Gaussian distribution (Ravikumar et al., 2011). Thus, the empirical loss
minimization view of the graphical lasso is more fundamental and can better explain its broader
successes with non-Gaussian data.

2·2. A new estimator

From the empirical loss minimization viewpoint, L G is not the most natural and convenient
loss function for precision matrix estimation because of the log-determinant term. We show in
this paper that there is a much simpler loss function than L G for estimating sparse precision
matrices. The new loss function is
1
L D (, 0 ) = 2 , 0
− tr(). (3)
2
As L D is expressed as the difference of two trace operators, we call it the D-trace loss. We first
verify that L D satisfies the two conditions above. To check Condition 1, observe that

1 1 − 2 2
L D (1 , 0 ) + L D (2 , 0 ) − 2L D (1 + 2 ), 0 = 2 , 0 0.
2 2
106 T. ZHANG AND H. ZOU
To check Condition 2, we show that the derivative of (3) is (0 + 0 )/2 − I and that the
Hessian of L D can be expressed as (0 ⊗ I + I ⊗ 0 )/2, where ⊗ denotes the Kronecker prod-
uct. Since 0 is positive definite, the Hessian has only positive eigenvalues (see, e.g., Pease, 1965,
§ XIV.7) and so is positive definite. It is then easy to see that = 0−1 is the unique minimizer
of L D (1 , 0 ) as a function of .
We have verified that the D-trace loss is a valid loss function to be used in the empirical loss
minimization framework. The corresponding estimator is then defined according to (1):

1 2

ˆ = arg min
ˆ − tr() + λn 1,off ,
, (4)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
∈S ( p) 2

where λn is a nonnegative penalization parameter. One can also use the 1 norm, 1 =

| j |i, j |, in (4). In many applications we know a priori that the smallest eigenvalue of the
i=
true precision matrix is at least , where is a certain threshold. We can easily incorporate this
into the estimator by considering

1 2

ˆ = arg min
ˆ − tr() + λn 1,off .
, (5)
I 2

In § 3 we derive an efficient algorithm for solving (5), setting = 10−8 as the default.
From a computational point of view, L D is more convenient than L G . We can view the D-trace
loss as an analogue of the least squares loss, used in regression, for precision matrix estima-
tion. It is difficult to come up with a simpler loss function than L D that satisfies Conditions 1
and 2. One might argue that L G should be the optimal loss function at least for estimating ∗
for Gaussian distributions, owing to its likelihood interpretation. However, the conventional wis-
dom does not necessarily hold true in the empirical loss minimization framework for precision
matrix estimation. For simplicity, let λn = 0 and compare the minimizer of the empirical loss
with the maximum likelihood estimator when −1 exists. Then we see that if the loss function
satisfies Conditions 1 and 2, the solution in (1) is always −1 , regardless of the actual form of the
loss function. This is different from what happens in conventional regression problems, where
unpenalized loss functions produce different estimates, such as in the case of Huber’s regression
versus least squares. In the rest of the paper we study the theoretical and numerical properties
of the lasso penalized D-trace loss estimator for estimating sparse precision matrices. We have
found that the new estimator enjoys theoretical and empirical advantages over the lasso penalized
Gaussian likelihood estimator.
Our estimator has an interesting connection to the constrained 1 minimization estimator
(Cai et al., 2011) defined through

p
minimize |i j | subject to ˆ − I | λn .
max | (6)
i, j
i, j

Cai et al. (2011) regularized the diagonal elements of . To simplify the discussion, we can do
the same for our estimator by using 1 in (4); then the penalized L D estimator is

1 2

arg min ˆ − tr() + λn 1 .

, (7)
∈S ( p) 2
Sparse precision matrix estimation 107
The solution of (7) satisfies
1
ˆ + )
( ˆ − I = λn Ẑ , (8)
2
where Ẑ represents the subgradient, taking values in [−1, 1]. Therefore, following the derivation
of the Dantzig selector (Candès & Tao, 2007), we can relax (8) and drop the positive-definiteness
constraint to define a constrained minimization estimator through

p
1
minimize |i j | subject to max ( ˆ + )
ˆ − I λn . (9)
i, j 2

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
i, j

Comparing (9) and (6), we see that the Dantzig version of the penalized L D estimator is very
similar to the estimator of Cai et al. (2011). An important difference between (9) and (6) is that
the solution of (9) is guaranteed to be symmetric, which is not the case for (6).
A referee called our attention to an unpublished manuscript by Liu and Luo, available at
http://arxiv.org/abs/1203.3896. Let θk be the kth column vector of , and let ek denote a
p-dimensional vector with 1 in the kth coordinate and 0 in all other coordinates. Liu and Luo’s
estimator is motivated by the fact that the constrained 1 minimization estimator in (6) has the
following equivalent formulation:

minimize |θk |1 ˆ k − ek |∞ λn
subject to |θ (k = 1, . . . , p). (10)

See Lemma 1 of Cai et al. (2011). Liu and Luo’s estimator of θk is defined by
1
ˆ k − ekT θk + λn |θk |1 .
arg min θkT θ (11)
θk 2

Liu and Luo used a reverse Dantzig selector step to get (11) from (10). A major advantage of
doing so is that solving (11) can be computationally more efficient than solving (10). On the
other hand, the penalized L D estimator in (7) can be rewritten as
p

1
arg min ˆ k − ek θk + λn |θk |1 .
θk θ
T T
(12)
=[θ1 ,...,θ p ]∈S ( p) k=1 2

Therefore, if we drop the positive-definiteness constraint, (12) reduces to solving (11) for
k = 1, . . . , p. The fundamental difference between our estimator and Liu and Luo’s estimator
is that our method respects the positive-definite nature of a precision matrix, while Liu and Luo’s
method treats a precision matrix estimation problem as p separate vector estimation problems;
their estimator is not even guaranteed to be symmetric.

3. ALGORITHM
3·1. Architecture of the algorithm based on the alternating direction method
In this section we develop an efficient algorithm for solving the constrained optimization
problem in (5), based on the alternating direction method. Before delving into the technical
details, it is interesting to first review the efforts that have been devoted to solving the lasso
penalized Gaussian likelihood estimator. Yuan & Lin (2007) used the maxdet algorithm to com-
pute the lasso penalized Gaussian likelihood estimator, but that algorithm is very slow for
high-dimensional data. Banerjee et al. (2008) and Friedman et al. (2008) developed blockwise
108 T. ZHANG AND H. ZOU
descent algorithms. Duchi et al. (2008) proposed a projected gradient method, and Lu (2009)
proposed a method that involves applying Nesterov’s smooth optimization technique; in both
these papers the authors showed that their algorithms perform faster than blockwise descent
algorithms. More recently, Scheinberg et al. (2010) developed an alternating direction method
for solving the lasso penalized Gaussian likelihood estimator and showed that their method is
faster than the projected gradient method (Duchi et al., 2008) as well as Nesterov’s smooth opti-
mization method (Lu, 2009). Based on previous work, the alternating direction method is the
state-of-the-art algorithm for solving the lasso penalized Gaussian likelihood estimator. In order
to compare the D-trace loss and Gaussian likelihood function in computational terms, we derive

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
an alternating direction method for solving the lasso penalized D-trace estimator and compare the
computational efficiency of the lasso penalized D-trace loss with that of the Gaussian likelihood
estimators, showing that the new estimator is faster.
We introduce two new matrices, 0 and 1 , and rewrite (4) as

1
ˆ − tr() + λn 0 1,off
arg min 2 ,
subject to [, ] = [0 , 1 ]. (13)
1 I 2

From (13), we consider the augmented Lagrangian

1 2

L(, 0 , 1 , 0 , 1 ) = , ˆ − tr() + λn 0 1,off + h(1 I )

2
+ 0 , − 0
+ 1 , − 1

+ (ρ/2) − 0 2F + (ρ/2) − 1 2F ,

where h(1 I ) is an indicator function defined by

0, 1 I ;
h(1 I ) =
∞, otherwise.

Let (k , k0 , k1 , k0 , k1 ) be the solution at step k, for k = 0, 1, 2, . . . . We update (, 0 , )
according to

k+1 = arg min L(, k0 , k1 , k0 , k1 ), (14)

[k+1 k+1
0 , 1 ] = arg min L(k+1 , 0 , 1 , k0 , k1 ), (15)
0 =T0 ,1 I

[k+1 k+1
0 , 1 ] = [0 , 1 ] + ρ[
k k k+1
− k+1
0 ,
k+1
− k+1
1 ]. (16)

Step (16) is trivial. For (14), we can write

1
ˆ + 2ρ I
− , I + ρk0 + ρk1 − k0 − k1
.
k+1 = arg min 2 ,
= T 2

Let G(A, B) denote the solution to the optimization problem

ρ
arg min 2 , A
− , B
, A > 0. (17)
=T 2
Sparse precision matrix estimation 109
Then we can write

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 ).

k+1 = G( (18)

The explicit solution to (17) is given in the following theorem.

THEOREM 1. Given any p-dimensional symmetric positive-definite matrix A and any

p-dimensional matrix B, let A = U A A U AT be the eigenvalue decomposition of A, with ordered
eigenvalues σ1 · · · σ p . Define

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
1
G(A, B) = arg min 2 , A
− B,
.
=T 2

Then
T
G(A, B) = U A U A BU A ◦ C U AT , (19)

where ◦ denotes the Hadamard product of matrices and Ci, j = 2/(σi + σ j ).

To update k+1
0 , from (15) we write

ρ
k+1
0 = arg min 20 , I
− 0 , ρk+1 + k0
+ λn 0 1,off .
0 =T 20

Let S(A, λ) denote the solution to the optimization problem

1
arg min 20 , I
− 0 , A
+ 0 0 1,off .
0 =T 2 0

Then we can write

1 k λn
k+1 =S k+1
+ 0 , , (20)
0
ρ ρ
where the operator S is defined by
⎧
⎪
⎪ Ai, j , i= j,
⎪
⎨ A − λ,
i, j i=
| j, Ai, j > λ,
S(A, λ)i, j =
⎪
⎪ Ai, j + λ, i=
| j, Ai, j < −λ,
⎪
⎩
0, i=
| j, −λ Ai, j λ.

To update k+1
1 , we write

ρ
k+1
1 = arg min 21 , I
− 1 , ρk+1 + k1
. (21)
1 I 2

For a symmetric matrix X we define the matrix operator [X ]+ as follows: let the eigenvalue
decomposition of X be U X diag(λ1 , . . . , λ p )U XT ; then

[X ]+ = U X diag max(λ1 , ), . . . , max(λ p , ) U XT .
110 T. ZHANG AND H. ZOU
The solution to (21) is then

k
k+1 = k+1
+ 1 .
1
ρ
+

We now have all the pieces needed to carry out the alternating direction method for solving
(5). Algorithm 1 summarizes the details.

Algorithm 1. Alternating direction method for solving (5).

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Step 1. Initialization: k = 0, 0 , 00 = 01 .

Step 2. Repeat (a)–(d) until convergence:

(a) k = k + 1;
ˆ + 2ρ I, I + ρk + ρk − k − k ), and set
(b) use Theorem 1 to compute G( 0 1 0 1

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 );

k+1 = G(

(c) let k+1

0 = S(k+1 + k0 /ρ, λn /ρ) and k+1
1 = [k+1 + k1 /ρ]+ ;
(d) let k+1
0 = k0 + ρ(k+1 − k+1
0 ) and 1
k+1
= k1 + ρ(k+1 − k+11 ).

3·2. Implementation
Here we discuss the implementation details for Algorithm 1. The most computationally expen-
sive part is the update of k+1
1 , owing to the eigenvalue constraint. If we drop that constraint and
consider
1 2

˘ = arg min ,ˆ − tr() + λn 1,off , (22)

∈R p× p ,=T 2

then we can derive a much simpler alternating direction method for computing . ˘ If ˘ I,
ˆ ˘ ˘
we must have = . If we find that has an eigenvalue less than , then we can always use
Algorithm 1 to find ˆ = ,
˘ in which ˘ can be taken as the initial value of 0 . This implemen-
tation strategy could save a lot of computational time.
We now work out the simplified alternating direction method for computing . ˘ Following the
same steps as in § 3·1, we consider the augmented Lagrangian
1 2

L(, 0 , ) = ˆ − tr() + λn 0 1 + , − 0

+ (ρ/2) − 0 2F .
,
2
We update (, 0 , ) according to the following three steps:

k+1 = arg min L(, k0 , k ), (23)

k+1
0 = arg min L(k+1 , 0 , k ), (24)
0

k+1 = k + ρ(k+1 − k+1

0 ).

The solutions to (23) and (24) are given in (18) and (20), respectively. Algorithm 2 summarizes
˘ and the final estimator .
the details for computing ˆ
Sparse precision matrix estimation 111
Algorithm 2. Alternating direction method implementation for our estimator.
ˆ −1 where diag()
Step 1. Initialization: k = 0, 0 , 00 = {diag()} ˆ is a diagonal matrix
ˆ
which keeps the diagonal elements of .

Step 2. Repeat (a)–(d) until convergence:

(a) k = k + 1;
(b) k+1 = G( ˆ + ρ I, I + ρk − k );
0
(c) 0 = S(k+1 + ρ −1 k , ρ −1 λn );
k+1

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
(d) k+1 = k + ρ(k+1 − k+1 0 ).

˘ defined in (22).
Step 3. Report the converged k as the solution to
˘ > , report
Step 4. If λmin () ˘ as .
ˆ

ˆ and in its Step 1 use

Step 5. Otherwise, use Algorithm 1 to calculate , ˘ as the initial
value for 0 and 1 .
0 0

We have implemented Algorithm 2 in Matlab. In our implementation, we take ρ = 1 and stop

the algorithm when both of the following criteria are satisfied:

k+1 − k F k+1 − k0 F

< 10−7 , 0
< 10−7 .
max(1, k F , k+1 F ) max(1, k0 F , k+1
0 F )

4. NUMERICAL RESULTS

Among existing methods, the lasso penalized Gaussian likelihood estimator is the only pop-
ular precision matrix estimator that can simultaneously retain sparsity and positive definiteness.
To show the virtue of the D-trace loss, we use simulations to compare the performance of our
estimator with that of the lasso penalized Gaussian likelihood estimator.
In the simulation study, data were generated from N (0, ∗ ). The following three forms of ∗
were considered.
∗ = 1, ∗ = 0·2 for 1 |i − j| 2 and ∗ = 0 otherwise.
Model 1: i,i i, j i, j
∗ = 1, ∗ = 0·2 for 1 |i − j| 4 and ∗ = 0 otherwise.
Model 2: i,i i, j i, j

∗ = 1, ∗
Model 3: i,i ∗ = 0·2 and i,∗ j = 0 otherwise;
i,i+1 = 0·2 for mod(i, p
1/2 ) =
| 0, i,i+ p 1/2
this is the grid model in Ravikumar et al. (2011) and requires p 1/2 to be an integer.
The sample size was taken to be n = 400 in all three models. We let p = 500 in Models 1
and 2, and p = 484 in Model 3. Each estimator was tuned by five-fold crossvalidation. Simu-
lation results based on 100 independent replications are reported in Table 1, where we compare
the two estimators in terms of five quantities: the Frobenius risk E( ˆ − ∗ F ), the operator
risk E( ˆ − 2 ), the matrix 1,∞ risk E(
∗ ˆ − 1,∞ ), and the percentages of correctly
∗

estimated nonzeros and zeros. Table 1 shows that our estimator performs better than the lasso
penalized Gaussian likelihood estimator, even though the data are Gaussian. We also recorded the
running time of each estimator by fixing the parameter λn at the value chosen by crossvalidation.
We computed the lasso penalized Gaussian likelihood estimator by using the alternating direction
112 T. ZHANG AND H. ZOU
Table 1. Results of simulation study: comparison of our estimator with the lasso penalized Gaus-
sian likelihood estimator, i.e., graphical lasso, in terms of three different matrix norms and the
percentages of correctly estimated nonzeros and zeros. Reported numbers are averages over 100
independent runs, with standard errors given in parentheses. In the first three columns smaller
numbers are better; in the last two columns larger numbers are better
Frobenius Operator 1,∞ TP TN
Model 1
Our estimator 7·19 (0·06) 0·77 (0·02) 1·06 (0·04) 88·80 (0·86) 98·77 (0·03)
Graphical lasso 7·49 (0·19) 0·78 (0·02) 1·26 (0·09) 88·12 (2·82) 97·65 (0·71)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Model 2
Our estimator 11·70 (0·09) 1·59 (0·01) 1·92 (0·03) 63·47 (1·57) 98·66 (0·20)
Graphical lasso 11·88 (0·03) 1·61 (0·01) 2·11 (0·05) 64·88 (0·69) 97·40 (0·06)
Model 3
Our estimator 5·07 (0·06) 0·56 (0·02) 0·91 (0·04) 99·41 (0·22) 98·57 (0·04)
Graphical lasso 5·26 (0·06) 0·58 (0·02) 1·06 (0·06) 99·76 (0·13) 97·48 (0·07)
TP, percentage of correctly estimated nonzeros; TN, percentage of correctly estimated zeros.

method as implemented by Scheinberg et al. (2010). The average running time for our estima-
tor was 1·2 seconds, whereas that for the lasso penalized Gaussian likelihood estimator was 2
seconds.

5. THEORETICAL RESULTS

5·1. Notation
In this section we study the theoretical properties of the proposed estimator in the ultrahigh-
dimensional setting. Under suitable regularity conditions, the proposed estimator is consistent
under various matrix norms and has a sparse recovery property with high probability. In partic-
ular, when the xi are sampled from a sub-Gaussian distribution, consistency holds if log( p) is
small compared to n.
We assume that the true precision matrix ∗ is sparse. Let S = {(i, j) : i,∗ j = | 0} denote the
support of ∗ and S c the complement of S. Let d be the maximum node degree in ∗ , and
denote by s the number of edges in the graph corresponding to ∗ . We introduce some additional
notation to facilitate the presentation.
p For2 a1/2 vector x = (x1 , . . . , xn ) ∈ Rn , the 1 norm |xi | is
written as |x|1 , and the 2 norm ( i=1 xi ) is written as x. For a matrix X , the elementwise
matrix norm maxi, j |X i, j | is written as X ∞ , the 1 norm i, j |X i, j | is written as X 1 , the

1,∞ norm maxi ( j |X i, j |) is written as X 1,∞ , and the operator norm maxx=1 X x is
written as X . For any subset T of {1, . . . , p} × {1, . . . , p}, we denote by vec(X )T the subvector
of vec(X ) indexed by T . For any two subsets T1 and T2 of {1, . . . , p} × {1, . . . , p}, we denote
by X T1 T2 the submatrix of X with rows and columns indexed by T1 and T2 , respectively. We use
λmax (X ) and λmin (X ) to denote the largest and smallest eigenvalues of a symmetric matrix X .
We write θmin = mini, j∈S |i,∗ j |, α = 1 − maxe∈S c e,S ∗ ( ∗ )−1 , =
1 ˆ S,S − ∗ , =
S,S S,S
ˆ − ∗ , ε = ∞ , κ = ∗−1 1,∞ and κ = ∗ 1,∞ .
S,S

5·2. The irrepresentability condition

We first present the irrepresentability condition for establishing the model selection consis-
tency of our estimator. An irrepresentability condition is also required for the lasso penalized
Gaussian likelihood estimator for estimating sparse precision matrices (Ravikumar et al., 2011).
Sparse precision matrix estimation 113
Denoting the Kronecker matrix sum by ⊕ and the Kronecker matrix product by ⊗, our irrep-
resentability condition involves the function
1 1
() = ( ⊕ ) = ( ⊗ I + I ⊗ ).
2 2
Upon using the definition of the Kronecker matrix sum, we see that () is a p 2 × p 2 matrix
indexed by vertex pairs and that

()( j,k),(l,m) = k,m δ( j, l) + j,l δ(k, m), (25)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
| l. For simplicity, we write ∗ = ( ∗ ) and
where δ( j, l) = 1 if j = l and δ( j, l) = 0 if j =
ˆ ˆ
= (). In our theoretical analysis, the following irrepresentability condition is assumed:
∗ ∗ −1
maxc e,S ( S,S ) 1 < 1. (26)
e∈S

It is interesting to compare (26) with the irrepresentability condition for the lasso penalized
Gaussian likelihood estimator (Ravikumar et al., 2011, Assumption 1), which is

maxc ( ∗ ⊗ ∗ )e,S {( ∗ ⊗ ∗ ) S,S }−1 1 < 1. (27)

e∈S

Notice that (26) involves the Kronecker sum ∗ ⊕ ∗ while (27) uses the Kronecker product
∗ ⊗ ∗.
It is difficult to compare (26) and (27) in general. Here we compare them on a specific example
used by Meinshausen (2008) and Ravikumar et al. (2011), with ∗ ∈ R4×4 , i,i ∗ = 1, ∗ =
2,3
∗3,2 = 0, ∗1,4 = ∗4,1 = 2c2 and i,∗ j = c otherwise, where we assume c ∈ [−2−1/2 , 2−1/2 ] so
that ∗ is positive definite. For this example, we can verify numerically that (26) holds for |c|
0·31 while (27) requires that |c| < 0·2017 (Ravikumar et al., 2011, § 3.1.1). Thus, when |c| ∈
[0·2017, 0·31], (26) holds while (27) fails.
We also compared (26) and (27) on two autoregressive models of orders 1 and 3. In the first,
we let ∗ ∈ R p× p , i,i∗ = 1, ∗ = c for |i − j| = 1 and ∗ = 0 otherwise. In the second, we
i, j i, j
let ∗ ∈ R p× p , i,i∗ = 1, ∗ = c for 1 |i − j| 3 and ∗ = 0 otherwise. The condition (26)
i, j i, j
was less restrictive than (27) for all values of p that we tested. For example, consider p = 30. For
the autoregressive model of order 1, (26) holds for |c| < 0·41 and (27) holds only for |c| < 0·35;
for the autoregressive model of order 3, (26) holds for |c| < 0·22 while (27) holds only for |c| <
0·14.

5·3. Rates of convergence

We establish rates of convergence and the model selection consistency of the penalized D-trace
estimator under the assumption that x1 , . . . , xn are independent and identically sampled from a
∗ 1/2 are sub-Gaussian with
sub-Gaussian distribution with covariance ∗ such that all the X i /i,i
parameter σ . Here X i is the ith coordinate of the random vector X , so we assume that
∗ −1/2
E[exp{t X i (i,i ) }] exp(σ 2 t 2 /2) (t ∈ R). (28)

THEOREM 2. Under (28) and the irrepresentability condition (26), choose

1/2
λn = 12α −1 (κ κ2 + κ ) 128(1 + 4σ 2 )2 max(i,i
∗ 2
) (η log p + log 4)/n
i
114 T. ZHANG AND H. ZOU
for some η > 2 and

n > C1 max λmin (∗ )−1 min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )},
2
12dκ , 12α −1 (κ κ2 + κ ), {max i,i
∗
8(1 + 4σ 2 )}−1 (η log p + log 4),
i

∗ )2 . Then, with probability greater than 1 − p η−2 , we have

where C1 = 128(1 + 4σ 2 )2 maxi (i,i

ˆ − ∗ ∞ {5dκ2 + 12α −1 (κ κ2 + κ )} {C1 (η log p + log 4)/n}1/2 ,

ˆ − ∗ F (s + p)1/2 {5dκ2 + 12α −1 (κ κ2 + κ )} {C1 (η log p + log 4)/n}1/2 ,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014

ˆ − ∗ 2 min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )} {C1 (η log p + log 4)/n}1/2 .

ˆ recovers all zeros in ∗ . Moreover, if

In addition,

n > C1 {5dκ2 + 12α −1 (κ κ2 + κ )}2 (η log p + log 4)/θmin

2
,
ˆ recovers all zeros and nonzeros in ∗ .
then

Next, we establish rates of convergence and model selection consistency of the penalized
D-trace estimator under a weaker polynomial tail assumption. Assume that x1 , . . . , xn are inde-
pendent and identically sampled from a distribution with polynomial tails having covariance ∗
∗ )−1/2 X has finite 4mth moments, i.e., there exist m and K ∈ R such that
such that (i,i i m

∗ −1/2
E{(i,i ) X i }4m K m (i = 1, . . . , p). (29)

THEOREM 3. Under (29) and the irrepresentability condition (26), choose

λn = 24n −1/2 α −1 (κ κ2 + κ )(max i,i

∗
)(K m + 1)1/(2m) p η/(2m)
i

for some η > 2 and

n > C2 p η/m max λmin (∗ )−1 min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )},
2
12dκ , 12α −1 (κ κ2 + κ )
∗ )2m (K + 1)}1/m . Then, with probability 1 − p η−2 , we have
where C2 = {22m (maxi i,i m

ˆ − ∗ ∞ {5dκ2 + 12α −1 (κ κ2 + κ )}C p η/(2m) n −1/2 ,

1/2
2
ˆ − ∗ F (s + p)1/2 {5dκ2 + 12α −1 (κ κ2 + κ )}C p η/(2m) n −1/2 ,

1/2
2
ˆ − ∗ 2 min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )}C p η/(2m) n −1/2 .

1/2
2

ˆ recovers all zeros in ∗ . Moreover, if

In addition,

n > C2 p η/m {5dκ2 + 12α −1 (κ κ2 + κ )}2 /θmin

2
,
ˆ recovers all zeros and nonzeros in ∗ .
then

These rate-of-convergence results look similar to those of Ravikumar et al. (2011). However,
our technical analysis is different from theirs. The key component in their analysis is Brouwer’s
Sparse precision matrix estimation 115
fixed-point theorem, but we can use a more direct approach to analyse the penalized D-trace
estimator, thanks to its simple expression.

6. DISCUSSION
In the empirical loss minimization framework, the D-trace loss is much simpler than the Gaus-
sian likelihood loss, which is basically a quadratic function of the precision matrix. Its simplicity
leads to theoretical and computational advantages. We have provided theoretical and empirical
evidence to support the D-trace loss and the lasso penalized D-trace estimator. On the other hand,

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
our results do not imply that the D-trace loss estimator is superior to the graphical lasso. Con-
ceptually, the D-trace loss is to the Gaussian likelihood as the hinge loss underlying the support
vector machine is to the binomial likelihood for logistic regression. Each has its own merits, and
neither dominates the other. An open question remains concerning the irrepresentability con-
dition: we can neither prove nor disprove that (26) is always weaker than (27). This technical
problem will be studied in another paper.

ACKNOWLEDGEMENT
We thank Dr Shiqian Ma for sharing the Matlab code that implements his alternating direction
method for computing the lasso penalized Gaussian likelihood estimator. We thank the editor,
associate editor and referees for their suggestions, as well as Professors Peter Bühlmann and Tony
Cai for helpful discussions. Zou’s research was supported in part by the U.S. National Science
Foundation and the Office of Naval Research.

APPENDIX: TECHNICAL PROOFS

Proof of Theorem 1
With a positive definite A, A,
/2 − B,
is a strictly convex function over . Therefore, we only
2

need to check that its derivative is zero at G(A, B), i.e., 2−1 {AG(A, B) + G(A, B)A} − B = 0. Equiva-
lently, we need to check that
diag(σ1 , . . . , σ p ){U AT G(A, B)U A } + {U AT G(A, B)U A } diag(σ1 , . . . , σ p ) = U AT BU A .

The above equation can be verified by calculation for G(A, B) defined in (19), and so Theorem 1 is proved.

Proofs of Theorems 2 and 3

We prove these two theorems simultaneously. For clarity of presentation, we first sketch the proof and
then fill in the details of the technical lemmas and their proofs.
Following Definition 1 in Ravikumar et al. (2011), we assume that there exists a constant v∗ > 0 and a
function f such that
ˆ i, j − i,∗ j | δ) 1/ f (n, δ)
pr(| (1 i, j p; 0 < δ < 1/v∗ ). (A1)

We also define
n f (δ, r ) = arg max{n : f (n, δ) r }, δ f (n, r ) = arg max{δ : f (n, δ) r }.

The tail assumption (A1) holds for a large class of random vectors. Two special cases, sub-Gaussian
∗
tails and polynomial tails, are defined in (28) and (29). When (28) holds, we have v∗ = {maxi i,i 8(1 +
2 −1 ∗ 2 −1
4σ )} and f (n, δ) = exp(c∗ nδ )/4, where c∗ = {128(1 + 4σ ) maxi (i,i ) } (Ravikumar et al., 2011,
2 2 2

§ 2.3.1). Straightforward calculation gives δ f (n, p η ) = {128(1 + 4σ 2 )2 maxi (i,i

∗ 2
) (η log p + log 4)/n}1/2
116 T. ZHANG AND H. ZOU
and n f (δ, p η ) = 128(1 + 4σ 2 )2 maxi (i,i
∗ 2
) (η log p + log 4)/δ 2 . When (29) holds, we have v∗ = 0 and
f (n, δ) = c∗ n m δ 2m , where c∗ = 2−2m (maxi i,i
∗ −2m
) (K m + 1)−1 (Ravikumar et al., 2011, § 2.3.2). Thus
η η/(2m) −1/(2m) −1/2
δ f (n, p ) = p c∗ n and n f (δ, p ) = p η/m c∗−1/m δ −2 . With these preparations in place, The-
η

orems 2 and 3 can be proved using the following technical lemma.

˘ by
LEMMA A1. Define
1 2

˘ =
arg min ˆ − tr() + λn 1,off .
, (A2)
∈R p× p ,=T 2

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Then the following hold:
˘ S = 0 if
(a) vec()

maxc ˆ e,S ˆ S,S
−1
vec(I ) S < αλn /2, maxc ˆ e,S (ˆ S,S )−1 1 − α/2; (A3)
e∈S e∈S 1

˘ S = 0 if
(b) vec()
1
ε< , (A4)
12dκ
6ε(κ κ2 + κ ) 0·5 α min(λn , 1); (A5)

(c) assuming the conditions in part (b), we also have

5
˘ − ∗ ∞ < λn κ +
d(1 + λn )εκ2 . (A6)
2

The proof of Lemma A1 is based on the following auxiliary lemma, which is used to control ˆ S,S
−1
−
∗ −1
S,S ∞ and ˆ S,S − S,S 1,∞ by ε = ∞ . For convenience we present it here.
−1 ∗ −1

LEMMA A2. Assuming (A4), we have

R( )1,∞ 6d 2 ε2 κ3 , R( )∞ 12dε2 κ3 , (A7)

∗ ∗−1 ∗−1 ∗−1
where R( ) = { S,S + ( ) S,S }−1 − S,S + S,S ( ) S,S S,S . Moreover, we have

ˆ S,S
−1 ∗−1
− S,S 1,∞ 6d 2 ε2 κ3 + 2dεκ2 , (A8)
ˆ S,S
−1 ∗−1
− S,S ∞ 12dε2 κ3 + 2εκ2 . (A9)

In this proof we assume the general choices of n and λn :

−1
n > n f 1/ max σmin min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )} , (A10)

−1
θmin {5dκ2 + 12α −1 (κ κ2 + κ )}, 12dκ , 12α −1 (κ κ2 + κ ), v∗ , p η

and λn = 12α −1 (κ κ2 + κ )δ f (n, p η ) for some η > 2.

(a) By the definition of n f , with probability at least 1 − 1/ p η−2 we have

ε = ˆ − ∗ ∞ δ f (n, p η ) < 1/ max 12dκ , 12α −1 (κ κ2 + κ ), v∗ . (A11)

Now we verify the two assumptions in Lemma A1(b). Assumption (A4) is easy to verify using (A11).
From (A11) and the definition of λn we also have λn 1, and (A5) follows from the definition of λn and
the fact that λn 1.
Sparse precision matrix estimation 117
The convergence rate of ˘ − ∗ ∞ then follows from (A6), (A4), the control of ε by δ f (n, p η )
in (A11), the definition of λn , and the fact that λn 1:
5
˘ − ∗ ∞ < λn κ + d(1 + λn )εκ2 λn κ + 5dεκ2

2
{5dκ + 12α −1 (κ κ2 + κ )}δ f (n, p η ).
2
(A12)

The estimation of ˆ − ∗ follows from (A12), the fact that

ˆ = ,˘ which will be shown at the end
of the proof of Lemma A1, and the estimation of v∗ , f (n, p ) and δ f (n, p η ).
η

(b) Combining the bound on ˘ − ∗ ∞ with the fact that there are at most s + p nonzero elements

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
in ˘ and that the nonzeros of
˘ form a subset of ∗ , we obtain

˘ − ∗ F (s + p)1/2
˘ − ∗ ∞ (s + p)1/2 {5dκ2 + 12α −1 (κ κ2 + κ )}δ f (n, p η ) (A13)

and
˘ − ∗ 2 min{(s + p)1/2 , d}
˘ − ∗ ∞

min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )}δ f (n, p η ). (A14)

The estimation of ˆ − ∗ 2 and ˆ − ∗ F follows from (A13), (A14), the equality ˆ = ,˘ and the
η η
estimation of v∗ , f (n, p ) and δ f (n, p ).
(c) By part (b) of Lemma A1, ˘ specifies all zeros in ∗ . When (A10) holds, with probability at least
η−2
1 − 1/ p we have that δ f (n, p ) {5dκ2 + 12α −1 (κ κ2 + κ )}/θmin . By combining this with (A12),
η

˘ recovers all zeros and nonzeros in ∗ . Finally, we show that ˆ = .

˘ Using the fact that with probability
at least 1 − 1/ p η−2 , δ f (n, p η ) λmin (∗ )/[min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )}], together
with (A14), we deduce that λmin () ˘ > 0 and therefore ˆ = .˘ This completes the proof of Theorems 2
and 3.

Proof of Lemma A1
˜
(a) First, we define as the solution to the hypothetical problem
1 2
˜ = arg min
ˆ − tr() + λn 1,off .
,
(A15)
=T , S c =0 2

From its directional derivative, we obtain the equality

˜
{( ˆ +
ˆ )/2
˜ − I + Z }S = 0

where
⎧
⎪
⎨= 0, (i, j) ∈ S c or i = j,
Z i, j = sign( ˜ i, j ), (i, j) ∈ S, i =| j, ˜ i, j =
| 0,
⎪
⎩
∈ [−1, 1], (i, j) ∈ S, i =| j, ˜
i, j = 0.

Applying the definition of ˆ = ()

ˆ in (25), this can be rewritten as

ˆ
{vec( ˜ − vec(I ) + λn vec(Z )} S = 0.
) (A16)

˜ S c = 0, (A16) is equivalent to ˆ S,S vec()

Recall that ˜ S − vec(I ) S + λn vec(Z ) S = 0, and the explicit solu-
tion to (A15) is
˜ S = ˆ S,S
vec() −1
{vec(I ) S − λn vec(Z ) S }. (A17)
118 T. ZHANG AND H. ZOU
Now we verify that ˜ is also the solution to (A2). Since the objective function in (A2) is convex, we
˜ is zero; that is,
only need to verify that its derivative at =

1
(ˆ ˜ ˜ ˆ
2 + ) − I λn (1 i = | j p),
i, j

1
(ˆ˜ + ˆ − I = 0 (i = 1, . . . , p).
˜ ) (A18)
2
i,i

Applying (A16), we have that (A18) holds when (i, j) ∈ S. Therefore we need only verify (A18) for (i, j) ∈

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
S c . As vec(I ) S c = 0, to prove (A18) it is sufficient to prove that for e ∈ S c ,

˜ S | λn .
|ˆ e,S vec() (A19)

Upon combining (A17) with (A19), it suffices to prove that for e ∈ S c ,

|ˆ e,S ˆ S,S
−1
vec(I ) S − λn ˆ e,S ˆ S,S
−1
vec(Z ) S | λn . (A20)

Since vec(Z ) S ∞ 1, we have |ˆ e,S ˆ S,S−1

vec(Z ) S | ˆ e,S ˆ S,S
−1
1 , Combining this upper bound of
ˆ ˆ −1
|e,S S,S vec(Z ) S | with the assumptions in (A3), we prove (A20) as follows:

Since (A20) implies (A18), we have shown that ˜ is also the solution
˘ in (A2). By the definition of
˜ ˘
, we obtain vec() S = 0.
(b) We prove this part in two steps. First, we show that (A21) implies the two conditions in (A3):

maxc ˆ e,S (ˆ S,S )−1 − e,S

∗ ∗
( S,S )−1 1 0·5 α min(λn , 1). (A21)
e∈S

Then we prove (A21). Therefore we get vec() ˘ S = 0 upon applying the result of part (a).
∗ ∗ −1
Combining α = 1 − maxe∈S c e,S ( S,S ) 1 with the triangle inequality, we obtain the second assump-
tion in (A3) from (A21). Using the fact that
∗ −1
S,S vec(I ) S = vec(∗ ) S , (A22)

∗ −1
we have S∗c ,S { S,S vec(I ) S } = S∗c ,S {vec(∗ ) S } = vec{( ∗ ∗ + ∗ ∗ )/2} S c = 0, and the first condition
in (A3) can be verified as follows:

We now prove (A21). Since the right-hand side of (A21) is equivalent to the right-hand side of (A5),
we need only prove that the left-hand side of (A21) is smaller than the left-hand side of (A5). Note that
∗ ∞ 2 ∗ ∞ , ∗ 1,∞ 2 ∗ 1,∞ , and the left-hand side of (A21) can be controlled as follows,
Sparse precision matrix estimation 119
by applying (A9), (A26) and (A27): for any e ∈ S , c

ˆ e,S (ˆ S,S )−1 − e,S

∗ ∗
( S,S )−1 1
= (ˆ e,S − e,S
∗ ∗
)( S,S )−1 + e,S
∗
(ˆ S,S
−1 ∗−1
− S,S ) + (ˆ e,S − e,S
∗
)(ˆ S,S
−1 ∗−1
− S,S )1
(ˆ e,S − e,S
∗ ∗−1
) S,S ∗
1 + e,S (ˆ S,S
−1 ∗−1
− S,S )1 + (ˆ e,S − e,S
∗
)(ˆ S,S
−1 ∗−1
− S,S )1
ˆ e,S − e,S
∗ ∗−1
∞ S,S 1,∞ + 2 ∗ 1,∞ ˆ S,S
−1 ∗−1
− S,S ∞ + 2dεˆ S,S
−1 ∗−1
− S,S ∞
2εκ + (2κ + 2dε)(12dε2 κ3 + 2εκ2 ). (A23)

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
Inserting (A4) into the right-hand side of (A23) yields the simplification

1
2εκ + (2κ + 2dε)(12dε2 κ3 + 2εκ2 ) 2εκ + 2κ + (εκ2 + 2εκ2 )
6κ

1
= ε 2κ + 6κ κ + κ
2
2
< 6ε(κ κ2 + κ ). (A24)

Upon combining (A5), (A23) and (A24), we obtain (A21).

˘ = ,
(c) By using the fact that ˜ along with (A17) and (A22), we obtain
˘ − ∗ ∞ = vec()
˜ S − vec(∗ ) S ∞

= (ˆ S,S
−1 ∗ −1
− S,S )vec(I ) S − λn ˆ S,S
−1
vec(Z ) S ∞
ˆ S,S
−1 ∗ −1
− S,S 1,∞ + λn ˆ S,S
−1
1,∞
(1 + λn )ˆ S,S
−1 ∗ −1
− S,S ∗ −1
1,∞ + λn S,S 1,∞ . (A25)

Then we prove (A6) by applying (A4) and (A8) to the right-hand side of (A25):
5
˘ − ∗ ∞ λn κ + (1 + λn )(6d 2 ε2 κ3 + 2dεκ2 ) < λn κ + (1 + λn )dεκ2 .

2
This completes the proof of Lemma A1.

Proof of Lemma A2
ˆ∗
Using the definition of and , we have
( ) S,S 1,∞ 2dε, (A26)
∗−1
and then (A4) implies that S,S 1,∞ ( ) S,S 1,∞ < 1/3. Following the proof of Ravikumar et al.
(2011, Appendix B), we obtain that R( )∞ 3( ) S,S ∞ ( ) S,S 1,∞ κ3 /2 and R( )1,∞
3( ) S,S 21,∞ κ3 /2. Then we prove (A7) by combining (A26) with the fact that

( ) S,S ∞ ˆ − ∗ ∞ 2
ˆ − ∗ ∞ = 2ε. (A27)

Moreover,
∗−1 ∗−1 ∗−1 2
S,S ( ) S,S S,S 1,∞ ( ) S,S 1,∞ S,S 1,∞ 2dεκ2 , (A28)
∗−1 ∗−1 ∗−1 2
S,S ( ) S,S S,S ∞ ( ) S,S ∞ S,S 1,∞ = ( ) S,S ∞ κ2 2εκ2 . (A29)

Then (A8) and (A9) are obtained by combining (A7), (A28), (A29) and the definition of R( ). This
completes the proof of Lemma A2.
120 T. ZHANG AND H. ZOU
REFERENCES
BANERJEE, O., EL GHAOUI, L. & D’ASPREMONT, A. (2008). Model selection through sparse maximum likelihood
estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9, 485–516.
CAI, T., LIU, W. & LUO, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation.
J. Am. Statist. Assoc. 106, 594–607.
CAND ÈS, E. & TAO, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist.
35, 2313–51.
DOBRA, A., EICHER, T. & LENKOSKI, A. (2009). Modeling uncertainty in macroeconomic growth determinants using
Gaussian graphical models. Statist. Methodol. 7, 292–306.
DUCHI, J., GOULD, S. & KOLLER, D. (2008). Projected subgradient methods for learning sparse Gaussians. In Proc.
24th Annual Conf. Uncertainty Artif. Intel. (UAI 2008). Corvallis, Oregon: AUAI Press, pp. 145–52.
FRIEDMAN, J. H., HASTIE, T. J. & TIBSHIRANI, R. J. (2008). Sparse inverse covariance estimation with the graphical

Downloaded from http://biomet.oxfordjournals.org/ at University of Minnesota,Walter Library Serial Processing on March 5, 2014
lasso. Biostatistics 9, 432–41.
HUANG, J., LIU, N., POURAHMADI, M. & LIU, L. (2006). Covariance matrix selection and estimation via penalised
normal likelihood. Biometrika 93, 85–98.
LI, H. & GUI, J. (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications
to inference of genetic networks. Biostatistics 7, 302–17.
LI, S. (2009). Markov Random Field Modeling in Image Analysis. New York: Springer.
LU, Z. (2009). Smooth optimization approach for sparse covariance selection. SIAM J. Optimiz. 19, 1807–27.
MEINSHAUSEN, N. (2008). A note on the lasso for Gaussian graphical model selection. Statist. Prob. Lett. 78, 880–4.
MEINSHAUSEN, N. & BÜHLMANN, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann.
Statist. 34, 1436–62.
PEASE, M. (1965). Methods of Matrix Algebra. London: Academic Press.
PENG, J., WANG, P., ZHOU, N. & ZHU, J. (2009). Partial correlation estimation by joint sparse regression models.
J. Am. Statist. Assoc. 104, 735–46.
RAVIKUMAR, P., WAINWRIGHT, M., RASKUTTI, G. & YU, B. (2011). High-dimensional covariance estimation by min-
imizing 1 -penalized log-determinant divergence. Electron. J. Statist. 5, 935–80.
ROTHMAN, A., BICKEL, P., LEVINA, E. & ZHU, J. (2008). Sparse permutation invariant covariance estimation. Electron.
J. Statist. 2, 494–515.
SCHEINBERG, K., SHIQIAN MA, S. & GOLDFARB, D. (2010). Sparse inverse covariance selection via alternating lin-
earization methods. In Advances in Neural Information Processing Systems 23, J. Lafferty, C.K.I. Williams, J.
Shawe-Taylor, R. Zemel & A. Culotta, eds. New York: Curran Associates, pp. 2101–9.
TIBSHIRANI, R. J. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88.
WHITTAKER, J. (1990). Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.
WILLE, A. & BÜHLMANN, P. (2006). Low-order conditional independence graphs for inferring genetic networks.
Statist. Appl. Genet. Molec. Biol. 5, Issue 1, Article 1.
WITTEN, D., FRIEDMAN, J. H. & SIMON, N. (2011). New insights and faster computations for the graphical lasso.
J. Comp. Graph. Statist. 20, 892–900.
YUAN, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach Learn.
Res. 11, 2261–86.
YUAN, M. & LIN, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35.

[Received May 2012. Revised October 2013]

APS 4 Report
No ratings yet
APS 4 Report
84 pages
Catalogue of G K Publishers
No ratings yet
Catalogue of G K Publishers
10 pages
Masked Voices: Craig M. Loftin
No ratings yet
Masked Voices: Craig M. Loftin
327 pages
Submitted To The Annals of Statistics
No ratings yet
Submitted To The Annals of Statistics
66 pages
Lasso Slides Tibsharani
No ratings yet
Lasso Slides Tibsharani
44 pages
A Unified Framework For High-Dimensional Analysis of M - Estimators With Decomposable Regularizers
No ratings yet
A Unified Framework For High-Dimensional Analysis of M - Estimators With Decomposable Regularizers
45 pages
AMP Paper
No ratings yet
AMP Paper
16 pages
Sparse Inverse Covariance Selection Presentation
No ratings yet
Sparse Inverse Covariance Selection Presentation
14 pages
2022lectures1-8 Optimization For DataScience
No ratings yet
2022lectures1-8 Optimization For DataScience
35 pages
High-Dimensional Regression With Noisy and Missing Data: Provable Guarantees With Nonconvexity
No ratings yet
High-Dimensional Regression With Noisy and Missing Data: Provable Guarantees With Nonconvexity
28 pages
Lecture 17 Least Squares, State Estimation
No ratings yet
Lecture 17 Least Squares, State Estimation
29 pages
A GAMP Based Low Complexity Sparse Bayesian Learning Algorithm
No ratings yet
A GAMP Based Low Complexity Sparse Bayesian Learning Algorithm
15 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Sparse Mixed Linear Regression With Guarantees Taming An Intractable Problem With Invex Relaxation
No ratings yet
Sparse Mixed Linear Regression With Guarantees Taming An Intractable Problem With Invex Relaxation
20 pages
Adaptive Signal Recovery in Sparse Nonparametric Models: Natalia Stepanova and Marie Turcicova
No ratings yet
Adaptive Signal Recovery in Sparse Nonparametric Models: Natalia Stepanova and Marie Turcicova
15 pages
Bickel and Levina 2004
No ratings yet
Bickel and Levina 2004
28 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
Penalty Decomposition Methods For L - Norm Minimization
No ratings yet
Penalty Decomposition Methods For L - Norm Minimization
26 pages
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
No ratings yet
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
46 pages
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
No ratings yet
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
14 pages
Glasso Insights
No ratings yet
Glasso Insights
25 pages
08 Aos600
No ratings yet
08 Aos600
26 pages
Lect 6
No ratings yet
Lect 6
10 pages
Sparse Models
No ratings yet
Sparse Models
8 pages
Sparse Models
No ratings yet
Sparse Models
8 pages
Intro 2 Matrix
No ratings yet
Intro 2 Matrix
9 pages
Sparse SVM
No ratings yet
Sparse SVM
12 pages
Intro 2 Matrix
No ratings yet
Intro 2 Matrix
9 pages
Regression
No ratings yet
Regression
7 pages
Sparsem: A Sparse Matrix Package For R
No ratings yet
Sparsem: A Sparse Matrix Package For R
9 pages
Sparse Inverse Covariance Estimation With The Graphical Lasso
No ratings yet
Sparse Inverse Covariance Estimation With The Graphical Lasso
14 pages
Rcpparmadillo: Sparse Matrix Support
No ratings yet
Rcpparmadillo: Sparse Matrix Support
11 pages
1566991547ISFO English G3 2019 PDF
100% (1)
1566991547ISFO English G3 2019 PDF
64 pages
Schenk Jocs 2021
No ratings yet
Schenk Jocs 2021
13 pages
Gradient Descent With Sparsification: An Iterative Algorithm For Sparse Recovery With Restricted Isometry Property
No ratings yet
Gradient Descent With Sparsification: An Iterative Algorithm For Sparse Recovery With Restricted Isometry Property
8 pages
Squic
No ratings yet
Squic
21 pages
03 Sparse Approx Algs PDF
No ratings yet
03 Sparse Approx Algs PDF
12 pages
2nd Introduction To The Matrix Package
No ratings yet
2nd Introduction To The Matrix Package
9 pages
Lecture 24
No ratings yet
Lecture 24
8 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Introduction
No ratings yet
Introduction
4 pages
2nd Introduction To The Matrix Package
No ratings yet
2nd Introduction To The Matrix Package
9 pages
Figueiredo Et Al. - 2007 - Gradient Projection For Sparse Reconstruction Application To Compressed Sensing and Other Inverse Problems-Annotated
No ratings yet
Figueiredo Et Al. - 2007 - Gradient Projection For Sparse Reconstruction Application To Compressed Sensing and Other Inverse Problems-Annotated
12 pages
CaiCardosoPalomarYing NeurIPS2023 Slides
No ratings yet
CaiCardosoPalomarYing NeurIPS2023 Slides
13 pages
Grade 4 Sasmo: Answer The Questions
100% (4)
Grade 4 Sasmo: Answer The Questions
4 pages
Introducción de Matrices en R
No ratings yet
Introducción de Matrices en R
9 pages
TheoryIdeasInla Student
No ratings yet
TheoryIdeasInla Student
34 pages
Introduction To The Matrix Package - As of Feb. 2005
No ratings yet
Introduction To The Matrix Package - As of Feb. 2005
4 pages
Introduction
No ratings yet
Introduction
4 pages
Wainwrightslides 2
No ratings yet
Wainwrightslides 2
77 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
CPR Checklist
No ratings yet
CPR Checklist
2 pages
CL 4 Nstse 2021 Paper 465 Key
No ratings yet
CL 4 Nstse 2021 Paper 465 Key
4 pages
Sparse Optimization Lecture: Basic Sparse Optimization Models
No ratings yet
Sparse Optimization Lecture: Basic Sparse Optimization Models
33 pages
Heaven Earth Man Facial Diagnosis
50% (2)
Heaven Earth Man Facial Diagnosis
4 pages
Introduction To The Matrix Package - As of Feb. 2005
No ratings yet
Introduction To The Matrix Package - As of Feb. 2005
4 pages
Report Sparse Group LASSO
No ratings yet
Report Sparse Group LASSO
10 pages
LogIQids Trends - Relationship
No ratings yet
LogIQids Trends - Relationship
16 pages
Introduction
No ratings yet
Introduction
4 pages
Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
No ratings yet
Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
41 pages
Student Room Dissertation Thread
100% (2)
Student Room Dissertation Thread
5 pages
Medical Fitness
No ratings yet
Medical Fitness
1 page
Talent and Olympiad Exams Resource Book Class 7 Math
90% (10)
Talent and Olympiad Exams Resource Book Class 7 Math
136 pages
Difference Between Systematic and Unsystematic Risk
No ratings yet
Difference Between Systematic and Unsystematic Risk
6 pages
Lecture 1 - Overview of Supervised Learning
No ratings yet
Lecture 1 - Overview of Supervised Learning
133 pages
CFG MCQ
100% (1)
CFG MCQ
7 pages
Direct Methods For Space Matrices
No ratings yet
Direct Methods For Space Matrices
7 pages
Letter To Judge Alvin Hellerstein
100% (1)
Letter To Judge Alvin Hellerstein
5 pages
Political Economy The Contest of Economic Ideas 2nd Edition Frank Stilwell Instant Download
No ratings yet
Political Economy The Contest of Economic Ideas 2nd Edition Frank Stilwell Instant Download
84 pages
UIMO CLASS 5 Past 5 Papers Reduced Size 6xwwsc
No ratings yet
UIMO CLASS 5 Past 5 Papers Reduced Size 6xwwsc
98 pages
Thermal Engineering-2 Question Bank: Unit Wise
No ratings yet
Thermal Engineering-2 Question Bank: Unit Wise
12 pages
The Broken Flute, 1994, Sharada Dwivedi, 0140236821, 9780140236828, Penguin Books, 1994
No ratings yet
The Broken Flute, 1994, Sharada Dwivedi, 0140236821, 9780140236828, Penguin Books, 1994
29 pages
Marine Cargo Insurance: Warranties, Representations, Disclosures and Conditions
No ratings yet
Marine Cargo Insurance: Warranties, Representations, Disclosures and Conditions
51 pages
SigmaTel STIr4200 Datasheet 315577 1
No ratings yet
SigmaTel STIr4200 Datasheet 315577 1
22 pages
WH Question Words 1
No ratings yet
WH Question Words 1
10 pages
Delta Module 1 June 2010 Paper 1 PDF
No ratings yet
Delta Module 1 June 2010 Paper 1 PDF
8 pages
ISTSE Detailed Instructions
No ratings yet
ISTSE Detailed Instructions
33 pages
The Bible - Its 66 Books in Brief-By L M Grant
No ratings yet
The Bible - Its 66 Books in Brief-By L M Grant
44 pages
University of Calicut: Day & Date Subject
No ratings yet
University of Calicut: Day & Date Subject
4 pages
Newsletter BHIS Noida
No ratings yet
Newsletter BHIS Noida
39 pages
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
No ratings yet
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
14 pages
Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
No ratings yet
Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
22 pages
Neural Networks: Guoqiang Wu, Ruobing Zheng, Yingjie Tian, Dalian Liu
No ratings yet
Neural Networks: Guoqiang Wu, Ruobing Zheng, Yingjie Tian, Dalian Liu
16 pages
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
No ratings yet
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
14 pages
Knowledge-Based Systems: Jun Ma, Jumei Shen
No ratings yet
Knowledge-Based Systems: Jun Ma, Jumei Shen
14 pages
Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
No ratings yet
Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
19 pages
Crypto Exchange: Marketing Options
No ratings yet
Crypto Exchange: Marketing Options
6 pages
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
No ratings yet
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
10 pages
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
No ratings yet
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
9 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
Rabino Lineth B Activity-8 Consyslab
No ratings yet
Rabino Lineth B Activity-8 Consyslab
13 pages
Kaveh Afrasiabi - Vilification of A Scholar
No ratings yet
Kaveh Afrasiabi - Vilification of A Scholar
4 pages
April SKG
No ratings yet
April SKG
5 pages
CFA L-2 Performance Tracker '23
No ratings yet
CFA L-2 Performance Tracker '23
11 pages
Neural Networks For Machine Learning: Lecture 3a Learning The Weights of A Linear Neuron
No ratings yet
Neural Networks For Machine Learning: Lecture 3a Learning The Weights of A Linear Neuron
34 pages
Unified International Mathematics Olympiad - 2021: NA: Not Applicable-For That Class Note
No ratings yet
Unified International Mathematics Olympiad - 2021: NA: Not Applicable-For That Class Note
2 pages
Candida-Associated Denture Stomatitis
No ratings yet
Candida-Associated Denture Stomatitis
5 pages
ISFO English Toppers List 2019-20: S.No. Student Name School Name State Class Subject International Rank
No ratings yet
ISFO English Toppers List 2019-20: S.No. Student Name School Name State Class Subject International Rank
10 pages
Grammar Extra: Intermediate Plus Unit 5
No ratings yet
Grammar Extra: Intermediate Plus Unit 5
2 pages
Urdu- Alami Yahudi Tanzemein عالمی یہودی تنظیمیں، از مفتی ابولبابہ صاحب #- By Abu Lubaba Shah Masur Non Shia Scholar
No ratings yet
Urdu- Alami Yahudi Tanzemein عالمی یہودی تنظیمیں، از مفتی ابولبابہ صاحب #- By Abu Lubaba Shah Masur Non Shia Scholar
362 pages
TreeofLife LBD
No ratings yet
TreeofLife LBD
11 pages
CBCP. Pastoral Letter: "The Truth Will Set You Free" (John 8:32), 25 February 2022. Signed by Pablo Virgilio S. David, DD. Manila: CBCP, 2022
No ratings yet
CBCP. Pastoral Letter: "The Truth Will Set You Free" (John 8:32), 25 February 2022. Signed by Pablo Virgilio S. David, DD. Manila: CBCP, 2022
3 pages
442HW Ii Stat
No ratings yet
442HW Ii Stat
2 pages
L 2what Is Inclusive History
No ratings yet
L 2what Is Inclusive History
2 pages
Newborn Care Checklist Edited
No ratings yet
Newborn Care Checklist Edited
3 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

Uploaded by

Sparse Precision Matrix Estimation Via Lasso Penalized D-Trace Loss

Uploaded by

Biometrika (2014), 101, 1, pp. 103–120 doi: 10.

Sparse precision matrix estimation via lasso penalized

AND HUI ZOU

Condition 1. The loss function L(, 0 ) is a smooth convex function of .

Condition 2. The unique minimizer of L(, 0 ) occurs at (0 )−1 .

2·2. A new estimator

arg min ˆ − tr() + λn 1 .

From (13), we consider the augmented Lagrangian

L(, 0 , 1 , 0 , 1 ) = , ˆ − tr() + λn 0 1,off + h(1  I )

+ (ρ/2) − 0 2F + (ρ/2) − 1 2F ,

where h(1  I ) is an indicator function defined by

k+1 = arg min L(, k0 , k1 , k0 , k1 ), (14)

Step (16) is trivial. For (14), we can write

Let G(A, B) denote the solution to the optimization problem

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 ).

The explicit solution to (17) is given in the following theorem.

THEOREM 1. Given any p-dimensional symmetric positive-definite matrix A and any

where ◦ denotes the Hadamard product of matrices and Ci, j = 2/(σi + σ j ).

Let S(A, λ) denote the solution to the optimization problem

Then we can write

Algorithm 1. Alternating direction method for solving (5).

Step 2. Repeat (a)–(d) until convergence:

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 );

(c) let k+1

˘ = arg min ,ˆ − tr() + λn 1,off , (22)

L(, 0 , ) = ˆ − tr() + λn 0 1 + , − 0

k+1 = arg min L(, k0 , k ), (23)

k+1 = k + ρ(k+1 − k+1

Step 2. Repeat (a)–(d) until convergence:

ˆ and in its Step 1 use

We have implemented Algorithm 2 in Matlab. In our implementation, we take ρ = 1 and stop

k+1 − k F k+1 − k0 F

5·2. The irrepresentability condition

()( j,k),(l,m) = k,m δ( j, l) + j,l δ(k, m), (25)

maxc ( ∗ ⊗ ∗ )e,S {( ∗ ⊗ ∗ ) S,S }−1 1 < 1. (27)

5·3. Rates of convergence

THEOREM 2. Under (28) and the irrepresentability condition (26), choose

∗ )2 . Then, with probability greater than 1 − p η−2 , we have

ˆ − ∗ ∞ {5dκ 2 + 12α −1 (κ κ 2 + κ )} {C1 (η log p + log 4)/n}1/2 ,

ˆ recovers all zeros in ∗ . Moreover, if

n > C1 {5dκ 2 + 12α −1 (κ κ 2 + κ )}2 (η log p + log 4)/θmin

THEOREM 3. Under (29) and the irrepresentability condition (26), choose

λn = 24n −1/2 α −1 (κ κ 2 + κ )(max i,i

for some η > 2 and

ˆ − ∗ ∞ {5dκ 2 + 12α −1 (κ κ 2 + κ )}C p η/(2m) n −1/2 ,

ˆ recovers all zeros in ∗ . Moreover, if

n > C2 p η/m {5dκ 2 + 12α −1 (κ κ 2 + κ )}2 /θmin

APPENDIX: TECHNICAL PROOFS

Proofs of Theorems 2 and 3

§ 2.3.1). Straightforward calculation gives δ f (n, p η ) = {128(1 + 4σ 2 )2 maxi (i,i

orems 2 and 3 can be proved using the following technical lemma.

(c) assuming the conditions in part (b), we also have

LEMMA A2. Assuming (A4), we have

R( )1,∞ 6d 2 ε2 κ 3 , R( )∞ 12dε2 κ 3 , (A7)

In this proof we assume the general choices of n and λn :

and λn = 12α −1 (κ κ 2 + κ )δ f (n, p η ) for some η > 2.

The estimation of  ˆ − ∗  follows from (A12), the fact that

min{(s + p)1/2 , d}{5dκ 2 + 12α −1 (κ κ 2 + κ )}δ f (n, p η ). (A14)

˘ recovers all zeros and nonzeros in ∗ . Finally, we show that ˆ = .

From its directional derivative, we obtain the equality

Applying the definition of ˆ = ()

˜ S c = 0, (A16) is equivalent to ˆ S,S vec()

Upon combining (A17) with (A19), it suffices to prove that for e ∈ S c ,

Since vec(Z ) S ∞ 1, we have | ˆ e,S ˆ S,S−1

maxc  ˆ e,S ( ˆ S,S )−1 − e,S

 ˆ e,S ( ˆ S,S )−1 − e,S

Upon combining (A5), (A23) and (A24), we obtain (A21).

[Received May 2012. Revised October 2013]

You might also like

arg min ˆ − tr() + λn 1 .

L(, 0 , 1 , 0 , 1 ) = , ˆ − tr() + λn 0 1,off + h(1 I )

+ (ρ/2) − 0 2F + (ρ/2) − 1 2F ,

where h(1 I ) is an indicator function defined by

k+1 = arg min L(, k0 , k1 , k0 , k1 ), (14)

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 ).

ˆ + 2ρ I, I + ρk0 + ρk1 − k0 − k1 );

˘ = arg min ,ˆ − tr() + λn 1,off , (22)

L(, 0 , ) = ˆ − tr() + λn 0 1 + , − 0

k+1 = arg min L(, k0 , k ), (23)

k+1 = k + ρ(k+1 − k+1

k+1 − k F k+1 − k0 F

maxc ( ∗ ⊗ ∗ )e,S {( ∗ ⊗ ∗ ) S,S }−1 1 < 1. (27)

ˆ − ∗ ∞ {5dκ2 + 12α −1 (κ κ2 + κ )} {C1 (η log p + log 4)/n}1/2 ,

n > C1 {5dκ2 + 12α −1 (κ κ2 + κ )}2 (η log p + log 4)/θmin

λn = 24n −1/2 α −1 (κ κ2 + κ )(max i,i

ˆ − ∗ ∞ {5dκ2 + 12α −1 (κ κ2 + κ )}C p η/(2m) n −1/2 ,

n > C2 p η/m {5dκ2 + 12α −1 (κ κ2 + κ )}2 /θmin

R( )1,∞ 6d 2 ε2 κ3 , R( )∞ 12dε2 κ3 , (A7)

and λn = 12α −1 (κ κ2 + κ )δ f (n, p η ) for some η > 2.

The estimation of ˆ − ∗ follows from (A12), the fact that

min{(s + p)1/2 , d}{5dκ2 + 12α −1 (κ κ2 + κ )}δ f (n, p η ). (A14)

Since vec(Z ) S ∞ 1, we have |ˆ e,S ˆ S,S−1

maxc ˆ e,S (ˆ S,S )−1 − e,S

ˆ e,S (ˆ S,S )−1 − e,S