0% found this document useful (0 votes)
56 views

A Stepwise Approach For High-Dimensional Gaussian Graphical Models

The document presents a stepwise approach for estimating high dimensional Gaussian graphical models. It exploits the relationship between partial correlation coefficients and the distribution of prediction errors. The proposed StepGraph algorithm uses a forward-backward approach to detect conditionally dependent variable pairs. Simulation studies and applications on real data show StepGraph performs comparably to existing methods like graphical lasso and constrained L1 minimization.

Uploaded by

ivanmarce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

A Stepwise Approach For High-Dimensional Gaussian Graphical Models

The document presents a stepwise approach for estimating high dimensional Gaussian graphical models. It exploits the relationship between partial correlation coefficients and the distribution of prediction errors. The proposed StepGraph algorithm uses a forward-backward approach to detect conditionally dependent variable pairs. Simulation studies and applications on real data show StepGraph performs comparably to existing methods like graphical lasso and constrained L1 minimization.

Uploaded by

ivanmarce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

A Stepwise Approach for High-Dimensional

Gaussian Graphical Models


Ginette Lafit ∗
University of Leuven, Belgium

Francisco Nogales
Universidad Carlos III de Madrid, España

Marcelo Ruiz
Universidad Nacional de Rı́o Cuarto, Argentina

Ruben Zamar
University of British Columbia, Canada
January 12, 2020

Abstract
We present a stepwise approach to estimate high dimensional Gaussian graphical models.
We exploit the relation between the partial correlation coefficients and the distribution of
the prediction errors, and parametrize the model in terms of the Pearson correlation coeffi-
cients between the prediction errors of the nodes’ best linear predictors. We propose a novel
stepwise algorithm for detecting pairs of conditionally dependent variables. We compare the
proposed algorithm with existing methods including graphical lasso (Glasso), constrained
`1 -minimization (CLIME) and equivalent partial correlation (EPC), via simulation studies
and real life applications. In our simulation study we consider several model settings and
report the results using different performance measures that look at desirable features of the
recovered graph.

Keywords: Covariance Selection, Gaussian Graphical Model, Forward and Backward Selection,
Partial Correlation Coefficient.

The authors thanks the generous support of NSERC, Canada, the Institute of Financial Big Data, University
Carlos III of Madrid and the CSIC, Spain.

1
1 Introduction

High-dimensional Gaussian graphical models (GGM) are widely used in practice to represent the

linear dependency between variables. The underlying idea in GGM is to measure linear dependen-

cies by estimating partial correlations to infer whether there is an association between a given pair

of variables, conditionally on the remaining ones. Moreover, there is a close relation between the

nonzero partial correlation coefficients and the nonzero entries in the inverse of the covariance ma-

trix. Covariance selection procedures take advantage of this fact to estimate the GGM conditional

dependence structure given a sample (Dempster, 1972; Lauritzen, 1996; Edwards, 2000).

When the dimension p is larger than the number n of observations, the sample covariance

matrix S is not invertible and the maximum likehood estimate (MLE) of Σ does not exist. When

p/n ≤ 1, but close to 1, S is invertible but ill-conditioned, increasing the estimation error (Ledoit

and Wolf, 2004). To deal with this problem, several covariance selection procedures have been

proposed based on the assumption that the inverse of the covariance matrix, Ω, called precision

matrix, is sparse.

We present an approach to perform covariance selection in a high dimensional GGM based on

a forward-backward algorithm, which we call StepGraph. Our procedure takes advantage of the

relation between the partial correlation and the Pearson correlation coefficient of the residuals.

Existing methods to estimate the GGM can be classified in three classes: nodewise regres-

sion methods, maximum likelihood methods and limited order partial correlations methods. The

nodewise regression method was proposed by Meinshausen and Bühlmann (2006). This method

estimates a lasso regression for each node in the graph. See for example Peng et al. (2009), Yuan

(2010), Liu and Wang (2012), Zhou et al. (2011) and Ren et al. (2015). Penalized likelihood meth-

2
ods include Yuan and Lin (2007), Banerjee et al. (2008), Friedman et al. (2008), Johnson et al.

(2011) and Ravikumar et al. (2011) among others. Cai et al. (2011) propose an estimator called

CLIME that estimates precision matrices by solving the dual of an `1 penalized maximum like-

lihood problem. Limited order partial correlation procedures use lower order partial correlations

to test for conditional independence relations. See Spirtes et al. (2000), Kalisch and Bühlmann

(2007), Rütimann et al. (2009), Liang et al. (2015) and Huang et al. (2016).

The rest of the article is organized as follows. Section 2 introduces the stepwise approach

along with some notation. Section 3 gives simulations results and a real data example. Section 4

presents some concluding remarks. Appendix A reports detailed description of the crossvalidation

procedure used to determine the required parameters in our StepGraph algorithm and Appendix

B gives additional simulation results.

2 Stepwise Approach to Covariance Selection

2.1 Definitions and Notation

In this section we review some definitions and technical concepts needed later on. Let G = (V, E)

be a graph where V 6= ∅ is the set of nodes or vertices and E ⊆ V × V = V 2 is the set of edges.

For simplicity we assume that V = {1, . . . , p}. The graph G is undirected, that is, (i, j) ∈ E if

and only if (j, i) ∈ E. Two nodes i and j are called connected, adjacent or neighbors if (i, j) ∈ E.

A graphical model (GM) is a graph such that V indexes a set of variables {X1 , . . . , Xp } and E

is defined by:

(i, j) ∈
/ E if and only if Xi ⊥⊥ Xj | XV \{i,j}. (1)

3
Here ⊥⊥ denotes conditional independence.

Given a node i ∈ V , its neighborhood Ai is defined as

Ai = {l ∈ V \ {i} : (i, l) ∈ E}. (2)

Notice that Ai gives the nodes directly connected with i and therefore a GM can be effectively

described by giving the system of neighborhoods {Ai }pi=1 .

We further assume that (X1 , . . . , Xp )> ∼ N(0, Σ), where Σ = (σij )i,j=1...,p is a positive-definite

covariance matrix. In this case the graph is called a Gaussian graphical model (GGM). The matrix

Ω = (ωij )i,j=1...,p = Σ−1 is called precision matrix.

There exists an extensive literature on GM and GGM. For a detailed treatment of the theory

see for instance Lauritzen (1996), Edwards (2000), and Bühlmann and Van De Geer (2011).

2.2 Conditional dependence in a GGM

In a GGM the set of edges E represents the conditional dependence structure of the vector

(X1 , . . . , Xp ). To represent this dependence structure as a statistical model it is convenient to

find a parametrization for E.

In this subsection we introduce a convenient parametrization of E using well known results

from classical multivariate analysis. For an exhaustive treatment of these results see, for instance,

Anderson (2003), Cramér (1999), Lauritzen (1996) and Eaton (2007).

Given a subset A of V , XA denotes the vector of variables with subscripts in A in increasing


>
order. For a given pair of nodes (i, l), set X> > > >
1 = (Xi , Xl ), X2 = XV \{i,l} and X = X1 , X2 .

4
Note that X has multivariate normal distribution with mean 0 and covariance matrix
 
Σ11 Σ12 



 (3)
Σ21 Σ22

such that Σ11 has dimension 2 × 2, Σ12 has dimension 2 × (p − 2) and so on. The matrix in (3) is

a partition of a permutation of the original covariance matrix Σ, and will be also denoted by Σ,

after a small abuse of notation.

Moreover, we set
 −1  
Σ11 Σ12  Ω11 Ω12 
Ω=


 =

.

Σ21 Σ22 Ω21 Ω22

Then, by (B.2) of Lauritzen (1996), the blocks Ωi,j can be written explicitly in terms of Σi,j and

Σ−1
i,j . In particular

−1
Ω11 = Σ11 − Σ12 Σ−1
22 Σ21 where
 
ωii ωil 
Ω11 = 



ωli ωll

is the submatrix of Ω (with rows i and l and columns i and l). Hence,

cov (X1 |X2 ) = Σ11 − Σ12 Σ−1


22 Σ21 (4)

= Ω−1
11
 
1  ωll −ωil 
=  
ωii ωll − ωil ωli  
−ωli ωii

and, in consequence, the partial correlation between Xi and Xl can be expressed as

 ωil
corr Xi , Xl |XV \{i,l} = − √ . (5)
ωii ωll

5
This gives the standard parametrization of E in terms of the support of the precision matrix

supp (Ω) = {(i, l) ∈ V 2 : i 6= l, ωi,l 6= 0}. (6)

We now introduce another parametrization of E, which we need to define and implement our

proposed method. We consider the regression error for the regression of X1 on X2 ,

b 1 = X1 − β > X2
ε = X1 − X

and let εi and εl denote the entries of ε (i.e. ε> = (εi , εl )). The regression error ε is independent

of X
b 1 and has normal distribution with mean 0 and covariance matrix Ψ11 with elements denoted

by
 
ψii ψil 
Ψ11 = 

.
 (7)
ψli ψll
A straightforward calculation shows that

  
Ψ11 = cov (X1 ) + cov X1 − 2cov X1 , X1
b b

= Σ11 + Σ12 Σ−1 −1 −1


22 Σ22 Σ22 Σ21 − 2Σ12 Σ22 Σ21

= Σ11 − Σ12 Σ−1 −1


22 Σ21 = Ω11 .

See Cramér (1999, Section 23.4).

Therefore, by this equality, (4) and (5), the partial correlation coefficient and the conditional

correlation are equal

 ψil
ρil·V \{i,l} = corr Xi , Xl |XV \{i,l} = √ .
ψii ψll

6
Summarizing, the problem of determining the conditional dependence structure in a GGM (rep-

resented by E) is equivalent to finding the pairs of nodes of V that belong to the set

{(i, l) ∈ V 2 : i 6= l, ψil 6= 0} (8)

which is equal to the support of the precision matrix, supp (Ω), defined by (6).

Remark 1 As noticed above, under normality, partial and conditional correlation are the same.

However, in general they are different concepts (Lawrance, 1976).

Remark 2 Let βi,l be the regression coefficient of Xl in the regression of Xi versus XV \{i} and,

similarly let βl,i be the regression coefficient of Xi in the regression of Xl versus XV \{i} . Then it
p
follows that ρil·V \{i,l} = sign (βl,i ) βl,i βi,l . This allows for another popular parametrization for E.

Moreover, let i be the error term in the regression of the ith variable on the remaining ones. Then

by Lemma 1 in Peng et al. (2009) we have that cov(i , l ) = ωil /ωii ωll and var(i ) = 1/ωii .

2.3 The Stepwise Algorithm

Conditionally on its neighbors, Xi is independent of all the other variables. Therefore, given a

system of neighborhoods {Ai }pi=1 and l ∈


/ Ai (and so i ∈
/ Al ), the partial correlation between

Xi and Xl can be obtained by the following procedure: (i) regress Xi on XAi and compute the

regression residual εi ; (ii) regress Xl on XAl and compute the regression residual εl ; (iii) calculate

the Pearson correlation between εi and εl .

This reasoning motivates the StepGraph algorithm. At each step k of StepGraph, we have

a working system of neighborhoods Abk1 , ..., Abkp . Then, if l ∈


/ Abkj one would expect, under this

working assumption, that the empirical partial correlation coefficient ρbjl.Abk is close to zero. If
j

7
the maximum absolute partial correlation computed this way is large, then we conclude that the

working system of neighborhoods needs to be updated. We then add the most likely new edge,

the one with the largest partial correlation. This constitutes the forward step. In the backward

step, if the minimum absolute partial correlation coefficient between presently connected nodes, j

and l, is too small, then this edge is removed.

A step by step description of StepGraph is given below:

Graphical Stepwise Algorithm

Input: the (centered) data {x1 , ..., xn } , and the forward and backward thresholds αf and αb .

Initialization. k = 0: set Ab01 = Ab02 = · · · = Ab0p = φ.

Iteration Step. Given Abk1 , Abk2 , ..., Abkp we compute Abk+1 k+1 bk+1 as follows.
1 , A2 , ..., Ap

Forward. For each j = 1, ..., p do the following.

/ Abkj calculate the partial correlations fjlk as follows.


For each l ∈

(a) Regress the j th variable on the variables with subscript in the set Abkj and compute the
 
regression residuals ekj = ek1j , ek2j , ..., eknj .

(b) Regress the lth variables on the variables with subscript in the set Abkl and compute the

regression residuals ekl = ek1l , ek2l , ..., eknl .




(c) Obtain the partial correlation fjlk by calculating the Pearson correlation between ekj and

ekl .

If

max fjlk = fjk0 l0 ≥ αf

l∈
/Abk ,j∈V
j

set Abk+1 bk bk+1 = Abk ∪ {j0 } , Abk+1 = Abk for l 6= j0 , l0


j0 = Aj0 ∪ {l0 } , Al0 l0 l l

8
If

max fjlk = fjk0 l0 < αf , stop.

Backward. For each j = 1, ..., p do the following.

For each l ∈ Abk+1


j calculate the partial correlation bkjl as follows.

(a) Regress the j th variables on the variables with subscript in the set Abk+1
j \ {l} and compute
 
the regression residuals rkj = r1jk , r k , ..., r k
2j nj .

(b) Regress the lth variable on the variables with subscript in the set Abk+1
l \ {j} and compute

the regression residuals rkl = r1l


k , r k , ..., r k .

2l nl

(c) Compute the partial correlation bkjl by calculating the Pearson correlation between rkj

and rkl .

If

k k
min bjl = bj0 l0 ≤ αb
bk ,j∈V
l∈Aj

set Abk+1 bk+1 bk+1 → Abk+1 \ {j0 }.


j0 → Aj0 \ {l0 } , Al0 l0

Output

1. A collection of estimated neighborhoods Abj , j = 1, . . . , p.


n o
b = (i, l) ∈ V 2 : i ∈ Abl .
2. The set of estimated edges E

3. An estimate of Ω, Ω ωil )pi,l=1 with ω


b = (b bii = n/(eTi ei )
bil defined as follow: in the case i = l, ω

for i = 1, ..., p, where ei is the vector of the prediction errors in the regression of the ith

variable on X Abi . In the case i 6= l we must distinguish two cases, if l ∈


/ Abi then ω
bil = 0,

bil = n eTi el / eTi ei eTl el (see Remark 2).


   
otherwise ω

9
2.4 Thresholds selection by cross-validation

Let X be the n × p matrix with rows xi = (xi1 , . . . , xip ), i = 1, . . . , n, corresponding to n obser-

vations. We randomly partition the dataset {xi }1≤i≤n into K disjoint subsets of approximately
K
X
th (t)
equal sizes, the t subset being of size nt ≥ 2 and nt = n. For every t, let {xi }1≤i≤nt be the
t=1
(t)
tth validation subset, and its complement {e
xi }1≤i≤n−nt , the tth training subset. For every t and
(t) (t)
for every pair (αf , αb ) of threshold parameters let Ab1 , . . . , Abp be the estimated neighborhoods

given by StepGraph using the tth training subset. For every j = 1, . . . , p let βbAb(t) be the estimated
j

(t)
coefficient of the regression of the variable Xj on the neighborhood Abj .
(t)
Consider now the tth validation subset. So, for every j, using βb (t) , we obtain the vector of
Aj

b (t) (αf , αb ). If A(t) = ∅ we predict each observation of Xj by the sample mean


predicted values X j j

of the observations in the tth dataset of this variable.

Then, we define the K–fold cross–validation function as

K pj
1 XX (t) b (t)
2
CV (αf , αb ) = Xj − Xj (αf , αb )

n t=1 j=1

where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation forward–

backward thresholds α
bf , α
bb is

(b
αf , α
bb ) =: argmin CV (αf , αb )
(αf ,αb )∈H

where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search. For

a detailed description see Appendix A.

10
2.5 Example

To illustrate the algorithm we consider the GGM with 16 edges given in the first panel of Figure

1. We draw n = 1000 independent observations from this model (see the next section for details).

The values for the threshold parameters αf = 0.17 and αb = 0.09 are determined by 5-fold cross-

validation. The figure also displays the selected pairs of edges at each step in a sequence of

successive updates of Abkj , for k = 1, 4, 9, 12 and the final step k = 16, showing that the estimated

graph is identical to the true graph.


7 ●
6

5 ●
7 ●
6

5 ●
7 ●
6

5

8 ●
4 ●
8 ●
4 ●
8 ●
4

●9 ●3 ●9 ●3 ●
9 ●
3

●10 ●2 ●10 ●2 ●10 ●2


11 ● 1 ●
11 ● 1 ●
11 ● 1

●12 ●
20 ●12 ●
20 ●12 ●
20


13 ●
19 ●
13 ●
19 ●
13 ●
19


14 ●
18 ●
14 ●
18 ●
14 ●
18

15

16 ●
17 ●
15

16 ●
17 ●
15

16 ●
17

True graph k=1 k=4


7 ●
6

5 ●
7 ●
6

5 ●
7 ●
6

5

8 ●
4 ●
8 ●
4 ●
8 ●
4

●9 ●3 ●9 ●3 ●
9 ●
3

●10 ●2 ●10 ●2 ●10 ●2


11 ● 1 ●
11 ● 1 ●
11 ● 1

●12 ●
20 ●12 ●
20 ●12 ●
20


13 ●
19 ●
13 ●
19 ●
13 ●
19


14 ●
18 ●
14 ●
18 ●
14 ●
18

15

16 ●
17 ●
15

16 ●
17 ●
15

16 ●
17

k=9 k = 12 k = 16

bk , for k = 1, 4, 9, 12, 16 of StepGraph.


Figure 1: True graph and sequence of successive updates of Aj

11
3 Numerical results and real data example

We conducted extensive Monte Carlo simulations to investigate the performance of StepGraph.

In this section we report some results from this study and a numerical experiment using real data.

3.1 Monte Carlo simulation study

Simulated Models

We consider three dimension values p = 50, 100, 150 and three different models for Ω:

Model 1. Autoregressive model of orden 1, denoted AR(1). In this case Σij = 0.4|i−j| for

i, j = 1, . . . p.

Model 2. Nearest neighbors model of order 2, denoted NN(2). For each node we randomly

select two neighbors and choose a pair of symmetric entries of Ω using the NeighborOmega

function of the R package Tlasso.

Model 3. Block diagonal matrix model with q blocks of size p/q, denoted BG. For p =

50, 100 and 150, we use q = 10, 20 and 30 blocks, respectively. Each block, of size p/q = 5,

has diagonal elements equal to 1 and off-diagonal elements equal to 0.5.

For each p and each model we generate R = 50 random samples of size n = 100. These graph

models are widely used in the genetic literature to model gene expression data. See for example

Lee and Liu (2015) and Lee and Ghi (2006). Figure 2 displays graphs from Models 1-3 with

p = 100 nodes.

12
● ● ● ● ●
● ●
● ● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●

● ● ●


● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●

● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ● ●

● ● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●

● ● ●


● ●
● ● ●

● ● ●
● ●

● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ● ●

● ●

● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●

AR(1) NN(2) BG

Figure 2: Graphs of AR(1), NN(2) and BG graphical models for p = 100 nodes.

Methods

We compare the performance of StepGraph with graphical lasso (Glasso), constrained l1 -

minimization for inverse matrix estimation (CLIME) and equivalent partial correlation (EPC)

proposed by Friedman et al. (2008), Cai et al. (2011) and Liang et al. (2015) respectively. More

precisely, the methods compared in our simulation study are:

1. The Glasso estimate obtained by solving the `1 penalized-likelihood problem:

min −log{det[Ω]} + tr{ΩX> X} + λ k Ω k1 .



(9)
Ω0

In our simulations and examples we use the R-package CVglasso with the tuning parameter

λ selected by 5−fold crossvalidation (the package default).

2. The CLIME estimate obtained by symmetrization of the solution of

min{k Ω k1 subject to |SΩ − I|∞ ≤ λ}, (10)

where S is the sample covariance, I is the identity matrix, |·|∞ is the elementwise l∞ norm,

and λ is a tuning parameter. For computations, we use the R-package clime with the tuning

parameter λ selected by 5−fold crossvalidation (the package default).

13
3. The EPC method, which performs multiple hypothesis tests based on an equivalent measure

to the partial correlation coefficient. This method starts with a screening step to determine

a reduced system of neighborhoods, using Pearson correlation coefficients. We use the R-

package equSA with default choice of parameters.

4. The proposed method StepGraph with the forward and backward thresholds, αf > αb ,

determined by 5-fold crossvalidation, as described in Appendix A. Our procedure allows

for an optional screening step, as in EPC, and the resulting method is then denoted by

StepGraph2 .

To evaluate the graph recovery we compute the Matthews correlation coefficient (Matthews, 1975)

TP × TN − FP × FN
MCC = p , (11)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)

the Specificity = TN/(TN + FP) and the Sensitivity = TP/(TP + FN). Here TP, TN, FP and FN

are the number of true positives, true negatives, false positives and false negatives, respectively.

Larger values of MCC, Sensitivity and Specificity indicate better performances (Fan et al., 2009;

Baldi et al., 2000).

b as an estimate for Ω is measured by mF = ||Ω−Ω||


The performance of Ω b F (where ||·||F denotes

the Frobenius norm) and by the normalized Kullback-Leibler divergence defined by mN KL =

DKL /(1 + DKL ) where

1  n b −1 o n h io
b −1 − p

DKL = tr ΩΩ − log det ΩΩ
2

is the the Kullback-Leibler divergence between Ω


b and Ω.

Results

14
Table 1 shows the MCC performance for the three methods under Models 1-3. For models 1

and 2, StepGraph and EPC clearly outperforms the other two methods, with CLIME being only

slightly better than Glasso. EPC is slightly better than StepGraph and worse than StepGraph2 .

Moreover, the equSA package often crashes in the case of model 3 (NA values reported in the

table). Cai et al. (2011) pointed out that a procedure yielding a more sparse Ω
b is preferable

because this facilitates interpretation of the data. The sensitivity and specificity results, reported

in Table 4 in Appendix B, show that in general StepGraph, StepGraph2 and EPC are more

sparse than the CLIME and Glasso, yielding fewer false positives (more specificity) but a few

more false negatives (less sensitivity). Table 2 shows that all the methods are roughly comparable

under AR(1) and show equally poor performances under NN(2). StepGraph and StepGraph2

outperform the competitors under model BG.

The axes in the panels in Figure 3 display the graph p-nodes in a given order. Each cell displays

a gray level proportional to the frequency with which the corresponding pair of nodes appear in

the estimated graph from the R = 50 simulation runs. Hence a white color in a given cell (i, j)

means that nodes i and j are never adjacent in the graph. On the other hand, a pair of nodes

that are always adjacent in the graph are given a black color. Notice that the sparsity patterns

estimated by StepGraph and StepGraph2 best match those of the true models. As noticed before,

EPC results are missing for the case of BG. Figures 1 -3 in Appendix B display similar heatmaps

and conclusions for 100 and 150 nodes.

15
True StepGraph StepGraph 2 Glasso CLIME EPC

AR(1)

True StepGraph StepGraph 2 Glasso CLIME EPC

NN(2)

True StepGraph StepGraph 2 Glasso CLIME EPC

BG

Figure 3: Models heatmaps for the frequency of adjancency for each pair of nodes, for models AR(1), NN(2) and BG, with p = 50

nodes. The axes display the graph p-nodes in a given order.

16
Table 1: Comparison of means and standard deviations (in brackets) of MCC over R = 50 replicates.

Model p StepGraph StepGraph 2 Glasso CLIME EPC

50 0.741 (0.009) 0.863 (0.005) 0.419 (0.016) 0.492 (0.006) 0.831 (0.005)

AR(1) 100 0.751 (0.004) 0.847 (0.005) 0.433 (0.020) 0.464 (0.004) 0.803 (0.005)

150 0.730 (0.004) 0.837 (0.004) 0.474 (0.017) 0.499 (0.003) 0.778 (0.004)

50 0.751 (0.004) 0.857 (0.006) 0.404 (0.014) 0.401 (0.007) 0.870 (0.004)

NN(2) 100 0.802 (0.005) 0.875 (0.005) 0.382 (0.006) 0.407 (0.005) 0.862 (0.000)

150 0.695 (0.007) 0.799 (0.004) 0.337 (0.008) 0.425 (0.003) 0.762 (0.004)

50 0.898 (0.005) 0.832 (0.028) 0.356 (0.009) 0.482 (0.005) NA NA

BG 100 0.857 (0.005) 0.857 (0.005) 0.348 (0.004) 0.461 (0.002) NA NA

150 0.780 (0.008) 0.780 (0.008) 0.314 (0.003) 0.408 (0.003) NA NA

Table 2: Comparison of means and standard deviations (in brackets) of mF and mN KL over R = 50 replicates.

StepGraph StepGraph 2 Glasso CLIME EPC

Model p mN KL mF mN KL mF mN KL mF mN KL mF mN KL mF

50 0.70 3.82 0.66 3.59 0.64 3.90 0.63 3.91 0.67 3.75

(0.00) (0.00) (0.00) (0.03) (0.00) (0.02) (0.00) (0.01) (0.00) (0.03)

AR(1) 100 0.83 5.73 0.81 5.24 0.80 5.72 0.79 5.75 0.82 5.56

(0.00) (0.00) (0.00) (0.03) (0.00) (0.02) (0.00) (0.01) (0.00) (0.03)

150 0.89 7.16 0.87 6.53 0.86 7.21 0.86 7.25 0.88 7.03

(0.00) (0.00) (0.00) (0.03) (0.02) (0.02) (0.01) (0.01) (0.00) (0.02)

50 0.99 6.98 0.99 6.88 0.99 6.65 0.99 6.64 1.00 6.39

(0.00) (0.00) (0.00) (0.01) (0.00) (0.01) (0.00) (0.00) (0.00) (0.00)

NN(2) 100 1.00 10.11 1.00 10.09 1.00 9.64 1.00 9.60 1.00 9.30

(0.00) (0.00) (0.00) (0.01) (0.00) (0.01) (0.00) (0.01) (0.00) (0.00)

150 1.00 12.37 1.00 12.34 1.00 11.90 1.00 11.79 1.00 11.51

(0.00) (0.00) (0.00) (0.01) (0.00) (0.01) (0.00) (0.00) (0.00) (0.00)

50 0.46 1.44 0.50 1.97 0.85 5.45 0.82 5.03 NA NA

(0.00) (0.00) (0.02) (0.23) (0.00) (0.10) (0.00) (0.05) NA NA

BG 100 0.71 2.94 0.71 2.94 0.93 9.16 0.92 8.71 NA NA

(0.00) (0.00) (0.00) (0.00) (0.00) (0.07) (0.00) (0.02) NA NA

150 0.88 6.10 0.88 6.10 0.96 11.59 0.96 11.42 NA NA

(0.00) (0.00) (0.00) (0.00) (0.00) (0.06) (0.00) (0.02) NA NA

17
3.2 Analysis of Breast Cancer Data

In preoperative chemoterapy, the complete eradication of all invasive cancer cells is referred to as

pathological complete response, abbreviated as pCR. It is known in medicine that pCR is associ-

ated with the long-term cancer-free survival of a patient. Gene expression profiling (GEP) – the

measurement of the activity (expression level) of genes in a patient – could in principle be a useful

predictor for the patient’s pCR.

Using normalized gene expression data of patients in stages I-III of breast cancer, Hess et al.

(2006) aim to identify patients that may achieve pCR under sequential anthracycline paclitaxel

preoperative chemotherapy. When a patient does not achieve pCR state, he is classified in the

group of residual disease (RD), indicating that cancer still remains. Their data consist of 22283

gene expression levels for 133 patients, with 34 pCR and 99 RD. Following Fan et al. (2009) and

Cai et al. (2011) we randomly split the data into a training set and a testing set. The testing set

is formed by randomly selecting 5 pCR patients and 16 RD patients (roughly 1/6 of the subjects)

and the remaining patients form the training set. From the training set, a two sample t-test

is performed to select the 50 most significant genes. The data is then standardized using the

standard deviation estimated from the training set.

We apply a linear discriminant analysis (LDA) to predict whether a patient may achieve

pathological complete response (pCR), based on the estimated inverse covariance matrix of the

gene expression levels. We label with r = 1 the pCR group and r = 2 the RD group and assume

that data are normally distributed, with common covariance matrix Σ and different means µr .

From the training set, we obtain µ


br, Ω
b and for the test data compute the linear discriminant score

18
as follows

b µr − 1 µ> Ωµ
δr (x) = x> Ωb b r + logb
πr for i = 1, . . . , n, (12)
2 r

where π
br is the proportion of group r subjects in the training set. The classification rule is

rb(x) = argmax δr (x) for r = 1, 2. (13)

For every method we use 5-fold cross validation on the training data to select the tuning constants.

We repeat this scheme 100 times.

Table 3 displays the means and standard errors (in brackets) of Sensitivity, Specificity, MCC

and Number of selected Edges using Ω


b over the 100 replications. As measured by MCC, the

performance of StepGraph and CLIME are similar. However notice that StepGraph is preferable

because the recovered graph is much more sparse. On the other hand, the performances of Glasso

and EPC are similarly poor. The results of StepGraph2 are similar to those of StepGraph and

therefore omited.

Table 3: Comparison of means and standard deviations (in brackets) of Sensitivity, Specificity, MCC and Number of selected edges

over 100 replications.

StepGraph Glasso CLIME EPC

Sensitivity 0.798 (0.020) 0.612 (0.021) 0.786 (0.020) 0.682 (0.021)

Specificity 0.784 (0.010) 0.754 (0.011) 0.788 (0.010) 0.712 (0.007)

MCC 0.520 (0.020) 0.342 (0.021) 0.516 (0.020) 0.346 (0.017)

Number of Edges 54 (2) 1712 (63) 4823 (8) 13 (0)

4 Concluding remarks

This paper introduces a stepwise procedure, called StepGraph, to perform covariance selection

in high dimensional Gaussian graphical models. StepGraph uses a different parametrization of

19
the Gaussian graphical model based on Pearson correlations between the best-linear-predictors

prediction errors. The algorithm begins with a family of empty neighborhoods and using basic

steps, forward and backward, adds or delete edges until appropriate thresholds are reached. These

thresholds are automatically determined by cross–validation.

StepGraph is compared with Glasso, CLIME and EPC under different Gaussian graphical

models (AR(1), NN(2) and BG) and using different performance measures regarding network

recovery and sparse estimation of the precision matrix Ω. StepGraph is shown to have good support

recovery performance and to produce more sparse models than Glasso and CLIME (i.e. StepGraph

is a parsimonious estimation procedure). StepGraph and StepGraph2 (a variant including a pre-

processing correlation screening step) compare well with standard procedures including Glasso,

CLIME and EPC. Particularly good simulation results are obtained under block models where

the other approaches face some difficulties.

We apply StepGraph for the analysis of breast cancer data and show that our method is a

useful tool for applications in medicine and other fields.

References

Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley.

Baldi, P., S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen (2000). Assessing the accuracy of

prediction algorithms for classification: An overview. Bioinformatics 16 (5), 412–424.

Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008). Model selection through sparse maximum

20
likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning

Research 9, 485–516.

Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods, theory

and applications. Springer Science & Business Media.

Cai, T., W. Liu, and X. Luo (2011). A constrained `1 minimization approach to sparse precision

matrix estimation. Journal of the American Statistical Association 106 (494), 594–607.

Cramér, H. (1999). Mathematical Methods of Statistics. Princeton University Press.

Dempster, A. P. (1972). Covariance selection. Biometrics, 157–175.

Eaton, M. L. (2007). Multivariate Statistics : A Vector Space Approach. Institute of Mathematical

Statistics.

Edwards, D. (2000). Introduction to Graphical Modelling. Springer Science & Business Media.

Fan, J., Y. Feng, and Y. Wu (2009). Network exploration via the adaptive lasso and scad penalties.

The Annals of Applied Statistics 3 (2), 521–541.

Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the

graphical lasso. Biostatistics 9 (3), 432–441.

Hess, K. R., K. Anderson, W. F. Symmans, V. Valero, N. Ibrahim, J. A. Mejia, D. Booser,

R. L. Theriault, A. U. Buzdar, P. J. Dempsey, et al. (2006). Pharmacogenomic predictor

of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and

cyclophosphamide in breast cancer. Journal of Clinical Oncology 24 (26), 4236–4244.

21
Huang, S., J. Jin, and Z. Yao (2016). Partial correlation screening for estimating large precision

matrices, with applications to classification. The Annals of Statistics 44 (5), 2018–2057.

Johnson, C. C., A. Jalali, and P. Ravikumar (2011). High-dimensional sparse inverse covariance

estimation using greedy methods. arXiv preprint arXiv:1112.6411 .

Kalisch, M. and P. Bühlmann (2007). Estimating high-dimensional directed acyclic graphs with

the pc-algorithm. The Journal of Machine Learning Research 8, 613–636.

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.

Lawrance, A. J. (1976). On conditional and partial correlation. The American Statistician 30 (3),

146–149.

Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance

matrices. Journal of Multivariate Analysis 88 (2), 365–411.

Lee, H. and J. Ghi (2006). Gradient directed regularization for sparse gaussian concentration

graphs, with applications to inference of genetic networks. Biostatistics 7 (2), 302317.

Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common structures.

Journal of Machine Learning Research 16 (1), 10351062.

Liang, F., Q. Song, and P. Qiu (2015). An equivalent measure of partial correlation coefficients

for high-dimensional gaussian graphical models. Journal of the American Statistical Associa-

tion 110 (511), 1248–1265.

Liu, H. and L. Wang (2012). Tiger: A tuning-insensitive approach for optimally estimating

gaussian graphical models. arXiv preprint arXiv:1209.2437 .

22
Matthews, B. (1975). Comparison of the predicted and observed secondary structure of t4 phage

lysozyme. Biochimica et Biophysica Acta 405 (2), 442451.

Meinshausen, N. and P. Bühlmann (2006). High-dimensional graphs and variable selection with

the lasso. The Annals of Statistics 34 (3), 1436–1462.

Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint sparse

regression models. Journal of the American Statistical Association 104 (486), 735–746.

Ravikumar, P., M. J. Wainwright, G. Raskutti, B. Yu, et al. (2011). High-dimensional covari-

ance estimation by minimizing `1 -penalized log-determinant divergence. Electronic Journal of

Statistics 5, 935–980.

Ren, Z., T. Sun, C.-H. Zhang, H. H. Zhou, et al. (2015). Asymptotic normality and optimalities

in estimation of large gaussian graphical models. The Annals of Statistics 43 (3), 991–1026.

Rütimann, P., P. Bühlmann, et al. (2009). High dimensional sparse covariance estimation via

directed acyclic graphs. Electronic Journal of Statistics 3, 1133–1160.

Spirtes, P., C. N. Glymour, and R. Scheines (2000). Causation, Prediction, and Search. MIT

press.

Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming.

The Journal of Machine Learning Research 11, 2261–2286.

Yuan, M. and Y. Lin (2007). Model selection and estimation in the gaussian graphical model.

Biometrika 94 (1), 19–35.

23
Zhou, S., P. Rütimann, M. Xu, and P. Bühlmann (2011). High-dimensional covariance estimation

based on gaussian graphical models. The Journal of Machine Learning Research 12, 2975–3026.

24
A Selection of the thresholds parameters by cross-validation

In this section we describe the selection of the forward and backward thresholds for StepGraph.

Let X be the n × p matrix with rows xi = (xi1 , . . . , xip ), i = 1, . . . , n, corresponding to n

observations. For each j = 1, . . . , p, let Xj = (x1j , . . . , xnj )> denote the jth–column of the matrix

X.

We randomly partition the dataset {xi }1≤i≤n into K disjoint subsets of approximately equal
XK
th (t)
size, the t subset being of size nt ≥ 2 and nt = n. For every t, let {xi }1≤i≤nt be the tth
t=1
(t)
validation subset, and its complement {e
xi }1≤i≤n−nt , the tth training subset.
(t) (t)
For every t = 1, . . . , K and threshold parameters (αf , αb ) ∈ [0, 1] × [0, 1] let Ab1 , . . . , Abp be
(t)
the estimated neighborhoods given by StepGraph using the tth training subset {e
xi }1≤i≤n−nt with

e(t)
x i = (e
(t) (t) (t)
eip ), 1 ≤ i ≤ n − nt . Consider for every node j the estimated neighborhood Abj =
xi1 , . . . , x
(t) (t)
{l1 , . . . , lq } and let βbAb(t) be the estimated coefficient of the regression of X en−nt j )>
x1j , . . . , x
e j = (e
j

on Xl1 , . . . , Xlq , represented in (15) (red colour).


(t) (t) (t) (t)
Consider the tth validation subset {xi }1≤i≤nt with xi = (xi1 , . . . , xip ), 1 ≤ i ≤ nt and for
 >
(t) (t) (t)
every j let Xj = x1j , . . . , xnt j and define the vector of predicted values

b (t) (αf , αb ) = X b(t) βb(t)(t) ,


X j A Aj
j

(t) (t)
where XAb(t) is the matrix with rows (xil1 , . . . , xilq ), 1 ≤ i ≤ nt represented in (15) (in blue colour).
j

(t)
If the neighborhood Aj = ∅ we define

b (t) (αf , αb ) = (x̄(t) , . . . , x̄(t) )>


X j j j

(t) (t) (t)


where x̄j is the mean of the sample of observations x1j , . . . , xnt j .

25
We define the K–fold cross–validation function as

K p
1 XX (t) b (t)
2
CV (αf , αb ) = Xj − Xj (αf , αb )

n t=1 j=1

where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation forward–

backward thresholds α
bf , α
bb is

(b
αf , α
bb ) =: argmin CV (αf , αb ) (14)
(αf ,αb )∈H

where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search.

 
 tth training subset 
 
 
(t) (t) (t)
··· x ··· x ··· x ···
 
 e1j e1l1 e1lq 
 
 .. .. .. .. .. .. .. 
. . . . . . .
 
 
 
 
 
 
 (t) (t) (t) 

 ··· x
en−nt j ··· x
en−nt l1 ··· x
en−nt lq ··· 




 (15)
 
 
 

 tth validation subset 

 
 (t) (t) (t)


 ··· x1j ··· x1l1 ··· x1lq ··· 

 
 .. .. .. .. .. .. .. 

 . . . . . . . 

 
 
(t) (t) (t)
··· xnt j ··· xnt l1 ··· xnt lq ···

Remark 3 Matrix (15) represents, for every node j the comparison between estimated and pre-
(t) (t)
dicted values for cross-validation. βbAb(t) is computed using the observations X
e j = (e en−nt j )>
x1j , . . . , x
j

(t) (t)
and the matrix X
e b(t) with rows (e
A
eilq ), i = 1, . . . , n − nt in the tth training subset (red
xil1 , . . . , x
j

b (t) is computed using X b(t) and compared with Xj (in


colour). Based on the tth validation set X j A j

blue color).

26
B Additional simulation results

In this section we give aditional simulation results. Table 4 reports additional Specificity and

Sensitivity results from our simulation study. Figures 3 - 6 display the heatmaps for the three

considered models and p equal to 50, 100 and 150.

27
Table 4: Comparison of means and standard deviations (in brackets) of Specificity (TN%), Sensitivity (TP%) and MCC over R = 50 replicates.

StepGraph StepGraph 2 Glasso CLIME EPC

Model p TP% TN% MCC TP% TN% MCC TP% TN% MCC TP% TN% MCC TP% TN% MCC

50 0.756 0.988 0.741 0.812 0.997 0.863 0.994 0.823 0.419 0.988 0.891 0.492 0.750 0.998 0.831

(0.015) (0.002) (0.009) (0.011) (0.000) (0.005) (0.002) (0.012) (0.016) (0.002) (0.003) (0.006) (0.011) (0.000) (0.005)

AR(1) 100 0.632 0.999 0.751 0.771 0.999 0.847 0.989 0.897 0.433 0.983 0.934 0.464 0.689 0.999 0.803

(0.007) (0.000) (0.004) (0.008) (0.000) (0.005) (0.002) (0.009) (0.020) (0.002) (0.001) (0.004) (0.009) (0.000) (0.005)

150 0.607 0.999 0.730 0.749 0.999 0.837 0.981 0.943 0.474 0.972 0.964 0.499 0.636 1.000 0.778

(0.006) (0.000) (0.004) (0.007) (0.000) (0.004) (0.002) (0.007) (0.017) (0.002) (0.001) (0.003) (0.007) (0.000) (0.004)

50 0.632 0.999 0.751 0.787 0.999 0.857 0.971 0.864 0.404 0.984 0.875 0.401 0.798 0.999 0.870

(0.007) (0.000) (0.004) (0.012) (0.000) (0.006) (0.004) (0.010) (0.014) (0.003) (0.004) (0.007) (0.008) (0.000) (0.004)

NN(2) 100 0.730 0.999 0.802 0.831 0.999 0.875 0.987 0.924 0.382 0.985 0.937 0.407 0.791 0.999 0.862

(0.008) (0.000) (0.005) (0.007) (0.000) (0.005) (0.002) (0.004) (0.006) (0.002) (0.001) (0.005) (0.007) (0.000) (0.000)

150 0.555 0.999 0.695 0.693 0.999 0.799 0.952 0.936 0.337 0.934 0.965 0.425 0.621 1.000 0.762

(0.017) (0.000) (0.007) (0.006) (0.000) (0.004) (0.004) (0.002) (0.008) (0.003) (0.001) (0.003) (0.007) (0.000) (0.004)

50 0.994 0.981 0.898 0.904 0.983 0.832 0.867 0.697 0.356 0.962 0.807 0.482 NA NA NA
28

(0.002) (0.001) (0.005) (0.039) (0.001) (0.028) (0.032) (0.021) (0.009) (0.004) (0.005) (0.005) NA NA NA

BG 100 0.949 0.989 0.857 0.949 0.989 0.857 0.569 0.908 0.348 0.818 0.920 0.462 NA NA NA

(0.007) (0.000) (0.005) (0.007) (0.000) (0.005) (0.039) (0.011) (0.004) (0.005) (0.005) (0.002) NA NA NA

150 0.782 0.994 0.780 0.782 0.994 0.780 0.426 0.952 0.314 0.626 0.959 0.408 NA NA NA

(0.021) (0.000) (0.008) (0.021) (0.000) (0.008) (0.035) (0.006) (0.003) (0.006) (0.001) (0.003) NA NA NA
True StepGraph StepGraph 2 Glasso CLIME EPC

p = 50

True StepGraph StepGraph 2 Glasso CLIME EPC


29

p = 100

True StepGraph StepGraph 2 Glasso CLIME EPC

p = 150

Figure 4: Model AR(1). Heatmaps for the frequency of adjancency for each pair of nodes. The axes display the graph p-nodes in a given order.
True StepGraph StepGraph 2 Glasso CLIME EPC

p = 50

True StepGraph StepGraph 2 Glasso CLIME EPC


30

p = 100

True StepGraph StepGraph 2 Glasso CLIME EPC

p = 150

Figure 5: Model NN(2). Heatmaps for the frequency of adjancency for each pair of nodes. The axes display the graph p-nodes in a given order.
True StepGraph StepGraph 2 Glasso CLIME

p = 50

True StepGraph GS 2 Glasso CLIME


31

p = 100

True StepGraph StepGraph 2 Glasso CLIME

p = 150

Figure 6: Model BG. Heatmaps for the frequency of adjancency for each pair of nodes. The axes display the graph p-nodes in a given order.

You might also like