CVX Lecture Graphs
CVX Lecture Graphs
1 Graphical modeling
6 Experiments
1/78
Outline
1 Graphical modeling
6 Experiments
2/78
Graphical models
x2 x3
x6 x5
x9
x1 x8
3/78
Examples
Graphical models are about having a graph representation that can encode
relationships between entities.
In many cases, the relationships between entities are straightforward:
I Are two people friends in a social network?
I Are two researchers co-authors in a published paper?
In many other cases, relationships are not known and must be learned:
I Does one gene regulate the expression of others?
I Which drug alters the pharmacologic effect of another drug?
6/78
Schematic of graph learning
I Given a data matrix X ∈ Rn×p = [x1 , x2 , . . . , xp ], each column
xi ∈ Rn is assumed to reside on one of the p nodes and each of the n
rows of X is a signal (or feature) on the same graph.
I The goal is to obtain a graph representation of the data.
x4
w 47
x7 w 24 w 49
w 37
w 67 w 23
w 57 x2 x3
w 26
w 25
w 29
w 56
x6 x5 w 59 w 38
x9
w 16 w 15 w 98
w 19
w 18
x1 x8
1 Graphical modeling
6 Experiments
8/78
Graph and its matrix representation
Connectivity matrix C, Adjacency matrix W , and Laplacian matrix L.
1 if (i, j) ∈ E
wij
if (i, j) ∈ E −wij
if (i, j) ∈ E
[C]ij = 0 if (i, j) ∈
/E [W]ij = 0 if (i, j) ∈
/E [L]ij = 0 if (i, j) ∈
/E
Pp
wij if i = j
0 if i = j 0 if i = j
j=1
Example: V = {1, 2, 3, 4}, E = {(1, 2), (1, 3), (2, 3), (2, 4)} W = {2, 2, 3, 1}.
0 1 1 0 0 2 2 0 4 −2 −2 0
1 0 1 1 −2 −3 −1
, W = 2 0 3 1 6
C= , L=
−2
1 1 0 0 2 3 0 0 −3 5 0
0 1 0 0 0 1 0 0 0 −1 0 1
9/78
Graph matrices
I The adjacency matrix W = [wij ] and the Laplacian matrix L both
represent the same weighted graph and are related:
L = D − W,
10/78
Types of graphical models
11/78
Outline
1 Graphical modeling
6 Experiments
12/78
Graph learning from smooth signals
Quantifying smoothness:
X 2
tr XLX> = wij kxi − xj k2 .
i,j
I Smaller distance kxi − xj k22 between data points xi and xj will force
to learn a graph with larger affinity value wij , and vice versa.
I Higher value of weight wij will imply the features xi and xj are similar,
and hence, strongly connected.
I h(L) is a regularization function (e.g., kLk1 , kLk2F , log det(L))
13/78
Ex 1. Learning graphs under constraints
14/78
Contd...
I Define eij = kxi − xj k22 , and ei as a vector with the j-th element as
eij .
I For wi , we have the following problem:
ei
2
1
minimize 2
wi +
λ 2 ,
wi
subject to wi> 1 = 1, wij ≥ 0, wii = 0.
1
2
ei
L(wi , ηi , β i ) =
wi +
− ηi (wi> 1 − 1) − β >
i wi
2 λ 2
where ηi > 0 and β i ∈ Rp are the Lagrangian multipliers, where
β ij ≥ 0 ∀ i 6= j.
15/78
KKT optimality
The optimal solution ŵi should satisfy that the derivative of the Lagrangian
w.r.t wi is equal to zero, so we have
ei
ŵi + − ηi 1 − β i = 0
λ
Then for the j-th element of ŵi , we have
eij
ŵij + − ηi − βij = 0
λ
Noting that ŵij βij = 0 (complementary slackness), then from the KKT
condition:
e +
ij
ŵij = − + ηi , where (a)+ = max(a, 0).
λ
ηi is found to satisfy the constraint wi> 1 = 1 ∀ i = 1, . . . , p.
Additional goal: sparsity in the graph. How do we enforce sparsity?
i) Use sparsity enforcing regularizer, e.g., `1 -norm.
ii) Chose λ such that each node has exactly m << p neighbors, i.e.,
kwi k0 = m.
We will explore the second path for enforcing sparsity. 16/78
Sparsity by chosing λi : number of neighbors kŵi k0 = m
Without loss of generality, suppose ei1 , ei2 , . . . , ein are ordered in increasing
order. By design, we have wii = 0.
Constraint kŵi k0 = m implies ŵim > 0 and ŵi,m+1 = 0. Therefore, we have
eim ei,m+1
− + 2ηi > 0, and − + 2ηi ≤ 0
λi λi
m m
X eim 1 1 X
− + 2ηi = 1 =⇒ ηi = + eij .
j=1
λi m 2mλi j=1
17/78
Contd...
Combining the previous results, the optimal {ŵij }i6=j can be obtained as
follows:
−eij
( ei,m+1P
mei,m+1 − m , j≤m
ŵij = h=1 eih
0, j>m
18/78
Ex 2. Graph based clustering
Goal of graph based clustering: Given an initial connected graph W infer
the k-component graph S.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
(iii) W (iv) S
L = U> Diag(λ1 , λ2 , · · · , λp )U
where U are the eigenvectors, and [λ1 , λ2 , . . . , λp ] are the eigenvalues in the increasing
order and have the following property:
{{λj = 0}kj=1 , c1 ≤ λk+1 ≤ · · · ≤ λp ≤ c2 },
where the k zero eigenvalues denote the number of components.
50
14
56 51 12
60
53 47 48
45
58
49 54
10
43
46 41
57 42 44
59
55 8
31 6
30 33 39 11
18 1
9 4
23 14
34
17
3 2 5
29 21 40
35 27 16 12
22
32 4 10 6 2
20
26
28 15 8 3 zero eigenavlues of 3 component graph Laplacian
37 19
25
36 38
24 13 0
0 10 20 30 40 50 60
Number of nodes(eigenvalues)
20/78
Constrained Laplacian Rank (CLR) for clustering
2
minimize kS − WkF ,
S=[s1 ,..,sp ]
subject to s>
i 1 = 1, si ≥ 0, sii = 0.
rank(Ls ) = p − k
21/78
CLR problem reformulation
2 Pk
minimize kS − WkF + β i=1 λi (Ls ),
S=[s1 ,..,sp ]
subject to s>
i 1 = 1, si ≥ 0, sii = 0.
For sufficiently
Plarge β, the optimal solution S to this problem will make the
k
second term i=1 λi (Ls ) equal to zero.
Ky Fan’s Theorem [Fan, 1949]:
k
X
λi (Ls ) = minimize tr F> Ls F, subject to F> F = I.
minimize
{λi (Ls )}k
i=1
F∈Rp×k
i=1
Using the Ky Fan’s Theorem, we can force the first k eigenvalues to zero via
the following formulation:
2
kS − WkF + βtr F> Ls F ,
minimize
S,F∈Rp×k
subject to s>
i 1 = 1, si ≥ 0, sii = 0,
F> F = I.
22/78
Solving for F and S alternately
I Sub-problem for F :
minimize tr F> Ls F, subject to F> F = I.
F∈Rp×k
1 Graphical modeling
6 Experiments
24/78
Gaussian Markov random field (GMRF)
A random vector x = (x1 , x2 , . . . , xp )> is called a GMRF with parameters
(0, Θ), if its density follows:
(−p/2) 1 1 >
p(x) = (2π) (det(Θ)) exp − (x Θx) .
2
2
Θij 6= 0 ⇐⇒ {i, j} ∈ E ∀ i 6= j
xi ⊥ xj |x/(xi , xj ) ⇐⇒ Θij = 0
27/78
Penalized MLE Gaussian Graphical Modeling (GGM)
28/78
Additional structure
maximize log det(Θ) − tr ΘS − αkΘk1 .
Θ0
I Laplacian
n structure o
P
SL = Θ|Θij = Θji ≤ 0 for i 6= j, Θii = − j6=i Θij . But Laplacian
is a PSD matrix and the log det is only defined for a PD matrix.
I Modified Laplacian structure [Lake and Tenenbaum, 2010]:
1
maximize log det(Θ̃) − tr Θ̃S − αkΘ̃k1 , subject to Θ̃ = Θ + 2 I
Θ̃0 σ
I `1 -regularized MLE with Laplacian structure after adding J = p1 11>
[Egilmez et al., 2017, Zhao et al., 2019].
maximize log det(Θ + J) − tr ΘS − αkΘk1 ,
Θ∈SL
29/78
GLasso algorithm [Friedman et al., 2008]
maximize log det(Θ) − tr ΘS − αkΘk1 .
Θ0
Θ−1 − S − αΓ = 0,
Σ̂ii = Sii + α, i = 1, . . . , p,
where Σ̂ = Θ−1 .
30/78
GLasso uses a block-coordinate method for solving the problem. Consider a
partitioning of Θ and Σ̂:
Θ11 θ 12 Σ̂11 σ̂ 12
Θ= , Σ̂ = ,
θ>12 Θ22 σ̂ >
12 Σ̂22
where Θ11 ∈ R(p−1)×(p−1) , θ 12 ∈ Rp−1 and Θ22 is a scalar, and similarly for
the other partitions. Next, ΘΣ̂ = I : Σ̂ = Θ−1 can be expressed as
Θ−1 > −1
11 θ 12 θ 12 Θ11 Θ−1
11 θ 12
Θ−1
11 + Θ22 −θ 21 Θ−1 Θ22 −θ > −1
Σ̂ = 11 θ 12 12 Θ11 θ 12
1
· Θ −θ Θ−1 θ
22 21 11 12
−σ̂ 12 + s12 + αγ 12 = 0.
31/78
GLasso [Mazumder and Hastie, 2012]
Consider reading off σ̂ 12 from the partitioned expression:
Θ−1
11 θ 12
+ s12 + αγ 12 = 0.
Θ22 − θ > −1
12 Θ11 θ 12
θ ?12 = ν ? Σ̂22
1
Θ?22 = + (θ ?12 )> Θ−1 ?
11 θ 12
Σ̂22
32/78
GLasso algorithm summary
−1
1: Initialize: Σ̂ = Diag(S) + αI, and Θ = Σ̂ . in what follows.
2: Repeat untill convergence criteria is met
(a) Rearrange row and column such that the target column is last
(implicitly).
σ̂ σ̂ >
(b) Compute Θ−1 = Σ̂11 − 12 Σ̂22
12
?
(c) Obtain ν and update θ 12 and Θ?22 .
(d) Update Θ and Σ̂ using the second partition function, ensuring
ΘΣ̂ = I.
3: Output the precision matrix Θ.
33/78
Learning a graph with Laplacian constraint
34/78
Solving for the Laplacian constraint [Zhao et al., 2019]
maximize log det(Θ + J) − tr ΘS − αkΘk1 .
Θ∈SL
n P o
where SL = Θ|Θij = Θji ≤ 0 for i 6= j, Θii = − j6=i Θij .
Now that Θ satisfies the Laplacian constraints, the off-diagonal elements of
Θ are non-positive and the diagonal elements are non-negative, so
kΘk1 = tr (ΘH)
log det(Θ + J) − tr ΘS − αtr (ΘH)
= log det(Θ + J) − tr ΘK
where K = S + αH.
35/78
Solving with known connectivity information: Approach 1
maximize log det(Θ + J) − tr ΘK
Θ∈SL (C)
36/78
We further suppose the graph has no self loops, so the diagonal elements of
C are all zero. Then, the constraint set SL (C) can be compactly rewritten
in the following way:
Θ = Ω,
Θ ∈ SΘ = {Θ|Θ 0, Θ1 = 0}
Ω ∈ SΩ = {Ω|I Ω ≥ 0, B Ω = 0, C Ω ≤ 0}
Θ = PΞP>
minimize tr ΞK̃ − log det Ξ
Ξ,Ω
subject to Ξ0
PΞP> = Ω
I Ω ≥ 0
B Ω = 0 Ω ∈ C.
CΩ≤0
subject to Ax + Bz = c
where x ∈ Rn , z ∈ Rm A ∈ Rp×n and B ∈ Rp×m and c ∈ Rp . The
augmented Lagrangian is written as
ρ 2
Lρ (x, z, y) = f (x) + g(z) + y> (Ax + Bz − c) + kAx + Bz − ck2
2
where ρ > 0 is the penalty parameter. Now, the ADMM subroutine cycles
through following updates until it converges
1: Initialize: z(0) , y(0) , ρ
2: t←0
3: while stopping criterion is not met do
4: z(t+1) = arg minz∈Z Lρ (x(t) , z, y(t) )
5: x(t+1) = arg minx∈X Lρ (x, z(t+1) , y(t) )
6: y(t+1) = y(t) + ρ Ax(t+1) + Bz(t+1) − c
7: t←t+1
8: end while 39/78
Edge based formulation: Approach 2
Suppose there are M edges present in a graph, and the mth edge connects
vertex im and jm . Then we can always perform the following decomposition
on its graph Laplacian matrix Θ:
M
X
Θ= Wim jm (eim e> > > >
im + ejm ejm − eim ejm − ejm eim )
m=1
M
X
= Wim jm (eim − ejm )> (eim − ejm )
m=1
= EDiag(w)E>
40/78
Problem reformulation
minimize − log det(EDiag(w)E> + J) + tr EDiag(w)E> K .
w≥0
1
EDiag(w)E> + 11> = [E, 1]Diag([w, 1/p]> )[E, 1]> = GDiag([w, 1/p]> )G>
p
The problem is convex, we will obtain a simple closed form update rule via
the majorijation-minimization (MM) approach[Sun et al., 2016].
where F0 = GX0 G> we substitute Diag([w> , 1/p> ]> ) for X, and the
minimization problem becomes
minimize tr EDiag(w)E> K
w≥0
−1
+ tr F0 GDiag [w0> , 1/p]> G>
Yet, this minimization problem does not yield a simple closed-form solution.
For the sake of algorithmic simplicity, we need to further majorize the
objective.
42/78
Double majorization and optimal solution
For any YXY > 0, the following matrix inequality holds:
1 Graphical modeling
6 Experiments
44/78
Structured graphs
State-of-the-art direction
I The effort has been on characterizing the families of structures for
which learning can be made feasible e.g., maximum weight spanning
tree for tree structure [Chow and Liu, 1968] and local-separation and
walk summability for Erdos-Renyi graphs, power-law graphs, and
small-world graphs [Anandkumar et al., 2012].
I Existing methods are restricted to some particular structures and it is
difficult to extend them to learn other useful structures, e.g.,
multi-component, bipartite, etc.
47/78
Problem statement
48/78
Motivating example 1: structure via Laplacian eigenvalues
Θ = UDiag(λ)U>
For a multi-component graph the first k eigenvalues of its Laplacian matrix
are zero:
Sλ = {{λj = 0}kj=1 , c1 ≤ λk+1 ≤ · · · ≤ λp ≤ c2 }
50
14
56 51 12
60
53 47 48
45
58
49 54
10
43
46 41
57 42 44
59
55 8
31 6
30 33 39 11
18 1
9 4
23 14
34
17
3 2 5
29 21 40
35 27 16 12
22
32 4 10 6 2
20
26
28 15 8 3 zero eigenavlues of 3 component graph Laplacian
37 19
25
36 38
24 13 0
0 10 20 30 40 50 60
Number of nodes(eigenvalues)
49/78
Motivating example 2: structure via adjacency eigenvalues
Adjacency matrix ΘA : ΘA = Diag(diag(Θ)) − Θ.
ΘA = VDiag(ψ)V>
For a bipartite graph the eigenvalues are symmetric about the origin:
Sψ = {ψi = −ψp−i+1 , ∀i = 1, . . . , p}.
4
30 60
29 59
28 58
27 57 3
26 56
25 55
50/78
Proposed unified framework for SGL
maximize log gdet(Θ) − tr ΘS − αh(Θ),
Θ
subject to Θ ∈ SΘ , λ(T (Θ)) ∈ ST
maximize log gdet(Θ) − tr ΘS − α kΘk1 ,
Θ,λ,U
subject to Θ ∈ SΘ , Θ = UDiag(λ)U> , λ ∈ Sλ , U> U = I,
n X o
SΘ = Θ|Θij = Θji ≤ 0 for i 6= j, Θii = − Θij ,
j6=i
P
wi −w1 −w2 −w3
i=1,2,3 P
−w1 i=1,4,5 wi −w4 −w5
Lw = P .
−w2 −w4 i=2,4,6 wi −w6
P
−w3 −w5 −w6 i=3,5,6 wi
53/78
Problem reformulation
maximize log gdet(Θ) − tr ΘS − α kΘk1 ,
Θ,λ,U
subject to Θ ∈ SΘ , Θ = UDiag(λ)U> , λ ∈ Sλ , U> U = I,
Using: i) Θ = Lw and ii) tr ΘS + αh(Θ) = tr ΘK , K = S + H and
H = α(2I − 11> ) the proposed problem formulation becomes:
β
minimize − log gdet(Diag(λ)) + tr(KLw) + kLw − UDiag(λ)U> k2F ,
w,λ,U 2
subject to w ≥ 0, λ ∈ Sλ , U> U = I.
54/78
SGL algorithm for k-component graph learning
I Variables: X = (w, λ, U)
55/78
Update for w
Sub-problem for w:
β
minimize tr (KLw) + kLw − UDiag(λ)U> k2F .
w≥0 2
1 2
minimize f (w) = kLwkF − c> w.
w≥0 2
This problem is a convex quadratic program, but does not have a
closed-form solution due to the non-negativity constraint w ≥ 0.
The function f (w) is majorized at wt by the function
L
w − wt
2
g(w|wt ) = f (w> ) + (w − wt )> ∇f (wt ) +
2
where w> is the update from previous iteration [Sun et al., 2016].
A closed-form update by using the MM technique
+
1
wt+1 = wt − ∇f (wt ) ,
2p
where (a)+ = max(a, 0). 56/78
Update for U
Sub-problem for U:
tr U> LwUDiag(λ)
maximize
U
subject to U> U = Ip−k .
57/78
Update for λ
Sub-problem for λ:
β
minimize − log det(λ) + kU> (Lw)U − Diag(λ)k2F .
λ∈Sλ 2
p−k
X β
minimize − log λk+i + kλ − dk2 ,
c1 ≤λk+1 ≤···≤λp ≤c2
i=1
2
Sandeep Kumar, Jiaxi Ying, José Vinícius de M. Cardoso, and Daniel P. Palomar,“
A Unified Framework for Structured Graph Learning via Spectral Constraints.” arXiv
preprint arXiv:1904.09792 (2019).
58/78
SGL algorithm summary
β
minimize − log gdet(Diag(λ)) + tr(KLw) + kLw − UDiag(λ)U> k2F ,
w,λ,U 2
subject to w ≥ 0, λ ∈ Sλ , U> U = Ip−k .
Algorithm:
1: Input: SCM S, k, c1 , c2 , β
2: Output: Lw
3: t←0
4: while stopping criterion is not met do
+
1
5: wt+1 = wt − 2p ∇f (wt )
7: Update λt+1 (via isotonic regression method with maxm iter p − k).
8: t←t+1
9: end while
10: return w(t+1)
59/78
Convergence and the computational complexity
Sandeep Kumar, Jiaxi Ying, José Vinícius de M. Cardoso, and Daniel P. Palomar,
“Structured graph learning via Laplacian spectral constraints,” in Advances in Neural
Information Processing Systems (NeurIPS), 2019.
60/78
Outline
1 Graphical modeling
6 Experiments
61/78
Synthetic experiment setup
?
Θ̂ − Θtrue
F 2tp
Relative Error = , F-Score =
kΘtrue kF 2tp + fp + fn
?
I Where Θ̂ is the final estimation result the algorithm and Θtrue is the
true reference graph Laplacian matrix, and tp, fp, fn correspond to true
positives, false positives, and false negatives, respectively.
62/78
Grid graph
0 0 0
0 0 0
66/78
Real data: cancer dataset [Weinstein et al., 2013]
67/78
Animal dataset [Osherson et al., 1991]
Iguana Cockroach
Ostrich
Ant Butterfly
Chicken
Bee Robin
Alligator
Trout
Trout Finch
Butterfly Eagle
Salmon
Dolphin
Tiger Iguana
Whale Giraffe
Lion Salmon Rhino
Gorilla Camel Elephant
Wolf Bee Penguin Ant
Rhino Elephant
Seal
Deer Horse
Dog Cat Cow Alligator
Chimp Penguin
Horse Gorilla
Whale
Seal
Camel Chimp
Cow
Deer Squirrel Lion
Mouse Chicken Dolphin
Ostrich Cockroach
Giraffe Mouse
Squirrel Eagle Wolf
Dog Tiger
Finch
Robin Cat
(xxv) GGL [Egilmez et al., 2017] (xxvi) GLasso [Friedman et al., 2008]
68/78
Animal dataset contd...
Giraffe
Horse Chicken
CamelElephant
Deer Robin
Cow Ostrich
Chimp Rhino
Squirrel Gorilla Salmon
Finch
Mouse Bee
Eagle
Butterfly Penguin Trout
Cockroach
Bee
Lion CockroachAnt Ant
Cat Tiger Butterfly
Wolf Seal
Dog Alligator Iguana Whale
Alligator
Mouse
Dolphin
Deer
Trout Iguana
Camel Squirrel
Seal Dolphin
Salmon Cow
Whale
Giraffe Dog
Penguin
Horse Cat
Rhino Wolf
Ostrich Chimp
Eagle
Chicken Elephant Lion
Tiger
Finch
Robin Gorilla
69/78
Bipartite structure via adjacency spectral constraints
1 1 1
0 0 0
70/78
Multi-component bipartite structure via joint spectral
constraints
1 1 1
0 0 0
71/78
Resources
An R package “spectralGraphTopology” containing code for all the experimental results
is available at
https://cran.r-project.org/package=spectralGraphTopology
NeurIPS paper: Sandeep Kumar, Jiaxi Ying, José Vinícius de M. Cardoso, and Daniel P.
Palomar, “Structured graph learning via Laplacian spectral constraints,” in Advances in
Neural Information Processing Systems (NeurIPS), 2019.
https://arxiv.org/pdf/1909.11594.pdf
Extended version paper: Sandeep Kumar, Jiaxi Ying, José Vinícius de M. Cardoso, and
Daniel P. Palomar, “A Unified Framework for Structured Graph Learning via Spectral
Constraints, (2019).” https://arxiv.org/pdf/1904.09792.pdf
Authors:
72/78
Thanks
https://www.danielppalomar.com
73/78
References
Absil, P.-A., Mahony, R., and Sepulchre, R. (2009).
Optimization algorithms on matrix manifolds.
Princeton University Press.
Chung, F. R. (1997).
Spectral graph theory.
Number 92. American Mathematical Soc.
Dempster, A. P. (1972).
Covariance selection.
Biometrics, pages 157–175.
Fan, K. (1949).
On a theorem of weyl concerning eigenvalues of linear transformations i.
Proceedings of the National Academy of Sciences of the United States of America,
35(11):652.
75/78
References
Lauritzen, S. L. (1996).
Graphical models, volume 17.
Clarendon Press.
Mazumder, R. and Hastie, T. (2012).
The graphical lasso: New insights and alternatives.
Electronic journal of statistics, 6:2125.
76/78
References
Osherson, D. N., Stern, J., Wilkie, O., Stob, M., and Smith, E. E. (1991).
Default probability.
Cognitive Science, 15(2):251–269.
77/78
References
Slawski, M. and Hein, M. (2015).
Estimation of positive definite m-matrices and structure learning for attractive
gaussian markov random fields.
Linear Algebra and its Applications, 473:145–179.
Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A.,
Ellrott, K., Shmulevich, I., Sander, C., Stuart, J. M., et al. (2013).
The cancer genome atlas pan-cancer analysis project.
Nature Genetics, 45(10):1113.