Sparsity and Its Mathematics
Sparsity and Its Mathematics
ELEG5481, Lecture 15
Acknowledgment: Qiang Li for helping prepare the slides.
Outline
Majorization Minimization (MM)
Convergence
Applications
Block Coordinate Descent (BCD)
Applications
Convergence
Summary
Majorization Minimization
Consider the following problem
min f (x)
x
s.t. x X
(1)
s.t. x X
(2)
u(x; x0)
f(x)
u(x; x1)
f(x)
x? x2x1 x0
The nonincreasing property of {f (xr )} implies that f (xr ) f. But how about
the convergence of the iterates {xr }?
3
Technical Preliminaries
Limit point: x
is a limit point of {xk } if there exists a subsequence of {xk } that
converges to x
. Note that every bounded sequence in Rn has a limit point (or
convergent subsequence);
Directional derivative: Let f : D R be a function where D Rm is a convex
set. The directional derivative of f at point x in direction d is defined by
f (x + d) f (x)
f (x; d) , lim inf
.
0
(3)
Convergence of MM
Assumption 1 u(, ) satisfies the following conditions
u(y, y) = f (y), y X ,
u(x, y) f (x), x, y X ,
0
0
u
(x,
y;
d)|
=
f
(y; d), d with y + d X ,
x=y
(4a)
(4b)
(4c)
(4d)
(4c) means the 1st order local behavior of u(, xr1) is the same as f ().
Convergence of MM
Theorem 1. [Razaviyayn-Hong-Luo] Assume that Assumption 1 is satisfied. Then
every limit point of the iterates generated by MM algorithm is a stationary point of
problem (1).
Proof. From Property 1, we know that f (xr+1) u(xr+1, xr ) u(x, xr ), x
X . Now assume that there exists a subsequence {xrj } of {xr } converging to a limit
point z, i.e., limj xrj = z. Then
u(xrj+1 , xrj+1 ) = f (xrj+1 ) f (xrj +1) u(xrj +1, xrj ) u(x, xrj ), x X .
Letting j , we obtain u(z, z) u(x, z),
u0(x, z; d)|x=z 0,
Combining the above inequality with
f 0(y; d), d with y + d X ), we have
f 0(z; d) 0,
(i.e.,
u0(x, y; d)|x=y
z + d X .
6
(5)
mn
where b Rm
+ , b 6= 0, and A R++ .
xrl
l = 1, . . . , n
crl
(6)
[AT b]l
.
[AT Axr1 ]l
1
10
0
10
-1
10
-2
10
-3
10
-4
10
0
20
40
60
80
100
Number of iterations
120
140
where
1
u(x, x ) , f (x ) + (x x ) f (x ) + (x xr1)T (xr1)(x xr1),
2
T
r1
T
r1
[A
Ax
]
[A
Ax
]
1
n
(xr1) = Diag
,
.
.
.
,
.
r1
r1
x1
xn
r1
r1
r1 T
r1
Observations:
(
u(x, xr1) is quadratic approx. of f (x),
(xr1) AT A,
(
u(x, xr1) f (x), x Rn,
=
u(xr1, xr1) = f (xr1).
r T
u(x, x ) = g(x) h(x ) + xh(x ) (x x )
|
{z
}
linearization of h at xr
u(xr , xr ) = f (xr ).
10
(7)
s.t. y = Ax
i=1
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
minn
xR
n
X
i=1
log(1 + |xi|/)
s.t. y = Ax,
min
n
X
x,zRn
log(zi + )
i=1
(8)
(x
r+1
,z
r+1
n
X
zi
) = arg min
r+
z
i
i=1
883
Fig. 2 Sparse signal recovery through reweighted 1 iterations. (a) Original length n = 512 signal x0
with 130 spikes. (b) Scatter plot, coefficient-by-coefficient, of x0 versus its reconstruction x (0) using unweighted 1 minimization. (c) Reconstruction x (1) after the first reweighted iteration. (d) Reconstruction
x (2) after the second reweighted iteration
13
(1)
(1)
Applications `2 `p Optimization
Many problems involve solving the following problem (e.g., basis-pursuit denoising)
1
min f (x) , ky Axk22 + kxkp
x
2
(9)
where p 1.
If A = I or A is unitary, optimal x? is computed in closed-form as
x? = AT y ProjC (AT y)
i = 1, . . . , n
(10)
where dist(x, xr ) , 2c kx xr k22 21 kAx Axr k22 and c > max(AT A).
dist(x, xr ) 0 x = u(x, xr ) majorizes f (x).
u(x, xr ) can be reexpressed as
c
r k22 + kxkp + const.,
u(x, x ) = kx x
2
r
where
1 T
r
x = A (y Axr ) + xr .
c
The modified `2 `p problem (10) has a simple soft-thresholding solution.
Repeatedly solving problem (10) leads to an optimal solution of the `2 `p
problem (by the MM convergence in Theorem 1 )
15
Suppose that there are some missing data or hidden variables z in the model.
Then, EM algorithm iteratively compute an ML estimate as follows:
E-step:
M-step:
MM interpretation of EM algorithm:
ln p(w|)
= ln Ez| p(w|z, )
r
p(z|w, )p(w|z, )
= ln Ez|
p(z|w, r )
p(z|)p(w|z, )
= ln Ez|w,r
(interchange the integrations)
p(z|w, r )
p(z|)p(w|z, )
0
Ez|w,r ln
(Jensen
s inequality)
p(z|w, r )
= Ez|w,r ln p(w, z|) + Ez|w,r ln p(z|w, r )
(11a)
,u(, r )
Outline
Majorization Minimization (MM)
Convergence
Applications
Block Coordinate Descent (BCD)
Applications
Convergence
Summary
18
s.t. x X = X1 X2 . . . Xm Rn
(12)
BCD Algorithm:
0
1: Find a feasible point x X and set r = 0
2: repeat
3:
r = r + 1, i = (r 1 mod m) + 1
r1
r1
r1
,
.
.
.
,
x
,
x,
x
,
.
.
.
,
x
4:
Let x?i arg minxXi f (xr1
m )
1
i1
i+1
5:
Set xri = x?i and xrk = xr1
k , k 6= i
6: until some convergence criterion is met
Merits of BCD
1
minn f (x) , ky Axk22 + kxk1
xR
2
(13)
X
1
min fk (xk ) , k y
aj xrj ak xk k22 + |xk |
xk
2
j6=k
|
{z
}
The optimal xk has a closed form:
x?k
= soft
,
y
Tx signal
MIMO channel matrix
standard additive Gaussian noise, i.e., n(t) CN (0, I).
Tx
Rx
s.t. Tr(Q) P
1/
) and 0 being the
i
i
P ?
water-level such that i pi = P .
22
x1(t)
n(t)
xK (t)
H1
y(t)
HK
PK
k=1 Hk xk (t)
+ n(t)
P
K
H
H
Q
H
k
k
k
k=1
+I
23
{Qk }K
k=1
log det
P
K
H
k=1 Hk Qk Hk
+I
(14)
s.t. Tr(Qk ) Pk , Qk 0, k = 1, . . . , K
Problem (14) is convex w.r.t. {Qk }, but it has no simple closed-form solution.
Alternatively, we can apply BCD to (14) and cyclically update Qk while fixing
Qj for j 6= k
(4)
Hk Qk HH
k
s.t. Tr(Qk ) Pk ,
+
Qk 0,
where = j6=k Hj Qj HH
j +I
(4) has a closed-form water-filling solution, just like the previous single-user
MIMO case.
P
24
2
1
M =
?
?
2
3
?
3
?
?
1
4
1
?
4
?
2
?
3
?
?
?
2
?
?
5
?
2
1
5
5
?
2
users
5
3
WRmn
kWk
rank(W)
s.t. Wij = Mij , (i, j)
:
25
1
min kXY Zk2F
X,Y,Z 2
26
1 T
max x Qx + cT x
x
2
s.t. x X
1 T
1 T
1 T
max x1 Qx2 + c x1 + c x2
x1 ,x2 2
2
2
When fixing either x1 or x2, problem (4) is an LP, thereby efficiently solvable.
1
The equivalence is in the following sense: If x? is an optimal solution of (), then (x? , x? ) is optimal for (4);
Conversely, if (x?1 , x?2 ) is an optimal solution of (4), then both x?1 , x?2 are optimal for ().
27
URmk ,VRkn
kM UVk2F
s.t. U 0, V 0
(15)
where M 0.
Usually k min(m, n) or mk + nk mn, so NMF can be seen as a linear
dimensionality reduction technique for nonnegative data.
28
representation.
As can be seen from Fig. 1, the NMF basis and encodings contain
a large fraction of vanishing coefficients, so both the basis images
NMF
and image encodings are
sparse. Examples
The basis images are sparse because
they are non-global and contain several versions of mouths, noses
and other facial parts, where the various versions are in different
Image Processing:
locations or forms. The variability of a whole face is generated by
combining the
thesebasis
different
parts. Although
all parts are used by at
U 0 constraints
elements
to be nonnegative.
Original
NMF
29
Application
Text Mining
2: text mining
Basis
elements
allow
to recover
the different
Basis
elements
allow
to recover
different
topics; topics;
Weights
allow
to assign
eacheach
text text
to itstocorresponding
topics.topics.
Weights
allow
to assign
its corresponding
Dagstuhl
30
5
Hyperspectral Unmixing
31
URmk ,VRkn
kM UVk2F
s.t. U 0, V 0
(16)
V(:,i)Rk
s.t. V(:, i) 0,
(17)
32
4:
5:
Vr+1 = V?;
solve the NLS problem
U? arg min kM UVr+1k2F ,
URmk
6:
7:
8:
s.t. V 0
s.t. U 0
Ur+1 = U?;
r = r + 1;
until some convergence criterion is met
33
Outline
Majorization Minimization (MM)
Convergence
Applications
Block Coordinate Descent (BCD)
Applications
Convergence
Summary
34
BCD Convergence
The idea of BCD is to divide and conquer. However, there is no free lunch; BCD
may get stuck or converge to some point of no interest.
35
BCD Convergence
min f (x)
x
s.t. x X = X1 X2 . . . Xm Rn
(18)
max
{Qk }K
k=1
log det
PK
H
k=1 Hk Qk Hk
+I ,
s.t. Tr(Qk ) Pk , Qk 0, k
37
min
X, Y , Z
1
kXY Zk2F
2
38
max
{Qk }K
k=1
log det
PK
H
k=1 Hk Qk Hk
+I
s.t. Tr(Qk ) Pk , Qk 0, k = 1, . . . , K
f is convex, thus pseudoconvex;
{Qk | Tr(Qk ) Pk , Qk 0} is compact;
iterative water-filling converges to a globally optimal solution.
3
f is pseudoconvex if for all x, y X such that f (x)T (y x) 0, we have f (y) f (x). Notice that
convex pseudoconvex quasiconvex.
39
URmk ,VRkn
kM UVk2F
s.t. U 0, V 0
40
Summary
MM and BCD have great potential in handling nonconvex problems and realizing
fast/distributed implementations for large-scale convex problems;
Many well-known algorithms can be interpreted as special cases of MM and BCD;
Under some conditions, convergence to stationary point can be guaranteed by
MM and BCD.
41
References
M. Razaviyayn, M. Hong, and Z.-Q. Luo, A unified convergence analysis of block
successive minimization methods for nonsmooth optimization, submitted to SIAM
Journal on Optimization, available online at http://arxiv.org/abs/ 1209.2385.
L. Grippo and M. Sciandrone, On the convergence of the block nonlinear GaussSeidel method under convex constraints, Operation research letter vol. 26, pp.
127-136, 2000
E. J. Candes, M. B. Wakin, and S. P. Boyd, Enhancing sparsity by reweighted `1
minimization, J. Fourier Anal. Appl., 14 (2008), pp. 877-905.
M. Zibulevsky and M. Elad, `1 `2 optimization in signal and image processing,
IEEE Signal Process. Magazine, May 2010, pp.76-88.
D. P. Bertsekas, Nonlinear Programming, Athena Scientific, 1st Ed., 1995
W. Yu and J. M. Cioffi, Sum capacity of a Gaussian vector broadcast channel,
IEEE Trans. Inf. Theory, vol. 50, no. 1, pp. 145-152, Jan. 2004
42
Z. Wen, W. Yin, and Y. Zhang, Solving a low-rank factorization model for matrix
completion by a nonlinear successive over-relaxation algorithm, Rice CAAM Tech
Report 10-07.
Daniel D. Lee and H. Sebastian Seung, Algorithms for Non-negative Matrix
Factorization.
Advances in Neural Information Processing Systems 13:
Proceedings of the 2000 Conference. MIT Press. pp. 556-562, 2001.
43