0% found this document useful (0 votes)
16 views12 pages

18 Scis

This document summarizes research on the convergence of multi-block Bregman alternating direction method of multipliers (ADMM) for nonconvex composite optimization problems. It first establishes convergence for 3-block Bregman ADMM when solving nonconvex problems. It then extends these results to N-block (N>3) Bregman ADMM, demonstrating the feasibility of applying multi-block ADMM to nonconvex problems. Finally, it presents a simulation study and real-world application to support the theoretical findings.

Uploaded by

dark liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

18 Scis

This document summarizes research on the convergence of multi-block Bregman alternating direction method of multipliers (ADMM) for nonconvex composite optimization problems. It first establishes convergence for 3-block Bregman ADMM when solving nonconvex problems. It then extends these results to N-block (N>3) Bregman ADMM, demonstrating the feasibility of applying multi-block ADMM to nonconvex problems. Finally, it presents a simulation study and real-world application to support the theoretical findings.

Uploaded by

dark liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SCIENCE CHINA

Information Sciences
. RESEARCH PAPER . December 2018, Vol. 61 122101:1–122101:12
https://doi.org/10.1007/s11432-017-9367-6

Convergence of multi-block Bregman ADMM for


nonconvex composite problems

Fenghui WANG1,2 , Wenfei CAO1,3 & Zongben XU1*


1Schoolof Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China;
2Departmentof Mathematics, Luoyang Normal University, Luoyang 471022, China;
3School of Mathematics and Information Science, Shaanxi Normal University, Xi’an 710119, China

Received 12 September 2017/Revised 5 December 2017/Accepted 26 February 2018/Published online 21 June 2018

Abstract The alternating direction method with multipliers (ADMM) is one of the most powerful and suc-
cessful methods for solving various composite problems. The convergence of the conventional ADMM (i.e.,
2-block) for convex objective functions has been stated for a long time, and its convergence for nonconvex
objective functions has, however, been established very recently. The multi-block ADMM, a natural exten-
sion of ADMM, is a widely used scheme and has also been found very useful in solving various nonconvex
optimization problems. It is thus expected to establish the convergence of the multi-block ADMM under
nonconvex frameworks. In this paper, we first justify the convergence of 3-block Bregman ADMM. We next
extend these results to the N -block case (N > 3), which underlines the feasibility of multi-block ADMM
applications in nonconvex settings. Finally, we present a simulation study and a real-world application to
support the correctness of the obtained theoretical assertions.
Keywords nonconvex regularization, alternating direction method, subanalytic function, K-L inequality,
Bregman distance
Citation Wang F H, Cao W F, Xu Z B. Convergence of multi-block Bregman ADMM for nonconvex composite
problems. Sci China Inf Sci, 2018, 61(12): 122101, https://doi.org/10.1007/s11432-017-9367-6

1 Introduction
Many problems arising in the fields of signal & image processing and machine learning involve finding a
minimizer of the sum of N (N > 2) functions with linear equality constraint [1]. If N = 2, the problem
then consists of solving

min f (x) + g(y) s.t. Ax + By = 0, (1)

where A ∈ Rm×n1 and B ∈ Rm×n2 are given matrices, f : Rn1 → R and g : Rn2 → R are proper lower
semicontinuous functions. Because of its separable structure, problem (1) can be efficiently solved by
ADMM, namely, through the procedure:


 xk+1 = arg min Lα (x, y k , pk ),

 x∈Rn1
y k+1 = arg min Lα (xk+1 , y, pk ), (2)

 y∈Rn2

 pk+1 = pk + α(Axk+1 + By k+1 ),

* Corresponding author (email: [email protected])

c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018 info.scichina.com link.springer.com
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:2

where α is a penalty parameter and


α
Lα (x, y, p) := f (x) + g(y) + hp, Ax + Byi + kAx + Byk2
2
is the associated augmented Lagrangian function with multiplier p. So far, various variants of the con-
ventional ADMM have been suggested. Among such varieties, Bregman ADMM (BADMM) is designed
to improve the performance of procedure (2) [2]. More specifically, BADMM takes the following iterative
form:

 k+1
Lα (x, y k , pk ) + △φ (x, xk ),
x


= arg min
x∈Rn1
y k+1 = arg min Lα (xk+1 , y, pk ) + △ψ (y, y k ), (3)

 y∈R n2

 pk+1 = pk + α(Axk+1 + By k+1 ),

where △φ and △ψ are the Bregman distance with respect to functions φ and ψ, respectively. ADMM was
introduced in the early 1970s, and its convergence properties for convex objective functions have been
extensively studied [3, 4]. It has been shown that ADMM can converge at a sublinear rate of O(1/k) [5],
and O(1/k 2 ) for the accelerated version [6]. The convergence of BADMM for convex objective functions
has been examined in [2].
Recently, there has been an increasing interest in the study of ADMM for nonconvex objective functions.
On one hand, the ADMM algorithm is highly successful in solving various nonconvex examples ranging
from nonnegative matrix factorization, distributed matrix factorization, distributed clustering, sparse
zero variance discriminant analysis, tensor decomposition, to matrix completion (see [7–9]). On the other
hand, the convergence analysis of nonconvex ADMM is generally very difficult, due to the failure of
the Fejér monotonicity of iterates. Very recently, the convergence of ADMM as well as BADMM for
nonconvex objective functions has been established in [10–12].
We now consider the 3-block composite optimization problem:

min f (x) + g(y) + h(z) s.t. Ax + By + Cz = 0, (4)

where A ∈ Rm×n1 , B ∈ Rm×n2 and C ∈ Rm×n3 are given matrices, f : Rn1 → R, g : Rn2 → R are proper
lower semicontinuous functions, and h : Rn3 → R is a continuously differentiable function. To solve this
problem, it is thus natural to extend (2) to the following form:

k+1

x
 = arg min Lα (x, y k , z k , pk ),

 x∈Rn1
 k+1
 y = arg min Lα (xk+1 , y, z k , pk ),
y∈Rn2 (5)

 z k+1
= arg min L (xk+1 k+1
, y , z, p k
),

 α

 z∈Rn3
 k+1
p = p + α(Axk+1 + By k+1 + Cz k+1 ),
k

where the augmented Lagrangian function Lα : Rn1 × Rn2 × Rn3 × Rm → R is defined by


α
Lα (x, y, z, p) := f (x) + g(y) + h(z) + hp, Ax + By + Czi + kAx + By + Czk2 . (6)
2
However, as shown in [13], the 3-block ADMM (5) does not necessarily converge in general even under
the convex frameworks. To guarantee its global convergence, some restrictive conditions are required;
for example, the strong convexity condition of all objective functions [14], or at least one function being
strongly convex [15, 16].
The purpose of the present study is to examine convergence of ADMM with N blocks for non-
convex objective functions. Following the idea of (3), we first propose 3-block BADMM for solving
problem (4), and establish its global convergence for some nonconvex functions. Next, we extend the
convergence result to the N -block case (N > 3), which underlines the feasibility of multi-block ADMM
applications in nonconvex settings. Finally we present a simulation study and a real-world application
to support the correctness of the obtained theoretical assertions.
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:3

2 Preliminaries
In what follows, Rn will stand for the n-dimensional Euclidean space,
n
X p
hx, yi = xT y = xi yi , kxk = hx, xi,
i=1

where x, y ∈ Rn and T stands for the transpose operation. For convenience, we fix the following notations:

uk = (xk , y k , z k ), wk = (xk , y k , z k , pk ), ŵk = (xk , y k , z k , pk , z k−1 ),


kwk = (kxk2 + kyk2 + kzk2 + kpk2 )1/2 , kwk1 = kxk + kyk + kzk + kpk.

2.1 Subdifferentials

Given a function f : Rn → R, we denote by domf the domain of f , namely, domf := {x ∈ Rn : f (x) <
+∞}. A function f is said to be proper if domf 6= ∅; lower semicontinuous at x0 if lim inf x→x0 f (x) >
f (x0 ). If f is lower semicontinuous at every point of its domain of definition, then it is simply called a
lower semicontinuous function.
Definition 1. Let f : Rn → R be a proper lower semi-continuous function.
b (x), is the set of all elements
(i) Given x ∈ domf, the Fréchet subdifferential of f at x, written by ∂f
u ∈ Rn which satisfy

f (y) − f (x) − hu, y − xi


lim inf > 0.
y6=x y→x kx − yk

(ii) The limiting subdifferential, or simply subdifferential, of f at x, written by ∂f (x), is defined as


n o
b (xk ) → u, k → ∞ .
∂f (x) = u ∈ Rn : ∃xk → x, f (xk ) → f (x), uk ∈ ∂f

(iii) A stationary point of f is a point x∗ in the domain of f satisfying 0 ∈ ∂f (x∗ ).


(iv) f is said to be L-Lipschitz continuous if kf (x) − f (y)k 6 Lkx − yk, for any x, y ∈ domf .
Definition 2. An element w∗ := (x∗ , y ∗ , z ∗ , p∗ ) is called a stationary point of the Lagrangian function
Lα defined as in (6) if it satisfies
(
AT p∗ ∈ −∂f (x∗ ), B T p∗ ∈ −∂g(y ∗ ),
(7)
C T p∗ = −∇h(z ∗ ), Ax∗ + By ∗ + Cz ∗ = 0.

The existence of proper lower semicontinuous functions and properties of subdifferential can be seen
from [17]. We particularly collect some basic properties of the subdifferential.
Proposition 1. Let f : Rn → R and g : Rn → R be proper lower semi-continuous functions. Then the
following holds:
b (x) ⊂ ∂f (x) for each x ∈ Rn . Moreover, the first set is closed and convex, while the second is
(i) ∂f
closed, and not necessarily convex.
(ii) Let (uk , xk ) be sequences such that xk → x, uk → u, f (xk ) → f (x) and uk ∈ ∂f (xk ). Then
u ∈ ∂f (x).
(iii) Fermat’s rule: if x0 ∈ Rn is a local minimizer of f , then x0 is a stationary point of f , that is,
0 ∈ ∂f (x0 ).
(iv) If f is continuously differentiable function, then ∂(f + g)(x) = ∇f (x) + ∂g(x).

2.2 Kurdyka-Lojasiewicz inequality

The Kurdyka-Lojasiewicz (K-L) inequality was first introduced by Lojasiewicz [18] for real analytic
functions, and then was extended by Kurdyka [19] to smooth functions whose graph belongs to an
o-minimal structure.
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:4

Definition 3 (K-L inequality). A function f : Rn → R is said to have the K-L property at x̃ if there
exists η > 0, δ > 0, ϕ ∈ Aη , such that for all x ∈ O(x̃, δ) ∩ {x : f (x̃) < f (x) < f (x̃) + η},

ϕ′ (f (x) − f (x̃))dist(0, ∂f (x)) > 1,

where dist(x̃, ∂f (x)) := inf{kx̃ − yk : y ∈ ∂f (x)}, and Aη stands for the class of functions ϕ : [0, η) → R+
with the properties: (i) ϕ is continuous on [0, η); (ii) ϕ is smooth concave on (0, η); (iii) ϕ(0) = 0, ϕ′ (x) >
0, ∀x ∈ (0, η).
Let Φ be a proper lower semicontinuous function, and a, b be two fixed positive constants. In the
sequel, we consider a sequence {xk } satisfying the following conditions:
(H1) For each k ∈ N, Φ(xk+1 ) 6 Φ(xk ) − akxk − xk+1 k2 ;
(H2) For each k ∈ N, dist(0, ∂Φ(xk+1 )) 6 bkxk − xk+1 k;
(H3) There exists a subsequence {xkj } converging to x̃ such that Φ(xkj ) → Φ(x̃) as j → ∞.
Lemma 1 ([20]). Let {xk } be a sequence that satisfies H1–H3. If Φ has the K-L property, then the
sequence {xk } converges to x̃, which is a stationary point of Φ. Moreover, the sequence {xk } has a finite
P∞
length, i.e., k=1 kxk+1 − xk k1 < ∞.
Typical functions satisfying the K-L inequality include strongly convex functions, real analytic func-
tions, semi-algebraic functions and subanalytic functions.
A differentiable function f is called convex if the following inequality holds for all x, y in its domain:

f (y) > f (x) + h∇f (x), y − xi;

ρ-strongly convex with ρ > 0 if the following inequality holds for all x, y in its domain:
ρ
f (y) > f (x) + h∇f (x), y − xi + ky − xk2 . (8)
2
A subset C ⊂ Rn is said to be semi-algebraic if it can be written as
r \
[ s
C= {x ∈ Rn : gi,j (x) = 0, hi,j (x) < 0},
j=1 i=1

where gi,j , hi,j : Rn → R are real polynomial functions. Then a function f : Rn → R is called semi-
algebraic if its graph G(f ) := {(x, y) ∈ Rn+1 : f (x) = y} is a semi-algebraic subset in Rn+1 . For example,
P
the Lq norm kxkq := ( i |xi |q )1/q with 0 < q 6 1, the sup-norm kxk∞ := maxi |xi |, the Euclidean norm
kxk, kAx − bkqq , kAx − bk and kAx − bk∞ are all semi-algebraic functions for any matrix A.
A real function on R is said to be analytic if it possesses derivatives of all orders and agrees with its
Taylor series in a neighborhood of every point. For a real function f on Rn , it is said to be analytic if
the function of one variable g(t) := f (x + ty) is analytic for any x, y ∈ Rn . It is readily seen that real
polynomial functions such as quadratic functions kAx − bk2 are analytic. Moreover, the ε-smoothed Lq
P
norm kxkε,q := i (x2i + ε)q/2 with 0 < q 6 1 and the logistic loss function log(1 + e−t ) are all examples
for real analytic functions. A subset C ⊂ Rn is said to be subanalytic if it can be written as
r \
[ s
C= {x ∈ Rn : gi,j (x) = 0, hi,j (x) < 0},
j=1 i=1

where gi,j , hi,j : Rn → R are real analytic functions. Then a function f : Rn → R is called subanalytic
if its graph G(f ) is a subanalytic subset in Rn+1 . It is clear that both real analytic and semi-algebraic
functions are subanalytic. Generally speaking, the sum of two subanalytic functions is not necessarily
subanalytic. It is known, however, that for two subanalytic functions, if at least one function maps
bounded sets to bounded sets, then their sum is also subanalytic, as shown in [9]. In particular, the
sum of a subanalytic function and an analytic function is subanalytic. Typical subanalytic functions
P Pn
include: kAx − bk2 + λkykqq ; kAx − bk2 + λ i (yi2 + ε)q/2 ; n1 i=1 log(1 + exp(−ci (aT q
i x + b)) + λkykq ; and
1 P n T
P 2 q/2
n i=1 log(1 + exp(−ci (ai x + b)) + λ i (yi + ε) .
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:5

2.3 Bregman distance

The Bregman distance plays an important role in various iterative algorithms. As a generalization
of squared Euclidean distance, the Bregman distance shares many similar nice properties of the Eu-
clidean distance. However, the Bregman distance is not a real metric, since it does not satisfy the tri-
angle inequality nor symmetry. For a convex differential function φ, the associated Bregman distance is
defined as

△φ (x, y) = φ(x) − φ(y) − h∇φ(y), x − yi.

In particular, if we let φ(x) = kxk2 in the above, then it is reduced to kx − yk2 , namely, the classical
Euclidean distance. Moreover, if φ is ρ-strongly convex, it follows from (8) that
ρ
△φ (x, y) > kx − yk2 . (9)
2
For more information on Bregman distance, we refer the reader to [21, 22].

3 Convergence analysis
Motivated by (3), we propose the following algorithm for solving problem (4):


 xk+1 = arg min Lα (x, y k , z k , pk ) + ∆φ1 (x, xk ),

 x∈Rn1


 y k+1 = arg min Lα (xk+1 , y, z k , pk ) + ∆φ2 (y, y k ),
n y∈R 2
(10)

 z k+1 = arg min Lα (xk+1 , y k+1 , z, pk ) + ∆φ3 (z, z k ),

 y∈Rn3


 k+1
p = pk + α(Axk+1 + By k+1 + Cz k+1 ),

where △φi is an appropriately chosen Bregman distance with respect to function φi , i = 1, 2, 3. Compared
with the traditional ADMM, our algorithm has advantages both in effectiveness and efficiency. First, the
global convergence of our algorithm does not require any strong convexity of the objective function.
Second, a proper choice of Bregman distance will simplify the subproblems, which in turn improve the
1/2
performance of the algorithm. For example, for the y-subproblem, let g(y) = kyk1/2 . In this situation,
the traditional ADMM requires to solve the following optimization problem:
2
1/2 α pk
min kyk1/2 + By + Axk+1 + Cz k + .
y∈Rn 2 2 α
In general finding a solution to the above problem is not a easy task. However, if we set φ2 (y) =
µ 2 α k+1
2 kyk − 2 kBy + Ax + Cz k − pk /αk2 with µ > kBk2 in our algorithm, then by a simple calculation
the y-subproblem is transformed into minimizing:
   2
1/2 µα pk
kyk1/2 + y − y k − µ−1 B T By k + Axk+1 + Cz k + .
2 α
This problem can be easily solved since its solution has closed form [23].
In what follows, we assume:
(A1) Φ has the K-L property;
(A2) There is σ > 0 such that σkxk2 6 kC T xk2 , ∀x ∈ Rm ;
(A3) h is continuously differentiable such that ∇h is L-Lipschitz continuous;
(A4) φi is ρi -strongly convex and ∇φi is Li -Lipschitz continuous for i = 1, 2, 3;
(A5) The parameters are chosen so that αρσ > 6(L2 + 2L23 ) where ρ = min{ρ1 , ρ2 , ρ3 }.
Also, define a function Φ : Rn1 × Rn2 × Rn3 × Rm × Rn3 → R by
τ
Φ(x, y, z, p, ẑ) = Lα (x, y, z, p) + kz − ẑk2 ,
2
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:6

where τ = 6L23 (ασ)−1 .


We establish a series of lemmas to support the proof of convergence of procedure (10).
Lemma 2. For each k ∈ N, there exists a > 0 such that Φ(ŵk+1 ) 6 Φ(ŵk ) − akŵk+1 − ŵk k2 .
Proof. Applying Fermat’s rule to the z-subproblem, we get

∇h(z k+1 ) + C T pk+1 + ∇φ3 (z k+1 ) − ∇φ3 (z k ) = 0. (11)

It then follows from the Cauchy-Schwarz inequality that

kC T (pk+1 − pk )k2 = k(∇h(z k+1 ) − ∇h(z k )) + (∇φ3 (z k+1 ) − ∇φ3 (z k )) − (∇φ3 (z k ) − ∇φ3 (z k−1 ))k2
6 k∇h(z k+1 ) − ∇h(z k )k2 + k(∇φ3 (z k+1 ) − ∇φ3 (z k )) − (∇φ3 (z k ) − ∇φ3 (z k−1 ))k2
+ 2k∇h(z k+1 ) − ∇h(z k )kk(∇φ3 (z k+1 ) − ∇φ3 (z k )) − (∇φ3 (z k ) − ∇φ3 (z k−1 ))k
3
6 3k∇h(z k+1 ) − ∇h(z k )k2 + k(∇φ3 (z k+1 ) − ∇φ3 (z k )) − (∇φ3 (z k ) − ∇φ3 (z k−1 ))k2
2
6 3L2 kz k+1 − z k k2 + 3(k∇φ3 (z k+1 ) − ∇φ3 (z k )k2 + k∇φ3 (z k ) − ∇φ3 (z k−1 )k2 )
6 3(L2 + L23 )kz k+1 − z k k2 + 3L23 kz k − z k−1 k2 .

Thus, in view of condition (A2), we get


3(L2 + L23 ) k+1 3L23 k
kpk+1 − pk k2 6 kz − z k k2 + kz − z k−1 k2 . (12)
σ σ
On the other hand, it follows from (10) and (9) that
ρ
Lα (xk+1 , y k , z k , pk ) 6 Lα (xk , y k , z k , pk ) − kxk+1 − xk k2 ,
2
k+1 k+1 k k k+1 k k k ρ
Lα (x ,y , z , p ) 6 Lα (x , y , z , p ) − ky k+1 − y k k2 ,
2
ρ
Lα (xk+1 , y k+1 , z k+1 , pk ) 6 Lα (xk+1 , y k+1 , z k , pk ) − kz k+1 − z k k2 ,
2
k+1 k+1 k+1 k+1 k+1 k+1 k+1 k 1
Lα (x ,y ,z ,p ) = Lα (x ,y ,z , p ) + kpk+1 − pk k2 ,
α
from which we have
ρ 1
Lα (wk+1 ) 6 Lα (wk ) − kuk+1 − uk k2 + kpk+1 − pk k2 . (13)
2 α
Adding up inequalities (12) and (13), we have
τ k+1 τ
Lα (wk+1 ) + kz − z k k2 6 Lα (ŵk ) + kz k−1 − z k k2 − akŵk+1 − ŵk k2 ,
2 2
where a := (ρ/2) − 3(L2 + 2L23 )(ασ)−1 is clearly a positive real number.
P∞
Lemma 3. If {uk } is bounded, then k=1 kwk − wk+1 k2 < ∞. In particular, {wk } is asymptotically
regular, namely, kwk − wk+1 k → 0 as k → ∞. Moreover, any cluster point of {wk } is a stationary point
of the augmented Lagrangian function Lα .
Proof. In view of (11), (A2) and (A4), we have

σkpk k 6 kC T pk k 6 k∇h(z k )k + L3 kz k − z k−1 k.

Since ∇h is continuous and {uk } is bounded, this implies that {pk } is bounded, and so are {wk } and
{ŵk }. Thus, there exists a subsequence {ŵkj } convergent to ŵ∗ . By our hypothesis, the function Φ is
lower semicontinuous, which leads to lim inf j→∞ Φ(ŵkj ) > Φ(ŵ∗ ), so that Φ(ŵkj ) is bounded from below.
By Lemma 2, Φ(ŵk ) is nonincreasing, and thus convergent. Furthermore, Φ(ŵk ) > Φ(ŵ∗ ) for each k,
which by Lemma 2, yields
k
X
a kuk+1 − uk k2 6 Φ(ŵ1 ) − Φ(ŵk+1 ) 6 Φ(ŵ1 ) − Φ(ŵ∗ ).
i=1
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:7

P
This together with (12) implies ∞ k
k=1 kw − w
k+1 2
k < ∞; in particular kwk − wk+1 k → 0.
Let w∗ = (x∗ , y ∗ , z ∗ , p∗ ) be any cluster point of {wk } and let {wkj } be a subsequence of {wk } converging
to w∗ . It then follows from (10) that

pkj +1 = pkj + α(Axkj +1 + By kj +1 + Cz kj +1 ),


− ∂f (xkj +1 ) ∋ AT [pkj +1 + αB(y kj − y kj +1 ) + αC(z kj − z kj +1 )] + ∇φ1 (xkj +1 ) − ∇φ1 (xkj ),
− ∂g(y kj +1 ) ∋ B T [pkj +1 + αC(z kj − z kj +1 )] + ∇φ2 (y kj +1 ) − ∇φ2 (y kj ),
− ∇h(z kj +1 ) = C T pkj +1 + ∇φ3 (z kj +1 ) − ∇φ3 (z kj ).

As ∇φi , i = 1, 2, 3 is continuous and kwk − wk+1 k → 0, letting j → ∞ above yields that w∗ is a stationary
point of the augmented Lagrangian function Lα .
Lemma 4. There exists b > 0 such that dist(0, ∂Φ(ŵk+1 )) 6 bkŵk − ŵk+1 k for each k ∈ N.
Proof. By a simple calculation, we have

∂Φx (ŵk+1 ) ∋ αAT B(y k+1 − y k ) + αAT C(z k+1 − z k ) + ∇φ1 (xk+1 ) − ∇φ1 (xk ) + AT (pk+1 − pk ),
∂Φy (ŵk+1 ) ∋ αB T C(z k+1 − z k ) + B T (pk+1 − pk ) + ∇φ2 (y k+1 ) − ∇φ2 (y k ),
∂Φz (ŵk+1 ) = ∇φ3 (z k+1 ) − ∇φ3 (z k ) + C T (pk+1 − pk ) + τ (z k+1 − z k ),
1
∂Φp (ŵk+1 ) = (pk+1 − pk ), ∂Φẑ (ŵk+1 ) = τ (z k − z k+1 ).
α
As matrices A, B, C are all bounded, the above together with (12) and (A4) implies that there exists
b > 0 such that the desired inequality follows.
Theorem 1. Each bounded sequence {wk } generated by procedure (10) converges to a stationary point
P∞
of Lα . Moreover, k=1 kwk+1 − wk k1 < ∞.
Proof. It is easy to see that conditions H1–H2 in Lemma 1 hold. To verify condition H3, we assume that
there exists a subsequence {ŵkj } that converges to ŵ∗ = (x∗ , y ∗ , z ∗ , p∗ , z ∗ ). By the lower semicontinuity
of Φ, lim inf j→∞ Φ(ŵkj ) > Φ(ŵ∗ ). On the other hand, we have
α
f (xkj +1 ) + hpk , Axkj +1 i +kAxkj +1 + By kj + Cz kj k2 + ∆φ1 (xkj +1 , xkj )
2
α
6 f (x∗ ) + hpk , Ax∗ i + kAx∗ + By kj + Cz kj k2 + ∆φ1 (x∗ , xkj ).
2
Since {xk } is asymptotically regular, this implies lim supj→∞ f (xkj +1 ) 6 f (x∗ ). In a similar way, we
conclude that lim supj→∞ g(y kj +1 ) 6 g(y ∗ ). Since

lim h(z kj +1 ) = h(z ∗ ) and lim kz kj +1 − z kj k = 0,


j→∞ j→∞

we have lim supj→∞ Φ(ŵk ) 6 Φ(ŵ∗ ). Altogether, limj→∞ Φ(ŵkj ) = Φ(ŵ∗ ). Thus, condition H3 holds.
Applying Lemma 1, we conclude that {ŵk } converges to ŵ∗ , which is a stationary point of Φ. In
particular, it is easy to see that {wk } converges to w∗ . By Lemma 3, w∗ is a stationary point of Lα .
P∞
Moreover, {wk } has a finite length, i.e., k=1 kwk+1 − wk k1 < ∞.
Remark 1. There are various choices of Bregman distance in (10). For instance, if we let

∆φ3 (x, y) = kx − yk2Q = hQx, xi

with Q a symmetric positive definite matrix, then our first assumption A1 is satisfied whenever the
objective function f + g + h is subanalytic. Indeed, since the function kz − ẑk2Q is analytic, Φ is also
subanalytic as the sum of a subanalytic function and an analytic function, which in turn implies the K-L
property. Typical examples of subanalytic functions are exhibited in the previous section.
We now extend the above result to the N -block case. Thus, let us consider the following composite
optimization problem:
min f1 (x1 ) + f2 (x2 ) + · · · + fN (xN )
(14)
s.t. A1 x1 + A2 x2 + · · · + AN xN = 0,
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:8

where Ai ∈ Rm×ni , fi : Rni → R, i = 1, 2, . . . , N − 1 are proper lower semicontinuous functions, and


fN : RnN → R is a continuously differentiable function. The Lagrangian function Lα : Rn1 × Rn2 × · · · ×
RnN × Rm → R of problem (14) is defined by
N N N 2
X X α X
Lα (x1 , x2 , . . . , xN , p) = fi (xi ) + hp, Ai xi i + Ai xi . (15)
i=1 i=1
2 i=1

Accordingly, the associated algorithm takes the form:


 k+1

 x1 = arg minn Lα (x1 , xk2 , . . . , xkN , pk ) + △φ1 (x1 , xk1 ),

 x1 ∈R 1

 k+1


 x = arg min Lα (xk+1 , x2 , . . . , xkN , pk ) + △φ2 (x2 , xk2 ),
 2 x2 ∈Rn2
1
.. .. .. .. .. .. .. (16)
 . . . . . . .


 xk+1 = arg min L (xk+1 , . . . , xk+1 , x , pk ) + △ (x , xk ),


 α 1 N φN N N
 N
 xN ∈RnN N −1
 k+1 k k+1 k+1 k+1
p = p + α(A1 x1 + A2 x2 + · · · + AN xN ).
Following the idea of Theorem 1, it is not hard to extend the results to the case whenever the followings
are satisfied:
(B1) Ψ has the K-L property;
(B2) there is σ > 0 such that σkxk2 6 kAT 2
N xk , ∀x ∈ R ;
m

(B3) fN is continuously differentiable such that ∇fN is L-Lipschitz continuous;


(B4) φi is ρi -strongly convex and ∇φi is Li -Lipschitz continuous for i = 1, 2, . . . , N ;
(B5) the parameters are chosen so that αρσ > 6(L2 + 2L2N ) where ρ = min{ρ1 , ρ2 , . . . , ρN }.
Analogously, we define a function Ψ : Rn1 × · · · × RnN × Rm × RnN → R by
τ
Ψ(x1 , x2 , . . . , xN , p, x̂N ) = Lα (x1 , x2 , . . . , xN , p) + kxN − x̂N k2 ,
2
where τ = 6L2N (ασ)−1 .
Theorem 2. If conditions B1–B5 are satisfied, then each bounded sequence {xk1 , xk2 , . . . , xkN , pk } gen-
erated by procedure (16) converges to a stationary point of Lα defined as in (15).

4 Demonstration examples
Consider the non-convex optimization problem with 3-block variables deduced from matrix decomposition
applications (see [24, 25]):
µ
min kLk⊛ + λkSk1 + kT − M k2F s.t. T = L + S, (17)
L,S,T 2
P P Pn
where M is an m × n observation matrix, kLk⊛ := min(m,n) i=1 |σi (L)|1/2 , kSk1 := m i=1 j=1 |Sij |, λ is
a trade-off parameter between the low-rank term kLk⊛ and the sparse term kSk1 , and µ is a penalty
parameter related to the noise level.
The augmented Lagrangian function of problem (17) is given by
µ α
Lα (L, S, T, Λ) = kLk⊛ + λkSk1 + kT − M k2F + hΛ, T − (L + S)i + kT − (L + S)k2F , (18)
2 2
where Λ is the Lagrangian multiplier. According to the 3-block BADMM (10), the optimization prob-
lem (17) can be solved by the following procedure:
 ρ

 Lk+1 = arg min Lα (L, S k , T k , Λk ) + kL − Lk k2F ,

 2

 L
 S k+1 = arg min Lα (Lk+1 , S, T k , Λk ) + ρ kS − S k k2F ,

S 2 (19)
 k+1 k+1 k+1 k ρ k 2

 T = arg min L α (L , S , T, Λ ) + kT − T k ,

 2 F
 T
 Λk+1 = Λk + α T k+1 − (Lk+1 + S k+1 ).

Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:9

Simplifying the procedure (19), we then obtain the closed-form iterative formulas:
 k
!

 k+1 α(T k − S k + Λα ) + ρLk 1
L
 =H , ,

 α+ρ α+ρ

 !

 k

 α(T k − Lk+1 + Λα ) + ρS k λ
k+1
S =S , ,
α+ρ α+ρ (20)



 k

 k+1 µM + α(Lk+1 + S k+1 − Λα ) + ρT k

 T = ,

 µ+α+ρ

 k+1 k k+1 k+1 k+1

Λ =Λ +α T − (L +S ) ,

where H(A, ·) indicates the half shrinkage operator [23, 26] imposed on the singular values of A, and
S(A, ·) indicates the well-known soft shrinkage operator imposed on the entries of A. The procedure (20)
is the specification of BADMM (10) for the solution of problem (17) with functions f (x), g(y), h(z) defined
by f (L) = kLk⊛, g(S) = λkSk1 , h(T ) = µ2 kT − M k2 and matrices A, B, C defined by A = I, B = −I,
C = −I where I is the identity matrix. It is direct to see that all the assumptions of Theorem 1 are
satisfied. Consequently, Theorem 1 can be applied to predict convergence of (20) in theory. We conduct
a simulation study and an application example below for support of such theoretical assertion.

4.1 Simulation study

Let M = L∗ + S ∗ + N be an observation matrix, where L∗ and S ∗ are, respectively, the original low-rank
and sparse matrices that we wish to recover by the problem (17), and N is the Gaussian noise matrix.
In the following, r and spr represent, respectively, matrix rank and sparsity ratio. The MATLAB script
for generating matrix M is as follows:
• L = randn(m, r) ∗ randn(r, n);
• S = zeros(m, n); q = randperm(m ∗ n); K = round(spr ∗ m ∗ n); S(q(1 : K)) = randn(K, 1);
• σ = 0; % Noiseless case; σ = 0.01; % Gaussian noise; N = randn(m, n) ∗ σ;
• T = L + S; M = T + N .
Specifically, we set m = n = 100, and tested

(r, spr) = (1, 0.05), (5, 0.05), (10, 0.05), (20, 0.05), (1, 0.1), (5, 0.1), (10, 0.1), and (20, 0.1),

for which the decomposition problem roughly changes from easy to hard. Regarding the implementation
issues, we empirically set the parameters α = 0.3 and ρ = α in (20). The matrices L, S, and T in the
procedure (20) are initialized by zero matrix. We terminated the procedure (20) when the relative change
falls below 10−8 , i.e.,

k(Lk+1 , S k+1 , T k+1 ) − (Lk , S k , T k )kF


RelChg := 6 10−8 ,
k(Lk , S k , T k )kF + 1

where k · kF indicates the Frobenius norm. Let L̂, Ŝ, and T̂ be a numerical solution of problem (17)
obtained by the proposed BADMM. We will measure the quality of recovery by the relative error to
(L∗ , S ∗ , T ∗ ), which is defined by

k(L̂, Ŝ, T̂ ) − (L∗ , S ∗ , T ∗ )kF


RelErr := .
k(L∗ , S ∗ , T ∗ )kF + 1

In Table 1, we report the recovery results for the noiseless and Gaussian noise cases. From this table, it
can be seen that when the true sparsity ratio spr of S increase or the noise is introduced, the relative error
RelErr will go down, which suggests that the recovery performance will decline when the decomposition
problem changes from easy to hard. In addition, for the noiseless case, the proposed BADMM can exactly
recover the rank of L and the sparsity number of S. However, for the Gaussian noise case, since the noise
imposes an additional impact on the recovery, the sparsity number of S cannot be exactly recovered.
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:10

Table 1 The matrix decomposition results on simulated matrices with the size 100 × 100
(r, spr) RelErr Rank(L∗ ) Rank(L̂) kS ∗ k0 kŜk0
(1, 0.05) 4.8674E−06 1 1 500 500
(1, 0.1) 5.0446E−06 1 1 1000 1000
(5, 0.05) 2.2342E−06 5 5 500 500
(5, 0.1) 2.4366E−06 5 5 1000 1000
Noiseless case (σ = 0)
(10, 0.05) 1.5039E−06 10 10 500 500
(10, 0.1) 1.8572E−06 10 10 1000 1000
(20, 0.05) 1.2889E−06 20 20 500 500
(20, 0.1) 1.6974E−06 20 20 1000 1000
(1, 0.05) 0.0049 1 1 500 1723
(1, 0.1) 0.0060 1 1 1000 3797
(5, 0.05) 0.0025 5 5 500 1541
(5, 0.1) 0.0033 5 5 1000 3551
Gauss noise (σ = 0.01)
(10, 0.05) 0.0022 10 10 500 1318
(10, 0.1) 0.0024 10 10 1000 3183
(20, 0.05) 0.0020 20 20 500 1110
(20, 0.1) 0.0024 20 20 1000 3612

Noiseless (σ = 0) Gaussian noise (σ = 0.01)

RelChg RelChg
RelErr RelErr

100 100
log10 (RelChg & RelErr)

log10 (RelChg & RelErr)

10−5 10−5

(a) (b)
10−10 10−10
0 50 100 150 200 250 300 350 0 50 100 150
Number of iterations Number of iterations

Figure 1 (Color online) Convergence results for (a) the noiseless case and (b) Gaussian noise with the standard deviation
σ = 0.01.

In Figure 1, we further present the convergence results for the (r=10, spr=0.05) case with no noise and
Gaussian noise. From this figure, it can be observed that when the relative change RelChg is less than
10−8 , the relative error RelErr will arrive at a stable value, which indicates that the proposed BADMM
is convergent.

4.2 An application example

We further applied the model (17) with BADMM (20) to the background subtraction application. Back-
ground subtraction is a fundamental task in video surveillance. Its aim is to subtract the background
from a video clip and meanwhile detect the anomalies (i.e., moving objects). From the webpage1) , we
download four video clips: Lobby, Bootstrap, Hall, and ShoppingMall. Then we chose 600 frames from
each video clip and input these 600 frames into our algorithm. The parameter λ was fixed at the value
√ 0.1 . In Figure 2, we exhibit the separation results of some frames in four video clips. From
max(m,n)

1) http://perception.i2r.a-star.edu.sg/bk model/bk index.


Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:11

(a)

(b)

(c)

(d)

Figure 2 Background subtraction results in the real-world video clips. (a) Lobby; (b) Bootstrap; (c) Hall; (d) Shopping-
Mall.

Figure 2, it can be seen that our algorithm can produce a clean video background and meanwhile detect
a satisfactory video foreground, which supports the validity and convergence of the proposed BADMM.

Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No.
61603235), and Program for Science and Technology Innovation Talents in Universities of Henan Province (Grant
No. 15HASTIT013). We thank all anonymous reviewers for their thoughtful and constructive comments that
greatly improve the analysis and writing of the manuscript.

References
1 Boyd S, Parikh N, Chu E, et al. Distributed optimization and statistical learning via the alternating direction method
of multipliers. Found Trends Mach Learn, 2011, 3: 1–122
2 Wang H, Banerjee A. Bregman alternating direction method of multipliers. In: Proceedings of Advances in Neural
Information Processing Systems (NIPS), Montréal, 2014. 2816–2824
3 Gabay D, Mercier B. A dual algorithm for the solution of nonlinear variational problems via finite element approxi-
mation. Comput Math Appl, 1976, 2: 17–40
4 Alcouffe A, Enjalbert M, Muratet G. Méthodes de résolution du probléme de transport et de production d’une entreprise
á établissements multiples en présence de coûts fixes. RAIRO Recherche opérationnelle, 1975, 9: 41–55
5 He B, Yuan X. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J Numer
Anal, 2012, 50: 700–709
6 Goldstein T, O’Donoghue B, Setzer S, et al. Fast alternating direction optimization methods. SIAM J Imag Sci, 2014,
7: 1588–1623
7 Xu Y, Yin W, Wen Z, et al. An alternating direction algorithm for matrix completion with nonnegative factors. Front
Math China, 2012, 7: 365–384
8 Bolte J, Sabach S, Teboulle M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems.
Math Program, 2014, 146: 459–494
9 Xu Y, Yin W. A block coordinate descent method for regularized multiconvex optimization with applications to
nonnegative tensor factorization and completion. SIAM J Imag Sci, 2013, 6: 1758–1789
10 Hong M, Luo Z Q, Razaviyayn M. Convergence analysis of alternating direction method of multipliers for a family of
nonconvex problems. SIAM J Optim, 2016, 26: 337–364
11 Li G, Pong T K. Global convergence of splitting methods for nonconvex composite optimization. SIAM J Optim, 2015,
25: 2434–2460
12 Wang F, Xu Z, Xu H. Convergence of bregman alternating direction method with multipliers for nonconvex composite
Wang F H, et al. Sci China Inf Sci December 2018 Vol. 61 122101:12

problems. ArXiv:1410.8625, 2014


13 Chen C, He B, Ye Y, et al. The direct extension of ADMM for multi-block convex minimization problems is not
necessarily convergent. Math Program, 2016, 155: 57–79
14 Han D, Yuan X. A note on the alternating direction method of multipliers. J Optim Theor Appl, 2012, 155: 227–238
15 Cai X, Han D, Yuan X. On the convergence of the direct extension of ADMM for three-block separable convex
minimization models with one strongly convex function. Comput Optim Appl, 2017, 66: 39–73
16 Li M, Sun D, Toh K C. A convergent 3-block semi-proximal ADMM for convex minimization problems with one
strongly convex block. Asia Pac J Oper Res, 2015, 32: 1550024
17 Mordukhovich B. Variational Analysis And Generalized Differentiation I: Basic Theory. Berlin: Springer, 2006. 30–35
18 Lojasiewicz S. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles,
1963, 117: 87–89
19 Kurdyka K. On gradients of functions definable in o-minimal structures. Ann de l’institut Fourier, 1998, 48: 769–783
20 Attouch H, Bolte J, Svaiter B F. Convergence of descent methods for semi-algebraic and tame problems: proximal
algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math Program, 2013, 137: 91–129
21 Si S, Tao D, Geng B. Bregman divergence-based regularization for transfer subspace learning. IEEE Trans Knowl
Data Eng, 2010, 22: 929–942
22 Wu L, Hoi S C H, Jin R, et al. Learning Bregman distance functions for semi-supervised clustering. IEEE Trans
Knowl Data Eng, 2012, 24: 478–491
23 Xu Z B, Chang X Y, Xu F M, et al. L1/2 regularization: a thresholding representation theory and a fast solver. IEEE
Trans Neural Netw Learning Syst, 2012, 23: 1013–1027
24 Behmardi B, Raich R. On provable exact low-rank recovery in topic models. In: Proceedings of IEEE Statistical Signal
Processing Workshop (SSP), Nice, 2011. 265–268
25 Xu H, Caramanis C, Mannor S. Outlier-robust PCA: the high-dimensional case. IEEE Trans Inform Theor, 2013, 59:
546–572
26 Zeng J, Xu Z, Zhang B, et al. Accelerated regularization based SAR imaging via BCR and reduced Newton skills.
Signal Process, 2013, 93: 1831–1844

You might also like