0% found this document useful (0 votes)
22 views37 pages

Convergence of Descent Methods For Semi-Algebraic and

Uploaded by

463181554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views37 pages

Convergence of Descent Methods For Semi-Algebraic and

Uploaded by

463181554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Convergence of descent methods for semi-algebraic and

tame problems: proximal algorithms, forward-backward


splitting, and regularized Gauss-Seidel methods
Hedy Attouch, Jérôme Bolte, Benar Svaiter

To cite this version:


Hedy Attouch, Jérôme Bolte, Benar Svaiter. Convergence of descent methods for semi-algebraic
and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel
methods. Mathematical Programming, Series A, Springer, 2011, 137 (1), pp.91-124. �10.1007/s10107-
011-0484-9�. �hal-00790042�

HAL Id: hal-00790042


https://hal.archives-ouvertes.fr/hal-00790042
Submitted on 19 Feb 2013

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Convergence of descent methods for semi-algebraic
and tame problems: proximal algorithms,
forward-backward splitting, and regularized
Gauss-Seidel methods
Hedy ATTOUCH∗ Jérôme BOLTE† Benar Fux SVAITER ‡

December, 15, 2010; revised: July, 21, 2011

Abstract In view of the minimization of a nonsmooth nonconvex function f , we prove an


abstract convergence result for descent methods satisfying a sufficient-decrease assumption,
and allowing a relative error tolerance. Our result guarantees the convergence of bounded
sequences, under the assumption that the function f satisfies the Kurdyka-Lojasiewicz in-
equality. This assumption allows to cover a wide range of problems, including nonsmooth
semi-algebraic (or more generally tame) minimization. The specialization of our result to
different kinds of structured problems provides several new convergence results for inexact
versions of the gradient method, the proximal method, the forward-backward splitting algo-
rithm, the gradient projection and some proximal regularization of the Gauss-Seidel method
in a nonconvex setting. Our results are illustrated through feasibility problems, or iterative
thresholding procedures for compressive sensing.

2010 Mathematics Subject Classification: 34G25, 47J25, 47J30, 47J35, 49M15, 49M37,
65K15, 90C25, 90C53.

Keywords: Nonconvex nonsmooth optimization, semi-algebraic optimization, tame opti-


mization, Kurdyka-Lojasiewicz inequality, descent methods, relative error, sufficient decrease,
forward-backward splitting, alternating minimization, proximal algorithms, iterative thresh-
olding, block-coordinate methods, o-minimal structures.

1 Introduction
Being given a proper lower semicontinuous function f : Rn → R ∪ {+∞}, we consider descent
methods that generate sequences (xk )k∈N complying with the following conditions:

I3M UMR CNRS 5149, Université Montpellier II, Place Eugène Bataillon, 34095 Montpellier, France
([email protected]) Partially supported by ANR-08-BLAN-0294-03.

TSE (GREMAQ, Université Toulouse I), Manufacture des Tabacs, 21 allée de Brienne, Toulouse, France
([email protected]) Partially supported by ANR-08-BLAN-0294-03.

IMPA, Estrada Dona Castorina 110, 22460 - 320 Rio de Janeiro, Brazil ([email protected]) Partially
supported by CNPq grants 474944/2010-7, 303583/2008-8, FAPERJ grant E-26/102.821/2008 and PRONEX-
Optimization.

1
– for each k ∈ N, f (xk+1 ) + akxk+1 − xk k2 ≤ f (xk );
– for each k ∈ N, there exists wk+1 ∈ ∂f (xk+1 ) such that

kwk+1 k ≤ bkxk+1 − xk k;

where a, b are positive constants and ∂f (xk+1 ) denotes the set of limiting subgradients of f
at xk+1 (see Section 2.1 for a definition). The first condition is intended to model a descent
property: since it involves a measure of the quality of the descent, we call it a sufficient-
decrease condition (see [20] for an early paper on this subject, [7] for an interpretation of
this condition in decision sciences, and [43] for a discussion on this type of condition in
a particular nonconvex nonsmooth optimization setting). The second condition originates
from the well-known fact that most algorithms in optimization are generated by an infinite
sequence of subproblems which involve exact or inexact minimization processes. This is
the case of gradient methods, Newton’s method, forward-backward algorithm, Gauss-Seidel
method, proximal methods... The second set of conditions precisely reflects relative inexact
optimality conditions for such minimization subproblems.
When dealing with descent methods for convex functions, it became natural to expect that
the algorithm will provide globally convergent sequences (i.e., for arbitrary starting point,
the algorithm generates a sequence that converges to a solution). The standard recipe to
obtain the convergence is to prove that the sequence is (quasi-)Fejér monotone relative to the
set of minimizers of f . This fact has also been used intensively in the study of algorithms
for nonexpansive mappings (see e.g. [24]). When the functions under consideration are not
convex (or quasiconvex), the monotonicity properties are in general “broken”, and descent
methods may provide sequences that exhibit highly oscillatory behaviors. Apparently this
phenomenon was first observed by Curry (see [27]); in the framework of differential equations
similar behaviors occur, in [28] a nonconverging bounded curve of a 2-dimensional gradient
system of a C ∞ function is provided, this example was adapted in [1] to gradient methods.
In order to circumvent such behaviors, it seems necessary to work with functions that
present a certain structure. This structure can be of an algebraic nature, e.g. quadratic func-
tions, polynomial functions, real analytic functions, but it can also be captured by adequate
analytic assumptions, e.g. metric regularity [2, 41, 42], cohypomonotonicity [51, 36], self-
concordance [49], partial smoothness [40, 59]. In this paper, our central assumption for the
study of such algorithms is that the function f satisfies the (nonsmooth) Kurdyka-Lojasiewicz
inequality, which means, roughly speaking, that the functions under consideration are sharp
up to a reparametrization (see Section 2.2). The reader is referred to [44, 45, 38] for the
smooth cases, and to [15, 17] for nonsmooth inequalities. Kurdyka-Lojasiewicz inequalities
have been successfully used to analyze various types of asymptotic behavior: gradient-like sys-
tems [15, 34, 35, 39], PDE [55, 22], gradient methods [1, 48], proximal methods [3], projection
methods or alternating methods [5, 14].
In the context of optimization, the importance of Kurdyka-Lojasiewicz inequality is due
to the fact that many problems involve functions satisfying such inequalities, and it is often
elementary to check that such an inequality is satisfied; real semi-algebraic functions pro-
vide a very rich class of functions satisfying the Kurdyka-Lojasiewicz, see [5] for a thorough
discussion on these aspects, and also Section 2.2 for a simple illustration.
Many other functions, that are met in real world problems, and which are not semi-
algebraic, satisfy very often the Kurdyka-Lojasiewicz inequality. An important class is given

2
by functions definable in an o-minimal structure. The monographs [26, 30] are good refer-
ences on o-minimal structures; concerning Kurdyka-Lojasiewicz inequalities in this context
the reader is referred to [38, 17]. Functions definable in o-minimal structures or functions
whose graphs are locally definable are often called tame functions. We do not give a precise
definition of definability in this work, but the flexibility of this concept is briefly illustrated
in Example 5.4(b). Functions that are not necessarily tame but that satisfy Lojasiewicz in-
equality are given in [5], basic assumptions involve metric-regularity and transversality (see
also [41, 42] and Example 5.5).
From a technical viewpoint, our work blends the approach to nonconvex problems pro-
vided in [1, 15, 3, 5] with the relative error philosophy developed in [56, 57, 58, 36]. A
valuable guideline for the error aspects is the development of an inexact proximal algorithm
for equations governed by a monotone operator, and which is based on an estimation of the
relative error, see [56, 57, 58]. Related results without monotonicity (with a control on the
lack of monotonicity) have been obtained in [36].
Thus, in summary, this article aims at:
– providing a unified framework for the analysis of classical descent methods,
– relaxing exact descent conditions,
– extending convergence results obtained in [1, 3, 5, 56, 57, 58, 36] to richer and more
flexible algorithms,
– providing theorems which cover general nonsmooth problems under easily verifiable
assumptions (e.g. semi-algebraicity).

Let us proceed with a more precise description of the contents of this article.
In Section 2, we consider functions satisfying the Kurdyka-Lojasiewicz inequality. We
first give the definition and a brief analysis of this basic property. Then in subsection 2.3,
we provide an abstract convergence result for sequences satisfying the sufficient-decrease
condition and the relative inexact optimality condition mentioned above.
This result is then applied to the analysis of several descent methods with relative error
tolerance.
We recover and improve previous works on the question of gradient methods (Section 3)
and proximal algorithms (Section 4). Our results are illustrated through semi-algebraic fea-
sibility problems by means of an inexact version of the averaged projection method.
We also provide, in Section 5, an in-depth analysis of forward-backward splitting algo-
rithms in a nonsmooth nonconvex setting. Setting aside the convex case, we did not find
any general convergence results for this kind of algorithm, also, the results we present here
seem to be new. These results can be applied to general semi-algebraic problems (or tame
problems) and to nonconvex problems presenting a well-conditioned structure. An important
and enlightening consequence of our study is that the bounded sequences (xk )k∈N generated
by the nonconvex gradient projection algorithm
 
k+1 k 1 k
x ∈ PC x − ∇h(x )
2L
are convergent sequences so long as C is a closed semi-algebraic subset of Rn and h : Rn →
R is C 1 semi-algebraic with L-Lipschitz gradient (see [9] for some applications in signal

3
processing). As an application of our general results on forward-backward splitting, we
consider the following type of problem
 
1 2 n
(P ) min λkxk0 + kAx − bk : x ∈ R
2

where λ > 0 and k · k0 is the counting norm (or the ℓ0 norm), A is an m × n real matrix and
b ∈ Rm . We recall that for x in Rn , kxk0 is the number of nonzero components of x. This
kind of problem is central in compressive sensing [29]. In [11, 12] this problem is tackled by
using a “hard iterative thresholding” algorithm
 
xk+1 ∈ proxγk λk·k0 xk − γk (AT Axk − AT b) ,

where (γk )k∈N is a sequence of stepsizes evolving in a convenient interval (the definition of the
proximal mapping proxλf is given in Section 4). The convergence results the authors obtained
involve different assumptions on the linear operator A: they either assume that kAk < 1 [11,
Theorem 3] or that A satisfies the restricted isometry property [12, Theorem 4]. Our results
show that convergence actually occurs for any linear map so long as the sequence (xk )k∈N is
bounded. We also consider iterative thresholding with ℓp “norms” for sparse approximation
(in the spirit of [21]) and hard-constrained feasibility problems; in both cases convergence of
the bounded sequences is established.
In a last section, we study the proximal regularization of a p blocks alternating method
(with p ≥ 2). This method has been introduced by Auslender [8] for convex minimization;
see also [32] in a nonconvex setting. Convergence results for such methods are usually stated
in terms of cluster points. To our knowledge, the first convergence result in a nonconvex
setting, under fairly general assumptions, was obtained in [5] for a two-blocks exact version.
Our generalization is twofolds: we consider methods involving an arbitrary numbers of blocks,
and we provide a proper convergence result.

2 An abstract convergence result for inexact descent methods


The Euclidean scalar product of Rn and its corresponding norm are respectively denoted by
h·, ·i and k · k.

2.1 Some definitions from variational analysis


Standard references are [23, 54, 47].
If F : Rn ⇉ Rm is a point-to-set mapping its graph is defined by

Graph F := {(x, y) ∈ Rn × Rm : y ∈ F (x)},

while its domain is given by dom F := {x ∈ Rn : F (x) 6= ∅}. Similarly, the graph of a
real-extended-valued function f : Rn → R ∪ {+∞} is defined by

Graph f := {(x, s) ∈ Rn × R : s = f (x)},

and its domain by dom f := {x ∈ Rn : f (x) < +∞}. The epigraph of f is defined as usual as

epi f := {(x, λ) ∈ Rn × R : f (x) ≤ λ}.

4
When f is a proper function, i.e. when dom f 6= ∅, the set of its global minimizers, possibly
empty, is denoted by
argmin f := {x ∈ Rn : f (x) = inf f } .

The notion of subdifferential plays a central role in the following theoretical and algorithm
developments.
For each x ∈ dom f , the Fréchet subdifferential of f at x, written ∂f ˆ (x), is the set of
n
vectors v ∈ R which satisfy
1
lim inf [f (y) − f (x) − hv, y − xi] ≥ 0.
y 6= x kx − yk
y→x

When x ∈ ˆ (x) = ∅.
/ dom f , we set ∂f
The limiting processes used in an algorithmic context necessitate the introduction of the
more stable notion of limiting-subdifferential ([47]) (or simply subdifferential) of f . The
subdifferential of f at x ∈ dom f , written ∂f (x), is defined as follows
ˆ (xk ) → v}.
∂f (x) := {v ∈ Rn : ∃xk → x, f (xk ) → f (x), v k ∈ ∂f

It is straightforward to check from the definition the following closedness property of ∂f :


Let (xk , v k )k∈N be a sequence in Rn × Rn such that (xk , v k ) ∈ Graph ∂f for all k ∈ N. If
(xk , v k ) converges to (x, v), and f (xk ) converges to f (x) then (x, v) ∈ Graph ∂f .
These generalized notions of differentiation give birth to generalized notions of critical
point. A necessary (but not sufficient) condition for x ∈ Rn to be a minimizer of f is

∂f (x) ∋ 0. (1)

A point that satisfies (1) is called limiting-critical or simply critical.


We end this section by some words on an important class of functions which are intimately
linked to projection mappings: the indicator functions. Recall that if C is a closed subset of
Rn , its indicator function iC is defined by

0 if x ∈ C,
iC (x) =
+∞ otherwise,

where x ranges over Rn . Being given x in C, the limiting subdifferential of iC at x is called


the normal cone to C at x, it is denoted by NC (x) (for x ∈
/ C we set NC (x) = ∅).
The projection on C, written PC , is the following point-to-set mapping:
 n
R ⇉ Rn
PC :
x 7→ PC (x) := argmin {kx − zk : z ∈ C}.

When C is nonempty, the closedness of C and the compactness of the closed unit ball imply
that PC (x) is nonempty for all x in Rn .

5
2.2 Kurdyka-Lojasiewicz inequality: the nonsmooth case
We begin this section by a brief discussion on real semi-algebraic sets and functions which
will provide a very rich class of functions satisfying the Kurdyka-Lojasiewicz.
Definition 2.1. (a) A subset S of Rn is a real semi-algebraic set if there exists a finite
number of real polynomial functions Pij , Qij : Rn → R such that
p \
[ q
S= { x ∈ Rn : Pij (x) = 0, Qij (x) < 0}.
j=1 i=1

(b) A function f : Rn → R ∪ {+∞} (resp. a point-to-set mapping F : Rn ⇉ Rm ) is called


semi-algebraic if its graph {(x, λ) ∈ Rn+1 : f (x) = λ} (resp. {(x, y) ∈ Rn+m : y ∈ F (x)}) is
a semi-algebraic subset of Rn+1 (resp. Rn+m ).
One easily sees that the class of semi-algebraic sets is stable under the operation of
finite union, finite intersection, Cartesian product or complementation and that polynomial
functions are, of course, semi-algebraic functions.
The high flexibility of the concept of semi-algebraic sets is captured by the following
fundamental theorem known as Tarski-Seidenberg principle.
Theorem 2.2 (Tarski-Seidenberg). Let A be a semi-algebraic subset of Rn+1 , then its canon-
ical projection on Rn , namely
{(x1 , . . . , xn ) ∈ Rn : ∃z ∈ R, (x1 , . . . , xn , z) ∈ A}
is a semi-algebraic subset of Rn .
Let us illustrate the power of this theorem by proving that max functions associated to
polynomial functions are semi-algebraic. Let S be a nonempty semi-algebraic subset of Rm
and g : Rn × Rm → R a real polynomial function. Set f (x) = sup{g(x, y) : y ∈ S} (note that
f can assume infinite values). Let us prove that f is semi-algebraic.
Using the definition and the stability with respect to finite intersection, we see that the
set
{(x, λ, y) ∈ Rn × R × S : g(x, y) > λ}
\
= {(x, λ, y) ∈ Rn × R × Rm : g(x, y) > λ} (Rn × R × S) ,

is semi-algebraic. For (x, λ, y) in Rn × R × Rm , define the projection Π(x, λ, y) = (x, λ) and


use Π to project the above set on Rn × R. One obtains the following semi-algebraic set
{(x, λ) ∈ Rn × R : ∃y ∈ S, g(x, y) > λ}.
The complement of this set is
{(x, λ) ∈ Rn × R : ∀y ∈ S, g(x, y) ≤ λ} = epi f.
Hence epi f is semi-algebraic. Similarly hypo f := {(x, µ) : f (x) ≥ µ} is semi-algebraic hence
Graph f = epi f ∩ hypo f is semi-algebraic. Of course, this result also holds when replacing
sup by inf.
As a byproduct of these stability results, we recover the following standard result which
will be useful for further developments.

6
Lemma 2.3. Let S be a nonempty semi-algebraic subset of Rm , then the function

Rm ∋ x → dist (x, S)2

is semi-algebraic.

Proof. It suffices to consider the polynomial function g(x, y) = kx − yk2 for x, y in Rm and
to use the definition of the distance function.

The facts that the composition of semi-algebraic mappings gives a semi-algebraic mapping
or that the image (resp. the preimage) of a semi-algebraic set by a semi-algebraic mapping
is a semi-algebraic set are also consequences of the Tarski-Seidenberg principle. The reader
is referred to [10, 13] for those and many other consequences of this principle.

As already mentioned in the introduction, a prominent feature of semi-algebraic functions


is that they admit locally a sharp reparametrization, leading to what we call here Kurdyka-
Lojasiewicz inequality. The most fundamental works on this subject are of course due to
Lojasiewicz [44] (1963) and Kurdyka [38] (1998).
We proceed now to a formal definition of this inequality.
Let f : Rn → R ∪ {+∞} be a proper lower semicontinuous function. For η1 , η2 such that
−∞ < η1 < η2 ≤ +∞, we set

[η1 < f < η2 ] = {x ∈ Rn : η1 < f (x) < η2 } .

The following definition is taken from [5] (see also [18]).

Definition 2.4 (Kurdyka-Lojasiewicz property). (a) The function f : Rn → R ∪ {+∞} is


said to have the Kurdyka-Lojasiewicz property at x∗ ∈ dom ∂f if there exist η ∈ (0, +∞], a
neighborhood U of x∗ and a continuous concave function ϕ : [0, η) → R+ such that:
(i) ϕ(0) = 0,
(ii) ϕ is C 1 on (0, η),
(iii) for all s ∈ (0, η), ϕ′ (s) > 0,
(iv) for all x in U ∩ [f (x∗ ) < f < f (x∗ ) + η], the Kurdyka-Lojasiewicz inequality holds

ϕ′ (f (x) − f (x∗ )) dist (0, ∂f (x)) ≥ 1. (2)

(b) Proper lower semicontinuous functions which satisfy the Kurdyka-Lojasiewicz inequality
at each point of dom ∂f are called KL functions.

Remark 2.5. (a) One can easily check that the Kurdyka-Lojasiewicz property is automat-
ically satisfied at any non critical point x∗ ∈ dom ∂f , see for example Lemma 2.1 and Re-
mark 3.2 (b) of [5].
(b) When f is smooth, finite-valued, and f (x∗ ) = 0, inequality (2) can be rewritten as

k∇(ϕ ◦ f )(x)k ≥ 1,

for each convenient x in Rn . This inequality may be interpreted as follows: up to the


reparametrization of the values of f via ϕ, we face a sharp function. Since the function ϕ is
used here to turn a singular region –a region in which the gradients are arbitrarily small– into

7
a regular region, i.e. a place where the gradients are bounded away from zero, it is called a
desingularizing function for f . For theoretical and geometrical developments concerning this
inequality, see [18].
(c) The concavity assumption imposed on the function ϕ does not explicitly belong to the
usual formulation of the Kurdyka-Lojasiewicz inequality. However this inequality holds in
many instances with a concave function ϕ, see [5] for illuminating examples.
(d) It is important to observe that the KL inequality implies that the critical points lying in
U ∩ [f (x∗ ) < f < f (x∗ ) + η] have the same critical value f (x∗ ).

Among real-extended-valued lower-semicontinuous functions, typical KL functions are


semi-algebraic functions or more generally functions definable in an o-minimal structure,
see [15, 16, 17]. References on functions definable in an o-minimal structure are [26, 30].
Such examples are abundantly commented in [5], and they strongly motivate the present
study. Other types of examples based on more analytical assumptions like uniform convexity,
transversality or metric regularity can be found in [5], inequality (8.7) of [42], and Remark 3.6.

2.3 An inexact descent convergence result for KL functions


In this section, a and b are fixed positive constants. Let f : Rn → R ∪ {+∞} be a proper
lower semicontinuous function. In the sequel, we consider sequences (xk )k∈N which satisfy
the following conditions, which we will subsequently refer to as H1, H2, H3:

H1. (Sufficient decrease condition). For each k ∈ N,

f (xk+1 ) + akxk+1 − xk k2 ≤ f (xk );

H2. (Relative error condition). For each k ∈ N, there exists wk+1 ∈ ∂f (xk+1 ) such that

kwk+1 k ≤ bkxk+1 − xk k;

H3. (Continuity condition). There exists a subsequence (xkj )j∈N and x̃ such that

xkj → x̃ and f (xkj ) → f (x̃), as j → ∞.

The next sections feature several important methods that force the sequences to satisfy
the three conditions. Conditions H1 and H2 have been commented in the introduction;
concerning condition H3, it is important to note that f itself is not required, in general, to be
continuous or even continuous on its domain. Indeed, as we will see in the next sections, the
nature of some algorithms (e.g. forward-backward splitting, Gauss-Seidel methods) forces
the sequences to comply with condition H3 under a simple lower semicontinuity assumption.
The following abstract result is at the core of our convergence analysis.

Lemma 2.6. Let f : Rn → R ∪ {+∞} be a proper lower semicontinuous function which


satisfies the Kurdyka-Lojasiewicz property at some x∗ ∈ Rn . Denote by U , η and ϕ : [0, η) →
R+ the objects appearing in the Definition 2.4 of the KL property at x∗ . Let δ, ρ > 0 be such
that B(x∗ , δ) ⊂ U with ρ ∈ (0, δ).

8
Consider a sequence (xk )k∈N which satisfies conditions H1, H2. Assume moreover that

f (x∗ ) ≤ f (x0 ) < f (x∗ ) + η, (3)


r
f (x0 ) − f (x∗ ) b
kx∗ − x0 k + 2 + ϕ(f (x0 ) − f (x∗ )) < ρ, (4)
a a
and
∀k ∈ N, xk ∈ B(x∗ , ρ) ⇒ xk+1 ∈ B(x∗ , δ) with f (xk+1 ) ≥ f (x∗ ). (5)

Then, the sequence (xk )k∈N satisfies

∀k ∈ N, xk ∈ B(x∗ , ρ),
+∞
X
kxk+1 − xk k < +∞,
k=0
f (x ) → f (x∗ ),
k
as k → ∞,

and converges to a point x̄ ∈ B(x∗ , δ) such that f (x̄) ≤ f (x∗ ).


If the sequence (xk )k∈N also satisfies condition H3, then x̄ is a critical point of f , and
f (x̄) = f (x∗ ).

Proof. The key point is to establish the following claim: for j = 1, 2, . . .

xj ∈ B(x∗ , ρ), (6)

and
j
X b
kxi+1 − xi k + kxj+1 − xj k ≤ kx1 − x0 k + ϕ(f (x1 ) − f (x∗ )) − ϕ(f (xj+1 ) − f (x∗ )) . (7)

a
i=1

Concerning the above claim, first note that condition H1 implies that the sequence (f (xk ))k∈N
is nonincreasing, which by (3) gives f (xj+1 ) ≤ f (x0 ) < f (x∗ ) + η. On the other hand, by
assumption (5), the property xj ∈ B(x∗ , ρ) implies f (xj+1 ) ≥ f (x∗ ). Hence, the quantity
ϕ(f (xj+1 ) − f (x∗ )) appearing in (7) makes sense.
Let us observe beforehand that, for all k ≥ 1, the set ∂f (xk ) is nonempty, and therefore xk
belongs to dom f . As we already noticed, condition H1 implies that the sequence (f (xk ))k∈N
is nonincreasing, and it immediately yields
r
k+1 k f (xk ) − f (xk+1 )
kx −x k≤ , ∀k ∈ N. (8)
a
Fix k ≥ 1. We claim that if f (xk ) < f (x∗ ) + η and xk ∈ B(x∗ , ρ), then

b
2kxk+1 − xk k ≤kxk − xk−1 k + ϕ(f (xk ) − f (x∗ )) − ϕ(f (xk+1 ) − f (x∗ )) .

(9)
a
If xk+1 = xk this inequality holds trivially. So, we assume that xk+1 6= xk . In this case, using
(5) and (8), we conclude that f (xk ) > f (xk+1 ) ≥ f (x∗ ) which, combined with KL inequality

9
and H2 shows that wk 6= 0 and xk−1 6= xk . Since wk ∈ ∂f (xk ), using (again) KL inequality
and H2, we obtain
1 1
ϕ′ (f (xk ) − f (x∗ )) ≥ k
≥ .
kw k bkx − xk−1 k
k

The concavity assumption on ϕ, ϕ′ > 0, (5), and H1 imply

ϕ(f (xk ) − f (x∗ )) − ϕ(f (xk+1 ) − f (x∗ )) ≥ϕ′ (f (xk ) − f (x∗ ))(f (xk ) − f (xk+1 ))
≥ϕ′ (f (xk ) − f (x∗ ))akxk+1 − xk k2 .

Direct combination of the two above inequalities yields

b  kxk+1 − xk k2
ϕ(f (xk ) − f (x∗ )) − ϕ(f (xk+1 ) − f (x∗ )) ≥ .
a kxk − xk−1 k

Multiplying√this inequality by kxk − xk−1 k, taking the square root on both sides and using
inequality 2 αβ ≤ α + β, we conclude that inequality (9) is satisfied.
Let us prove claims (6), (7) by induction on j. From (5) with k = 0, we obtain that
x1 ∈ B(x∗ , δ) and f (x1 ) ≥ f (x∗ ). Using now (8) with k = 0, we have
r r
1 0 f (x0 ) − f (x1 ) f (x0 ) − f (x∗ )
kx − x k ≤ ≤ . (10)
a a
Combining the above equation with assumption (4), and using the triangle inequality we
obtain
r
∗ 1 ∗ 0 0 1 ∗ 0 f (x0 ) − f (x∗ )
kx − x k ≤ kx − x k + kx − x k ≤ kx − x k + < ρ,
a
which expresses that x1 belongs to B(x∗ , ρ). Direct use of (9) with k = 1 shows that (7)
holds with j = 1.
Suppose that (6) and (7) hold for some j ≥ 1. Then, using the triangle inequality and (7)
we have
j
X
∗ j+1 ∗ 0 0 1
kx − x k ≤kx − x k + kx − x k + kxi+1 − xi k
i=1
≤kx∗ − x0 k + 2kx0 − x1 k
b
+ ϕ(f (x1 ) − f (x∗ )) − ϕ(f (xj+1 ) − f (x∗ )) .

a
Using the above inequality, (10) and assumption (4) we conclude that xj+1 ∈ B(x∗ , ρ). Hence,
(9) holds with k = j + 1, i.e.
b
2kx(j+1)+1 − xj+1 k ≤ kxj+1 − xj k + ϕ(f (xj+1 ) − f (x∗ )) − ϕ(f (x(j+1)+1 ) − f (x∗ )) .

a
Adding the above inequality with (7) (with k = j) yields (7) with k = j + 1, which completes
the induction proof.

10
Direct use of (7) shows that
j
X b
kxi+1 − xi k ≤ kx1 − x0 k + ϕ(f (x1 ) − f (x∗ )).
a
i=1

Therefore,

X
kxi+1 − xi k < +∞
i=1

which implies that the sequence (xk )k∈N converges to some x̄. From H2 and (5) (note that
ϕ concave yields ϕ′ decreasing) we infer that wk → 0 and f (xk ) → β ≥ f (x∗ ). If β > f (x∗ )
then using Definition 2.4, (2) we have

ϕ′ (β − f (x∗ ))kwk k ≥ 1, k = 0, 1, . . .

which is absurd, because wk → 0. Therefore β = f (x∗ ) and, since f is lower semicontinuous,


f (x̄) ≤ β = f (x∗ ).
To end the proof, note that if the sequence (xk )k∈N satisfies H3, then x̃ = x̄, x̄ is critical
and f (x̄) = limk→∞ f (xk ) = f (x∗ ).

Corollary 2.7. Let f , x∗ , ρ, δ be as in the previous Lemma. For q ≥ 1, consider a finite


family x0 , . . . , xq which satisfies H1 and H2, conditions (3), (4) and
   
∀k ∈ {0, . . . , q}, xk ∈ B(x∗ , ρ) ⇒ xk+1 ∈ B(x∗ , δ) with f (xk+1 ) ≥ f (x∗ ) .

Then xj ∈ B(x∗ , ρ) for all j = 0, . . . , q.

Proof. Simply reproduce the beginning of the proof of the previous lemma.

Corollary 2.8. If we replace the assumption (5) in Lemma 2.6 by the set of assumptions,

η < a(δ − ρ)2 , (11)


f (xk ) ≥ f (x∗ ), ∀k ∈ N, (12)

the conclusion remains unchanged.

Proof. It suffices to prove that (11) and (12) implies (5). Let xk ∈ B(x∗ , ρ). By H1, we have
r r
k+1 k f (xk ) − f (xk+1 ) η
kx −x k≤ ≤ < δ − ρ.
a a

Hence kxk+1 − x∗ k ≤ kxk+1 − xk k + kxk − x∗ k < δ.

Lemma 2.6 and its corollaries have several important consequences that we now proceed
to discuss.

11
Theorem 2.9 (Convergence to a critical point). Let f : Rn → R ∪ {+∞} be a proper lower
semicontinuous function. Consider a sequence (xk )k∈N that satisfies H1, H2, and H3.
If f has the Kurdyka-Lojasiewicz property at the cluster point x̃ specified in H3 then the
sequence (xk )k∈N converges to x̄ = x̃ as k goes to infinity, and x̄ is a critical point of f .
Moreover the sequence (xk )k∈N has a finite length, i.e.
+∞
X
kxk+1 − xk k < +∞.
k=0

Proof. Let x̄ = x̃ be a cluster point of (xk )k∈N as given by H3 (i.e., xkj → x̄ and f (xkj ) →
f (x̄)). Since (f (xk ))k∈N is a nonincreasing sequence (a direct consequence of H1), we deduce
that f (xk ) → f (x̄) and f (xk ) ≥ f (x̄) for all integers k. The function f has the KL property
around x̄, hence there exist ϕ, U, η as in Definition 2.4. Let δ > 0 be such B(x̄, δ) ⊂ U ,
ρ ∈ (0, δ). If necessary, shrink η so that η < a(δ − ρ)2 . Use the continuity property of ϕ to
obtain the existence of an integer k0 such that: f (xk ) ∈ [f (x̄), f (x̄) + η) for all k ≥ k0 and
r
k0 f (xk0 ) − f (x̄) b
kx̄ − x k + 2 + ϕ(f (xk0 ) − f (x̄)) < ρ.
a a
Since f (xk ) ≥ f (x̄) for all integers k, the conclusion follows by applying Corollary 2.8 to
the sequence (y k )k∈N defined by y k = xk0 +k for all integers k.

As it will be shown later on, sequences complying with conditions H1, H2 and H3 are
not necessarily generated by a local model (see Section 6) and therefore the proximity of the
starting point x0 with a local minimizer x∗ does not imply, in general, that the limit point
of the sequence lies in a neighbourhood of x∗ .
However, under the following specific assumption, we can establish a convergence result
to a local minimizer.

H4: For any δ > 0 there exist 0 < ρ < δ and ν > 0 such that

x ∈ B(x∗ , ρ), f (x) < f (x∗ ) + ν



⇒ f (x) < f (y) + aky − xk2 .
y∈ / B(x∗ , δ)

Theorem 2.10 (Local convergence to local minima). Let f : Rn → R ∪ {+∞} be a proper


lower semicontinuous function which has the KL property at some local minimizer x∗ . Assume
that H4 holds at x∗ .
Then, for any r > 0, there exist u ∈ (0, r) and µ > 0 such that the inequalities

kx0 − x∗ k < u, f (x∗ ) < f (x0 ) < f (x∗ ) + µ,

imply that any sequence (xk )k∈N starting from x0 , that satisfies H1, H2 has the finite length
property, remains in B(x∗ , r) and converges to some x̄ ∈ B(x∗ , r) critical point of f with
f (x̄) = f (x∗ ).

Proof. Take r > 0. Since x∗ is a local minimum (hence critical) and f satisfies the Kurdyka-
Lojasiewicz property, there exist η0 ∈ (0, +∞], δ ∈ (0, r), and a continuous concave function
ϕ : [0, η0 ) → R+ such that:

12
- ϕ(0) = 0,
- ϕ is C 1 on (0, η0 ),
- for all s ∈ (0, η0 ), ϕ′ (s) > 0.
- for all x in B(x∗ , δ) ∩ [f (x∗ ) < f < f (x∗ ) + η0 ], the Kurdyka-Lojasiewicz inequality
holds
ϕ′ (f (x) − f (x∗ )) dist (0, ∂f (x)) ≥ 1. (13)
- for all x in B(x∗ , δ),
f (x) ≥ f (x∗ ). (14)
We infer from assumption H4 that there exist ρ ∈ (0, δ) and ν > 0 such that

x ∈ B(x∗ , ρ), f (x) < f (x∗ ) + ν



⇒ f (x) < f (y) + aky − xk2 . (15)
y∈ / B(x∗ , δ)

Set η = min{η0 , ν} and let k ∈ N. If xk is such that f (xk ) < f (x∗ ) + η and kxk − x∗ k < ρ then
H4, together with H1, implies that xk+1 ∈ B(x∗ , δ), and thus that f (xk+1 ) ≥ f (x∗ ) (recall
that x∗ is a local minimizer on B(x∗ , δ)). That’s precisely property (5) of Lemma 2.6.
Choose u, µ > 0 such that
r
µ b 2ρ
u < ρ/3, µ < η, 2 + ϕ(µ) < .
a a 3

If x0 satisfies the set of inequalities kx0 − x∗ k < u and f (x∗ ) < f (x0 ) < f (x∗ ) + µ we therefore
have
r
∗ 0 f (x0 ) − f (x∗ ) b
kx − x k + 2 + ϕ(f (x0 ) − f (x∗ )) < ρ.
a a
which is precisely property (4) of Lemma 2.6. Using Lemma 2.6 we conclude that the sequence
(xk )k∈N has the finite length property, remains in B(x∗ , ρ), converges to some x̄ ∈ B(x∗ , δ),
f (xk ) → f (x∗ ) and f (x̄) ≤ f (x∗ ). Since f (x∗ ) is the minimum value of f in B(x∗ , δ),
f (x̄) = f (x∗ ) and the sequence (xk )k∈N has also property H3. So, x̄ is a critical point of
f.

Remark 2.11. Let us verify that Condition H4 is satisfied when x∗ ∈ dom f is a local
minimum and the function f satisfies the following global growth condition:
a
f (y) ≥ f (x∗ ) − ky − x∗ k2 for all y ∈ Rn . (16)
4
Let δ > ρ and ν be positive real numbers. Take y ∈ Rn such that ky − x∗ k ≥ δ and x ∈ Rn
such that kx − x∗ k ≤ ρ and f (x) < f (x∗ ) + ν. From (16), and the triangle inequality we infer
a
f (y) ≥ f (x) − ν − ky − x∗ k2
4
a a
≥ f (x) − ν − ky − x∗ k2 + ky − x∗ k2
2 4
a
≥ f (x) − ν − aky − xk − akx − x∗ k2 + ky − x∗ k2
2
4
2 2 a 2
≥ f (x) − ν − aky − xk − aρ + δ .
4

13
Hence  a 
f (y) + aky − xk2 ≥ f (x) + −ν − aρ2 + δ 2 for all y ∈ Rn . (17)
4
We conclude by noticing that −ν − aρ2 + a4 δ 2 is nonnegative for ρ and ν sufficiently small.
We end this section by a result on the convergence toward a global minimum similar
to [5], Theorem 3.3. Observe that, in this context, the set of global minimizers may be a
continuum.
Theorem 2.12 (Local convergence to global minima). Let f : Rn → R ∪ {+∞} be a lower
semicontinuous function which has the KL property at some x∗ , a global minimum point of f .
For each r > 0, there exist u ∈ (0, r), µ > 0 such that the inequalities

kx0 − x∗ k < u, min f < f (x0 ) < min f + µ

imply that any sequence (xk )k∈N that satisfies H1, H2 and which starts from x0 satisfies
(i) xk ∈ B(x∗ , r), ∀k ∈ N,
P+∞
(ii) xk converges to some x̄ and k=1 kx
k+1 − xk k < +∞,

(iii) f (x̄) = min f .


Proof. It is a straightforward variant of Theorems 2.9 and 2.10.

3 Inexact gradient methods


The first natural domain of application of our previous results concerns the simplest first-
order methods, namely the gradient methods. As we shall see, our abstract framework
(Theorem 2.9) allows to recover some of the results of [1]. In order to illustrate the versatil-
ity of our algorithmic framework, we also consider a fairly general semi-algebraic feasibility
problem, and we provide, in the line of [42], a local convergence proof for an inexact averaged
projection method.

3.1 General convergence result


Let f : Rn → R be a C 1 function whose gradient is Lipschitz continuous with constant L (or
L-Lipschitz continuous). We consider the following algorithm.
Algorithm 1 Take some positive parameters a, b with a > L.
Fix x0 in Rn . For k = 0, 1, . . . consider:
a
h∇f (xk ), xk+1 − xk i + kxk+1 − xk k2 ≤ 0, (18)
2
k k+1
k∇f (x )k ≤ bkx − xk k. (19)

To illustrate the variety of dynamics covered by Algorithm 1, let us show how variable
metric gradient algorithms can be cast in this framework. Consider a sequence (Ak )k∈N of
symmetric positive definite matrices in Rn×n such that for each k ∈ N the eigenvalues λki of
Ak satisfy
0 < λ ≤ λki ≤ λ,

14
where λ and λ are given thresholds. For each integer k, consider the following subproblem
built on a second-order model of f around the point xk :
 
k k 1 k k k n
minimize h∇f (x ), u − x i + hA (u − x ), u − x i : u ∈ R .
2

This type of quadratic models arises, for instance, in trust-region methods (see [1] which
is also connected to Lojasiewicz inequality). When solving the above problem exactly, we
obtain the following method

xk+1 = xk − (Ak )−1 ∇f (xk ),

which satisfies

h∇f (xk ), xk+1 − xk i + λ kxk+1 − xk k2 ≤ 0, (20)


k k+1 k
k∇f (x )k ≤ λ kx − x k. (21)
L
So long as λ > 2, the sequence (xk )k∈N falls into the general category delineated by Algo-
rithm 1.
For the convergence analysis of Algorithm 1, we shall of course use the elementary but
important descent lemma (see for example [50] 3.2.12).

Lemma 3.1 (Descent lemma). Let f : Rn → R be a function and C a convex subset of Rn


with nonempty interior. Assume that f is C 1 on a neighborhood of each point in C and that
∇f is L-Lipschitz continuous on C. Then, for any two points x, u in C,
L
f (u) ≤ f (x) + h∇f (x), u − xi + ku − xk2 . (22)
2
We then have the following result:

Theorem 3.2. Assume that f : Rn → R is a C 1 function with L-Lipschitz continuous


gradient, and that f is bounded from below. If f is a KL function, then each bounded sequence
(xk )k∈N generated by Algorithm 1 converges to some critical point x̄ of f .
Moreover, the sequence (xk )k∈N has a finite length, i.e. k+1 − xk k < +∞.
P
k kx

Proof. Applying the descent lemma at points u = xk+1 and x = xk , inequality (18) becomes

a − L k+1
f (xk+1 ) − f (xk ) + kx − xk k2 ≤ 0.
2
Since a > L, condition H1 of Theorem 2.9 is satisfied. To see that H2 is satisfied, use the
Lipschitz continuity property of ∇f and (19) to obtain

k∇f (xk+1 )k ≤ k∇f (xk+1 ) − ∇f (xk )k + k∇f (xk )k ≤ (L + b)kxk+1 − xk k.

The sequence (xk )k∈N has been assumed to be bounded. Thus it admits a converging subse-
quence, and, by continuity of f , H3 is trivially fulfilled. We can therefore apply theorem 2.9
to conclude.

15
Remark 3.3. The conclusion of Theorem 3.2 remains unchanged if the assumption that ∇f
is Lipschitz continuous on Rn and f is a KL function are replaced by the assumptions:
There exists a closed subset S of Rn such that
(i) ∇f is L-Lipschitz continuous on co S;
(ii) xk ∈ S for all k ∈ N;
(iii) f satisfies the KL inequality at each point of S,
where co S denotes the convex envelope of S. The result is evident from the proof. Just
notice that the L-Lipschitz continuity of ∇f on co S is needed in order to apply the descent
lemma.

3.2 Prox-regularity
When considering nonconvex feasibility problems, we are led to consider squared distance
functions to nonconvex sets. Contrary to what happens in the standard convex setting, such
functions may fail to be differentiable. If we want to handle feasibility problems through
gradient methods (e.g. Algorithm 1), this lack of regularity causes serious trouble. The
key concept of prox-regularity provides a characterization of the local differentiability of these
functions and, as we will see in the next section, it allows in turn to design averaged projection
methods with interesting converging properties.
A closed subset F of Rn is prox-regular if its projection operator PF is single-valued around
each point x in F (see [53, Theorem 1.3, (a) ⇔ (f )]). Prominent examples of prox-regular
sets are closed convex sets and C 2 submanifolds of Rn (see [53] and references therein).
Set g(x) = 12 dist (x, F )2 and assume that F is prox-regular. Let us gather the following
definition/properties concerning F that are fundamental for our purpose.
Theorem 3.4 ([53]). Let F be a closed prox-regular set. Then for each x̄ in F there exists
r > 0 such that:
(a) The projection PF is single-valued on B(x̄, r),
(b) the function g is C 1 on B(x̄, r) and ∇g(x) = x − PF (x),
(c) the gradient mapping ∇g is 1-Lipschitz continuous on B(x̄, r).
Item (c) is not explicitly developed in [53], a proper proof can be found in [42, Proposition 8.1].

3.3 Averaged projections for feasibility problems


Let F1 , . . . , Fp be nonempty closed semi-algebraic, prox-regular subsets of Rn such that
p
\
Fi 6= ∅.
i=1

A classical approach to the problem of finding a common point to the sets F1 , . . . , Fp is to


find a global minimizer of the function f : Rn → [0, +∞)
p
1X
f (x) := dist (x, Fi )2 , (23)
2
i=1

16
where dist (·, Fi ) is the distance function to the set Fi .
As one can easily verify, in the convex case, the averaged projection method corresponds
exactly to an explicit gradient method applied to the function p1 f . In a nonconvex setting,
we are thus led to study the following algorithm:
Inexact averaged projection algorithm Take θ ∈ (0, 1), α < 21 and M > 0 such that

1−α 1
> . (24)
θ 2
Given a starting point x0 in Rn , consider the following algorithm
p
!
k+1 k 1X
x ∈ (1 − θ) x + θ PFi (x ) + ǫk ,
k
(25)
p
i=1

where (ǫk )k∈N is a sequence of errors which satisfies

hǫk , xk+1 − xk i ≤ αkxk+1 − xk k2 (26)


kǫk k ≤ M kxk+1 − xk k (27)

for all k ∈ N.
We then have the following result.

Theorem 3.5 (Inexact averaged projection method). Let F1 , . . . , Fp be semi-algebraic, and


prox-regular subsets of Rn which satisfy
p
\
Fi 6= ∅.
i=1
Tp
If x0 is sufficiently close to i=1 Fi , then the inexact averaged projection algorithm reduces
to the gradient method
θ
xk+1 = xk − ∇f (xk ) + ǫk ,
p
with f being given by (23), which therefore defines a unique sequence. Moreover, this sequence
has a finite length and converges to a feasible point x̄, i.e. such that
p
\
x̄ ∈ Fi .
i=1

Proof. Let us first observe that the function f (given by (23)) is semi-algebraic, because the
distance function to any nonempty semi-algebraic set is semi-algebraic (see Lemma 2.3 or
[30, 15]). This implies in particular that f is a KL function (see the end of Section 2.2).
Take a point x∗ in ∩p1 Fi and use Theorem 3.4 to obtain δ > 0 such that, for each i =
1, . . . , p,

(a) the projection PFi is single-valued on B(x∗ , δ),

(b) the function gi := 21 dist (·, Fi )2 is C 1 on B(x∗ , δ) and ∇gi (x) = x − PFi (x),

17
(c) the gradient mapping ∇gi is 1-Lipschitz continuous on B(x∗ , δ).

Since the function f has the KL property around x∗ , there exist ϕ, U, η as in Definition 2.4.
Shrinking δ if necessary, we may assume that B(x∗ , δ) ⊂ U . Take ρ ∈ (0, δ) and shrink η so
that
1 − 2α
η< (δ − ρ)2 . (28)
2p
Choose a starting point x0 such that: 0 = f (x∗ ) ≤ f (x0 ) < η and
r
∗ 0 f (x0 ) b
kx − x k + 2 + ϕ(f (x0 )) < ρ. (29)
a a

Introduce a = p( 1−α 1 1+M



θ − 2 ) > 0 (cf (24)) and b = p 1 + θ .
Let us prove by induction that the averaged projection algorithm defines a unique sequence
that satisfies:

– Conditions H1 and H2 of Section 2.3 with respect to the function f and the constants
a, b,

– xk ∈ B(x∗ , ρ) for all integers k ≥ 0.

The case k = 0 follows from (29). Before proceeding, note that, if a point x belongs to
B(x∗ , δ), we have
Xp
∇f (x) = (x − PFi (x)).
i=1

Using Cauchy-Schwarz inequality (one may as well use the convexity of k · k2 ), we obtain
p
!2
X
2
k∇f (x)k ≤ kx − PFi (x)k (30)
i=1
p
X
≤ p kx − PFi (x)k2
i=1
= 2pf (x).

Let k ≥ 0. Assume now that xk ∈ B(x∗ , ρ) and properties H1, H2 hold for the k + 1-uple
x0 , . . . , xk . Using Theorem 3.4, the inclusion (25) defining xk+1 may be rewritten as follows

θ
xk+1 = xk − ∇f (xk ) + ǫk ,
p

hence xk+1 is uniquely defined. The above equality yields (note that θ ∈ (0, 1) and p ≥ 1)

kxk+1 − xk k2 − 2hxk+1 − xk , ǫk i + kǫk k2 ≤ k∇f (xk )k2 ,

thus, in view of (26), (28) and (30),


2p
kxk+1 − xk k2 ≤ f (xk ) ≤ (δ − ρ)2 . (31)
1 − 2α

18
Since kxk+1 − x∗ k ≤ kxk+1 − xk k + kxk − x∗ k, this implies that xk+1 ∈ B(x∗ , δ). Using (26)
and (27), let us verify that property H1 is satisfied for x0 , x1 , . . . , xk+1 . We have
p k
h∇f (xk ), xk+1 − xk i = h(x − xk+1 ) + ǫk , xk+1 − xk i
θ
p
≤ − (1 − α)kxk+1 − xk k2 .
θ
By Theorem 3.4, we know that ∇f is p-Lipschitz on B(x∗ , δ); we can thus combine the above
inequality with the descent lemma to obtain
2 pθ (1 − α) − p k+1
f (xk+1 ) + kx − xk k2 ≤ f (xk ),
2
that is
f (xk+1 ) + akxk+1 − xk k2 ≤ f (xk ),
which is exactly property H1. On the other hand we have

k∇f (xk+1 )k ≤ k∇f (xk+1 ) − ∇f (xk )k + k∇f (xk )k


p
≤ pkxk+1 − xk k + (kxk+1 − xk k + kǫk k)
 θ
1+M
≤ p 1+ kxk+1 − xk k,
θ
= bkxk+1 − xk k

where the second inequality comes from the Lipschitz property of ∇f and the definition of the
sequence, while the last one follows from the error stepsize inequality, namely (27). Property
H2 is therefore satisfied.
Applying now Corollary 2.7, we get xk+1 ∈ B(x∗ , ρ) and our induction proof is complete.
As a consequence, the algorithm defines a unique sequence that satisfies the assumption
of Lemma 2.6 (or Theorem 3.2), hence it generates a finite length sequence which converges
to a point x̄ such that f (x̄) = 0.

Remark 3.6. In [42], a paper that inspired the above development, the authors establish
similar results for sets Fi having a linearly regular intersection at some point x̄, an important
concept that originates from [47, Theorem 2.8]. A linearly regular intersection at x̄ means
that the equation
X p
yi = 0, with yi ∈ NFi (x̄)
i=1
admits yi = 0, ∀i = 1, . . . , p as a unique solution.
An important fact, tightly linked to the convergence
P result for averaged projections given
in [42, Theorem 7.3], is that the objective f (x) := 12 i dist (x, Fi )2 satisfies the inequality

k∇f (x)k2 ≥ cf (x),

where x is in a neighborhood of x̄ and with c being a positive constant (see [42, Proposition
8.6]). One recognizes the Lojasiewicz inequality with a desingularizing function of the form

ϕ(s) = √2c s with s ≥ 0.

19
4 Inexact proximal algorithm
Let us first recall the exact version of the proximal algorithm for nonconvex functions [36, 3].
Let f : Rn → R ∪ {+∞} be a proper lower semicontinuous function which is bounded
from below, and λ a positive parameter. It is convenient to introduce formally the proximal
correspondence proxλf : Rn ⇉ Rn , which is defined through the formula
 
1 2 n
proxλf x := argmin f (y) + ky − xk : y ∈ R .

Note that for any µ > 0, we have proxλ(µf ) = prox(λµ)f , so that these objects may be simply
denoted by proxλµf .
In view of the assumption inf f > −∞, the lower semicontinuity of f and the coercivity
of the squared norm imply that proxλf has nonempty values. Observe finally that, contrary
to the case when f is convex, we generally do not face here a single-valued operator.
The classical proximal algorithm writes
xk+1 ∈ proxλk f (xk ), (32)
where λk is a sequence of stepsize parameters lying in an interval [λ, λ] ⊂ (0, +∞), and
x0 ∈ Rn . Writing successively the definition of the proximal operator and the associated first
optimality condition (use the sum rule [54]), we obtain
1
f (xk+1 ) + kxk+1 − xk k2 ≤ f (xk ) (33)
2λk
wk+1 ∈ ∂f (xk+1 ); (34)
k+1 k+1 k
λk w +x − x = 0. (35)

4.1 Convergence of an inexact proximal algorithm for KL functions


Let us introduce an inexact version of the proximal point method. Consider the sequence
(xk )k∈N generated by the following algorithm:
Algorithm 2: Take x0 ∈ Rn , 0 < λ ≤ λ < ∞, 0 ≤ σ < 1, 0 < θ ≤ 1.
For k = 0, 1, . . . , choose λk ∈ [λ, λ], and find xk+1 ∈ Rn , wk+1 ∈ Rn such that
θ
f (xk+1 ) + kxk+1 − xk k2 ≤ f (xk ) (36)
2λk
wk+1 ∈ ∂f (xk+1 ); (37)
k+1 k+1 k 2 k+1 2 k+1 k 2
kλk w +x − x k ≤ σ(kλk w k + kx − x k ). (38)
The error criterion (38) is a particular case of the error criterion considered in [57], but here,
contrary to [57], we are not dealing with a maximal monotone operator and no extra-gradient
step is performed. In our setting, condition (38) can be replaced by a weaker condition: for
some positive b > 0
kλk wk+1 k ≤ bkxk+1 − xk k. (39)
The fact that Algorithm 2 is an inexact version of the proximal algorithm is transparent: the
first inequality (36) reflects the fact that a sufficient decrease of the value must be achieved,
while the last lines (38), (39) both correspond to an inexact optimality condition.
The following elementary Lemma is useful for the convergence analysis of the algorithm.

20
Lemma 4.1. Let σ ∈ (0, 1]. If x, y ∈ Rn , and

kx + yk2 ≤ σ(kxk2 + kyk2 ), (40)

then
1−σ
(kxk2 + kyk2 ) ≤ −hx, yi.
2
Assuming moreover σ ∈ (0, 1)
p p
1 − 1 − (1 − σ)2 1 + 1 − (1 − σ)2
kyk ≤ kxk ≤ kyk.
1−σ 1−σ
Proof. Note that (40) is equivalent to

kxk2 + 2hx, yi + kyk2 ≤ σ(kxk2 + kyk2 ).

Direct algebraic manipulation of the above inequality yields the first inequality. For prov-
ing the second and third inequalities, combine the above inequality with Cauchy-Schwarz
inequality to obtain
(1 − σ)kxk2 − 2kxkkyk + (1 − σ)kyk2 ≤ 0
kxk
Viewing the left-hand side of the above inequality as a quadratic function of kyk yields the
conclusion.

The main result of this section is the following theorem.

Theorem 4.2 (Inexact proximal algorithm). Let f : Rn → R ∪ {+∞} be a proper lower


semicontinuous KL function which is bounded from below. Assume that the restriction of f
to its domain is a continuous function. If a sequence (xk )k∈N generated by Algorithm 2 (or
by (36), (37) and (39)) is bounded, then it converges toP
some critical point x̄ of f .
Moreover the sequence (xk )k∈N has a finite length, i.e. k kx
k+1 − xk k < +∞.

Proof. First use Lemma 4.1 to conclude that condition (38) implies (39). Therefore, we
assume that (36), (37) and (39) holds. If (xk )k∈N is bounded, there exists a subsequence
(xkj ) and x̄ such that
xkj → x̄ as j → ∞.
Since f is continuous on its effective domain and f (xkj ) ≤ f (x0 ) < +∞ for all j, we conclude
that
f (xkj ) → f (x̄) as j → ∞.
We can now apply Theorem 2.9, and thus obtain the convergence of the sequence (xk )k∈N to
a critical point of f .

4.2 A variant for convex functions


When the function under consideration is convex and satisfies the Kurdyka-Lojasiewicz prop-
erty, Algorithm 2 can be simplified while its convergence properties are maintained.
Let f : Rn → R ∪ {+∞} be a proper lower semicontinuous convex function. Consider
the sequence (xk )k∈N generated by the following algorithm.

21
Algorithm 2bis: Take 0 < λ ≤ λ < ∞, 0 ≤ σ < 1.
For k = 0, 1, . . . , choose λk ∈ [λ, λ] and find xk+1 ∈ Rn , wk+1 ∈ Rn such that

wk+1 ∈ ∂f (xk+1 ), (41)


k+1 k+1 k 2 k+1 2 k+1 k 2
kλk w +x − x k ≤ σ(kλk w k + kx − x k ). (42)

Before stating our main results, let us establish some elementary inequalities. We claim
that for each k
1 − σ k+1 (1 − σ)λ k+1 2
f (xk+1 ) + kx − xk k2 + kw k ≤ f (xk ), (43)
2λ 2
and p
k+1 1+ 1 − (1 − σ)2 k+1
kw k≤ kx − xk k. (44)
λ(1 − σ)
For proving (43), use the convexity of f and inclusion (41) to obtain

f (xk ) ≥ f (xk+1 ) + hxk − xk+1 , wk+1 i.

Using the above inequality, the algebraic identity


1
hxk − xk+1 , wk+1 i = [kλk wk+1 k2 + kxk+1 − xk k2 − kλk wk+1 + xk+1 − xk k2 ]
2λk
and (42), we obtain
 
k k+1 1−σ k+1 2 1 k+1 k 2
f (x ) ≥ f (x )+ λk kw k + kx −x k . (45)
2 λk

Combining this inequality with the assumption λ ≤ λk ≤ λ yields (43).


Equation (44) follows from Lemma 4.1, inequality (42) and assumption λ ≤ λk ≤ λ.
Theorem 4.3. Let f : Rn → R ∪ {+∞} be a proper convex lower semicontinuous. Assume
that f is a KL function which is bounded from below and let (xk )k∈N be a sequence generated
by Algorithm 2bis.
If (xk )k∈N is bounded, then it converges to a minimizer of f and the sequence of values
f (xk ) converges to the program value min f . Moreover, the sequence has a finite length, i.e.
k+1 − xk k < +∞.
P
k kx

Proof. Since f is bounded from below, it follows from (43) and σ < 1 that
+∞
X +∞
X
k 2
kxk+1
− x k < +∞, kwk k2 < +∞.
k=1 k=1

Therefore
wk → 0 as k → +∞.
Since (xk ) has been assumed to be bounded, there exists a subsequence (xkj ) which converges
to some x̄. By (43) and σ < 1, we also see that the sequence (f (xk )) is decreasing. From this
property and the lower semicontinuity of f , we deduce that

f (x̄) ≤ lim inf f (xkj ) = lim f (xk ).


j→∞ k→∞

22
Using the convexity of f and the inclusion wk ∈ ∂f (xk ) for k ≥ 1, we obtain

f (x̄) ≥ f (xkj ) + hx̄ − xkj , wkj i, j = 2, 3, . . . .

Passing to the limit, as j → ∞, in the above inequality we conclude that

f (x̄) ≥ lim f (xk ).


k→∞

Therefore
f (x̄) = lim f (xk ) = lim f (xkj ).
k→∞ j→∞

Then use (43), (44) and Theorem 2.9 to obtain the convergence of the sequence (xk ) to x̄.
From wk ∈ ∂f (xk ), wk → 0 as k → +∞, and the closedness property of ∂f , we deduce that
∂f (x̄) ∋ 0, which expresses that x̄ is a minimizer of f .

Remark 4.4. (a) As mentioned in the introduction, many functions encountered in finite-
dimensional applications are of semi-algebraic (or tame) nature and are thus KL functions.
So are in particular many convex functions: this fact was a strong motivation for the above
result.
(b) Building a convex function that does not satisfy the Kurdyka-Lojasiewicz property is
not easy. It is however possible to do so in dimension 2 (see [18]), but such functions must
somehow have an highly oscillatory collection of sublevel sets (a behavior which is unlikely
as far as applications are concerned).

5 Inexact forward-backward algorithm


Let f : Rn → R ∪ {+∞} be a proper lower semicontinuous function which is bounded from
below, and which satisfies the Kurdyka-Lojasiewicz property.
We assume that f is a structured function that can be split as

f =g+h (46)

where h : Rn → R is a C 1 function whose gradient ∇h is Lipschitz continuous (note that


this is not a restrictive assumption on f , one can for example take h = 0 and g = f ). The
Lipschitz constant of ∇h is denoted by L. This kind of structured problem occurs frequently,
see for instance [25, 6] and Example 5.4.
We consider sequences generated according to the following algorithm:
Algorithm 3: Take a, b > 0 with a > L. Take x0 ∈ dom g.
For k = 0, 1, . . . , find xk+1 ∈ Rn , v k+1 ∈ Rn such that
a
g(xk+1 ) + hxk+1 − xk , ∇h(xk )i + kxk+1 − xk k2 ≤ g(xk ); (47)
2
v k+1 ∈ ∂g(xk+1 ); (48)
k+1 k k+1 k
kv + ∇h(x )k ≤ bkx − x k. (49)

23
This section is divided into three distinct parts. In a first part, we recall what is the
classical forward-backward algorithm and explain how Algorithm 3 provides an inexact ver-
sion of the latter; the special case of projection methods is also discussed. In a second part,
we provide a general convergence result for KL functions. We end this section by providing
illustrations of our results through problems coming from compressive sensing, and hard-
constrained feasibility problems.

5.1 The forward-backward splitting algorithm for nonconvex functions


Let us further assume that g is bounded from below. Being given a sequence of positive
parameters γk that satisfies
1
0 < γ < γk < γ <
L
where γ and λ are given thresholds, the forward-backward splitting algorithm reads

xk+1 ∈ proxγk g (xk − γk ∇h(xk )). (50)

An important observation here is that the sequence is not uniquely defined since proxγk g
may be multivalued; a surprising fact is that this freedom in the choice of the sequence does
not impact the convergence properties of the algorithm (see Theorem 5.1).
Let us show how this algorithm fits into the general framework of Algorithm 3. By
definition of the proximal operator we have
1 1
γk g(xk+1 ) + kxk+1 − xk + γk ∇h(xk )k2 ≤ γk g(xk ) + kγk ∇h(xk )k2 ,
2 2
which after simplification gives
1
γk g(xk+1 ) + kxk+1 − xk k2 + γk h∇h(xk ), xk+1 − xk i ≤ γk g(xk ).
2
Thus
1 k+1
g(xk+1 ) + h∇h(xk ), xk+1 − xk i + kx − xk k2 ≤ g(xk ),

1
so that the sufficient-decrease condition (47) holds (that’s where we precisely use γ < L ).
Writing down the optimality condition yields

γk v k+1 + γk ∇h(xk ) + xk+1 − xk = 0

where v k+1 ∈ ∂g(xk+1 ). Dividing by γk , we end up with


1 k+1
kv k+1 + ∇h(xk )k = kx − xk k
γk
1 k+1
≤ kx − xk k,
γ

which is the inexact optimality conditions announced in (49).


As for the proximal algorithm, the inexact version offers some flexibility in the choice of
xk+1 by relaxing both the descent condition and the optimality conditions.

24
Gradient projection algorithm Let us specialize the forward-backward splitting algo-
rithm to functions of the form iC + h (where C is a nonempty closed subset of Rn ). For all
positive λ, we have the elementary equality

proxλ iC x = PC (x).

We thus find the nonconvex nonsmooth gradient-projection method

xk+1 ∈ PC (xk − γk ∇h(xk )). (51)

5.2 Convergence of an inexact forward-backward splitting algorithm


Let us now return to the general inexact forward-backward splitting Algorithm 3, and show
the following convergence result.
Theorem 5.1 (Nonconvex nonsmooth forward-backward splitting). Let f = g + h : Rn →
R ∪ {+∞} be a proper lower semicontinuous KL function which is bounded from below.
Assume further that h : Rn → R is finite valued, differentiable, has a L-Lipschitz continuous
gradient, and that the restriction of g to its domain is continuous.
If (xk )k∈N is a bounded sequence generated by Algorithm 3, then it converges to some critical
point of f = g + h.
Moreover, the sequence (xk )k∈N has a finite length, i.e. k+1 − xk k < +∞.
P
k kx

Proof. Using the descent lemma for the C 1 function h at xk+1 and xk , and the sufficient
decrease property (47) of Algorithm 3, we obtain
a − L k+1
g(xk+1 ) + h(xk+1 ) + kx − xk k2 ≤g(xk+1 ) + h(xk )
2
a
+ hxk+1 − xk , ∇h(xk )i + kxk+1 − xk k2
2
k k
≤g(x ) + h(x ).

Therefore, setting
ã = a − L > 0
we have

f (xk+1 ) + kxk+1 − xk k2 ≤ f (xk ). (52)
2
Define
wk+1 = v k+1 + ∇h(xk+1 ).
The classical derivation rule for the sum, see [54], and property (48) of Algorithm 3 yield

wk+1 ∈ ∂f (xk+1 ). (53)

Moreover, by property (49) of Algorithm 3, and the triangle inequality, we obtain

kwk+1 k ≤kv k+1 + ∇h(xk )k + k∇h(xk+1 ) − ∇h(xk )k


≤bkxk+1 − xk k + Lkxk+1 − xk k.

We are precisely in the case which has been examined in Theorem 4.2 (continuous functions
on their domain).

25
Remark 5.2. (a) For the exact forward-backward splitting algorithm the continuity assump-
tion concerning g is useless. Indeed in that case, we have for all u in Rn
1 1
γk g(xk+1 ) + kxk+1 − xk + γk ∇h(xk )k2 ≤ γk g(u) + ku − xk + γk ∇h(xk )k2 ,
2 2
so that
1 1
g(xk+1 ) + kxk+1 − xk k2 + hxk+1 − xk , ∇h(xk )i ≤ g(u) + ku − xk k2 + hu − xk , ∇h(xk )i.
2γk 2γk
(54)
k k
Let x be a subsequence of x which converges to x̄. Take u = x̄, k = kj in (54) and let
j

j → +∞. Since xk+1 − xk → 0, we obtain

lim sup g(xk+1 ) ≤ g(x̄),


k→+∞

and, since g is lower semicontinuous, lim g(xkj ) = g(x̄). The end of the proof is the same
that the one of Theorem 5.1.
(b) Forward-backward splitting algorithms have many applications to parallel splitting of
coupled systems. For applications involving monotone operators one may consult [6].

An important consequence of the above result is a general convergence result for gradient
projection methods.

Theorem 5.3 (Nonconvex gradient projection method). Let h : Rn → R be a differentiable


function whose gradient is L-Lipschitz continuous, and C a nonempty closed subset of Rn .
1
Being given ǫ ∈ (0, 2L ) and a sequence of stepsizes γk such that ǫ < γk < L1 − ǫ, we consider
k
a sequence (x )k∈N that complies with

xk+1 ∈ PC (xk − γk ∇h(xk )), with x0 ∈ C.

If the function h + iC is a KL function and if (xk )k∈N is bounded, then the sequence (xk )k∈N
converges to a point x∗ in C such that

∇h(x∗ ) + NC (x∗ ) ∋ 0.

Proof. It is a direct consequence of Remark 5.2 (a).

As mentioned in the Introduction and in Section 2.2, the assumption that h + iC is a KL


function is very general. For instance, when h is C 1 semi-algebraic and C is a nonempty closed
semi-algebraic set, h+iC is a KL function and the above result applies. Let us also emphasize
here that our convergence result, contrary to those of Theorem 3.5 and [42], do not rely on
any regularity properties of the set C (in the sense of variational analysis). In particular, C
does not need to be prox-regular so that the projection mapping may be multi-valued in any
neighborhood of C.

26
5.3 Examples

Example 5.4 (Forward-backward splitting for compressive sensing). (a) The central issue
in compressive sensing is to recover sparse solutions of under-determined linear systems (see
[29]). The model problem is the following

(P ) min{kxk0 : Ax = b}

where k · k0 is the counting norm (or the ℓ0 norm), A 6= 0 is an m × n real matrix and b ∈ Rm .
We recall that for x in Rn , kxk0 is the number of nonzero components of x.
As in [11], we proceed in the spirit of Tikhonov regularization for the least squares method.
Fix a parameter λ > 0. We aim at solving the nonsmooth nonconvex problem:
1
(P ′ ) min{λkxk0 + kAx − bk2 }.
2
If we set g(x) = λkxk0 and h(x) = 21 kAx − bk2 , it is straightforward to check that the
topological assumptions of Remark 5.2(a) are satisfied (observe indeed that k · k0 is lower
semicontinuous). To see that g + h is a KL function, we simply note that h is a polynomial
function and that k · k0 has a piecewise linear graph, hence the sum g + h is semi-algebraic.
Consider now the proximal operator proxγλk·k0 (1 ). When n = 1, the counting norm is
denoted by | · |0 ; in that case one easily establishes that
 √
 u if |u| > √2γλ
proxγλ|·|0 u = {0, u} if |u| = 2γλ
0 otherwise.

When n is arbitrary, trivial algebraic manipulations yield, with u = (u1 , . . . , un ) ∈ Rn ,

proxγλk·k0 u = (proxλγ|·|0 u1 , . . . , proxγλ|·|0 un ),

and thus proxγλk·k0 is a perfectly known object.


Let k · kF denote the Frobenius norm in Rm×n . Applying the previous result (Re-
mark 5.2(a)) to the bounded sequences generated by the thresholding process
 
xk+1 ∈ proxγk λk·k0 xk − γk (AT Axk − AT b)

1
where 0 < γ < γk < γ < ||AT A||F
, shows that the sequence (xk )k∈N converges to a critical
point of λkxk0 + 12 kAx − bk2 , i.e. towards a point x∗ that satisfies

(AT Ax∗ )i = (AT b)i ,

for all i such that x̄∗i 6= 0.


As mentioned in the introduction, these results offer a complementary view to the theo-
retical develoments of [11, 12]. They also provide at the same time a very general convergence
result which can be immediately generalized to compressive sensing problems involving semi-
algebraic or real-analytic nonlinear measurements.
1
Recall that proxγλk·k0 = prox(γλ)k·k0 = proxγ(λk·k0 )

27
(b) Alternative approaches to (P ) are based on the following approximation
1
(P ′′ ) min{λkxkp + kAx − bk2 },
2
where p is in (0, 1) and kxkp = n1 |xi |p (see [21]). Some encouraging numerical results have
P
been reported in [21]. In [19] some theoretical results in the framework of Hilbert spaces
are announced but, even when the space is finite dimensional, no convergence result and no
estimate result are provided.
Using the separable structure of k·kp , the computation of the proximal operator proxγλk·kp
can be reduced to the one dimensional minimization problem: for u ∈ R, find x solution of

min 2γλ|x|p + (x − u)2 : x ∈ R ,




that can be solved numerically by standard methods. Thus the forward-backward splitting
algorithm may be run in a simple way. To obtain convergence, the only nontrivial fact that
has to be checked is that f = λkxkp + 12 kAx − bk2 is a KL function. For this, we recall
that there exists a (polynomially bounded) o-minimal structure that contains the family of
functions {xα : x > 0, α ∈ R} and restricted analytic functions (see [31, Example (5), p. 505
and Property 5.2 p. 513]). As a consequence, the results of [17] apply and f is a KL function
with a desingularizing function of the form ϕ(s) = csθ where c > 0, θ ∈ [0, 1). Hence the
previous convergence and estimate results apply to the algorithm

xk+1 ∈ proxλγk k·kp (xk − γk (AT Axk − AT b)),

and to its inexact counterparts (note that g(·) = k · kp is continuous and that γk is taken as
in remark (a) above).

Example 5.5 (Hard-constrained feasibility problems). Let F, F1 , . . . , Fp be a finite collection


of nonempty closed subsets of Rn , and assume that F1 , . . . , Fp are convex sets. The hard
constraint F is not supposed to be convex. We consider the following minimization problem
( p )
1X 2
min ωi dist (x, Fi ) : x ∈ F ,
2
i=1
P
where ωi are positive constants such that i ωi = 1. By applying the forward-backward
splitting algorithm to this problem, we aim at finding a point which satisfies the hard con-
straints modelled by F , while the other constraints are satisfied in a possibly weaker sense
(see [25] and references therein). Set
p
1X
h(x) = ωi dist (x, Fi )2 .
2
i=1

By a standard convex analysis result, each function hi (x) = 21 dist (x, Fi )2 is C 1 convex, and
its gradient, equal to ∇hi (x) = x − PFi (x), is Lipschitz continuous with Lipschitz constant
equal to 1. By convex combination, the same property holds true for h, and we can take
L = 1 as a Lipschitz constant of ∇h.
Thus the forward-backward splitting algorithm (gradient-projection) (50) reads:

28
Take 0 < θ < θ < 1. Take x0 ∈ Rn . For k = 0, 1, ...,
p
!
X
xk+1 ∈ PF (1 − θk )xk + θk PFi (xk ) , (55)
i=1

where θk ∈ [θ, θ].


Let us consider successively the convergence properties of this algorithm in two different
situations, which are based respectively on the concepts of semi-algebraic sets, and linear
regular intersection.
Theorem 5.6. Assume that the sets F, F1 , . . . , Fp are semi-algebraic. Let (xk )k∈N be a
sequence generated by the forward-backward splitting algorithm (55). If (xk )k∈N is bounded,
and x0 is sufficiently close to the intersection of the sets F, F1 , . . . , Fp , then the sequence
(xk )k∈N converges to a point which lies in the intersection of the sets F, F1 , . . . , Fp .
Proof. The proof relies on the fact that the underlying function
p
1X
f (x) = iF (x) + ωi dist (x, Fi )2 , x ∈ Rn , (56)
2
i=1
is a KL function. This follows immediately from the fact that the distance function to a semi-
algebraic set is semi-algebraic (see Lemma 2.3) and hence satisfies KL. Then apply Theorem
5.1 to obtain the finite length property and the convergence of the sequence (xk )k∈N to a
critical point of f . Then by direct application of the local convergence to global minima
Theorem 2.12, we obtain the convergence of the sequence (xk )k∈N to a point which lies in the
intersection of the sets F, F1 , . . . , Fp .
Let us now consider the KL analysis in the regular intersection case (see definition in
Remark 3.6). To this end, we will use the following result [42, Proposition 8.5] (based itself
on a characterization given in [37]).
Lemma 5.7. Let C1 , . . . , Cm be closed subsets of Rn whose intersection is nonempty. Let
x̄ ∈ ∩i Ci . Assume that the intersection of C1 , . . . , Cm is linearly regular at x̄. Then, there
exists a positive constant α such that for each x sufficiently close to x̄, we have:
v
um m
uX X
α t 2
kyi k ≤ k yi k, ∀(y1 , . . . , ym ) ∈ NC1 (x) × . . . × NCm (x). (57)
i=1 i=1

We shall see below that this property entails that the function
p
1X
f˜(x) = iF (x) + dist (x, Fi )2 , x ∈ Rn , (58)
2
i=1
satisfies the KL inequality. Thus, in that case, we are led to consider the algorithm obtained
by applying to f˜ the forward-backward splitting algorithm (50), which gives:
Take 0 < θ < θ < p1 . Take x0 ∈ Rn . For k = 0, 1, ...,
p
!
X
k+1 k k
x ∈ PF (1 − θk )x + θk PFi (x ) , (59)
i=1

where θk ∈ [θ, θ].

29
Theorem 5.8. Assume that the sets F, F1 , . . . , Fp have a linearly regular intersection around
a point x̄ and that one of them is a compact set. If x0 is sufficiently close to x̄, then the
sequence (xk )k∈N generated by the algorithm (59) converges to a point which lies in the in-
tersection of the sets F, F1 , . . . , Fp .

Proof. The convergence proof can be obtained like in Theorem 5.1 by using Theorem 2.12.
We simply need to verify that the function f˜, as defined by (58), satisfies the KL inequality.
Let K be a compact neighborhood of x̄ on which (57) holds. Take x in K; we have
p
∂ f˜(x) = NF (x) +
X
(x − PFi (x)).
i=1

For each i = 1, . . . , p, set yi = (x − PF i (x)) and observe that yi ∈ NFi (x). If x is in dom ∂ f˜
(i.e., ∂ f˜(x) 6= ∅), use Lemma 5.7 and inequality (57) to obtain
m
dist (0, ∂ f˜(x)) = min{kz +
X
yi k : z ∈ NF (x)}
i=1
v
u m
X
u
≥ α min{tkzk2 + kyi k2 : z ∈ NF (x)}
i=1
v
um
uX
≥ αt kyi k2
i=1
1
≥ cf˜(x) 2

where c is a positive constant. This shows that f˜ is a KL function.

6 An inexact regularized Gauss-Seidel method


Fix an integer p ≥ 2, and let n1 , . . . , np be positive integers. The current vector x belongs to
the product space Rn1 × . . . × Rnp , it is denoted by x = (x1 , . . . , xp ), where each xi belongs to
Rni . We are concerned with the minimization of functions f : Rn1 × . . . × Rnp → R ∪ {+∞}
having the following structure
p
X
f (x) = Q(x1 , . . . , xp ) + fi (xi ), (60)
i=1

where Q is a C 1 function with locally Lipschitz continuous gradient, and fi : Rni → R∪{+∞}
is a proper lower semicontinuous function, i = 1, 2, ..., p.
For each i in {1, . . . , p}, we consider a bounded sequence of symmetric positive definite
matrices (Bik )k∈N of size ni . We assume that the eigenvalues of the matrices {Bik : k ∈ N, i ∈
{1, . . . , p}} are bounded away from zero.
Our model algorithm is the following.
A proximal modification of the Gauss-Seidel method (see [8])
Take a starting point x0 = (x01 , . . . , x0p ) in Rn1 × . . . × Rnp and consider the alternating

30
minimizing procedure.
For xk being given in Rn1 × . . . × Rnp construct xk+1 as follows
1
xk+1
1 ∈ argmin {f (u1 , xk2 , . . . , xkp ) + hB1k (u1 − xk1 ), u1 − xk1 i : u1 ∈ Rn1 }. (61)
2
Successively for i = 2, . . . , p − 1:
1 k
xk+1
i ∈ argmin {f (xk+1 k+1 k k k ni
1 , . . . , xi−1 , ui , xi+1 , . . .) + hBi (ui − xi ), ui − xi i : ui ∈ R }; (62)
2
1 k
xk+1
p ∈ argmin {f (xk+1 k+1 k k
1 , . . . , xp−1 , up ) + hBp (up − xp ), up − xp i : up ∈ R }.
np
(63)
2

Set xk+1 = (xk+1 k+1


1 , . . . , xp ).

Remark 6.1. When Bik = 0 for all integers i and k, which is not allowed in our framework,
one recovers the classical Gauss-Seidel model. When Bik = αk I where αk is a positive real
number and I is the identity matrix, we recover the exact methods studied in [8, 32, 5].
Some ingredients of our approach (like the sufficient decrease condition) can be found in [60],
where a block-coordinate relaxation method with proximal linearized subproblems is used for
solving similar structured minimization problems.

Let us now introduce an inexact version of the above alternating method.


Algorithm 4 Take 0 < λ < λ < ∞.
For each i in {1, . . . , p}, take a sequence of symmetric positive definite matrices (Aki )k∈N of
size ni such that the eigenvalues of each Aki (k ∈ N, i ∈ {1, . . . , p}) lie in [λ, λ].
Take some positive parameters bi (i = 1, . . . , p).
Take a starting point x0 = (x01 , . . . , x0p ) in Rn1 × . . . × Rnp ,.
For k = 0, 1, . . . , find xk+1 and v k+1 ∈ Rn1 × . . . × Rnp such that
1
fi (xk+1
i ) + Q(xk+1 k+1 k+1
1 , . . . , xi−1 , xi , . . . , xkp ) + hAki (xk+1
i − xki ), xk+1
i − xki i
2
≤ fi (xki ) + Q(xk+1 k+1 k k
1 , . . . , xi−1 , xi , . . . , xp ); (64)
vik+1 ∈ ∂fi (xk+1
i ); (65)
kvik+1 + ∇xi Q(xk+1 k+1 k
1 , . . . , xi , xi+1 , . . . , xkp )k ≤ bi kxk+1
i − xki k, (66)

where i ranges over {1, . . . , p}.


Elementary computations show that the model algorithm (61)-(62)-(63) is a special in-
stance of Algorithm 4.

Convergence analysis of the regularized Gauss-Seidel method


Define
wk+1 = (vik+1 + ∇xi Q(xk+1 k+1
1 , . . . , xp ))i=1,...,p ∈ R
n1
× . . . × Rnp ,
Using the differentiation rules for separable functions, we obtain

wk+1 ∈ ∂f (xk+1 ). (67)

31
Assume that the sequence (xk )k∈N is bounded, and denote by L the Lipschitz constant of
∇Q on a product of balls B̄1 × . . . × B̄p containing the sequence (xk )k∈N . For all i = 1, . . . , p,
we have

kvik+1 + ∇xi Q(xk+1 k+1


1 , . . . , xp )k
≤ kvik+1 + ∇xi Q(xk+1 k+1 k
1 , . . . , xi , xi+1 , . . . , xkp )k
+ k∇xi Q(xk+1
1 , . . . , xi , xi+1 , . . . , xkp ) − ∇xi Q(xk+1
k+1 k k+1
1 , . . . , xp )k
≤ bi kxk+1
i − xki k + Lkxk+1 − xk k.

Therefore, for some M > 0,


kwk+1 k ≤ M kxk+1 − xk k. (68)
Summing inequalities of the type (64) from i = 1 to i = p, and using the inequalities
λ kuk2 ≤ hAki u, ui for all integers i and k, we conclude that

f (xk+1 ) + λkxk+1 − xk k2 ≤ f (xk ).

We are in position to apply Theorem 2.9 to obtain

Theorem 6.2 (Proximal regularization of Gauss-Seidel method). Assume that f defined in


(60) is a KL function which is bounded from below. Let (xk )k∈N be a sequence generated by
Algorithm 4. If (xk )k∈N is bounded, then it converges toPsome critical point x̄ of f .
Moreover the sequence (xk )k∈N has a finite length, i.e. k kx
k+1 − xk k < +∞.

Remark 6.3 (Convex minimization). Observe that this result is new even in the context of
convex optimization where this problem was considered first (see the seminal work [8] and
the recent study [4]). Indeed it allows both to choose a general smooth convex coupling term
Q and to adapt the geometry of the proximal operators (through the choice of a metric Aki )
to the geometry of the problem. Due to the fact that a convex function has at most one
critical value, the bounded sequences generated by the above algorithms converge to a global
minimizer.

7 Conclusion
Very often, iterative minimization algorithms rely on inexact solution of minimization sub-
problems, whose exact solution may be almost as difficult to obtain as the solution of the
original minimization problem.
Even when the minimization subproblem can be solved with high accuracy, its solutions
are mere approximations of the solution of the original problems. In these cases, over-solving
the minimization subproblems would increase the computational burden of the method, and
may slow down the final computation of a good approximation of the solution. On the
other hand, under-solving the minimization subproblems may result in a breakdown of the
algorithm, and convergence to a solution may be lost.
In this paper we gave theoretical basis for the application of numerical methods for min-
imizing a class of functions (which satisfies the KL inequality). In particular our abstract
scheme was designed to handle relative errors because practical methods always involve nu-
merical approximation, e.g., the representation of a real number in floating points numbers

32
with a fixed byte-length. We provided practical examples where the approximated solution of
the minimization subproblems within the proposed error tolerance is feasible in a single step.
Moreover, we also supplied stopping criteria for the solution of the minimization subproblems
in general.
The computational implementation of the methods analyzed in this paper, as well as these
stopping rules are topics for future research.

References
[1] Absil, P.-A., Mahony, R. , Andrews, B., Convergence of the iterates of descent methods
for analytic cost functions, SIAM J. Optim., 16, no. 2, (2005), 531–547.

[2] Aragon, A., Dontchev, A. , Geoffroy, M., Convergence of the proximal point method for
metrically regular mappings, ESAIM Proc., 17, EDP Sci., Les Ulis, (2007), 1–8.

[3] Attouch, H., Bolte, J., On the convergence of the proximal algorithm for nonsmooth func-
tions involving analytic features, Math. Program., Ser. B, 116 (2009), 5-16.

[4] Attouch, H., Bolte, J., Redont, P., Soubeyran, A., Alternating Proximal Algorithms for
Weakly Coupled Convex Minimization Problems. Applications to Dynamical Games and PDE’s,
Journal of Convex Analysis 15 (2008), 485–506

[5] Attouch, H., Bolte, J., Redont, P., Soubeyran, A. Proximal alternating minimization
and projection methods for nonconvex problems. An approach based on the Kurdyka-Lojasiewicz
inequality, Mathematics of Operations Research, 35, no. 2, (2010), 438-457.

[6] Attouch, H., Briceño-Arias, L.M., Combettes, P.L. A parallel splitting method for cou-
pled monotone inclusions, SIAM J. Control Optim., 48, no. 5, (2010), 3246-3270.

[7] Attouch, H., Soubeyran, A. Local search proximal algorithms as decision dynamics with
costs to move, Set Valued and Variational Analysis, Online First, 12 May 2010.

[8] Auslender, A., Asymptotic properties of the Fenchel dual functional and applications to de-
composition problems, J. Optim. Theory Appl., 73 (1992), 427–449.

[9] Beck, A., Teboulle, M.,, A Linearly Convergent Algorithm for Solving a Class of Noncon-
vex/Affine Feasibility Problems, July 2010, to appear in the book Fixed-Point Algorithms for
Inverse Problems in Science and Engineering, part of the Springer Verlag series Optimization
and Its Applications. Available online http://ie.technion.ac.il/Home/Users/becka.html

[10] Benedetti, R., Risler, J.-J., Real Algebraic and Semialgebraic Sets, Hermann, Éditeur des
Sciences et des Arts, (Paris, 1990).

[11] Blumensath T., Davis, M. E., Iterative Thresholding for Sparse Approximations, J. of Fourier
Anal. App. 14 (2008), 629–654.

[12] Blumensath T., Davis, M. E., Iterative hard thresholding for compressed sensing, App. Com-
put. Harmon. Anal., 27 (2009), 265–274.

[13] Bochnak, J., Coste, M., Roy, M.-F., Real Algebraic Geometry, (Springer, 1998).

[14] Bolte, J., Combettes, P.L., Pesquet, J.-C., Alternating proximal algorithm for blind image
recovery, Proceedings of the IEEE International Conference on Image Processing. Hong-Kong,
September 26-29, 2010.

33
[15] Bolte, J., Daniilidis, A. , Lewis, A., The Lojasiewicz inequality for nonsmooth subanalytic
functions with applications to subgradient dynamical systems, SIAM J. Optim., 17 , no. 4,
(2006), 1205–1223.

[16] Bolte, J., Daniilidis, A., Lewis, A., A nonsmooth Morse-Sard theorem for subanalytic
functions, J. Math. Anal. Appl., 321, no. 2, (2006), 729–740.

[17] Bolte, J., Daniilidis, A., Lewis, A., Shiota, M., Clarke subgradients of stratifiable func-
tions, SIAM J. Optim., 18, no. 2, (2007), 556–572.

[18] Bolte, J., Daniilidis, A., Ley, O., Mazet, L., Characterizations of Lojasiewicz inequalities:
Subgradient flows, talweg, convexity, Trans. Amer. Math. Soc., 362, (2010), 3319-3363.

[19] Bredies, K., Lorenz, D.A., Minimization of non-smooth, non-convex functionals by iterative
thresholding, preprint available at http://www.uni-graz.at/ bredies/publications.html

[20] Burke, J.V., Descent methods for composite nondifferentiable optimization problems, Mathe-
matical Programming, 33 (1985), 260–279.

[21] Chartrand, R., Exact Reconstruction of Sparse Signals via Nonconvex Minimization, Signal
Processing Letters IEEE, 14 (2007), 707–710.

[22] Chill, R., Jendoubi, M.A. Convergence to steady states in asymptotically autonomous semi-
linear evolution equations, Nonlinear Analysis, 53, (2003), 1017–1039.

[23] Clarke, F.H., Ledyaev, Yu., Stern, R.I. , Wolenski, P.R., Nonsmooth analysis and
control theory, Graduate texts in Mathematics 178, (Springer-Verlag, New-York, 1998).

[24] Combettes, P.L., Quasi-Fejerian analysis of some optimization algorithms, in Inherently Paral-
lel Algorithms in Feasibility and Optimization and Their Applications, (D. Butnariu, Y. Censor,
and S. Reich, Eds.), New York: Elsevier, 2001, 115-152.

[25] Combettes, P.L., Wajs, V.R., Signal recovery by proximal forward-backward splitting., Mul-
tiscale Model. Simul., 4 (2005), 1168–1200.

[26] Coste, M., An introduction to o-minimal geometry, RAAG Notes, 81 p., Institut de Recherche
Mathématiques de Rennes, November 1999.

[27] Curry, H.B., The method of steepest descent for non-linear minimization problems, Quart.
Appl. Math., 2 (1944), 258–261.

[28] Palis, J.,& De Melo, W., Geometric theory of dynamical systems. An introduction, (Trans-
lated from the Portuguese by A. K. Manning), Springer-Verlag, New York-Berlin, 1982.

[29] Donoho, D. L., Compressed Sensing, IEEE Trans. Inform. Theory 4 (2006), 1289–1306.

[30] van den Dries, L., Tame topology and o-minimal structures. London Mathematical Society
Lecture Note Series, 248, Cambridge University Press, Cambridge, (1998) x+180 pp.

[31] van den Dries, L., & Miller, C., Geometric categories and o-minimal structures, Duke Math.
J. 84 (1996), 497-540.

[32] Grippo, L., Sciandrone, M., Globally convergent block-coordinate techniques for uncon-
strained optimization, Optimization Methods and Software, 10 (4), (1999), 587–637.

[33] Hare, W., Sagastizábal, C. Computing proximal points of nonconvex functions, Math. Pro-
gram., 116 (2009), 1-2, Ser. B, 221–258.

34
[34] Haraux, A., Jendoubi, M.A. Convergence of solutions of second-order gradient-like systems
with analytic nonlinearities, J. Differential Equations, 144 (2), (1999), 313–320.

[35] Huang, S.-Z., Takač, P. Convergence in gradient-like systems which are asymptotically au-
tonomous and analytic, Nonlinear Anal., Ser. A, Theory Methods, 46, (2001), 675–698.

[36] Iusem A.N., Pennanen T., Svaiter, B.F. Inexact variants of the proximal point algorithm
without monotonicity, SIAM Journal on Optimization, 13, no. 4 (2003), 1894–1097.

[37] Kruger, A.Y., About regularity of collections of sets, Set Valued Analysis, 14, (2006), 187–206.

[38] Kurdyka, K., On gradients of functions definable in o-minimal structures, Ann. Inst. Fourier,
48, (1998), 769-783.

[39] Lageman, C., Pointwise convergence of gradient-like systems, Math. Nachr., 280, (2007), no.
13-14, 1543-1558.

[40] Lewis, A.S., Active sets, nonsmoothness and sensitivity, SIAM Journal on Optimization, 13,
(2003), 702–725.

[41] Lewis, A.S., Malick, J., Alternating projection on manifolds, Mathematics of Operations
Research, 33, no. 1, (2008), 216-234.

[42] Lewis, A.S., Luke, D.R., Malick, J., Local linear convergence for alternating and averaged
nonconvex projections., Found. Comput. Math. 9, (2009), 485–513.

[43] Lewis, A.S., Wright, S.J., A proximal method for composite minimization, Optimization
online, (2008).

[44] Lojasiewicz, S., Une propriété topologique des sous-ensembles analytiques réels, in: Les
Équations aux Dérivées Partielles, pp. 87–89, Éditions du centre National de la Recherche Sci-
entifique, Paris 1963.

[45] Lojasiewicz, S., Sur la géométrie semi- et sous-analytique, Ann. Inst. Fourier 43, (1993),
1575-1595.

[46] Mordukhovich, B., Maximum principle in the problem of time optimal response with nons-
mooth constraints, J. Appl. Math. Mech., 40 (1976), 960–969 ; [translated from Prikl. Mat. Meh.
40 (1976), 1014–1023].

[47] Mordukhovich, B., Variational analysis and generalized differentiation. I. Basic theory,
Grundlehren der Mathematischen Wissenschaften, 330, Springer-Verlag, Berlin, 2006.

[48] Nesterov, Yu., Accelerating the cubic regularization of Newton’s method on convex problems,
Math. Program., 112 (2008), no. 1, Ser. B, 159–181.

[49] Nesterov, Yu., Nemirovskii, A., Interior-point polynomial algorithms in convex program-
ming, SIAM Studies in Applied Mathematics, 13, Philadelphia, PA, 1994.

[50] Ortega, J.M., Rheinboldt, W.C., Iterative solutions of nonlinear equations in several vari-
ables, Academic Press, New York and London, 1970.

[51] Pennanen, T., Local convergence of the proximal point algorithm and multiplier methods
without monotonicity, Math. Oper. Res. 27, (2002), 170–191 .

[52] Peypouquet, J., Sorin, S., Evolution equations for maximal monotone operators: asymptotic
analysis in continuous and discrete time, J. Convex Analysis, 17, (2010), 1113–1163.

35
[53] Poliquin, R.A., Rockafellar, R.T., Thibault, L., Local differentiability of distance func-
tions, Trans. AMS, 352, (2000), 5231–5249.

[54] Rockafellar, R.T. , Wets, R., Variational Analysis, Grundlehren der Mathematischen Wis-
senschaften, 317, Springer, 1998.

[55] Simon, L., Asymptotics for a class of non-linear evolution equations, with applications to geo-
metric problems, Ann. of Math., 118 (1983), 525–571.

[56] Solodov, M.V., Svaiter, B.F., A hybrid projection-proximal point algorithm, Journal of
Convex Analysis, 6, no. 1, (1999), 59–70.

[57] Solodov, M.V., Svaiter, B.F., A hybrid approximate extragradient-proximal point algorithm
using the enlargement of a maximal monotone operator, Set-Valued Analysis, 7, (1999), 323–345.

[58] Solodov, M.V., Svaiter, B.F., A unified framework for some inexact proximal point algo-
rithms, Numerical Functional Analysis and Optimization, 22, (2001), 1013-1035.

[59] Wright, S.J., Identifiable surfaces in constrained optimization. SIAM Journal on Control and
Optimization, 31, (1993), 1063-1079.

[60] Wright, S.J., Accelerated block-coordinate relaxation for regularized optimization, Optimiza-
tion online, (2010).

36

You might also like