0% found this document useful (0 votes)
12 views37 pages

A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Journal of Machine Learning Research 24 (2023) 1-37 Submitted 3/20; Revised 12/22; Published 1/23

A Relaxed Inertial Forward-Backward-Forward Algorithm


for Solving Monotone Inclusions with Application to GANs

Radu I. Boţ [email protected]


Faculty of Mathematics
University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
Michael Sedlmayer [email protected]
Research Network Data Science @ Uni Vienna
University of Vienna
Kolingasse 14-16, 1090 Vienna, Austria
Phan Tu Vuong [email protected]
Mathematical Sciences
University of Southampton
Southampton SO17 1BJ, United Kingdom

Editor: Suvrit Sra

Abstract
We introduce a relaxed inertial forward-backward-forward (RIFBF) splitting algorithm for
approaching the set of zeros of the sum of a maximally monotone operator and a single-
valued monotone and Lipschitz continuous operator. This work aims to extend Tseng’s
forward-backward-forward method by both using inertial effects as well as relaxation pa-
rameters. We formulate first a second order dynamical system that approaches the solution
set of the monotone inclusion problem to be solved and provide an asymptotic analysis for
its trajectories. We provide for RIFBF, which follows by explicit time discretization, a
convergence analysis in the general monotone case as well as when applied to the solving
of pseudo-monotone variational inequalities. We illustrate the proposed method by ap-
plications to a bilinear saddle point problem, in the context of which we also emphasize
the interplay between the inertial and the relaxation parameters, and to the training of
Generative Adversarial Networks (GANs).
Keywords: forward-backward-forward algorithm, inertial effects, relaxation parameters,
continuous time approach, application to GANs;

1. Introduction
1.1 Motivation
The main motivation for the investigation of monotone inclusions and variational inequal-
ities governed by monotone and Lipschitz continuous operators is represented by convex-
concave minimax problems. It is well-known that determining primal-dual pairs of optimal
solutions of convex optimization problems means actually solving convex-concave minimax
problems (Bauschke and Combettes, 2017), nevertheless, minimax problems are of own in-
terest. Minimax problems arise traditionally in game theory and, more recently, in the

©2023 Radu Ioan Boţ, Michael Sedlmayer and Phan Tu Vuong.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v24/20-267.html.
Boţ, Sedlmayer and Vuong

training of Generative Adversarial Networks (GANs), as we will see in Section 4. The


general convex-concave minimax, or saddle point, problem is of the form

min max Ψ(u, v) := f (u) + Φ(u, v) − g(v), (1)


u∈H v∈G

where H and G are real Hilbert spaces, Φ : H × G → R is a coupling function, and


f : H → R ∪ {+∞} and g : G → R ∪ {+∞} are regularizers. The coupling function Φ is
assumed to be convex-concave, i.e., Φ(·, v) is convex for all v ∈ G and Φ(u, ·) is concave for
all u ∈ H, and differentiable with L-Lipschitz continuous gradient (L > 0), which means
that for all u, u0 ∈ H and all v, v 0 ∈ G we have

k∇Φ(u, v) − ∇Φ(u0 , v 0 )k ≤ Lk(u, v) − (u0 , v 0 )k = L ku − u0 k2 + kv − v 0 k2 .


p

The regularizers f and g are assumed to be proper, convex and lower semicontinuous. A
solution of (1) is given by a so-called saddle point (u∗ , v ∗ ), fulfilling for all u ∈ H and all
v∈G
Ψ(u∗ , v) ≤ Ψ(u∗ , v ∗ ) ≤ Ψ(u, v ∗ )
or, equivalently, the system of optimality conditions

0 ∈ ∂Ψ(·, v ∗ )(u∗ ) = ∂f (u∗ ) + ∇u Φ(u∗ , v ∗ ),


0 ∈ ∂(−Ψ(u∗ , ·))(v ∗ ) = ∂g(v ∗ ) − ∇v Φ(u∗ , v ∗ ),

where ∂f : H → 2H and ∂g : G → 2G denote the convex subdifferential of the functions f


and g, respectively. This system can be rewritten as the following monotone inclusion

0 ∈ A(u∗ , v ∗ ) + B(u∗ , v ∗ ), (2)

where A = ∂f × ∂g : H × G → 2H×G is a maximally monotone operator and B : H × G →


H × G with B(u, v) = (∇u Φ(u, v), −∇v Φ(u, v)) is monotone and Lipschitz continuous with
constant L > 0.
Monotone inclusions governed by maximal monotone operators are oftenly seen as gen-
eralizations of systems of optimality conditions for convex optimization problems, however,
(2) shows that they can also be of own interest. In order to solve (2) one cannot rely on
solution methods developed for optimization problems, but needs to design and develop
specific algorithms in the framework of the theory of monotone operators. This will be one
of our goals in this paper and in this way we will provide further evidence for the importance
of monotone operators beyond the convex optimization setting.

1.2 Problem Formulation


Let H be a real Hilbert space endowed with inner product h·, ·i and corresponding norm
k · k, A : H → 2H a maximally monotone set-valued operator, and B : H → H a monotone
and Lipschitz continuous operator with Lipschitz constant L > 0, such that Zeros(A+B) :=
{x ∈ H : 0 ∈ Ax + Bx} =6 ∅. We are interested in solving the following inclusion problem:
Find x∗ ∈ H such that
0 ∈ Ax∗ + Bx∗ . (3)

2
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

We recall that the operator A : H → 2H is called monotone if


hu − v, x − yi ≥ 0 ∀(x, u), (y, v) ∈ Graph(A),
where Graph(A) = {(x, u) ∈ H × H : u ∈ Ax} denotes its graph. The operator A is called
maximally monotone if it is monotone and its graph is not properly included in the graph of
another monotone operator. The single-valued operator B : H → H is said to be Lipschitz
continuous with Liphschitz constant L > 0 if
kBx − Byk ≤ Lkx − yk ∀x, y ∈ H.
An important special case is when A = NC , the normal cone of a nonempty closed
convex subset C of H. Then (3) reduces to a variational inequality (VI):
Find x∗ ∈ C such that
hBx∗ , x − x∗ i ≥ 0 ∀x ∈ C. (4)

1.3 Related Literature


Solution methods for solving (3), when the operator B is cocoercive, have been devel-
oped intensively in the last decades (Abbas and Attouch, 2015; Attouch and Cabot, 2019a;
Bauschke and Combettes, 2017; Boţ and Csetnek, 2016a, 2018). Recall that the operator
B : H → H is cocoercive if there is a constant L > 0 such that
L hBx − By, x − yi ≥ kBx − Byk2 ∀x, y ∈ H.
Notice that every cocoercive operator is Lipschitz continuous and that the gradient of
a convex and Fréchet differentiable function is a cocoercive operator if and only if it is
Lipschitz continuous (Bauschke and Combettes, 2017).
The simplest method for solving (3), when B is cocoercive, is the forward-backward
(FB) method, which generates an iterative sequence (xk )k≥0 via
(∀k ≥ 0) xk+1 := JλA (I − λB)xk , (5)
where x0 ∈ H is the starting point and JλA := (I + λA)−1 : H → H is the resolvent of the
operator A. Here, I denotes the identity operator of H and λ is a positive stepsize chosen
in (0, L2 ). The resolvent of a maximally monotone operator is a single-valued and cocoercive
operator with constant L = 1. The iterative scheme (5) results by time-discretizing with
time stepsize equal 1 the first order dynamical system
(
ẋ(t) + x(t) = JλA (I − λB) x(t),
x(0) = x0 .
For more about dynamical systems of implicit type associated to monotone inclusions and
convex optimization problems, see for example the work by (Abbas and Attouch, 2015),
(Antipin, 1994), (Bolte, 2003), and (Boţ and Csetnek, 2018).
Recently, the following second order dynamical system associated with the monotone
inclusion problem (3), when B is cocoercive,
(
ẍ(t) + γ(t)ẋ(t) + τ (t) [x(t) − JλA (I − λB)x(t)] = 0,
x(0) = x0 , ẋ(0) = v0 ,

3
Boţ, Sedlmayer and Vuong

where γ, τ : [0, +∞) → [0, +∞), was proposed and studied by (Boţ and Csetnek, 2016a)
(see also the work of Alvarez, 2000; Antipin, 1994; Boţ and Csetnek, 2018). Explicit time
discrerization of this second order dynamical system gives rise to so-called relaxed inertial
forward-backward algorithms, which combine inertial effects and relaxation parameters.
In the last years, Attouch and Cabot have promoted in a series of papers relaxed iner-
tial algorithms for monotone inclusions and convex optimization problems, as they combine
the advantages of both inertial effects and relaxation techniques. More precisely, they ad-
dressed the relaxed inertial proximal method (RIPA) (Attouch and Cabot, 2019b,c) and the
relaxed inertial forward-backward method (RIFB) (Attouch and Cabot, 2019a). A relaxed
inertial Douglas-Rachford algorithm for monotone inclusions has been proposed by (Boţ
et al., 2015). (Iutzeler and Hendrickx, 2019) investigated the influence inertial effects and
relaxation techniques have on the numerical performances of optimization algorithms. The
interplay between relaxation and inertial parameters for relative-error inexact under-relaxed
algorithms has been addressed by (Alves and Marcavillaca, 2020; Alves et al., 2020).
Relaxation techniques are essential ingredients in the formulation of algorithms for
monotone inclusions, as they provide more flexibility to the iterative schemes (Bauschke
and Combettes, 2017; Eckstein and Bertsekas, 1992). Inertial effects have been introduced
in order to accelerate the convergence of the numerical methods. This technique traces back
to the pioneering work of (Polyak, 1964), who introduced the heavy ball method in order
to speed up the convergence behavior of the gradient algorithm and allow the detection
of different critical points. This idea was employed and refined by (Nesterov, 1983) and
by (Alvarez, 2000) and by (Alvarez and Attouch, 2001) in the context of solving smooth
convex minimization problems and monotone inclusions/nonsmooth convex minimization
problems, respectively. When applied to convex minimization problems, the acceleration
techniques introduced by Nesterov in (Nesterov, 1983) or those defined via asymptotically
similar constructions lead to an improved convergence rate for the sequence of objective
function values. When applied to monotone inclusions, the same lead, as seen in (Attouch
and Cabot, 2019a,b,c) for the relaxed inertial proximal and forward-backward methods, to
improved convergence rates for the sequences of discrete velocities and Yosida regulariza-
tions of the governing operator.
In this paper we will focus on solving the monotone inclusion (3) in the case when
B is merely monotone and Lipschitz continuous. To this end we will formulate a relaxed
inertial forward-backward-forward (RIFBF) algorithm, which we obtain through the time
discretization of a second order dynamical system approaching the solution set of (3). The
forward-backward-forward (FBF) method was proposed by (Tseng, 2000) and it generates
an iterative sequence (xk )k≥0 via
(
yk = Jλk A (I − λk B)xk ,
(∀k ≥ 0)
xk+1 = yk − λk (Byk − Bxk ),

where x0 ∈ H is the starting point. The sequence (xk )k≥0 converges weakly to a solution of
(3) if the sequence of stepsizes (λk )k≥0 is chosen in the interval 0, L1 , where L > 0 is the


Lipschitz constant of B. An inertial version of FBF, however in the absence of relaxation


variables, was proposed by (Boţ and Csetnek, 2016b). The convergence of the sequence of
generated iterates has been proved in a very restricting setting that requires that the inertial

4
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

parameters take very small values. This is in accordance to the inertial parameters choices
in the earliest papers on inertial algorithms (Alvarez, 2000) and (Alvarez and Attouch,
2001). It is one of our aims to show that relaxation parameters allow more flexibility in the
choice of the inertial ones, which leads to better convergence results.
Recently, a forward-backward algorithm for solving (3), when B is monotone and Lip-
schitz continuous, was proposed by (Malitsky and Tam, 2020). This method requires in
every iteration only one forward step instead of two, however, the sequence of stepsizes has
1
to be chosen constant in the interval 0, 2L , which slows the algorithm down in compar-
ison to FBF. A popular algorithm used to solve the variational inequality (4), when B is
monotone and Lipschitz continuous, is Korpelevich’s extragradient method (Korpelevich,
1976). The stepsizes are to be chosen in the interval 0, L1 , however, this method requires


two projection steps on C and two forward steps.

1.4 Outline

First, we will approach the solution set of (3) from a continuous perspective by means of the
trajectories generated by a second order dynamical system of FBF type. We will prove an
existence and uniqueness result for the generated trajectories and provide a general setting
in which these converge to a zero of A + B as time goes to infinity. In addition, we will
show that explicit time discretization of the dynamical system gives rise to an algorithm of
forward-backward-forward type with inertial and relaxation parameters (RIFBF).
In Section 3 we will discuss the convergence of RIFBF and investigate the interplay
between the inertial and the relaxation parameters. It is of certain relevance to notice
that both the standard FBF method, the algorithm by (Malitsky and Tam, 2020) and the
extragradient method require to know the Lipschitz constant of B, which is not always
available. This can be avoided by performing a line-search procedure, which usually leads
to additional computation costs. On the contrary, we will use an adaptive stepsize rule
which does not require knowledge of the Lipschitz constant of B. We will also comment on
the convergence of RIFBF when applied to the solving of the variational inequality (4) in
the case when the operator B is pseudo-monotone but not necessarily monotone. Pseudo-
monotone operators appear in the consumer theory of mathematical economics (Hadjisavvas
et al., 2012) and as gradients of pseudo-convex functions (Cottle and Ferland, 1971), such
as ratios of convex and concave functions in fractional programming (Borwein and Lewis,
2006).
Concluding, we treat two different numerical experiments supporting our theoretical
results in Section 4. On the one hand we deal with a bilinear saddle point problem which can
be understood as a two-player zero-sum constrained game. In this context, we emphasize the
interplay between the inertial and the relaxation parameters. On the other hand we employ
variants of RIFBF for training Generative Adversarial Networks (GANs), which is a class of
machine learning systems where two opposing artificial neural networks compete in a zero-
sum game. GANs have achieved outstanding results for producing photorealistic pictures
and are typically known to be difficult to optimize. We show that our method outperforms
“Extra Adam”, a GAN training approach inspired by the extra-gradient algorithm, which
recently achieved state-of-the-art results (Gidel et al., 2019).

5
Boţ, Sedlmayer and Vuong

2. A Second Order Dynamical System of FBF Type


In this section we will focus on the study of the dynamical system

y(t) = JλA (I − λB)x(t),

ẍ(t) + γ(t)ẋ(t) + τ (t) [x(t) − y(t) − λ (Bx(t) − By(t))] = 0, (6)

x(0) = x0 , ẋ(0) = v0 ,

1
where γ, τ : [0, +∞) → [0, +∞) are Lebesgue measurable functions, 0 < λ < L and
x0 , v0 ∈ H, in connection with the monotone inclusion problem (3).
We define M : H → H by

M x = x − JλA (I − λB)x − λ [Bx − B ◦ JλA (I − λB)x] . (7)

Then (6) can be equivalently written as


(
ẍ(t) + γ(t)ẋ(t) + τ (t)M x(t) = 0,
(8)
x(0) = x0 , ẋ(0) = v0 .

The following result collects some properties of M .

Proposition 1 Let M be defined as in (7). Then the following statements are true:
(i) Zeros(M ) = Zeros(A + B);

(ii) M is Lipschitz continuous;

(iii) for all x∗ ∈ Zeros(M ) and all x ∈ H it holds


1 − λL
hM x, x − x∗ i ≥ kM xk2 . (9)
(1 + λL)2

Proof (i) For x ∈ H we set y := JλA (I − λB)x, thus M x = x − y − λ (Bx − By). Using
the Lipschitz continuity of B we have

(1 − λL)kx − yk ≤ kM xk = kx − y − λ (Bx − By) k≤(1 + λL)kx − yk. (10)

Therefore, x ∈ Zeros(M ) if and only if x = y = JλA (I − λB)x, which is further equivalent


to x ∈ Zeros(A + B).
(ii) Let x, x0 ∈ H and y := JλA (I − λB)x, and y 0 := JλA (I − λB)x0 . The Lipschitz
continuity of B yields

kM x − M x0 k = kx − y − λ (Bx − By) − x0 + y 0 + λ Bx0 − By 0 k




≤ (1 + λL) kx − x0 k + ky − y 0 k .


In addition, by the nonexpansiveness (Lipschitz continuity with Lipschitz constant 1) of


JλA and again by the Lipschitz continuity of B we obtain

ky − y 0 k = kJλA (I − λB)x − JλA (I − λB)x0 k


≤ k(I − λB)x − (I − λB)x0 k ≤ (1 + λL)kx − x0 k.

6
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

Therefore,
kM x − M x0 k ≤ (1 + λL)(2 + λL)kx − x0 k,
which shows that M is Lipschitz continuous with Lipschitz constant (1 + λL)(2 + λL) > 0.
(iii) Let x∗ ∈ H be such that 0 ∈ (A + B)x∗ and x ∈ H. We denote y := JλA (I − λB)x
and can write (I − λB)x ∈ (I + λA)y or, equivalently,
1
(x − y) − (Bx − By) ∈ (A + B)y. (11)
λ
Using the monotonicity of A + B we obtain
 
1 ∗
(x − y) − (Bx − By) , y − x ≥ 0,
λ
which is equivalent to

hx − y − λ (Bx − By) , x − x∗ i ≥ hx − y − λ (Bx − By) , x − yi .

This means that


hM x, x − x∗ i ≥ hx − y − λ (Bx − By) , x − yi = kx − yk2 − λ hBx − By, x − yi
≥ kx − yk2 − λkBx − Bykkx − yk ≥ (1 − λL)kx − yk2
1 − λL
≥ kM xk2 ,
(1 + λL)2

where the last inequality follows from (10).

The following definition makes explicit which kind of solutions of the dynamical system
(6) we are looking for. We recall that a function x : [0, b] → H (where b > 0) is said to be
absolutely continuous if there exists an integrable function y : [0, b] → H such that
Z t
x(t) = x(0) + y(s)ds ∀t ∈ [0, b].
0

This is nothing else than x is continuous and its distributional derivative ẋ is Lebesgue
integrable on [0, b].

Definition 2 We say that x : [0, +∞) → H is a strong global solution of (6) if the following
properties are satisfied:

(i) x, ẋ : [0, +∞) → H are locally absolutely continuous, in other words, absolutely
continuous on each interval [0, b] for 0 < b < +∞;

(ii) ẍ(t) + γ(t)ẋ(t) + τ (t)M x(t) = 0 for almost every t ∈ [0, +∞);

(iii) x(0) = x0 and ẋ(0) = v0 .

Since M : H → H is Lipschitz continuous, the existence and uniqueness of the trajectory


of (6) follows from the Cauchy-Lipschitz Theorem for absolutely continuous trajectories.

7
Boţ, Sedlmayer and Vuong

Theorem 3 (see Boţ and Csetnek, 2016a, Theorem 4) Let γ, τ : [0, +∞) → [0, +∞) be
Lebesgue measurable functions such that γ, τ ∈ L1loc ([0, +∞)) (that is γ, τ ∈ L1 ([0, b]) for
all 0 < b < +∞). Then for each x0 , v0 ∈ H there exists a unique strong global solution of
the dynamical system (6).

We will prove the convergence of the trajectories of (6) in a setting which requires the
damping function γ and the relaxation function τ to fulfil the assumptions below. We refer
to (Boţ and Csetnek, 2016a) for examples of functions which fulfil this assumption and
want also to emphasize that when the two functions are constant we recover the conditions
from (Attouch and Maingé, 2011).

Assumption 1 γ, τ : [0, +∞) → [0, +∞) are locally absolutely continuous and there exists
ν > 0 such that for almost every t ∈ [0, +∞) it holds

γ 2 (t) (1 + λL)2
γ̇(t) ≤ 0 ≤ τ̇ (t) and ≥ (1 + ν) .
τ (t) 1 − λL

The result which states the convergence of the trajectories is adapted from the results
by (Boţ and Csetnek, 2016a, Theorem 8). Though, it cannot be obtained as a direct
consequence of it, since the operator M is not cocoercive as it is required there. However,
as seen in Proposition 1 (iii), M has a property, sometimes called “cocoercive with respect
to its set of zeros”, which is by far weaker than cocoercivity, but strong enough in order to
allow us to partially apply the techniques used to prove Theorem 8 by (Boţ and Csetnek,
2016a).

Theorem 4 Let γ, τ : [0, +∞) → [0, +∞) be functions satisfying Assumption 1 and x0 , v0 ∈
H. Let x : [0, +∞) → H be the unique strong global solution of (6). Then the following
statements are true:

(i) the trajectory x is bounded and ẋ, ẍ, M x ∈ L2 ([0, +∞); H);

(ii) limt→+∞ ẋ(t) = limt→+∞ ẍ(t) = limt→+∞ M x(t) = limt→+∞ [x(t) − y(t)] = 0;

(iii) x(t) converges weakly to an element in Zeros(A + B) as t → +∞.

Proof Take an arbitrary x∗ ∈ Zeros(A + B) = Zeros(M ) and define for all t ∈ [0, +∞) the
Lyapunov function h(t) = 21 kx(t) − x∗ k2 . For almost every t ∈ [0, +∞) we have

ḣ(t) = hx(t) − x∗ , ẋ(t)i and ḧ(t) = kẋ(t)k2 + hx(t) − x∗ , ẍ(t)i .

Taking into account (8) we obtain for almost every t ∈ [0, +∞) that

ḧ(t) + γ(t)ḣ(t) + τ (t) hx(t) − x∗ , M x(t)i = kẋ(t)k2 ,

which, together with Proposition 1 (iii), implies

1 − λL
ḧ(t) + γ(t)ḣ(t) + τ (t)kM x(t)k2 ≤ kẋ(t)k2 .
(1 + λL)2

8
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

From this point, we can proceed as in the proof of (Boţ and Csetnek, 2016a, Theorem 8)
– for a complete explanation see Appendix A. Consequently, we obtain the statements in
(i) and (ii) and the fact that the limit limt→+∞ kx(t) − x∗ k ∈ R exists, which is the first
assumption in the continuous version of the Opial Lemma as it was used, for example,
by (Boţ and Csetnek, 2016a, Lemma 7). In order to show that the second assumption
of the Opial Lemma is fulfilled, which means actually proving that every weak sequential
cluster point of the trajectory x is a zero of M , one cannot use the arguments from (Boţ and
Csetnek, 2016a, Theorem 8), since M is not maximal monotone. We have to use, instead,
different arguments relying on the maximal monotonicity of A + B.
Indeed, let x̄ be a weak sequential cluster point of x, which means that there exists
a sequence tk → +∞ such that (x(tk ))k≥0 converges weakly to x̄ as k → +∞. Since,
according to (ii), limt→+∞ M x(t) = limt→+∞ [x(t) − y(t)] = 0, we conclude that (y(tk ))k≥0
also converges weakly to x̄. According to (11) we have

1
(x(tk ) − y(tk )) − (Bx(tk ) − By(tk )) ∈ (A + B)y(tk ) ∀k ≥ 0. (12)
λ

Since B is Lipschitz continuous and limk→+∞ kx(tk ) − y(tk )k = 0, the left hand side of
(12) converges strongly to 0 as k → +∞. Since A + B is maximal monotone, its graph is
sequentially closed with respect to the weak-strong topology of the product space H × H.
Therefore, taking the limit as tk → +∞ in (12) we obtain x̄ ∈ Zeros(A + B).
Thus, the continuous Opial Lemma implies that x(t) converges weakly to an element in
Zeros(A + B) as t → +∞.

Remark 5 (explicit discretization) Explicit time discretization of (6) with stepsize sk >
0, relaxation variable τk > 0, damping variable γk > 0, and initial points x0 and x1 yields
for all k ≥ 1 the following iterative scheme:

1 γk
(xk+1 − 2xk + xk−1 ) + (xk − xk−1 ) + τk M zk = 0, (13)
s2k sk

where zk is an extrapolation of xk and xk−1 that will be chosen later. The Lipschitz con-
tinuity of M provides a certain flexibility to this choice. We can write (13) equivalently
as
(∀k ≥ 1) xk+1 = xk + (1 − γk sk )(xk − xk−1 ) − s2k τk M zk .

Setting αk := 1 − γk sk , ρk := s2k τk and choosing zk := xk + αk (xk − xk−1 ) for all k ≥ 1, we


can write the above scheme as

zk = xk + αk (xk − xk−1 )

(∀k ≥ 1) yk = JλA (I − λB)zk

xk+1 = (1 − ρk )zk + ρk (yk − λ(Byk − Bzk )) ,

which is a relaxed version of the FBF algorithm with inertial effects.

9
Boţ, Sedlmayer and Vuong

3. A Relaxed Inertial FBF Algorithm


In this section we investigate the convergence of the relaxed inertial algorithm derived in
the previous section via the time discretization of (6), however, we also assume that the
stepsizes (λk )k≥1 are variable. More precisely, we will address the following algorithm

zk = xk + αk (xk − xk−1 )

(RIF BF ) (∀k ≥ 1) yk = Jλk A (I − λk B)zk

xk+1 = (1 − ρk )zk + ρk (yk − λk (Byk − Bzk )) ,

where x0 , x1 ∈ H are starting points, (λk )k≥1 and (ρk )k≥1 are sequences of positive numbers,
and (αk )k≥1 is a sequence of nonnegative numbers. The following iterative schemes can be
obtained as particular instances of RIFBF:

• ρk = 1 for all k ≥ 1: inertial Forward-Backward-Forward algorithm



zk = xk + αk (xk − xk−1 )

(IF BF ) (∀k ≥ 1) yk = Jλk A (I − λk B)zk

xk+1 = yk − λk (Byk − Bzk )

• αk = 0 for all k ≥ 1: relaxed Forward-Backward-Forward algorithm


(
yk = Jλk A (I − λk B)xk
(RF BF ) (∀k ≥ 1)
xk+1 = (1 − ρk )xk + ρk (yk − λk (Byk − Bxk ))

• αk = 0, ρk = 1 for all k ≥ 1: Forward-Backward-Forward algorithm


(
yk = Jλk A (I − λk B)xk
(F BF ) (∀k ≥ 1)
xk+1 = yk − λk (Byk − Bxk ).

Stepsize rules: Depending on the availability of the Lipschitz constant L of B , we have


two different options for the choice of the sequence of stepsizes (λk )k≥1 :

• constant stepsize: λk := λ ∈ 0, L1 for all k ≥ 1;




• adaptive stepsize: let µ ∈ (0, 1) and λ1 > 0. The stepsizes for k ≥ 1 are adaptively
updated as follows
n o
µkyk −zk k
(
min λk , kBy k −Bzk k
, if Byk − Bzk 6= 0,
λk+1 := (14)
λk , otherwise.

If the Lipschitz constant L of B is known in advance, then a constant stepsize can be


chosen. Otherwise, the adaptive stepsize rule (14) is highly recommended. In the following,
we provide the convergence analysis for the adaptive stepsize rule, as the constant stepsize
rule can be obtained as a particular of it by setting λ1 := λ and µ := λL.

10
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

Proposition 6 Let µ ∈ (0, 1) and λ1 > 0. The sequence (λk )k≥1 generated by (14) is
nonincreasing and n µo
lim λk = λ ≥ min λ1 , .
k→+∞ L
In addition,
µ
kByk − Bzk k ≤ kyk − zk k ∀k ≥ 1. (15)
λk+1

Proof It is obvious from (14) that λk+1 ≤ λk for all k ≥ 1. Since B is Lipschitz continuous
with Lipschitz constant L, it yields

µkyk − zk k µ
≥ , if Byk − Bzk 6= 0,
kByk − Bzk k L

which together with (14) yields


n µo
λk+1 ≥ min λ1 , ∀k ≥ 1.
L
Thus, there exists n µo
λ := lim λk ≥ min λ1 , .
k→+∞ L
The inequality (15) follows directly from (14).

λk
Proposition 7 Let wk := yk − λk (Byk − Bzk ) and θk := λk+1 for all k ≥ 1. Then for all
x∗ ∈ Zeros(A + B) it holds

kwk − x∗ k2 ≤ kzk − x∗ k2 − 1 − µ2 θk2 kyk − zk k2



∀k ≥ 1. (16)

Proof Let k ≥ 1. Using (15) we have that

kzk − x∗ k2 = kzk − yk + yk − wk + wk − x∗ k2
= kzk − yk k2 + kyk − wk k2 + kwk − x∗ k2
+2 hzk − yk , yk − x∗ i + 2 hyk − wk , wk − x∗ i
= kzk − yk k2 − kyk − wk k2 + kwk − x∗ k2 + 2 hzk − wk , yk − x∗ i
= kzk − yk k2 − λ2k kByk − Bzk k2 + kwk − x∗ k2 + 2 hzk − wk , yk − x∗ i
µ2 λ2
≥ kzk − yk k2 − 2 k kyk − zk k2 + kwk − x∗ k2 + 2 hzk − wk , yk − x∗ i .(17)
λk+1

On the other hand, since


(I + λk A)yk 3 (I − λk B)zk ,
we have

zk ∈ yk + λk Ayk + λk Bzk = yk − λk (Byk − Bzk ) + λk (A + B)yk = wk + λk (A + B)yk .

11
Boţ, Sedlmayer and Vuong

Therefore,
1
(zk − wk ) ∈ (A + B)yk ,
λk
which, together with 0 ∈ (A + B)x∗ and the monotonicity of A + B, implies

hzk − wk , yk − x∗ i ≥ 0.

Hence,
µ2 λ2k
kzk − x∗ k2 ≥ kzk − yk k2 − kyk − zk k2 + kwk − x∗ k2 ,
λ2k+1
which is (16).

The following result introduces a discrete Lyapunov function for which a decreasing
property is established.

Proposition 8 Let (αk )k≥1 be a nondecreasing sequence of nonnegative numbers and (ρk )k≥1
be such that for some kρ ≥ 1 we have

2
0 < ρk ≤ ∀k ≥ kρ , (18)
1 + µθk

let x∗ ∈ Zeros(A + B) and for all k ≥ 1 define

1 − αk
 
∗ 2 ∗ 2
Hk := kxk − x k − αk kxk−1 − x k + 2αk αk + kxk − xk−1 k2 . (19)
ρk (1 + µθk )

Then there exists k0 ≥ kρ such that for all k ≥ k0 it holds

Hk+1 − Hk ≤ −δk kxk+1 − xk k2 , (20)

where
1 − αk+1
   
2
δk := (1 − αk ) − 1 − 2αk+1 αk+1 + ∀k ≥ 1.
ρk (1 + µθk ) ρk+1 (1 + µθk+1 )

Proof From Proposition 7, with wk = yk − λk (Byk − Bzk ), we have for all k ≥ 1

kxk+1 − x∗ k2 = k(1 − ρk )zk + ρk wk − x∗ k2


= k(1 − ρk )(zk − x∗ ) + ρk (wk − x∗ )k2
= (1 − ρk )kzk − x∗ k2 + ρk kwk − x∗ k2 − ρk (1 − ρk )kwk − zk k2
≤ (1 − ρk )kzk − x∗ k2 + ρk kzk − x∗ k2
1 − ρk
−ρk 1 − µ2 θk2 kyk − zk k2 − kxk+1 − zk k2

ρk
1 − ρk
= kzk − x∗ k2 − ρk 1 − µ2 θk2 kyk − zk k2 − kxk+1 − zk k2 . (21)

ρk

12
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

Using (15) we obtain for all k ≥ 1


1
kxk+1 − zk k = kwk − zk k ≤ kwk − yk k + kyk − zk k = λk kByk − Bzk k + kyk − zk k
ρk
 
µλk
≤ 1+ kyk − zk k = (1 + µθk ) kyk − zk k.
λk+1

Since limk→+∞ (1 − µθk ) = 1 − µ > 0, there exists k1 ≥ 1 such that

1 − µθk > 0 ∀k ≥ k1 .

This means that for all k ≥ k1 , we have


1 − µθk
kxk+1 − zk k2 ≤ ρk 1 − µ2 θk2 kyk − zk k2 .

ρk (1 + µθk )

We obtain from (21) that for all k ≥ k1

1 − µθk 1 − ρk
 
∗ 2 ∗ 2
kxk+1 − x k ≤ kzk − x k − + kxk+1 − zk k2
ρk (1 + µθk ) ρk
 
∗ 2 2
= kzk − x k − − 1 kxk+1 − zk k2 . (22)
ρk (1 + µθk )

Now we will estimate the right-hand side of (22). For all k ≥ 1 we have

kzk − x∗ k2 = kxk + αk (xk − xk−1 ) − x∗ k2


= k(1 + αk )(xk − x∗ ) − αk (xk−1 − x∗ )k2
= (1 + αk )kxk − x∗ k2 − αk kxk−1 − x∗ k2 + αk (1 + αk )kxk − xk−1 k2 , (23)

and

kxk+1 − zk k2 = k(xk+1 − xk ) − αk (xk − xk−1 )k2


= kxk+1 − xk k2 + αk2 kxk − xk−1 k2 − 2αk hxk+1 − xk , xk − xk−1 i
≥ (1 − αk ) kxk+1 − xk k2 + αk2 − αk kxk − xk−1 k2 .

(24)

Combining (22) with (23) and (24), together with (18) and using that (αk )k≥1 is nonde-
creasing, we obtain for all k ≥ k0 := max{k1 , kρ }

kxk+1 − x∗ k2 − αk+1 kxk − x∗ k2


≤ kxk+1 − x∗ k2 − αk kxk − x∗ k2
≤ kxk − x∗ k2 − αk kxk−1 − x∗ k2 + αk (1 + αk )kxk − xk−1 k2
 
2
− 1 (1 − αk ) kxk+1 − xk k2 + αk2 − αk kxk − xk−1 k2
  

ρk (1 + µθk )
1 − αk
 
= kxk − x∗ k2 − αk kxk−1 − x∗ k2 + 2αk αk + kxk − xk−1 k2
ρk (1 + µθk )
 
2
− (1 − αk ) − 1 kxk+1 − xk k2 ,
ρk (1 + µθk )

13
Boţ, Sedlmayer and Vuong

which is nothing else than (20).

In order to further proceed with the convergence analysis, we have to choose the se-
quences (αk )k≥1 and (ρk )k≥1 such that

lim inf δk > 0.


k→+∞

This is a manageable task, since we can choose for example the two sequences such that
limk→+∞ αk = α ≥ 0, and limk→+∞ ρk = ρ > 0. Recalling that limk→+∞ θk = 1, we obtain

1−α 2(1 − α)2


   
2
lim δk = (1 − α) − 1 − 2α α + = − 1 + α − 2α2 ,
k→+∞ ρ(1 + µ) ρ(1 + µ) ρ(1 + µ)

thus, in order to guarantee limk→+∞ δk > 0 it is sufficient to choose ρ such that

2 (1 − α)2
0<ρ< . (25)
(1 + µ) (2α2 − α + 1)

Moreover, for α ≥ 0 we have


(1 − α)2
≤ 1,
2α2 − α + 1
and thus  
1 2
ε := − ρ > 0.
2 1+µ
Then there exist k0 , k1 ≥ 1 such that
2 2
ρk ≤ ρ + ε ∀k ≥ k0 , −ε≤ ∀k ≥ k1 .
1+µ 1 + µθk

We obtain that for all k ≥ kρ := max{k0 , k1 } we have


 
1 2 2 2
0 < ρk ≤ ρ + ε = ρ+ = −ε≤ ,
2 1+µ 1+µ 1 + µθk

hence (18) is fulfilled naturally with the choice (25).

Remark 9 (inertia versus relaxation) Inequality (25) represents the necessary trade-off
between inertia and relaxation (see Figure 1 for two particular choices of µ). The expression
is similar to the one obtained by (Attouch and Cabot, 2019b, Remark 2.13), the exception
being an additional factor incorporating the stepsize parameter µ. This means that for given
2 (1−α)2
0 ≤ α < 1 the upper bound for the relaxation parameter is ρ(α, µ) = (1+µ) (2α2 −α+1)
. We
further see that α 7→ ρ(α, µ) is a decreasing function on the interval [0, 1]. Hence, the
maximal value for the limit of the sequence of relaxation parameters is obtained when α = 0
2
and is ρmax (µ) := ρ(0, µ) = 1+µ . On the other hand, when α % 1, then ρ(α, µ) & 0. In
addition, the function µ 7→ ρmax (µ) is also decreasing on [0, 1] with limiting values 2 as
µ & 0, and 1 as µ % 1.

14
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2

Figure 1: Trade-off between inertia and relaxation for µ = 0.5 (left) and µ = 0.95 (right).

Proposition 10 Let (αk )k≥1 be a nondecreasing sequence of nonnegative numbers with


0 ≤ αk ≤ α < 1 for all k ≥ 1 and (ρk )k≥1 a sequence of positive numbers such that (18)
holds, with the property that lim inf k→+∞ δk > 0. For x∗ ∈ Zeros(A + B) we define the
sequence (Hk )k≥1 as in (19). Then the following statements are true:

(i) The sequence (xk )k≥0 is bounded.

(ii) There exists limk→+∞ Hk ∈ R.


P+∞
(iii) k=1 δk kxk+1 − xk k2 < +∞.

Proof (i) From (20) and lim inf k→+∞ δk > 0 we can conclude that there exists k0 ≥ 1 such
that the sequence (Hk )k≥k0 is nonincreasing. Therefore for all k ≥ k0 we have

Hk0 ≥ Hk ≥ kxk − x∗ k2 − αk kxk−1 − x∗ k2 ≥ kxk − x∗ k2 − αkxk−1 − x∗ k2

and, from here,

kxk − x∗ k2 ≤ αkxk−1 − x∗ k2 + Hk0


≤ αk−k0 kxk0 − x∗ k2 + Hk0 (1 + α + ... + αk−k0 −1 )
1 − αk−k0
= αk−k0 kxk0 − x∗ k2 + Hk0 ,
1−α

which implies that (xk )k≥0 is bounded and so is (Hk )k≥1 .


(ii) Since (Hk )k≥k0 is nonincreasing and bounded, it converges to a real number.
(iii) Follows from (ii) and Proposition 8.

We are now in the position to prove the main result of this section. In order to do so,
we first recall two useful lemmas.

15
Boţ, Sedlmayer and Vuong

Lemma 11 (Alvarez and Attouch, 2001) Let (ϕk )k≥0 , (αk )k≥1 , and (ψk )k≥1 be sequences
of nonnegative real numbers satisfying
+∞
X
ϕk+1 ≤ ϕk + αk (ϕk − ϕk−1 ) + ψk ∀k ≥ 1, ψk < +∞,
k=1

and such that 0 ≤ αk ≤ α < 1 for all k ≥ 1. Then the limit limk→+∞ ϕk ∈ R exists.

Lemma 12 (discrete Lemma of Opial, 1967) Let C be a nonempty subset of H and


(xk )k≥0 be a sequence in H such that the following two conditions hold:

(i) For every x ∈ C, limk→+∞ kxk − xk exists.

(ii) Every weak sequential cluster point of (xk )k≥0 is in C.

Then (xk )k≥0 converges weakly to an element in C.

Theorem 13 Let (αk )k≥1 be a nondecreasing sequence of nonnegative numbers with 0 ≤


αk ≤ α < 1 for all k ≥ 1 and (ρk )k≥1 a sequence of positive numbers such that (18) holds,
with the property that lim inf k→+∞ δk > 0. Then the sequence (xk )k≥0 converges weakly to
some element in Zeros(A + B) as k → +∞.

Proof The result will be a consequence of the discrete Opial Lemma. To this end we will
prove that the conditions (i) and (ii) in Lemma 12 for C := Zeros(A + B) are satisfied.
Let x∗ ∈ Zeros(A + B). Indeed, it follows from (22) and (23) that for k large enough

kxk+1 − x∗ k2 ≤ (1 + αk )kxk − x∗ k2 − αk kxk−1 − x∗ k2 + αk (1 + αk )kxk − xk−1 k2 .

Therefore, according to Lemma 11 and Proposition 10 (iii), limk→+∞ kxk − x∗ k exists.


Let x̄ be a weak limit point of (xk )k≥0 and a subsequence (xkl )l≥0 which converges
weakly to x̄ as l → +∞. From Proposition 10 (iii) we have

lim δk kxk+1 − xk k2 = 0,
k→+∞

which, as lim inf k→+∞ δk > 0, yields limk→+∞ kxk+1 − xk k = 0. Recall that for k ≥ 1 we
have wk = yk − λk (Byk − Bzk ). Since
1 1
kwk − zk k = kxk+1 − zk k = kxk+1 − xk + αk (xk − xk−1 )k ∀k ≥ 1,
ρk ρk

we have limk→+∞ kwk − zk k = 0. On the other hand, for all k ≥ 1 holds

kwk − zk k = kyk − zk −λk (Byk − Bzk )k ≥ kyk − zk k − λk kByk − Bzk k


≥ (1 − µθk ) kyk − zk k.

Since limk→+∞ (1 − µθk ) = 1 − µ > 0, we can conclude that kyk − zk k → 0 as k → +∞.


Furthermore, for all k ≥ 1 we have zk = xk + αk (xk − xk−1 ) and since kxk+1 − xk k → 0 as
k → +∞ we deduce that (zkl )l≥0 converges weakly to x̄ as l → +∞. Combining this with

16
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

the fact that kyk − zk k → 0 as k → +∞ then shows that (ykl )l≥0 and (zkl )l≥0 converges
weakly to x̄ as l → +∞, too. The definition of (ykl )l≥0 gives

1
(zk − ykl ) + Bykl − Bzkl ∈ (A + B)ykl ∀l ≥ 0.
λkl l
 
1
Using that λkl (zkl − ykl ) + Bykl − Bzkl converges strongly to 0 and that the graph of
l≥0
the maximal monotone operator A+B is sequentially closed with respect to the weak-strong
topology of the product space H × H, we obtain 0 ∈ (A + B)x̄, thus x̄ ∈ Zeros(A + B).

Remark 14 In the particular case of the variational inequality (4), which corresponds to
the case when A is the normal cone of a nonempty closed convex subset C of H, by taking
into account that JλNC = PC is for all λ > 0 the projection operator onto C, the relaxed
inertial FBF algorithm reads

zk = xk + αk (xk − xk−1 )

(RIF BF − V I) (∀k ≥ 1) yk = PC (I − λk B)zk

xk+1 = (1 − ρk )zk + ρk (yk − λk (Byk − Bzk )) ,

where x0 , x1 ∈ H are starting points, (λk )k≥1 and (ρk )k≥1 are sequences of positive numbers,
and (αk )k≥1 is a sequence of nonnegative numbers.
The algorithm converges weakly to a solution of (4) when B is a monotone and Lipschitz
continuous operator in the hypotheses of Theorem 13.
As it has been shown in (Boţ et al., 2020), it also converges when B is pseudo-monotone
on H, Lipschitz continuous, and sequentially weak-to-weak continuous, and also when H is
finite dimensional, and B is pseudo-monotone on C and Lipschitz continuous.
We recall that B is said to be pseudo-monotone on C (on H) if for all x, y ∈ C (x, y ∈ H)
it holds
hBx, y − xi ≥ 0 ⇒ hBy, y − xi ≥ 0.
λk
Denoting wk := yk − λk (Byk − Bzk ) and θk := λk+1 for all k ≥ 1, then for all x∗ ∈
Zeros(NC + B) it holds

kwk − x∗ k2 ≤ kzk − x∗ k2 − 1 − µ2 θk2 kyk − zk k2



∀k ≥ 1

which is nothing else than relation (16).


Indeed, since yk ∈ C, we have hBx∗ , yk − x∗ i ≥ 0, and, further, the pseudo-monotonicity
of B gives hByk , yk − x∗ i ≥ 0 for all k ≥ 1. On the other hand, since yk = PC (I − λk B)zk ,
we have hx∗ − yk , yk − zk + λk Bzk i ≥ 0 for all k ≥ 1. The two inequalities yield

hyk − x∗ , zk − wk i ≥ 0 ∀k ≥ 1,

which, combined with (17), lead as in the proof of Proposition 7 to the conclusion.
Now, since (16) holds, the statements in Proposition 8 and Proposition 10 remain true
and, as seen in the proof of Theorem 13, they guarantee that the limit limk→+∞ kxk − x∗ k ∈

17
Boţ, Sedlmayer and Vuong

R exists, and that limk→+∞ kyk − zk k = limk→+∞ kByk − Bzk k = 0. Having that, the weak
converge of (xk )k≥0 to a solution of (4) follows, by arguing as in the proof by (Boţ et al., 2020,
Theorem 3.1), when B is pseudo-monotone on H, Lipschitz-continuous and sequentially
weak-to-weak continuous, and as in the proof by (Boţ et al., 2020, Theorem 3.2), when H
is finite dimensional, and B is pseudo-monotone on C and Lipschitz continuous.

4. Numerical Experiments
In this section, we provide two numerical experiments that complete our theoretical results.
The first one, a bilinear saddle point problem, is a usual deterministic example where all
the assumptions are fulfilled to guarantee convergence. For the second experiment we leave
the save harbour of justified assumptions and frankly use RIFBF to treat a more complex
(stochastic) problem and train a special kind of generative machine learning system that
receives a lot of attention recently.

4.1 Bilinear Saddle Point Problem


We want to solve the following bilinear saddle-point problem

min max Φ(u, v):= uT Av + aT u + bT v, (26)


u∈U v∈V

with A ∈ Rm×n , a ∈ Rm and b ∈ Rn and where U ⊆ Rm and V ⊆ Rn are nonempty, closed


and convex sets, in the sense that we want to find u∗ ∈ U and v ∗ ∈ V such that

Φ(u∗ , v) ≤ Φ(u∗ , v ∗ ) ≤ Φ(u, v ∗ )

for all u ∈ U and all v ∈ V . The monotone inclusion to be solved in this case is of the form

0 ∈ NU ×V (u, v) + F (u, v),

where
 NU ×V is the  maximal monotone
  cone operator to U × V and F (u, v) =
normal
u a 0 A
M + , with M = , is monotone and Lipschitz continuous with
v −b −AT 0
Lipschitz constant L = kM k2 . Notice that F is not cocoercive.
For our experiments we choose m = n = 500, and A, a and b to have entries drawn from
different random distributions; we look at the uniform distribution on the interval [0, 1], the
standard normal distribution and the 1-Poisson distribution. For the constraint sets U and
V we take unit balls in the Euclidean norm. We use constant stepsize λk = λ = Lµ , where
0 < µ < 1, constant inertial parameter αk = α and constant relaxation parameter ρk = ρ
for all k ≥ 1. We set xk := (uk , vk ) to fit the framework of our algorithm. The starting
point x0 is initialized randomly (with entries drawn from the uniform distribution on [0, 1]
for all three settings) and we set x1 = x0 . We did not observe different behavior for various
random trials, so each time we provide the results for only one run in the following.
In Figure 2 we can see the development of kyk − zk k for problem (26) with data drawn
from the different distributions for µ = 0.5, ρ = 0.5 and various values of α. The quantity
kyk − zk k is the fixed point residual of the operator JλA (I − λB), for which according
to the convergence analysis we have that kyk − zk k → 0 as k → +∞. As solutions of the

18
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

(a) uniform (b) standard normal (c) Poisson

Figure 2: Behavior of the fixed point residual kyk − zk k for the constrained bilinear saddle-
point problem with data drawn from different distributions (µ = 0.5).

ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.32
α
0.00 ≥ 10,000 ≥ 10,000 6,493 4,327 3,245 2,596 2,166 1,860 1,629 1,447 1,234 1,108 1,017 941 929
0.04 ≥ 10,000 ≥ 10,000 6,233 4,154 3,115 2,493 2,080 1,787 1,562 1,375 1,171 1,065 977 - -
0.08 ≥ 10,000 ≥ 10,000 5,973 3,981 2,985 2,389 1,995 1,713 1,496 1,277 1,121 1,024 937 - -
0.12 ≥ 10,000 ≥ 10,000 5,713 3,807 2,856 2,286 1,910 1,635 1,427 1,194 1,077 984 - - -
0.16 ≥ 10,000 ≥ 10,000 5,453 3,634 2,726 2,184 1,824 1,551 1,336 1,140 1,038 - - - -
0.20 ≥ 10,000 ≥ 10,000 5,192 3,461 2,597 2,082 1,735 1,461 1,231 1,099 - - - - -
0.24 ≥ 10,000 9,871 4,932 3,287 2,468 1,976 1,621 1,374 1,170 - - - - - -
0.28 ≥ 10,000 9,350 4,671 3,114 2,339 1,858 1,502 1,282 - - - - - - -
0.32 ≥ 10,000 8,830 4,411 2,943 2,204 1,734 1,396 - - - - - - - -
0.36 ≥ 10,000 8,309 4,150 2,770 2,035 1,589 1,327 - - - - - - - -
0.40 ≥ 10,000 7,787 3,891 2,571 1,866 1,486 - - - - - - - - -
0.44 ≥ 10,000 7,266 3,633 2,351 1,730 - - - - - - - - - -
0.48 ≥ 10,000 6,744 3,340 2,149 - - - - - - - - - - -
0.52 ≥ 10,000 6,223 3,012 2,000 - - - - - - - - - - -
0.56 ≥ 10,000 5,707 2,709 - - - - - - - - - - - -
0.60 ≥ 10,000 5,130 - - - - - - - - - - - - -
0.64 ≥ 10,000 4,480 - - - - - - - - - - - - -
0.68 ≥ 10,000 3,968 - - - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - -
0.88 ≥ 10,000 - - - - - - - - - - - - - -

Table 1: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from a uniform distribution (µ =
0.5)

problem (26) are not available explicitly, we look at this meaningful surrogate instead. Also,
if we have yk = zk for some k ≥ 1 then xk+1 = yk = zk is a solution of (26).
We see that the behavior of the residual is similar for most combinations of parameters
and all three settings. When the inertial parameter α is larger, the algorithm takes fewer
iterations for the fixed point residual to reach the considered threshold of 10−5 . However,
when α gets close to the limiting case the performance is not consistently better throughout
the entire run anymore when the data is drawn from the uniform or the Poisson distribution.

19
Boţ, Sedlmayer and Vuong

Temporarily the residual is even worse than for smaller α, nevertheless the algorithm still
terminates in fewer iterations in the end.
ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.04
α
0.00 ≥ 10,000 7,850 3,927 2,620 1,967 1,574 1,309 1,110 948 837 734 705
0.04 ≥ 10,000 7,536 3,770 2,516 1,889 1,511 1,256 1,054 908 791 704 -
0.08 ≥ 10,000 7,222 3,613 2,411 1,810 1,447 1,201 999 865 751 - -
0.12 ≥ 10,000 6,908 3,456 2,307 1,731 1,383 1,141 946 822 - - -
0.16 ≥ 10,000 6,595 3,300 2,202 1,652 1,317 1,071 893 774 - - -
0.20 ≥ 10,000 6,281 3,143 2,098 1,572 1,248 999 843 - - - -
0.24 ≥ 10,000 5,967 2,987 1,993 1,491 1,170 934 - - - - -
0.28 ≥ 10,000 5,653 2,830 1,887 1,405 1,083 879 - - - - -
0.32 ≥ 10,000 5,340 2,674 1,780 1,302 1,000 - - - - - -
0.36 ≥ 10,000 5,026 2,517 1,664 1,196 - - - - - - -
0.40 ≥ 10,000 4,713 2,357 1,531 1,105 - - - - - - -
0.44 ≥ 10,000 4,400 2,189 1,390 - - - - - - - -
0.48 ≥ 10,000 4,088 1,999 - - - - - - - - -
0.52 ≥ 10,000 3,773 1,789 - - - - - - - - -
0.56 ≥ 10,000 3,443 - - - - - - - - - -
0.60 ≥ 10,000 3,077 - - - - - - - - - -
0.64 ≥ 10,000 2,663 - - - - - - - - - -
0.68 ≥ 10,000 - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - -

Table 2: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from a uniform distribution (µ =
0.9)

In Table 1 we can see the necessary number of iterations for the algorithm to achieve
kyk − zk k ≤ 10−5 for µ = 0.5 and different choices of the parameters α and ρ. As the
results are very similar for all three considered distributions, here we only report the case
of data drawn from the uniform distribution on the interval [0, 1]. For the tables regarding
the experiments for data drawn from the standard normal distribution and the 1-Poisson
distribution please refer to Appendix B.
As mentioned in Remark 9, there is a trade-off between inertia and relaxation. The
parameters α and ρ also need to fulfil the relations
2 (1 − α)2
0 ≤ α < 1 and 0 < ρ < ,
(1 + µ) (2α2 − α + 1)
which is the reason why not every combination of α and ρ is valid.
We see that for a particular choice of the relaxation parameter the least number of
iterations is achieved when the inertial parameter is as large as possible. If, on the other
hand, we fix the inertial parameter, we observe that also larger values of ρ are better
and lead to fewer iterations. To get a conclusion regarding the trade-off between the two
parameters, Table 1 suggests that the influence of the relaxation parameter is stronger than
that of the inertial parameter. Given that no numerical experiments are available for the

20
0.1)
ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80
α
0.00 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,337 7,143 6,247 5,553 4,975 4,274 4,200 3,818 3,406 3,272 3,222 3,146 3,230
0.04 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,607 8,003 6,856 5,997 5,334 4,639 4,246 4,041 3,579 3,333 3,235 3,161 2,846 -
0.08 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,206 7,668 6,569 5,746 5,114 4,314 4,204 3,837 3,382 3,256 3,211 3,209 - -
0.12 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,804 7,332 6,281 5,493 4,774 4,255 4,038 3,617 3,298 3,242 3,314 - - -
0.16 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,402 6,996 5,992 5,227 4,395 4,215 3,837 3,428 3,338 3,371 - - - -
0.20 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,998 6,658 5,685 4,906 4,261 4,048 3,656 3,415 3,574 - - - - -
0.24 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,501 7,592 6,309 5,263 4,497 4,225 3,841 3,626 - - - - - - -
0.28 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,994 7,178 5,872 4,861 4,297 4,065 3,835 - - - - - - - -
0.32 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,481 6,666 5,364 4,612 4,269 4,105 - - - - - - - - -
0.36 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,906 6,098 5,053 4,548 4,450 - - - - - - - - - -

21
0.40 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,961 7,205 5,649 5,008 4,924 - - - - - - - - - - -
0.44 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,109 6,572 5,520 5,490 - - - - - - - - - - - -
0.48 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,174 6,356 6,115 - - - - - - - - - - - - -
0.52 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,633 6,785 - - - - - - - - - - - - - -
0.56 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,642 - - - - - - - - - - - - - - -
0.60 ≥ 10,000 ≥ 10,000 9,747 - - - - - - - - - - - - - - - -
0.64 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.68 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.72 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.88 ≥ 10,000 - - - - - - - - - - - - - - - - - -
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

bilinear saddle-point problem with data drawn from a uniform distribution (µ =


Table 3: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained
Boţ, Sedlmayer and Vuong

relaxed inertial proximal or forward-backward algorithms, we cannot say if this is a common


phenomenon of relaxed inertial methods or one that is specific to RIFBF.
2
Even though the possible values for α get smaller if ρ goes to ρmax (µ) = 1+µ , the number
of iterations gets less for α > 0. In particular, we want to point out that over-relaxation
(ρ > 1) seems to be highly beneficial.
Comparing the results for different choices of µ in Table 2 (µ = 0.9) and Table 3
(µ = 0.1), we can observe a very interesting behavior. Even though smaller µ allows for
larger values of the relaxation parameter (see Remark 9), larger µ leads to better results
in general. This behavior presumably is due to the role µ plays in the definition of the
stepsize.

(a) uniform (b) standard normal (c) Poisson

Figure 3: Behavior of the gap function |G(uk , vk )| for the constrained bilinear saddle-point
problem with data drawn from different distributions (µ = 0.5).

To get further insight into the convergence behavior, we also look at the following gap
function,
G(s, t) := inf Φ(u, t) − sup Φ(s, v).
u∈U v∈V

The quantity (G(uk , vk ))k≥0 should be a measure to judge the performance of the iterates
(uk , vk )k≥0 , as for the optimum (u∗ , v ∗ ) we have Φ(u∗ , v) ≤ Φ(u∗ , v ∗ ) ≤ Φ(u, v ∗ ) for all
u ∈ U and all v ∈ V and hence G(u∗ , v ∗ ) = 0.
Because of the particular choice of a bilinear objective and the constraint sets U and V the
expressions can be actually computed in closed form and we get

G(uk , vk ) = −kAvk + ak2 + bT vk − kAT uk + bk2 − aT uk ∀k ≥ 0.

In Figure 3 we see the development of the absolute value of the gap |G(uk , vk )| for
problem (26) with data drawn from the different distributions for µ = 0.5, ρ = 0.5 and
various values of α. We see that, as in the case of the residual of the fixed point iteration, the
behavior is similar for most combinations of parameters – the larger the inertial parameter
α the better the results in general. However, in the case of the data being drawn from
the uniform or the Poisson distribution, the behavior of the gap is not consistently better
anymore when α gets close to the limiting case. In the meantime the gap for maximal α is

22
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

worst compared to all other parameter combinations, altough the algorithm still gives the
best results in the end. As the theory suggests, the gap indeed decreases and tends to zero
as the number of iterations grows larger.
Finally, in Figure 4 we look at the discrete velocity of the iterates kxk+1 − xk k which
is square summable and thus vanishes for k → +∞. Again we look at the three settings
where the data is drawn from a uniform, the standard normal and a Poisson distribution
and plot the results for µ = 0.5, ρ = 0.5 and various values of α. Once more larger inertial
parameters give better results, with the biggest difference to the previous instances being an
intermediate phase approximately up to iteration 100 where we do not have a clear picture
yet and which occurs for all distributions.

(a) uniform (b) standard normal (c) Poisson

Figure 4: Behavior of the discrete velocity of the iterates kxk+1 − xk k for the constrained
bilinear saddle-point problem with data drawn from different distributions (µ =
0.5).

4.2 Generative Adversarial Networks (GANs)

Generative Adversarial (Artificial Neural) Networks (GANs) is a class of machine learn-


ing systems, where two “adversarial” networks compete in a (zero-sum) game against each
other. Given a training set, this technique aims to learn to generate new data with the
same statistics as the training set. The generative network, called generator, tries to mimic
the original genuine data (distribution) and strives to produce samples that fool the other
network. The opposing, discriminative network, called discriminator, evaluates both true
samples as well as generated ones and tries to distinguish between them. The generator’s
objective is to increase the error rate of the discriminative network by producing novel
samples that the discriminator thinks were not generated, while its opponent’s goal is to
successfully judge (with high certainty) whether the presented data is true or not. Typically,
the generator learns to map from a latent space to a data distribution which is (hopefully)
similar to the original distribution. However, one does not have access to the distribu-
tion itself, but only random samples drawn from it. The original formulation of GANs

23
Boţ, Sedlmayer and Vuong

by (Goodfellow et al., 2014) is as follows,

min max Φ(u, v),


u v

where Φ(u, v) = Ex∼p [log (Dv (x))] + Ex0 ∼qu [log (1 − Dv (x0 ))] is the value function, u and v
is the parametrisation of the generator and discriminator, respectively, Dv is the probability
how certain the discriminator is that the input is real, and p and qu is the real and learned
distribution, respectively.

Real Sample

Real World
Images

Fake

Loss
Discriminator
Latent random variable

Real

Generator

Fake Sample

Figure 5: Schematic illustration of Generative Adversarial Networks.

Again, the problem


min max Φ(u, v)
u∈U v∈V

is understood in the sense that we want to find u∗ ∈ U and v ∗ ∈ V such that

Φ(u∗ , v) ≤ Φ(u∗ , v ∗ ) ≤ Φ(u, v ∗ )

for all u ∈ U and all v ∈ V . The corresponding inclusion problem then reads

0 ∈ NU ×V (u, v) + F (u, v),

where NU ×V is the normal cone operator to U ×V and F (u, v) = (∇u Φ(u, v), −∇v Φ(u, v))T .
If U and V are nonempty, convex and closed sets, and Φ(u, v) is convex-concave, Fréchet
differentiable and has a Lipschitz continuous gradient, then we have a variational inequality,
the obtained theory holds and we can apply RIFBF-VI.
This motivates to use (variants of) FBF methods for the training of GANs, even though
in practice the used value functions typically are not convex-concave and further, the gra-
dient might not be Lipschitz continuous if it exists at all. Additionally, in general one
needs stochastic versions of the used algorithms and which we do not provide in the case
of RIFBF. A stochastic variant of the relaxed inertial forward-backward-forward algorithm
RIFBF has been proposed and analyzed in (Cui et al., 2021) one and a half years after the
first version of this article. The convergence analysis of the stochastic numerical method
massively relies on the one carried out for RIFBF.

24
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

Recently, a first successful attempt of using methods coming from the field of variational
inequalities for GAN training was done by (Gidel et al., 2019). In particular they applied
the well established extra-gradient algorithm and some derived variations. In this spirit
we frankly apply the FBF method and a variant with inertial effect (α = 0.05), as well as
a primal-dual algorithm for saddle point problems, introduced by (Hamedani and Aybat,
2021) (PDSP), which to the best of our knowledge has not been used for GAN training
before, and compare the results to the best method (Extra Adam) from the work by (Gidel
et al., 2019).
For our experiments we use the standard DCGAN architecture (Radford et al., 2016)
where generator and discriminator consist (among other elements) of several convolutional
and convolutional-transpose layers, respectively, that have shown to work very well with
images. The opposing networks are trained on the CIFAR10 dataset (Krizhevsky, 2009)
with the WGAN objective and weight clipping (Arjovsky et al., 2017), which uses the idea
to minimize the Wasserstein distance between the true distribution and the one learned by
the generator. Note that in absence of bound constraints on the weights of at least one
of the two networks, the backward step in the FBF algorithm would be redundant, as we
would project on the whole space and we obtain the unconstrained extra-gradient method.
Furthermore, in our experiments instead of stochastic gradients we use the Adam opti-
mizer (Kingma and Ba, 2014) with the hyperparameters (β1 = 0.5, β2 = 0.9) that where
used by (Gidel et al., 2019), as the best results there were achieved with this choice. Also
we would like to mention that we only did a hyperparameter search for the stepsizes of the
newly introduced methods, all other parameters we chose to be equal as in the aforemen-
tioned work.
Method IS FID
PDSP Adam 4.20 ± 0.04 53.97 ± 0.28
Extra Adam 4.07 ± 0.05 56.67 ± 0.61
FBF Adam 4.54 ± 0.04 45.85 ± 0.35
IFBF Adam (α = 0 .05 ) 4 .59 ± 0 .04 45 .25 ± 0 .60

Table 4: Best IS and FID scores (averaged over 5 runs) achieved on CIFAR10.

The model is evaluated using the inception score (IS) (reworked implementation by (Bar-
ratt and Sharma, 2018) that fixes some issues of the original one) as well as the Fréchet
inception distance (FID) (Heusel et al., 2017), both computed on 50,000 samples. Exper-
iments were run with 5 random seeds for 500,000 updates of the generator on a NVIDIA
GeForce RTX 2080Ti GPU. Table 4 reports the best IS and FID achieved by each consid-
ered method. Note that the values of IS for Extra Adam differ from those stated by (Gidel
et al., 2019), due to the usage of the corrected implementation of the score.
We see that even though we only have proved convergence for the monotone case in
the deterministic setting, the variants of RIFBF perform well in the training of GANs.
IFBF Adam outperforms all other considered methods, both for the IS and the FID. As
the theory suggests, making use of some inertial effects (regardless that Adam already
incorporates some momentum) seems to provide additional improvement of the numerical
method in practice. The results suggest, that employing methods that are designed to

25
Boţ, Sedlmayer and Vuong

capture the nature of a problem, in this case a constrained minimax/saddle-point problem,


is highly beneficial. In Figure 6 we provide samples of the generator trained with the
different methods.

Acknowledgments

The authors would like to thank Julius Berner and Axel Böhm for fruitful discussions and
their valuable suggestions and comments, and also to the anonymous reviewers for their
recommendations which have improved the quality of the paper. Radu Boţ would like to
acknowledge partial support from the Austrian Science Fund (FWF), projects I 2419-N32
and W 1260. Michael Sedlmayer would like to acknowledge support from the Austrian
Research Promotion Agency (FFG), project “Smart operation of wind turbines under icing
conditions (SOWINDIC)”.

26
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

(a) PDSP Adam (b) Extra Adam

(c) FBF Adam (d) IFBF Adam

Figure 6: Comparison of the samples of a WGAN with weight clipping trained with the
different methods.

27
Boţ, Sedlmayer and Vuong

Appendix A. Proof of Theorem 4


We will prove that from

1 − λL
ḧ(t) + γ(t)ḣ(t) + τ (t)kM x(t)k2 ≤ kẋ(t)k2 ∀t ∈ [0, +∞)
(1 + λL)2

one obtains the statements (i) and (ii) in Theorem 4 as well as the existence of the limit
limt→+∞ kx(t) − x∗ k ∈ R. We denote κ := (1+λL)
1−λL
2 > 0.

Taking again into account (8) one obtains for every t ∈ [0, +∞)
κ
ḧ(t) + γ(t)ḣ(t) + kẍ(t) + γ(t)ẋ(t)k2 ≤ kẋ(t)k2
τ (t)

or, equivalently,

κγ 2 (t)
 
κγ(t) d κ
kẋ(t)k2 + − 1 ||ẋ(t)||2 + ||ẍ(t)||2 ≤ 0.

ḧ(t) + γ(t)ḣ(t) +
τ (t) dt τ (t) τ (t)

Combining this inequality with

γ̇(t)τ (t) − γ(t)τ̇ (t)


 
γ(t) d d γ(t)
kẋ(t)k2 = kẋ(t)k2 kẋ(t)k2


τ (t) dt dt τ (t) τ 2 (t)

and
d d
γ(t)ḣ(t) = (γh)(t) − γ̇(t)h(t) ≥ (γh)(t),
dt dt
it yields for every t ∈ [0, +∞)
 
d d γ(t) 2
ḧ(t) + (γh)(t) + κ kẋ(t)k
dt dt τ (t)
 2
−γ̇(t)τ (t) + γ(t)τ̇ (t)

κγ (t) κ
+ +κ − 1 ||ẋ(t)||2 + ||ẍ(t)||2 ≤ 0.
τ (t) τ 2 (t) τ (t)

According to Assumption 1 we have for almost every t ∈ [0, +∞) the inequality
 
d d γ(t) κ
ḧ(t) + (γh)(t) + κ kẋ(t)k2 + νkẋ(t)k2 + kẍ(t)||2 ≤ 0, (27)
dt dt τ (t) τ

where τ denotes a positive upper bound for the function τ . From here we obtain that
the function t 7→ ḣ(t) + γ(t)h(t) + κ γ(t) 2
τ (t) kẋ(t)k , which is locally absolutely continuous, is
monotonically decreasing. Hence there exists a real number M such that

γ(t)
ḣ(t) + γ(t)h(t) + κ kẋ(t)k2 ≤ M ∀t ∈ [0, +∞), (28)
τ (t)

which yields
ḣ(t) + γh(t) ≤ M ∀t ∈ [0, +∞),

28
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

where γ denotes a positive lower bound for the function γ. By multiplying this inequality
with exp(γt) and then by integrating from 0 to T , where T > 0, it yields

M
h(T ) ≤ h(0) exp(−γT ) + (1 − exp(−γT )),
γ

thus
h is bounded
and, consequently,
the trajectory x is bounded.
On the other hand, from (28), it follows that for every t ∈ [0, +∞)
κγ
ḣ(t) + kẋ(t)k2 ≤ M,
τ
hence κγ
hx(t) − x∗ , ẋ(t)i + kẋ(t)k2 ≤ M.
τ
This yields, since x is bounded, that

ẋ is bounded,

and further that


ḣ is bounded.
Integrating (27) we obtain that there exists a real number N ∈ R such that for every
t ∈ [0, +∞)
Z t
γ(t) κ t
Z
2 2
ḣ(t) + γ(t)h(t) + κ kẋ(t)k + ν ||ẋ(s)|| ds + ||ẍ(s)||2 ds ≤ N,
τ (t) 0 τ 0

which allows us to conclude that ẋ, ẍ ∈ L2 ([0, +∞); H). Finally, from (8) and Assumption
1 we deduce M x ∈ L2 ([0, +∞); H) and the proof of statement (i) is complete.
In order to prove (ii), we notice that for every t ∈ [0, +∞) it holds
 
d 1 1 1
kẋ(t)k2 = hẋ(t), ẍ(t)i ≤ kẋ(t)k2 + kẍ(t)k2 ,
dt 2 2 2
which, according to (i), leads to limt→+∞ ẋ(t) = 0.
Further, for every t ∈ [0, +∞) we have

(1 + λL)2 (2 + λL)2
   
d 1 d 1
kM (x(t))k2 = M (x(t)), (M x(t)) ≤ kM (x(t))k2 + kẋ(t)k2 .
dt 2 dt 2 2
By using again (i), we obtain limt→+∞ M (x(t)) = 0, while limt→+∞ ẍ(t) = 0 follows from
from (8) and Assumption 1.
Finally, we have seen that the function t 7→ ḣ(t)+γ(t)h(t)+κ γ(t) 2
τ (t) kẋ(t)k is monotonically
decreasing, thus from (i), (ii) and Assumption 1 we deduce that limt→+∞ γ(t)h(t) exists and
it is a real number. By taking also into account that limt→+∞ γ(t) ∈ (0, +∞) exists, we
obtain that limt→+∞ kx(t) − x∗ k ∈ R exists.

29
Boţ, Sedlmayer and Vuong

Appendix B. Additional Tables


B.1 Standard Normal Distribution

ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.04
α
0.00 ≥ 10,000 4,217 2,110 1,407 1,056 845 705 604 529 471 424 400
0.04 ≥ 10,000 4,049 2,025 1,351 1,014 811 677 580 508 452 399 -
0.08 ≥ 10,000 3,880 1,941 1,295 972 778 648 556 487 433 - -
0.12 ≥ 10,000 3,711 1,857 1,238 929 744 620 532 466 - - -
0.16 ≥ 10,000 3,543 1,772 1,182 887 710 592 508 441 - - -
0.20 ≥ 10,000 3,374 1,688 1,126 845 677 564 481 - - - -
0.24 ≥ 10,000 3,206 1,604 1,070 803 643 533 - - - - -
0.28 ≥ 10,000 3,037 1,520 1,014 761 607 498 - - - - -
0.32 ≥ 10,000 2,868 1,435 958 717 564 - - - - - -
0.36 ≥ 10,000 2,700 1,351 901 668 - - - - - - -
0.40 ≥ 10,000 2,531 1,267 841 610 - - - - - - -
0.44 ≥ 10,000 2,363 1,182 770 - - - - - - - -
0.48 ≥ 10,000 2,194 1,093 - - - - - - - - -
0.52 ≥ 10,000 2,026 986 - - - - - - - - -
0.56 ≥ 10,000 1,857 - - - - - - - - - -
0.60 ≥ 10,000 1,679 - - - - - - - - - -
0.64 ≥ 10,000 1,459 - - - - - - - - - -
0.68 ≥ 10,000 - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - -
0.80 8,437 - - - - - - - - - - -
0.84 6,747 - - - - - - - - - - -

Table 5: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from the standard normal distri-
bution (µ = 0.9)

ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.32
α
0.00 ≥ 10,000 7,127 3,565 2,378 1,785 1,429 1,191 1,022 894 795 716 614 600 537 522
0.04 ≥ 10,000 6,842 3,423 2,283 1,713 1,372 1,144 981 859 764 668 623 572 - -
0.08 ≥ 10,000 6,557 3,281 2,188 1,642 1,315 1,096 940 823 731 615 604 542 - -
0.12 ≥ 10,000 6,272 3,138 2,093 1,571 1,258 1,049 900 787 693 623 578 - - -
0.16 ≥ 10,000 5,988 2,996 1,999 1,500 1,201 1,001 858 744 626 607 - - - -
0.20 ≥ 10,000 5,703 2,853 1,904 1,429 1,144 953 813 695 621 - - - - -
0.24 ≥ 10,000 5,418 2,711 1,809 1,358 1,087 900 751 642 - - - - - -
0.28 ≥ 10,000 5,133 2,569 1,714 1,286 1,025 841 682 - - - - - - -
0.32 ≥ 10,000 4,848 2,427 1,619 1,212 952 758 - - - - - - - -
0.36 ≥ 10,000 4,564 2,284 1,524 1,128 861 725 - - - - - - - -
0.40 ≥ 10,000 4,279 2,142 1,421 1,028 795 - - - - - - - - -
0.44 ≥ 10,000 3,995 1,999 1,299 925 - - - - - - - - - -
0.48 ≥ 10,000 3,710 1,847 1,148 - - - - - - - - - - -
0.52 ≥ 10,000 3,426 1,663 1,102 - - - - - - - - - - -
0.56 ≥ 10,000 3,141 1,449 - - - - - - - - - - - -
0.60 ≥ 10,000 2,839 - - - - - - - - - - - - -
0.64 ≥ 10,000 2,458 - - - - - - - - - - - - -
0.68 ≥ 10,000 2,172 - - - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - -
0.88 8,066 - - - - - - - - - - - - - -

Table 6: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from the standard normal distri-
bution (µ = 0.5)

30
bution (µ = 0.1)
ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80
α
0.00 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,690 7,269 5,816 4,848 4,156 3,637 3,233 2,910 2,416 2,434 2,162 1,878 1,943 1,702 1,696 1,459
0.04 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,303 6,978 5,584 4,654 3,990 3,492 3,104 2,677 2,515 2,316 1,981 1,929 1,828 1,616 1,588 -
0.08 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,916 6,688 5,351 4,460 3,824 3,346 2,970 2,404 2,448 2,186 1,853 1,916 1,650 1,685 - -
0.12 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,528 6,397 5,119 4,267 3,658 3,198 2,802 2,506 2,343 2,044 1,859 1,816 1,572 - - -
0.16 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,141 6,107 4,887 4,073 3,487 3,014 2,460 2,460 2,237 1,893 1,860 1,678 - - - -
0.20 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,754 5,816 4,654 3,876 3,299 2,796 2,485 2,343 2,130 1,752 1,795 - - - - -
0.24 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,366 5,526 4,419 3,653 3,022 2,551 2,473 2,227 2,023 - - - - - - -
0.28 ≥ 10,000 ≥ 10,000 ≥ 10,000 6,979 5,235 4,165 3,396 2,703 2,448 2,344 2,111 - - - - - - - -
0.32 ≥ 10,000 ≥ 10,000 9,885 6,592 4,929 3,847 3,014 2,622 2,490 2,214 - - - - - - - - -
0.36 ≥ 10,000 ≥ 10,000 9,304 6,201 4,568 3,439 2,894 2,639 2,345 - - - - - - - - - -

31
0.40 ≥ 10,000 ≥ 10,000 8,723 5,775 4,128 3,148 2,920 2,513 - - - - - - - - - - -
0.44 ≥ 10,000 ≥ 10,000 8,139 5,246 3,667 3,219 2,737 - - - - - - - - - - - -
0.48 ≥ 10,000 ≥ 10,000 7,507 4,563 3,724 3,049 - - - - - - - - - - - - -
0.52 ≥ 10,000 ≥ 10,000 6,708 4,416 3,518 - - - - - - - - - - - - - -
0.56 ≥ 10,000 ≥ 10,000 5,748 4,298 - - - - - - - - - - - - - - -
0.60 ≥ 10,000 ≥ 10,000 5,719 - - - - - - - - - - - - - - - -
0.64 ≥ 10,000 9,879 - - - - - - - - - - - - - - - - -
0.68 ≥ 10,000 8,680 - - - - - - - - - - - - - - - - -
0.72 ≥ 10,000 8,203 - - - - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.88 ≥ 10,000 - - - - - - - - - - - - - - - - - -
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

bilinear saddle-point problem with data drawn from the standard normal distri-
Table 7: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained
Boţ, Sedlmayer and Vuong

B.2 1-Poisson Distribution

ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.04
α
0.00 ≥ 10,000 6,183 3,093 2,063 1,549 1,241 1,036 888 769 675 589 565
0.04 ≥ 10,000 5,936 2,969 1,981 1,487 1,192 995 851 733 638 564 -
0.08 ≥ 10,000 5,689 2,846 1,899 1,426 1,143 953 812 698 603 - -
0.12 ≥ 10,000 5,441 2,722 1,816 1,365 1,094 911 768 662 - - -
0.16 ≥ 10,000 5,194 2,598 1,734 1,304 1,044 866 723 623 - - -
0.20 ≥ 10,000 4,947 2,475 1,653 1,242 993 814 679 - - - -
0.24 ≥ 10,000 4,700 2,352 1,571 1,180 941 756 - - - - -
0.28 ≥ 10,000 4,452 2,229 1,490 1,117 882 706 - - - - -
0.32 ≥ 10,000 4,205 2,106 1,407 1,046 810 - - - - - -
0.36 ≥ 10,000 3,958 1,984 1,321 967 - - - - - - -
0.40 ≥ 10,000 3,711 1,861 1,224 888 - - - - - - -
0.44 ≥ 10,000 3,465 1,733 1,125 - - - - - - - -
0.48 ≥ 10,000 3,219 1,592 - - - - - - - - -
0.52 ≥ 10,000 2,975 1,441 - - - - - - - - -
0.56 ≥ 10,000 2,724 - - - - - - - - - -
0.60 ≥ 10,000 2,444 - - - - - - - - - -
0.64 ≥ 10,000 2,145 - - - - - - - - - -
0.68 ≥ 10,000 - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - -
0.84 9,852 - - - - - - - - - - -

Table 8: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from a Poisson distribution (µ =
0.9)

ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.32
α
0.00 ≥ 10,000 ≥ 10,000 5,149 3,432 2,573 2,059 1,717 1,474 1,293 1,153 998 894 822 763 753
0.04 ≥ 10,000 9,888 4,943 3,294 2,470 1,977 1,648 1,416 1,242 1,103 944 861 791 - -
0.08 ≥ 10,000 9,476 4,737 3,157 2,368 1,895 1,580 1,359 1,190 1,037 904 829 762 - -
0.12 ≥ 10,000 9,064 4,530 3,019 2,264 1,812 1,513 1,300 1,138 964 871 799 - - -
0.16 ≥ 10,000 8,651 4,324 2,882 2,162 1,731 1,446 1,237 1,080 919 842 - - - -
0.20 ≥ 10,000 8,239 4,118 2,744 2,059 1,650 1,379 1,164 996 888 - - - - -
0.24 ≥ 10,000 7,827 3,911 2,607 1,956 1,568 1,296 1,093 942 - - - - - -
0.28 ≥ 10,000 7,414 3,705 2,469 1,855 1,477 1,202 1,031 - - - - - - -
0.32 ≥ 10,000 7,002 3,498 2,332 1,751 1,381 1,119 - - - - - - - -
0.36 ≥ 10,000 6,589 3,292 2,197 1,624 1,273 1,067 - - - - - - - -
0.40 ≥ 10,000 6,176 3,085 2,049 1,490 1,196 - - - - - - - - -
0.44 ≥ 10,000 5,762 2,881 1,874 1,386 - - - - - - - - - -
0.48 ≥ 10,000 5,349 2,658 1,714 - - - - - - - - - - -
0.52 ≥ 10,000 4,935 2,404 1,611 - - - - - - - - - - -
0.56 ≥ 10,000 4,524 2,169 - - - - - - - - - - - -
0.60 ≥ 10,000 4,085 - - - - - - - - - - - - -
0.64 ≥ 10,000 3,576 - - - - - - - - - - - - -
0.68 ≥ 10,000 3,198 - - - - - - - - - - - - -
0.72 ≥ 10,000 - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - -
0.88 ≥ 10,000 - - - - - - - - - - - - - -

Table 9: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained


bilinear saddle-point problem with data drawn from a Poisson distribution (µ =
0.5)

32
0.1)
ρ
0.01 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80
α
0.00 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,060 6,714 5,752 5,031 4,471 4,015 3,532 3,438 3,129 2,893 2,827 2,767 2,728 2,761
0.04 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,675 7,737 6,445 5,521 4,829 4,294 3,773 3,504 3,303 2,964 2,890 2,790 2,784 2,646 -
0.08 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,271 7,414 6,175 5,290 4,627 4,119 3,556 3,448 3,152 2,893 2,838 2,786 2,672 - -
0.12 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,867 7,090 5,905 5,058 4,424 3,874 3,523 3,316 3,026 2,920 2,829 2,863 - - -
0.16 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,462 6,766 5,634 4,826 4,210 3,619 3,471 3,198 2,968 2,946 2,895 - - - -
0.20 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,057 6,441 5,362 4,581 3,960 3,563 3,354 3,142 3,022 3,032 - - - - -
0.24 ≥ 10,000 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,652 6,114 5,080 4,263 3,677 3,522 3,288 3,192 - - - - - - -
0.28 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,670 7,244 5,781 4,750 3,957 3,650 3,469 3,366 - - - - - - - -
0.32 ≥ 10,000 ≥ 10,000 ≥ 10,000 9,128 6,831 5,383 4,373 3,831 3,695 3,588 - - - - - - - - -
0.36 ≥ 10,000 ≥ 10,000 ≥ 10,000 8,582 6,374 4,951 4,210 3,982 3,863 - - - - - - - - - -
0.40 10,000 10,000 10,000 8,023 5,842 4,628 4,352 4,233 - - - - - - - - - - -

33
≥ ≥ ≥
0.44 ≥ 10,000 ≥ 10,000 ≥ 10,000 7,357 5,382 4,824 4,681 - - - - - - - - - - - -
0.48 ≥ 10,000 ≥ 10,000 ≥ 10,000 6,655 5,474 5,215 - - - - - - - - - - - - -
0.52 ≥ 10,000 ≥ 10,000 9,424 6,385 5,869 - - - - - - - - - - - - - -
0.56 ≥ 10,000 ≥ 10,000 8,431 6,725 - - - - - - - - - - - - - - -
0.60 ≥ 10,000 ≥ 10,000 8,298 - - - - - - - - - - - - - - - -
0.64 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.68 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.72 ≥ 10,000 ≥ 10,000 - - - - - - - - - - - - - - - - -
0.76 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.80 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.84 ≥ 10,000 - - - - - - - - - - - - - - - - - -
0.88 ≥ 10,000 - - - - - - - - - - - - - - - - - -
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

bilinear saddle-point problem with data drawn from a Poisson distribution (µ =


Table 10: Number of iterations necessary to achieve kyk − zk k ≤ 10−5 in the constrained
Boţ, Sedlmayer and Vuong

Appendix C. DCGAN Architecture

Generator

Input: z ∈ R128 ∼ N (0, I)


Linear 128 → 512 × 4 × 4
Batch Normalization
ReLU
transposed conv. (kernel: 4 × 4, 512 → 256, stride: 2, pad: 1)
Batch Normalization
ReLU
transposed conv. (kernel: 4 × 4, 256 → 128, stride: 2, pad: 1)
Batch Normalization
ReLU
transposed conv. (kernel: 4 × 4, 128 → 3, stride: 2, pad: 1)
Tanh(.)

Discriminator

Input: x ∈ R3×32×32
conv. (kernel: 4 × 4, 1 → 64, stride: 2, pad: 1)
LeakyReLU (negative slope: 0.2)
conv. (kernel: 4 × 4, 64 → 128, stride: 2, pad: 1)
Batch Normalization
LeakyReLU (negative slope: 0.2)
conv. (kernel: 4 × 4, 128 → 256, stride: 2, pad: 1)
Batch Normalization
LeakyReLU (negative slope: 0.2)
Linear 128 × 4 × 4 × 4 → 1

Table 11: DCGAN architecture for our experiments on CIFAR10.

34
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

References
Boushra Abbas and Hedy Attouch. Dynamical systems and forward-backward algorithms
associated with the sum of a convex subdifferential and a monotone cocoercive operator.
Optimization, 64(10):2223–2252, 2015.

Felipe Alvarez. On the minimizing property of a second order dissipative system in Hilbert
spaces. SIAM Journal on Control and Optimization, 38(4):1102–1119, 2000.

Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone
operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis,
9(1-2):3–11, 2001.

Maicon M Alves and Raul T Marcavillaca. On inexact relative-error hybrid proximal ex-
tragradient, forward-backward and Tseng’s modified forward-backward methods with
inertial effects. Set-Valued and Variational Analysis, 28:301–325, 2020.

Maicon M Alves, Jonathan Eckstein, Marina Geremia, and Jefferson G Melo. Relative-error
inertial-relaxed inexact versions of Douglas-Rachford and ADMM splitting algorithms.
Computational Optimization and Applications, 75(2):389–422, 2020.

Anatoly S Antipin. Minimization of convex functions on convex sets by means of differential


equations. Differential Equations, 30(9):1365–1375, 1994.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Generative Adversarial
Networks. In International Conference on Machine Learning (ICML 2017), volume 70,
pages 214–223. PMLR, 2017.

Hedy Attouch and Alexandre Cabot. Convergence of a relaxed inertial forward-backward


algorithm for structured monotone inclusions. Applied Mathematics & Optimization, 80
(3):547–598, 2019a.

Hedy Attouch and Alexandre Cabot. Convergence of a relaxed inertial proximal algorithm
for maximally monotone operators. Mathematical Programming, pages 1–45, 2019b.

Hedy Attouch and Alexandre Cabot. Convergence rate of a relaxed inertial proximal algo-
rithm for convex minimization. Optimization, pages 1–32, 2019c.

Hedy Attouch and Paul-Emile Maingé. Asymptotic behavior of second-order dissipative


evolution equations combining potential with non-potential effects. ESAIM: Control,
Optimisation and Calculus of Variations, 17(3):836–857, 2011.

Shane Barratt and Rishi Sharma. A note on the inception score. In ICML 2018 Workshop
on Theoretical Foundations and Applications of Deep Generative Models, 2018.

Heinz H Bauschke and Patrick L Combettes. Convex Analysis and Monotone Operator
Theory in Hilbert Spaces. Springer, New York, 2017.

Jérôme Bolte. Continuous gradient projection method in Hilbert spaces. Journal of Opti-
mization Theory and its Applications, 119(2):235–259, 2003.

35
Boţ, Sedlmayer and Vuong

Jonathan M Borwein and Adrian S Lewis. Convex Analysis and Nonlinear Optimization:
Theory and Examples. Springer, New York, 2006.

Radu I Boţ and Ernö R Csetnek. Second order forward-backward dynamical systems for
monotone inclusion problems. SIAM Journal on Control and Optimization, 54(3):1423–
1443, 2016a.

Radu I Boţ and Ernö R Csetnek. An inertial forward-backward-forward primal-dual split-


ting algorithm for solving monotone inclusion problems. Numerical Algorithms, 71(3):
519–540, 2016b.

Radu I Boţ and Ernö R Csetnek. Convergence rates for forward-backward dynamical sys-
tems associated with strongly monotone inclusions. Journal of Mathematical Analysis
and Applications, 457(2):1135–1152, 2018.

Radu I Boţ, Ernö R Csetnek, and Christopher Hendrich. Inertial Douglas-Rachford splitting
for monotone inclusion problems. Applied Mathematics and Computation, 256(1):472–
487, 2015.

Radu I Boţ, Ernö R Csetnek, and Phan T Vuong. The Forward-Backward-Forward method
from discrete and continuous perspective for pseudo-monotone variational inequalities in
Hilbert spaces. European Journal of Operational Research, 287:49–60, 2020.

Richard W Cottle and Jacques A Ferland. On pseudo-convex functions of nonnegative


variables. Mathematical Programming, 1(1):95–101, 1971.

Shisheng Cui, Uday V Shanbhag, Mathias Staudigl, and Phan T Vuong. Stochastic Relaxed
Inertial Forward-Backward-Forward splitting for monotone inclusions in Hilbert spaces.
arXiv:2107.10335, 2021.

Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas-Rachford splitting method


and the proximal point algorithm for maximal monotone operators. Mathematical Pro-
gramming, 55(1-3):293–318, 1992.

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien.
A variational inequality perspective on Generative Adversarial Networks. In International
Conference on Learning Representations (ICLR 2019), 2019.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, volume 27, pages 2672–2680, 2014.

Nicolas Hadjisavvas, Siegfried Schaible, and Ngai-Ching Wong. Pseudomonotone opera-


tors: a survey of the theory and its applications. Journal of Optimization Theory and
Applications, 152(1):1–20, 2012.

Erfan Y Hamedani and Necdet S Aybat. A primal-dual algorithm for general convex-concave
saddle point problems. SIAM Journal on Optimization, 31(2):1299–1329, 2021.

36
Relaxed Inertial FBF Algorithm for Solving Monotone Inclusions

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash
equilibrium. In Advances in Neural Information Processing Systems, volume 30, pages
6626–6637, 2017.

Franck Iutzeler and Julien M Hendrickx. A generic online acceleration scheme for optimiza-
tion algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):
383–405, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.

Galina M Korpelevich. The extragradient method for finding saddle points and other prob-
lems. Ekonomika i Matematicheskie Metody, 12:747–756, 1976.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis,
University of Toronto, Canada, 2009.

Yura Malitsky and Matthew K Tam. A forward-backward splitting method for monotone
inclusions without cocoercivity. SIAM Journal on Optimization, 30(2):1451–1472, 2020.

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of
convergence o(1/k 2 ). Doklady Akademija Nauk USSR, 269:543–547, 1983.

Zdzislaw Opial. Weak convergence of the sequence of successive approximations for nonex-
pansive mappings. Bulletin of the American Mathematical Society, 73(4):591–597, 1967.

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning
with deep convolutional generative adversarial networks. In International Conference on
Learning Representations (ICLR 2016), 2016.

Paul Tseng. A modified forward-backward splitting method for maximal monotone map-
pings. SIAM Journal on Control and Optimization, 38(2):431–446, 2000.

37

You might also like