Rectified Flow
Rectified Flow
Optimal Transport
arXiv:2209.14577v1 [stat.ML] 29 Sep 2022
Qiang Liu
University of Texas at Austin
[email protected]
Abstract
We present a flow-based approach to the optimal transport (OT) problem between two continuous
distributions π0 , π1 on Rd , of minimizing a transport cost E[c(X1 −X0 )] in the set of couplings (X0 , X1 )
whose marginal distributions on X0 , X1 equals π0 , π1 , respectively, where c is a cost function. Our
method iteratively constructs a sequence of neural ordinary differentiable equations (ODE), each learned
by solving a simple unconstrained regression problem, which monotonically reduce the transport cost
while automatically preserving the marginal constraints. This yields a monotonic interior approach that
traverses inside the set of valid couplings to decrease the transport cost, which distinguishes itself from
most existing approaches that enforce the coupling constraints from the outside. The main idea of the
method draws from rectified flow [LGL22], a recent approach that simultaneously decreases the whole
family of transport costs induced by convex functions c (and is hence multi-objective in nature), but is
not tailored to minimize a specific transport cost. Our method is a single-object variant of rectified flow
that guarantees to solve the OT problem for a fixed, user-specified convex cost function c.
1 Introduction
The Monge–Kantorovich (MK) optimal transport (OT) problem concerns finding an optimal coupling be-
tween two distributions π0 , π1 :
where we seek to find (the law of) an optimal coupling (X0 , X1 ) of π0 and π1 , for which marginal laws
of X0 , X1 equal π0 , π1 , respectively, to minimize E[c(X1 − X0 )], called the c-transport cost, for a cost
function c. Theories, algorithms, and applications of optimal transport have attracted a vast literature; see,
for example, the monographs of [Vil21, Vil09, ABS21, San15, PC+ 19] for overviews. Notably, OT has
been growing into a popular and powerful technique in machine learning, for key tasks such as learning
generative models, transfer learning, and approximate inference [e.g., PC+ 19, ACB17, SRGB14, EMM12,
CFT14, MMPS16].
The OT problem should be treated differently depending on whether π0 , π1 are discrete or continuous mea-
sures. In this work, we focus on the continuous case when π0 , π1 are high dimensional absolutely con-
tinuous measures on Rd that are observed through empirical observations, a setting called data-driven OT
in [TT16]. A well known result in OT [e.g., Vil09] shows that, if π0 is continuous, the optimization in
1
(1) can be restricted to the set of deterministic couplings satisfying X1 = T (X0 ) for some continuous
transport mapping T : Rd → Rd , which is often approximated in practice with deep neural networks [e.g.,
MTOL20, KLG+ 21, KSB22, HCTC20].
However, continuous OT remains highly challenging computationally. One major difficulty is to handle the
coupling constraints of Law(X0 ) = π0 and Law(X1 ) = π1 , which are infinite dimensional when π0 and π1
are continuous. As a result, (1) can not be solved as a “clean” unconstrained optimization problem. There
are essentially two types of approaches to solving (1) in the literature. One uses Lagrange duality to turn
(1) into a certain minimax game, and the other one approximates the constraint with an integral (often
entropic-like) penalty function. However, the minimax approaches suffer from convergence and instability
issues and are difficult to solve in practice, while the regularization approach can not effectively enforce the
infinite-dimensional coupling constraints.
This work We present a different approach to continuous OT that re-frames (1) into a sequence of simple
unconstrained nonlinear least squares optimization problems, which monotonically reduce the transport cost
of a coupling while automatically preserving the marginal constraints. Different from the minimax and reg-
ularization approaches that enforce the constraints from outside, our method is an interior approach which
starts from a valid coupling (typically the naive independent coupling), and traverses inside the constraint
set to decrease the transport cost. Such an interior approach is non-trivial and has not been realized before,
because there exists no obvious unconstrained parameterization of the set of couplings of π0 and π1 .
Our method is made possible by leveraging rectified flow [LGL22], a recent approach to constructing (non-
optimal) transport maps for generative modeling and domain transfer. What makes rectified flow special is
that it provides a simple procedure that turns a given coupling into a new one that obeys the same marginal
laws, while yielding no worse transport cost w.r.t. all convex functions c simultaneously. Despite this
attractive property, as pointed out in [LGL22], rectified flow can not be used to optimize any fixed cost c, as
it is essentially a special multi-objective optimization procedure that targets no specific cost. Our method is
a variant of rectified flow that targets a user-specified cost function c and hence yields a new approach to the
OT problem (1).
Rectified flow We provide a high-level overview of the rectified flow of [LGL22] and the main results
of this work. For a given coupling (X0 , X1 ) of π0 and π1 , the rectified flow induced by (X0 , X1 ) is the
time-differentiable process Z = {Zt : t ∈ [0, 1]} over an artificial notion of time t ∈ [0, 1], that solves the
following ordinary differential equation (ODE):
dZt = vtX (Zt )dt, t ∈ [0, 1], starting from Z0 = X0 , (2)
and Xt is the linear interpolation between X0 and X1 . Eq (3) is a least squares regression problem of
predicting the line direction of (X1 − X0 ) from every space-time point (Xt , t) on the linear interpolation
path, yielding a solution of
vtX (z) = E [X1 − X0 | Xt = z] ,
which is the average of direction (X1 − X0 ) for all lines that pass point Xt = z at time t. The (conditional)
expectations E[·] above are w.r.t. the randomness of (X0 , X1 ). We assume that the solution of (2) exists
2
and is unique, and hence vtX (z) is assumed to exist at least on the trajectories of the ODE. The start-end
pair (Z0 , Z1 ) induced by Z is called the rectified coupling of (X0 , X1 ), and we denote it by (Z0 , Z1 ) =
Rectify((X0 , X1 )).
In practice, the expectation E[·] is approximated by empirical observations of (X0 , X1 ), and v is approxi-
mated by a parametric family, such as deep neural networks. In this case, the optimization in Eq (3) can
be solved conveniently with off-the-shelf stochastic optimizers such as stochastic gradient descent (SGD),
without resorting to minimax algorithms or expensive inner loops. This makes rectified flow attractive for
deep learning applications as these considered in [LGL22].
The importance of (Z0 , Z1 ) = Rectify((X0 , X1 )) is justified by two key properties:
1) (Z0 , Z1 ) shares the same marginal laws as (X0 , X1 ) and is hence a valid coupling of π0 and π1 ;
2) (Z0 , Z1 ) yields no larger convex transport costs than (X0 , X1 ), that is, E[c(Z1 − Z0 )] ≤ E[c(X1 − X0 )],
for every convex function c : Rd → R.
Hence, it is natural to recursively apply the Rectify mapping, that is, (Z0k+1 , Z1k+1 ) = Rectify((Z0k , Z1k ))
starting from (Z00 , Z10 ) = (X0 , X1 ), yielding a sequence of couplings that is monotonically non-increasing
in terms of all convex transport costs. The initialization can be taken to be the independent coupling
(Z00 , Z10 ) ∼ π0 × π1 , or any other couplings that can be constructed from marginal (unpaired) observa-
tions of π0 and π1 . In practice, each step of Rectify is empirically approximated by first drawing samples
of (Z0k , Z1k ) from the ODE with drift v k , and then constructing the next flow v k+1 from the optimization
in (3). Although this process accumulates errors, it was shown that one or two iterations are sufficient for
practical applications [LGL22].
Note that the Rectify procedure is “cost-agnostic” in that it does not dependent on any specific cost c.
Although the recursive Rectify update is monotonically non-increasing on the transport cost for all convex
c, it does not necessarily converge to the optimal coupling for any pre-specified c, as the update would
stop whenever two cost functions are conflicting with each other. In [LGL22], a coupling (X0 , X1 ) is
called straight if it is a fixed point of Rectify, that is, (X0 , X1 ) = Rectify((X0 , X1 )). It was shown
that rectifiable couplings that are optimal w.r.t. a convex c must be straight, but the opposite is not true in
general. One exception is the one dimension case (d = 1), for which all convex functions c (whose c-optimal
coupling exists) share a common optimal coupling that is also straight. But this does not hold when d ≥ 2.
c-Rectified flow In this work, we modify the Rectify procedure so that it can be used to solve (1) given a
user-specified cost function c. We show that this can be done easily by properly restricting the optimization
domain of v and modifying the loss function in (3). The case of quadratic loss c(x) = 12 kxk2 is particularly
simple, for which we simply need to restrict the v to be a gradient field vt = ∇ft in the optimization of (3).
For more general convex c, we need to restrict v to have a form of vt (x) = ∇c∗ (∇ft (x)), with f minimizing
the following loss function:
Z 1 h i
inf E c∗ (∇f (Xt )) − (X1 − X0 )⊤ ∇f (Xt ) + c(X1 − X0 ) dt, (4)
f 0
where c∗ denotes the conjugate function of c. Obviously when c(x) = 21 kxk2 , (4) reduces to (3) with
v = ∇f . The loss function in (4) is closely related to Bregman divergence [e.g., BMD+ 05] and the so-
called matching loss [e.g., AHW95]. We call Z = {Zt : t ∈ [0, 1]} that follows dZt = ∇c∗ (∇ft (Zt ))dt
3
with Z0 = X0 and f solving (4) the c-rectified flow of (X0 , X1 ), and the corresponding (Z0 , Z1 ) the
c-rectified coupling of (X0 , X1 ), denoted as (Z0 , Z1 ) = c-Rectify((X0 , X1 )).
Similar to the original rectified coupling, the c-rectified coupling (Z0 , Z1 ) also share the same marginal laws
as (X0 , X1 ) and hence is a coupling of π0 and π1 . In addition, (Z0 , Z1 ) yields no larger transport cost than
(X0 , X1 ) w.r.t. c, that is, E[c(Z1 − Z0 )] ≤ E[c(X1 − X0 )]. But this only holds for the specific c that is used
to define the flow, rather than all convex functions like Rectify.
More importantly, recursively performing c-Rectify allows us to find c-optimal couplings that solve the
OT problem (1). Under mild conditions, we have
where ℓ∗X,c denotes the minimum value of the loss function in (4), which provides a criterion of c-optimality
of a given coupling without solving the OT problem. Moreover, when following the recursive update
(Z0k+1 , Z1k+1 ) = c-Rectify((Z0k , Z1k )), the ℓ∗Z k ,c is guaranteed to decay to zero with mink≤K ℓ∗Z k ,c =
O(1/K).
Notation Let C 1 (Rd ) be the set of continuously differentiable functions f : Rd → R, and Cc1 (Rd ) the
functions in C 1 (Rd ) whose support is compact. For a time-dependent velocity field v : Rd × [0, 1] → R, we
write vt (·) = v(x, t) and use v̇t (x) := ∂v(x, t) and ∇vt (x) := ∂x v(x, t) to denote the partial derivative w.r.t.
time t and variable x, respectively. We denote by C 2,1 (Rd × [0, 1]) the set of functions f : Rd × [0, 1] → R
that are second-order continuously differentiable w.r.t. x and first-order continuously differentiable w.r.t. t.
In this work, an ordinary
R t differential equation (ODE) dzt = vt (zt )dt should be interpolated as an integral
d
equation zt = z0 + 0 vt (zt )dt. For x ∈ R , kxk denotes the Euclidean norm. We always write c∗ as the
convex conjugate of c : Rd → R, that is, c∗ (x) = supy∈Rd {x⊤ y − c(y)}.
Random variables are capitalized (e.g., X, Y, Z) to distinguish them with deterministic values (e.g, x, y, z).
Recall that an Rd -valued random variable X = X(ω) is a measurable function X : Ω → Rd , where Ω is an
underlying sample space equipped with a σ-algebra F and a probability measure P. The triplet (Ω, F, P)
form the underlying probability space, which is omitted in writing in the most places. We use Law(X) to
denote the probability law of X, which is the probability measure L that satisfies L(B) = P({ω : X(ω) ∈
B}) for all measurable sets on Rd . For a functional F (X) of a random variable X, the optimization problem
minX F (X) technically means to find a measurable function X(ω) to minimize F , even though we omit the
underlying sample space Ω. When F (X) depends on X only through Law(X), the optimization problem is
equivalent to finding the optimal Law(X).
Outline The rest of the work is organized as follows. Section 2 introduces the background of optimal
transport. Section 3 reviews rectified flow of [LGL22] from an optimization-based view. Section 4 charac-
terizes the if and only if condition for two differentiable stochastic processes to have equal marginal laws.
Section 5 introduces the main c-rectified flow method and establishes its theoretical properties.
4
flow approach. The readers can find systematic introductions to OT in a collection of excellent textbooks
[Vil21, FG21, ABS21, PC+ 19, OPV14, San15, Vil09].
Static formulations The optimal transport problem was first formulated by Gaspard Monge in 1781 when
he studied the problem of how to redistribute mass, e.g., a pile of soil, with minimal effort. Monge’s problem
can be formulated as
where we minimize the c-transport cost in the set of deterministic couplings (X0 , X1 ) that satisfy X1 =
T (X0 ) for a transport mapping T : Rd → Rd . The Monge–Kantorovich (MK) problem in (1) is the relax-
ation of (5) to the set of all (deterministic and stochastic) couplings of π0 and π1 . The two problems are
equivalent when the optimum of (1) is achieved by a deterministic coupling, which is guaranteed if π0 is an
absolutely continuous measure on Rd .
A key feature of the MK problem is that it is a linear programming w.r.t. the law of the coupling (X0 , X1 ),
and yields a dual problem of form:
where we write π1 (µ) := µ(x)dπ1 (x), and µ, ν are optimized in all functions from Rd to R. For any
R
coupling (X0 , X1 ) of π0 and π1 , and (µ, ν) satisfying the constraint in (6), it is easy to see that
As the left side of (7) only depends on (X0 , X1 ) and the right side only on (µ, ν), one can show that
(X0 , X1 ) is c-optimal and (µ, ν) solves (6) iff µ(X0 ) + ν(X1 ) = c(X1 − X0 ) holds with probability one,
which provides a basic optimality criterion. Many existing OT algorithms are developed by exploiting the
primal dual relation of (1) and (6) (see e.g., [KSB22]), but have the drawback of yielding minimax problems
that are challenging to solve in practice.
If c is strictly convex, the optimal transport map of (5) is unique (almost surely) and yields a form of
where c∗ is the convex conjugate function of c, and ν is an optimal solution of (6), which is c-convex in that
ν(x) = supy {−c(y − x) + µ(y)} with µ the associated solution. In the canonical case of quadratic cost
c(x) = 21 kxk2 , we can write T (x) = ∇φ(x), where φ(x) := 12 kxk2 + ν(x) is a convex function.
Dynamic formulations Both the MK and Monge problems can be equivalently framed in dynamic ways
as finding continuous-time processes that transfer π0 to π1 . Let {xt : t ∈ [0, 1]} be a smooth path connecting
x0 and x1 , whose time derivative is denoted as ẋt . For convex c, by Jensen’s inequality, we can represent
the cost c(x1 − x0 ) in an integral form:
Z 1 Z 1
c(x1 − x0 ) = c ẋt dt = inf c(ẋt )dt,
0 x 0
5
where the infimum is attained when xt is the linear interpolation (geodesic) path: xt = tx1 + (1 − t)x0 .
Hence, the MK optimal transport problem (1) is equivalent to
Z 1
inf E c(Ẋt )dt s.t. Law(X0 ) = π0 , Law(X1 ) = π1 , (8)
X 0
where we optimize in the set of time-differentiable stochastic processes X = {Xt : t ∈ [0, 1]}. The op-
timum of (8) is attained by Xt = tX1 + (1 − t)X0 when (X0 , X1 ) is a c-optimal coupling of (1), which
is known as the displacement interpolation [McC97]. We call the objective function in (8) the path-wise
c-transport cost.
The Monge problem can also be framed in a dynamic way. Assume the transport map T can be induced by
an ODE model dXt = vt (Xt )dt such that X1 = T (X0 ). Then the Monge problem is equivalent to
Z 1
inf E c(vt (Xt ))dt s.t. dXt = vt (Xt )dt, Law(X0 ) = π0 , Law(X1 ) = π1 , (9)
v,X 0
which is equivalent to restricting X in (8) to the set of processes that can be induced by ODEs.
Assume that Xt following dXt = vt (Xt )dt yields a density function ̺t . Then it is well known that ̺t
satisfies the continuity equation:
̺˙ t + ∇ · (vt ̺t ) = 0.
Hence, we can rewrite (9) into an optimization problem on (v, ̺), yielding the celebrated Benamou-Brenier
formula [BB00]:
Z 1Z
inf c(vt (x))̺t (x)dxdt s.t. ̺˙ t + ∇ · (vt ̺t ) = 0, ρ0 = dπ0 /dx, ρ1 = dπ1 /dx, (10)
v,̺ 0
where dπi /dx denotes the density function of πi . The key idea of (9) and (10) is to restrict the optimization
of (8) to the set of deterministic processed induced by ODEs, which significantly reduces the search space.
Intuitively, Jensen’s inequality E[c(Z)] ≥ c(E[Z]) shows that we should be able to reduce the expected cost
of a stochastic process by “marginalizing” out the randomness. In fact, we will show that, for a differentiable
stochastic process X, its (c-)rectified flow yields no larger path-wise c-transport cost in (8) than X (see
Lemma 3.3 and Theorem 5.3).
However, all the dynamic formulations above are still highly challenging to solve in practice. We will show
that c-rectified flow can be viewed as a special coordinate descent like approach to solving (8) (Section 5.4).
6
where Ẋt denotes the time derivative of Xt . Obviously, v X is the solution of
Z 1
2
inf LX (v) := E Ẋt − vt (Xt ) dt , (12)
v 0
where the optimization is on the set of all measurable velocity fields v : Rd → Rd . The importance of v X
lies on the fact that it characterizes the time-evolution of the marginal laws ρt := Law(Xt ) of X, through
the continuity equation in the distributional sense:
Precisely, Equation (13) should be interpreted by its weak and integral form:
Z t
ρt (h) − ρ0 (h) − ρt (∇h⊤ vsX )ds = 0, ρ0 = Law(X0 ), ∀h ∈ Cc1 (Rd ), t ∈ [0, 1], (14)
0
where ρt (h) := h(x)dρt (x) and Cc1 (Rd ) denotes the set of continuously differentiable functions on Rd
R
with compact support. Hence, if the solution of Eq (13)-(14) is unique, then the marginal laws {Law(Xt )}t
of X are uniquely determined by v X and the initial Law(X0 ).
We define the rectified flow of X, denoted by Z = Rectflow(X), as the ODE driven by v X :
Moreover, the rectified flow of a coupling (X0 , X1 ) is defined as the rectified flow of X when X is the
linear interpolation of (X0 , X1 ).
Definition 3.1. A stochastic process X is called rectifiable if v X exists and is locally bounded, and Equa-
tion (15) has an unique solution.
A coupling (X0 , X1 ) is called rectifiable if its linear interpolation process X, following Xt = tX1 + (1 −
t)X0 , is rectifiable. In this case, we call Z = Rectflow(X) the rectified flow of (X0 , X1 ), and write it
(with an abuse of notation) as Z = Rectflow((X0 , X1 )). The corresponding (Z0 , Z1 ) is called the rectified
coupling of (X0 , X1 ), denoted as (Z0 , Z1 ) = Rectify((X0 , X1 )).
By the definition in (15), we have v Z = v X , and hence the marginal laws Law(Zt ) of Z are governed by
the same continuity equation (13)-(14), which is a well known fact. As shown in [Kur11], Equation (15)
has an unique solution iff Equation (14) has an unique solution, which implies that Z and X share the same
marginal laws. We also assumed that the solution of (12) is unique; if not, results in the paper hold for all
solutions of (12).
Theorem 3.2 (Theorem 3.3 of [LGL22]). Assume that X is rectifiable. We have
Hence, rectified flow turns a rectifiable stochastic process into a flow while preserving the marginal laws.
7
A optimization view of rectified flow We show that the rectified flow Z of X achieves the minimum
of the path-wise c-transport cost in the set of time-differentiable stochastic processes whose expected ve-
locity field equals v X . This explains that the property of non-increasing convex transport costs of rectified
flow/coupling.
Lemma 3.3. The rectified flow Z = Rectflow(X t ) in (15) attains the minimum of
Z 1 h i
inf Fc (Y ) := E c(Ẏt ) dt, s.t. v Y = v X , (16)
Y 0
Proof. For any stochastic process Y with vtX (z) = vtY (z) = E[Ẏt |Yt = z], we have
Z 1
Fc (Y ) = E[c(Ẏt )]dt
0
Z 1
≥ E[c(E[Ẏt |Yt ])]dt //Jensen’s inequality
0
Z 1
= E[c(v Y (Yt ))]dt
0
Z 1
= E[c(v X (Xt ))]dt //v X = v Y , and hence Law(Xt ) = Law(Yt )
0
Z 1
= E[c(v X (Zt ))]dt //Law(Xt ) = Law(Zt )
0
Z 1
= E[c(Żt )dt = F c (Z).
0
Lemma 3.3 suggests that the rectified flow decreases the path-wise c-transport cost: Fc (Z) ≤ Fc (X), for
all convex c. Note that E [c(Z1 − Z0 )] ≤ Fc (Z) by Jensen’s inequality, and E [c(X1 − X0 )] = Fc (X) if
X is the linear interpolation of (X0 , X1 ). Hence, in this case, we have
which yields a proof of Theorem 3.2 of [LGL22] that the rectified coupling (Z0 , Z1 ) yields no larger convex
transport costs than (X0 , X1 ).
A primal-dual relation Let us generalize the least squares loss LX (v) in (12) to a a Bregman divergence
based loss:
Z 1 h i
L̃X,c (v) := E bc Ẋt ; vt (Xt ) dt, bc (x; y) = c(x) − c(y) − (x − y)⊤ ∇c(y),
0
where bc (·; ·) is the Bregman divergence w.r.t. c. The least squares loss LX is recovered with c(x) = 1
2 kxk2 .
8
Rectified flow can be alternatively implemented by minimizing L̃X,c with a differentiable strictly convex c,
as in this case the minimum of L̃X,c is also attended by v X (z) = E[Ẋt |Xt = z]. The c-rectified flow is
obtained if we minimize L̃X,c with v restricted to be a form of v = ∇c∗ ◦ ∇ft . See more in Section 5.
In the following, we show that the optimization in (16) can be viewed as the dual problem (11).
Theorem 3.4. For any differentiable convex function c, and rectifiable process X, we have
Proof. Let varc (Ẋt | Xt ) := E[c(Ẋt ) − c(E[Ẋt |Xt ]) | Xt ]. For any v, we have
Z 1
L̃X,c (v) = E[c(Ẋt ) − c(v(Xt )) − (Ẋt − v(Xt ))∇c(v(Xt ))]dt
0
Z 1
= E[c(Ẋt ) − c(v(Xt )) − (v X (Xt ) − v(Xt ))∇c(v(Xt ))]dt //v X (Xt ) = E[Ẋt |Xt ]
0
Z 1
≥ E[c(Ẋt ) − c(v X (Xt ))]dt //c(v X ) ≥ c(v) + (v X − v)∇c(v)
0
Z 1
= varc (Ẋt | Xt )dt,
0
9
Similarly, if X is the linear interpolation of the coupling (X0 , X1 ), then ℓ̃∗X,c = 0 with strictly convex c if
and only if (X0 , X1 ) is a fixed point of the Rectify mapping, that is, (X0 , X1 ) = Rectify((X0 , X1 )),
following [LGL22]. Such couplings are called straight, or fully rectified in [LGL22]. Obtaining straight
couplings is useful for learning fast ODE models because the trajectories of the associated rectified flow
Z are straight lines and hence can be calculated in closed form without iterative numerical solvers. See
[LGL22] for more discussion.
Moreover, [LGL22] showed that rectifiable c-optimal couplings must be straight. In the one dimensional
case (d = 1), the straight coupling, if it exists, is unique and attains the minimum of E[c(X1 − X0 )] for
all convex functions for which c-optimal coupling exists. For higher dimensions (d ≥ 2), however, straight
couplings are not unique, and the specific straight coupling obtained at the convergence of the recursive
Rectify update (i.e. (Z0k+1 , Z1k+1 ) = Rectify((Z0k , Z1k ))) is implicitly determined by the initial coupling
(Z00 , Z10 ), and is not expected to be optimal w.r.t. any pre-fixed c.
The following counter example shows a somewhat stronger negative result: there exist straight couplings
that are not optimal w.r.t. all second order differentiable convex functions with invertible Hessian matrices.
Example 3.5. Take π0 = π1 = N (0, I). Hence, for c(x) = kxkp with p > 0, the c-optimal mapping is the
trivial identity coupling (X0 , X0 ) with X0 ∼ π0 .
However, consider the coupling (X0 , AX0 ), where A is a non-identity and non-reflecting rotation matrix
(namely A⊤ A = I, det(A) = 1, A 6= I and A does not have λ = −1 as an eigenvalue). Then (X0 , AX0 )
is a straight coupling of π0 and π1 , but it is not c-optimal for all second order differentiable convex function
c whose Hessian matrix is invertible everywhere. See Appendix for the proof.
It is the rotation transform that makes (X0 , AX0 ) sub-optimal, which is removed in the proposed c-rectified
flow in Section 5 via a Helmholtz like decomposition.
which yields a dynamic OT problem with a continuum of marginal constraints. In Section 5, we show that
the solution of (17) yields our c-rectified flow that solve the OT problem at the fixed point. Solving (17)
allows us to remove the rotational components of v X , which is what what renders rectified flow non-optimal.
In this section, we first characterize the necessary and sufficient condition for having equivalent marginal
laws.
Definition 4.1. A time-dependent vector field r : Rd × [0, 1] → Rd is said to be X-marginal-preserving if
Z t
E[∇h(Xt )⊤ rt (Xt )] = 0, ∀t ∈ [0, 1], h ∈ Cc1 (Rd ). (18)
0
10
Equation (18) is equivalent to saying that E[∇h(Xt )⊤ rt (Xt )] = 0 holds almost surely assuming that t is a
random variable following Uniform([0, 1]) (i.e., t-almost surely). Let ρt = Law(Xt ) and it yields a density
function ̺t . Using integration by parts, we have
Z Z
0 = E[∇h(Xt )⊤ rt (Xt )] = ∇h(x)⊤ rt (x)̺t (x)dx = − h(x)∇ · (rt (x)̺t (x))dx, ∀h ∈ Cc1 (Rd ),
which gives ∇ · (rt ̺t ) = 0. This says that rt ̺t is a rotation-only (or divergence-free) vector field in the
classical sense.
Lemma 4.2. Let X and Y be two stochastic processes with the same initial distributions Law(X0 ) =
Law(Y0 ). Assume that X is rectifiable, and vtY (z) := E[Ẏt |Yt = z] exists and is locally bounded.
Then X and Y share the same marginal laws at all time, that is, Law(Xt ) = Law(Yt ), ∀t ∈ [0, 1], if and
only if v X − v Y is Y -marginal-preserving.
11
which suggests that ρ̃t := Law(Yt ) solves the same continuity equation (20), starting from the same ini-
tialization as Law(X0 ) = Law(Y0 ). Hence, we have ρt = ρ̃t if the solution of (20) is unique, which is
equivalent to the uniqueness of the solution of dZt = vtX (Zt ) in (15) following Corollary 1.3 of [Kur11].
On the other hand, if ρt = Law(Xt ) = Law(Yt ) = ρ̃t , following (19) and (21), we have for any h ∈
Cc1 (Rd ),
Z t Z t
0= ρ̃t (∇h⊤ vsX ) − ρ̃t (∇h⊤ vsY )ds = ∇h(Ys )⊤ (v X (Ys ) − v Y (Ys ))ds,
0 0
5 c-Rectified Flow
We introduce c-rectified flow, a c-dependent variant of rectified flow that guarantees to minimize the c-
transport cost when applied recursively. This section is organized as follows: Section 5.1 defines and dis-
cusses the c-rectified flow of a differentiable stochastic process X, which we show yields the solution of the
infinite-marginal OT problem (17). Section 5.2 considers the c-rectified flow of a coupling (X0 , X1 ), which
we show is non-increasing on the c-transport cost. Section 5.3 proves that the fixed points of c-Rectify
are c-optimal. Section 5.4 interprets c-rectified flow as an alternating direction descent method for the
dynamic OT problem (8), and a majorize-minimization (MM) algorithm for the static OT problem (1). Sec-
tion 5.5 discusses a key lemma relating c-optimal couplings and its associated displacement interpolation
with Hamilton-Jacobi equation.
where c∗ (x) := supy {x⊤ y − c(y)} is the convex conjugate of c, and f X,c : Rd × [0, 1] → R is the optimal
solution of
Z 1 h i
inf LX,c (f ) := E mc Ẋt ; ∇ft (Xt ) dt , (23)
f 0
12
Bregman divergence, Helmholtz decomposition, marginal preserving We can equivalently write (23)
using Bergman divergence associated with c, that is,
Then it is easy to see that mc (x; y) = bc (x; ∇c∗ (y)), by using the fact that ∇c(∇c∗ (y)) = y and c∗ (y) =
y ⊤ ∇c∗ (y) − c(∇c∗ (y)). Hence, mc and bc are equivalent up to the monotonic transform ∇c∗ on y. The
minimum bc (x; y) = 0 is achieved when y = x, while mc (x; y) = 0 is achieved when ∇c∗ (y) = x.
Therefore, (23) is equivalent to
Z 1 h i
inf E bc Ẋt ; gt (Xt )) dt, with gt = ∇c∗ ◦ ∇ft . (24)
f 0
Moreover, the generalized Pythagorean theorem of Bregman divergence (e.g., [BMD+ 05]) gives
h i h i h h ii
E bc Ẋt ; gt | Xt = bc E Ẋt |Xt ; gt + E bc Ẋt ; E Ẋt |Xt . (25)
h i
Because v X (Xt ) = E Ẋt |Xt and the last term of (25) is independent with gt , we can further reframe (23)
into
Z 1
E bc vtX (Xt ); gt (Xt )) dt, with gt = ∇c∗ ◦ ∇ft ,
min (26)
f 0
which can be viewed as projecting the expected velocity vtX to the set of functions of form gt = ∇c∗ ◦ ∇ft ,
w.r.t. the Bregman divergence. This yields an orthogonal decomposition of vtX :
where rtX,c = vtX,c − ∇c∗ ◦ ∇ftX,c is the residual term. The key result below shows that r X,c is X-
marginal-preserving, which ensures that the c-rectified flow preserves the marginals of X.
Definition 5.1. We say that X is c-rectifiable if v X exists, the minimum of (23) exists and is attained by a
locally bounded function f X,c , and the solution of Equation (22) exists and is unique.
Theorem 5.2. Assume that X is c-rectifiable, and c∗ := supy {x⊤ y − c(y)} and c∗ ∈ C 1 (Rd ). We have
i) v X − g X ,c is X-marginal-preserving.
ii) Z = c-Rectify(X) preserves the marginal laws of X, that is, Law(Zt ) = Law(Xt ), ∀t ∈ [0, 1].
Proof. i) By vtX (z) = E[Ẋt |Xt = z], the loss function in (23) is equivalent to
Z 1 h i
LX,c (f ) = E c∗ (∇ft (Xt )) − E[Ẋt | Xt ]⊤ ∇ft (Xt ) + c(Ẋt ) dt
0
Z 1 h i
= E c∗ (∇ft (Xt )) − vtX (Xt )⊤ ∇ft (Xt ) + c(Ẋt ) dt.
0
13
By Euler-Lagrange equation, we have
Z 1 h i
E (∇c∗ (∇fsX,s (Xs )) − v X (Xs ))⊤ ∇gs (Xs ) ds = 0, ∀g : gs ∈ Cc1 (Rd ).
0
Taking gs = h if s < t and gs = 0 if s > t yields that r X,c (x) = ∇c∗ (∇fsX,c (Xs )) − v X (Xs ) is
X-marginal-preserving following (18).
ii) Note that Z is rectifiable if X is c-rectifiable. Applying Lemma 4.2 yields the result.
For the quadratic cost c(x) = c∗ (x) = 12 kxk2 , the ∇c∗ is the identity mapping, and (27) reduces to the
Helmholtz decomposition, which represents a velocity field into the sum of a gradient field and a divergence-
free field. Hence, (27) yields a generalization of Helmholtz decomposition, in which a monotonic transform
∇c∗ is applied on the gradient field component. We call (27) a Bregman Helmholtz decomposition.
Remark: score matching In some special cases, v X may already be a gradient field, and hence the
rectified flow and c-rectified flow coincide for c(x) = 21 kxk2 . One example of this is when Xt = αt X1 +βt ξ
for some time-differentiable functions αt and βt , and ξ ∼ N (0, I), satisfying α1 = 1, β1 = 0, and X0 =
α0 X1 + β0 ξ. In this case, one can show that
ζt
vtX (z) = E[α̇t X1 + β̇t X0 | Xt = z] = ∇ft (z), with ft (z) = ηt log ̺t (z) + kzk2 ,
2
z−αt x1
dπ1 (x1 ) and φ(z) = exp(− kzk2 /2),
R
where ̺t is the density function of Xt with ̺t (z) ∝ φ βt
and ηt = βt2 (α̇t /αt − β̇t /βt ) and ζt = α̇t /αt . This case covers the probability flow ODEs [SSDK+ 20] and
denoising diffusion implicit models (DDIM) [SME20] with different choices of αt and βt as suggested in
[LGL22]. When ζt = 0, as the case of [SE19], vtX is proportional to ∇ log ρt , the score function of ̺t , and
the least squares loss LX (v) in (12) reduces to a time-integrated score matching loss [HD05, Vin11].
However, vtX is generally not a score function or gradient function, especially in complicate cases when
the coupling (X0 , X1 ) is induced from the previous rectified flow as we iteratively apply the rectification
procedure. In these cases, it is necessary to impose the gradient form as we do in c-rectified flow.
c-Rectified flow solves Problem (17) We are ready to show that the c-rectified flow solves the optimization
problem in (17). Further, (23) forms a dual problem of (17).
Theorem 5.3. Under the conditions in Theorem 5.2, we have
i) Z = c-Rectify(X) attains the minimum of (17).
ii) Problem (17) and (23) has a strong duality:
As the optima above are achieved by f X,c and Z, we have LX,c (f X,c ) = Fc (X) − Fc (Z).
14
Proof. Write RX,c (Y ) = Fc (X) − Fc (Y ). First, we show that LX,c (f ) ≥ RX,c (Y ) for any f and Y that
satisfies Law(Yt ) = Law(Xt ), ∀t:
RX,c (Y )
Z 1
=E c(Ẋt ) − c(Ẏt )dt
0
(1)
Z 1
∗ ⊤
≤E c(Ẋt ) + c (∇ft (Yt )) − Ẏt ∇ft (Yt )dt //Fenchel-Young inequality: c(y) ≥ x⊤ y − c∗ (x)
0
Z 1
∗ ⊤
=E Y
c(Ẋt ) + c (∇ft (Yt )) − vt (Yt ) ∇ft (Yt )dt //vtY (Yt ) = E[Ẏt |Yt ]
0
Z 1
∗ Y ⊤
=E c(Ẋt ) + c (∇ft (Xt )) − vt (Xt ) ∇ft (Xt )dt //Law(Xt ) = Law(Yt )
0
Z 1
∗ ⊤
=E X
c(Ẋt ) + c (∇ft (Xt )) − vt (Xt ) ∇ft (Xt )dt //v X − v Y is X-marginal-preserving
0
= LX,c (f ).
(1)
Moreover, if we take Y = Z and f = f X,c , then the inequality in ≤ is tight because Żt = ∇c∗ (∇ft (Yt ))
holds t-almost surely. Therefore, RX,c (Z) = LX,c (f X,c ) ≥ RX,c (Y ), which suggests that Z attains the
maximum of RX,c (under the marginal constraints) and the strong duality holds.
which establishes that (Z0 , Z1 ) yields no larger transport cost than (X0 , X1 ).
Theorem 5.5. Assume that c is convex with conjugate c∗ ∈ C 1 (Rd ), and the conditions in Definition 5.4
holds. Then Equation (28) holds and E[c(Z1 − Z0 )] ≤ E[c(X1 − X0 )].
15
Compared with the regular Rectify mapping, the key difference here is that the monotonicity of c-Rectify
only holds for the specific c that it employees, rather than all convex cost functions. More importantly, as
we show below, recursively applying c-Rectify yields optimal couplings w.r.t. c, a key property that the
regular rectified flow misses.
Theorem 5.6. Assume that c is convex with conjugate c∗ , and c, c∗ ∈ C 1 (Rd ) and X is the linear interpo-
lation process of (X0 , X1 ). Assume that (X0 , X1 ) is a c-rectifiable coupling, and f X,c ∈ C 2,1 (Rd × [0, 1]).
Then the following statements are equivalent:
i) (X0 , X1 ) is a fixed point of c-Rectify, that is, (X0 , X1 ) = c-Rectify(X0 , X1 ).
ii) ℓ∗X,c := inf f LX,c (f ) = LX,c (f X,c ) = 0, for LX,c in (23).
iii) (X0 , X1 ) is a c-optimal coupling.
Proof. i) → ii). If (Z0 , Z1 ) = (X0 , X1 ), we have Sc (Z) = 0 and LX,c (f X,c ) = 0 following (28).
iii) → ii). If (X0 , X1 ) is c-optimal, we have E[c(X1 − X0 )] ≤ E[c(Z1 − Z0 )], which again implies that
LX,c (f X,c ) = 0 following (28).
ii) → i) Note that
Z 1 h i
LX,c (f X,c ) = E bc Ẋt ; gtX,c (Xt ) dt ≥ 0.
0
Therefore, LX,c (f X,c ) = 0 implies that Ẋt = gtX,c (Xt ) t-almost surely. Because Zt satisfies the same
equation, whose solution is assumed to be unique, we have Z = X and hence (Z0 , Z1 ) = (X0 , X1 ).
ii) → iii) Because X is the linear interpolation, we have Xt = tX1 + (1 − t)X0 , and it simultaneously
satisfies the ODE dXt = gtX,c (Xt )dt. Using Lemma 5.9 shows that (X0 , X1 ) is c-optimal.
Knowing that LX,c (f X,c ) is an indication of c-optimality, we show below that it is guaranteed to converge
to zero with recursive Rectify updates.
Corollary 5.7. Let Z k be the k-th c-rectified flow of (X0 , X1 ), satisfying Z k+1 = c-Rectflow((Z0k , Z1k ))
and (Z00 , Z10 ) = (X0 , X1 ). Assume each (Z0k , Z1k ) is c-rectifiable for k = 0, . . . , K. Then
K
X k
,c
LZ k ,c (f Z ) + Sc (Z k+1 ) ≤ E[c(X1 − X0 )].
k=0
16
k
Therefore, if E[c(X1 − X0 )] < +∞, we have mink≤K LZ k ,c (f Z ,c ) + Sc (Z k+1 ) = O(1/K).
Summing it over k = 0, . . . , K,
K K
X k X
LZ k ,c (f Z ,c
) + Sc (Z k+1 ) = E[c(Z1k − Z0k )] − E[c(Z1k+1 − Z0k+1 )]
k=0 k=0
= E[c(Z10 − Z00 )] − E[c(Z1K+1 − Z0K+1 )]
≤ E[c(X1 − X0 )].
c-Rectified flow as alternative direction descent on (8) The mapping Z k+1 = c-Rectflow(Z k ) can be
interpreted as an alternative direction descent procedure for the dynamic OT problem (8):
n o
X k = arg min Fc (Y ) s.t. (Y0 , Y1 ) = (Z0k , Z1k ) , (29)
Y
n o
Z k+1 = arg min Fc (Y ) s.t. Law(Yt ) = Law(Xtk ), ∀t ∈ [0, 1] . (30)
Y
Here in (29), we minimize Fc (Y ) in the set of processes whose start-end pair (Y0 , Y1 ) equals the coupling
(Z0k , Z1k ) from Z k , which simply yields the linear interpolation Xtk = tZ1k +(1−t)Z0k by Jensen’s inequality.
In (30), we minimize Fc (Y ) given the path-wise marginal constraint of Law(Yt ) = Law(Xtk ) for all time
t ∈ [0, 1], which yields the c-rectified flow following Theorem 5.3. Note that the updates in both (29) and
(30) keep the start-end marginal laws Law(Y0 ) and Law(Y1 ) unchanged, and hence the algorithm stays
inside the feasible set {Y : Law(Y0 ) = π0 , Law(Y1 ) = π1 } in (8) once it is initialized to be so.
The updates in (29)-(30) highlight a key difference between our method and the Benamou-Brenier ap-
proach (9)-(10): the key idea of Benamou-Brenier is to restrict the optimization domain to the set of
deterministic, ODE-induced processes (a.k.a. flows), but our updates alternate between the deterministic
c-rectified flow Z k and the linear interpolation process X k , which is not deterministic or ODE-inducable
unless the fixed point is achieved.
17
c-Rectified flow as an MM algorithm The majorize-minimization (MM) algorithm [HL04] is a general
optimization recipe that works by finding a surrogate function that majorizes the objective function. Let
F (X) be the objective concave function to be minimize. An MM algorithm consists of iterative update of
form X k+1 ∈ arg minY F + (Y | X k ), where F + is a majorization function of F that satsifies
F (X k+1 ) ≤ F + (X k+1 |X k ) ≤ F + (X k | X k ) = F (X k ).
One can also view MM as conducting coordinate descent on (X, Y ) for solving minX,Y F + (Y | X).
In the following, we show that (Z0k+1 , Z1k+1 ) = c-Rectify((Z0k , Z1k )) can be interpreted as an MM algo-
rithm for the static OT problem (1) for minimizing E[c(X1 − X1 )] in the set of couplings of π0 and π1 . The
majorization function corresponding to c-Rectify can be shown to be
n o
Fc+ ((Y0 , Y1 ) | (X0 , X1 )) = inf Fc (Ỹ ) s.t. (Ỹ0 , Ỹ1 ) = (Y0 , Y1 ), Y ∈ MX ,
Ỹ
with MX = {Y : Law(Yt ) = Law(tX1 + (1 − t)X0 ), ∀t ∈ [0, 1]},
where Fc+ ((Y0 , Y1 ) | (X0 , X1 )) denotes the minimum value of Fc (Ỹ ) for Ỹ whose start-end points equal
(Y0 , Y1 ), and yields the same marginal laws as that of the linear interpolation process of (X0 , X1 ).
Proposition 5.8. i) Fc+ yields a majorization function of the c-transport cost E[c(Y1 − Y0 )] in the sense that
and the minimum is attained by (X0 , X1 ) = (Y0 , Y1 ), where Π0,1 denotes the set of couplings of π0 and π1 .
ii) c-Rectify yields the MM update related F + in that
where the inequality holds because remove the constraint Y ∈ MX . In addition, it is obvious that the
inequality above becomes equality when (X0 , X1 ) = (Y0 , Y1 ).
ii) Note that
inf Fc+ ((Y0 , Y1 ) | (X0 , X1 )) = inf {Fc (Y ) s.t. Y ∈ MX } ,
(Y0 ,Y1 ) Y
whose minimum of the right side is attained by Y = c-Rectflow((X0 , X1 )) following Theorem 5.3.
Hence, the minimum of the left side is attained by (Y0 , Y1 ) = c-Rectify((X0 , X1 )).
18
5.5 Hamilton-Jacobi Equation and Optimal Transport
The proof of Theorem 5.6 relies on a key lemma shows that if the trajectories of an ODE of form dXt =
∇c∗ (∇ft (Xt ))dt are geodesic in that Xt = tX1 + (1 − t)X0 , then the induced coupling (X0 , X1 ) is an c-
optimal coupling of its marginals. The proof of this lemma relies on Hamilton-Jacobi (HJ) equation, which
provides a characterization of f for an ODE dXt = ∇c∗ (∇ft (Xt ))dt whose trajectories are geodesic. The
connection between HJ equation and optimal transport has been a classic result and can be found in, for
example, [Vil21, Vil09].
Lemma 5.9. Let vt (x) = ∇c∗ (∇ft (x)) where c∗ ∈ C 1 (Rd ) is a convex function c, and f ∈ C 2,1 (Rd ×[0, 1])
and ∇c∗ is an injective mapping. Assume all trajectories of dxt = vt (xt )dt are geodesic paths in that
xt = tx1 + (1 − t)x0 . Then we have:
i) There exists f˜t such that ∇f˜t = ∇ft (and hence we can replace f with f˜ in the assumption), such that the
following Hamilton–Jacobi (HJ) equation holds
ii) f satisfies
yt − y0
ft (yt ) = inf tc + f0 (y0 ) , ∀t ∈ [0, 1], yt ∈ Rd , (Hopf-Lax formula)
y0 ∈Rd t
where the minimum is attained if {yt } follows the ODE dyt = vt (yt )dt.
iii) Assume a coupling (X0 , X1 ) of π0 , π1 satisfies dXt = vt (Xt )dt. Then (X0 , X1 ) is a c-optimal coupling.
Proof. i) Starting from any point xt = x ∈ Rd at time t, because the trajectories of dxt = vt (xt )dt are
geodesic, we have ẋt = vt (xt ) = const following the trajectory. Because vt (x) = ∇c∗ (∇ft (x)) and ∇c∗
is injective, we have ∇ft (xt ) = const as well. Hence, we have
d
0= ∇ft (xt ) = ∂t ∇ft (xt ) + ∇2 ft (xt )ẋt
dt
= ∂t ∇ft (xt ) + ∇2 ft (xt )∇c∗ (∇ft (xt )).
On the other hand, define ht (x) = ∂t ft (x) + c∗ (∇ft (x)). Then we have
This suggests
Rt that ∇x ht (x) = 0 everywhere and hence ht (x) does not depend on x. Define f˜t (x) =
ft (x) − 0 ht (x0 )dt, where x0 is any fixed point in Rd . Then
19
ii) Take any y0 , y1 in Rd , let yt = ty1 + (1 − t)y0 be their linear interpolation. We have
f1 (y1 ) − f0 (y0 )
Z 1
= (∂t ft (yt ) + ∇ft (yt )⊤ (yt − y0 ))dt
0
Z 1
= ∇ft (yt )⊤ (y1 − y0 ) − c∗ (∇ft (yt ))dt //ht = ∂ft + c∗ (∇ft ) = 0
0
(1)
Z 1
≤ c(y1 − y0 )dt //c(x) + c∗ (y) ≥ x⊤ y
0
= c(y1 − y0 ).
(1)
The equality in ≤ is attained if yt follows the geodesic ODE dyt = vt (yt )dt as we have y1 − y0 =
∇c∗ (∇ft (yt )), ∀t in this case. A similar derivation holds for ft .
iii) Note that i) gives that c(y1 − y0 ) ≥ f1 (y1 ) − f0 (y0 ). For any coupling (Y0 , Y1 ) of π0 , π1 , we have
E[c(Y1 − Y0 )] ≥ E[f1 (Y1 ) − f0 (Y0 )] = E[f1 (X1 ) − f0 (X0 )] = E[c(X1 − X0 )].
Hence, (X0 , X1 ) is a c-optimal coupling.
Connection to Benamou-Brenier Formula The results in Lemma 5.9 can also formally derived from
Benamou-Brenier problem (10), as shown in the seminal work of [BB00]. By introducing a Lagrangian
multiplier λ : Rd × [0, 1] → R for the constraint of ̺˙ t + ∇ · (vt ̺t ) = 0, the problem in (10) can be framed
into a minimax problem:
Z Z
inf sup L(v, ̺, λ) := c(vt )̺t + λt (̺˙ t + ∇ · (vt ̺t ) s.t. ̺ ∈ Γ0,1 ,
v,̺ λ
where L(v, ̺, λ) is the Lagrangian function, and Γ0,1 denotes the set of density functions {̺t }t satisfying
̺0 = dπ0 /dx, ̺1 = dπ1 /dx. Note that the following integration by parts formulas:
Z Z
⊤
λt ∇ · (vt ̺t ) + ∇λt vt ̺t = 0, λt ̺˙ t + λ̇t ̺t = λ1 ̺1 − λ0 ̺0 ,
20
6 Discussion and Open Questions
1. Corollary 5.7 only bounds the surrogate measure ℓ∗Z k ,c . Can we directly bound the optimality gap on
the c-transport cost e∗k = E[c(Z1k − Z0k )] − inf (Z0 ,Z1 ) E[c(Z1 − Z0 )]? Can we find a certain type of
strong convexity like condition, under which e∗k decays exponentially with k?
2. For machine learning (ML) tasks such as generative models and domain transfer, the transport cost
is not necessarily the direct object of interest. In these cases, as suggested in [LGL22], rectified flow
might be preferred because it is simpler and does not require to specify a particular cost c. Question:
for such ML tasks, when would it be preferred to use OT with a specific c, and how to choose c
optimally?
3. In practice, recursively applying the (c-)rectification accumulates errors because the training optimiza-
tion for the drift field and the simulation of the ODE can not be conducted perfectly. How to avoid
the error accumulation at each step? Assume {x1,i }i ∼ π1 , and {z0,i k , z k } is obtained by solving the
1,i i
ODE of the k-th c-rectified flow starting from z0,i k ∼ π . As we increase k, {z k } may yield increas-
0 0,i i
ingly bad approximation of π1 due to the error accumulation. One way to fix this is to adjust {z1,i k }
to make it closer to {xk1,i }i at each step. This can be done by reweighting/transporting {z1,i
k } towards
i
k k k
{x1,i }i by minimizing certain discrepancy measure, or replacing each z1,i with xσ(i) where σ is a
(i) (i)
permutation that yields a one-to-one matching between {z1 } and {x1 }i . The key and challenging
part is to do the adjustment in a good and fast way, ideally with a (near) linear time complexity.
4. With or without the adjustment step, build a complete theoretical analysis on the statistical error of
the method.
5. In what precise sense is rectified flow solving a multi-objective variant of optimal transport?
References
[ABS21] Luigi Ambrosio, Elia Brué, and Daniele Semola. Lectures on optimal transport. Springer,
2021.
[ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial net-
works. In International conference on machine learning, pages 214–223. PMLR, 2017.
[AHW95] Peter Auer, Mark Herbster, and Manfred KK Warmuth. Exponentially many local minima for
single neurons. Advances in neural information processing systems, 8, 1995.
[BB00] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the
monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
[BMD+ 05] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty.
Clustering with bregman divergences. Journal of machine learning research, 6(10), 2005.
[CFT14] Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal
transport. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 274–289. Springer, 2014.
21
[EMM12] Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal maps. Journal
of Computational Physics, 231(23):7815–7850, 2012.
[FG21] Alessio Figalli and Federico Glaudo. An Invitation to Optimal Transport, Wasserstein Dis-
tances, and Gradient Flows. 2021.
[HCTC20] Chin-Wei Huang, Ricky TQ Chen, Christos Tsirigotis, and Aaron Courville. Convex poten-
tial flows: Universal probability distributions with optimal transport and convex optimization.
arXiv preprint arXiv:2012.05942, 2020.
[HD05] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.
[HL04] David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician,
58(1):30–37, 2004.
[KLG+ 21] Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and
Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 bench-
mark. Advances in Neural Information Processing Systems, 34:14593–14605, 2021.
[KSB22] Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport.
arXiv preprint arXiv:2201.12220, 2022.
[Kur11] Thomas G Kurtz. Equivalence of stochastic equations and martingale problems. In Stochastic
analysis 2010, pages 113–130. Springer, 2011.
[LGL22] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate
and transfer data with rectified flow. preprint, 2022.
[McC97] Robert J McCann. A convexity principle for interacting gases. Advances in mathematics,
128(1):153–179, 1997.
[MMPS16] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. An introduction to
sampling via measure transport. arXiv e-prints, pages arXiv–1602, 2016.
[MTOL20] Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport
mapping via input convex neural networks. In International Conference on Machine Learning,
pages 6672–6681. PMLR, 2020.
[OPV14] Yann Ollivier, Hervé Pajot, and Cedric Villani. Optimal Transportation: Theory and Applica-
tions. Number 413. Cambridge University Press, 2014.
[PC+ 19] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data
science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
[San15] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-
63):94, 2015.
[SE19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data dis-
tribution. Advances in Neural Information Processing Systems, 32, 2019.
22
[SME20] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In
International Conference on Learning Representations, 2020.
[SRGB14] Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Wasserstein propa-
gation for semi-supervised learning. In International Conference on Machine Learning, pages
306–314. PMLR, 2014.
[SSDK+ 20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
[TT16] Giulio Trigila and Esteban G Tabak. Data-driven optimal transport. Communications on Pure
and Applied Mathematics, 69(4):613–648, 2016.
[Vil09] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[Vil21] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc.,
2021.
[Vin11] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural
computation, 23(7):1661–1674, 2011.
A Proofs
Proof of Example 3.5. i) The fact that A⊤ A = I and π0 = π1 = N (0, I) ensures that AX0 ∼ π1 and
hence (X0 , AX0 ) is a coupling of π0 and π1 . Let Xt = tAX0 + (1 − t)X0 be the linear interpolation of the
coupling. Related, we have Ẋt = AX0 − X0 . Canceling X0 yields that
where we use the fact that tA + (1 − t)I is convertible for t ∈ [0, 1], which we prove as follows: if
tA + (1 − t)I is not invertible, then A must have λ = − 1−t t as one of its eigenvalues; but as a rotation
matrix, all eigenvalues of A must have a norm of 1, which means that we must have t = 0.5 and λ = −1.
This, however, is excluded by the assumption that A is non-reflecting (and hence λ 6= −1).
R1 h i
Equation (32) shows that Ẋt is uniquely determined by Xt . Hence, we have 0 E var(Ẋt |Xt ) dt = 0,
which implies that (X0 , AX0 ) is a straight coupling by Theorem 3.6 of [LGL22].
2) Let c be a second order differentiable convex function whose Hessian matrix ∇2 c(x) is invertible every-
where. Let c∗ be the convex conjugate of c, then c∗ is also second order differentiable and ∇c(∇c∗ (x)) = x,
and ∇2 c∗ (x) = ∇2 c(x)−1 .
If (X0 , AX0 ) is a c-optimal coupling, there must exists a function φ : Rd → R, such that
where c∗ is the convex conjugate of c. Equation (33) is equivalent to ∇c(Ax − x) = ∇φ(x), which means
that ∇φ is continuously differentiable. Taking gradient on both sides of (33) gives
23
where Hx , Bx are both symmetric and Hx is positive definite, and hence Then Hx Bx is a diagonalizable
(all its eigenvalues are real) by Lemma A.1. However, A − I is not diagonalizable because A must have
complex eigenvalues as a non-reflecting, non-identity rotation matrix. Hence, (34) can not hold.
Lemma A.1. Assume that A, B are two real symmetric matrices and A is positive definite. Then AB is
diagonalizable (on the real domain), that is, there exists an invertible matrix P , such that P −1 ABP is a
diagonal matrix.
Proof. This is a standard result in linear algebra. Because A is positive definite, there exists an invertible
symmetric matrix C, such that CC = A. Then, AB = CCB, and it is similar to CBC −1 , which is
symmetric and hence diagonalizable.
24