Wasserstein Propagation For Semi-Supervised Learning
Wasserstein Propagation For Semi-Supervised Learning
Abstract Niyogi (2001); Zhu et al. (2003); Belkin et al. (2006); Zhou
Probability distributions and histograms are nat- & Belkin (2011); Ji et al. (2012) (also see the survey by Zhu
ural representations for product ratings, traffic (2008) and references therein), can be applied bin-by-bin to
measurements, and other data considered in many propagate normalized frequency counts, this strategy does
machine learning applications. Thus, this pa- not model interactions between histogram bins. As a result,
per introduces a technique for graph-based semi- a fundamental aspect of this type of data is ignored, leading
supervised learning of histograms, derived from to artifacts even when propagating Gaussian distributions.
the theory of optimal transportation. Our method Among first works directly addressing semi-supervised
has several properties making it suitable for this learning of probability distributions is Subramanya &
application; in particular, its behavior can be char- Bilmes (2011), which propagates distributions represent-
acterized by the moments and shapes of the his- ing class memberships. Their loss function, however, is
tograms at the labeled nodes. In addition, it can be based on Kullback-Leibler divergence, which cannot cap-
used for histograms on non-standard domains like ture interactions between histogram bins. Talukdar & Cram-
circles, revealing a strategy for manifold-valued mer (2009) allow interactions between bins by essentially
semi-supervised learning. We also extend this modifying the underlying graph to its tensor product with a
technique to related problems such as smoothing prescribed bin interaction graph; this approach loses prob-
distributions on graph nodes. abilistic structure and tends to oversmooth. Similar issues
have been encountered in the mathematical literature (Mc-
Cann, 1997; Agueh & Carlier, 2011) and in vision/graphics
1. Introduction applications (Bonneel et al., 2011; Rabin et al., 2012) involv-
Graph-based semi-supervised learning is an effective ap- ing interpolating probability distributions. Their solutions
proach for learning problems involving a limited amount attempt to find weighted barycenters of distributions, which
of labeled data (Singh et al., 2008). Methods in this class is insufficient for propagating distributions along graphs.
typically propagate labels from a subset of nodes of a graph The goal of our work is to provide an efficient and theoreti-
to the rest of the nodes. Usually each node is associated cally sound approach to graph-based semi-supervised learn-
with a real number, but in many applications labels are more ing of probability distributions. Our strategy uses the ma-
naturally expressed as histograms or probability distribu- chinery of optimal transportation (Villani, 2003). Inspired
tions. For instance, the traffic density at a given location by (Solomon et al., 2013), we employ the two-Wasserstein
can be seen as a histogram over the 24-hour cycle; these distance between distributions to construct a regularizer
densities may be known only where a service has cameras measuring the “smoothness” of an assignment of a proba-
installed but need to be propagated to the entire map. Prod- bility distribution to each graph node. The final assignment
uct ratings, climatic measurements, and other data sources is produced by optimizing this energy while fitting the his-
exhibit similar structure. togram predictions at labeled nodes.
While methods for numerical labels, such as Belkin & Our technique has many notable properties. As certainty in
st
Proceedings of the 31 International Conference on Machine the known distributions increases, it reduces to the method
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- of label propagation via harmonic functions (Zhu et al.,
right 2014 by the author(s). 2003). Also, the moments and other characteristics of the
Wasserstein Propagation
we replace the square distance between scalar function val- W22 (ρ0 , ρ1 ) = (F1−1 (s) − F0−1 (s))2 ds . (2)
0
ues appearing in the classical Dirichlet energy (namely the
quantity |fv − fw |2 ) with an appropriate distance between
By applying (2) to the minimization problem (1), we obtain
the distributions ρv and ρw . Rather than using the bin-by-bin
a linear strategy for our propagation problem.
KL divergence, however, we use the Wasserstein distance
with quadratic cost between probability distributions with Proposition 2. Wasserstein propagation can be character-
finite second moment on R. This distance is defined as ized in the following way. For each v ∈ V0 let Fv be the
¨ 1/2 CDF of the distribution ρv . Now suppose that for each
W2 (ρv , ρw ) := inf |x − y|2 dπ(x, y) s ∈ [0, 1] we determine gs : V → R as the solution of the
π∈Π(ρv ,ρw ) R2 classical Dirichlet problem
Distribution-valued maps ρ : V → Prob(R) propagated by of the classical Dirichlet problem and the Wasserstein prop-
optimizing (1) satisfy many analogs of functions extended agation problem coincide in the following way. Suppose that
using the classical Dirichlet problem. Two results of this f : V → R satisfies the classical Dirichlet problem with
kind concern the mean m(v) and the variance σ(v) of the boundary data u. Then ρv (x) := δ(x − f (v)) minimizes (1)
distributions ρv as functions of V . These are defined as subject to the fixed boundary constraints.
ˆ ∞
m(v) := xρv (x) dx Proof. The boundary data for ρ given here yields the bound-
−∞ ary data gs (v) = u(v) for all v ∈ V0 and s ∈ [0, 1) in
ˆ ∞
σ 2 (v) := (x − m(v))2 ρv (x) dx . the Dirichlet problem (3). The solution of this Dirichlet
−∞ problem is thus also constant in s, let us say gs (v) = f (v)
for all s ∈ [0, 1) and v ∈ V . The only distributions whose
Proposition 3. Suppose the distribution-valued map ρ : inverse CDFs are of this form are δ-distributions; hence
V → Prob(R) is obtained using Wasserstein propagation. ρv (x) = δ(x − f (v)) as desired.
Then for all v ∈ V the following estimates hold.
• inf v0 ∈V0 m(v0 ) ≤ m(v) ≤ supv0 ∈V0 m(v0 ). 3.2. Application to Smoothing
• 0 ≤ σ(v) ≤ supv0 ∈V0 σ(v0 ). Using the connection to the classical Dirichlet problem in
Proposition 2 we can extend our treatment to other dif-
Proof. Both estimates can be derived from the following ferential equations. There is a large space of differential
formula. Let ρ ∈ Prob(R) and let φ : R → R be any equations that have been adapted to graphs via the discrete
integrable function. If we apply the change of variables Laplacian ∆; here we focus on the heat equation, considered
s = F (x) where F is the CDF of ρ in the integral defining e.g. in Chung et al. (2007).
the expectation value of φ with respect to ρ, we get The heat equation for scalar functions is applied to smooth-
ˆ ∞ ˆ 1
ing problems; for example, in Rn solving the heat equation
φ(x)ρ(x) dx = φ(F −1 (s)) ds . is equivalent to Gaussian convolution. Just as the Dirichlet
−∞ 0 equation on F −1 is equivalent to Wasserstein propagation,
´1 ´1 heat diffusion on F −1 is equivalent to gradient flows of
Thus m(v) = 0 Fv−1 (s) ds and σ 2 (v) = 0 (Fv−1 (s) − the energy ED in (1), providing a straightforward way to
m(v))2 ds where Fv is the CDF of ρv for each v ∈ V . understand and implement such a diffusive process.
Assume ρ minimizes (1) with fixed boundary constraints Proposition 5. Let ρ : V → Prob(R) be a distribution-
on V0 . By Proposition 2, we then have ∆Fv−1 = 0 for all valued map and let Fv : [0, 1] → R be the CDF of ρv for
´1
v ∈ V . Therefore ∆m(v) = 0 ∆Fv−1 (s) ds = 0, so m is each v ∈ V . Then these two procedures are equivalent:
a harmonic function on V . The estimates for m follow by
• Mass-preserving flow of ρ in the direction of steepest
the maximum principle for harmonic functions. Also,
descent of the Dirichlet energy.
ˆ 1
∆[σ 2 (v)] = ∆(Fv−1 (s) − m(v))2 ds • Heat flow of the inverse CDFs.
0
ˆ 1 2 Proof. A mass-preserving flow of ρ is a family of
X
= a(v, s) − a(v ′ , s) ds
(v,v ′ )∈E 0 distribution-valued maps ρε : V → Prob(R) with ε ∈
(−ε0 , ε0 ) that satisfies the equations
≥0 — where a(v, s) := Fv−1 (s) − m(v),
∂ρv,ε (t)
∂
since ∆Fv−1 (s) = ∆m(v) = 0. Thus σ 2 is a subharmonic + Yv (ε, t)ρv,ε (t) = 0
∂ε ∂t ∀v ∈ V
function and the upper bound for σ 2 follows by the maxi-
ρv,0 (t) = ρv (t)
mum principle for subharmonic functions.
where Yv : (−ε0 , ε0 ) × R → R is an arbitrary function
Finally, we check that if we encode a classical interpola- that governs the flow. By applying the change of variables
tion problem using Dirac delta distributions, we recover −1
t = Fv,ε (s) using the inverse CDFs of the ρv,ε , we find that
the classical solution. The essence of this result is that this flow is equivalent to the equations
if the boundary data for Wasserstein propagation has zero
variance, then the solution must also have zero variance. −1
∂Fv,ε (s)
−1
= Yv (ε, Fv,ε (s))
Proposition 4. Suppose that there exists u : V0 → R such ∂ε ∀v ∈ V .
that ρv (x) = δ(x−u(v)) for all v ∈ V0 . Then, the solutions −1 −1
Fv,0 (s) = Fv (s)
Wasserstein Propagation
A short calculation starting from (1) now leads to the deriva- ρvi ≥ 0 ∀v ∈ V, i ∈ S xij ≥ 0 ∀i, j ∈ S
tive of the Dirichlet energy under such a flow, namely
where S = {1, . . . , m}.
dED (ρε ) Xˆ 1
−1 −1
= −2 ∆(Fv,ε ) · Yv (ε, Fv,ε (s)) ds .
dε 0 v∈V 5. Algorithm Details
Thus, steepest descent for the Dirichlet energy is achieved We handle the general case from §4 by optimizing the linear
−1
by choosing Yv (ε, Fv,ε (s)) := ∆(Fv,ε (s)) for each v, ε, s. programming formulation directly. Given the size of these
−1
As a result, the equation for the evolution of Fv,ε becomes linear programs, we use large-scale barrier method solvers.
−1
∂Fv,ε (s)
−1 The characterizations in Propositions 2 and 5, however, sug-
= ∆(Fv,ε (s))
∂ε ∀v ∈ V gest a straightforward discretization and accompanying set
−1
Fv,0 (s) = Fv−1 (s)
of optimization algorithms in the linear case. In fact, we
can recover propagated distributions by inverting the graph
−1
which is exactly heat flow of Fv,ε . Laplacian ∆ via a sparse linear solve, leading to near-real-
time results for moderately-sized graphs G.
4. Generalization For a given graph G = (V, E) and subset V0 ⊆ V , we
discretize the domain [0, 1] of Fv−1 for each v using a set
Our preceding discussion involves distribution-valued maps
of evenly-spaced samples s0 = 0, s1 , . . . , sm = 1. This
into Prob(R), but in a more general setting we might wish
representation supports any ρv provided it is possible to
to replace Prob(R) with Prob(Γ) for an alternative domain
sample the inverse CDF from Proposition 1 at each si . In
Γ carrying a distance metric d. Our original formulation
particular, when the underlying distributions are histograms,
of Wasserstein propagation easily handles such an exten-
we model ρv using δ functions at evenly-spaced bin cen-
sion by replacing |x − y|2 with d(x, y)2 in the definition of
ters, which have piecewise constant CDFs; we model con-
W2 . Furthermore, although proofs in this case are consider-
tinuous ρv using piecewise linear interpolation. Regard-
ably more involved, some key properties proved above for
less, in the end we obtain a non-decreasing set of samples
Prob(R) extend naturally.
(F −1 )1v , . . . , (F −1 )m
v with (F
−1 1
)v = 0 and (F −1 )m
v = 1.
In this case, we no longer can rely on the computational
Now that we have sampled Fv−1 for each v ∈ V0 , we can
benefits of Propositions 2 and 5 but can solve the propaga-
propagate to the remainder V \V0 . For each i ∈ {1, . . . , m},
tion problem directly. If Γ is discrete, then Wasserstein dis-
we solve the system from (3):
tances between ρv ’s can be computed using a linear program.
Suppose we represent two histograms P as {a1 , . . P
. , am } and ∆g = 0 ∀ v ∈ V \ V0
{b1 , . . . , bm } with ai , bi ≥ 0 ∀i and i ai = i bi = 1. (5)
Then, the definition of W2 yields the optimization: g(v) = (F −1 )iv ∀ v ∈ V0 .
X
W22 ({ai }, {bj }) = min d2ij xij (4) In the diffusion case, we replace this system with implicit
ij time stepping for the heat equation, iteratively applying
X X (I − t∆)−1 to g for diffusion time step t. In either case, the
s.t. xij = ai ∀i xij = bj ∀j xij ≥ 0 ∀i, j
linear solve is sparse, symmetric, and positive definite; we
j i
apply Cholesky factorization to solve the systems directly.
Here dij is the distance from bin i to bin j, which need not
This process propagates F −1 to the entire graph, yielding
be proportional to |i − j|.
samples (F −1 )iv for all v ∈ V . We invert once again to
From this viewpoint, the energy ED from (1) remains convex yield samples ρiv for all v ∈ V . Of course, each inversion
in ρ and can be optimized using a linear program simply by incurs some potential for sampling and discretization error,
summing terms of the form (4) above: but in practice we are able to oversample sufficiently to
XX (e)
overcome most potential issues. When the inputs ρv are
min d2ij xij discrete histograms, we return to this discrete representation
ρ,x
e∈E ij by integrating the resulting ρv ∈ Prob([0, 1]) over the width
(e)
X
s.t. xij = ρvi ∀e = (v, w) ∈ E, i ∈ S of the bin about the center defined above.
j
This algorithm is efficient even on large graphs and is easily
(e)
X
xij = ρwj ∀e = (v, w) ∈ E, j ∈ S parallelizable. For instance, the initial sampling steps for
i obtaining F −1 from ρ are parallelizable over v ∈ V0 , and
the linear solve (5) can be parallelized over samples i. Direct
X
ρvi = 1 ∀v ∈ V ρvi fixed ∀v ∈ V0
i
solvers can be replaced with iterative solvers for particularly
Wasserstein Propagation
(a)
Boundary Value Problems Figure 3 illustrates our algo- Alternative Target Domain Figure 5 shows an example
rithm on a less trivial graph G. To mimic a typical test case in which the target is Prob(S1 ), where S1 is the unit cir-
for classical Dirichlet problems, our graph is a mesh of the cle, rather than Prob([0, 1]). We optimize the ED using the
Wasserstein Propagation
(b)
(c)
(a)
Figure 6. We propagate histograms of temperatures collected over time to a map of the United States: (a) Average error at propagated sites
as a function of the number of nodes with labeled distributions; (b) means of the histograms at the propagated sites from a typical trial in
(a); (c) standard deviations at the propagated sites. Vertices with prescribed distributions are shown in blue and comprise ∼ 2% of V .
Figure 7. (a) Interpolating histograms of wind directions using the PDF and Wasserstein propagation methods, illustrated using the same
scheme as Figure 5; (b) entropy values from the same distributions.