0% found this document useful (0 votes)
15 views9 pages

Wasserstein Propagation For Semi-Supervised Learning

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

Wasserstein Propagation For Semi-Supervised Learning

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Wasserstein Propagation for Semi-Supervised Learning

Justin Solomon JUSTIN . SOLOMON @ STANFORD . EDU


Raif M. Rustamov RUSTAMOV @ STANFORD . EDU
Leonidas Guibas GUIBAS @ CS . STANFORD . EDU
Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, California 94305 USA
Adrian Butscher ADRIAN . BUTSCHER @ GMAIL . COM
Max Planck Center for Visual Computing and Communication, Campus E1 4, 66123 Saarbrücken, Germany

Abstract Niyogi (2001); Zhu et al. (2003); Belkin et al. (2006); Zhou
Probability distributions and histograms are nat- & Belkin (2011); Ji et al. (2012) (also see the survey by Zhu
ural representations for product ratings, traffic (2008) and references therein), can be applied bin-by-bin to
measurements, and other data considered in many propagate normalized frequency counts, this strategy does
machine learning applications. Thus, this pa- not model interactions between histogram bins. As a result,
per introduces a technique for graph-based semi- a fundamental aspect of this type of data is ignored, leading
supervised learning of histograms, derived from to artifacts even when propagating Gaussian distributions.
the theory of optimal transportation. Our method Among first works directly addressing semi-supervised
has several properties making it suitable for this learning of probability distributions is Subramanya &
application; in particular, its behavior can be char- Bilmes (2011), which propagates distributions represent-
acterized by the moments and shapes of the his- ing class memberships. Their loss function, however, is
tograms at the labeled nodes. In addition, it can be based on Kullback-Leibler divergence, which cannot cap-
used for histograms on non-standard domains like ture interactions between histogram bins. Talukdar & Cram-
circles, revealing a strategy for manifold-valued mer (2009) allow interactions between bins by essentially
semi-supervised learning. We also extend this modifying the underlying graph to its tensor product with a
technique to related problems such as smoothing prescribed bin interaction graph; this approach loses prob-
distributions on graph nodes. abilistic structure and tends to oversmooth. Similar issues
have been encountered in the mathematical literature (Mc-
Cann, 1997; Agueh & Carlier, 2011) and in vision/graphics
1. Introduction applications (Bonneel et al., 2011; Rabin et al., 2012) involv-
Graph-based semi-supervised learning is an effective ap- ing interpolating probability distributions. Their solutions
proach for learning problems involving a limited amount attempt to find weighted barycenters of distributions, which
of labeled data (Singh et al., 2008). Methods in this class is insufficient for propagating distributions along graphs.
typically propagate labels from a subset of nodes of a graph The goal of our work is to provide an efficient and theoreti-
to the rest of the nodes. Usually each node is associated cally sound approach to graph-based semi-supervised learn-
with a real number, but in many applications labels are more ing of probability distributions. Our strategy uses the ma-
naturally expressed as histograms or probability distribu- chinery of optimal transportation (Villani, 2003). Inspired
tions. For instance, the traffic density at a given location by (Solomon et al., 2013), we employ the two-Wasserstein
can be seen as a histogram over the 24-hour cycle; these distance between distributions to construct a regularizer
densities may be known only where a service has cameras measuring the “smoothness” of an assignment of a proba-
installed but need to be propagated to the entire map. Prod- bility distribution to each graph node. The final assignment
uct ratings, climatic measurements, and other data sources is produced by optimizing this energy while fitting the his-
exhibit similar structure. togram predictions at labeled nodes.
While methods for numerical labels, such as Belkin & Our technique has many notable properties. As certainty in
st
Proceedings of the 31 International Conference on Machine the known distributions increases, it reduces to the method
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- of label propagation via harmonic functions (Zhu et al.,
right 2014 by the author(s). 2003). Also, the moments and other characteristics of the
Wasserstein Propagation

propagated distributions are well-characterized by those


of the labeled nodes at minima of our smoothness energy.
Our approach does not restrict the class of the distributions
provided at labeled nodes, allowing for bi-modality and
other non-Gaussian properties. Finally, we prove that under
an appropriate change of variables our objective can be
minimized using a fast linear solve.
Figure 1. Propagating prescribed probability distributions (in red)
Overview We first motivate the problem of propagating to interior nodes of path graph identified with the interval [0, 1]:
distributions along graphs and show why naı̈ve techniques (a) naive approach; (b) statistical approach; (c) desirable output.
are ineffective (§2). Given this setup, we develop the Wasser-
stein propagation technique (§3) and discuss its theoretical • The spread of the propagated distributions should be
properties (§3.1). We also show how it can be used to related to the spread of the prescribed distributions.
smooth distribution-valued maps from graphs (§3.2) and
extend it to more general domains (§4). Finally, after provid- • As the prescribed distributions in V0 become peaked
ing algorithmic details (§5) we demonstrate our techniques (concentrated around the mean), the propagated dis-
on both synthetic (§6.1) and real-world (§6.2) data. tributions should become peaked around the values
obtained by propagating means of prescribed distribu-
tions via label propagation (e.g. Zhu et al. (2003)).
2. Preliminaries and Motivation
2.1. Label Propagation on Graphs • The computational complexity of distribution propaga-
tion should be similar to that of scalar propagation.
We consider generalization of the problem of label prop-
agation on a graph G = (V, E). Suppose a label func- The simplest method for propagating probability distribu-
tion f is known on a subset of vertices V0 ⊆ V , and we tions is to extend Zhu et al. (2003) naı̈vely. For each x ∈ R,
wish to extend f to the remainder V \V0 . The classical we can view ρv (x) as a label at v ∈ V and solve the Dirich-
approach of ZhuPet al. (2003) minimizes the Dirichlet en- let problem ∆ρv (x) = 0 with ρv0 (x) prescribed for all
ergy ED [f ] := (v,w)∈E ωe (fv − fw )2 over the space of v ∈ V0 . The resulting functions ρv (x) are distributions be-
functions taking the prescribed values on V0 . Here ωe is
´ maximum principle guarantees ρv (x) ≥ 0 for all
cause the
the weight associated to the edge e = (v, w). ED is a x and R ρv (x) dx = 1 for all v ∈ V since these properties
measure of smoothness; therefore the minimizer matches hold at the boundary (Chung et al., 2007).
the prescribed labels with minimal variation in between.
Minimizing this quadratic objective is equivalent to solv- It is easy to see, however, that this method has short-
ing ∆f = 0 on V \V0 for an appropriate positive definite comings. For instance, consider the case where G is
Laplacian matrix ∆ (Chung & Yau, 2000). Solutions of this a path graph representing the segment [0, 1] and the la-
system are well-known to enjoy many regularity properties, beled vertices are the endpoints, V0 = {0, 1}. In this
making it a sound choice for smooth label propagation. case, the naı̈ve approach results in the linear interpolation
ρt (x) := (1 − t)ρ0 (x) + tρ1 (x) at all intermediate graph
vertices for t ∈ (0, 1). The propagated distributions are
2.2. Propagating Probability Distributions
thus bimodal as in Figure 1a. Given our criteria, however,
Suppose, however, that each vertex in V0 is decorated with we would prefer an interpolation result closer to Figure 1c,
a probability distribution rather than a real number. That which causes the peak in the boundary data simply to slide
is, for each v ∈ V0 , we are given a probability distribution from left to right without introducing variance as t changes.
ρv ∈ Prob(R). Our goal now is to propagate these distri-
An alternative strategy for propagating probability distribu-
butions to the remaining vertices, generating a distribution-
tions over V given boundary data on V0 is to use a statistical
valued map ρ : v ∈ V 7→ ρv ∈ Prob(R) associating a
approach. We could repeatedly draw an independent sam-
probability distribution with every vertex ´ v ∈ V . It must ple from each distribution in {ρv : v ∈ V0 } and propagate
satisfy ρv (x) ≥ 0 for all x ∈ R and R ρv (x) dx = 1.
the resulting scalars using a classical approach; binning the
In §4 we consider the generalized case ρ : V → Prob(Γ)
results of these repeated experiments provides a histogram-
for alternative domains Γ including subsets of Rn ; most of
style distribution at each vertex in V . This strategy has
the statements we prove about maps into Prob(R) extend
a similar shortcomings to the naı̈ve approach above. For
naturally to this setting with suitable technical adjustments.
instance, in the path graph example, the interpolated distri-
In the applications we consider, such a propagation process bution is trimodal as in Figure 1b, with nonzero probability
should satisfy a number of properties: at both endpoints and for some v in the interior of V .
Wasserstein Propagation

Of course, the desiderata above are application-specific. WASSERSTEIN P ROPAGATION


One key assumption is that the spread of the distributions
Minimize ED [ρ] in the space of distribution-valued
is preserved, which differs from existing approaches which
maps with prescribed distributions at all v ∈ V0 .
tend to blur the distributions. While this property is not
intrinsically superior, in a way the experiments in §6 validate
not only the algorithmic effectiveness of our technique but 3.1. Theoretical Properties
also this assumption about probabilistic data on graphs. Solutions of the Wasserstein propagation problem satisfy
many desirable properties that we will establish below. Be-
3. Wasserstein Propagation fore proceeding, however, we recall a fact about the Wasser-
stein distance. Let ρ ∈ Prob(R) be a probability distribution.
Ad hoc methods for propagating distributions based on meth- Then its cumulative distribution function (CDF) is given by
´x
ods for scalar functions tend to have a number of drawbacks. F (x) := −∞ ρ(y) dy, and the generalized inverse of the
Therefore, we tackle this problem using a technique de- its CDF is given by F −1 (s) := inf{x ∈ R : F (x) > s}.
signed explicitly for the probabilistic setting. To this end, Then the following result holds.
we formulate the semi-supervised problem at hand as the
optimization of a Dirichlet energy for distribution-valued Proposition 1. [Villani (2003), Theorem 2.18] Let ρ0 , ρ1 ∈
maps generalizing the classical Dirichlet energy. Prob(R) with CDFs F0 , F1 . Then

Similar to the construction in (Subramanya & Bilmes, 2011), ˆ 1

we replace the square distance between scalar function val- W22 (ρ0 , ρ1 ) = (F1−1 (s) − F0−1 (s))2 ds . (2)
0
ues appearing in the classical Dirichlet energy (namely the
quantity |fv − fw |2 ) with an appropriate distance between
By applying (2) to the minimization problem (1), we obtain
the distributions ρv and ρw . Rather than using the bin-by-bin
a linear strategy for our propagation problem.
KL divergence, however, we use the Wasserstein distance
with quadratic cost between probability distributions with Proposition 2. Wasserstein propagation can be character-
finite second moment on R. This distance is defined as ized in the following way. For each v ∈ V0 let Fv be the
¨ 1/2 CDF of the distribution ρv . Now suppose that for each
W2 (ρv , ρw ) := inf |x − y|2 dπ(x, y) s ∈ [0, 1] we determine gs : V → R as the solution of the
π∈Π(ρv ,ρw ) R2 classical Dirichlet problem

where Π(ρ0 , ρ1 ) ⊆ Prob(R2 ) is the set of probability distri- ∆gs = 0 ∀ v ∈ V \ V0


butions π on R2 satisfying the marginal constraints (3)
gs (v) = Fv−1 (s) ∀ v ∈ V0 .
ˆ 1 ˆ 1
π(x, y) dx = ρw (y) and π(x, y) dy = ρv (x) . Then for each v, the function s 7→ gs (v) is the inverse CDF
0 0 of a probability distribution ρv . Moreover, the distribution-
The Wasserstein distance is a well-known distance metric valued map v 7→ ρv minimizes the Dirichet energy (1).
for probability distributions, sometimes called the quadratic
Earth Mover’s Distance, and is studied in the field of optimal Proof. Let X be the set of functions g : V × [0, 1] → R
transportation. It measures the optimal cost of transporting satisfying the constraints gs (v) = Fv−1 (s) for all s ∈ [0, 1]
one distribution to another, given that the cost of transporting and all v ∈ V0 . Consider the minimization problem
a unit amount of mass from x to y is |x − y|2 . W2 (ρv , ρw )
X ˆ 1
takes into account not only the values of ρv and ρw but
min ÊD (g) := (gs (u) − gs (v))2 ds .
also the ground distance in the sample space R. It already g∈X 0
(u,v)∈E
has shown promise for search and clustering techniques (Ir-
pino et al., 2011; Applegate et al., 2011) and interpolation The solution of this optimization for each s is exactly a solu-
problems in graphics and vision (Bonneel et al., 2011). tion of the classical Dirichlet problem (3) on G. Moreover,
With these ideas in place, we define a Dirichlet energy for a the maximum principle implies that gs (v) ≤ gs′ (v) when-
distribution-valued map from a graph into Prob(R) by ever s < s′ , which holds by definition for all v ∈ V0 , can be
extended to all v ∈ V (Chung et al., 2007). Hence gs (v) can
be interpreted as an inverse CDF for each v ∈ V form which
X
ED [ρ] := W22 (ρv , ρw ) , (1)
(v,w)∈E we can define a distribution-valued map ρ : v 7→ ρv . Since
ÊD takes on its minimum value in the subset of X consisting
along with the notion of Wasserstein propagation of of inverse CDFs, and ÊD coincides with ED on this set, ρ is
distribution-valued maps given prescribed boundary data. a solution of the Wasserstein propagation problem.
Wasserstein Propagation

Distribution-valued maps ρ : V → Prob(R) propagated by of the classical Dirichlet problem and the Wasserstein prop-
optimizing (1) satisfy many analogs of functions extended agation problem coincide in the following way. Suppose that
using the classical Dirichlet problem. Two results of this f : V → R satisfies the classical Dirichlet problem with
kind concern the mean m(v) and the variance σ(v) of the boundary data u. Then ρv (x) := δ(x − f (v)) minimizes (1)
distributions ρv as functions of V . These are defined as subject to the fixed boundary constraints.
ˆ ∞
m(v) := xρv (x) dx Proof. The boundary data for ρ given here yields the bound-
−∞ ary data gs (v) = u(v) for all v ∈ V0 and s ∈ [0, 1) in
ˆ ∞
σ 2 (v) := (x − m(v))2 ρv (x) dx . the Dirichlet problem (3). The solution of this Dirichlet
−∞ problem is thus also constant in s, let us say gs (v) = f (v)
for all s ∈ [0, 1) and v ∈ V . The only distributions whose
Proposition 3. Suppose the distribution-valued map ρ : inverse CDFs are of this form are δ-distributions; hence
V → Prob(R) is obtained using Wasserstein propagation. ρv (x) = δ(x − f (v)) as desired.
Then for all v ∈ V the following estimates hold.
• inf v0 ∈V0 m(v0 ) ≤ m(v) ≤ supv0 ∈V0 m(v0 ). 3.2. Application to Smoothing

• 0 ≤ σ(v) ≤ supv0 ∈V0 σ(v0 ). Using the connection to the classical Dirichlet problem in
Proposition 2 we can extend our treatment to other dif-
Proof. Both estimates can be derived from the following ferential equations. There is a large space of differential
formula. Let ρ ∈ Prob(R) and let φ : R → R be any equations that have been adapted to graphs via the discrete
integrable function. If we apply the change of variables Laplacian ∆; here we focus on the heat equation, considered
s = F (x) where F is the CDF of ρ in the integral defining e.g. in Chung et al. (2007).
the expectation value of φ with respect to ρ, we get The heat equation for scalar functions is applied to smooth-
ˆ ∞ ˆ 1
ing problems; for example, in Rn solving the heat equation
φ(x)ρ(x) dx = φ(F −1 (s)) ds . is equivalent to Gaussian convolution. Just as the Dirichlet
−∞ 0 equation on F −1 is equivalent to Wasserstein propagation,
´1 ´1 heat diffusion on F −1 is equivalent to gradient flows of
Thus m(v) = 0 Fv−1 (s) ds and σ 2 (v) = 0 (Fv−1 (s) − the energy ED in (1), providing a straightforward way to
m(v))2 ds where Fv is the CDF of ρv for each v ∈ V . understand and implement such a diffusive process.
Assume ρ minimizes (1) with fixed boundary constraints Proposition 5. Let ρ : V → Prob(R) be a distribution-
on V0 . By Proposition 2, we then have ∆Fv−1 = 0 for all valued map and let Fv : [0, 1] → R be the CDF of ρv for
´1
v ∈ V . Therefore ∆m(v) = 0 ∆Fv−1 (s) ds = 0, so m is each v ∈ V . Then these two procedures are equivalent:
a harmonic function on V . The estimates for m follow by
• Mass-preserving flow of ρ in the direction of steepest
the maximum principle for harmonic functions. Also,
descent of the Dirichlet energy.
ˆ 1
∆[σ 2 (v)] = ∆(Fv−1 (s) − m(v))2 ds • Heat flow of the inverse CDFs.
0
ˆ 1 2 Proof. A mass-preserving flow of ρ is a family of
X
= a(v, s) − a(v ′ , s) ds
(v,v ′ )∈E 0 distribution-valued maps ρε : V → Prob(R) with ε ∈
(−ε0 , ε0 ) that satisfies the equations
≥0 — where a(v, s) := Fv−1 (s) − m(v),
∂ρv,ε (t)

∂ 
since ∆Fv−1 (s) = ∆m(v) = 0. Thus σ 2 is a subharmonic + Yv (ε, t)ρv,ε (t) = 0 
∂ε ∂t ∀v ∈ V
function and the upper bound for σ 2 follows by the maxi-
ρv,0 (t) = ρv (t)

mum principle for subharmonic functions.
where Yv : (−ε0 , ε0 ) × R → R is an arbitrary function
Finally, we check that if we encode a classical interpola- that governs the flow. By applying the change of variables
tion problem using Dirac delta distributions, we recover −1
t = Fv,ε (s) using the inverse CDFs of the ρv,ε , we find that
the classical solution. The essence of this result is that this flow is equivalent to the equations
if the boundary data for Wasserstein propagation has zero
variance, then the solution must also have zero variance. −1
∂Fv,ε (s)

−1
= Yv (ε, Fv,ε (s))

Proposition 4. Suppose that there exists u : V0 → R such ∂ε ∀v ∈ V .
that ρv (x) = δ(x−u(v)) for all v ∈ V0 . Then, the solutions −1 −1
Fv,0 (s) = Fv (s)

Wasserstein Propagation

A short calculation starting from (1) now leads to the deriva- ρvi ≥ 0 ∀v ∈ V, i ∈ S xij ≥ 0 ∀i, j ∈ S
tive of the Dirichlet energy under such a flow, namely
where S = {1, . . . , m}.
dED (ρε ) Xˆ 1
−1 −1
= −2 ∆(Fv,ε ) · Yv (ε, Fv,ε (s)) ds .
dε 0 v∈V 5. Algorithm Details
Thus, steepest descent for the Dirichlet energy is achieved We handle the general case from §4 by optimizing the linear
−1
by choosing Yv (ε, Fv,ε (s)) := ∆(Fv,ε (s)) for each v, ε, s. programming formulation directly. Given the size of these
−1
As a result, the equation for the evolution of Fv,ε becomes linear programs, we use large-scale barrier method solvers.
−1
∂Fv,ε (s)

−1 The characterizations in Propositions 2 and 5, however, sug-
= ∆(Fv,ε (s))

∂ε ∀v ∈ V gest a straightforward discretization and accompanying set
−1
Fv,0 (s) = Fv−1 (s)
 of optimization algorithms in the linear case. In fact, we
can recover propagated distributions by inverting the graph
−1
which is exactly heat flow of Fv,ε . Laplacian ∆ via a sparse linear solve, leading to near-real-
time results for moderately-sized graphs G.
4. Generalization For a given graph G = (V, E) and subset V0 ⊆ V , we
discretize the domain [0, 1] of Fv−1 for each v using a set
Our preceding discussion involves distribution-valued maps
of evenly-spaced samples s0 = 0, s1 , . . . , sm = 1. This
into Prob(R), but in a more general setting we might wish
representation supports any ρv provided it is possible to
to replace Prob(R) with Prob(Γ) for an alternative domain
sample the inverse CDF from Proposition 1 at each si . In
Γ carrying a distance metric d. Our original formulation
particular, when the underlying distributions are histograms,
of Wasserstein propagation easily handles such an exten-
we model ρv using δ functions at evenly-spaced bin cen-
sion by replacing |x − y|2 with d(x, y)2 in the definition of
ters, which have piecewise constant CDFs; we model con-
W2 . Furthermore, although proofs in this case are consider-
tinuous ρv using piecewise linear interpolation. Regard-
ably more involved, some key properties proved above for
less, in the end we obtain a non-decreasing set of samples
Prob(R) extend naturally.
(F −1 )1v , . . . , (F −1 )m
v with (F
−1 1
)v = 0 and (F −1 )m
v = 1.
In this case, we no longer can rely on the computational
Now that we have sampled Fv−1 for each v ∈ V0 , we can
benefits of Propositions 2 and 5 but can solve the propaga-
propagate to the remainder V \V0 . For each i ∈ {1, . . . , m},
tion problem directly. If Γ is discrete, then Wasserstein dis-
we solve the system from (3):
tances between ρv ’s can be computed using a linear program.
Suppose we represent two histograms P as {a1 , . . P
. , am } and ∆g = 0 ∀ v ∈ V \ V0
{b1 , . . . , bm } with ai , bi ≥ 0 ∀i and i ai = i bi = 1. (5)
Then, the definition of W2 yields the optimization: g(v) = (F −1 )iv ∀ v ∈ V0 .
X
W22 ({ai }, {bj }) = min d2ij xij (4) In the diffusion case, we replace this system with implicit
ij time stepping for the heat equation, iteratively applying
X X (I − t∆)−1 to g for diffusion time step t. In either case, the
s.t. xij = ai ∀i xij = bj ∀j xij ≥ 0 ∀i, j
linear solve is sparse, symmetric, and positive definite; we
j i
apply Cholesky factorization to solve the systems directly.
Here dij is the distance from bin i to bin j, which need not
This process propagates F −1 to the entire graph, yielding
be proportional to |i − j|.
samples (F −1 )iv for all v ∈ V . We invert once again to
From this viewpoint, the energy ED from (1) remains convex yield samples ρiv for all v ∈ V . Of course, each inversion
in ρ and can be optimized using a linear program simply by incurs some potential for sampling and discretization error,
summing terms of the form (4) above: but in practice we are able to oversample sufficiently to
XX (e)
overcome most potential issues. When the inputs ρv are
min d2ij xij discrete histograms, we return to this discrete representation
ρ,x
e∈E ij by integrating the resulting ρv ∈ Prob([0, 1]) over the width
(e)
X
s.t. xij = ρvi ∀e = (v, w) ∈ E, i ∈ S of the bin about the center defined above.
j
This algorithm is efficient even on large graphs and is easily
(e)
X
xij = ρwj ∀e = (v, w) ∈ E, j ∈ S parallelizable. For instance, the initial sampling steps for
i obtaining F −1 from ρ are parallelizable over v ∈ V0 , and
the linear solve (5) can be parallelized over samples i. Direct
X
ρvi = 1 ∀v ∈ V ρvi fixed ∀v ∈ V0
i
solvers can be replaced with iterative solvers for particularly
Wasserstein Propagation

(a)

Figure 2. Comparison of propagation strategies on a linear graph


(coarse version on left); each horizontal slice represents a vertex
v ∈ V , and the colors from left to right in a slice show ρv . (Sub-
ramanya & Bilmes, 2011) (KL) is shown only in one example
(b) (c)
because it has qualitatively similar behavior to the PDF strategy.
Figure 3. PDF (b) and Wasserstein (c) propagation on a meshed
circle with prescribed boundary distributions (a). The underlying
large graphs G; regardless, the structure of such a solve is graph is shown in grey, and probability distributions at vertices
well-understood and studied, e.g. in Krishnan et al. (2013). v ∈ V are shown as vertical bars colored by the density ρv ; we
invert the color scheme of Figures 2 and 4 to improve contrast.
6. Experiments Propagated distributions in (b) and (c) are computed for all vertices
but for clarity are shown at representative slices of the circle.
We run our scheme through a number of tests demonstrating
its strengths and weaknesses compared to other potential
methods for propagation. We compare Wasserstein propaga-
tion with the strategy of propagating probability distribution
functions (PDFs) directly, as described in §2.2.

6.1. Synthetic Tests


(a) (b)
We begin by considering the behavior of our technique on
synthetic data designed to illustrate its various properties. Figure 4. Comparison of PDF diffusion (a) and Wasserstein dif-
One-Dimensional Examples Figure 2 shows “displace- fusion (b); in both cases the leftmost distribution comprises the
initial conditions, and several time steps of diffusion are shown
ment interpolation” properties inherited by our propagation
left-to-right. The underlying graph G is the circle on the left.
technique from the theory of optimal transportation. The
underlying graph is a line as in Figure 1, along the vertical
axis. Horizontally, each image is colored by values in ρv . unit circle, and we propagate ρv from fixed distributions
on the boundary. Unlike the classical case, however, our
The bottom and top vertices v0 and v1 have fixed distribu-
prescribed boundary distributions ρv are multimodal. Once
tions ρv0 and ρv1 , and the remaining vertices receive ρv
again, Wasserstein propagation recovers a smoothly-varying
via one of two propagation techniques. The left of each
set of distributions whose peaks behave like solutions to
pair propagates distributions by solving a classical Dirichlet
the classical Dirichlet problem. Propagating probability di-
problem independently for each bin of the probability dis-
rections rather than inverse CDFs yields somewhat similar
tribution function (PDF) ρv , whereas the right of each pair
modes, but with much higher entropy and variance espe-
propagates inverse CDFs using our method in §5.
cially at the center of the circle.
By examining the propagation behavior from the bottom to
the top of this figure, it is easy to see how the naı̈ve PDF Diffusion Figure 4 illustrates the behavior of Wasserstein
method varies from Wasserstein propagation. For instance, diffusion compared with simply diffusing distribution val-
in the leftmost example both ρv0 and ρv1 are unimodal, yet ues directly. When PDF values are diffused directly, as time
when propagating PDFs all the intermediate vertices have t increases the distributions simply become more and more
bimodal distributions; furthermore, no relationship is deter- smooth until they are uniform not only along G but also as
mined between the two peaks. Contrastingly, our technique distributions on Prob([0, 1]). Contrastingly, Wasserstein dif-
identifies the modes of ρv0 and ρv1 , linearly moving the fusion preserves the uncertainty from the initial distributions
peak from one side to the other. but does not increase it as time progresses.

Boundary Value Problems Figure 3 illustrates our algo- Alternative Target Domain Figure 5 shows an example
rithm on a less trivial graph G. To mimic a typical test case in which the target is Prob(S1 ), where S1 is the unit cir-
for classical Dirichlet problems, our graph is a mesh of the cle, rather than Prob([0, 1]). We optimize the ED using the
Wasserstein Propagation

(c) of ρv from PDF and Wasserstein propagation. Both


yield similar mean temperatures on V \V0 , which agree with
the means of the ground truth data. The standard devi-
ations, however, better illustrate differences between the
(a) (b) approaches. In particular, the standard deviations of the
Wasserstein-propagated distributions approximately follow
Figure 5. Interpolation of distributions on S1 via (a) PDF propaga- those of the ground truth histograms, whereas the PDF strat-
tion and (b) Wasserstein propagation; in these figures the vertices egy yields high standard deviations nearly everywhere on
with valence 1 have prescribed distributions ρv and the remaining the map due to undesirable smoothing effects.
vertices have distributions from propagation.
Wind Directions We apply the general formulation in §4
to propagating distributions on the unit circle S1 by consid-
linear program in §4 rather than the linear algorithm for ering histograms of wind directions collected over time by
Prob([0, 1]). Conclusions from this example are similar nodes on the ocean outside of Australia.2
to those from Figure 3: Wasserstein propagation identifies
peaks from different prescribed boundary distributions with- In this experiment, we keep approximately 4% of the data
out introducing variance, while PDF propagation exhibits points and propagate to the remaining vertices. Both the
much higher variance in the interpolated distributions and PDF and Wasserstein propagation strategies score similarly
does not “move” peaks from one location to another. with respect to our error metric; in the experiment shown,
Wasserstein propagation exhibits 6.6% average error per
6.2. Real-World Data node and PDF propagation exhibits 6.1% average error per
node. Propagation results are illustrated in Figure 7a.
We now evaluate our techniques on real-world input. To
evaluate the quality of our approach relative to ground truth, The nature of the error from the two strategies, however, is
we will use the one-Wasserstein distance, or Earth Mover’s quite different. In particular, Figure 7b shows the same map
Distance (Rubner et al., 2000), formulated by removing the colored by the entropy of the propagated distributions. PDF
square in the formula for W22 . We use this distance, given on propagation exhibits high entropy away from the prescribed
Prob(R) by the L1 distance between (non-inverted) CDFs, vertices, reflecting the fact that the propagated distributions
because it does not favor the W2 distance used in Wasser- at these points approach uniformity. Wasserstein propaga-
stein propagation while taking into account the ground dis- tion, on the other hand, has a more similar pattern of entropy
tances. We consider weather station coordinates as defining to that of the ground truth data, reflecting structure like that
a point cloud on the plane and compute the point cloud demonstrated in Proposition 3.
Laplacian using the approach of (Coifman & Lafon, 2006).
Non-Euclidean Interpolation Proposition 4 suggests an
Temperature Data Figure 6 illustrates the results of a application outside histogram propagation. In particular, if
series of experiments on weather data on a map of the United the vertices of V0 have prescribed distributions that are δ
States.1 Here, we have |V | = 1113 sites each collecting functions encoding individual points as mapping targets, all
daily temperature measurements, which we classify into propagated distributions also will be δ functions. Thus, one
100 bins at each vertex. In each experiment, we choose a strategy for interpolation is to encode the problem proba-
subset V0 ⊆ V of vertices, propagate the histograms from bilistically using δ distributions, interpolate using Wasser-
these vertices to the remainder of V , and measure the error stein propagation, and then extract peaks of the propagated
between the propagated and ground-truth histograms. distributions. Experimentally we find that optima of the
linear program in §4 with peaked prescribed distributions
Figure 6a shows quantitative results of this experiment. Here yield peaked distributions ρv for all v ∈ V even when the
we show the average histogram error per vertex as a func- target is not Prob(R); we leave a proof for future work.
tion of the percent of nodes in V with fixed labels; the fixed
vertices are chosen randomly, and errors are averaged over In Figure 8, we apply this strategy to interpolating angles on
20 trials for each percentage. The Wasserstein strategy con- S1 from a single day of wind data on a map of Europe.3 Clas-
sistently outperforms naı̈ve PDF interpolation with respect sical Dirichlet interpolation fails to capture the identification
to our error metric and approaches relatively small error of angles 0 and 2π. Contrastingly, if we encode the bound-
with as few as 5% of the labels fixed. ary conditions as peaked distributions on Prob(S1 ), we can
interpolate using Wasserstein propagation without losing
Figures 6b and 6c show results for a single trial. We color structure. The resulting distributions are peaked about a sin-
the vertices v ∈ V by the mean (b) and standard deviation 2
WindSat Remote Sensing Systems
1 3
National Climatic Data Center Carbon Dioxide Information Analysis Center
Wasserstein Propagation

(b)

(c)

(a)

Figure 6. We propagate histograms of temperatures collected over time to a map of the United States: (a) Average error at propagated sites
as a function of the number of nodes with labeled distributions; (b) means of the histograms at the propagated sites from a typical trial in
(a); (c) standard deviations at the propagated sites. Vertices with prescribed distributions are shown in blue and comprise ∼ 2% of V .

Ground truth PDF Wasserstein Ground truth PDF Wasserstein


(a) Histograms of wind directions (b) Entropy

Figure 7. (a) Interpolating histograms of wind directions using the PDF and Wasserstein propagation methods, illustrated using the same
scheme as Figure 5; (b) entropy values from the same distributions.

such as the surface mapping problem in Solomon et al.


(2013). Such an optimization, however, has O(m2 |E|) vari-
ables, which is intractable for dense or large graphs. An
open theoretical problem might be to reduce the number of
variables asymptotically. Some simplifications may also be
Ground truth PDF (19%) Wasserstein (15%) afforded using approximations like (Pele & Werman, 2009),
which simplify the form of dij at the cost of complicating
Figure 8. Learning wind directions on the unit circle S1 . theoretical analysis and understanding of optimal distribu-
tions ρv . Alternatively, work such as (Rabin et al., 2011)
suggests the potential to formulate efficient algorithms when
gle maximum, so we extract a direction field as the mode of replacing Prob([0, 1]) with Prob(S1 ) or other domains with
each ρv . Despite noise in the dataset we achieve 15% error special structure.
rather than the 19% error obtained by classical Dirichlet
interpolation of angles disregarding periodicity. In the end, our proposed algorithms are equally as
lightweight as less principled alternatives, while exhibit-
ing practical performance, theoretical soundness, and the
7. Conclusion possibility of extension into several alternative domains.
It is easy to formulate strategies for histogram propagation
by applying methods for propagating scalar functions bin-
by-bin. Here, however, we have shown that propagating
instead inverse CDFs has a deep connections to the theory of
optimal transportation and provides superior results, making
Acknowledgments The authors gratefully acknowledge
it a strong yet still efficient choice. This basic connection
the support of NSF grants CCF 1161480 and DMS 1228304,
gives our method theoretical and practical soundness that is
AFOSR grant FA9550-12-1-0372, a Google research award,
difficult to guarantee otherwise.
the Max Planck Center for Visual Computing and Commu-
While our algorithms show promise as practical techniques, nications, the National Defense Science and Engineering
we leave many avenues for future study. Most prominently, Graduate Fellowship, the Hertz Foundation Fellowship, and
the generalization in §4 can be applied to many problems, the NSF GRF program.
Wasserstein Propagation

References Rabin, Julien, Peyre, Gabriel, Delon, Julie, and Bernot,


Marc. Wasserstein barycenter and its application to
Agueh, M. and Carlier, G. Barycenters in the Wasserstein
texture mixing. volume 6667 of LNCS, pp. 435–446.
space. J. Math. Anal., 43(2):904–924, 2011. 1
Springer, 2012. 1
Applegate, David, Dasu, Tamraparni, Krishnan, Shankar,
Rubner, Yossi, Tomasi, Carlo, and Guibas, Leonidas. The
and Urbanek, Simon. Unsupervised clustering of multi-
earth mover’s distance as a metric for image retrieval.
dimensional distributions using earth mover distance. In
IJCV, 40(2):99–121, November 2000. 6.2
KDD, pp. 636–644, 2011. 3
Singh, Aarti, Nowak, Robert D., and Zhu, Xiaojin. Unla-
Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps
beled data: Now it helps, now it doesn’t. In NIPS, pp.
and spectral techniques for embedding and clustering. In
1513–1520, 2008. 1
NIPS, pp. 585–591, 2001. 1
Solomon, Justin, Guibas, Leonidas, and Butscher, Adrian.
Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas.
Dirichlet energy for analysis and synthesis of soft maps.
Manifold regularization: A geometric framework for
Comp. Graph. Forum, 32(5):197–206, 2013. 1, 7
learning from labeled and unlabeled examples. JMLR, 7:
2399–2434, December 2006. 1 Subramanya, Amarnag and Bilmes, Jeff. Semi-supervised
learning with measure propagation. JMLR, 12:3311–
Bonneel, Nicolas, van de Panne, Michiel, Paris, Sylvain, and
3370, 2011. 1, 3, 2
Heidrich, Wolfgang. Displacement interpolation using
Lagrangian mass transport. Trans. Graph., 30(6):158:1– Talukdar, Partha Pratim and Crammer, Koby. New regular-
158:12, December 2011. 1, 3 ized algorithms for transductive learning. ECML-PKDD,
5782:442–457, 2009. 1
Chung, Fan and Yau, S.-T. Discrete Green’s functions. J.
Combinatorial Theory, 91(1–2):191–214, 2000. 2.1 Villani, Cédric. Topics in Optimal Transportation. Graduate
Studies in Mathematics. AMS, 2003. 1, 1
Chung, Soon-Yeong, Chung, Yun-Sung, and Kim, Jong-Ho.
Diffusion and elastic equations on networks. Pub. RIMS, Zhou, Xueyuan and Belkin, Mikhail. Semi-supervised learn-
43(3):699–726, 2007. 2.2, 3.1, 3.2 ing by higher order regularization. ICML, 15:892–900,
2011. 1
Coifman, Ronald R. and Lafon, Stéphane. Diffusion maps.
Applied and Computational Harmonic Anal., 21(1):5–30, Zhu, Xiaojin. Semi-supervised learning literature survey.
2006. 6.2 Technical Report 1530, Computer Sciences, University
of Wisconsin-Madison, 2008. 1
Irpino, Antonio, Verde, Rosanna, and de A.T. de Car-
valho, Francisco. Dynamic clustering of histogram data Zhu, Xiaojin, Ghahramani, Zoubin, and Lafferty, John D.
based on adaptive squared Wasserstein distances. CoRR, Semi-supervised learning using Gaussian fields and har-
abs/1110.1462, 2011. 3 monic functions. pp. 912–919, 2003. 1, 2.1, 2.2
Ji, Ming, Yang, Tianbao, Lin, Binbin, Jin, Rong, and Han,
Jiawei. A simple algorithm for semi-supervised learning
with improved generalization error bound. In ICML, 2012.
1

Krishnan, Dilip, Fattal, Raanan, and Szeliski, Richard. Effi-


cient preconditioning of Laplacian matrices for computer
graphics. Trans. Graph., 32(4):142:1–142:15, July 2013.
5

McCann, Robert J. A convexity principle for interacting


gases. Advances in Math., 128(1):153–179, 1997. 1

Pele, O. and Werman, M. Fast and robust earth mover’s


distances. In ICCV, pp. 460–467, 2009. 7

Rabin, Julien, Delon, Julie, and Gousseau, Yann. Trans-


portation distances on the circle. J. Math. Imaging Vis.,
41(1–2):147–167, September 2011. 7

You might also like