611 Variational Bayesian Phylogene
611 Variational Bayesian Phylogene
A BSTRACT
Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo
with simple mechanisms for proposing new states, which hinders exploration effi-
ciency and often requires long runs to deliver accurate posterior estimates. In this
paper we present an alternative approach: a variational framework for Bayesian
phylogenetic analysis. We approximate the true posterior using an expressive
graphical model for tree distributions, called a subsplit Bayesian network, together
with appropriate branch length distributions. We train the variational approxima-
tion via stochastic gradient ascent and adopt multi-sample based gradient estima-
tors for different latent variables separately to handle the composite latent space
of phylogenetic models. We show that our structured variational approximations
are flexible enough to provide comparable posterior estimation to MCMC, while
requiring less computation due to a more efficient tree exploration mechanism en-
abled by variational inference. Moreover, the variational approximations can be
readily used for further statistical analysis such as marginal likelihood estimation
for model comparison via importance sampling. Experiments on both synthetic
data and real data Bayesian phylogenetic inference problems demonstrate the ef-
fectiveness and efficiency of our methods.
1 I NTRODUCTION
Bayesian phylogenetic inference is an essential tool in modern evolutionary biology. Given an align-
ment of nucleotide or amino acid sequences and appropriate prior distributions, Bayesian methods
provide principled ways to assess the phylogenetic uncertainty by positing and approximating a
posterior distribution on phylogenetic trees (Huelsenbeck et al., 2001). In addition to uncertainty
quantification, Bayesian methods enable integrating out tree uncertainty in order to get more con-
fident estimates of parameters of interest, such as factors in the transmission of Ebolavirus (Dudas
et al., 2017). Bayesian methods also allow complex substitution models (Lartillot & Philippe, 2004),
which are important in elucidating deep phylogenetic relationships (Feuda et al., 2017).
Ever since its introduction to the phylogenetic community in the 1990s, Bayesian phylogenetic infer-
ence has been dominated by random-walk Markov chain Monte Carlo (MCMC) approaches (Yang
& Rannala, 1997; Mau et al., 1999; Huelsenbeck & Ronquist, 2001). However, this approach is
fundamentally limited by the complexities of tree space. A typical MCMC method for phylogenetic
inference involves two steps in each iteration: first, a new tree is proposed by randomly perturbing
the current tree, and second, the tree is accepted or rejected according to the Metropolis-Hastings
acceptance probability. Any such random walk algorithm faces obstacles in the phylogenetic case,
in which the high-posterior trees are a tiny fraction of the combinatorially exploding number of
trees. Thus, major modifications of trees are likely to be rejected, restricting MCMC tree move-
ment to local modifications that may have difficulty moving between multiple peaks in the posterior
distribution (Whidden & Matsen IV, 2015). Although recent MCMC methods for distributions
on Euclidean space use intelligent proposal mechanisms such as Hamiltonian Monte Carlo (Neal,
2011), it is not straightforward to extend such algorithms to the composite structure of tree space,
which includes both tree topology (discrete object) and branch lengths (continuous positive vector)
(Dinh et al., 2017).
1
Published as a conference paper at ICLR 2019
Variational inference (VI) is an alternative approximate inference method for Bayesian analysis
which is gaining in popularity (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2017).
Unlike MCMC methods that sample from the posterior, VI selects the best candidate from a family
of tractable distributions to minimize a statistical distance measure to the target posterior, usually
the Kullback-Leibler (KL) divergence. By reformulating the inference problem into an optimization
problem, VI tends to be faster and easier to scale to large data (via stochastic gradient descent)
(Blei et al., 2017). However, VI can also introduce a large bias if the variational distribution is
insufficiently flexible. The success of variational methods, therefore, relies on having appropriate
tractable variational distributions and efficient training procedures.
To our knowledge, there have been no previous variational formulations of Bayesian phylogenetic
inference. This has been due to the lack of an appropriate family of approximating distributions on
phylogenetic trees. However the prospects for variational inference have changed recently with the
introduction of subsplit Bayesian networks (SBNs) (Zhang & Matsen IV, 2018), which provide a
family of flexible distributions on tree topologies (i.e. trees without branch lengths). SBNs build on
previous work (Höhna & Drummond, 2012; Larget, 2013), but in contrast to these previous efforts,
SBNs are sufficiently flexible for real Bayesian phylogenetic posteriors (Zhang & Matsen IV, 2018).
In this paper, we develop a general variational inference framework for Bayesian phylogenetics. We
show that SBNs, when combined with appropriate approximations for the branch length distribu-
tion, can provide flexible variational approximations over the joint latent space of phylogenetic trees
with branch lengths. We use recently-proposed unbiased gradient estimators for the discrete and
continuous components separately to enable efficient stochastic gradient ascent. We also leverage
the similarity of local structures among trees to reduce the complexity of the variational parameteri-
zation for the branch length distributions and provide an extension to better capture the between-tree
variation. Finally, we demonstrate the effectiveness and efficiency of our methods on both synthetic
data and a benchmark of challenging real data Bayesian phylogenetic inference problems.
2 BACKGROUND
Phylogenetic Posterior A phylogenetic tree is described by a tree topology τ and associated non-
negative branch lengths q. The tree topology τ represents the evolutionary diversification of the
species. It is a bifurcating tree with N leaves, each of which has a label corresponding to one of
the observed species. The internal nodes of τ represent the unobserved characters (e.g. DNA bases)
of the ancestral species. A continuous-time Markov model is often used to describe the transition
probabilities of the characters along the branches of the tree. Let Y = {Y1 , Y2 , . . . , YM } ∈ ΩN ×M
be the observed sequences (with characters in Ω) of length M over N species. The probability of
each site observation Yi is defined as the marginal distribution over the leaves
X Y
p(Yi |τ, q) = η(aiρ ) Paiu aiv (quv ) (1)
ai (u,v)∈E(τ )
where ρ is the root node (or any internal node if the tree is unrooted and the Markov model is
time reversible), ai ranges over all extensions of Yi to the internal nodes with aiu being the assigned
character of node u, E(τ ) denotes the set of edges of τ , Pij (t) denotes the transition probability from
character i to character j across an edge of length t and η is the stationary distribution of the Markov
model. Assuming different sites are identically distributed and evolve independently, the likelihood
QM
of observing the entire sequence set Y is p(Y |τ, q) = i=1 p(Yi |τ, q). The phylogenetic likelihood
for each site in equation 1 can be evaluated efficiently through the pruning algorithm (Felsenstein,
2003), also known as the sum-product algorithm in probabilistic graphical models (Strimmer &
Moulton, 2000; Koller & Friedman, 2009; Höhna et al., 2014). Given a proper prior distribution with
density p(τ, q) imposed on the tree topologies and the branch lengths, the phylogenetic posterior
p(τ, q|Y ) is proportional to the joint density
p(Y |τ, q)p(τ, q)
p(τ, q|Y ) = ∝ p(Y |τ, q)p(τ, q)
p(Y )
where p(Y ) is the intractable normalizing constant.
Subsplit Bayesian Networks We now review subsplit Bayesian networks (Zhang & Matsen IV,
2018) and the flexible distributions on tree topologies they provide. Let X be the set of leaf labels.
2
Published as a conference paper at ICLR 2019
A A
A
B
B BC
C ass
ABC
ign S4
C D
D
Species A: ATGAAC · · · D
S2
D
D S5
Species B: ATGCAC · · ·
S1
Species C: ATGCAT · · · A A S6
A S3
Species D: ATGCAT · · · B B B
AB
ign
CD ass S7
C C C
D
D D
Figure 1: A simple subsplit Bayesian network for a leaf set that contains 4 species. Left: A leaf label
set X of 4 species, each label corresponds to a DNA sequence. Middle (left): Examples of (rooted)
phylogenetic trees that are hypothesized to model the evolutionary history of the species. Middle
(right): The corresponding SBN assignments for the trees. For ease of illustration, subsplit (W, Z)
is represented as W Z in the graph. The dashed gray subgraphs represent fake splitting processes
where splits are deterministically assigned, and are used purely to complement the networks such
that the overall network has a fixed structure. Right: The SBN for these examples.
We call a nonempty subset of X a clade. Let be a total order on clades (e.g., lexicographical order).
A subsplit (W, Z) of a clade X is an ordered pair of disjoint subclades of X such that W ∪ Z = X
and W Z. A subsplit Bayesian network BX on a leaf set X of size N is a Bayesian network whose
nodes take on subsplit or singleton clade values that represent the local topological structure of trees
(Figure 1). Following the splitting processes (see the solid dark subgraphs in Figure 1, middle
right), rooted trees have unique subsplit decompositions and hence can be uniquely represented as
compatible SBN assignments. Given the subsplit decomposition of a rooted tree τ = {s1 , s2 , . . .},
where s1 is the root subsplit and {si }i>1 are other subsplits, the SBN tree probability is
Y
psbn (T = τ ) = p(S1 = s1 ) p(Si = si |Sπi = sπi )
i>1
where Si denotes the subsplit- or singleton-clade-valued random variables at node i and πi is the
index set of the parents of Si . The Bayesian network formulation of SBNs enjoys many benefits: i)
flexibility. The expressiveness of SBNs is freely adjustable by changing the dependency structures
between nodes, allowing for a wide range of flexible distributions; ii) normality. SBN-induced
distributions are all naturally normalized if the associated conditional probability tables (CPTs) are
consistent, which is a common property of Bayesian networks. The SBN framework also generalizes
to unrooted trees, which are the most common type of trees in phylogenetics. Concretely, unrooted
trees can be viewed as rooted trees with unobserved roots. Marginalizing out the unobserved root
node S1 , we have the SBN probability estimates for unrooted trees
X Y
psbn (T u = τ ) = p(S1 = s1 ) p(Si = si |Sπi = sπi )
s1 ∼τ i>1
3
Published as a conference paper at ICLR 2019
variational parameters. We then combine these distributions and use the product
Qφ,ψ (τ, q) = Qφ (τ )Qψ (q|τ )
as our variational approximation. Inference now amounts to finding the member of this family that
minimizes the Kullback-Leibler (KL) divergence to the exact posterior,
φ∗ , ψ ∗ = arg min DKL (Qφ,ψ (τ, q)k p(τ, q|Y )) (2)
φ,ψ
The CPTs in SBNs are, in general, associated with all possible parent-child subsplit pairs. Therefore,
in principle a full parameterization requires an exponentially increasing number of parameters. In
practice, however, we can find a sufficiently large subsplit support of CPTs (i.e. where the associated
conditional probabilities are allowed to be nonzero) that covers favorable subsplit pairs from trees in
the high-probability areas of the true posterior. In this paper, we will mostly focus on the variational
approach and assume the support of CPTs is available, although in our experiments we find that
a simple bootstrap-based approach does provide a reasonable CPT support estimate for real data.
We leave the development of more sophisticated methods for finding the support of CPTs to future
work.
Now denote the set of root subsplits in the support as Sr and the set of parent-child subsplit pairs in
the support as Sch|pa . The CPTs are defined according to the following equations
exp(φs1 ) exp(φs|t )
p(S1 = s1 ) = P , p(Si = s|Sπi = t) = P
sr ∈Sr exp(φsr ) s∈S·|t exp(φs|t )
where S·|t denotes the set of child subsplits for parent subsplit t.
We use the Log-normal distribution Lognormal(µ, σ 2 ) as our variational approximation for branch
lengths to accommodate their non-negative nature in phylogenetic models. Instead of a naive param-
eterization for each edge on each tree (which would require a large number of parameters when the
high-probability areas of the posterior are diffuse), we use an amortized set of parameters over the
shared local structures among trees. A simple choice of such local structures is the split, a bipartition
(X1 , X2 ) of the leaf labels X (i.e. X1 ∪ X2 = X , X1 ∩ X2 = ∅), and each edge of a phylogenetic
tree naturally corresponds to a split, the bipartition that consists of the leaf labels from both sides of
the edge. Note that a split can be viewed as a root subsplit. We then assign µ(·, ·), σ(·, ·) for each
split (·, ·) in Sr . We denote the corresponding split of edge e of tree τ as e/τ .
A Simple Independent Approximation Given a phylogenetic tree τ , we start with a simple model
that assumes the branch lengths for the edges of the tree are independently distributed. The approx-
imate density Qψ (q|τ ), therefore, has the form
Y µ
Qψ (q|τ ) = pLognormal (qe | µ(e, τ ), σ(e, τ )) , µ(e, τ ) = ψe/τ σ
, σ(e, τ ) = ψe/τ . (4)
e∈E(τ )
4
Published as a conference paper at ICLR 2019
Z1
1
W
The above approximation equation 4 implicitly assumes e
that the branch lengths in different trees have the same W Z
ψ µ (W, Z)
distribution if they correspond to the same split, which ) ψµ
,Z (Z
W2
fails to account for between-tree variation. To capture this
2
|W
Z
1,Z
2
,W 2 |W
variation, one can use a more sophisticated parameteriza- ψ
µ (W
1
+
,Z
)
tion that allows other tree-dependent terms for the varia-
tional parameters µ and σ. Specifically, we use additional µ(e, τ )
local structure associated with each edge as follows:
Figure 2: Branch length parameteriza-
Definition 1 (primary subsplit pair) Let e be an edge tion using primary subsplit pairs, which
of a phylogenetic tree τ which represents a split e/τ = is the sum of parameters for a split and
(W, Z). Assume that at least one of W or Z, say W , its neighboring subsplit pairs. Edge e
contains more than one leaf label and denote its sub- represents a split (W, Z). Parameteriza-
split as (W1 , W2 ). We call the parent-child subsplit pair tion for the variance is the same as for
(W1 , W2 )|(W, Z) a primary subsplit pair. the mean.
We assign additional parameters for each primary subsplit pair. Denoting the primary subsplit pair(s)
of edge e in tree τ as e//τ , we then simply sum all variational parameters associated with e to form
the mean and variance parameters for the corresponding branch length (Figure 2):
µ
X X
µ(e, τ ) = ψe/τ + ψsµ , σ(e, τ ) = ψe/τ
σ
+ ψsσ .
s∈e//τ s∈e//τ
This modifies the density in equation 4 by adding contributions from primary subsplit pairs and
hence allows for more flexible between-tree approximations. Note that the above structured param-
eterizations of branch length distributions also enable joint learning across tree topologies.
In practice, the lower bound is usually maximized via stochastic gradient ascent (SGA). However,
the naive stochastic gradient estimator obtained by differentiating the lower bound has very large
variance and is impractical for our purpose. Fortunately, various variance reduction techniques have
been introduced in recent years including the control variate (Paisley et al., 2012; Ranganath et al.,
2014; Mnih & Gregor, 2014; Mnih & Rezende, 2016) for general latent variables and the reparam-
eterization trick (Kingma & Welling, 2014) for continuous latent variables. In the following, we
apply these techniques to different components of our latent variables and derive efficient gradient
estimators with much lower variance, respectively. In addition, we also consider a stable gradient
estimator based on an alternative variational objective. See Appendix A for derivations.
|τ,q)p(τ,q)
The VIMCO Estimator Let fφ,ψ (τ, q) = p(Y Qφ (τ)Qψ (q|τ ) . The stochastic lower bound with K
P
K 1 K i i
samples is L̂ (φ, ψ) = log K i=1 fφ,ψ (τ , q ) . Mnih & Rezende (2016) propose a localized
learning signal strategy that significantly reduces the variance of the naive gradient estimator by
utilizing the independence between the multiple samples and the regularity of the learning signal,
which estimates the gradient as follows
X K
∇φ LK (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) L̂K
j|−j (φ, ψ) − w̃ j
∇φ log Qφ (τ j ) (5)
j=1
where
1 X
L̂K
j|−j (φ, ψ)
K
:= L̂ (φ, ψ) − log fφ,ψ (τ , q ) + fˆφ,ψ (τ , q )
i i −j −j
K
i6=j
is the per-sample local learning signal, with fˆφ,ψ (τ −j , q −j ) being some estimate of fφ,ψ (τ j , q j )
f (τ j ,q j )
for sample j using the rest of samples (e.g., the geometric mean), and w̃j = PKφ,ψf (τ i ,qi ) is the
i=1 φ,ψ
self-normalized importance weight. This gives the following VIMCO estimator
K
iid
X
∇φ LK (φ, ψ) ' L̂Kj|−j (φ, ψ) − w̃
j
∇φ log Qφ (τ j ) with τ j , q j ∼ Qφ,ψ (τ, q). (6)
j=1
5
Published as a conference paper at ICLR 2019
The Reparameterization Trick The VIMCO estimator also works for the branch length gradient.
However, as branch lengths are continuous latent variables, we can use the reparameterization trick
to estimate the gradient. Because the Log-normal distribution has a simple reparameterization, q ∼
Lognormal(µ, σ 2 ) ⇔ q = exp(µ + σ), ∼ N (0, 1), we can rewrite the lower bound:
K j j j j j j
1 X p(Y |τ , gψ ( |τ ))p(τ , gψ ( |τ ))
LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) log .
K j=1 Qφ (τ j )Qψ (gψ (j |τ j )|τ j )
where gψ (|τ ) = exp(µψ,τ + σψ,τ ). Then the gradient of the lower bound w.r.t. ψ is
K
X
∇ψ LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )) (7)
j=1
f (τ j ,g (j |τ j ))
where w̃j = PKφ,ψf (τψi ,g (i |τ i )) is the same normalized importance weight as in equation equa-
i=1 φ,ψ ψ
tion 5. Therefore, we can form the Monte Carlo estimator of the gradient
K
iid iid
X
∇ψ LK (φ, ψ) ' w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )) with τ j ∼ Qφ (τ ), j ∼ N (0, I). (8)
j=1
We can use an importance sampling estimator to compute the gradient of the objective
1 p(Y |τ, q)p(τ, q)
∇φ L̃(φ, ψ) = Ep(τ,q|Y ) ∇φ log Qφ,ψ (τ, q) = EQ (τ,q) ∇φ log Qφ (τ )
p(Y ) φ,ψ Qφ (τ )Qψ (q|τ )
K
iid
X
' w̃j ∇φ log Qφ (τ j ) with τ j , q j ∼ Qφ,ψ (τ, q) (10)
j=1
with the same importance weights w̃j as in equation 5. This can be viewed as a multi-sample gen-
eralization of the wake-sleep algorithm (Hinton et al., 1995) and was first used in the reweighted
wake-sleep algorithm (Bornschein & Bengio, 2015) for training deep generative models. We there-
fore call the gradient estimator in equation 10 the RWS estimator. Like the VIMCO estimator, the
RWS estimator also provides gradients for branch lengths. However, we find in practice that equa-
tion 8 that uses the reparameterization trick is more useful and often leads to faster convergence,
although it uses a different optimization objective. A better understanding of this phenomenon
would be an interesting subject of future research.
All stochastic gradient estimators introduced above can be used in conjunction with stochastic op-
timization methods such as SGA or some of its adaptive variants (e.g. Adam Kingma & Ba, 2015)
to maximize the lower bounds. See algorithm 1 in Appendix B for a basic variational Bayesian
phylogenetic inference (VBPI) approach.
4 E XPERIMENTS
Throughout this section we evaluate the effectiveness and efficiency of our variational framework
for inference over phylogenetic trees. The simplest SBN (the one with a full and complete binary
tree structure) is used for the phylogenetic tree topology variational distribution; we have found it to
provide sufficiently accurate approximation. For real datasets, we estimate the CPT supports from
ultrafast maximum likelihood phylogenetic bootstrap trees using UFBoot (Minh et al., 2013), which
is a fast approximate bootstrap method based on efficient heuristics. We compare the performance
of the VIMCO estimator and the RWS estimator with different variational parameterizations for
the branch length distributions, while varying the number of samples in the training objective to
6
Published as a conference paper at ICLR 2019
10 1
0.0 VIMCO(20) VIMCO(50)
Variational approximation
VIMCO(50) RWS(50)
Evidence Lower Bound
0.5 0.00 RWS(20)
KL Divergence
1.0 0.02 RWS(50) 10 2
100
1.5
EXACT
2.0 VIMCO(20) 10 3
VIMCO(50)
2.5 RWS(20)
RWS(50)
3.0 10 1
10 4
0 50 100 150 200 0 50 100 150 200 10 4 10 3 10 2 10 1
Thousand Iterations Thousand Iterations Ground truth
see how these affect the quality of the variational approximations. For VIMCO, we use Adam
for stochastic gradient ascent with learning rate 0.001 (Kingma & Ba, 2015). For RWS, we also
use AMSGrad (Sashank et al., 2018), a recent variant of Adam, when Adam is unstable. Results
were collected after 200,000 parameter updates. The KL divergences reported are over the discrete
collection of phylogenetic tree structures, from trained SBN distribution to the ground truth, and a
low KL divergence means a high quality approximation of the distribution of trees.
with the exact evidence being log(1) = 0. We then use both the VIMCO and RWS estimators to
optimize the above lower bound based on 20 and 50 samples (K). We use a slightly larger learning
rate (0.002) in AMSGrad for RWS.
Figure 3 shows the empirical performance of different methods. From the left plot, we see that the
lower bounds converge rapidly and the gaps between lower bounds and the exact model evidence
are close to zero, demonstrating the expressive power of SBNs on phylogenetic tree probability es-
timations. The evolution of KL divergences (middle plot) is consistent with the lower bounds. All
methods benefit from using more samples, with VIMCO performing better in the end and RWS
learning slightly faster at the beginning.1 The slower start of VIMCO is partly due to the regular-
ization term in the lower bounds, which turns out to be beneficial for the overall performance since
the regularization encourages the diversity of the variational approximation and leads to more suffi-
cient exploration in the starting phase, similar to the exploring starts (ES) strategy in reinforcement
learning (Sutton & Barto, 1998). The right plot compares the variational approximations obtained
by VIMCO and RWS, both with 50 samples, to the ground truth p0 (τ ). We see that VIMCO slightly
underestimates trees in high-probability areas as a result of the regularization effect. While RWS
provides better approximations for trees in high-probability areas, it tends to underestimate trees
1
Although we use larger learning rates for RWS in this experiment, we found RWS generally learns slightly
faster than VIMCO at the beginning. See Figure 4 for the real data phylogenetic inference problems in section
4.2 where we use Adam with learning rate 0.001 for both methods.
7
Published as a conference paper at ICLR 2019
KL Divergence
RWS(20) RWS(20) + PSP 7038
MCMC MCMC
VBPI
7039
100 100
7040
7041
10 1 10 1
7042
0 50 100 150 200 0 50 100 150 200 7042 7040 7038 7036
Thousand Iterations Thousand Iterations GSS
Figure 4: Performance on DS1. Left: KL divergence for methods that use the simple split-based
parameterization for the branch length distributions. Middle: KL divergence for methods that use
PSP. Right: Per-tree marginal likelihood estimation (in nats): VBPI vs GSS. The number in brackets
specifies the number of samples used in the training objective. MCMC results are averaged over 10
independent runs. The results for VBPI were obtained using 1000 samples and the error bar shows
one standard deviation over 100 independent runs.
in low-probability areas which deteriorates the overall performance. We expect the biases in both
approaches to be alleviated with more samples.
In the second set of experiments we evaluate the proposed variational Bayesian phylogenetic in-
ference (VBPI) algorithms at estimating unrooted phylogenetic tree posteriors on 8 real datasets
commonly used to benchmark phylogenetic MCMC methods (Lakner et al., 2008; Höhna & Drum-
mond, 2012; Larget, 2013; Whidden & Matsen IV, 2015) (Table 1). We concentrate on the most
challenging part of the phylogenetic model: joint learning of the tree topologies and the branch
lengths. We assume a uniform prior on the tree topology, an i.i.d. exponential prior (Exp(10))
for the branch lengths and the simple Jukes & Cantor (1969) substitution model. We consider two
different variational parameterizations for the branch length distributions as introduced in section
3.1. In the first case, we use the simple split-based parameterization that assigns parameters to the
splits associated with the edges of the trees. In the second case, we assign additional parameters for
the primary subsplit pairs (PSP) to better capture the between-tree variation. We form our ground
truth posterior from an extremely long MCMC run of 10 billion iterations (sampled each 1000 itera-
tions with the first 25% discarded as burn-in) using MrBayes (Ronquist et al., 2012), and gather the
support of CPTs from 10 replicates of 10000 ultrafast maximum likelihood bootstrap trees (Minh
et al., 2013). Following Rezende & Mohamed (2015), we use a simple annealed version of the lower
bound which was found to provide better results. The modified bound is:
K
!
K 1 X [p(Y |τ i , q i )]βt p(τ i , q i )
Lβt (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log
K i=1 Qφ (τ i )Qψ (q i |τ i )
8
Published as a conference paper at ICLR 2019
Table 1: Data sets used for variational phylogenetic posterior estimation, and marginal likelihood
estimates of different methods across datasets. The marginal likelihood estimates of all variational
methods are obtained by importance sampling using 1000 samples. We run stepping-stone in Mr-
Bayes using default settings with 4 chains for 10,000,000 iterations and sampled every 100 iterations.
The results are averaged over 10 independent runs with standard deviation in brackets.
as burn-in. For a relatively fair comparison (in terms of the number of likelihood evaluations),
we compare 10 (i.e. 2·20/4) times the number of MCMC iterations with the number of 20-sample
objective VBPI iterations.2 Although MCMC converges faster at the start, we see that VBPI methods
(especially those with PSP) quickly surpass MCMC and arrive at good approximations with much
less computation. This is because VBPI iteratively updates the approximate distribution of trees
(e.g., SBNs) which in turn allows guided exploration in the tree topology space. VBPI also provides
the same majority-rule consensus tree as the ground truth MCMC run (Figure 5 in Appendix D).
The variational approximations provided by VBPI can be readily used to perform importance sam-
pling for phylogenetic inference (more details in Appendix C). The right plot of Figure 4 compares
VBPI using VIMCO with 20 samples and PSP to the state-of-the-art generalized stepping-stone
(GSS) (Fan et al., 2011) algorithm for estimating the marginal likelihood of trees in the 95% credi-
ble set of DS1. For GSS, we use 50 power posteriors and for each power posterior we run 1,000,000
MCMC iterations, sampling every 1000 iterations with the first 10% discarded as burn-in. The ref-
erence distribution for GSS was obtained from an independent Gamma approximation using the
maximum a posterior estimate. Table 1 shows the estimates of the marginal likelihood of the data
(i.e., model evidence) using different VIMCO approximations and one of the state-of-the-art meth-
ods, the stepping-stone (SS) algorithm (Xie et al., 2011). For each data set, all methods provide
estimates for the same marginal likelihood, with better approximation leading to lower variance.
We see that VBPI using 1000 samples is already competitive with SS using 100000 samples and
provides estimates with much less variance (hence more reproducible and reliable). Again, the extra
flexibility enabled by PSP alleviates the demand for larger number of samples used in the training
objective, making it possible to achieve high quality variational approximations with less samples.
5 D ISCUSSION
In this work we introduced VBPI, a general variational framework for Bayesian phylogenetic in-
ference. By combining subsplit Bayesian networks, a recent framework that provides flexible dis-
tributions of trees, and efficient structured parameterizations for branch length distributions, VBPI
exhibits guided exploration (enabled by SBNs) in tree space and provides competitive performance
to MCMC methods with less computation. Moreover, variational approximations provided by VBPI
can be readily used for further statistical analysis such as marginal likelihood estimation for model
comparison via importance sampling, which, compared to MCMC based methods, dramatically re-
duces the cost at test time. We report promising numerical results demonstrating the effectiveness
and efficiency of VBPI on a benchmark of real data Bayesian phylogenetic inference problems.
When the data are weak and posteriors are diffuse, support estimation of CPTs becomes challenging.
However, compared to classical MCMC approaches in phylogenetics that need to traverse the enor-
mous support of posteriors on complete trees to accurately evaluate the posterior probabilities, the
SBN parameterization in VBPI has a natural advantage in that it alleviates this issue by factorizing
the uncertainty of complete tree topologies into local structures.
2
the extra factor of 2/4 is because the likelihood and the gradient can be computed together in twice the time
of a likelihood (Schadt et al., 1998) and we run MCMC with 4 chains.
9
Published as a conference paper at ICLR 2019
Many topics remain for future work: constructing more flexible approximations for the branch length
distributions (e.g., using normalizing flow (Rezende & Mohamed, 2015) for within-tree approxima-
tion and deep networks for the modeling of between-tree variation), deeper investigation of support
estimation approaches in different data regimes, and efficient training algorithms for general varia-
tional inference on discrete / structured latent variables.
ACKNOWLEDGMENTS
This work supported by National Science Foundation grant CISE-1564137, as well as National
Institutes of Health grant R01-GM113246. The research of Frederick Matsen was supported in part
by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation.
R EFERENCES
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.
Journal of the American Statistical Association, 112(518):859–877, 2017.
Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. In Proceedings of the International
Conference on Learning Representations (ICLR), 2015.
Vu Dinh, Arman Bilge, Cheng Zhang, and Frederick A Matsen IV. Probabilistic Path Hamiltonian
Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, pp.
1009–1018, July 2017. URL http://proceedings.mlr.press/v70/dinh17a.html.
Gytis Dudas, Luiz Max Carvalho, Trevor Bedford, Andrew J Tatem, Guy Baele, Nuno R Faria,
Daniel J Park, Jason T Ladner, Armando Arias, Danny Asogun, Filip Bielejec, Sarah L Caddy,
Matthew Cotten, Jonathan D’Ambrozio, Simon Dellicour, Antonino Di Caro, Joseph W Diclaro,
Sophie Duraffour, Michael J Elmore, Lawrence S Fakoli, Ousmane Faye, Merle L Gilbert, Sahr M
Gevao, Stephen Gire, Adrianne Gladden-Young, Andreas Gnirke, Augustine Goba, Donald S
Grant, Bart L Haagmans, Julian A Hiscox, Umaru Jah, Jeffrey R Kugelman, Di Liu, Jia Lu, Chris-
tine M Malboeuf, Suzanne Mate, David A Matthews, Christian B Matranga, Luke W Meredith,
James Qu, Joshua Quick, Suzan D Pas, My V T Phan, Georgios Pollakis, Chantal B Reusken,
Mariano Sanchez-Lockhart, Stephen F Schaffner, John S Schieffelin, Rachel S Sealfon, Etienne
Simon-Loriere, Saskia L Smits, Kilian Stoecker, Lucy Thorne, Ekaete Alice Tobin, Mohamed A
Vandi, Simon J Watson, Kendra West, Shannon Whitmer, Michael R Wiley, Sarah M Winnicki,
Shirlee Wohl, Roman Wölfel, Nathan L Yozwiak, Kristian G Andersen, Sylvia O Blyden, Fa-
torma Bolay, Miles W Carroll, Bernice Dahn, Boubacar Diallo, Pierre Formenty, Christophe
Fraser, George F Gao, Robert F Garry, Ian Goodfellow, Stephan Günther, Christian T Happi, Ed-
ward C Holmes, Brima Kargbo, Sakoba Keı̈ta, Paul Kellam, Marion P G Koopmans, Jens H Kuhn,
Nicholas J Loman, N’faly Magassouba, Dhamari Naidoo, Stuart T Nichol, Tolbert Nyenswah,
Gustavo Palacios, Oliver G Pybus, Pardis C Sabeti, Amadou Sall, Ute Ströher, Isatta Wurie,
Marc A Suchard, Philippe Lemey, and Andrew Rambaut. Virus genomes reveal factors that
spread and sustained the ebola epidemic. Nature, April 2017. ISSN 0028-0836, 1476-4687. doi:
10.1038/nature22040. URL http://dx.doi.org/10.1038/nature22040.
Y. Fan, R. Wu, M.-H. Chen, L. Kuo, and P. O. Lewis. Choosing among partition models in Bayesian
phylogenetics. Mol. Biol. Evol., 28(1):523–532, 2011.
Roberto Feuda, Martin Dohrmann, Walker Pett, Hervé Philippe, Omar Rota-Stabelli, Nicolas Lar-
tillot, Gert Wörheide, and Davide Pisani. Improved modeling of compositional heterogene-
ity supports sponges as sister to all other animals. Curr. Biol., 27(24):3864–3870.e4, De-
cember 2017. ISSN 0960-9822, 1879-0445. doi: 10.1016/j.cub.2017.11.008. URL http:
//dx.doi.org/10.1016/j.cub.2017.11.008.
10
Published as a conference paper at ICLR 2019
S. B. Hedges, K. D. Moberg, and L. R. Maxson. Tetrapod phylogeny inferred from 18S and 28S
ribosomal RNA sequences and review of the evidence for amniote relationships. Mol. Biol. Evol.,
7:607–633, 1990.
D. A. Henk, A. Weir, and M. Blackwell. Laboulbeniopsis termitarius, an ectoparasite of termites
newly recognized as a member of the Laboulbeniomycetes. Mycologia, 95:561–564, 2003.
G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The wake-sleep algorithm for unsupervised
neural networks. Science, 268:1158–1161, 1995.
S. Höhna, T. A. Heath, B. Boussau, M. J. Landis, F. Ronquist, and J. P. Huelsenbeck. Probabilistic
graphical model representation in phylogenetics. Syst. Biol., 63:753–771, 2014.
Sebastian Höhna and Alexei J. Drummond. Guided tree topology proposals for Bayesian phyloge-
netic inference. Syst. Biol., 61(1):1–11, January 2012. ISSN 1063-5157. doi: 10.1093/sysbio/
syr074. URL http://dx.doi.org/10.1093/sysbio/syr074.
J. P. Huelsenbeck and F. Ronquist. MrBayes: Bayesian inference of phylogeny. Bioinformatics, 17:
754–755, 2001.
J. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian inference of phylogeny and
its impact on evolutionary biology. Science, 294:2310–2314, 2001.
Thibaut Jombart, Michelle Kendall, Jacob Almagro-Garcia, and Caroline Colijn. treespace: Statisti-
cal exploration of landscapes of phylogenetic trees. Molecular Ecology Resources, 17:1385–1392,
2017. URL https://doi.org/10.1111/1755-0998.12676.
M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for
graphical models. Machine Learning, 37:183–233, 1999.
T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro (ed.), Mammalian
protein metabolism, III, pp. 21–132, New York, 1969. Academic Press.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT
Press, 2009.
C. Lakner, P. van der Mark, J. P. Huelsenbeck, B. Larget, and F. Ronquist. Efficiency of Markov
chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol., 57:86–103, 2008.
Bret Larget. The estimation of tree posterior probabilities using conditional clade probability dis-
tributions. Syst. Biol., 62(4):501–511, July 2013. ISSN 1063-5157. doi: 10.1093/sysbio/syt014.
URL http://dx.doi.org/10.1093/sysbio/syt014.
Nicolas Lartillot and Hervé Philippe. A Bayesian mixture model for across-site heterogeneities in
the amino-acid replacement process. Mol. Biol. Evol., 21(6):1095–1109, June 2004. ISSN 0737-
4038. doi: 10.1093/molbev/msh112. URL http://dx.doi.org/10.1093/molbev/
msh112.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
B. Mau, M. Newton, and B. Larget. Bayesian phylogenetic inference via Markov chain Monte Carlo
methods. Biometrics, 55:1–12, 1999.
B. Q. Minh, M. A. T. Nguyen, and A. von Haeseler. Ultrafast approximation for phylogenetic
bootstrap. Mol. Biol. Evol., 30:1188–1195, 2013.
A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In Proceedings
of The 31th International Conference on Machine Learning, pp. 1791–1799, 2014.
11
Published as a conference paper at ICLR 2019
Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In Proceedings
of the 33rd International Conference on Machine Learning, pp. 1791–1799, 2016.
Radford Neal. MCMC using hamiltonian dynamics. In S Brooks, A Gelman, G Jones, and XL Meng
(eds.), Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC Handbooks of Modern
Statistical Methods. Taylor & Francis, 2011. ISBN 9781420079425. URL http://books.
google.com/books?id=qfRsAIKZ4rIC.
J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational bayesian inference with stochastic search. In
Proceedings of the 29th International Conference on Machine Learning ICML, 2012.
R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, pp. 814–822,
2014.
D. Rezende and S. Mohamed. Variational inference with normalizing flow. In Proceedings of The
32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu,
M. A. Shchard, and J. P. Huelsenbeck. MrBayes 3.2: efficient Bayesian phylogenetic inference
and model choice across a large model space. Syst. Biol., 61:539–542, 2012.
A. Y. Rossman, J. M. Mckemy, R. A. Pardo-Schultheiss, and H. J. Schroers. Molecular studies of
the Bionectriaceae using large subunit rDNA sequences. Mycologia, 93:100–110, 2001.
J. R. Sashank, K. Satyen, and K. Sanjiv. On the convergence of adam and beyond. In ICLR, 2018.
Eric E. Schadt, Janet S. Sinsheimer, and Kenneth Lange. Computational advances in maximum
likelihood methods for molecular phylogeny. Genome Res., 8:222–233, 1998. doi: 10.1101/gr.8.
3.222.
K. Strimmer and V. Moulton. Likelihood analysis of phylogenetic networks using directed graphical
models. Molecular biology and evolution, 17:875–881, 2000.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational infer-
ence. Foundations and Trends in Maching Learning, 1(1-2):1–305, 2008.
Chris Whidden and Frederick A Matsen IV. Quantifying MCMC exploration of phylogenetic tree
space. Syst. Biol., 64(3):472–491, May 2015. ISSN 1063-5157, 1076-836X. doi: 10.1093/sysbio/
syv006. URL http://dx.doi.org/10.1093/sysbio/syv006.
W. Xie, P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. Improving marginal likelihood estimation for
Bayesian phylogenetic model selection. Syst. Biol., 60:150–160, 2011.
Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov chain
Monte Carlo method. Mol. Biol. Evol., 14:717–724, 1997.
Z. Yang and A. D. Yoder. Comparison of likelihood and Bayesian methods for estimating divergence
times using multiple gene loci and calibration points, with application to a radiation of cute-
looking mouse lemur species. Syst. Biol., 52:705–716, 2003.
A. D. Yoder and Z. Yang. Divergence datas for Malagasy lemurs estimated from multiple gene loci:
geological and evolutionary context. Mol. Ecol., 13:757–773, 2004.
Cheng Zhang and Frederick A Matsen IV. Generalizing tree probability estimation via
bayesian networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31,
pp. 1451–1460. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/
7418-generalizing-tree-probability-estimation-via-bayesian-networks.
pdf.
N. Zhang and M. Blackwell. Molecular phylogeny of dogwood anthracnose fungus (Discula de-
structiva) and the Diaporthales. Mycologia, 93:355–365, 2001.
12
Published as a conference paper at ICLR 2019
In this section we will derive the gradient for the multi-sample objectives introduced in section 3.
We start with the lower bound
K j j j j
1 X p(Y |τ , q )p(τ , q )
LK (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log
K j=1 Qφ (τ j )Qψ (q j |τ j )
K
1 X
= EQφ,ψ (τ 1:K , q1:K ) log fφ,ψ (τ j , q j ) .
K j=1
Using the product rule and noting that ∇φ log fφ,ψ (τ j , q j ) = −∇φ log Qφ (τ j ),
K
1 X
∇φ Lk (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) ∇φ log fφ,ψ (τ j , q j ) +
K j=1
K K
!
X ∇φ Qφ (τ j ) 1 X i i
EQφ,ψ (τ 1:K , q1:K ) log fφ,ψ (τ , q )
j=1
Qφ (τ j ) K i=1
K
X fφ,ψ (τ j , q j )
= EQφ,ψ (τ 1:K , q1:K ) PK ∇φ log fφ,ψ (τ j , q j )+
i i
j=1 i=1 fφ,ψ (τ , q )
K K
!
X 1 X
EQφ,ψ (τ 1:K , q1:K ) log fφ,ψ (τ , q ) ∇φ log Qφ (τ j )
i i
j=1
K i=1
K
X
= EQφ,ψ (τ 1:K , q1:K ) L̂K (φ, ψ) − w̃j ∇φ log Qφ (τ j ).
j=1
Since ψ is not involved in the distribution with respect to which we take expectation,
K
1 X
∇ψ LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) ∇ψ log fφ,ψ (τ j , gψ (j |τ j ))
K j=1
K
X fφ,ψ (τ j , gψ (j |τ j ))
= EQφ, (τ 1:K ,1:K ) PK ∇ψ log fφ,ψ (τ j , gψ (j |τ j ))
f (τ i , g (i |τ i ))
j=1 i=1 φ,ψ ψ
K
X
= EQφ, (τ 1:K ,1:K ) w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )).
j=1
Next, we derive the gradient of the multi-sample likelihood objective used in RWS
13
Published as a conference paper at ICLR 2019
The second to last step uses self-normalized importance sampling with K samples. ∇ψ L̃(φ, ψ) can
be computed in a similar way.
14
Published as a conference paper at ICLR 2019
In our experiments, we use K = 1000. When taking a log transformation, the above Monte Carlo
estimate is no longer unbiased (for the evidence log p(Y )). Instead, it can be viewed as one sample
Monte Carlo estimate of the lower bound
K
!
K 1 X p(Y |τ i , q i )p(τ i , q i )
L (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log ≤ log p(Y ) (11)
K i=1 Qφ (τ i )Qψ (q i |τ i )
whose tightness improves as the number of samples K increases. Therefore, with a sufficiently large
K, we can use the lower bound estimate as a proxy for Bayesian model selection.
Figure 5: A comparison of majority-rule consensus trees obtained from VBPI and ground truth
MCMC run on DS1. Left: Ground truth MCMC. Right: VBPI (10000 sampled trees). The plot is
created using the treespace (Jombart et al., 2017) R package.
15