0% found this document useful (0 votes)

9 views15 pages

611 Variational Bayesian Phylogene

This paper presents a variational framework for Bayesian phylogenetic inference, addressing the limitations of traditional MCMC methods by using subsplit Bayesian networks (SBNs) for more efficient exploration of tree distributions. The authors demonstrate that their variational approach provides comparable posterior estimates with reduced computational demands, making it suitable for large datasets. Experiments on synthetic and real data validate the effectiveness and efficiency of the proposed methods in phylogenetic analysis.

Uploaded by

22028129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views15 pages

611 Variational Bayesian Phylogene

Uploaded by

22028129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Published as a conference paper at ICLR 2019

VARIATIONAL BAYESIAN P HYLOGENETIC I NFERENCE

Cheng Zhang, Frederick A. Matsen IV
Computational Biology Program
Fred Hutchinson Cancer Research Center
Seattle, WA 98109, USA
{czhang23,matsen}@fredhutch.org

A BSTRACT

Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo
with simple mechanisms for proposing new states, which hinders exploration effi-
ciency and often requires long runs to deliver accurate posterior estimates. In this
paper we present an alternative approach: a variational framework for Bayesian
phylogenetic analysis. We approximate the true posterior using an expressive
graphical model for tree distributions, called a subsplit Bayesian network, together
with appropriate branch length distributions. We train the variational approxima-
tion via stochastic gradient ascent and adopt multi-sample based gradient estima-
tors for different latent variables separately to handle the composite latent space
of phylogenetic models. We show that our structured variational approximations
are flexible enough to provide comparable posterior estimation to MCMC, while
requiring less computation due to a more efficient tree exploration mechanism en-
abled by variational inference. Moreover, the variational approximations can be
readily used for further statistical analysis such as marginal likelihood estimation
for model comparison via importance sampling. Experiments on both synthetic
data and real data Bayesian phylogenetic inference problems demonstrate the ef-
fectiveness and efficiency of our methods.

1 I NTRODUCTION

Bayesian phylogenetic inference is an essential tool in modern evolutionary biology. Given an align-
ment of nucleotide or amino acid sequences and appropriate prior distributions, Bayesian methods
provide principled ways to assess the phylogenetic uncertainty by positing and approximating a
posterior distribution on phylogenetic trees (Huelsenbeck et al., 2001). In addition to uncertainty
quantification, Bayesian methods enable integrating out tree uncertainty in order to get more con-
fident estimates of parameters of interest, such as factors in the transmission of Ebolavirus (Dudas
et al., 2017). Bayesian methods also allow complex substitution models (Lartillot & Philippe, 2004),
which are important in elucidating deep phylogenetic relationships (Feuda et al., 2017).
Ever since its introduction to the phylogenetic community in the 1990s, Bayesian phylogenetic infer-
ence has been dominated by random-walk Markov chain Monte Carlo (MCMC) approaches (Yang
& Rannala, 1997; Mau et al., 1999; Huelsenbeck & Ronquist, 2001). However, this approach is
fundamentally limited by the complexities of tree space. A typical MCMC method for phylogenetic
inference involves two steps in each iteration: first, a new tree is proposed by randomly perturbing
the current tree, and second, the tree is accepted or rejected according to the Metropolis-Hastings
acceptance probability. Any such random walk algorithm faces obstacles in the phylogenetic case,
in which the high-posterior trees are a tiny fraction of the combinatorially exploding number of
trees. Thus, major modifications of trees are likely to be rejected, restricting MCMC tree move-
ment to local modifications that may have difficulty moving between multiple peaks in the posterior
distribution (Whidden & Matsen IV, 2015). Although recent MCMC methods for distributions
on Euclidean space use intelligent proposal mechanisms such as Hamiltonian Monte Carlo (Neal,
2011), it is not straightforward to extend such algorithms to the composite structure of tree space,
which includes both tree topology (discrete object) and branch lengths (continuous positive vector)
(Dinh et al., 2017).

1
Published as a conference paper at ICLR 2019

Variational inference (VI) is an alternative approximate inference method for Bayesian analysis
which is gaining in popularity (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2017).
Unlike MCMC methods that sample from the posterior, VI selects the best candidate from a family
of tractable distributions to minimize a statistical distance measure to the target posterior, usually
the Kullback-Leibler (KL) divergence. By reformulating the inference problem into an optimization
problem, VI tends to be faster and easier to scale to large data (via stochastic gradient descent)
(Blei et al., 2017). However, VI can also introduce a large bias if the variational distribution is
insufficiently flexible. The success of variational methods, therefore, relies on having appropriate
tractable variational distributions and efficient training procedures.
To our knowledge, there have been no previous variational formulations of Bayesian phylogenetic
inference. This has been due to the lack of an appropriate family of approximating distributions on
phylogenetic trees. However the prospects for variational inference have changed recently with the
introduction of subsplit Bayesian networks (SBNs) (Zhang & Matsen IV, 2018), which provide a
family of flexible distributions on tree topologies (i.e. trees without branch lengths). SBNs build on
previous work (Höhna & Drummond, 2012; Larget, 2013), but in contrast to these previous efforts,
SBNs are sufficiently flexible for real Bayesian phylogenetic posteriors (Zhang & Matsen IV, 2018).
In this paper, we develop a general variational inference framework for Bayesian phylogenetics. We
show that SBNs, when combined with appropriate approximations for the branch length distribu-
tion, can provide flexible variational approximations over the joint latent space of phylogenetic trees
with branch lengths. We use recently-proposed unbiased gradient estimators for the discrete and
continuous components separately to enable efficient stochastic gradient ascent. We also leverage
the similarity of local structures among trees to reduce the complexity of the variational parameteri-
zation for the branch length distributions and provide an extension to better capture the between-tree
variation. Finally, we demonstrate the effectiveness and efficiency of our methods on both synthetic
data and a benchmark of challenging real data Bayesian phylogenetic inference problems.

2 BACKGROUND
Phylogenetic Posterior A phylogenetic tree is described by a tree topology τ and associated non-
negative branch lengths q. The tree topology τ represents the evolutionary diversification of the
species. It is a bifurcating tree with N leaves, each of which has a label corresponding to one of
the observed species. The internal nodes of τ represent the unobserved characters (e.g. DNA bases)
of the ancestral species. A continuous-time Markov model is often used to describe the transition
probabilities of the characters along the branches of the tree. Let Y = {Y1 , Y2 , . . . , YM } ∈ ΩN ×M
be the observed sequences (with characters in Ω) of length M over N species. The probability of
each site observation Yi is defined as the marginal distribution over the leaves
X Y
p(Yi |τ, q) = η(aiρ ) Paiu aiv (quv ) (1)
ai (u,v)∈E(τ )

where ρ is the root node (or any internal node if the tree is unrooted and the Markov model is
time reversible), ai ranges over all extensions of Yi to the internal nodes with aiu being the assigned
character of node u, E(τ ) denotes the set of edges of τ , Pij (t) denotes the transition probability from
character i to character j across an edge of length t and η is the stationary distribution of the Markov
model. Assuming different sites are identically distributed and evolve independently, the likelihood
QM
of observing the entire sequence set Y is p(Y |τ, q) = i=1 p(Yi |τ, q). The phylogenetic likelihood
for each site in equation 1 can be evaluated efficiently through the pruning algorithm (Felsenstein,
2003), also known as the sum-product algorithm in probabilistic graphical models (Strimmer &
Moulton, 2000; Koller & Friedman, 2009; Höhna et al., 2014). Given a proper prior distribution with
density p(τ, q) imposed on the tree topologies and the branch lengths, the phylogenetic posterior
p(τ, q|Y ) is proportional to the joint density
p(Y |τ, q)p(τ, q)
p(τ, q|Y ) = ∝ p(Y |τ, q)p(τ, q)
p(Y )
where p(Y ) is the intractable normalizing constant.

Subsplit Bayesian Networks We now review subsplit Bayesian networks (Zhang & Matsen IV,
2018) and the flexible distributions on tree topologies they provide. Let X be the set of leaf labels.

2
Published as a conference paper at ICLR 2019

A A
A
B
B BC
C ass
ABC
ign S4
C D
D
Species A: ATGAAC · · · D
S2
D
D S5
Species B: ATGCAC · · ·
S1
Species C: ATGCAT · · · A A S6
A S3
Species D: ATGCAT · · · B B B
AB
ign
CD ass S7
C C C
D
D D

Figure 1: A simple subsplit Bayesian network for a leaf set that contains 4 species. Left: A leaf label
set X of 4 species, each label corresponds to a DNA sequence. Middle (left): Examples of (rooted)
phylogenetic trees that are hypothesized to model the evolutionary history of the species. Middle
(right): The corresponding SBN assignments for the trees. For ease of illustration, subsplit (W, Z)
is represented as W Z in the graph. The dashed gray subgraphs represent fake splitting processes
where splits are deterministically assigned, and are used purely to complement the networks such
that the overall network has a fixed structure. Right: The SBN for these examples.

We call a nonempty subset of X a clade. Let be a total order on clades (e.g., lexicographical order).
A subsplit (W, Z) of a clade X is an ordered pair of disjoint subclades of X such that W ∪ Z = X
and W Z. A subsplit Bayesian network BX on a leaf set X of size N is a Bayesian network whose
nodes take on subsplit or singleton clade values that represent the local topological structure of trees
(Figure 1). Following the splitting processes (see the solid dark subgraphs in Figure 1, middle
right), rooted trees have unique subsplit decompositions and hence can be uniquely represented as
compatible SBN assignments. Given the subsplit decomposition of a rooted tree τ = {s1 , s2 , . . .},
where s1 is the root subsplit and {si }i>1 are other subsplits, the SBN tree probability is
Y
psbn (T = τ ) = p(S1 = s1 ) p(Si = si |Sπi = sπi )
i>1

where Si denotes the subsplit- or singleton-clade-valued random variables at node i and πi is the
index set of the parents of Si . The Bayesian network formulation of SBNs enjoys many benefits: i)
flexibility. The expressiveness of SBNs is freely adjustable by changing the dependency structures
between nodes, allowing for a wide range of flexible distributions; ii) normality. SBN-induced
distributions are all naturally normalized if the associated conditional probability tables (CPTs) are
consistent, which is a common property of Bayesian networks. The SBN framework also generalizes
to unrooted trees, which are the most common type of trees in phylogenetics. Concretely, unrooted
trees can be viewed as rooted trees with unobserved roots. Marginalizing out the unobserved root
node S1 , we have the SBN probability estimates for unrooted trees
X Y
psbn (T u = τ ) = p(S1 = s1 ) p(Si = si |Sπi = sπi )
s1 ∼τ i>1

where ∼ means all root subsplits that are compatible with τ .

To reduce model complexity and encourage generalization, the same set of CPTs for parent-child
subsplit pairs is shared across the SBN network, regardless of their locations. Similar to weight
sharing used in convolutional networks (LeCun et al., 1998) for detecting translationally-invariant
structure of images (e.g., edges, corners), this heuristic parameter sharing used in SBNs is for iden-
tifying conditional splitting patterns of phylogenetic trees. See Zhang & Matsen IV (2018) for more
detailed discussion on SBNs.

3 VARIATIONAL P HYLOGENETIC I NFERENCE VIA SBN S

The flexible and tractable tree topology distributions provided by SBNs serve as an essential building
block to perform variational inference (Jordan et al., 1999) for phylogenetics. Suppose that we have
a family of approximate distributions Qφ (τ ) (e.g., SBNs) over phylogenetic tree topologies, where
φ denotes the corresponding variational parameters (e.g., CPTs for SBNs). For each tree τ , we
posit another family of densities Qψ (q|τ ) over the branch lengths, where ψ is the branch length

3
Published as a conference paper at ICLR 2019

variational parameters. We then combine these distributions and use the product
Qφ,ψ (τ, q) = Qφ (τ )Qψ (q|τ )
as our variational approximation. Inference now amounts to finding the member of this family that
minimizes the Kullback-Leibler (KL) divergence to the exact posterior,
φ∗ , ψ ∗ = arg min DKL (Qφ,ψ (τ, q)k p(τ, q|Y )) (2)
φ,ψ

which is equivalent to maximizing the evidence lower bound (ELBO),

p(Y |τ, q)p(τ, q)
L(φ, ψ) = EQφ,ψ (τ,q) log ≤ log p(Y ).
Qφ (τ )Qψ (q|τ )
As the ELBO is based on a single-sample estimate of the evidence, it heavily penalizes samples
that fail to explain the observed sequences. As a result, the variational approximation tends to cover
only the high-probability areas of the true posterior. This effect can be minimized by averaging over
K > 1 samples when estimating the evidence (Burda et al., 2016; Mnih & Rezende, 2016), which
leads to tighter lower bounds
K
!
K 1 X p(Y |τ i , q i )p(τ i , q i )
L (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log ≤ log p(Y ) (3)
K i=1 Qφ (τ i )Qψ (q i |τ i )
QK
where Qφ,ψ (τ 1:K , q 1:K ) ≡ i=1 Qφ,ψ (τ i , q i ); the tightness of the lower bounds improves as the
number of samples K increases (Burda et al., 2016). We will use multi-sample lower bounds in the
sequel and refer to them as lower bounds for short.

3.1 VARIATIONAL PARAMETERIZATION

The CPTs in SBNs are, in general, associated with all possible parent-child subsplit pairs. Therefore,
in principle a full parameterization requires an exponentially increasing number of parameters. In
practice, however, we can find a sufficiently large subsplit support of CPTs (i.e. where the associated
conditional probabilities are allowed to be nonzero) that covers favorable subsplit pairs from trees in
the high-probability areas of the true posterior. In this paper, we will mostly focus on the variational
approach and assume the support of CPTs is available, although in our experiments we find that
a simple bootstrap-based approach does provide a reasonable CPT support estimate for real data.
We leave the development of more sophisticated methods for finding the support of CPTs to future
work.
Now denote the set of root subsplits in the support as Sr and the set of parent-child subsplit pairs in
the support as Sch|pa . The CPTs are defined according to the following equations
exp(φs1 ) exp(φs|t )
p(S1 = s1 ) = P , p(Si = s|Sπi = t) = P
sr ∈Sr exp(φsr ) s∈S·|t exp(φs|t )

where S·|t denotes the set of child subsplits for parent subsplit t.
We use the Log-normal distribution Lognormal(µ, σ 2 ) as our variational approximation for branch
lengths to accommodate their non-negative nature in phylogenetic models. Instead of a naive param-
eterization for each edge on each tree (which would require a large number of parameters when the
high-probability areas of the posterior are diffuse), we use an amortized set of parameters over the
shared local structures among trees. A simple choice of such local structures is the split, a bipartition
(X1 , X2 ) of the leaf labels X (i.e. X1 ∪ X2 = X , X1 ∩ X2 = ∅), and each edge of a phylogenetic
tree naturally corresponds to a split, the bipartition that consists of the leaf labels from both sides of
the edge. Note that a split can be viewed as a root subsplit. We then assign µ(·, ·), σ(·, ·) for each
split (·, ·) in Sr . We denote the corresponding split of edge e of tree τ as e/τ .

A Simple Independent Approximation Given a phylogenetic tree τ , we start with a simple model
that assumes the branch lengths for the edges of the tree are independently distributed. The approx-
imate density Qψ (q|τ ), therefore, has the form
Y µ
Qψ (q|τ ) = pLognormal (qe | µ(e, τ ), σ(e, τ )) , µ(e, τ ) = ψe/τ σ
, σ(e, τ ) = ψe/τ . (4)
e∈E(τ )

4
Published as a conference paper at ICLR 2019

Capturing Between-Tree Branch Length Variation

Z1
1
W
The above approximation equation 4 implicitly assumes e
that the branch lengths in different trees have the same W Z
ψ µ (W, Z)
distribution if they correspond to the same split, which ) ψµ
,Z (Z

W2
fails to account for between-tree variation. To capture this

2
|W

Z
1,Z
2
,W 2 |W
variation, one can use a more sophisticated parameteriza- ψ
µ (W
1
+
,Z
)
tion that allows other tree-dependent terms for the varia-
tional parameters µ and σ. Specifically, we use additional µ(e, τ )
local structure associated with each edge as follows:
Figure 2: Branch length parameteriza-
Definition 1 (primary subsplit pair) Let e be an edge tion using primary subsplit pairs, which
of a phylogenetic tree τ which represents a split e/τ = is the sum of parameters for a split and
(W, Z). Assume that at least one of W or Z, say W , its neighboring subsplit pairs. Edge e
contains more than one leaf label and denote its sub- represents a split (W, Z). Parameteriza-
split as (W1 , W2 ). We call the parent-child subsplit pair tion for the variance is the same as for
(W1 , W2 )|(W, Z) a primary subsplit pair. the mean.
We assign additional parameters for each primary subsplit pair. Denoting the primary subsplit pair(s)
of edge e in tree τ as e//τ , we then simply sum all variational parameters associated with e to form
the mean and variance parameters for the corresponding branch length (Figure 2):
µ
X X
µ(e, τ ) = ψe/τ + ψsµ , σ(e, τ ) = ψe/τ
σ
+ ψsσ .
s∈e//τ s∈e//τ
This modifies the density in equation 4 by adding contributions from primary subsplit pairs and
hence allows for more flexible between-tree approximations. Note that the above structured param-
eterizations of branch length distributions also enable joint learning across tree topologies.

3.2 S TOCHASTIC G RADIENT E STIMATORS AND T HE VBPI A LGORITHM

In practice, the lower bound is usually maximized via stochastic gradient ascent (SGA). However,
the naive stochastic gradient estimator obtained by differentiating the lower bound has very large
variance and is impractical for our purpose. Fortunately, various variance reduction techniques have
been introduced in recent years including the control variate (Paisley et al., 2012; Ranganath et al.,
2014; Mnih & Gregor, 2014; Mnih & Rezende, 2016) for general latent variables and the reparam-
eterization trick (Kingma & Welling, 2014) for continuous latent variables. In the following, we
apply these techniques to different components of our latent variables and derive efficient gradient
estimators with much lower variance, respectively. In addition, we also consider a stable gradient
estimator based on an alternative variational objective. See Appendix A for derivations.

|τ,q)p(τ,q)
The VIMCO Estimator Let fφ,ψ (τ, q) = p(Y Qφ (τ)Qψ (q|τ ) . The stochastic lower bound with K
P
K 1 K i i
samples is L̂ (φ, ψ) = log K i=1 fφ,ψ (τ , q ) . Mnih & Rezende (2016) propose a localized
learning signal strategy that significantly reduces the variance of the naive gradient estimator by
utilizing the independence between the multiple samples and the regularity of the learning signal,
which estimates the gradient as follows
X K
∇φ LK (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) L̂K
j|−j (φ, ψ) − w̃ j
∇φ log Qφ (τ j ) (5)
j=1
where  
1 X
L̂K
j|−j (φ, ψ)
K
:= L̂ (φ, ψ) − log fφ,ψ (τ , q ) + fˆφ,ψ (τ , q )
i i −j −j
K
i6=j

is the per-sample local learning signal, with fˆφ,ψ (τ −j , q −j ) being some estimate of fφ,ψ (τ j , q j )
f (τ j ,q j )
for sample j using the rest of samples (e.g., the geometric mean), and w̃j = PKφ,ψf (τ i ,qi ) is the
i=1 φ,ψ
self-normalized importance weight. This gives the following VIMCO estimator
K
iid
X
∇φ LK (φ, ψ) ' L̂Kj|−j (φ, ψ) − w̃
j
∇φ log Qφ (τ j ) with τ j , q j ∼ Qφ,ψ (τ, q). (6)
j=1

5
Published as a conference paper at ICLR 2019

The Reparameterization Trick The VIMCO estimator also works for the branch length gradient.
However, as branch lengths are continuous latent variables, we can use the reparameterization trick
to estimate the gradient. Because the Log-normal distribution has a simple reparameterization, q ∼
Lognormal(µ, σ 2 ) ⇔ q = exp(µ + σ), ∼ N (0, 1), we can rewrite the lower bound:
 
K j j j j j j
1 X p(Y |τ , gψ ( |τ ))p(τ , gψ ( |τ )) 
LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) log  .
K j=1 Qφ (τ j )Qψ (gψ (j |τ j )|τ j )

where gψ (|τ ) = exp(µψ,τ + σψ,τ ). Then the gradient of the lower bound w.r.t. ψ is
K
X
∇ψ LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )) (7)
j=1

f (τ j ,g (j |τ j ))
where w̃j = PKφ,ψf (τψi ,g (i |τ i )) is the same normalized importance weight as in equation equa-
i=1 φ,ψ ψ
tion 5. Therefore, we can form the Monte Carlo estimator of the gradient
K
iid iid
X
∇ψ LK (φ, ψ) ' w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )) with τ j ∼ Qφ (τ ), j ∼ N (0, I). (8)
j=1

Self-normalized Importance Sampling Estimator In addition to the standard variational formu-

lation equation 2, one can reformulate the optimization problem by minimizing the reversed KL
divergence, which is equivalent to maximizing the likelihood of the variational approximation
Qφ∗ ,ψ∗ (τ, q), where φ∗ , ψ ∗ = arg max L̃(φ, ψ), L̃(φ, ψ) = Ep(τ,q|Y ) log Qφ,ψ (τ, q). (9)
φ,ψ

We can use an importance sampling estimator to compute the gradient of the objective
1 p(Y |τ, q)p(τ, q)
∇φ L̃(φ, ψ) = Ep(τ,q|Y ) ∇φ log Qφ,ψ (τ, q) = EQ (τ,q) ∇φ log Qφ (τ )
p(Y ) φ,ψ Qφ (τ )Qψ (q|τ )
K
iid
X
' w̃j ∇φ log Qφ (τ j ) with τ j , q j ∼ Qφ,ψ (τ, q) (10)
j=1

with the same importance weights w̃j as in equation 5. This can be viewed as a multi-sample gen-
eralization of the wake-sleep algorithm (Hinton et al., 1995) and was first used in the reweighted
wake-sleep algorithm (Bornschein & Bengio, 2015) for training deep generative models. We there-
fore call the gradient estimator in equation 10 the RWS estimator. Like the VIMCO estimator, the
RWS estimator also provides gradients for branch lengths. However, we find in practice that equa-
tion 8 that uses the reparameterization trick is more useful and often leads to faster convergence,
although it uses a different optimization objective. A better understanding of this phenomenon
would be an interesting subject of future research.
All stochastic gradient estimators introduced above can be used in conjunction with stochastic op-
timization methods such as SGA or some of its adaptive variants (e.g. Adam Kingma & Ba, 2015)
to maximize the lower bounds. See algorithm 1 in Appendix B for a basic variational Bayesian
phylogenetic inference (VBPI) approach.

4 E XPERIMENTS
Throughout this section we evaluate the effectiveness and efficiency of our variational framework
for inference over phylogenetic trees. The simplest SBN (the one with a full and complete binary
tree structure) is used for the phylogenetic tree topology variational distribution; we have found it to
provide sufficiently accurate approximation. For real datasets, we estimate the CPT supports from
ultrafast maximum likelihood phylogenetic bootstrap trees using UFBoot (Minh et al., 2013), which
is a fast approximate bootstrap method based on efficient heuristics. We compare the performance
of the VIMCO estimator and the RWS estimator with different variational parameterizations for
the branch length distributions, while varying the number of samples in the training objective to

6
Published as a conference paper at ICLR 2019

10 1
0.0 VIMCO(20) VIMCO(50)

Variational approximation
VIMCO(50) RWS(50)
Evidence Lower Bound
0.5 0.00 RWS(20)

KL Divergence
1.0 0.02 RWS(50) 10 2
100
1.5
EXACT
2.0 VIMCO(20) 10 3
VIMCO(50)
2.5 RWS(20)
RWS(50)
3.0 10 1
10 4
0 50 100 150 200 0 50 100 150 200 10 4 10 3 10 2 10 1
Thousand Iterations Thousand Iterations Ground truth

Figure 3: Comparison of multi-sample objective on approximating a challenging distribution over

unrooted phylogenetic trees with 8 leaves using VIMCO and RWS gradient estimators. Left: Ev-
idence lower bound. Middle: KL divergence. Right: Variational approximations vs ground truth
probabilities. The number in brackets specifies the number of samples used in the training objective.

see how these affect the quality of the variational approximations. For VIMCO, we use Adam
for stochastic gradient ascent with learning rate 0.001 (Kingma & Ba, 2015). For RWS, we also
use AMSGrad (Sashank et al., 2018), a recent variant of Adam, when Adam is unstable. Results
were collected after 200,000 parameter updates. The KL divergences reported are over the discrete
collection of phylogenetic tree structures, from trained SBN distribution to the ground truth, and a
low KL divergence means a high quality approximation of the distribution of trees.

4.1 S IMULATED S CENARIOS

To empirically investigate the representative power of SBNs to approximate distributions on phylo-

genetic trees under the variational framework, we first conduct experiments on a simulated setup.
We use the space of unrooted phylogenetic trees with 8 leaves, which contains 10395 unique trees in
total. Given an arbitrary order of trees, we generate a target distribution p0 (τ ) by drawing a sample
from the symmetric Dirichlet distributions Dir(β1) of order 10395, where β is the concentration
parameter. The target distribution becomes more diffuse as β increases; we used β = 0.008 to
provide enough information for inference while allowing for adequate diffusion in the target. Note
that there are no branch lengths in this simulated model and the lower bound is
K
!
i
1 X p0 (τ )
LK (φ) = EQφ (τ 1:K ) log ≤0
K i=1 Qφ (τ i )

with the exact evidence being log(1) = 0. We then use both the VIMCO and RWS estimators to
optimize the above lower bound based on 20 and 50 samples (K). We use a slightly larger learning
rate (0.002) in AMSGrad for RWS.
Figure 3 shows the empirical performance of different methods. From the left plot, we see that the
lower bounds converge rapidly and the gaps between lower bounds and the exact model evidence
are close to zero, demonstrating the expressive power of SBNs on phylogenetic tree probability es-
timations. The evolution of KL divergences (middle plot) is consistent with the lower bounds. All
methods benefit from using more samples, with VIMCO performing better in the end and RWS
learning slightly faster at the beginning.1 The slower start of VIMCO is partly due to the regular-
ization term in the lower bounds, which turns out to be beneficial for the overall performance since
the regularization encourages the diversity of the variational approximation and leads to more suffi-
cient exploration in the starting phase, similar to the exploring starts (ES) strategy in reinforcement
learning (Sutton & Barto, 1998). The right plot compares the variational approximations obtained
by VIMCO and RWS, both with 50 samples, to the ground truth p0 (τ ). We see that VIMCO slightly
underestimates trees in high-probability areas as a result of the regularization effect. While RWS
provides better approximations for trees in high-probability areas, it tends to underestimate trees
1
Although we use larger learning rates for RWS in this experiment, we found RWS generally learns slightly
faster than VIMCO at the beginning. See Figure 4 for the real data phylogenetic inference problems in section
4.2 where we use Adam with learning rate 0.001 for both methods.

7
Published as a conference paper at ICLR 2019

VIMCO(10) VIMCO(10) + PSP 7036

101 VIMCO(20) 101 VIMCO(20) + PSP 7037
RWS(10) RWS(10) + PSP
KL Divergence

KL Divergence
RWS(20) RWS(20) + PSP 7038
MCMC MCMC

VBPI
7039
100 100
7040
7041
10 1 10 1
7042
0 50 100 150 200 0 50 100 150 200 7042 7040 7038 7036
Thousand Iterations Thousand Iterations GSS

Figure 4: Performance on DS1. Left: KL divergence for methods that use the simple split-based
parameterization for the branch length distributions. Middle: KL divergence for methods that use
PSP. Right: Per-tree marginal likelihood estimation (in nats): VBPI vs GSS. The number in brackets
specifies the number of samples used in the training objective. MCMC results are averaged over 10
independent runs. The results for VBPI were obtained using 1000 samples and the error bar shows
one standard deviation over 100 independent runs.

in low-probability areas which deteriorates the overall performance. We expect the biases in both
approaches to be alleviated with more samples.

4.2 R EAL DATA P HYLOGENETIC P OSTERIOR E STIMATION

In the second set of experiments we evaluate the proposed variational Bayesian phylogenetic in-
ference (VBPI) algorithms at estimating unrooted phylogenetic tree posteriors on 8 real datasets
commonly used to benchmark phylogenetic MCMC methods (Lakner et al., 2008; Höhna & Drum-
mond, 2012; Larget, 2013; Whidden & Matsen IV, 2015) (Table 1). We concentrate on the most
challenging part of the phylogenetic model: joint learning of the tree topologies and the branch
lengths. We assume a uniform prior on the tree topology, an i.i.d. exponential prior (Exp(10))
for the branch lengths and the simple Jukes & Cantor (1969) substitution model. We consider two
different variational parameterizations for the branch length distributions as introduced in section
3.1. In the first case, we use the simple split-based parameterization that assigns parameters to the
splits associated with the edges of the trees. In the second case, we assign additional parameters for
the primary subsplit pairs (PSP) to better capture the between-tree variation. We form our ground
truth posterior from an extremely long MCMC run of 10 billion iterations (sampled each 1000 itera-
tions with the first 25% discarded as burn-in) using MrBayes (Ronquist et al., 2012), and gather the
support of CPTs from 10 replicates of 10000 ultrafast maximum likelihood bootstrap trees (Minh
et al., 2013). Following Rezende & Mohamed (2015), we use a simple annealed version of the lower
bound which was found to provide better results. The modified bound is:
K
!
K 1 X [p(Y |τ i , q i )]βt p(τ i , q i )
Lβt (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log
K i=1 Qφ (τ i )Qψ (q i |τ i )

where βt ∈ [0, 1] is an inverse temperature that follows a schedule βt = min(0.001, t/100000),

going from 0.001 to 1 after 100000 iterations. We use Adam with learning rate 0.001 to train the
variational approximations using VIMCO and RWS estimators with 10 and 20 samples.
Figure 4 (left and middle plots) shows the resulting KL divergence to the ground truth on DS1 as a
function of the number of parameter updates. The results for methods that adopt the simple split-
based parameterization of variational branch length distributions are shown in the left plot. We see
that the performance of all methods improves significantly as the number of samples is increased.
The middle plot, containing the results using PSP for variational parameterization, clearly indicates
that a better modeling of between-tree variation of the branch length distributions is beneficial for all
method / number of samples combinations. Specifically, PSP enables more flexible branch length
distributions across trees which makes the learning task much easier, as shown by the considerably
smaller gaps between the methods. To benchmark the learning efficiency of VBPI, we also compare
to MrBayes 3.2.5 (Ronquist et al., 2012), a standard MCMC implementation. We run MrBayes with
4 chains and 10 runs for two million iterations, sampling every 100 iterations. For each run, we
compute the KL divergence to the ground truth every 50000 iterations with the first 25% discarded

8
Published as a conference paper at ICLR 2019

Table 1: Data sets used for variational phylogenetic posterior estimation, and marginal likelihood
estimates of different methods across datasets. The marginal likelihood estimates of all variational
methods are obtained by importance sampling using 1000 samples. We run stepping-stone in Mr-
Bayes using default settings with 4 chains for 10,000,000 iterations and sampled every 100 iterations.
The results are averaged over 10 independent runs with standard deviation in brackets.

(#TAXA , M ARGINAL L IKELIHOOD ( NATS )

DATA SET R EFERENCE
#S ITES )
VIMCO(10) VIMCO(20) VIMCO(10) + PSP VIMCO(20) + PSP SS
DS1 H EDGES ET AL . (1990) (27, 1949) -7108.43(0.26) -7108.35(0.21) -7108.41(0.16) -7108.42(0.10) -7108.42(0.18)
DS2 G AREY ET AL . (1996) (29, 2520) -26367.70(0.12) -26367.71(0.09) -26367.72(0.08) -26367.70(0.10) -26367.57(0.48)
DS3 YANG & YODER (2003) (36, 1812) -33735.08(0.11) -33735.11(0.11) -33735.10(0.09) -33735.07(0.11) -33735.44(0.50)
DS4 H ENK ET AL . (2003) (41, 1137) -13329.90(0.31) -13329.98(0.20) -13329.94(0.18) -13329.93(0.22) -13330.06(0.54)
DS5 L AKNER ET AL . (2008) (50, 378) -8214.36(0.67) -8214.74(0.38) -8214.61(0.38) -8214.55(0.43) -8214.51(0.28)
DS6 Z HANG & B LACKWELL (2001) (50, 1133) -6723.75(0.68) -6723.71(0.65) -6724.09(0.55) -6724.34(0.45) -6724.07(0.86)
DS7 YODER & YANG (2004) (59, 1824) -37332.03(0.43) -37331.90(0.49) -37331.90(0.32) -37332.03(0.23) -37332.76(2.42)
DS8 ROSSMAN ET AL . (2001) (64, 1008) -8653.34(0.55) -8651.54(0.80) -8650.63(0.42) -8650.55(0.46) -8649.88(1.75)

as burn-in. For a relatively fair comparison (in terms of the number of likelihood evaluations),
we compare 10 (i.e. 2·20/4) times the number of MCMC iterations with the number of 20-sample
objective VBPI iterations.2 Although MCMC converges faster at the start, we see that VBPI methods
(especially those with PSP) quickly surpass MCMC and arrive at good approximations with much
less computation. This is because VBPI iteratively updates the approximate distribution of trees
(e.g., SBNs) which in turn allows guided exploration in the tree topology space. VBPI also provides
the same majority-rule consensus tree as the ground truth MCMC run (Figure 5 in Appendix D).
The variational approximations provided by VBPI can be readily used to perform importance sam-
pling for phylogenetic inference (more details in Appendix C). The right plot of Figure 4 compares
VBPI using VIMCO with 20 samples and PSP to the state-of-the-art generalized stepping-stone
(GSS) (Fan et al., 2011) algorithm for estimating the marginal likelihood of trees in the 95% credi-
ble set of DS1. For GSS, we use 50 power posteriors and for each power posterior we run 1,000,000
MCMC iterations, sampling every 1000 iterations with the first 10% discarded as burn-in. The ref-
erence distribution for GSS was obtained from an independent Gamma approximation using the
maximum a posterior estimate. Table 1 shows the estimates of the marginal likelihood of the data
(i.e., model evidence) using different VIMCO approximations and one of the state-of-the-art meth-
ods, the stepping-stone (SS) algorithm (Xie et al., 2011). For each data set, all methods provide
estimates for the same marginal likelihood, with better approximation leading to lower variance.
We see that VBPI using 1000 samples is already competitive with SS using 100000 samples and
provides estimates with much less variance (hence more reproducible and reliable). Again, the extra
flexibility enabled by PSP alleviates the demand for larger number of samples used in the training
objective, making it possible to achieve high quality variational approximations with less samples.

5 D ISCUSSION

In this work we introduced VBPI, a general variational framework for Bayesian phylogenetic in-
ference. By combining subsplit Bayesian networks, a recent framework that provides flexible dis-
tributions of trees, and efficient structured parameterizations for branch length distributions, VBPI
exhibits guided exploration (enabled by SBNs) in tree space and provides competitive performance
to MCMC methods with less computation. Moreover, variational approximations provided by VBPI
can be readily used for further statistical analysis such as marginal likelihood estimation for model
comparison via importance sampling, which, compared to MCMC based methods, dramatically re-
duces the cost at test time. We report promising numerical results demonstrating the effectiveness
and efficiency of VBPI on a benchmark of real data Bayesian phylogenetic inference problems.
When the data are weak and posteriors are diffuse, support estimation of CPTs becomes challenging.
However, compared to classical MCMC approaches in phylogenetics that need to traverse the enor-
mous support of posteriors on complete trees to accurately evaluate the posterior probabilities, the
SBN parameterization in VBPI has a natural advantage in that it alleviates this issue by factorizing
the uncertainty of complete tree topologies into local structures.
2
the extra factor of 2/4 is because the likelihood and the gradient can be computed together in twice the time
of a likelihood (Schadt et al., 1998) and we run MCMC with 4 chains.

9
Published as a conference paper at ICLR 2019

Many topics remain for future work: constructing more flexible approximations for the branch length
distributions (e.g., using normalizing flow (Rezende & Mohamed, 2015) for within-tree approxima-
tion and deep networks for the modeling of between-tree variation), deeper investigation of support
estimation approaches in different data regimes, and efficient training algorithms for general varia-
tional inference on discrete / structured latent variables.

ACKNOWLEDGMENTS
This work supported by National Science Foundation grant CISE-1564137, as well as National
Institutes of Health grant R01-GM113246. The research of Frederick Matsen was supported in part
by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation.

R EFERENCES
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.
Journal of the American Statistical Association, 112(518):859–877, 2017.

Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. In Proceedings of the International
Conference on Learning Representations (ICLR), 2015.

Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.

Vu Dinh, Arman Bilge, Cheng Zhang, and Frederick A Matsen IV. Probabilistic Path Hamiltonian
Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, pp.
1009–1018, July 2017. URL http://proceedings.mlr.press/v70/dinh17a.html.

Gytis Dudas, Luiz Max Carvalho, Trevor Bedford, Andrew J Tatem, Guy Baele, Nuno R Faria,
Daniel J Park, Jason T Ladner, Armando Arias, Danny Asogun, Filip Bielejec, Sarah L Caddy,
Matthew Cotten, Jonathan D’Ambrozio, Simon Dellicour, Antonino Di Caro, Joseph W Diclaro,
Sophie Duraffour, Michael J Elmore, Lawrence S Fakoli, Ousmane Faye, Merle L Gilbert, Sahr M
Gevao, Stephen Gire, Adrianne Gladden-Young, Andreas Gnirke, Augustine Goba, Donald S
Grant, Bart L Haagmans, Julian A Hiscox, Umaru Jah, Jeffrey R Kugelman, Di Liu, Jia Lu, Chris-
tine M Malboeuf, Suzanne Mate, David A Matthews, Christian B Matranga, Luke W Meredith,
James Qu, Joshua Quick, Suzan D Pas, My V T Phan, Georgios Pollakis, Chantal B Reusken,
Mariano Sanchez-Lockhart, Stephen F Schaffner, John S Schieffelin, Rachel S Sealfon, Etienne
Simon-Loriere, Saskia L Smits, Kilian Stoecker, Lucy Thorne, Ekaete Alice Tobin, Mohamed A
Vandi, Simon J Watson, Kendra West, Shannon Whitmer, Michael R Wiley, Sarah M Winnicki,
Shirlee Wohl, Roman Wölfel, Nathan L Yozwiak, Kristian G Andersen, Sylvia O Blyden, Fa-
torma Bolay, Miles W Carroll, Bernice Dahn, Boubacar Diallo, Pierre Formenty, Christophe
Fraser, George F Gao, Robert F Garry, Ian Goodfellow, Stephan Günther, Christian T Happi, Ed-
ward C Holmes, Brima Kargbo, Sakoba Keı̈ta, Paul Kellam, Marion P G Koopmans, Jens H Kuhn,
Nicholas J Loman, N’faly Magassouba, Dhamari Naidoo, Stuart T Nichol, Tolbert Nyenswah,
Gustavo Palacios, Oliver G Pybus, Pardis C Sabeti, Amadou Sall, Ute Ströher, Isatta Wurie,
Marc A Suchard, Philippe Lemey, and Andrew Rambaut. Virus genomes reveal factors that
spread and sustained the ebola epidemic. Nature, April 2017. ISSN 0028-0836, 1476-4687. doi:
10.1038/nature22040. URL http://dx.doi.org/10.1038/nature22040.

Y. Fan, R. Wu, M.-H. Chen, L. Kuo, and P. O. Lewis. Choosing among partition models in Bayesian
phylogenetics. Mol. Biol. Evol., 28(1):523–532, 2011.

J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2nd edition, 2003.

Roberto Feuda, Martin Dohrmann, Walker Pett, Hervé Philippe, Omar Rota-Stabelli, Nicolas Lar-
tillot, Gert Wörheide, and Davide Pisani. Improved modeling of compositional heterogene-
ity supports sponges as sister to all other animals. Curr. Biol., 27(24):3864–3870.e4, De-
cember 2017. ISSN 0960-9822, 1879-0445. doi: 10.1016/j.cub.2017.11.008. URL http:
//dx.doi.org/10.1016/j.cub.2017.11.008.

J. R. Garey, T. J. Near, M. R. Nonnemacher, and S. A. Nadler. Molecular evidence for Acantho-

cephala as a subtaxon of Rotifera. Mol. Evol., 43:287–292, 1996.

10
Published as a conference paper at ICLR 2019

S. B. Hedges, K. D. Moberg, and L. R. Maxson. Tetrapod phylogeny inferred from 18S and 28S
ribosomal RNA sequences and review of the evidence for amniote relationships. Mol. Biol. Evol.,
7:607–633, 1990.
D. A. Henk, A. Weir, and M. Blackwell. Laboulbeniopsis termitarius, an ectoparasite of termites
newly recognized as a member of the Laboulbeniomycetes. Mycologia, 95:561–564, 2003.
G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The wake-sleep algorithm for unsupervised
neural networks. Science, 268:1158–1161, 1995.
S. Höhna, T. A. Heath, B. Boussau, M. J. Landis, F. Ronquist, and J. P. Huelsenbeck. Probabilistic
graphical model representation in phylogenetics. Syst. Biol., 63:753–771, 2014.
Sebastian Höhna and Alexei J. Drummond. Guided tree topology proposals for Bayesian phyloge-
netic inference. Syst. Biol., 61(1):1–11, January 2012. ISSN 1063-5157. doi: 10.1093/sysbio/
syr074. URL http://dx.doi.org/10.1093/sysbio/syr074.
J. P. Huelsenbeck and F. Ronquist. MrBayes: Bayesian inference of phylogeny. Bioinformatics, 17:
754–755, 2001.
J. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian inference of phylogeny and
its impact on evolutionary biology. Science, 294:2310–2314, 2001.
Thibaut Jombart, Michelle Kendall, Jacob Almagro-Garcia, and Caroline Colijn. treespace: Statisti-
cal exploration of landscapes of phylogenetic trees. Molecular Ecology Resources, 17:1385–1392,
2017. URL https://doi.org/10.1111/1755-0998.12676.
M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for
graphical models. Machine Learning, 37:183–233, 1999.
T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro (ed.), Mammalian
protein metabolism, III, pp. 21–132, New York, 1969. Academic Press.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT
Press, 2009.
C. Lakner, P. van der Mark, J. P. Huelsenbeck, B. Larget, and F. Ronquist. Efficiency of Markov
chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol., 57:86–103, 2008.
Bret Larget. The estimation of tree posterior probabilities using conditional clade probability dis-
tributions. Syst. Biol., 62(4):501–511, July 2013. ISSN 1063-5157. doi: 10.1093/sysbio/syt014.
URL http://dx.doi.org/10.1093/sysbio/syt014.
Nicolas Lartillot and Hervé Philippe. A Bayesian mixture model for across-site heterogeneities in
the amino-acid replacement process. Mol. Biol. Evol., 21(6):1095–1109, June 2004. ISSN 0737-
4038. doi: 10.1093/molbev/msh112. URL http://dx.doi.org/10.1093/molbev/
msh112.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
B. Mau, M. Newton, and B. Larget. Bayesian phylogenetic inference via Markov chain Monte Carlo
methods. Biometrics, 55:1–12, 1999.
B. Q. Minh, M. A. T. Nguyen, and A. von Haeseler. Ultrafast approximation for phylogenetic
bootstrap. Mol. Biol. Evol., 30:1188–1195, 2013.
A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In Proceedings
of The 31th International Conference on Machine Learning, pp. 1791–1799, 2014.

11
Published as a conference paper at ICLR 2019

Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In Proceedings
of the 33rd International Conference on Machine Learning, pp. 1791–1799, 2016.
Radford Neal. MCMC using hamiltonian dynamics. In S Brooks, A Gelman, G Jones, and XL Meng
(eds.), Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC Handbooks of Modern
Statistical Methods. Taylor & Francis, 2011. ISBN 9781420079425. URL http://books.
google.com/books?id=qfRsAIKZ4rIC.
J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational bayesian inference with stochastic search. In
Proceedings of the 29th International Conference on Machine Learning ICML, 2012.
R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, pp. 814–822,
2014.
D. Rezende and S. Mohamed. Variational inference with normalizing flow. In Proceedings of The
32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu,
M. A. Shchard, and J. P. Huelsenbeck. MrBayes 3.2: efficient Bayesian phylogenetic inference
and model choice across a large model space. Syst. Biol., 61:539–542, 2012.
A. Y. Rossman, J. M. Mckemy, R. A. Pardo-Schultheiss, and H. J. Schroers. Molecular studies of
the Bionectriaceae using large subunit rDNA sequences. Mycologia, 93:100–110, 2001.
J. R. Sashank, K. Satyen, and K. Sanjiv. On the convergence of adam and beyond. In ICLR, 2018.
Eric E. Schadt, Janet S. Sinsheimer, and Kenneth Lange. Computational advances in maximum
likelihood methods for molecular phylogeny. Genome Res., 8:222–233, 1998. doi: 10.1101/gr.8.
3.222.
K. Strimmer and V. Moulton. Likelihood analysis of phylogenetic networks using directed graphical
models. Molecular biology and evolution, 17:875–881, 2000.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational infer-
ence. Foundations and Trends in Maching Learning, 1(1-2):1–305, 2008.
Chris Whidden and Frederick A Matsen IV. Quantifying MCMC exploration of phylogenetic tree
space. Syst. Biol., 64(3):472–491, May 2015. ISSN 1063-5157, 1076-836X. doi: 10.1093/sysbio/
syv006. URL http://dx.doi.org/10.1093/sysbio/syv006.
W. Xie, P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. Improving marginal likelihood estimation for
Bayesian phylogenetic model selection. Syst. Biol., 60:150–160, 2011.
Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov chain
Monte Carlo method. Mol. Biol. Evol., 14:717–724, 1997.
Z. Yang and A. D. Yoder. Comparison of likelihood and Bayesian methods for estimating divergence
times using multiple gene loci and calibration points, with application to a radiation of cute-
looking mouse lemur species. Syst. Biol., 52:705–716, 2003.
A. D. Yoder and Z. Yang. Divergence datas for Malagasy lemurs estimated from multiple gene loci:
geological and evolutionary context. Mol. Ecol., 13:757–773, 2004.
Cheng Zhang and Frederick A Matsen IV. Generalizing tree probability estimation via
bayesian networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31,
pp. 1451–1460. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/
7418-generalizing-tree-probability-estimation-via-bayesian-networks.
pdf.
N. Zhang and M. Blackwell. Molecular phylogeny of dogwood anthracnose fungus (Discula de-
structiva) and the Diaporthales. Mycologia, 93:355–365, 2001.

12
Published as a conference paper at ICLR 2019

A G RADIENT D ERIVATION FOR T HE M ULTI - SAMPLE O BJECTIVES

In this section we will derive the gradient for the multi-sample objectives introduced in section 3.
We start with the lower bound
 
K j j j j
1 X p(Y |τ , q )p(τ , q )
LK (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log  
K j=1 Qφ (τ j )Qψ (q j |τ j )
 
K
1 X
= EQφ,ψ (τ 1:K , q1:K ) log  fφ,ψ (τ j , q j ) .
K j=1

Using the product rule and noting that ∇φ log fφ,ψ (τ j , q j ) = −∇φ log Qφ (τ j ),
 
K
1 X
∇φ Lk (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) ∇φ log  fφ,ψ (τ j , q j ) +
K j=1
K K
!
X ∇φ Qφ (τ j ) 1 X i i
EQφ,ψ (τ 1:K , q1:K ) log fφ,ψ (τ , q )
j=1
Qφ (τ j ) K i=1
K
X fφ,ψ (τ j , q j )
= EQφ,ψ (τ 1:K , q1:K ) PK ∇φ log fφ,ψ (τ j , q j )+
i i
j=1 i=1 fφ,ψ (τ , q )
K K
!
X 1 X
EQφ,ψ (τ 1:K , q1:K ) log fφ,ψ (τ , q ) ∇φ log Qφ (τ j )
i i

j=1
K i=1
K
X
= EQφ,ψ (τ 1:K , q1:K ) L̂K (φ, ψ) − w̃j ∇φ log Qφ (τ j ).
j=1

This gives the naive gradient of the lower bound w.r.t. φ.

Since ψ is not involved in the distribution with respect to which we take expectation,
 
K
1 X
∇ψ LK (φ, ψ) = EQφ, (τ 1:K ,1:K ) ∇ψ log  fφ,ψ (τ j , gψ (j |τ j ))
K j=1
K
X fφ,ψ (τ j , gψ (j |τ j ))
= EQφ, (τ 1:K ,1:K ) PK ∇ψ log fφ,ψ (τ j , gψ (j |τ j ))
f (τ i , g (i |τ i ))
j=1 i=1 φ,ψ ψ
K
X
= EQφ, (τ 1:K ,1:K ) w̃j ∇ψ log fφ,ψ (τ j , gψ (j |τ j )).
j=1

Next, we derive the gradient of the multi-sample likelihood objective used in RWS

L̃(φ, ψ) = Ep(τ,q|Y ) log Qφ,ψ (τ, q).

13
Published as a conference paper at ICLR 2019

Again, p(τ, q|Y ) is independent of φ, ψ, and we have

∇φ L̃(φ, ψ) = Ep(τ,q|Y ) ∇φ log Qφ,ψ (τ, q)
p(τ, q|Y )
= EQφ,ψ (τ,q) ∇φ log Qφ,ψ (τ, q)
Qφ (τ )Qψ (q|τ )
1
= EQ (τ,q) fφ,ψ (τ, q)∇φ log Qφ (τ )
p(Y ) φ,ψ
K
X fφ,ψ (τ j , q j ) iid
' Pk ∇φ log Qφ (τ j ) with τ j , q j ∼ Qφ,ψ (τ, q).
i i
j=1 i=1 fφ,ψ (τ , q )
K
X
= w̃j ∇φ log Qφ (τ j )
j=1

The second to last step uses self-normalized importance sampling with K samples. ∇ψ L̃(φ, ψ) can
be computed in a similar way.

B T HE VARIATIONAL BAYESIAN P HYLOGENETIC I NFERENCE A LGORITHM

Algorithm 1 The variational Bayesian phylogenetic inference (VBPI) algorithm.

1: φ, ψ ← Initialize parameters
2: while not converged do
3: τ 1 , . . . , τ K ← Random samples from the current approximating tree distribution Qφ (τ )
4: 1 , . . . , K ← Random samples from the multivariate standard normal distribution N (0, I)
5: g ← ∇φ,ψ LK (φ, ψ; τ 1:K , 1:K ) (Use any gradient estimator from section 3.2)
6: φ, ψ ← Update parameters using gradients g (e.g. SGA)
7: end while
8: return φ, ψ

C I MPORTANCE S AMPLING FOR P HYLOGENETIC INFERENCE VIA

VARIATIONAL A PPROXIMATIONS
In this section, we provide a detailed importance sampling procedure for marginal likelihood esti-
mation for phylogenetic inference based on the variational approximations provided by VBPI.

C.1 E STIMATING M ARGINAL L IKELIHOOD OF T REES

For each tree τ that is covered by the subsplit support,

Y
Qψ (q|τ ) = pLognormal (qe | µ(e, τ ), σ(e, τ ))
e∈E(τ )
can provide accurate approximation to the posterior of branch lengths on τ , where the mean and
variance parameters µ(e, τ ), σ(e, τ ) are gathered from the structured variational parameters ψ as
introduced in section 3.1. Therefore, we can estimate the marginal likelihood of τ using importance
sampling with Qψ (q|τ ) being the importance distribution as follows
M
p(Y |τ, q)p(q) 1 X p(Y |τ, q j )p(q j ) iid
p(Y |τ ) = EQψ (q|τ ) ' with q j ∼ Qψ (q|τ )
Qψ (q|τ ) M j=1 Qψ (q j |τ )

C.2 E STIMATING M ODEL E VIDENCE

Similarly, we can estimate the marginal likelihood of the data as follows

K
p(Y |τ, q)p(τ, q) 1 X p(Y |τ j , q j )p(τ j , q j ) iid
p(Y ) = EQφ,ψ (τ, q) ' with τ j , q j ∼ Qφ,ψ (τ, q).
Qφ (τ )Qψ (q|τ ) K j=1 Qφ (τ j )Qψ (q j |τ j )

14
Published as a conference paper at ICLR 2019

In our experiments, we use K = 1000. When taking a log transformation, the above Monte Carlo
estimate is no longer unbiased (for the evidence log p(Y )). Instead, it can be viewed as one sample
Monte Carlo estimate of the lower bound
K
!
K 1 X p(Y |τ i , q i )p(τ i , q i )
L (φ, ψ) = EQφ,ψ (τ 1:K , q1:K ) log ≤ log p(Y ) (11)
K i=1 Qφ (τ i )Qψ (q i |τ i )

whose tightness improves as the number of samples K increases. Therefore, with a sufficiently large
K, we can use the lower bound estimate as a proxy for Bayesian model selection.

D C ONSENSUS T REE C OMPARISON ON DS1

Sceloporus undulatus Sceloporus undulatus

Heterodon platyrhinos Heterodon platyrhinos

Turdus migratorius Turdus migratorius

Gallus gallus Gallus gallus

Oryctolagus cuniculus Oryctolagus cuniculus

Rattus norvegicus Rattus norvegicus

Mus musculus Mus musculus

Homo sapiens Homo sapiens

Latimeria chalumnae Latimeria chalumnae

Xenopus laevis Xenopus laevis

Gastrophryne carolinensis Gastrophryne carolinensis

Nesomantis thomasseti Nesomantis thomasseti

Eleutherodactylus cuneatus Eleutherodactylus cuneatus

Hyla cinerea Hyla cinerea

Bufo valliceps Bufo valliceps

Discoglossus pictus Discoglossus pictus

Scaphiopus holbrooki Scaphiopus holbrooki

Plethodon yonhalossee Plethodon yonhalossee

Ichthyophis bannanicus Ichthyophis bannanicus

Hypogeophis rostratus Hypogeophis rostratus

Grandisonia alternans Grandisonia alternans

Amphiuma tridactylum Amphiuma tridactylum

Typhlonectes natans Typhlonectes natans

Siren intermedia Siren intermedia

Ambystoma mexicanum Ambystoma mexicanum

Trachemys scripta Trachemys scripta

Alligator mississippiensis Alligator mississippiensis

Figure 5: A comparison of majority-rule consensus trees obtained from VBPI and ground truth
MCMC run on DS1. Left: Ground truth MCMC. Right: VBPI (10000 sampled trees). The plot is
created using the treespace (Jombart et al., 2017) R package.

Huson D.H., Rupp R., Scornavacca C. - Phylogenetic networks-CUP (2010) PDF
100% (1)
Huson D.H., Rupp R., Scornavacca C. - Phylogenetic networks-CUP (2010) PDF
375 pages
Maximum Parsimony and Likelihood
No ratings yet
Maximum Parsimony and Likelihood
34 pages
Unit IV
No ratings yet
Unit IV
11 pages
Swami
No ratings yet
Swami
12 pages
Variational Autoencoder
No ratings yet
Variational Autoencoder
21 pages
CS236 Hw2 Answers
No ratings yet
CS236 Hw2 Answers
14 pages
CS236 Homework 3
No ratings yet
CS236 Homework 3
4 pages
Graphical Models, Exponential Families, and Variational Inference
No ratings yet
Graphical Models, Exponential Families, and Variational Inference
305 pages
Computational Phylogenetics
No ratings yet
Computational Phylogenetics
404 pages
Thesis 2021 Bayesian Neural Network For Probablistic Machine Learning Diss - Pdfaversion
No ratings yet
Thesis 2021 Bayesian Neural Network For Probablistic Machine Learning Diss - Pdfaversion
169 pages
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
No ratings yet
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
89 pages
Molecular Evolution and Phylogenetics Session.4
No ratings yet
Molecular Evolution and Phylogenetics Session.4
28 pages
Molecular Phylogenetics
No ratings yet
Molecular Phylogenetics
29 pages
Lecture 6373 07
No ratings yet
Lecture 6373 07
53 pages
Phyl o Genetics
No ratings yet
Phyl o Genetics
58 pages
16.stochastic Variational Inference
No ratings yet
16.stochastic Variational Inference
45 pages
BTC 506 Phylogenetic Analysis
No ratings yet
BTC 506 Phylogenetic Analysis
58 pages
Cornualt y Sanmartín 2022 - UNA HOJA DE RUTA PARA ÁRBOLES DE ESPECIES
No ratings yet
Cornualt y Sanmartín 2022 - UNA HOJA DE RUTA PARA ÁRBOLES DE ESPECIES
41 pages
Intro To Phyl o Genetics
No ratings yet
Intro To Phyl o Genetics
44 pages
A Comprehensive Review of Generative AI in Healthcare
No ratings yet
A Comprehensive Review of Generative AI in Healthcare
47 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
31 pages
Bioinformatics Assignment Topic: Phylogenetics Analysis Softwares
No ratings yet
Bioinformatics Assignment Topic: Phylogenetics Analysis Softwares
12 pages
Approximate Bayesian Computation (ABC) in Practice
No ratings yet
Approximate Bayesian Computation (ABC) in Practice
9 pages
Phyml Maximum Likelihood Trees
No ratings yet
Phyml Maximum Likelihood Trees
37 pages
A Variational Approach To Bayesian Phylogenetic Inference: Cheng Zhang
No ratings yet
A Variational Approach To Bayesian Phylogenetic Inference: Cheng Zhang
56 pages
Phylogenic Tree
No ratings yet
Phylogenic Tree
42 pages
Introduction - Arbres - Phylogénique
No ratings yet
Introduction - Arbres - Phylogénique
36 pages
Algorithms For Efficient Near-Perfect Phylogenetic Tree Reconstruction in Theory and Practice
No ratings yet
Algorithms For Efficient Near-Perfect Phylogenetic Tree Reconstruction in Theory and Practice
25 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Approximate Bayesian Computation For Markovian Binary Trees in
No ratings yet
Approximate Bayesian Computation For Markovian Binary Trees in
43 pages
Phylogenetic Tree Constructions Methods and Programmes - L 11 - 12
No ratings yet
Phylogenetic Tree Constructions Methods and Programmes - L 11 - 12
27 pages
Vikram Mullachery Aniruddh Khera Amir Husain: Bayesian Neural Networks
No ratings yet
Vikram Mullachery Aniruddh Khera Amir Husain: Bayesian Neural Networks
16 pages
Invariants - On The Identifiability of Phylogenetic Networks Under A Pseudolikelihood Model
No ratings yet
Invariants - On The Identifiability of Phylogenetic Networks Under A Pseudolikelihood Model
35 pages
Nascimento Et Al. 2017 A Biologit S Guide To Bayesian Phylogenetic Analysis
No ratings yet
Nascimento Et Al. 2017 A Biologit S Guide To Bayesian Phylogenetic Analysis
21 pages
Presentation-2 CDVAE (05-31-2024)
No ratings yet
Presentation-2 CDVAE (05-31-2024)
33 pages
Phylogenetic Inference - Darya and Meghana
No ratings yet
Phylogenetic Inference - Darya and Meghana
21 pages
Mean Field
No ratings yet
Mean Field
31 pages
Constructing Phylogenetic Trees Using Maximum Likelihood
No ratings yet
Constructing Phylogenetic Trees Using Maximum Likelihood
58 pages
Mixed Integer Linear Programming For Maximum Parsimony Phylogeny Inference
No ratings yet
Mixed Integer Linear Programming For Maximum Parsimony Phylogeny Inference
20 pages
Introduction To Variational Methods
No ratings yet
Introduction To Variational Methods
51 pages
1 s2.0 S0278612524001584 Main
No ratings yet
1 s2.0 S0278612524001584 Main
25 pages
Filo Genia
No ratings yet
Filo Genia
15 pages
Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3
No ratings yet
Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3
21 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Var Bayes Linreg
No ratings yet
Var Bayes Linreg
14 pages
Bioengineering 11 00480 With Cover
No ratings yet
Bioengineering 11 00480 With Cover
23 pages
Inferring Phylogenetic Networks With Maximum Pseudolikelihood Under Incomplete Lineage Sorting
No ratings yet
Inferring Phylogenetic Networks With Maximum Pseudolikelihood Under Incomplete Lineage Sorting
21 pages
Methods For Computing Wagner Trees: Systematic Zoology March 1970
No ratings yet
Methods For Computing Wagner Trees: Systematic Zoology March 1970
12 pages
Phylogenetics
No ratings yet
Phylogenetics
18 pages
Leveraging Future Relationship Reasoning For Vehicle Trajectory Prediction
No ratings yet
Leveraging Future Relationship Reasoning For Vehicle Trajectory Prediction
13 pages
Bayesian Estimation of Concordance Among Gene Trees
No ratings yet
Bayesian Estimation of Concordance Among Gene Trees
15 pages
Phylo Invariants
No ratings yet
Phylo Invariants
18 pages
CAIMS2019
No ratings yet
CAIMS2019
48 pages
Computing The Joint Distribution of Tree Shape and Tree Distance For Gene Tree Inference and Recombination Detection
No ratings yet
Computing The Joint Distribution of Tree Shape and Tree Distance For Gene Tree Inference and Recombination Detection
12 pages
Gokcen Etal 2023
No ratings yet
Gokcen Etal 2023
12 pages
NeurIPS 2018 Information Constraints On Auto Encoding Variational Bayes Paper
No ratings yet
NeurIPS 2018 Information Constraints On Auto Encoding Variational Bayes Paper
12 pages
Wang 等 - 2024 - Machine Unlearning a Comprehensive Survey
No ratings yet
Wang 等 - 2024 - Machine Unlearning a Comprehensive Survey
29 pages
cs236 Lecture5
No ratings yet
cs236 Lecture5
29 pages
Uet Lahore
No ratings yet
Uet Lahore
25 pages
Abc PDF
No ratings yet
Abc PDF
9 pages
Holmes Statistical
No ratings yet
Holmes Statistical
15 pages
Revisiting VAE For Unsupervised Time Series Anomaly Detection: A Frequency Perspective
No ratings yet
Revisiting VAE For Unsupervised Time Series Anomaly Detection: A Frequency Perspective
10 pages
Phylogenetic Analysis Extra
No ratings yet
Phylogenetic Analysis Extra
13 pages
IFT6135 H2020 Homework 3
No ratings yet
IFT6135 H2020 Homework 3
16 pages
Mixed Integer Linear Programming For Maximum-Parsimony Phylogeny Inference
No ratings yet
Mixed Integer Linear Programming For Maximum-Parsimony Phylogeny Inference
9 pages
1992 - Statistical Properties of Bootstrap Estimation of Phylogenetic Variability From Nucleotide Sequences
No ratings yet
1992 - Statistical Properties of Bootstrap Estimation of Phylogenetic Variability From Nucleotide Sequences
29 pages
Larget Simon MBE 1999
No ratings yet
Larget Simon MBE 1999
10 pages
Waddell NZJB93
No ratings yet
Waddell NZJB93
8 pages
2410.14390v1 Bayesian Deep Learning
No ratings yet
2410.14390v1 Bayesian Deep Learning
16 pages
Phylogenetic Analysis
No ratings yet
Phylogenetic Analysis
11 pages
Materials Generation in The Era of ArtificialIntelligence A Comprehensive Survey
No ratings yet
Materials Generation in The Era of ArtificialIntelligence A Comprehensive Survey
21 pages
Understanding Phylogenies
No ratings yet
Understanding Phylogenies
6 pages
Yang Rannala, 1997
No ratings yet
Yang Rannala, 1997
8 pages
Markov Chain Monte Carlo Algorithms For
No ratings yet
Markov Chain Monte Carlo Algorithms For
10 pages
pnasSI201206598 1..6
No ratings yet
pnasSI201206598 1..6
6 pages
MODULE 6 - 1 Variational Autoencoders (VAE)
No ratings yet
MODULE 6 - 1 Variational Autoencoders (VAE)
46 pages
46 Differentiable Search of Ev
No ratings yet
46 Differentiable Search of Ev
8 pages
Branch and Bound: Simple Greedy Heuristic
No ratings yet
Branch and Bound: Simple Greedy Heuristic
4 pages
BE Phylogenetics
No ratings yet
BE Phylogenetics
6 pages
Bayesian Inference of Species Trees Using Diffusion Models
No ratings yet
Bayesian Inference of Species Trees Using Diffusion Models
18 pages
Ronquist (2003) - MrBayes 3
No ratings yet
Ronquist (2003) - MrBayes 3
3 pages
Btab 129
No ratings yet
Btab 129
3 pages
BIO OGY SHORT - Merged
No ratings yet
BIO OGY SHORT - Merged
2 pages
Syst Biol-2004-Penny-669-70
No ratings yet
Syst Biol-2004-Penny-669-70
2 pages

611 Variational Bayesian Phylogene

Uploaded by

611 Variational Bayesian Phylogene

Uploaded by

Published as a conference paper at ICLR 2019

VARIATIONAL BAYESIAN P HYLOGENETIC I NFERENCE

where ∼ means all root subsplits that are compatible with τ .

3 VARIATIONAL P HYLOGENETIC I NFERENCE VIA SBN S

which is equivalent to maximizing the evidence lower bound (ELBO),

3.1 VARIATIONAL PARAMETERIZATION

Capturing Between-Tree Branch Length Variation

3.2 S TOCHASTIC G RADIENT E STIMATORS AND T HE VBPI A LGORITHM

Self-normalized Importance Sampling Estimator In addition to the standard variational formu-

Figure 3: Comparison of multi-sample objective on approximating a challenging distribution over

4.1 S IMULATED S CENARIOS

To empirically investigate the representative power of SBNs to approximate distributions on phylo-

VIMCO(10) VIMCO(10) + PSP 7036

4.2 R EAL DATA P HYLOGENETIC P OSTERIOR E STIMATION

where βt ∈ [0, 1] is an inverse temperature that follows a schedule βt = min(0.001, t/100000),

(#TAXA , M ARGINAL L IKELIHOOD ( NATS )

Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.

J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2nd edition, 2003.

J. R. Garey, T. J. Near, M. R. Nonnemacher, and S. A. Nadler. Molecular evidence for Acantho-

A G RADIENT D ERIVATION FOR T HE M ULTI - SAMPLE O BJECTIVES

This gives the naive gradient of the lower bound w.r.t. φ.

L̃(φ, ψ) = Ep(τ,q|Y ) log Qφ,ψ (τ, q).

Again, p(τ, q|Y ) is independent of φ, ψ, and we have

B T HE VARIATIONAL BAYESIAN P HYLOGENETIC I NFERENCE A LGORITHM

Algorithm 1 The variational Bayesian phylogenetic inference (VBPI) algorithm.

C I MPORTANCE S AMPLING FOR P HYLOGENETIC INFERENCE VIA

C.1 E STIMATING M ARGINAL L IKELIHOOD OF T REES

For each tree τ that is covered by the subsplit support,

C.2 E STIMATING M ODEL E VIDENCE

Similarly, we can estimate the marginal likelihood of the data as follows

D C ONSENSUS T REE C OMPARISON ON DS1

Sceloporus undulatus Sceloporus undulatus

Heterodon platyrhinos Heterodon platyrhinos

Turdus migratorius Turdus migratorius

Gallus gallus Gallus gallus

Oryctolagus cuniculus Oryctolagus cuniculus

Rattus norvegicus Rattus norvegicus

Mus musculus Mus musculus

Homo sapiens Homo sapiens

Latimeria chalumnae Latimeria chalumnae

Xenopus laevis Xenopus laevis

Gastrophryne carolinensis Gastrophryne carolinensis

Nesomantis thomasseti Nesomantis thomasseti

Eleutherodactylus cuneatus Eleutherodactylus cuneatus

Hyla cinerea Hyla cinerea

Bufo valliceps Bufo valliceps

Discoglossus pictus Discoglossus pictus

Scaphiopus holbrooki Scaphiopus holbrooki

Plethodon yonhalossee Plethodon yonhalossee

Ichthyophis bannanicus Ichthyophis bannanicus

Hypogeophis rostratus Hypogeophis rostratus

Grandisonia alternans Grandisonia alternans

Amphiuma tridactylum Amphiuma tridactylum

Typhlonectes natans Typhlonectes natans

Siren intermedia Siren intermedia

Ambystoma mexicanum Ambystoma mexicanum

Trachemys scripta Trachemys scripta

Alligator mississippiensis Alligator mississippiensis

You might also like