Abstract—Variational Bayes (VB), also known as independent complexity of posterior estimate θ(x)
b often grows exponen-
mean-field approximation, has become a popular method for tially with arriving data x and, hence, yields the curse of
Bayesian network inference in recent years. Its application is dimensionality [7]. For tractable computation, as shown in
vast, e.g. in neural network, compressed sensing, clustering, etc.
this paper, the VB algorithm iteratively projects the originally
will see that CVB, and its special case VB, iteratively projects The closest form to the CVB of this paper is the so-called
the original distribution to a fixed copula constraint space Copula Variational inference in [32], which
QN fixes the form of
until convergence. Then, similar to the fact that the mean is approximated distribution feθ|ξ = e
cθ|ξ i=1 feθi |ξ and applies
the point of minimum total distance to data, an augmented gradient decent method upon the latent variable ξ in order
CVB approximation will also be designed as a distribution of to find a local minimum of KL divergence. In contrast, the
minimum total Bregman divergence to the original distribution CVB in this paper is a free-form approximation, i.e. it does
in this paper. not impose any particular form initially, and provides higher-
Three popular special cases of VB will also be revisited in order moment’s estimates than a mere point estimate. Hence,
this paper, namely Expectation-Maximization (EM) [21], [22], the fixed-form constraint class in their Copula Variational
Iterated Conditional Mode (ICM) [23], [24] and k-means algo- inference is much more restricted than the free-form copula
rithms [25], [26]. In literature, the well-known EM algorithm constraint class of CVB in this paper. Also, the iterative
was shown to be a special case of VB [1], [2], in which one computation for CVB will be given in closed form with low
of VB’s marginal is restricted to a point estimate via Dirac complexity, rather than relying point estimates of gradient
delta function. In this paper, the EM algorithm will be shown decent methods.
that it does not only reach a local minimum KL divergence,
but it may also return a local maximum-a-posteriori (MAP) B. Contributions and organization
point estimate of the true marginal distribution. This justifies
The contributions of this paper are summarized as follows:
the superiority of EM algorithm to VB in some cases of MAP
• A novel copula VB (CVB) algorithm, which extends the
estimation, since the peaks in VB marginals might not be the
independent constraint class of traditional VB to a copula
same as those of true marginals.
constraint class, will be given. The convergence of CVB
If all VB marginals are restricted to Dirac delta space, the
will be proved via three methods: Lagrange multiplier
iterative VB algorithm will become ICM algorithm, which re-
method in calculus of variation, Jensen’s inequality and
turns a locally joint MAP estimate of the original distribution.
the Bregman projection in information geometry. The two
Also, for standard Normal mixture clustering, the ICM algo-
former methods have been used in literature for proof of
rithm is equivalent to the well-known k-means algorithm, as
convergence of traditional VB, while the third method is
shown in this paper. The k-means algorithm is also equivalent
new and provides a unified scheme for the former two
to the Lloyd-Max algorithm [25], which has been widely used
in quantization context [27].
• The EM, ICM and k-means algorithms will be shown
For illustration, the CVB and its special cases mentioned
to be special cases of the traditional VB, i.e. they all
above will be applied to two canonical models in this paper,
locally minimize the KL divergence under a fixed-form
namely bivariate Gaussian distribution and Gaussian mixture
independent constraint class.
clustering. By tuning the correlation in these two models, the
• An augmented form of CVB, namely hierarchical CVB
performance of CVB will be shown to be superior to that of
approximation, with linear computational complexity for
state-of-the-art mean-field methods like VB, EM and k-means
a generic Bayesian network will also be provided.
algorithm. An augmented CVB form for a generic Bayesian
• In simulations, the CVB algorithm for Gaussian mixture
network will also be studied and applied to this Gaussian
clustering will be illustrated. The classification’s perfor-
mixture model.
mance of CVB will be shown to be far superior to that
of VB, EM and k-means algorithms for this model.
A. Related works The paper is organized as follows: since the Bregman projec-
Although some generalized forms of VB have been pro- tion in information geometry is insightful and plays central
posed in literature, most of them are merely variants of mean- role to VB method, it will be presented first in section II. The
field approximations and, hence, still confined within indepen- definition and property of copula will then be introduced in
dent class. For example, in [28], [29], the so-called Condition- section III. The novel copula VB (CVB) method and its special
ally Variational algorithm is an application ofQtraditional VB cases will be presented in section IV. The computational flow
K of CVB for a Bayesian network is studied in section V and
to a joint conditional distribution fe(θ|ξ) = k=1 fek (θk |ξ),
given a latent variable ξ. Hence, different to CVB above, the will be applied to simulations in section VI. The paper is then
approximated marginal feξ was not updated in their scheme. In concluded in section VII.
[30], the so-called generalized mean-field algorithm is merely Note that, for notational simplicity, the notion of probability
to apply the traditional VB method to the independent class density function (p.d.f.) for continuous random variable (r.v.)
of a set of variables, i.e. each θk consists of a set of variables. in this paper will be implicitly understood as the probability
In [31], the so-called Structured Variational inference is the mass function (p.m.f) in the case of discrete r.v., when the
same as the generalized mean-field, except that the dependent context is clear.
structure inside the set θk is also specified. In summary,
they are different ways of applying traditional VB, without II. I NFORMATION GEOMETRY
changing the VB’s updating formula. In contrast, the CVB in In this section, we will revisit a geometric interpretation of
this paper involves new tractable formulae and broader copula one of fundamental measures in information theory, namely
constraint class. Kullback-Leibler (KL) divergence, which is also the central
Á(x) °
X (®jj
jj° )
X D(®
® 2 in D
H¯(x) ® D(®jj¯X) ¯X
* +
Á(x) ~
x E[φ(x)] − E[φ(E[x])] − E[x] − E[x], ∇φ(E[x]) .
| {z }
E[Á(x)] jjx~ Remark 6. Although we have Var[x] 6= Varφ [x] in general,
E[DÁ(xjjE[x])] (x
E[ the mean E[x] is the same minimum point for any expected
X Bregman divergence, as shown in (7). This notable property of
E[x] x x E[DÁ(xjjE[x])] E[x] the mean has been exploited extensively for Bregman k-means
algorithms in literature [34], [35].
Figure 3. Illustration of equivalence between Jensen’s inequality (left) and A list of Bregman divergences, corresponding to different
Bregman variance theorem (right). Similar to Fig. 1 and Fig. 2, the dashed
contours on the right represent the convexity of Dφ (x||ex) over x, which, in functional forms of φ(x), can be found feasibly in literature,
turn, can be regarded as another convex function φe for Jensen’s inequality on e.g. in [20], [39]. Let us recall two most popular forms below.
the left. 1) Euclidean distance: A special case of Bregman diver-
gence is squared Euclidean distance [35]:
of γ onto X and defined as follows: DφE (α||β) = ||α − β||2 , with φE (x) , ||x||2 , (8)
β X , arg min D(α||γ). (4) where ||·|| denotes L2 -norm for elements of a vector or matrix.
In this case, the Bregman pythagorean theorem (3) becomes
From three-point property (2), we can see that the Bregman the traditional Pythagorean theorem and the Bregman variance
pythagorean inequality in (3) becomes equality for all α ∈ X (5) becomes the traditional variance theorem, i.e. VarφE [x] =
if and only if X is an affine set (i.e. the triple points Var[x] = E[||x||2 ] − ||E[x]||2 .
{α, β X , γ} are Bregman orthogonal at β X , ∀α ∈ X ). 2) Kullback-Leibler (KL) divergence: Another popular case
of Bregman divergence is the KL divergence [35]:
Proof: Note that β X , as defined in (4), is not necessarily
unique if X is not convex [20]. The uniqueness of β X (4) for K K K
X αk X X
convex set X can be proved either via contradiction [34] or KL(α||β) , DKL (α||β) = αk log − αk + βk ,
via convexity of X in three-point property (2), c.f. [20], [37]. k=1 k=1 k=1
Theorem 5. (Bregman variance theorem - Jensen’s inequality) where φKL (f (θ)) , H(θ) = Ef (θ) log f (θ) is the continuous
entropy and DKL (fe||f ) is the Bregman divergence between
Let x ∈ RK be a r.v. with mean E[x] and variance Var[x]. two density distributions fe(θ) and f (θ), as presented below.
The Bregman variance Varφ [x] is defined as follows:
Varφ [x],E[Dφ (x||E[x])] = E[φ(x)] − φ(E[x]) ≥ 0. (5) B. Bregman divergence for functional space
Equivalently, we have: In the calculus of variations, the Bregman divergence for
Varφ [x],E[Dφ (x||E[x])] = E[Dφ (x||e x)] − Dφ (E[x]||ex) ≥ 0 vector space is a special case of the Bregman divergence for
(6) functional space, defined as follows:
for any fixed point xe ∈ RK . The right hand side (r.h.s.) of Definition 7. (Bregman divergence for functional space) [33]
(5) is called Jensen’s inequality in literature, i.e. E(φ(x)) ≥ Let φ : Lp (θ) → R be a strictly convex and twice Fréchet-
φ(E(x)), for any convex function φ [38]. Also, from (6), we differentiable functional over Lp -normed space. The Bregman
have: divergence D : Lp (θ) × Lp (θ) → R+ between two functions
x0 , E[x] = arg min E[D(x||e x)], (7) f, g ∈ Lp (θ) is defined as follows:
Proof: Let us show the proof in reverse way. Firstly, where δφ(·; g) is Fréchet derivative of φ at g.
the mean property (7) is a consequence of (6), i.e. we have:
Apart from gradient form, all well-known properties of
E[D(x||e x)] = E[D(x||E[x])]+D(E[x]||e x) and D(E[x]||e x) =
Bregman divergence in Proposition 2 are also valid for func-
0 ⇔ x e = E[x]. Secondly, by replacing φ(x) in (5) with
tional space [33], [40]. Hence, we can feasibly derive the
e = Dφ (x||ex), the form (6) is equivalent to (5), owing
Bregman variance theorem for probabilistic functional space,
to the affine equivalence property in Proposition 2. Lastly, the
as follows:
form (5) is a direct derivation from Bregman definition (1),
with α = x and β = E[x], as follows: D(x||E[x]) = φ(x) − Proposition 8. (Bregman variance theorem for functions)
φ(E[x])−hx − E[x], ∇φ(E[x])i and, hence, E[D(x||E[x])] = Let functional point f (θ) be a r.v. drawn from the functional
space Lp (θ) with functional mean E[f ] , E[f (θ)] and ~f H(f)
functional variance Var[f ],E[||f (θ)−E[f ]||2 ]. Then we have: E[KL(fjj~f)]
KL ]
(f j j~f)
jf~) KL(E[f]jj~f) fj
Varφ [f ],E [D(f ||E(f ))] = E[φ(f )] − φ(E[f ]) ≥ 0. f2
f1 f3 L
E[f] f f E[KL(fjjE[f])] E[f]
Equivalently, we have: f5 f4 ~f E[f]
Varφ [f ] , E [D(f ||E(f ))] = E[D(f ||fe)] − D(E[f ]||fe) ≥ 0, Figure 4. Application of Bregman variance theorem (13) to KL divergence
in distribution space f ∈ L, with thePsame convention in Fig. 3. As an
for any functional point fe , fe(θ) ∈ Lp (θ) and: example, the mixture f0 (θ) , E[f ] = 5i=1 pi fi (θ) in (12) must lie inside
the polytope L = {f1 (θ), . . . , f5 (θ)}. In middle sub-figure, H(f ) denotes
f0 , E[f ] = arg min E[D(f ||fe)]. (11) the continuous entropy over p.d.f. f . The mixture fe = f0 = E[f ] is then the
fe minimum functional point of E[KL(f ||fe)], which is also an upper bound of
Proof: Because the Fréchet derivative in (10) is a linear KL(E[f ]||fe) over fe ∈ L, as shown in (13-14).
operator like gradient in (1), we can derive the above results
in the same manner of the proof of Theorem 5.
Remark 9. From Proposition 8, we have Var[f ] = VarφE [f ] for
Euclidean case φE (f ) = ||f (θ)−E[f ]||2 , but Var[f ] 6= Varφ [f ]
in general. The functional mean f0 , E[f ] is also the same
minimum function for any expected Bregman divergence,
similarly to Remark 6.
For later use, let us apply Proposition 8 and show here the
Bregman variance for a probabilistic mixture:
Figure 5. Illustration of variable transformation from x to s in the case of
Corollary 10. (Bregman variance theorem for mixture) continuous c.d.f. [43] (left), together with pseudo-inverse F ← (u) of a non-
continuous c.d.f. F (x) and their concatenations F ← ◦ F (x), F ◦ F ← (u)
Let functional point f (θ) be a r.v. drawn from a functional [42] (right). We can see that the uniqueness of copula requires the continuous
set f , {f‘1 (θ), . . . , f‘N (θ)} ofPN distributions over θ, with property of c.d.f., since non-continuous c.d.f. does not preserve the inverse
N transformation.
probabilities pi ∈ I , [0, 1], i=1 pi = 1. The functional
mean (11) is then regarded as a mixture, as follows:
f0 (θ) , E[f ] = pi fi (θ), (12)
The copula concept was firstly defined in [13], although
PK it was also defined under different names such as “uniform
with variance Var[f ]= i=1 pi ||f (θ) − f¯(θ)||2 . The Bregman representation” or “dependence function” [14]. The copula
variance is then: has been studied intensively in many decades in statistics,
particularly for finances [41], [42]. Yet the application of
X copula in information theory is still limited at the moment.
Varφ [f ] = pi D(fi ||f0 ) = pi D(fi ||fe) − D(f0 ||fe) ≥ 0, In this section, we will review the basic concept of copula
i=1 i=1 and its direct connection to mutual information of a system.
(13) The KL divergence for copula, which is the nutshell of CVB
for any distribution fe , fe(θ) and, hence, from (11-12), we approximation in next section, will be provided at the end of
have: this section.
f0 (θ) = E[f ] = pi fi (θ) = arg min pi D(fi ||fe). (14) A. Sklar’s theorem
i=1 fe i=1 Because the Sklar’s paper [13] is the beginning of copula’s
Proof: This case is a consequence of Proposition 8. history, let us recall the Sklar’s theorem first.
The case of KL divergence, which is a special case of
Definition 12. (Pseudo-inverse function)
Bregman variance with φ = φKL in (13), is illustrated in
Let F : R → I be a cumulative distributional function (c.d.f.)
Fig. 4.
of a r.v. θ ∈ R. Since F (θ) is not strictly increasing in general,
Remark 11. The computation of KL variance via (13) for as illustrated in Fig. 5, a pseudo-inverse function (also called
a mixture is often more feasible than the computation of quantile function) F ← : I → R is defined as follows:
Euclidean variance in practice. Indeed, the KL form cor-
responds to geometric mean [39], which can yield linearly F ← (u) , inf{θ ∈ R : F (θ) ≥ u}, u ∈ I.
computational complexity over exponential coordinates (par-
Note that, the quasi-inverse F ← coincides with the inverse
ticularly for exponential family [20], [39]), while the Euclidean
function F −1 if F (θ) is continuous and strictly increasing, as
form corresponds to arithmetic mean, which would yield ex-
illustrated in Fig. 5.
ponentially computational complexity for exponential family
distributions over Euclidean coordinates in general, as shown Theorem 13. (Sklar’s theorem) [13], [16]
in section IV-B3. For any r.v. θ = [θ1 , θ2 , . . . , θK ]T ∈ RK with joint c.d.f. F (θ)
Intuitively, the copula’s rank-invariant property is merely 2) KL divergence (KLD): In literature, the below copula-
a consequence of natural rank-invariant property of marginal based KL divergence for a joint p.d.f. was already given for a
c.d.f. under increasing transformation, as implied by definition special case of conditional structure [44]. For later use, let us
of copula (15) and illustrated in Fig. 5. recall their proof here in a slightly more generally form, via
3) Copula’s marginal transformation: For later use, let pseudo-inverse (16) and rescaling property (18).
us emphasize a very special case of rank-invariant property,
Proposition 20. (Copula’s divergence) [44]
namely marginal transformation. By definition (17), we can
The KLD of two joint p.d.f. f , fe in (17) is the sum of KLD
see that copula separates the dependence part of joint p.d.f.
of their copulas c, e
c and KLDs of their marginals fk , fek , as
from its marginals. Hence, we can freely replace any marginal
Fk with new marginal Fek , ∀k ∈ {1, 2, . . . , K}, without
changing the form of copula, as shown below:
Corollary 18. (Copula’s marginal-invariant property) c(u)||c(F (Fe← (u)))) +
KLfe||f = KL(e KLfek ||fk (21)
Let θ(θ)
e , [θ1 , . . . , θek (θk ), . . . , θK ]T ∈ RK , in which r.v. θk k=1
in θ is replaced by a continuously transformed r.v. θek (θk ) ,
in which the copula e c of fe was rescaled back to
Fek← (Fk (θk )) , for any k ∈ {1, 2, . . . , K}. Then the density
marginal coordinates of f , i.e. e c(Fe(F ← (u))) ,
copulas e ck and c of θ(θ) e and θ, respectively, have the same ← ←
c(F1 (F1 (u1 )), . . . , FK (FK (uK ))).
e e
ck (u) = c(u), ∀u ∈ IK .
form, i.e. e
Proof: By definition of KLD (9) and copula den-
Proof: This corollary is a direct consequence of the
sity (17), we have: KL(f (θ)||fe(θ)) = Ef (θ) log fe(θ) =
copula’s rank-invariant property, since the continuous c.d.f. PK f (θ)
functions Fek← and Fk are both strictly increasing function for Ef (θ) log ec(u(θ))
c(ũ(θ)) +
fk (θk )
k=1 Ef (θ) log fek (θk ) , of which the sec-
continuous variables. ond term in r.h.s. is actually KLDs of marginal, i.e.
The marginal-invariant property shows that when we replace Ef (θ) log fek (θk ) = Efk (θk ) log fek (θk ) = KL(fk (θk ))||fek (θk ))
fk (θk ) fk (θk )
the marginal distribution fk (θk ) of joint p.d.f. f (θ) in (17) and the first term in r.h.s. is actually KLD of copulas,
by another marginal distribution fek (θk ), the resulted joint via rescaling property (18), as follows: Ef (θ) log ec(u(θ))c(ũ(θ)) =
distribution fe(θ) does not change its original copula form,
Ef (θ) log ecc(F 1 (θ1 ),...,FK (θK ))
= Ec(u) log ec(Fe(Fc(u)
← (u)))
i.e.: (F
e1 (θ1 ),...,FeK (θK ))
KL(c(u)||e c(Fe(F (u)))).
f (θ) = f (θ\k |θk )fk (θk )
c(u) = c(u), ∀u ∈ IK
⇒e Note that, by copula’s marginal- and rank-invariant proper-
fe(θ) = f (θ\k |θk )fek (θk )
ties in section III-B, we can see that the marginal rescaling
form ec(Fe(F ← (u))) of e c in (21) does not change the original
Indeed, by Corollary 18, we have f (θ) = fe(θ(θ)),e i.e. the
form of copula e c.
distribution fe(θ) is merely a marginally rescaling form of f (θ)
and, hence, does not change the form of copula. Remark 21. If all Fek are exact marginals of F (θ), i.e. Fek =
Fk in (21), ∀k ∈ {1, 2, . . . , K}, we have KL(f (θ)||fe(θ)) =
KL(c(u)||e c(u)). Furthermore, if e c(u) is also an independent
C. Copula’s divergence
copula, as noted in Remark 15, the KL divergence in (21) will
Because the copula is essentially a distribution itself, the be equal to mutual information I(θ) in (20).
KL divergence (9) can be applied directly to any two copulas.
Let us show the relationship between joint p.d.f. and its copula
via KL divergence in this subsection.
1) Mutual information: Because all dependencies in a joint As shown in (21), the KL divergence between any two
p.d.f. f in (17) is captured by its copula, it is natural that the distributions can always be factorized as the sum of KL diver-
mutual information of joint p.d.f. f can also be computed via gence of their copulas and KL divergences of their marginals.
its copula form c in (17), as shown below. Exploiting this property, we will design a novel iterative copula
VB (CVB) algorithm in this section, such that the CVB
Proposition 19. (Mutual information) distribution is closest to the true distribution in terms of KL
For continuous copula c in (17), the mutual information I(θ) divergence, under constraint of initially approximated copula’s
of joint p.d.f. f (θ) is equal to continuous entropy H of copula form. The mean-field approximations, which are special cases
density c(u(θ)), as follows: of CVB, will also be revisited later in this section.
I(θ) = H(c(u)). (20)
Proof: The proof is straight-forward from definition A. Motivation of marginal approximation
of KL divergence Q (9) and copula density (17), as follows: Let us now consider R a joint p.d.f. f (θ), of which the true
I(θ) , KL(f (θ)|| k=1 fk (θk )) = Ef (θ) log QK f (θ) = marginals fk (θk ) = θ\k f (θ)dθ\k , k ∈ {1, 2, . . . , K}, are
k=1 fk (θk )
Ef (θ) log c(u(θ)) = Ec(u) log c(u) = H(c(u)), in which θ either unknown or too complicated to compute. A natural
was transformed to u via rescaling property (18). For a special approximation of fk (θk ) is then to seek a closedPKform distri-
case of bivariate copula density, another proof was given in bution fek (θk ) such that their KL divergences k=1 KLfk ||fek
[17]. in (21) is minimized. This direct approach is, however, not
fµ conditional fe\k|k . Then fe is convex over marginals fek , which
fµ yields:
\kjµk=® KL(~fµ¤ jµ =¯jjfµ )
\k k \kjµk=¯
f~µ¤ jµ =® log(fµ ) KLfe||f = KLfe||fe∗ + KLfe∗ ||f ≥ KLfe∗ ||f = log (22)
\k k k
µ\k log(³k~
fµ¤) owing to Bregman pythagorean property (3) for functional
space (9-10). The distribution fe = fe∗ minimizing KLfe||f and
® ¯ µk fk (θk )
fek∗ (θk ) = (23)
ζ k exp(KLfe∗ ||f\k|k )
Figure 7. Illustration of Conditionally Variational approximation (CVA), as
defined in (23). The lower KL divergence, the better approximation. Given 1 f (θ)
= exp Efe∗ (θ\k |θk ) log
initially a conditional form feθ∗ |θ for feθ = feθ∗ |θ feθk , the optimally ζk ∗
f (θ\k |θk )
\k k \k k
approximated marginal feθ∗ minimizing KL(feθ ||fθ ) is proportional to the true
marginal fθk in fθ = fθ\k |θk fθk by a fraction of normalized conditional in which ζk is the normalizing constant of fek∗ in (23) and
divergence ζ exp KL(fe
θ\k |θk
||fθ |θ ), where ζ is the normalizing con-
\k k k KLfe∗ ||f\k|k , KL(fe∗ (θ\k |θk )||f (θ\k |θk )).
stant. In traditional VB approximation (29), we simply set feθ∗ |θ = feθ∗ , Note that, if the marginal fek = fek is initially fixed instead,
\k k \k
which is independent of θk . ∗
f is then convex over fe\k|k and, hence, the conditional fe\k|k
minimizing KLfe||f in (22) is the true conditional distribution
feasible if the integration for true marginal fk (θk ) is very hard f\k|k , i.e. fe∗ = f\k|k .
to compute at the beginning.
A popular approach in literature is to find an approxima- Proof: Firstly, we note that, for any mixture fek (θk ) =
tion fe(θ) of the joint distribution f (θ) such that their KL p1 f1 (θk ) + p2 fe2 (θk ), we always have fe(θ) = p1 fe1 (θ) +
divergence KLfe||f , KL(fe(θ)||f (θ)) can be minimized. This p2 fe2 (θ). Hence, fe is convex over fek with fixed fe\k|k and
indirect approach is more feasible since it circumvents the satisfies the Bregman pythagorean equality (22), since KL
explicit form of fk (θk ). Also, since KLfe||f is the upper bound divergence is a special case of Bregman divergence (9). We
PK can also verify the pythagorean equality (22) directly, similarly
k=1 KLfek ||fk , as shown in (21), it would yield good
to the proof of copula’s KL divergence (21), as follows:
approximated marginals fek (θk ) if KLfe||f could be set low
enough. This is the objective of CVB algorithm in this section. KLfe||f = Efek KLfe∗ ||f\k|k + KLfek ||fk
Remark 22. Another approximation approach is to
fk e 1
find fe(θ) such that the copula’s KL divergence = Efek log 1 fk
+ Efek log
KL(e c(u)||c(F (Fe← (u)))) in (21) is as close as possible ζ k exp(KLfe∗ ||f\k|k
to KL(c(u)||e c(u)), which is equivalent to the exact case 1
fek = fk , ∀k ∈ {1, 2, . . . , K}. This copula’s analysis approach = KLfek ||fe∗ + log (24)
| {z } | {z k} ζ
is promising, since the original copula form can be extracted
KLfe||fe∗ KLfe∗ ||f
from mutual dependence part of the original f (θ), without
the need of marginal’s normalization, as shown in [44] for
in which the form fek∗ is defined in (23) and ζk is in-
a simple case of a Gaussian copula function. However, this
dependent of θk . Also, we have KLfe||fe∗ = KLfek ||fe∗ in
approach would generally involve copula’s explicit analysis, k
which is not a focus of this paper and will be left for future the first term of r.h.s. of (24) since fe and fe∗ only dif-
work. fer in marginals fek , fek∗ . For the second term, by defini-
tion (23), we have KLfe∗ ||f\k|k = log ζ1 fek∗(θk ) , which
\k|k k fk (θk )
B. Copula Variational approximation yields: Efe∗ KLfe∗ = log 1
− KLfe∗ ||fk ⇔ KLfe∗ ||f =
k \k|k
||f\k|k ζk k
Since the CVB algorithm is actually an iterative procedure log ζ1 in (22) and (24). Lastly, the second equality in
of many Conditionally Variational approximation (CVA) steps, k
(23) is given as follows: fk (θk )/ exp KLfe∗ ||f\k|k =
let us define the CVA step first, which is also illustrated in f\k|k
f (θ)
Fig. 7. fk (θk ) exp Efe∗ log = exp Efe∗ log .
\k|k fe∗
\k|k \k|k
1) Conditionally Variational approximation (CVA): For a
If fek = fek∗ is fixed instead, fe is then convex over a mixture
good approximation fek of fk , let us initially pick a closed
of fe\k|k as shown above. Then, KLfe||f in (24) is minimum at
form p.d.f. fe(θ) = fe∗ (θ\k |θk )fek (θk ), in which the conditional
distribution fe\k|k , fe∗ (θ\k |θk ) is fixed and given. The optimal fe∗ = f\k|k , since the term KL e
\k|k = KL e∗ in (24) is
fk ||fk fk ||fk
now fixed and the term Efe∗ KLfe∗ ||f\k|k is minimum at zero
approximation fek∗ , fek∗ (θk ) is then found by the following k \k|k
Theorem, which is also the foundational idea of this paper: with fe\k|k = f\k|k .
Theorem 23. (Conditionally Variational approximation) In Theorem 23, the conditional fe\k|k is fixed beforehand and
∗ ∗
Let fe = fe\k|k fek be a family of distributions with fixed-form f is found in a free-form variational space, hence the name
or Conditional mean-field [29] in literature, which are merely
applications of mean-field approximations
QK to a conditionally ~
independent structure, i.e. fe(θ|ξ) = k=1 fek (θk |ξ), given a
f£ f[£º] 2 V [º]
latent variable ξ in this case. min KL(
~fjjf) C
~f 2 C
2) Copula Variational algorithm: In CVA form above,
we can only find one approximated marginal fek∗ (θk ), given
conditional form fe\k|k (θ\k |θk ). In the iterative form below,
Figure 8. Venn diagram for iterative Copula Variational approximation (CVA),
we will iteratively multiply fek∗ (θk ) back to fe\k|k
(θ\k |θk ) in given in (25). The dashed contours represent the convexity of KL(feθ ||fθ )
order to find the reverse conditional fek|\k (θk |θ\k ) for the over distributional points feθ . The set C, possibly nonconvex, denotes a class
∗ of distributions with the same copula form. Given initial form feθ |θ , the
next fe\k (θ\k ) via (23). At convergence, we can find a set of \k k
[0] [1]
joint distributions feθ and feθ belong to the same convex set V [1] ⊆ C, by
approximations fek , ∀k ∈ {1, 2, . . . , K}, such that the KLfe||f [1]
Theorem 23 and Corollary 25. The CVA feθ is the Bregman projection of the
is locally minimized, as follows: [1]
true distribution fθ onto V [1] , with feθ = arg minfe ∈V [1] KL(feθ ||fθ ), as
k θk
Corollary 25. (Copula Variational approximation) shown in (22) and illustrated in Fig. 2. By interchanging the role of θ\k and
[0] [ν]
Let fe = fe\k|k fek be the initial approximation with initial θk , the KL(feθ ||fθ ) never increases over iterations ν and, hence, converges
[0] to a local minimum inside copula set C. In traditional VB algorithm, we
form fe . At iteration ν ∈ {1, 2, . . . , νc }, the approximation
[ν] [ν]
set feθ |θ = feθ , which belongs to the independent copula class at all
\k k \k
[ν−1] [ν] [ν] [ν]
fe[ν] = fe\k|k fek = fek|\k fe\k is given by (23), as follows: iterations ν.
[ν] fk (θk )
fek (θk ) = [ν]
ζk exp(KLfe[ν−1] ||f ) For example, in ternary partition, even if we initially set
\k|k [0] [0]
fek|\k = fek independent of θ\k = {θj , θm } and yield the
f fk e[ν−1] e[ν] [1] [0] [1] [1] [1]
in which the reverse conditional is fek|\k = \k|k and updated fe\k = fe[1] (θj , θm ) for fe[1] = fek fe\k = fem|j fe\m
fe\k [1] [2]
[ν] R [ν−1] [ν] via (25), the reverse form fe yields fe , fe[2] (θk , θj ) via
m|j \m
fe\k , θk fe\k|k fek , ∀k ∈ {1, 2, . . . , K}. Then, the value [2]
1 [ν] (25) again and, hence fek|\k = fe[2] (θk |θj ) dependent on θj
KLfe[ν] ||f = log [ν] in (22), where ζk is the normalizing again, which does not yield a mean-field approximation in
[ν] subsequent iterations of (25). This ternary partition scheme
constant of marginal fek , monotonically decreases to a local
minimum at convergence ν = νc , as illustrated in Fig. 8. will be implemented in (59) and clarified further in Remark
Note that, by copula’s marginal-invariant property (19), 42.
the copula’s form of the iterative joint distribution fe[ν] (θ) is 3) Conditionally exponential family (CEF) approximation:
[ν] The computation in above approximations will be linearly
invariant with any updated marginals fek (θk ), ∀k, hence the
name Copula Variational approximation. tractable, if the true joint f (θ) and the approximated con-
ditional fe\k|k can be linearly factorized with respect to log-
Proof: Since the calculation of reverse form fek|\k does operator in (23) and (25). The distributions satisfying this
not change fe[ν] (θ), the value KLfe[ν] ||f only decreases with property belong to a special class of distributions, namely CEF,
marginal update fek via (22-23) and, hence, converges mono- defined as follows:
tonically. Definition 26. (Conditionally Exponential Family)
If the initial form fe\k|k belongs to the independent space, A joint distribution f (θ) is a member of CEF if it has the
[0] [0] [0] following form:
i.e. fe\k|k = fe\k , the copula of the joint feθ will have
independent form, as noted in Remark 15, and cannot leave
f (θ) ∝ exp g k (θk ), g \k (θ\k ) (26)
this independence space via dual iterations of (25). Hence,
for a binary partition θ = {θ\k , θk }, an initially independent where g k , g \k are vectors dependent on θk , θ\k element-wise,
copula will lead to a mean-field approximation. respectively. Note that, the form (26) is similar to the well-
Nonetheless, this is not true in general for ternary partition known Exponential Family in literature [2], [45], hence the
θ = {θk , θj , θm } or for a generic network of parameters, name CEF.
since the iterative CVA (25) can be implemented with different
partitions of a network at any iteration, without changing the From (26), the marginal of a joint CEF distribution is:
joint network’s copula or increasing the joint KL divergence D E
f (θk ) ∝ exp g k (θk ), g \k (θ\k ) dθ\k (27)
KLfe[ν] ||f . θ\k
Let fe = fe\k|k ∗
fek be a distribution with fe\k|k ∗
= ®
µ~\k log(³k~fµ )
[ º]
true distribution f (θ) also takes the CEF form (26), the log(f~[µº]) log(³ )
k k
approximation fe∗ minimizing KLfe||f in (22), as given by (23), µk
also belongs to CEF, as follows: µ [kº{1]
~ µ [kº]=
~ argmax ~f[º]
D E µk
fek∗ (θk ) ∝ exp η k (θk ), η ∗\k (θk ) (28)
Figure 9. Illustration of Expectation-Maximization (EM) algorithm (30) as
a special case of VB approximation. The lower KL divergence, the better
where η ∗\k (θk ) , Efe∗ η \k (θ\k ), with η k , g k − hk and [ν] [ν−1]
\k|k approximation. Given restricted form feθ = f (θ\k |θek ) at iteration ν,
η \k , g \k − h\k . [ν] [ν] [ν]
the approximated feθ minimizing KL(feθ feθ ||fθ ) is proportional to the
k \k k
Proof: The form (28) is a direct consequence of (23), true marginal fθk by a fraction of conditional divergence, similar to Fig. 7.
∗ Note that, θek might fail to converge to a local mode θbk of the true marginal
since both fe\k|k and f (θ) in (23) now have CEF form (26).
fθk , if the peak β is lower than point α. For ICM algorithm (31), we further
[ν] [ν]
restrict feθ to a Dirac delta distribution concentrating around its mode θe\k
From (27-28), we can see that the integral in (27) has moved \k
[ν] [ν] [ν]
inside the non-linear exp operator in (28) and, hence, become and, hence, θ̃ = {θe\k , θek } always converges to a joint local mode θ
b of
the true distribution fθ .
linear and numerically tractable. Then, substituting (28) into
iterative CVA (25), we can see that the iterative CVA for CEF
is also tractable, since we only have to update the parameters
of CEF iteratively in (28) until convergence.
recover the so-called mean-field approximations in literature.
Remark 28. In the nutshell, the key advantage of KL diver- Four cases of them, namely VB, EM, ICM and k-means
gence is to approximate the originally intractable arithmetic algorithms, will be presented below.
mean (27) by the tractable geometric mean in exponential
domain (28), as noted in Remark 11. 1) Variational Bayes (VB) algorithm: From CVA (23), the
4) Backward KLD and minimum-risk (MR) approxima- VB algorithm is given as follows:
tion: In above approximations, we have used the forward
KLfe||f (22) as the approximation criterion, since the Bregman Corollary 31. (VB approximation)
pythagorean property (3) is only valid for forward KLfe||f . The independent distribution fe∗ = fe\k
∗ e∗
fk minimizing KLfe||f
Moreover, the approximation via backward KLf ||fe is not in (22) is given by (23), as follows:
interesting since the minimum is only achieved with the true fk (θk )
distributions, as shown below: fek∗ (θk ) ∝ ∝ exp Efe∗ (θ\k ) log f (θ), (29)
exp(KLfe∗ ||f\k|k ) \k
Corollary 29. (Conditionally minimum-risk approximation)
The approximation fe∗ = fe\k|k fek minimizing backward KLf ||fe ∀k ∈ {1, 2, . . . , K}, as illustrated in Fig. 7.
is either fe∗ = fe\k|k
fk or fe∗ = f\k|k fekMR for fixed fe\k|k
Proof: Since fe\k|k = fe\k does not depends on θk in this
or fixed fekMR , respectively, where fk and f\k|k are the true
case, substituting fe\k|k = fe\k into (23) yields (29).
marginal and conditional distributions.
Since there is no conditional form fe\k|k to be updated,
Proof: Similar to proof of Theorem 23, the backward
the iterative VB algorithm simply updates (29) iteratively for
form is KLf ||fe = Efk KLf\k|k ||fe\k|k + KLfk ||fek . Hence,
all marginals fek and fe\k , similar to (25), until convergence.
KL e∗ is minimum at feMR = fk for fixed KL
f ||f k eMR and
f\k|k ||f\k|k Hence, VB algorithm is a special case of Copula Variational
minimum at fe\k|k = f\k|k for fixed KLfk ||feMR . algorithm in Corollary 25, in which the approximated copula
is of independent form, as noted in Remark 15.
Remark 30. The Corollary 29 is the generalized form of
the minimum-risk approximation in [2], which minimizes 2) Expectation-Maximization (EM) algorithm: If we re-
backward KL divergence in the context of VB approximation strict the independent form fe = fe\k fek in VB algorithm with
in mean-field theory. The name “minimum-risk” refers to the Dirac delta form feEM , fe\k δek , where δek , δ(θk − θek ), we
fact that the true distribution always yields minimum-risk will recover the EM algorithm, as follows:
estimation in Bayesian theory (c.f. Appendix A).
Corollary 32. (EM algorithm)
C. Mean-field approximations At iteration ν ∈ {1, 2, . . . , νc }, the EM approximation of f (θ)
[ν] [ν] [ν] [ν] [ν] [ν]
If we confine the conditional form fe = fe\k|k fek in above is feEM , fe\k δek , in which fe\k = f (θ\k |θek ) and δek ,
approximations by independent form, i.e. fe = fe\k fek , we will δ(θk − θe ), as given by (29):
µ7 µ8 µ9
~f (£)
µ4 µ5 µ6 ~
£p 1
µ1 µ2 µ3
+ f~0(£) = §Ni=1 p
µ7 µ8 µ9
µ7 µ8 µ9
KL(f~i(£)jjf(£)) µ4 µ5 µ6 ~
£p 2
µ7 µ8 µ9
µ4 µ5 µ6 f(£) µ1 µ2 µ3 = µ4 µ5 µ6
µ1 µ2 µ3
+ µ1 µ2 µ3
µ7 µ8 µ9
~f (£)
µ4 µ5 µ6 ~
£p N
µ1 µ2 µ3
Figure 10. Augmented CVB approximation fe0 (θ) for a complicated joint distribution f (θ), illustrated via directed acyclic graphs (DAG). Each fei (θ) is a
converged CVB approximation of f (θ) with simpler structure. The weight vector p e , [ep1 , pe2 , . . . , peN ]T , with i=1 p
ei = 1, is then calculated via (38)
and yields the optimal mixture fe0 (θ) , N
i=1 p
ei fei (θ) minimizing the upper bounds (39-40) of KLfe ||f . Since KLfe ||f is convex over fe0 , the mixture fe0
0 0
would be close to the original f , if we can design a set of fei such that f stays inside a polytope bounded by vertices fei , as illustrated in Fig. 4. Hence, a
good choice of fei might be a set of overlapped sectors of the original network f , such that its mixture would have a similar structure of f , as illustrated in
above DAGs.
V. H IERARCHICAL CVB FOR BAYESIAN NETWORK assumed to be the converged CVB approximation of each
original component fi .
In this section, let us apply the CVB approximation to a joint
Ideally, our aim is to pick the weight vector p e ,
posterior f (θ|x) of a generic Bayesian network. Since the
p1 , pe2 , . . . , peN ]T such that KL(fe(θ|e
[e p)||f (θ|p)) is minimized.
network structure of f (θ|x) is often complicated in practice,
Nevertheless, it is not feasible to directly factorize the mixture
an intuitive approach is to approximate f (θ|x) with a simpler
form f (θ|p) and fe(θ|e p) via non-linear form of KL divergence.
CEF structure fe(θ|x), such that the KLfe||f can be locally
Instead, let us minimize the KL divergence of their augmented
minimized via iterative CVB algorithm.
forms in (36), as follows:
Nevertheless, since CVB approximation fe[ν] (θ|x) in (32)
cannot change its copula form at any iteration ν, a natural e∗ , arg min KL(fe(θ, l|e
p p)||f (θ, l|p)), (37)
approach is to design initially a set of simple network struc- e
tures fei , i ∈ {1, 2, . . . , N }, and then combine them into a which is also an upper bound of KL(fe(θ|e p)||f (θ|p)), as
more complex structure with lowest KLfe[νc ] ||f , or equivalently, shown in (21). The solution for (37) can be found via CVA
highest ELBO (33) at convergence ν = νc . An augmented (23), as follows:
hierarchy method for merging potential CVB’s structures, as
illustrated in Fig. 10, will be studied below. Corollary 38. (CVA for mixture model)
For simplicity, let us consider the case of joint distribution Applying CVA (23) to (37), we can compute the optimal weight
f (θ) first, before applying the augmented approach to joint e∗ , [e
p p∗ 1 , pe∗ 2 , . . . , pe∗N ]T minimizing (37), as follows:
posterior f (θ|x). pi
pe∗i ∝ , ∀i ∈ {1, 2, . . . , N }. (38)
exp(KLfei ||fi )
A. Augmented CVB for mixture model From (24), the minimum value of (37) is then:
Let us firstly consider a mixture model, which is the X X pi
simplest structure of P a hierarchical network. The traditional KLpe∗ , pe∗i KLfei ||fi + pe∗i log (39)
N P i=1 i=1
mixture f (θ|p) = p
i=1Pi if (θ) = l P l|p) and its
f (θ,
approximation f (θ|e e p) = i=1 p
ei fi (θ) =
l f (θ, l|e
e p) can Proof: From CVA (23), the marginal fe(l|e p) minimizing
be written in augmented form via a boolean label vector (37) is f (l|e
e p) ∝ f (l|p)/ exp(KL(f (θ|l)||f (θ|l)),
l , [l1 , l2 , . . . , lN ]T ∈ IN , as follows: which yields (38), since KL(fe(θ|l)||f (θ|l)) =
i=1 li KL(fi (θ)||fi (θ)).
f (θ, l|p) = f (θ|l)f (l|p) = fili (θ)plii , (36)
i=1 B. Augmented CVB for Bayesian network
Y Let us now apply the above approach to a generic network
fe(θ, l|e
p) = fe(θ|l)fe(l|e
p) = feili (θ)e
plii , f (θ). In (36), let us set fi (θ) = f (θ), ∀i, together with
uniform weight p = p̄ , [p̄1 , p̄2 , . . . , p̄N ]T = [ N1 , . . . , N1 ]T .
where l ∈ {1 , 2 , . . . , N } and i , [0, . . . , 1, . . . 0]T is a Each fei in (38) is now a CVB approximation, with possibly
N × 1 element vector with all zero elements except the unit simpler structures, of the same original network f (θ), as
value at i-th position, ∀i ∈ {1, 2, . . . , N }. Each fei is then illustrated in Fig. 10.
Owing to Bregman’s property 4 in Proposition 2, KLfe||f is closed form for fei (θ|e pi ). This hierarchical CVB approach
convex fe. Hence, there exists a linear mixture fe0 (θ|e
p) = is, however, outside the scope of this paper and will be left
PN over for future work.
pei fei (θ), such that:
Remark 39. In literature, the idea of augmented hierarchy was
KLfe0 ||f ≤ KLi∗ , min KLfei ||f (40) mentioned briefly in [51], [52], in which the potential approxi-
i∈{1,2,...,N }
mations fei are confined to a set of mean-field approximations
e = i∗ , with i∗ ,
in which the equality is reached if we set p and the prior fe(l|e
p) is extended from a mixture to a latent
arg mini KLfei ||f . Markovian model. Nevertheless, the ELBO minimization in
Since minimizing KLfe0 ||f directly is not feasible, as ex- [51], [52] was implemented via stochastic-gradient decent
plained above, we can firstly minimize KLfei ||f in (40) via methods and did not yield an explicit form for the mixture’s
iterative CVB algorithm for each approximated structure fei . weights in (38).
We then compute the optimal weights p e∗ in (37, 38) for the
minimum upper bound KLpe∗ of KLfe0 ||f . Note that KLpe∗ in VI. C ASE STUDY
(39) and KLi∗ in (40) are two different upper bounds of In this section, let us illustrate the superior performance of
KLfe0 ||f and may not yield the global minimum solution for CVB to mean-field approximations for two canonical scenarios
KLfe0 ||f in general. The choice pe = i∗ might yield lower in practice: the bivariate Gaussian distribution and Gaussian
KLfe0 ||f than p
e=p e , even when we have KLi∗ > KLpe∗ . mixture clustering. These two cases belong to CEF class (26)
Although we can only find the minimum upper bound and, hence, their CVB approximation is tractable, as shown
solution for the mixture fe0 in this paper, the key advantage below.
of the mixture form is that the moments of fe0 are simply a
mixture of moments of fei , i.e.: A. Bivariate Gaussian distribution
X In this subsection, let us approximate a bivariate Gaussian
b0 = E e (θ) =
θ f0 pei Efei (θ) = pei θ
bi . (41) =N
distribution f (θ) θ (0, Σ) with
zero mean and covariance
i=1 i=1 σ12 ρσ1 σ2
matrix Σ , . The purpose is then to
ρσ1 σ2 σ22
By this way, the true moments θ b of complicated network f (θ)
illustrate the performance of CVB and VB approximations
can be approximated by a mixture of moments θ bi of simpler
for f (θ) with different values of correlation coefficient ρ ∈
CVB’s network structure fi (θ).
[−1, 1].
Another advantage of mixture form is that the optimal For simple notation, let us denote the marginal and condi-
weight vector p̃ can be evaluated tractably, without the need of tional distributions of f (θ) by f1 = Nθ1 (0, σ1 ) and f2|1 =
normalizing constant of f (θ|x) in Bayesian context. Indeed, Nθ2 (β2|1 θ1 , σ2|1 ), respectively, in which β2|1 , ρ σσ12 and
for a posterior Bayesian network f (θ|x), we can simply p
σ2|1 , σ2 1 − ρ2 .
replace the value KLfei ||f in (38-40) by ELBO’s value in (33), 1) CVB approximation: Since Gaussian distribution be-
since the evidence f (x) is a constant. [1]
longs to CEF class (26), the CVB form feCVB = fe2|1 fe1 =
[0] [1]
[1] [1]
fe1|2 fe2 in (25) is also Gaussian, as shown in (28). Then, given
C. Hierarchical CVB approximation [0] σ
[0] [0]
initial values β̃2|1 , ρe[0] 2[0] and σ e2|1 , σ 1 − ρe2[0] , we
In principle, if we keep augmenting the above CVB’s aug- σ
[0] [0] [0]
mented mixture, it is possible to establish an m-order hierar- have fe2|1 = Nθ2 (β̃2|1 θ1 , σ
e2|1 ). At iteration ν = 1, the CVA
chical CVB approximation fe{m} (θ) for a complicated network form (23) yields:
f (θ), ∀m ∈ {0, 1, . . . , M }. For example, each zero-order
p∗i ) = m=1 pe∗i,m fei,m (θ) = li fe(θ, li |e p∗i ),
1 f1
mixture fei (θ|e [1]
fe1 = [1]
∀i ∈ {1, 2, . . . , N }, can be considered as a component of ζ1 exp(KLfe2|1
the first-order mixture fe0 (θ|e e ∗ ) = PN qei fe{0} (θ|e
q, P p∗i ),
∗ i=1 i 1 θ2
∗ ∗ ∗ 1 √ exp − 2σ12
where P e , [e p 1, pe 2, . . . , p
eN ] and q q 1 , qe2 , . . . , qeN ]T .
e , [e =
σ1 2π
" 1
[1] 2
If fi,m (θ) are all tractable CVB’s approximations with
e ζ1 σ2|1 1
β̃2|1 −β2|1 θ12 +(e
σ2|1 )2
simpler and possibly overlapped sectors of the network f (θ), [0] exp 2 2
the optimal vectors p e∗i can be evaluated feasibly via KLfei,m ||f
in (38). Nonetheless, the computation of the optimal vector q e∗ = Nθ1 (0, σ
e1 ),
via KLfe{0} ||f in (38) might be intractable in practice, because in which KLfe[0] ||f is KL divergence between Gaussian
KLfe{0} ||f is a KL divergence of a mixture of distributions and, 2|1
i distributions and:
hence, it is difficult to evaluate KLfe{0} ||f directly in closed
form. [0] 2 [0]
An intuitive solution for this issue might be to apply [1] 1 [1] σ
[1] e
e1 σ 2|1 σ2|1 σ2|1 )2
− (e
e1 = r , ζ1 = exp 2 .
CVB again to the augmented form KL(fe(θ, li |e pi )||f (θ, li |p̄)), σ1 σ2|1 2σ2|1
β̃2|1 −β2|1
similar to (37). By this way, we could avoid the mixture form σ12
+ σ22 (1−ρ2 )
p∗i ) and directly derive a CVB’s
P e
fei (θ|e pi ) = li f (θ, li |e (42)
KL divergence VB (inital)
VB (converged)
CVB (inital)
4 4 CVB (converged)
2 2
0 0
-2 -2
-4 -4 1
-6 -6 0.5
-10 -5 0 5 10 -10 -5 0 5 10 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
1 1
Figure 11. CVB and VB approximations feθ for a zero-mean bivariate Gaussian distribution fθ , with true variances σ12 = 4, σ22 = 1 and correlation coefficient
[0] [0]
ρ = 0.8. The initial guess values for CVB and VB are σ e1 = σ e2 = 1, together with various ρe[0] ∈ (−1, 1) for CVB. The cases ρe[0] = 0.5 and ρe[0] = −0.5
are shown on the left and middle panel, respectively. The marginal distributions, which are also Gaussian, are plotted on two axes in these two panels. The
lower KL divergence KL(feθ ||fθ ) on the right panel, the better approximation, as illustrated in Fig. 7, 9. The CVB will be exact, i.e. KL(feθ ||fθ ) ≈ 0 at
convergence, if the initial guess values ρe[0] are in range ρe[0] ∈ [0.6, 0.7], which is close to the true value ρ = 0.8. If ρe[0] = 0, the CVB is equivalent to VB
approximation in independent class. The number νc of iterations until convergence for VB and CVB are, respectively, 8 and 11.1 ± 5.2, averaged over all
cases of ρe[0] ∈ (−1, 1) for CVB. Only one marginal is updated per iteration.
f (45-46) since we have:
¨ VB
i=1 lk,i ||xi − µk || lk,i ||xi − µk (L)||2
= i=1 PN
l l ... l ... l PN
i=1 lk,i i=1 lk,i
1 2 j N
³ ¨
+ ||µk − µk (L)||2 , (47)
x x x ... x ¨ j
in which the form of µk is given in (46), bli , where b [ν]
Υ , [b
µ1 , µ
b2 , . . . , µ
b K ], Υ e [ν] ,
[ν] [ν] [ν] [ν] [ν] [ν] [ν]
l1,i , b lK,i ]T and b
l2,i , . . . , b lk,i = δ[k − b ki ], with δ[·] denoting [e
µ1 , µ
e2 , . . . , µ
e K ], LEM1 b [ν] b[ν]
, [l1 , l2 , . . . , lN ] with
b b
the Kronecker delta function, ∀i ∈ {1, 2, . . . , N }. By conven- bl[ν] , [b T [ν] [ν]
[ν] [ν−1] PN [ν−1] i l1,i , l2,i , . . . , lK,i ] and b
b b lk,i = δ[k − b ki ],
tion, we keep µ bk = µ bk unchanged if i=1 b lk,i = 0,
e [ν] , [e
p1 , p
e2 , . . . , p
eN ] with
PK [ν]
since no update for k-th cluster is found in this case. k=1 p
ek,i = 1,
From (50), we can see that the algorithm starts with K ∀i ∈ {1, 2, . . . , N }.
initial mean values µk , ∀k ∈ {1, 2, . . . , K}, then assigns The forms µk and σ k are given in (46). By convention,
categorical labels to clusters via minimum Euclidean distance we keep µ
= µ
and σ
ek = σ
unchanged if
[1] PN b[ν−1]k
e ek ek
in (50), which, in turn, yields K new cluster’s means µk ,
i=1 k,i = 0 in (54).
∀k ∈ {1, 2, . . . , K}, and so forth. Hence it is called the k-
means algorithm in literature [25], [26]. Also, since f (Υ, L, X) is of CEF form (26), we can
[ν] [ν]
At convergence ν = νc , the k-means algorithm returns a feasibly evaluate KLfe[ν] ||f (X,Υ,L) directly for feEM1 and feEM2 ,
locally joint MAP value Θ b [νc ] = [Υb [νc ] ,L
b [νc ] ], which depends as defined in (51). The convergence of ELBOEM , as given in
b [ν] = arg max E
Υ feVB (Υ|X) ∝ exp Efe[ν−1] (L|X) log f (X, Υ, L) (56)
EM2 b [ν−1] ,X) log f (X, Υ, L),
f (L|Υ
(53) VB
[ν] [ν]
where f (Υ|L b [ν−1] , X) = QK Nµ (e [ν]
µk , σ
ek I2 ) and = Nµk µek , σ
ek I2 ,
EM1 k=1 k
b [ν−1] , X) = QN M ul (e
f (L|Υ pi
). [ν]
EM2 i=1 i
feVB (L|X) ∝ exp Efe[ν] (Υ|X) log f (X, Υ, L)
Replacing Υ and L in f (X, Υ, L) in (52) and VB
e [ν] = E e[ν]
Replacing L in (46) with P (L), we then have:
[ν] b [ν−1] ), σ
= µk (L
[ν] b [ν−1] ),
ek = σ k (L f (L|X) VB
ek EM1 EM1
[ν] [ν−1] [ν−1]
[ν] Nxi (e
µ k , I2 ) µ
= µk (P
e ), σ
= σ k (P
e ),
ki = arg max
, (54) ek ek
k exp((eσk )2 ) Nxi (e
µk , I2 )
pek,i ∝ , (57)
and: exp((e
σk ) 2 )
6 90
(initial) 4 80
EM1 s
d iu
Ra 70
VB k-means
-6 -4 -2 2 4 6 8
60 1
50 CVB1
40 CVB3
0 1 2 3 4 5 6 7 8
80 0
Mean squared error (MSE) of cluster means
70 k-means -200
0 -1800
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Radius Radius
Figure 13. CVB and mean-field approximations for K = 4 bivariate independent Normal clusters N (µ, I2 ), with mean vectors µ located diagonally and
equally at radius R from the offset point [1, 1]T . The upper left panel shows the convergent results of approximated mean vectors for one Monte Carlo run in
the case R = 4, with true mean vectors located at intersections of four dotted lines. The dashed circles represent contours of true Normal distributions. The
plus signs + are N = 100 random data, generated with equal probability from each Normal cluster. The four smallest circles are the same initial guesses of
true mean vectors for all algorithms. The dash-dot line illustrates the k-means algorithm, from initial to convergent points. The other panels show the Purity,
MSE and ELBO values at convergence with varying radius. The higher Purity, the higher percentage of correct classification of data. The higher ELBO at
each radius, the lower KL divergence and, hence, the better approximation for that case of radius, as shown in (33-34) and illustrated in Fig. 7, 9. The number
of Monte Carlo runs for each radius is 104 .
in order to generate the data xi ∈ R2 , i ∈ {1, 2, . . . , N }, νc over all cases in Fig. 13 are [16.4, 16.4, 27.2, 27.4, 27.8] ±
with N = 100, as shown in Fig. 13. The varying radius [5.0, 5.1, 10.4, 10.4, 7.8] for k-means, EM1 , EM2 ,VB and CVB
R then controls the inter-distance between clusters. In order algorithms. Only one approximated marginal is updated per
to quantify the algorithm’s performance, let us compute the iteration.
Purity and mean squared error (MSE) for estimates bli , Υ b We can see that both performance and number of iterations
of categorical labels li and mean vectors Υ, respectively. of k-means and EM1 algorithms are almost identical to each
The Purity, which is a common measure for percentage of other, since they use the same approach with point estimates
successfulPlabel’s classification
PN [53], is calculated as follows: for categorical labels. Although the EM1 (54) takes one extra
Purity = k=1 N1 maxm i=1 δ[b lk,i = lm,i ] in each Monte data-driven step, in comparison with k-means, by using the
Carlo run. The higher Purity ∈ [0, 1], the better estimate for total number of classified labels in each cluster as an indicator
labels. The MSE in each Monte Carlo run is calculated as for credibility, the EM1 is virtually the same as k-means in
1 b − Υ||2 , where Φ is all
follows: MSE = K minφ∈Φ ||φ(Υ) estimate’s accuracy. Likewise, since the point estimates of
K! possible permutations of K estimated cluster means in labels are data-driven and use hard decision approach, the
Υb ∈ R2×K . k-means and EM1 yield lower accuracy than other methods,
which are model-driven and use soft decision approach.
For comparison at convergence, the initialization Υe [0] =
The EM2 (55) and VB (57) also have almost identical
Υ0 and σ e = 1 are the same for all algorithms. The performance and number of iterations, even though EM2 does
k-means (50) and EM1 algorithms (54) will converge at not update the cluster mean’s credibility via total number of
iteration νc if there is no update for categorical labels, i.e. classified labels like VB does. Hence, like the case of EM1
L b [νc −1] ⇔ ELBO[νc ] = ELBO[νc −1] in this case.
b [νc ] = L versus k-means, this extra step of data-driven update seems
The other algorithms are called converged at iteration νc if insignificant in terms of estimate’s accuracy. Nevertheless,
0 ≤ ELBO[νc ] − ELBO[νc −1] ≤ 0.01. The averaged values of since both EM2 and VB use the model’s probability of each
label as weighted credibility and make soft decision at each CVB2 becomes better, which indicates that the classification’s
iteration, their performance is significantly better than k-means accuracy now relies more on the most significantly correlated
and EM1 in the range of radius R ∈ [2, 4]. Hence, the model- structure between labels.
driven update step seems to exploit more information from the Generalizing both schemes CVB1 and CVB2 , CVB3 (70)
true model than the data-driven update step, when the clusters can return the optimal weights for the mixture of N potential
are close to each other. structures and achieve the minimum upper bound of KL
For a large radius R > 4, there is not much difference divergence (37), as illustrated in Fig. 4. Hence, the CVB3
between soft and hard decisions for these standard Normal yields the best performance in Fig. 13. When R < 3, the
clusters, since the tail of Normal distribution is very small in CVB3 is on par with VB approximation, since the probabilities
these cases. Hence, given the same initialization at origin, the computed via Normal model are high enough for making soft
performances of all mean-field approximations like k-means, decisions in VB. When R > 3, however, VB has to rely
EM1 , EM2 and VB are very close to each other when the inter- on hard decisions like k-means, since the standard Normal
distance between clusters is high. Also, since the computation probabilities are too low. The CVB3 , in contrast, automatically
of soft decision in VB and EM2 requires almost double number move the mixture’s weights closer to hard decision on the best
of iterations, compared with hard decision approaches like k- structures like CVB2 .
means and EM1 , the k-means is more advantageous in this Note that, although the computed ELBO values for CVB2 in
case, owing to its low computational complexity. Fig. 13 are correct, the computed ELBO values for CVB1 and
The CVB algorithms are the slowest methods overall. Since CVB3 are merely heuristic and not correct values, since their
the CVB in (70) requires nearly the same number of iterations ELBO values are hard to compute in this case. Nonetheless,
as VB for each structure j ∈ {1, 2, . . . , N }, as illustrated in from their performance in Purity and MSE, we may speculate
Fig. 12, the CVB’s complexity is at least N times slower than that the true ELBO values of CVB1 and CVB3 are lower and
VB method, where N is the number of data. In practice, we higher than those of CVB2 , respectively. Equivalently, in terms
may not have to update all N CVB’s potential structures, since of KL divergence, the CVB3 seems to be the best posterior
there might be some good candidates out of exponentially approximation for this independent Normal cluster model,
growing number of potential structures. In this paper, however, followed by CVB2 , CVB1 and mean-field approximations,
let us consider the case of N structures in order to illustrate which yield almost identical ELBO values.
the superior performance of augmented CVB form in CVB3 Intuitively, as shown in the case of R = 4 in the upper left
(70), in comparison with VB, heuristic CVB1 and hit-or-miss panel of Fig. 13, the mean-field approximations like VB, EM
CVB2 approaches. and k-means seems not to recognize the correlations between
The heuristic CVB1 , which takes uniform average for mean data of the same clusters, but focus more on the inter-distance
vectors over all N potential structures, returns a lower MSE between clusters as a whole. The CVB approximations, in
than mean-field approximations in all cases. This result seems contrast, exploit the correlations between each label lj to all
reasonable, since cluster means are common parameters of all other labels, as shown in Fig. 12. Although the heuristic CVB1
potential CVB structures in Fig. 12. In contrast, CVB1 returns becomes worse when R increases, the CVB2 and CVB3 are
label’s estimate blj via j-th structure only, without considering still able to pick the best correlated structures to represent
label’s estimates from other CVB’s structures. Hence, the the data. When inter-distance of cluster is much higher than
label’s Purity of CVB1 is only on par with that of mean- cluster’s variance, these two CVB methods stabilize and ac-
field approximations for short radius R ≤ 2 and deteriorates curately classify 90% of total data in average. The successful
over longer radius R > 2. As illustrated in Fig. 11, CVB rate is only about 80% for all other state-of-the-art mean-field
might be the worst approximation if the CVB’s structure is approximations.
too different from true posterior structure. In this case, a single
j-th structure seems to be a bad CVB candidate for estimating
label lj at time j ∈ {1, 2, . . . , N }.
The hit-or-miss CVB2 , which picks the single best structure In this paper, the independent constraint of mean-field
j in terms of KL divergence, yields the worst performance
b approximations like VB, EM and k-means algorithms has been
in the range R ∈ [1, 2.5], while in other cases, it is the shown to be a special case of a broader conditional constraint
second-best method. The structure b j, as illustrated in Fig. class, namely copula. By Sklar’s theorem, which guarantees
12, concentrates on the b j-th label. Hence, the classification’s the existence of copula for any joint distribution, a copula
accuracy of CVB2 depends on whether the hard decision on Variational Bayes (CVB) algorithm is then designed in order
j-th label serves as a good reference for other labels, as
b to minimize the Kullback-Leibler (KL) divergence from the
illustrated in Fig. 11. For this reason, CVB2 may be able true joint distribution to an approximated copula class. The
to achieve globally optimal approximation, but it may also iterative CVB can converge to the true probability distribution
be worse than mean-field approximations. When R < 3, when their copula structures are close to each other. From
which is less than three standard deviation of a standard perspective of generalized Bregman divergence in information
Normal cluster, the clusters data are likely overlapped with geometry, the CVB algorithm and its special cases in mean-
each other. Within this range, the hard decision of CVB2 on field approximations have been shown to iteratively project the
j destroys the correlated information between clusters and,
b true probability distribution to a conditional constraint class
hence, becomes worse than other methods. For R ≥ 3, the until convergence at a local minimum KL divergence.
