2017-AdaCluster Adaptive Clustering For Heterogeneous Data
2017-AdaCluster Adaptive Clustering For Heterogeneous Data
Princeton University
Princeton, NJ 08540, USA
Editor: EDITOR
Abstract
Clustering algorithms start with a fixed divergence, which captures the possibly asymmetric dis-
tance between a sample and a centroid. In the mixture model setting, the sample distribution plays
the same role. When all attributes have the same topology and dispersion, the data are said to
be homogeneous. If the prior knowledge of the distribution is inaccurate or the set of plausible
distributions is large, an adaptive approach is essential. The motivation is more compelling for
heterogeneous data, where the dispersion or the topology differs among attributes. We propose an
adaptive approach to clustering using classes of parametrized Bregman divergences. We first show
that the density of a steep exponential dispersion model (EDM) can be represented with a Bregman
divergence. We then propose AdaCluster, an expectation-maximization (EM) algorithm to cluster
heterogeneous data using classes of steep EDMs. We compare AdaCluster with EM for a Gaus-
sian mixture model on synthetic data and nine UCI data sets. We also propose an adaptive hard
clustering algorithm based on Generalized Method of Moments. We compare the hard clustering
algorithm with k-means on the UCI data sets. We empirically verified that adaptively learning the
underlying topology yields better clustering of heterogeneous data.
Keywords: Clustering, Mixture Models, Bregman Divergences, Exponential Dispersion Model,
Heterogeneous Data
1. Introduction
Despite the general tendency towards more complex models and algorithms, k-means remains one
of the most popular clustering algorithms (Kulis and Jordan, 2012). The k-means algorithm makes
three simplifying assumptions: i) clusters are disjoint, implying a hard assignment of samples to
clusters; ii) the number of clusters is fixed and specified ahead of time; and iii) the distance be-
tween two samples is measured with Euclidean distance (Bishop et al., 2006). Many extensions to
k-means have been proposed to relax these assumptions. For example, soft k-means, an expectation
maximization (EM) algorithm for a homogeneous Gaussian mixture model, allows for a soft as-
signment of samples to clusters. In fact, the k-means algorithm is the equivalent of the soft k-means
under the asymptotically small variance assumption (Bishop et al., 2006). This probabilistic model
connection opened the door to non-parametric clustering using the DP-means algorithm, removing
the second assumption of static numbers of clusters. DP-means was derived using the Gibbs sam-
pler for a Dirichlet process (DP) Gaussian mixture model (Kulis and Jordan, 2012). Finally, the
Gaussian distribution implicit in the third assumption of Euclidean geometry has been generalized
to regular exponential family distributions (Banerjee et al., 2005). The generalization is established
via the bijection between regular exponential family distributions and regular Bregman divergences.
The resulting algorithm, Bregman hard clustering, also has a soft counterpart, Bregman soft clus-
tering (Banerjee et al., 2005). Furthermore, (Jiang et al., 2012) showed that DP-means may be used
with any regular Bregman divergence, resulting in a nonparametric hard-clustering algorithm for
homogeneous mixture of regular exponential family distributions. After relaxing the three k-means
assumptions, we are still bound to specify a fixed divergence metric to measure the distance from
the centroid. In the case of mixture models, the corresponding choice is for the sample attribute
distributions.
Describing data in its natural habitat can yield tremendous benefits, especially when the errors
are so large that the Gaussian assumption does not hold (Fisher, 1953). The motivation to identify
the true topological properties of the data has led to the study of dispersion models and most notably
to the generalized linear models (Nelder and Baker, 1972). In this paper, we attempt to use disper-
sion models in clustering setting and provide ways of identifying the topology of the data with mild
assumptions on the data distribution.
With a good understanding of the topology of the data, we can create a custom mixture model
by specifying the known distribution of each sample attribute. Under certain regularity and inde-
pendence conditions, we can combine these attribute-specific mixture models into a single mixture
model. This can be a tedious job when the number of attributes is large. A more challenging and
common scenario is when we do not know the underlying distribution of each attribute a priori. The
conventional approach is to cast this problem as model selection. We can select between alternative
models using a criterion such as a Bayes factor, a likelihood ratio, or an information criterion. In
the case of homogeneous data, i.e. when all the attributes have the same topology and dispersion,
the model selection reduces to identifying the single true distribution within the set of plausible
distributions. Model selection in the case of heterogeneous data tends to be more challenging due to
combinatorial
Q nature of the problem. If we have Mj choices for the j th attribute, then we need to ex-
plore j Mj models, which can be overwhelming. We address this problem in three steps. We first
present a large family of distributions, namely steep exponential dispersion models (EDMs), where
the analytic expressions for the mean and dispersion parameter estimators are available. We then
propose parametrized sub-families of steep EDMs for different data types (positive, non-negative,
real, discrete, continuous etc.). Finally, we derive an adaptive clustering algorithm for heteroge-
neous data that can learn the underlying distribution of each attribute given its type.
In Section 2, we introduce convex duality concepts, Bregman divergences, exponential family
distributions, and exponential dispersion models. We show that the density of a steep exponential
dispersion model can be represented in terms of a Bregman divergence. This finding allows us
to parametrize class of steep EDMs by describing the divergence-generating functions in the dual
domain. With the dual formulation, we are able to keep the support of the class consistent and
analyze the variance-mean relationship more clearly.
In Section 3, we introduce parametrized families of steep EDMs for non-negative discrete, con-
tinuous, positive continuous, and non-negative continuous data. Section 3.1 describes a class for
non-negative discrete data that includes Poisson and negative binomial distributions. This class is
primarily suited for count data such as number of hits recorded with a Geiger counter, page views
2
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
for a website, patient days spent in a clinic and game scores in a contest (Hilbe, 2011). For contin-
uous data on the real line, we suggest another class including generalized hyperbolic secant (GHS)
and Gaussian distributions in Section 3.2. This class includes asymmetric distributions with ap-
plications to financial data (Fischer, 2013). Furthermore, when data live on a unit interval, the
proposed class can be utilized by transforming data with a logit function first. Most notably, beta
distribution and logit-normal distribution map to GHS and Gaussian distributions under the logit
transform, respectively. The families included in Section 3.1 and 3.2 are sub-families of the Morris
class characterized by the quadratic polynomial variance function (Morris, 1982). In Section 3.3, we
discuss another prominent class of EDMs, the Tweedie class, with a defining property of a power
variance function (Tweedie, 1947). We show that many members of the Tweedie class are steep
EDMs, hence, can be represented with Bregman divergences. We show that the Tweedie class can
be used to analyze positive continuous data as well as for non-negative continuous data depending
on the choice of the hyper-parameter domain.
In Section 4, we write the quasi-log likelihood of heterogeneous data under mixture of steep
EDMs in terms of Bregman divergences using saddle-point approximation. We propose an EM
algorithm with closed-form updates for the mean and dispersion parameters and numerical proce-
dures for learning the hyper-parameters identifying the underlying distribution of each attribute.
The resulting algorithm, AdaCluster, is similar to the Bregman soft-clustering algorithm (Banerjee
et al., 2005) with the added capability of handling heterogenous data. We also consider the Bayesian
treatment of the model and show that the MAP estimates of the mean parameters also have closed
form when we have conjugate priors. Furthermore, we show that inverse-Gamma distribution is the
conjugate prior to the dispersion parameter of steep EDMs under the saddle-point approximation.
In Section 5, we turn our attention to the hard clustering problem for heterogeneous data. We
first note that the AdaCluster algorithm does not have a hard clustering counterpart under the asymp-
totically small variance assumption. We further show that the likelihood approach falls short in
learning the hyper-parameters defining the topology of the data. We then propose an algorithm
based on generalized method of moments (GMoM) using the moment conditions derived from
the parametrized variance functions. Since variance functions and divergence-generating functions
characterize steep EDMs uniquely, the parametrized families introduced in Section 3 can be used
within the hard-clustering setting as well. We derive a k-means like algorithm, GMoM-HC, that
adaptively learns the topology and partitions data into hard clusters.
Finally, in Section 6, we start by examining the performance of EM algorithm for GMM in ho-
mogeneous mixture models with non-Gaussian distributions. We consider the simple task of density
estimation with mixture models and show that the GMM fit to non-Gaussian data fails to capture the
true density as the variance-mean relationship deviates from the Gaussian assumption of indepen-
dence. In another synthetic data analysis, we compare EM for GMM and AdaCluster in clustering
heterogeneous data. We obtain substantial improvements with AdaCluster in terms of normalized
mutual information and likelihood. Finally, we look at nine data sets from the UCI repository with
distinct topology and dispersion characteristics. We compare AdaCluster and GMoM-HC with EM
for GMM and k-means in terms of clustering performance and conclude that AdaCluster clusters
data better than GMM while GMoM-HC and k-means perform comparably.
3
BASBUG AND E NGELHARDT
2. Background
2.1 Convex duality
We begin with a discussion of two fundamental properties of convex functions and their relationship
to one another. The first property is steepness, which is important with respect to the existence and
uniqueness of a maximum likelihood estimator for exponential family distributions (discussed fur-
ther in Section 2.3). The second property is strict convexity, which enables us to generate Bregman
divergences. Bregman divergences are an important class of distortion functions that have favorable
properties such as non-negativity, convexity, and linearity, which we exploit later (Banerjee et al.,
2005). We then state a theorem establishing the connection between steep convex functions and a
particular type of strict convex functions, essentially strict convex functions. For the remainder of
the paper we assume that the convex functions are proper, i.e., they are finite, and their effective
domain is non-empty. We start by formally defining the steepness property.
Definition 1 (Rockafellar, 1970, Section 26) A proper convex function f is essentially smooth if it
is differentiable throughout nonempty A = int(dom f )1 and limi→∞ |Of (xi )| = ∞ whenever x1 ,
x2 , . . . , is a sequence in A converging to a boundary point x ∈ bd A.
As in (Barndorff-Nielsen, 1978), we use the terms essentially smooth and steep interchangeably.
Although the term steep may be used to describe non-convex functions, here we limit our focus to
convex functions.
Strict convexity is another intrinsic property of convex functions. The effective domain of a
convex function is used to determine if the function is strictly convex; however, more refined char-
acterizations of strict convexity exist. One such characterization is essential strict convexity, which
we define as follows.
Definition 2 (Rockafellar, 1970, Section 26) A proper convex function f is called essentially strictly
convex if f is strictly convex on every convex subset of dom ∂f .
To better understand the difference between essentially strictly convex and strictly convex functions,
we state the following theorem.
Theorem 1 (Barndorff-Nielsen, 1978, Theorem 5.22) Suppose f is a convex function. Then
From Theorem 1, it follows that, if f is essentially strictly convex, then f is strictly convex on
ri(dom f ) but not necessarily on dom f . Furthermore, every essentially strictly convex function
with an open domain is also strictly convex.
Example 1 (Rockafellar, 1970, Section 26) An example of an essentially strictly convex but not
strictly convex function is
( √
x22 /(2x1 ) − 2 x2 x1 > 0, x2 ≥ 0
f (x) = (1)
0 x1 = 0, x2 = 0.
1. Throughout the paper we use int, ri, bd to denote the interior, relative interior and the boundary of a set, respectively.
For a function f , we denote the domain, sub-differential mapping, gradient, and convex conjugate with dom f , ∂f ,
Of , and f ∗ . We denote the interval [0, ∞) with R0 and (0, ∞) with R+ . Similarly, we denote the set {0, 1, . . . }
with Z0 and {1, 2, . . . } with Z+ .
4
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
f is strictly convex on dom ∂f = R2+ but not on dom f since f is identically zero for x2 = 0 and
x1 6= 0. Also note that f is strictly convex on ri(dom f ) = R2+ . Incidentally, f is also a steep
function.
In information geometry, strict convexity is a useful property because strictly convex functions
enable the construction of Bregman divergences. Bregman divergences are defined as follows.
Definition 3 (Bregman, 1967) Let f : A → R be a strictly convex function defined on a convex set
A ⊆ Rd such that f is differentiable on nonempty ri(A). The Bregman divergence df : A×ri(A) →
R0 is defined as
Bregman divergences include many useful distortion functions such as squared loss, Mahalanobis
distance, KL-divergence, logistic loss, and Itakura-Saito distance. Bregman divergences are non-
negative, convex in the first argument, linear, and, often, non-symmetric. From the statistical per-
spective, Bregman divergences are appealing because, when the density of an exponential family
distribution can be represented in terms of a Bregman divergence, the maximum likelihood estimate
of the natural parameter is unique and can be calculated easily from the sample mean (Banerjee
et al., 2005). This observation hints at a duality between steep convex functions and strictly convex
functions. In general, there is no duality between the two families; however, the connection between
essentially smooth and essentially strictly convex functions can be established with the following
theorem.
Theorem 2 (Rockafellar, 1970, Theorem 26.3) A closed proper convex function is essentially strictly
convex if and only if its conjugate is essentially smooth.
An immediate corollary of Theorem 2 is that the conjugate of an essentially smooth and lower
semi-continuous function is essentially strictly convex. This is due to the bi-conjugate theorem: If
a convex function f is lower semi-continuous, then f ∗∗ = f (Rockafellar, 1970, Theorem 12.2).
Theorem 2 is especially important in characterizing the dual of an exponential family distribution.
We use Theorem 2 to show that the density of a certain subset of exponential family distributions
can be described in terms of Bregman divergences. Before examining the connection between the
exponential family and Bregman divergences, we formally introduce exponential family distribu-
tions.
5
BASBUG AND E NGELHARDT
6
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
Table 1: Six common natural exponential family distributions. Bernouilli, geometric, Poisson,
exponential, Rayleigh, chi-squared distributions with the common density formulation, support (S)
and the corresponding exponential family parametrization with natural parameter (θ), log-partition
function (Ψ), mean-value mapping (τ ) and the variance function (υ).
Definition 5 For a non-degenerate exponential family F(Ψ,Θ) , we define the mean-value mapping
τ : int(Θ) → Ω as τ (θ) = OΨ(θ). The range of the mean-value mapping Ω is called the mean
domain. In addition, if τ −1 (x) is differentiable within Ω, then we define the variance function
υ : Ω → R+ as υ(x) = Oτ −1 (x).
In Table 1, we show the densities of six natural exponential family distributions with the map-
ping between the common parameters and the natural parameter, the corresponding log-partition
function, mean-value mapping and variance functions. One striking observation is that the variance
function has at most a quadratic term and can be written as a polynomial for these six distributions.
This may seem like a coincidence; however, many of the common exponential family distributions
can be characterized by a quadratic polynomial variance function (Morris, 1982). The parametriza-
tion of variance functions is often employed to characterize family of distributions. We’ll discuss
three different parametrization of variance functions in Section 3. The variance function also comes
up in saddle-point approximation of the density which we’ll discuss in the Section 2.4. However,
we first turn our attention to the mean-value mapping and its key role in finding the dual of a density
function.
Steepness is an intrinsic property of an exponential family distribution and preserved under linear
transformations. Steep exponential family distributions allow us to find analytic solutions for the
maximum likelihood (ML) and maximum a posteriori (MAP) estimates of θ. As we will discuss
shortly, representing the density of a steep exponential family distribution in dual domain is im-
mensely useful. Dual formulation allows us to parametrize family of distributions, find analytic
estimates of the mean parameter and use saddle-point approximation. The following theorem out-
lines the key properties of steep exponential family.
7
BASBUG AND E NGELHARDT
Theorem 4 (Barndorff-Nielsen, 1978, Corollary 9.6) Suppose Ψ is steep, which is true in particular
if Ψ is regular. The maximum likelihood estimate exists if and only if T (x) ∈ int(C), and then it is
unique. Furthermore, Ω = int(C) and C \ Ω = bd C. Finally, the maximum likelihood estimator
θ M L is the one-to-one mapping from Ω onto int(Θ) whose inverse is τ .
The relationship between the ML estimator and the mean-value mapping hints at a convex du-
ality. More concretely, (int(Θ), Ψ) and (Ω, Ψ̄∗ ) are Legendre duals of each other, and duality is
established through the mean-value mapping (Barndorff-Nielsen, 1978, Section 7). We note that Ψ̄∗
is the restriction of Ψ∗ to Ω and τ is a one-to-one mapping and has an inverse for steep exponential
family distributions. The Legendre duality can be summarized with the following equations
The goal of the dual formulation is to express the density in terms of divergence from the mean.
Legendre dual of the log-partition function is the first step; however, we need to make sure that
the divergence from mean to each point in the support—more generally the convex support since
the support may vary with the natural parameter—is well-defined. Notice that the Legendre dual
is defined on Ω and not C; hence, we can only talk about strict convexity on Ω. To achieve dual
formulation of density as a divergence from mean, we need to investigate the convex conjugate of
(Θ, Ψ) and not just the Legendre dual of (int(Θ), Ψ).
If we have a regular exponential family, then Θ is an open set (bd Θ = ∅) and therefore
(Θ = int(Θ), Ψ) is of Legendre type. The Legendre dual of (Θ, Ψ) is (Ω, Ψ̄∗ ); however, this does
not imply that the convex support is also an open set. The convex support and the convex conjugate
should be calculated with the closure construction of Ψ̄∗ . As noted in (Rockafellar, 1970, Theorem
7.5), taking limits of Ψ̄∗ at the boundaries of C can also be used to find the convex conjugate of Ψ.
We note that for a continuous steep exponential family distribution, the probability mass on
bd Ω is zero; therefore, the closure construction may not be as important (Jorgensen, 1997). For
steep discrete distributions, however, C \ Ω plays an important role. A curious example highlighting
this point and showing the interplay of Θ, Ω, S, and C is the Bernoulli distribution.
Example 2 For a Bernoulli distribution (see Table 1), we have Ψ(θ) = log(1 + eθ ) with Θ = R.
Since Θ is open, we have a regular—thus, also steep—exponential family distribution. The mean-
eθ ∗
value mapping is τ (θ) = 1+e θ with range Ω = (0, 1). The Legendre dual of (Θ, Ψ) is Ψ̄ (t) =
t log(t) + (1 − t) log(1 − t) with dom(Ψ̄∗ ) = (0, 1). The support is S = {0, 1}; hence, the convex
support is C = [0, 1]. We have Ω = ri(C) and C \ Ω = bd Ω. Note that the support, S, is
disjoint from the mean domain, Ω. Also note that all probability mass is on C \ Ω and the Legendre
dual of Ψ is not defined on C \ Ω. We should find the convex conjugate of Ψ and investigate its
behavior on C. The convex conjugate can be calculated by taking limits: Ψ∗ (t) = Ψ̄∗ (t) ∀t ∈ Ω,
Ψ∗ (0) = limt→0+ Ψ̄∗ (t) = 0 and Ψ∗ (1) = limt→1− Ψ̄∗ (t) = 0. In this particular case, Ψ∗ turns
out to be strictly convex on C.
8
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
Next, we describe the concluding theorem of this section, where we show that the density of
a steep natural exponential family can be written in terms of a Bregman divergence. A similar
connection was explored in previous work (Banerjee et al., 2005) for the regular exponential family,
which is a subset of the steep exponential family. In general, not every steep exponential family
distribution has a corresponding Bregman divergence.
Theorem 5 Suppose we have a steep natural exponential family distribution pΨ (x | θ). Then the
.
convex conjugate of the log-partition function φ = Ψ∗ is a strictly convex function, and dom φ = C.
Furthermore, the density can be expressed as
where dφ (x, y) is the Bregman divergence generated from φ, µ = τ −1 (θ) is the mean parameter,
and gφ (x) = exp(φ(x))h(x) is the base measure and pφ (x | µ) is the dual formulation of the
density.
9
BASBUG AND E NGELHARDT
7 5
6
4
0.6 2
0.4
1
0.2
0.00 1 2 3 4 5 00 1 2 3 4 5
Support Support
(c) (d)
3.0 5
2.5 4
Divergence from mean
2.0
3
1.5
PDF
2
1.0
0.5 1
0.00 1 2 3 4 5 00 1 2 3 4 5
Support Support
(e) (f)
Figure 1: Duality in steep EDMs with support on positive real numbers . Probability density
function and the corresponding Bregman divergence functions for three EDMs: a-b) gamma c-
d) inverse-Gaussian and e-f) stable. In all cases, the mean parameter (µ) is fixed to 1.0 and the
dispersion parameter (κ) is set to 0.5 (blue) 1.0 (green) 2.0 (red).
10
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
Notice that the mean-value mapping—hence the mean parameter corresponding to a fixed natural
parameter—does not change after the scaling. The divergence from the mean is, however, is scaled
by 1/κ as implied with stretching the geometry around mean. Finally, we write the density of a
steep EDM as
1
pΨ (x | θ, κ) = pφ (x | µ, κ) = exp − dφ (x, µ) gφ (x, κ)
κ
where dφ (x, µ) is the Bregman divergence generated from φ. In the dual form, the density is
parametrized with the mean parameter, µ, and the dispersion parameter, κ. We note that the mean
parameter only appears in the Bregman divergence term, dφ (x, µ). One important property of Breg-
man divergences is that the ML estimate of µ is independent of the divergence (Banerjee et al.,
11
BASBUG AND E NGELHARDT
2005). In fact, the ML estimate of the mean is simply the population mean. The following theorem
summarizes the results for the ML and MAP estimators of the mean parameter and its dual, the
natural parameter.
Theorem 6 Suppose we have N i.i.d. samples {xi }N i=1 from a steep EDM pΨ (x | θ, κ). If the
. 1 PN
sample mean x̄ = N i=1 xi ∈ int(C), then the unique ML estimate of the mean parameter is
µM L = x̄ and the unique ML estimate of the natural parameter is θM L = τ −1 (µM L ) ∈ int(Θ).
Suppose we have a conjugate prior pΨ (θ | a, b) with parameters a ∈ int(C) and b > 0 as in Eq 4.
Then the unique MAP estimate of the mean parameter µM AP = abκ+N x̄
bκ+N ∈ int(C), and the unique
MAP estimate of the natural parameter is θM AP = τ −1 (µM AP ) ∈ int(Θ).
As opposed to the mean parameter, the dispersion parameter appears in the exponent and in
gφ (x, κ); in general, there is no closed-form expression for gφ (x, κ). Fortunately, asymptotic theory
suggests that, for x ∈ Ω, we can approximate the density as
1 1
pφ (x | µ, κ) = p exp − dφ (x, µ) as κ → 0. (9)
2πκυ(x) κ
This approximation is known as the saddle-point approximation (Daniels, 1954). One limitation of
the saddle-point approximation is that the sample point must be in Ω.
√
Example 3 The inverse-Gaussian distribution has the unit log-partition function Ψ(θ) = − −2θ
with Θ = (−∞, 0]. Since Θ is closed and limθ→0 Ψ0 (θ) = ∞, we have√a steep but not regular
exponential family distribution. The mean-value mapping is τ (θ) = 1/ −2θ with range Ω =
(0, ∞). The convex conjugate of Ψ is φ(t) = 1/(2t), and the corresponding Bregman divergence is
dφ (x, y) = (x − y)2 /(2xy 2 ). The saddle-point approximation is then given by
(x − µ)2
1
pφ (x | µ, κ) = √ exp − . (10)
2πκx3 2κxµ2
Coincidentally, this is the exact density function for inverse Gaussian with mean µ and shape pa-
rameter 1/κ. Also note that the support S = Ω = (0, ∞).
In the case of steep continuous EDMs, we know that C \ Ω = bd Ω and the probability mass on
bd Ω is zero (Jorgensen, 1997). Therefore, x ∈ Ω condition is not limiting for such distributions.
Recall the Bernouilli distribution in Example 2 has all the probability mass on bd Ω. Although
Bernouilli distribution itself is not an EDM; there exist discrete EDMs where the probability mass
on bd Ω is not zero. In particular, when a steep discrete EDM with a support at zero, the saddle-point
approximation must be modified to accommodate the point mass at bd Ω as
r
κ 1
pφ (x | µ, κ) = exp − dφ (κx, κµ) as κ → 0 (11)
2πυ(κx + κc) κ
where c is a small positive constant (usually set to 1/3) (McCullagh and Nelder, 1989; Jorgensen,
1997). When the support does not include 0, we set c = 0.
12
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
13
BASBUG AND E NGELHARDT
Table 2: Summary of the family of steep EDMs for different data types. The hyper-parameter
domain, A, the variance function, υ(x | α), and the Bregman divergence, dφ (x, µ | α), used in
AdaCluster for different data types.
Since Θ is open, we get a regular exponential family. The mean-value mapping is given by
eθ
τ (θ | α) = .
1 − αeθ
The range of the mean-value mapping is then Ω = (0, ∞). The convex conjugate of Ψ is given by
(
t log t − αt+1
α log(αt + 1) α > 0
φ(t | α) =
t log t − t α=0
with support S = Z0 and convex support C = R0 . From Theorem 5, we know that φ generates
a Bregman divergence. The parametrized Bregman divergence, dφ (x, y | α) : R0 × R+ → R0 , is
given by
1
α log(αy + 1) α > 0, x = 0
( 1 + x) log( αy+1 ) + x log( x ) α > 0, x 6= 0
α αx+1 y
dφ (x, y | α) = (12)
y α = 0, x = 0
y − x + x log( x )
α = 0, x 6= 0.
y
14
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
0.14 0.07
0.12 0.06
0.10 0.05
0.08 0.04
PMF
PMF
0.06 0.03
0.04 0.02
0.02 0.01
0.000 20 40 60 80 100 0.000 20 40 60 80 100
Support Support
(a) (b)
0.08 0.12
0.07 0.10
0.06
0.05 0.08
0.04 0.06
PMF
PMF
0.03 0.04
0.02
0.01 0.02
0.000 20 40 60 80 100 0.000 20 40 60 80 100
Support Support
(c) (d)
Figure 2: Density of EDMs with support on the natural numbers. Probability mass function of
four members of the proposed EDM class for non-negative discrete data with µ = 10 and κ = 1
when the hyper-parameter is set to a) α = 0 (Poisson), b) α = 0.5, c) α = 1 (negative binomial), d)
α=2
Two members of this class are the Poisson (α = 0) and the negative binomial (α = 1) distributions.
Having specified the variance function and the corresponding Bregman divergence, we can use the
saddle-point approximation (Eq. 11) to express the density in the dual domain. The probability mass
functions for four members of this class, including Poisson and negative binomial, are depicted in
Fig. 2. Recall that the variance of a given steep EDM, pφ (x | µ, κ, α), is κµ + ακµ2 (see Eq. 8).
We see that, as α increases, the variance increases monotonically. The level of over-dispersion is
therefore captured by the hyper-parameter α. Distributions with higher α are useful for the analysis
of over-dispersed count data.
15
BASBUG AND E NGELHARDT
0.40 0.08
0.35 0.07
0.30 0.06
0.25 0.05
0.20 0.04
PDF
PDF
0.15 0.03
0.10 0.02
0.05 0.01
0.0010 5 0 5 10 15 20 25 30 0.0010 5 0 5 10 15 20 25 30
Support Support
(a) (b)
0.09 0.14
0.08 0.12
0.07
0.10
0.06
0.05 0.08
PDF
0.04 0.06
0.03
0.04
0.02
0.01 0.02
0.0010 5 0 5 10 15 20 25 30 0.0010 5 0 5 10 15 20 25 30
Support Support
(c) (d)
Figure 3: Density of EDMs with support on the real line. Probability distribution function for
four members of the proposed EDM class for continuous data on R with µ = 10 and κ = 1 when
the hyper-parameter is set to a) α = 0 (Gaussian), b) α = 0.5, c) α = 1 (GHS), d) α = 2
16
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
support on the unit interval (Jorgensen, 1997, Chapter 3). GHS allows us to model data on the
unit interval through the logit transform. Similarly, the logit transform of a logit-normal distributed
random variable has a Gaussian distribution.
In this section, we propose a class of EDMs that includes Gaussian and hyperbolic secant dis-
tributions as members. We start with the quadratic variance function υ(x | α) = 1 + αx2 with
α ∈ R0 . Solving the implied second order differential equation, Ψ00 (θ) = 1 + αΨ0 (θ)2 , we get the
following log-partition function
( √
− α1 log(cos( αθ)) α > 0
Ψ(θ | α) = 1 2
2θ α=0
Since Θ is open, we get a regular (and thus steep) exponential family distribution for every α ∈ R0 .
The mean-value mapping is given by
( √
√1 tan( αθ) α > 0
τ (θ | α) = α
θ α = 0.
The range of the mean-value mapping is then Ω = R. The convex conjugate of Ψ is given by
( √ √ 1
2 αt arctan( αt) − 2α log(1 + αt2 ) α > 0
φ(t | α) = 1 2
2t α=0
Two noteworthy members of this class are the Gaussian (α = 0) and generalized hyperbolic secant
(α = 1) distributions. We can use the saddle-point approximation (Eq. 9) with the variance function
and the Bregman divergence above to express the density of each member of the class. In Fig. 3, the
probability distribution functions for several members of this class, including Gaussian and GHS,
are shown. The variance of a given steep EDM, pφ (x | µ, κ, α), in this class is κ + ακµ2 . As α
increases, we see that the asymmetric shape becomes more apparent. Similar to the case for the
over-dispersed count data, increasing α implies a higher variance for a fixed mean parameter.
17
BASBUG AND E NGELHARDT
Suppose we have a power variance function υ(x | α) = x2−α with α ∈ R. Later we’ll see
that not every choice of α yields a steep EDM; therefore, we will choose a smaller set for A. After
solving Ψ00 (θ) = (Ψ0 (θ))2−α , we obtain the following log-partition function
α
1
α (((α − 1)θ + 1) α−1 − 1) α ∈ R \ {0, 1}
Ψ(θ | α) = eθ − 1 α=1 (14)
− log (1 − θ) α=0
with Θ given by
1
(−∞, 1−α ] α ∈ (−∞, 0)
(−∞, 1 )
α ∈ [0, 1)
Θ= 1−α (15)
1
[ 1−α , ∞) α ∈ (1, ∞) \ {2}
R α ∈ {1, 2}.
Note that Ψ is continuous in α, including the limiting cases for α = 0 and α = 1. When α ∈
[0, 1] ∪ {2}, the corresponding distributions belong to the regular exponential family. Similarly, for
α ∈ (−∞, 1] ∪ {2}, the corresponding distributions are steep. The mean-value mapping is given by
1
(
((α − 1)θ + 1) α−1 α 6= 1
τ (θ | α) =
eθ α=1
with range Ω = (0, ∞), except when α = 2 where Ω = R. The convex conjugate is given by
1 α
α(α−1) (t − αt + α − 1) α ∈ R \ {0, 1}
φ(t | α) = t log t − t + 1 α=1 (16)
t − log t − 1 α=0
with convex support C = [0, ∞) when α ∈ (0, ∞) \ {2}, C = (−∞, ∞) when α = 2, and
C = (0, ∞) when α ∈ (−∞, 0]. When 1 < α < 2, then Ψ00 ( 1−α 1
) = 0, leading to a degenerate
distribution. At the expense of having a non-full exponential family, we limit the convex support to
1
be C = (0, ∞) when 1 < α < 2. With this choice, the domain of Ψ is limited to Θ = ( 1−α , ∞)
when 1 < α < 2. The corresponding divergence is given by
α
x +(α−1)y α −αxy α−1
α(α−1) α ∈ R \ {0, 1}
dφ (x, y | α) = x(log x − log y) + (y − x) α = 1 (17)
x − log ( x ) − 1
α = 0.
y y
18
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
β-divergences was established in (Hennequin et al., 2011). The connection between β-divergences
and the Tweedie class has been explored in (Yilmaz and Cemgil, 2012). We note that the definition
of β-divergences are slightly different in these two works.
When the support is S = (0, ∞), we choose A = (−∞, 0], which includes the gamma (α = 0)
and inverse-Gaussian (α = −1) distributions. In fact, it is possible to extend A to (−∞, 2], assum-
ing the probability mass at 0 is small for 0 < α < 2 and the probability of the interval (−∞, 0]
is small for α = 2. We note that A = (−∞, 2] includes Poisson (α = 1) and Gaussian (α = 2)
distributions as well. At first, using the Poisson distribution to model positive continuous data looks
unusual; however, the saddle-point approximation in Eq. 9 requires only that the set-of-interest is
a subset of the convex support and the variance function is positive. Both of these conditions are
satisfied since R+ ⊂ C = R0 and υ(x | α = 1) = x > 0 ∀x ∈ R+ .
Non-negative continuous data. Zero-inflated distributions are often modeled with compound-
Poisson distributions (Basbug and Engelhardt, 2016). Within the Tweedie class, compound-Poisson-
Gamma distributions correspond to the hyper-parameter range 0 < α < 1. Similar to positive
continuous data, the divergence corresponding to the Poisson distribution can be used for non-
negative continuous data, which makes the feasible hyper-parameter domain A = (0, 1]. One
important caveat is that, when using the saddle-point approximation for the point mass at zero, we
must use Eq. 11 instead of Eq. 9.
19
BASBUG AND E NGELHARDT
50 50 50
40 40 40
30 30 30
x2
x2
x2
20 20 20
10 10 10
00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50
x1 x1 x1
40 40 40
30 30 30
x2
x2
x2
20 20 20
10 10 10
00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50
x1 x1 x1
Figure 4: Homogeneous and heterogeneous mixture models. Contour plots for various mixture
models with fixed centroids located at (10,15), (25,20) and (15,30). Homogeneous mixture model
with fixed dispersion across dimensions and mixture components and fixed Gaussian topology in
both dimensions (a) and gamma topology in both dimensions (b); heterogeneous mixture model
with dispersion varying in the first and second dimensions but fixed across mixture components
and with gamma topology in the first dimension and Gaussian topology in the second dimension
(c), Gaussian topology in both dimensions (d) and gamma topology in both dimensions (e); het-
erogeneous mixture model with dispersion varying in the first and second dimension and across the
three mixture components and gamma topology in the first dimension and Gaussian topology in the
second dimension (f). Bregman soft clustering can be used for (a) and (b), AdaCluster can handle
(a-e). The most arbitrary form of heterogeneous mixture models (f) cannot be captured by either of
the algorithms.
Bregman soft clustering—consisting of closed-form updates to fit any homogeneous mixture model
that can be represented in terms of a Bregman divergence. Based on the discussion in Section 2.4, we
know that the density of steep EDMs can be represented in terms of Bregman divergences; hence,
we conclude that the homogeneous mixture of steep EDMs can be fit to data with the Bregman
soft-clustering algorithm.
20
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
We start by describing the closed-form updates of the Bregman soft-clustering algorithm when
used to fit a mixture of steep EDMs. We write the log-likelihood of a set of observations X =
{xi }N
i=1 with J attributes under the homogeneous mixture of steep EDM, pφ (x | µ, κ), as
N K J
X X X 1
log pφ (X | π, µ, κ) = log πh exp − dφ (xij , µhj ) + log gφ (xij , κ) (18)
κ
i=1 h=1 j=1
where π = {πh }K h=1 is the mixture proportions, and dφ (x, y) is the Bregman divergence generated
from a fixed strictly convex function φ. The underlying distribution pφ (x | µ, κ) is fixed and
specified a priori.
In the E-step of the Bregman soft-clustering algorithm, the weight of each mixture component
is updated using:
πh exp − κ1 Jj=1 dφ (xij , µhj )
P
pφ (h | xi , π, µ, κ) ← P (19)
K 1 PJ
h=1 π h exp − κ j=1 dφ (x ij , µhj )
for i = 1, . . . , N and h = 1, . . . , K. Note that the base measure terms log gφ (xij , κ) cancel. It
is often the case that the base measure does not have a closed form. The homogeneity assumption
simplifies the EM algorithm, allowing us to specify the log-partition function Ψ or its conjugate φ,
and not the full density.
In the M-step of Bregman soft clustering, the maximum likelihood estimates of π and µ are
given by
N
1 X
πh ← pφ (h | xi , π, µ, κ) (20)
N
i=1
PN
i=1 pφ (h | xi , π, µ, κ)xij
µhj ← P N
(21)
i=1 pφ (h | xi , π, µ, κ)
for h = 1, . . . , K and j = 1, . . . , J. The base measure terms do not appear in the closed-form
updates.
So far, we only described the Bregman soft-clustering algorithm updates for a homogenous
mixture of steep EDMs and did not modify the existing algorithm of (Banerjee et al., 2005). To
be able to employ Bregman soft clustering, we need to specify the topology using the divergence-
generating function φ, the variance function υ or the log-partition function Ψ. If we have multiple
candidates for the underlying distribution, we can optimize the objective function in Eq. 18 although
it requires specification of the full density including the base-measure terms (gφ (·, κ)) which often
do not have closed-form expressions.
The homogeneous mixture of steep EDMs, however, can fall short in capturing the true geo-
metric properties of the data. In data sets collected from multiple complex sources, the dispersion
parameters tend to be different across attributes even when the topology is the same (Fig. 4d and
Fig. 4e). For an arbitrary EDM, normalizing data can be burdensome since we would have to fit
an individual mixture model to each attribute. Inference gets even more complicated when the
topology is different across attributes (Fig. 4c). Census data is a good example, where we have
non-negative discrete attributes such as age, positive continuous attributes such as height, possibly
21
BASBUG AND E NGELHARDT
negative continuous attributes such as income, and binary attributes such as gender. When the num-
ber of attributes is small, a custom mixture model can be built; however, as the dimensionality of
the data increases, creating a custom model for each attribute is tedious. To address this problem,
we develop heterogeneous mixture models, where the topology and the dispersion characteristics
are attribute-specific and shared across mixture components.
Suppose we have a family of steep EDMs {ΦA }j to model the underlying distribution of the
j th attribute. More precisely, we assume that each distribution in that family has the same support
and can be specified with the mother function φj (· | α) and hyper-parameter domain Aj . We can
write the quasi-log-likelihood of data under the heterogeneous mixture model using the saddle-point
approximation as
N
X K
X
L̂ = log p̂φ (X | π, µ, κ, α) = log πh exp (Υ(xi , µh , κ, α)) , (22)
i=1 h=1
.
where Υ(x, µ, κ, α) = − Jj=1 κ1j dφj (xj , µj | αj ) + 21 log(2πκj υj (xj | αj )) is an auxiliary
P
function. In practice, we algorithmically detect the type of each attribute, and we set {ΦA }j ac-
cordingly. Table 2 summarizes the properties of the EDM families for different data types. We
present the AdaCluster algorithm, which maximizes the quasi-log-likelihood L̂ using expectation-
maximization.
In the E-step of AdaCluster, the weight of each mixture component is given by
Using Eq. 24, we calculate the updates for µ, κ and α (Algorithm 1). We first note that the MLE
of κ has a closed form. Since we assume the underlying distribution is a steep EDM, the maximum
likelihood estimate of µ also has a closed form (Theorem 6). For the hyper-parameter vector α,
numerical optimization techniques may be used to find the value of α that maximizes the quasi-log-
likelihood L̂.
To incorporate a prior knowledge about the data, we discuss the Bayesian treatment of the het-
erogeneous mixture of steep EDMs where we have conjugate priors on µ and κ. For instance, in
census data, we may fit a mixture model to the census data of all U.S. states and use the estimated
parameters as priors on the mixture model for the state-level census data. Apart from such hierar-
chical applications, the Bayesian approach also offers a way to regularize the dispersion parameters.
22
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
In Gaussian mixture model, prior on κ is equivalent to regularizing the covariance matrix estimation
with a diagonal matrix.
Suppose we have a conjugate prior on θhj = τ −1 (µhj ) with parameters aµhj and bµhj , as in Eq. 4.
Then, the MAP estimate of µhj is given by
aµhj bµhj κj +
PN
i=1 pφ (h | xi , µ, κ, α, π)xij
µM
hj
AP
← (25)
bµhj κj
PN
+ i=1 pφ (h | xi , µ, κ, α, π)
for h = 1, . . . , K and j = 1, . . . , J. In the mixture model setting, aµhj reflects our prior knowledge
about the location of the hth centroid along the j th dimension, and bµhj represents the effective
sample size of the prior. The MAP estimate reduces to the ML estimate when bµhj is set to 0. It
is also possible to set bµhj = η/κj , where η is the adjusted effective sample size of the prior; η
represents how much weight is given to the prior relative to the effective cluster size N
P
i=1 pφ (h |
xi , µ, κ, α, π).
In describing the MAP estimate for κ, we first note that the inverse-gamma (IG) distribution is
the conjugate prior for the dispersion parameter of a steep EDM under the saddle-point approxima-
tion. Suppose we have an independent IG(aκj , bκj ) prior on each {κj }Jj=1 . Then, the MAP estimate
23
BASBUG AND E NGELHARDT
for κj is given by
PN PK
bκj + i=1 h=1 pφ (h | xi , µ, κ, α, π)dφj (xij , µhj | αj )
κM
j
AP
← (26)
aκj + 2 N
1 P PK
i=1 h=1 pφ (h | xi , µ, κ, α, π)
for j = 1, . . . , J. For the Bayesian version of AdaCluster, we replace the µhj and κj updates with
∂ L̂
MAP estimates, and we adjust the gradient ∂α j
accordingly.
where C(X ) = {Xh }K h=1 is the partition of the data set and dφ (x, y) is the corresponding Bregman
divergence to the underlying steep EDM.
In the case of a heterogenous mixture of steep EDMs, Bregman soft-clustering does not reduce
to Bregman hard-clustering in general. The only exception is when the dispersion parameter is
shared across attributes, which may be possible if the data are normalized within each attribute.
In this case, we can combine the Bregman divergences specific to each attribute into a single
multi-dimensional Bregman divergence using the linearity property. We can show that Bregman
soft-clustering reduces to Bregman hard-clustering under the small-variance assumption, with the
objective
K
X J
X X
LBHC (C(X ) | µ, α) = dφj (xij , µhj | αj ). (28)
h=1 xi ∈Xh j=1
24
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
We note that the topological properties of the attributes need to be specified ahead of time and
the Bregman hard-clustering algorithm cannot adaptively infer the topology. One might attempt to
learn the parameters α by minimizing Eq. 27 directly. This naive approach fails in practice; similar
accounts have been described for the Tweedie class (Dunn and Smyth, 2005, 2008).
We explore alternatives to the likelihood approach to adaptively learn the parameters of a general
heterogeneous mixture of steep EDMs in the setting of hard clustering. Suppose we have N i.i.d.
1 PN
samples {xi }N i=1 from a steep EDM pφ (x | µ, κ, α) such that N i=1 i ∈ int(C). Then we can
x
estimate µ, κ, and α with method of moments using the first three sample moments. Estimating
parameters using high order moments suffer from statistical efficiency problems. Fortunately, the
hard clustering problem has an additional property that we can exploit. The parameters κ and α are
shared across mixture components, and each cluster is assumed to be disjoint. Therefore, the first
two population moments for K clusters allow us to write down 2JK moment conditions for J(K +
2) parameters. For K = 2, we have exact identification, i.e., the number of moment conditions is
the same as the number of parameters to estimate. When K > 2, we have more moment conditions
than parameters, leading to an overidentified system using the first two population moments for
each cluster. This is the ideal setting for Generalized Method of Moments (GMoM) (Hansen, 1982).
GMoM is a method of estimating the set of parameters of a model using moment conditions and
arbitrary functions of random variables and parameters that are zero in expectation. Given a vector
of moment conditions, GMoM minimizes the quadratic form defined using an arbitrary positive
definite weight matrix (Hansen, 1982). The choice of the weight matrix determines the efficiency
of the estimator.
For the remainder of this section, we denote the vector of parameters in a heterogeneous mixture
of steep EDMs as λ = {µ, κ, α}. We denote the feasible set for λ with Λ. We construct the set
Λ using {Ωj }Jj=1 for the mean parameters µ, RJ+ for the dispersion parameters κ, and {Aj }Jj=1 for
the hyper-parameters α.
Using the parametrized variance function of steep EDMs (Eq. 8), we write the first two moments
for the j th attribute and the hth cluster as
xj − µhj
mhj (x; λ) = 2 .
xj − κj υj (µhj | αj )
Since E[mhj (x; λ)] = 0 for each attribute j = {1, . . . , J} and each cluster h = {1, . . . , K},
they can be used as moment conditions in GMoM. Having described the moment conditions, we
need to specify a positive-definite weight matrix to formulate the GMoM objective. In Hansen’s
iterative GMoM method, the weight matrix is constructed by taking the inverse of the residual ma-
trix (Hansen, 1982). This construction yields asymptotically efficient estimates for the parameters.
A variant of the iterative GMoM is the Continuously Updating Generalized Method of Moments
(CUGMoM), which is preferred when analytical solutions in each iteration of the original method
are not available (Hansen et al., 1996). Instead of iterating between the update for the parameter
estimates and the weight matrix estimate, CUGMoM optimizes for the parameters directly using
computational techniques.
The CUGMoM objective is given by
K X
X J
LGM oM (C(X ); λ) = mhj T W hj mhj (29)
h=1 j=1
25
BASBUG AND E NGELHARDT
where
1 X
mhj (C(X ); λ) = mhj (xi ; λ) (30)
|Xh |
xi ∈Xh
−1
1 X
W hj (C(X ); λ) = mhj (xi ; λ) mhj (xi ; λ)T . (31)
|Xh |
xi ∈Xh
The CUGMoM parameter estimates are λ̂ = argminλ∈Λ LGM oM (C(X ); λ). To solve this op-
timization problem, we used the limited-memory Broyden-Fletcher-Goldfarb-Shanno with bound-
aries (L-BFGS-B) algorithm (Byrd et al., 1995). The computational bottleneck for the GMoM
routine is usually the inversion of the weight matrix. In the hard clustering problem, clusters are
disjoint; hence, the residual matrix and the weight matrix are block diagonal. The inversion of the
weight matrix then has complexity O(KJ) rather than O(K 3 J).
GMoM gives us a way to estimate the parameters of the model given the partition of the data
into clusters, C(X ). This is somewhat equivalent to the M-step of AdaCluster. To mimic the E-step
of AdaCluster, we use the weight matrix of each cluster to calculate the quadratic distance, and then
we assign each sample to the cluster with minimum distance. We refer to this algorithm (Alg. 2)
as GMoM Hard Clustering (GMoM-HC). Convergence can be assessed either through the objective
or through a stable partition. A lighter version of this algorithm can also be derived by setting the
mean estimates to the maximum likelihood estimates. In this case, the only moment conditions are
the second population moments; then, the inverse weight matrix becomes a diagonal matrix with
the mean squared error for the moment conditions as the diagonal entries. The analogous Bayesian
algorithm can be derived by fixing bµh many pseudo samples at location aµ th
h to the h cluster. Recall
that the geometric interpretation of conjugate prior to the mean parameter is having b many pseudo
samples at a location a (Agarwal and Daumé III, 2010).
6. Results
6.1 Setup
We used normalized mutual information (NMI) to quantify the results from a clustering algorithm
with respect to ground truth. NMI is a metric that takes values between 0 and 1, where 0 corresponds
to random cluster assignments.
In addition to AdaCluster and GMoM-HC, we used k-means and the EM algorithm for a Gaus-
sian mixture model (GMM) for comparison. We force the covariance matrix of the GMM to be
diagonal and shared across mixture components. We note that AdaCluster reduces to a GMM when
ΦA is a singleton set with only Gaussian members. We also note that in such a setting κ = {κ}Jj=1
corresponds to the diagonal entries of the GMM covariance matrix and is shared across clusters.
To set up AdaCluster and GMoM-HC, we categorized each attributes based on whether we
found all of the values to be positive, non-negative, or on the real line, and whether the values
were discrete or continuous. Based on the different categories of attribute values, we selected the
parameterization of the EDM, ΦA , as outlined in Table 2.
To initialize all of the algorithms in this comparison, we used k-means++ (Arthur and Vassil-
vitskii, 2007). In AdaCluster and GMoM-HC, we used the k-means++ centroids to set the hyper-
parameters aµ µ κ
h while fixing bh = 1. We set the hyper-parameters of κ as a = 1.0, bκ = 10
−9 to
26
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
8 15000
100 10000
10
Posterior Quantiles
Posterior Quantiles
Posterior Quantiles
50 5000
0
12
0 5000
14 10000
50
15000
16
100 20000
16 14 12 10 8 0 20 40 60 80 0 200 400 600 800 1000
Empirical Quantiles Empirical Quantiles Empirical Quantiles
(a) (b) (c)
10000 4000
5000 2000
Posterior Quantiles
Posterior Quantiles
0
0
2000
5000
4000
10000
6000
0 200 400 600 800 0 100 200 300 400 500
Empirical Quantiles Empirical Quantiles
(d) (e)
Figure 5: Simulated density estimation results. Quantile-quantile plots of the samples generated
from a homogeneous steep EDM mixture models with an underlying distribution a) Gaussian, b)
Poisson c) gamma, d) negative binomial, and e) inverse-Gaussian against samples generated from
the Gaussian mixture model fitted to synthetic data. Each color corresponds to a synthetic data set
with distinct parameters, and the diagonal black line marks the perfect fit.
ensure numerical stability. Similarly, we add 10−9 to the diagonal entries of the covariance matrix
in the GMM.
We ran each algorithm for a maximum of 1, 000 iterations and terminated early if the hard
cluster assignments did not change in two consecutive iterations. We restarted each algorithm 1, 000
times and selected the results from these 1, 000 runs with the highest likelihood for the GMM and
AdaCluster, and with the lowest inertia for k-means and GMoM-HC.
To illustrate the benefits of our approach to clustering on homogeneous data, we considered the
problem of density estimation with mixture models. For each distribution in {Gaussian, gamma,
inverse-Gaussian, Poisson, negative binomial}, we generated six single dimensional data sets of
size N = 1000 from a mixture model with K = 4 mixture components. We drew the dispersion
parameter κ from an inverse-gamma with shape parameter 1.01 and scale parameter 1.0. We chose
27
BASBUG AND E NGELHARDT
1.0 0
AdaCluster log-likelihood
0.8 20
AdaCluster NMI
0.6 40
0.4 60
0.2 80
0.0 100
0.0 0.2 0.4 0.6 0.8 1.0 100 80 60 40 20 0
GMM NMI GMM log-likelihood
(a) (b)
Figure 6: Synthetic heterogeneous data results. Comparing AdaCluster with the GMM on syn-
thetic heterogeneous data in terms of a) normalized mutual information (NMI) and b) per sample
log-likelihood. Each red point corresponds to a different data set, and the black diagonal line indi-
cates equivalent performance.
the centroids such that pφ (µh0 | µh , κ, α) < 0.01 for every h = 1, . . . , K and h0 6= h. We then fit a
Gaussian mixture model to the data and generated 1, 000 samples from this model.
The quantile quantile plots between the data and the posterior samples generated from the fitted
GMM show that the GMM recovers the true density accurately when the underlying distribution is
Gaussian (Fig. 5). However, the GMM fails in the case of Gamma, inverse-Gaussian, Poisson and
negative binomial. Recall that the variance function, υ(x), of the Gaussian is constant (1), linear
for Poisson (x), quadratic for Gamma (x2 ) and negative binomial (x2 + x), and finally cubic for
inverse-Gaussian (x3 ). Hence, the GMM performance progressively degrades as the variance-mean
relationship deviates from constant (Fig. 5a–5e).
To compare AdaCluster and GMM performance in clustering heterogenous data, we gener-
ated 100 data sets of size N = 1000 with J = 10 dimensions and K = 4 mixture compo-
nents. We started by drawing cluster proportions from a Dirichlet distribution with equal con-
centration parameters. For each dimension j = 1, . . . , J, we first drew a dispersion parameter κj
from an inverse-gamma with shape parameter 1.01 and scale parameter 1.0. We then randomly
selected the underlying distribution from {Gaussian, gamma, inverse-Gaussian, Poisson, negative
Binomial} and set αj and φj (· | αj ) accordingly. Finally, we selected the cluster centroids such that
pφj (µh0 j | µhj , κj , αj ) < 0.01 for every h = 1, . . . , K and h0 6= h. We then ran AdaCluster and
EM for the GMM as described in the previous section.
AdaCluster outperforms the GMM in every synthetic data set except for one in terms of NMI
(Fig. 6). More interestingly, AdaCluster often achieves perfect NMI score whereas the GMM per-
formance is below 0.2 in roughly half of the data sets. In fact, GMM either struggles to cluster the
data (top left quadrant of Fig. 6a) or achieves a reasonable NMI value (top right quadrant of Fig. 6a).
In terms of the quasi-log-likelihood, AdaCluster outperforms GMM in every data set (Fig. 6b), sug-
28
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
Table 3: Comparison of NMI values for EM for the Gaussian mixture model (GMM), Ada-
Cluster (AdaC.), k-means, and GMoM hard clustering (GMoM-HC) on nine UCI data sets.
N is the number of samples; J is the number of attributes; K is the number of clusters; S is the
support.
gesting that AdaCluster not only yields a better clustering of data than GMM but also provides a
better fit. We note that the quasi-log-likelihood corresponds to the exact log-likelihood in the case
of GMM since saddle-point approximation is exact for the Gaussian distribution.
29
BASBUG AND E NGELHARDT
that the dispersion is roughly the same across attributes. In seeds and libras, AdaCluster and k-
means have similar performances, and the GMM is inferior to both. The common characteristics
of these two data sets is a relatively small dispersion across attributes. We verified this hypothesis
by examining the fitted parameters by AdaCluster. Small dispersion might explain why k-means
outperformed GMM. Similarly, it can be argued that the Gaussian distribution approximates the
underlying distribution when the dispersion is small. Finally, GMM beats AdaCluster in leaf data
set, which has a small number of samples per cluster, leading to noisy estimates of α and degrad-
ing the performance of the adaptive algorithms. This is especially true for GMoM-HC, where the
performance relies on having good α estimates within each cluster.
30
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
measured with normalized mutual information and log-likelihood. We also compared AdaCluster
and GMM on nine data sets from the UCI repository with distinct topology and dispersion charac-
teristics. We found that that AdaCluster clusters heterogeneous data consistently better than EM for
the GMM.
Acknowledgments
We would like to acknowledge support for this project for MEB from the Princeton Innovation J.
Insley Blair Pyne Fund Award, and, for BEE, from NIH R00 HG006265, NIH R01 MH101822,
NIH U01 HG007900, and a Sloan Faculty Fellowship.
31
BASBUG AND E NGELHARDT
References
Arvind Agarwal and Hal Daumé III. A geometric view of conjugate priors. Machine learning, 81
(1):99–113, 2010.
Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science &
Business Media, 2012.
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceed-
ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035.
Society for Industrial and Applied Mathematics, 2007.
Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with breg-
man divergences. The Journal of Machine Learning Research, 6:1705–1749, 2005.
Shaul K Bar-Lev, Peter Enis, et al. Reproducibility and natural exponential families with power
variance functions. The Annals of Statistics, 14(4):1507–1522, 1986.
Ole Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. John Wiley &
Sons, 1978.
Mehmet E Basbug and Barbara E Engelhardt. Hierarchical compound Poisson factorization. Pro-
ceedings of the International Conference on Machine Learning, page 17951803, July 2016. URL
http://arxiv.org/abs/1604.03853.
Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by
minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
Christopher M Bishop et al. Pattern recognition and machine learning, volume 4. springer New
York, 2006.
Chester Ittner Bliss and Ronald A Fisher. Fitting the negative binomial distribution to biological
data. Biometrics, 9(2):176–200, 1953.
Lev M Bregman. The relaxation method of finding the common point of convex sets and its ap-
plication to the solution of problems in convex programming. USSR computational mathematics
and mathematical physics, 7(3):200–217, 1967.
Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for
bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Csiszars divergences for non-negative ma-
trix factorization: Family of new algorithms. In Independent Component Analysis and Blind
Signal Separation, pages 32–39. Springer, 2006.
Peter K Dunn and Gordon K Smyth. Series evaluation of tweedie exponential dispersion model
densities. Statistics and Computing, 15(4):267–280, 2005.
32
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA
Peter K Dunn and Gordon K Smyth. Evaluation of tweedie exponential dispersion model densities
by fourier inversion. Statistics and Computing, 18(1):73–86, 2008.
Bradley Efron. Defining the curvature of a statistical problem (with applications to second order
efficiency). The Annals of Statistics, pages 1189–1242, 1975.
Ronald Fisher. Dispersion on a sphere. In Proceedings of the Royal Society of London A: Math-
ematical, Physical and Engineering Sciences, volume 217, pages 295–305. The Royal Society,
1953.
Ronald Aylmer Fisher. Two new properties of mathematical likelihood. Proceedings of the Royal
Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 144:
285–307, 1934.
Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econo-
metrica: Journal of the Econometric Society, pages 1029–1054, 1982.
Lars Peter Hansen, John Heaton, and Amir Yaron. Finite-sample properties of some alternative
gmm estimators. Journal of Business & Economic Statistics, 14(3):262–280, 1996.
WL Harkness and ML Harkness. Generalized hyperbolic secant distributions. Journal of the Amer-
ican Statistical Association, 63(321):329–337, 1968.
Romain Hennequin, Bertrand David, and Roland Badeau. Beta-divergence as a subclass of bregman
divergence. IEEE Signal Processing Letters, 18(2):83–86, 2011.
Ke Jiang, Brian Kulis, and Michael I Jordan. Small-variance asymptotics for exponential family
dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages
3158–3166, 2012.
Brian Kulis and Michael I Jordan. Revisiting k-means: New algorithms via bayesian nonpara-
metrics. In Proceedings of the 29th International Conference on Machine Learning (ICML-12),
pages 513–520, 2012.
Erich Leo Lehmann and George Casella. Theory of point estimation. Springer Science & Business
Media, 2006.
Peter McCullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989.
33
BASBUG AND E NGELHARDT
Carl N Morris. Natural exponential families with quadratic variance functions. The Annals of
Statistics, pages 65–80, 1982.
John A Nelder and R Jacob Baker. Generalized linear models. Encyclopedia of statistical sciences,
1972.
MCK Tweedie. Functions of a statistical variate with given means, with special reference to lapla-
cian distributions. In Proceedings of the Cambridge Philosophical Society, volume 43, page 100.
Cambridge Univ Press, 1947.
Y Kenan Yilmaz and A Taylan Cemgil. Alpha/beta divergences and tweedie models. arXiv preprint
arXiv:1209.4280, 2012.
34