0% found this document useful (0 votes)

45 views34 pages

2017-AdaCluster Adaptive Clustering For Heterogeneous Data

Uploaded by

jht1094030222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views34 pages

2017-AdaCluster Adaptive Clustering For Heterogeneous Data

Uploaded by

jht1094030222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Journal of Machine Learning Research 17 (2017) ?-?? Submitted 1/17; Published ??/??

AdaCluster : Adaptive Clustering for Heterogeneous Data

Mehmet E. Basbug MBASBUG @ PRINCETON . EDU

Department of Electrical Engineering
Princeton University
Princeton, NJ 08540, USA
Barbara E. Engelhardt BEE @ PRINCETON . EDU
Department of Computer Science
Center for Statistics and Machine Learning
arXiv:1510.05491v2 [cs.LG] 6 Jan 2017

Princeton University
Princeton, NJ 08540, USA

Editor: EDITOR

Abstract
Clustering algorithms start with a fixed divergence, which captures the possibly asymmetric dis-
tance between a sample and a centroid. In the mixture model setting, the sample distribution plays
the same role. When all attributes have the same topology and dispersion, the data are said to
be homogeneous. If the prior knowledge of the distribution is inaccurate or the set of plausible
distributions is large, an adaptive approach is essential. The motivation is more compelling for
heterogeneous data, where the dispersion or the topology differs among attributes. We propose an
adaptive approach to clustering using classes of parametrized Bregman divergences. We first show
that the density of a steep exponential dispersion model (EDM) can be represented with a Bregman
divergence. We then propose AdaCluster, an expectation-maximization (EM) algorithm to cluster
heterogeneous data using classes of steep EDMs. We compare AdaCluster with EM for a Gaus-
sian mixture model on synthetic data and nine UCI data sets. We also propose an adaptive hard
clustering algorithm based on Generalized Method of Moments. We compare the hard clustering
algorithm with k-means on the UCI data sets. We empirically verified that adaptively learning the
underlying topology yields better clustering of heterogeneous data.
Keywords: Clustering, Mixture Models, Bregman Divergences, Exponential Dispersion Model,
Heterogeneous Data

1. Introduction
Despite the general tendency towards more complex models and algorithms, k-means remains one
of the most popular clustering algorithms (Kulis and Jordan, 2012). The k-means algorithm makes
three simplifying assumptions: i) clusters are disjoint, implying a hard assignment of samples to
clusters; ii) the number of clusters is fixed and specified ahead of time; and iii) the distance be-
tween two samples is measured with Euclidean distance (Bishop et al., 2006). Many extensions to
k-means have been proposed to relax these assumptions. For example, soft k-means, an expectation
maximization (EM) algorithm for a homogeneous Gaussian mixture model, allows for a soft as-
signment of samples to clusters. In fact, the k-means algorithm is the equivalent of the soft k-means
under the asymptotically small variance assumption (Bishop et al., 2006). This probabilistic model
connection opened the door to non-parametric clustering using the DP-means algorithm, removing

c 2017 Mehmet E. Basbug and Barbara E. Engelhardt.

BASBUG AND E NGELHARDT

the second assumption of static numbers of clusters. DP-means was derived using the Gibbs sam-
pler for a Dirichlet process (DP) Gaussian mixture model (Kulis and Jordan, 2012). Finally, the
Gaussian distribution implicit in the third assumption of Euclidean geometry has been generalized
to regular exponential family distributions (Banerjee et al., 2005). The generalization is established
via the bijection between regular exponential family distributions and regular Bregman divergences.
The resulting algorithm, Bregman hard clustering, also has a soft counterpart, Bregman soft clus-
tering (Banerjee et al., 2005). Furthermore, (Jiang et al., 2012) showed that DP-means may be used
with any regular Bregman divergence, resulting in a nonparametric hard-clustering algorithm for
homogeneous mixture of regular exponential family distributions. After relaxing the three k-means
assumptions, we are still bound to specify a fixed divergence metric to measure the distance from
the centroid. In the case of mixture models, the corresponding choice is for the sample attribute
distributions.
Describing data in its natural habitat can yield tremendous benefits, especially when the errors
are so large that the Gaussian assumption does not hold (Fisher, 1953). The motivation to identify
the true topological properties of the data has led to the study of dispersion models and most notably
to the generalized linear models (Nelder and Baker, 1972). In this paper, we attempt to use disper-
sion models in clustering setting and provide ways of identifying the topology of the data with mild
assumptions on the data distribution.
With a good understanding of the topology of the data, we can create a custom mixture model
by specifying the known distribution of each sample attribute. Under certain regularity and inde-
pendence conditions, we can combine these attribute-specific mixture models into a single mixture
model. This can be a tedious job when the number of attributes is large. A more challenging and
common scenario is when we do not know the underlying distribution of each attribute a priori. The
conventional approach is to cast this problem as model selection. We can select between alternative
models using a criterion such as a Bayes factor, a likelihood ratio, or an information criterion. In
the case of homogeneous data, i.e. when all the attributes have the same topology and dispersion,
the model selection reduces to identifying the single true distribution within the set of plausible
distributions. Model selection in the case of heterogeneous data tends to be more challenging due to
combinatorial
Q nature of the problem. If we have Mj choices for the j th attribute, then we need to ex-
plore j Mj models, which can be overwhelming. We address this problem in three steps. We first
present a large family of distributions, namely steep exponential dispersion models (EDMs), where
the analytic expressions for the mean and dispersion parameter estimators are available. We then
propose parametrized sub-families of steep EDMs for different data types (positive, non-negative,
real, discrete, continuous etc.). Finally, we derive an adaptive clustering algorithm for heteroge-
neous data that can learn the underlying distribution of each attribute given its type.
In Section 2, we introduce convex duality concepts, Bregman divergences, exponential family
distributions, and exponential dispersion models. We show that the density of a steep exponential
dispersion model can be represented in terms of a Bregman divergence. This finding allows us
to parametrize class of steep EDMs by describing the divergence-generating functions in the dual
domain. With the dual formulation, we are able to keep the support of the class consistent and
analyze the variance-mean relationship more clearly.
In Section 3, we introduce parametrized families of steep EDMs for non-negative discrete, con-
tinuous, positive continuous, and non-negative continuous data. Section 3.1 describes a class for
non-negative discrete data that includes Poisson and negative binomial distributions. This class is
primarily suited for count data such as number of hits recorded with a Geiger counter, page views

2
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

for a website, patient days spent in a clinic and game scores in a contest (Hilbe, 2011). For contin-
uous data on the real line, we suggest another class including generalized hyperbolic secant (GHS)
and Gaussian distributions in Section 3.2. This class includes asymmetric distributions with ap-
plications to financial data (Fischer, 2013). Furthermore, when data live on a unit interval, the
proposed class can be utilized by transforming data with a logit function first. Most notably, beta
distribution and logit-normal distribution map to GHS and Gaussian distributions under the logit
transform, respectively. The families included in Section 3.1 and 3.2 are sub-families of the Morris
class characterized by the quadratic polynomial variance function (Morris, 1982). In Section 3.3, we
discuss another prominent class of EDMs, the Tweedie class, with a defining property of a power
variance function (Tweedie, 1947). We show that many members of the Tweedie class are steep
EDMs, hence, can be represented with Bregman divergences. We show that the Tweedie class can
be used to analyze positive continuous data as well as for non-negative continuous data depending
on the choice of the hyper-parameter domain.

In Section 4, we write the quasi-log likelihood of heterogeneous data under mixture of steep
EDMs in terms of Bregman divergences using saddle-point approximation. We propose an EM
algorithm with closed-form updates for the mean and dispersion parameters and numerical proce-
dures for learning the hyper-parameters identifying the underlying distribution of each attribute.
The resulting algorithm, AdaCluster, is similar to the Bregman soft-clustering algorithm (Banerjee
et al., 2005) with the added capability of handling heterogenous data. We also consider the Bayesian
treatment of the model and show that the MAP estimates of the mean parameters also have closed
form when we have conjugate priors. Furthermore, we show that inverse-Gamma distribution is the
conjugate prior to the dispersion parameter of steep EDMs under the saddle-point approximation.

In Section 5, we turn our attention to the hard clustering problem for heterogeneous data. We
first note that the AdaCluster algorithm does not have a hard clustering counterpart under the asymp-
totically small variance assumption. We further show that the likelihood approach falls short in
learning the hyper-parameters defining the topology of the data. We then propose an algorithm
based on generalized method of moments (GMoM) using the moment conditions derived from
the parametrized variance functions. Since variance functions and divergence-generating functions
characterize steep EDMs uniquely, the parametrized families introduced in Section 3 can be used
within the hard-clustering setting as well. We derive a k-means like algorithm, GMoM-HC, that
adaptively learns the topology and partitions data into hard clusters.

Finally, in Section 6, we start by examining the performance of EM algorithm for GMM in ho-
mogeneous mixture models with non-Gaussian distributions. We consider the simple task of density
estimation with mixture models and show that the GMM fit to non-Gaussian data fails to capture the
true density as the variance-mean relationship deviates from the Gaussian assumption of indepen-
dence. In another synthetic data analysis, we compare EM for GMM and AdaCluster in clustering
heterogeneous data. We obtain substantial improvements with AdaCluster in terms of normalized
mutual information and likelihood. Finally, we look at nine data sets from the UCI repository with
distinct topology and dispersion characteristics. We compare AdaCluster and GMoM-HC with EM
for GMM and k-means in terms of clustering performance and conclude that AdaCluster clusters
data better than GMM while GMoM-HC and k-means perform comparably.

3
BASBUG AND E NGELHARDT

2. Background
2.1 Convex duality
We begin with a discussion of two fundamental properties of convex functions and their relationship
to one another. The first property is steepness, which is important with respect to the existence and
uniqueness of a maximum likelihood estimator for exponential family distributions (discussed fur-
ther in Section 2.3). The second property is strict convexity, which enables us to generate Bregman
divergences. Bregman divergences are an important class of distortion functions that have favorable
properties such as non-negativity, convexity, and linearity, which we exploit later (Banerjee et al.,
2005). We then state a theorem establishing the connection between steep convex functions and a
particular type of strict convex functions, essentially strict convex functions. For the remainder of
the paper we assume that the convex functions are proper, i.e., they are finite, and their effective
domain is non-empty. We start by formally defining the steepness property.
Definition 1 (Rockafellar, 1970, Section 26) A proper convex function f is essentially smooth if it
is differentiable throughout nonempty A = int(dom f )1 and limi→∞ |Of (xi )| = ∞ whenever x1 ,
x2 , . . . , is a sequence in A converging to a boundary point x ∈ bd A.
As in (Barndorff-Nielsen, 1978), we use the terms essentially smooth and steep interchangeably.
Although the term steep may be used to describe non-convex functions, here we limit our focus to
convex functions.
Strict convexity is another intrinsic property of convex functions. The effective domain of a
convex function is used to determine if the function is strictly convex; however, more refined char-
acterizations of strict convexity exist. One such characterization is essential strict convexity, which
we define as follows.
Definition 2 (Rockafellar, 1970, Section 26) A proper convex function f is called essentially strictly
convex if f is strictly convex on every convex subset of dom ∂f .
To better understand the difference between essentially strictly convex and strictly convex functions,
we state the following theorem.
Theorem 1 (Barndorff-Nielsen, 1978, Theorem 5.22) Suppose f is a convex function. Then

ri(dom f ) ⊂ dom ∂f ⊂ dom f.

From Theorem 1, it follows that, if f is essentially strictly convex, then f is strictly convex on
ri(dom f ) but not necessarily on dom f . Furthermore, every essentially strictly convex function
with an open domain is also strictly convex.

Example 1 (Rockafellar, 1970, Section 26) An example of an essentially strictly convex but not
strictly convex function is
( √
x22 /(2x1 ) − 2 x2 x1 > 0, x2 ≥ 0
f (x) = (1)
0 x1 = 0, x2 = 0.
1. Throughout the paper we use int, ri, bd to denote the interior, relative interior and the boundary of a set, respectively.
For a function f , we denote the domain, sub-differential mapping, gradient, and convex conjugate with dom f , ∂f ,
Of , and f ∗ . We denote the interval [0, ∞) with R0 and (0, ∞) with R+ . Similarly, we denote the set {0, 1, . . . }
with Z0 and {1, 2, . . . } with Z+ .

4
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

f is strictly convex on dom ∂f = R2+ but not on dom f since f is identically zero for x2 = 0 and
x1 6= 0. Also note that f is strictly convex on ri(dom f ) = R2+ . Incidentally, f is also a steep
function.

In information geometry, strict convexity is a useful property because strictly convex functions
enable the construction of Bregman divergences. Bregman divergences are defined as follows.

Definition 3 (Bregman, 1967) Let f : A → R be a strictly convex function defined on a convex set
A ⊆ Rd such that f is differentiable on nonempty ri(A). The Bregman divergence df : A×ri(A) →
R0 is defined as

df (x, y) = f (x) − f (y) − hx − y, Of (y)i. (2)

Bregman divergences include many useful distortion functions such as squared loss, Mahalanobis
distance, KL-divergence, logistic loss, and Itakura-Saito distance. Bregman divergences are non-
negative, convex in the first argument, linear, and, often, non-symmetric. From the statistical per-
spective, Bregman divergences are appealing because, when the density of an exponential family
distribution can be represented in terms of a Bregman divergence, the maximum likelihood estimate
of the natural parameter is unique and can be calculated easily from the sample mean (Banerjee
et al., 2005). This observation hints at a duality between steep convex functions and strictly convex
functions. In general, there is no duality between the two families; however, the connection between
essentially smooth and essentially strictly convex functions can be established with the following
theorem.

Theorem 2 (Rockafellar, 1970, Theorem 26.3) A closed proper convex function is essentially strictly
convex if and only if its conjugate is essentially smooth.

An immediate corollary of Theorem 2 is that the conjugate of an essentially smooth and lower
semi-continuous function is essentially strictly convex. This is due to the bi-conjugate theorem: If
a convex function f is lower semi-continuous, then f ∗∗ = f (Rockafellar, 1970, Theorem 12.2).
Theorem 2 is especially important in characterizing the dual of an exponential family distribution.
We use Theorem 2 to show that the density of a certain subset of exponential family distributions
can be described in terms of Bregman divergences. Before examining the connection between the
exponential family and Bregman divergences, we formally introduce exponential family distribu-
tions.

2.2 Exponential family distributions

In statistical theory, the discussion of exponential family distributions can be traced back to (Fisher,
1934). A comprehensive account of the theoretical properties of the exponential family was given
by (Barndorff-Nielsen, 1978; Lehmann and Casella, 2006). The relation to information geometry
was first explored in (Efron, 1975) and later detailed in (Amari, 2012). In this section, we start with
the definition of the full, regular, steep and natural exponential families. We then present the notion
of conjugate prior to an exponential family distribution and finally discuss the general convexity
properties of exponential family distributions and their relation to the topology.

5
BASBUG AND E NGELHARDT

Definition 4 A family of distributions F(Ψ,Θ) = {pΨ (· | θ) : θ ∈ Θ ⊂ Rn } is called an exponen-

tial family if
pΨ (x | θ) = exp (hθ, T (x)i − Ψ(θ)) h(x) (3)
where θ is the natural parameter, Ψ(θ) is the log-partition function, h(x) is the base measure,
and T (x) is the minimal sufficient statistics. Let Sθ = {x : pΨ (x | θ) > 0} be the support
of pΨ (x | θ). The smallest interval containing the union of {Sθ | θ ∈ Θ} is called the convex
support and denoted by C. Furthermore, when Θ = dom(Ψ), the family is said to be full. If
Θ = int(Θ) = dom(Ψ), the family is called regular. If Ψ(θ) is steep, then the family is also called
steep. Finally, if Θ ⊂ R and T (x) = x, then the family is called natural.
Gaussian, gamma, binomial, and Poisson distributions are well-known members of the regular ex-
ponential family. Steep exponential family is a larger set that includes inverse-Gaussian, stable and
Bessel-type distributions (Bar-Lev et al., 1986) as well as all the regular exponential family distri-
butions (Barndorff-Nielsen, 1978, Theorem 8.2). In mixture model setting, the regular exponential
family is particularly easy to work with as the distributions in this family have densities that can
be expressed in terms of Bregman divergences (Banerjee et al., 2005). In Section 2.3, we present a
similar result for the steep exponential family.
Natural exponential family is another useful sub-family of the exponential family with notable
members such as Poisson, exponential, geometric, and Bernoulli distributions. Although the natural
exponential family is not as expressive as we might want for analysis of arbitrary attributes, they
form the basis for a much wider range of distributions, exponential dispersion models, which we
discuss in further detail in Section 2.4.
Bayesian treatment of the exponential family often involves a conjugate prior on the natural
parameter. Having a conjugate prior ensures the posterior distribution of the natural parameter,
pΨ (θ | x), belongs to the same family as the prior distribution, pΨ (θ | a, b). The conjugate prior
for an exponential family distribution pΨ (x | θ) with minimal statistics T (x) = x has the following
form
pΨ (θ | a, b) = exp (b(hθ, ai − Ψ(θ))) m(a, b) (4)
where a ∈ ri(C) and b > 0 are the parameters of the conjugate prior, and m(a, b) is a base measure.
The geometric interpretation of the conjugate prior for exponential family distributions is discussed
in (Agarwal and Daumé III, 2010). The parameter a reflects our prior knowledge about where the
mean is located. The second parameter b measures how confident we are about this belief in relation
to the sample size. This interpretation becomes clearer when we discuss the duality in Section 2.3.
To discuss the topological interpretation of an exponential family distribution, it is essential
to spell out the convexity properties of the log-partition function and its convex conjugate. The
following theorem summarizes the fundamental convexity properties of Ψ and Ψ∗ .
Theorem 3 (Barndorff-Nielsen, 1978, Theorem 9.1) For an exponential family F(Ψ,Θ) , the log-
partition function Ψ is a closed and strictly convex function and satisfies Ψ = Ψ∗∗ . Furthermore,
Ψ∗ is a closed and essentially smooth convex function with int(C) ⊂ dom(Ψ∗ ) ⊂ C.
The topological properties of an exponential family distribution is encompassed in the log-partition
function. To see this, we need to analyze the density of an exponential family distribution in the
dual domain. In addition to the convex conjugate of the log-partition function, the first and second
derivatives of Ψ are also needed for such analysis. We define two useful functions in this regard.

6
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

Table 1: Six common natural exponential family distributions. Bernouilli, geometric, Poisson,
exponential, Rayleigh, chi-squared distributions with the common density formulation, support (S)
and the corresponding exponential family parametrization with natural parameter (θ), log-partition
function (Ψ), mean-value mapping (τ ) and the variance function (υ).

DISTRIBUTION DENSITY S θ Ψ(θ) τ (θ) υ(x)

p eθ
B ERNOUILLI px (1 − p)1−x {0, 1} log( 1−p ) log(1 + eθ ) 1+eθ
x(1 − x)
eθ
G EOMETRIC (1 − p)x−1 p Z+ log( 1−p
p ) log( 1+eθ ) 1
1+eθ
x(x − 1)
λx
P OISSON e−λ x! N log(λ) eθ eθ x
x(k−2)/2 e−x/2
C HI -S QUARED 2k/2 Γ(k/2)
R0 log( k2 ) 1 2θ
2e e2θ 2x
−1 −1
E XPONENTIAL λe−λx R0 λ − log(θ) θ x2
2 √
x − x2 − 2π −π log(θ(π−4)) −π 4−π 2
R AYLEIGH σ2
e 2σ R0 (4−π)σ (4−π) (4−π)θ π x

Definition 5 For a non-degenerate exponential family F(Ψ,Θ) , we define the mean-value mapping
τ : int(Θ) → Ω as τ (θ) = OΨ(θ). The range of the mean-value mapping Ω is called the mean
domain. In addition, if τ −1 (x) is differentiable within Ω, then we define the variance function
υ : Ω → R+ as υ(x) = Oτ −1 (x).

In Table 1, we show the densities of six natural exponential family distributions with the map-
ping between the common parameters and the natural parameter, the corresponding log-partition
function, mean-value mapping and variance functions. One striking observation is that the variance
function has at most a quadratic term and can be written as a polynomial for these six distributions.
This may seem like a coincidence; however, many of the common exponential family distributions
can be characterized by a quadratic polynomial variance function (Morris, 1982). The parametriza-
tion of variance functions is often employed to characterize family of distributions. We’ll discuss
three different parametrization of variance functions in Section 3. The variance function also comes
up in saddle-point approximation of the density which we’ll discuss in the Section 2.4. However,
we first turn our attention to the mean-value mapping and its key role in finding the dual of a density
function.

2.3 Steep exponential family

Steepness is an intrinsic property of an exponential family distribution and preserved under linear
transformations. Steep exponential family distributions allow us to find analytic solutions for the
maximum likelihood (ML) and maximum a posteriori (MAP) estimates of θ. As we will discuss
shortly, representing the density of a steep exponential family distribution in dual domain is im-
mensely useful. Dual formulation allows us to parametrize family of distributions, find analytic
estimates of the mean parameter and use saddle-point approximation. The following theorem out-
lines the key properties of steep exponential family.

7
BASBUG AND E NGELHARDT

Theorem 4 (Barndorff-Nielsen, 1978, Corollary 9.6) Suppose Ψ is steep, which is true in particular
if Ψ is regular. The maximum likelihood estimate exists if and only if T (x) ∈ int(C), and then it is
unique. Furthermore, Ω = int(C) and C \ Ω = bd C. Finally, the maximum likelihood estimator
θ M L is the one-to-one mapping from Ω onto int(Θ) whose inverse is τ .

The relationship between the ML estimator and the mean-value mapping hints at a convex du-
ality. More concretely, (int(Θ), Ψ) and (Ω, Ψ̄∗ ) are Legendre duals of each other, and duality is
established through the mean-value mapping (Barndorff-Nielsen, 1978, Section 7). We note that Ψ̄∗
is the restriction of Ψ∗ to Ω and τ is a one-to-one mapping and has an inverse for steep exponential
family distributions. The Legendre duality can be summarized with the following equations

Ψ̄∗ (x) = sup (hθ, xi − Ψ(θ))

θ∈int(Θ)

τ −1 (x) = argsup (hθ, xi − Ψ(θ))

θ∈int(Θ)

Ψ(θ) = sup (hx, θi − Ψ̄∗ (x))

x∈Ω
τ (θ) = argsup(hx, θi − Ψ̄∗ (x)).
x∈Ω

The goal of the dual formulation is to express the density in terms of divergence from the mean.
Legendre dual of the log-partition function is the first step; however, we need to make sure that
the divergence from mean to each point in the support—more generally the convex support since
the support may vary with the natural parameter—is well-defined. Notice that the Legendre dual
is defined on Ω and not C; hence, we can only talk about strict convexity on Ω. To achieve dual
formulation of density as a divergence from mean, we need to investigate the convex conjugate of
(Θ, Ψ) and not just the Legendre dual of (int(Θ), Ψ).
If we have a regular exponential family, then Θ is an open set (bd Θ = ∅) and therefore
(Θ = int(Θ), Ψ) is of Legendre type. The Legendre dual of (Θ, Ψ) is (Ω, Ψ̄∗ ); however, this does
not imply that the convex support is also an open set. The convex support and the convex conjugate
should be calculated with the closure construction of Ψ̄∗ . As noted in (Rockafellar, 1970, Theorem
7.5), taking limits of Ψ̄∗ at the boundaries of C can also be used to find the convex conjugate of Ψ.
We note that for a continuous steep exponential family distribution, the probability mass on
bd Ω is zero; therefore, the closure construction may not be as important (Jorgensen, 1997). For
steep discrete distributions, however, C \ Ω plays an important role. A curious example highlighting
this point and showing the interplay of Θ, Ω, S, and C is the Bernoulli distribution.

Example 2 For a Bernoulli distribution (see Table 1), we have Ψ(θ) = log(1 + eθ ) with Θ = R.
Since Θ is open, we have a regular—thus, also steep—exponential family distribution. The mean-
eθ ∗
value mapping is τ (θ) = 1+e θ with range Ω = (0, 1). The Legendre dual of (Θ, Ψ) is Ψ̄ (t) =

t log(t) + (1 − t) log(1 − t) with dom(Ψ̄∗ ) = (0, 1). The support is S = {0, 1}; hence, the convex
support is C = [0, 1]. We have Ω = ri(C) and C \ Ω = bd Ω. Note that the support, S, is
disjoint from the mean domain, Ω. Also note that all probability mass is on C \ Ω and the Legendre
dual of Ψ is not defined on C \ Ω. We should find the convex conjugate of Ψ and investigate its
behavior on C. The convex conjugate can be calculated by taking limits: Ψ∗ (t) = Ψ̄∗ (t) ∀t ∈ Ω,
Ψ∗ (0) = limt→0+ Ψ̄∗ (t) = 0 and Ψ∗ (1) = limt→1− Ψ̄∗ (t) = 0. In this particular case, Ψ∗ turns
out to be strictly convex on C.

8
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

Next, we describe the concluding theorem of this section, where we show that the density of
a steep natural exponential family can be written in terms of a Bregman divergence. A similar
connection was explored in previous work (Banerjee et al., 2005) for the regular exponential family,
which is a subset of the steep exponential family. In general, not every steep exponential family
distribution has a corresponding Bregman divergence.

Theorem 5 Suppose we have a steep natural exponential family distribution pΨ (x | θ). Then the
.
convex conjugate of the log-partition function φ = Ψ∗ is a strictly convex function, and dom φ = C.
Furthermore, the density can be expressed as

pΨ (x | θ) = exp (xθ − Ψ(θ)) h(x)

= exp (−dφ (x, µ)) gφ (x)
= pφ (x | µ)

where dφ (x, y) is the Bregman divergence generated from φ, µ = τ −1 (θ) is the mean parameter,
and gφ (x) = exp(φ(x))h(x) is the base measure and pφ (x | µ) is the dual formulation of the
density.

Proof From Theorem 3, φ∗ = Ψ, and Ψ is essentially smooth; hence, φ is essentially strictly

convex by Theorem 2. Moreover, φ is closed and therefore lower semi-continuous. By Theorem 3
and Theorem 4, dom φ = C. If C is open, we conclude that φ is strictly convex due to Theorem 1.
When C is a bounded or a half-bounded interval, the inclusion of the boundary points do not change
strict convexity since φ is lower semi-continuous. Therefore φ is strictly convex on C. Furthermore,
φ is an essentially smooth function by Theorem 3; therefore, φ is differentiable on int(C) and C is
non-empty. It then follows that φ generates a Bregman divergence dφ : C × int(C) → R0 . The
dual density formulation follows from (Jorgensen, 1997, p.49).
We note that Theorem 5 does not generalize to the steep exponential family with multi-dimensional
natural parameter i.e. Θ * R. Recall the function f in Example 1 and suppose we have an
exponential family distribution with the log-partition function f ∗ . Due to Theorem 2, f ∗ is steep
since f is essentially strictly convex. The exponential family distribution f ∗ generates is steep but
not natural. Thus, Theorem 5 does not guarantee that the corresponding divergence is of Bregman
type. In fact, we know that f is not strictly convex; therefore, f does not generate a Bregman
divergence.
Although steep natural exponential family includes many useful distributions such as Gaussian
with known variance, exponential, Bernoulli, Poisson, geometric, chi-squared, Rayleigh, binomial
with known count and negative binomial with known count, it is not as expressive as we desire to
characterize the underlying distribution of an arbitrary sample attribute. To better model an arbitrary
error distribution, we need to capture both the mean location and the dispersion. Therefore, we are
interested in two-parameter univariate distributions such as Gaussian, gamma, inverse-Gaussian,
negative binomial and binomial distributions in our analysis. Fortunately, there is a straightforward
way to expand the natural exponential family and capture the topological characteristics of data
more accurately. In the next section, we explore steep exponential dispersion models and describe
how to expand the steep natural exponential family.

9
BASBUG AND E NGELHARDT

7 5
6
4

Divergence from mean

5
PDF
4 3
3 2
2
1
1
00 1 2 3 4 5 00 1 2 3 4 5
Support Support
(a) (b)
1.6 5
1.4
4
1.2 Divergence from mean
1.0 3
0.8
PDF

0.6 2
0.4
1
0.2
0.00 1 2 3 4 5 00 1 2 3 4 5
Support Support
(c) (d)
3.0 5
2.5 4
Divergence from mean

2.0
3
1.5
PDF

2
1.0
0.5 1

0.00 1 2 3 4 5 00 1 2 3 4 5
Support Support
(e) (f)

Figure 1: Duality in steep EDMs with support on positive real numbers . Probability density
function and the corresponding Bregman divergence functions for three EDMs: a-b) gamma c-
d) inverse-Gaussian and e-f) stable. In all cases, the mean parameter (µ) is fixed to 1.0 and the
dispersion parameter (κ) is set to 0.5 (blue) 1.0 (green) 2.0 (red).

10
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

2.4 Exponential dispersion models

In Theorem 5, we showed that the probability density of a sample data point is related to its distance
from the mean. In particular, when the density is a steep natural exponential family distribution, the
divergence function is of Bregman type. Suppose we stretch out the underlying geometry around
the mean i.e. we scale the divergence from the mean with a constant for every point in the convex
support. The resulting geometry induces a new probability distribution—albeit closely related to
the original one. For every steep natural exponential family distribution, we may scale the Bregman
divergence with a positive constant, which preserves steepness of Ψ and the strict convexity of φ.
The newly generated family of distributions is called steep exponential dispersion models. In Fig. 1,
we show how scaling plays out in both domains for three different distributions: gamma, inverse-
Gaussian and a stable distribution. One might argue that the stretching does not fundamentally
change the underlying topology but simply scales it. For the remainder of the paper, we refer to
the geometry induced by the unit Bregman divergence as the topology and the scaling of the unit
divergence as the dispersion.
There are two equivalent parametrization of EDMs: additive and reproductive (Jorgensen, 1997,
Chapter 3). We give the definition for the reproductive formulation of EDMs.
Definition 6 A family of distributions F(Ψ,Θ) = {pΨ (· | θ, κ) : θ ∈ Θ ⊂ R, κ ∈ R+ } is called a
reproductive exponential dispersion model if

1
pΨ (x | θ, κ) = exp (xθ − Ψ(θ)) h(x, κ)
κ
where θ is the natural parameter, κ is the dispersion parameter, Ψ(θ) is the unit log-partition
function, and h(x, κ) is the base measure.
Suppose we have a steep natural exponential family distribution pΨ (x | θ), and let pφ (x | µ) be its
dual formulation. With change of variables θ̃ = θ/κ and Ψ̃(θ̃) = Ψ(κθ̃)/κ, we obtain the following
relationships using the properties of convex conjugacy:

φ̃(t̃) = φ(t̃)/κ (5)

dφ̃ (x, y) = dφ (x, y)/κ (6)
τ̃ (θ̃) = τ (κθ̃) (7)
υ̃(x) = κυ(x). (8)

Notice that the mean-value mapping—hence the mean parameter corresponding to a fixed natural
parameter—does not change after the scaling. The divergence from the mean is, however, is scaled
by 1/κ as implied with stretching the geometry around mean. Finally, we write the density of a
steep EDM as

1
pΨ (x | θ, κ) = pφ (x | µ, κ) = exp − dφ (x, µ) gφ (x, κ)
κ
where dφ (x, µ) is the Bregman divergence generated from φ. In the dual form, the density is
parametrized with the mean parameter, µ, and the dispersion parameter, κ. We note that the mean
parameter only appears in the Bregman divergence term, dφ (x, µ). One important property of Breg-
man divergences is that the ML estimate of µ is independent of the divergence (Banerjee et al.,

11
BASBUG AND E NGELHARDT

2005). In fact, the ML estimate of the mean is simply the population mean. The following theorem
summarizes the results for the ML and MAP estimators of the mean parameter and its dual, the
natural parameter.

Theorem 6 Suppose we have N i.i.d. samples {xi }N i=1 from a steep EDM pΨ (x | θ, κ). If the
. 1 PN
sample mean x̄ = N i=1 xi ∈ int(C), then the unique ML estimate of the mean parameter is
µM L = x̄ and the unique ML estimate of the natural parameter is θM L = τ −1 (µM L ) ∈ int(Θ).
Suppose we have a conjugate prior pΨ (θ | a, b) with parameters a ∈ int(C) and b > 0 as in Eq 4.
Then the unique MAP estimate of the mean parameter µM AP = abκ+N x̄
bκ+N ∈ int(C), and the unique
MAP estimate of the natural parameter is θM AP = τ −1 (µM AP ) ∈ int(Θ).

Proof Follows from Theorem 4.

As opposed to the mean parameter, the dispersion parameter appears in the exponent and in
gφ (x, κ); in general, there is no closed-form expression for gφ (x, κ). Fortunately, asymptotic theory
suggests that, for x ∈ Ω, we can approximate the density as

1 1
pφ (x | µ, κ) = p exp − dφ (x, µ) as κ → 0. (9)
2πκυ(x) κ

This approximation is known as the saddle-point approximation (Daniels, 1954). One limitation of
the saddle-point approximation is that the sample point must be in Ω.
√
Example 3 The inverse-Gaussian distribution has the unit log-partition function Ψ(θ) = − −2θ
with Θ = (−∞, 0]. Since Θ is closed and limθ→0 Ψ0 (θ) = ∞, we have√a steep but not regular
exponential family distribution. The mean-value mapping is τ (θ) = 1/ −2θ with range Ω =
(0, ∞). The convex conjugate of Ψ is φ(t) = 1/(2t), and the corresponding Bregman divergence is
dφ (x, y) = (x − y)2 /(2xy 2 ). The saddle-point approximation is then given by

(x − µ)2

1
pφ (x | µ, κ) = √ exp − . (10)
2πκx3 2κxµ2

Coincidentally, this is the exact density function for inverse Gaussian with mean µ and shape pa-
rameter 1/κ. Also note that the support S = Ω = (0, ∞).

In the case of steep continuous EDMs, we know that C \ Ω = bd Ω and the probability mass on
bd Ω is zero (Jorgensen, 1997). Therefore, x ∈ Ω condition is not limiting for such distributions.
Recall the Bernouilli distribution in Example 2 has all the probability mass on bd Ω. Although
Bernouilli distribution itself is not an EDM; there exist discrete EDMs where the probability mass
on bd Ω is not zero. In particular, when a steep discrete EDM with a support at zero, the saddle-point
approximation must be modified to accommodate the point mass at bd Ω as
r
κ 1
pφ (x | µ, κ) = exp − dφ (κx, κµ) as κ → 0 (11)
2πυ(κx + κc) κ

where c is a small positive constant (usually set to 1/3) (McCullagh and Nelder, 1989; Jorgensen,
1997). When the support does not include 0, we set c = 0.

12
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

3. Parametrized classes of exponential dispersion models (EDMs)

We have shown that the density of a steep EDM may be expressed in terms of a Bregman divergence.
Each steep EDM family is uniquely characterized by the log-partition function (Ψ), the divergence
generating function (φ), the mean-value mapping (τ ) or the variance function (υ). Within a steep
EDM family, each distribution is characterized by a dispersion parameter (κ) and a natural parameter
(θ) or a mean
√ parameter (µ).√The inverse-Gaussian family in Example 3 can be identified with
3
Ψ(θ) = − −2θ, τ (θ) = 1/ −2θ, φ(t) = 1/(2t) or υ(x) = 1/x . Within the family, there is a
unique inverse-Gaussian distribution for every pair (θ, κ) ∈ (−∞, 0] × (0, ∞)—equivalently, the
same distribution can be specified by the pair (µ, κ) ∈ (0, ∞) × (0, ∞) where µ = τ (θ).
When fitting a steep EDM distribution to samples, we need to start with a fixed family, such as
gamma or inverse-Gaussian, and learn its pair of parameters (θ, κ) or (µ, κ). For a more versatile
analysis, we may want to explore a set of steep EDMs instead of a pre-determined one. For instance,
when dealing with positive continuous data, we may fit a gamma distribution, an inverse-Gaussian
distribution or a stable distribution (Fig. 1). In this section, we investigate parametrized classes of
steep EDMs for different data types so as to facilitate such versatile analysis.
We denote a class of steep EDM with ΦA = {φ(· | α) : α ∈ A}, where α is a hyper-parameter
∂φ ∂2φ
such that φ(t | α) is strictly convex in t, and ∂α and ∂t∂α exist ∀α ∈ A. Each class ΦA can be fully
specified by φ(t | α) and A. We refer to φ(t | α) as the divergence-generating mother function or
simply the mother function, and A as the hyper-parameter domain. We focus on classes where the
support does not depend on the choice of α ∈ A to ensure that the class can be specified by merely
knowing the data support.
We note that the variance function uniquely characterizes a natural exponential family (Jor-
gensen, 1997, Theorem 2.11). It is common to describe a class of EDMs in terms of its unit vari-
ance function—the variance function of the natural exponential family from which the EDM family
is generated. Two well-studied EDM classes, the Morris class (Morris, 1982) and the Tweedie
class (Bar-Lev et al., 1986), are characterized by the quadratic and power unit variance functions,
respectively. Not all members of these classes are steep; therefore, the hyper-parameter domain
needs to be specified carefully. We first look at a sub-class of the Morris class with support on nat-
ural numbers, and we show that the Poisson and negative binomial distributions are members of the
proposed class. We then turn to continuous data on the real line and propose another sub-class of the
Morris class that includes the Gaussian and generalized hyperbolic secant distributions as members.
Next, we investigate the Tweedie class, and we show that, for positive continuous data, it is possi-
ble to identify the true distribution among Gaussian, gamma, and inverse-Gaussian distributions by
estimating the hyper-parameter α.
In each sub-section, we start with a parametrized unit variance function (υ) and derive the cor-
responding unit log-partition function (Ψ). We then analyze the domain of the unit log-partition
function and the natural parameter domain (Θ) to ensure the resulting class consists of steep dis-
tributions. Next, we derive the conjugate of Ψ with the help of the mean-value mapping (τ ) and
determine the convex support (C) from the range of the mean-value mapping (Ω). Finally, we ob-
tain the corresponding parametrized unit Bregman divergence (dφ ) for each member of the proposed
class. With the unit variance function and the unit Bregman divergence, we can express the density
of a steep EDM within the class in the dual domain using the saddle-point approximation. The
results are summarized in Table 2, readers familiar with the Morris and Tweedie classes may skip
to Section 4.

13
BASBUG AND E NGELHARDT

Table 2: Summary of the family of steep EDMs for different data types. The hyper-parameter
domain, A, the variance function, υ(x | α), and the Bregman divergence, dφ (x, µ | α), used in
AdaCluster for different data types.

DATA T YPE S UPPORT A υ(x | α) dφ (x, µ | α)

N ON - NEGATIVE DISCRETE Z0 R0 x(1 + αx) E Q . 12

P OSITIVE DISCRETE Z+ R0 x(1 + αx) E Q . 12
R EAL CONTINUOUS R R0 1 + αx2 E Q . 13
N ON - NEGATIVE CONTINUOUS R0 (0, 1] x2−α E Q . 17
P OSITIVE CONTINUOUS R+ (−∞, 2] x2−α E Q . 17

3.1 EDM class for non-negative discrete data

The count data is often modeled with a Poisson or negative binomial distribution (Bliss and Fisher,
1953; Hilbe, 2011). The support for the count data is unbounded, unlike binomial data. We start
with the following quadratic unit variance function υ(x | α) = x(1 + αx) with α ∈ R0 . Solving
the implied second order differential equation, Ψ00 (θ) = Ψ0 (θ)(1 + αΨ0 (θ)), we get
(
− α1 log(1 − αeθ ) α > 0
Ψ(θ | α) =
eθ α=0

with Θ = dom Ψ given by

(
(−∞, − log α) α > 0
Θ=
R α = 0.

Since Θ is open, we get a regular exponential family. The mean-value mapping is given by

eθ
τ (θ | α) = .
1 − αeθ
The range of the mean-value mapping is then Ω = (0, ∞). The convex conjugate of Ψ is given by
(
t log t − αt+1
α log(αt + 1) α > 0
φ(t | α) =
t log t − t α=0

with support S = Z0 and convex support C = R0 . From Theorem 5, we know that φ generates
a Bregman divergence. The parametrized Bregman divergence, dφ (x, y | α) : R0 × R+ → R0 , is
given by

1
α log(αy + 1) α > 0, x = 0



( 1 + x) log( αy+1 ) + x log( x ) α > 0, x 6= 0

α αx+1 y
dφ (x, y | α) = (12)


y α = 0, x = 0
y − x + x log( x )

α = 0, x 6= 0.
y

14
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

0.14 0.07
0.12 0.06
0.10 0.05
0.08 0.04
PMF

PMF
0.06 0.03
0.04 0.02
0.02 0.01
0.000 20 40 60 80 100 0.000 20 40 60 80 100
Support Support
(a) (b)
0.08 0.12
0.07 0.10
0.06
0.05 0.08
0.04 0.06
PMF

PMF

0.03 0.04
0.02
0.01 0.02
0.000 20 40 60 80 100 0.000 20 40 60 80 100
Support Support
(c) (d)

Figure 2: Density of EDMs with support on the natural numbers. Probability mass function of
four members of the proposed EDM class for non-negative discrete data with µ = 10 and κ = 1
when the hyper-parameter is set to a) α = 0 (Poisson), b) α = 0.5, c) α = 1 (negative binomial), d)
α=2

Two members of this class are the Poisson (α = 0) and the negative binomial (α = 1) distributions.
Having specified the variance function and the corresponding Bregman divergence, we can use the
saddle-point approximation (Eq. 11) to express the density in the dual domain. The probability mass
functions for four members of this class, including Poisson and negative binomial, are depicted in
Fig. 2. Recall that the variance of a given steep EDM, pφ (x | µ, κ, α), is κµ + ακµ2 (see Eq. 8).
We see that, as α increases, the variance increases monotonically. The level of over-dispersion is
therefore captured by the hyper-parameter α. Distributions with higher α are useful for the analysis
of over-dispersed count data.

15
BASBUG AND E NGELHARDT

0.40 0.08
0.35 0.07
0.30 0.06
0.25 0.05
0.20 0.04
PDF

PDF
0.15 0.03
0.10 0.02
0.05 0.01
0.0010 5 0 5 10 15 20 25 30 0.0010 5 0 5 10 15 20 25 30
Support Support
(a) (b)
0.09 0.14
0.08 0.12
0.07
0.10
0.06
0.05 0.08
PDF

PDF

0.04 0.06
0.03
0.04
0.02
0.01 0.02
0.0010 5 0 5 10 15 20 25 30 0.0010 5 0 5 10 15 20 25 30
Support Support
(c) (d)

Figure 3: Density of EDMs with support on the real line. Probability distribution function for
four members of the proposed EDM class for continuous data on R with µ = 10 and κ = 1 when
the hyper-parameter is set to a) α = 0 (Gaussian), b) α = 0.5, c) α = 1 (GHS), d) α = 2

3.2 EDM class for continuous data on the real line

The natural choice to model continuous data on R is the Gaussian distribution. The Gaussian distri-
bution is symmetric around the mean and has a variance term independent of the mean term. There
are cases—especially in financial applications—where the underlying distribution of the data is
asymmetric around the mean (Fischer, 2013). To better model such data, the generalized hyperbolic
secant (GHS) distribution may be used (Harkness and Harkness, 1968).
Another use of GHS is to model data on the unit interval, S = (0, 1). The logit transform,
log(x/(1 − x)), of a beta-distributed random variable has a GHS distribution (Johnson, 1949; Jor-
gensen, 1997, p.101). The beta distribution itself is not an EDM, and in fact there is no EDM with

16
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

support on the unit interval (Jorgensen, 1997, Chapter 3). GHS allows us to model data on the
unit interval through the logit transform. Similarly, the logit transform of a logit-normal distributed
random variable has a Gaussian distribution.
In this section, we propose a class of EDMs that includes Gaussian and hyperbolic secant dis-
tributions as members. We start with the quadratic variance function υ(x | α) = 1 + αx2 with
α ∈ R0 . Solving the implied second order differential equation, Ψ00 (θ) = 1 + αΨ0 (θ)2 , we get the
following log-partition function
( √
− α1 log(cos( αθ)) α > 0
Ψ(θ | α) = 1 2
2θ α=0

with Θ = dom Ψ given by

(
π π
(− 2√ , √
α 2 α
) α>0
Θ=
R α = 0.

Since Θ is open, we get a regular (and thus steep) exponential family distribution for every α ∈ R0 .
The mean-value mapping is given by
( √
√1 tan( αθ) α > 0
τ (θ | α) = α
θ α = 0.

The range of the mean-value mapping is then Ω = R. The convex conjugate of Ψ is given by
( √ √ 1
2 αt arctan( αt) − 2α log(1 + αt2 ) α > 0
φ(t | α) = 1 2
2t α=0

with convex support C = S = R. From Theorem 5, we conclude that φ generates a Bregman

divergence, dφ (x, y | α) : R × R → R0 , which is given by
( √ √ √
1 1+αy 2
2 αx (arctan ( αx) − arctan ( αy)) + log 2 α>0
dφ (x, y | α) = 2α 1+αx
(13)
1 2
2 (x − y) α = 0.

Two noteworthy members of this class are the Gaussian (α = 0) and generalized hyperbolic secant
(α = 1) distributions. We can use the saddle-point approximation (Eq. 9) with the variance function
and the Bregman divergence above to express the density of each member of the class. In Fig. 3, the
probability distribution functions for several members of this class, including Gaussian and GHS,
are shown. The variance of a given steep EDM, pφ (x | µ, κ, α), in this class is κ + ακµ2 . As α
increases, we see that the asymmetric shape becomes more apparent. Similar to the case for the
over-dispersed count data, increasing α implies a higher variance for a fixed mean parameter.

3.3 The EDM class for positive continuous data

Tweedie is another prominent class of EDMs that includes gamma, inverse-Gaussian, Poisson, and
Gaussian distributions (Tweedie, 1947). The seminal work of (Bar-Lev et al., 1986) gives a compre-
hensive account of the Tweedie class. In this section, we reiterate previous results in the context of
our framework and establish the connection between the Tweedie class and Bregman divergences.

17
BASBUG AND E NGELHARDT

Suppose we have a power variance function υ(x | α) = x2−α with α ∈ R. Later we’ll see
that not every choice of α yields a steep EDM; therefore, we will choose a smaller set for A. After
solving Ψ00 (θ) = (Ψ0 (θ))2−α , we obtain the following log-partition function
 α
1
 α (((α − 1)θ + 1) α−1 − 1) α ∈ R \ {0, 1}

Ψ(θ | α) = eθ − 1 α=1 (14)

− log (1 − θ) α=0


with Θ given by

1


(−∞, 1−α ] α ∈ (−∞, 0)
(−∞, 1 )

α ∈ [0, 1)
Θ= 1−α (15)
1


[ 1−α , ∞) α ∈ (1, ∞) \ {2}

R α ∈ {1, 2}.

Note that Ψ is continuous in α, including the limiting cases for α = 0 and α = 1. When α ∈
[0, 1] ∪ {2}, the corresponding distributions belong to the regular exponential family. Similarly, for
α ∈ (−∞, 1] ∪ {2}, the corresponding distributions are steep. The mean-value mapping is given by
1
(
((α − 1)θ + 1) α−1 α 6= 1
τ (θ | α) =
eθ α=1

with range Ω = (0, ∞), except when α = 2 where Ω = R. The convex conjugate is given by
 1 α
 α(α−1) (t − αt + α − 1) α ∈ R \ {0, 1}

φ(t | α) = t log t − t + 1 α=1 (16)

t − log t − 1 α=0


with convex support C = [0, ∞) when α ∈ (0, ∞) \ {2}, C = (−∞, ∞) when α = 2, and
C = (0, ∞) when α ∈ (−∞, 0]. When 1 < α < 2, then Ψ00 ( 1−α 1
) = 0, leading to a degenerate
distribution. At the expense of having a non-full exponential family, we limit the convex support to
1
be C = (0, ∞) when 1 < α < 2. With this choice, the domain of Ψ is limited to Θ = ( 1−α , ∞)
when 1 < α < 2. The corresponding divergence is given by
 α
x +(α−1)y α −αxy α−1


 α(α−1) α ∈ R \ {0, 1}
dφ (x, y | α) = x(log x − log y) + (y − x) α = 1 (17)

 x − log ( x ) − 1

α = 0.
y y

As a corollary of Theorem 5, we conclude that dφ (x, y | α) is a Bregman divergence when

α ∈ (−∞, 1] ∪ {2}. For the first time in the literature, we explicitly establish the connection
between Tweedie class and Bregman divergences. Previously, an indirect relationship is explored
through a related class of divergences, β-divergences (Basu et al., 1998; Cichocki et al., 2006). In
β-divergences, the parameter α corresponds to parameter β and the rest of the formulation is the
same except that the domain is restricted to R+ . The relationship between Bregman divergences and

18
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

β-divergences was established in (Hennequin et al., 2011). The connection between β-divergences
and the Tweedie class has been explored in (Yilmaz and Cemgil, 2012). We note that the definition
of β-divergences are slightly different in these two works.
When the support is S = (0, ∞), we choose A = (−∞, 0], which includes the gamma (α = 0)
and inverse-Gaussian (α = −1) distributions. In fact, it is possible to extend A to (−∞, 2], assum-
ing the probability mass at 0 is small for 0 < α < 2 and the probability of the interval (−∞, 0]
is small for α = 2. We note that A = (−∞, 2] includes Poisson (α = 1) and Gaussian (α = 2)
distributions as well. At first, using the Poisson distribution to model positive continuous data looks
unusual; however, the saddle-point approximation in Eq. 9 requires only that the set-of-interest is
a subset of the convex support and the variance function is positive. Both of these conditions are
satisfied since R+ ⊂ C = R0 and υ(x | α = 1) = x > 0 ∀x ∈ R+ .
Non-negative continuous data. Zero-inflated distributions are often modeled with compound-
Poisson distributions (Basbug and Engelhardt, 2016). Within the Tweedie class, compound-Poisson-
Gamma distributions correspond to the hyper-parameter range 0 < α < 1. Similar to positive
continuous data, the divergence corresponding to the Poisson distribution can be used for non-
negative continuous data, which makes the feasible hyper-parameter domain A = (0, 1]. One
important caveat is that, when using the saddle-point approximation for the point mass at zero, we
must use Eq. 11 instead of Eq. 9.

4. Mixture of exponential dispersion models

In previous sections, we have explored parametrized classes of steep EDMs for data with different
support. If we are given a set of samples drawn from a unimodal steep EDM, then we can identify the
underlying topology and dispersion characteristics by estimating the hyper-parameter α as well as
(µ,κ) pair. Recall that the topology of a steep EDM is uniquely characterized with the unit base log-
partition function (Ψ), its dual (φ), the unit mean-value mapping (τ ) or the unit variance function (υ)
and the dispersion characteristic (κ) corresponds to the scaling of the Bregman divergence (dφ ). In
a more complicated case, we might be dealing with a set of samples from a multimodal steep EDM
where α and κ are shared across modes. The density estimation for such data can be done through
the expectation-maximization (EM) algorithm. In a more general case, we might get samples from a
multi-dimensional multimodal steep EDM. If the underlying steep EDM for each attribute is known
a priori, then deriving the EM algorithm is straightforward. When we do not know the topology a
priori, we can use the classes of EDMs proposed in the previous section to fit a class of mixture of
steep EDMs instead. Recall that the classes in Section 3 can be specified by the data support. In this
section, we propose an EM-like adaptive algorithm that can fit a class of mixture of steep EDMs
and identify the mixture components, the underlying topology and dispersion characteristics.
We call a mixture model homogeneous if the topology and the dispersion characteristics are
the same across mixture components and across attributes. In a homogeneous mixture model, the
underlying distribution of all of the mixture components across attributes share the same parameters
except the mean parameter (or, equivalently, the natural parameter in dual form). A simple example
is the Gaussian mixture model with identity covariance matrix (Fig. 4a). For a non-symmetric
distribution, homogeneity is harder to notice visually (see Fig. 4b for a homogenous Gamma mixture
model). In (Banerjee et al., 2005), the authors show that the likelihood of data under a homogeneous
mixture of regular exponential family distributions can be represented in terms of the corresponding
Bregman divergence. Furthermore, the authors propose an expectation-maximization algorithm—

19
BASBUG AND E NGELHARDT

50 50 50

40 40 40

30 30 30
x2

x2
20 20 20

10 10 10

00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50
x1 x1 x1

(a) (b) (c)

50 50 50

40 40 40

30 30 30
x2

x2
20 20 20

10 10 10

00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50
x1 x1 x1

(d) (e) (f)

Figure 4: Homogeneous and heterogeneous mixture models. Contour plots for various mixture
models with fixed centroids located at (10,15), (25,20) and (15,30). Homogeneous mixture model
with fixed dispersion across dimensions and mixture components and fixed Gaussian topology in
both dimensions (a) and gamma topology in both dimensions (b); heterogeneous mixture model
with dispersion varying in the first and second dimensions but fixed across mixture components
and with gamma topology in the first dimension and Gaussian topology in the second dimension
(c), Gaussian topology in both dimensions (d) and gamma topology in both dimensions (e); het-
erogeneous mixture model with dispersion varying in the first and second dimension and across the
three mixture components and gamma topology in the first dimension and Gaussian topology in the
second dimension (f). Bregman soft clustering can be used for (a) and (b), AdaCluster can handle
(a-e). The most arbitrary form of heterogeneous mixture models (f) cannot be captured by either of
the algorithms.

Bregman soft clustering—consisting of closed-form updates to fit any homogeneous mixture model
that can be represented in terms of a Bregman divergence. Based on the discussion in Section 2.4, we
know that the density of steep EDMs can be represented in terms of Bregman divergences; hence,
we conclude that the homogeneous mixture of steep EDMs can be fit to data with the Bregman
soft-clustering algorithm.

20
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

We start by describing the closed-form updates of the Bregman soft-clustering algorithm when
used to fit a mixture of steep EDMs. We write the log-likelihood of a set of observations X =
{xi }N
i=1 with J attributes under the homogeneous mixture of steep EDM, pφ (x | µ, κ), as
 
N K J
X X X 1
log pφ (X | π, µ, κ) = log πh exp  − dφ (xij , µhj ) + log gφ (xij , κ)  (18)
κ
i=1 h=1 j=1

where π = {πh }K h=1 is the mixture proportions, and dφ (x, y) is the Bregman divergence generated
from a fixed strictly convex function φ. The underlying distribution pφ (x | µ, κ) is fixed and
specified a priori.
In the E-step of the Bregman soft-clustering algorithm, the weight of each mixture component
is updated using:

πh exp − κ1 Jj=1 dφ (xij , µhj )
P
pφ (h | xi , π, µ, κ) ← P (19)
K 1 PJ
h=1 π h exp − κ j=1 dφ (x ij , µhj )

for i = 1, . . . , N and h = 1, . . . , K. Note that the base measure terms log gφ (xij , κ) cancel. It
is often the case that the base measure does not have a closed form. The homogeneity assumption
simplifies the EM algorithm, allowing us to specify the log-partition function Ψ or its conjugate φ,
and not the full density.
In the M-step of Bregman soft clustering, the maximum likelihood estimates of π and µ are
given by
N
1 X
πh ← pφ (h | xi , π, µ, κ) (20)
N
i=1
PN
i=1 pφ (h | xi , π, µ, κ)xij
µhj ← P N
(21)
i=1 pφ (h | xi , π, µ, κ)

for h = 1, . . . , K and j = 1, . . . , J. The base measure terms do not appear in the closed-form
updates.
So far, we only described the Bregman soft-clustering algorithm updates for a homogenous
mixture of steep EDMs and did not modify the existing algorithm of (Banerjee et al., 2005). To
be able to employ Bregman soft clustering, we need to specify the topology using the divergence-
generating function φ, the variance function υ or the log-partition function Ψ. If we have multiple
candidates for the underlying distribution, we can optimize the objective function in Eq. 18 although
it requires specification of the full density including the base-measure terms (gφ (·, κ)) which often
do not have closed-form expressions.
The homogeneous mixture of steep EDMs, however, can fall short in capturing the true geo-
metric properties of the data. In data sets collected from multiple complex sources, the dispersion
parameters tend to be different across attributes even when the topology is the same (Fig. 4d and
Fig. 4e). For an arbitrary EDM, normalizing data can be burdensome since we would have to fit
an individual mixture model to each attribute. Inference gets even more complicated when the
topology is different across attributes (Fig. 4c). Census data is a good example, where we have
non-negative discrete attributes such as age, positive continuous attributes such as height, possibly

21
BASBUG AND E NGELHARDT

negative continuous attributes such as income, and binary attributes such as gender. When the num-
ber of attributes is small, a custom mixture model can be built; however, as the dimensionality of
the data increases, creating a custom model for each attribute is tedious. To address this problem,
we develop heterogeneous mixture models, where the topology and the dispersion characteristics
are attribute-specific and shared across mixture components.
Suppose we have a family of steep EDMs {ΦA }j to model the underlying distribution of the
j th attribute. More precisely, we assume that each distribution in that family has the same support
and can be specified with the mother function φj (· | α) and hyper-parameter domain Aj . We can
write the quasi-log-likelihood of data under the heterogeneous mixture model using the saddle-point
approximation as
N
X K
X
L̂ = log p̂φ (X | π, µ, κ, α) = log πh exp (Υ(xi , µh , κ, α)) , (22)
i=1 h=1

.

where Υ(x, µ, κ, α) = − Jj=1 κ1j dφj (xj , µj | αj ) + 21 log(2πκj υj (xj | αj )) is an auxiliary
P

function. In practice, we algorithmically detect the type of each attribute, and we set {ΦA }j ac-
cordingly. Table 2 summarizes the properties of the EDM families for different data types. We
present the AdaCluster algorithm, which maximizes the quasi-log-likelihood L̂ using expectation-
maximization.
In the E-step of AdaCluster, the weight of each mixture component is given by

πh exp (Υ(xi , µh , κ, α))

pφ (h | xi , µ, κ, α, π) ← PK (23)
h=1 πh exp (Υ(xi , µh , κ, α))

for i = 1, . . . , N and h = 1, . . . , K. Compared to the E-step of Bregman soft clustering (Eq. 19

), we observe that the base measure terms do not cancel in AdaCluster as a consequence of the
heterogeneity assumption. The inclusion of the base measure terms require us to specify the full
density and not just the divergence-generating functions φ. We use the saddle-point approximation
to express the base measure in terms of the variance function (Eq. 9,11).
In the M-step of AdaCluster, the maximum likelihood estimate of πh is equivalent to Eq. 20.
The derivative of the quasi log likelihood with respect to a canonical parameter ξ is given by
N K
∂ L̂ XX ∂
← pφ (h | xi , µ, κ, α, π) Υ(xi , µh , κ, α). (24)
∂ξ ∂ξ
i=1 h=1

Using Eq. 24, we calculate the updates for µ, κ and α (Algorithm 1). We first note that the MLE
of κ has a closed form. Since we assume the underlying distribution is a steep EDM, the maximum
likelihood estimate of µ also has a closed form (Theorem 6). For the hyper-parameter vector α,
numerical optimization techniques may be used to find the value of α that maximizes the quasi-log-
likelihood L̂.
To incorporate a prior knowledge about the data, we discuss the Bayesian treatment of the het-
erogeneous mixture of steep EDMs where we have conjugate priors on µ and κ. For instance, in
census data, we may fit a mixture model to the census data of all U.S. states and use the estimated
parameters as priors on the mixture model for the state-level census data. Apart from such hierar-
chical applications, the Bayesian approach also offers a way to regularize the dispersion parameters.

22
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

Algorithm 1 AdaCluster : EM for a heterogeneous mixture of EDMs

Input: Data X = {xi }N i=1 , number of clusters K
Parametrized family of steep EDMs {ΦjA }Jj=1 = {φj (· | α) : α ∈ Aj }Jj=1
N
Output: Soft partition {pφ (h | xi , µ, κ, α, π)}K

h=1 i=1
Initialize {µh }K
h=1 , κ and α.
repeat
for i = 1 to N do
for h = 1 to K do
Calculate pφ (h | xi , µ, κ, α, π) as in Eq. 23
end for
end for
for h = 1 toP K do
πh ← N N
1
i=1 pφ (h | xi , µ, κ, α, π)
end for
for j = 1 to J do
for h = 1 PtoNK do
i=1 pφ (h|xi ,µ,κ,α,π)xij
µhj ← P N
i=1 pφ (h|xi ,µ,κ,α,π)
end for PN PK
i=1 h=1 pφ (h|xi ,µ,κ,α,π)dφj (xij ,µhj |αj )
κj ← 2 PN PK
i=1 h=1 pφ (h|xi ,µ,κ,α,π)
Optimize for αj using the gradient
∂ L̂ PN PK 1 ∂dφj (xij ,µhj |αj ) 1 ∂υj (xij |αj )
∂αj ← − i=1 h=1 pφ (h | xi , µ, κ, α, π) κj ∂αj + 2υj (xij |αj ) ∂αj
end for
until convergence of quasi-log-likelihood

In Gaussian mixture model, prior on κ is equivalent to regularizing the covariance matrix estimation
with a diagonal matrix.
Suppose we have a conjugate prior on θhj = τ −1 (µhj ) with parameters aµhj and bµhj , as in Eq. 4.
Then, the MAP estimate of µhj is given by

aµhj bµhj κj +
PN
i=1 pφ (h | xi , µ, κ, α, π)xij
µM
hj
AP
← (25)
bµhj κj
PN
+ i=1 pφ (h | xi , µ, κ, α, π)

for h = 1, . . . , K and j = 1, . . . , J. In the mixture model setting, aµhj reflects our prior knowledge
about the location of the hth centroid along the j th dimension, and bµhj represents the effective
sample size of the prior. The MAP estimate reduces to the ML estimate when bµhj is set to 0. It
is also possible to set bµhj = η/κj , where η is the adjusted effective sample size of the prior; η
represents how much weight is given to the prior relative to the effective cluster size N
P
i=1 pφ (h |
xi , µ, κ, α, π).
In describing the MAP estimate for κ, we first note that the inverse-gamma (IG) distribution is
the conjugate prior for the dispersion parameter of a steep EDM under the saddle-point approxima-
tion. Suppose we have an independent IG(aκj , bκj ) prior on each {κj }Jj=1 . Then, the MAP estimate

23
BASBUG AND E NGELHARDT

Algorithm 2 GMoM Hard Clustering (GMoM-HC)

Input: Data X = {xi }N i=1 , number of clusters K
Parametrized family of steep EDMs {ΦjA }Jj=1 = {φj (· | α) : α ∈ Aj }Jj=1
Output: Hard partition C(X ) = {Xh }K h=1
Initialize C(X ) with k-means++
repeat
λ = argminλ0 ∈Λ LGM oM (C(X ); λ0 )
W hj ← W hj (C(X ); λ)
Xh ← ∅ for h = 1, . . . , K
for i = 1 to N do
J
mh0 j (xi )T W h0 j mh0 j (xi )
P
h = argmin
h0 j=1
Xh ← Xh ∪ {xi }
end for
until convergence

for κj is given by
PN PK
bκj + i=1 h=1 pφ (h | xi , µ, κ, α, π)dφj (xij , µhj | αj )
κM
j
AP
← (26)
aκj + 2 N
1 P PK
i=1 h=1 pφ (h | xi , µ, κ, α, π)
for j = 1, . . . , J. For the Bayesian version of AdaCluster, we replace the µhj and κj updates with
∂ L̂
MAP estimates, and we adjust the gradient ∂α j
accordingly.

5. Hard clustering for heterogeneous data

For homogeneous mixture of steep EDMs, the Bregman soft-clustering algorithm reduces to a k-
means-like Bregman hard-clustering algorithm under the small variance assumption (i.e., as κ →
0) (Banerjee et al., 2005). The objective of the Bregman hard-clustering algorithm is given by
K
X J
X X
LBHC (C(X ) | µ) = dφ (xij , µhj ) (27)
h=1 xi ∈Xh j=1

where C(X ) = {Xh }K h=1 is the partition of the data set and dφ (x, y) is the corresponding Bregman
divergence to the underlying steep EDM.
In the case of a heterogenous mixture of steep EDMs, Bregman soft-clustering does not reduce
to Bregman hard-clustering in general. The only exception is when the dispersion parameter is
shared across attributes, which may be possible if the data are normalized within each attribute.
In this case, we can combine the Bregman divergences specific to each attribute into a single
multi-dimensional Bregman divergence using the linearity property. We can show that Bregman
soft-clustering reduces to Bregman hard-clustering under the small-variance assumption, with the
objective
K
X J
X X
LBHC (C(X ) | µ, α) = dφj (xij , µhj | αj ). (28)
h=1 xi ∈Xh j=1

24
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

We note that the topological properties of the attributes need to be specified ahead of time and
the Bregman hard-clustering algorithm cannot adaptively infer the topology. One might attempt to
learn the parameters α by minimizing Eq. 27 directly. This naive approach fails in practice; similar
accounts have been described for the Tweedie class (Dunn and Smyth, 2005, 2008).
We explore alternatives to the likelihood approach to adaptively learn the parameters of a general
heterogeneous mixture of steep EDMs in the setting of hard clustering. Suppose we have N i.i.d.
1 PN
samples {xi }N i=1 from a steep EDM pφ (x | µ, κ, α) such that N i=1 i ∈ int(C). Then we can
x
estimate µ, κ, and α with method of moments using the first three sample moments. Estimating
parameters using high order moments suffer from statistical efficiency problems. Fortunately, the
hard clustering problem has an additional property that we can exploit. The parameters κ and α are
shared across mixture components, and each cluster is assumed to be disjoint. Therefore, the first
two population moments for K clusters allow us to write down 2JK moment conditions for J(K +
2) parameters. For K = 2, we have exact identification, i.e., the number of moment conditions is
the same as the number of parameters to estimate. When K > 2, we have more moment conditions
than parameters, leading to an overidentified system using the first two population moments for
each cluster. This is the ideal setting for Generalized Method of Moments (GMoM) (Hansen, 1982).
GMoM is a method of estimating the set of parameters of a model using moment conditions and
arbitrary functions of random variables and parameters that are zero in expectation. Given a vector
of moment conditions, GMoM minimizes the quadratic form defined using an arbitrary positive
definite weight matrix (Hansen, 1982). The choice of the weight matrix determines the efficiency
of the estimator.
For the remainder of this section, we denote the vector of parameters in a heterogeneous mixture
of steep EDMs as λ = {µ, κ, α}. We denote the feasible set for λ with Λ. We construct the set
Λ using {Ωj }Jj=1 for the mean parameters µ, RJ+ for the dispersion parameters κ, and {Aj }Jj=1 for
the hyper-parameters α.
Using the parametrized variance function of steep EDMs (Eq. 8), we write the first two moments
for the j th attribute and the hth cluster as

xj − µhj
mhj (x; λ) = 2 .
xj − κj υj (µhj | αj )

Since E[mhj (x; λ)] = 0 for each attribute j = {1, . . . , J} and each cluster h = {1, . . . , K},
they can be used as moment conditions in GMoM. Having described the moment conditions, we
need to specify a positive-definite weight matrix to formulate the GMoM objective. In Hansen’s
iterative GMoM method, the weight matrix is constructed by taking the inverse of the residual ma-
trix (Hansen, 1982). This construction yields asymptotically efficient estimates for the parameters.
A variant of the iterative GMoM is the Continuously Updating Generalized Method of Moments
(CUGMoM), which is preferred when analytical solutions in each iteration of the original method
are not available (Hansen et al., 1996). Instead of iterating between the update for the parameter
estimates and the weight matrix estimate, CUGMoM optimizes for the parameters directly using
computational techniques.
The CUGMoM objective is given by
K X
X J
LGM oM (C(X ); λ) = mhj T W hj mhj (29)
h=1 j=1

25
BASBUG AND E NGELHARDT

where
1 X
mhj (C(X ); λ) = mhj (xi ; λ) (30)
|Xh |
xi ∈Xh
 −1
1 X
W hj (C(X ); λ) =  mhj (xi ; λ) mhj (xi ; λ)T  . (31)
|Xh |
xi ∈Xh

The CUGMoM parameter estimates are λ̂ = argminλ∈Λ LGM oM (C(X ); λ). To solve this op-
timization problem, we used the limited-memory Broyden-Fletcher-Goldfarb-Shanno with bound-
aries (L-BFGS-B) algorithm (Byrd et al., 1995). The computational bottleneck for the GMoM
routine is usually the inversion of the weight matrix. In the hard clustering problem, clusters are
disjoint; hence, the residual matrix and the weight matrix are block diagonal. The inversion of the
weight matrix then has complexity O(KJ) rather than O(K 3 J).
GMoM gives us a way to estimate the parameters of the model given the partition of the data
into clusters, C(X ). This is somewhat equivalent to the M-step of AdaCluster. To mimic the E-step
of AdaCluster, we use the weight matrix of each cluster to calculate the quadratic distance, and then
we assign each sample to the cluster with minimum distance. We refer to this algorithm (Alg. 2)
as GMoM Hard Clustering (GMoM-HC). Convergence can be assessed either through the objective
or through a stable partition. A lighter version of this algorithm can also be derived by setting the
mean estimates to the maximum likelihood estimates. In this case, the only moment conditions are
the second population moments; then, the inverse weight matrix becomes a diagonal matrix with
the mean squared error for the moment conditions as the diagonal entries. The analogous Bayesian
algorithm can be derived by fixing bµh many pseudo samples at location aµ th
h to the h cluster. Recall
that the geometric interpretation of conjugate prior to the mean parameter is having b many pseudo
samples at a location a (Agarwal and Daumé III, 2010).

6. Results
6.1 Setup
We used normalized mutual information (NMI) to quantify the results from a clustering algorithm
with respect to ground truth. NMI is a metric that takes values between 0 and 1, where 0 corresponds
to random cluster assignments.
In addition to AdaCluster and GMoM-HC, we used k-means and the EM algorithm for a Gaus-
sian mixture model (GMM) for comparison. We force the covariance matrix of the GMM to be
diagonal and shared across mixture components. We note that AdaCluster reduces to a GMM when
ΦA is a singleton set with only Gaussian members. We also note that in such a setting κ = {κ}Jj=1
corresponds to the diagonal entries of the GMM covariance matrix and is shared across clusters.
To set up AdaCluster and GMoM-HC, we categorized each attributes based on whether we
found all of the values to be positive, non-negative, or on the real line, and whether the values
were discrete or continuous. Based on the different categories of attribute values, we selected the
parameterization of the EDM, ΦA , as outlined in Table 2.
To initialize all of the algorithms in this comparison, we used k-means++ (Arthur and Vassil-
vitskii, 2007). In AdaCluster and GMoM-HC, we used the k-means++ centroids to set the hyper-
parameters aµ µ κ
h while fixing bh = 1. We set the hyper-parameters of κ as a = 1.0, bκ = 10
−9 to

26
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

8 15000
100 10000
10
Posterior Quantiles

Posterior Quantiles

Posterior Quantiles
50 5000
0
12
0 5000
14 10000
50
15000
16
100 20000
16 14 12 10 8 0 20 40 60 80 0 200 400 600 800 1000
Empirical Quantiles Empirical Quantiles Empirical Quantiles
(a) (b) (c)
10000 4000

5000 2000

Posterior Quantiles
Posterior Quantiles

0
0
2000
5000
4000
10000
6000
0 200 400 600 800 0 100 200 300 400 500
Empirical Quantiles Empirical Quantiles
(d) (e)

Figure 5: Simulated density estimation results. Quantile-quantile plots of the samples generated
from a homogeneous steep EDM mixture models with an underlying distribution a) Gaussian, b)
Poisson c) gamma, d) negative binomial, and e) inverse-Gaussian against samples generated from
the Gaussian mixture model fitted to synthetic data. Each color corresponds to a synthetic data set
with distinct parameters, and the diagonal black line marks the perfect fit.

ensure numerical stability. Similarly, we add 10−9 to the diagonal entries of the covariance matrix
in the GMM.
We ran each algorithm for a maximum of 1, 000 iterations and terminated early if the hard
cluster assignments did not change in two consecutive iterations. We restarted each algorithm 1, 000
times and selected the results from these 1, 000 runs with the highest likelihood for the GMM and
AdaCluster, and with the lowest inertia for k-means and GMoM-HC.

6.2 Synthetic data experiments

To illustrate the benefits of our approach to clustering on homogeneous data, we considered the
problem of density estimation with mixture models. For each distribution in {Gaussian, gamma,
inverse-Gaussian, Poisson, negative binomial}, we generated six single dimensional data sets of
size N = 1000 from a mixture model with K = 4 mixture components. We drew the dispersion
parameter κ from an inverse-gamma with shape parameter 1.01 and scale parameter 1.0. We chose

27
BASBUG AND E NGELHARDT

1.0 0

AdaCluster log-likelihood
0.8 20
AdaCluster NMI

0.6 40
0.4 60
0.2 80
0.0 100
0.0 0.2 0.4 0.6 0.8 1.0 100 80 60 40 20 0
GMM NMI GMM log-likelihood
(a) (b)

Figure 6: Synthetic heterogeneous data results. Comparing AdaCluster with the GMM on syn-
thetic heterogeneous data in terms of a) normalized mutual information (NMI) and b) per sample
log-likelihood. Each red point corresponds to a different data set, and the black diagonal line indi-
cates equivalent performance.

the centroids such that pφ (µh0 | µh , κ, α) < 0.01 for every h = 1, . . . , K and h0 6= h. We then fit a
Gaussian mixture model to the data and generated 1, 000 samples from this model.
The quantile quantile plots between the data and the posterior samples generated from the fitted
GMM show that the GMM recovers the true density accurately when the underlying distribution is
Gaussian (Fig. 5). However, the GMM fails in the case of Gamma, inverse-Gaussian, Poisson and
negative binomial. Recall that the variance function, υ(x), of the Gaussian is constant (1), linear
for Poisson (x), quadratic for Gamma (x2 ) and negative binomial (x2 + x), and finally cubic for
inverse-Gaussian (x3 ). Hence, the GMM performance progressively degrades as the variance-mean
relationship deviates from constant (Fig. 5a–5e).
To compare AdaCluster and GMM performance in clustering heterogenous data, we gener-
ated 100 data sets of size N = 1000 with J = 10 dimensions and K = 4 mixture compo-
nents. We started by drawing cluster proportions from a Dirichlet distribution with equal con-
centration parameters. For each dimension j = 1, . . . , J, we first drew a dispersion parameter κj
from an inverse-gamma with shape parameter 1.01 and scale parameter 1.0. We then randomly
selected the underlying distribution from {Gaussian, gamma, inverse-Gaussian, Poisson, negative
Binomial} and set αj and φj (· | αj ) accordingly. Finally, we selected the cluster centroids such that
pφj (µh0 j | µhj , κj , αj ) < 0.01 for every h = 1, . . . , K and h0 6= h. We then ran AdaCluster and
EM for the GMM as described in the previous section.
AdaCluster outperforms the GMM in every synthetic data set except for one in terms of NMI
(Fig. 6). More interestingly, AdaCluster often achieves perfect NMI score whereas the GMM per-
formance is below 0.2 in roughly half of the data sets. In fact, GMM either struggles to cluster the
data (top left quadrant of Fig. 6a) or achieves a reasonable NMI value (top right quadrant of Fig. 6a).
In terms of the quasi-log-likelihood, AdaCluster outperforms GMM in every data set (Fig. 6b), sug-

28
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

Table 3: Comparison of NMI values for EM for the Gaussian mixture model (GMM), Ada-
Cluster (AdaC.), k-means, and GMoM hard clustering (GMoM-HC) on nine UCI data sets.
N is the number of samples; J is the number of attributes; K is the number of clusters; S is the
support.

DATA SET N J K S GMM A DAC. K - MEANS GM O M-HC

IRIS 149 4 2 R+ 1.000 1.000 0.869 1.000

WINE 177 13 3 R+ /Z+ 0.974 0.783 0.426 0.769
STEEL 1940 20 7 R+ /Z+ 0.178 0.265 0.115 0.204
YEAST 1483 6 10 R+ 0.289 0.292 0.260 0.167
WHOLESALE 440 6 2 Z+ 0.017 0.442 0.009 0.309
STONEFLAKES 78 8 4 R0 /Z0 0.416 0.474 0.436 0.246
SEEDS 210 7 3 R+ 0.581 0.696 0.695 0.674
LEAF 340 14 30 R+ 0.742 0.699 0.653 0.172
LIBRAS 360 90 15 R+ 0.548 0.596 0.595 0.589

gesting that AdaCluster not only yields a better clustering of data than GMM but also provides a
better fit. We note that the quasi-log-likelihood corresponds to the exact log-likelihood in the case
of GMM since saddle-point approximation is exact for the Gaussian distribution.

6.3 UCI data experiments

We performed clustering with a GMM, AdaCluster, k-means, and GMoM-HC on nine data sets
from the UCI repository (Lichman, 2013). Each data set includes different attribute characteristics
(Table 3). For instance, yeast has only positive continuous attributes whereas stoneflakes has both
non-negative discrete and non-negative continuous attributes.
In six of the nine data sets, AdaCluster outperforms the GMM, and they both achieve a perfect
score for iris. In the wine data set, a relatively small number of samples might explain the inferior
performance of AdaCluster. Since the GMM improves over k-means, this suggests that the disper-
sion parameters are significantly different across attributes. In that case, AdaCluster needs to learn
the varying dispersion parameters as well as the hyper-parameters α, and the small size of the data
set might explain the under fitting. The steel and stoneflakes data are both heterogeneous data sets,
and AdaCluster outperforms the GMM. Both AdaCluster and GMoM-HC achieves better scores
than the GMM and k-means. In the yeast data set, AdaCluster has a slight edge over the GMM, and
k-means and the GMM have comparable performances. When we analyzed the fitted parameters
by AdaCluster and the GMM, we found that the dispersion parameter estimates are similar for each
attribute. This explains why the GMM and k-means achieved roughly the same scores. Further-
more, we observed that the variance function identified by AdaCluster is sub-linear, which explains
why Gaussian can be a good proxy for such distributions. In the wholesale data set, AdaCluster
and GMoM-HC beat the GMM and k-means by a substantial margin. Since all the attributes are
non-negative discrete, we used the family described in Section 3.1. The fitted parameters from Ada-
Cluster indicate that the attributes have negative binomial characteristics, i.e., quadratic polynomial
variance function. Similar to yeast, GMM and k-means have comparable NMI scores, suggesting

29
BASBUG AND E NGELHARDT

that the dispersion is roughly the same across attributes. In seeds and libras, AdaCluster and k-
means have similar performances, and the GMM is inferior to both. The common characteristics
of these two data sets is a relatively small dispersion across attributes. We verified this hypothesis
by examining the fitted parameters by AdaCluster. Small dispersion might explain why k-means
outperformed GMM. Similarly, it can be argued that the Gaussian distribution approximates the
underlying distribution when the dispersion is small. Finally, GMM beats AdaCluster in leaf data
set, which has a small number of samples per cluster, leading to noisy estimates of α and degrad-
ing the performance of the adaptive algorithms. This is especially true for GMoM-HC, where the
performance relies on having good α estimates within each cluster.

7. Discussion and Conclusion

In this work, we addressed the problem of clustering heterogeneous data when the underlying distri-
bution of each attribute is unknown and heterogeneous. We first showed that the density of a steep
exponential dispersion model can be represented with a Bregman divergence. We then proposed
families of steep EDMs for by parametrizing the divergence-generating functions. We showed that
a certain sub-family of the Morris class yields distributions with a support on non-negative num-
bers, including the widely used Poisson and negative-binomial distributions. We proposed another
sub-family of the Morris class that can be used to model continuous data on the real line. This
family has the symmetric Gaussian distribution and asymmetric generalized hyperbolic secant dis-
tribution as members. We suggested that the same family can be used for data on the unit interval
by transforming data with the logit function first. Thirdly, we looked at the Tweedie class and in-
vestigated its connection to Bregman divergences in light of the theorem introduced earlier. We
showed that a certain sub-family of the Tweedie class—including Gaussian, gamma, Poisson, and
inverse-Gaussian distributions—may be used for positive continuous data. We further showed that
another Tweedie sub-class may be used for non-negative continuous data.
Using these parametrized family of steep EDMs, we derived an expectation-maximization algo-
rithm, AdaCluster, for soft clustering of heterogeneous data. AdaCluster has closed form updates
for the mean and dispersion parameters of each mixture component as a result of saddle-point ap-
proximation. Throughout the derivations, we highlighted the contrasts with Bregman soft-clustering
algorithm (Banerjee et al., 2005). We used numerical optimization techniques to identify the topol-
ogy of each attribute. We also presented the Bayesian equivalent of AdaCluster, where there are
conjugate priors on the mean and dispersion parameters. In deriving the Bayesian update for the
dispersion parameter, we established that the inverse-gamma distribution is the conjugate prior to a
steep EDM in reproductive form under the saddle-point approximation.
We then turned our attention to the hard clustering of heterogeneous data. We demonstrated that
the familiar asymptotic relationship between EM and k-means like algorithms, such as Bregman soft
clustering and Bregman hard clustering, does not apply to the heterogeneous data. We proposed an
alternative approach using generalized method of moments with moment conditions found through
the parametrized variance functions. We presented an adaptive hard-clustering algorithm (GMoM-
HC) that mimics the k-means algorithm. Finally, we studied the behavior of a Gaussian mixture
model in density estimation for non-Gaussian homogeneous data. We concluded that the Gaussian
mixture model fails to capture the true density as the variance-mean relationship deviates from the
independence assumption that the GMM makes. We then compared the GMM and AdaCluster for
clustering heterogeneous data, and showed that AdaCluster has a significant edge over GMM as

30
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

measured with normalized mutual information and log-likelihood. We also compared AdaCluster
and GMM on nine data sets from the UCI repository with distinct topology and dispersion charac-
teristics. We found that that AdaCluster clusters heterogeneous data consistently better than EM for
the GMM.

Acknowledgments

We would like to acknowledge support for this project for MEB from the Princeton Innovation J.
Insley Blair Pyne Fund Award, and, for BEE, from NIH R00 HG006265, NIH R01 MH101822,
NIH U01 HG007900, and a Sloan Faculty Fellowship.

31
BASBUG AND E NGELHARDT

References
Arvind Agarwal and Hal Daumé III. A geometric view of conjugate priors. Machine learning, 81
(1):99–113, 2010.

Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science &
Business Media, 2012.

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceed-
ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035.
Society for Industrial and Applied Mathematics, 2007.

Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with breg-
man divergences. The Journal of Machine Learning Research, 6:1705–1749, 2005.

Shaul K Bar-Lev, Peter Enis, et al. Reproducibility and natural exponential families with power
variance functions. The Annals of Statistics, 14(4):1507–1522, 1986.

Ole Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. John Wiley &
Sons, 1978.

Mehmet E Basbug and Barbara E Engelhardt. Hierarchical compound Poisson factorization. Pro-
ceedings of the International Conference on Machine Learning, page 17951803, July 2016. URL
http://arxiv.org/abs/1604.03853.

Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by
minimising a density power divergence. Biometrika, 85(3):549–559, 1998.

Christopher M Bishop et al. Pattern recognition and machine learning, volume 4. springer New
York, 2006.

Chester Ittner Bliss and Ronald A Fisher. Fitting the negative binomial distribution to biological
data. Biometrics, 9(2):176–200, 1953.

Lev M Bregman. The relaxation method of finding the common point of convex sets and its ap-
plication to the solution of problems in convex programming. USSR computational mathematics
and mathematical physics, 7(3):200–217, 1967.

Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for
bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.

Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Csiszars divergences for non-negative ma-
trix factorization: Family of new algorithms. In Independent Component Analysis and Blind
Signal Separation, pages 32–39. Springer, 2006.

Henry E Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical Statistics,

pages 631–650, 1954.

Peter K Dunn and Gordon K Smyth. Series evaluation of tweedie exponential dispersion model
densities. Statistics and Computing, 15(4):267–280, 2005.

32
A DAPTIVE C LUSTERING FOR H ETEROGENEOUS DATA

Peter K Dunn and Gordon K Smyth. Evaluation of tweedie exponential dispersion model densities
by fourier inversion. Statistics and Computing, 18(1):73–86, 2008.

Bradley Efron. Defining the curvature of a statistical problem (with applications to second order
efficiency). The Annals of Statistics, pages 1189–1242, 1975.

Matthias J Fischer. Generalized Hyperbolic Secant Distributions: With Applications to Finance.

Springer Science & Business Media, 2013.

Ronald Fisher. Dispersion on a sphere. In Proceedings of the Royal Society of London A: Math-
ematical, Physical and Engineering Sciences, volume 217, pages 295–305. The Royal Society,
1953.

Ronald Aylmer Fisher. Two new properties of mathematical likelihood. Proceedings of the Royal
Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 144:
285–307, 1934.

Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econo-
metrica: Journal of the Econometric Society, pages 1029–1054, 1982.

Lars Peter Hansen, John Heaton, and Amir Yaron. Finite-sample properties of some alternative
gmm estimators. Journal of Business & Economic Statistics, 14(3):262–280, 1996.

WL Harkness and ML Harkness. Generalized hyperbolic secant distributions. Journal of the Amer-
ican Statistical Association, 63(321):329–337, 1968.

Romain Hennequin, Bertrand David, and Roland Badeau. Beta-divergence as a subclass of bregman
divergence. IEEE Signal Processing Letters, 18(2):83–86, 2011.

Joseph M Hilbe. Negative binomial regression. Cambridge University Press, 2011.

Ke Jiang, Brian Kulis, and Michael I Jordan. Small-variance asymptotics for exponential family
dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages
3158–3166, 2012.

Norman L Johnson. Systems of frequency curves generated by methods of translation. Biometrika,

36(1/2):149–176, 1949.

Bent Jorgensen. The Theory of Dispersion Models. CRC Press, 1997.

Brian Kulis and Michael I Jordan. Revisiting k-means: New algorithms via bayesian nonpara-
metrics. In Proceedings of the 29th International Conference on Machine Learning (ICML-12),
pages 513–520, 2012.

Erich Leo Lehmann and George Casella. Theory of point estimation. Springer Science & Business
Media, 2006.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.

edu/ml.

Peter McCullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989.

33
BASBUG AND E NGELHARDT

Carl N Morris. Natural exponential families with quadratic variance functions. The Annals of
Statistics, pages 65–80, 1982.

John A Nelder and R Jacob Baker. Generalized linear models. Encyclopedia of statistical sciences,
1972.

Ralph Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

MCK Tweedie. Functions of a statistical variate with given means, with special reference to lapla-
cian distributions. In Proceedings of the Cambridge Philosophical Society, volume 43, page 100.
Cambridge Univ Press, 1947.

Y Kenan Yilmaz and A Taylan Cemgil. Alpha/beta divergences and tweedie models. arXiv preprint
arXiv:1209.4280, 2012.

Vtu Mechanical Engineering
40% (5)
Vtu Mechanical Engineering
175 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Data Mining Project - Parijat
No ratings yet
Data Mining Project - Parijat
28 pages
Yihao Final Paper CCSC For Submission
No ratings yet
Yihao Final Paper CCSC For Submission
6 pages
Unit Iii
No ratings yet
Unit Iii
70 pages
Elliptical Mixture Models Improve The Accuracy of Gaussian Mixture Models With Expectationmaximization Algorithm
No ratings yet
Elliptical Mixture Models Improve The Accuracy of Gaussian Mixture Models With Expectationmaximization Algorithm
20 pages
Aiml Unit-4
No ratings yet
Aiml Unit-4
82 pages
Iris MBC Solution
No ratings yet
Iris MBC Solution
6 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
Paper 1 73
No ratings yet
Paper 1 73
6 pages
I Jcs It 20140506204
No ratings yet
I Jcs It 20140506204
4 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Petroleum Data Managment
No ratings yet
Petroleum Data Managment
52 pages
EM and Kmeans Relations
No ratings yet
EM and Kmeans Relations
70 pages
Diss Arno Fritsch
No ratings yet
Diss Arno Fritsch
153 pages
Maths Roadmap For Machine Learning - Statistics
No ratings yet
Maths Roadmap For Machine Learning - Statistics
5 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Unit 5
No ratings yet
Unit 5
40 pages
DSA5102 Lecture10
No ratings yet
DSA5102 Lecture10
40 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Deep Multivariate Time Series Embedding Clustering
No ratings yet
Deep Multivariate Time Series Embedding Clustering
26 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
Clustering Mixed Data
No ratings yet
Clustering Mixed Data
10 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Machine Learning: CSCE883
No ratings yet
Machine Learning: CSCE883
22 pages
Expectation-Maximization Clustring V2
No ratings yet
Expectation-Maximization Clustring V2
9 pages
Clustering
No ratings yet
Clustering
55 pages
TD10 - TD - GMM - 2025
No ratings yet
TD10 - TD - GMM - 2025
1 page
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Clustering
No ratings yet
Clustering
27 pages
A Distributed Framework For Trimmed Kernel K-Means Clustering
No ratings yet
A Distributed Framework For Trimmed Kernel K-Means Clustering
14 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Symmetrical Based Projects
No ratings yet
Symmetrical Based Projects
105 pages
Semi-Parametric Exponential Family PCA
No ratings yet
Semi-Parametric Exponential Family PCA
8 pages
Unsupervised Learning and Clustering
No ratings yet
Unsupervised Learning and Clustering
26 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
Data Clustering Using Kernel Based
No ratings yet
Data Clustering Using Kernel Based
6 pages
Standardization and Its Effects On K-Means Clustering Algorithm
No ratings yet
Standardization and Its Effects On K-Means Clustering Algorithm
6 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
Rs 636.36 Rs 6363.6 Rs 63.636 Rs 6363
No ratings yet
Rs 636.36 Rs 6363.6 Rs 63.636 Rs 6363
16 pages
Mixture Models and Clustering
No ratings yet
Mixture Models and Clustering
8 pages
Angelov Et Al. (2016) - Empirical Data Analysis - A New Tool For Data Analyties.
No ratings yet
Angelov Et Al. (2016) - Empirical Data Analysis - A New Tool For Data Analyties.
8 pages
Subjective InterestingDAtaExploration PDF
No ratings yet
Subjective InterestingDAtaExploration PDF
12 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
No ratings yet
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
14 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Bej1906 004r2a0 PDF
No ratings yet
Bej1906 004r2a0 PDF
35 pages
Management-Activity Prediction For Differently-Mouneshachari S
No ratings yet
Management-Activity Prediction For Differently-Mouneshachari S
6 pages
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
No ratings yet
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
16 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
A Basic Approach To K-Means Clustering Applied To Stock Data
No ratings yet
A Basic Approach To K-Means Clustering Applied To Stock Data
4 pages
Unit 5
No ratings yet
Unit 5
5 pages
Unit 4
No ratings yet
Unit 4
5 pages
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
No ratings yet
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
35 pages
JAMOVI AND Basic Statistics
No ratings yet
JAMOVI AND Basic Statistics
28 pages
Mechanical Properties of Nanostructured Materials: Quantum Mechanics and Molecular Dynamics Insights
From Everand
Mechanical Properties of Nanostructured Materials: Quantum Mechanics and Molecular Dynamics Insights
Abdolhossein Fereidoon
No ratings yet
18 PGTRB Maths Study Material Statistics 2
No ratings yet
18 PGTRB Maths Study Material Statistics 2
4 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Bayesian Statistics (Szábo & V.d.vaart)
No ratings yet
Bayesian Statistics (Szábo & V.d.vaart)
146 pages
CEP Calculation
No ratings yet
CEP Calculation
20 pages
16 ACTL2131 Exercises
No ratings yet
16 ACTL2131 Exercises
94 pages
2022-A New Maximum Entropy Method For Estimation of Multimodal Probability Density Function
100% (1)
2022-A New Maximum Entropy Method For Estimation of Multimodal Probability Density Function
16 pages
CM2A - April23 - EXAM - Final Clean Proof
No ratings yet
CM2A - April23 - EXAM - Final Clean Proof
8 pages
Discrete Probability Distributions
No ratings yet
Discrete Probability Distributions
2 pages
Excel Random Numbers and Sampling
No ratings yet
Excel Random Numbers and Sampling
3 pages
Introduction To Probabilistic and Statistical Methods With Examples in R - 9783030457990 PDF
100% (2)
Introduction To Probabilistic and Statistical Methods With Examples in R - 9783030457990 PDF
163 pages
IDF Curves
No ratings yet
IDF Curves
33 pages
Statistics For Management Exam-Converted (1) - Min
100% (8)
Statistics For Management Exam-Converted (1) - Min
4 pages
Random Variables and Probability Distribution: Prepared By: Grace A. Pasion
No ratings yet
Random Variables and Probability Distribution: Prepared By: Grace A. Pasion
47 pages
AE 9-Activity 5-Measures of Dispersion and Shape
No ratings yet
AE 9-Activity 5-Measures of Dispersion and Shape
13 pages
2019-Multi-Axis Low-Cycle Creepfatigue Life Prediction of High-Pressure Turbine Blades Based On A New Critical Plane Damage Parameter
No ratings yet
2019-Multi-Axis Low-Cycle Creepfatigue Life Prediction of High-Pressure Turbine Blades Based On A New Critical Plane Damage Parameter
18 pages
Farming Sto Pro
No ratings yet
Farming Sto Pro
26 pages
Unit-1 Part A
No ratings yet
Unit-1 Part A
4 pages
Analisis Soal Ulangan Harian Fitri Nur Hidayati
No ratings yet
Analisis Soal Ulangan Harian Fitri Nur Hidayati
12 pages
s2 2022 Jan QP
No ratings yet
s2 2022 Jan QP
28 pages
Math 243. Lab 4 Binomial Distribution: Location Long-Term Mean % of Clear Days in Dec
No ratings yet
Math 243. Lab 4 Binomial Distribution: Location Long-Term Mean % of Clear Days in Dec
9 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
17 pages
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
Commonly Used Discrete Distributions: 1st Semester 2022
No ratings yet
Commonly Used Discrete Distributions: 1st Semester 2022
46 pages
1 s2.0 S1364032116000873 Main
No ratings yet
1 s2.0 S1364032116000873 Main
19 pages
Stats 2 GA Week 9 Sols
No ratings yet
Stats 2 GA Week 9 Sols
8 pages
2024-A Multiple Kernel-Based Kernel Density Estimator For Multimodal Probability Density Functions
No ratings yet
2024-A Multiple Kernel-Based Kernel Density Estimator For Multimodal Probability Density Functions
16 pages
Lectorial Module 2.1
No ratings yet
Lectorial Module 2.1
9 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
9 pages
MODULE 2 - Lesson 1
No ratings yet
MODULE 2 - Lesson 1
6 pages
2024-Fourier Basis Density Model
No ratings yet
2024-Fourier Basis Density Model
5 pages
Quiz 3 Pool
No ratings yet
Quiz 3 Pool
9 pages
Analisis Soalan STPM Matematik
No ratings yet
Analisis Soalan STPM Matematik
1 page
Major Exam I: Electrical Engineering Department EE315: Probabilistic Methods in Electrical Engineering
No ratings yet
Major Exam I: Electrical Engineering Department EE315: Probabilistic Methods in Electrical Engineering
6 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

2017-AdaCluster Adaptive Clustering For Heterogeneous Data

Uploaded by

2017-AdaCluster Adaptive Clustering For Heterogeneous Data

Uploaded by

Journal of Machine Learning Research 17 (2017) ?-?? Submitted 1/17; Published ??/??

AdaCluster : Adaptive Clustering for Heterogeneous Data

Mehmet E. Basbug MBASBUG @ PRINCETON . EDU

c 2017 Mehmet E. Basbug and Barbara E. Engelhardt.

ri(dom f ) ⊂ dom ∂f ⊂ dom f.

df (x, y) = f (x) − f (y) − hx − y, Of (y)i. (2)

2.2 Exponential family distributions

Definition 4 A family of distributions F(Ψ,Θ) = {pΨ (· | θ) : θ ∈ Θ ⊂ Rn } is called an exponen-

DISTRIBUTION DENSITY S θ Ψ(θ) τ (θ) υ(x)

2.3 Steep exponential family

Ψ̄∗ (x) = sup (hθ, xi − Ψ(θ))

τ −1 (x) = argsup (hθ, xi − Ψ(θ))

Ψ(θ) = sup (hx, θi − Ψ̄∗ (x))

pΨ (x | θ) = exp (xθ − Ψ(θ)) h(x)

Proof From Theorem 3, φ∗ = Ψ, and Ψ is essentially smooth; hence, φ is essentially strictly

Divergence from mean

2.4 Exponential dispersion models

φ̃(t̃) = φ(t̃)/κ (5)

Proof Follows from Theorem 4.

3. Parametrized classes of exponential dispersion models (EDMs)

DATA T YPE S UPPORT A υ(x | α) dφ (x, µ | α)

N ON - NEGATIVE DISCRETE Z0 R0 x(1 + αx) E Q . 12

3.1 EDM class for non-negative discrete data

with Θ = dom Ψ given by

3.2 EDM class for continuous data on the real line

with Θ = dom Ψ given by

with convex support C = S = R. From Theorem 5, we conclude that φ generates a Bregman

3.3 The EDM class for positive continuous data

As a corollary of Theorem 5, we conclude that dφ (x, y | α) is a Bregman divergence when

4. Mixture of exponential dispersion models

(a) (b) (c)

(d) (e) (f)

πh exp (Υ(xi , µh , κ, α))

for i = 1, . . . , N and h = 1, . . . , K. Compared to the E-step of Bregman soft clustering (Eq. 19

Algorithm 1 AdaCluster : EM for a heterogeneous mixture of EDMs

Algorithm 2 GMoM Hard Clustering (GMoM-HC)

5. Hard clustering for heterogeneous data

6.2 Synthetic data experiments

DATA SET N J K S GMM A DAC. K - MEANS GM O M-HC

IRIS 149 4 2 R+ 1.000 1.000 0.869 1.000

6.3 UCI data experiments

7. Discussion and Conclusion

Henry E Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical Statistics,

Matthias J Fischer. Generalized Hyperbolic Secant Distributions: With Applications to Finance.

Joseph M Hilbe. Negative binomial regression. Cambridge University Press, 2011.

Norman L Johnson. Systems of frequency curves generated by methods of translation. Biometrika,

Bent Jorgensen. The Theory of Dispersion Models. CRC Press, 1997.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.

Ralph Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

You might also like