Clustering Methods For Spherical Data: An Overview and A New Generalization
Clustering Methods For Spherical Data: An Overview and A New Generalization
net/publication/324037906
CITATIONS READS
3 1,327
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sungsu Kim on 23 December 2018.
1 Introduction
A cluster analysis refers to finding of natural groups (clusters) from a data set,
when little or nothing is known about the category structure. A cluster analysis di-
vides data into groups (clusters) that are meaningful, useful, or both. A data cluster-
ing belongs to the core methods of data mining, in which one focuses on large data
sets with unknown underlying structure. One can broadly categorize clustering ap-
proaches to be either model-based (parametric) or distance-based (non-parametric
or prototype-based).
Sungsu Kim
Department of Mathematics, University of Louisiana at Lafayette, U.S.A.
e-mail: [email protected]
Ashis SenGupta
Applied Statistics Unit, Indian Statistical Institute, Kolkata, India.
e-mail: [email protected]
1
2 Sungsu Kim and Ashis SenGupta
In this big data era, a clustering analysis is a fundamental and crucial step in
an attempt to explore structures and patterns in massive datasets, where clustering
objects (data) are represented as vectors. Often such high dimensional vectors are L2
normalized so that they lie on the surface of unit hypersphere, transforming them to
spherical data. In spherical (directional) clustering (i.e. clustering of spherical data),
a set of data vectors is partitioned into groups, where the distance used to group the
vectors is the angle between them. That is, data vectors are grouped depending on
the direction into which they point, but the overall vector length does not influence
the clustering result. The goal of spherical clustering is thus to find a partition in
which clusters are made up of vectors that roughly point in the same direction. For
distance-based methods, cosine similarity, instead of Euclidean distance, is mostly
used, which measures the cosine of an angle formed by two vectors. For model-
based methods, popular mixture models such as a mixture of multi-variate Gaussian
distributions are inadequate, and the use of a spherical distribution in a mixture
model is required.
Two main applications of spherical clustering are found in text mining and gene
expression analysis. In document clustering (or text mining), text documents are
grouped based on their features, often described in frequencies (counts) of words,
after removing stop words and word stemming operation. Using words as features,
Clustering Methods for Spherical Data: an Overview and a New Generalization 3
text documents are often represented as high-dimensional and sparse vectors - a few
thousands dimensions and a sparsity of 95 to 99% is typical. In order to remove the
biases induced by different lengths of documents, the data are normalized to have
the unit length, ignoring overall lengths of documents. In other words, documents
with a similar composition but different lengths will be grouped together (Dhillon
and Modha, 2001).
Gene expression profile data is usually represented by a matrix of expression
levels, with rows corresponding to genes and columns to conditions, experiments or
time points. Each row vector is the expression pattern of a particular gene across all
the conditions. Since the goal of gene expression clustering is to detect groups of
genes that exhibit similar expression patterns, the data are standardized so that genes
have mean zero and variance 1, removing the effect of magnitude of expression level
(Banerjee et al., 2005).
As a special case of spherical clustering, the spatial clustering is used for agri-
cultural insurance claims and earthquake occurrences (SenGupta, 2016). Other ap-
plications of spherical clustering found in the literature include
• fMRI, white matter supervoxel segmentation and brain imaging in Biomedicine
• spatial fading and blind speech segregation in Signal processing
• exoplanet data clustering in Astrophysics
• environmental pollution data in Environmental sciences.
When data lie on the unit circle, the circular distance between two objects is
given by cos(α1 − α2 ), where α1 and α2 are corresponding angles (Jammalamadaka
and SenGupta, 2001). Generalizing the circular distance to unit hypersphere, co-
sine similarity between two unit vectors, say y1 and y2 , is defined to be the in-
ner product of y1 and y2 , denoted < y1 , y2 >. Suppose n spherical data points are
subject to a classification into K groups. Spherical K-means algorithm minimizes
∑Kk=1 ∑ni=1 µki < yi , pk >, where µki = 1 if yi belongs to cluster k (and otherwise
µki = 0), and pk denotes a prototype (cluster center) vector for cluster k. The opti-
mization process consists of alternating updates of the memberships and the cluster
centers. Given a set of data objects and a pre-specified number of clusters K, K clus-
ters are initialized. each one being the centroid of a cluster. The remaining objects
are then assigned to the cluster represented by the nearest or most similar centroid,
which is an updating membership step. Next, new centroids are re-computed for
each cluster and in turn all data objects are re-assigned based on the new centroids,
which is an updating cluster center step. These 2 steps iterate until a converged and
fixed solution is reached, where all data objects remain in the same cluster after an
update of centroids (Hornik et al., 2012).
It is known that for complex data sets containing overlapping clusters, fuzzy par-
titions model the data better than their crisp counterparts. In fuzzy C-means cluster-
ing algorithm for spherical data, each data point belongs to more than one cluster
with a different membership value (Kesemen et al., 2016). Fuzzy C-means algorithm
for spherical data uses the following criterion
K n
min ∑
B,M
∑ νkim dki2 , (1)
k=1 i=1
Suppose a data set consists of n spherical objects (i.e. unit vectors) {y1 , y2 , . . . , yn } ∈
S p−1 that one wants to divide into K homogeneous groups. Denoting by g(·) a prob-
ability density function of Y , the mixture model is
K
g(y) = ∑ πk f (y; θk ), (2)
k=1
where πk (with ∑Kk=1 πk = 1), f and θk represent mixing proportion, spherical den-
sity function and parameter vector of kth mixture component, respectively. Infer-
ences of a mixture model cannot be directly done through the maximization of the
likelihood since group labels {z1 , z2 , . . . , zn } of n objects are unknown. The set of
pairs {(yi , zi )}ni=1 is usually referred to as the complete data set. The E-M algorithm
iteratively maximizes the conditional expectation of the complete log-likelihood,
beginning with initial values of θ (0) . Each expectation (E) step computes the ex-
pectation of the complete log-likelihood conditionally to the current value of θ (q) .
Then, the maximization (M) step maximizes the expectation of the complete log-
likelihood over θ (q) to provide an update for θ , i.e. θ (q+1) . Computations with high-
dimensional or large number of components can be quite demanding. In such cases,
Bayesian approaches can lead to significant computational savings and have been
quite popular.
The most widely used mixture model is a mixture of von Mises Fisher(vMF)
distributions. The probability density function of vMF distribution is defined by
0
f (y|µ, κ) = cd (κ)eκ µ y , (3)
While the E-M algorithm is most widely being used in a mixture modeling of
spherical clustering, it requires an approximation of the normalizing constant of a
spherical probability distribution, for example, κ in case of vMF distribution. Alter-
natively, score matching algorithm (Rosenbaum and Rubin, 1983) can be employed,
which does not require any knowledge about a normalizing constant. Let f (y; π)
6 Sungsu Kim and Ashis SenGupta
3.0
3.0
3.0
3.0
−0.1 −0.1
−0.2
−0.2 −0.2 −0.2
−0.3 −0.3 −0.3
2.5
2.5
2.5
2.5
−0.6
−0.4 −0.4 −0.4 −0.5
.4
−0.5
−0
−0.5
−0.7 −1
−0.6 −0.7
2.0
2.0
2.0
2.0
−0 −0
.8 .9
−1.
1.5
1.5
1.5
1.5
4
−0.7
−1
−0.8
.2
1.0
1.0
1.0
1.0
−0.6
−0
−0.6 .5
−0
.5
−0.5 −0.6 −0.8
−0.4
0.5
0.5
0.5
0.5
−0.4 −0.4
−0.3 −0.3 −0.3
−0.4
−0.2 −0.2 −0.2 −0.2
−0.1 −0.1 −0.1
0.0
0.0
0.0
0.0
0 0 0 0
3.0 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
3.0
3.0
3.0
−0.2 −0.4
−1 −2
−0
2.5
2.5
2.5
2.5
.8
−1 −2
.2
−8
−4
2.0
2.0
2.0
2.0
0
−2
−4
−120
−5
−18
−2.6
−1.6
1.5
1.5
1.5
1.5
−6
−1.8
0
1.0
1.0
1.0
1.0
−1
−2
−3
0
−1.5 −4
−1
.6
0.5
0.5
0.5
0.5
−0
−0.5
−0.2
0.0
0.0
0.0
0.0
0 0 0 0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
π̂ = Wn−1 dn , (4)
where Wn and dn are sample averages over n data points of wab = E{< ta ,tb > (y)}
and dc = −E{∆M tc (y)}, where <, > the gradient inner product and ∆M the Lapace-
Beltrami operator.
In this section, alternative spherical probability models are discussed, which are
suitable to model non-symmetric cluster shapes.
The probability density function of Kent distribution (Kent, 1982) is defined by
where ζ1 , ζ2 , ζ3 are mean direction, major axis and minor axis vectors, resp. κ, β are
shape parameters, and Cκ denotes the normalizing constant. The density has ellipse-
like contours of constant probability density on the spherical surface. For a high-
dimension, maximum likelihood estimation is problematic and moment estimators
are available (Peel et al., 2001).
By construction, the mixture of the Inverse Stereographic Projection of Multi-
variate Normal Distribution has the isodensity lines that are inverse stereographic
mappings of ellipsoids, which allows asymmetric contour shapes. The necessary
and sufficient condition for the density being unimodal is that the greatest eigen-
1
value of the variance-covariance matrix is smaller than 2(p−1) , where p denotes the
dimension of a multivariate normal distribution used in the projection. There is no
closed form solution for µMLE (Dortet-Bernadet and Wicker, 2008).
While mixture models using Kent distribution or inverse stereographic projection
of normal distributions are suitable for elongated clusters in the data, using their el-
liptic contours, they will not perform well with clusters having shifted centers nor
non-convex clusters. Spherical generalizations of two asymmetric circular distribu-
tions found in the following sections provide more flexible model-based spherical
clustering.
8 Sungsu Kim and Ashis SenGupta
When data lie in the unit circle, the generalized von Mises(GvM) density is given
by
exp(κ1 cos(θ − µ1 ) + κ2 cos 2(θ − µ2 ))
f (θ ) = R π , (6)
−π exp(κ1 cos(θ − µ1 ) + κ2 cos 2(θ − µ2 ))dθ
where µ1 , µ2 ∈ (−π, π] are location parameters, and κ1 , κ2 > 0 are shape parameters.
GvM distribution is suitable for modeling asymmetric and bimodal circular data,
and an extended model of the von Mises (vM) distribution.
A spherical generalization of GvM distribution has the density given by
where ζ ’s are orientation vectors, κ, β are shape parameters, and Cκ denotes the
normalizing constant.
Various contour shapes shown in Figure 2 suggest that a mixture model based
on spherical generalization of GvM distribution is appropriate for non-convex sym-
metric or asymmetric cluster shapes, as well as circular or elliptic symmetric cluster
shapes. The Kent distribution is a special case of (7), where ζ1 , ζ2 and ζ3 are con-
strained to be orthogonal.
3.0
3.0
3.0
3.0
−0.3 −0.5
−0.4 −0.6
2.5
2.5
2.5
2.5
−2 −2
−0.8
−1
2.0
2.0
2.0
2.0
−8
−2
−3
−10
−3
−14
1.5
1.5
1.5
1.5
−1
−1
−0.4
.4
−0.9
−0
−1.5
−6
1.0
1.0
1.0
1.0
−0.5
−2.5
−0.5
−0.5
−0.7 −1.5
0.5
0.5
0.5
0.5
−0.4 −0.5 −1
−0.3
−0.1 −0.2
0.0
0.0
0.0
0.0
0 0 0 0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
3.0
3.0
3.0
3.0
−0.4
−0.5
−1 −1 −0.8
2.5
2.5
2.5
2.5
−0.6
−2 −2 −1
.2
−4
−2
−5−6
2.0
2.0
2.0
2.0
−2.2 −2
−1.5
−2.5
−6
−9
1.5
1.5
1.5
1.5
−5
−7
−1.8
−0.4
1.0
1.0
1.0
1.0
.4
−3
−0
−3
−1.5
0.5
0.5
0.5
0.5
−1 −1
−0.5 −0.4
−0.2
0.0
0.0
0.0
0.0
0 0 0 0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
where µ ∈ (−π, π] is a location parameter, and κ1 > 0 and κ2 ∈ [−1, 1] are concen-
tration and skewness parameters, respectively. GvM3 distribution has an advantage
over GvM distribution with one less parameter and easier interpretation of the pa-
rameters.
A spherical generalization of GvM3 has the density given by
where ζ ’s are orientation vectors, κ, β are shape parameters, and Cκ denotes the
normalizing constant.
Various contour shapes shown in Figure 3 suggest that a mixture model based on
spherical generalization of GvM3 distribution is appropriate for clusters with shifted
centers or clusters with a daughter cluster, as well as symmetric cluster shapes.
3.0
3.0
3.0
3.0
−0.2
−5
2.5
2.5
2.5
2.5
−0.8
−0.5 −1.5
−1
−0.5 −1
−1.2
−1
2.0
2.0
2.0
2.0
−2
5
−2
−4.5
−45
−3
−50
−2.6
−5
1.5
1.5
1.5
1.5
−1.4
.6
−2
−1.5
1.0
1.0
1.0
1.0
−1
−10
−0.6
−1
−0.5
0.5
0.5
0.5
0.5
−0.2 −0.4
0.0
0.0
0.0
0.0
0 0 0 0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
3.0
3.0
3.0
3.0
−0.2
−0.5 −0.5
−1 −0.3
−1 −0.4
2.5
2.5
2.5
2.5
−1 −0.5
−1
−1 −0
.6
−2
−−3
.5 .7
−0
4
2.0
2.0
2.0
2.0
−1
−0.9
−8
−2.5
−3
1.5
1.5
1.5
1.5
−2
−0.9
−0.6
−2.5
−0.7
−5
−2
1.0
1.0
1.0
1.0
−0
.6
−2 −0.5
0.5
0.5
0.5
0.5
−0.4
−0.3
−0.2
−0.1
0.0
0.0
0.0
0.0
0 0 0 0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3
4 Concluding Remarks
In this chapter, an overview on spherical clustering was presented, and more flex-
ible alternative model-based methods were discussed. The authors suggest our read-
ers to consider the alternative model-based methods found in this chapter when clus-
ter shapes in the data set seem to arise from populations which have neither circular
nor elliptic contours. On the other hand, it is possible to consider more flexible alter-
native distance-based methods for asymmetric cluster shapes by developing suitable
similarity measures.
References
1. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von
Mises-Fisher Distributions. Journal of Machine Learning Research, 6, 1345–1382 (2005)
2. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text using clustering.
Machine Learning, 42, 143–175 (2001)
3. Dortet-Bernadet, J-N., Wicker, N.: Model-based clustering on the unit sphere with an illus-
tration using gene expression profiles. Biostatistics, 9, 66–80 (2008)
4. Everitt, B.S., Hand, D.J.: Finite Mixture Distributions. Chapman and Hall, London (1981)
5. Hornik, K., Feinerer, I., Kober, M., Buchta, C.: Spherical K-means Clustering. Journal of
Statistical Software, 50, 1–22 (2012)
6. Jammalamadaka, S., SenGupta, A.: Topics in Circular Statistics. World Scientific, Singapore
(2001)
7. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Pearson, New York
(2008)
8. Kent, J.T.: The Fisher-Bingham distribution on the sphere. Journal of the Royal Statistical
Society Series B, 44, 71–80 (1982)
9. Kesemen, O., Tezel, Ö., Özkul, E.: Fuzzy c-means clustering algorithm for directional data
(FCM4DD). Expert Systems with Applications, 58, 76–82 (2016)
10. Kim, S., SenGupta, A.: A three-parameter generalized von Mises distribution. Statistical Pa-
pers, 54, 685–693 (2012)
11. Peel, D., Whiten, W.J., McLachlan, G.J.: Fitting mixtures of Kent distributions to aid in joint
set identification. Journal of the American Statistical Association, 96, 56–63 (2001)
12. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational stud-
ies for causal effects. Biometrika, 70, 41-55 (1983)
13. SenGupta, A.: High volatility, multimodal distributions and directional statistics. Special In-
vited Paper, Platinum Jubilee International Conference on Applicaitons of Statistics, Calcutta
University, December 21-23, 2016