0% found this document useful (0 votes)
19 views14 pages

Gaussian Mixture Model Clustering With Incomplete Data

This document presents a novel approach to Gaussian Mixture Model (GMM) clustering that integrates the imputation of missing data with clustering in a unified learning procedure. The proposed method alternates between filling missing data using GMM clustering results and performing GMM clustering on the imputed data, which enhances clustering performance on incomplete datasets. Extensive experiments on benchmark datasets demonstrate the effectiveness of this approach compared to traditional methods that treat imputation and clustering as separate processes.

Uploaded by

Khoi Duong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Gaussian Mixture Model Clustering With Incomplete Data

This document presents a novel approach to Gaussian Mixture Model (GMM) clustering that integrates the imputation of missing data with clustering in a unified learning procedure. The proposed method alternates between filling missing data using GMM clustering results and performing GMM clustering on the imputed data, which enhances clustering performance on incomplete datasets. Extensive experiments on benchmark datasets demonstrate the effectiveness of this approach compared to traditional methods that treat imputation and clustering as separate processes.

Uploaded by

Khoi Duong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Gaussian Mixture Model Clustering with Incomplete Data

YI ZHANG, School of Computer, NUDT


MIAOMIAO LI, Changsha University; School of Computer, NUDT
SIWEI WANG, SISI DAI, LEI LUO, and EN ZHU, School of Computer, NUDT
HUIYING XU, College of Mathematics and Computer Science, Zhejiang Normal University; Department
of Computer Science, City University of Hong Kong
XINZHONG ZHU, College of Mathematics and Computer Science, Zhejiang Normal University
CHAOYUN YAO, State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics
and Information System, NUDT
HAORAN ZHOU, Chongqing University of Technology

Gaussian mixture model (GMM) clustering has been extensively studied due to its effectiveness and efficiency.
Though demonstrating promising performance in various applications, it cannot effectively address the ab-
sent features among data, which is not uncommon in practical applications. In this article, different from
existing approaches that first impute the absence and then perform GMM clustering tasks on the imputed
data, we propose to integrate the imputation and GMM clustering into a unified learning procedure. Specifi-
cally, the missing data is filled by the result of GMM clustering, and the imputed data is then taken for GMM
clustering. These two steps alternatively negotiate with each other to achieve optimum. By this way, the im-
puted data can best serve for GMM clustering. A two-step alternative algorithm with proved convergence is
carefully designed to solve the resultant optimization problem. Extensive experiments have been conducted
on eight UCI benchmark datasets, and the results have validated the effectiveness of the proposed algorithm.
CCS Concepts: • Computing methodologies → Machine learning algorithms; • Theory of computa-
tion → Unsupervised learning and clustering;
Additional Key Words and Phrases: GMM, clustering, EM, incomplete data
ACM Reference format:
Yi Zhang, Miaomiao Li, Siwei Wang, Sisi Dai, Lei Luo, En Zhu, Huiying Xu, Xinzhong Zhu, Chaoyun Yao,
and Haoran Zhou. 2021. Gaussian Mixture Model Clustering with Incomplete Data. ACM Trans. Multimedia
Comput. Commun. Appl. 17, 1s, Article 6 (March 2021), 14 pages.
https://doi.org/10.1145/3408318

This work was supported by the National Natural Science Foundation of China (project no. 61906020, 61701451, 61872377,
6
and 61922088).
Authors’ addresses: Y. Zhang, S. Wang, S. Dai, L. Luo, and E. Zhu, School of Computer, NUDT, China; M. Li (corresponding
author), Changsha University; School of Computer, NUDT, China; email: [email protected]; H. Xu, College of
Mathematics and Computer Science, Zhejiang Normal University; Department of Computer Science, City University of
Hong Kong, China; X. Zhu, College of Mathematics and Computer Science, Zhejiang Normal University, China; C. Yao,
State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System, NUDT,
China; H. Zhou, Chongqing University of Technology, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1551-6857/2021/03-ART6 $15.00
https://doi.org/10.1145/3408318

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:2 Y. Zhang et al.

1 INTRODUCTION
Clustering has been widely studied in the unsupervised learning community and used in various
real-world applications. Plenty of excellent clustering algorithms have been proposed in the past
few decades, such as K-Means, Fuzzy CMeans, DBSCAN, hierarchical clustering, and many more
[13, 29]. These clustering algorithms show good clustering performance and are widely applied in
simple data analysis situations; however, they cannot process complicated multi-modal data well,
resulting in poor clustering performance [8, 22].
The Gaussian mixture model (GMM) clustering algorithm, which can achieve better clustering
performance with complicated multi-modal data, is proposed to address this issue. GMM cluster-
ing has been extensively studied due to its effectiveness and efficiency [4] and widely applied in
various fields, such as image segmentation problems in the fields of medicine and transportation
[27].
Although the existing GMM clustering and its improved algorithms have achieved great success
and demonstrated promising performance in various applications [19], these clustering algorithms
all assume that the dataset is completely observable. However, in many practical applications, due
to the absence of partial features among samples, this assumption may no longer hold, and thus
the above algorithms cannot be effectively applied to cope with the absent features among data.
Many efforts have been devoted to addressing the problem of clustering incomplete data. Their
main idea is to impute the incomplete features through filling algorithms first, such as zero-filling,
mean-filling, k-nearest-neighbor filling, expectation-maximizing filling [10], and other improved
algorithms [2, 7, 11, 15]. These methods try to reduce the negative impact of missing data by pre-
processing and then apply standard GMM clustering algorithms to these imputed data matrices.
However, the processes of data imputation and clustering are performed separately, resulting in
the process of missing features’ imputation cannot serve clustering tasks.
Therefore, different from existing approaches that first impute the absence and then perform
GMM clustering tasks on the imputed data, we propose to integrate the imputation and GMM
clustering into a unified learning procedure. Specifically, the missing data is filled by the result
of GMM clustering, and the imputed data is then taken for GMM clustering. These two steps
alternatively negotiate with each other to achieve optimum. By this way, the imputed data can best
serve for GMM clustering. Further, a two-step alternative algorithm with proven convergence is
carefully designed to solve the resultant optimization problem. Extensive experiments have been
conducted on eight UCI benchmark datasets and the results have validated the effectiveness of the
proposed algorithm.
The contributions of this article are summarized as follows:

• Different from the existing algorithms that use imputing as preprocessing and clustering,
respectively, we unify these two processes into one optimization goal. Missing features are
dynamically imputed to better serve the clustering tasks, while existing observed entries
are guaranteed to remain unchanged.
• According to our knowledge, this is the newest work proposed to extend the Gaussian mix-
ture model to cope with a clustering task with incomplete data. We represent the formula-
tion and fulfill the aforementioned ideas. In addition, we designed an alternating optimiza-
tion algorithm to solve the optimization problem of clustering with incomplete data.
• Extensive experimental research is conducted on eight UCI benchmark datasets. As noted,
our algorithm is consistent to achieve state-of-the-art performance, compared to other
filling methods. Experiment results have validated the effectiveness of the proposed
algorithm.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:3

2 RELATED WORK
2.1 Gaussian Mixture Model
Given a data matrix X = {xx 1 ,xx 2 , . . . ,xx m }, the maximum likelihood estimation method can be ap-
plied to solve its model parameters [4], and the objective formulation is represented as follows:


n  k
1
ln  α i e − 2 (xx j −μμ i ) Σ i (xx j −μμ i ) ,
1 T −1
max (1)
α , μ, Σ 2π n/2 |Σ
Σ | 1/2
j=1  i=1 i 
where μ i and Σ i are the parameters of the ith Gaussian mixture component, and α i is the corre-

sponding mixture coefficient, which satisfies ki=1 α i = 1.
Although the existing GMM clustering and its improved algorithms have achieved great success
and demonstrated promising performance in various applications, these clustering algorithms all
assume that the dataset is completely observable. However, in many practical applications, due to
the absence of partial features among samples, the above algorithms cannot be effectively applied
to cope with the absent features among data.
Many efforts have been devoted to addressing the problem of clustering incomplete data. These
methods try to reduce the negative impact of missing data by pre-processing, and then apply
standard GMM clustering algorithms to these imputed data matrices.

2.2 Imputation Methods


2.2.1 Basic Statistical Methods. The statistical methods seek more useful information from the
missing data. Most of them impute the absent values through statistical properties, rather than dis-
carding incomplete information. They fill the incomplete entries with constants to take a complete
data sample and apply it to learning tasks.
For example, the simplest and most common methods used filling values like zero, conditional
mean, and median in this dimension. Specially, the KNN-filling method has been proposed to im-
pute the missing entries with the mean feature on the K-closest neighbors in that dimension [9].
But it does not work when the missing data ratio is relatively large, and thus we do not use it as a
comparison algorithm.
Different from the aforementioned methods, another statistical class to deal with incomplete
features is followed by Bayesian frameworks. These frameworks are often shown in a maximum-
likelihood manner, which imputes the missing values with the most likely estimated numbers. The
most followed or popular method is the expectation maximization (EM) algorithm [5].
2.2.2 Dynamically K-means Filling. Given data matrix X = {xx i }i=1 n and the number of clusters k,

we have three variables to be optimized: the data matrix X , assignment matrix H , and the clusters’
centers μ c (1 ≤ c ≤ k )[23]. By imposing the constraint on X , the formulation can be expressed as
follows:
k  k
min H ic x i − μ c 2
H , {μμ c }ck=1, X i=1 c=1
(2)
 k
s.t . ∀i, H ic = 1, x i (oi ) = x oi ,
c=1
n n
where nc = i=1 H ic and μ c = 1
nc i=1 H ic x i are the sample number and the centroid of the cth
(1 ≤ c ≤ k ) cluster and x i (oi ) represents the observable elements of the ith sample x i . Moreover,
we also impose constraints on the observable part of data matrix x i (oi ) to ensure their values are
kept unchanged during the optimization process.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:4 Y. Zhang et al.

2.3 Other Clustering Algorithms


2.3.1 Multi-view Clustering. Multi-view clustering (MVC) integrates the available multi-view
information to categorize data items with similar structures or patterns into the same group, aim-
ing to optimally integrate features from different views to improve clustering performance [3].
It has been intensively studied and widely applied into various applications during the last few
decades [6, 14, 16, 30].
The work in [24] presented a novel approach towards subspace clustering over multi-view
data, which achieves the data correlation consensus on multi-views through an angular-based
regularizer. The work in [25] proposes to learn a clustered low-rank representation via struc-
tured matrix factorization for multi-view spectral clustering. The work in [16] proposed a mul-
tiple kernel k-means clustering algorithm with matrix-induced regularization to reduce the re-
dundancy of the pre-defined kernels. A multiple kernel algorithm is proposed to allow the op-
timal kernel to reside in the neighborhood of the combinational kernels [31]. The works in [17,
18] proposed simple yet effective algorithms to address the incomplete kernel matrices. These
newly proposed methods greatly improve the practical application ability of multi-view cluster-
ing; the study of multi-view clustering will be a direction worthy of continued exploration and
research.
2.3.2 Deep Clustering. Deep learning can automatically extract abstract, non-linear, and
cluster-friendly features from complex data structures, thereby improving the performance of the
algorithm. In recent years, it has made great efforts in various application fields. Clustering al-
gorithms are supplemented by deep neural network feature learning in order to capture the data
itself or its internal structure to better separate clusters [1, 20, 21].
The work in [26] further proposed the concept of cyclic consistency loss on the basis of
adversarial training in order to strengthen the correlation between the input and the cor-
responding output. It provides a good inspiration for the research on deep neural network
clustering.
The work in [12, 28] studied and improved deep embedding clustering (DEC). DEC is one of
the most representative deep clustering methods, which has gained a lot of attention in the field
of deep clustering. Using the autoencoder as the network architecture, clustering is performed by
minimizing the reconstruction loss of the autoencoder and the KL divergence between the label
distribution and the auxiliary distribution. It has achieved encouraging results and has become
one of the references for the performance of newer deep clustering algorithms.

3 GMM CLUSTERING WITH INCOMPLETE DATA


The processes of data imputation and clustering are performed separately in the above methods,
resulting in the process of missing features’ imputation cannot serve clustering tasks. Therefore,
different from existing approaches that first impute the absence and then perform GMM clustering
tasks on the imputed data, we propose to integrate the imputation and GMM clustering into a uni-
fied learning procedure. Specifically, the missing data is filled by the result of GMM clustering, and
the imputed data is then taken for GMM clustering. These two steps alternatively negotiate with
each other to achieve optimum. By this way, the imputed data can best serve for GMM clustering.
For incomplete data, each sample x j (1 ≤ j ≤ n) can be divided into two parts: the observable
features x j (o j ) and missing features x j (m j ). When we optimize the prior of k components, the
mean μ k , and the covariance Σ k for the kth Gaussian basis, we propose to optimize the missing
parts of the data x j (m j ) while keeping the observable features x j (o j ) unchanged during the
optimization process.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:5

3.1 The Proposed Formulation


Based on the aforementioned discussion, we redefine the objective of the Gaussian mixture model
to cope with incomplete data clustering. Given data matrix X and the number of clusters k, we
have four variables to be optimized: the data matrix X , the prior of k components, the mean μ k , and
the covariance Σk . By imposing the constraint on the observable features of the data, we need to
optimize the model parameters {αα , μ , Σ } and the missing data x j (m j ). Thus, we set our Gaussian
mixture model clustering with incomplete data as follows:


n  k
max ln  α i p(xx j | μ i , Σ i )  s.t . x j (o j ) = x oj , (3)
α , μ , Σ, X
j=1  i=1 
where Σ i is a symmetric positive definite matrix, x j (o j ) represents the observable elements of the
jth sample x j , and p(xx j | μ i , Σ i ) represents the probability density of the jth sample value corre-
sponding to the ith Gaussian component, which can be expressed as follows:
1 T Σ −1 (x
e − 2 (xx j −μμ i ) i x j −μ
1
p(xx j | μ i , Σ i ) = μi )
. (4)
2π n/2 |Σ
Σi | 1/2
Moreover, we also impose constraints on the observable part of data matrix x j (o j ) to ensure
their values are kept unchanged during the optimization process. And we can continuously opti-
mize the missing part of the data matrix x j (m j ) through the EM algorithm during the clustering
process.

3.2 Alternative Optimization


As seen, the additional constraint x i (oi ) = x oi makes the whole optimization problem difficult to
solve. In order to solve it, we design a two-step alternative optimization algorithm with a fast
convergence rate, where the first step can be easily solved by applying the existing off-the-shelf
packages and the second step can be solved by the matrix derivative method. To solve the model’s
parameters {(α i , μ i , Σ i | 1 ≤ i ≤ k )}, we can use maximum likelihood estimation, which maximizes
the log-likelihood as follows:


n  k
X) =
LL(X ln  α i p(xx j | μ i , Σ i )  . (5)
j=1  i=1 
And we use the EM algorithm for iterative optimization.

3.2.1 Expectation Step. For the dataset X = {xx 1 ,xx 2 , . . . ,xx m }, let the random variable z j ∈
{1, 2, . . . , k } denote the Gaussian mixture component to which the sample x j belongs, and its value
is unknown. Obviously, the prior probability of z j corresponds to α. According to Bayes’ theorem,
the posterior distribution of z j corresponds to

α i p(xx j | μ i , Σ i )
γ ji = k . (6)
l =1 α l p(x j | μ i , Σ l )
x

In each iteration, first calculate the posterior probability γ ji ; the probability for each sample
belongs to each Gaussian mixture component according to the current parameters, which is the
Expectation Step (E-Step).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:6 Y. Zhang et al.

3.2.2 Maximization Step.


ALGORITHM 1: Gaussian Mixture Model Clustering with Incomplete Data
Input: incomplete data matrix X = {xx 1 ,xx 2 , . . . ,xx m }, number of cluster k,
convergence tolerance ϵ, and max iteration times maxIter
Parameter: mixed coefficients α , mean vector μ , covariance matrix Σ , and data matrix X
Output: complete data matrix X and Cluster division C

1: Initialize the missing values through basic imputing method.


2: Initialize the parameters of Gaussian Mixture Model according to the data matrix X .
3: repeat
4: Calculate the posterior probability Γ by solving Equation (6) that each sample generated by each Gauss-
ian mixed component.
5: Update μ i , Σ i , α i by solving Equations (9)–(11) with fixed other GMM parameters and data matrix X .
(1 ≤ i ≤ k ), respectively.
6: Update X by solving Equation (14) with fixed mixed coefficients α , mean vector μ , and covariance
matrix Σ .
7: Iteration times add one : t = t + 1.
8: until (obj (t ) − obj (t −1) )/obj (t ) ≤ ϵ or t ≥ maxIter
9: Ci = ∅ (1 ≤ i ≤ k ).
10: for j = 1, 2, . . . , n do
11: Determine cluster label for sample x j :
λ j = arg maxi ∈ {1,2, ...,k } γ ji .
12: Divide x j into corresponding clusters:
C λ j = C λ j ∪ {xx j }.
13: end for

3.2.3 Optimizing Parameters α i , μ i , and Σ i with fixed X . At first, the optimization Equation (3)
can be divided into k sub-problems with consideration to each component of the Gaussian mixture
model:

n  
max ln α i p(xx j | μ i , Σ i ) s.t . x j (o j ) = x oj . (7)
α , μ, Σ
j=1

Optimizing μ i with fixed α i , Σ i , and X :


With mixed coefficients α i and covariance matrices Σ i being fixed, we make the partial deriva-
tive of Equation (7) for μ i equal to zero, and obtain:
 n
α i p(xx j | μ i , Σ i )
k (xx j − μ i ) = 0, (8)
l =1 α i p(x j | μ l , Σ l )
j=1 x
n
j=1 γ ji x j
μ i = n . (9)
j=1 γ ji
Optimizing Σ i with fixed α i , μ i , and X : Similar to the previous step, we make the partial deriv-
ative of Equation (7) for Σ i equal to zero, and obtain:
n
x j − μ i )(xx j − μ i )T
j=1 γ ji (x
Σi = n . (10)
j=1 γ ji

Optimizing α i with fixed μ i , Σ i , and X : For mixed coefficients α i , because of α i ≥ 0, ki=1 α i = 1,

consider the Lagrangian form, LL(X ) + λ( ki=1 α i − 1), where λ is a Lagrangian multiplier, and

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:7

obtain the new value of α i by making the derivative of the formula for α i be zero:

1 
n
αi = γ ji . (11)
m j=1

3.2.4 Optimizing Data Matrix X with Fixed α , μ , and Σ . The problem in Equation (3) can be
divided into n individual sub-problems with consideration to each sample x j ; the optimization can
be equivalently rewritten as follows:


k
max α i p(xx j | μ i , Σ i ) s.t . x j (o j ) = x oj , (12)
X
i=1

where x j (o j ) represents the observable elements of the jth sample x j . Constraints are imposed on
the observable part of data matrix x j (o j ) to ensure their values are kept unchanged during the
optimization process.
To solve the difficult problem, calculating the partial derivative of Equation (12) for x m , we
divide the mean μ i and the covariance matrices Σi the same as the data matrices into two parts,
the observable and missing parts, as follows:
    −1
Σ ioo −1
Σ iom
x j = x jo x jm , μ i = μ io μ im , Σ i−1 = −1 −1 . (13)
Σ imo Σ imm

We make the partial derivative of Equation (12) for x m to zero, and finally we can get an analytic
solution as follows:
−1 k
 k   
Xm =  −1 
α i Pi Σ imm −1
α i Pi Σ imo −1
μ io + Σ imm −1
μ im − Σ imo x jo , (14)
 i=1  i=1

where Pi = p(xx j | μ i , Σ i ). As seen from Equation (14), the missing elements of each sample x j are
imputed with the corresponding dimension of the Gaussian mixture component and the observable
features of the sample.
According to the posterior probability γ ji calculated in the E-Step, the model parameters and
missing data values are updated by maximizing the likelihood estimation method, which is the
Maximization Step (M-Step).

3.3 Discussion and Extension


Computational complexity: Compared to the basic GMM clustering algorithm, our Algorithm 1
considers the data matrix X as another variable to be optimized. In Equation (14), we update the
missing values by maximizing the likelihood estimation objects with the GMM parameters, and
when updating the missing values, we need to invert the covariance matrix whose time complexity
is o(d 3 ). Therefore, the time complexity of our algorithm is O(tknd 3 ), where t, k, n, d represent the
number of iterations, clusters, samples, and dimensions, respectively.
Extensions: The algorithm in this work can be extended from the following aspects. First, the
initialization parameters of GMM can be readily extended to be obtained through other methods
for serving clustering better. For example, we can consider the denseness of the initial data distri-
bution and the impact of outliers. And the idea of joint imputation and clustering is so natural that
it can be generalized to other learning tasks such as classification, feature selection/extraction, etc.
The one-stage framework can also be extended to handle noise by dynamically clustering.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:8 Y. Zhang et al.

Table 1. Datasets Used in Our Experiments

Dataset #Samples #Dimensions #Clusters


Iris 150 4 3
AlcoholQCM 125 10 5
Seeds 210 7 3
Wine 178 13 3
Segment 2,310 18 7
ElectricalGrid 10,000 13 2
Avila 20,871 10 12
Letter 20,000 16 26

4 EXPERIMENT
4.1 Experiment Setting
The proposed algorithm is experimentally evaluated on eight widely used UCI datasets, including
Iris, AlcoholQCM, Seeds, Wine, Segment, ElectricalGrid, Avila, and Letter. Table 1 shows the details
of the datasets.
The incompleteness is randomly generated by the original complete data, which can be publicly
downloaded from the UCI Machine Learning Repository. The number of samples varies from 100
to over 20,000, and dimensions from 4 to 18.
Iris: It is perhaps the best-known database to be found in the pattern recognition literature. The
dataset contains three types of iris plant with a total of 150 instances. The sample features are four
real numeric attributes of sepal length, sepal width, petal length, and petal width.
AlcoholQCM: For different types of alcohol, QCM gas sensors with two channels are used to
measure gas with five different air-gas mixture ratios. There are 125 samples, and each sample
has 10 real numeric attributes. There are five types of alcohol: 1-octanol, 1-propanol, 2-butanol,
2-propanol, and 1-isobutanol.
Seeds: It is derived from the grains of three different wheat varieties combined harvested in
experimental fields. A soft X-ray technique and GRAINS package were used to construct all seven,
real-valued attributes. There were a total of 210 samples.
Wine: These data are the results of a chemical analysis of wines grown in the same region
in Italy but derived from three different cultivars. The analysis determined the quantities of
13 constituents found in each of the three types of wines. There are a total of 178 samples.
Segment: The instances were drawn randomly from a database of outdoor images. The images
were hand-segmented to create a classification for every pixel. Each sample is a 3x3 region, and
there are a total of 2,310 samples with 19 real numeric attributes.
ElectricGrid: It is the data of the local stability analysis of the four-node star system implement-
ing the decentral smart grid control concept. It has a total of 10,000 samples, which are divided
into two clusters. The sample features are 13 real numeric attributes.
Avila: It has been extracted from 800 images of the “Avila Bible,” a twelfth-century giant Latin
copy of the Bible. There are a total of 20,871 samples, which are divided into 12 clusters. The sample
features are 10 real numeric attributes such as spacing and margin.
Letter: It consists of hand-written digits with a total of 20,000 samples. The character images
were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to
produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical
attributes, which were then scaled to fit into a range of integer values from 0 through 15. The
labels are 26 capital letters in the English alphabet.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:9

Fig. 1. ACC comparison with the variation of missing ratios on eight benchmark datasets.

We compare the proposed Dynamically GMM Clustering algorithm with several commonly used
imputation methods, including mean filling (MF), zero filling (ZF), EM filling. In addition, we com-
pare with the recently proposed dynamically K-means filling (DK) combined with the previous
three methods.
For all the datasets, it is assumed that the true number of clusters k is known and it is set as
the number of classes. We randomly generate the incompleteness by the original complete data
matrix and the missing ratios of generated matrices are 10%, 20%, 30%, 40%, 50%, 60% and 70%,
which affects the performance of the algorithms to varying degrees. To show this point in depth,
we compare these algorithms with respect to the missing ratio.
The widely used clustering accuracy (ACC), normalized mutual information (NMI), F-score, and
purity (PUR) are applied to evaluate the clustering performance of each algorithm.
For all algorithms, we repeat each experiment 50 times with random initialization to reduce
the effect of the randomness of the initial value selection of the K-means and GMM clustering
algorithm and report the average result. Meanwhile, we randomly generated the incompleteness
patterns 20 times in the above-mentioned way and report the statistical results.
The aggregated ACC, NMI, F-score, and PUR are used to evaluate the performance of the algo-
rithms in comparison. Taking the aggregated ACC, for example, it is obtained by averaging the
averaged ACC achieved by an algorithm over different missing ratios. Our implementation is based
on MATLAB and is available at https://github.com/Zhangyi1231/GMM-with-Incomplete-Data.

4.2 Clustering Performance


Figures 1 , 2, 3, and 4 present the ACC, NMI, PUR, and F-score, respectively, in comparison with the
above algorithms with different missing ratios on all datasets. We have the following observations:
(1) The proposed algorithm significantly and consistently outperforms existing two-stage impu-
tation methods. For example, it exceeds the best two-stage imputation method (EM) by 0%, 21.2%,
6.0%, 16.7%, 14.8%, 17.3%, 20.6%, and 20.2% in terms of ACC, and 0%, 22.2%, 5.0%, 19.0%, 15.1%,
21.6%, 26.3%, and 22.4% in terms of NMI, with the variation of missing ratios in [0.1, 0.2, . . . , 0.7]
on the Seeds dataset. From Figures 3 and 4, we can observe that the trends of purity and F-score
indicators are also the same. These results validate the effectiveness of unifying the imputation
and clustering.
(2) Although the recently proposed algorithm, dynamically K-means filling clustering, achieves
fairly good performance, Gaussian mixture model clustering is able to handle more complex multi-
modal data and achieves better performance. For example, the proposed algorithm improves the

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:10 Y. Zhang et al.

Fig. 2. NMI comparison with the variation of missing ratios on eight benchmark datasets.

Fig. 3. PUR comparison with the variation of missing ratios on eight benchmark datasets.

Fig. 4. F-score comparison with the variation of missing ratios on eight benchmark datasets.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:11

Table 2. Aggregated ACC, NMI, F-score, and PUR Comparison (mean±std) of Different Algorithms on
Eight Benchmark Datasets

Dataset Mean Zero Em DK+Mean DK+Zero DK+Em Ours


ACC (%)
Iris 61.3 ± 1.5 67.3 ± 1.5 76.0 ± 2.5 68.6 ± 1.0 70.7 ± 1.2 76.0 ± 0.8 84.4 ± 2.2
AlcoholQCM 38.0 ± 0.3 40.1 ± 0.5 46.1 ± 0.4 39.2 ± 0.3 41.8 ± 0.4 44.8 ± 0.4 46.8 ± 0.4
Seeds 56.6 ± 3.6 53.9 ± 1.3 64.7 ± 1.5 69.2 ± 0.8 61.3 ± 1.0 67.9 ± 0.6 79.3 ± 1.0
Wine 58.0 ± 1.6 74.8 ± 0.8 81.8 ± 0.6 56.8 ± 0.9 70.0 ± 0.6 75.2 ± 0.7 87.0 ± 0.7
Segment 39.2 ± 0.5 40.2 ± 0.5 44.7 ± 0.7 41.0 ± 0.2 45.5 ± 0.4 49.9 ± 0.3 50.8 ± 0.3
ElectricalGrid 58.8 ± 0.4 68.2 ± 2.0 64.8 ± 2.0 58.0 ± 0.5 75.2 ± 0.7 77.5 ± 0.4 79.7 ± 0.6
Avila 25.5 ± 0.2 25.1 ± 0.3 23.9 ± 0.3 24.3 ± 0.2 24.1 ± 0.2 23.0 ± 0.3 29.1 ± 0.1
Letter 16.5 ± 0.2 16.4 ± 0.1 16.7 ± 0.2 19.0 ± 0.1 18.8 ± 0.1 18.9 ± 0.1 22.1 ± 0.1
NMI (%)
Iris 41.3 ± 1.2 47.2± 1.3 58.8 ± 2.7 43.0± 1.0 48.1 ± 1.6 58.0 ± 0.8 66.3 ± 2.6
AlcoholQCM 18.1 ± 0.4 22.8 ± 0.6 28.0 ± 0.5 20.4 ± 0.3 24.5 ± 0.5 28.5 ± 0.3 29.0 ± 0.3
Seeds 27.4 ± 4.9 21.1 ± 1.5 38.7 ± 1.7 38.6 ± 0.9 27.1 ± 1.2 36.6 ± 1.0 55.1 ± 1.3
wine 29.9 ± 1.8 49.3 ± 1.3 59.8 ± 1.0 24.2 ± 1.1 41.4 ± 1.0 50.3 ± 0.8 65.5 ± 1.5
Segment 29.7 ± 0.6 31.8 ± 0.6 37.4 ± 1.0 31.6 ± 0.2 39.1 ± 0.3 45.7 ± 0.3 46.7 ± 0.5
ElectricalGrid 1.76 ± 0.2 14.6 ± 2.3 12.5 ± 2.6 1.76 ± 0.2 27.0 ± 0.5 31.0 ± 0.2 27.5 ± 0.7
Avila 8.78 ± 0.1 8.93 ± 0.2 8.71 ± 0.2 7.82 ± 0.1 7.61 ± 0.1 7.69 ± 0.1 10.92 ± 0.1
Letter 18.2 ± 0.2 18.4 ± 0.2 19.7 ± 0.3 21.6 ± 0.1 21.6 ± 0.1 22.6 ± 0.1 28.1± 0.1
F-Score (%)
Iris 65.8 ± 1.0 70.5 ± 1.1 78.0 ± 2.2 70.6 ± 0.7 73.1 ± 1.0 78.7 ± 0.5 84.8 ± 2.0
AlcoholQCM 40.3 ± 0.3 43.4 ± 0.5 46.6 ± 0.3 42.1 ± 0.3 44.4 ± 0.3 46.6 ± 0.3 47.2 ± 0.4
Seeds 59.7 ± 3.3 55.8 ± 1.1 67.0 ± 1.5 69.7 ± 0.7 62.5 ± 0.9 69.1 ± 0.7 80.3 ± 0.7
Wine 60.6 ± 1.2 75.8 ± 0.7 81.8 ± 0.7 58.2 ± 0.8 71.4 ± 0.6 76.4 ± 0.6 86.9 ± 0.7
Segment 43.8 ± 0.5 45.7 ± 0.5 49.8 ± 1.0 44.8 ± 0.2 51.5 ± 0.3 55.5 ± 0.3 57.8 ± 0.4
ElectricalGrid 60.6 ± 0.4 68.8 ± 1.6 65.9 ± 1.5 60.5 ± 0.3 75.8 ± 0.5 77.9 ± 0.3 79.9 ± 0.5
Avila 28.7 ± 0.1 28.3 ± 0.3 27.6 ± 0.3 28.5 ± 0.2 28.3 ± 0.2 26.9 ± 0.3 33.8 ± 0.1
Letter 20.4 ± 0.2 20.5 ± 0.1 20.8 ± 0.3 22.8 ± 0.1 22.6 ± 0.1 22.4 ± 0.1 27.0 ± 0.1
PUR (%)
Iris 62.1 ± 1.5 68.6 ± 1.3 76.4 ± 2.4 70.0 ± 0.8 72.3 ± 1.1 78.1 ± 0.6 84.6 ± 2.1
AlcoholQCM 39.9 ± 0.3 42.3 ± 0.5 46.8 ± 0.4 41.2 ± 0.3 44.0 ± 0.4 46.8 ± 0.3 47.4 ± 0.4
Seeds 57.0 ± 3.6 54.4 ± 1.3 65.4 ± 1.4 69.7 ± 0.7 62.0 ± 1.0 68.7 ± 0.7 79.9 ± 0.9
Wine 69.6 ± 1.3 78.3 ± 0.5 83.6 ± 0.5 70.7 ± 0.5 75.2 ± 0.5 77.8 ± 0.6 87.4 ± 0.7
Segment 40.5 ± 0.4 41.3 ± 0.5 45.7 ± 0.7 43.8 ± 0.2 48.2 ± 0.3 52.5 ± 0.2 51.8 ± 0.3
ElectricalGrid 64.1 ± 0.1 71.9 ± 1.3 70.9 ± 1.4 63.9 ± 0.1 76.5 ± 0.5 78.2 ± 0.3 79.7 ± 0.5
Avila 42.9 ± 0.1 43.0 ± 0.1 42.7 ± 0.1 43.1 ± 0.1 43.0 ± 0.1 43.0 ± 0.1 43.7 ± 0.0
Letter 17.7 ± 0.2 17.6 ± 0.1 17.9 ± 0.2 20.5 ± 0.1 20.1 ± 0.1 20.5 ± 0.1 23.6 ± 0.1

second-best method (DK+mean) by 3.3%, 6.1%, 7.8%, 13.0%, 16.1%, 15.8%, 9.9%, and 9.3% in terms of
ACC, and 5.8%, 13.6%, 17.6%, 23.0%, 26.3%, 22.0%, 14.1%, and 10.1% in terms of NMI, with the vari-
ation of missing ratios in [0.1, 0.2, . . . , 0.7] on the Seeds dataset. It can be observed from Figures 3
and 4 that the trends of purity and F-score indicators are also the same. These results verify the
effectiveness of the Gaussian mixture model rather than the K-means method.
(3) When the missing ratio exceeds 40%, the effect of the existing two-stage imputation methods
drops significantly. However, the proposed algorithm has the best robustness compared to other

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:12 Y. Zhang et al.

algorithms, and can maintain the best performance under the condition of an increasing missing
ratio.
(4) We also found a phenomenon worthy of consideration: some of the proposed algorithm’s
clustering results are better when the missing ratio is larger than when it is lower, even to zero.
For example, the ACC value on the Segment dataset is about 4% higher when the missing rate
reaches 30% than when it is zero, and the trends of other indicators are also the same. It seems that
the algorithm performs better with incomplete data than complete data, which is not in line with
intuitive logic. We repeated the experiment many times with the code being correct and verified the
authenticity of the result. After research, we believe that there are many noises and some natural
outliers based on the data itself, which may affect the noise points when dealing with missing data,
so the clustering result of the algorithm may be better than the case of complete data. Based on
this, in future research, we can consider how to use this point to deal with the noise problem of
large-scale data.
We also report the aggregated performance metrics (ACC, NMI, F-score, and PUR) and the
standard deviation in Table 2, where the best results are shown in bold. Again, we observe that
the proposed algorithm almost always achieves the state of the art on each performance metric
on all eight benchmark datasets. For example, the proposed algorithm exceeds the second-best
one (DK+em) by 8.4%, 11.4%, 11.8%, and 6.1% in terms of ACC, and by 8.3%, 18.5%, 15.2%, and 3.2%
in terms of NMI on the Iris, Seeds, Wine, and Avila datasets, respectively. Compared with the
traditional Gaussian mixture model clustering algorithm, the proposed algorithm’s clustering per-
formance also has huge advantages. For example, the proposed algorithm exceeds the GMM with
EM imputation (EM) by 8.4%, 14.6%, 5.2%, 6.1%, 14.9%, and 5.4% in terms of ACC; 6.8%, 13.3%, 5.1%,
8.0%, 14.0%, and 6.2% in terms of F-score; and by 8.2%, 14.5%, 3.8%, 6.1%, 8.8%, and 5.7% in terms of
PUR on the Iris, Seeds, Wine, ElectricalGrid, and Avila datasets, respectively. These results are con-
sistent with our observations in Figures 1, 2, 3, and 4 and validate the effectiveness of the proposed
algorithm.

4.3 Advantages of Dynamically GMM Clustering


(1) The proposed algorithm adopts a Gaussian mixture model as the basic clustering model, which
can address the complicated multi-modal data tasks better than others such as K-Means, Fuzzy
CMeans, DBSCAN, and hierarchical clustering.
(2) The proposed algorithm unifies the imputation and clustering into a single optimization,
and these two steps alternatively negotiate with each other to achieve optimum. By this way, the
proposed algorithm makes full use of information from the missing entries and the imputed data
can best serve for GMM clustering, resulting in better clustering performance.

4.4 Convergence
Our algorithm is outlined in Algorithm 1, where obj (t ) denotes the objective value at the tth it-
eration. At each iteration, the objective value is monotonically increased when optimizing one
variable with others fixed. At the same time, the EM algorithm is proven to converge when the
likelihood function is monotonically increased. As a result, the proposed algorithm is theoretically
guaranteed to converge to a local maximum. We also record the objective values with iterations
and plot them in Figure 5. As observed, the objective value does monotonically increase at each
iteration and it usually converges in fewer than 100 iterations.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
Gaussian Mixture Model Clustering with Incomplete Data 6:13

Fig. 5. The objective value with iterations performed on Seeds and Letter datasets.

5 CONCLUSION
While the recently proposed dynamically K-means is able to handle incomplete clustering, its
inability to handle complicated multi-modal data prevents it from most practical clustering tasks.
This article proposed a dynamically filling algorithm, GMM clustering with incomplete data, which
integrates the imputation and GMM clustering into a unified learning procedure, to make full
use of the information from the missing entries to serve clustering tasks better. The proposed al-
ternative algorithm effectively and efficiently solves the resultant optimization problem, and the
significantly improved clustering performance is demonstrated via extensive experiments on UCI
benchmark datasets. In the future, we plan to apply the one-stage framework to other cluster-
ing tasks and extend our algorithm to a general framework and to one that can handle noise by
dynamically clustering.

REFERENCES
[1] Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, and Daniel Cremers. 2018. Clustering with deep learning: Taxonomy
and new methods. arXiv: Learning (2018).
[2] Marco Aste, Massimo Boninsegna, Antonino Freno, and Edmondo Trentin. 2015. Techniques for dealing with incom-
plete data: A tutorial and survey. Pattern Analysis and Applications 18, 1 (2015), 1–29.
[3] Steffen Bickel and Tobias Scheffer. 2004. Multi-view clustering. In Fourth IEEE International Conference on Data Mining
(ICDM’04). 19–26.
[4] Charles A. Bouman, Michael Shapiro, G. W. Cook, C. Brian Atkins, and Hui Cheng. 1997. Cluster: An Unsupervised
Algorithm for Modeling Gaussian Mixtures. Technical Report.
[5] Magalie Celton, Alain Malpertuy, Gaëlle Lelandais, and Alexandre G. De Brevern. 2010. Comparative analysis of miss-
ing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics
11, 1 (2010), 15.
[6] Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen. 2015. Robust multiple
kernel k -means clustering using 21 -norm. In International Joint Conference on Artificial Intelligence (IJCAI’15). 3476–
3482.
[7] Eduard Eiben, Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. 2019. On clustering incomplete data.
arXiv preprint arXiv:1911.01465 (2019).
[8] Payam Ezatpoor, Justin Zhan, Jimmy Ming-Tai Wu, and Carter Chiu. 2018. Finding top-k dominance on incomplete
big data using Mapreduce framework. IEEE Access 6 (2018), 7872–7887.
[9] Pedro J. García-Laencina, José-Luis Sancho-Gómez, Aníbal R. Figueiras-Vidal, and Michel Verleysen. 2009. K nearest
neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing
72, 7–9 (2009), 1483–1493.
[10] Zoubin Ghahramani and Michael I. Jordan. 1994. Supervised learning from incomplete data via an EM approach. In
Advances in Neural Information Processing Systems. 120–127.
[11] Iffat A. Gheyas and Leslie S. Smith. 2010. A neural network-based framework for the reconstruction of incomplete
data sets. Neurocomputing 73, 16–18 (2010), 3039–3065.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.
6:14 Y. Zhang et al.

[12] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. 2017. Improved deep embedded clustering with local structure
preservation. In International Joint Conference on Artificial Intelligence (IJCAI’17). 1753–1759.
[13] John A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, Inc.
[14] Shao-Yuan Li, Yuan Jiang, and Zhi-Hua Zhou. 2014. Partial multi-view clustering. In Association for the Advance of
Artificial Intelligence (AAAI’14). 1968–1974.
[15] Tianhao Li, Liyong Zhang, Wei Lu, Hui Hou, Xiaodong Liu, Witold Pedrycz, and Chongquan Zhong. 2017. Interval
kernel fuzzy C-means clustering of incomplete data. Neurocomputing 237 (2017), 316–331.
[16] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu. 2016. Multiple kernel k-means clustering with matrix-
induced regularization. In Association for the Advance of Artificial Intelligence (AAAI’16). 1888–1894.
[17] Xinwang Liu, Wen Gao, Xinzhong Zhu, Miaomiao Li, Lei Wang, En Zhu, Tongliang Liu, Marius Kloft, Dinggang Shen,
and Jianping Yin. 2020. Multiple kernel k-means with incomplete kernels. IEEE Transactions on Pattern Analysis and
Machine Intelligence 42, 5 (2020), 1191–1204.
[18] Xinwang Liu, Lei Wang, Xinzhong Zhu, Miaomiao Li, En Zhu, Tongliang Liu, Li Liu, Yong Dou, and Jianping Yin.
2020. Absent multiple kernel learning algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 42,
6 (2020), 1303–1316.
[19] Volodymyr Melnykov and Ranjan Maitra. 2010. Finite mixture models and model-based clustering. Statistics Surveys
4 (2010), 80–116.
[20] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Jianjing Cui, and Jun Long. 2018. A survey of clustering with deep
learning: From the perspective of network architecture. IEEE Access 6 (2018), 39501–39514.
[21] Chunfeng Song, Feng Liu, Yongzhen Huang, Liang Wang, and Tieniu Tan. 2013. Auto-encoder based data clustering.
In Iberoamerican Congress on Pattern Recognition (CIARP’13). 117–124.
[22] Anusua Trivedi, Piyush Rai, Hal Daumé III, and Scott L. DuVall. 2010. Multiview clustering with incomplete views.
In NIPS Workshop, Vol. 224.
[23] Siwei Wang, Miaomiao Li, Ning Hu, En Zhu, Jingtao Hu, Xinwang Liu, and Jianping Yin. 2019. K-means clustering
with incomplete data. IEEE Access 7 (2019), 69162–69171.
[24] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang. 2015. Robust subspace clustering for
multi-view data by exploiting correlation consensus. IEEE Transactions on Image Processing 24, 11 (2015), 3939–3949.
[25] Y. Wang, L. Wu, X. Lin, and J. Gao. 2018. Multiview spectral clustering via structured low-rank matrix factorization.
IEEE Transactions on Neural Networks and Learning Systems 29, 10 (2018), 4833–4843.
[26] Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE
Transactions on Image Processing 28, 4 (2019), 1602–1612.
[27] Cong-Hua Xie, Jin-Yi Chang, and Yong-Jun Liu. 2013. Estimating the number of components in Gaussian mixture
models adaptively for medical image. Optik 124, 23 (2013), 6216–6221.
[28] Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding for clustering analysis. In Interna-
tional Conference on Machine Learning (ICML’16). 478–487.
[29] Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 3 (2005),
645–678.
[30] Shi Yu, Léon-Charles Tranchevent, Xinhai Liu, Wolfgang Glänzel, Johan A. K. Suykens, Bart De Moor, and Yves
Moreau. 2012. Optimized data fusion for kernel k-means clustering. IEEE TPAMI 34, 5 (2012), 1031–1039.
[31] En Zhu, Sihang Zhou, Yueqing Wang, Jianping Yin, Miaomiao Li, Xinwang Liu, and Yong Dou. 2017. Optimal neigh-
borhood kernel clustering with multiple kernels. In Association for the Advance of Artificial Intelligence (AAAI’17).
2266–2272.

Received April 2020; revised May 2020; accepted June 2020

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 1s, Article 6. Publication date: March 2021.

You might also like