0% found this document useful (0 votes)
11 views13 pages

1-Learning Deep Generative Clustering Via Mutual Information Maximization

This document presents a novel approach to deep generative clustering called Information Maximization Deep Generative Clustering (IMDGC), which integrates mutual information maximization to enhance the separation of different clusters in data space. The authors theoretically justify their method and propose techniques to stabilize and improve performance, demonstrating that IMDGC significantly outperforms existing generative models on benchmark datasets. The research highlights the importance of combining generative and discriminative mechanisms to address the limitations of traditional clustering methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

1-Learning Deep Generative Clustering Via Mutual Information Maximization

This document presents a novel approach to deep generative clustering called Information Maximization Deep Generative Clustering (IMDGC), which integrates mutual information maximization to enhance the separation of different clusters in data space. The authors theoretically justify their method and propose techniques to stabilize and improve performance, demonstrating that IMDGC significantly outperforms existing generative models on benchmark datasets. The research highlights the importance of combining generative and discriminative mechanisms to address the limitations of traditional clustering methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

9, SEPTEMBER 2023 6263

Learning Deep Generative Clustering via Mutual


Information Maximization
Xiaojiang Yang , Junchi Yan , Senior Member, IEEE, Yu Cheng , Member, IEEE, and Yizhe Zhang

Abstract— Deep clustering refers to joint representation learn- and (2) discriminative methods, such as maximum margin
ing and clustering using deep neural networks. Existing methods clustering (MMC) [5] and regularized information maximiza-
can be mainly categorized into two types: discriminative and gen- tion (RIM) [6], which group data directly using some objec-
erative methods. The former learns representations for clustering
with discriminative mechanisms directly, and the latter estimate tives [7]. These approaches have achieved some success in
the latent distribution of each cluster for generating data points different real-world tasks, but it is commonly believed that
and then infers cluster assignments. Although generative methods they have difficulty in learning hidden, hierarchical dependen-
have the advantage of estimating the latent distributions of cies in latent spaces [8], as they are based on local relations
clusters, their performances still significantly fall behind discrim- in the data space.
inative methods. In this work, we argue that this performance gap
might be partly due to the overlap of data distribution of different Consequently, recent efforts for clustering tend to move
clusters. In fact, there is little guarantee of generative methods to toward performing clustering on latent space, namely, deep
separate the distributions of different clusters in the data space. clustering. New algorithms have been proposed and achieved
To tackle these problems, we theoretically prove that mutual promising results. In the line of deep discriminative clustering,
information maximization promotes the separation of different deep embedded clustering (DEC) [9] iteratively encodes data
clusters in the data space, which provides a theoretical justifi-
cation for deep generative clustering with mutual information into feature space and improves the confidence for assign-
maximization. Our theoretical analysis directly leads to a model ments. For deep generative clustering, variants of variational
which integrates a hierarchical generative adversarial network autoencoder (VAE) [10], [11], such as Gaussian mixture vari-
and mutual information maximization. Moreover, we further ational autoencoder (GMVAE) [8] and VaDE [12], are devised
propose three techniques and empirically show their effects to to estimate the latent distribution by a mixture of Gaussians.
stabilize and enhance the model. The proposed approach notably
outperforms other generative models for deep clustering on Although generative methods for deep clustering have the
public benchmarks. advantage of estimating the latent distributions of clusters,
their performances are still notably worse than the state-of-
Index Terms— Deep generative clustering, generative adversar-
ial networks (GANs), mutual information maximization, varia- the-art deep discriminative methods [13], [14]. In this article,
tional autoencoders (VAEs). we highlight that such a performance gap might be partly
caused by the reason that there is no sufficient discriminative
I. I NTRODUCTION power for these generative methods to separate the data distri-
bution of different clusters. Furthermore, this induces another

C OMPARED with classification in supervised learn-


ing [1]–[3], clustering aims to group data points into
different categories in an unsupervised manner. Existing works
vital problem: there is little guarantee of generative models to
separate the distributions of different clusters in the data space.
To tackle these problems, we propose to integrate discrim-
can be generally divided into two main categories: (1) genera- inative mechanisms into deep generative clustering, which
tive methods, such as Gaussian mixture model (GMM) [4], explores the direction for combining generative and discrimi-
which tries to estimate the data distribution with several native methods in deep clustering. The hope is that introducing
Gaussians and then groups data by the mixture of Gaussians a discriminative mechanism can enhance the discriminative
Manuscript received 16 July 2020; revised 27 December 2020, 7 July power of deep generative methods. Motivated by this, we argue
2021, and 12 October 2021; accepted 11 December 2021. Date of publication that mutual information maximization between data and clus-
4 January 2022; date of current version 1 September 2023. The work of Xiao- ter code promotes the separation of different clusters in the
jiang Yang and Junchi Yan was supported in part by the National Key Research
and Development Program of China under Grant 2018AAA0100704; in part data space, which provides a theoretical justification for deep
by the Shanghai Municipal Science and Technology Major Project under Grant generative clustering with mutual information maximization.
2021SHZDZX0102; and in part by NSFC under Grant 61972250 and Grant Based on the theoretical analysis, we propose a novel
72061127003. (Corresponding author: Junchi Yan.)
Xiaojiang Yang and Junchi Yan are with the Department of Computer Sci- method for deep clustering, named Information Maximization
ence and Engineering, and the MoE Key Laboratory of Artificial Intelligence, Deep Generative Clustering (IMDGC). In this method, a hier-
AI Institute, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: archical generative adversarial network (GAN) [15] with flex-
[email protected]; [email protected]).
Yu Cheng and Yizhe Zhang are with the Microsoft Research Laboratory, ible prior in latent space is utilized for synthesizing data,
Redmond, WA 98052 USA (e-mail: [email protected]; yizhe.zhang@ and a hierarchical inference model is used to estimate the
microsoft.com). posterior of cluster code for mutual information maximization
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3135375. and cluster assignments. Specifically, the flexible prior is
Digital Object Identifier 10.1109/TNNLS.2021.3135375 a trainable mixture of Gaussians, and each Gaussian is
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6264 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023

generated by a value of cluster code. As hierarchical GAN is other classical discriminative methods [23], [27] are limited
often hard to train, some techniques are devised in this article: in the data space and, hence, are unable to capture the hidden
matching the prior of the generative model with the estimated dependency. DEC [9] opens the avenue of deep clustering.
latent distribution of true data with another GAN, maximizing It iteratively encodes data into feature space and improves
the mutual information between latent variables and cluster confidence for assignment. Information maximizing self aug-
code, and concatenating continuous codes to cluster code in mented training (IMSAT) [13] maximizes the information-
a generative model. These techniques show empirical success theoretic dependency between data and their assignments in an
in stabilizing and enhancing our methods on deep clustering, end-to-end model. DEPICT [14] stacks a multinomial logistic
as shown in our experiments. regression function on an autoencoder to predict assignment
We evaluate our method on three public benchmarks as and jointly minimizes the relative entropy.
widely used in deep generative clustering: Mixed National
Institute of Standards and Technology database (MNIST) [16], B. Deep Generative Clustering
Fashion-MNIST [17], and self-taught learning image database
Some deep generative methods are proposed to model the
(STL) [18], and compare it with the state-of-the-art deep
latent distribution of true data for deep clustering, including
generative clustering methods. Our method notably outper-
GMVAE [8], VaDE [12], and ClusterGAN [28]. GMVAE and
forms them in terms of accuracy (ACC) and normalized
VaDE exploit a Gaussian mixture distribution to estimate the
mutual information (NMI). The contributions of this article
latent distribution within VAE [10] framework. ClusterGAN
are summarized as follows.
decomposes the latent space into a categorical one and a
1) We theoretically show that mutual information maxi- continuous one to learn the assignments directly within a
mization between data and cluster code promotes the bidirectional GAN. These methods employ no discriminative
separation of different clusters in the data space, which mechanism to improve clustering. Another method closely
provides a theoretical justification to deep generative related to our model is clustering with adversarial autoen-
clustering with mutual information maximization. coders [29] or BiGAN [30], [31], which encodes data into
2) Based on the theoretical analysis, we propose to inte- a fixed Gaussian mixture prior. Compared with this, in our
grate discriminative mechanism into deep generative model, the prior is more flexible, and the optimal prior is
clustering, which can enhance the discrimination of obtained by learning.
latent distribution in the generative model for deep In addition, there are (generative) models tailored for clus-
clustering. Motivated by this, we devise a novel model, tering of other forms of data, e.g., event sequences [32] and
namely IMDGC. graph [33], which are beyond the scope of this article.
3) We propose three techniques to stabilize and enhance
the proposed model and empirically demonstrate their
mechanisms and effects. C. Differences to InfoGAN
4) Experimental results show that IMDGC outperforms One closely related work to ours is InfoGAN [25], which
the state-of-the-art deep generative clustering models can be viewed as a method for integrating mutual infor-
on popular datasets for generative clustering, including mation maximization into generative clustering. Specifically,
MNIST, Fashion-MNIST, and STL. an InfoGAN estimates the data distribution with a GAN, and
This article is organized as follows. Section II introduces meanwhile, maximizes the mutual information between data
preliminary background on related works. Theoretical analysis and codes (both cluster code and continuous codes), as shown
is given in Section III. Section IV presents the proposed in Fig. 1(a). The main difference between InfoGAN and
approach and the empirical study is performed in Section V. our method is the generating process. In InfoGAN, data are
A further discussion is given in Section VI. Section VII synthesized directly from codes. In our method, however, the
concludes this article. generative process is hierarchical: flexible priors of clusters are
first generated from cluster code, and then data are generated
II. R ELATED W ORK from the prior. This hierarchical structure enables our method
to achieve clustering in a natural way. Specifically, it can
As a long-standing problem, there exist multiple lines of
separate different clusters by low-density regions in the data
works for clustering, e.g., subspace clustering [19], adversarial
space due to the combined action of its hierarchical structure
clustering [20], [21], discriminative clustering [22], [23], gen-
and mutual information maximization (see Section III). As it
erative clustering [4], [7], [20], [24], and so on. As our method
is believed that different classes of data are separated by low-
lies on improving deep generative clustering with mutual
density regions in the data space [34], our method probably
information maximization, we review related works about
provides a natural solution to clustering. However, InfoGAN
discriminative clustering by mutual information maximization,
has no such desirable property, especially when it simply
deep generative clustering, and specifically a closely related
maximizes the mutual information between data and cluster
model InfoGAN [25].
code (see Section V-C.1). Specifically, even when the mutual
information term of InfoGAN is well maximized, different
A. Discriminative Clustering clusters can be connected in the data space. Consider a case
In this part, we mainly focus on those relevant to mutual of InfoGAN on MNIST, for example, when its cluster code
information maximization. RIM [6], Infomax [22], [26], and corresponds to the width of digits, the mutual information

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6265

c, latent variables z and data x form a Markov chain. Based on


this, we characterize the generative model for deep clustering
by a joint density as follows:
p(c, z, x) = p(x|z) p(z|c) p(c) (1)
where p(c) is fixed and c is a 1-D discrete code in
{c(1) , . . . , c(K ) }.
For clarity, next, we introduce the concepts of mutual
information. The mutual information between two stochastic
variables can be interpreted as the reduction of uncertainty in
one variable when the other is given
I (c; x) = −H (c|x) + H (c) (2)
where the entropy H (c) = −E p(c) [log p(c)] and the con-
ditional entropy H (c|x) = −E p(c,x) [log p(c|x)] measure the
uncertainty of the corresponding stochastic variables.
Based on the concepts earlier, we can bridge the change of
prior p(z|c(i) ) of cluster c(i) with the change of the mutual
information I (c; x) via mathematical variational method. This
result is based on an assumption: p(x) is fixed. This assump-
tion is reasonable. As in generative models, the generated
distribution p(x) is pushed to match the true data distribution,
and we can assume the p(x) has perfectly matched the true
data distribution, and hence, is fixed.
Proposition 1 (Variation of MI): Suppose (c, z, x) ∼
p(x|z) p(z|c) p(c), where c is a discrete variable in
{c(1) , . . . , c(K ) }, and p(c) and p(x) are fixed. If p(z|c(i) )
changes to pδ (z|c(i) ) = p(z|c(i) ) + δp(z|c(i)), where δp(z|c(i) )
is a perturbative function which satisfies δp(z|c(i) )dz = 0,
then the variation of I (c; x) [see (2)] is given by
Fig. 1. Structures of (a) InfoGAN [25], (b) vanilla IMDGC, and (c) proposed
techniques. {G z , G x }, {E x , E z }, and {Dx , Dz } are generators, encoders, and
  
δ I (c; x) p z|c(i)
discriminators, respectively. Vx and Vz the adversarial objectives, and I , Ix ,         
and Iz represents the mutual information terms I (c, c ; x), I (c; x), and I (c; z), ∝ DKL pδ x|c(i)  p(x) − DKL pδ x|c(i)  p x|c(i) (3)
respectively. Random variables c, c , z, and x are cluster code, continuous 
codes, latent variables, and data, respectively. In (c), thick lines, darker round where pδ (x|c(i) ) = p(x|z) pδ (z|c(i) )dz.
nodes, and darker square node represent the additional processes, outputs
(Iz and Vz ) and network (Dz ) for the proposed techniques. Compare the
The proof details are given in Appendix A.
structures of InfoGAN [see (a)] and IMDGC [see (b) or (c)], we can see that Remark: This proposition has an intuitive interpretation.
there are some vital differences between them: 1) InfoGAN maximizes the We begin our analysis by considering the meanings of the
information terms about both cluster code and continuous codes I (c, c ; x),
while IMDGC simply maximizes the information term about cluster code
two KL terms on the right-hand side of (3). Note that the
I (c; x). 2) Compared with InfoGAN, IMDGC is a hierarchical model, which first KL  term measures the divergence between pδ (x|c(i) ) and
brings it some theoretical justifications and flexibility to be integrated with p(x) = Kj=1 p(x|c( j )) p(c( j )), and the second one measures
other useful techniques, as shown in Sections III and IV.
the divergence between pδ (x|c(i) ) and p(x|c(i) ). Therefore, the
subtraction of the two KL terms is an approximation of
the divergence between pδ (x|c(i) ) with distributions of other
between cluster code and data can be still well maximized, but ( j) ( j)
different clusters are obviously connected in the data space.
clusters j  =i p(x|c ) p(c ). This means that maximizing
I (c; x) in terms of priors of clusters is equivalent to the
In addition, since our method includes flexible prior, many
maximization of divergence between the distribution of each
techniques can be involved to improve its performance. These
cluster with the distributions of other clusters in the data
ideas will be discussed in detail in Section IV.
space.
Therefore, mutual information maximization in deep gen-
III. T HEORETICAL A NALYSIS erative methods between data and cluster code promotes the
In this section, we investigate the effect of mutual infor- separation of different clusters in the data space. Furthermore,
mation maximization between data and cluster code in a if the supports of data distributions of different clusters have
hierarchical model. Specifically, we mainly theoretically show overlapped, then different clusters must be separated by low-
that it promotes the separation of different clusters in the data density regions in the data space. This conclusion provides
space. a theoretical justification for deep generative clustering and
First, we make a moderate assumption, which has been motivates us to integrate mutual information maximization
widely adopted in the literature [9], [28], that the cluster code with a generative method for deep clustering.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6266 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023

IV. P ROPOSED A PPROACH With the approximate posterior q(c|x), the mutual informa-
In this section, we propose IMDGC, by first introducing tion I (c; x) can be estimated as [25]
 
its vanilla model structure and objective. Then, we present I (c; x) ≈ E(c, x̃)∼ p(c) p(x|c) log q(c| x̃) + H (c) (11)
three techniques to stabilize and enhance the proposed model:
1) matching the prior in the generative model with the where H (c) is a constant, as p(c) is given in this work.
estimated latent distribution of true data with another GAN; Combining the adversarial objective with the mutual infor-
2) maximizing the mutual information between latent variables mation term, the objective of IMDGC is
and cluster code; and 3) concatenating continuous codes to min max VIMDGC = Vx − λ1 I (c; x) (12)
cluster code in a generative model. These tricks are easy G,E Dx

to be integrated into our method due to the flexibility prior where λ1 is the balancing hyperparammeter of I (c; x), and
introduced in our method. G = {G z , G x }, E = {E x , E z }.

A. Information Maximization Deep Generative Clustering B. Matching the Prior With Latent Distribution of True Data
The basic structure of the proposed model (IMDGC) is It has been reported that without constraints on the gener-
shown in Fig. 1(b). Specifically, IMDGC adopts a hierarchical ating process, a hierarchical GAN is hard to train and its gen-
generative model p(c, z, x) = p(c) p(z|c) p(x|z) to synthe- erated images are of low quality [35]. We empirically observe
size data and a hierarchical inference model q(c, z|x) = that this also occurs on the vanilla IMDGC. This phenomenon
q(z|x)q(c|z) for estimating I (c; x). The specific generative might be harmful to the clustering performance of IMDGC,
and inference models are described as follows. as a powerful generative clustering model should successfully
In the generative model, we denote samples from p(c, z, x) generate images of each cluster. Otherwise, the generative
as (c, z, x̃), and set p(z|c) = N (z|μ(c), diag(σ 2 (c))) and clustering model cannot learn the true data distribution well,
p(x|z) = δ(x − G x (z)), where G x (·) denotes the generator and hence, cannot reasonably separate the true distribution for
for x̃. Specifically clustering.
To deal with the problem earlier, we propose to constrain
c ∼ p(c) = Cat(c|K ) (4) the generating process of IMDGC, by matching the prior with
 
μ(c), σ 2 (c) = G z (c) (5) the latent distribution of true data. Specifically, we introduce
   another GAN to achieve this in three steps.
z ∼ p(z|c) = N z|μ(c), diag σ 2 (c) (6)
1) Encode true data x by E x (·) into latent variables ẑ.
x̃ = G x (z) (7)
2) Train a discriminator Dz (·) to distinguish z and ẑ.
where Cat(c|K ) denotes the uniform categorical distribution 3) Train the generators and encoders to match z with ẑ.
over K categories. In (6), z is sampled through reparameter- The adversarial objective here is
ization [10] for back propagation, i.e., z = μ(c) + σ (c)  ε,    
Vz = Ez∼ p(z) log Dz (z) + E ẑ∼q(z) log(1 − Dz (ẑ)) (13)
ε ∼ N (ε|0, I ), where  is pointwise product. Note that the 
parameters of each Gaussian, μ(·) and σ (·), are outputs of where q(z) = pd (x)q(z|x)dx is the latent distribution of true
generator G z (·), and hence, the priors of clusters are flexible. data in the inference model.
The generative model has trained adversarially with a Therefore, the objective of IMDGC becomes
discriminator Dx (·) to synthesize samples. The adversarial
min max VIMDGC = Vx + Vz − λ1 I (c; x) (14)
objective is denoted as Vx . In this article, we simply use G,E D
the objective of vanilla GAN [15] without loss of generality where D = {Dx , Dz }.
   
Vx = E x∼ pd (x) log Dx (x) + E x̃∼ p(x) log(1 − Dx ( x̃)) (8)
C. Maximizing Mutual Information Between Latent
where pd (x) denotes the true data distribution. Variables and Cluster Code
To estimate the mutual information I (c; x), the posterior
We empirically observe that the maximization of mutual
p(c|x) is necessary. Following InfoGAN [25], we introduce
information between data and cluster code I (c; x) is unstable
an auxiliary inference model q(c|x) to approximate p(c|x).
in experiments. This might be caused by the vanishing gradient
Different withInfoGAN, here, we adopt a hierarchical struc-
problem of the mutual information term I (c; x), as the network
ture q(c|x) = q(c|z)q(z|x)dz, and set q(z|x) = δ(z−E x (x)),
to estimate I (c; x) includes two generators and two encoders,
where E x (·) is the encoder to encode x. This setting seems
which is too deep for the maximization of I (c; x).
redundant, but it brings flexibility to constrain the prior, as will
To stabilize the maximization of I (c; x), we propose to
be shown in Sections IV-B and IV-C.
meanwhile maximize the mutual information between cluster
Given a synthesized sample x̃, the inference process is
code and latent variables I (c; z). This can be viewed as a way
z̃ = E x ( x̃) (9) to “guide” the maximization of I (c; x). To prove this, we give
q(c| x̃) = E z (z̃) (10) another proposition as follows, and the proof details are given
in Appendix A.
where E z (·) is the encoder to encode z, and its output unit is Proposition 2: Suppose (c, z) ∼ p(c, z), where c is a
Softmax, and hence, its output is a distribution of c. discrete variable in {c(1) , . . . , c(K ) }. If p(z|c(i) ) changes to

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6267

pδ (z|c(i) ) = p(z|c(i) ) + δp(z|c(i) ), where δp(z|c(i) ) is a per- In other words, although the continuous codes are utilized
turbative function which satisfies δp(z|c(i) )dz = 0, then the to improve the flexibility of prior, their reconstructions are
variation of I (c; z) is given by not inferred in the inference process. Therefore, involving
   continuous codes merely increases a few parameters of G z (·),
δ I (c; z) p z|c(i) and hence, has ignorable cost in IMDGC.
        
∝ DKL pδ z|c(i)  p(z) − DKL pδ z|c(i)  p z|c(i) . (15)
V. E XPERIMENTS
Similar to the discussion about Proposition 1, we can conclude
In this section, we first introduce the protocol of our exper-
that the maximization of I (c; z) promotes separation of the
iments, including the choice and preprocessing of datasets,
priors of different clusters, which intuitively helps separate dif-
the metrics of clustering, and the architectures of networks
ferent clusters in the data space. Note that maximizing I (c; z)
and hyperparameters. Then, we report the results of verifying
cannot directly separate the distributions of different clusters,
experiments, aiming to verify the effects of the three proposed
and hence, it cannot replace the maximization of I (c; x).
techniques. After that, we report the main part of this section:
For simplicity and efficiency, we use the encoder E z (·) to
the clustering performance of IMDGC compared with other
approximate the posterior p(c|z) to estimate I (c; z), without
generative methods of deep clustering. Moreover, to show
involving an auxiliary network
that each component of IMDGC contributes to the clustering
 
I (c; z) ≈ E(c,z)∼ p(c) p(z|c) log q(c|z) + H (c). (16) performance, we present the results of ablation experiments on
clustering. Finally, we show the results of sensitivity analysis
Finally, the objective of IMDGC is of three vital hyperparameters in IMDGC: the weight λ1 , λ2 ,
min max VIMDGC = Vx + Vz − λ1 I (c; x) − λ2 I (c; z) (17) and the number of continuous codes J .
G,E D

where λ2 is the balancing weight of I (c; z), which is usually A. Protocol


set smaller than λ1 in our experiments as it is merely an 1) Datasets: We evaluate our method on MNIST [16],
auxiliary tool for the maximization of I (c; x). Fashion-MNIST [17], and STL [18]. For MNIST and Fashion,
we use the default split for training and testing.
These datasets have been widely used in deep generative
D. Involving Continuous Codes
clustering [9], [12], [28]. Note that deep generative clustering
Although setting the prior as a mixture of Gaussian has been models currently have inherent difficulty on those more chal-
widely employed in generative methods of deep clustering lenging datasets, e.g., CIFAR-10 [37] with a complex back-
[4], [8], [12], a more flexible prior might further improve the ground, especially for limited computing resources. As they
clustering performance. In fact, our theoretical analysis can are trained for image generation meanwhile, not clustering
be applied to a much more complex prior than a mixture of alone.
Gaussians, and hence, using a much more flexible prior still 1) MNIST: A dataset of handwritten digits ranging
coincides with our theory. Motivated by these considerations, from 0 to 9, containing 60 000 training and 10 000 testing
we propose to incorporate continuous codes to improve the samples, each being a 28 × 28 gray-scale image.
flexibility of prior. 2) Fashion-MNIST: A dataset consists of ten types of
First, we introduce a set of continuous codes c ∈ R J , and fashion products with the same number and same shape
set their densities as follows: of images as MNIST, but it is much more complicated.
cj ∼ U(−1, 1) ∀ j ∈ [ J ] (18) 3) STL: A dataset of ten classes of 96 × 96 color Images.
Following [9], [12], and [13], for avoiding the dis-
where U(−1, 1) is a uniform density on (−1, 1). Then, turbance of background in raw images, we apply a
we concatenate them with the cluster code c as the input of pretrained VGG16 [38] network to extract 4096-dim
G z (·). Hence, the prior of each cluster becomes features and then use them for clustering. Note that the
      VGG16 is pretrained for classification on ImageNet [39],
p z|c(i) = p z|c(i) , c p c dc ∀i ∈ [K ] (19) and therefore, the extracted features are easier for
clustering.
where p(z|c(i) , c ) = N (z|μ(c, c ), diag(σ 2 (c, c ))), and 2) Metrics of Clustering: We use two popular clustering
(μ(c, c ), diag(σ 2 (c, c ))) = G z (c, c ). In such a setting, metrics: ACC [9] and NMI [40], and we explain their defini-
a sample from the prior of cluster c(i) is generated by tions as follows.
z = μ(c(i) , c ) + σ (c(i) , c )  ε, where c ∼ U(−1, 1) J and The ACC metric is to find the best one-to-one mapping
ε ∼ N (ε|0, I). between cluster assignments and ground-truth labels, and then
After involving continuous codes, the prior of each cluster calculate the rate of right assignments in the best mapping
p(z|c(i) ) becomes a universal approximator, which can approx- N (n)
 
imate any distribution of independent latent variables with n=1 1 t = m c(n)
ACC = max (20)
arbitrary precision [36]. Therefore, theoretically, involving m N
continuous code improves the flexibility of prior. where t (n) is ground truth label and c(n) is cluster assignment
Finally, we highlight that in IMDGC, the mutual infor- of x (n) , and m is one-to-one mapping. The best mapping can
mation terms about continuous codes I (c ; x) are not used. be computed by the Hungarian algorithm [41].

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6268 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023

The NMI metric is to calculate the mutual information


between cluster assignment variable and ground truth label
variable, and then the results are normalized with the arith-
metic mean of entropy

2I (t; c)
NMI = (21)
H (t) + H (c)
where t is the ground truth label variable and c is the cluster
assignment variable.
3) Architectures and Hyperparameters: In all experiments,
the hyperparameters of the optimizer are the same. For opti-
mization, we use Adam [42] with a learning rate of 0.0002,
and β1 = 0.5 and β1 = 0.999. The architectures of networks
used in our experiments are shown in Appendix tables. As for
the architectures of networks, an architecture is employed on
both MNIST and Fashion-MNIST, while another with spectral
normalization is employed on STL. The specific architectures
are reported in Appendix B. Other hyperparameters, including
λ1 , λ2 , and J , are reported before showing experimental results
in Sections V-B–V-D. In addition, the batch size is set as 100,
the number of clusters is set as K = 10, and the number of
latent variables is set as d = 64 on both MNIST and Fashion-
MNIST and d = 128 on STL.
Fig. 2. Generated images on MNIST and Fashion-MNIST with Dz (·) or not.
Each row represents generated images with fixed cluster code and random
sampled continuous code. (a) MNIST, without Dz . (b) MNIST, with Dz .
B. Verifying Experiments and Results (c) Fashion-MNIST, without Dz . (d) Fashion-MNIST, with Dz .

In this section, we introduce the experiments to verify the


effects of the three proposed techniques: matching the prior
with the latent distribution of true data, maximizing the mutual 2) Effects of Involving I (c; z): Here, we aim at verifying our
information between latent variables and cluster code, and claim for involving I (c; z), i.e., it can stabilize maximizing
involving continuous code. For simplicity, here, these three I (c; x) and help reducing overlap among priors of different
tricks are briefly named involving Dz (·), involving I (c; z), clusters. We set λ1 = 1.0, λ2 = 0.1, 0.01 or 0, and J = 3.
and involving c , respectively. In this section, the models are First, we introduce our estimation of I (c; x) and overlap in
trained by 100 epochs on MNIST. experiments. To estimate I (c; x), we simply use its variational
1) Effects of Involving Dz ( · ): We have claimed that involv- approximation introduced by (11). Specifically, for efficiency,
ing Dz (·) can improve the quality of generated images in we sample 100 cluster codes and then concatenate them with
Section IV. To prove this, we contrast the generated images by 100 randomly sampled continuous codes. We use this batch
IMDGC with and without Dz (·) on both MNIST and Fashion. of codes to synthesize data, and finally, encode them into
In experiments, we set λ1 = 1, λ2 = 0.1, and J = 3. distributions of cluster code by encoders. In this way, we can
As shown in Fig. 2, some generated images by IMDGC obtain the estimation of negative conditional entropy via
without Dz (·) are weird, e.g., the fifth and the ninth images Monte Carlo sampling. Note that in (11), H (c) ≈ 2.3 when
in the first row of Fig. 2(a), and the second and the seventh K = 10, and hence, there is no need to estimate this term.
images in the first column of Fig. 2(c). On the contrary, all the To estimate the overlap among priors, we calculate the inner
generated images of IMDGC with Dz (·) have high perceptual product of each pair of Gaussians and then normalize it by
quality. These results show that involving Dz (·) does improve logarithm and finally sum it up. This metric has a closed form
the quality of generated images by IMDGC. for finite Gaussians
Furthermore, we also observe that for IMDGC without K
 
Dz (·), many images generated from the same cluster are log N (z|μk ; k )N z|μ j ;  j dz
visually from two or more distinct classes [e.g., the third row k=1 j  =k
of Fig. 2(a) contains digits “8” and “5,” and the fourth row of
1   −1  
K
Fig. 2(c) contains “bag” and “trouser”], which suggests that = − μk − μ j k +  j μk − μ j
the model cannot assign the images from these classes into dif- k=1 j  =k
2
ferent clusters correctly. For IMDGC with Dz (·), however, the  1/2 d
images generated from the same cluster are visually similar. − log det k +  j − log(2π) (22)
2
This phenomenon indicates that involving Dz (·) also improves
the clustering performance of IMDGC, which coincides with where d is the dimensionality of z. The specific estimating
our discussion in Section IV-B. process of overlap is similar with I (c; x). The difference is

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6269

Fig. 3. Mutual information term I (c; x) and overlap among priors of different
clusters on MNIST over training epochs by λ2 = 0.1, 0.01, 0. The lines are
the mean of five trials, and the error bands represent the standard deviations.
(a) I (c; x). (b) Overlap among priors.

Fig. 5. Generated images on MNIST and Fashion-MNIST, under different J .


Each row represents generated images with fixed cluster code and random
sampled continuous code. (a) MNIST, J = 0. (b) MNIST, J = 3. (c) Fashion-
MNIST, J = 0. (d) Fashion-MNIST, J = 3.

Fig. 4. Non-Gaussianity of priors of clusters as calculated by (23) on MNIST cluster code c and then average the non-Gaussianities of ten
over training epochs as the number of continuous codes J varies. The lines clusters. As shown in Fig. 4, in the training process, the
are the average values of five trials, and the error bands represent the standard non-Gaussianity of IMDGC with continuous codes overall
deviations.
increases. This result demonstrates that the model indeed
utilizes the flexibility brought by continuous codes, which
that the batch size here is set as 10, and each cluster code has coincides with our expectation of this technique.
only one sample. Besides, we empirically observe another merit of involving
As shown in Fig. 3, the maximization of I (c; x) becomes continuous codes. As shown in Fig. 5(c), for the generated
stable and efficient when maximizing I (c; z) in IMDGC images on Fashion-MNIST by IMDGC without continuous
meanwhile. Otherwise, it is extremely unstable and inefficient. codes, we encounter the mode collapse, where some classes
Besides, maximizing I (c; z) also contributes to reducing over- of images like “bags” are missing, and the generated images
lap among priors of different clusters. These results testify to from the same cluster are almost identical. Compared with this,
our motivation for maximizing I (c; z) in IMDGC. the generated images by IMDGC with continuous codes are
3) Effects of Involving c : To verify if the model can utilize diverse, as shown in Fig. 5(b) and (d). These results reveal
the flexibility brought by continuous codes c , we design that involving continuous codes can prevent IMDGC from
a metric to measure the non-Gaussianity of prior of each mode collapse. This is reasonable, as involving continuous
cluster. We believe the non-Gaussianity can reflect the usage codes improves the non-Gaussianity of prior of each cluster
of flexibility of prior. For given cluster code c, the non- according to the discussion earlier, and hence, it can prevent
Gaussianity is obtained by the following optimization: the prior from mode collapse.

      
min DKL p z| c, c N z|μ, diag σ 2 d c (23)
μ,σ C. Unsupervised Clustering
where p(z|c, c ) is a tractable Gaussian in IMDGC, and the In this section, we compare the clustering performance of
integral can be estimated by Monte Carlo sampling. In the IMDGC with other deep generative methods and InfoGAN to
experiment, we sample ten c for each cluster code c to show that IMDGC can improve the clustering performance of
estimate the integral in (23) and average the non-Gaussianities deep generative methods. Besides, we also show the generated
of ten clusters. images on MNIST and Fashion-MNIST by IMDGC when
In experiments, we train on MNIST and Fashion-MNIST, the cluster code is fixed. Finally, we report the results of
and set λ1 = 1.0, λ2 = 0.1, and J = 0, 1, 2, or 3. ablation experiments on MNIST, to show that each component
To estimate the integral in (23), we sample ten c for each of IMDGC indeed improves its clustering performance.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6270 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023

TABLE I
AVERAGE AND S TANDARD D EVIATION OF A CCURACY (ACC) (%) AND NMI OF F IVE T RIALS .
S TANDARD D EVIATIONS A RE R EPORTED IN THE ROUND B RACKETS

TABLE II
B EST A CCURACY (ACC) (%) AND NMI OF F IVE T RIALS

First, we introduce our experimental settings. We train the TABLE III


models on the training set, and then test them on the testing set AVERAGE A CCURACY (ACC) (%) AND NMI OF F IVE T RIALS ON
MNIST. S TANDARD D EVIATION I S G IVEN IN B RACKET
and report the test accuracy. Every model is trained and tested
on the datasets in five trials. The hyperparameters of IMDGC
in clustering experiments are set as λ1 = 1.0, λ2 = 0.1, and
J = 3 on MNIST and Fashion-MNIST, and λ1 = λ2 = 1.0,
and J = 10 on STL. These hyperparameters are selected by
fivefold cross validation.
Then, we demonstrate the process of cluster assignment.
We assign the testing data to the cluster code with the highest
probability. Specifically, the testing sample x is encoded into
ẑ and then is mapped to the distribution of cluster code q(c|ẑ).
Finally, we assign the sample to arg maxc q(c|ẑ). of mutual information maximization. While compared with
1) Clustering Performance: We compare IMDGC with InfoGAN, IMDGC utilizes a flexible prior. Therefore, IMDGC
several deep generative clustering methods, including can combine the advantages of deep generative clustering
GMVAE [8], VaDE [12], and ClusterGAN [28]. For a more and mutual information maximization, and hence, it is not
comprehensive comparison, InfoGAN [25] with maximization unsurprising that IMDGC outperforms these methods.
of mutual information between data and continuous codes Furthermore, Tables I and II show that when removing the
I (c ; x) is also included. Moreover, to more comprehensively maximization of I (c ; x), InfoGAN cannot perform clustering.
compare IMDGC with InfoGAN, a variant of InfoGAN This is totally different from IMDGC, as IMDGC does not
without maximizing I (c ; x) is also evaluated. depend on I (c ; x). This indicates that IMDGC and InfoGAN
For a fair comparison, these models earlier adopt the same are fundamentally different, and the success of IMDGC is
hyperparameters of optimizer and architectures of encoder essentially due to the use of a flexible prior.
E x (·) and generator G x (·) in IMDGC. For VaDE/GMVAE, 2) Ablation Study: This part is to show that the key
another output layer of encoder E x (·) will be included to components in IMDGC indeed contribute to its clustering
output the variances of latent variables. We report the aver- performance. To this end, several key components of IMDGC
aged ACC and NMI (and the standard deviations in round are chosen for the ablation study on MNIST. These compo-
brackets) of five trials in Table I and the best ACC and NMI nents include Dz (·), maximization of I (c; x) and I (c; z) and
in Table II. continuous codes c . We set the default hyperparameters of
As shown in Tables I and II, IMDGC obtains a remarkable standard IMDGC as λ1 = 1.0, λ2 = 0.1, and J = 0.
improvement on both averaged and best clustering perfor- As shown in Table III, removing any component earlier will
mance in terms of ACC and NMI, compared with other lead to a significant decline of clustering performance in terms
deep generative methods and InfoGAN. Note that compared of ACC and NMI. This reveals that all components are useful
with these deep generative methods, IMDGC takes advantage for the clustering performance of IMDGC.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6271

decline when one of these two hyperparameters is too large.


This indicates that too strong discriminative power is harmful
to deep clustering, at least on the studied benchmarks. More-
over, the clustering performance of IMDGC is also sensitive to
the number of continuous codes J . This reveals that although
involving continuous codes improves the flexibility of prior
and remarkably lifts clustering performance, it can also do
harm to clustering when the number of continuous codes is
too large. We conjecture that this might be due to the too
high non-Gaussianity of each cluster’s prior, as in this case
Fig. 6. Clustering performance of IMDGC on MNIST over λ1 in terms the prior of each cluster is easy to generate more than one
of ACC (%) and NMI. Length of error bars represents standard deviations. class of images.
(a) ACC. (b) NMI.
VI. S UMMARY AND F URTHER D ISCUSSION
Based on our experimental results, we further summarize
and discuss our proposed approach.
We have proven that in a hierarchical generative model
c → z → x, maximizing the mutual information term I (c; x)
promotes separation of different clusters in the data space.
This provides a theoretical justification for deep generative
clustering with mutual information maximization. Directly
based on our theoretical result, we propose a deep generative
model for clustering, called IMDGC, which naturally inte-
grates hierarchical GAN and mutual information maximiza-
tion. However, it is well known that a hierarchical GAN is
Fig. 7. Clustering performance of IMDGC on MNIST over λ2 in terms of
ACC (%) and NMI. (a) ACC. (b) NMI. difficult to train [35]. To tackle this problem, we propose
three techniques, which are closely related to our theoretical
analysis.
The first technique, involving Dz (·), is aimed at constraining
the generative process of IMDGC to enhance its generative
ability. This is the precondition of our method, as our the-
oretical analysis is based on an assumption: the generative
distribution p(x) has perfectly matched the true data distrib-
ution (see Section III). Fortunately, as shown in Section V,
such a simple technique successfully improves the generative
ability of IMDGC and significantly improves its clustering
performance. These results also indicate that the generative
ability of the deep generative model is vital for clustering.
Fig. 8. Clustering performance of IMDGC on MNIST over J in terms of As for involving I (c; z), it is an auxiliary tool to the
ACC (%) and NMI. (a) ACC. (b) NMI.
maximization of I (c; x). Our theoretical analysis focuses
on separating different clusters in the data space, which is
obviously equivalent to separating different clusters in latent
D. Sensitivity Analysis
space when the mapping between the two spaces is a dif-
In this section, we report the sensitivity analysis of the feomorphism. Motivated by this, we prove that maximizing
three hyperparameters of IMDGC on MNIST, namely, λ1 , I (c; z) promotes separating different clusters in latent space
λ2 , and J . In experiments, we manipulate the three hyper- (see Proposition 2 in Appendix), which means that maximizing
parameters of standard IMDGC (λ1 = 1.0, λ2 = 0.1, and I (c; z) enhances the maximization of I (c; x). However, since
J = 3). The ranges of λ1 , λ2 , and J are {0.01, 0.1, 1.0, 10.0}, our goal is to separate different clusters in the data space and
{0.001, 0.01, 0.1, 1.0}, and {1, 2, 3, 4}, respectively. maximizing I (c; z) cannot directly achieve this, this technique
The clustering results in terms of ACC and NMI are shown cannot fully replace the maximization of I (c; x). Experimen-
in Figs. 6–8, for the three hyperparameters, respectively. From tal results show that this technique effectively stabilizes the
Figs. 6–8, we can see that the clustering performance of maximization of I (c; x) and reduces the overlap of different
IMDGC varies when the three hyperparameters change, and clusters in latent space. These results suggest that maximizing
the standard IMDGC is configured with the optimal setting I (c; z) is an excellent enhancement for maximizing I (c; x) in
of hyperparameters. Specifically, the clustering performance a hierarchical process.
of IMDGC is not very sensitive when λ1 and λ2 decrease The third technique is a simple yet effective improvement,
from 1.0 and 0.1, respectively. But there is a significant namely, involving continuous codes c . Note that in our

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6272 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023


theoretical analysis, the prior of each cluster is absolutely satisfies δp(z|c(i) )dz = 0, then the variation of I (c; x) is
flexible without any constraint, but in vanilla IMDGC the prior given by
of each cluster is set as a Gaussian. This loss of flexibility   
might be harmful to clustering. A simple solution to fill this δ I (c; x) p z|c(i)
        
gap is involving continuous codes c , which enables the prior ∝ DKL pδ x|c(i)  p(x) − DKL pδ x|c(i)  p x|c(i)
of each cluster to be a universal approximator. We empirically 
observe some merits of this technique: improving the non- where pδ (x|c(i) ) = p(x|z) pδ (z|c(i) )dz.
Gaussianity of each cluster’s prior, preventing mode collapse, Proof: Due to p(c, z, x) = p(x|z) p(z|c) p(c), we have
and improving clustering performance. We also observe that K
   
involving too many continuous codes leads to the decline I (c; x) = p c(k) p(x|z) p z|c(k)
of clustering performance, which indicates that too much k=1
flexibility of prior might be harmful to clustering.   
p(x|z) p z|c(k) dz
× log dxdz.
p(x)
VII. C ONCLUSION
(i) (i)
We have proposed to learn deep generative clustering by  of generality, let δp(z|c ) = h(z|c ), where
Without loss
integrating discriminative mechanisms. Specifically, based on  1 and h(z|c(i) )dz = 0. This notation is useful: when
the theoretical analysis we have shown that integrating mutual p(z|c(i) ) shift to p(z|c(i) ) + δp(z|c(i)), the mutual information
information maximization into deep generative clustering pro- I (c; x) can be expanded with respect to 
motes the separation of different clusters in the data space,     
I (c; x) p z|c(i) + δp z|c(i)
which gives a theoretical justification to deep generative     
clustering. Based on this, we have proposed a novel deep gen- = I (c; x) p z|c(i) + h z|c(i)
erative clustering model with mutual information maximiza-    
= p c(k) p(x|z) p z|c(k)
tion. The proposed model, called IMDGC, jointly performs k =i
deep generative clustering and reduces overlap among priors   
p(x|z) p z|c(k) dz
of different clusters via mutual information maximization. × log dxdz
We have evaluated our model on unsupervised clustering tasks p(x)
      
using MNIST, Fashion-MNIST, and STL-10 w.r.t. ACC and + p c(i) p(x|z) p z|c(i) + h z|c(i)
NMI. Experimental results show that IMDGC can significantly      
improve the performance of deep generative clustering. p(x|z) p z|c(i) + h z|c(i) dz
log dxdz
Thanks to the hierarchical structure of IMDGC, three p(x)
techniques can be easily integrated to enhance and stabi-    
lize it, including matching the prior with latent distribu- = p c(k) p(x|z) p z|c(k)
tions of true data, maximizing mutual information between k =i
  
latent variables and cluster code, and involving continuous p(x|z) p z|c(k) dz
× log dxdz
codes. We have shown in experiments that these techniques p(x)
can improve the quality of generated images, stabilize and    
enhance mutual information maximization, and improve the + p c(i) p(x|z) p z|c(i)
flexibility of prior and meanwhile prevent mode collapse,      
p(x|z) p z|c(i) + h z|c(i) dz
respectively. log dxdz
p(x)
There are many possible extensions. The solution of using
   
a flexible complex prior and performing mutual information + p c(i) p(x|z)h z|c(i)
maximization can be applied in VAEs [10]. Applying our      
method in supervised and semisupervised settings is another p(x|z) p z|c(i) + h z|c(i) dz
log dxdz
potential extension. Finally, integrating our techniques into p(x)
other deep models is also a possible future work. K
   
= p c(k) p(x|z) p z|c(k)
k=1
A PPENDIX   
p(x|z) p z|c(k) dz
In this appendix, we give the proof details of our proposed × log dxdz
p(x)
proposition. In addition, we provide the details of the archi-   
tectures of our models.  (i)   (i)  p(x|z)h z|c(i) dz
+ p c p(x|z) p z|c    dxdz
p(x|z) p z|c(i) dz
   
A. Proof of Propositions + p c(i) p(x|z)h z|c(i)
Proposition 1: Suppose (c, z, x) ∼ p(x|z) p(z|c) p(c), where   
c is a discrete variable in {c(1) , . . . , c(K ) }, and p(c) and p(x) p(x|z) p z|c(i) dz
× log dxdz
are fixed. If p(z|c(i) ) changes to pδ (z|c(i) ) = p(z|c(i) ) +  2
p(x)
δp(z|c(i) ), where δp(z|c(i) ) is a perturbative function which +O 

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6273

K
    TABLE IV
= p c(k) p(x|z) p z|c(k) G ENERATORS FOR MNIST AND FASHION -MNIST
k=1   
p(x|z) p z|c(k) dz
× log dxdz
p(x)
   
+ p c(i) p(x|z)h z|c(i) dxdz
   
+ p c(i) p(x|z)h z|c(i)
  
p(x|z) p z|c(i) dz  
× log dxdz + O  2 .
p(x) TABLE V
  E NCODERS FOR MNIST AND FASHION -MNIST
As p(x|z)h(z|c )dxdz = h(z|c(i) )dz = 0, and collecting
(i)

the first-order terms of , we have


    
I (c; x) p z|c(i) + δp z|c(i)
  
≈ I (c; x) p z|c(i)
   
+ p c(i) p(x|z)h z|c(i)
  
p(x|z) p z|c(i) dz
× log dxdz
  (i)p(x)

= I (c; x) p z|c Proof: The mutual information I (c; z) is
     
+ p c(i) p(x|z)δp z|c(i) K
 (k)   (k)  p z|c(k)
   p c p z|c log  K     dz.
( j ) p c( j )
p(x|z) p z|c(i) dz k=1 j =1 p z|c
× log dxdz.
p(x) Similar to the proof of Proposition 1, we next calculate the
Therefore, the first-order variation of I (c; x) with respect to first-order variation of I (c; z) with respect to p(z|c(i) ). For the
p(z|c(i) ) is sake of brevity, we omit the expansion of I (c; z) and directly
   give its first-order variation as follows:
δ I (c; x) p z|c(i)   
        δ I (c; z) p z|c(i)
= I (c; x) p z|c(i) + δp z|c(i) − I (c; x) p z|c(i)  
 (i)  
    p x|c  (i)   (i)  p z|c(i)
= p c(i) p(x|z)δp z|c(i) log dxdz = p c δp z|c log dz
p(x) p(z)
          
 (i)   (i)  p x|c(i) δp z|c(i) p z|c(i) δp z|c(i) p c(i)
= p c p(x|z) pδ z|c log dxdz + p(z) − dz
p(x) p(z) p(z)2
        
 (i)  p x|c(i)  k p z|c(k) δp z|c(i) p c(i)
− p(x|z) p z|c log dxdz + p c p(z) − dz
p(x) k =i
p(z)2
   
 (i)   (i)  pδ x|c(i)  (i)   (i)  p z|c(i)
= p c p(x|z) pδ z|c log dxdz = p c δp z|c log dz
p(x) p(z)
 (i) 
  pδ x|c   
− p(x|z) pδ z|c(i) log  (i)  dxdz δp z|c(i)   (i)   (i) 
p x|c + p(z) − p z|c p c dz
 (i)   p(z)
 (i)  p x|c  
− p(x|z) p z|c log dxdz  k  δp z|c(i)   (k)   (i) 
p(x) − p c p z|c p c dz.
 (i)    (i)   p(z)
= p c D pδ x|c  p(x) k =i
 KL (i)        
−DKL pδ x|c  p x|c(i) − DKL p x|c(i)  p(x) Note that k=i p(z|c(k) ) p(c(k) ) = p(z) − p(z|c(i)) p(c(i ), then
  (i)     (i)   (i) 
∝ DKL pδ x|c  p(x) − DKL pδ x|c  p x|c . the third term earlier can be rewritten as
 
 k  δp z|c(i)   (k)   (i) 
Proposition 2: Suppose (c, z) ∼ p(c, z), where c is a p c p z|c p c dz
discrete variable in {c(1) , . . . , c(K ) }. If p(z|c(i) ) changes to k =i
p(z)
pδ (z|c(i) ) = p(z|c(i) ) + δp(z|c(i) ), where δp(z|c(i) ) is a per-  ⎛ ⎞
turbative function which satisfies δp(z|c(i) )dz = 0, then the  i  δp z|c(i)  (k)   (k) 
= p c ⎝ p z|c p c ⎠dz
variation of I (c; z) is given by p(z) k =i
    (i) 
δ I (c; z) p z|c(i)   δp z|c     
  (i)     
 p(z) − DKL pδ z|c(i)  p z|c(i) .
 = p ci p(z) − p z|c(i) p c(i) dz.
∝ DKL pδ z|c p(z)

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6274 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023


TABLE VI approximation of DKL ( pδ (z|c(i) ) ( j) ( j)
j  =i p(c ) p(z|c )),
D ISCRIMINATORS FOR MNIST AND FASHION -MNIST which promotes separating the prior of different clusters.

B. Architectures Details
The architectures of generators and encoders for MNIST,
Fashion-MNIST, and STL-10 are, respectively, shown in
Tables IV–IX. Conv(k, s, c) or DeConv(k, s, c) denotes con-
volution or transposed convolution with kernel size k, stride s,
and number of channels c, respectively. BN represents batch
normalization, SN denotes spectral normalization, and LReLU
is LeakyReLU with slope 0.2.

TABLE VII ACKNOWLEDGMENT


G ENERATORS FOR STL-10 The authors would like to thank the valuable comments and
suggestions provided by the reviewers.

R EFERENCES
[1] T. C. Silva and L. Zhao, “Network-based high level data classification,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6, pp. 954–970,
Jun. 2012.
[2] Y. Wang, S. Chen, and Z.-H. Zhou, “New semi-supervised classification
method based on modified cluster assumption,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 23, no. 5, pp. 689–702, May 2012.
[3] J. Gui, T. Liu, D. Tao, Z. Sun, and T. Tan, “Representative vector
TABLE VIII machines: A unified framework for classical classifiers,” IEEE Trans.
E NCODERS FOR STL-10 Cybern., vol. 46, no. 8, pp. 1877–1888, Aug. 2016.
[4] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
NY, USA: Springer, 2006.
[5] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum
margin clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2004,
pp. 1537–1544.
[6] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering
by regularized information maximization,” in Proc. Adv. Neural Inf.
Process. Syst., 2010, pp. 775–783.
[7] J. T. Springenberg, “Unsupervised and semi-supervised learning with
categorical generative adversarial networks,” in Proc. Conf. Learn.
Represent., 2016, pp. 1–20.
[8] N. Dilokthanakul et al., “Deep unsupervised clustering with Gaussian
mixture variational autoencoders,” in Proc. Conf. Learn. Represent.,
TABLE IX 2017, pp. 1–12.
D ISCRIMINATORS FOR STL-10 [9] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed-
ding for clustering analysis,” in Proc. Int. Conf. Mach. Learn., 2016,
pp. 478–487.
[10] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Proc. Int. Conf. Learn. Represent., 2014, pp. 1–14.
[11] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropa-
gation and approximate inference in deep generative models,” in Proc.
Int. Conf. Mach. Learn., 2014, pp. 1278–1286.
[12] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep
embedding: An unsupervised and generative approach to clustering,” in
Proc. 26h Int. Joint Conf. Artif. Intell., Aug. 2017, pp. 1965–1972.
[13] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning
discrete representations via information maximizing self-augmented
training,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1558–1567.
Therefore, we have [14] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep
  clustering via joint convolutional autoencoder embedding and relative
δ I (c; z) p(z|c(i) ) entropy minimization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 5747–5756.
p(z|c(i) ) [15] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
= p(c(i) ) δp(z|c(i) ) log dz Inf. Process. Syst., 2014, pp. 2672–2680.
p(z)
 [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
= p(c(i) )(DKL ( pδ (z|c(i) ) p(z)) learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
 
− DKL ( pδ (z|c(i) ) p(z|c(i) )) − DKL ( p(z|c(i) ) p(z)))
pp. 2278–2324, Nov. 1998.
  [17] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel
∝ DKL ( pδ (z|c(i) ) p(z)) − DKL ( pδ (z|c(i) ) p(z|c(i) )). image dataset for benchmarking machine learning algorithms,” 2017,
arXiv:1708.07747.
[18] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks
Similar with the discussion of Proposition 1, we can see in unsupervised feature learning,” in Proc. 14th Int. Conf. Artif. Intell.
that maximizing I (c; z) is equivalent to maximizing the Statist., 2011, pp. 215–223.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6275

[19] H. Jia and Y.-M. Cheung, “Subspace clustering of categorical and Xiaojiang Yang received the B.E. degree in physics
numerical data with an unknown number of clusters,” IEEE Trans. from Nankai University, Tianjin, China, in 2018.
Neural Netw. Learn. Syst., vol. 29, no. 8, pp. 3308–3325, Aug. 2017. He is currently pursuing the Ph.D. degree with the
[20] X. Yang, C. Deng, K. Wei, J. Yan, and W. Liu, “Adversarial learning for Department of Computer Science and Engineering,
robust deep clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2020, Shanghai Jiao Tong University, Shanghai, China.
pp. 9098–9108. His research interests include machine learning
[21] P. Zhou, Y. Hou, and J. Feng, “Deep adversarial subspace clustering,” and computer vision, with emphasis on generative
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, models and representation learning.
pp. 1596–1604.
[22] D. Barber and F. Agakov, “Kernelized infomax clustering,” in Proc. Adv.
Neural Inf. Process. Syst., 2005, pp. 17–24.
[23] N. Slonim, G. S. Atwal, G. Tkačik, and W. Bialek, “Information-
based clustering,” Proc. Nat. Acad. Sci. USA, vol. 102, no. 51,
pp. 18297–18302, 2005.
[24] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral
clustering using dual autoencoder network,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4066–4075. Junchi Yan (Senior Member, IEEE) received
[25] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and the Ph.D. degree from the Department of Elec-
P. Abbeel, “InfoGAN: Interpretable representation learning by informa- tronic Engineering, Shanghai Jiao Tong University,
tion maximizing generative adversarial nets,” in Proc. Adv. Neural Inf. Shanghai, China, in 2015.
Process. Syst., 2016, pp. 2180–2188. He was a Senior Research Staff Member and a
[26] D. Calandriello, G. Niu, and M. Sugiyama, “Semi-supervised Principal Scientist for industrial vision with IBM
information-maximization clustering,” Neural Netw., vol. 57, T. J. Watson Research Center, Yorktown Heights,
pp. 103–111, Sep. 2014. NY, USA, where he started his career in 2011.
[27] T. S. Jaakkola, M. Meila, and T. Jebara, “Maximum entropy discrimi- He is currently an Associate Professor with the
nation,” in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 470–476. Department of Computer Science and Engineering,
[28] S. Mukherjee, H. Asnani, E. Lin, and S. Kannan, “ClusterGAN: Latent Shanghai Jiao Tong University. His research interests
space clustering in generative adversarial networks,” in Proc. AAAI Conf. are machine learning.
Artif. Intell., 2019, pp. 4610–4617.
[29] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversar-
ial autoencoders,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–16.
[30] J. Donahue, P. Krahenbuhl, and T. Darrell, “Adversarial feature learn-
ing,” in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–18.
[31] V. Dumoulin et al., “Adversarially learned inference,” in Proc. Int. Conf.
Learn. Represent., 2017, pp. 1–18. Yu Cheng (Member, IEEE) received the bachelor’s
[32] W. Wu, J. Yan, X. Yang, and H. Zha, “Discovering temporal degree from Tsinghua University, Beijing, China,
patterns for event sequence clustering via policy mixture model,” in 2010, and the Ph.D. degree from Northwestern
IEEE Trans. Knowl. Data Eng., early access, Apr. 10, 2020, doi: University, Xi’an, China, in 2015.
10.1109/TKDE.2020.2986206. He was a Research Staff Member with the IBM
[33] R. Wang, J. Yan, and X. Yang, “Graduated assignment for joint multi- T. J. Watson Research Center, Yorktown Heights,
graph matching and clustering with application to unsupervised graph NY, USA. He is a Principal Researcher at Microsoft
matching network learning,” in Proc. Adv. Neural Inf. Process. Syst., Research Laboratory, Redmond, WA, USA. His
2020, pp. 19908–19919. research interest includes deep learning, with empha-
[34] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Müller, “The man- sis on the deep generative model, model compres-
ifold tangent classifier,” in Proc. Adv. Neural Inf. Process. Syst., 2011, sion, and adversarial learning.
pp. 2294–2302. Dr. Cheng regularly serves on the program committees of top-tier AI
[35] H. Zhang et al., “StackGAN: Text to photo-realistic image synthesis conferences, such as Neural Information Processing Systems (NIPS), Interna-
with stacked generative adversarial networks,” in Proc. IEEE Int. Conf. tional Conference on Machine Learning (ICML), International Conference on
Comput. Vis. (ICCV), Oct. 2017, pp. 5907–5915. Learning Representation (ICLR), IEEE/CVF Conference on Computer Vision
[36] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, and Pattern Recognition (CVPR), and Annual Meeting of the Association for
MA, USA: MIT Press, 2016. Computational Linguistics (ACL).
[37] A. Krizhevsky et al., “Learning multiple layers of features from tiny
images,” M.S. thesis, Dept. Comput. Sci., Univ. Tronto, Toronto, ON,
Canada, 2009.
[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.,
2015, pp. 1–14.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Yizhe Zhang received the B.S. degree in physics
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. from Nanjing University, Nanjing, China, in 2011,
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. and the M.S. degree in statistical science and
[40] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures the Ph.D. degree from Duke University, Durham,
for clusterings comparison: Variants, properties, normalization and cor- NC, USA, both in 2018, under the supervision of
rection for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, Prof. Lawrence Carin.
Jan. 2010. He is currently a Senior Researcher with the
[41] H. W. Kuhn, “The Hungarian method for the assignment problem,” Microsoft Research Laboratory, Redmond, WA,
Naval Res. Logistics Quart., vol. 2, nos. 1–2, pp. 83–97, Mar. 1955. USA, where he is primarily working on natural
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” language generation, deep generative models, gen-
in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15. erative pretraining, and text representation learning.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.

You might also like