1-Learning Deep Generative Clustering Via Mutual Information Maximization
1-Learning Deep Generative Clustering Via Mutual Information Maximization
Abstract— Deep clustering refers to joint representation learn- and (2) discriminative methods, such as maximum margin
ing and clustering using deep neural networks. Existing methods clustering (MMC) [5] and regularized information maximiza-
can be mainly categorized into two types: discriminative and gen- tion (RIM) [6], which group data directly using some objec-
erative methods. The former learns representations for clustering
with discriminative mechanisms directly, and the latter estimate tives [7]. These approaches have achieved some success in
the latent distribution of each cluster for generating data points different real-world tasks, but it is commonly believed that
and then infers cluster assignments. Although generative methods they have difficulty in learning hidden, hierarchical dependen-
have the advantage of estimating the latent distributions of cies in latent spaces [8], as they are based on local relations
clusters, their performances still significantly fall behind discrim- in the data space.
inative methods. In this work, we argue that this performance gap
might be partly due to the overlap of data distribution of different Consequently, recent efforts for clustering tend to move
clusters. In fact, there is little guarantee of generative methods to toward performing clustering on latent space, namely, deep
separate the distributions of different clusters in the data space. clustering. New algorithms have been proposed and achieved
To tackle these problems, we theoretically prove that mutual promising results. In the line of deep discriminative clustering,
information maximization promotes the separation of different deep embedded clustering (DEC) [9] iteratively encodes data
clusters in the data space, which provides a theoretical justifi-
cation for deep generative clustering with mutual information into feature space and improves the confidence for assign-
maximization. Our theoretical analysis directly leads to a model ments. For deep generative clustering, variants of variational
which integrates a hierarchical generative adversarial network autoencoder (VAE) [10], [11], such as Gaussian mixture vari-
and mutual information maximization. Moreover, we further ational autoencoder (GMVAE) [8] and VaDE [12], are devised
propose three techniques and empirically show their effects to to estimate the latent distribution by a mixture of Gaussians.
stabilize and enhance the model. The proposed approach notably
outperforms other generative models for deep clustering on Although generative methods for deep clustering have the
public benchmarks. advantage of estimating the latent distributions of clusters,
their performances are still notably worse than the state-of-
Index Terms— Deep generative clustering, generative adversar-
ial networks (GANs), mutual information maximization, varia- the-art deep discriminative methods [13], [14]. In this article,
tional autoencoders (VAEs). we highlight that such a performance gap might be partly
caused by the reason that there is no sufficient discriminative
I. I NTRODUCTION power for these generative methods to separate the data distri-
bution of different clusters. Furthermore, this induces another
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6264 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
generated by a value of cluster code. As hierarchical GAN is other classical discriminative methods [23], [27] are limited
often hard to train, some techniques are devised in this article: in the data space and, hence, are unable to capture the hidden
matching the prior of the generative model with the estimated dependency. DEC [9] opens the avenue of deep clustering.
latent distribution of true data with another GAN, maximizing It iteratively encodes data into feature space and improves
the mutual information between latent variables and cluster confidence for assignment. Information maximizing self aug-
code, and concatenating continuous codes to cluster code in mented training (IMSAT) [13] maximizes the information-
a generative model. These techniques show empirical success theoretic dependency between data and their assignments in an
in stabilizing and enhancing our methods on deep clustering, end-to-end model. DEPICT [14] stacks a multinomial logistic
as shown in our experiments. regression function on an autoencoder to predict assignment
We evaluate our method on three public benchmarks as and jointly minimizes the relative entropy.
widely used in deep generative clustering: Mixed National
Institute of Standards and Technology database (MNIST) [16], B. Deep Generative Clustering
Fashion-MNIST [17], and self-taught learning image database
Some deep generative methods are proposed to model the
(STL) [18], and compare it with the state-of-the-art deep
latent distribution of true data for deep clustering, including
generative clustering methods. Our method notably outper-
GMVAE [8], VaDE [12], and ClusterGAN [28]. GMVAE and
forms them in terms of accuracy (ACC) and normalized
VaDE exploit a Gaussian mixture distribution to estimate the
mutual information (NMI). The contributions of this article
latent distribution within VAE [10] framework. ClusterGAN
are summarized as follows.
decomposes the latent space into a categorical one and a
1) We theoretically show that mutual information maxi- continuous one to learn the assignments directly within a
mization between data and cluster code promotes the bidirectional GAN. These methods employ no discriminative
separation of different clusters in the data space, which mechanism to improve clustering. Another method closely
provides a theoretical justification to deep generative related to our model is clustering with adversarial autoen-
clustering with mutual information maximization. coders [29] or BiGAN [30], [31], which encodes data into
2) Based on the theoretical analysis, we propose to inte- a fixed Gaussian mixture prior. Compared with this, in our
grate discriminative mechanism into deep generative model, the prior is more flexible, and the optimal prior is
clustering, which can enhance the discrimination of obtained by learning.
latent distribution in the generative model for deep In addition, there are (generative) models tailored for clus-
clustering. Motivated by this, we devise a novel model, tering of other forms of data, e.g., event sequences [32] and
namely IMDGC. graph [33], which are beyond the scope of this article.
3) We propose three techniques to stabilize and enhance
the proposed model and empirically demonstrate their
mechanisms and effects. C. Differences to InfoGAN
4) Experimental results show that IMDGC outperforms One closely related work to ours is InfoGAN [25], which
the state-of-the-art deep generative clustering models can be viewed as a method for integrating mutual infor-
on popular datasets for generative clustering, including mation maximization into generative clustering. Specifically,
MNIST, Fashion-MNIST, and STL. an InfoGAN estimates the data distribution with a GAN, and
This article is organized as follows. Section II introduces meanwhile, maximizes the mutual information between data
preliminary background on related works. Theoretical analysis and codes (both cluster code and continuous codes), as shown
is given in Section III. Section IV presents the proposed in Fig. 1(a). The main difference between InfoGAN and
approach and the empirical study is performed in Section V. our method is the generating process. In InfoGAN, data are
A further discussion is given in Section VI. Section VII synthesized directly from codes. In our method, however, the
concludes this article. generative process is hierarchical: flexible priors of clusters are
first generated from cluster code, and then data are generated
II. R ELATED W ORK from the prior. This hierarchical structure enables our method
to achieve clustering in a natural way. Specifically, it can
As a long-standing problem, there exist multiple lines of
separate different clusters by low-density regions in the data
works for clustering, e.g., subspace clustering [19], adversarial
space due to the combined action of its hierarchical structure
clustering [20], [21], discriminative clustering [22], [23], gen-
and mutual information maximization (see Section III). As it
erative clustering [4], [7], [20], [24], and so on. As our method
is believed that different classes of data are separated by low-
lies on improving deep generative clustering with mutual
density regions in the data space [34], our method probably
information maximization, we review related works about
provides a natural solution to clustering. However, InfoGAN
discriminative clustering by mutual information maximization,
has no such desirable property, especially when it simply
deep generative clustering, and specifically a closely related
maximizes the mutual information between data and cluster
model InfoGAN [25].
code (see Section V-C.1). Specifically, even when the mutual
information term of InfoGAN is well maximized, different
A. Discriminative Clustering clusters can be connected in the data space. Consider a case
In this part, we mainly focus on those relevant to mutual of InfoGAN on MNIST, for example, when its cluster code
information maximization. RIM [6], Infomax [22], [26], and corresponds to the width of digits, the mutual information
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6265
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6266 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
IV. P ROPOSED A PPROACH With the approximate posterior q(c|x), the mutual informa-
In this section, we propose IMDGC, by first introducing tion I (c; x) can be estimated as [25]
its vanilla model structure and objective. Then, we present I (c; x) ≈ E(c, x̃)∼ p(c) p(x|c) log q(c| x̃) + H (c) (11)
three techniques to stabilize and enhance the proposed model:
1) matching the prior in the generative model with the where H (c) is a constant, as p(c) is given in this work.
estimated latent distribution of true data with another GAN; Combining the adversarial objective with the mutual infor-
2) maximizing the mutual information between latent variables mation term, the objective of IMDGC is
and cluster code; and 3) concatenating continuous codes to min max VIMDGC = Vx − λ1 I (c; x) (12)
cluster code in a generative model. These tricks are easy G,E Dx
to be integrated into our method due to the flexibility prior where λ1 is the balancing hyperparammeter of I (c; x), and
introduced in our method. G = {G z , G x }, E = {E x , E z }.
A. Information Maximization Deep Generative Clustering B. Matching the Prior With Latent Distribution of True Data
The basic structure of the proposed model (IMDGC) is It has been reported that without constraints on the gener-
shown in Fig. 1(b). Specifically, IMDGC adopts a hierarchical ating process, a hierarchical GAN is hard to train and its gen-
generative model p(c, z, x) = p(c) p(z|c) p(x|z) to synthe- erated images are of low quality [35]. We empirically observe
size data and a hierarchical inference model q(c, z|x) = that this also occurs on the vanilla IMDGC. This phenomenon
q(z|x)q(c|z) for estimating I (c; x). The specific generative might be harmful to the clustering performance of IMDGC,
and inference models are described as follows. as a powerful generative clustering model should successfully
In the generative model, we denote samples from p(c, z, x) generate images of each cluster. Otherwise, the generative
as (c, z, x̃), and set p(z|c) = N (z|μ(c), diag(σ 2 (c))) and clustering model cannot learn the true data distribution well,
p(x|z) = δ(x − G x (z)), where G x (·) denotes the generator and hence, cannot reasonably separate the true distribution for
for x̃. Specifically clustering.
To deal with the problem earlier, we propose to constrain
c ∼ p(c) = Cat(c|K ) (4) the generating process of IMDGC, by matching the prior with
μ(c), σ 2 (c) = G z (c) (5) the latent distribution of true data. Specifically, we introduce
another GAN to achieve this in three steps.
z ∼ p(z|c) = N z|μ(c), diag σ 2 (c) (6)
1) Encode true data x by E x (·) into latent variables ẑ.
x̃ = G x (z) (7)
2) Train a discriminator Dz (·) to distinguish z and ẑ.
where Cat(c|K ) denotes the uniform categorical distribution 3) Train the generators and encoders to match z with ẑ.
over K categories. In (6), z is sampled through reparameter- The adversarial objective here is
ization [10] for back propagation, i.e., z = μ(c) + σ (c) ε,
Vz = Ez∼ p(z) log Dz (z) + E ẑ∼q(z) log(1 − Dz (ẑ)) (13)
ε ∼ N (ε|0, I ), where is pointwise product. Note that the
parameters of each Gaussian, μ(·) and σ (·), are outputs of where q(z) = pd (x)q(z|x)dx is the latent distribution of true
generator G z (·), and hence, the priors of clusters are flexible. data in the inference model.
The generative model has trained adversarially with a Therefore, the objective of IMDGC becomes
discriminator Dx (·) to synthesize samples. The adversarial
min max VIMDGC = Vx + Vz − λ1 I (c; x) (14)
objective is denoted as Vx . In this article, we simply use G,E D
the objective of vanilla GAN [15] without loss of generality where D = {Dx , Dz }.
Vx = E x∼ pd (x) log Dx (x) + E x̃∼ p(x) log(1 − Dx ( x̃)) (8)
C. Maximizing Mutual Information Between Latent
where pd (x) denotes the true data distribution. Variables and Cluster Code
To estimate the mutual information I (c; x), the posterior
We empirically observe that the maximization of mutual
p(c|x) is necessary. Following InfoGAN [25], we introduce
information between data and cluster code I (c; x) is unstable
an auxiliary inference model q(c|x) to approximate p(c|x).
in experiments. This might be caused by the vanishing gradient
Different withInfoGAN, here, we adopt a hierarchical struc-
problem of the mutual information term I (c; x), as the network
ture q(c|x) = q(c|z)q(z|x)dz, and set q(z|x) = δ(z−E x (x)),
to estimate I (c; x) includes two generators and two encoders,
where E x (·) is the encoder to encode x. This setting seems
which is too deep for the maximization of I (c; x).
redundant, but it brings flexibility to constrain the prior, as will
To stabilize the maximization of I (c; x), we propose to
be shown in Sections IV-B and IV-C.
meanwhile maximize the mutual information between cluster
Given a synthesized sample x̃, the inference process is
code and latent variables I (c; z). This can be viewed as a way
z̃ = E x ( x̃) (9) to “guide” the maximization of I (c; x). To prove this, we give
q(c| x̃) = E z (z̃) (10) another proposition as follows, and the proof details are given
in Appendix A.
where E z (·) is the encoder to encode z, and its output unit is Proposition 2: Suppose (c, z) ∼ p(c, z), where c is a
Softmax, and hence, its output is a distribution of c. discrete variable in {c(1) , . . . , c(K ) }. If p(z|c(i) ) changes to
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6267
pδ (z|c(i) ) = p(z|c(i) ) + δp(z|c(i) ), where δp(z|c(i) ) is a per- In other words, although the continuous codes are utilized
turbative function which satisfies δp(z|c(i) )dz = 0, then the to improve the flexibility of prior, their reconstructions are
variation of I (c; z) is given by not inferred in the inference process. Therefore, involving
continuous codes merely increases a few parameters of G z (·),
δ I (c; z) p z|c(i) and hence, has ignorable cost in IMDGC.
∝ DKL pδ z|c(i) p(z) − DKL pδ z|c(i) p z|c(i) . (15)
V. E XPERIMENTS
Similar to the discussion about Proposition 1, we can conclude
In this section, we first introduce the protocol of our exper-
that the maximization of I (c; z) promotes separation of the
iments, including the choice and preprocessing of datasets,
priors of different clusters, which intuitively helps separate dif-
the metrics of clustering, and the architectures of networks
ferent clusters in the data space. Note that maximizing I (c; z)
and hyperparameters. Then, we report the results of verifying
cannot directly separate the distributions of different clusters,
experiments, aiming to verify the effects of the three proposed
and hence, it cannot replace the maximization of I (c; x).
techniques. After that, we report the main part of this section:
For simplicity and efficiency, we use the encoder E z (·) to
the clustering performance of IMDGC compared with other
approximate the posterior p(c|z) to estimate I (c; z), without
generative methods of deep clustering. Moreover, to show
involving an auxiliary network
that each component of IMDGC contributes to the clustering
I (c; z) ≈ E(c,z)∼ p(c) p(z|c) log q(c|z) + H (c). (16) performance, we present the results of ablation experiments on
clustering. Finally, we show the results of sensitivity analysis
Finally, the objective of IMDGC is of three vital hyperparameters in IMDGC: the weight λ1 , λ2 ,
min max VIMDGC = Vx + Vz − λ1 I (c; x) − λ2 I (c; z) (17) and the number of continuous codes J .
G,E D
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6268 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
2I (t; c)
NMI = (21)
H (t) + H (c)
where t is the ground truth label variable and c is the cluster
assignment variable.
3) Architectures and Hyperparameters: In all experiments,
the hyperparameters of the optimizer are the same. For opti-
mization, we use Adam [42] with a learning rate of 0.0002,
and β1 = 0.5 and β1 = 0.999. The architectures of networks
used in our experiments are shown in Appendix tables. As for
the architectures of networks, an architecture is employed on
both MNIST and Fashion-MNIST, while another with spectral
normalization is employed on STL. The specific architectures
are reported in Appendix B. Other hyperparameters, including
λ1 , λ2 , and J , are reported before showing experimental results
in Sections V-B–V-D. In addition, the batch size is set as 100,
the number of clusters is set as K = 10, and the number of
latent variables is set as d = 64 on both MNIST and Fashion-
MNIST and d = 128 on STL.
Fig. 2. Generated images on MNIST and Fashion-MNIST with Dz (·) or not.
Each row represents generated images with fixed cluster code and random
sampled continuous code. (a) MNIST, without Dz . (b) MNIST, with Dz .
B. Verifying Experiments and Results (c) Fashion-MNIST, without Dz . (d) Fashion-MNIST, with Dz .
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6269
Fig. 3. Mutual information term I (c; x) and overlap among priors of different
clusters on MNIST over training epochs by λ2 = 0.1, 0.01, 0. The lines are
the mean of five trials, and the error bands represent the standard deviations.
(a) I (c; x). (b) Overlap among priors.
Fig. 4. Non-Gaussianity of priors of clusters as calculated by (23) on MNIST cluster code c and then average the non-Gaussianities of ten
over training epochs as the number of continuous codes J varies. The lines clusters. As shown in Fig. 4, in the training process, the
are the average values of five trials, and the error bands represent the standard non-Gaussianity of IMDGC with continuous codes overall
deviations.
increases. This result demonstrates that the model indeed
utilizes the flexibility brought by continuous codes, which
that the batch size here is set as 10, and each cluster code has coincides with our expectation of this technique.
only one sample. Besides, we empirically observe another merit of involving
As shown in Fig. 3, the maximization of I (c; x) becomes continuous codes. As shown in Fig. 5(c), for the generated
stable and efficient when maximizing I (c; z) in IMDGC images on Fashion-MNIST by IMDGC without continuous
meanwhile. Otherwise, it is extremely unstable and inefficient. codes, we encounter the mode collapse, where some classes
Besides, maximizing I (c; z) also contributes to reducing over- of images like “bags” are missing, and the generated images
lap among priors of different clusters. These results testify to from the same cluster are almost identical. Compared with this,
our motivation for maximizing I (c; z) in IMDGC. the generated images by IMDGC with continuous codes are
3) Effects of Involving c : To verify if the model can utilize diverse, as shown in Fig. 5(b) and (d). These results reveal
the flexibility brought by continuous codes c , we design that involving continuous codes can prevent IMDGC from
a metric to measure the non-Gaussianity of prior of each mode collapse. This is reasonable, as involving continuous
cluster. We believe the non-Gaussianity can reflect the usage codes improves the non-Gaussianity of prior of each cluster
of flexibility of prior. For given cluster code c, the non- according to the discussion earlier, and hence, it can prevent
Gaussianity is obtained by the following optimization: the prior from mode collapse.
min DKL p z| c, c N z|μ, diag σ 2 d c (23)
μ,σ C. Unsupervised Clustering
where p(z|c, c ) is a tractable Gaussian in IMDGC, and the In this section, we compare the clustering performance of
integral can be estimated by Monte Carlo sampling. In the IMDGC with other deep generative methods and InfoGAN to
experiment, we sample ten c for each cluster code c to show that IMDGC can improve the clustering performance of
estimate the integral in (23) and average the non-Gaussianities deep generative methods. Besides, we also show the generated
of ten clusters. images on MNIST and Fashion-MNIST by IMDGC when
In experiments, we train on MNIST and Fashion-MNIST, the cluster code is fixed. Finally, we report the results of
and set λ1 = 1.0, λ2 = 0.1, and J = 0, 1, 2, or 3. ablation experiments on MNIST, to show that each component
To estimate the integral in (23), we sample ten c for each of IMDGC indeed improves its clustering performance.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6270 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
TABLE I
AVERAGE AND S TANDARD D EVIATION OF A CCURACY (ACC) (%) AND NMI OF F IVE T RIALS .
S TANDARD D EVIATIONS A RE R EPORTED IN THE ROUND B RACKETS
TABLE II
B EST A CCURACY (ACC) (%) AND NMI OF F IVE T RIALS
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6271
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6272 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
theoretical analysis, the prior of each cluster is absolutely satisfies δp(z|c(i) )dz = 0, then the variation of I (c; x) is
flexible without any constraint, but in vanilla IMDGC the prior given by
of each cluster is set as a Gaussian. This loss of flexibility
might be harmful to clustering. A simple solution to fill this δ I (c; x) p z|c(i)
gap is involving continuous codes c , which enables the prior ∝ DKL pδ x|c(i) p(x) − DKL pδ x|c(i) p x|c(i)
of each cluster to be a universal approximator. We empirically
observe some merits of this technique: improving the non- where pδ (x|c(i) ) = p(x|z) pδ (z|c(i) )dz.
Gaussianity of each cluster’s prior, preventing mode collapse, Proof: Due to p(c, z, x) = p(x|z) p(z|c) p(c), we have
and improving clustering performance. We also observe that K
involving too many continuous codes leads to the decline I (c; x) = p c(k) p(x|z) p z|c(k)
of clustering performance, which indicates that too much k=1
flexibility of prior might be harmful to clustering.
p(x|z) p z|c(k) dz
× log dxdz.
p(x)
VII. C ONCLUSION
(i) (i)
We have proposed to learn deep generative clustering by of generality, let δp(z|c ) = h(z|c ), where
Without loss
integrating discriminative mechanisms. Specifically, based on 1 and h(z|c(i) )dz = 0. This notation is useful: when
the theoretical analysis we have shown that integrating mutual p(z|c(i) ) shift to p(z|c(i) ) + δp(z|c(i)), the mutual information
information maximization into deep generative clustering pro- I (c; x) can be expanded with respect to
motes the separation of different clusters in the data space,
I (c; x) p z|c(i) + δp z|c(i)
which gives a theoretical justification to deep generative
clustering. Based on this, we have proposed a novel deep gen- = I (c; x) p z|c(i) + h z|c(i)
erative clustering model with mutual information maximiza-
= p c(k) p(x|z) p z|c(k)
tion. The proposed model, called IMDGC, jointly performs k =i
deep generative clustering and reduces overlap among priors
p(x|z) p z|c(k) dz
of different clusters via mutual information maximization. × log dxdz
We have evaluated our model on unsupervised clustering tasks p(x)
using MNIST, Fashion-MNIST, and STL-10 w.r.t. ACC and + p c(i) p(x|z) p z|c(i) + h z|c(i)
NMI. Experimental results show that IMDGC can significantly
improve the performance of deep generative clustering. p(x|z) p z|c(i) + h z|c(i) dz
log dxdz
Thanks to the hierarchical structure of IMDGC, three p(x)
techniques can be easily integrated to enhance and stabi-
lize it, including matching the prior with latent distribu- = p c(k) p(x|z) p z|c(k)
tions of true data, maximizing mutual information between k =i
latent variables and cluster code, and involving continuous p(x|z) p z|c(k) dz
× log dxdz
codes. We have shown in experiments that these techniques p(x)
can improve the quality of generated images, stabilize and
enhance mutual information maximization, and improve the + p c(i) p(x|z) p z|c(i)
flexibility of prior and meanwhile prevent mode collapse,
p(x|z) p z|c(i) + h z|c(i) dz
respectively. log dxdz
p(x)
There are many possible extensions. The solution of using
a flexible complex prior and performing mutual information + p c(i) p(x|z)h z|c(i)
maximization can be applied in VAEs [10]. Applying our
method in supervised and semisupervised settings is another p(x|z) p z|c(i) + h z|c(i) dz
log dxdz
potential extension. Finally, integrating our techniques into p(x)
other deep models is also a possible future work. K
= p c(k) p(x|z) p z|c(k)
k=1
A PPENDIX
p(x|z) p z|c(k) dz
In this appendix, we give the proof details of our proposed × log dxdz
p(x)
proposition. In addition, we provide the details of the archi-
tectures of our models. (i) (i) p(x|z)h z|c(i) dz
+ p c p(x|z) p z|c dxdz
p(x|z) p z|c(i) dz
A. Proof of Propositions + p c(i) p(x|z)h z|c(i)
Proposition 1: Suppose (c, z, x) ∼ p(x|z) p(z|c) p(c), where
c is a discrete variable in {c(1) , . . . , c(K ) }, and p(c) and p(x) p(x|z) p z|c(i) dz
× log dxdz
are fixed. If p(z|c(i) ) changes to pδ (z|c(i) ) = p(z|c(i) ) + 2
p(x)
δp(z|c(i) ), where δp(z|c(i) ) is a perturbative function which +O
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6273
K
TABLE IV
= p c(k) p(x|z) p z|c(k) G ENERATORS FOR MNIST AND FASHION -MNIST
k=1
p(x|z) p z|c(k) dz
× log dxdz
p(x)
+ p c(i) p(x|z)h z|c(i) dxdz
+ p c(i) p(x|z)h z|c(i)
p(x|z) p z|c(i) dz
× log dxdz + O 2 .
p(x) TABLE V
E NCODERS FOR MNIST AND FASHION -MNIST
As p(x|z)h(z|c )dxdz = h(z|c(i) )dz = 0, and collecting
(i)
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
6274 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 9, SEPTEMBER 2023
TABLE VI approximation of DKL ( pδ (z|c(i) ) ( j) ( j)
j =i p(c ) p(z|c )),
D ISCRIMINATORS FOR MNIST AND FASHION -MNIST which promotes separating the prior of different clusters.
B. Architectures Details
The architectures of generators and encoders for MNIST,
Fashion-MNIST, and STL-10 are, respectively, shown in
Tables IV–IX. Conv(k, s, c) or DeConv(k, s, c) denotes con-
volution or transposed convolution with kernel size k, stride s,
and number of channels c, respectively. BN represents batch
normalization, SN denotes spectral normalization, and LReLU
is LeakyReLU with slope 0.2.
R EFERENCES
[1] T. C. Silva and L. Zhao, “Network-based high level data classification,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6, pp. 954–970,
Jun. 2012.
[2] Y. Wang, S. Chen, and Z.-H. Zhou, “New semi-supervised classification
method based on modified cluster assumption,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 23, no. 5, pp. 689–702, May 2012.
[3] J. Gui, T. Liu, D. Tao, Z. Sun, and T. Tan, “Representative vector
TABLE VIII machines: A unified framework for classical classifiers,” IEEE Trans.
E NCODERS FOR STL-10 Cybern., vol. 46, no. 8, pp. 1877–1888, Aug. 2016.
[4] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
NY, USA: Springer, 2006.
[5] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum
margin clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2004,
pp. 1537–1544.
[6] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering
by regularized information maximization,” in Proc. Adv. Neural Inf.
Process. Syst., 2010, pp. 775–783.
[7] J. T. Springenberg, “Unsupervised and semi-supervised learning with
categorical generative adversarial networks,” in Proc. Conf. Learn.
Represent., 2016, pp. 1–20.
[8] N. Dilokthanakul et al., “Deep unsupervised clustering with Gaussian
mixture variational autoencoders,” in Proc. Conf. Learn. Represent.,
TABLE IX 2017, pp. 1–12.
D ISCRIMINATORS FOR STL-10 [9] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed-
ding for clustering analysis,” in Proc. Int. Conf. Mach. Learn., 2016,
pp. 478–487.
[10] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Proc. Int. Conf. Learn. Represent., 2014, pp. 1–14.
[11] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropa-
gation and approximate inference in deep generative models,” in Proc.
Int. Conf. Mach. Learn., 2014, pp. 1278–1286.
[12] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep
embedding: An unsupervised and generative approach to clustering,” in
Proc. 26h Int. Joint Conf. Artif. Intell., Aug. 2017, pp. 1965–1972.
[13] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning
discrete representations via information maximizing self-augmented
training,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1558–1567.
Therefore, we have [14] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep
clustering via joint convolutional autoencoder embedding and relative
δ I (c; z) p(z|c(i) ) entropy minimization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 5747–5756.
p(z|c(i) ) [15] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
= p(c(i) ) δp(z|c(i) ) log dz Inf. Process. Syst., 2014, pp. 2672–2680.
p(z)
[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
= p(c(i) )(DKL ( pδ (z|c(i) ) p(z)) learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
− DKL ( pδ (z|c(i) ) p(z|c(i) )) − DKL ( p(z|c(i) ) p(z)))
pp. 2278–2324, Nov. 1998.
[17] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel
∝ DKL ( pδ (z|c(i) ) p(z)) − DKL ( pδ (z|c(i) ) p(z|c(i) )). image dataset for benchmarking machine learning algorithms,” 2017,
arXiv:1708.07747.
[18] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks
Similar with the discussion of Proposition 1, we can see in unsupervised feature learning,” in Proc. 14th Int. Conf. Artif. Intell.
that maximizing I (c; z) is equivalent to maximizing the Statist., 2011, pp. 215–223.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: LEARNING DEEP GENERATIVE CLUSTERING VIA MUTUAL INFORMATION MAXIMIZATION 6275
[19] H. Jia and Y.-M. Cheung, “Subspace clustering of categorical and Xiaojiang Yang received the B.E. degree in physics
numerical data with an unknown number of clusters,” IEEE Trans. from Nankai University, Tianjin, China, in 2018.
Neural Netw. Learn. Syst., vol. 29, no. 8, pp. 3308–3325, Aug. 2017. He is currently pursuing the Ph.D. degree with the
[20] X. Yang, C. Deng, K. Wei, J. Yan, and W. Liu, “Adversarial learning for Department of Computer Science and Engineering,
robust deep clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2020, Shanghai Jiao Tong University, Shanghai, China.
pp. 9098–9108. His research interests include machine learning
[21] P. Zhou, Y. Hou, and J. Feng, “Deep adversarial subspace clustering,” and computer vision, with emphasis on generative
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, models and representation learning.
pp. 1596–1604.
[22] D. Barber and F. Agakov, “Kernelized infomax clustering,” in Proc. Adv.
Neural Inf. Process. Syst., 2005, pp. 17–24.
[23] N. Slonim, G. S. Atwal, G. Tkačik, and W. Bialek, “Information-
based clustering,” Proc. Nat. Acad. Sci. USA, vol. 102, no. 51,
pp. 18297–18302, 2005.
[24] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral
clustering using dual autoencoder network,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4066–4075. Junchi Yan (Senior Member, IEEE) received
[25] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and the Ph.D. degree from the Department of Elec-
P. Abbeel, “InfoGAN: Interpretable representation learning by informa- tronic Engineering, Shanghai Jiao Tong University,
tion maximizing generative adversarial nets,” in Proc. Adv. Neural Inf. Shanghai, China, in 2015.
Process. Syst., 2016, pp. 2180–2188. He was a Senior Research Staff Member and a
[26] D. Calandriello, G. Niu, and M. Sugiyama, “Semi-supervised Principal Scientist for industrial vision with IBM
information-maximization clustering,” Neural Netw., vol. 57, T. J. Watson Research Center, Yorktown Heights,
pp. 103–111, Sep. 2014. NY, USA, where he started his career in 2011.
[27] T. S. Jaakkola, M. Meila, and T. Jebara, “Maximum entropy discrimi- He is currently an Associate Professor with the
nation,” in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 470–476. Department of Computer Science and Engineering,
[28] S. Mukherjee, H. Asnani, E. Lin, and S. Kannan, “ClusterGAN: Latent Shanghai Jiao Tong University. His research interests
space clustering in generative adversarial networks,” in Proc. AAAI Conf. are machine learning.
Artif. Intell., 2019, pp. 4610–4617.
[29] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversar-
ial autoencoders,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–16.
[30] J. Donahue, P. Krahenbuhl, and T. Darrell, “Adversarial feature learn-
ing,” in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–18.
[31] V. Dumoulin et al., “Adversarially learned inference,” in Proc. Int. Conf.
Learn. Represent., 2017, pp. 1–18. Yu Cheng (Member, IEEE) received the bachelor’s
[32] W. Wu, J. Yan, X. Yang, and H. Zha, “Discovering temporal degree from Tsinghua University, Beijing, China,
patterns for event sequence clustering via policy mixture model,” in 2010, and the Ph.D. degree from Northwestern
IEEE Trans. Knowl. Data Eng., early access, Apr. 10, 2020, doi: University, Xi’an, China, in 2015.
10.1109/TKDE.2020.2986206. He was a Research Staff Member with the IBM
[33] R. Wang, J. Yan, and X. Yang, “Graduated assignment for joint multi- T. J. Watson Research Center, Yorktown Heights,
graph matching and clustering with application to unsupervised graph NY, USA. He is a Principal Researcher at Microsoft
matching network learning,” in Proc. Adv. Neural Inf. Process. Syst., Research Laboratory, Redmond, WA, USA. His
2020, pp. 19908–19919. research interest includes deep learning, with empha-
[34] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Müller, “The man- sis on the deep generative model, model compres-
ifold tangent classifier,” in Proc. Adv. Neural Inf. Process. Syst., 2011, sion, and adversarial learning.
pp. 2294–2302. Dr. Cheng regularly serves on the program committees of top-tier AI
[35] H. Zhang et al., “StackGAN: Text to photo-realistic image synthesis conferences, such as Neural Information Processing Systems (NIPS), Interna-
with stacked generative adversarial networks,” in Proc. IEEE Int. Conf. tional Conference on Machine Learning (ICML), International Conference on
Comput. Vis. (ICCV), Oct. 2017, pp. 5907–5915. Learning Representation (ICLR), IEEE/CVF Conference on Computer Vision
[36] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, and Pattern Recognition (CVPR), and Annual Meeting of the Association for
MA, USA: MIT Press, 2016. Computational Linguistics (ACL).
[37] A. Krizhevsky et al., “Learning multiple layers of features from tiny
images,” M.S. thesis, Dept. Comput. Sci., Univ. Tronto, Toronto, ON,
Canada, 2009.
[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.,
2015, pp. 1–14.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Yizhe Zhang received the B.S. degree in physics
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. from Nanjing University, Nanjing, China, in 2011,
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. and the M.S. degree in statistical science and
[40] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures the Ph.D. degree from Duke University, Durham,
for clusterings comparison: Variants, properties, normalization and cor- NC, USA, both in 2018, under the supervision of
rection for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, Prof. Lawrence Carin.
Jan. 2010. He is currently a Senior Researcher with the
[41] H. W. Kuhn, “The Hungarian method for the assignment problem,” Microsoft Research Laboratory, Redmond, WA,
Naval Res. Logistics Quart., vol. 2, nos. 1–2, pp. 83–97, Mar. 1955. USA, where he is primarily working on natural
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” language generation, deep generative models, gen-
in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15. erative pretraining, and text representation learning.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 28,2024 at 16:51:42 UTC from IEEE Xplore. Restrictions apply.