0% found this document useful (0 votes)
80 views21 pages

ML (CSE-531) Assignment

This document is an assignment submitted by 5 students to their Machine Learning professor at East West University on the topic of "Clustering Ensemble Algorithms". It provides an abstract discussing cluster ensemble methods which generate multiple clusterings from a dataset and combine them to improve quality over individual clusterings. The assignment then covers key areas like properties of good cluster ensemble algorithms, challenges with high-dimensional data that deep learning can help with, and provides a taxonomy and overview of clustering with deep learning approaches.

Uploaded by

Ankur Mallick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views21 pages

ML (CSE-531) Assignment

This document is an assignment submitted by 5 students to their Machine Learning professor at East West University on the topic of "Clustering Ensemble Algorithms". It provides an abstract discussing cluster ensemble methods which generate multiple clusterings from a dataset and combine them to improve quality over individual clusterings. The assignment then covers key areas like properties of good cluster ensemble algorithms, challenges with high-dimensional data that deep learning can help with, and provides a taxonomy and overview of clustering with deep learning approaches.

Uploaded by

Ankur Mallick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Assignment

on
“Clustering Ensemble Algorithms”
Course Title: Machine Learning
Course Code: CSE-531
Submitted To
Dr. Mohammad Shafiul Alam
Associate Professor

Submitted By

Kazi Sayeed Hasan 2018-2-96-004

Afia Anjum 2018-2-96-005

Ankur Mallik 2018-2-96-008

Md. Nurul Huda 2018-3-96-008

Saeed Zaki Yamani 2018-3-96-009

EAST WEST UNIVERSITY


Jahurul Islam City Gate, A/2 Jahurul Islam Ave, Dhaka 1212

Assignment-CSE-531/East West University


Clustering Ensemble Algorithms,Research
Advance Of Clustering Ensemble Algorithm
and Clustering With Deep Learning
Saeed Zaki Yamani Afia Anjum Ankur Mallik
2018-3-96-009 2018-2-96-005 2018-2-96-008
East West University East West University East West University
Dhaka, Bangladesh Dhaka, Bangladesh Dhaka, Bangladesh

Kazi Sayeed Hasan Md. Nurul Huda


2018-2-96-004 2018-3-96-008
East West University East West University
Dhaka, Bangladesh Dhaka, Bangladesh

Abstract— Cluster ensemble is always a relatively active and lower similarity exist among data in different classes or
challenging research direction in the field of machine learning clusters.
and pattern recognition.The Clustering algorithm has good
theoretical basis which is widely used in machine learing,data A large variety of clustering algorithms has been
mining,pattern recognition,image analysis and other areas.It proposed: k-Means, EM (Expectation Maximization), based
becomes an important research in the field of cluster on spectral graph theory, hierarchical clustering algorithms
analysis,taking more more researches’ attention.However,there like Single-Link, Fuzzy c-Means, etc.As one kind of
is rather deficient reference materialon clustering research in clustering algorithm,the spectral clustering algorithm takes a
our country.Cluster ensemble has proved to be a good large number of researchers‟ attention because it is based on
alternative when facing cluster analysis problems. It consists of spectrograms and easy to achieve.The spectral clustering
generating a set of clusterings from the same dataset and algothim has a better non-convex recognition ability and can
combining them into final clustering. The goal of this avoid trapping in local optimum.However, as it is known,
combination process is to improve the quality of individual there is no clustering method capable of correctly finding the
data clusterings. Due to the increasing appearance of new underlying structure for all data sets.
methods, their promising results and the great number of
applications, it is necessary to make a critical analysis of the When we apply a clustering algorithm to a set of objects,
existing techniques and future projections.Clustering is a it imposes an organization to the data following an internal
fundamental problem in many data-driven application criterion, the characteristics of the used dissimilarity function
domains, and clustering performance highly depends on the and the dataset. Hence, if we have two different clustering
quality of data representation. Hence, linear or non-linear algorithms and we apply them to the same dataset, we can
feature transformations have been extensively used to learn a obtain very different results. But, which is the correct one?
better data representation for clustering. In recent years, a lot How can we evaluate the results? In clustering analysis, the
of works focused on using deep neural networks to learn a evaluation of results is associated to the use of cluster
clustering-friendly representation, resulting in a significant validity indexes (CVI), which are used to measure the quality
increase of clustering performance. Here, a systematic survey of clustering results. Nevertheless, the use of the CVIs is not
of clustering with deep learning in views of architecture in
the definite solution. There is no CVI that impartially
given. Specifically, we first introduce the preliminary
evaluates the results of any clustering algorithm. Thus, we
knowledge for better understanding of this field. Then, a
taxonomy of clustering with deep learning is proposed and can say that different solutions obtained by different
some representative methods are introduced. Finally, propose clustering algorithms can be equally plausible, if there is no
some interesting future opportunities of clustering with deep previous knowledge about the best way to evaluate the
learning and give some conclusion remarks. results. Roughly, we can assure that for any clustering
algorithm there is a CVI that will evaluate satisfactorily its
Keywords— Clustering; Cluster ensemble; Cluster member; results.
Cluster analysis; Consensus partition, Deep learning, Data
The idea of combining different clustering results (cluster
representation, Network architecture.
ensemble or clustering aggregation) emerged as an
I. INTRODUCTION alternative approach for improving the quality of the results
of clustering algorithms. It is based on the success of the
Cluster analysis is an essential technique in any field of combination of supervised classifiers. Given a set of objects,
research which involves analyzing or processing multivariate a cluster ensemble method consists of two principal steps:
data, such as: data mining, taxonomy, document retrieval, Generation, which is about the creation of a set of partitionsa
image segmentation, pattern classification, etc. Its goal is to of these objects, and Consensus Function, where a new
attend the underlying structure of a dataset following several partition, which is the integration of all partitions obtained in
clustering criteria, specific properties in the data and the generation step, is computed.
different ways of data comparison. Cluster refers to dividing
data in the dataset into multiple classes or clusters by some In different articles about clustering ensemble, authors
methods and making the data in each classes or clusters have have tried to define a set of properties that endorses the use
higher degree of similarity.On the contray,great differences of clustering ensemble methods, such as by Fred [29] and
Jain and Topchy [30] et al.Cluster ensemble refers to make a data into one cluster based on some similarity measures (e.g.,
plurality of cluster results objects to merge into a final Euclidean distance). Although a large number of data
partition which can satisfy the specified task.However, which clustering methods have been proposed,conventional
are the properties that should fulfll a clustering ensemble clustering methods usually have poor performance on high-
algorithm? There is no agreement about this unanswered dimensional data, due to the inefficiency of similarity
question. On top of it, the verification of any of these measures used in these methods. Furthermore, these meth-
properties in practice is very dificult due to the unsupervised ods generally suffer from high computational complexity on
nature of the clustering ensemble process. Some of them are: large-scale datasets. For this reason, dimensionality reduc-
tion and feature transformation methods have been exten-
 Robustness: The combination process must have sively studied to map the raw data into a new feature space,
better average performance than the single where the generated data are easier to be separated by exist-
clustering algorithms. ing classifiers. Generally speaking, existing data transfor-
 Consistency: The result of the combination should mation methods include linear transformation like Principal
be somehow, very similar to all combined single component analysis (PCA) and non-linear transforma- tion
clustering algorithm results. such as kernel methods and spectral methods. Nevertheless, a
highly complex latent structure of data is still challenging the
 Novelty: Cluster ensembles must allow finding effectiveness of existing clustering methods. Owing to the
solutions unattainable by single clustering development of deep learning, deep neural networks (DNNs)
algorithms. can be used to transform the data into more clustering-
friendly representations due to its inherent property of highly
 Stability: Results with lower sensitivity to noise
non-linear transformation. For the sim- plicity of description,
and outliers.
we call clustering methods with deep learning as deep
Properties like these are expected to be present in the clustering.
results of a clustering ensemble process. However, the
Basically, previous work mainly focuses on feature trans-
natural organization of data or the ground-truth cannot be
formation or clustering independently. Data are usually
expected as the best result. Moreover, it cannot be said that
mapped into a feature space and then directly fed into a
the clustering results obtained by a cluster ensemble method
clustering algorithm. In recent years, deep embedding clus-
is better than those which were combined. It can only be
tering (DEC) was proposed and followed by other novel
ensured that the new clustering is a consensus of all the
methods , making deep clustering become a popular research
previous ones, and we can use it instead of any other
field. Recently, an overview of deep clustering was proposed
clustering assuming as a fact that: the process of fusion could
in to review most remarkable algorithms in this field.
compensate for possible errors in a single clustering
Specifically, it presented some key elements of deep
algorithm, and the decision of a group must be more reliable
clustering and introduce related methods. However, this
than any individual one.
paper mainly focuses on methods based on autoencoder , and
II. THEORY OF THE CLUSTERING ENSEMBLE it was incapable of generalizing many other important
ALGORITHM methods, e.g., clustering based on deep generative model.
What is worse, some up-to-date progress is also missing.
This assumption is endorsed by an increasing number of Therefore, it is meaningful to conduct a more systematic
applications of the clustering ensemble methods in different survey covering the advanced methods in deep clustering.
areas.Over the past years, many clustering ensemble Classical clustering methods are usually categorized as
techniques have been proposed, resulting in new ways to partition-based methods, density-based methods hierarchical
face the problem together with new fields of application for methods and so on. However, since the essence of deep
these techniques. Despite the large number of clustering clustering is to learning a clustering-oriented representation,
ensemble methods, there are only a few papers with the it is not suitable to classify methods according to the
purpose of giving a summary of some of the existing clustering loss, instead, we should focus on the network
clustering ensemble techniques, e.g. by Ghaemi et al. and Li architecture used for clustering.
[31] et al. However, we think that a more general and
complete study of the clustering ensemble methods is still Fred [2] took the lead in the preliminay study of
necessary. Besides the presentation of the main methods, the clustering ensemble.However, its concept was official
introduction of a taxonomy of the different tendencies and proposed bt Strehl and Ghosh in literature [3] until
critical comparisons among the methods is really important 2002.Strehl and Ghosh thought that combined clustering
in order to give a practical application to a survey. Thus, due objects in the same group with different features, but did not
to the importance that clustering ensembles have gained use their original features.A detailed description as follows:
facing cluster analysis problems and the amount of articles Assumes that the data set D contains n data points,
published on this topic, we have made a critical study of the
{ } M times cluster the data set D,get a M
different approaches and the existing methods. This can be
cluster members containing set ∏=
very useful for the community of clustering practitioners
since showing the advantages and disadvantages of each are the clustering
members by cluster times i. ki represents the number of
method, their implicit assumptions, can help in the selection
of the appropriate clustering ensemble algorithm to solve a members in the clustering set. After then, using an
problem on hand. appropriate consensus function F is used to integrate M
clustering members to get an optimal consistent clustering
Data clustering is a basic problem in many areas, such as results Whether the clustering ensemble algorithm can
machine learning, pattern recognition, computer vision, data achieve the optimal clustering results,the most critical factor
compression. The goal of clustering is to categorize similar is the consensus function can be used to integrate for
multiple clustering results with different identities sets parallel, merger effectively upper results and the results
effectively.Therefore,the problem of clustering ensemble of distributed data source or features.
algorithm can be transformed into the following two
problems: V. CLUSTERING ENSEMBLES
The basic idea of clustering ensemble is based on
(1) Construct differentiated clustering members.
classification ensemble, and its purpose is to cluster the
(2) Design consensus function to integrate clustering original data with several independent classifiers, then
members. combine these clustering results which finally obtain the best
consensus division. The process of cluster ensemble is as
III. GENERATE DIFFERENTIATED CLUSTERING MEMBERS follows:
Firstly, generate differentiated clustering members. The
independent classifier is an important tool to generate
Firstly, the clustering ensemble algorithm should have a original clustering results, yet produce a number of different
clustering collectively. Generally, the differences of clustering results that have a certain differentiation.
clustering members in clustering collectivity are considered
to be one of the key factors influencing the integration results Secondly, design consensus function to obtain
[4]. The problem of members clustering in clustered collective relations. In this progress, we need to design
collectivity is also called ensemble or consistency problem. consensus function and take advantage of consensus function
The methods of generating differentiated clustering members to integrate a clustering results from the cluster members.
are: Finally, an optimal clustering results is determined by
(1) Cluster the same dataset by using different the integration relation as the final clustering results.
clustering algorithms to generate multiple clustering results
which will be used to construct by using these clustering
results.
(2) Use the same clustering algorithm but different
initialization parameters or random initial clustering center
to generate multiple clustering members for constructing
clustering collectivity. This method has the advantage of low
algorithm complexity and easy operation. K-means is its
typical algorithm.
(3) Based on the original data set by resampling
method to generate multiple data sets which will be
clustered by the same clustering method to get multiple
clustering members.
(4) Generate a clustering collectivity by clustering
different feature subsets of the data or different subspace
projection of datasets.
Fig 1. The Progress of cluster ensemble
IV. FEATURES OF CLUSTERING ENSEMBLE ALGORITHM
The progress of clustering is shown in Fig.1. The
clustering ensemble algorithm makes full use of the idea of
Compared with a single clustering algorithm, clustering ensemble learning,and integrates multiple clustering results
ensemble algorithm has some obvious advantages and through a consensus function to improve the
characteristics, mainly reflected as follows: robustness,novelty and stability.
(1) The average performance of the results divided by Every clustering ensemble method is made up of two
cluster ensemble is more prominent. This reflects the steps: Generation and Consensus Function (see Fig. 2). The
comprehensive performance of several clustering algorithms different ways of generation are described in Sec. A and in
and improve the quality of clustering results. Sec. B the principal consensus function methods are
(2) Clustering ensemble algorithm can obtain the discussed.
clustering results which is difficult to be obtained by a single
A. Generation mechanisms
clustering algorithm and its clustering results can be repeated
and fully utilized. Generation is the first step in clustering ensemble
(3) Clustering ensemble algorithm can detect and deal methods, in this step the set of clusterings that will be
with noise and outliers, and its results are not only less combined is generated. In a particular problem, it is very
affected by noise, isolated points and sampling methods and important to apply an appropriate generation process,
other factors, but also partially overcome the clustering because the final result will be conditioned by the initial
algorithm itself caused by the problem of sensitive clusterings obtained in this step. There are clustering
parameters. ensemble methods like the voting-k-means that demand a
(4) Clustering ensemble algorithm can evaluate the well-determined generation process, in this case, all the
partitions should be obtained by applying the k-Means
uncertainty of clustering from the integrated distribution.
algorithm with different initializations for the number of
(5) Clustering ensemble algorithm can cluster data
clusters parameters. This method uses a big k value (the function is available this diversity can be obtained by using
number of clusters), in order to obtain complex structure in the different generation mechanism presented in fig 3.
the consensus partition, from the combination of small
hyper-spherical structures in the single partitions. B. Consensus functions
The consensus function is the main step in any clustering
ensemble algorithm. Precisely, the great challenge in
clustering ensemble is the definition of an appropriate
consensus function, capable of improving the results of
single clustering algorithms. In this step, the final data
partition or consensus partition P , which is the result of any
clustering ensemble algorithm, is obtained. However, the
consensus among a set of clusterings is not obtained in the
same way in all cases.There are two main consensus function

Fig.2. Diagram of general process of cluster example

Fig.3. Diagram of the principle clustering ensemble approaches. objects co-occurrence and median partition.
generation mechanisms
In the first approach, the idea is to determine which must
However, in a general way, in the generation step there be the cluster label associated to each object in the consensus
are no constraints about how the partitions must be obtained. partition. To do that, it is analyzed how many times an object
Therefore, in the generation process different clustering belongs to one cluster or how many times two objects belong
algorithms or the same algorithm with different parameters together to the same cluster. The consensus is obtained
initialization can be applied. Even different objects through a voting process among the objects. Somehow, each
representations, different subsets of objects or projections of object should vote for the cluster to which it will belong in
the objects on different subspaces could be used (see Fig. 3). the consensus partition. This is the case, for example, of
Relabeling and Voting (Sec. B.1) and Co-association Matrix
In the generation step the weak clustering algorithms are (Sec. B.2) based methods.
also used. These algorithms make up a set of clusterings
using very simple and fast procedures. Despite the simplicity In the second consensus function approach, the consensus
of this kind of algorithms, Topchy et al. showed that weak partition is obtained by the solution of an optimization
clustering algorithms are capable of producing high quality problem, the problem of finding the median partition with
consensus clusterings in conjunction with a proper consensus respect to the cluster ensemble. Formally, the median
function. partition is defined as:
In a general way, in the generation step, it is advisable to
use those clustering algorithms that can yield more
information about the data. It can often be very difficult to
know a priori which clustering algorithm will be appropriate …..(Fig.4)
for a specific problem. The expert's experience of the
The first mathematical treatment of the median partition
problem area could be very useful in these cases. Besides, if
problem (1) was presented by Regnier. From this moment
there is no information about the problem, making a diverse
on, several studies about the median partition problem have
cluster ensemble is recommended, since the more varied the
been made. However, the main theoretical results have been
set of partitions is, the more information for the consensus
obtained for the particular case when is the symmetric
difference distance (or the Mirkin distance).63 Krivanek and Entropy[8] Normalized Mutual Information, Utility
Moravek and also Wakabayashi proved by different ways Function, Variation of Information and V-measure.
that the median partition problem (1) with the Mirkin
distance is NP-hard. This proof was given for the case where  Kernel measures: These measures are defined
there is a variable number of partitions m in the cluster speci fically for the median partition problem and
ensemble. However, it is not known whether it is a NP-hard are proven to be positive semide nite kernels. Some
problem for any particular m value.25 For m = 1 or m = 2 the of them are the Graph Kernel based measure and
solution of the problem is trivial, but for m > 2 nothing is the Subset Significance based measure.
known about the computational complexity.
The median partition problem with other dissimilarity Consensus functions based on the median partition
measures has not been properly studied. The complexity of approach (1) have been theoretically more studied than the
the general problem is dependent on the (dis)similarity ones based on the objects co-occurrence approach. The
measure used in its definition. Despite the fact that the median partition approach allows facing the consensus
median partition problem has been proved to be NP-hard problem in a more rigorous way. In spite of that, Topchy et
when it is defined with the Mirkin distance, we can find a al.79 give theoretical arguments about the validity of both
(dis)similarity measure for which the problem can be solved approaches. They showed that the consensus solution
in polynomial time. For example, defining the median converges to the underlying clustering solution as the
partition problem with the following similarity measure number of partitions in the ensemble increases. However, in
both approaches there are problems without de finite
solution, e.g. in the object co-occurrence approach generally,
the application of a clustering algorithm as annual step to
find the consensus is necessary, but the questions are: Which
clustering algorithm should be used? Which are the correct
In the above example, the median partition can be parameters?In the median partition approach, a dissimilarity
obtained in polynomial time because one of the partitions in measure between clusterings is necessary, but which is the
P is the solution. Indeed, if all partitions in P are different, correct dissimilarity measure? Besides, the consensus
the solution can be found in O(1), since any partition in P is partition is usually depended as the optimum of an
the solution to the problem. However, the similarity function exponential optimization problem; however, which is the
does not have practical relevance, because it is a very best heuristic to solve the problem or to come close to the
weak similarity measure between partitions. Hence, the solution?
following question comes up. Is there any strong similarity
measure between partitions, so it allows solving the median A lot of clustering ensemble methods have been
partition problem in polynomial time? To the extent of the proposed in recent years trying to answer questions like the
authors knowledge, this is an unanswered question. We think previous ones. The consensus problem has been faced by
that this question has not been deeply studied and a positive using several mathematical and computational tools.
answer may lead to a promising clustering ensemble Methods based on Relabeling and Voting, Co-association
technique. Matrix, Graph and Hypergraph partitioning, Mirkin distance,
Information Theory, Finite Mixture Models, Genetic
Besides the Mirkin distance, there are a lot of Algorithms, Locally Adaptive Clustering Algorithms (LAC),
(dis)similarity measures between partitions that can be used Kernel methods, Non-Negative Matrix Factorization (NMF)
in the definition of the median partition problem. Deep and Fuzzy techniques can be found. In Fig. 5, a taxonomy of
analyses of the different dissimilarity measures between the main consensus functions is presented.
partitions can be found in Refs. 5, 62 and 63. However, these
analyses were motivated by an interest in finding the best In Fig. 5, besides the taxonomy based on the
external cluster validity index.41 Therefore, the properties of mathematical or computation tool used in each kind of
these measures are not studied from the perspective of how clustering ensemble technique, a correspondence between
they can be suitable for the median partition problem. each kind of technique and one of the two consensus
function approaches defined above (object co-occurrence or
Among the main dissimilarity measures between median partition) is presented. In principle, this corre-
partitions we can find: spondence between these two taxonomies of consensus
functions does not have to be unique, e.g. there could be two
 Counting pairs measures: These measures count
consensus clustering methods based on genetic algorithms,
the pairs of objects on which two partitions agree or
one following the co-occurrence approach and the other, the
disagree. Some of them are the Rand index,68
median partition approach. However, we made explicit this
Fowlkes-Mallows index,28 the Jaccard coefcient,9
correspondence since it actually holds in practice. On the
the Mirkin distance63 and some adjusted versions
other hand, some consensus functions present the peculiarity
of these measures.
that they are defined through the median partition problem,
 Set matching measures:These measures are based but in practice, the consensus partition is obtained by means
on set cardinality comparisons. Some of them are of a mechanism related with the object co-occurrence
the Purity and Inverse Purity the F measure and approach. These are the cases, for instance, of the Graph and
Dongen measure. Hypergraph based methods (Sec. B.3) and Information
Theory based methods (Sec. B.5). We put these algorithms in
 Information Theory based measures: These the object co-occurrence classification in Fig. 5.
measures quantify the information shared between
two partitions. Some of them are the Class In the next sections, we will present an analysis of each
kind of clustering ensemble methods. In this analysis, we
will explain the most popular clustering ensemble Dudoit and Fridlyand[22] and Fischer and Buhmann
techniques. Also, for each kind of method, we will talk about presented a voting consensus algorithm similar to plurality
its strength and weakness for defining the clustering voting used in supervised classifiers ensembles. In this
ensemble problem, as well as their advantages and method, it is assumed that the number of clusters in each
drawbacks for obtaining the consensus partition. partition is the same and equal to the final number of clusters
in the consensus partition. The labeling correspondence
1. Relabeling and Voting based methods problem is solved through a maximum-likelihood problem
The Relabeling and Voting methods are based on solving using the Hungarian53 method. After that, a plurality voting
as first step the labeling correspondence problem and after procedure is applied to obtain the winner cluster for each
that, in a voting process, the consensus partition is obtained. object.
The labeling correspondence problem consists of the The Voting Active Clusters method provides an adaptive
following: the label associated to each object in a partition is voting method where the votes are updated in order to
symbolic; there is no relation between the set of labels given maximize an overall quality measure. This method allows
by a clustering algorithm and the set of labels given by the combination of clusterings from different locations, i.e.
another one. The label correspondence is one of the main all the data does not have to be collected in one central work
issues that makes unsupervised combination difficult. The
station. The idea is to make different clusterings from
different clustering ensemble methods based on relabeling different portions of the original data in separate processing
try to solve this problem using different heuristics such as centers. Afterwards, the consensus clustering is obtained
bipartite matching and cumulative voting. A general through a voting mechanism.
formulation for the voting problem as a multi-response
regression problem was recently presented by Ayad and If there exists a relation among the labels associated for
Kamel.6 Among the relabeling based methods Plurality each clustering algorithm, the voting definition of the
Voting (PV),[26] Voting-Merging (V-M),[18] Voting for clustering ensemble problem would be the most appropriate.
fuzzy clusterings,[19] voting Active Clusters (VAC), However, the labeling correspondence problem is what
Cumulative Voting (CV)7 and the methods proposed by makes the combination of clusterings difficult. This
Zhou and Tang1[03] and Gordon and Vichi are found. correspondence problem can only be solved, with certain
accuracy, if all partitions have the same number of clusters.

Fig. 5. Diagram of the principal consensus functions techniques. Consensus functions based on object co-occurrence
approach are represented by a rectangle (left) and the ones based on the median partition approach are represented by a
rounded rectangle (right).
We consider this to be a strong restriction to the cluster In the co-association matrix (2), δ(a b) takes only the
ensemble problem. Then, in general, they are not values 0 or 1. That way, the new similarity between objects
recommended when the number of clusters in all partitions in is computed only by taking into account whether the two
the ensemble is not the same. Besides, very frequently, they objects belong to the same cluster or not. We think that a
could have high computational cost since the Hungarian representation, which uses additional information to make
algorithm to solve the label correspondence problem is the similarity measure should be more expressive about the
O(k3), where k is the number of clusters in the consensus real relationship between the objects.
partition. On the other hand, these kinds of algorithms are
usually easy to understand and implement. In this direction, two similarity matrixes: Connected-
Triple Based Similarity (CTS) and SimRank Based
Similarity (SRS) are proposed by Iam-on et al.46 The CTS
works on the basis that if two objects share a link with a third
2. Co-association matrix based methods object, then this is indicative of similarity between those two
The idea of co-association is used to avoid the label objects. The SRS reflects the underlying assumption that
correspondence problem. Co-association methods (see Ref. neighbors are similar if their neighbors are similar as well.
30), map the partitions in the cluster ensemble into an Also, Vega-Pons and Ruiz-Shulcloper87 presented the
intermediate representation: the co-association matrix. Each Weighted Co-Association Matrix, which computes the
cell in the matrix has the following value: similarity between objects using the size of the cluster, the
number of clusters in each partition and the original
similarity values between the objects. Besides, Wang et al.[9]
introduced the Probability accumulation matrix, which is
…….(Fig.6) conformed taking into account the size of clusters, as well as
where Pt(xi) represents the associated label of the object the number of features in the object representation. These
xi in the partition Pt, and (a ,b) is 1, if a = b, and 0 otherwise. matrixes take into account more information than the
Then, the value in each position ði; jÞ of this matrix is a traditional co-association (2) and they can measure the pair-
measure about how many times the objects xi and xj are in wise correlation between objects in higher resolution.
the same cluster for all partitions in P. This matrix can be All the co-association methods are based on the
viewed as a new similarity measure between the set of construction of a new similarity measure between objects
objects X. The more objects xi and xj appear in the same from the clustering ensemble. Also, a clustering algorithm to
clusters, the more similar they are. Using the co-association obtain the final partition is necessary. Hence, the consensus
matrix CA as the similarity measure between objects, the clustering will be conditioned by the way that the similarity
consensus partition is obtained by applying a clus- tering is created and the particular algorithm applied (and its
algorithm. parameters initialization). Besides, this kind of algorithms
In Ref. 29, a fixed threshold equal to 0.5 is used to have a computational complexity of O(n2), and cannot be
generate the final consensus partition. It is obtained by applied to large datasets. However, they are very easy to
joining in the same cluster, objects with a co-association implement and understand.
value greater than 0:5. 3. Graph and hypergraph based methods
Fred and Jain[30] proposed a modification where an This kind of clustering ensemble methods transform the
algorithm is applied to nd a minimum spanning tree after combination problem into a graph or hypergraph partitioning
obtaining the co-association matrix, i.e. seeing the co- problem. The difference among these methods lies on the
association matrix as an adjacency matrix of a graph, a tree way the (hyper)graph is built from the set of clusterings and
that contains all the nodes of the graph and the minimum how the cuts on the graph are defined in order to obtain the
weights in their edges are searched. Then, the weak links consensus partition.
between nodes are cut using a threshold r. This is equivalent
to cutting the dendrogram produced by the Single Link Strehl and Ghosh[3] defined the consensus partition as
(SL)[47] algorithm using the threshold r. This threshold is the partition that most information shares with all partitions
obtained by using a simple but effective heuristic called in the cluster ensemble. To measure the information shared
highest lifetime criterion. In Ref. 30 the k-cluster lifetime is by two partitions, the Normalized Mutual Information (NMI)
defined as the range of threshold values on the dendrogram is used based on the Mutual Information concept from
to obtain k clusters. After computing the life- time value of Information Theory.
each level, the one with the highest value is selected as the The graph and hypergraph based methods are among the
final partition of the data. Besides, the Complete-Link (CL), most popular methods. They are easy to understand and
Average-Link (AL) and other hierarchical clustering implement. Moreover, in most cases they have low
algorithms can be used as variants of this method. computational complexity (less than quadratic in the number
Li et al. introduced a new hierarchical clustering of objects), for example, HGPA (O(k.n.m)), MCLA (O(k2
algorithm that is applied to the co-association matrix to .n.m2)) and HBGF (O(k.n.m)), where n is the number of
improve the quality of the consensus partition. This objects, m the number of partitions and k the number of
algorithm is based on the development of the concept of clusters in the consensus partition. Only the CSPA method
normalized edges to measure similarity between clusters. has a computational and storage complexity of O(k .n2.m),
which is quadratic in the number of objects. We put more
attention in the complexity respect to the number of objects
n, because in practice, m, n and k almost always takes 8. Fuzzy clustering based methods
relatively small values.
There are some clustering ensemble methods that work
4. Mirkin distance based methods with fuzzy clusterings. There are very popular clustering
algorithms like EM and fuzzy-c-means that naturally output
Given two partitions Pa and Pb of the same dataset X the fuzzy partitions of data. If the results obtained by these
following four categories are de ned: methods are forcibly hardening i.e. convert fuzzy partition in
 n00: The number of pairs of objects that were hard partitions of the data, valuable information for the
clustered in separate clusters in Pa and also in Pb. combination process could be lost. Thus, to combine the
fuzzy partitions directly may be more appropriate than
 n01: The number of pairs of objects that were hardening first and after that using a hard clustering
clustered in different clusters in Pa, but in the same ensemble method. The consensus partition obtained by soft
cluster in Pb. clustering ensemble methods could be hard or soft. In this
 n10: The number of pairs of objects that section we only refer to the methods that output hard nal
were co-clustered in the same cluster in Pa, but not clusterings since they can be used for the same purpose as all
in Pb. n11: The number of pairs of objects that previous clustering ensemble methods: given a set of objects,
were co-clustered in both partitions. obtaining a hard partitioning of them. The fuzzy clusterings
of data are only used in internal steps of the methods.
The symmetric difference distance or Mirkin distance M
is defined as M(Pa,Pb) = n01 + n10, which represents the
number of disagreements between the two partitions. The VI. CLUSTERING DISCRIMINATION TECHNIQUES
median partition problem defined with this similarity
measure (4) was proven to be a NP-complete problem. The general methodology in a clustering ensemble
algorithm, is made up of two steps: Generation and
Consensus. Most of the clustering ensemble algorithms use
5. Finite mixture models based methods in the consensus step all the partitions obtained in the
generation step. Besides, they combine all partitions giving
Topchy et al.76 proposed a new consensus function, to each one the same significance. Therefore, a simple
where the consensus partition is obtained as the solution of a average of all clusterings does not have to be the best choice.
maximum likelihood estimation problem. The problem of
maximum likelihood is solved by using the EM algorithm. However, in particular situations, all clusterings in the
cluster ensemble may not have the same quality, i.e. the
In this method, the data is modeled as random variables information that each one contributes may not be the same.
and it is assumed that they are independent and identically
distributed which are three restrictions to the general In this direction, two different approaches appear. The
problem. esides the number of clusters in the consensus idea of both approaches is to inspect the generated partitions
partition must be xed because it is necessary to know the and make a decision that assists the combination process.
number of components in the mixture model. However, this The rst one consists in selecting a subset of clustering to
method has a low computational complexity O(k.n.m) create an ensemble committee, which will be combined to
comparable with the k-means algorithm. obtain the final solution. The other approach consists in
setting a weight to each partition in order to give a value
6. Genetic algorithms based methods according to its significance in the clustering ensemble.
These methods use the search capability of genetic
algorithms to obtain the con- sensus clustering. Generally,
the initial population is created with the partitions in the VII. APPLICATIONS
cluster ensemble and a fitness function is applied to The recent progress in clustering ensemble techniques
determine which chromo- somes (partitions of the set of is to a big extent endorsed by its application to several fields
object) are closer to the clustering than it is searching for. of investigation. There is a large variety of problems in
After that, crossover and mutation steps are applied to obtain which the clustering ensemble algorithms can be applied. In
new offsprings and renovate the population. uring this
principle, as clustering ensembles try to improve the quality
process if any termination criterion is achieved the partition
with the highest tness value is selected as the consensus of clustering results, they can be directly used in almost all
partition. cluster analysis problems, e.g. image segmentation,
bioinformatics, document retrieval and data mining.
7. Kernel based methods In particular, Gionis [32] et al. showed how
Vega-Pons et al. proposed the Weighted Partition clustering ensemble algorithms can be useful for improving
Consensus via Kernels (WPCK) algorithm. This algorithm the clustering robustness, clustering categorical data and
incorporates an intermediate step, called Partition Rel- heterogeneous data, identifying the correct number of
evance Analysis (see Sec. C), in the traditional methodology clusters and detecting outliers.
of the clustering ensemble algorithms with the goal of Clustering ensemble methods developed for a specific
estimating the importance of each partition before the application purpose should take into account the
combination process. In this intermediate step, to each peculiarities of the problem at hand. The kind of clustering
partition Pi is assigned a weight value Wi which represents ensemble algorithm should be selected according to the
the relevance of the partition in the cluster ensemble. specific requirements of each application. For instance, in
image segmentation problems, graph representation of
images are very convenient since neighboring relations
among pixels can easily be taken into account by this
structure. Besides, in image segmentation the compu-
tational cost is an important issue because images usually
have a large number of pixels. Hence, graph based
clustering ensemble methods could be an appropriate choice
in the segmentation ensemble context.

VIII.PRELIMINARIES
Here introduced some preliminary knowledge of deep
clustering. It includes the related network architectures for
feature representation, loss functions of standard clustering
methods, and the performance evaluation metrics for deep
clustering.
A. NEURAL NETWORK ARCHITECTURE FOR
DEEP CLUSTERING
In this part, we introduce some neural network Fig.7. Deep Belief Network
architectures, which have been extensively used to
transform inputs to a new feature representation.
To understand Deep belief networks we need to understand
1. FEEDFORWARD FULLY-CONNECTED
two important caveat of DBN
NEURAL NETWORK
A fully-connected network (FCN) consists of multiple
layers of neurons, each neuron is connected to every neuron 1.Belief Net
in the previous layer, and each connection has its own
weight. The FCN is also known as multi-layer perceptron
(MLP). It is a totally general purpose connection pattern and 2. RBM : Restricted Boltzmann Machine
makes no assumptions about the features in the data. It is
usually used in supervised learning when labels are The two most significant properties of deep belief nets are:
provided. However, for clustering, a good initialization of
parameters of network is necessary because a naive FC • There is an efficient, layer-by-layer procedure for
network tends to obtain a trivial solution when all data learning the top-down, generative weights that
points are simply mapped to tight clusters, which will lead determine how the variables in one layer depend on the
to a small value of clustering loss, but be far from being variables in the layer.
desired .
2. FEEDFORWARD CONVOLUTIONAL • After learning, the values of the latent variables in every
NEURAL NETWORK layer can be inferred by a single, bottom-up pass that
Convolutional neural networks (CNNs) were inspired by starts with an observed data vector in the bottom layer
biological process, in which the connectivity pattern and uses the generative weights in the reverse direction.
between neurons is inspired by the organization of the
animal visual cortex. Likewise, each neuron in a  Construction of Deep Belief Networks:
convolutional layer is only connected to a few nearby Restricted Boltzmann Machine, unsupervised learning,
neurons in the previous layer, and the same set of weights is has the advantage of fitting the feature of the samples. So
used for every neuron. It is widely applied to image datasets when we have an output of the hidden layer in a RBM, we
when locality and shift-invariance of feature extraction are can use it as the visible layer‟s input of another R M. This
required. It can be trained with a specific clustering loss process can be regard as further feature extraction from the
directly without any requirements on initialization, and a extracted feature of our samples. With this kind of thought,
good initialization would significantly boost the clustering Hinton raised Deep Belief Network (DBN) in 2006, which
performance. To the best of our knowledge, no theoretical is based on RBM.As the Figure shows, by using the output
explanation is given in any existing papers, but extensive of the upper R M‟s hidden layer as the input of the lower
work shows its feasibility for clustering. R M‟s visible layer, we get a Deep Belief Network. This
DBN is stacked by three RBMs.
3. DEEP BELIEF NETWORK
Deep Belief Networks (DBNs) are generative graphical
models which learn to extract a deep hierarchical represen-
tation of the input data. A DBN is composed of several
stacked Restricted Boltzmann machines (RBMs). The
greedy layer-wise unsupervised training is applied to DBNs
with RBMs as the building blocks for each layer. Then, all
(or part) of the parameters of DBN are fine-tuned with
respect to certain criterion (loss function), e.g., a proxy for
the DBN log-likelihood, a supervised training criterion, or a
clustering loss. Fig.8. A DBN stacked by three RBMs
However, RBMs learning process is unsupervised learning. developed for both GAN and VAE. Moreover, they have
So the Deep Belief Network can only works without also been applied to handle clustering tasks.
supervising. If we want to use it as a classification, we must
add a new network of supervised learning, which can B. LOSS FUNCTIONS RELATED TO
classify the samples based on the features extracted by CLUSTERING
DBN. Its simple idea is that DBM can extract the features of This part introduces some clustering loss functions, which
samples well. This will make the classifier work better than guides the networks to learn clustering-friendly represen-
that without DBN. tations. Generally, there are two kinds of clustering loss. We
name them as principal clustering loss and auxiliary
 Training Process: clustering loss.
If we train a whole DBN at a time without some strategy, its  Principal Clustering Loss: This category of
many layers will lead to the low efficiency of learning. To clustering loss functions contain the cluster
solve this problem, Hinton put forward a useful method centroids and cluster assignments of samples. In
including two steps as follows: other words, after the training of network guided
by the clustering loss, the clusters can be obtained
(1) Layer-wise Unsupervised Learning: this step is an directly. It includes k-means loss, cluster
unsupervised learning process. First we train the first RBM assignment hardening loss, agglomerative
by inputting the original data and fixing up the parameters clustering loss, nonparametric maximum margin
of this RBM. Then we use these output as the input of the clustering [30] and so on.
second RBM and the rest can be done in the same manner.  Auxiliary Clustering Loss: The second category
At last we get a DBN with several layers, whose parameters solely plays the role of guiding the network to learn
are suitable to extract the features of this kinds of data. a more feasible representation for clustering, but
(2) Fine-Turning: add a suitable classifier to the end of this cannot out- put clusters straightforwardly. It means
DBN, such as Back Propagation Network. We use gradient- deep clustering methods with merely auxiliary
descent algorithm to revise the weight matrix of the whole clustering loss require to run a clustering method
network. However, the parameters of RBMs are slightly after the training of net- work to obtain the clusters.
changed as the error is propagated in the opposite direction. There are many auxiliary clustering losses used in
In this way the trained DBN will not be easily damaged. deep clustering, such as locality-preserving loss,
which enforces the net- work to preserve the local
 How DBN Works (Basic Learning) : property of data embedding; group sparsity loss,
Now that we have understood the basics of Belief which exploits block diagonal similarity matrix for
Net and RBM , lets try to understand how DBN actually representation learning; sparse subspace clustering
learns. As DBN is multi-layer belief networks. where each loss , which aims at learning a sparse code of data.
layer is Restricted Boltzmann Machine stacked against each
other to for the Deep belief Network . The first step of C. PERFORMANCE EVALUATION
training DBN is to learn a layer of features from the visible METRICS FOR DEEP CLUSTERING
units, using Contrastive Divergence (CD) algorithm. Then,
the next step is to treat the activations of previously trained Two standard unsupervised evaluation metrics are
features as visible unites and learn features of features in a extensively used in many deep clustering papers. For all
second hidden layer. Finally, the whole DBN is trained when algorithms, the number of clusters are set to the number of
the learning for the final hidden layer is achieved. ground-truth categories. The first metric is unsupervised
clustering accuracy (ACC):
4. AUTOENCODER
Autoencoder (AE) is one of the most significant algorithms …….(Fig.9.)
in unsupervised representation learning. It is a powerful where yi is the ground-truth label, ci is the cluster
method to train a mapping function, which ensures the assignment generated by the algorithm, and m is a mapping
minimum reconstruction error between coder layer and data function which ranges over all possible one-to-one
layer. Since the hidden layer usually has smaller mappings between assignments and labels. It is obvious that
dimensionality than the data layer, it can help find the most this metric finds the best matching between cluster
salient features of data. Although autoencoder is mostly assignments from a clustering method and the ground truth.
applied to find a better initialization for parameters in The optimal mapping function can be efficiently computed
supervised learning, it is also natural to combine it with by Hungarian algorithm.The second one is Normalized
unsupervised clustering. Mutual Information (NMI) :

5. GAN&VAE
Generative Adversarial Network (GAN) and Variational ……(Fig.10.)
Autoencoder (VAE) are the most powerful frameworks for where Y denotes the ground-truth labels, C denotes the
deep generative learning. GAN aims to achieve an equilib- clusters labels, I is the mutual information metric and H is
rium between a generator and a discriminator, while VAE entropy.
attempts to maximizing a lower bound of the data log-
likelihood. A series of model extensions have been
IX. TAXONOMY OF DEEP CLUSTERING sake of handling data with spatial invariance, e.g.,
image data, convolutional and pooling layers can
be used to construct a convolutional autoencoder
Deep clustering is a family of clustering methods that (CAE).
adopt deep neural networks to learn clustering-friendly 2) Robustness: To avoid overfitting and to
representations. The loss function (optimizing objective) of improve robustness, it is natural to add noise to the
deep clustering methods are typically composed of two parts: input. Denoising autoencoder attempts to
network loss Ln and clustering loss Lc, thus the loss function reconstruct x from x which is a corrupted version
can be formulated as follows:
of x through some form of noise. Additionally,
noise can also be added to the inputs of each layer .
…(Fig.11)
where λ [0 1] is a hype-parameter to balance Ln and Lc. 3) Restrictions on latent features: Under-
The network loss Ln is used to learn feasible features and complete autoencoder constrains the dimension of
avoid trivial solutions, and the clustering loss Lc encourages latent coder z lower than that of input x, enforcing
the feature points to form groups or become more discrim- the encoder to extract the most salient features
inative. The network loss can be the reconstruction loss of an from original space. Other restrictions can also be
autoencoder (AE), the variational loss of a variational adopted, e.g., sparse autoencoder imposes a
encoder (VAE) or the adversarial loss of a generative adver- sparsity constraint on latent coder to obtain a sparse
sarial network (GAN).The clustering loss can be k-means representation.
loss, agglomerative clustering loss, locality-preserving loss 4) Reconstruction loss: Commonly the
and so on. For deep clustering methods based on AE reconstruction loss of an autoencoder consists of
network, the network loss is essential. But some other work only the discrepancy between input and output
designs a specific clustering loss to guide the optimization of layer, but the reconstruction losses of all layers can
networks, in which case the network loss can be removed. As also be optimized jointly.
mentioned in Section I, we refer this type of networks trained
The optimizing objective of AE-based deep clustering is
only by Lc as clustering DNN (CDNN). For GAN-based or
thus formulated as follows:
VAE-based deep clustering, the network loss and the
clustering loss are usually incorporated together. In this
section, from the perspective of DNN architecture, we divide …(Fig.13)
deep clustering algorithms into four categories: AE-based, The reconstruction loss enforce the network to learn a fea-
CDNN-based, VAE-based, and GAN-based deep clustering. sible representation and avoid trivial solutions. The general
Characteristics of each category are revealed and related architecture of AE-based deep clustering algorithms and
algorithms are introduced. Some notations frequently used in
some representative methods are introduced as follows:
the paper and their meanings are presented in Table 1. The
components of representative algorithms are illustrated in  Deep Clustering Network (DCN): DCN is one of
Table 2 and their contributions are described briefly in Table the most remarkable methods in this field, which
3. combines autoencoder with the k-means algorithm.
In the first step, it pre-trains an autoen- coder.
A. AE-BASED DEEP CLUSTERING Then, it jointly optimizes the reconstruction loss
Autoencoder is a kind of neural network designed for and k-means loss. Since k-means uses discrete
unsupervised data representation. It aims at minimizing the cluster assignments, the method requires an
reconstruction loss. An autoencoder may be viewed as alternative optimization algorithm. The objective of
consisting of two parts: an encoder function h = fφ(x) which DCN is simple compared with other methods and
maps original data x into a latent representation h, and a the computational complexity is relatively low.
decoder that produces a reconstruction r = gθ (h). The  Deep Embedding Network (DEN): DEN
reconstructed representation r is required to be as similar to x proposes a deep embedding network to extract
as possible. Note that both encoder and decoder can be effective representations for clustering. It first
constructed by fullyconnected neural network or utilizes a deep autoencoder to learn reduced
convolutional neural net- work. When the distance measure representation from the raw data. Secondly, in
of the two variables is mean square error, given a set of data order to preserve the local structure property of the
samples {xi }in=1 , its optimizing objective is formulated as original data, a locality- preserving constraint is
follows:
applied. Furthermore, it also incorporates a group
sparsity constraint to diagonalize the affinity of
representations. Together with the recon- struction
loss, the three losses are jointly optimized to fine-
..(Fig.12)
tune the network for a clustering-oriented repre-
where φ and θ denote the parameters of encoder and sentation. The locality-preserving and group
decoder respectively. Many variants of autoencoder have sparsity constraints serve as the auxiliary clustering
been proposed and applied to deep clustering.The perfor- loss, thus, as the last step, k-means is required to
mance of autoencoder can be improved from the following cluster the learned representations.
perspectives:  Deep Subspace Clustering Networks (DSC-
1) Architecture: The original autoencoder is Nets): DSC-Nets introduces a novel autoencoder
comprised of multiple layer perceptions. For the architecture to learn an explicit non-linear mapping that is
friendly to subspace clustering [38]. The key contribution expressiveness property of data drawn from a union of
is introducing a novel self-expressive layer, which is a subspaces. Mathematically, its optimizing objective is a
fully connected layer without bias and non-linear activation subspace clustering loss combined with a reconstruction
and inserted to the junction between the encoder and the loss. Although it has superior performance on several
decoder. This layer aims at encoding the self- small-scale datasets, it is

TABLE 1. Comparison of algorithms based on network architecture and loss function.

TABLE 2. Main contributions of the representative algorithms.


really memory-consuming and time-consuming and thus can method is that it employs a noisy encoder to
not be applied to large-scale datasets. The reason is that its enhance the robustness of the algorithm.
parameter number is O(n2) for n samples, and it can only be Experimental results show that DEPICT achieves
optimized by gradient descent. superior clustering performance while having a
 Deep Multi-Manifold Clustering (DMC): DMC high computational efficiency.
is deep learning based framework for multi-  Deep Continuous Clustering (DCC):DCC is also
manifold clustering (MMC). It optimizes a joint an AE-based deep clustering algorithm. It aims at
loss function comprised of two parts: the locality solving two limitations of deep clustering. Since
preserving objective and the clustering-oriented most deep clustering algorithms are based on
objective. The first part makes the learned classical center-based, divergence-based or hierar-
representations meaningful and embedded into chical clustering formulations, they have some
their intrinsic manifold. It includes the autoencoder inherent limitations. For one thing, they require
reconstruction loss and locality preserving loss. setting the number of clusters in priori. For another,
The second part penalizes representations based on the optimization procedures of these methods
their proximity to each cluster centroids, making involve discrete reconfigurations of the objective,
the representation cluster-friendly and which require updating the clustering parameters
discriminative. Experimental results show that and network parameters alternatively. DCC is
DMC has a better performance than the state-of- rooted in Robust Continuous Clustering (RCC) , a
the-art multi-manifold clustering methods. formulation having a clear continuous objective
 Deep Embedded Regularized Clustering and no prior knowledge of clusters number. Similar
(DEPICT): DEPICT is a sophisticated method to many other methods, the representation learning
consisting of multiple striking tricks. It consists of and clustering is optimized jointly.
a softmax layer stacked on top of a multi-layer
convolutional autoencoder. It minimizes a relative B. CDNN-BASED DEEP CLUSTERING
entropy loss function with a regularization term for
clustering. The regularization term encourages CDNN-based algorithms only use the clustering loss to
balanced cluster assignments and avoids allocating train the network, where the network can be FCN, CNN or
clusters to outlier samples. Furthermore, the DBN. The optimizing objective of CDNN-based algorithms
reconstruction loss of autoencoder is also employed can be formulated as follows:
to prevent corrupted feature representation. Note L = Lc (1)
that each layer in both encoder and decoder con- Without the reconstruction loss, CDNN-based algorithms
tributes to the reconstruction loss, rather than only suffer from the risk of obtaining corrupted feature space,
the input and output layer. Another highlight of this

Fig.14 . Architecture of CDNN-based deep clustering algorithms. The network is only adjusted by the clustering loss. The
network architecture can be FCN, CNN, DBN and so on.

when all data points are simply mapped to tight clusters, unsupervised pre-trained, supervised pre-trained and
resulting in a small value of clustering loss but meaning- randomly initialized (non-pre-trained).
less. Consequently, the clustering loss should be designed
carefully and network initialization is important for 1. UNSUPERVISED PRE-TRAINED NETWORK
certain clustering loss. For this reason, we divide CDNN- RBMs and autoencoders have been applied to CDNN-
based deep clustering algorithms into three categories based clustering. These algorithms firstly train a RBM or
according to the ways of network initialization, i.e., an autoencoder in an unsupervised manner, then fine-tune
the network (only encoder part for the autoencoder) by the
clustering loss. Several representative algorithms are  Clustering Convolutional Neural Network
introduced as below. (CCNN): CCNN is an efficient and reliable deep
 Deep Nonparametric Clustering (DNC):DNC cluster- ing algorithm which can deal with large-
leverages unsupervised feature learning with scale image datasets. It proposes a CNN-based
DBN for clustering analysis. It first trains a DBN framework to solve clustering and representation
to map original training data into the embedding learning iteratively. It first randomly picks k
codes. Then, it runs the nonparametric maximum samples and uses an initial model pre-trained on
margin clustering (NMMC) algorithm to obtain the ImageNet dataset to extract their features as
the number of clusters and labels for all training the initial cluster centroids. In each step, mini-
data. After that, it takes the fine-tuning process to batch k-means is performed to update
refine the parameters of the top layer of the assignments of samples and cluster centroids,
DBN. The experimental results show advantages while stochastic gradient descent is used to
over classical clustering algorithms. update the parameters of the proposed CNN. The
 Deep Embedded Clustering (DEC):DEC is one mini-batch k-means significantly reduces
of the most representative methods of deep computation and memory costs, enabling CCNN
clustering and attracts lots of attention into this to be adapted to large-scale datasets. Moreover, it
field. It uses autoencoder as the network also includes a novel iterative centroid updating
architecture and uses cluster assignment method that avoids drift error induced by the
hardening loss as a regularization. It first trains feature inconsistency between two successive
an autoencoder by using the reconstruction loss iterations. At the same time, only top-km samples
and then drops the decoder part. The features with the smallest distances to their corresponding
extracted by the encoder network serve as the centroids are chosen to update the network
input of clustering module. After that, the parameters, in order to enhance the reliability of
network is fine-tuned using the cluster updates. All these techniques improve the
assignment hardening loss. Meanwhile, the clustering performance. To the best of our
clusters are iteratively refined by minimizing the knowledge, it is the only deep clustering method
KL-divergence between the distribution of soft which can deal with the task of clustering
labels and the auxiliary target distribution. As a millions of images.
result, the algorithm obtains a good result and
become a reference to compare the performances 3. NON-PRE-TRAINED NETWORK
of new deep clustering algorithms. Despite the fact that a pre-trained network can
 Discriminatively Boosted Clustering significantly boost the clustering performance, under the
(DBC):DBC has almost the same architecture guidance of a well-designed clustering loss, the networks
with DEC and the only improvement is that it use can also be trained to extract discriminative features.
convolutional autoencoder. In other words, it also  Information Maximizing Self-Augmented
first pre-trains an autoencoder and then uses the Training (IMSAT): IMSAT is an unsupervised
cluster assignment hardening loss to fine-tune the discrete representation learning algorithm, the
network, along with refining the clustering task of which is to obtain a function mapping
parameters. It outperforms DEC on image data into discrete representations. Clustering is a
datasets on account of the use of the special case of the task. It com- bines FCN and
convolutional network. regularized Information Maximization (RIM)
[50], which learns a probabilistic classifier such
2. SUPERVISED PRE-TRAINED NETWORK that mutual information between inputs and clus-
Although unsupervised pre-training provides a better ter assignments is maximized. Besides, the com-
initialization of networks, it is still challenging to extract plexity of the classifier is regularized. At the
feasible features from complex image data. Guérin et al. same time, an flexible and useful regularization
conduct extensive experiments by testing the performance objective termed Self-Augmented Training
of combinations of different popular CNN architectures (SAT) is proposed to impose the intended
pre-trained on ImageNet and different classical clustering invariance on the data repre- sentations. This data
algorithms. The experimental results show that feature augmentation technique significantly improves
extracted from deep CNN trained on large and diverse the performance of standard deep RIM. IMSAT
labeled datasets, combined with classical clustering shows state-of-the-art results on MNIST and
algorithms, can outperform the state-of-the-art image REUTERS datasets.
clustering methods. To this effect, when the clustering  Joint Unsupervised Learning (JULE):JULE is
objective is complex image data, it is natural to make use proposed to learn feature representations and
of the most popular network architectures like VGG, cluster images jointly. A convolutional neural
ResNet or Inception models, which are pre-trained on network is used for representation learning and a
large-scale image datasets like ImageNet, to speed up the hierarchical clustering (to be specific,
convergence of iterations and to boost the clustering agglomerative clustering) is used for clustering.
quality. The most remarkable method of this type is It optimizes the objective iteratively in a
introduced as follows: recurrent process. Hierarchical image clustering
is performed in the forward pass while feature
representation is learned in the backward pass. In motivated a large number of variants. In this section, we
the forward pass, the representations of images intro- duce the deep clustering algorithms based on VAE.
are regarded as initial samples, and then label VAE can be considered as a generative variant of
information is generated from an undirected AE, as it enforces the latent code of AE to follow a
affinity matrix based on the deep representations predefined distribution. VAE combines variational
of images. After that, two clusters are merged bayesian methods with the flexibility and scalability of
according to a predefined loss metric. In the neural networks. It introduces neural networks to fit the
backward pass, the network parameters are conditional posterior and thus can optimize the variational
iteratively updated towards obtaining a better inference objective via stochastic gradient descent and
feature representation by optimizing the already standard backpropagation . To be specific, it uses the
merged clusters. In experiments, the method reparameterization of the variational lower bound to yield
shows excellent results on image datasets and a simple differentiable unbiased estimator of the lower
indicates that the learned representations can be bound. This estimator can be used for efficient
transferred across different datasets. approximate posterior inference in almost any model with
Nevertheless, the computational cost and continuous latent variables.
memory complexity are extremely high when
datasets is large as it requires to construct an D. GAN-BASED DEEP CLUSTERING
undirected affinity matrix. What is worse, the Then Generative Adversarial Network (GAN) is another
cost can hardly be optimized since it is a dense popular deep generative model in recent years. The
matrix. (GAN) frame- work establishes a min-max adversarial
 Deep Adaptive Image Clustering (DAC):DAC game between two neural networks: a generative network,
is a single-stage convolutional-network-based G, and a discriminative network, D. The generative
method to cluster images. The method is network tries to map a sample z from a prior distribution
motivated from a basic assumption that the p(z) to the data space, while the discriminative network
relationship between pair- wise images is binary tries to compute the probability that a input is a real
and its optimizing objective is the binary sample from the data distribution, rather than a sample
pairwise-classification problem. The images are generated by the generative network.
represented by label features extracted by a GAN-based deep clustering algorithms have the same
convolutional neural network, and the pairwise problems of GAN, e.g., hard to converge and mode
similarities are measured by the cosine distance collapse. The noticeable works are presented as follows:
between label features. Furthermore, DAC  Deep Adversarial Clustering (DAC):DAC is a
introduces a constraint to make the learned label generative model specific to clustering. It applies
features tend to be one-hot vectors. Moreover, the adversarial autoencoder (AAE) to clustering.
since the ground-truth similarities are unknown, AAE is similar to VAE as VAE uses a KL
it adopts an adaptive learning algorithm [52], an divergence penalty to impose a prior distribution
alternating iterative method to optimize the on the latent representation, while AAE uses an
model. In each iteration, pairwise images with adversarial training procedure to match the
the estimated similarities are selected based on aggregated posterior of the latent representation
the fixed network, then the network is trained by with the prior distribution. Inspired by the
the selected labeled samples. DAC converges success of VaDE, Harchaoui et al. match the
when all instances are used for training and the aggregated posterior of the latent representation
objective can not be improved further. Finally, with a Gaussian Mixture Distribution. Its
images are clustered according to the largest optimizing objective is comprised of three terms:
response of label features. DAC achieves the traditional auto-encoder reconstruction
superior performance on five challenging objective, the Gaussian mixture model
datasets. likelihood, and the adversarial objective, where
the reconstruction objective can be considered as
C. VAE-BASED DEEP CLUSTERING the network loss, and the other two terms are the
AE-based and CDNN-based deep clustering have clustering loss. Experiment in illustrates that it
made impressive improvements compared to classical has a comparable result with VaDE on the
clustering method. However, they are designed MNIST dataset.
specifically for clustering and fail to uncover the real  Categorial Generative Adversarial Network
underlying structure of data, which prevent them from (CatGAN): CatGAN generalizes the GAN
being extended to other tasks beyond clustering, e.g., framework to multiple classes. It considers the
generating samples. Worse still, the assumptions problem of unsupervisedly learning a
underlying the dimensionality reduction techniques are discriminative classifier D from dataset, which
generally independent of the assumptions of the clustering classifies the data points into a priori chosen
techniques, thus there is no theoretical guarantee that the number of categories instead of only two
network would learn feasible representations. In recent categories (fake or real). CatGAN introduces a
years, Variational Autoencoder (VAE), a kind of deep new two player game based on GAN framework:
generative model, has attracted extensive attention and Instead of requiring D to predict the probability
of x belonging to real dataset, it enforces D to
classify all data points into k classes, while being out of k classes, instead of generating samples
uncertain of class assignments for samples belonging to the dataset.
generated by G. On the other hand, it requires G
to generate samples belonging to precisely one

TABLE 3. Comparison of different categories of deep clustering algorithms.

framework. They are more flexible and diverse than VAE-


based ones. Some of them aim at learning interpretable
E. SUMMARY OF DEEP CLUSTERING representations and just take clustering task as a specific
ALGORITHMS case. The shortcomings of GAN-based algorithms are
In this part, we present an overall scope of deep similar to GANs, e.g, mode collapse and converge slowly.
clustering algorithms. Specifically, we compare the four The computational complexity of deep clustering varies a
categories of algorithms in terms of loss function, lot. For AE-based and CDNN-based algorithms, the com-
advantages, disadvantages and computational complexity. putational cost is highly related to the clustering loss. For
as shown in Table 3. example, k-means loss results in a relatively low overhead
In regard to the loss function, apart from CDNN-based while the cost of agglomerative clustering is extremely
algorithms, the other three categories of algorithms jointly high. At the same time, the network architectures also
optimize both clustering loss Lc and network loss Ln. The influence the computational complexity significantly, as a
difference is that the network loss of AE-based algorithms deep CNN requires a long time to train. For VAE and
is explicitly reconstruction loss, while the two losses of GAN, due to the difficulty to optimize, they usually have
VAE-based and GAN-based algorithms are usually a higher computational complexity than efficient methods
incorporated together. in the AE-based and CDNN-based categories, e.g., DEC,
AE-based DC algorithms are most common as DCN, DEPICT and so on.
autoencoder can be combined with almost all clustering
algorithms. The reconstruction loss of autoencoder
ensures the network learns a feasible representation and X. RESEARCH ADVANCE OF CLUSTERING ENSEMBLE
avoid obtaining trivial solutions. However, due to the ALGORITHM
symmetry architecture, the network depth is limited for
computational fea- sibility. Besides, the hyper-parameter Cluster algorithm is a relatively active and challenging
to balance the two losses requires extra fine-tuning. In research area in the field of machine learning and pattern
contrast to AE-based DC algorithms, CDNN-based DC recognition.With the development of cluster algorithm
algorithms only optimize the clustering loss. Therefore, research,in recent years, a number of classical methods
the depth of network is unlimited and supervisedly pre- based on clustering algorithms have been generated,Such
trained architectures can be used to extract more as density-based method,distortion-based method,model-
discriminative features, thus they are capable to cluster based method and so on.Although these methods have
large-scale image datasets. However, without the good clustering results on some data sets,there are still
reconstruction loss, they have the risk of learning a some shortcomings such as poor scalability and sensitive
corrupted feature representation, thus the clustering loss to initialization should be parameters as well as to the
should be well- designed. VAE-based and GAN-based DC concave and convex shape of the spatialdistribution of the
algorithms are generative DC techniques, as they are data set.
capable to generate samples from the finally obtained In order to overcome the shortcomings of some
clusters. VAE-based algorithms have a good theoretical traditional clustering algorithms and improve the
guarantee because they minimizes the variational lower robustness,stability,scability and scability of clustering
bound on the marginal likelihood of data, but it suffer results,researchers proposed clustering ensemble
from high-computational complexity. GAN-based algorithm.
algorithms impose a multi-class priori on general GAN
In 2001, Fred first researched cluster ensemble Wang Junhong et al. [23] (2009) put forward a model
through the method of associative matrix and proposed of Latent Variable Cluster Ensemble (LVCE).
the concept of co-occurrence matrix that is as a similarity
matrix to divide the data. Zhou Lin et al. [24] (2002) studied clustering
ensemble algorithm which had superior characteristics.
In 2002, based on the research, Strehl and Ghosh [3] Constructed diversity clustering members by
first officially put forward the concept of cluster using intrinsic characteristics of spectral clustering
ensemble. algorithm, calculated the similarity matrix based the
three-tuple algorithm, and obtain the final results by
In 2002, Fred et al. put forward the clustering using spectral clustering algorithm.
ensemble algorithm based on evidence accumulation.
The clustering algorithm cannot give the proper
In 2004, Fern et al. [13] improved hyper-graph processing scheme according to the different
partition algorithm put forward by Strehl and Ghosh and characteristics of different data sets, Hou Yong et al. [2]
proposed the Hybrid Bipartite Graph Formulation based (2013) proposed an enhanced clustering algorithm based
on examples and clustering. on the characteristics of the data set.
In 2004, Topchy et al. [3] summarized and pointed Based on the constraint and metric semi -supervised
out the advantages of clustering ensemble algorithm
clustering method, combined with clustering ensemble
compared with single clustering algorithm. algorithm, Wei Siting et al. [25] proposed a semi -
In 2004, Minaei-bidgoli et al. [14] summarized supervised cluster ensemble method combining
research methods of cluster ensemble at present, such as constraint and metric.
the correlation matrix, voting methods, information According to the clustering membership reliability
theory, hypergraph partitioning, mixed model method and estimation and weighting problem, Huang Dong, et al.
so on. [26] (2016) proposed a clustering ensemble algorithm
Since then, many researchers joined in the research based on bipartite graph model and decision weighting
of clustering ensemble algorithm, such as Topchy [15], mechanism.
Duboit and Fridlyand [16], Fischer and Buhmann [17],
Wang, Yang and Zhou [10] etc. XI. FUTURE OPPORTUNITIES

Clustering ensemble algorithm has become one of


the popular clustering methods. There are a large number A. FUTURE OPPORTUNITIES OF DEEP
of research results in the algorithm extension and CLUSTERING
theoretical analysis. Based on the aforementioned literature review and
He et al. [18,19] (2005,2008) analyzed the analysis, we argue that the following perspectives of deep
similarities and differences between clustering ensemble clustering are worth being studied further:
and categorical data integration, and pointed out their 1) Theoretical exploration: Although jointly
consistency essentially, after that, he regarded NMI as the optimizing networks and clustering models significantly
objective function to solve the problem of the clustering boost the clustering performance, there is no theoretical
analysis with the classification attribute data set, and put analysis explaining why it works and how to further
forward the K-ANMI algorithm. improve the performance. Therefore, it is meaningful to
Based on the Borda voting method, by data fusion explore the theoretical guar- antee of deep clustering, in
technology of the Bordafuse, Sevillano [20] et al. (2007) order to guide further researches in this area.
put forward a consistency function of soft clustering 2) Other network architectures: Existing deep
ensemble which can be used in the field of information clustering algorithms mostly focus on image datasets,
retrieval. while few attempts have been made on sequential data,
Tumer et al. [21] (2008) considered clustering e.g., documents. To this effect, it is recommended to
ensemble as a dynamic optimization problem, regarded explore the feasibility of combining other network
the max value of the ANMI as an object function, and put architectures with clustering, e.g., recur- rent neural
forward an adaptable Voting active clusters(VACs). network [61].

Hore et al. [22] (2009) researched the extending 3) Tricks in deep learning: It is viable to introduce
problem of clustering ensemble from the time and space some tricks or techniques used in supervised deep learning
complexity and proposed two algorithms: Bipartite to deep clustering, e.g. data augmentation and specific
Merger algorithm and Metis Merger algorithm. regularizations. A concrete example is augmenting data
with noise to improve the robustness of clustering
Luo Huilan et al. [4] (2007) thought the diversity of methods.
clustering ensemble is a key element to limit
ensemblelearning. They studied seven different kinds of 4) Other clustering tasks: Combining deep neural
measurement methods of clustering collective, and put networks with diverse clustering tasks, e.g. multi-task
forward that there was no monotonic relationship clustering, self-taught clustering (transfer clustering) [62],
between the diversity measure and the clustering is another interesting research direction. To the best of our
ensemble. However, if the size of clustering ensembles is knowledge, these tasks have not exploited the powerful
moderate and the cluster is evenly distributed in the data non-linear trans- formation of neural networks.

set, their correlation is relatively high.


XII. CONCLUSION Due to the unsupervised nature of these techniques,
Clustering ensemble has become a leading technique it is not adequate to talk about the best clustering
when facing cluster analysis problems, due to its capacity ensemble method. Nevertheless, we can still establish a
for improving the results of simple clustering algor- ithms. comparison among these methods and determine, for
The combination process integrates information from all specific conditions, which one may be the most
partitions in the ensemble, where possible errors in simple appropriate. We made a critical analysis and comparison
clustering algorithms could be compen- sated. That way, of the methods, taking into account different parameters.
the consensus clustering, obtained from a set of clusterings The main advantages and dis- advantages of each
of the same dataset, represents an appropriate solution. method can be helpful to the users to select the
convenient method to solve their particular problem.
The clustering ensemble algorithm has some
outstanding advantages in theory and application. In As deep clustering is widely used in many practical
recent years, the clustering ensemble algorithm has applications for its powerful ability of feature extraction, it
attracted more and more attention, which also extends the is natural to combine clustering algorithms with deep
research of cluster analysis. Applying the clustering learning for better clustering results. In this paper, we give
ensemble algorithm to more practical problems has a systematic survey of deep clustering, which is a popular
become an important emerging branch in clustering research field of clustering in recent years. A taxonomy of
analysis.Here more systematic exposition of the deep clustering is proposed from the perspective of
clustering algorithm and its research advances from its network architectures, and the representative algorithms
production, development, basic theory, main algorithm are presented in detail. The taxonomy explicitly shows the
and other aspects.Here also systematically introduces the characteristics, advantages and disadvantages of different
generation of clustering algorithm, development, deep clustering algorithms. Furthermore, we provide
theoretical research, evaluation methods and the research several interesting future directions of deep clustering. We
advance. Readers can get a preliminary understanding to hope this work can serve as a valuable reference for
the clustering ensemble algorithm, and apply this method researchers who are interested in both deep learning and
to scientific research and engineering to solve practical clustering.
problems.
REFERENCES

[1] Yong Hou, Xuefeng Zheng “Enhanced clustering ensemble [22] Hore P, Hall L O,Goldgof D B, “A scalable framework for cluster
algorithm based on characteristics of data sets” Journal of ensembles” Pattern Recognition Vol. 42 No. 5 pp. 676-
Computer Applications, Vol. 33, No. 8, pp.2207-2207:2249, 2013. 688,2009.
[2] Fred A, “Finding consistent clusters in data partitions” Proceeding [23] Hongjun Wang, Zhishu Li Yang Cheng Peng Zhou Wei Zhou “A
of the 2nd International Workshop on Multiple Classifier Systems. Latent Variable Model for Cluster Ensemble” Journal of Software
[S.1.]: Springer, pp. 309- 318, 2001. Vol. 20, No. 4, pp. 825-833,2009.
[3] Strehl A Ghosh J Cardie C “Cluster ensembles: A knowledge [24] Lin Zhou Xijian Ping Sen Xu Tao Zhang “Cluster Ensemble
reuse framework for combining multiple partitions” Journal of ased on Spectral clustering” Acta Automatic Sinica, Vol. 38, No.
Machine Learning Research, No. 3, pp. 583-617,2002. 8, pp. 1335-1342, 2012.
[4] Huilan Luo Fansheng Kong Yixiao Li “An analysis of diversity [25] Siting Wei Zhixin Li Canlong Zhang “Semi- supervised
measures in clustering ensembles” Chinese Journal of Computers, clustering ensemble combining constraints and metric” Computer
Vol. 30, No. 8, pp. 1315- 1324,2007. Engineering and Design, Vol. 37, No. 9, pp. 2440-2453,2016.
[5] Suqin Ji Hongno Shi “K-means clustering ensemble based on [26] Dong Huang, Changdong Wang,Jianhuang Lai,Yun Liang,Shan
MapReduce” Computer Engineering, Vol. 39, No.9,pp.84-87,2013. ian Yu Chen “Clustering ensemble by decision weighting”
[6] Topchy A Jain A K Punch W “Clustering Ensembles: Models of Transactions on Intelligent Systems, Vol. 11, No. 3, pp. 418-
Consensus and Weak Partitions” IEEE Transact ions on Pattern 425,2016.
Analysis and Machine Intelligence, Vol. 27, No. 12, pp. 1866- [27] V. Filkov and S. Skiena, Integrating microarray data by consensus
1881,2005. clustering, Int. J.Artif. Intell. Tools 13(4) (2004) 863_880.
[7] Karypis G,Kumar V “A fast and high quality multilevel scheme for [28] B. Fischer and J. Buhmann, Bagging for path-based clustering,
partitioning irregular graphs” SIAM Journal on Scientific IEEE Trans. Patt. Anal. Mach. Intell. 25(11) (2003) 1411_1415.
Computing,1999,20(1):359, Vol. 20, No. 1, pp. 359, 1999. [29] A. Fred, Finding consistent clusters in data partitions, 3rd. Int.
[8] Topchy A Jain A K Punch W “A mixture model for clustering Workshop on Multiple Classi er Systems (2001), pp. 309_318.
ensembles” [C]. Proceeding of the 4th SIAM Internatinal [30] A. L. N. Fred and A. K. Jain, Combining multiple clustering using
Conference on Data Mining,2004:379-390. evidence accumulation, IEEE Trans. Patt. Anal. Mach. Intell. 27
[9] Fred A Jain A K “ ata clustering using evidence Accumulation” (2005) 835_850.
Proceedings of the 16th International Conference on Pat-tern [31] R. Ghaemi, M. N. Sulaiman, H. Ibrahim and N. Mustapha, A
Recognition (ICP R‟02) pp. 276-280,2002. survey: clustering ensembles thechniques, Proc. World Acad. Sci.
[10] Xi Wang, Chunyu Yang, Jie Zhou “Clustering aggregation by Engin. Technol. 38 (2009) 644_653.
probability accumulation” Pattern Recognition Vol. 42 No. 5 pp. [32] A. Gionis, H. Mannila and P. Tsaparas, Clustering aggregation,
668-675, 2009. ACM Trans. Knowl.Discov. Data 1(1) (2007) 341_352.
[11] Meila M “Comparing clusterings by the variation of information” [33] M. Gluck and J. Corter, Information, uncertainty, and the utility of
[C]. Proceedings of Conference on Learning Theory, No. 2777, pp. categories, Proc.Seventh Annual Conf. Cognitive Science Society
173-187, Feb. 2003. (Lawrence Erlbaum: Hillsdale, NJ,1985), pp. 283_287.
[12] Sevillano X Cobo G Alias F Socoro J C “Feature diversity in [34] A. Goder and V. Filkov, Consensus Clustering Algorithms:
cluster ensembles for robust document clustering” SIGIR‟06 Comparison and Re nement eds. J. I. Munro and D. Wagner
Proceedings of the 29th annual international ACM SIGIR (ALENEX, SIAM, 2008), pp. 109_117.
conference on Research and development in information
retrieval,2006,697-698., pp. 697-698,2006. [35] E. Gonzalez and J. Turmo, Comparing non-parametric ensemble
methods for document clustering, Natural Language and
[13] Fern X X rodley C E “Solving cluster ensemble problems by Information Systems, LNCS, Vol. 5039 (2008), pp.245_256.
bipartite graph partitioning” proceedings of the 21st International
Conference on Machine Laerning Banff,Cnanada,2004. [36] A. Gordon and M. Vichi Fuzzy partition models for tting a set of
partitions, Psychometrika 66(2) (2001) 229_248.
[14] Minaei- idgoli Topchy A Punch W F “A comparison of
[37] L. Grady, Random walks for image segmentation, IEEE Trans.
resampling methods for clustering ensembles” International
Patt. Anal. Mach. Intell.28(11) (2006) 1768_1783.
Conference on Machine Learning,Models,Technologies and
Application(MLMTA 2004), pp. 939-945, Feb. 2004. [38] D. Greene and P. Cunningham, E±cient ensemble method for
document clustering,Tech. Rep., Department of Computer Science,
[15] Law M Topchy A Jain A “Multiobjective data clustering”
Trinity College Dublin (2006).
Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, No. 2, pp. 424- [39] T. Kohonen „„The self-organizing map ‟‟ Neurocomputing, vol.
430,2004. 21, nos. 1–3, pp. 1–6, 1998.
[16] udoit S Fridlyand J “ agging to improve the accuracy of a [40] .Reynolds „„Gaussianmixturemodels ‟‟inEncyclopediaofBiometri
clustering procedure” ioinformatics Vol. 19 No. 9 pp. cs. Springer, 2015, pp. 827–832.
1090,2001. [41] M.Ester,H.-P.Kriegel J.Sander andX.Xu „„Adensity basedalgorithm
[17] Fischer uhmann J M “ agging for path-based for discovering clusters in large spatial databases with noise ‟‟ in
clustering” IEEE Transactions on Pattern Analysis and Machine Proc. KDD, 1996, pp. 226–231.
Intelligence, Vol. 25, No. 11, pp. 1411-1415, Feb. 20013. [42] . Arthur and S. Vassilvitskii „„k-means++: The advantages of
[18] Zengyou He Xiaofei Xu Shengchun eng “A cluster ensemble careful seeding ‟‟ in Proc. 18th Annu. ACM-SIAM Symp.
method for clustering categorical data” Information Fusion 2005, Discrete Algorithms, 2007, pp. 1027–1035.
6(2):143-151, Vol. 6, No. 2, pp. 143-151, 2005. [43] J.A.HartiganandM.A.Wong „„AlgorithmAS136:Akmeansclustering
[19] Zengyou He, Xiaofei Xu, Shengchun Deng, “K-ANMI: A mutual algorithm ‟‟ J. Roy. Stat. Soc. C, Appl. Stat., vol. 28, no. 1, pp.
information based clustering algorithm for categorical data” 100–108, 1979.
Information Fusion, Vol. 9, No. 2, pp. 223-233, 2008. [44] S. Wold K. Esbensen and P. Geladi „„Principal component
[20] Sevillano X, Alias F, Socoro J C, “ road Consensus: A new analysis ‟‟ Chemometrics Intell. Lab. Syst., vol. 2,
consensus function for soft cluster ensembles” Proceedings of the nos. 1–3, pp. 37–52, 1987.
30th Annual International AC M S IG IR Conference on Research [45] T.Hofmann .Schölkopf andA.J.Smola „„Kernelmethodsinmachine
and Development in Information Retrieval, pp. 743-744,2007 learning ‟‟ Ann. Stat., vol. 36, no. 3, pp. 1171–1220, 2008.
[21] Tumer K Agogino A “K-Ensemble clustering with voting active [46] A.Y.Ng M.I.Jordan andY.Weiss „„Onspectralclustering:Analysisan
clusters” Pattern Recognition Letters Vol. 29 No. 4 pp. 1947- d an algorithm ‟‟ in Proc. Adv. Neural Inf. Process. Syst., 2002, pp.
1953,2008. 849–856.
[47] Netw., vol. 61, pp. 85–117, Jan. 2015.[10] J. R. Hershey, Z. Chen,
J. Le Roux and S. Watanabe „„ eep clustering:
[48] Discriminative embeddings for segmentation and separation ‟‟ in [56] C.-C.HsuandC.-W.Lin „„CNN
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), basedjointclusteringandrepresentation learning with feature drift
Mar. 2016, pp. 31–35. compensation for large-scale image data ‟‟ IEEE Trans.
[49] J. Xie R. Girshick and A. Farhadi „„Unsupervised deep Multimedia, vol. 20, no. 2, pp. 421–429, Feb. 2018.
embedding for clustering analysis ‟‟ in Proc. Int. Conf. [57] Z.Wang S.Chang J.Zhou M.Wang andT.S.Huang „„Learningatask-
Mach. Learn., 2016, pp. 478–487. [12] F. Li, H. Qiao, B. specific deep architecture for clustering ‟‟ in Proc. SIAM Int. Conf.
Zhang and X. Xi. (2017). „„ iscriminatively boosted image Data Mining, 2016, pp. 369–377.
clustering with fully convolutional auto-encoders.‟‟ [Online]. [58] E.Aljalbout V.Golkov Y.Siddiqui and .Cremers.(2018).„„Clusterin
Avail- g with deep learning: Taxonomy and new methods.‟‟ [Online].
[50] able: https://arxiv.org/abs/1703.07980[13] B. Yang, X. Fu, N. D. Available: https://arxiv.org/abs/1801.07648
Sidiropoulos and M. Hong. (2016). „„Towards [59] A.Ng „„Sparseautoencoder ‟‟CS294ALectureNotes 2011.[21] M.
[51] K-means-friendly spaces: Simultaneous deep learning and . oomija and M. Phil „„Comparison of partition based clustering
clustering.‟‟ [60] algorithms ‟‟ J. Comput. Appl., vol. 1, no. 4, pp. 18–21,
[52] [Online]. Available: https://arxiv.org/abs/1610.04794[14] K. G. 2008.[22] H.-P.Kriegel P.Kröger J.Sander andA.Zimek „„ ensity-
izaji A. Herandi C. eng W. Cai and H. Huang „„ eep clus- basedcluster- ing ‟‟ Wiley Interdiscipl. Rev., Data Mining Knowl.
[53] tering via joint convolutional autoencoder embedding and relative Discovery, vol. 1, no. 3,
entropy minimization ‟‟ in Proc. IEEE Int. Conf. Comput. Vis. [61] pp. 231–240, 2011.[23]
(ICCV), Oct. 2017, pp. 5747–5756. S.C.Johnson „„Hierarchicalclusteringschemes ‟‟Psychometrika
[54] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. (2016). ,vol.32,
„„Variational deep embedding: An unsupervised and generative [62] no. 3, pp. 241–254, 1967.[24] I. Goodfellow et al. „„Generative
approach to clustering.‟‟ [Online]. Available: adversarial nets ‟‟ in Proc. Adv. Neural
https://arxiv.org/abs/1611.05148 [63] Inf. Process. Syst., 2014, pp. 2672–2680.[25] D. P. Kingma and
[55] J. Yang . Parikh and . atra „„Joint unsupervised learning of M. Welling. (2013). „„Auto-encoding variational ayes.‟‟
deep representations and image clusters ‟‟ in Proc. IEEE [64] [Online]. Available: https://arxiv.org/abs/1312.6114[26] A.
Conf. Comput. Vis. Pattern Recognit., Jun. 2016, Krizhevsky I. Sutskever and G. E. Hinton „„ImageNet classifica-
pp. 5147–5156.
[65] tion with deep convolutional neural networks ‟‟ in Proc. Adv.
Neural Inf. Process. Syst., 2012, pp. 1097–1105.

You might also like