International Journal of Scientific and Technical Advancements
ISSN: 2454-1532
A Review of Clustering Algorithms for Gene
Expression Data
Ritika Manhas1, Ayushman Koul2, Bhawna Sharma3, Sheetal Gandotra4
1
Department of Microbiology, Panjab University, Chandigarh, India-160014
2,3,4
Department of Computer Engineering, GCET Jammu, J&K, India-181121
1,#
[email protected],
[email protected],
[email protected],
[email protected] Abstract- In biological studies, the DNA microarray technology has provided us with various ways to understand the gene expression of
different species and study the effect of environment on them. It is used to identify the group of genes with similar expression patterns that
are extensively used to produce proteins together. It is a tedious task to analyze the huge volumes of data obtained from genome sequencing
and to procure meaningful information from it. Therefore, the first major step is clustering of genes which help in identifying similar gene
expression and to understand the function of each gene. In this paper we have discussed about DNA microarray technology and compared
the various clustering techniques which are used in cluster analysis of gene expression data.
Keywords - DNA Microarray Technology; Clustering Algorithms; Gene Expression Data.
I. INTRODUCTION used to cluster together genes that are co-regulated, co-
expressed and co-function. Clustering the data also gives us
F or genomic research in the field of biology a number
of traditional approaches have been used to analyse
and collect data obtained to study gene expression.
Analysis of gene expression data allows us to identify
insight into the mechanism of transcription regulation. These
algorithms group the gene expression data into several clusters
based on the level of similarity amongst the expression levels
of different genes. Those in separate clusters are more
difference in levels of gene expression and their profile. It is of
dissimilar than the ones in the same cluster. The co-expressed
utmost importance to study gene expression data as it provides
genes are grouped in the same clusters indicating co-regulation
an insight into the biological processes occurring inside the
and co-function.
cells in relation to its environment and about different genes
working to produce a similar effect ultimately. A slight III. CLUSTERING
variation in the expression level could indicate a possible
Clustering is the type of unsupervised classification and it is
stress condition being experienced by the organism or cell and
the process of making disjoint sets called clusters, of those
could even indicate a mutation in the genomes.
data objects which have higher similarity to each other and the
II. DNA MICROARRAY TECHNOLOGY similarity between inter cluster data objects is very minimal.
Clustering is broadly used in many applications like data
The DNA Microarray technology is amongst the leading
analysis, image processing, gene clustering, market research
techniques used to study gene expression. The technique uses
and pattern recognition.
DNA probes of known sequences. These probes are spotted on
the microarray chips in the form of spots which has led to the
monitoring of thousands of genes simultaneously. The mRNA
of the sample is reverse-transcribed into cDNA which is then
labelled. It compares the gene expression levels of the test
sample and the control sample by hybridization of cDNA with
the probes. The intensity signals obtained after the
hybridization reaction under different conditions like time or
biological processes through scanning form the intensity
matrix. The data thus obtained is called the gene expression
data. The data obtained is compared to the intensity levels of
the test and the control, increase in the expression levels
indicates the co-expression of genes, increase and decrease in
the intensity levels overtime could indicate other relations Fig. 1. Clustering procedure
amongst the genes under study. To filter out meaningful data .
out of the intensity matrix various clustering algorithms are
137
Ritika Manhas, Ayushman Koul, Bhawna Sharma, and Sheetal Gandotra, “A Review on Clustering Algorithms for Gene Expression Data,”
International Journal of Scientific and Technical Advancements, Volume 5, Issue 1, pp. 137-140, 2019.
International Journal of Scientific and Technical Advancements
ISSN: 2454-1532
IV. TYPES OF CLUSTERING METHODS B. Hierarchical Clustering
Hierarchical clustering is the hierarchical decomposition of the
In this section, we describe the different clustering methods data based on group similarities. Hierarchical clustering is
which are used for gene based clustering whose aim is to extensively used in gene expression data analysis. This type of
identify group of co-expressed genes. clustering is divided into two methods, agglomerative and
A. K-means Algorithm divisive. Agglomerative clustering uses a bottom-up approach,
in which each data point initiates its own cluster. These clusters
K-means algorithm is the most popular method used for are then joined by taking the two most similar clusters together
clustering. It is a typical partition based algorithm using
and merging them. Divisive clustering uses a top-
centroid approach where a cluster is represented by a gravity
centre. It is used to classify data objects into predefined down approach, wherein all data points start in the same cluster
number of clusters. It initially takes K known cluster centres then using a parametric clustering algorithm like K-Means
and minimize the distance between cluster centres of given divide the cluster into two clusters. For each cluster, we further
clusters and for measuring distance between data objects divide it down to two clusters until we get the desired number
Euclidean distance between two data points X= (x1, x2,…..,xm) of clusters.
and Y= (y1, y2,……ym) can be calculated as follows: Clustering Using Representatives (CURE) employs a
D(X, Y) = 2√(x1 − y1)2 + (x2 − y2)2 + ⋯ + (xm − ym)2
new algorithm model which uses both centroid based and all
points approach. A constant number of scattered points in a
The K-means algorithm aims to minimize the sum of cluster act as representatives. The clusters with closest pair of
squared distances between all points and cluster centre representative points are the clusters that are merged together
[1].This algorithm consists of following steps as explained at each step. It uses random sampling and partitioning for
below: reducing the large amount of input data set [9].
CHAMELEON algorithm works on the principle to
generate a sparse graph in which nodes represent data items,
Algorithm: K-mean clustering algorithm [2] and weighted edges represent similarities among the data
Require: D = {d1, d2, …….., dn) // set of n data points items. It uses a graph partitioning algorithm to cluster the data
items into a large number of relatively small sub-clusters and
K = No. of desired clusters then uses an agglomerative hierarchical clustering algorithm to
Ensure: A set of K clusters. find the genuine clusters by repeatedly combining together
these sub-clusters [10].
Steps:
ROCK is also an agglomerative hierarchical clustering
1. Arbitarily choose K data points from D as initial algorithm. It uses links to measure the similarity between a
centroids; pair of data points in a cluster. Then it merges the data points
of a cluster [11].
2. Repeat
BIRCH (Balanced Iterative and Clustering using
Assign each point di to the cluster which has closest Hierarchies) is suitable for large database [12].
centroid; EISEN’s method is much favored by many biologists and
Calculate the new mean for each cluster; has become the most widely-used tool in gene expression data
analysis [13].
Until convergence criteria is met. However, the conventional agglomerative approach suffers
from a lack of robustness [3], i.e., a small perturbation of the
data set may greatly change the structure of the hierarchical
K-means algorithm is fast and performs well when dendrogram. Another drawback of the hierarchical approach is
compared to the new clustering algorithm but has a number of its high computational complexity [3].
limitations. First, the number of gene clusters in gene
expression data set is usually unknown and to detect the C. Model-based clustering.
optimal number of clusters, users usually run the algorithms Clustering algorithms can also be developed based on
with different values of K and for a large expression data set probability models. In the family of model based clustering
which contains thousands of genes, this extensive parameter algorithms, one uses certain models for clusters and tries to
fine-tuning process may not be practical [3]. Second, gene optimize the fit between the data and the models. The
expression data typically contain a huge amount of noise; Expectation Maximization (EM) algorithm [2] determines
however, the K-means algorithm forces each gene into a good values for its parameters iteratively. It is able to handle
cluster, which may cause the algorithm to be sensitive to noise different shapes of cluster, and lots of iteration are required
[3]. The various application of K means algorithm for that makes this algorithm costly. The literature of the model
clustering gene expression data is also discussed in literature based clustering approaches for gene expression data is
[4, 5, 6, 7, 8]. discussed in [14, 15].
Self Organizing Map (SOM) by Teuvo Kohonen provides
a data visualization technique which helps to understand high
138
Ritika Manhas, Ayushman Koul, Bhawna Sharma, and Sheetal Gandotra, “A Review on Clustering Algorithms for Gene Expression Data,”
International Journal of Scientific and Technical Advancements, Volume 5, Issue 1, pp. 137-140, 2019.
International Journal of Scientific and Technical Advancements
ISSN: 2454-1532
dimensional data by reducing the dimensions of data to a map. clustering on DNA microarray data by Alizadeh et al, in
SOM also represents clustering concept by grouping similar 2000, which led to the discovery of three distinct subtypes of
data together. Therefore it can be said that SOM reduces data the diffuse large B-cell lymphoma (DLBCL) [17]. The K-
dimensions and displays similarities among data. Self- Means algorithm was found to be efficient for clustering the
Organizing Map (SOM) [16] is a technique easy to implement, lung cancer dataset with Attribute Relation File Format
fast and scalable for large gene expression dataset. It is based (ARFF) [18] and K-means algorithm was implemented on
on a single layered neural network. It is represented in a two images from the Mammography Image Analysis Society
dimensional m*n grid where data points are taken as input and (MIAS) to determine the stage of malignant breast cancer [19].
output neuron. Then neurons are represented as simple Traditional clustering algorithms despite of proving useful still
neighborhood structure. A reference number is attached with have scope of improvement to curtail the significant
each neuron, and each data point is mapped to the nearest drawbacks that they pose for gene expression data analysis. To
reference vector. Each data point is act as training sample tackle these issues new algorithms have been implemented
which leads the movement of reference vectors towards the recently to improve the drawbacks of traditional approaches in
deeper input space so that it will is distributed to input dataset. order to get better accuracy. The Enhanced Automatic
Clusters are identified by mapping all data points to the output Generation of Merge Factor for ISODATA (EAGMFI)
neuron after the completion of training process. The Self Clustering Microarray Data based on K-Means and AGMFI
organizing map clustering algorithm starts with the clustering algorithms was implemented in [1] to overcome
initialization of the reference vector followed by randomly random selection of initial seed point of desired clusters.
selection of data points. Then nearest reference vector to the Similarly as discussed methods like BIRCH and CURE based
current data point is determined and finally reference vector on hierarchical approaches perform better when applied to
and neighboring reference vectors are updated. SOM is very large databases whereas model based approaches are costly
efficient method used for gene expression data clustering and compared to these as they require lot of iterations.
it is also discussed in literature [2, 16].
REFERENCES
[1] T.Chandrasekhar , K.Thangavel and E.Elayaraja, “Performance
TABLE I. Clustering Algorithms computational complexity Analysis of Enhanced Clustering Algorithm for Gene Expression Data,”
International Journal of Computer Science Issues, Vol. 8, Issue 6, No 3,
Clustering Algorithm Computational Complexity 2011
Capable of [2] Dempster AP, Laird NM, Rubin DB, “Maximum likelihood from
S. No. Clustering handling high incomplete data via the EM algorithm,” Journal of the royal statistical
Complexity
Algorithm dimensional society. Series B (methodological).pages 1-38, 1977.
data [3] Daxin Jiang, Chun Tang, Aidong zhang, “Cluster analysis for gene
1 K-means O(NKd) (time) No expression data: a survey”, IEEE Transactions on knowledge and data
O(N+K) (space) engineering, vol. 16, issue 11,pages 1370-1386, 2004
2 Hierarchial O(N2) (time) No [4] J. H. Do and D. -K. Choi, "Clustering Approaches to Identifying Gene
Clustering O(N2) (space) Expression Patterns from DNA Microarray Data," Molecular Cells, vol.
25, no. 2, 2007.
3 ROCK O(n3) No
[5] Do JH, Choi D. “Clustering approaches to identifying gene expression
4 CHEMLEON O(m2logm) No patterns from DNA microarray data”, Molecules and cells, pages 242-
5 BIRCH O(N) (time) Yes 279, 2008
6 CURE O(N2samplelogNsample) Yes [6] Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC, “Evaluation and
comparison of gene clustering methods in microarray analysis”,
(time)
Bioinformatics, pages 2405-12, 2006
O(N sample) (space) [7] Costa IG, de Carvalho FD, de Souto MC, “Comparative analysis of
7 SOM O(N2sample)(time) Yes clustering methods for gene expression time course data”, Genetics and
Molecular Biology, pages 623-631, 2004
[8] Borg A, Lavesson N, Boeva V, “Comparison of Clustering Approaches
for Gene Expression Data”, In-SCAI, pages 55-64, 2013
V. CONCLUSION [9] Guha S, Rastogi R, Shim K, “CURE: an efficient clustering algorithm
for large databases”, In ACM SIGMOD, vol. 27, pages 73-84,1998
In this review, we discussed three types of traditional [10] Karypis G, Han EH, Kumar V, “Chameleon: Hierarchical clustering
methods of clustering i.e. K means, Hierarchical Clustering using dynamic modeling. Computer”, pages 68-75,1999
and Model Based Clustering. A good clustering algorithm [11] Guha S, Rastogi R, Shim K. “ROCK: A robust clustering algorithm for
categorical attributes” in Data Engineering, Proceedings, 15th
should be able to generate arbitrary shapes of clusters, able to International Conference, pp. 512-521, 1999
handle large volumes of input data, should not be affected by [12] Zhang T, Ramakrishnan R, Livny M., “BIRCH: an efficient data
the order of input data, able to handle noise in input data and clustering method for very large data-bases” in ACM Sigmod , vol. 25,
should be able to produce desired and correct results with No. 2, pp. 103-114, 1996
[13] Eisen, Michael B., Spellman, Paul T., Brown, Patrick O. and Botstein,
higher accuracy in less time. In the field of biology clustering David, “Cluster analysis and display of genome-wide expression
algorithms have made it possible to identify those genes which patterns”. Proc. Natl. Acad. Sci. USA, pages 14863–14868, 1998.
perform the same function and also helped to study the effect [14] Datta S, Datta S., “Comparisons and validation of statistical clustering
of environment on genes and predict the diseases before they techniques for microarray gene expression data”, Bioinformatics, pages
459-66, 2003
actually show symptoms. Examples of such implementations [15] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,
of clustering algorithms include the use of hierarchical Lander ES, Golub TR., “Interpreting patterns of gene expression with
139
Ritika Manhas, Ayushman Koul, Bhawna Sharma, and Sheetal Gandotra, “A Review on Clustering Algorithms for Gene Expression Data,”
International Journal of Scientific and Technical Advancements, Volume 5, Issue 1, pp. 137-140, 2019.
International Journal of Scientific and Technical Advancements
ISSN: 2454-1532
self-organizing maps: methods and application to hematopoietic [18] Dharmarajan A, Velmurugan T, “Lung cancer data analysis by k-means
differentiation”, Proceedings of the National Academy of Sciences, and farthest first clustering algorithms”, Indian J Sci Techno,. 2015
pages 2907-12, 1999 [19] Karmilasari SW, Hermita M, Agustiyani NP, Hanum Y, Lussiana ETP,
[16] Tomida S, Hanai T, Honda H, Kobayashi T., “Analysis of expression “Sample K-Means Clustering Method for Determining the Stage of
profile using fuzzy adaptive resonance theory”, Bioinformatics, pages Breast Cancer Malignancy Based on Cancer Size on Mammogram
1073-83, 2002 Image Basis”, IJACSA) Int J Adv Comput Sci Appl. , pages 86-90, 2014
[17] Alizadeh AA, Eisen MB, Davis RE, et al. “Distinct types of diffuse large
B-cell lymphoma identified by gene expression profiling”,Nature., pages
503-11, 2000
140
Ritika Manhas, Ayushman Koul, Bhawna Sharma, and Sheetal Gandotra, “A Review on Clustering Algorithms for Gene Expression Data,”
International Journal of Scientific and Technical Advancements, Volume 5, Issue 1, pp. 137-140, 2019.