A Survey On Clustering Algorithm For Microarray Gene Expression Data
A Survey On Clustering Algorithm For Microarray Gene Expression Data
Volume: 4 Issue: 3
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
R. Porkodi
Assistant Professor
Department of CS
Bharathiar University
Coimbatore, India
Porkodi_r76buc.edu.in
Abstract The DNA data are huge multidimensional which contains the simultaneous gene expression and it uses the microarray chip
technology, also handling these data are cumbersome. Microarray technique is used to measure the expression level from tens of thousands of
gene in different condition such as time series during biological process. Clustering is an unsupervised learning process which partitions the
given data set into similar or dissimilar groups. The mission of this research paper is to analyze the accuracy level of the microarray data using
different clustering algorithms and identify the suitable algorithm for further research process.
Keywords- Microarray technology, Clustering techniques, Partition algorithm, Fuzzy c-means Algorithm, Hierarchical clustering, Model based
clustering.
__________________________________________________*****________________________________________________
I.
INTRODUCTION
Data mining is often defined as finding hidden
information or extracting meaningful information from large
database. The extraction of meaningful information from a
large database is known as Knowledge discovery.
Clustering is the task of grouping set of object in such a way
that objects in the same group called a cluster[1].A good
clustering method will produce high quality clusters in
which the intra class that is similarity is high. The inter class
similarity is low. Clustering can be applied in many filed
marketing, biology, library, insurance, city planning, www
and many more. In data mining the data is mined using two
learning approaches i.e. supervised learning or unsupervised
clustering. Classification is supervised learning problem
collection of labeled data. This model are called predictive.
Data object explanatory variables and one or more
dependent variables. Clustering is unsupervised learning
problem so as every problem it deals with finding structure
in a collection of unlabled data. This model are sometimes
called descriptive model. Clustering is one of the most
common untested data mining methods that explore the
hidden structures embedded in a dataset. Data object
dependent and explanatory variables[2]. Microarray
technology measure copy number of modules in a mixture
on a small slide.Thousands or millions of different kind of
module can be measured [3].Thus creating large volumes of
data per biological sample. The module can be DNA, RNA
or protein.
Clustering gene expression data in such a group sample
clustering and gene-based clustering.
Group samples:
Group together tissues that are similarly affected by a
disease.
Group together patients that are similarly affected by a
disease.
Group genes:
Group genes: Group together gene that are similarly affected
by a disease.
Group together gene that respond similarly to a
experimental conditions.
Sample
Gene
W11
W12
W1m
W21
W22
W2m
Wn1
Wn2
Wnm
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
involves placing thousands of gene sequences in known
locations on a glass slide called a gene chip. A sample
containing DNA or RNA is placed in contact with the gene
chip[6] . DNA microarrays (also called Gene Chips ) are
devices not much larger than postage stamps. They are
based printed on a glass substrate containing as many as
400,000 tiny cell each containing a microscopic spot of
DNA. Each microscopic spot holds a short, synthetic,
single-stranded DNA sequence from a different human
gene[7].this makes it possible to carry out a very large
number of genetic tests on a sample at one time. An array is
an orderly arrangement of samples where matching of
known and unknown DNA samples is done based on base
pairing rules. An array experiment makes use of common
assay systems such as micro plates or standard blotting
membranes. The sample spot sizes are typically less than
200 microns in diameter usually contain thousands of
spots.Thousands of spotted samples known as probes with
known identity are immobilized on a solid support a
microscope glass slides or silicon chips or nylon
membrane[8].The spots can be DNA, cDNA, or
oligonucleotides. These are used to determine
complementary binding of the unknown sequences thus
allowing parallel analysis for gene expression and gene
discovery.
The paper organized as follows section 1, describe
the literature review, section 2 describe the various
clustering algorithm, section 3 describe clustering
validation, section 4 describe comparison of clustering
algorithm, finally the paper is concluded in section 5
II.
LITERATURE REVIEW
In paper [9] A.Dharmarajan., T.Velmurugan
presented about the performance of partitioning based
algorithm. The algorithm was analyzed by using the selected
three attributes from the total number of attribute. In this
algorithm LC arff and LC.csv datasets are used. The
contents of dataset are completely numeric symbols with kmeans algorithm to give better result.
In paper [10] Jyrki Joutsensalo and Antti
Miettinen,Tamayo,P
proposed
the results of SOM
algorithm which includes rat CNS dataset . This algorithm
used to compare with high accuracy for best result when
compared with other algorithms. This paper described about
the expression levels with genes during rat central nervous
system development over 9 time points. Mapping to 2-D
space and statistical average, are used in this technique to
give better results than average method.
In paper [11] Nikhil R. Pal, Kuhu Pal, James M.
Keller, and James C. Bezde presented the work with FCM
algorithm and EFC algorithm of clustering. The terms of
quality clusters are used and their computational time is
measured. Fuzzy reasoning algorithms was developed by
using the best set of clusters, thus obtained.
In paper [12] Jacob Goldberger., Tamir Tassa
discussed about the hierarchical clustering with number of
attributes are added automatically. The attributes are based
on the priority. Human cancer data set is used in this
algorithm to give best result and time complexity.
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
technique commonly used which is simple and a fast
method. It is easy to implement and has small number of
iterations[19].The K-means algorithm is a typical partitionbased clustering method. given a pre-specified number K the
algorithm partitions the data set into disjoint subsets which
optimize the following objective function:
E=
|O-i|2
i=1 OCi
Here O is a data object in cluster "Ci and i is the centroid
(mean of objects) of Ci Thus, the Objective function E tries
to minimize the sum of the squared distances of objects
from their cluster centers. The time complexity of K-means
is O (i*k*n) where iis the number of interation and k is the
number of clusters. however, the K-means algorithm forces
each gene into a cluster, which may cause the algorithm to
be sensitive to noise.
Typically the square error criterion is used, defined as,
k
E = |p-mi|2
i=1 pCi
Where E is the sum of the square error for all objects in
the data set P is the point in space representing a given
object Mi is the mean of cluster co For each object in each
cluster, the distance from the object to its cluster center is
squared and the distances are summed This criterion tries to
make the resulting k clusters as compact and as separate as
possible.
CLARANS was one of the first clustering algorithms that
was developed specifically for use in data mining spatial
data mining. CLARANS itself grew out of two clustering
algorithms, PAM and CLARA , that were developed in the
field of statistics.
PAM (Partitioning Around Medoids) is a K-medoid based
clustering algorithm that attempts to cluster a set of m points
into K clusters by performing.
CLARA (Clustering LARge Applications) is an adaptation
of PAM for handling larger data sets. It works by repeatedly
sampling a set of data points, calculating the medoids of the
sample, and evaluating the cost of the configuration that
consists of these sample-derived medoids and the entire data
set.
SELF ORGANIZING MAP
Self organizing Map (SOM) is used for visualization
and analysis of high-dimensional datasets. SOM facilitate
presentation of high dimensional datasets into lower
dimensional ones, usually 1-D, 2-D and 3-D. It is an
unsupervised learning algorithm, and does not require a
target vector since it learns to classify data without
supervision. A SOM is formed from a grid of nodes or units
to which the input data are presented. Every node is
connected to the input, and there is no connection between
the nodes.[20] The Self-Organizing Map (SOM) was
developed by Kohonen on the basis of a single layered
neural network. SOFMs were developed by observing how
neurons work in the brain and in ANNs The firing of
neurons impact the firing of other neurons that are near it
B.
(u
i 1
n
) m xi
ij
(u
i 1
ij
)m
u ij
xi c
j 1
nc
m 1
1
j
1
xi c
m 1
337
IJRITCC | March 2016, Available @ http://www.ijritcc.org
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
If ||U(k) U(k-1)|| < , Determine membership cutoff For each
data point gi, assign gi to cluster clj if uij of U(k) > Allows
a data point to be in multiple clusters.The representation of
the behavior of genes usually are involved in multiple
functions.Need to define c, the number of clusters.Need to
determine membership cutoff value Clusters are sensitive to
the initial assignment of centroids.
D. HIERARCHICAL CLUSTERING
Hierarchical cluster builds a cluster hierarchy in
other words a tree of cluster also known as
dendrogram[23].Organize elements into a tree, leaves
represent genes and length of the paths between leaves
represents distances between genes. Similar genes lie within
same sub trees. The branches of a dendrogram not only
record the formation of the clusters but also indicate the
similarity between the clusters. By cutting the dendrogram
at some level, we can obtain a specified number of clusters.
By reordering the objects such that the branches of the
corresponding dendrogram do not cross, the data set can be
arranged with similar objects placed together. There two
methods for hierarchical clustering.
Agglomerative: start with every element in its own cluster,
and iteratively join clusters together.
Divisive: start with one cluster and iteratively divide
it
into cluster.
Hierarchical clustering algorithms can be further divided
into agglomerative approaches and divisive approaches
based on how the hierarchical dendrogram is formed.
Agglomerative algorithms (bottom-up approach) start with
every element in its own cluster, and iteratively join clusters
together. Divisive algorithms (top-down approach) start with
one cluster and iteratively divide it into smaller
clusters.UPGMA (Unweighted Pair Group Method with
Arithmetic Mean) and adopted a method to graphically
represent the clustered data set. In this method, each cell of
the gene expression matrix is colored on the basis of the
measured fluorescence ratio, and the rows of the matrix are
re-ordered based on the hierarchical dendrogram structure
and a consistent node-ordering rule. After clustering, the
original gene expression matrix is represented by a colored
table a cluster image where large contiguous patches of
color represent groups of genes that share similar expression
patterns over multiple conditions. split the genes through a
divisive approach, called the deterministic-annealing
algorithm[24].First, two initial cluster centroids =1,2..
Were randomly defined.The expression pattern of gene k
was represented by a vector and the probability of Gene
k belonging to cluster j was assigned according to a two
component Gaussian model.
The
cluster
centroids
were
pj g k exp
(|g k c j |2)/ j exp(| g k -cj |2) The cluster centroids were
recalculated bycj = j g k pj (g k )/ ( ) an iterative
process (the EM process (the EM algorithm) w as then
applied to solve and .For =o there was only one
cluster C1 = C2.when was increased in small steps until a
threshold was reached, two distinct, converged centroids
emerged. The whole data set was recursively split until each
Lmin- = ki=1 r i)
where n is the number of data
object K is number of component Xr data object (i.e., a
gene expression pattern),fi(Xr | i) is the density function of
of component with some unknown set of parameter i
model parameter and hidden parameters) represents the
probability that belongs to . usally the parameter(-) and
are estimated by the EM algorithm. The EM algorithm
iterates between Expectation (E) steps and Maximization
(M) steps. In the E step, hidden parameters
are
conditionally estimated from the data with the current
estimated (-).In the M step, model parameters (-) are
estimated so as to maximize the likelihood of complete data
given the estimated hidden parameters.[26]When the EM
algorithm converges, each data object is assigned to the
component (cluster) with the maximum conditional
probability. An important advantage of model-based
approaches is that they provide an estimated probability
that data object i will belong to cluster gene expression data
are typically highly-connected there may be instances in
which a single gene has a high correlation with two different
clusters. Thus, the probabilistic feature of model-based
clustering is particularly suitable for gene expression data.
338
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
[27]However, model-based clustering relies on the
assumption that the data set fits a specific distribution The
modeling of gene expression data sets, in particular, is an
ongoing effort by many researchers, and, to the best of our
knowledge, there is currently no well-established model to
represent gene expression data. Commonly used data
transformations and assessed the degree to which three gene
expression data sets fit the multi-variant Gaussian model
assumption. The raw values from all three data sets fit the
Gaussian model poorly and there is no uniform rule to
indicate which transformation would best improve this fit.
IV. CLUSTER VALIDATION
Clustering algorithms which partition the dataset
based on different Clustering. For gene expression data,
clustering results in groups of genes, groups of samples with
a process. However, different clustering algorithms, or even
a single clustering algorithm using different parameters,
generally in different sets of clusters. Cluster validation is
the process of assessing the quality and reliability of the
cluster sets derived from various clustering processes
Generally, cluster validity has six aspects. First, the quality
of clusters can be measured in terms of huberts, Dunn
Index, Simple matching coefficient, Nmimeasure Purity and
Silhouette coefficient.
A. Huberts Statistics
Let us consider two nn proximity matrices X (i, j)
and Y (i, j) on the same n genes. X (i, j) denotes the observed
distance of genes i and j and Y (i, j) is defined as
follows[28]
Y i, i =
= = 1
=1
=+1
Matrix Q
1
A
C
1
0
= min min
1<
a+d
a+b+c
Here, the negative matches d are not considered.
D. Nmimeasure
Is called Normalized Mutual Information (NMI).
The NMI of two labeled objects can be measured as[31]
0
B
D
I(x, y)
NMI(X,J)
() + (Y)
Where I (X,Y) the mutual information between two
random variables X and Y and H(x) denotes the entropy of
X,X will be consensus clustering while Y will be the true
label.
339
IJRITCC | March 2016, Available @ http://www.ijritcc.org
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
E. Purity
Purity is very similar to entropy. We calculate the
purity of a set of clusters. First, we cancel the purity in each
SC
1
N
s ( x)
i 1
j=1
VI. CONCLUSION
Clustering algorithms are useful for identifying
biologically relevant groups of genes and sample clustering
techniques are essential in the data mining process to reveal
natural structure and identifying pattern in the data sets.
From the above context identified that, for microarray
clustering techniques k-means algorithm is used widely. The
quality of clustered data measured using different validation
parameters as Huberts statistics, Dunns Index, Simple
matching coefficient, Nmimeasure
and Purity and
Silhouette coefficient the similarity measures are used
extensively in most of recent studies on gene expression
data. The future research direction towards the gene
ontology (GO) terms which comes under microarray gene
expression data and k-means clustering algorithm is
hybridized with other algorithms to achieve quality and
accuracy.
AUTHOR NAME
ALGORITHM
DATASET
OUTCOME
DEMERTICS
A.Dharmarajan.,T.
Velmurugan [9]
Partition method
K-means
LC.arff
Time complexity is
high.
Difficult to compare a
quality of cluster.
SOM
Rat CNS
High Accuracy.
Computational process is
high.
Fuzzy c-means
Blood cancer
data
High
times
complexity with best
result
Iteration process is
expensive
Jacob Goldberger.,
Tamir Tassa [12]
Hierachical clustering
Divisive
Human cancer
data
Accurate Result.
K.Sasirekha,P.Baby[13]
Agglomerative
Model based
Clustering
Breast cancer
data
Iteration
high.
Process is
340
IJRITCC | March 2016, Available @ http://www.ijritcc.org
______________________________________________________________________________________
ISSN: 2321-8169
335 - 341
_______________________________________________________________________________________
REFERENCE
[1] Mann AK, Kaur n. Survey paper on clustering
techniques. Ijsetr. 2013 apr; 2 (4):8036.
[2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in
Data: an Introduction to Cluster Analysis. JohnWiley
and Sons, 1990
[3] Brazma, Alvis and Vilo, Jaak. Mini review: Gene
expression data analysis. Federation of European
Biochemical societies, 480:1724, June 2000.36
[4] Siedow, J. N. Meeting report: Making sense of
microarrays. Genome Biology,2(2):reports 4003.1
4003.2, 2001.
[5] Derisi, J.l. , Iyer, V.R., and brown, P.O. Exploring the
metabolic and genetic control of gene expression on a
genomic scale. Science, pages 680686, 1997.
[6] Smet, Frank De, Mathys, Janick, Marchal, Kathleen,
Thijs, Gert, Moor, Bart De and
Moreau, Yves.
Adaptive quality-based clustering of gene expression
profiles. Bioinformatics, 18:735746, 2002
[7] R.Sharan, R.Elkon, R.Shamir, Cluster Analysis and its
Applications to Gene Expression Data.
[8] Smet, Frank De, Mathys, Janick, Marchal, Kathleen,
Thijs, Gert, Moor, Bart De and
Moreau, Yves.
Adaptive quality-based clustering of gene expression
profiles.
Bioinformatics, 18:735746, 2002.
[9] In A.Dharmarajan.,T.Velmurugan k-means algorithm.
[10] Jyrki Joutsensalo and Antti Miettinen,Tamayo, P. And
others, interpreting patterns of gene expression with self
organizing Maps, pnas 96, p.2907--2912, 1999
[11] Nikhil R. Pal, Kuhu Pal, James M. Keller, and James C.
Bezdek fuzzy c-means algorithm Springer, Berlin,
Heidelberg, 1995.
[12] Jacob Goldberger., Tamir Tassa Hierarchical clustering
algorithm 2nd edution springer-verlage 1998.
[13] K.Sasirekha, P.Baby.,Agglomerative algorithm 2 nd
edution springer-verlage 1998.
[14] McLachlan, G.J., Bean R.W. and Peel D. A mixture
model-based approach to the clustering of microarray
expression data. Bioinformatics, 18:413422, 2002.
[15] Daxin Jiang Chun Tang Aidong Zhang et.al application
of data mining in bioinformatics book.
[16] T. Deepika*, Dr. R. Porkodi Cluster Analysis and micro
array technology bioinformatics.
[17] L.Boopathi1, D.Vijaybabu Cluster Analysis and its
Applications to Gene Expression Data
[18] Khalid Raza application of data mining in bioinformatics
book.
[19] Hartigan j.a clustering algorithm wiley,new york
Hartigan j.a and Wong m.a(1979) a k-means clustering.
341
IJRITCC | March 2016, Available @ http://www.ijritcc.org
______________________________________________________________________________________