Clustering

The document discusses clustering techniques for analyzing gene expression data from microarray experiments. It describes how clustering can help analyze high-dimensional microarray data by finding patterns among genes with similar expression profiles. Clustering techniques are useful for generating hypotheses about potential gene functions and protein interactions. Common clustering algorithms discussed include hierarchical clustering, k-means clustering, and fuzzy c-means clustering. The document explores how these different algorithms work and their advantages and disadvantages.

Uploaded by

Arul Kumar Venugopal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views36 pages

Clustering

Uploaded by

Arul Kumar Venugopal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

Clustering

Luis Tari
Motivation
 One of the important goals in the post-
genomic era is to discover the functions of
genes.
 High-throughput technologies allow us to
speed up the process of finding the functions
of genes.
 But there are tens of thousands of genes
involved in a microarray experiment.
 Questions:
 How do we analyze the data?
 Which genes should we start exploring?
Why clustering?
 Let’s look at the problem in a different angle
 The issue here is dealing with high-dimensional data
 How do people deal with high-dimensional data?
 Start by finding interesting patterns associated with the
data
 Clustering is one of the well-known techniques with
successful applications on large domain for finding patterns
 Some successes in applying clustering on
microarray data
 Golub et. al (1999) uses clustering techniques to discover
subclasses of AML and ALL from microarray data
 Eisen et. al (1998) uses clustering techniques that are able
to group genes of similar function together.
 But what is clustering?
Introduction
 The goal of clustering is to
 group data points that are close (or similar) to each other
 identify such groupings (or clusters) in an unsupervised
manner
 Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
 Example
x
What should the
x
clusters be for
these data points?
x
x x
x
x x
x
What can we do with
clustering?
 One of the major applications of clustering in
bioinformatics is on microarray data to cluster similar
genes
 Hypotheses:
 Genes with similar expression patterns implies that the
coexpression of these genes
 Coexpressed genes can imply that
 they are involved in similar functions
 they are somehow related, for instance because their proteins
directly/indirectly interact with each other
 It is widely believed that coexpressed genes implies that
they are involved in similar functions
 But still, what can we really gain from doing
clustering?
Purpose of clustering on
microarray data
 Suppose genes A and B are grouped in the
same cluster, then we hypothesis that genes
A and B are involved in similar function.
 If we know the role of gene A is apoptosis
 but we do not know if gene B is involved in
apoptosis
 we can do experiments to confirm if gene B
indeed is involved in apoptosis.
Purpose of clustering on
microarray data
 Suppose genes A and B are grouped in the
same cluster, then we hypothesize that
proteins A and B might interact with each
other.
 So we can do experiments to confirm if such
interaction exists.
 So clustering microarray data in a way helps
us make hypotheses about:
 potential functions of genes
 potential protein-protein interactions
Does clustering always work?
 Do coexpressed genes always imply that
they have similar functions?
 Not necessarily
 housekeeping genes
 genes which always expressed or never expressed
despite of different conditions
 there can be noise in microarray data
 But clustering is useful in:
 visualization of data
 hypothesis generation
Overview of clustering

 From the paper “Data clustering: review”

 Feature Selection
 identifying the most effective subset of the original features to
use in clustering
 Feature Extraction
 transformations of the input features to produce new salient
features.
 Interpattern Similarity
 measured by a distance function defined on pairs of patterns.
 Grouping
 methods to group similar patterns in the same cluster
Outline of discussion
 Various clustering algorithms
 hierarchical
 k-means
 k-medoid
 fuzzy c-means
 Different ways of measuring similarity
 Measure validity of clusters
 How can we tell the generated clusters are good?
 How can we judge if the clusters are biologically
meaningful?
Hierarchical clustering
 Modified from Dr. Seungchan Kim’s slides
 Given the input set S, the goal is to produce a
hierarchy (dendrogram) in which nodes represent
subsets of S.
 Features of the tree obtained:
 The root is the whole input set S.
 The leaves are the individual elements of S.
 The internal nodes are defined as the union of their
children.
 Each level of the tree represents a partition of the
input data into several (nested) clusters or groups.
Hierarchical clustering
Hierarchical clustering
 There are two styles of hierarchical clustering
algorithms to build a tree from the input set S:
 Agglomerative (bottom-up):
 Beginning with singletons (sets with 1 element)
 Merging them until S is achieved as the root.
 It is the most common approach.
 Divisive (top-down):
 Recursively partitioning S until singleton sets are
reached.
Hierarchical clustering
 Input: a pairwise matrix involved all instances in S
 Algorithm
1. Place each instance of S in its own cluster (singleton),
creating the list of clusters L (initially, the leaves of T):
L= S1, S2, S3, ..., Sn-1, Sn.
2. Compute a merging cost function between every pair of
elements in L to find the two closest clusters {Si, Sj} which
will be the cheapest couple to merge.
3. Remove Si and Sj from L.
4. Merge Si and Sj to create a new internal node Sij in T
which will be the parent of Si and Sj in the resulting tree.
5. Go to Step 2 until there is only one set remaining.
Hierarchical clustering
 Step 2 can be done in different ways, which is what distinguishes
single-linkage from complete-linkage and average-linkage
clustering.
 In single-linkage clustering (also called the connectedness or
minimum method): we consider the distance between one cluster
and another cluster to be equal to the shortest distance from any
member of one cluster to any member of the other cluster.
 In complete-linkage clustering (also called the diameter or
maximum method), we consider the distance between one
cluster and another cluster to be equal to the greatest distance
from any member of one cluster to any member of the other
cluster.
 In average-linkage clustering, we consider the distance between
one cluster and another cluster to be equal to the average
distance from any member of one cluster to any member of the
other cluster.
Hierarchical clustering:
example
Hierarchical clustering:
example using single linkage
Hierarchical clustering:
forming clusters
 Forming clusters from dendograms
Hierarchical clustering
 Advantages
 Dendograms are great for visualization
 Provides hierarchical relations between clusters
 Shown to be able to capture concentric clusters
 Disadvantages
 Not easy to define levels for clusters
 Experiments showed that other clustering
techniques outperform hierarchical clustering
K-means
 Input: n objects (or points) and a number k
 Algorithm
1. Randomly place K points into the space represented by
the objects that are being clustered. These points
represent initial group centroids.
2. Assign each object to the group that has the closest
centroid.
3. When all objects have been assigned, recalculate the
positions of the K centroids.
4. Repeat Steps 2 and 3 until the stopping criteria is met.
K-means
 Stopping criteria:
 No change in the members of all clusters
 when the squared error is less than some small threshold
value 
 Squared error se k
se    p  mi
2

i 1 pci
 where mi is the mean of all instances in cluster ci
 se(j) < 
 Properties of k-means
 Guaranteed to converge
 Guaranteed to achieve local optimal, not necessarily
global optimal.
 Example:
http://www.kdnuggets.com/dmcourse/data_mining_course/
mod-13-clustering.ppt.
K-means
 Pros:
 Low complexity
 complexity is O(nkt), where t = #iterations
 Cons:
 Necessity of specifying k
 Sensitive to noise and outlier data points
 Outliers: a small number of such data can
substantially influence the mean value)
 Clusters are sensitive to initial assignment of centroids
 K-means is not a deterministic algorithm
 Clusters can be inconsistent from one run to another
Fuzzy c-means
 An extension of k-means
 Hierarchical, k-means generates partitions
 each data point can only be assigned in one
cluster
 Fuzzy c-means allows data points to be
assigned into more than one cluster
 each data point has a degree of membership (or
probability) of belonging to each cluster
Fuzzy c-means algorithm
 Let xi be a vector of values for data point gi.
1. Initialize membership U(0) = [ uij ] for data point gi of
cluster clj by random
2. At the k-th step, compute the fuzzy centroid C(k) =
[ cj ] for j = 1, .., nc, where nc is the number of
clusters, using
n
 (uij ) m xi
i 1
cj  n
 ij
(u ) m

i 1

where m is the fuzzy parameter and n is the number of data

points.
Fuzzy c-means algorithm
3. Update the fuzzy membership U(k) = [ uij ], using
1
  m 1
 1 
 x c 
uij  
i j 
1
nc   m 1

  x  c 
1
j 1  i j 

4. If ||U(k) – U(k-1)|| < , then STOP, else return to step 2.

5. Determine membership cutoff
 For each data point gi, assign gi to cluster clj if uij of U(k) > 
Fuzzy c-means
 Pros:
 Allows a data point to be in multiple clusters
 A more natural representation of the behavior of
genes
 genes usually are involved in multiple functions
 Cons:
 Need to define c, the number of clusters
 Need to determine membership cutoff value
 Clusters are sensitive to initial assignment of
centroids
 Fuzzy c-means is not a deterministic algorithm
Similarity measures
 How to determine similarity between data
points
 using various distance metrics
 Let x = (x1,…,xn) and y = (y1,…yn) be n-
dimensional vectors of data points of
objects g1 and g2
 g1, g2 can be two different genes in microarray
data
 n can be the number of samples
Distance measure
 Euclidean distance
n
d ( g1 , g 2 )   ( xi  yi ) 2
i 1

 Manhattan distance
n
d ( g1 , g 2 )   ( xi  yi )
i 1

 Minkowski distance
n
d ( g1 , g 2 )  m  ( xi  yi ) m
i 1
Correlation distance
 Correlation distance
Cov( X , Y )
rxy 
(Var ( X ) Var (Y )

 Cov(X,Y) stands for covariance of X and Y

 degree to which two different variables are
related
 Var(X) stands for variance of X
 measurement of a sample differ from their
mean
Correlation distance
 Variance

n
(x
i 1 i
 X )2
Var ( X ) 
n 1
 Covariance


n
(x
i 1 i
 X )( y i  Y )
CoVar( X , Y ) 
n 1

 Positive covariance
 two variables vary in the same way
 Negative covariance
 one variable might increase when the other decreases
 Covariance is only suitable for heterogeneous pairs
Correlation distance
 Correlation
Cov( X , Y )
rxy 
(Var ( X ) Var (Y )

 maximum value of 1 if X and Y are perfectly

correlated
 minimum value of 1 if X and Y are exactly
opposite
 d(X,Y) = 1 - rxy
Summary of similarity
measures
 Using different measures for clustering can yield
different clusters
 Euclidean distance and correlation distance are the
most common choices of similarity measure for
microarray data
 Euclidean vs Correlation Example
 g1 = (1,2,3,4,5)
 g2 = (100,200,300,400,500)
 g3 = (5,4,3,2,1)
 Which genes are similar according to the two different
measures?
Validity of clusters
 Why validity of clusters?
 Given some data, any clustering algorithm
generates clusters
 So we need to make sure the clustering results
are valid and meaningful.
 Measuring the validity of clustering results
usually involve
 Optimality of clusters
 Verification of biological meaning of clusters
Optimality of clusters
 Optimal clusters should
 minimize distance within clusters (intracluster)
 maximize distance between clusters (intercluster)
 Example of intracluster measure
 Squared error se
k
se    p  mi
2

i 1 pci

where mi is the mean of all instances in cluster ci

Biological meaning of clusters
 Manually verify the clusters using the literature
 Can utilize the biological process ontology of the
Gene Ontology to do the verification
 FD Gibbons and FP Roth. Judging the quality of gene
expression-based clustering methods using gene
annotation, Genome Research 12(10): 1574 - 1581 (2002).
 GoMiner: A Resource for Biological Interpretation of
Genomic and Proteomic Data. Barry R. Zeeberg, Weimin
Feng, Geoffrey Wang, May D. Wang, Anthony T. Fojo,
Margot Sunshine, Sudarshan Narasimhan, David W. Kane,
William C. Reinhold, Samir Lababidi, Kimberly J. Bussey,
Joseph Riss, J. Carl Barrett, and John N. Weinstein.
Genome Biology, 2003 4(4):R28
References
 A. K. Jain and M. N. Murty and P. J. Flynn, Data
clustering: a review, ACM Computing Surveys, 31:3,
pp. 264 - 323, 1999.
 T. R. Golub et. al, Molecular Classification of
Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring, Science, 286:5439,
pp. 531 – 537, 1999.
 Gasch,A.P. and Eisen,M.B. (2002) Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biol., 3,
1–22.
 M. Eisen et. al, Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8, 1998.

Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
Clustering
No ratings yet
Clustering
22 pages
Ch10 Clustering
No ratings yet
Ch10 Clustering
45 pages
Clustering
No ratings yet
Clustering
64 pages
Clustering: Georg Gerber Lecture #6, 2/6/02
No ratings yet
Clustering: Georg Gerber Lecture #6, 2/6/02
50 pages
CL IV Lab Manual
No ratings yet
CL IV Lab Manual
50 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Grouping
No ratings yet
Grouping
98 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
Module 5
No ratings yet
Module 5
43 pages
ML Imp Ques 2
No ratings yet
ML Imp Ques 2
37 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
SAP HANA Predictive Analysis Library PAL en
No ratings yet
SAP HANA Predictive Analysis Library PAL en
672 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Cluster Analysis Fifth Edition Wiley Series in Probability and Statistics Brian S. Everitt
No ratings yet
Cluster Analysis Fifth Edition Wiley Series in Probability and Statistics Brian S. Everitt
48 pages
DBSCAN
No ratings yet
DBSCAN
7 pages
Lec 2
No ratings yet
Lec 2
32 pages
Clustering
No ratings yet
Clustering
20 pages
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
50% (2)
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
248 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Lecture 33 35 Fuzzy C Means Clustering
No ratings yet
Lecture 33 35 Fuzzy C Means Clustering
22 pages
2015 - Cloud Computing Data Security Issues, Challenges, Architecture and Mehods - A Survey PDF
No ratings yet
2015 - Cloud Computing Data Security Issues, Challenges, Architecture and Mehods - A Survey PDF
10 pages
3 - Braglia, 2006, A New Value Stream Mapping Approach For Complex Production Systems.
No ratings yet
3 - Braglia, 2006, A New Value Stream Mapping Approach For Complex Production Systems.
25 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Clustering in Bioinformatics
No ratings yet
Clustering in Bioinformatics
110 pages
DS203 2024-02-09 Clustering K Means and Hierarchical v2
No ratings yet
DS203 2024-02-09 Clustering K Means and Hierarchical v2
35 pages
Clustering
No ratings yet
Clustering
75 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lect 11 DM
No ratings yet
Lect 11 DM
41 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Clustering
No ratings yet
Clustering
75 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
45 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering
No ratings yet
Clustering
38 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
STATISTICS WITH R PROGRAMMING Question Paper PDF
89% (9)
STATISTICS WITH R PROGRAMMING Question Paper PDF
5 pages
Agenda: 1. Introduction To Clustering
No ratings yet
Agenda: 1. Introduction To Clustering
47 pages
Data Mining Functionalities
No ratings yet
Data Mining Functionalities
13 pages
Advanced Framework For Simulation, Integration and Modeling
No ratings yet
Advanced Framework For Simulation, Integration and Modeling
5 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
No ratings yet
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
70 pages
Data Mining: Dimensionality Reduction Pca - SVD
No ratings yet
Data Mining: Dimensionality Reduction Pca - SVD
33 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Block 4
No ratings yet
Block 4
88 pages
Find Cluster Centers With Subtractive Clustering - MATLAB Subclust
100% (2)
Find Cluster Centers With Subtractive Clustering - MATLAB Subclust
2 pages
UNIT5
No ratings yet
UNIT5
60 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
ML - Model Paper
No ratings yet
ML - Model Paper
2 pages
Data Mining Tutorial
100% (2)
Data Mining Tutorial
64 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
3 Clustering
No ratings yet
3 Clustering
18 pages
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
No ratings yet
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
67 pages
PRACTICAL5
No ratings yet
PRACTICAL5
23 pages
Research Methodology 4
No ratings yet
Research Methodology 4
33 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Lecture 13
No ratings yet
Lecture 13
29 pages
Chapter 16 - Files and Streams
No ratings yet
Chapter 16 - Files and Streams
53 pages
Chapter Fraud Detection
No ratings yet
Chapter Fraud Detection
14 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
No ratings yet
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
34 pages
ML Theory MCQ Unit 1,2,3,4,5
No ratings yet
ML Theory MCQ Unit 1,2,3,4,5
31 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
39 pages
CH 09
No ratings yet
CH 09
74 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
5 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Deps 087671
No ratings yet
Deps 087671
22 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Cdcs Desy Wa 2018
No ratings yet
Cdcs Desy Wa 2018
22 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
The Fine-Scale Genetic Structure of The British Population
No ratings yet
The Fine-Scale Genetic Structure of The British Population
45 pages
Unit 5
No ratings yet
Unit 5
5 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Cognitive Weighted Response For A Class: A New Metric For Measuring Cognitive Complexity of OO Systems
No ratings yet
Cognitive Weighted Response For A Class: A New Metric For Measuring Cognitive Complexity of OO Systems
13 pages
Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
Management Zone Analyst: For Windows 95, 98, NT, and 2000
No ratings yet
Management Zone Analyst: For Windows 95, 98, NT, and 2000
8 pages
Clustering of Web Users Based On Access Patterns
No ratings yet
Clustering of Web Users Based On Access Patterns
6 pages
Saint 03 Header
No ratings yet
Saint 03 Header
6 pages
Cloud Computing I C
No ratings yet
Cloud Computing I C
8 pages
ATOM Install Notes Readme
No ratings yet
ATOM Install Notes Readme
2 pages
A New Approach To Determine Base Intermediate and Peak-Demand in An Electric Power System
100% (1)
A New Approach To Determine Base Intermediate and Peak-Demand in An Electric Power System
5 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
SWE-419 Big Data Analytics End Semester Exam Spring 2021 Final Paper
No ratings yet
SWE-419 Big Data Analytics End Semester Exam Spring 2021 Final Paper
5 pages
Relational Database Management System 3 Exam/Comp/It/Csc/0626/0090/Nov'16 Duration: 3 Hrs M. Marks 75 Section-A Q1. Do As Directed: 15X1 15
No ratings yet
Relational Database Management System 3 Exam/Comp/It/Csc/0626/0090/Nov'16 Duration: 3 Hrs M. Marks 75 Section-A Q1. Do As Directed: 15X1 15
1 page
Batc 601
No ratings yet
Batc 601
6 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

 From the paper “Data clustering: review”

where m is the fuzzy parameter and n is the number of data

4. If ||U(k) – U(k-1)|| < , then STOP, else return to step 2.

 Cov(X,Y) stands for covariance of X and Y

 maximum value of 1 if X and Y are perfectly

where mi is the mean of all instances in cluster ci

You might also like