0% found this document useful (0 votes)
62 views

Clustering Lecture 1: Basics: Jing Gao

This document provides an overview of clustering basics and techniques. It discusses clustering motivation, definitions, evaluation methods, and applications. The document outlines clustering of gene expression data to find groups of co-expressed genes with similar patterns over time that may indicate co-function or co-regulation. It also covers two important aspects of clustering - properties of input data like defining similarity, and clustering objectives and methodologies.

Uploaded by

Keshav Negi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Clustering Lecture 1: Basics: Jing Gao

This document provides an overview of clustering basics and techniques. It discusses clustering motivation, definitions, evaluation methods, and applications. The document outlines clustering of gene expression data to find groups of co-expressed genes with similar patterns over time that may indicate co-function or co-regulation. It also covers two important aspects of clustering - properties of input data like defining similarity, and clustering objectives and methodologies.

Uploaded by

Keshav Negi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Clustering

Lecture 1: Basics

Jing Gao
SUNY Buffalo

1
Class Structure

• Topics
– Clustering, Classification
– Network mining
– Anomaly detection
• Expectation
– Sign-in
– Take quiz in class
– Two more projects on clustering and classification
– One more homework on network mining or anomaly detection
• Website
– http://www.cse.buffalo.edu/~jing/cse601/fa12/

2
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional,
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.

3
Readings

• Tan, Steinbach, Kumar, Chapters 8 and 9.


• Han, Kamber, Pei. Data Mining: Concepts and Techniques.
Chapters 10 and 11.
• Additional readings posted on website

4
Clustering Basics

• Definition and Motivation


• Data Preprocessing and Similarity Computation
• Objective of Clustering
• Clustering Evaluation

5
Clustering
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

6
Application Examples

• A stand-alone tool: explore data distribution


• A preprocessing step for other algorithms
• Pattern recognition, spatial data analysis, image processing,
market research, WWW, …
– Cluster documents
– Cluster web log data to discover groups of similar access
patterns

7
Clustering Co-expressed Genes
Gene Expression Data Matrix Gene Expression Patterns

Co-expressed Genes

Why looking for co-expressed genes?


 Co-expression indicates co-function;
 Co-expression also indicates co-regulation.

8
Gene-based Clustering
1.5

0.5

Expression Value
0

-0.5

-1

-1.5
Time Point

1.5

0.5

Expression Level
0

-0.5

-1

-1.5
Time Points

1.5

0.5

Expression Value
0

-0.5

-1

-1.5
Time Points

Examples of co-expressed genes and coherent


patterns in gene expression data
Iyer’s data [2]

 [2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science,
283:83–87, 1999.
9
Other Applications

• Land use: Identification of areas of similar land use in an earth


observation database
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Climate: understanding earth climate, find patterns of
atmospheric and ocean

10
Two Important Aspects

• Properties of input data


– Define the similarity or dissimilarity between points
• Requirement of clustering
– Define the objective and methodology

11
Clustering Basics

• Definition and Motivation


• Data Preprocessing and Distance computation
• Objective of Clustering
• Clustering Evaluation

12
Data Representation

• Data: Collection of data objects Attributes


and their attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
temperature, etc. 2 No Married 100K No
3 No Single 70K
– Attribute is also known as No

dimension, variable, field, 4 Yes Married 120K No


characteristic, or feature 5 No Divorced 95K Yes
Objects
6 No Married 60K No

• A collection of attributes describe 7 Yes Divorced 220K No

an object 8 No Single 85K Yes


9 No Married 75K No
– Object is also known as record,
point, case, sample, entity, or 10
10 No Single 90K Yes
instance

13
Data Matrix

• Represents n objects with p variables


– An n by p matrix
The value of the i-th
object on the f-th
Attributes attribute

 x11  x1 f  x 
1p 

Objects
      
x  x  x 
 i1 if ip 
      
 
 xn1  xnf  x 
np 

14
Gene Expression Data

sample 1 sample 2 sample 3 sample 4 sample …


• Clustering genes
gene 1 0.13 0.72 0.1 0.57 •Genes are objects
gene 2 0.34 1.58 1.05 1.15 •Experiment conditions are
gene 3 0.43 1.1 0.97 1 attributes
gene 4 1.22 0.97 1 0.85
• Find genes with similar
gene 5 -0.89 1.21 1.29 1.08 function
gene 6 1.1 1.45 1.44 1.12
gene 7 0.83 1.15 1.1 1
gene 8 0.87 1.32 1.35 1.13
gene 9 -0.33 1.01 1.38 1.21
gene 10 0.10 0.85 1.03 1
gene

15
Similarity and Dissimilarity

• Similarity
– Numerical measure of how alike two data objects are
– Is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies

16
Distance Matrix

• Represents pairwise distance in n objects


– An n by n matrix
– d(i,j): distance or dissimilarity between objects i and j
– Nonnegative
– Close to 0: similar

 0 
d (2,1) 0 
 
d (3,1) d (3,2) 0 
 
    
d (n,1) d (n,2)   0
17
Data Matrix -> Distance Matrix

s1 s2 s3 s4 …
g1 0.13 0.72 0.1 0.57
g1 g2 g3 g4 …
g2 0.34 1.58 1.05 1.15
g3 0.43 1.1 0.97 1 g1 0 d(1,2) d(1,3) d(1,4)
g4 1.22 0.97 1 0.85 g2 0 d(2,3) d(2,4)
g5 -0.89 1.21 1.29 1.08
g6 1.1 1.45 1.44 1.12 g3 0 d(3,4)
g7 0.83 1.15 1.1 1 g4 0
g8 0.87 1.32 1.35 1.13
g9 …
-0.33 1.01 1.38 1.21
g 10 0.10 0.85 1.03 1
… Distance Matrix
Original Data Matrix

18
Types of Attributes

• Discrete
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Note: binary attributes are a special case of discrete attributes
• Ordinal
– Has only a finite or countably infinite set of values
– Order of values is important
– Examples: rankings (e.g., pain level 1-10), grades (A, B, C, D)
• Continuous
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Continuous attributes are typically represented as floating-point
variables

19
Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Discrete

Ordinal

Continuous

Dissimilarity and similarity between p and q

20
Minkowski Distance—Continuous Attribute
• Minkowski distance: a generalization
d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1 j1 i2 j2 ip jp

• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance

Xi (1,7)
xi
12
8.48
q=2 6 q=1

6
Xj(7,1) xj
21
Standardization

• Calculate the mean absolute deviation


mf  1
n (x1 f  x2 f  ...  xnf )
.

sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

• Calculate the standardized measurement (z-score)


xif  m f
zif  sf

22
Mahalanobis Distance
1
d ( p, q)  ( p  q)  ( p  q) T

 is the covariance matrix of the


input data X

1 n
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

Belongs to the family of bregman


divergence

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


23
Mahalanobis Distance
Covariance Matrix:

0.3 0.2
 
 0.2 0.3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

24
Common Properties of a Distance

• Distances, such as the Euclidean distance, have


some well known properties
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.

• A distance that satisfies these properties is a


metric

25
Similarity for Binary Attributes
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / total number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of matches / number of not-both-zero attributes values


= (M11) / (M01 + M10 + M11)

26
SMC versus Jaccard: Example

p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

27
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

28
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

29
Correlation

• Correlation measures the linear relationship between objects


• To compute correlation, we standardize data objects, p and q,
and then take their dot product (continuous attributes)

pk  ( pk  mean( p)) / std ( p)

qk  (qk  mean(q)) / std (q)

s( p, q)  p  q
30
Common Properties of a Similarity

• Similarities, also have some well known


properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q.

31
Characteristics of the Input Data Are Important

• Sparseness
• Attribute type
• Type of Data
• Dimensionality
• Noise and Outliers
• Type of Distribution
• => Conduct preprocessing and select the appropriate
dissimilarity or similarity measure
• => Determine the objective of clustering and choose
the appropriate method

32
Clustering Basics

• Definition and Motivation


• Data Preprocessing and Distance computation
• Objective of Clustering
• Clustering Evaluation

33
Considerations for Cluster Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. overlapping
(e.g., one document may belong to more than one topic)

• Hard versus fuzzy


– In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics

• Similarity measure and data types


• Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
34
Requirements of Clustering

• Scalability
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Discovery of clusters with arbitrary shape
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
• What clustering results we want to get?
35
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

36
Partitional Clustering

Input Data A Partitional Clustering

37
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Clustering Solution 1

p1
p3 p4
p2
p1 p2 p3 p4

Clustering Solution 2
38
Types of Clusters: Center-Based

• Center-based
– A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster

4 center-based clusters

39
Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise
and outliers are present.

6 density-based clusters

40
Clustering Basics

• Definition and Motivation


• Data Preprocessing and Distance computation
• Objective of Clustering
• Clustering Evaluation

41
Cluster Validation

• Cluster validation
– Quality: “goodness” of clusters
– Assess the quality and reliability of clustering
results

• Why validation?
– To avoid finding clusters formed by chance
– To compare clustering algorithms
– To choose clustering parameters
• e.g., the number of clusters

42
Aspects of Cluster Validation

• Comparing the clustering results to ground truth


(externally known results)
– External Index
• Evaluating the quality of clusters without reference
to external information
– Use only the data
– Internal Index
• Determining the reliability of clusters
– To what confidence level, the clusters are not formed
by chance
– Statistical framework

43
Comparing to Ground Truth

• Notation
– N: number of objects in the data set
– P={P1,…,Ps}: the set of “ground truth” clusters
– C={C1,…,Ct}: the set of clusters reported by a clustering
algorithm
• The “incidence matrix”
– N  N (both rows and columns correspond to objects)
– Pij = 1 if Oi and Oj belong to the same “ground truth” cluster
in P; Pij=0 otherwise
– Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0
otherwise

44
Rand Index and Jaccard Coefficient

• A pair of data object (Oi,Oj) falls into one of the


following categories
– SS: Cij=1 and Pij=1; (agree)
– DD: Cij=0 and Pij=0; (agree)
– SD: Cij=1 and Pij=0; (disagree)
– DS: Cij=0 and Pij=1; (disagree)

| SS |  | DD |
• Rand index Rand 
| Agree |

| Agree |  | Disagree | | SS |  | SD |  | DS |  | DD |

– may be dominated by DD
• Jaccard Coefficient Jaccard coefficien t 
| SS |
| SS |  | SD |  | DS |

45
Clustering

g1 g2 g3 g4 g5

g1 1 1 1 0 0

g2 1 1 1 0 0 Clustering
g3 1 1 1 0 0 Same Different
Cluster Cluster
g4 0 0 0 1 1 Ground Same
truth 9 4
Cluster
g5 0 0 0 1 1 Different
4 8
Cluster
Groundtruth

g1 g2 g3 g4 g5
| SS |  | DD | 17
g1 1 1 0 0 0 Rand  
| SS |  | SD |  | DS |  | DD | 25
g2 1 1 0 0 0

g3 0 0 1 1 1 | SS | 9
Jaccard  
g4 0 0 1 1 1 | SS |  | SD |  | DS | 17
g5 0 0 1 1 1
46
Entropy and Purity

• Notation
– | Ck  Pj | the number of objects in both the k-th cluster of
the clustering solution and j-th cluster of the groundtruth
– | Ck | the number of objects in the k-th cluster of the
clustering solution
– | Pj | the number of objects in the j-th cluster of the
groundtruth
1
• Purity Purity  
N k
max j | Ck  Pj |

• Normalized Mutual Information


| Ck  Pj | N  | Ck  Pj |
I (C , P) I (C , P)   log
NMI  k j N | Ck || Pj |
H (C ) H ( P)
| Ck | |C |
H (C )  
| Pj | | Pj |
log k H ( P)   log
k N N j N N 47
Example
P1 P2 P3 P4 P5 P6 Total

C1 3 5 40 506 96 27 677

1
C2 4 7 280 29 39 2 361
Purity   max j | Ck  Pj |
N k
C3 1 1 1 7 4 671 685
506  280  671  162  331  358
C4 10 162 3 119 73 2 369 Purity 
3204
C5 331 22 5 70 13 23 464  0.7203

C6 5 358 12 212 48 13 648

total 354 555 341 943 273 738 3204

| Ck  Pj | N  | Ck  Pj |
NMI 
I (C , P) I (C , P)   log
H (C ) H ( P) k j N | Ck || Pj |
| Ck | |C |
H (C )  
| Pj | | Pj |
log k H ( P)   log
k N N j N N 48
Internal Index

• “Ground truth” may be unavailable


• Use only the data to measure cluster quality
– Measure the “cohesion” and “separation” of clusters
– Calculate the correlation between clustering results
and distance matrix

49
Cohesion and Separation
• Cohesion is measured by the within cluster sum of squares
WSS    ( x  mi ) 2
i xCi

• Separation is measured by the between cluster sum of squares


BSS   Ci (m  mi )2
i
where |Ci| is the size of cluster i, m is the centroid of the whole data set

• BSS + WSS = constant


• WSS (Cohesion) measure is called Sum of Squared Error (SSE)—a
commonly used measure
• A larger number of clusters tend to result in smaller SSE

50
Example

m
  
1 m1 2 3 4 m2 5

WSS (1  3)2  (2  3)2  (4  3)2  (5  3)2  10


K=1 :
BSS  4  (3  3)2  0
Total  10  0  10
WSS (1  1.5) 2  (2  1.5) 2  (4  4.5) 2  (5  4.5) 2  1
K=2 :
BSS  2  (3  1.5) 2  2  (4.5  3) 2  9
Total  1  9  10

K=4: WSS (1  1) 2  (2  2) 2  (4  4) 2  (5  5) 2  0
BSS  1 (1  3) 2  1 (2  3) 2  1 (4  3) 2  1 (5  3) 2  10
Total  0  10  10

51
Silhouette Coefficient
• Silhouette Coefficient combines ideas of both cohesion and separation

• For an individual point, i


– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (s = b/a - 1 if a  b, not the usual case)

– Typically between 0 and 1 b


– The closer to 1 the better a

• Can calculate the Average Silhouette width for a cluster or a clustering

52
Correlation with Distance Matrix

• Distance Matrix
– Dij is the similarity between object Oi and Oj
• Incidence Matrix
– Cij=1 if Oi and Oj belong to the same cluster, Cij=0
otherwise
• Compute the correlation between the two
matrices
– Only n(n-1)/2 entries needs to be calculated
• High correlation indicates good clustering

53
Correlation with Distance Matrix

• Given Distance Matrix D = {d11,d12, …, dnn } and Incidence


Matrix C= { c11, c12,…, cnn } .

• Correlation r between D and C is given by

n _ _

 (d
i 1, j 1
ij  d )(cij  c)
r
n _ n _

 (dij  d )
i 1, j 1
2
 ij
( c  c
i 1, j 1
) 2

54
Are There Clusters in the Data?
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6 DBSCAN


Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

K-means 0.8 0.8


Complete
0.7 0.7
Link
0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
55
Measuring Cluster Validity Via Correlation

• Correlation of incidence and distance matrices for the K-


means clusterings of the following two data sets

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810

56
Using Similarity Matrix for Cluster Validation

• Order the similarity matrix with respect to cluster


labels and inspect visually.
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x

57
Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp


1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x

58
Reliability of Clusters

• Need a framework to interpret any measure

– For example, if our measure of evaluation has the


value, 10, is that good, fair, or poor?

• Statistics provide a framework for cluster validity


– The more “atypical” a clustering result is, the more
likely it represents valid structure in the data

59
Statistical Framework for SSE
• Example
– Compare SSE of 0.005 against three clusters in random data
– SSE Histogram of 500 sets of random data points of size 100—
lowest SSE is 0.0173
1
50
0.9
45
0.8
40
0.7
35
0.6
30

Count
0.5
y

25
0.4
20
0.3
15
0.2
10
0.1
5
0
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE

SSE = 0.005
60
Determine the Number of Clusters Using SSE

• SSE curve
10

6 9

8
4
7

2 6

SSE
5
0
4
-2 3

2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K

Clustering of Input Data SSE wrt K

61
Take-away Message

• What’s clustering?
• Why clustering is important?
• How to preprocess data and compute
dissimilarity/similarity from data?
• What’s a good clustering solution?
• How to evaluate the clustering results?

62

You might also like