0% found this document useful (0 votes)

7 views

Clustering

Uploaded by

Trung Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Clustering

Uploaded by

Trung Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

10701

Machine Learning

Clustering
What is Clustering?
• Organizing data into clusters
such that there is
• high intra-cluster similarity

• low inter-cluster similarity

•Informally, finding natural

groupings among objects.

•Why do we want to do that?

•Any REAL application?
Example: clusty
Example: clustering genes

• Microarrays measures the activities

of all genes in different conditions

• Clustering genes can help determine

new functions for unknown genes

• An early “killer application” in this

area
– The most cited (11,591) paper in
PNAS!
Why clustering?
• Organizing data into clusters provides information
about the internal structure of the data
– Ex. Clusty and clustering genes above
• Sometimes the partitioning is the goal
– Ex. Image segmentation
• Knowledge discovery in data
– Ex. Underlying rules, reoccurring patterns, topics, etc.
Unsupervised learning

• Clustering methods are unsupervised learning

techniques
- We do not have a teacher that provides examples with their
labels

• We will also discuss dimensionality reduction,

another unsupervised learning method later in the
course
Outline
•Motivation
•Distance functions
•Hierarchical clustering
•Partitional clustering
– K-means
– Gaussian Mixture Models
•Number of clusters
What is a natural grouping among these objects?
What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males

What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary

Similarity is hard
to define, but…
“We know it when
we see it”

The real meaning

of similarity is a
philosophical
question. We will
take a more
pragmatic
approach.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance (dissimilarity)
between O1 and O2 is a real number denoted by D(O1,O2)

gene1
gene2

0.23 3 342.7
gene1 gene2

Inside these black boxes:

d('', '') = 0 d(s, '') =
d('', s) = |s| -- i.e.
length of s d(s1+ch1,
some function on two variables
s2+ch2) = min( d(s1,
s2) + if ch1=ch2 then
0 else 1 fi, d(s1+ch1,
(might be simple or very
s2) + 1, d(s1,
s2+ch2) + 1 ) complex)

A few examples: d(x, y)  (x i  y i )2

• Euclidian distance i • Similarity rather than distance
• Can determine similar trends
• Correlation coefficient
(x i  x )(y i  y )
 s(x, y)  i
 x y
Outline
•Motivation
•Distance measure
•Hierarchical clustering
•Partitional clustering
– K-means
– Gaussian Mixture Models
•Number of clusters
Desirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)

• Ability to deal with different data types
• Minimal requirements for domain knowledge to
determine input parameters
• Interpretability and usability
Optional
- Incorporation of user-specified constraints
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of
the set of objects using some criterion (focus of this class)
Bottom up or top down Top down

Hierarchical Partitional
(How-to) Hierarchical Clustering
The number of dendrograms with n Bottom-Up (agglomerative): Starting
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new cluster.
Number Number of Possible
of Leafs Dendrograms Repeat until all clusters are fused
2 1 together.
3 3
4 15
5 105
... …
10 34,459,425
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.

0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Computing distance between
clusters: Single Link
• cluster distance = distance of two closest
members in each class

- Potentially
long and skinny
clusters
Example: single link
1 2 3 4 5
1 0 
2  2 0 

3 6 3 0 
 
4 10 9 7 0 
5  9 8 5 4 0

5
4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5
1 0  (1,2) 0 
2  2 
3 3 0 
0 
3 6 3 0  
  4 9 7 0 
4 10 9 7 0   
5 8 5 4 0
5  9 8 5 4 0

d (1, 2), 3  min{d1,3 , d 2, 3}  min{6,3}  3

5
d (1, 2), 4  min{d1, 4 , d 2, 4 }  min{10,9}  9 4
d (1, 2), 5  min{d1,5 , d 2, 5}  min{9,8}  8 3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0 
2  2  (1,2,3) 0 
3 3 0 
0 
3 6 3 0   4 7 0 
  4 9 7 0 
4 10 9 7 0    5 5 4 0
5 8 5 4 0
5  9 8 5 4 0

5
d (1, 2, 3), 4  min{d(1, 2), 4 , d 3, 4}  min{9,7}  7
d (1, 2, 3),5  min{d (1, 2), 5 , d3, 5}  min{8,5}  5 4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0 
2  2  (1,2,3) 0 
3 3 0 
0 
3 6 3 0   4 7 0 
  4 9 7 0 
4 10 9 7 0    5 5 4 0
5 8 5 4 0
5  9 8 5 4 0

5
d (1, 2, 3),( 4, 5)  min{d (1, 2, 3), 4 , d (1, 2, 3),5 }  5
4
3
2
1
Computing distance between
clusters: : Complete Link
• cluster distance = distance of two farthest
members

+ tight clusters
Computing distance between
clusters: Average Link
• cluster distance = average distance of all
pairs

the most widely

used measure
Robust against
noise
Single linkage

Height represents 2

distance between objects 1

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

/ clusters Average linkage

Summary of Hierarchal Clustering Methods

• No need to specify the number of clusters in

advance.
• Hierarchical structure maps nicely onto human
intuition for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
But what are the clusters?
In some cases we can determine the “correct” number of clusters.
However, things are rarely this clear cut, unfortunately.
One potential use of a dendrogram is to detect outliers
The single isolated branch is suggestive of a
data point that is very different to all others

Outlier
Example: clustering genes
• Microarrays measures the activities of all
genes in different conditions

• Clustering genes can help determine new

functions for unknown genes
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K non-overlapping clusters.
• Since the output is only one set of clusters the
user has to specify the desired number of
clusters K.
K-means Clustering: Initialization
Decide K, and initialize K centers (randomly)
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Iteration 1
Assign all objects to the nearest center.
Move a center to the mean of its members.
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Iteration 2
After moving centers, re-assign the objects…
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Iteration 2
After moving centers, re-assign the objects to nearest centers.
Move a center to the mean of its new members.
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Finished!
Re-assign and move centers, until …
no objects changed membership.
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5

expression in condition 1
Algorithm k-means
1. Decide on a value for K, the number of clusters.
2. Initialize the K cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat 3 and 4 until none of the N objects changed
membership in the last iteration.
Algorithm k-means
1. Decide on a value for K, the number
Use one of clusters.
of the distance /
similarity functions we
2. Initialize the K cluster centers (randomly, if
discussed earlier
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat 3 and 4 until none of the N objects changed
membership in the last iteration. Average / median of class
members
Why K-means Works
• What is a good partition?
• High intra-cluster similarity
10
• K-means optimizes 9
– the average distance to members of 8
the same cluster 7
K nk nk 6
1
  x
2
ki  xkj 5
k 1 nk
4
i 1 j 1
3
– which is twice the total distance to 2
centers, also called squared error 1
K nk
se   xki  k
2 1 2 3 4 5 6 7 8 9 10

k 1 i 1
Summary: K-Means
• Strength
– Simple, easy to implement and debug
– Intuitive objective function: optimizes intra-cluster similarity
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Weakness
– Applicable only when mean is defined, what about categorical
data?
– Often terminates at a local optimum. Initialization is important.
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
• Summary
– Assign members based on current centers
– Re-estimate centers based on current assignment
Outline
• Motivation
• Distance measure
• Hierarchical clustering
• Partitional clustering
– K-means
– Gaussian Mixture Models
– Number of clusters
Gaussian Mixture Models
( x  ) 2
• Gaussian 1 
P( x)  e 2 2

2 2
– ex. height of one population
• Gaussian Mixture: Generative
modeling framework
( x i ) 2

1
P(C – i)  wi , P( x | C  i)  e 2 i2

2 i
2

P( x | )   P(C  i, x | )   P( x | C  i, )P(C  i | ) 
i i
( x  i ) 2

1
w
i
i
2 2
e 2 i2
Likelihood of a data
i
point given the model
Gaussian Mixture Models
• Mixture of Multivariate
Gaussian
( x i ) 2

1
P(C  i)  wi P( x | C  i )  e 2 i2

2 i
2

– ex. y-axis is blood pressure

and x-axis is age
GMM: A generative model
w i 1
i

• Assuming we know the number of w1

components (k), 
their weights (wi) and 1,21
parameters (i, ∑i) we can generate
new instances from a GMM in the
w1
following way:
2,22
- Pick one component at random with
probability wi for each component

- Sample a point x from N(i,∑i)

Estimating model parameters
• We have a weight, mean and covariance parameters for
each class
• As usual we can write the likelihood function for our
model
n  k 
p(x1 x n |  )   p(x j | C  i)wi 
j1 i1 


GMM+EM = “Soft K-means”
• Decide the number of clusters, K
• Initialize parameters (randomly)
• E-step: assign probabilistic membership to all input samples j
p(x j | C  i) p(C  i)
One for each pi, j  p(C  i | x j ) 
cluster  p(x j | C  k) p(C  k)
k

pi   pi , j
j

• M-step:
 re-estimate parameters based on probabilistic
membership pi , j x j
  i  j pi
pi , j x j x Tj
i  
j pi
pi
wi 
 pj
j

• Repeat until change in parameters are smaller than a threshold

Iteration 1

The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
Strength of Gaussian Mixture Models
• Interpretability: learns a generative model of each cluster
– you can generate new data based on the learned model
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Intuitive (?) objective function: optimizes data likelihood
Weakness of Gaussian Mixture Models
• Often terminates at a local optimum. Initialization
is important.
• Need to specify K, the number of clusters, in
advance
• Not suitable to discover clusters with non-convex
shapes

• Summary
– To learn Gaussian mixture, assign probabilistic
membership based on current parameters, and re-
estimate parameters based on current membership
Algorithm: K-means and GMM
1. Decide on a value for K, the number of clusters.
2. Initialize the K cluster centers / parameters (randomly).

K-means GMM

3. Decide the class memberships of 3. E-step: assign probabilistic

the N objects by assigning them to membership
the nearest cluster center.
4. Re-estimate the K cluster centers, 4. M-step: re-estimate parameters
by assuming the memberships found based on probabilistic membership
above are correct.

5. Repeat 3 and 4 until parameters do not change.

Clustering methods: Comparison
Hierarchical K-means GMM

Running naively, O(N3) fastest (each fast (each

time iteration is iteration is
linear) linear)
Assumptions requires a strong strongest
similarity / assumptions assumptions
distance measure
Input none K (number of K (number of
parameters clusters) clusters)
Clusters subjective (only a exactly K exactly K
tree is returned) clusters clusters
Outline
• Motivation
• Distance measure
• Hierarchical clustering
• Partitional clustering
– K-means
– Gaussian Mixture Models
– Number of clusters
How can we tell the right number of clusters?

In general, this is a unsolved problem. However there are many

approximate methods. In the next few slides we will see an example.

10
9
8
7
6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0

1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6

1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…

The abrupt change at k = 2, is highly suggestive of two clusters

in the data. This technique for determining the number of
clusters is known as “knee finding” or “elbow finding”.
1.00E+03

9.00E+02
Objective Function

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3 4 5 6
k

Note that the results are not always as clear cut as in this toy example
Cross validation
• We can also use cross validation to determine the correct number of classes
• Recall that GMMs is a generative model. We can compute the likelihood of
the left out data to determine which model (number of clusters) is more
accurate
n  k 
p(x1 x n |  )   p(x j | C  i)wi 
j1 i1 


Cross validation
Cluster validation
• We wish to determine whether the clusters are real
or compare different clustering methods.
- internal validation (stability, coherence)
- external validation (match to known categories)
Internal validation: Coherence
• A simple method is to compare clustering algorithm based
on the coherence of their results
• We compute the average inter-cluster similarity and the
average intra-cluster similarity
• Requires the definition of the similarity / distance metric
Internal validation: Stability
• If the clusters capture real structure in the data they should
be stable to minor perturbation (e.g., subsampling) of the
data.
• To characterize stability we need a measure of similarity
between any two k-clusterings.
• For any set of clusters C we define L(C) as the matrix of
0/1 labels such that L(C)ij =1 if objects i and j belong to the
same cluster and zero otherwise.
• We can compare any two k clusterings C and C' by
comparing the corresponding label matrices L(C) and
L(C').
Validation by subsampling
• C is the set of k clusters based on all the objects
• C' denotes the set of k clusters resulting from a randomly
chosen subset (80-90%) of objects
• We have high confidence in the original clustering if
Sim(L(C),L(C')) approaches 1 with high probability, where
the comparison is done over the objects common to both
External validation
• For this we need an external source that contains related, but
usually not identical information.
• For example, assume we are clustering web pages based on
the car pictures they contain.
• We have independently grouped these pages based on the
text description they contain.
• Can we use the text based grouping to determine how well
our clustering works?
External validation
• Suppose we have generated k clusters C1,…,Ck. How do we
assess the significance of their relation to m known
(potentially overlapping) categories G1,…,Gm?
• Let's start by comparing a single cluster C with a single
category Gj. The p-value for such a match is based on the
hyper-geometric distribution.
• Board.
• This is the probability that a randomly chosen |Ci| elements
out of n would have l elements in common with Gj.
P-value (cont.)
• If the observed overlap between the sets (cluster and
category) is l elements (genes), then the p-value is
min( c ,m )
p  prob(l  lˆ)   prob(exactly  j  matches )
j l

• Since the categories G1,…,Gm typically overlap we cannot

assume that each cluster-category pair represents an
independent comparison
• In addition, we have to account for the multiple hypothesis
we are testing.
• Solution ?
External validation: Example
P-value comparison

Response to
5
stimulus
transerase activity
-Log Pval Profiles

4
cell death
Ratio

0
0 1 2 3 4 5 6 7
- Log Pval Kmeans
What you should know
• Why is clustering useful
• What are the different types of clustering
algorithms
• What are the assumptions we are making
for each, and what can we get from them
• Unsolved issues: number of clusters,
initialization, etc.

Unit 3 Test: Name - Date
80% (5)
Unit 3 Test: Name - Date
3 pages
Dunham Bush Air Cooled Screw Chiller AFVX B 6SR Series
100% (1)
Dunham Bush Air Cooled Screw Chiller AFVX B 6SR Series
33 pages
Clustering
No ratings yet
Clustering
39 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
84 pages
Cluster
100% (1)
Cluster
72 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
M5
No ratings yet
M5
40 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 2
No ratings yet
Unit 2
33 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
UNIT5
No ratings yet
UNIT5
60 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Clustering
No ratings yet
Clustering
20 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML UNIT-5
No ratings yet
ML UNIT-5
31 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML UNIT-5 (1)
No ratings yet
ML UNIT-5 (1)
30 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
M5
No ratings yet
M5
40 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Representation Theory of Finite Groups
From Everand
Representation Theory of Finite Groups
Martin Burrow
4/5 (2)
Fibrous Composites Research
No ratings yet
Fibrous Composites Research
13 pages
ECO412 Project - Forecasting Exchange Rates Through Machine Learning and Econometric Models
No ratings yet
ECO412 Project - Forecasting Exchange Rates Through Machine Learning and Econometric Models
17 pages
Chapter 5 Cumulative Review: Multiple Choice
No ratings yet
Chapter 5 Cumulative Review: Multiple Choice
2 pages
15 Analytical Profiles of Drug Substances, Vol 15 PDF
No ratings yet
15 Analytical Profiles of Drug Substances, Vol 15 PDF
807 pages
Application of Linear Algebra in Dimension of 2d and 3d
No ratings yet
Application of Linear Algebra in Dimension of 2d and 3d
6 pages
3.taimur - Combined Blade Element Momentum Theory
No ratings yet
3.taimur - Combined Blade Element Momentum Theory
38 pages
IRC 37-2012 - BTech
No ratings yet
IRC 37-2012 - BTech
19 pages
Strings: Topics: String Class Associated Methods Examples
No ratings yet
Strings: Topics: String Class Associated Methods Examples
8 pages
One Shot-Chemical Bonding (9.12.2020)
No ratings yet
One Shot-Chemical Bonding (9.12.2020)
68 pages
Wittgenstein and He Vienna Circle
No ratings yet
Wittgenstein and He Vienna Circle
8 pages
Math For Ai
No ratings yet
Math For Ai
15 pages
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
No ratings yet
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
15 pages
Stability Augmentation Systems: M. Sadraey, Automatic Flight Control Systems © Springer Nature Switzerland AG 2020
No ratings yet
Stability Augmentation Systems: M. Sadraey, Automatic Flight Control Systems © Springer Nature Switzerland AG 2020
2 pages
Preview of Plasticity F A
No ratings yet
Preview of Plasticity F A
20 pages
Assessment of Number Sense
100% (1)
Assessment of Number Sense
27 pages
5 V To 6 KV DC-DC Converter Using Switching Regulator With Cockcroft-Walton Voltage Multiplier For High Voltage Power Supply Module
No ratings yet
5 V To 6 KV DC-DC Converter Using Switching Regulator With Cockcroft-Walton Voltage Multiplier For High Voltage Power Supply Module
7 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Waste Heat Recovery
No ratings yet
Waste Heat Recovery
32 pages
module-13
No ratings yet
module-13
9 pages
Bakery & Pastry Operation Practice Module: Sekolah Tinggi Pariwisata Bogor Hotel Institute
No ratings yet
Bakery & Pastry Operation Practice Module: Sekolah Tinggi Pariwisata Bogor Hotel Institute
40 pages
Case Study 4 Single Versus Multiple Supplier Sourcing
No ratings yet
Case Study 4 Single Versus Multiple Supplier Sourcing
19 pages
01b Pure Mathematics 1 - October 2020 Examination Paper (Word)
No ratings yet
01b Pure Mathematics 1 - October 2020 Examination Paper (Word)
8 pages
12 - Uniaxial Compression Testing
No ratings yet
12 - Uniaxial Compression Testing
9 pages
INFO 6066 - Intro To Java Data Types and Creating Java Programs
No ratings yet
INFO 6066 - Intro To Java Data Types and Creating Java Programs
22 pages
Reservoir Geomechanics: Benzetta Rihana 9/08/2019
100% (1)
Reservoir Geomechanics: Benzetta Rihana 9/08/2019
32 pages
Marat Molyboga, Larry E. Swedroe - Your Essential Guide To Quantitative Hedge Fund Investing (2023, Chapman and Hall - CRC) - Libgen - Li
No ratings yet
Marat Molyboga, Larry E. Swedroe - Your Essential Guide To Quantitative Hedge Fund Investing (2023, Chapman and Hall - CRC) - Libgen - Li
317 pages
Mechanical Pressure Sensors Learning Instrumentation and Control Engineering
No ratings yet
Mechanical Pressure Sensors Learning Instrumentation and Control Engineering
4 pages
16PEIC4029 Mini Project
No ratings yet
16PEIC4029 Mini Project
17 pages

Clustering

Uploaded by

Clustering

Uploaded by

10701

• low inter-cluster similarity

•Informally, finding natural

•Why do we want to do that?

• Microarrays measures the activities

• Clustering genes can help determine

• An early “killer application” in this

• Clustering methods are unsupervised learning

• We will also discuss dimensionality reduction,

Simpson's Family School Employees Females Males

The real meaning

Inside these black boxes:

A few examples: d(x, y)  (x i  y i )2

• Scalability (in terms of both time and space)

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

d (1, 2), 3  min{d1,3 , d 2, 3}  min{6,3}  3

the most widely

distance between objects 1

/ clusters Average linkage

• No need to specify the number of clusters in

• Clustering genes can help determine new

– ex. y-axis is blood pressure

• Assuming we know the number of w1

- Sample a point x from N(i,∑i)

• Repeat until change in parameters are smaller than a threshold

3. Decide the class memberships of 3. E-step: assign probabilistic

5. Repeat 3 and 4 until parameters do not change.

Running naively, O(N3) fastest (each fast (each

In general, this is a unsolved problem. However there are many

The abrupt change at k = 2, is highly suggestive of two clusters

• Since the categories G1,…,Gm typically overlap we cannot

You might also like