0% found this document useful (0 votes)

7 views

Module12.02 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Module12.02 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Clustering

• Clustering refers to a very broad set of techniques for

finding subgroups, or clusters, in a data set.
• We seek a partition of the data into distinct groups so that
the observations within each group are quite similar to
each other.
P C A vs Clustering

• P C A looks for a low-dimensional representation of the

observations that explains a good fraction of the variance.
• Clustering looks for homogeneous subgroups among the
observations.
Two clustering methods

• In K-means clustering, observations are partitioned into a

pre-specified number of clusters.
• In hierarchical clustering, number of clusters are not known
beforehand
• A tree-like visual representation of the observations, called
dendrogram, is created to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.
K-means clustering
K=2 K=3 K=4

A simulated data set with 150 observations in 2-dimensional space. Panels show the
results of applying K-means clustering with different values of K , the number of
clusters. The color of each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no ordering of the
clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they are the outputs of the
clustering procedure.
Details of K-means clustering

Let , . . . , denote sets containing the indices of the

observations in each cluster. These sets satisfy two
properties:
. In other words, each
observation belongs to at least one of the K clusters.
. In other words, the clusters
a r e non-overlapping: no observation belongs to more than
one cluster.
For instance, if the ith observation is in the kth cluster, then
i .
• The within-cluster variation for cluster C k is a measure
W C V ( C k ) of the amount by which the observations within
a cluster differ from each other.
• Hence it is an optimization problem

(2)

• In words, this formula says partition the observations into K

clusters such that the total within-cluster variation, summed
over all K clusters, is as small as possible.
How to define within-cluster variation?

• Typically Euclidean distance is used

(3)
, ∈

where |Ck| denotes the number of observations in the kth

cluster.

• Combining (2) and (3) gives the optimization problem that

defines K-means clustering,
(4)
, ∈
K-Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each of the

observations. These serve as initial cluster assignments for
the observations.
2. Iterate until the cluster assignments stop changing:
1. For each of the K clusters, compute the cluster centroid. The
kth cluster centroid is the vector of the p feature means for the
observations in the kth cluster.
2. Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).
Properties of the Algorithm

• This algorithm is guaranteed to decrease the value of the

objective (4) at each step. Why? Note that

, ∈ ∈

where | | ∈ is the mean for feature j in cluster

.
• However it is not guaranteed to give the global minimum.
• This is why clustering should be tried with a number of initial
solutions
Hierarchical Clustering

• K-means clustering requires pre-specification of the number

of clusters K .
• Hierarchical clustering is an alternative approach which
does not require that we commit to a particular choice of
K.
• HC also provides a tree-like visualization
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...

A B

C
D

A B

C
D

A B

C
D

A B

C
D

A B

C
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.

Dendrogram

4
3
D
E
A B
2

C
1
0

C
E
B
A
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
Complete dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
Average dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
A n Example

4
X2
2
0
−2

−6 −4 −2 0 2

45 observations generated in 2-dimensional space. In reality there are three

distinct classes, shown in separate colors.
However, we will treat these class labels as unknown and will seek to cluster the
observations in order to discover the classes from the data.
10 Application of hierarchical clustering

10
8

8
6

6
4

4
2

2
0

0
Details of previous figure
• Left: Dendrogram obtained from hierarchically clustering the data from
previous slide, with complete linkage and Euclidean distance.
• Center: The dendrogram from the left-hand panel, cut at a height of 9
(indicated by the dashed line). Th i s cut results in two distinct clusters, shown
in different colors.
• Right: The dendrogram from the left-hand panel, now cut at a height of 5. Th is
cut results in three distinct clusters, shown in different colors. Note that the colors
were not used in clustering, but are simply used for display purposes in this
figure.
Choice of Dissimilarity Measure
• So far used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• Here correlation is computed between the observation
profiles for each pair of observations.
• Correlation care more about the shape, than the levels
20

Observation 1
Observation 2
Observation 3
15
10

2
5

1
0

5 10 15 20

Variable Index
Practical Issues for Clustering

1. Scaling is necessary
2. In some cases, standardization may be useful
3. What dissimilarity measure and linkage should be used (for HC)?
4. Choice of for K-means clustering
5. Which features should be used to drive the clustering?
Example

Gene expression measurement for 8000 genes, sample collected

from 88 women with breast cancer

Average linkage, correlation metric

Subset of 500 intrinsic genes were studied, before and after

chemotherapy (which genes were varying by how much, within
women and between women)
Heatmap
Based on the
gene
expression,
Samples were
clustered

Survival curves for

different groups

Leading Through Digital Disruption-MEEZAN BANK
100% (3)
Leading Through Digital Disruption-MEEZAN BANK
7 pages
Cad Cam Dentistry and Chairside Digital Impression Making by DR Bob Lowe 062609
100% (1)
Cad Cam Dentistry and Chairside Digital Impression Making by DR Bob Lowe 062609
12 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Chapter 4 _ Clustering
No ratings yet
Chapter 4 _ Clustering
21 pages
Clustering
No ratings yet
Clustering
20 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
S27
No ratings yet
S27
30 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Clustering
No ratings yet
Clustering
75 pages
Stat 390 Presentation 2
No ratings yet
Stat 390 Presentation 2
14 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Understanding Clustering_ A Comprehensive Guide to
No ratings yet
Understanding Clustering_ A Comprehensive Guide to
5 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
39 pages
EML %th Module
No ratings yet
EML %th Module
40 pages
Week-10
No ratings yet
Week-10
84 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
Unit 2
No ratings yet
Unit 2
33 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Algorithms: Dalya Baron (Tel Aviv University) XXX Winter School, November 2018
No ratings yet
Clustering Algorithms: Dalya Baron (Tel Aviv University) XXX Winter School, November 2018
53 pages
Clustering
No ratings yet
Clustering
4 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
21csc305p Machine Learning Unit 3_updated (2)
No ratings yet
21csc305p Machine Learning Unit 3_updated (2)
147 pages
Clustering
No ratings yet
Clustering
80 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Advances in Chemical Physics
From Everand
Advances in Chemical Physics
Stuart A. Rice
No ratings yet
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
SLA - Class Test - 2 - AnswerKey
No ratings yet
SLA - Class Test - 2 - AnswerKey
5 pages
SLA - Class Test - 4 - AnswerKey
No ratings yet
SLA - Class Test - 4 - AnswerKey
2 pages
SLA - Class Test - 1 - AnswerKey
No ratings yet
SLA - Class Test - 1 - AnswerKey
4 pages
SLA - Class Test - 5 - AnswerKey
No ratings yet
SLA - Class Test - 5 - AnswerKey
2 pages
DS2413 Datasheet
No ratings yet
DS2413 Datasheet
18 pages
Lecture Notes ENGG2740 summary part I_blackboard
No ratings yet
Lecture Notes ENGG2740 summary part I_blackboard
24 pages
WTM 1000
No ratings yet
WTM 1000
2 pages
ELS 27 Februari 2021
No ratings yet
ELS 27 Februari 2021
13 pages
Sakshi Singh
No ratings yet
Sakshi Singh
3 pages
CT Scan in Radiology: Made By: Beland Khalil MA1709O
No ratings yet
CT Scan in Radiology: Made By: Beland Khalil MA1709O
7 pages
Lab Content ..: Programming Language
No ratings yet
Lab Content ..: Programming Language
25 pages
PM4Q (Admzp0da)
No ratings yet
PM4Q (Admzp0da)
29 pages
Cultivated Culture Resume
No ratings yet
Cultivated Culture Resume
2 pages
Eulexin Package Insert PDF
No ratings yet
Eulexin Package Insert PDF
2 pages
Cafco 300 Ac - C-TDS - 10-19
No ratings yet
Cafco 300 Ac - C-TDS - 10-19
2 pages
Reverse Air Bag House - Brochure
100% (1)
Reverse Air Bag House - Brochure
4 pages
Ge8161-Lab Programs
No ratings yet
Ge8161-Lab Programs
37 pages
The Misery Of International Law: Confrontations With Injustice In The Global Economy John Linarelli pdf download
100% (2)
The Misery Of International Law: Confrontations With Injustice In The Global Economy John Linarelli pdf download
60 pages
Pnjdent: Prudent Advisory Services LTD
No ratings yet
Pnjdent: Prudent Advisory Services LTD
20 pages
Virginia Hasn t Always Been for Lovers Interracial Marriage Bans and the Case of Richard and Mildred Loving 1st Edition Phyl Newbeck - Download the ebook today and experience the full content
No ratings yet
Virginia Hasn t Always Been for Lovers Interracial Marriage Bans and the Case of Richard and Mildred Loving 1st Edition Phyl Newbeck - Download the ebook today and experience the full content
60 pages
Appendix 27 - CASH RECEIPTS REGISTER
No ratings yet
Appendix 27 - CASH RECEIPTS REGISTER
1 page
Afar 2 Exam
0% (4)
Afar 2 Exam
3 pages
Rs-Cfa Women Provisional Eligible Itgk List 2022
No ratings yet
Rs-Cfa Women Provisional Eligible Itgk List 2022
19 pages
Miciano vs. Brimo
No ratings yet
Miciano vs. Brimo
2 pages
Touch The Sky - Chord Chart PDF
No ratings yet
Touch The Sky - Chord Chart PDF
1 page
Introduction To Talpac
100% (4)
Introduction To Talpac
34 pages
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
No ratings yet
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
6 pages
CN Lab Mannual
No ratings yet
CN Lab Mannual
95 pages
Power Electronics 2
No ratings yet
Power Electronics 2
9 pages
Digitization: An Overview of Issues: Prof. Harsha Parekh
No ratings yet
Digitization: An Overview of Issues: Prof. Harsha Parekh
10 pages
Bls Study Guide
No ratings yet
Bls Study Guide
2 pages
AntMiner S9 Hydro Server Installation Guide
No ratings yet
AntMiner S9 Hydro Server Installation Guide
21 pages

Module12.02 UnsupervisedLearning

Uploaded by

Module12.02 UnsupervisedLearning

Uploaded by

Clustering

• Clustering refers to a very broad set of techniques for

• P C A looks for a low-dimensional representation of the

• In K-means clustering, observations are partitioned into a

Let , . . . , denote sets containing the indices of the

• In words, this formula says partition the observations into K

• Typically Euclidean distance is used

where |Ck| denotes the number of observations in the kth

• Combining (2) and (3) gives the optimization problem that

1. Randomly assign a number, from 1 to K , to each of the

• This algorithm is guaranteed to decrease the value of the

where | | ∈ is the mean for feature j in cluster

• K-means clustering requires pre-specification of the number

45 observations generated in 2-dimensional space. In reality there are three

Gene expression measurement for 8000 genes, sample collected

Average linkage, correlation metric

Subset of 500 intrinsic genes were studied, before and after

Survival curves for

You might also like