0% found this document useful (0 votes)

54 views7 pages

Clustering

The document discusses clustering algorithms and k-means clustering. It describes how k-means works, including initializing centroids, assigning points to clusters, recomputing centroids, and iterating until convergence. It also discusses choosing the number of clusters, evaluating clusters, and implementing k-means in R.

Uploaded by

Deepak Varma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views7 pages

Clustering

Uploaded by

Deepak Varma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Clustering

Most of the topics are based on the textbook, introduction to data mining by tan and
video lectures by AndewNg
Clustering algorithm is an unsupervised learning

Clusters are potential classes and cluster analysis is the study of techniques for automatically
finding classes.

Dividing objects is clustering and assigning particular objects to these groups is called
classification

Clustering is similar to classification and that we don’t know labels here

Ex: finding human genome, social network analysis, market segmentation and astronomical data
analysis

So, why can’t we get labels?

We can’t afford to get (costly) – amazon web pages

Don’t exist (no real truth) – classification of animals into species

We can represent each object by the index of the prototype associated with it. This type of
compression is called vector quantization

Cluster validity – methods for evaluating the goodness of the cluster produced by clustering
algorithm

The greater the similarity within the group and greater the difference between groups the better
or more distinct is the clustering

The definition of cluster is imprecise and the best definition depends on the nature of data and
the desired results

Clustering can be used either for utility or for understanding

Understanding:

Clustering is used in biology, information retrieval, climate, business etc. to understand various
different patterns

Utility: here clustering analysis is only starting point for other purposes

Summarization

Compression

Efficiently finding nearest neighbours

Different types of clustering:

Hierarchical and partitional clustering:

Exclusive: assigning each object to a single cluster

Overlapping or non-exclusive clustering: situations in which a point can be reasonably be placed
in more than one cluster

Fuzzy: Every object belongs to every cluster with a membership weight that is between 0 and 1
(all the weights must sum to 1)

Probabilistic clustering compute the probability with which each point belongs to each cluster (all
the probabilities must sum to 1)

Complete and partial clusters: a complete cluster assigns every object to a cluster where as a
partial clustering does not.

Types of cluster:

Well separated

Prototype based – (centre-based clusters)

Graph based – connected component

Density based

Shared property
There are three major algorithms used in clustering they are

1) k-means
2) DBSCAN
3) agglomerative hierarchical clustering

In this exercise we are mainly going to concentrate on Kmeans clustering which is prototype based
partitional clustering

Algorithm:

1: Select K points as initial centroids

2: repeat

3: Form K clusters by assigning each point to its closest centroid

4: recomputed the centroid of each cluster

5: until centroids do not change

Step 5 is often replaced by weaker condition, like repeat until only 1% of points change cluster

Kmeans is based on proximity measures to assign a point to its closest center, where proximity
measure characterizes the similarity or dissimilarity that exists between the objects

Some proximity measures:

Euclidean and Manhattan – used for data points

Cosine and Jaccard - used for documents

Objective: Minimise the SSE (sum of squared errors) from a point to its centroid

𝑘
2
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑐𝑖, 𝑥)
𝑖=1 𝑥∈𝑐𝑖

If we have the output labels we can calculate the accuracy of objects allocated to their
respective clusters

Time and space complexity:

Storage required – O((m+K)n)
Time required – O(I*K*m*n) – linear
Here I - number of iterations, K – number of clusters, m – number of data points, n – number
of attributes. Here I and K are much smaller compared to m

Process in kmeans

Programming in R:

1) Using the library

2) Implementing the algorithm

Prototype:

kmeans(x, centers, iter.max = 10, nstart = 1,

algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)

#use required parameters depending on the application

#x, centers are necessary parameters and others are optional

kmeans(dataframe, number of clusters) – generally used

fit = kmeans(data,2) – here 2 represents number of clusters

fit$centers – gives centers of each cluster

fit$cluster – gives which points belong to which cluster

fit$size – gives number of points in each cluster

fit$iter – number of iterations it required to converge

Cluster Evaluation:
The evaluation measures that are applied to judge various aspects of cluster validity are
traditionally classified into three types

Unsupervised:
Measures the goodness of cluster structure without respect to external information. An
example of this is SSE.
Supervised:
Measures the extent to which the clustering structure discovered by a clustering algorithm
matches some external structure. Here external structure can be externally provided class
labels
Relative:
Compares different clustering’s or clusters. A relative cluster evaluation method is a
supervised or unsupervised evaluation measure that is used for the purpose of comparison

Issues:
Choosing initial clusters
Choosing number of clusters
Handling empty clusters
Outliers

Choosing initial clusters:

One way of choosing initial clusters Random initialization

Randomly pick objects and set centroid equal to these objects

K means can converge to different solution depending on the centroid initialization (So random
centroid initialisation is important)

There should be a global optima, clusters should not struck at local optima
For this problem, try multiple random initialisations and consider those whose cost function value is
low

This will be helpful when number of clusters are less

Choosing number of clusters:

There is no particular better way for this, one way is visualisation

Elbow method for choosing clusters:

Distortion (cost function value) goes down as we increase number of clusters

Choose number of clusters at the elbow point (before elbow, distortion goes down rapidly and after
elbow, distortion goes down slowly)
Questions:

1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).

Compute the (Euclidean) distance.
2)Consider the points x1<-c(1,2) and x2<-c(5,10). Compute cosine distance
3) Using the data x<-c(1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10) . Find clusters
where k=2.
4) For the given sonar_test.csv find the clustering with k=2 for first two
columns
5) Find the testing error
6) For the given sonar_test.csv find the clustering with k=2 for the entire
data set and find the testing error
7) For the given sonar_test.csv do hierarchical clustering with 4 cuts

Strip
No ratings yet
Strip
50 pages
Top 30 OOPS Interview Questions and Answers
No ratings yet
Top 30 OOPS Interview Questions and Answers
10 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
HTB Endgame - P.O.O. - 0xdf Hacks Stuff
No ratings yet
HTB Endgame - P.O.O. - 0xdf Hacks Stuff
26 pages
W-2 Employee Electronic Consent Instructions
No ratings yet
W-2 Employee Electronic Consent Instructions
2 pages
1745421623409
No ratings yet
1745421623409
8 pages
XN-9000 Sorting Archiving
No ratings yet
XN-9000 Sorting Archiving
3 pages
Simatic Et 200sp
No ratings yet
Simatic Et 200sp
34 pages
AtellicaSolution Sample Handler SpecSheet FINAL 1800000004691539
No ratings yet
AtellicaSolution Sample Handler SpecSheet FINAL 1800000004691539
3 pages
CV1800B CV1801B Preliminary Datasheet Full en
No ratings yet
CV1800B CV1801B Preliminary Datasheet Full en
692 pages
Unit 4
No ratings yet
Unit 4
3 pages
Ceel 311 - Prelim
No ratings yet
Ceel 311 - Prelim
57 pages
1431-5403-1-1 Ict Spring 2025
No ratings yet
1431-5403-1-1 Ict Spring 2025
18 pages
MMQ 44
No ratings yet
MMQ 44
15 pages
User + Password Portal Asuransi
No ratings yet
User + Password Portal Asuransi
3 pages
Clustering
No ratings yet
Clustering
55 pages
Harshil Concert Management
No ratings yet
Harshil Concert Management
6 pages
SQL Interview Questions Day 13-20
No ratings yet
SQL Interview Questions Day 13-20
23 pages
Relational Database Management System
No ratings yet
Relational Database Management System
6 pages
Semester 4 TimeTable
No ratings yet
Semester 4 TimeTable
1 page
Valentini 2009
No ratings yet
Valentini 2009
11 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Network Cisco Fundamentals
No ratings yet
Network Cisco Fundamentals
31 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering
No ratings yet
Clustering
44 pages
Clustering
No ratings yet
Clustering
80 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Basics of OS
No ratings yet
Basics of OS
22 pages
DX Diag
No ratings yet
DX Diag
35 pages
Clustering
No ratings yet
Clustering
20 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
HPBIOSUPDREC65
No ratings yet
HPBIOSUPDREC65
1 page
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Sap Jam Admin Guide
No ratings yet
Sap Jam Admin Guide
366 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Clustering Notes
No ratings yet
Clustering Notes
37 pages
Daksh Malhotra: Male, 22
No ratings yet
Daksh Malhotra: Male, 22
2 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Resource Pools and Groups v6.3 PDF
No ratings yet
Resource Pools and Groups v6.3 PDF
15 pages
The 9 Essential Things in Your Gantt Chart
No ratings yet
The 9 Essential Things in Your Gantt Chart
3 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
TechTip Configuring PLC Devices With Device Description Files
No ratings yet
TechTip Configuring PLC Devices With Device Description Files
8 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Clustering
No ratings yet
Clustering
38 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
New Commands and System Variables
No ratings yet
New Commands and System Variables
26 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
ML - 8
No ratings yet
ML - 8
70 pages
Week 9
No ratings yet
Week 9
66 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unit 5
No ratings yet
Unit 5
63 pages
Sharpening The Edge: Overview of The LF Edge Taxonomy and Framework
No ratings yet
Sharpening The Edge: Overview of The LF Edge Taxonomy and Framework
30 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Clustering
No ratings yet
Clustering
75 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Clustering
No ratings yet
Clustering
84 pages
UNIT5
No ratings yet
UNIT5
60 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering
No ratings yet
Clustering
39 pages
Cluster
100% (1)
Cluster
72 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Clustering is similar to classification and that we don’t know labels here

So, why can’t we get labels?

We can’t afford to get (costly) – amazon web pages

Don’t exist (no real truth) – classification of animals into species

Clustering can be used either for utility or for understanding

Efficiently finding nearest neighbours

Different types of clustering:

Hierarchical and partitional clustering:

Exclusive: assigning each object to a single cluster

Prototype based – (centre-based clusters)

Graph based – connected component

1: Select K points as initial centroids

3: Form K clusters by assigning each point to its closest centroid

4: recomputed the centroid of each cluster

5: until centroids do not change

Some proximity measures:

Euclidean and Manhattan – used for data points

Cosine and Jaccard - used for documents

Time and space complexity:

1) Using the library

kmeans(x, centers, iter.max = 10, nstart = 1,

#use required parameters depending on the application

#x, centers are necessary parameters and others are optional

kmeans(dataframe, number of clusters) – generally used

fit = kmeans(data,2) – here 2 represents number of clusters

fit$cluster – gives which points belong to which cluster

fit$size – gives number of points in each cluster

fit$iter – number of iterations it required to converge

Choosing initial clusters:

Randomly pick objects and set centroid equal to these objects

This will be helpful when number of clusters are less

Choosing number of clusters:

There is no particular better way for this, one way is visualisation

Elbow method for choosing clusters:

Distortion (cost function value) goes down as we increase number of clusters

1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).

You might also like