0% found this document useful (0 votes)

15 views56 pages

IT3080 Lecture04 2023

The document discusses different clustering techniques used in data science and analytics. It covers key concepts like supervised vs unsupervised learning, what is cluster analysis, applications of cluster analysis, popular clustering methods like k-means and hierarchical clustering. It also discusses concepts like centroids, limitations of k-means clustering and how to determine optimal number of clusters.

Uploaded by

omeshchamika572

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views56 pages

IT3080 Lecture04 2023

Uploaded by

omeshchamika572

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

CLUSTERING

DATA SCIENCE & ANALYTICS (IT3080)

OVERVIEW
Supervised and unsupervised learning
What is a cluster and cluster analysis?
Applications of cluster analysis
Methods of clustering
K-means algorithm
Hierarchical clustering
LEARNING OUTCOMES

 Compare and contrast supervised and unsupervised learning

 Explain what is cluster analysis
 Identify applications of cluster analysis
 Apply k-means algorithm for cluster analysis
 Apply Agglomerative hierarchical clustering
SUPERVISED LEARNING

 Supervised learning is a learning in which we teach or train the machine

using data which is well labeled that means some data is already tagged
with the correct answer.
 Ex: In the email spam filter problem, we have a dataset of emails with all the text
within each and every email.
 We also know which of these emails are spam or not (the so-called labels).
 These labels are very valuable in helping the supervised learning to separate the
spam emails from the rest.
 Classification and regression are common examples for supervised learning.
UNSUPERVISED LEARNING

In unsupervised learning, labels are not available.

 Ex: Consider the email spam filter problem, this time without labels.
 To identify a spam email, now the underlying structure of emails
should be understood and emails should be separated into different
groups such that emails within a group are similar to each other but
different from emails in other groups.
Clustering is the most common type of unsupervised learning.
WHAT IS CLUSTER ANALYSIS?
 Cluster analysis or simply clustering is the process of
partitioning a set of data objects (or observations) into subsets.
 Each subset is a cluster, such that objects in a cluster are
similar to one another, yet dissimilar to objects in other clusters.
 The goal of clustering is to maximize the similarity of
observations within a cluster and maximize the dissimilarity
between clusters.
WHAT IS CLUSTER ANALYSIS? (CONTD.)

Cluster analysis does not have any labels

When cluster analysis is done, the analyst is not aware of
number of clusters that are available, whether they are correct or
whether they are useful
Labelling outputs is up to the analyst or other stake holders
AN EXAMPLE
AN EXAMPLE
APPLICATIONS OF CLUSTER ANALYSIS

 Information retrieval/organization: topic-based news

 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Social network mining: special interest group automatic discovery
CLUSTERING METHODS

Partitioning Method
Hierarchical Method
Density-based Method
Fuzzy clustering
Model-Based Method
CLUSTERING METHODS – PARTITIONING METHODS

Given a set of n objects, a partitioning method constructs k

partitions of the data, where each partition represents a cluster
and k<= n.
That is, it divides the data into k groups such that each group
must contain at least one object.
Most partitioning methods are distance-based.
CLUSTERING METHODS – PARTITIONING METHODS

Typical methods: K-means, K-medoids, CLARANS, ……

CLUSTERING METHODS – HIERARCHICAL METHODS

 Hierarchical method creates a hierarchical decomposition of the given set of data

objects.
 A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed.
 The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group.
 It successively merges the objects or groups close to one another, until all the groups are merged into one
 The divisive approach, also called the top-down approach, starts with all the objects in the
same cluster.
 In each successive iteration, a cluster is split into smaller clusters, until eventually each object is in one
cluster, or a termination condition holds.
CLUSTERING METHODS – HIERARCHICAL METHODS

Typical methods: Agglomerative, Diana, Agnes, BIRCH,

ROCK
CLUSTERING METHODS – DENSITY-BASED METHODS

 Most partitioning methods cluster objects based on the distance between objects.
 Such methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density.
 Their general idea is to continue growing a given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
 For example, for each data point within a given cluster, the neighborhood of a given radius has
to contain at least a minimum number of points.
 Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape
CLUSTERING METHODS – DENSITY-BASED METHODS

Typical methods: DBSACN, OPTICS, DenClue

CLUSTERING PROCEDURE- WHAT IS TO CONSIDER?

Choosing variables
Similarity and dissimilarity measurement
Standardization
Weights and thresholds
CHOOSING VARIABLES
 Select relevant variables
 Ex: identifying which type of drivers are at high risk of insurance claims
 Relevant variables : age, penalties, marital status
 Irrelevant : height, weight of vehicle
 Inclusion of a variable such as the height or weight of an automobile may
adversely affect the outcome of the categorization because they are not relevant
to the problem.
 fewer the better to adequately address the problem
SIMILARITY AND DISSIMILARITY MEASUREMENT

Similarity or dissimilarity refers to the likeness of two objects.

A proximity measure can be used to describe similarity or
dissimilarity.
There are several techniques in widespread use to determine the
proximity of one object in relation to another.
Ex: Euclidean distance
SIMILARITY AND DISSIMILARITY
MEASUREMENT(CONTD.)

 Euclidean distance
 How can distance between two
points in a 2D space could be
calculated? Pythagoras theorem
could be used
 A general form: distance between
SIMILARITY AND DISSIMILARITY
MEASUREMENT(CONTD.)

 In a 3D space

 In a N dimension space, if coordinates of A is (a1,a2,a3….an) and B is

𝑑 ( 𝐴 , 𝐵 ) =𝑑 ( 𝐵 , 𝐴 )= √ ( 𝑎2 −𝑏 1 ) ¿ ¿ ¿
(b1,b2,b3….bn) 2

 In cluster analysis the distance between two points are known within-cluster
STANDARDIZATION
When different variables are often represented in different
dimensions (units) standardization of variables might be
required.
The standardization of an attribute involves two steps:
 calculate the difference between the value of the attribute and the mean
of all samples involving the attribute, and
 divide the difference by its standard deviation
K-MEANS ALGORITHM
 With an input of k, which denotes the number of expected clusters, k
centers or centroids will be defined that will facilitate defining the k
partitions.
 The centroid is (typically) the mean of the points in the cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
 Initial centroids are often chosen randomly.
WHAT IS A CENTROID?
A centroid is the mean position of a group of points
K-MEANS ALGORITHM
 Based on these centers (centroids), the algorithm identifies the
members and thus builds a partition followed by the re-
computation of the new centers based on the identified members.
 This process is iterated until the clear, and optimal dissimilarities
that make the partition really unique are exposed.
 Hence, the accuracy of the centroids is the key for the partition-
based clustering algorithm to be successful.
HOW THE CLUSTERS ARE COMPUTED
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
K-MEANS ALGORITHM
Input: S (instance set), K (number of cluster)
Output: clusters
1: Initialize K cluster centers.
2: while termination condition is not satisfied do
3: Assign instances to the closest cluster center.
4: Update cluster centers based on the assignment.
5: end while
DEMO
EXERCISE
 Consider the following observations. Assuming k=2 and initial centroids are A
and C.
 Identify the observations belonging to each cluster after the first epoch
 Calculate the new centroid.
X Y
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
LIMITATIONS OF K-MEANS
K should be picked
Sensitive to initialization of centroids
Issues in clustering data of varying sizes and density.
Sensitive to outliers
Produce spherical solutions since Euclidean distance from
centroids are used
LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (3 Clusters)

LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (3 Clusters)

LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (2 Clusters)

THE OPTIMAL NUMBER OF CLUSTERS

Minimizing WCSS leads to the perfect clustering solution

WCSS=0 when there is one point in every cluster, which is
useless
Ideally it is required to have a small number for WCSS while
having small number of clusters
ELBOW METHOD
 When WCSS is plotted against the number
of clusters, a graph which looks like an
elbow is resulted.
 At the beginning WCSS declining
extremely.
 But once it reaches the elbow not so
much.
 The biggest number of clusters for which
there is still a significant amount of
decrease in WCSS is the best candidate for
the number of clusters.
HIERARCHICAL CLUSTERING

Hierarchical clustering mainly involves transforming a

proximity matrix, into a sequence of nested partitions.
 The sequence can be represented with a tree-like dendrogram in
which each cluster is nested into an enclosing cluster.
Hierarchical algorithms can be further categorized into two
kinds: agglomerative and divisive
AN EXAMPLE

0.2 6 5

4
0.15
3 4
2
5
0.1
2
0.05
1
3 1
0
1 3 2 5 4 6
DENDOGRAM: HIERARCHICAL CLUSTERING

 Clustering obtained by cutting the

dendrogram at a desired level: each
connected component forms a cluster.
AGGLOMERATIVE HIERARCHICAL
CLUSTERS
 An agglomerative algorithm starts with a disjoint clustering, which places
each of the n objects in a cluster by itself and then merges clusters based on
their similarities.
 The merging continues until all the individual objects are grouped into a
single cluster.
 Whenever a merger occurs, the number of clusters is reduced by one.
 The similarities between the new merged cluster and any of the other clusters
need to be recalculated.
AGGLOMERATIVE HIERARCHICAL
CLUSTERS

 Basic algorithm is straightforward

1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
SIMILARITY MEASUREMENT
The similarity measurement with the Euclidean distance can be
determined by minimum, maximum, average, or centroid
distance between two clusters to be merged.
There are four hierarchical clustering methods corresponding to
each criterion.
They are called single-link, complete-link, group-average, and
centroid clustering methods, respectively
SIMILARITY MEASUREMENT (CONTD.)
The single-link and complete-link methods consider the
distance between each individual object in each cluster
The group-average method concerns the distances of all objects
in the clusters.
In the centroid method, the distance between two clusters is
defined as the (squared) Euclidean distance between their
centroids.
SIMILARITY MEASUREMENT (CONTD.)

Single-link Complete-link

 

Group Average Centroid Clustering

EXAMPLE
EXAMPLE - SINGLE-LINK
1
0.2

5
2 1 0.15

3 6 0.1

0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
0.2

5
2 1 0.15

2 3 6 0.1

0.05

4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1 0.2
3
5 0.15
2 1
2 3 6 0.1

0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
3 0.2

5
2 1
0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)

5
1
3
0.2

5
2 1
0.15

2 3 6
0.1

0.05

4 0
3 6 2 5 4 1
4
EXAMPLE - COMPLETE-LINK
1
0.2

5 0.15
2 1
3 6 0.1

0.05

4
0
3 6 2 5 4 1
EXAMPLE - COMPLETE-LINK (CONTD.)
1
0.2

5 0.15
2 1
0.1
2 3 6
0.05

4 0
3 6 2 5 4 1

1 0.4

2 0.35

5 2
0.3
0.25

3 6
0.2

3
0.15

1 0.1

4 0.05
0
3 6 4 1 2 5
EXAMPLE - COMPLETE-LINK (CONTD.)
4 1 0.4

2 0.35

5 0.3
2 0.25

3 6
0.2

3 1
0.15

0.1

4 0.05
0
3 6 4 1 2 5
HIERARCHICAL CLUSTERING : COMPLETE-
LINK

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5
DEMO
THANK YOU

L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Clustering
No ratings yet
Clustering
67 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Grouping
No ratings yet
Grouping
98 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Module 5
No ratings yet
Module 5
43 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Unit 5
No ratings yet
Unit 5
85 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
ML - 8
No ratings yet
ML - 8
70 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Cluster
No ratings yet
Cluster
20 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Unit 4
No ratings yet
Unit 4
16 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering
No ratings yet
Clustering
45 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering
No ratings yet
Clustering
38 pages
M5
No ratings yet
M5
40 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering
No ratings yet
Clustering
25 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
11 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Clustering
No ratings yet
Clustering
125 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Clustering
No ratings yet
Clustering
104 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering New
No ratings yet
Clustering New
6 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Cluster
100% (1)
Cluster
72 pages

IT3080 Lecture04 2023

Uploaded by

IT3080 Lecture04 2023

Uploaded by

CLUSTERING

DATA SCIENCE & ANALYTICS (IT3080)

 Compare and contrast supervised and unsupervised learning

 Supervised learning is a learning in which we teach or train the machine

In unsupervised learning, labels are not available.

Cluster analysis does not have any labels

 Information retrieval/organization: topic-based news

Given a set of n objects, a partitioning method constructs k

Typical methods: K-means, K-medoids, CLARANS, ……

 Hierarchical method creates a hierarchical decomposition of the given set of data

Typical methods: Agglomerative, Diana, Agnes, BIRCH,

Typical methods: DBSACN, OPTICS, DenClue

Similarity or dissimilarity refers to the likeness of two objects.

 In a N dimension space, if coordinates of A is (a1,a2,a3….an) and B is

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Minimizing WCSS leads to the perfect clustering solution

Hierarchical clustering mainly involves transforming a

 Clustering obtained by cutting the

 Basic algorithm is straightforward

Group Average Centroid Clustering

You might also like