0% found this document useful (0 votes)
15 views56 pages

IT3080 Lecture04 2023

The document discusses different clustering techniques used in data science and analytics. It covers key concepts like supervised vs unsupervised learning, what is cluster analysis, applications of cluster analysis, popular clustering methods like k-means and hierarchical clustering. It also discusses concepts like centroids, limitations of k-means clustering and how to determine optimal number of clusters.

Uploaded by

omeshchamika572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views56 pages

IT3080 Lecture04 2023

The document discusses different clustering techniques used in data science and analytics. It covers key concepts like supervised vs unsupervised learning, what is cluster analysis, applications of cluster analysis, popular clustering methods like k-means and hierarchical clustering. It also discusses concepts like centroids, limitations of k-means clustering and how to determine optimal number of clusters.

Uploaded by

omeshchamika572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

CLUSTERING

DATA SCIENCE & ANALYTICS (IT3080)


OVERVIEW
Supervised and unsupervised learning
What is a cluster and cluster analysis?
Applications of cluster analysis
Methods of clustering
K-means algorithm
Hierarchical clustering
LEARNING OUTCOMES

 Compare and contrast supervised and unsupervised learning


 Explain what is cluster analysis
 Identify applications of cluster analysis
 Apply k-means algorithm for cluster analysis
 Apply Agglomerative hierarchical clustering
SUPERVISED LEARNING

 Supervised learning is a learning in which we teach or train the machine


using data which is well labeled that means some data is already tagged
with the correct answer.
 Ex: In the email spam filter problem, we have a dataset of emails with all the text
within each and every email.
 We also know which of these emails are spam or not (the so-called labels).
 These labels are very valuable in helping the supervised learning to separate the
spam emails from the rest.
 Classification and regression are common examples for supervised learning.
UNSUPERVISED LEARNING

In unsupervised learning, labels are not available.


 Ex: Consider the email spam filter problem, this time without labels.
 To identify a spam email, now the underlying structure of emails
should be understood and emails should be separated into different
groups such that emails within a group are similar to each other but
different from emails in other groups.
Clustering is the most common type of unsupervised learning.
WHAT IS CLUSTER ANALYSIS?
 Cluster analysis or simply clustering is the process of
partitioning a set of data objects (or observations) into subsets.
 Each subset is a cluster, such that objects in a cluster are
similar to one another, yet dissimilar to objects in other clusters.
 The goal of clustering is to maximize the similarity of
observations within a cluster and maximize the dissimilarity
between clusters.
WHAT IS CLUSTER ANALYSIS? (CONTD.)

Cluster analysis does not have any labels


When cluster analysis is done, the analyst is not aware of
number of clusters that are available, whether they are correct or
whether they are useful
Labelling outputs is up to the analyst or other stake holders
AN EXAMPLE
AN EXAMPLE
APPLICATIONS OF CLUSTER ANALYSIS

 Information retrieval/organization: topic-based news


 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Social network mining: special interest group automatic discovery
CLUSTERING METHODS

Partitioning Method
Hierarchical Method
Density-based Method
Fuzzy clustering
Model-Based Method
CLUSTERING METHODS – PARTITIONING METHODS

Given a set of n objects, a partitioning method constructs k


partitions of the data, where each partition represents a cluster
and k<= n.
That is, it divides the data into k groups such that each group
must contain at least one object.
Most partitioning methods are distance-based.
CLUSTERING METHODS – PARTITIONING METHODS

Typical methods: K-means, K-medoids, CLARANS, ……


CLUSTERING METHODS – HIERARCHICAL METHODS

 Hierarchical method creates a hierarchical decomposition of the given set of data


objects.
 A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed.
 The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group.
 It successively merges the objects or groups close to one another, until all the groups are merged into one
 The divisive approach, also called the top-down approach, starts with all the objects in the
same cluster.
 In each successive iteration, a cluster is split into smaller clusters, until eventually each object is in one
cluster, or a termination condition holds.
CLUSTERING METHODS – HIERARCHICAL METHODS

Typical methods: Agglomerative, Diana, Agnes, BIRCH,


ROCK
CLUSTERING METHODS – DENSITY-BASED METHODS

 Most partitioning methods cluster objects based on the distance between objects.
 Such methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density.
 Their general idea is to continue growing a given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
 For example, for each data point within a given cluster, the neighborhood of a given radius has
to contain at least a minimum number of points.
 Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape
CLUSTERING METHODS – DENSITY-BASED METHODS

Typical methods: DBSACN, OPTICS, DenClue


CLUSTERING PROCEDURE- WHAT IS TO CONSIDER?

Choosing variables
Similarity and dissimilarity measurement
Standardization
Weights and thresholds
CHOOSING VARIABLES
 Select relevant variables
 Ex: identifying which type of drivers are at high risk of insurance claims
 Relevant variables : age, penalties, marital status
 Irrelevant : height, weight of vehicle
 Inclusion of a variable such as the height or weight of an automobile may
adversely affect the outcome of the categorization because they are not relevant
to the problem.
 fewer the better to adequately address the problem
SIMILARITY AND DISSIMILARITY MEASUREMENT

Similarity or dissimilarity refers to the likeness of two objects.


A proximity measure can be used to describe similarity or
dissimilarity.
There are several techniques in widespread use to determine the
proximity of one object in relation to another.
Ex: Euclidean distance
SIMILARITY AND DISSIMILARITY
MEASUREMENT(CONTD.)

 Euclidean distance
 How can distance between two
points in a 2D space could be
calculated? Pythagoras theorem
could be used
 A general form: distance between
SIMILARITY AND DISSIMILARITY
MEASUREMENT(CONTD.)

 In a 3D space

 In a N dimension space, if coordinates of A is (a1,a2,a3….an) and B is

𝑑 ( 𝐴 , 𝐵 ) =𝑑 ( 𝐵 , 𝐴 )= √ ( 𝑎2 −𝑏 1 ) ¿ ¿ ¿
(b1,b2,b3….bn) 2

 In cluster analysis the distance between two points are known within-cluster
STANDARDIZATION
When different variables are often represented in different
dimensions (units) standardization of variables might be
required.
The standardization of an attribute involves two steps:
 calculate the difference between the value of the attribute and the mean
of all samples involving the attribute, and
 divide the difference by its standard deviation
K-MEANS ALGORITHM
 With an input of k, which denotes the number of expected clusters, k
centers or centroids will be defined that will facilitate defining the k
partitions.
 The centroid is (typically) the mean of the points in the cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
 Initial centroids are often chosen randomly.
WHAT IS A CENTROID?
A centroid is the mean position of a group of points
K-MEANS ALGORITHM
 Based on these centers (centroids), the algorithm identifies the
members and thus builds a partition followed by the re-
computation of the new centers based on the identified members.
 This process is iterated until the clear, and optimal dissimilarities
that make the partition really unique are exposed.
 Hence, the accuracy of the centroids is the key for the partition-
based clustering algorithm to be successful.
HOW THE CLUSTERS ARE COMPUTED
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
K-MEANS ALGORITHM
Input: S (instance set), K (number of cluster)
Output: clusters
1: Initialize K cluster centers.
2: while termination condition is not satisfied do
3: Assign instances to the closest cluster center.
4: Update cluster centers based on the assignment.
5: end while
DEMO
EXERCISE
 Consider the following observations. Assuming k=2 and initial centroids are A
and C.
 Identify the observations belonging to each cluster after the first epoch
 Calculate the new centroid.
X Y
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
LIMITATIONS OF K-MEANS
K should be picked
Sensitive to initialization of centroids
Issues in clustering data of varying sizes and density.
Sensitive to outliers
Produce spherical solutions since Euclidean distance from
centroids are used
LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (3 Clusters)


LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (3 Clusters)


LIMITATIONS OF K-MEANS (CONTD.)

Original Points K-means (2 Clusters)


THE OPTIMAL NUMBER OF CLUSTERS

Minimizing WCSS leads to the perfect clustering solution


WCSS=0 when there is one point in every cluster, which is
useless
Ideally it is required to have a small number for WCSS while
having small number of clusters
ELBOW METHOD
 When WCSS is plotted against the number
of clusters, a graph which looks like an
elbow is resulted.
 At the beginning WCSS declining
extremely.
 But once it reaches the elbow not so
much.
 The biggest number of clusters for which
there is still a significant amount of
decrease in WCSS is the best candidate for
the number of clusters.
HIERARCHICAL CLUSTERING

Hierarchical clustering mainly involves transforming a


proximity matrix, into a sequence of nested partitions.
 The sequence can be represented with a tree-like dendrogram in
which each cluster is nested into an enclosing cluster.
Hierarchical algorithms can be further categorized into two
kinds: agglomerative and divisive
AN EXAMPLE

0.2 6 5

4
0.15
3 4
2
5
0.1
2
0.05
1
3 1
0
1 3 2 5 4 6
DENDOGRAM: HIERARCHICAL CLUSTERING

 Clustering obtained by cutting the


dendrogram at a desired level: each
connected component forms a cluster.
AGGLOMERATIVE HIERARCHICAL
CLUSTERS
 An agglomerative algorithm starts with a disjoint clustering, which places
each of the n objects in a cluster by itself and then merges clusters based on
their similarities.
 The merging continues until all the individual objects are grouped into a
single cluster.
 Whenever a merger occurs, the number of clusters is reduced by one.
 The similarities between the new merged cluster and any of the other clusters
need to be recalculated.
AGGLOMERATIVE HIERARCHICAL
CLUSTERS

 Basic algorithm is straightforward


1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
SIMILARITY MEASUREMENT
The similarity measurement with the Euclidean distance can be
determined by minimum, maximum, average, or centroid
distance between two clusters to be merged.
There are four hierarchical clustering methods corresponding to
each criterion.
They are called single-link, complete-link, group-average, and
centroid clustering methods, respectively
SIMILARITY MEASUREMENT (CONTD.)
The single-link and complete-link methods consider the
distance between each individual object in each cluster
The group-average method concerns the distances of all objects
in the clusters.
In the centroid method, the distance between two clusters is
defined as the (squared) Euclidean distance between their
centroids.
SIMILARITY MEASUREMENT (CONTD.)

Single-link Complete-link

 

Group Average Centroid Clustering


EXAMPLE
EXAMPLE - SINGLE-LINK
1
0.2

5
2 1 0.15

3 6 0.1

0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
0.2

5
2 1 0.15

2 3 6 0.1

0.05

4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1 0.2
3
5 0.15
2 1
2 3 6 0.1

0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
3 0.2

5
2 1
0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)

5
1
3
0.2

5
2 1
0.15

2 3 6
0.1

0.05

4 0
3 6 2 5 4 1
4
EXAMPLE - COMPLETE-LINK
1
0.2

5 0.15
2 1
3 6 0.1

0.05

4
0
3 6 2 5 4 1
EXAMPLE - COMPLETE-LINK (CONTD.)
1
0.2

5 0.15
2 1
0.1
2 3 6
0.05

4 0
3 6 2 5 4 1

1 0.4

2 0.35

5 2
0.3
0.25

3 6
0.2

3
0.15

1 0.1

4 0.05
0
3 6 4 1 2 5
EXAMPLE - COMPLETE-LINK (CONTD.)
4 1 0.4

2 0.35

5 0.3
2 0.25

3 6
0.2

3 1
0.15

0.1

4 0.05
0
3 6 4 1 2 5
HIERARCHICAL CLUSTERING : COMPLETE-
LINK

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5
DEMO
THANK YOU

You might also like