0% found this document useful (0 votes)
12 views38 pages

Clustering

The document provides an overview of unsupervised learning, focusing on clustering techniques such as K-means and hierarchical clustering. It explains the differences between supervised and unsupervised learning, the functioning of K-means, and the two types of hierarchical clustering: agglomerative and divisive. Additionally, it discusses the applications of clustering in various fields and includes self-assessment questions for learners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views38 pages

Clustering

The document provides an overview of unsupervised learning, focusing on clustering techniques such as K-means and hierarchical clustering. It explains the differences between supervised and unsupervised learning, the functioning of K-means, and the two types of hierarchical clustering: agglomerative and divisive. Additionally, it discusses the applications of clustering in various fields and includes self-assessment questions for learners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

UNSUPERVISED LEARNING - CLUSTERING

CO-3
AIM

To familiarize students with the concepts of unsupervised machine learning, its difference with
supervised machine learning and the use of unsupervised learning, particularly clustering

INSTRUCTIONAL OBJECTIVES

This session is designed to:


1. Introduction to unsupervised learning
2. K-means clustering algorithm
3. Hierarchical clustering and its types

LEARNING OUTCOMES

At the end of this session, you should be able to:


1. Supervised learning vs. unsupervised learning
2. K-means clustering
3. Hierarchical clustering and its types
4. Summary
5. Self – Assessment
Supervised learning vs. unsupervised learning

Supervised learning: discover patterns in the data


that relate data attributes with a target (class)
attribute.
These patterns are then utilized to predict the
values of the target attribute in future data
instances.

Unsupervised learning: The data have no target


attribute.
We want to explore the data to find some intrinsic
structures in them.
Unsupervised learning - Clustering

• Clustering is a technique for finding similarity groups in data, called


clusters. I.e.,
it groups data instances that are similar to (near) each other in one
cluster and data instances that are very different (far away) from each
other into different clusters.

• Clustering is often called an unsupervised learning task as no class


values denoting an a priori grouping of the data instances are given,
which is the case in supervised learning.

• Due to historical reasons, clustering is often considered synonymous


with unsupervised learning.

In fact, association rule mining is also unsupervised


An illustration

• The data set has three natural groups of data points, i.e., 3 natural
clusters.
What is clustering for?

• Let us see some real-life examples

• Example 1: groups people of similar sizes together to make “small”,


“medium” and “large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their


similarities
 To do targeted marketing.
Aspects of clustering

• A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …

• A distance (similarity, or dissimilarity) function


• Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized

• The quality of a clustering result depends on the algorithm, the


distance function, and the application.
K-means clustering

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be


{x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes (dimensions) in the
data.

• The k-means algorithm partitions the given data into k clusters.


 Each cluster has a cluster center, called centroid.
 k is specified by the user
K-means algorithm

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,


cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).


K-means algorithm – (cont.…)

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,


cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).


K-means algorithm – (cont.…)
Stopping/convergence criterion

• no (or minimum) re-assignments of data points to different clusters

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean


vector of all the data points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
An example
An example (cont.…)
An example distance function
A disk version of k-means

• K-means can be implemented with data on disk


 In each iteration, it scans the data once.
 as the centroids can be computed incrementally

• It can be used to cluster large datasets that do not fit in main memory

• We need to control the number of iterations


 In practice, a limited is set (< 50).

• Not the best method. There are other scale-up algorithms, e.g., BIRCH.
A disk version of k-means (cont …)
Strengths of k-means

• Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn), where n is the number of data points,
k is the number of clusters, and t is the number of iterations.
 Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used. The global


optimum is hard to find due to complexity.
Weaknesses of k-means

• The algorithm is only applicable if the mean is defined.


 For categorical data, k-mode - the centroid is represented by most
frequent values.

• The user needs to specify k.

• The algorithm is sensitive to outliers


 Outliers are data points that are very far away from other data points.
 Outliers could be errors in the data recording or some special data
points with very different values.
Weaknesses of k-means: Problems with outliers
Hierarchical Clustering
• A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
1.Identify the 2 clusters which can be closest together, and
2.Merge the 2 maximum comparable clusters. We need to continue these
steps until all the clusters are merged together.
• In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters.
• A Dendrogram is a tree-like diagram that statistics the sequences of
merges or splits.

21
Hierarchical Clustering

Produce a nested sequence of clusters, a tree, also called Dendrogram


Types of hierarchical clustering

• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)


from the bottom level, and
 merges the most similar (or nearest) pair of clusters
 stops when all the data points are merged into a single cluster (i.e.,
the root cluster).

• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
 Splits the root into a set of child clusters. Each child cluster is
recursively divided further
 stops when only singleton clusters of individual data points remain,
i.e., each cluster with only a single point
AGGLOMERATIVE CLUSTERING

• Agglomerative clustering is one of the most common types of


hierarchical clustering used to group similar objects in clusters.
• Agglomerative clustering is also known as AGNES (Agglomerative
Nesting). In agglomerative clustering, each data point act as an
individual cluster and at each step, data objects are grouped in a
bottom-up method.
• Initially, each data object is in its cluster. At each iteration, the clusters
are combined with different clusters until one cluster is formed.

24
AGGLOMERATIVE CLUSTERING

25
AGGLOMERATIVE CLUSTERING

26
AGGLOMERATIVE CLUSTERING

27
AGGLOMERATIVE CLUSTERING

• The algorithm for Agglomerative Hierarchical Clustering is:


• Calculate the similarity of one cluster with all the other clusters (calculate
proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.

28
DIVISIVE HIERARCHICAL CLUSTERING

• Divisive hierarchical clustering is exactly the opposite of Agglomerative


Hierarchical clustering.
• In Divisive Hierarchical clustering, all the data points are considered an
individual cluster, and in every iteration, the data points that are not
similar are separated from the cluster.
• The separated data points are treated as an individual cluster. Finally,
we are left with N clusters.

29
DIVISIVE HIERARCHICAL CLUSTERING

30
DIVISIVE CLUSTERING

31
DIVISIVE CLUSTERING

• This approach starts with all of the objects in the same cluster.
• In the continuous iteration, a cluster is split up into smaller clusters.
• It is down until each object in one cluster or the termination condition
holds.
• This method is rigid, i.e., once a merging or splitting is done, it can never
be undone.

32
APPLICATIONS

• Clustering analysis is broadly used in many applications such as market


research, pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer
base.
• And they can characterize their customer groups based on the purchasing
patterns.
• We don't have to pre-specify any particular number of clusters. ...

• Easy to decide the number of clusters by merely looking at the Dendrogram.

33
Summary

• Use the centroid of each cluster to represent the cluster.


 compute the radius and

 standard deviation of the cluster to determine its spread in each


dimension

 The centroid representation alone works well if the clusters are of the
hyper-spherical shape.

 If clusters are elongated or are of other shapes, centroids are not


sufficient
Summary

• Hierarchical clustering is a popular method for grouping objects.


• It creates groups so that objects within a group are similar to each other and different from
objects in other groups.
• Types of Hierarchical Clustering
• Agglomerative clustering is one of the most common types of hierarchical clustering used
to group similar objects in clusters.
• In Divisive Hierarchical clustering, all the data points are considered an individual cluster,
and in every iteration, the data points that are not similar are separated from the cluster.
Self-Assessment Questions

1. What are the two types of Hierarchical Clustering?

(a) Top-Down Clustering (Divisive)Boosting


(b)Bottom-Top Clustering (Agglomerative)
(c) Both a and b

(d)Dendrogram

2. Hierarchical clustering should be mainly used for exploration.

(a) TRUE
(b)FALSE

36
Density-Based
Self-Assessment Questions

3. Which of the following is not clustering method?

(a) Dbscan
(b) Hierarchy
(c) Grid
(d) Project based

4. __________clusters formed in this method forms a tree-type structure based on the


hierarchy.
(a) Dbscan
(b) Hierarchy
(c) Grid
(d) Project based

37
THANK YOU

OUR TEAM

38

You might also like