0% found this document useful (0 votes)

12 views38 pages

Clustering

The document provides an overview of unsupervised learning, focusing on clustering techniques such as K-means and hierarchical clustering. It explains the differences between supervised and unsupervised learning, the functioning of K-means, and the two types of hierarchical clustering: agglomerative and divisive. Additionally, it discusses the applications of clustering in various fields and includes self-assessment questions for learners.

Uploaded by

venkatraoboppudi95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views38 pages

Clustering

Uploaded by

venkatraoboppudi95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

UNSUPERVISED LEARNING - CLUSTERING

CO-3
AIM

To familiarize students with the concepts of unsupervised machine learning, its difference with
supervised machine learning and the use of unsupervised learning, particularly clustering

INSTRUCTIONAL OBJECTIVES

This session is designed to:

1. Introduction to unsupervised learning
2. K-means clustering algorithm
3. Hierarchical clustering and its types

LEARNING OUTCOMES

At the end of this session, you should be able to:

1. Supervised learning vs. unsupervised learning
2. K-means clustering
3. Hierarchical clustering and its types
4. Summary
5. Self – Assessment
Supervised learning vs. unsupervised learning

Supervised learning: discover patterns in the data

that relate data attributes with a target (class)
attribute.
These patterns are then utilized to predict the
values of the target attribute in future data
instances.

Unsupervised learning: The data have no target

attribute.
We want to explore the data to find some intrinsic
structures in them.
Unsupervised learning - Clustering

• Clustering is a technique for finding similarity groups in data, called

clusters. I.e.,
it groups data instances that are similar to (near) each other in one
cluster and data instances that are very different (far away) from each
other into different clusters.

• Clustering is often called an unsupervised learning task as no class

values denoting an a priori grouping of the data instances are given,
which is the case in supervised learning.

• Due to historical reasons, clustering is often considered synonymous

with unsupervised learning.

In fact, association rule mining is also unsupervised

An illustration

• The data set has three natural groups of data points, i.e., 3 natural
clusters.
What is clustering for?

• Let us see some real-life examples

• Example 1: groups people of similar sizes together to make “small”,

“medium” and “large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their

similarities
 To do targeted marketing.
Aspects of clustering

• A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …

• A distance (similarity, or dissimilarity) function

• Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized

• The quality of a clustering result depends on the algorithm, the

distance function, and the application.
K-means clustering

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be

{x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes (dimensions) in the
data.

• The k-means algorithm partitions the given data into k clusters.

 Each cluster has a cluster center, called centroid.
 k is specified by the user
K-means algorithm

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

K-means algorithm – (cont.…)

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

K-means algorithm – (cont.…)
Stopping/convergence criterion

• no (or minimum) re-assignments of data points to different clusters

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean

vector of all the data points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
An example
An example (cont.…)
An example distance function
A disk version of k-means

• K-means can be implemented with data on disk

 In each iteration, it scans the data once.
 as the centroids can be computed incrementally

• It can be used to cluster large datasets that do not fit in main memory

• We need to control the number of iterations

 In practice, a limited is set (< 50).

• Not the best method. There are other scale-up algorithms, e.g., BIRCH.
A disk version of k-means (cont …)
Strengths of k-means

• Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn), where n is the number of data points,
k is the number of clusters, and t is the number of iterations.
 Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used. The global

optimum is hard to find due to complexity.
Weaknesses of k-means

• The algorithm is only applicable if the mean is defined.

 For categorical data, k-mode - the centroid is represented by most
frequent values.

• The user needs to specify k.

• The algorithm is sensitive to outliers

 Outliers are data points that are very far away from other data points.
 Outliers could be errors in the data recording or some special data
points with very different values.
Weaknesses of k-means: Problems with outliers
Hierarchical Clustering
• A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
1.Identify the 2 clusters which can be closest together, and
2.Merge the 2 maximum comparable clusters. We need to continue these
steps until all the clusters are merged together.
• In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters.
• A Dendrogram is a tree-like diagram that statistics the sequences of
merges or splits.

21
Hierarchical Clustering

Produce a nested sequence of clusters, a tree, also called Dendrogram

Types of hierarchical clustering

• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)

from the bottom level, and
 merges the most similar (or nearest) pair of clusters
 stops when all the data points are merged into a single cluster (i.e.,
the root cluster).

• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
 Splits the root into a set of child clusters. Each child cluster is
recursively divided further
 stops when only singleton clusters of individual data points remain,
i.e., each cluster with only a single point
AGGLOMERATIVE CLUSTERING

• Agglomerative clustering is one of the most common types of

hierarchical clustering used to group similar objects in clusters.
• Agglomerative clustering is also known as AGNES (Agglomerative
Nesting). In agglomerative clustering, each data point act as an
individual cluster and at each step, data objects are grouped in a
bottom-up method.
• Initially, each data object is in its cluster. At each iteration, the clusters
are combined with different clusters until one cluster is formed.

24
AGGLOMERATIVE CLUSTERING

25
AGGLOMERATIVE CLUSTERING

26
AGGLOMERATIVE CLUSTERING

27
AGGLOMERATIVE CLUSTERING

• The algorithm for Agglomerative Hierarchical Clustering is:

• Calculate the similarity of one cluster with all the other clusters (calculate
proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.

28
DIVISIVE HIERARCHICAL CLUSTERING

• Divisive hierarchical clustering is exactly the opposite of Agglomerative

Hierarchical clustering.
• In Divisive Hierarchical clustering, all the data points are considered an
individual cluster, and in every iteration, the data points that are not
similar are separated from the cluster.
• The separated data points are treated as an individual cluster. Finally,
we are left with N clusters.

29
DIVISIVE HIERARCHICAL CLUSTERING

30
DIVISIVE CLUSTERING

31
DIVISIVE CLUSTERING

• This approach starts with all of the objects in the same cluster.
• In the continuous iteration, a cluster is split up into smaller clusters.
• It is down until each object in one cluster or the termination condition
holds.
• This method is rigid, i.e., once a merging or splitting is done, it can never
be undone.

32
APPLICATIONS

• Clustering analysis is broadly used in many applications such as market

research, pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer
base.
• And they can characterize their customer groups based on the purchasing
patterns.
• We don't have to pre-specify any particular number of clusters. ...

• Easy to decide the number of clusters by merely looking at the Dendrogram.

33
Summary

• Use the centroid of each cluster to represent the cluster.

 compute the radius and

 standard deviation of the cluster to determine its spread in each

dimension

 The centroid representation alone works well if the clusters are of the
hyper-spherical shape.

 If clusters are elongated or are of other shapes, centroids are not

sufficient
Summary

• Hierarchical clustering is a popular method for grouping objects.

• It creates groups so that objects within a group are similar to each other and different from
objects in other groups.
• Types of Hierarchical Clustering
• Agglomerative clustering is one of the most common types of hierarchical clustering used
to group similar objects in clusters.
• In Divisive Hierarchical clustering, all the data points are considered an individual cluster,
and in every iteration, the data points that are not similar are separated from the cluster.
Self-Assessment Questions

1. What are the two types of Hierarchical Clustering?

(a) Top-Down Clustering (Divisive)Boosting

(b)Bottom-Top Clustering (Agglomerative)
(c) Both a and b

(d)Dendrogram

2. Hierarchical clustering should be mainly used for exploration.

(a) TRUE
(b)FALSE

36
Density-Based
Self-Assessment Questions

3. Which of the following is not clustering method?

(a) Dbscan
(b) Hierarchy
(c) Grid
(d) Project based

4. __________clusters formed in this method forms a tree-type structure based on the

hierarchy.
(a) Dbscan
(b) Hierarchy
(c) Grid
(d) Project based

37
THANK YOU

OUR TEAM

Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
The Simplex Minimization Method
No ratings yet
The Simplex Minimization Method
13 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
UNIT5
No ratings yet
UNIT5
60 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster
100% (1)
Cluster
72 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Kunci Jawaban Kalkulus Edisi 9yunusFairVry - Blogspot.com-262-280
No ratings yet
Kunci Jawaban Kalkulus Edisi 9yunusFairVry - Blogspot.com-262-280
19 pages
Grouping
No ratings yet
Grouping
98 pages
Laboratory Activity 2 - Newton Raphson and Secant Method
100% (1)
Laboratory Activity 2 - Newton Raphson and Secant Method
8 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering
No ratings yet
Clustering
75 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering
No ratings yet
Clustering
80 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DM 4
No ratings yet
DM 4
76 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Unit 5
No ratings yet
Unit 5
5 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
M3 - Unsupervised Machine Learning
No ratings yet
M3 - Unsupervised Machine Learning
35 pages
Clustering
No ratings yet
Clustering
104 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
84 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Clustering
No ratings yet
Clustering
39 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Transportation Problem
No ratings yet
Transportation Problem
51 pages
Winsem2020-21 Eee1007 Eth Vl2020210500383 Model Question Paper Eee1007 QP
No ratings yet
Winsem2020-21 Eee1007 Eth Vl2020210500383 Model Question Paper Eee1007 QP
4 pages
Unit 1 MT 202 CBNST
No ratings yet
Unit 1 MT 202 CBNST
24 pages
BCAC403 ClassNote Module-2
No ratings yet
BCAC403 ClassNote Module-2
47 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Numerical Methods (New)
No ratings yet
Numerical Methods (New)
2 pages
Explicit Matrix Representation For NURBS Curves and Surfaces
No ratings yet
Explicit Matrix Representation For NURBS Curves and Surfaces
11 pages
Performance Measures
No ratings yet
Performance Measures
19 pages
Probability Models
No ratings yet
Probability Models
23 pages
Tree Models
No ratings yet
Tree Models
42 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
Week2 Linear Programming Fundamentals
No ratings yet
Week2 Linear Programming Fundamentals
46 pages
(Slide) Logistic Regression
No ratings yet
(Slide) Logistic Regression
42 pages
4th Year Lab
No ratings yet
4th Year Lab
53 pages
AS Mathematics - Practice Paper - Algebra (Part 1) MS
No ratings yet
AS Mathematics - Practice Paper - Algebra (Part 1) MS
7 pages
Numerical Analysis With Optimiz PDF
No ratings yet
Numerical Analysis With Optimiz PDF
102 pages
Lecture 11
No ratings yet
Lecture 11
28 pages
Lec # 11 Polynomial Interpolation
No ratings yet
Lec # 11 Polynomial Interpolation
29 pages
Infographic - ABCs of AI and Deep Learning
No ratings yet
Infographic - ABCs of AI and Deep Learning
1 page
Week 009 Calculus I - Optimization
No ratings yet
Week 009 Calculus I - Optimization
7 pages
2085-Article Text-5597-1-10-20220804
No ratings yet
2085-Article Text-5597-1-10-20220804
12 pages
Ap STAT Style ch3
No ratings yet
Ap STAT Style ch3
19 pages
Math 128a - Homework 3 - Due Sept 21 at The Beginning of Class
No ratings yet
Math 128a - Homework 3 - Due Sept 21 at The Beginning of Class
3 pages
Handouts Operational Research Chap2 - Dualité
No ratings yet
Handouts Operational Research Chap2 - Dualité
9 pages
Lecture 7 Transportation Special Cases
No ratings yet
Lecture 7 Transportation Special Cases
12 pages
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
No ratings yet
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
16 pages
Name: Sumit Roll No.: 205119101: Q. Solve 8-Queens Problem Using Hill Climbing
No ratings yet
Name: Sumit Roll No.: 205119101: Q. Solve 8-Queens Problem Using Hill Climbing
9 pages
Assignment 1: Operations Research
No ratings yet
Assignment 1: Operations Research
2 pages
Exercise Sheet 4
No ratings yet
Exercise Sheet 4
4 pages
VL2021220103738 Da
No ratings yet
VL2021220103738 Da
2 pages
Problem Sheet 1
No ratings yet
Problem Sheet 1
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

Clustering

Uploaded by

Clustering

Uploaded by

UNSUPERVISED LEARNING - CLUSTERING

This session is designed to:

At the end of this session, you should be able to:

Supervised learning: discover patterns in the data

Unsupervised learning: The data have no target

• Clustering is a technique for finding similarity groups in data, called

• Clustering is often called an unsupervised learning task as no class

• Due to historical reasons, clustering is often considered synonymous

In fact, association rule mining is also unsupervised

• Let us see some real-life examples

• Example 1: groups people of similar sizes together to make “small”,

• Example 2: In marketing, segment customers according to their

• A distance (similarity, or dissimilarity) function

• The quality of a clustering result depends on the algorithm, the

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be

• The k-means algorithm partitions the given data into k clusters.

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

• no (or minimum) re-assignments of data points to different clusters

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean

• K-means can be implemented with data on disk

• We need to control the number of iterations

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used. The global

• The algorithm is only applicable if the mean is defined.

• The user needs to specify k.

• The algorithm is sensitive to outliers

Produce a nested sequence of clusters, a tree, also called Dendrogram

• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)

• Agglomerative clustering is one of the most common types of

• The algorithm for Agglomerative Hierarchical Clustering is:

• Divisive hierarchical clustering is exactly the opposite of Agglomerative

• Clustering analysis is broadly used in many applications such as market

• Easy to decide the number of clusters by merely looking at the Dendrogram.

• Use the centroid of each cluster to represent the cluster.

 standard deviation of the cluster to determine its spread in each

 If clusters are elongated or are of other shapes, centroids are not

• Hierarchical clustering is a popular method for grouping objects.

1. What are the two types of Hierarchical Clustering?

(a) Top-Down Clustering (Divisive)Boosting

2. Hierarchical clustering should be mainly used for exploration.

3. Which of the following is not clustering method?

4. __________clusters formed in this method forms a tree-type structure based on the

You might also like