0% found this document useful (0 votes)

79 views17 pages

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

1) Hierarchical clustering groups similar data points into clusters through an iterative process of merging the most similar clusters. 2) It starts with each data point as its own cluster and successively merges the two closest clusters until all clusters are merged into one. 3) The results are shown as a dendrogram that displays the hierarchical relationships between clusters. 4) There are different linkage criteria for determining cluster similarity during merging such as single, complete, and average linkage.

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views17 pages

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 17

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 17

Contents

1 Module 3: Clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 17

Module 3: Clustering

Topics to be covered in Module-3

Introduction to Clustering
K -Means Clustering
K -Medoids Clustering
Hierarchical Clustering
Applications of Clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 17

Module 3: Clustering

Introduction to Clustering

Clustering algorithms group samples/data points/features/objects

into clusters by natural association according to some similarity
measures (say Euclidean distance).
Clustering serves two purposes: (i) understanding the structure in the
data (i.e. data exploration) and (ii) finding the similarities between
instances (data points) and thus grouping them.
After grouping the data points, the groups can be named and their
attributes can be defined (using domain knowledge). This paves the
way for supervised learning. In this case, clustering becomes a part of
preprocessing stage.
In most cases, labelling the data is costly and therefore, preceding a
supervised learning (regression or classification) with unsupervised
learning (clustering) is advantageous.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 17

Module 5: Clustering

K -Means Clustering

Partitioning based clustering methods group data points based on

their similarity and characteristics.
K -means and K -medoids (partitioning around medoids) are the two
popular partitioning based methods.
K -means algorithm partitions the data points into K clusters and the
value of K should be known apriori (i.e. it needs to be specified
beforehand).
K -means algorithm is based on the minimization of the sum of
squared distances.
The K -means procedure attempts to form clusters such that the
intracluster similarity is high and intercluster similarity is low. The
similarity of a cluster is determined based on the centroid (i.e. mean
value) of the data points in it.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 17

Module 3: Clustering

K -Means Clustering - Procedure

Step 1: Initialize the iteration count: n = 1; and arbitrarily choose K
samples as initial cluster centres: z1 (n), z2 (n), ..., and zK (n).
Step 2: Distribute the pattern samples x among the K clusters according
to the following rule:

x ∈ Gi (n) if kx − zi (n)k < kx − zj (n)k for j = 1, 2, ..., K ; j 6= i.

Step 3: Compute zi (n + 1) for i = 1, 2, .., K :

1 X
zi (n + 1) = x
Ni
x∈Gi (n)

where Ni is the number of pattern samples assigned to class Gi (n).

Step 4: If ∀i zi (n + 1) = zi (n), then STOP. Otherwise go to Step 2.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 17

Module 3: Clustering

K -Means Clustering

Advantages of K -means clustering: (i) simple and efficient ; and (ii)

low computational complexity.
Drawbacks: (i) K must be known/decided; and (ii) final clusters
usually depend on the order of presentation of training samples and
the initial cluster centres.

Question 3.1
Apply K -means clustering to cluster the following samples/data points:
(0,0), (0,1), (1,0), (3,3), (5,6), (8,9), (9,8) and (9,9).
Fix K = 2 and choose (0,0) and (5,6) as the initial cluster centres.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 17

Module 3: Clustering
K -Means Clustering
The ‘elbow method’ (a heuristic approach) for determining K : Plot the
‘explained variation’ (say, the ‘distortion’) as a function of K , and pick the
‘elbow’ or ’knee’ of the curve as the value of K (as shown in Figure 1).

Figure 1: Elbow method using distortion.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 17
Module 5: Clustering

K -Medoids Clustering
In K -medoids clustering, each cluster is represented by a cluster
medoid which is one among the data points in the cluster.
The medoid of a cluster is defined as a data point in the cluster
whose average dissimilarity to all the other data points in the cluster
is minimal. As ‘medoid’ is the most centrally located point in the
cluster, the cluster representatives can be interpreted in a better way
(compared to K -means).
In K -medoids can use arbitrary dissimilarity measures, whereas
K -means generally requires Euclidean distance for better
performance. In general, K -medoids use Manhattan distance and
minimizes the sum of pairwise dissimilarities.
As in the case of K -means, the value of K needs to be specified
beforehand. An heuristic approach, the ‘silhouette method’ , can be
used for determining the optimal value of K .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 17
Module 5: Clustering

K -Medoids Clustering

As the K -medoids clustering problem is NP-hard to solve exactly,

many heuristic approaches/solutions exist. The most common
approach is the Partitioning Around Medoids (PAM) algorithm.
PAM algorithm has 2 phases: BUILD phase and SWAP phase.
The BUILD phase greedily selects K data points from the available
data points and initialize them as cluster medoids.
The SWAP phase associates each data point to the closest medoid
and SWAPS a cluster medoid with a non-medoid data point in the
cluster if the cost of the configuration decreases.
PAM is faster than exhaustive search and being a greedy search, it
may not find the optimum solution.
K -medoids clustering is more robust (i.e. less sensitive to outliers and
noise) compared to K -means.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 17
Module 3: Clustering
Hierarchical Clustering
Hierarchical clustering procedure groups similar data points into
clusters. It can be performed with either raw data (i.e. data points)
or a similarity matrix. When raw data is provided, a similarity matrix
S should be computed:
1
Si,j =
di,j + 1
where di,j is the Euclidean distance between data points i and j. In
the recent years, many other distance metrics have been developed.
An agglomerative hierarchical clustering is an iterative, 2-step
procedure that starts by considering each data point as a separate
cluster. In each iteration, it executes the following steps: (i)
identifying the two clusters that are closest together, and (ii) merging
the two most similar clusters. This iterative process continues until all
the clusters are merged together.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 17
Module 3: Clustering

Hierarchical Clustering
Click here for more details (particularly an illustration) on hierachical
clustering.
The dendrogram obtained at the end of hierarchical clustering shows
the hierarchical relationship between the clusters.
After completing the merging step, it is necessary to update the
similarity matrix. The updation can be based on (i) the two most
similar parts of a cluster (single-linkage), (ii) the two least similar bits
of a cluster (complete-linkage), or (iii) the center of the clusters
(mean or average-linkage). Refer Figure 2.
The choice of similarity or distance metric and the choice of linkage
criteria are always application-dependent.
Hierarchical clustering can also be done by initially treating all data
points as one cluster, and then successively splitting them. This
approach is called the divisive hierarchical clustering.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 17
Module 3: Clustering

Hierarchical Clustering

Figure 2: Three linkage types used in hierarchical clustering. Source:

https://www.dexlabanalytics.com/blog/hierarchical-clustering-foundational-
concepts-and-example-of-agglomerative-clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 17

Module 3: Clustering

Question 3.2
Consider the similarity matrix given below.

Determine the hierarchy of clusters created by

(a) the single-linkage clustering algorithm, and
(b) the complete-linkage clustering algorithm.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 17

Module 3: Clustering

Applications of Clustering

Clustering analysis finds its application in market research, image

processing, etc.
Consider Customer Relationship Management for example.
Assume the customers of a company are defined in terms of their
demographic attributes and transactions with the company. If those
customers are grouped into K clusters, then a better understanding of
customer base is possible. Based on this understanding the company
shall adapt different strategies for different types of customers. The
company shall also identify unique customers (those who don’t fall in
any large group) and develop strategies for them. For example,
’churning’ customers who require immediate attention.
In Image Segmentation, clustering can be used to group the pixels
that ‘belong together’ (say, foreground and background pixels).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 17

Module 3: Clustering

Module-3 Summary

Clustering: grouping data points based on similarity measures

Partitioning based methods: K -means and K -medoids
The elbow method and the silhouette method: heuristic approaches
for deciding the optimal value of K for partition based methods
Hierarchical clustering: single-linkage, complete-linkage, etc.
Dendogram: shows the hierarchical relationship between the clusters
Applications of clustering: customer relationship management, image
segmentation, etc.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 17

Coding: MCQ's
100% (1)
Coding: MCQ's
3 pages
Dijkstra's Shortest Path Algorithm Serial and Parallel Execution Performance Analysis
No ratings yet
Dijkstra's Shortest Path Algorithm Serial and Parallel Execution Performance Analysis
5 pages
BOM Electrical Components
No ratings yet
BOM Electrical Components
3 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Clustering
No ratings yet
Clustering
29 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Cluster
No ratings yet
Cluster
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Lesson8 Clustering
100% (1)
Lesson8 Clustering
33 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
M5
No ratings yet
M5
40 pages
Module 3
No ratings yet
Module 3
193 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
25 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unit 5
No ratings yet
Unit 5
85 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Lecture5 - Clustering (K Means and K Medoids)
No ratings yet
Lecture5 - Clustering (K Means and K Medoids)
36 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
80 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering and Visualisation of Data - 2020
No ratings yet
Clustering and Visualisation of Data - 2020
5 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering
No ratings yet
Clustering
34 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Unit 5
No ratings yet
Unit 5
5 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
CV Unit 4
No ratings yet
CV Unit 4
60 pages
Unit4 ML
No ratings yet
Unit4 ML
20 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Sl. No. Full Name Gender Course: Internal
No ratings yet
Sl. No. Full Name Gender Course: Internal
6 pages
Sno Reg No First Name Last Name
No ratings yet
Sno Reg No First Name Last Name
14 pages
Reg - No Talent First Name Talent Middle Name
No ratings yet
Reg - No Talent First Name Talent Middle Name
32 pages
Module 4
No ratings yet
Module 4
40 pages
Essentials of Data Analytics: J Component Report
No ratings yet
Essentials of Data Analytics: J Component Report
25 pages
Schneider Electric - FTE R&D Job Description - 2022 Batch
No ratings yet
Schneider Electric - FTE R&D Job Description - 2022 Batch
32 pages
Situational Poverty
100% (1)
Situational Poverty
9 pages
Full Profect Report
No ratings yet
Full Profect Report
39 pages
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
No ratings yet
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
3 pages
Ece3099 Ipt PPT Template 18becxxxx
No ratings yet
Ece3099 Ipt PPT Template 18becxxxx
27 pages
Name Reg - No
No ratings yet
Name Reg - No
4 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
39 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
12 pages
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
No ratings yet
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
31 pages
Winners Advice
No ratings yet
Winners Advice
40 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
29 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
17 pages
18bec1241 Team6 Esd J Report
No ratings yet
18bec1241 Team6 Esd J Report
22 pages
LIC LAB REPORT New
No ratings yet
LIC LAB REPORT New
27 pages
Candidatename Gender Degree Branch
No ratings yet
Candidatename Gender Degree Branch
2 pages
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
No ratings yet
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
7 pages
18bec1241 Tarp Report Team1
No ratings yet
18bec1241 Tarp Report Team1
100 pages
Improvised Radio Jamming Techniques PALADIN PRESS
96% (28)
Improvised Radio Jamming Techniques PALADIN PRESS
130 pages
Code Planet. Machine Learning With Python. A Comprehensive Guide... 2025
No ratings yet
Code Planet. Machine Learning With Python. A Comprehensive Guide... 2025
231 pages
Em Tech Reviewer
No ratings yet
Em Tech Reviewer
2 pages
01 WT 00
No ratings yet
01 WT 00
3 pages
IJCSDF 7 4 Forensic Investigation Whatsapp
100% (1)
IJCSDF 7 4 Forensic Investigation Whatsapp
11 pages
Attention-Based CRNN Models For Identification of Respiratory Diseases From Lung Sounds
No ratings yet
Attention-Based CRNN Models For Identification of Respiratory Diseases From Lung Sounds
7 pages
Compiler Project
No ratings yet
Compiler Project
33 pages
How To Unlock O2 ST780WL
No ratings yet
How To Unlock O2 ST780WL
8 pages
7 Task2
No ratings yet
7 Task2
17 pages
GM Tech 2 User Manual
100% (1)
GM Tech 2 User Manual
301 pages
MY20 Buyersguideegolf EN
No ratings yet
MY20 Buyersguideegolf EN
4 pages
ASROCK P4i45pe0210 Schematic
No ratings yet
ASROCK P4i45pe0210 Schematic
30 pages
A Sanskrit Grammar Text Basic Principles Rules and Formats With Reference Tables and Vocabulary by John M Denton
No ratings yet
A Sanskrit Grammar Text Basic Principles Rules and Formats With Reference Tables and Vocabulary by John M Denton
7 pages
Controller Based Power Theft Location Detection System: Ntroduction
No ratings yet
Controller Based Power Theft Location Detection System: Ntroduction
4 pages
All What Transmission Engineer Want For Aggregation and Access in One Series
No ratings yet
All What Transmission Engineer Want For Aggregation and Access in One Series
4 pages
Scoring Rubrics1
100% (2)
Scoring Rubrics1
2 pages
Electronic Gaming Monthly Issue 116 (March 1999)
No ratings yet
Electronic Gaming Monthly Issue 116 (March 1999)
161 pages
Main Report
100% (3)
Main Report
46 pages
Unit 10
No ratings yet
Unit 10
15 pages
CCAA VIRTUAL WORKSHOP Fillable Registration Form
No ratings yet
CCAA VIRTUAL WORKSHOP Fillable Registration Form
2 pages
Sony XAV-AX8050 Operating Manual
No ratings yet
Sony XAV-AX8050 Operating Manual
112 pages
VERSION 1.1/0116: Product Manual English
No ratings yet
VERSION 1.1/0116: Product Manual English
28 pages
Game - The Compleat Strategyst (Theory of Games of Strategy) (PDFDrive)
No ratings yet
Game - The Compleat Strategyst (Theory of Games of Strategy) (PDFDrive)
285 pages
Worksheet - All Calling Sheet
No ratings yet
Worksheet - All Calling Sheet
796 pages
Mobile Educational Applications For Children. What Educators and Parents Need To Know
No ratings yet
Mobile Educational Applications For Children. What Educators and Parents Need To Know
23 pages
Vertical Tunnel Field Effect Transistors VTFETs A Potential Candidate For Low Power Applications
No ratings yet
Vertical Tunnel Field Effect Transistors VTFETs A Potential Candidate For Low Power Applications
7 pages
Amex
0% (1)
Amex
5 pages
Pioneer+Deh p5780mp
No ratings yet
Pioneer+Deh p5780mp
82 pages