0% found this document useful (0 votes)

41 views11 pages

Text Analytics Unit-3

The document discusses different types of clustering algorithms including centroid-based clustering, k-means clustering, agglomerative hierarchical clustering, and divisive hierarchical clustering. It provides details on how each algorithm works, examples to illustrate the algorithms, and discusses advantages and applications of each type of clustering.

Uploaded by

aathyukthas.ai20001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views11 pages

Text Analytics Unit-3

Uploaded by

aathyukthas.ai20001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT 3

1) Centroid-based Clustering:

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to

hierarchical clustering defined below.

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based

algorithms are efficient but sensitive to initial conditions and outliers. k-means is an efficient,
effective, and simple clustering algorithm.

Example of centroid-based clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative

process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Example 1
It is a simple example to understand how k-means works. In this example, we are going to
first generate 2D dataset containing 4 different blobs and after that will apply k-means
algorithm to see the result.

import matplotlib.pyplot as plt

import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()

Make an object of KMeans along with providing number of clusters, train the model and do
the prediction as follows −
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

with the help of following code we can plot and visualize the cluster’s centers picked by k-
means Python estimator −
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()
Advantages
The following are some advantages of K-Means clustering algorithms −
 It is very easy to understand and implement.
 If we have large number of variables then, K-means would be faster than
Hierarchical clustering.
 On re-computation of centroids, an instance can change the cluster.
 Tighter clusters are formed with K-means as compared to Hierarchical
clustering.
Disadvantages
The following are some disadvantages of K-Means clustering algorithms −
 It is a bit difficult to predict the number of clusters i.e. the value of k.
 Output is strongly impacted by initial inputs like number of clusters (value of
k).
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by means of
normalization or standardization, then the output will completely change.final
output.
Applications of K-Means Clustering Algorithm
 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analyzing the trend on dynamic data

2) Agglomerative Hierarchical
clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical

clustering Work?
The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data points, so
the number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster.
So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

The closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and
these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between

each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

STEPS AND EXAMPLE(refer note)

Advantages

1. No need for information about how many numbers of clusters are required.

2. Easy to use and implement

Disadvantages

1. We can not take a step back in this algorithm.

2. Time complexity is higher at least 0(n^2logn)

APPLICATIONS

Image Segmentation

Social Network Analysis

Customer Segmentation

Document Clustering

Gene Expression Analysis

3) Hierarchical Clustering in
Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the

algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.
Divisive clustering: Also known as a top-down approach. This
algorithm also does not require to prespecify the number of clusters.
Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.

ADVANTAGES OF DIVISIVE CLUSTERING:

Easy to implement

Top-down approach

Complete clustering

DISADVANTAGES OF DIVISIVE CLUSTERING

Computationally expensive

Sensitive to noise and outliers

Biased towards globular clusters

APPLICATIONS OF DIVISIVE CLUSTERING

Biology
Marketing

Image Segmentation

Customer churn analysis

Social network analysis

HIERARCHICAL CLUSTERING

APPLICATIONS:

Biological Data Analysis

Marketing and Customer Segmentation

Image and Text Analysis

Social Network Analysis

Environmental Science

21 Lessons From The 21st Century
100% (5)
21 Lessons From The 21st Century
29 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4
No ratings yet
Unit 4
29 pages
ML Unit 2
No ratings yet
ML Unit 2
17 pages
Data Science Unit 5
No ratings yet
Data Science Unit 5
105 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
DA-Unit V
No ratings yet
DA-Unit V
152 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Genesis Artificial Intelligence, Hope, and The Human Spirit (Henry A. Kissinger, Eric Schmidt, Craig Mundie) (Z-Library)
No ratings yet
Genesis Artificial Intelligence, Hope, and The Human Spirit (Henry A. Kissinger, Eric Schmidt, Craig Mundie) (Z-Library)
199 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Clustering
No ratings yet
Clustering
7 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
75 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
Unit - Iv Unsupervisied Learning - Notes
No ratings yet
Unit - Iv Unsupervisied Learning - Notes
32 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unit 4
No ratings yet
Unit 4
16 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
B.tech. Computer Science Engineering Effective For The Batches Admitted From Session 2022-23
No ratings yet
B.tech. Computer Science Engineering Effective For The Batches Admitted From Session 2022-23
159 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Quiz
No ratings yet
Quiz
6 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Unit - 4 (ML)
No ratings yet
Unit - 4 (ML)
13 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Exciting Changes in Project Management PMBOK 7th Vs 8th Edition
100% (1)
Exciting Changes in Project Management PMBOK 7th Vs 8th Edition
3 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Clustering
No ratings yet
Clustering
10 pages
Clustering
No ratings yet
Clustering
84 pages
Aies Lab 2018
No ratings yet
Aies Lab 2018
16 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Unit 4
No ratings yet
Unit 4
74 pages
Zara
No ratings yet
Zara
47 pages
Event Rulebook
No ratings yet
Event Rulebook
11 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Tech Application in Finance
No ratings yet
Tech Application in Finance
33 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Clustering
No ratings yet
Clustering
11 pages
Unit 4
No ratings yet
Unit 4
125 pages
Python - Deep Learning
No ratings yet
Python - Deep Learning
3 pages
Machine Vision Machine Vision: Technical University of Crete
No ratings yet
Machine Vision Machine Vision: Technical University of Crete
39 pages
The Role of Digital Transformation in Achieving Sustainable Supply Chain Management in Industry 4.0 An Editorial Review Perspective
No ratings yet
The Role of Digital Transformation in Achieving Sustainable Supply Chain Management in Industry 4.0 An Editorial Review Perspective
10 pages
Machine Learning
No ratings yet
Machine Learning
1 page
Ins2701 Assignment 1.2025
No ratings yet
Ins2701 Assignment 1.2025
13 pages
Practice Test 54: Page 1 of 4
No ratings yet
Practice Test 54: Page 1 of 4
4 pages
Class 10:artificial Intelligence: Computer Vision
No ratings yet
Class 10:artificial Intelligence: Computer Vision
36 pages
Flutter Lab
No ratings yet
Flutter Lab
81 pages
Dynamic Link Prediction by Learning The Representatio - 2024 - Expert Systems Wi
No ratings yet
Dynamic Link Prediction by Learning The Representatio - 2024 - Expert Systems Wi
8 pages
Journal of Macroeconomics: Chia-Hui Lu
No ratings yet
Journal of Macroeconomics: Chia-Hui Lu
15 pages
Searching in Problem Solving AI - PPTX - 20240118 - 183824 - 0000
No ratings yet
Searching in Problem Solving AI - PPTX - 20240118 - 183824 - 0000
57 pages
Bone Fracture Detection
No ratings yet
Bone Fracture Detection
26 pages
Project File 22222
No ratings yet
Project File 22222
14 pages
Intelligent Bot: For Healthcare
No ratings yet
Intelligent Bot: For Healthcare
26 pages
Innovating Pedagogy 2024 1723227063
No ratings yet
Innovating Pedagogy 2024 1723227063
58 pages
You Cannot Have AI Ethics Without Ethics - SpringerLink
No ratings yet
You Cannot Have AI Ethics Without Ethics - SpringerLink
9 pages
Contributions of Cbis
No ratings yet
Contributions of Cbis
4 pages
RWKV-TS: Beyond Traditional Recurrent Neural Network For Time Series Tasks
No ratings yet
RWKV-TS: Beyond Traditional Recurrent Neural Network For Time Series Tasks
13 pages
Brochure Kellogg DMS 17-4-2023 V71
No ratings yet
Brochure Kellogg DMS 17-4-2023 V71
15 pages
Black Blue Futuristic Modern Artificial Intelligence Project Presentation
No ratings yet
Black Blue Futuristic Modern Artificial Intelligence Project Presentation
10 pages
Robots, Re-Evolving Mind
No ratings yet
Robots, Re-Evolving Mind
9 pages
2023 - Midterm 2 Solution - Spring - AI
No ratings yet
2023 - Midterm 2 Solution - Spring - AI
5 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Text Analytics Unit-3

Uploaded by

Text Analytics Unit-3

Uploaded by

UNIT 3

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based

Example of centroid-based clustering.

What is K-Means Algorithm?

o Determines the best value for K center points or centroids by an iterative

How does the K-Means Algorithm Work?

Step-1: Select the number K to decide the number of clusters.

Step-7: The model is ready.

import matplotlib.pyplot as plt

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical

Measure for the distance between two clusters

3. Average Linkage: It is the linkage method in which the distance between

STEPS AND EXAMPLE(refer note)

2. Easy to use and implement

1. We can not take a step back in this algorithm.

2. Time complexity is higher at least 0(n^2logn)

Social Network Analysis

Gene Expression Analysis

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the

ADVANTAGES OF DIVISIVE CLUSTERING:

DISADVANTAGES OF DIVISIVE CLUSTERING

Sensitive to noise and outliers

Biased towards globular clusters

APPLICATIONS OF DIVISIVE CLUSTERING

Customer churn analysis

Social network analysis

Biological Data Analysis

Marketing and Customer Segmentation

Image and Text Analysis

Social Network Analysis

You might also like