0% found this document useful (0 votes)
33 views

Text Analytics Unit-3

The document discusses different types of clustering algorithms including centroid-based clustering, k-means clustering, agglomerative hierarchical clustering, and divisive hierarchical clustering. It provides details on how each algorithm works, examples to illustrate the algorithms, and discusses advantages and applications of each type of clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Text Analytics Unit-3

The document discusses different types of clustering algorithms including centroid-based clustering, k-means clustering, agglomerative hierarchical clustering, and divisive hierarchical clustering. It provides details on how each algorithm works, examples to illustrate the algorithms, and discusses advantages and applications of each type of clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT 3

1) Centroid-based Clustering:

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to


hierarchical clustering defined below.

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based


algorithms are efficient but sensitive to initial conditions and outliers. k-means is an efficient,
effective, and simple clustering algorithm.

Example of centroid-based clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Example 1
It is a simple example to understand how k-means works. In this example, we are going to
first generate 2D dataset containing 4 different blobs and after that will apply k-means
algorithm to see the result.

import matplotlib.pyplot as plt


import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()

Make an object of KMeans along with providing number of clusters, train the model and do
the prediction as follows −
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

with the help of following code we can plot and visualize the cluster’s centers picked by k-
means Python estimator −
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()
Advantages
The following are some advantages of K-Means clustering algorithms −
 It is very easy to understand and implement.
 If we have large number of variables then, K-means would be faster than
Hierarchical clustering.
 On re-computation of centroids, an instance can change the cluster.
 Tighter clusters are formed with K-means as compared to Hierarchical
clustering.
Disadvantages
The following are some disadvantages of K-Means clustering algorithms −
 It is a bit difficult to predict the number of clusters i.e. the value of k.
 Output is strongly impacted by initial inputs like number of clusters (value of
k).
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by means of
normalization or standardization, then the output will completely change.final
output.
Applications of K-Means Clustering Algorithm
 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analyzing the trend on dynamic data

2) Agglomerative Hierarchical
clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical


clustering Work?
The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data points, so
the number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster.
So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters


The closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and
these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between


each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

STEPS AND EXAMPLE(refer note)


Advantages

1. No need for information about how many numbers of clusters are required.

2. Easy to use and implement


Disadvantages

1. We can not take a step back in this algorithm.

2. Time complexity is higher at least 0(n^2logn)

APPLICATIONS

Image Segmentation

Social Network Analysis

Customer Segmentation

Document Clustering

Gene Expression Analysis

3) Hierarchical Clustering in
Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.
Divisive clustering: Also known as a top-down approach. This
algorithm also does not require to prespecify the number of clusters.
Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.

ADVANTAGES OF DIVISIVE CLUSTERING:

Easy to implement

Top-down approach

Complete clustering

DISADVANTAGES OF DIVISIVE CLUSTERING

Computationally expensive

Sensitive to noise and outliers

Biased towards globular clusters

APPLICATIONS OF DIVISIVE CLUSTERING


Biology
Marketing

Image Segmentation

Customer churn analysis

Social network analysis

HIERARCHICAL CLUSTERING

APPLICATIONS:

Biological Data Analysis

Marketing and Customer Segmentation

Image and Text Analysis

Social Network Analysis

Environmental Science

You might also like