Text Analytics Unit-3
Text Analytics Unit-3
1) Centroid-based Clustering:
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Example 1
It is a simple example to understand how k-means works. In this example, we are going to
first generate 2D dataset containing 4 different blobs and after that will apply k-means
algorithm to see the result.
Make an object of KMeans along with providing number of clusters, train the model and do
the prediction as follows −
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
with the help of following code we can plot and visualize the cluster’s centers picked by k-
means Python estimator −
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()
Advantages
The following are some advantages of K-Means clustering algorithms −
It is very easy to understand and implement.
If we have large number of variables then, K-means would be faster than
Hierarchical clustering.
On re-computation of centroids, an instance can change the cluster.
Tighter clusters are formed with K-means as compared to Hierarchical
clustering.
Disadvantages
The following are some disadvantages of K-Means clustering algorithms −
It is a bit difficult to predict the number of clusters i.e. the value of k.
Output is strongly impacted by initial inputs like number of clusters (value of
k).
Order of data will have strong impact on the final output.
It is very sensitive to rescaling. If we will rescale our data by means of
normalization or standardization, then the output will completely change.final
output.
Applications of K-Means Clustering Algorithm
Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
2) Agglomerative Hierarchical
clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so
the number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster.
So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
1. No need for information about how many numbers of clusters are required.
APPLICATIONS
Image Segmentation
Customer Segmentation
Document Clustering
3) Hierarchical Clustering in
Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
Easy to implement
Top-down approach
Complete clustering
Computationally expensive
Image Segmentation
HIERARCHICAL CLUSTERING
APPLICATIONS:
Environmental Science