Machine Learning & Data Mining: Understanding
Machine Learning & Data Mining: Understanding
The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data, not defined a priori, such that objects in a
given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. Finding groups of objects
such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
6. Clustering Methodology
Page - 1 of 7
Clustering in ML
7. Types of Clustering
A clustering is a set of clusters. There are different types of clustering methods in use and some important Clusterings are:
Partitioning clustering
Hierarchical clustering
Fuzzy clustering
Density-based clustering, and
Distribution Model-based clustering.
Partitioning clustering
Partitioning Clustering is a type of clustering technique, that divides the data set into a set number of groups. A division data objects into non-
overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical clustering
It is a type of clustering technique, that divides that data set into a number of clusters, where the user doesn’t specify the number of clusters
to be generated before training the model. A set of nested clusters organized as a hierarchical tree
Density-Based Clustering
In this clustering, technique clusters will be formed by the segregation of various density regions based on different densities in the data plot.
Density-Based Spatial Clustering and Application with Noise (DBSCAN) is the most used algorithm in this type of technique.
The main idea behind this algorithm is there should be a minimum number of points contained in the neighborhood of a given radius for each
point in the cluster.
Distribution Model-Based Clustering
In this type of clustering, technique clusters are formed by identifying by the probability of all the data points in the cluster come from the
same distribution (Normal, Gaussian). The most popular algorithm in this type of technique is Expectation-Maximization (EM) clustering
using Gaussian Mixture Models (GMM).
9. Clustering Algorithms
1. K-means and its variants
2. Hierarchical clustering
3. Density-based clustering
It is a very computationally expensive algorithm as it computes the distance of every data point with the centroids of all the clusters at each
iteration. This makes it difficult for implementing the same for huge data sets.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change clusters.
Complexity is O ( n * K * I * d )
Here, n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
Page - 3 of 7
Clustering in ML
Page - 4 of 7
Clustering in ML
Min
Max
Group Average
Need to use average connectivity for scalability since total proximity favors large clusters
Page - 6 of 7
Clustering in ML
Page - 7 of 7