Clustering
Clustering
Clustering
When encountering an unsupervised learning problem initially, confusion may arise as you
aren’t seeking specific insights but rather identifying data structures. This process, known
as clustering or cluster analysis, identifies similar groups within a dataset.
It Is one of the most popular clustering techniques in data science used by data scientists.
Entities in each group are comparatively more similar to entities of that group than those of
the other groups. In this article, I will be taking you through the types of clustering, different
clustering algorithms, and a comparison between two of the most commonly used
methods of clustering in machine learning.
Clustering is the task of dividing the unlabeled data or data points into different clusters
such that similar data points fall in the same cluster than those which differ from the
others. In simple words, the aim of the clustering process is to segregate groups with
similar traits and assign them into clusters.
Let’s understand this with an example. Suppose you are the head of a rental store and wish
to understand the preferences of your customers to scale up your business. Is it possible
for you to look at the details of each customer and devise a unique business strategy for
each one of them? Definitely not. But, what you can do is cluster all of your customers into,
say 10 groups based on their purchasing habits and use a separate strategy for customers
in each of these 10 groups. And this is what we call clustering methods.
Now that we understand what clustering is. Let’s take a look at its different types.
1.Hard Clustering: Each input data point either fully belongs to a cluster or not. For
instance, in the example above, every customer is assigned to one group out of the ten.
2.Soft Clustering: Rather than assigning each input data point to a distinct cluster, it
assigns a probability or likelihood of the data point being in those clusters. For example, in
the given scenario, each customer receives a probability of being in any of the ten retail
store clusters.
Since the task of clustering methods is subjective, the means that can be used for
achieving this goal are plenty. Every methodology follows a different set of rules for defining
the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms
known. But few of the algorithms are used popularly. Let’s look at them in detail:
1.Connectivity Models
As the name suggests, these models are based on the notion that the data points closer in
data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start by classifying all
data points into separate clusters & then aggregating them as the distance decreases. In
the second approach, all data points are classified as a single cluster and then partitioned
as the distance increases. Also, the choice of distance function is subjective. These
models are very easy to interpret but lack scalability for handling big datasets. Examples of
these models are the hierarchical clustering algorithms and their variants.
2.Centroid Models
These clustering algorithms iterate, deriving similarity from the proximity of a data point to
the centroid or cluster center. The k-Means clustering algorithm, a popular example, falls
into this category. These models necessitate specifying the number of clusters beforehand,
requiring prior knowledge of the dataset. They iteratively run to discover local optima.
3.Distribution Models
These clustering models are based on the notion of how probable it is that all data points in
the cluster belong to the same distribution (For example: Normal, Gaussian). These
models often suffer from overfitting. A popular example of these models is the Expectation-
maximization algorithm which uses multivariate normal distributions.
4.Density Models
These models search the data space for areas of the varied density of data points in the
data space. They isolate different dense regions and assign the data points within these
regions to the same cluster. Popular examples of density models are DBSCAN and OPTICS.
These models are particularly useful for identifying clusters of arbitrary shape and
detecting outliers, as they can detect and separate points that are located in sparse
regions of the data space, as well as points that belong to dense regions.
Applications of Clustering
Clustering has a large no. of application of clustering spread across various domains.
Some of the most popular applications of clustering are recommendation engines, market
segmentation, social network analysis, search result grouping, medical imaging, image
segmentation, and anomaly detection.
Improving Supervised Learning Algorithms With Clustering
Key Takeaways
1.Clustering helps to identify patterns in data and is useful for exploratory data analysis,
customer segmentation, anomaly detection, pattern recognition, and image segmentation.
2.It is a powerful tool for understanding data and can help to reveal insights that may not be
apparent through other methods of analysis.
4.The choice of clustering algorithm and the number of clusters to use depend on the
nature of the data and the specific problem at hand.