0% found this document useful (0 votes)
21 views8 pages

Clustering

This document discusses clustering techniques as a form of unsupervised learning aimed at grouping similar observations into clusters. It outlines various clustering methods, including partitioning, hierarchical, density-based, model-based, and grid-based approaches, along with their applications in business analytics such as customer segmentation and anomaly detection. The document emphasizes the importance of criteria for effective clustering, including similarity measures, scalability, and cluster interpretability.

Uploaded by

sahilkauts112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Clustering

This document discusses clustering techniques as a form of unsupervised learning aimed at grouping similar observations into clusters. It outlines various clustering methods, including partitioning, hierarchical, density-based, model-based, and grid-based approaches, along with their applications in business analytics such as customer segmentation and anomaly detection. The document emphasizes the importance of criteria for effective clustering, including similarity measures, scalability, and cluster interpretability.

Uploaded by

sahilkauts112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

INTRODUCTION

Clustering techniques, which will be discussed in this chapter,


are an
example of the second class of unsupervised learning
models. The goal of
clustering methods is the identification of homogenous
groupings of records
known as clusters by specifying acceptable metrics and the
induced ideas
of distance and similarity between pairs of observations. The
observations
included in each cluster must be close to one another and
remote from those
found in other clusters, depending on the precise distance
chosen. We
discuss the key characteristics of clustering models at the
beginning of this
chapter. Following that, we will demonstrate the most
common ways to
measure the distance between pairs of observations in
relation to the
characteristics of the dataset's attributes. Then, partitioning
techniques will
be discussed with an emphasis on the K-means and K-
medoids algorithms.
In relation to the key metrics that describe the
inhomogeneity among
various clusters, we will finally illustrate both agglomerative
and divisive
hierarchical techniques. We will also go through several
metrics for
measuring the effectiveness of clustering methods.

CLUSTERING METHODS
The goal of clustering models is to partition the records of a
dataset into
clusters, which are homogenous groups of observations that
are similar to
one another and different from the observations contained in
other groups.
The human brain frequently uses a method of reasoning
called affinity grouping to organize objects. Additionally,
because of this, clustering
models have been used for a long time in a variety of fields,
including social
sciences, biology, astronomy, statistics, image recognition,
handling digital
data, marketing, and data mining. There are several uses for
clustering
models. The clusters produced may offer a useful
understanding of the event
in some applications of interest. For instance, categorizing
clients based on
their purchasing patterns may identify a cluster that
corresponds to a market
niche where it may be acceptable to focus marketing efforts
for promotional
purposes. In addition, a data mining project's preliminary
phase can involve
grouping data into clusters, which would be followed by using
various
approaches inside each cluster. In a retention study, a
preliminary partition
into clusters may be followed by the creation of unique
classification
models, with the goal of better identifying the clients with a
high churn
likelihood. To highlight outliers and find an observation that
might stand in
for an entire cluster on its own, grouping data into clusters
may be useful
during exploratory data analysis. This will help to reduce the
size of the
dataset.

The following general criteria must be met by clustering


methods:

Flexibility: Only numerical attributes can be used with some


clustering
techniques, and the distances between observations can be
calculated using
Euclidean metrics. The ability to analyze datasets with
categorical features
should be a feature of a flexible clustering technique, though.
Euclidean
metrics-based algorithms frequently produce spherical
clusters and struggle
to recognize more intricate geometrical patterns.

Robustness: The stability of the clusters produced about


little variations in
the values of each observation's attribute values is a sign of
an algorithm's
resilience. This characteristic makes sure that any noise that
could be
present in the data does not significantly impair the
clustering procedure.
Additionally, the clusters produced must be stable with
respect to the
dataset's observations' appearance order.

Efficiency: Because there are often quite a few observations


in some
applications, clustering algorithms must produce clusters
quickly in order
to ensure reasonable computation times for complex
problems. When
dealing with large datasets, one may also use the extraction
of smaller
samples to create clusters more quickly. This method
inherently indicates a
lesser resilience for the resulting clusters, though. In terms of
the quantity
of characteristics included in the dataset, clustering
algorithms must also
demonstrate their efficacy.

Clustering in Business Analytics

Clustering is a technique in Business Analytics used for grouping a set of objects or data points
in such a way that objects in the same group (called a cluster) are more similar to each other than
to those in other groups. It is a form of unsupervised learning, where the goal is to identify
patterns or structures in data without predefined labels or outcomes.

Applications in Business Analytics:

 Customer Segmentation: Grouping customers based on purchasing behavior,


demographics, or interests.
 Market Segmentation: Identifying distinct market segments for targeted marketing
campaigns.
 Product Categorization: Classifying products based on features, sales performance, or
user ratings.
 Anomaly Detection: Identifying outliers or unusual patterns in data, which could
indicate fraud or unusual behavior.
 Inventory Management: Grouping products based on demand patterns for optimized
stock management.

Criteria for Applying Clustering Methods

To ensure effective clustering, several criteria should be met:

1. Similarity or Distance Measure:


o A clear measure of similarity or dissimilarity (like Euclidean distance, Manhattan
distance, or cosine similarity) must be chosen to compare data points. This is
crucial as it directly impacts the cluster formation.
2. Scalability:
o The clustering method should handle large datasets efficiently, especially if
dealing with big data. Scalability ensures that clustering is computationally
feasible for large volumes of data.
3. Cluster Interpretability:
o The resulting clusters should be interpretable, meaning the characteristics of each
cluster should be clear and distinct to draw actionable insights.
4. Cluster Density:
o Clusters should be dense (i.e., points within a cluster should be close to each
other) and well-separated from other clusters to avoid overlapping, which could
lead to ambiguity.
5. Handling of Noise and Outliers:
o The clustering method should be able to handle noisy data or outliers effectively
without distorting the clusters. Robust clustering methods should accommodate
some level of data anomalies.
6. Dimensionality:
o The method should be adaptable to the data’s dimensionality. High-dimensional
data may require specific methods or dimensionality reduction before clustering.
7. Cluster Shape:
o Clusters should be of flexible shape (not necessarily spherical) to capture complex
patterns in data, especially in business scenarios where natural groupings may not
conform to regular shapes.

Types of Clustering Methods Based on Cluster Creation

Different clustering methods have unique approaches for creating clusters. Here are the primary
types:

1. Partitioning Methods

 Description: These methods divide the dataset into a set of non-overlapping clusters by
optimizing a given criterion, like minimizing the distance of points from their cluster
centers.
 Examples:
o K-Means: Divides the dataset into KKK clusters by assigning each data point to the
cluster with the nearest mean. The cluster centers (centroids) are iteratively updated.
o K-Medoids (PAM): Similar to K-Means but uses medoids (actual data points) as cluster
centers, making it less sensitive to outliers.

 Application: Useful when the number of clusters KKK is known or can be estimated.
 Advantages:
o Easy to implement and computationally efficient.
o Works well with large datasets.

 Limitations:
o Sensitive to the choice of initial clusters.
o Assumes clusters are spherical and similar in size.
2. Hierarchical Methods

 Description: These methods create a hierarchy of clusters using a tree-like structure


called a dendrogram, which can be visualized to determine the optimal number of
clusters.
 Types:
o Agglomerative: Starts with individual data points as clusters and merges them step-by-
step based on similarity until a single cluster is formed.
o Divisive: Starts with the entire dataset as a single cluster and splits it into smaller
clusters iteratively.

 Examples:
o Agglomerative Hierarchical Clustering: Commonly uses linkage criteria like single-
linkage (minimum distance), complete-linkage (maximum distance), or average-linkage.

 Application: Useful when the data has a natural hierarchy or when the number of clusters
is unknown.
 Advantages:
o Does not require specifying the number of clusters in advance.
o Can capture complex nested structures.

 Limitations:
o Computationally intensive for large datasets.
o Merging or splitting decisions are irreversible.

3. Density-Based Methods

 Description: These methods create clusters based on the density of data points in a
region. A cluster is formed when data points are closely packed together, and regions of
low density are treated as noise or outliers.
 Examples:
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters based
on a neighborhood radius (ε) and minimum number of points. It can identify arbitrarily
shaped clusters and outliers.
o OPTICS (Ordering Points to Identify the Clustering Structure): Similar to DBSCAN but
provides a more detailed cluster analysis.

 Application: Effective for datasets with noise, outliers, or when clusters are of arbitrary
shape.
 Advantages:
o Can identify clusters of various shapes.
o Handles noise and outliers well.

 Limitations:
o Requires careful tuning of parameters like ε and minimum points.
o May struggle with varying densities.
4. Model-Based Methods

 Description: These methods assume that the data is generated by a mixture of underlying
probability distributions (e.g., Gaussian). Each cluster is treated as a component of a
mixture model, and the goal is to estimate the parameters of these distributions.
 Examples:
o Gaussian Mixture Model (GMM): Assumes that data points are generated from a
mixture of Gaussian distributions with different means and variances.
o Expectation-Maximization (EM): Iteratively estimates the parameters of the model to
maximize the likelihood of the observed data.

 Application: Useful for probabilistic clustering or when clusters have overlapping


boundaries.
 Advantages:
o Provides a probabilistic assignment of data points to clusters.
o Can model complex cluster shapes.

 Limitations:
o Sensitive to initialization.
o Computationally intensive with a large number of clusters.

5. Grid-Based Methods

 Description: The data space is divided into a finite number of grid cells, and clustering is
performed on these cells instead of individual data points.
 Examples:
o CLIQUE (Clustering In QUEst): A combination of grid-based and density-based
approaches.
o STING (Statistical Information Grid): Uses a hierarchical grid structure for clustering.

 Application: Suitable for large datasets with a spatial context.


 Advantages:
o Efficient for high-dimensional data.
o Computationally fast.

 Limitations:
o Resolution of clustering depends on the grid size.
o May not handle arbitrary shapes well.

Conclusion

Clustering is a powerful tool in Business Analytics for uncovering hidden patterns and
segmenting data into meaningful groups. The choice of clustering method depends on factors
like the shape and density of the data, the presence of noise, and the need for interpretability. The
application of clustering requires careful consideration of criteria such as distance measures,
scalability, and the desired outcome. Each method has its strengths and limitations, making it
essential to match the technique to the specific characteristics of the dataset and the business
problem at hand.

You might also like