0% found this document useful (0 votes)
65 views18 pages

ML Unit-Iii

The document discusses unsupervised learning, focusing on clustering as a method for grouping unlabelled data based on similarities. It covers various clustering algorithms, including k-means, hierarchical, and density-based clustering, along with their applications in fields like marketing, biology, and anomaly detection. Additionally, it highlights the importance of clustering in simplifying complex datasets and facilitating insights from data.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views18 pages

ML Unit-Iii

The document discusses unsupervised learning, focusing on clustering as a method for grouping unlabelled data based on similarities. It covers various clustering algorithms, including k-means, hierarchical, and density-based clustering, along with their applications in fields like marketing, biology, and anomaly detection. Additionally, it highlights the importance of clustering in simplifying complex datasets and facilitating insights from data.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-III

UNSUPERVISED LEARNING: clustering- Introduction- Mixture Densities- k-Means


Clustering- Expectation-Maximization Algorithm- Mixtures of Latent Variable Models-
Supervised Learning after Clustering- Hierarchical Clustering.

UNSUPERVISED LEARNING:
Introduction:
 Unsupervised learning is a type of machine learning where the algorithm learns from
unlabelled data. This means that the data provided to the algorithm doesn't have any pre-
existing labels or classifications.
 The goal of unsupervised learning is to discover hidden patterns, structures, and
relationships within the data itself.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and
then will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
o Centroid-based clustering (e.g., K-means)
o Distribution model-based clustering
o Hierarchical clustering
o Density-based clustering (e.g., DBSCAN)
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.
o Apriori Algorithm
o FP-Growth (Frequent Pattern Growth) Algorithm

o Dimensionality Reduction – Reduces the number of features while preserving


essential information.
o Examples:
 Principal Component Analysis (PCA)
 t-SNE
 Autoencoders

Applications:
 Customer Segmentation: Grouping customers for targeted marketing.
 Recommendation Systems: Suggesting products or content based on user behaviour.
 Anomaly Detection: Identifying fraud, network intrusions, or equipment failures.
 Image and Video Analysis: Grouping similar images or detecting patterns in video.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis

Clustering:

 Clustering is grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning we don’t have a target variable.
 Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points with
highest similarity score together.

For Example, in the graph given below, we can clearly see that there are 3 circular clusters
forming on the basis of distance.
 Now it is not necessary that the clusters formed must be circular in shape. The shape of
clusters can be arbitrary. There are many algorithms that work well with detecting arbitrary
shaped clusters.
 For example, In the below given graph we can see that the clusters formed are not circular in
shape.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So, each data point will either belong to cluster 1 or cluster 2.

Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So, we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward
cluster ID. Using the same principle, clustering data can make complex datasets simpler.

Types of Clustering methods


The main clustering methods in Machine learning are:
 Centroid-based Clustering (Partitioning methods)
 Density-based Clustering (Model-based methods)
 Distribution-based Clustering
 Connectivity-based Clustering (Hierarchical clustering)
 Fuzzy Clustering

Centroid-based Clustering (Partitioning methods)

Partitional clustering is a method that divides a dataset into a predetermined number of


non-overlapping clusters, where each data point belongs to only one cluster, aiming to
optimize a specific objective function like minimizing intra-cluster distance.

Definition:
Partitional clustering algorithms aim to partition a dataset into a set of disjoint clusters,
meaning each data point belongs to only one cluster. It is a type of clustering that divides the
data into non-hierarchical groups. It is also known as the centroid-based method. The most
common example of partitioning clustering is the K-Means Clustering algorithm.

Process:
These algorithms require the analyst to specify the number of clusters (K) beforehand. The
algorithm then iteratively refines the cluster assignments to minimize the distance between
data points and their respective cluster centroids.
Objective Function:
The goal is to find the optimal partitioning of the data that minimizes the within-cluster
variance or maximizes the between-cluster variance.
Popular Algorithms:
K-means: A widely used algorithm that assigns data points to the nearest cluster centroid,
iteratively updating the centroids until convergence.
K-medoids: Similar to K-means, but instead of using centroids, it uses medoids
(representative data points) to define the clusters.
Mini-batch K-means: An efficient variant of K-means that uses mini-batches of data points
to speed up the clustering process.

Advantages:
Computational Efficiency: Partitional algorithms are generally computationally efficient
and easy to implement.
Suitable for Large Datasets: They can handle large datasets effectively.
Good for Clusters of Similar Shapes and Sizes: They perform well when clusters have
similar shapes and sizes.

Disadvantages:

Requires Predefined Number of Clusters: The analyst needs to specify the number of
clusters (K) in advance, which can be challenging for complex datasets.
Struggles with Clusters of Varying Shapes and Sizes: They may struggle with clusters that
have irregular shapes or sizes.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

DBSCAN is a density-based clustering algorithm that groups data points that are closely
packed together and marks outliers as noise based on their density in the feature space. It
identifies clusters as dense regions in the data space, separated by areas of lower density.

Unlike K-Means or hierarchical clustering, which assume clusters are compact and spherical,
DBSCAN excels in handling real-world data irregularities such as:

Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular or convex.

Noise and Outliers: It effectively identifies and handles noise points without assigning them
to any cluster.
Fig.Density based clustering

Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there
is no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.

Fuzzy Clustering

Fuzzy clustering is a type of ` in which a data object may belong to more than one group
or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms

 The Clustering algorithms can be divided based on their models. There are different
types of clustering algorithms published, but only a few are commonly used.

 The clustering algorithm is based on the kind of data that we are using. Such as, some
algorithms need to guess the number of clusters in the given dataset, whereas some
are required to find the minimum distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that
works on updating the candidates for centroid to be the center of the points within a
given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N 2T) time complexity,
which is the main drawback of this algorithm.

Applications of Clustering

Clustering is a powerful unsupervised machine learning technique with a wide range of


applications across various fields. Here's a breakdown of some key areas where clustering is
utilized:

1. Marketing and Customer Segmentation:


 Identifying customer groups: Businesses use clustering to group customers with
similar purchasing habits, demographics, or behaviors. This allows for targeted
marketing campaigns and personalized experiences.
 Market basket analysis: Analyzing which products are frequently purchased
together to optimize product placement and create targeted promotions.

2. Biology and Medicine:

 Gene sequencing: Clustering helps identify similarities and differences in genetic


data, aiding in disease diagnosis and drug discovery.
 Medical imaging: Clustering can be used to segment medical images, such as MRI or
CT scans, to identify anomalies or diseased areas.
 Patient grouping: Grouping patients with similar symptoms or treatment responses
to improve healthcare delivery.

3. Image Processing:

 Image segmentation: Clustering is used to divide an image into distinct regions


based on color, texture, or other features.
 Object recognition: Clustering can help identify and group similar objects within an
image.

4. Social Network Analysis:

 Identifying communities: Clustering algorithms can discover groups of users with


similar interests or connections within social networks.
 Analyzing social media data: Clustering can help identify trends and patterns in
social media conversations.

5. Anomaly Detection:

 Fraud detection: Clustering can identify unusual patterns in financial transactions,


indicating potential fraud.
 Network intrusion detection: Clustering can detect anomalies in network traffic,
signaling potential cyberattacks.

6. Information Retrieval:
 Search result grouping: Search engines use clustering to group similar search
results, making it easier for users to find relevant information.
 Document clustering: Organizing large collections of documents into thematic
groups.

7. Other Applications:

 Spatial data analysis: Clustering can identify patterns in geographic data, such as
identifying areas with similar climate or population density.
 Data compression: Clustering can reduce the amount of data needed to represent a
dataset by replacing similar data points with a representative value.
 Identifying Fake News: Clustering algorithms can be used to identify patterns in the
spread of misinformation.

Hierarchical clustering:

Hierarchical clustering is a technique used to group similar data points together based on
their similarity creating a hierarchy or tree-like structure. The key idea is to begin with
each data point as its own separate cluster and then progressively merge or split them based
on their similarity.

A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group, and as
you move up, similar groups are combined. The lower the merge point, the more similar the
groups are. It helps you see how things are grouped step by step.

The working of the dendrogram can be explained using the below diagram:
Types of Hierarchical Clustering

Now that we understand the basics of hierarchical clustering, let’s explore the two main types
of hierarchical clustering.

1. Agglomerative Clustering

2. Divisive clustering

Hierarchical Agglomerative Clustering

It is also known as the bottom-up approach or hierarchical agglomerative clustering


(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group data.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.

Workflow for Hierarchical Agglomerative clustering


1. Start with individual points: Each data point is its own cluster. For example if you
have 5 data points you start with 5 clusters each containing just one data point.

2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two
data points.

3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.

4. Update distance matrix: After merging you now have one less cluster. Recalculate
the distances between the new cluster and the remaining clusters.

5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until you have only one cluster left.

6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.

Hierarchical Divisive clustering

It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.

Workflow for Hierarchical Divisive clustering :

1. Start with all data points in one cluster: Treat the entire dataset as a single large
cluster.

2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to
separate the data into two parts.

3. Repeat the process: For each of the new clusters, repeat the splitting process:

1. Choose the cluster with the most dissimilar points.


2. Split it again into two smaller clusters.

4. Stop when each data point is in its own cluster: Continue this process until every
data point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.

The various types of linkages describe distinct methods for measuring the distance between
two sub-clusters of data points, influencing the overall clustering outcome.

1.Single Linkage:

For two clusters R and S, the single linkage returns the minimum distance between two points
i and j such that i belongs to R and j belongs to S.
2.Complete Linkage:

For two clusters R and S, the complete linkage returns the maximum distance between two
points i and j such that i belongs to R and j belongs to S.

3. Average Linkage:

For two clusters R and S, first for the distance between any data-point i in R and any data-
point j in S and then the arithmetic mean of these distances are calculated. Average Linkage
returns this value of the arithmetic mean.

You might also like