Module 4
Module 4
UNSUPERVISED LEARNING
Principal Components Analysis
K-means Clustering
Hierarchical Clustering
Gaussian Mixture Models
Expectation Maximization(EM) algorithm
Principal Components Analysis
Principal Component Analysis can be abbreviated as PCA
PCA comes under the Unsupervised Machine Learning category
The main goal of PCA is to reduce the number of variables in a data
collection.
Principal component analysis in machine learning can be mainly used
for Dimensionality Reduction and important feature selection.
Working with high-dimensional data will cause overfitting issues.
Features are independent of each other.
Dimensionality Reduction Work in Real-Time Application:
Assume there are 50 questions in the survey. The following three are
among them: Please give the following a rating between 1 and 5:
1. I feel comfortable around people
2. I easily make friends
3. I like going out
These queries could give information about a person is introvert or
extrovert.
Intuition behind PCA:
Find tallest person
Now, the bank can potentially have millions of customers. Does it make
sense to look at the details of each customer separately and then make a
decision? Certainly not! It is a manual process and will take a huge
amount of time.
So what can the bank do? One option is to segment its customers into
different groups. For instance, the bank can group the customers based
on their income:
• Clustering is the process of dividing the entire data into groups
(also known as clusters) based on the patterns in the data.
Applications:
1. Document Classification
2. Customer Segmentation
3. Cyber Profiling
4. Image Segmentation
5. Fraud detection in banking and insurance
Advantages of K-means
1.Simple and easy to implement: The k-means algorithm is easy to
understand and implement, making it a popular choice for clustering tasks.
2.Fast and efficient: K-means is computationally efficient and can handle
large datasets with high dimensionality.
3.Scalability: K-means can handle large datasets with many data points and
can be easily scaled to handle even larger datasets.
4.Flexibility: K-means can be easily adapted to different applications and
can be used with varying metrics of distance and initialization methods.
Disadvantages of K-Means
1.Sensitivity to initial centroids: K-means is sensitive to the initial
selection of centroids and can converge to a suboptimal solution.
2.Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3.Sensitive to outliers: K-means is sensitive to outliers, which can have
a significant impact on the resulting clusters.
Hierarchical Clustering
Hierarchical clustering is a technique used to group similar data points
together based on their similarity creating a hierarchy or tree-like
structure.
The key idea is to begin with each data point as its own separate cluster
and then progressively merge or split them based on their similarity.
Eg:
Imagine you have four fruits with different weights: an apple (100g), a
banana (120g), a cherry (50g), and a grape (30g). Hierarchical
clustering starts by treating each fruit as its own group.
Getting Started with Dendogram
• A dendrogram is like a family tree for clusters.
• It shows how individual data points or groups of data merge together.
• The bottom shows each data point as its own group, and as you move
up, similar groups are combined.
• It helps you see how things are grouped step by step.
•At the bottom of the dendrogram, the points P, Q, R, S, and T are all
separate.
•As you move up, the closest points are merged into a single group.
•The lines connecting the points show how they are progressively merged
based on similarity.
•The height at which they are connected shows how similar the points are
to each other; the shorter the line, the more similar they are
Types of Hierarchical Clustering: 2
1.Agglomerative Clustering: Bottom –Top approach
2.Divisive clustering: Top –bottom approach
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative
clustering (HAC).
Workflow for Hierarchical Agglomerative clustering :
1.Start with individual points: Each data point is its own cluster. For example
if you have 5 data points you start with 5 clusters each containing just one data
point.
2.Calculate distances between clusters: Calculate the distance between every
pair of clusters. Initially since each cluster has one point this is the distance
between the two data points.
3.Merge the closest clusters: Identify the two clusters with the smallest
distance and merge them into a single cluster.
Workflow for Hierarchical Agglomerative clustering CONTD…
4.Update distance matrix: After merging you now have one less cluster.
Recalculate the distances between the new cluster and the remaining clusters.
5.Repeat steps 3 and 4: Keep merging the closest clusters and updating the
distance matrix until you have only one cluster left.
6.Create a dendrogram: As the process continues you can visualize the
merging of clusters using a tree-like diagram called a dendrogram. It shows the
hierarchy of how clusters are merged.
Workflow for Hierarchical Agglomerative clustering CONTD…
Gaussian Mixture Models
Gaussian Mixture Models
K-means cluster GMM(probabilistic approach)
1. K-means clustering assigns data 1. GMMs assign data into a mixture
points to exactly one cluster of multiple clusters or Gaussian
distributions.
2. K-means can also only detect 2. while a GMM can group data
spherical or circular cluster shapes within elliptical or oval shapes.