0% found this document useful (0 votes)
5 views63 pages

Module 4

Module
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views63 pages

Module 4

Module
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

MODULE-4

UNSUPERVISED LEARNING
 Principal Components Analysis
 K-means Clustering
 Hierarchical Clustering
 Gaussian Mixture Models
 Expectation Maximization(EM) algorithm
Principal Components Analysis
 Principal Component Analysis can be abbreviated as PCA
 PCA comes under the Unsupervised Machine Learning category
 The main goal of PCA is to reduce the number of variables in a data
collection.
 Principal component analysis in machine learning can be mainly used
for Dimensionality Reduction and important feature selection.
 Working with high-dimensional data will cause overfitting issues.
 Features are independent of each other.
Dimensionality Reduction Work in Real-Time Application:
 Assume there are 50 questions in the survey. The following three are
among them: Please give the following a rating between 1 and 5:
1. I feel comfortable around people
2. I easily make friends
3. I like going out
These queries could give information about a person is introvert or
extrovert.
Intuition behind PCA:
Find tallest person

A mind game like: Answer is C Answer is :?

Person height Person height


A 145 A 172
B 160 B 173
C 185 C 171
Basic Terminologies of PCA in Machine Learning:7
 Variance: For calculating the variation of data distributed across the
dimensionality of the graph.
 Covariance: Calculating dependencies and relationship between
features.
 Standardizing data: Scaling our dataset within a specific range for
unbiased output.
Covariance matrix: Used for calculating interdependencies between the
features or variables and also helps in reducing it to improve the performance.
EigenValues and EigenVectors: The eigenvectors aim to find the largest
dataset variance to calculate the Principal Component.
 Eigenvalue means the magnitude of the Eigenvector.
 The eigenvalue indicates variance in a particular direction, whereas the
eigenvector expands or contracts the X-Y (2D) graph without altering the
direction.
EigenValues and EigenVectors: CONTD……
PCA, Eigenvalues:
 In this shear mapping, the blue arrow changes direction, whereas the
pink arrow does not. In this instance, the pink arrow is an eigenvector
because of its constant orientation. The length of this arrow is also
unaltered, and its eigenvalue is 1.
 PC is a straight line that captures the data’s maximum variance
(information).
 PC shows direction and magnitude.
 PCs are perpendicular to each other.
 Dimensionality Reduction: Transpose of original data and multiply it by
transposing of the derived feature vector.
 Reducing the features without losing information.
Advantages of PCA in ML:
1. Used for Dimensionality Reduction
2. PCA will assist you in eliminating all related features, sometimes called
multi-collinearity.
3. The time required to train your model is substantially shorter because PCA
has reduced the number of features.
4. PCA aids in overcoming overfitting by eliminating the extraneous features
from your dataset.
The steps involved for PCA in ML:
1. Original Data
2. Normalize the original data (mean =0, variance =1)
3. Calculating covariance matrix
4. Calculating Eigen values, Eigen vectors, and normalized Eigenvectors
5. Calculating Principal Component (PC)
6. Plot the graph for orthogonality between PCs
Disadvantages of PCA in ML:
1. Useful for quantitative data but not effective with qualitative data.
2. Interpretation of PC is difficult from the original data.
Application for Principal Component Analysis (PCA):
1. Computer Vision
2. Bio-informatics application
3. For compressed images or resizing of the image
4. Discovering patterns from high-dimensional data
5. Reduction of dimensions
6. Multidimensional Data – Visualization
K-means Clustering
• K-means is a centroid-based algorithm or a distance-based algorithm:
• K-means clustering is a popular unsupervised machine learning
algorithm.
• It is used for partitioning a dataset into a pre-defined number of clusters.
• The goal is to group similar data points together and discover
underlying patterns or structures within the data.
• The main objective of the K-Means algorithm is to minimize the sum of
distances between the points and their respective cluster centroid.
• Optimization plays a crucial role in the k-means clustering algorithm.
• The goal of the optimization process is to find the best set of centroids
that minimizes the sum of squared distances between each data point
and its closest centroid.
How K-Means Clustering Works?
1.Initialization: Select K centroids from data points randomly.
2.Assignment: (i) For each data point in the dataset, calculate the distance
between that point and each of the K centroids.
(ii) Assign the data point to the cluster whose centroid is closest to it. .
3. Update centroids: Calculate the centroids of the clusters by taking the
mean of all data points assigned to each cluster.
CONTD…
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs
when the centroids no longer change significantly or when a specified
number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm gives the final
cluster centroids.
• Objective of k means Clustering
• The main objective of k-means clustering is to partition your data into a
specific number (k) of groups.
• The data points within each group are similar and dissimilar to points in
other groups.
• It achieves this by minimizing the distance between data points and their
centroid.

• Objective of k means Clustering CONTD….
•Grouping similar data points: K-means aims to identify patterns in your data
by grouping data points that share similar characteristics together. This allows
you to discover underlying structures within the data.
•Minimizing within-cluster distance: The algorithm strives to make sure data
points within a cluster are as close as possible to each other, as measured by a
distance metric (usually Euclidean distance).
•Maximizing between-cluster distance: Conversely, k-means also tries to
maximize the separation between clusters.
Eg:
A bank wants to give credit card offers to its customers. Currently, they
look at the details of each customer and, based on this information,
decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make
sense to look at the details of each customer separately and then make a
decision? Certainly not! It is a manual process and will take a huge
amount of time.

So what can the bank do? One option is to segment its customers into
different groups. For instance, the bank can group the customers based
on their income:
• Clustering is the process of dividing the entire data into groups
(also known as clusters) based on the patterns in the data.
Applications:
1. Document Classification
2. Customer Segmentation
3. Cyber Profiling
4. Image Segmentation
5. Fraud detection in banking and insurance
Advantages of K-means
1.Simple and easy to implement: The k-means algorithm is easy to
understand and implement, making it a popular choice for clustering tasks.
2.Fast and efficient: K-means is computationally efficient and can handle
large datasets with high dimensionality.
3.Scalability: K-means can handle large datasets with many data points and
can be easily scaled to handle even larger datasets.
4.Flexibility: K-means can be easily adapted to different applications and
can be used with varying metrics of distance and initialization methods.
Disadvantages of K-Means
1.Sensitivity to initial centroids: K-means is sensitive to the initial
selection of centroids and can converge to a suboptimal solution.
2.Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3.Sensitive to outliers: K-means is sensitive to outliers, which can have
a significant impact on the resulting clusters.
Hierarchical Clustering
 Hierarchical clustering is a technique used to group similar data points
together based on their similarity creating a hierarchy or tree-like
structure.
 The key idea is to begin with each data point as its own separate cluster
and then progressively merge or split them based on their similarity.
 Eg:
 Imagine you have four fruits with different weights: an apple (100g), a
banana (120g), a cherry (50g), and a grape (30g). Hierarchical
clustering starts by treating each fruit as its own group.
Getting Started with Dendogram
• A dendrogram is like a family tree for clusters.
• It shows how individual data points or groups of data merge together.
• The bottom shows each data point as its own group, and as you move
up, similar groups are combined.
• It helps you see how things are grouped step by step.
•At the bottom of the dendrogram, the points P, Q, R, S, and T are all
separate.
•As you move up, the closest points are merged into a single group.
•The lines connecting the points show how they are progressively merged
based on similarity.
•The height at which they are connected shows how similar the points are
to each other; the shorter the line, the more similar they are
Types of Hierarchical Clustering: 2
1.Agglomerative Clustering: Bottom –Top approach
2.Divisive clustering: Top –bottom approach
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative
clustering (HAC).
Workflow for Hierarchical Agglomerative clustering :
1.Start with individual points: Each data point is its own cluster. For example
if you have 5 data points you start with 5 clusters each containing just one data
point.
2.Calculate distances between clusters: Calculate the distance between every
pair of clusters. Initially since each cluster has one point this is the distance
between the two data points.
3.Merge the closest clusters: Identify the two clusters with the smallest
distance and merge them into a single cluster.
Workflow for Hierarchical Agglomerative clustering CONTD…
4.Update distance matrix: After merging you now have one less cluster.
Recalculate the distances between the new cluster and the remaining clusters.
5.Repeat steps 3 and 4: Keep merging the closest clusters and updating the
distance matrix until you have only one cluster left.
6.Create a dendrogram: As the process continues you can visualize the
merging of clusters using a tree-like diagram called a dendrogram. It shows the
hierarchy of how clusters are merged.
Workflow for Hierarchical Agglomerative clustering CONTD…
Gaussian Mixture Models
Gaussian Mixture Models
K-means cluster GMM(probabilistic approach)
1. K-means clustering assigns data 1. GMMs assign data into a mixture
points to exactly one cluster of multiple clusters or Gaussian
distributions.

2. K-means can also only detect 2. while a GMM can group data
spherical or circular cluster shapes within elliptical or oval shapes.

3. k-means cannot handle complex 3. GMMs to handle highly complex


data. data distributions and provide
probabilities of clusters .
4. Hard clustering 4. Soft clustering
Gaussian Mixture Model vs K-Means
•A Gaussian mixture model is a soft clustering technique used in
unsupervised learning to determine the probability that a given data point
belongs to a cluster.
•It’s composed of several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters in a data set and is comprised of the
following parameters.
•K-means is a clustering algorithm that assigns each data point to one cluster
based on the closest centroid.
Gaussian Mixture Model vs K-Means
•K-means hard clustering method, meaning each point belongs to only one
cluster with no uncertainty.
On the other hand, Gaussian Mixture Models (GMM) use soft clustering,
where data points can belong to multiple clusters with a certain probability.
This provides a more flexible and nuanced way to handle clusters, especially
when points are close to multiple centroids.
How Gaussian Mixture Models Work?
 Unlike K-Means, where the clustering process relies solely on the centroid
and assigns each data point to one cluster, GMM uses a probabilistic
approach.
 GMM performs clustering:
1. Multiple Gaussians (Clusters): Each cluster is represented by a Gaussian
distribution, and the data points are assigned probabilities of belonging to
different clusters based on their distance from each Gaussian.
GMM performs clustering:
2. Parameters of a Gaussian: The core of GMM is made up of three main
parameters for each Gaussian:
i) Mean (μ): The center of the Gaussian distribution.
ii) Covariance (Σ): Describes the spread or shape of the cluster.
iii) Mixing Probability (π): Determines how dominant or likely each cluster
is in the data.
The Expectation-Maximization (EM) Algorithm
 To fit a Gaussian Mixture Model to the data, we use the Expectation-
Maximization (EM) algorithm.
 It is an iterative method that optimizes the parameters of the Gaussian
distributions (mean, covariance, and mixing coefficients).
 It works in two main steps:
1. Expectation Step (E-step):
2. Maximization Step (M-step):
Expectation Step (E-step): (probability calculation)
In this step, the algorithm calculates the probability that each data point
belongs to each cluster based on the current parameter estimates (mean,
covariance, mixing coefficients).
2.Maximization Step (M-step): (Updation of parameters)
After estimating the probabilities, the algorithm updates the parameters
(mean, covariance, and mixing coefficients) to better fit the data.
These two steps are repeated until the model converges, meaning the
parameters no longer change significantly between iterations.
How GMM Works
1.Initialization: Start with initial guesses for the means, covariances, and
mixing coefficients of each Gaussian distribution.
2. E-step: For each data point, calculate the probability of it belonging to each
Gaussian distribution (cluster).
3. M-step: Update the parameters (means, covariances, mixing coefficients)
using the probabilities calculated in the E-step.
4. Repeat: Continue alternating between the E-step and M-step until the log-
likelihood of the data converges.(a measure of how well the model fits the
data).
GMM formula: P(x) = N(x∣μi​,Σi​)

where: K is the number of Gaussian components.


=1
0≤≤1
N(x∣μi,Σi) --- the probability density function (PDF) of the i-th Gaussian
component
​is the mean vector of the i-th Gaussian.
Σi ​is the covariance matrix of the i-th Gaussian.
d is the dimensionality of the data.
The E-step computes the probabilities that each data point belongs to each
Gaussian, while the M-step updates the parameters μk​, Σk ​, and πk based on
these probabilities.
Advantages of Gaussian Mixture Models (GMM)
1. Flexible Cluster Shapes: Unlike K-Means, which assumes spherical
clusters, GMM can model clusters with arbitrary shapes.
2. Soft Assignment: GMM assigns a probability for each data point to belong
to each cluster, while K-Means assigns each point to exactly one cluster.
3. Handles Overlapping Data: GMM performs well when clusters overlap or
have varying densities. Since it uses probability distributions, it can assign a
point to multiple clusters with different probabilities
Limitations of GMM
1. Computational Complexity: GMM tends to be computationally
expensive, particularly with large datasets.
2. Choosing the Number of Clusters: Like other clustering methods, GMM
requires you to specify the number of clusters beforehand.
Expectation Maximization Algorithm
• The Expectation-Maximization (EM) algorithm is an iterative method
used in unsupervised machine learning to estimate unknown parameters in
statistical models.
• It helps find the best values for unknown parameters, especially when some
data is missing or hidden.
• It works in two steps:
• E-step (Expectation Step): Estimates missing or hidden values using
current parameter estimates.
• M-step (Maximization Step): Updates model parameters to maximize the
likelihood based on the estimated values from the E-step.
• This process repeats until the model reaches a stable solution, improving
accuracy with each iteration.
• EM is widely used in clustering (e.g., Gaussian Mixture Models) and
handling missing data
• APPLICATIONS:
• Machine Learning, Computer Vision, and Natural Language
Processing.
Key Terms in Expectation-Maximization (EM) Algorithm: 8
1. Latent Variables: These are hidden or unmeasured variables that affect
what we can observe in the data. We can’t directly see them, but we can
make educated guesses about them based on the data we can see.
2. Likelihood: The EM algorithm tries to find the best parameters by
probability that make the data most likely.
3. Log-Likelihood: This is just the natural log of the likelihood function. It’s
used to make calculations easier . The EM algorithm tries to maximize the
log-likelihood to improve the model fit.
4. Maximum Likelihood Estimation (MLE): This is a technique for
estimating the parameters of a model. It does this by finding the parameter
values that make the observed data most likely (maximizing the likelihood).
5. Posterior Probability: In Bayesian methods, this is the probability of the
parameters, given both prior knowledge and the observed data. In EM, it
helps estimate the “best” parameters when there’s uncertainty about the data.
6. Expectation (E) Step: In this step, the algorithm estimates the missing or
hidden information (latent variables) based on the observed data and current
parameters. It calculates probabilities for the hidden values given what we
can see.
7. Maximization (M) Step: This step updates the parameters by finding the
values that maximize the likelihood, based on the estimates from the E-step.
It often involves running optimization methods to get the best parameters.
8. Convergence: Convergence happens when the algorithm has reached a
stable point. This is checked by seeing if the changes in the model’s
parameters or the log-likelihood are small enough to stop the process.
Working flow of EM algorithm:
EM Algorithm Flowchart:
1.Initialization:
The algorithm starts with initial parameter values and assumes the observed
data comes from a specific model.
•E-Step (Expectation Step):
• Estimate the missing or hidden data based on the current parameters.
• Calculate the posterior probability (responsibility) of each latent
variable given the observed data.
• Compute the log-likelihood of the observed data using the current
parameter estimates.
EM Algorithm Flowchart: CONTD…
•M-Step (Maximization Step):
• Update the model parameters by maximizing the log-likelihood computed
in the E-step.
• This involves solving an optimization problem to find parameter values
that improve the model fit.
•Convergence:
• Check if the model parameters are stable (converging).
• If the changes in log-likelihood or parameters are below a set threshold,
stop. If not, repeat the E-step and M-step until convergence is reached
Advantages of EM algorithm:
1. Always improves results – With each step, the algorithm improves the
likelihood (chances) of finding a good solution.
2. Simple to implement – The two steps (E-step and M-step) are often
easy to code for many problems.
3. Quick math solutions – In many cases, the M-step has a direct
mathematical solution (closed-form), making it efficient
Disadvantages of EM algorithm:
1. Takes time to finish – It converges slowly, meaning it may take many
iterations to reach the best solution.
2. Gets stuck in local best – Instead of finding the absolute best solution, it
might settle for a “good enough” one.
3. Needs extra probabilities – Unlike some optimization methods that only
need forward probability, EM requires both forward and backward
probabilities, making it slightly more complex.
Conclusion
The EM algorithm iteratively estimates missing data and updates
model parameters to improve accuracy. By alternating between the E-
step and M-step, it refines the model until it converges, making it a
powerful tool for handling hidden or incomplete data in machine
learning.
THANK YOU

You might also like