Unit5 CSM ML
Unit5 CSM ML
Clustering:
Clustering is like sorting a bunch of similar items into different groups based on their
characteristics. In data mining and machine learning, it’s a powerful technique used to group
similar data points together, making it easier to find patterns or understand large datasets.
Essentially, clustering helps identify natural groupings in your data.
Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let's say there are 4 data point and we have to cluster
them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1
Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
For example, Let's say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Hierarchical clustering
Partitioning clustering
Density-based Clustering
Agglomerative clustering
Divisive clustering
K-Means clustering
Density-based Clustering
DBSCAN
At the surface level, clustering helps in the analysis of unstructured data. Graphing,
the shortest distance, and the density of the data points are a few of the elements that
influence cluster formation. Clustering is the process of determining how related the
objects are based on a metric called the similarity measure.
Similarity metrics are easier to locate in smaller sets of features and harder as the
number of features increases. Depending on the type of clustering algorithm being
utilized, several techniques are employed to group the data from the datasets. In this part,
the clustering techniques are described. Various types of clustering algorithms are:
Partitional clustering divides data into distinct non-overlapping subsets or clusters. The most
common algorithm is K-Means, which assigns each data point to the nearest cluster center,
iteratively updating centers to minimize intra-cluster variance.
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters,
so k is equal to 2 here. We then randomly select the centroid:
Here, the red and green circles represent the centroid for these clusters.
Here you can see that the points closer to the red point are assigned to the red cluster,
whereas the points closer to the green point are assigned to the green cluster.
Now, once we have assigned all of the points to either cluster, the next step is to compute the
centroids of newly formed clusters:
Here, the red and green crosses are the new centroids.
The step of computing the centroid and assigning all the points to the cluster based on their
distance from the centroid is a single iteration. But wait – when should we stop this process?
It can’t run till eternity, right?
There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached
We can stop the algorithm if the centroids of newly formed clusters are not changing. Even
after multiple iterations, if we are getting the same centroids for all the clusters, we can say
that the algorithm is not learning any new pattern, and it is a sign to stop the training.
EXAMPLE:
Solution:
Iteratively assign data points to the closest cluster and recalculate cluster centers until
convergence.
Step 1 . Assign data points to the closest cluster.
Hierarchical Clustering
Hierarchical clustering is a technique used to group similar data points together based on
their similarity creating a hierarchy or tree-like structure. The key idea is to begin with each
data point as its own separate cluster and then progressively merge or split them based on
their similarity.
Dendogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group, and as you move
up, similar groups are combined. The lower the merge point, the more similar the groups are.
It helps you see how things are grouped step by step.
The working of the dendrogram can be explained using the below diagram:
In this image, on the left side, there are five points labeled P, Q, R, S, and T. These
represent individual data points that are being clustered. On the right side, there’s
a dendrogram, which shows how these points are grouped together step by step.
At the bottom of the dendrogram, the points P, Q, R, S, and T are all separate.
As you move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based on
similarity.
The height at which they are connected shows how similar the points are to each other;
the shorter the line, the more similar they are
Types of Hierarchical Clustering
Now that we understand the basics of hierarchical clustering, let's explore the two main
types of hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
1.Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. This
clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.
2.Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.
No predefined clusters: Hierarchical clustering does not require specifying the number
of clusters in advance.
Dendrogram for visualization: It allows the creation of dendrograms, providing a
visual representation of clusters at various levels.
Cluster hierarchy: The hierarchical nature helps reveal nested clusters.
Limitations:
Density-Based Clustering
Density-based clustering algorithms, like DBSCAN, identify clusters as areas of high density
separated by areas of low density. They are effective in discovering clusters of arbitrary
shape and in identifying noise or outliers in the data.
DBSCAN is a density-based clustering algorithm that groups data points that are
closely packed together and marks outliers as noise based on their density in the
feature space.
It identifies clusters as dense regions in the data space separated by areas of lower density.
Unlike K-Means or hierarchical clustering which assumes clusters are compact and
spherical, DBSCAN perform well in handling real-world data irregularities such as:
Arbitrary-Shaped Clusters: Clusters can take any shape not just circular or convex.
Noise and Outliers: It effectively identifies and handles noise points without assigning
them to any cluster.
K-Means and Hierarchical handling compact, spherical clusters with varying noise
tolerance while DBSCAN manages arbitrary-shaped clusters and noise handling.
1. eps: This defines the radius of the neighborhood around a data point. If the distance
between two points is less than or equal to eps they are considered neighbors. A common
method to determine eps is by analyzing the k-distance graph. Choosing the right eps is
important:
If eps is too small most points will be classified as noise.
If eps is too large clusters may merge and the algorithm may fail to distinguish between
them.
1. MinPts: This is the minimum number of points required within the eps radius to
form a dense region. A general rule of thumb is to set MinPts >= D+1 where D is
the number of dimensions in the dataset.
1. Identify Core Points: For each point in the dataset count the number of points within
its eps neighborhood. If the count meets or exceeds MinPts mark the point as a core
point.
2. Form Clusters: For each core point that is not already assigned to a cluster create a
new cluster. Recursively find all density-connected points i.e points within
the eps radius of the core point and add them to the cluster.
3. Density Connectivity: Two points a and b are density-connected if there exists a chain
of points where each point is within the eps radius of the next and at least one point in
the chain is a core point. This chaining process ensures that all points in a cluster are
connected through a series of dense regions.
4. Label Noise Points: After processing all points any point that does not belong to a
cluster is labeled as noise.
Partitioning of Data
Hard Partitioning:
In hard partitioning, each data point belongs to exactly one cluster. There is a clear and
exclusive assignment—no overlaps, no ambiguity.
Key Characteristics:
Advantages:
Soft Partitioning:
In soft partitioning, a data point can belong to multiple clusters with a degree of
membership (between 0 and 1). The sum of the memberships across all clusters is usually 1.
Key Characteristics:
Advantages:
It is the process of decomposing a matrix into a product of two (or more) smaller
matrices.
The core idea is to represent your data matrix X (e.g., rows are data points,
columns are features) as a product of two lower-rank matrices, W and H:
Here:
X is the original data matrix (e.g., m×n, where m is the number of data points and n is
the number of features).
W (e.g., m×k) can be interpreted as the basis matrix or cluster centroid matrix.
Each column of W can represent a latent feature or a cluster prototype.
H (e.g., n×k) is the coefficient matrix or cluster indicator matrix. Each row of H (or
sometimes a transformed version of it) indicates the degree to which a data point
belongs to each of the k clusters. The non-negativity constraint (especially in NMF)
allows for a "part-based" representation, where parts add up to form the whole, which
is highly interpretable for clustering.
Role in Clustering
Other MF techniques like Singular Value Decomposition (SVD) can also be used, often
followed by another clustering step (e.g., K-means on the reduced dimensions). However,
SVD doesn't impose non-negativity, which can make the interpretation of the factors less
direct for clustering.
Choosing the Number of Latent Factors (k): Determining the optimal number of
clusters (k) is often a challenge, similar to other clustering algorithms.
Computational Cost: For very large datasets, the factorization process can still be
computationally expensive, though often more scalable than some high-dimensional
clustering methods.
Non-convex Optimization (for NMF): NMF's optimization problem is non-convex,
meaning there's no guarantee of finding the global minimum. The results can be
sensitive to initialization.
Cold-Start Problem (in recommender systems): For entirely new users or items
with no historical data, MF models struggle to generate embeddings or assign them to
clusters.
Sensitivity to Noise and Outliers: The performance can be affected by significant
noise or outliers in the data.
Fuzzy C-Means (FCM) clustering is a soft clustering algorithm that allows data points to
belong to multiple clusters simultaneously, with varying degrees of membership. Unlike
"hard" clustering methods like K-means, where each data point is assigned to exactly one
cluster, FCM provides a more nuanced representation of data, especially when clusters
overlap or when a data point naturally exhibits characteristics of more than one group.
The core idea of FCM is to minimize an objective function that measures the weighted sum
of squared distances between data points and cluster centroids. The "fuzzy" aspect comes
from the membership values, which are real numbers between 0 and 1 (inclusive), indicating
the degree to which a data point belongs to a particular cluster. The sum of membership
values for any given data point across all clusters must equal 1.
1. Initialization:
o Choose the number of clusters (C): This is a user-defined parameter, similar
to k in K-means.
o Initialize cluster centroids: Randomly select C data points from the dataset
as initial cluster centers, or randomly assign initial membership values to data
points.
o Choose the fuzziness parameter (m): This parameter controls the degree of
fuzziness or overlap between clusters. m>1 (typically m=2).
If m→1, FCM approaches hard clustering (like K-means).
If m→∞, the cluster centers move towards the center of the entire
dataset, and memberships tend to be equal for all clusters.
2. Iterative Optimization: The algorithm iteratively updates the cluster centroids and
membership values until convergence (i.e., when the change in membership values or
the objective function falls below a certain threshold, or a maximum number of
iterations is reached).
o Calculate/Update Membership Values (μij): For each data point xj and each
cluster i, the membership value μij is calculated based on the distance between
xj and the cluster centroid ci. The closer xj is to ci, the higher its membership
degree in cluster i. The formula is:
Calculate/Update Cluster Centroids (ci): After updating the membership values, the
cluster centroids are recalculated using the weighted mean of all data points, where the
weights are the fuzzy membership values:
3.Convergence: The iterations continue until the change in the membership matrix (or
the objective function value) between consecutive iterations is less than a predefined
threshold, or a maximum number of iterations is reached.
Objective Function
Minimizing this function effectively finds cluster centers that are central to the data points,
with the influence of each data point on a cluster center being proportional to its fuzzy
membership in that cluster.
Soft Clustering: Allows data points to belong to multiple clusters, providing a more realistic
representation of overlapping clusters or ambiguous data points.
Flexibility: Can handle situations where cluster boundaries are not clear-cut.
Interpretability: The membership values offer insight into the degree of association between
data points and clusters.
Robustness to Noise (to some extent): Fuzzy memberships can provide some resilience to
noisy data points compared to hard clustering.
FCM is widely applied in various fields, especially where data exhibits ambiguity or overlap:
Image Segmentation: Dividing images into regions based on pixel intensity, color, or texture,
where pixels might belong to multiple regions (e.g., boundaries).
Medical Image Analysis: Identifying different tissue types in MRI or CT scans, where
boundaries between tissues might be fuzzy.
Pattern Recognition: Classifying patterns where there's inherent uncertainty or overlap
between categories.
Bioinformatics: Clustering gene expression data, where genes might be involved in multiple
biological pathways.
Customer Segmentation: Grouping customers with overlapping interests or purchasing
behaviors for targeted marketing.
Geographical Information Systems (GIS): Analyzing spatial data to identify regions with
mixed characteristics.
Fault Diagnosis: Identifying system faults based on sensor data, where a fault might have
multiple contributing factors.
Rough Clustering
Rough Clustering is an approach to clustering that leverages Rough Set Theory (RST).
Unlike traditional clustering methods that aim to assign each data point definitively to a
single cluster (hard clustering) or to multiple clusters with degrees of membership (fuzzy
clustering), Rough Clustering addresses the inherent vagueness and uncertainty in data by
defining clusters using lower and upper approximations.
At its core, Rough Set Theory, introduced by Zdzisław Pawlak in 1982, is a mathematical
framework for dealing with imprecise, uncertain, or incomplete information. It operates
on the principle of indiscernibility, meaning that objects that have the same attribute values
are considered indiscernible from each other given the available information.
Several algorithms have been developed based on rough set theory for clustering:
Handles Uncertainty Natively: Its primary strength is its ability to directly deal with
vagueness, imprecision, and incompleteness in data without requiring probability
distributions or membership functions (like fuzzy sets).
No Prior Assumptions on Data Distribution: Unlike some statistical clustering
methods, RST does not make assumptions about the underlying data distribution.
Identifies Redundant Features: RST's core concept of reducts makes it effective for
feature selection, which can simplify the clustering problem and improve efficiency.
Interpretable Results: The lower and upper approximations provide a clear and
interpretable way to understand the certainty (or uncertainty) of an object's belonging
to a cluster.
Works with Categorical Data: RST is particularly well-suited for categorical
(nominal) data, where traditional distance metrics might be problematic.
The fundamental difference lies in how data points are assigned to clusters and how cluster
centroids are updated:
1. Initialization:
o Choose the number of clusters, K.
o Randomly select K initial cluster centroids (similar to standard K-Means).
o Define two weighting parameters, wL and wU, for the lower and upper
approximations, respectively. Typically, wL≥wU, and wL+wU=1. These
weights reflect the importance given to points in the lower vs. upper
approximations during centroid calculation.
2. Assignment Step (Approximation Formation): For each data point xj:
o Calculate the distance from xj to all K cluster centroids (d(xj,ci)).
o Identify the two closest centroids, cp and cq (where cp is the closest and cq is
the second closest).
o Assign to Lower Approximation:
If the distance to the closest centroid (d(xj,cp)) is significantly smaller
than the distance to the second closest centroid (d(xj,cq)) by a certain
ratio or threshold (e.g., d(xj,cp)/d(xj,cq)<threshold), then xj is assigned
to the lower approximation of cluster Cp (Cp). These are the "certain"
members.
o Assign to Upper Approximation:
If xj is not in the lower approximation of any cluster (i.e., it's
ambiguous), it is assigned to the upper approximations of all clusters
whose centroids are "close enough" (e.g., within a certain range or a
certain ratio of distances). More commonly, an object xj is added to the
upper approximation Ci if ci is among the "few" closest centroids to xj.
In many formulations, Ci simply includes all points that are not
definitely in the lower approximation of another cluster.
3. Centroid Update Step: Recalculate the cluster centroids based on the new
assignments, typically using a weighted average:
This formula gives more weight (wL) to the points in the lower approximation
(definitely belonging) and less weight (wU) to the points in the boundary region
(possibly belonging).
4. Convergence Check: Repeat steps 2 and 3 until the cluster centroids no longer
change significantly, or a maximum number of iterations is reached.
Gaussian Mixture Model is a probabilistic model that assumes all data points are
generated from a mixture of several Gaussian distributions with unknown parameters.
Unlike hard clustering methods like K-Means that assigns each data point to one cluster
based on the closest centroid which often doesn't align with the complexity of real-world
data. GMMs perform soft clustering meaning each data point belongs to multiple
clusters with certain probabilities. Each Gaussian in the mixture is defined by:
Mean (μ): The center of the distribution.
Covariance (Σ): Describes the spread and orientation.
Mixing coefficient (π): Represents the proportion of each Gaussian in the mixture.
1. Initialization: Start with initial guesses for the means, covariances and mixing
coefficients of each Gaussian distribution.
2. E-step: For each data point, calculate the probability of it belonging to each Gaussian
distribution (cluster).
3. M-step: Update the parameters (means, covariances, mixing coefficients) using the
probabilities calculated in the E-step.
4. Repeat: Continue alternating between the E-step and M-step until the log-likelihood of
the data (a measure of how well the model fits the data) converges.
Formula:
The E-step computes the probabilities that each data point belongs to each Gaussian while
the M-step updates the parameters μk, Σk and πk based on these probabilities.
Advantages of EM algorithm
Always improves results – With each step, the algorithm improves the likelihood
(chances) of finding a good solution.
Simple to implement – The two steps (E-step and M-step) are often easy to code for
many problems.
Quick math solutions – In many cases, the M-step has a direct mathematical solution
(closed-form), making it efficient
Disadvantages of EM algorithm
Takes time to finish: It converges slowly meaning it may take many iterations to reach
the best solution.
Gets stuck in local best: Instead of finding the absolute best solution, it might settle for
a "good enough" one.
Needs extra probabilities: Unlike some optimization methods that only need forward
probability, EM requires both forward and backward probabilities making it slightly
more complex.
Spectral Clustering
Spectral Clustering is a powerful and flexible unsupervised learning algorithm that uses
the eigenvalues (spectrum) of a similarity matrix to reduce dimensions before applying a
standard clustering algorithm like K-Means.
Measures similarity between data points (e.g., using Gaussian RBF kernel).
Advantages
Disadvantages
UNIT-I
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents
can learn to make decisions through trial and error to maximize cumulative rewards. RL
allows machines to learn by interacting with an environment and receiving feedback based
on their actions. This feedback comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or
decision-maker) interacts with an environment to achieve a goal. The agent performs
actions and receives feedback to optimize its decision-making over time.
Agent: The decision-maker that performs actions.
The robot learns by exploring different paths in the maze. By trying various moves, it
evaluates the rewards and penalties for each path. Over time, the robot determines the best
route by selecting the actions that lead to the highest cumulative reward.
Types of Reinforcements in RL
1. Positive Reinforcement
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior.
Advantages: Maximizes performance, helps sustain change over time.
Disadvantages: Overuse can lead to excess states that may reduce effectiveness.
2. Negative Reinforcement
Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
Advantages: Increases behavior frequency, ensures a minimum performance standard.
Disadvantages: It may only encourage just enough action to avoid penalties.
A Reinforcement Learning framework refers to the formal structure that defines how an
RL problem is modeled and solved. The RL framework consists of states, actions, rewards,
and policies. The most common framework used is the Markov Decision Process (MDP).
The agent takes actions, observes the environment's state, and receives a reward. The goal is
to learn a policy that maximizes the expected return over time.
RL framework:
1. Agent: The decision-maker that interacts with the environment.
2. Environment: The context in which the agent operates, providing feedback and rewards.
3. Actions: The choices the agent can make within the environment.
4. States: The current situation or condition of the environment at a given time.
5. Rewards: Feedback provided by the environment to the agent, indicating the quality of the
action taken.
The RL Loop:
1. The agent observes the current state of the environment.
2. The agent takes an action based on its current policy.
3. The environment transitions to a new state based on the action.
4. The agent receives a reward from the environment.
5. The agent updates its policy based on the experience, aiming to maximize the expected
cumulative reward.
In essence, the RL framework provides a structured approach for designing algorithms that
can learn to make optimal decisions in complex environments through trial and error and
feedback.
Solution Methods in RL
A. Value-Based Methods
These methods estimate value functions (how good a state or action is).
Key Idea: Learn the expected reward of being in a state (or performing an action in a
state).
Common Algorithms:
Q-Learning (off-policy)
SARSA (on-policy)
Deep Q-Networks (DQN) – uses neural networks to approximate Q-values
B. Policy-Based Methods
Directly learn the policy (mapping from states to actions), without estimating value functions.
Common Algorithms:
REINFORCE
Policy Gradient
Actor-Critic Methods
C. Model-Based Methods
Learn a model of the environment (i.e., transition probabilities and rewards) and use it to plan
actions.
Key Idea: Simulate outcomes and plan ahead (like in classical planning).
Examples:
Dyna-Q
Monte Carlo Tree Search (MCTS)
Algorithms:
A. Robotics
B. Healthcare
D. Autonomous Driving
E. Finance
Portfolio optimization
Trading strategy development
Risk management
F. Recommendation Systems
G. Energy Systems
Multi-class Classification
Multi-class classification involves categorizing instances into one of three or more classes.
Techniques include extending binary classifiers using strategies like one-vs-rest or one-vs-
one, and using algorithms inherently capable of multi-class classification, such as decision
trees and neural networks.