ML Module5
ML Module5
MODULE 5
CHAPTER 13
CLUSTERING ALGORITHMS
• Clustering: the process of grouping a set of objects into classes of similar objects
• Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters.
• Unsupervised learning: no predefined classes.
• Example: Below fig: shows the data points with two features shown in different shaded
samples.
If few similarities then manually we can do , but when examples have more features then cannot
be done manually, so automatic clustering is required.
Deepa S, Dept .of CSE,RNSIT 1
Machine Learning (BCS602)
Applications of Clustering
1. Grouping based on customer buying patterns
2. Profiling of customers based on lifestyle
3. In information retrieval applications (like retrieval of a document from a collection
of documents)
4. Identifying the groups of genes that influence a disease
5. Identification of organs that are similar in physiology functions
6. Taxonomy of animals, plants in Biology
7. Clustering based on purchasing behaviour and demography
8. Document indexing
9. Data compression by grouping similar objects and finding duplicate objects
PROXIMITY MEASURES
Clustering algorithms need a measure to find the similarity or dissimilarity among the
objects to group them. Similarity and Dissimilarity are collectively known as proximity
measures. This is used by a number of data mining techniques, such as clustering, nearest
neighbour classification, and anomaly detection.
Distance measures are known as dissimilarity measures, as these indicate how one object
is different from another.
Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as more distance
indicates more similarity and vice-versa.
If all the conditions are satisfied, then the distance measure is called metric.
Some of proximity measures:
1. Quantitative variables
Advantage: The distance does not change with the addition of new object.
Disadvantage: i) If the unit changes, the resulting Euclidean or squared
Euclidean Changes drastically.
ii) Computational complexity is high, because it involves square root and
square.
b) City Block Distance: Known as Manhattan Distance or L1 norm.
Binary Attributes: Binary Attributes have only two values. Distance measures
have discussed above cannot be applied to find the distance between objects
that have binary attributes. For finding the distance among objects with binary
objects, the contingency table is used.
Hamming Distance: Hamming distance is a metric for comparing two binary data strings.
While comparing two binary strings of equal length, Hamming distance is the number of bit
positions in which the two bits are different. It is used for error detection or error correction
when data is transmitted over computer networks.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
Categorical Variables
In many cases, categorical values are used. It is just a code or symbol to represent the values. For
example, for the attribute Gender, a code 1 can be given to female and 0 can be given to male. To
calculate the distance between two objects represented by variables, we need to find only whether
they are equal or not. This is given as:
Ordinal Variables
Ordinal variables are like categorical values but with an inherent order. For example, designation is an
ordinal variable. If job designation is 1 or 2 or 3, it means code 1 is higher than 2 and code 2 is higher than
3. It is ranked as 1 >2>3.
Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
The cosine similarity is advantageous because even if the two similar documents are
far apart by the Euclidean distance (due to the size of the document), chances are
they may still be oriented closer together.
The smaller the angle, higher the cosine similarity.
The following three methods differ in how the distance between each cluster is measured.
1. Single Linkage
2. Average Linkage
3. Complete Linkage
Single Linkage or MIN algorithm
In single linkage hierarchical clustering, the distance between two clusters is defined
as the shortest distance between two points in each cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the length of the arrow between
their two closest points.
Complete Linkage : In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the arrow
between their two furthest points.
OR
Average Linkage : In average linkage hierarchical clustering, the distance between two
clusters is defined as the average distance between each point in one cluster to every point in
the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to
the average length each arrow between connecting the points of one cluster to the other.
Mean-Shift Algorithm
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8
Solution
The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.
Table 134.57: Proximity Matrix
Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -
The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Table After Iteration 1
Clusters {1,4} 2 3 5
3 - 8.94
5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12
The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.
Table After Iteration 2
Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83
5 -
The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.
therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum is
computed as
3 - 8.94
5 -
So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.
2 -
The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}
and {2}, and {3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs of
points across the clusters is used to form clusters.
Consider the following data shown in Table 143.125. Use k-means algorithm with k = 2 and show
the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
Solution
Let us assume the seed points are (3,5) and (16,9). This is shown in the following tableas
starting clusters.
Table Initial Cluster Table
Cluster 1 Cluster 2
(3,5) (16,9)
Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.
Take the sample object 2 and compare it with the two centroids as follows:
Dist(2,centroid 1) = 5
(7 − 3)2 + (8 − 5)2
Dist(2,centroid 2) = 9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,
Dist(3,centroid 1) =
9
Dist(3,centroid 2) = (12 − 3)2 + (5 − 5)2 5.66
Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1
Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)
Dist(1,centroid 1) =
6.25
Dist(1,centroid 2) = (7 − 5)2 + (8 − 6.5)2 7.07
(12 −14)2 + (8 − 7)2
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.
K means can be viewed as greedy algorithm as it involves partitioning ‘n’ samples to k clusterd
to minimize sum of squared Error. SSE is a metric that is a measure of error that gives the sum
of the squared Euclidean distances of each data to its closet centroid.
𝑘
SSE= ∑ 𝑓(𝑥) = ∑𝑖=1(𝐝𝐢𝐬𝐭(𝐜𝐢 , x)2)
Here ci = centroid of ith cluster
x=sample data
Advantages
1. Simple
2. Easy to implement
Disadvantages
1. It is sensitive to initialization process as change of initial points leads to different clusters.
2. If the samples are large, then the algorithm takes a lot of time.
Complexity
The complexity of k-means algorithm is dependent on the parameters like 1, the number of
samples, k, the number of clusters, (nkld). I is the number of iterations and d is the number of
attributes. The complexity of k-means algorithm is 0 (n2).
PROBLEM
Density-Based Clustering
A cluster is a dense region of points, which is separated by low-density regions, from other regions
of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of
high point density, separated from other such clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large amount
of data, which is containing noise and outliers.
There are three types of points after the DBSCAN clustering is complete:
• Core — This is a point that has at least m points within distance n from itself.
• Border — This is a point that has at least one Core point at a distance n.
• Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
3. Density connected – X and Y are densely connected if Z is a core point and thus points X and
Y are densely reachable from Z.
Advantages
1. No need for specifying the number of clusters beforehand
2. The algorithm can detect clusters of any shapes
3. Robust to noise
4. Few parameters are needed
The complexity of this algorithm is O(nlogn).
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the operations
for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.
There are three important concepts that need to be mastered for understanding the grid-based
schemes. They are:
1. Subspace clustering
Subspace Clustering
Grid-based algorithms are useful for clustering high-dimensional data, that is, data with many attributes.
Some data like gene data may have millions of attributes. Every attribute is called a dimension. But all the
attributes are not needed, as in many applications one may not require all the attributes. For example, an
employee's address may not be required for profiling his diseases. Age may be required in that case. So,
one can conclude that only a subset of features is required.
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into cells.
Then, algorithm determines whether the cells is dense or sparse. The cell is considered dense if
it exceeds a threshold value.
It is defined as the ratio of number of points and volume of the region. In one pass, the algorithm
finds the number of cells , number of points etc and then combines the dense cells. For that the
algorithm uses the contiguous intervals and a set of dense cells.
MONOTONICITY Property
CLIQUE uses antimonotonicity property or apriori algorithm. It means that all the subsets of a
frequent itemset are frequent. Similarly if the subset is infrequent then its superset are
infrequent.
Advantages of CLIQUE
1. Insensitive to input order of objects
2. No assumptions of underlying data distributions
3. Finds subspace of higher dimensions such that high-density clusters exist in those
subspaces
Disadvantage
The disadvantage of CLIQUE is that tuning of grid parameters, such as grid size, and finding
optimal threshold for finding whether the cell is dense or not is a challenge.
Reinforcement Learning
Characteristics of RL
1. Sequential Decision Making
o The goal is achieved through a sequence of decisions.
o One wrong decision can lead to failure.
o RL involves learning a proper sequence of steps to reach the target.
2. Delayed Feedback
o Rewards or feedback are not given immediately.
o It may take several steps or moves before knowing whether the action was good
or bad.
3. Interdependent Actions
o Each action affects future actions.
o A wrong move now can cause problems later.
4. Time-related Actions
o All actions are linked to time.
o Actions are naturally ordered in a timeline.
No supervisor and labelled dataset initially Presence of supervisor and labelled data
Decisions are dependent and are made Decisions are independent and based on
sequentially input given in training
Feedback is not instantaneous and delayed by Feedback is usually instantaneous once the
time model is created
Depends on initial input or input given at
Agent action affects the next input data
start
No target values, only goal-oriented Target class is predefined by the problem
Example: Chess, GO, Atari Games Example: Classifiers
Deepa S, Dept .of CSE,RNSIT 26
Machine Learning (BCS602)
Reinforcement Problems
There are two types of problems in RL:
1. Learning Problems
o The environment is unknown.
o The agent learns through trial and error.
o It interacts with the environment to improve its policy (behavior strategy).
2. Planning Problems
o The environment is known.
o The agent computes with the model to improve the policy.
Environment and Agent
• Environment:
o The external system where all actions happen.
o It includes input, output, and reward definitions.
o It describes the state or state variables (initially called the initial state).
o Example: In a self-driving car system, maps, rules, and road obstacles are part
of the environment.
• Agent:
o An autonomous entity that observes the environment and performs actions.
o It could be a robot, chatbot, or software that learns from the environment.
States
• A state is the current situation or position of the agent (e.g., location in a maze or city).
• Example states: A, B, C, D, E, F, G, H, I
• A = Starting state, G = Goal/Target state
• Notations:
o S – general state
o s – specific state
o sₜ – state at time t
🔹 Types of Nodes
1. Goal Node (also called Terminal or Absorbing State)
o Final destination with the highest reward.
2. Non-terminal Nodes
o Intermediate states before reaching the goal.
3. Start Node
o The initial position of the agent.
🔹 Actions
Deepa S, Dept .of CSE,RNSIT 28
Machine Learning (BCS602)
A common problem in reinforcement learning is the multi-arm (or N-arm) bandit problem.
Imagine a 5-armed slot machine in a casino (Figure 14.7).
Each lever, when pulled, gives a random reward between $1 and $10.
There can be N levers, each with unknown reward behavior.
The goal is to select levers wisely to maximize total money earned in a limited number of
tries.
This function is called action-value function or Q function. This indicates the value of taking a
particular action 'd'.
)
The best action (highest Q. value) is the action that returns the highest average return and is the
indicator of the action quality.
Selection Procedure
Reinforcement algorithm
Solving Reinforcement Problems conventional methods: There are two main algorithms for
solving reinforcement learning problems using conventional methods:
1. Value Iteration 2. Policy Iteration
Policy Improvement
The policy improvement process is performed as follows:
1. Evaluate the current policy using policy evaluation.
2. Solve the Bellman equation for the current policy to obtain v(s).
3. Improve the policy by applying the greedy approach to maximize expected
rewards.
4. Repeat the process until the policy converges to the optimal policy.
Algorithm
1. Start with an arbitrary policy π.
2. Perform policy evaluation using Bellman’s equation.
3. Improve the policy greedily.
4. Repeat until convergence.
Working of MC
• Rewards are collected only at the end of an episode.
• These are used to calculate the maximum expected future reward (called return).
• Empirical return is used instead of expected return.
• The return is averaged over multiple episodes for each state.
Q-Learning
Q-Learning Algorithm
1. Initialize Q-table:
Create a table Q(s,a) with states s and actions a.
Initialize Q-values with random or zero values.
2. Set parameters:
Learning rate α\alphaα (typically between 0 and 1).
Discount factor γ (typically close to 1).
Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
Start from an initial state s.
Repeat until reaching a terminal state:
SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)