0% found this document useful (0 votes)
39 views165 pages

Unit V

Uploaded by

Keerthi Talakoti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views165 pages

Unit V

Uploaded by

Keerthi Talakoti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 165

GR20A3123

GR20A3123
UNIT – V
UNSUPERVISED LEARNING
Unsupervised Learning : Reinforcement Learning:

Clustering-K-means Exploration and exploitation trade-offs


K-Modes Non-associative learning
K-Prototypes Markov decision processes
Gaussian Mixture Models Q-learning
Expectation-Maximization.
K-Means Clustering

• K-Means clustering is an unsupervised iterative clustering technique.

• It partitions the given data set into k predefined distinct clusters.

• A cluster is defined as a collection of data points exhibiting certain similarities.


It partitions the data set such that-
• Each data point belongs to a cluster with the nearest mean.
• Data points belonging to one cluster have high degree of similarity.
Data points belonging to different clusters have high degree of
dissimilarity.
K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps


Step-01:
Choose the number of clusters K.
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each
other.
Step-03:
Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or by
using Euclidean distance formula.
Step-04:

Assign each data point to some cluster.


A data point is assigned to that cluster whose center is nearest to that data point.

Step-05:

Re-compute the center of newly formed clusters.


The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
Step-06:

Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping
criteria is met-
Center of newly formed clusters do not change
Data points remain present in the same cluster
Maximum number of iterations are reached
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM-

Problem-01:

Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:

We calculate the distance of each point from each of the center of the three
clusters.
The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point


A1(2, 10) and each of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0

Calculating Distance Between A1(2, 10) and C2(5, 8)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9

In the similar manner, we calculate the distance of other points from each of the center
of the three clusters.

We draw a table showing all the results.


Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Distance
Distance Distance
from Point
Given from center from center
center (2, belongs
Points (5, 8) of (1, 2) of
10) of to Cluster
Cluster-02 Cluster-03
Cluster-01
A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2
Cluster-01:

First cluster contains points- A1(2, 10)

Cluster-02:

Second cluster contains points-


A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)

Cluster-03:

Third cluster contains points- A2(2, 5) A7(1, 2)


Now,
We re-compute the new clusters.
The new cluster center is computed by taking mean of all the points
contained in that cluster.

For Cluster-01:

We have only one point A1(2, 10) in Cluster-01.


So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-01.

Iteration-02:

We calculate the distance of each point from each of the center of the three
clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|= 0

Calculating Distance Between A1(2, 10) and C2(6, 6)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
= 4 + 4= 8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5 = 7
In the similar manner, we calculate the distance of other points from each of
the center of the three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Distance from Distance from Distance from
Point belongs
Given Points center (2, 10) center (6, 6) of center (1.5, 3.5)
to Cluster
of Cluster-01 Cluster-02 of Cluster-03

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1
From here, New clusters are-
Cluster-01:
First cluster contains points-
A1(2, 10) A8(4, 9)
Cluster-02:
Second cluster contains points-
A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)

Cluster-03:

Third cluster contains points-


A2(2, 5)
A7(1, 2)
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that
cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)

For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)

For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the center of the three clusters are-
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
Problem-02:

Use K-Means Algorithm to create two clusters-


Solution-
We follow the above discussed K-Means Clustering Algorithm.
Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:

We calculate the distance of each point from each of the center of the
two clusters.
The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance between point A(2,
2) and each of the center of the two clusters-

Calculating Distance Between A(2, 2) and C1(2, 2)-

Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ] = sqrt [ 0 + 0 ] = 0

Calculating Distance Between A(2, 2) and C2(1, 1)-

Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ] = sqrt [ 1 + 1 ] = sqrt [ 2 ] = 1.41

In the similar manner, we calculate the distance of other points from each of the
center of the two cluster
• We draw a table showing all the results.
• Using the table, we decide which point belongs to which cluster.
• The given point belongs to that cluster whose center is nearest to it.
Distance from Distance from
Point belongs
Given Points center (2, 2) of center (1, 1) of
to Cluster
Cluster-01 Cluster-02

A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5, 0.5) 1.58 0.71 C2
From here, New clusters are-

Cluster-01:

First cluster contains points-


A(2, 2) B(3, 2) E(1.5, 0.5) D(3, 1)

Cluster-02:
Second cluster contains points-
C(1, 1) E(1.5, 0.5)

Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the
points contained in that cluster.
For Cluster-01:

Center of Cluster-01
= ((2 + 3 + 3)/3, (2 + 2 + 1)/3) = (2.67, 1.67)

For Cluster-02:

Center of Cluster-02
= ((1 + 1.5)/2, (1 + 0.5)/2) = (1.25, 0.75)

This is completion of Iteration-01.


Next, we go to iteration-02, iteration-03 and so on until the
centers do not change anymore.
K MODES CLU STE RI N G ALGORITHM FOR
CATEGORICAL DATA
K MODES
• KModes clustering is one of the unsupervised Machine Learning
algorithms that is used to cluster categorical variables.

• It is basically a collection of objects based on similarity and dissimilarity


between them.

• Unlike traditional clustering algorithms that use distance metrics, KModes


works by identifying the modes or most frequent values within each cluster to
determine its centroid.

• KModes is ideal for clustering categorical data such as customer demographics,


market segments, or survey responses.
K MODES VS K MEANS
• KMeans uses mathematical measures (distance) to cluster continuous data.
The lesser the distance, the more similar our data points are.

• Centroids are updated by Means.

• But for categorical data points, we cannot calculate the distance. So we go for
KModes algorithm.

• It uses the dissimilarities(total mismatches) between the data points.

• The lesser the dissimilarities the more similar our data points are. It uses
Modes instead of means.
HOW K MODES ALGORITHM WORKS ?

• Step 1 :Pick K observations at random and use them as leaders/clusters

• Step 2 : Calculate the dissimilarities and assign each observation to its closest

cluster

• Step 3: Define new modes for the clusters

• Step 4 : Repeat 2–3 steps until there are is no re-assignment required


EXAMPLE :
EXAMPLE

• Alright, we have the sample data


now. Let us proceed by defining
the number of clusters(K)=3
• Step 1: Pick K observations at
random and use them as
leaders/clusters
EXAMPLE

• Step 2: Calculate the


dissimilarities(no. of mismatches)
and assign each observation to its
closest cluster
• Iteratively compare the cluster data
points to each of the observations.
Similar data points give 0, dissimilar
data points give 1.
• Comparing leader/Cluster P1 to the
observation P1 gives 0 dissimilarities.
EXAMPLE

• Comparing leader/cluster P1 to
the observation P2 gives
3(1+1+1) dissimilarities.
EXAMPLE

• Likewise, calculate all the

dissimilarities and put them in a

matrix as shown below and assign

the observations to their closest

cluster(cluster that has the least

dissimilarity)
EXAMPLE

• After step 2, the observations P1, P2, P5 are assigned to cluster 1; P3, P7 are

assigned to Cluster 2; and P4, P6, P8 are assigned to cluster 3.

• Note: If all the clusters have the same dissimilarity with an observation, assign

to any cluster randomly. In our case, the observation P2 has 3 dissimilarities

with all the leaders. I randomly assigned it to Cluster 1.


EXAMPLE

• Step 3: Define new modes for the


clusters

• Mode is simply the most observed


value.

• Mark the observations according to the


cluster they belong to. Observations of
Cluster 1 are marked in Yellow, Cluster 2
are marked in Brick red, and Cluster 3
are marked in Purple.
EXAMPLE

• Considering one cluster at a time, for each


feature, look for the Mode and update the
new leaders.
• Explanation: Cluster 1 observations(P1,
P2, P5) has brunette as the most observed
hair color, amber as the most observed eye
color, and fair as the most observed skin
color.
• Note: If you observe the same occurrence
of values, take the mode randomly. In our
case, the observations of Cluster 3(P3, P7)
have one occurrence of brown, fair skin
EXAMPLE

• Repeat steps 2–4

• After obtaining the new leaders, again

calculate the dissimilarities between the

observations and the newly obtained leaders.

• Comparing Cluster 1 to the observation P1

gives 1 dissimilarity.
EXAMPLE

• Comparing Cluster 1 to the

observation P2 gives 2

dissimilarities.
EXAMPLE
• Likewise, calculate all the dissimilarities
and put them in a matrix. Assign each
observation to its closest cluster.
• The observations P1, P2, P5 are assigned
to Cluster 1; P3, P7 are assigned to
Cluster 2; and P4, P6, P8 are assigned to
Cluster 3.
• We stop here as we see there is no
change in the assignment of
observations.
K PROTOTYPES CLUSTERING
ALGORITHM
WHAT IS K PROTOTYPES CLUSTERING ?
• K-Prototype which is created to handle clustering algorithms with the mixed
data types (numerical and categorical variables).
• K-Prototype is a clustering method based on partitioning.
• Its algorithm is an improvement of the K-Means and K-Mode clustering
algorithm to handle clustering with the mixed data types.
• In k-prototypes clustering, we select k-prototypes randomly at the start.
• After that, we calculate the distance between each data point and the
prototypes. Accordingly, all the data points are assigned to clustering
associated with different prototypes.
DI S TANCE MEA S U R ES I N K- PR OTOTYPES
CLU STER I NG :
• As the k-prototypes clustering algorithm deals with data having numerical as well as
categorical attributes, it uses different measures for both data types.
• For numerical data, the k-prototypes clustering uses squared euclidean distance as the
distance measure.
• For instance, if we are given two data points (1,2,3) and (4, 3, 3), the distance between these
two data points will be calculated as (1-4)^2+(2-3)^2+(3-3)^2 which is equal to 10.
• For categorical attributes, the k-prototypes clustering algorithm follows matching dissimilarity.
If you have two records (A, B, C, D) and (A, D, C, C) with categorical attributes, the matching
dissimilarity is the number of different values at each position in the records.
• In the given records, values are different at the two positions only. Hence, the matching
dissimilarity between the records will be 2.
DI S TANCE MEA S U R ES I N K- PR OTOTYPES
CLU STER I NG :
• For records having mixed attributes, we calculate the distance between
categorical and numerical attributes separately.
• After that, we use the sum of the dissimilarity scores as the distance between
two records. For instance, consider that we have two records [‘A’, ‘B’, ‘F’, 155,
53] and [‘A’, ‘A’, ‘M’, 174, 70].
• To find the distance between these two records, we will first find the
dissimilarity score between [‘A’, ‘B’, ‘F’] and [‘A’, ‘A’, ‘M’]. The score is 2 as
two attributes out of three have different values.
• Next, we will calculate the square Euclidean distance between [155, 53] and
[174, 70]. Here, (155-174)^2 + (53-70)^2 which is equal to 650.
DI S TANCE MEA S U R ES I N K- PR OTOTYPES
CLU STER I NG :
• Now, we can directly calculate the total dissimilarity score as the sum of the
dissimilarity score between categorical attributes and the square Euclidean
distance of numerical attributes. Here, the sum will be equal to 650+2=652.
• Observe that the matching dissimilarity score of categorical attributes is almost
negligible compared to the square Euclidean distance between numerical
attributes. Hence, the categorical attributes will have little or no effect on
clustering.
• To solve this problem, we can scale the values in numeric attributes within a
range of say 0 to 5.
• Alternatively, we can take a weighted sum of the matching dissimilarity scores
and the square Euclidean distance.
CH OI CE OF NEW PR OT OT Y PES I N K-
PR OTOTYPES CLU S T ER I N G :
• Once a cluster is formed, we need to calculate a new prototype for the cluster
using the data points in the current cluster.
• To calculate the new prototype for any given cluster, we will take mode of
categorical attributes of the data points in the cluster.
• For numerical attributes, we will use the mean of the values to calculate new
prototype for the cluster.
• For example, suppose that we have the following data points in a cluster.
EXAMPLE :
EXAMPLE :
• In the above cluster, we will take the mode of values in the EQ Rating, IQ
Rating, and Gender attributes.
• For the attributes Height and Weight, we will take the mean of the values to
calculate the new prototype.
• Hence, the prototype for the above cluster will be [B, A , F, 4.71978,3.999999 ].
• Here, B is the mode of values in the EQ Rating column. A is the mode of the
values in the IQ Rating column, and F is the mode of the values in
the Gender column.
• Similarly, 4.71978 and 3.999999 are mean of
the Height and Weight attributes respectively.
K–PROTOTYPES CLUSTERING ALGORITHM :

• The steps in the k-prototypes clustering algorithm are discussed below.

1. First, we select K data points from the input dataset as initial prototypes.

2. We then find the distance of each data point from the current prototypes. The
distances are calculated as discussed in the previous sections.

3. After finding the distance of each data point from the prototypes, we assign
data points to clusters. Here, each data point is assigned to the cluster with the
prototype nearest to the data point.
K-PROTOTYPES CLUSTERING
ALGORITHM:
4. After assigning data points to the clusters, we calculate new prototypes for
each cluster. To calculate the prototypes, we take the mean of numeric attributes
and the mode of categorical attributes as discussed previously.
5. If the new prototypes are the same as the previous prototypes, we say that
the algorithm has converged. Hence, the current clusters are finalized.
Otherwise, we go to 2.
NUMERICAL
EXAMPLE :
NUMERICAL
EXAMPLE :

• The above table isn’t normalized.


Due to this, the dissimilarity score of
categorical attributes is negligible
compared to the difference between
numerical attributes. Due to this, the
clustering will be biased on the basis
of height and weight. To avoid this
bias, we will normalize the dataset
as shown in the following table.
NUMERICAL EXAMPLE :

• In the above table, we have normalized the Height and Weight attributes in the

range 0 to 5. Now, the distance between the numeric attributes will not be very

large compared to the dissimilarity between the categorical attributes.

• Let us now discuss the numerical for k-prototypes clustering using the

normalized dataset having mixed data types.


NUMERICAL EXAMPLE ITERATION 1 :
• Suppose that we want to classify the input dataset into 3 clusters. Hence, we will
select three data points as initial prototypes. Here, we will select Student 5, Student 9,
and Student 13 as initial prototypes. Hence,
• prototype1= [‘B’, ‘A’, ‘M’, 4.175824, 4.470588]
• prototype2= [‘A’, ‘C’, ‘M’, 4.945055, 5.0]
• prototype3=[‘A’, ‘B’, ‘M’, 4.395604, 3.6470589]
• Now, let us calculate the distance of each data point with the prototypes. The distance
is calculated using the matching dissimilarity and squared distance measure
discussed in the previous sections. The distance has been calculated and tabulated
below.
NUMERICAL
EXAMPLE
ITERATION
1 :
• In the above table, we have first calculated the distance
NUMERICAL of each data point from the initial prototypes. Then, we
have assigned the data points to the nearest prototype.

EXAMPLE • After creating the clusters, we will calculate the new


prototypes for each cluster using the mode of categorical

ITERATION 1 :
attributes and mean of the numerical attributes.
NUMERICAL EXAMPLE 1 :
NUMERICAL EXAMPLE ITERATION 1 :
NUMERICAL EXAMPLE ITERATION 1 :
• Hence, the prototype for cluster 3 is [A, B, F, 4.580062, 3.773109].
• After iteration 1, we have the following prototypes.
• prototype1= [B, A, F, 4.656593, 3.970588]
• prototype2= [A, C, M, 4.615384, 4.411764]
• prototype3=[A, B, F, 4.580062, 3.773109]
• You can observe that the current prototypes are not the same as the initial
prototypes. Hence, we will calculate the distance of the data points in the
dataset to these prototypes and reassign the points to the clusters.
NUMERICAL
EXAMPLE
ITERATION 2 :
NUMERICAL EXAMPLE ITERATION 2:

• In the above table, we have


reassigned the data points to the
clusters based on their distance from
the prototypes that were calculated in
iteration 1. Now, we will again
calculate the prototype for each
cluster.
• In cluster 1, we have the following
points.
NUMERICAL EXAMPLE ITERATION 2 :
NUMERICAL EXAMPLE ITERATION 2 :
NUMERICAL EXAMPLE ITERATION 2 :
• The prototype for cluster 3 is [A, B, F, 4.461538, 3.635294].
• After iteration 2, we have the following prototypes.
• prototype1= [B, A , F, 4.71978,3.999999].
• prototype2= [A, C, M, 4.648351, 4.3529412].
• prototype3=[A, B, F, 4.461538, 3.635294].
• You can observe that the current prototypes are not the same as the prototypes
calculated after iteration 1. Hence, we will calculate the distance of the data
points in the dataset to these prototypes and reassign the points to the
clusters.
CONTINUED …..

• Likewise, we need to calculate the


new prototypes for each cluster until
the data points in the cluster are not
changed.
• In this example it takes 4 iterations
to be converged.
• The beside table is having the data
points and clusters formed after
iteration 4.
ADVANTAGES :
• The first advantage of k-prototypes clustering is that it can be used for
datasets having mixed data types. As we can cluster datasets with numerical
and categorical attributes, the usability of the k-prototypes clustering algorithm
becomes high in real-life situations.
• K-Prototypes clustering algorithm is easy to implement. It uses simple
mathematical constructs and is an iterable algorithm. Hence, it becomes easy
to understand and implement this algorithm.
• K-prototypes clustering is a scalable clustering algorithm. You can use it for
smaller as well as large datasets.
• The K-Prototypes clustering algorithm is guaranteed to converge. Hence, it is
sure that we will get the output clusters if we execute the algorithm.
ADVANTAGES :
• The K-Prototypes clustering algorithm isn’t restricted to a particular domain. If
we have structured datasets with numeric and categorical attributes, we can
use the k-prototypes clustering algorithm in any domain.

• The number of iterations and the shape of clusters in K-Prototypes clustering


are highly dependent on the initial prototypes. The algorithm allows us to
warm-start the choice of clusters. Hence after data preprocessing and
exploratory analysis, you can choose suitable prototypes initially so that the
number of iterations can be minimized while executing the algorithm.
DIS ADVANTAGES :
• The K-prototypes clustering algorithm, just like the k-means and k-modes clustering algorithm, faces
the curse of dimensionality. With an increasing number of attributes in the input dataset, the
dissimilarity between the data points starts to become almost the same. In such a case, we need to
define a mechanism to specify which cluster a data point is assigned to if the data points have the
same dissimilarity with two prototypes.
• In K-prototypes clustering, we don’t know the optimal number of clusters. Due to this, we need to
try different numbers of clusters to find the optimal k for clustering data into k clusters.
• The number of iterations in the k-prototypes clustering for convergence depends on the choice of
initial prototypes. Due to this, if the prototypes aren’t selected in an efficient way, the runtime for
the algorithm will be longer.
• The shape of the clusters in k-prototypes clustering depends on the initial prototypes. Hence, the k-
prototypes algorithm might give different results for each run.
GAUSSIAN MIXTURE MODELS
EM ALGORITHM
Algorithm:

1.Given a set of incomplete data, consider a set of


starting parameters.
2. Expectation step (E – step): Using the observed
available data of the dataset, estimate (guess) the
values of the missing data.
3. Maximization step (M – step): Complete data
generated after the expectation (E) step is used to
update the parameters.
4. Repeat step 2 and step 3 until convergence.
1. Initially, a set of initial values of the parameters are
considered.
2. The next step is known as “Expectation” – step or E-step. In this
step, we use the observed data to estimate or guess the values of
the missing or incomplete data.
3.The next step is known as “Maximization”-step or M-step. In
this step, we use the complete data generated in the preceding
“Expectation” – step to update the values of the parameters.
4. Now, in the fourth step, it is checked whether the values are
converging or not, if yes, then stop otherwise repeat step-
2 and step-3 .
Flow chart for EM algorithm:
Usage of EM algorithm –
•It can be used to fill the missing data in a sample.
•It can be used as the basis of unsupervised learning of
clusters.
•It can be used for discovering the values of latent variables.

Advantages of EM algorithm –
• It is always guaranteed that likelihood will increase with
each iteration.
• The E-step and M-step are often easy for many problems in
terms of implementation.
• Solutions to the M-steps often exist in the closed form.
Reinforcement Learning
What is Reinforcement Learning?
Unlike supervised and unsupervised learning, reinforcement
learning is a feedback-based approach in which agent
learns by performing some actions as well as their outcomes.
Based on action status (good or bad), the agent gets positive
or negative feedback. Further, for each positive feedback,
they get rewarded, whereas, for each negative feedback, they
also get penalized.
Key points in Reinforcement Learning

Reinforcement learning does not require any labeled data for the
learning process. It learns through the feedback of action
performed by the agent. Moreover, in reinforcement learning,
agents also learn from past experiences.
Reinforcement learning methods are used to solve tasks where
decision-making is sequential and the goal is long-term, e.g.,
robotics, online chess, etc.
Reinforcement learning aims to get maximum positive feedback
so that they can improve their performance.
Reinforcement learning involves various actions, which include
acting, changing/unchanged state, and getting feedback. And
based on these actions, agents learn and explore the environment
Exploitation in Reinforcement Learning:

Exploitation is defined as a greedy approach in which


agents try to get more rewards by using estimated value but
not the actual value. So, in this technique, agents make the
best decision based on current information.
Exploration in Reinforcement Learning:

Unlike exploitation, in exploration techniques, agents


primarily focus on improving their knowledge about each
action instead of getting more rewards so that they can get
long-term benefits. So, in this technique, agents work on
gathering more information to make the best overall
decision.
Examples of Exploitation and Exploration in Machine
Learning Let's understand exploitation and exploration with
some interesting real-world examples.
Coal mining:
Let's suppose people A and B are digging in a coal mine in
the hope of getting a diamond inside it. Person B got success
in finding the diamond before person A and walks off happily.
After seeing him, person A gets a bit greedy and thinks he too
might get success in finding diamond at the same place where
person B was digging coal. This action performed by person
A is called greedy action, and this policy is known as a
greedy policy. But person A was unknown because a bigger
diamond was buried in that place where he was initially
digging the coal, and this greedy policy would fail in this
situation.
In this example, person A only got knowledge of the place
where person B was digging but had no knowledge of what
lies beyond that depth.
But in the actual scenario, the diamond can also be buried in
the same place where he was digging initially or some
completely another place.
Hence, with this partial knowledge about getting more
rewards, our reinforcement learning agent will be in a
dilemma on whether to exploit the partial knowledge to
receive some rewards or it should explore unknown actions
which could result in many rewards.
However, both these techniques are not feasible
simultaneously, but this issue can be resolved by using
Epsilon Greedy Policy (Explained below).
There are a few other examples of Exploitation and
Exploration in Machine Learning as follows:
Example 1: Let's say we have a scenario of online
restaurant selection for food orders, where you have two
options to select the restaurant.
In the first option, you can choose your favorite
restaurant from where you ordered food in the past; this
is called exploitation because here, you only know
information about a specific restaurant.
And for other options, you can try a new restaurant to
explore new varieties and tastes of food, and it is called
exploration. However, food quality might be better in the
first option, but it is also possible that it is more
delicious in another restaurant.
Example 2: Suppose there is a game-playing platform
where you can play chess with robots.
To win this game, you have two choices either play the
move that you believe is best, and for the other choice,
you can play an experimental move.
However, you are playing the best possible move, but
who knows new move might be more strategic to win this
game.
Here, the first choice is called exploitation, where you
know about your game strategy, and the second choice is
called exploration, where you are exploring your
knowledge and playing a new move to win the game.
Epsilon Greedy Policy
Epsilon greedy policy is defined as a technique to maintain
a balance between exploitation and exploration. However,
to choose between exploration and exploitation, a very
simple method is to select randomly. This can be done by
choosing exploitation most of the time with a little
exploration.
In the greedy epsilon strategy, an exploration rate or
epsilon (denoted as ε) is initially set to 1. This exploration
rate defines the probability of exploring the
environment by the agent rather than exploiting it. It
also ensures that the agent will start by exploring the
environment with ε=1.
As the agent start and learns more about the environment,
the epsilon decreases by some rate in the defined rate, so
the likelihood of exploration becomes less and less
probable as the agent learns more and more about the
environment. In such a case, the agent becomes greedy for
exploiting the environment.
find if the agent will select exploration or exploitation at
each step, we generate a random number between 0 and 1
and compare it to the epsilon.

If this random number is greater than ε, then the next


action would be decided by the exploitation method. Else
it must be exploration.

In the case of exploitation, the agent will act with the


highest Q-value for the current state.
Notion of Regret:
Whenever we do something and don't find the proper
outcome, then regret our decision as we have previously
discussed an example of exploitation and exploration for
choosing a restaurant. For that example, if we choose a
new restaurant instead of our favorite, but the food quality
and overall experience are poor, then we will regret our
decision and will consider what we paid for as a complete
loss. Moreover, if we order food from the same restaurant
again, the regret level increases along with the number of
losses. However, reinforcement learning methods can
reduce the amount of loss and the level of regret.
Regret in Reinforcement Learning
Before understanding the regret in reinforcement learning,
we must know the optimal action 'a*', which is the action
that gives the highest rewards. It is given as follows:

Hence, the regret in reinforcement learning can be defined as


the difference between the reward generated by the optimal
action a* multiplied by T and the sum from 1 to T of each
reward of arbitrary action. It can be expressed as follows:

Regret: LT=TE[r|a^* ]-∑[r|at]


What is Associative Learning
Associative learning is a type of learning that happens
when two unrelated elements get connected in our brains
through the process of conditioning. Our brains usually
do not recall information in isolation; we generally group
information together with our associative memory. This is
something we do quite naturally. Associative learning is a
form of conditioning that involves a stimulus and a
response. For example, when you eat a certain type of
food, you may experience stomachaches. After you make
the association between the food and your health, you will
learn to avoid this food altogether. Similarly, as a child,
you may have noticed that good grades get you praise and
rewards, so you would have tried your best to achieve
good grades.
The concept of associative learning is very important in
the field of education. Classical conditioning and operant
conditioning are two classical types of associative
learning.
What is Non-associative Learning:

Non-associative learning is a type of learning where there


is no association between a stimulus and a behavior. It’s a
very simple form of learning. In non-associative learning,
an organism’s behavior toward a certain stimulus changes
over time in the absence of any evident association with
consequences or other stimuli that would induce such
change. Moreover, habituation and sensitization are the
two basic non-associative learning methods.
Habituation is a decrease in an innate response to a
frequently repeated stimulus. For instance, if you are
working with a radio playing in the background, the noise
will distract you at first. But after a while, you will
gradually tune out the noise and focus on your work.
Sensitization, on the other hand, is the increased reaction
to a stimulus after repeated exposure to that stimulus. In
this instance, you become more sensitive to the stimulus as
time goes on. Here, frequent exposure to a stimulus
increases the strength of the reaction to a stimulus.
HIDDEN MARKOV MODELS
HIDDEN MARKOV MODELS :
• Hidden Markov Models (HMMs) are a type of probabilistic model that are commonly used in
machine learning for tasks such as speech recognition, natural language processing, and
bioinformatics.
• A Hidden Markov Model (HMM) is a probabilistic model that consists of a sequence of hidden states, each
of which generates an observation.
• The hidden states are usually not directly observable, and the goal of HMM is to estimate the sequence of
hidden states based on a sequence of observations. An HMM is defined by the following components:
HIDDEN MARKOV MODELS :
o A set of N hidden states, S = {s1, s2, ..., sN}.

o A set of M observations, O = {o1, o2, ..., oM}.

o An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies the probability of starting in each
hidden state.

o A transition probability matrix, A = [aij], defines the probability of moving from one hidden state to another.

o An emission probability matrix, B = [bjk], defines the probability of emitting an observation from a given
hidden state.

o The basic idea behind an HMM is that the hidden states generate the observations, and the observed data is
used to estimate the hidden state sequence. This is often referred to as the forward-backwards algorithm.
APPLICATIONS :
• Speech Recognition :
• One of the most well-known applications of HMMs is speech recognition. In this field, HMMs are
used to model the different sounds and phones that makeup speech. The hidden states, in this case,
correspond to the different sounds or phones, and the observations are the acoustic signals that are
generated by the speech. The goal is to estimate the hidden state sequence, which corresponds to
the transcription of the speech, based on the observed acoustic signals. HMMs are particularly well-
suited for speech recognition because they can effectively capture the underlying structure of the
speech, even when the data is noisy or incomplete. In speech recognition systems, the HMMs are
usually trained on large datasets of speech signals, and the estimated parameters of the HMMs are
used to transcribe speech in real time.
APPLICATIONS :
• Natural Language Processing :

Another important application of HMMs is natural language processing. In this field, HMMs
are used for tasks such as part-of-speech tagging, named entity recognition, and text
classification. In these applications, the hidden states are typically associated with the underlying
grammar or structure of the text, while the observations are the words in the text. The goal is to
estimate the hidden state sequence, which corresponds to the structure or meaning of the text, based
on the observed words. HMMs are useful in natural language processing because they can effectively
capture the underlying structure of the text, even when the data is noisy or ambiguous.
APPLICATIONS:
• Bio Informatics :
HMMs are also widely used in bioinformatics, where they are used to model sequences of
DNA, RNA, and proteins. The hidden states, in this case, correspond to the different types of
residues, while the observations are the sequences of residues. The goal is to estimate the hidden state
sequence, which corresponds to the underlying structure of the molecule, based on the observed
sequences of residues. HMMs are useful in bioinformatics because they can effectively capture the
underlying structure of the molecule, even when the data is noisy or incomplete.
APPLICATIONS :
Finance :

Finally, HMMs have also been used in finance, where they are used to model stock prices,
interest rates, and currency exchange rates. In these applications, the hidden states correspond to
different economic states, such as bull and bear markets, while the observations are the stock prices,
interest rates, or exchange rates. The goal is to estimate the hidden state sequence, which
corresponds to the underlying economic state, based on the observed prices, rates, or exchange
rates.
LIMITATIONS OF HIDDEN MARKOV MODELS :
• Now, we will explore some of the key limitations of HMMs and discuss how they can impact the accuracy and performance
of HMM-based systems.
• Limited Modeling Capabilities:

One of the key limitations of HMMs is that they are relatively limited in their modelling capabilities. HMMs are designed to
model sequences of data, where the underlying structure of the data is represented by a set of hidden states. However, the
structure of the data can be quite complex, and the simple structure of HMMs may not be enough to accurately capture all the
details. For example, in speech recognition, the complex relationship between the speech sounds and the corresponding acoustic
signals may not be fully captured by the simple structure of an HMM.
LIMITATIONS OF HIDDEN MARKOV MODELS :
Overfitting :

Another limitation of HMMs is that they can be prone to overfitting, especially when the number of
hidden states is large, or the amount of training data is limited. Overfitting occurs when the model fits
the training data too well and is unable to generalize to new data. This can lead to poor performance
when the model is applied to real-world data and can result in high error rates. To avoid overfitting, it
is important to carefully choose the number of hidden states and to use appropriate regularization
techniques.
LIMITATIONS OF HIDDEN MARKOV MODELS :
• Lack of Robustness:
HMMs are also limited in their robustness to noise and variability in the data. For
example, in speech recognition, the acoustic signals generated by speech can be subjected to a
variety of distortions and noise, which can make it difficult for the HMM to accurately estimate the
underlying structure of the data. In some cases, these distortions and noise can cause the HMM to
make incorrect decisions, which can result in poor performance. To address these limitations, it is
often necessary to use additional processing and filtering techniques, such as noise reduction and
normalization, to pre-process the data before it is fed into the HMM.
LIMITATIONS OF HIDDEN MARKOV MODELS :
• Computational Complexity:
Finally, HMMs can also be limited by their computational complexity, especially when
dealing with large amounts of data or when using complex models. The computational complexity
of HMMs is due to the need to estimate the parameters of the model and to compute the likelihood
of the data given in the model.
Q Learning
Q Learning :
Let’s say that a robot has to cross a maze and reach the end point.
There are mines, and the robot can only move one tile at a time. If
the robot steps onto a mine, the robot is dead. The robot has to
reach the end point in the shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot takes
the shortest path and reaches the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end
goal with the shortest path without stepping on a mine?
Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup
table where we calculate the maximum expected
future rewards for action at each state. Basically,
this table will guide us to the best action at each
state.
There will be four numbers of actions at each non-edge tile. When a robot is
at a state it can either move up or down or right or left.

So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions, and the rows are the states.
Each Q-table score will be the maximum expected future reward that
the robot will get if it takes that action at that state. This is an iterative
process, as we need to improve the Q-Table at each iteration.
But the questions are:
 How do we calculate the values of the Q-table?
 Are the values available or predefined?
 To learn each value of the Q-table, we use the Q-Learning algorithm.
 Mathematics: the Q-Learning algorithm
 Q-function
 The Q-function uses the Bellman equation and takes two inputs: state (s)
and action (a).
 The Q-function uses the Bellman equation and takes two inputs: state (s)
and action (a).

Using the above function, we get the values of Q for the cells in
the table.
When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the
environment, the Q-function gives us better and better approximations by
continuously updating the Q-values in the table.
Now, let’s understand how the updating takes place.

Introducing the Q-learning algorithm process


Each of the colored boxes is one step. Let’s understand each of these steps in
detail.

Step 1: initialize the Q-Table


We will first build a Q-table. There are n columns, where n= number of
actions. There are m rows, where m= number of states. We will initialise the
values at 0.
In our robot example, we have four actions (a=4) and five states (s=5).
So we will build a table with four columns and five rows.
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of time.
This means that this step runs until the time we stop the training, or
the training loop stops as defined in the code.

We will choose an action (a) in the state (s) based on the Q-Table.
But, as mentioned earlier, when the episode initially starts, every Q-
value is 0.

So now the concept of exploration and exploitation trade-off comes


into play.
For the robot example, there are four actions to choose from: up,
down, left, and right. We are starting the training now — our robot
knows nothing about the environment. So the robot chooses a random
action, say right.
We can now update the Q-values for being at the start and moving right using
the Bellman equation.

Steps 4 and 5: evaluate

Now we have taken an action and observed an outcome and reward. We need
to update the function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward structure is:
 power = +1
 mine = -100
 end = +100
We will repeat this again and again until the learning is stopped.
In this way the Q-Table will be updated.

You might also like