0% found this document useful (0 votes)
2 views

algo

K-Means Clustering is an unsupervised learning algorithm that partitions an unlabeled dataset into K predefined clusters based on similarity. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until convergence is achieved. The optimal number of clusters (K) can be determined using methods like the Elbow Method, which analyzes the Within Cluster Sum of Squares (WCSS) for different K values.

Uploaded by

jainh172020
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

algo

K-Means Clustering is an unsupervised learning algorithm that partitions an unlabeled dataset into K predefined clusters based on similarity. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until convergence is achieved. The optimal number of clusters (K) can be determined using methods like the Elbow Method, which analyzes the Within Cluster Sum of Squares (WCSS) for different K values.

Uploaded by

jainh172020
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised


Learning algorithm, which groups the
unlabeled dataset into different clusters. Here
K defines the number of pre-defined clusters
that need to be created in the process, as if
K=2, there will be two clusters, and for K=3,
there will be three clusters, and so on.
It is an iterative algorithm that divides the
unlabeled dataset into k different clusters in
such a way that each dataset belongs only
one group that has similar properties.
It allows us to cluster the data into different
groups and a convenient way to discover the
categories of groups in the unlabeled dataset
on its own without the need for any training.
It is a centroid-based algorithm, where each
cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum
of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as
input, divides the dataset into k-number of
clusters, and repeats the process until it does
not find the best clusters. The value of k
should be predetermined in this algorithm.
The k-means clustering algorithm mainly
performs two tasks:
Determines the best value for K center points
or centroids by an iterative process.
Assigns each data point to its closest k-center.
Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some
commonalities, and it is away from other
clusters.
The below diagram explains the working of
the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is
explained in the below steps:
Step-1: Select the number K to decide the
number of clusters.
Select random K points or centroids. (It
Step-2:
can be other from the input dataset).
Assign each data point to their closest
Step-3:
centroid, which will form the predefined K
clusters.
Calculate the variance and place a new
Step-4:
centroid of each cluster.
Repeat the third steps, which means
Step-5:
reassign each datapoint to the new closest
centroid of each cluster.
If any reassignment occurs, then go to
Step-6:
step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by
considering the visual plots:
Suppose we have two variables M1 and M2.
The x-y axis scatter plot of these two
variables is given below:

Let's take number k of clusters, i.e., K=2, to


identify the dataset and to put them into
different clusters. It means here we will try to
group these datasets into two different
clusters.
We need to choose some random k points or
centroid to form the cluster. These points can
be either the points from the dataset or any
other point. So, here we are selecting the
below two points as k points, which are not
the part of our dataset. Consider the below
image:

Now we will assign each data point of the


scatter plot to its closest K-point or centroid.
We will compute it by applying some
mathematics that we have studied to
calculate the distance between two points.
So, we will draw a median between both the
centroids. Consider the below image:

From the above image, it is clear that points


left side of the line is near to the K1 or blue
centroid, and points to the right of the line are
close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
As we need to find the closest cluster, so we
will repeat the process by choosing a new
centroid. To choose the new centroids, we will
compute the center of gravity of these
centroids, and will find new centroids as
below:

Next, we will reassign each datapoint to the


new centroid. For this, we will repeat the
same process of finding a median line. The
median will be like below image:

From the above image, we can see, one


yellow point is on the left side of the line, and
two blue points are right to the line. S

As reassignment has taken place, so we will


again go to the step-4, which is finding new
centroids or K-points.
o, these three points will be assigned to new
centroids.
e will repeat the process by finding the center
of gravity of centroids, so the new centroids
will be as shown in the below image:

As we got the new centroids so again will


draw the median line and reassign the data
points. So, the image will be:

We can see in the above image; there are no


dissimilar data points on either side of the
line, which means our model is formed.
Consider the below image:

As our model is ready, so we can now remove


the assumed centroids, and the two final
clusters will be as shown in the below image:
How to choose the value of "K number of
clusters" in K-means Clustering?
The performance of the K-means clustering
algorithm depends upon highly efficient
clusters that it forms. But choosing the
optimal number of clusters is a big task. There
are some different ways to find the optimal
number of clusters, but here we are
discussing the most appropriate method to
find the number of clusters or value of K. The
method is given below:
Elbow Method
The Elbow method is one of the most popular
ways to find the optimal number of clusters.
This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations
within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in
Cluster2distance(Pi C2) +∑Pi in CLuster3 distance(Pi C3)
2 2

In the above formula of WCSS,


∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the
square of the distances between each data
point and its centroid within a cluster1 and
the same for the other two terms.
To measure the distance between data points
and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the
elbow method follows the below steps:
It executes the K-means clustering on a given
dataset for different K values (ranges from 1-
10).
For each value of K, calculates the WCSS
value.
Plots a curve between calculated WCSS
values and the number of clusters K.
The sharp point of bend or a point of the plot
looks like an arm, then that point is
considered as the best value of K.

K-Means Clustering Algorithm | Examples


K-Means Clustering-
K-Means clustering is an unsupervised
iterative clustering technique.
It partitions the given data set into k
predefined distinct clusters.
A cluster is defined as a collection of data
points exhibiting certain similarities.

It partitions the data set such that-


Each data point belongs to a cluster with the
nearest mean.
Data points belonging to one cluster have
high degree of similarity.
Data points belonging to different clusters
have high degree of dissimilarity.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the


following steps-

Step-01:

Choose the number of clusters K.

Step-02:

Randomly select any K data points as cluster


centers.
Select cluster centers in such a way that they
are as farther as possible from each other.
Step-03:

Calculate the distance between each data


point and each cluster center.
The distance may be calculated either by
using given distance function or by using
euclidean distance formula.

Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose
center is nearest to that data point.

Step-05:

Re-compute the center of newly formed


clusters.
The center of a cluster is computed by taking
mean of all the data points contained in that
cluster.
Step-06:

Keep repeating the procedure from Step-03 to


Step-05 until any of the following stopping
criteria is met-
Center of newly formed clusters do not
change
Data points remain present in the same
cluster
Maximum number of iterations are reached

Advantages-

K-Means Clustering Algorithm offers the


following advantages-

Point-01:

It is relatively efficient with time complexity


O(nkt) where-
n = number of instances
k = number of clusters
t = number of iterations

Point-02:

It often terminates at local optimum.


Techniques such as Simulated Annealing
or Genetic Algorithms may be used to find the
global optimum.

Disadvantages-
K-Means Clustering Algorithm has the
following disadvantages-
It requires to specify the number of clusters
(k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-
convex shapes.
PRACTICE PROBLEMS BASED ON K-MEANS
CLUSTERING ALGORITHM-

Problem-01:

Cluster the following eight points (with (x, y)


representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7,
5), A6(6, 4), A7(1, 2), A8(4, 9)

nitial cluster centers are: A1(2, 10), A4(5, 8)


and A7(1, 2).
The distance function between two points a =
(x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three


cluster centers after the second iteration.
Solution-

We follow the above discussed K-Means


Clustering Algorithm-

Iteration-01:

We calculate the distance of each point from


each of the center of the three clusters.
The distance is calculated by using the given
distance function.

The following illustration shows the


calculation of distance between point A1(2,
10) and each of the center of the three
clusters-

Calculating Distance Between A1(2, 10) and


C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0

Calculating Distance Between A1(2, 10) and


C2(5, 8)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5

Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
the similar manner, we calculate the distance
of other points from each of the center of the
three clusters.

Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.

Given Distanc Distanc Distanc Point


Points e from e from e from belong
center center center s to
(2, 10) (5, 8) of (1, 2) of Cluste
of Cluster- Cluster- r
Cluster-
02 03
01
A1(2,
0 5 9 C1
10)
A2(2,
5 6 4 C3
5)
A3(8,
12 7 9 C2
4)
A4(5,
5 0 10 C2
8)
A5(7,
10 5 9 C2
5)
A6(6,
10 5 7 C2
4)
A7(1,
9 10 0 C3
2)
A8(4,
3 2 10 C2
9)

From here, New clusters are-


Cluster-01:
First cluster contains points-
A1(2, 10)

Cluster-02:

Second cluster contains points-


A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)

Cluster-03:

Third cluster contains points-


A2(2, 5)
A7(1, 2)

Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.

For Cluster-01:

We have only one point A1(2, 10) in Cluster-


01.
So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 +
9)/5)
= (6, 6)

For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.

Iteration-02:

We calculate the distance of each point from


each of the center of the three clusters.
The distance is calculated by using the given
distance function.

The following illustration shows the


calculation of distance between point A1(2,
10) and each of the center of the three
clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5,
3.5)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7

In the similar manner, we calculate the


distance of other points from each of the
center of the three clusters.

Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.
Distance
Distance Distance
from
from from Point
center
Given center center (6, belongs
(1.5, 3.5)
Points (2, 10) of 6) of to
of
Cluster- Cluster- Cluster
Cluster-
01 02
03
A1(2,
0 8 7 C1
10)
A2(2,
5 5 2 C3
5)
A3(8,
12 4 7 C2
4)
A4(5,
5 3 8 C2
8)
A5(7,
10 2 7 C2
5)
A6(6,
10 2 5 C2
4)
A7(1,
9 9 2 C3
2)
A8(4,
3 5 8 C1
9)

From here, New clusters are-

Cluster-01:

First cluster contains points-


A1(2, 10)
A8(4, 9)

Cluster-02:

Second cluster contains points-


A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)

Cluster-03:

Third cluster contains points-


A2(2, 5)
A7(1, 2)

Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.

For Cluster-01:

Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three


clusters are-
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)

Problem-02:

Use K-Means Algorithm to create two clusters-

Solution-
We follow the above discussed K-Means
Clustering Algorithm.
Assume A(2, 2) and C(1, 1) are centers of the
two clusters.

Iteration-01:

We calculate the distance of each point from


each of the center of the two clusters.
The distance is calculated by using the
euclidean distance formula.

The following illustration shows the


calculation of distance between point A(2, 2)
and each of the center of the two clusters-

Calculating Distance Between A(2, 2) and C1(2, 2)-

Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0

Calculating Distance Between A(2, 2) and C2(1, 1)-

Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41

In the similar manner, we calculate the


distance of other points from each of the
center of the two clusters.

Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.

Distance Distance
Point
Given from center from center
belongs to
Points (2, 2) of (1, 1) of
Cluster
Cluster-01 Cluster-02
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5,
1.58 0.71 C2
0.5)

From here, New clusters are-


Cluster-01:

First cluster contains points-


A(2, 2)
B(3, 2)
E(1.5, 0.5)
D(3, 1)

Cluster-02:

Second cluster contains points-


C(1, 1)
E(1.5, 0.5)

Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.

For Cluster-01:

Center of Cluster-01
= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
= (2.67, 1.67)

For Cluster-02:

Center of Cluster-02
= ((1 + 1.5)/2, (1 + 0.5)/2)
= (1.25, 0.75)

This is completion of Iteration-01.


Next, we go to iteration-02, iteration-03 and
so on until the centers do not change
anymore.

To gain better understanding about K-Means


Clustering Algorithm,

https://www.youtube.com/watch?v=CLKW6uWJtTc

https://www.youtube.com/watch?v=5FpsGnkbEpM
SARSA Algorithm
The algorithm for SARSA is a little bit different
from Q-learning.
In the SARSA algorithm, the Q-value is
updated taking into account the action, A1,
performed in the state, S1. In Q-learning, the
action with the highest Q-value in the next
state, S1, is used to update the Q-table.
https://youtu.be/FhSaHuC0u2M
How Does the SARSA Algorithm Work?
The SARSA algorithm works by carrying out
actions based on rewards received from
previous actions. To do this, SARSA stores a
table of state (S)-action (A) estimate pairs for
each Q-value. This table is known as a Q-
table, while the state-action pairs are denoted
as Q(S, A).
The SARSA process starts by initializing Q(S,
A) to arbitrary values. In this step, the initial
current state (S) is set, and the initial action
(A) is selected by using an epsilon-greedy
algorithm policy based on current Q-values.
An epsilon-greedy policy balances the use of
exploitation and exploration methods in the
learning process to select the action with the
highest estimated reward.
Exploitation involves using already known,
estimated values to get more previously
earned rewards in the learning process.
Exploration involves attempting to find new
knowledge on actions, which may result in
short-term, sub-optimal actions during
learning but may yield long-term benefits to
find the best possible action and reward.
From here, the selected action is taken, and
the reward (R) and next state (S1) are
observed. Q(S, A) is then updated, and the
next action (A1) is selected based on the
updated Q-values. Action-value estimates of a
state are also updated for each current
action-state pair present, which estimates the
value of receiving a reward for taking a given
action.
The above steps of R through A1 are repeated
until the algorithm’s given episode ends,
which describes the sequence of states,
actions and rewards taken until the final
(terminal) state is reached. State, action and
reward experiences in the SARSA process are
used to update Q(S, A) values for each
iteration.
SARSA vs. Q-learning
The main difference between SARSA and Q-
learning is that SARSA is an on-policy learning
algorithm, while Q-learning is an off-policy
learning algorithm.
In reinforcement learning, two different
policies are also used for active agents: a
behavior policy and a target policy. A
behavior policy is used to decide actions in a
given state (what behavior the agent is
currently using to interact with its
environment), while a target policy is used to
learn about desired actions and what rewards
are received (the ideal policy the agent seeks
to use to interact with its environment).
If an algorithm’s behavior policy matches its
target policy, this means it is an on-policy
algorithm. If these policies in an algorithm
don’t match, then it is an off-policy algorithm.
SARSA operates by choosing an action
following the current epsilon-greedy policy
and updates its Q-values accordingly. On-
policy algorithms like SARSA select random
actions where non-greedy actions have some
probability of being selected, providing a
balance between exploitation and exploration
techniques. Since SARSA Q-values are
generally learned using the same epsilon-
greedy policy for behavior and target, it
classifies as on-policy.
Q-learning, unlike SARSA, tends to choose the
greedy action in sequence. A greedy action is
one that gives the maximum Q-value for the
state, that is, it follows an optimal policy. Off-
policy algorithms like Q-learning learn a
target policy regardless of what actions are
selected from exploration. Since Q-learning
uses greedy actions, and can evaluate one
behavior policy while following a separate
target policy, it classifies as off-policy.
Step 1 — Define SARSA
The SARSA (abbreviation from State-Action-
Reward-State-Action) algorithm is an on-policy
reinforcemnt learning algorithm used for
solving the control problem — specifically, for
estimating the action-value function Q that
approximates the optimal action-value
function q*.
There are a few things in this definition that
need to be explained, such as on-policy
algorithm, control problem, optimal action-
value function.
Policy
In the context of reinforcement learning (RL),
a policy is a fundamental concept defining the
behaviour of an agent in an environment (in
our case Cliff Walking environment).
Essentially, it represents a strategy that the
agent follows to decide on actions based on
the current state of the environment. The
objective of a policy is to maximize an agent’s
cummulative reward over time. In our simple
example, the policy is a look-up table, mainly,
thus the name of tabular method (tabular →
table look-up).
On-Policy algorithm
Any algorithm that evaluates and improves
the policy that is acually being used to make
decisions, including the exploration steps. In
contrast to the “off-policy” methods, which
learn the value on the best possible policy
while following another policy (usually more
exploratory), on-policy methods make no
distinction between the policy for learning and
the policy for action selection.
Control Problem
SARSA is called a control problem because it
focuses on finding the optimal policy for
controlling the agent’s behaviour in an
environment, in our case a Cliff Walking
environment. In RL we have two main types of
problems: prediction and control.
Prediction Problem — involves the value
function of a given policy, which means
estimating how good it is to follow a certain
policy from a given state. The goal is not to
change the policy but to evaluate it.
Control Problem — involves finding the
optimal policy itself, not just evaluating a
given policy. The goal is to learn a policy that
maximizes the expected return from any
initial state. This means determining the best
action to take in each state.
Action-Value Function
We know that an episode represents an
alternating sequence of state-action pairs,
which we can visualize like this:
Diagram of an alternating sequnce of state-
action pairs, as described in ‘Reinforcement
Learning: An Introduction’ by Richard S.
Sutton.
Of course, in the case of SARSA would be
good to visualize the exact sequence of
interest on which the reasoning will happen:

The S A R S A sequence that gave the name


to the algorithm we discuss here, by Geek on
Software and Money
SARSA addresses the control problem by
iteratively updating the action-value
function Q(s,a) based on the equation:
The image showcases an intricate action-
value function, a pivotal concept captured
from the seminal book ‘Reinforcement
Learning: An Introduction’ by Richard S.
Sutton.
This update rule is applied after every step
within an episode, using the experiences
(state, action, reward, next state, next action)
gathered by following a certain policy. This is
typically ε-greedy, which balances exploration
(trying out new actions) and exploitation
(choosing the best-known action), and is
derived from the current action-value
function Q.
The key aspect that classifies SARSA as a
control algorithm is its focus on updating the
policy towards optimality. By continuously
adjusting the Q values based on the agent’s
experiences and using these Q values to
make decisions, SARSA learns an optimal or
near-optimal policy that maximizes the
expected rewards over time.
Furthermore, Sarsa is an on-policy learning
method, meaning it learns the value of the
policy being carried out by the agent,
including the exploration steps. This is in
contrast to off-policy methods like Q-learning,
which learn the value of the optimal policy
regardless of the agent’s actions. Both
approaches aim to solve the control problem
but do so in slightly different ways, with Sarsa
directly learning the policy it follows.
Step 2 — Explain SARSA
SARSA Algorithm setup

The visual represents a detailed screenshot of


the SARSA algorithm, directly excerpted from
‘Reinforcement Learning: An Introduction’ by
Richard S. Sutton. This image encapsulates
the core mechanics of SARSA, a cornerstone
algorithm in reinforcement learning,
highlighting its methodical process for
learning optimal policies through the
evaluation of action-value pairs in an
interactive environment.
Let’s follow through this algorithm line by line
and ensure we really understand what’s
happening when SARSA is applied:
Line 1 — Algorithm parameters: step size α ∈
(0, 1], small ε > 0
α (alpha) is the learning rate, determining the
extent to which the newly acquired
information will override the old information.
A value of 0 makes the agent not learn
anything, while a value closer to 1 makes it
consider only the most recent information.
ε (epsilon) is used for the ε-greedy policy,
determining how often the agent will explore
(choose a random action) versus exploit
(choose the best-known action).

all s ∈ S+, a ∈ A(s), arbitrarily except


Line 2 — Initialize Q(s,a), for

that Q(terminal,⋅)=0:
This step initializes the action-value
function Q for all state-action pairs. The
values can be initialized to any arbitrary
values, but the value for terminal states is
always initialized to 0, as no future rewards
can be obtained once a terminal state is
reached.
Line 3 — Loop for Each Episode
Line 4 — Loop for each episode: Initialize S
Each episode starts with an initial state S. This
is typically done by resetting the environment
to a starting state.
Line 5 — Choose A from S using policy derived
from Q (e.g., ε-greedy)
An action A is selected using the ε-greedy
policy based on the current Q-values. With
probability ε, a random action is chosen
(exploration), and with probability 1−ε, the
action with the highest Q-value for the current
state is chosen (exploitation).
Line 6 — Loop for Each Step of the Episode
Line 7 — Take action A, observe R, S′
The agent takes the action A in the current
state S, then observes the immediate
reward R and the next state S′.
Line 8 — Choose A′ from S′ using policy
derived from Q (e.g., ε-greedy)
For the next state S′, an action A′ is chosen
using the same ε-greedy policy based on the
current Q-values.
Line 9 — Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)
−Q(S,A)]
This line updates the Q-value for the state-
action pair (S,A) based on the observed
reward R, the estimated value of the next
state-action pair (S′,A′), and the current Q-
value of (S,A). The discount factor γ weights
the importance of future rewards.
Line 10 — S←S′;A←A′
The current state S is updated to the next
state S′, and the current action A is updated
to the next action A′.
Line 11 — until S is terminal
These steps are repeated for each step of the
episode until a terminal state is reached, at
which point the episode ends.
In summary, SARSA iteratively updates the Q-
values based on the observed transitions and
rewards, adjusting the policy towards optimal
as it learns from the experiences of each
episode.
Step 3 — Implement SARSA
I already described how the Cliff Walking
environment works and how to interact with
it, but for those who haven’t had the chance
to read my previous post, please check it
here:

Bellman Equation

What is Bellman equation?


The Bellman equation, named after Richards
E. Bellman, sets the reward with hints about
its next action. The reinforcement agent will
aim to proceed with the actions, producing
maximum reward. Bellman’s method
considers the current action’s reward and the
predicted reward for future action, and it can
be illustrated by the formulation:

According to the Bellman Equation, long-term-


reward in a given action is equal to the reward from
the current action combined with the expected reward
from the future actions taken at the following
time. Let’s try to understand first.
Let’s take an example:
Here we have a maze which is our
environment and the sole goal of our agent is
to reach the trophy state (R = 1) or to get Good
reward and to avoid the fire state because it will
be a failure (R = -1) or will get Bad reward.
Fig: Without Bellman Equation
What happens without Bellman Equation?
Initially, we will give our agent some time to
explore the environment and let it figure out a
path to the goal. As soon as it reaches its
goal, it will back trace its steps back to its
starting position and mark values of all the
states which eventually leads towards the goal
as V = 1.
The agent will face no problem until we change its
starting position, as it will not be able to find a
path towards the trophy state since the value
of all the states is equal to 1. So, to solve this
problem we should use Bellman Equation:
V(s)=maxa(R(s,a)+ γV(s’))
State(s): current state where the agent is in the
environment
Next State(s’): After taking action(a) at state(s)
the agent reaches s’
Value(V): Numeric representation of a state
which helps the agent to find its
path. V(s) here means the value of the state s.
Reward(R): treat which the agent gets after
performing an action(a).
R(s): reward for being in the state s
R(s,a): reward for being in the state and
performing an action a
R(s,a,s’): reward for being in a state s, taking
an action a and ending up in s’
e.g. Good reward can be +1, Bad reward can be -
1, No reward can be 0.
Action(a): set of possible actions that can be
taken by the agent in the state(s). e.g.
(LEFT, RIGHT, UP, DOWN)
Discount factor(γ): determines how much the
agent cares about rewards in the distant
future relative to those in the immediate
future. It has a value between 0 and 1. Lower
value encourages short–term rewards
while higher value promises long-term reward

Fig: Using Bellman Equation


The max denotes the most optimum action
among all the actions that the agent can take
in a particular state which can lead to the
reward after repeating this process every
consecutive step.
For example:
The state left to the fire state (V = 0.9) can
go UP, DOWN, RIGHT but NOT LEFT because
it’s a wall(not accessible). Among all these
actions available the maximum value for that
state is the UP action.
The current starting state of our agent can
choose any random action UP or RIGHT since
both lead towards the reward with the same
number of steps.
By using the Bellman equation our agent will
calculate the value of every step except for the
trophy and the fire state (V = 0), they cannot have
values since they are the end of the maze.
So, after making such a plan our agent can
easily accomplish its goal by just following
the increasing values.

You might also like