algo
algo
Step-01:
Step-02:
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose
center is nearest to that data point.
Step-05:
Advantages-
Point-01:
Point-02:
Disadvantages-
K-Means Clustering Algorithm has the
following disadvantages-
It requires to specify the number of clusters
(k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-
convex shapes.
PRACTICE PROBLEMS BASED ON K-MEANS
CLUSTERING ALGORITHM-
Problem-01:
Iteration-01:
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
the similar manner, we calculate the distance
of other points from each of the center of the
three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.
Cluster-02:
Cluster-03:
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.
For Cluster-01:
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 +
9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5,
3.5)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.
Distance
Distance Distance
from
from from Point
center
Given center center (6, belongs
(1.5, 3.5)
Points (2, 10) of 6) of to
of
Cluster- Cluster- Cluster
Cluster-
01 02
03
A1(2,
0 8 7 C1
10)
A2(2,
5 5 2 C3
5)
A3(8,
12 4 7 C2
4)
A4(5,
5 3 8 C2
8)
A5(7,
10 2 7 C2
5)
A6(6,
10 2 5 C2
4)
A7(1,
9 9 2 C3
2)
A8(4,
3 5 8 C1
9)
Cluster-01:
Cluster-02:
Cluster-03:
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Problem-02:
Solution-
We follow the above discussed K-Means
Clustering Algorithm.
Assume A(2, 2) and C(1, 1) are centers of the
two clusters.
Iteration-01:
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
Next,
We draw a table showing all the results.
Using the table, we decide which point
belongs to which cluster.
The given point belongs to that cluster whose
center is nearest to it.
Distance Distance
Point
Given from center from center
belongs to
Points (2, 2) of (1, 1) of
Cluster
Cluster-01 Cluster-02
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5,
1.58 0.71 C2
0.5)
Cluster-02:
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking
mean of all the points contained in that
cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
= (2.67, 1.67)
For Cluster-02:
Center of Cluster-02
= ((1 + 1.5)/2, (1 + 0.5)/2)
= (1.25, 0.75)
https://www.youtube.com/watch?v=CLKW6uWJtTc
https://www.youtube.com/watch?v=5FpsGnkbEpM
SARSA Algorithm
The algorithm for SARSA is a little bit different
from Q-learning.
In the SARSA algorithm, the Q-value is
updated taking into account the action, A1,
performed in the state, S1. In Q-learning, the
action with the highest Q-value in the next
state, S1, is used to update the Q-table.
https://youtu.be/FhSaHuC0u2M
How Does the SARSA Algorithm Work?
The SARSA algorithm works by carrying out
actions based on rewards received from
previous actions. To do this, SARSA stores a
table of state (S)-action (A) estimate pairs for
each Q-value. This table is known as a Q-
table, while the state-action pairs are denoted
as Q(S, A).
The SARSA process starts by initializing Q(S,
A) to arbitrary values. In this step, the initial
current state (S) is set, and the initial action
(A) is selected by using an epsilon-greedy
algorithm policy based on current Q-values.
An epsilon-greedy policy balances the use of
exploitation and exploration methods in the
learning process to select the action with the
highest estimated reward.
Exploitation involves using already known,
estimated values to get more previously
earned rewards in the learning process.
Exploration involves attempting to find new
knowledge on actions, which may result in
short-term, sub-optimal actions during
learning but may yield long-term benefits to
find the best possible action and reward.
From here, the selected action is taken, and
the reward (R) and next state (S1) are
observed. Q(S, A) is then updated, and the
next action (A1) is selected based on the
updated Q-values. Action-value estimates of a
state are also updated for each current
action-state pair present, which estimates the
value of receiving a reward for taking a given
action.
The above steps of R through A1 are repeated
until the algorithm’s given episode ends,
which describes the sequence of states,
actions and rewards taken until the final
(terminal) state is reached. State, action and
reward experiences in the SARSA process are
used to update Q(S, A) values for each
iteration.
SARSA vs. Q-learning
The main difference between SARSA and Q-
learning is that SARSA is an on-policy learning
algorithm, while Q-learning is an off-policy
learning algorithm.
In reinforcement learning, two different
policies are also used for active agents: a
behavior policy and a target policy. A
behavior policy is used to decide actions in a
given state (what behavior the agent is
currently using to interact with its
environment), while a target policy is used to
learn about desired actions and what rewards
are received (the ideal policy the agent seeks
to use to interact with its environment).
If an algorithm’s behavior policy matches its
target policy, this means it is an on-policy
algorithm. If these policies in an algorithm
don’t match, then it is an off-policy algorithm.
SARSA operates by choosing an action
following the current epsilon-greedy policy
and updates its Q-values accordingly. On-
policy algorithms like SARSA select random
actions where non-greedy actions have some
probability of being selected, providing a
balance between exploitation and exploration
techniques. Since SARSA Q-values are
generally learned using the same epsilon-
greedy policy for behavior and target, it
classifies as on-policy.
Q-learning, unlike SARSA, tends to choose the
greedy action in sequence. A greedy action is
one that gives the maximum Q-value for the
state, that is, it follows an optimal policy. Off-
policy algorithms like Q-learning learn a
target policy regardless of what actions are
selected from exploration. Since Q-learning
uses greedy actions, and can evaluate one
behavior policy while following a separate
target policy, it classifies as off-policy.
Step 1 — Define SARSA
The SARSA (abbreviation from State-Action-
Reward-State-Action) algorithm is an on-policy
reinforcemnt learning algorithm used for
solving the control problem — specifically, for
estimating the action-value function Q that
approximates the optimal action-value
function q*.
There are a few things in this definition that
need to be explained, such as on-policy
algorithm, control problem, optimal action-
value function.
Policy
In the context of reinforcement learning (RL),
a policy is a fundamental concept defining the
behaviour of an agent in an environment (in
our case Cliff Walking environment).
Essentially, it represents a strategy that the
agent follows to decide on actions based on
the current state of the environment. The
objective of a policy is to maximize an agent’s
cummulative reward over time. In our simple
example, the policy is a look-up table, mainly,
thus the name of tabular method (tabular →
table look-up).
On-Policy algorithm
Any algorithm that evaluates and improves
the policy that is acually being used to make
decisions, including the exploration steps. In
contrast to the “off-policy” methods, which
learn the value on the best possible policy
while following another policy (usually more
exploratory), on-policy methods make no
distinction between the policy for learning and
the policy for action selection.
Control Problem
SARSA is called a control problem because it
focuses on finding the optimal policy for
controlling the agent’s behaviour in an
environment, in our case a Cliff Walking
environment. In RL we have two main types of
problems: prediction and control.
Prediction Problem — involves the value
function of a given policy, which means
estimating how good it is to follow a certain
policy from a given state. The goal is not to
change the policy but to evaluate it.
Control Problem — involves finding the
optimal policy itself, not just evaluating a
given policy. The goal is to learn a policy that
maximizes the expected return from any
initial state. This means determining the best
action to take in each state.
Action-Value Function
We know that an episode represents an
alternating sequence of state-action pairs,
which we can visualize like this:
Diagram of an alternating sequnce of state-
action pairs, as described in ‘Reinforcement
Learning: An Introduction’ by Richard S.
Sutton.
Of course, in the case of SARSA would be
good to visualize the exact sequence of
interest on which the reasoning will happen:
that Q(terminal,⋅)=0:
This step initializes the action-value
function Q for all state-action pairs. The
values can be initialized to any arbitrary
values, but the value for terminal states is
always initialized to 0, as no future rewards
can be obtained once a terminal state is
reached.
Line 3 — Loop for Each Episode
Line 4 — Loop for each episode: Initialize S
Each episode starts with an initial state S. This
is typically done by resetting the environment
to a starting state.
Line 5 — Choose A from S using policy derived
from Q (e.g., ε-greedy)
An action A is selected using the ε-greedy
policy based on the current Q-values. With
probability ε, a random action is chosen
(exploration), and with probability 1−ε, the
action with the highest Q-value for the current
state is chosen (exploitation).
Line 6 — Loop for Each Step of the Episode
Line 7 — Take action A, observe R, S′
The agent takes the action A in the current
state S, then observes the immediate
reward R and the next state S′.
Line 8 — Choose A′ from S′ using policy
derived from Q (e.g., ε-greedy)
For the next state S′, an action A′ is chosen
using the same ε-greedy policy based on the
current Q-values.
Line 9 — Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)
−Q(S,A)]
This line updates the Q-value for the state-
action pair (S,A) based on the observed
reward R, the estimated value of the next
state-action pair (S′,A′), and the current Q-
value of (S,A). The discount factor γ weights
the importance of future rewards.
Line 10 — S←S′;A←A′
The current state S is updated to the next
state S′, and the current action A is updated
to the next action A′.
Line 11 — until S is terminal
These steps are repeated for each step of the
episode until a terminal state is reached, at
which point the episode ends.
In summary, SARSA iteratively updates the Q-
values based on the observed transitions and
rewards, adjusting the policy towards optimal
as it learns from the experiences of each
episode.
Step 3 — Implement SARSA
I already described how the Cliff Walking
environment works and how to interact with
it, but for those who haven’t had the chance
to read my previous post, please check it
here:
Bellman Equation