Unit-5 (AI)
Unit-5 (AI)
Unit-5 (AI)
Reinforcement Learning
What is Reinforcement Learning?
Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
Since there is no labeled data, so the agent is bound to learn by its experience only.
RL solves a specific type of problem where decision making is sequential, and the goal
is long-term, such as game-playing, robotics, etc.
The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning
is a type of machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within that." How a
Robotic dog learns the movement of his arms is an example of Reinforcement
learning.
For example, in the Mario video game, if a character takes a random action (e.g.
moving left), based on that action, it may receive a reward. After taking the action, the
agent (Mario) is in a new state, and the process repeats until the game character
reaches the end of the stage or dies. This episode will repeat multiple times until
Mario learns to navigate the environment by maximizing the rewards
Example:
Suppose there is an AI agent present within a maze environment, and his goal is to find
the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward
or penalty as feedback.
The agent continues doing these three things (take action, change state/remain in
the same state, and get feedback), and by doing these actions, he learns and explores
the environment.
The agent learns that what actions lead to positive feedback or rewards and what
actions lead to negative feedback penalty. As a positive reward, the agent gets a positive
point, and as a penalty, it gets a negative point.
1. Value-based: The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based: Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to apply such a
policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
Deterministic: The same action is produced by the policy (π) at any state.
Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for each
environment.
Elements of Reinforcement Learning: There are four main elements of Reinforcement
Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It
maps the perceived states of the environment to the actions taken on those states. A
policy is the core element of the RL as it alone can define the behavior of the agent.
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal.
At each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total
number of rewards for good actions. The reward signal can change the policy, such as if
an action selected by the agent leads to low reward, then the policy may change to select
other actions in the future.
3) Value Function: The value function gives information about how good the situation
and action are and how much reward an agent can expect. A reward indicates the
immediate signal for each good and bad action, whereas a value function specifies the
good state and action for the future. The value function depends on the reward as,
without reward, there could be no value. The goal of estimating values is to achieve more
rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about
how the environment will behave. Such as, if a state and an action are given, then a
model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action
by considering all future situations before actually experiencing those situations.
The approaches for solving the RL problems with the help of the model are
termed as the model-based approach. Comparatively, an approach without
using a model is called a model-free approach.
In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block,
then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take
four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible
fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-
reward point.The agent will try to remember the preceding steps that it has taken to reach
the final step. To memorize the steps, it assigns 1 value to each previous step. Consider
the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has
1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to reach the
destination. Hence to solve the problem, we will use the Bellman equation, which is the
main concept behind reinforcement learning.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking
only positive rewards, so for this, he will move to upwards only. The complete block
values will be calculated using this formula. Consider the below image:
Or, in other words, as per Markov Property, the current state transition does not depend
on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do
not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process: Markov Process is a memoryless process with a sequence of
random states S1, S2, ....., St that uses the Markov Property. Markov process is also
known as Markov chain, which is a tuple (S, P) on state S and transition function P.
These two components (S and P) can define the dynamics of the system.
The Reinforcement Learning and Supervised Learning both are the part of machine
learning, but both types of learning’s are far opposite to each other. The RL agents
interact with the environment, explore it, take action, and get rewarded. Whereas
supervised learning algorithms learn from the labeled dataset and, on the basis of the
training, predict the output.
Q-Learning
What is Q-Learning?
Q-learning is a popular model-free reinforcement learning algorithm used in machine
learning and artificial intelligence applications. Q-learning is based on the Bellman
equation. It falls under the category of temporal difference learning techniques, in
which an agent picks up new information by observing results, interacting with the
environment, and getting feedback in the form of rewards. Q-learning is a model-free,
value-based, off-policy algorithm that will find the best series of actions based on the
agent's current state. The “Q” stands for quality. Quality represents how valuable the
action is in maximizing future rewards. The value of Q-learning can be derived from
the Bellman equation.
This update rule to estimate the value of Q is applied at every time step of the
agent’s interaction with the environment. The terms used are explained below:
S – Current State of the agent.
A – Current Action Picked according to some policy.
S’ – Next State where the agent ends up.
A’ – Next best action to be picked using current Q-value estimation, i.e. pick
the action with the maximum Q-value in the next state.
R – Current Reward observed from the environment in Response of current
action.
γ (>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are
less valuable than current rewards so they must be discounted. Since Q-value
is an estimation of expected rewards from a state, discounting rule applies
here as well.
α: Step length taken to update the estimation of Q(S, A).
4. Selecting the Course of Action with ϵ-greedy policy: A simple method for
selecting an action to take based on the current estimates of the Q-value is the ϵ-
greedy policy.
6.
There are two methods for determining Q-values:
1. Temporal Difference: Calculated by comparing the current state and action
values with the previous ones.
2. Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957,
used to calculate the value of a given state and determine its optimal position. It
provides a recursive formula for calculating the value of a given state in a Markov
Decision Process (MDP) and is particularly influential in the context of Q-
learning and optimal decision-making.
Q-Table
The agent will use a Q-table to take the best possible action based on the expected reward
for each state in the environment. In simple words, a Q-table is a data structure of sets of
actions and states, and we use the Q-learning algorithm to update the values in the table.
Q-Function
The Q-function uses the Bellman equation and takes state(s) and action(a) as input. The
equation simplifies the state values and state-action value calculation.
1. Direct Utility Estimation: In this method, the agent executes a sequence of trials or
runs (sequences of states-actions transitions that continue until the agent reaches the
terminal state). Each trial gives a sample value and the agent estimates the utility based
on the samples values. Can be calculated as running averages of sample values.
The main drawback is that this method makes a wrong assumption that state
utilities are independent while in reality they are Markovian. Also, it is slow to converge.
Suppose we have a 4x3 grid as the environment in which the agent can move either Left,
Right, Up or Down(set of available actions). An example of a run
Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount
factor and Uπ(s) = utility of being in state s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can
become quite costly to compute for large state spaces. ADP is a model based approach
and requires the transition model of the environment. A model-free approach is Temporal
Difference Learning.
3. Temporal Difference Learning (TD): TD learning does not require the agent to learn
the transition model. The update occurs between successive states and agent only updates
states that are directly affected.
Where f(u, n) is the exploration function that increases with expected value u and
decreases with number of tries n
2. Q-Learning
Q-learning is a TD learning method which does not require the agent to learn the
transitional model, instead learns Q-value functions Q(s, a) .
V(s) = max [R(s,a) + γV(s`)]