0% found this document useful (0 votes)
3 views

Reinforcement Learning

The document provides an overview of Reinforcement Learning (RL), defining it as a machine learning technique that enables software to make decisions through a reward-and-punishment system. It discusses key terminologies, procedures, and types of RL, including passive vs. active learning, value-based vs. policy-based methods, and model-based vs. model-free approaches. Additionally, it covers specific algorithms like Q-learning and their underlying principles, such as the Bellman equation.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Reinforcement Learning

The document provides an overview of Reinforcement Learning (RL), defining it as a machine learning technique that enables software to make decisions through a reward-and-punishment system. It discusses key terminologies, procedures, and types of RL, including passive vs. active learning, value-based vs. policy-based methods, and model-based vs. model-free approaches. Additionally, it covers specific algorithms like Q-learning and their underlying principles, such as the Bellman equation.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lorem Ipsum Dolor

Reinforcement Dr. Thomas Abraham J V

Learning
Associate Professor Senior,
School of Computing and
Engineering Science,
VIT Chennai.
Introduction
Introduction
Reinforcement Learning (RL) can be defined as the study of taking
optimal decisions utilizing experiences.
It is mainly intended to solve a specific kind of problem where the
decision making is successive and the goal or objective is long-term,
this includes robotics, game playing, or even logistics and resource
management.
Reinforcement learning (RL) is a machine learning technique that
teaches software to make decisions by using a reward-and-
punishment system.
Reinforcement learning is the process of training a program to attain a
goal through trial and error by incentivising it with a combination of
rewards and penalties.
RL Applications
RL Terminologies
Agent: The agent in RL can be defined as the entity which acts as a learner and decision-
maker. It is empowered to interact continually, select its own actions, and respond to those
actions.
Environment: It is the abstract world through which the agent moves. The Environment
takes the current state and action of the agent as input and returns its next state and
appropriate reward as the output.
States: The specific place at which an agent is present is called a state. This can either
be the current situation of the agent in the environment or any of the future situations.
Actions: This defines the set of all possible moves an agent can make within an
environment.
Reward or Penalty: This is nothing but the feedback by means of which the success or
failure of an agent’s action within a given state can be measured. The rewards are used for
the effective evaluation of an agent’s action.
Policy or Strategy: It is mainly used to map the states along with their actions. The
agent is said to use a strategy to determine the next best action based on its current state.
Reinforcement Learning Procedure
1. An Agent perceives the environment and gets the current-state.
2. According to the current state, the Agent takes
an action with strategy/policy in the environment.
3. The Agent receives the reward from the environment and updates
its strategy/policy.
4. After taking action, the environment updates and reaches to next
state.
5. Repete 1–4.
Reinforcement Learning Procedure
Passive vs Active RL
A policy is a strategy or set of rules that an agent follows to make
decisions in an environment. It defines the mapping from states of the
world to the actions the agent should take. Essentially, a policy guides
the agent on what action to choose when it encounters a particular state.

Both active and passive reinforcement learning are types of RL. In case of
passive RL, the agent’s policy is fixed which means that it is told what to
do.

In contrast to this, in active RL, an agent needs to decide what to do as


there’s no fixed policy that it can act on. Therefore, the goal of a passive
RL agent is to execute a fixed policy (sequence of actions) and evaluate it
while that of an active RL agent is to act and learn an optimal policy.
Sour
ce
Types of RL
Value-Based Reinforcement Learning

Value-based reinforcement learning focuses on finding the optimal value function that
measures how good it is for an agent to be in a given state (or take a given action). The goal is
to maximize the value function, which represents the long-term cumulative reward.

Example: Q-Learning, Deep Q-Learning

Policy-Based Reinforcement Learning

A policy is a strategy or set of rules that an agent follows to make decisions in an


environment. It defines the mapping from states of the world to the actions the agent should
take. Essentially, a policy guides the agent on what action to choose when it encounters a
particular state.

Unlike value-based methods, policy-based RL methods aim to directly learn the optimal policy
π(a∣s), which maps states to probabilities of selecting actions. These methods can be effective
for environments with high-dimensional or continuous action spaces, where value-based
methods struggle.
Types of RL
Passive vs Active RL

Both active and passive reinforcement learning are types of RL. In case of passive RL, the agent’s
policy is fixed which means that it is told what to do.

In contrast to this, in active RL, an agent needs to decide what to do as there’s no fixed policy that it
can act on. Therefore, the goal of a passive RL agent is to execute a fixed policy (sequence of
actions) and evaluate it while that of an active RL agent is to act and learn an optimal policy.

On-Policy vs Off-policy Learning

On-policy methods are about learning from what you are currently doing. The policy directs the
agent's actions in every state, including the decision-making process while learning. The agent
evaluates the outcomes of its present actions, refining its strategy incrementally.

Off-policy methods, on the other hand, are like learning from someone else's experience. It involves
learning the value of the optimal policy independently of the agent's actions. These methods enable
the agent to learn from observations about the optimal policy, even when it's not followed. This is
useful for learning from a fixed dataset or a teaching policy.
Types of RL
Model-based reinforcement learning
In model-based algorithms, the agent builds an internal model of the
environment. This model represents the dynamics of the environment,
including state transitions and reward probabilities. The agent can then use
this model to plan and evaluate different actions before taking them in the
real environment.
This approach has the advantage of being more sample-efficient,
especially in complex environments.
The disadvantage is that building an accurate model can be challenging,
especially for complex environments. The model may not reflect the real
environment accurately, leading to suboptimal behaviour.
Types of RL
Model-free reinforcement learning
This approach focuses on learning directly from interaction with the
environment without explicitly building an internal model. The agent learns
the value of states and actions or the optimal strategy through trial and
error.
Model-free RL offers a simpler approach in environments where building an
accurate model is challenging. For Bob, this means he doesn't need to
create a complex mental map of the room – he can learn through
scratching and experiencing the consequences.
Model-free RL excels in dynamic environments where the rules might
change. However, only learning through trial and error can be less sample-
efficient.
Q-Learning: The algorithm learns a Q-value for each state-action pair. The Q-value
represents the expected future reward of taking a specific action in a particular
state. The agent can then choose the action with the highest Q-value to maximize
its long-term reward (we’ll explain this in more detail in the next section).
SARSA (State-Action-Reward-State-Action): This is similar to Q-learning, but it
learns a value function for each state-action pair. It updates the value based on the
reward received after taking an action and the next state observed.
Policy gradient methods: These algorithms directly learn the policy function,
which maps states to actions. They use gradients to update the policy in the
direction expected to lead to higher rewards. Examples include REINFORCE and
Proximal Policy Optimization (PPO).
Deep Q-Networks (DQN): This algorithm combines Q-learning with deep neural
networks to handle high-dimensional state spaces, often encountered in complex
environments like video games.
Bellman Equation for State Value Function V(s)

This is used for policy evaluation under a fixed policy π. It expresses the
expected return from a state following the policy:

V(s)=Eπ[R(s,a)+γ⋅V(s′)]

If the policy is deterministic, it simplifies to:

V(s)=R(s,a)+γ⋅V(s′)

States: S1, S2, S3, S4

The agent follows a fixed policy and always transitions:

S1 → S2, reward = +1; S2 → S3, reward = 0; S3 → S4, reward = +2; S4 →


S1, reward = −1

Discount factor: γ = 0.9


Bellman Equation for State Value Function V(s)

The Bellman Equations

V(S1) =1+0.9⋅V(S2)

V(S2) =0+0.9⋅V(S3)

V(S3) =2+0.9⋅V(S4)

V(S4) =−1+0.9⋅V(S1)

Solve the system of equations and Back-substitute to get other


values

V(S1)≈5.5, V(S4)=−1+0.9⋅5.5≈3.95,
V(S3)=2+0.9⋅3.95≈5.55, V(S2)=0.9⋅5.555≈5.0
Q-learning
Q-learning is a model-free, value-based, off-policy
algorithm that will find the best series of actions based on
the agent's current state. The “Q” stands for quality. Quality
represents how valuable the action is in maximising future
rewards.

1. Initialise your Q-table


2. Choose an action using the Epsilon-Greedy Exploration Strategy
3. Update the Q-table using the Bellman Equation
Initialize your Q-table
Bellman Equation

S = the State or Observation


A = the Action the agent takes
R = the Reward from taking an Action
t = the time step
Ɑ = the Learning Rate
ƛ = the discount factor which causes rewards to lose their value
over time so more immediate rewards are valued more highly.
Bellman Equation
Q-learning
Q-learning is a model-free, value-based, off-policy algorithm that will find
the best series of actions based on the agent's current state. The “Q”
stands for quality. Quality represents how valuable the action is in
maximizing future rewards.

Key Terminologies in Q-learning

States(s): the current position of the agent in the environment.

Action(a): a step taken by the agent in a particular state.

Rewards: for every action, the agent receives a reward and penalty.
Key Terminologies in Q-learning (contd)

Episodes: the end of the stage, where agents can’t take new action. It
happens when the agent has achieved the goal or failed.

Q(St+1, a): expected optimal Q-value of doing the action in a particular


state.

Q(St, At): it is the current estimation of Q(St+1, a).

Q-Table: the agent maintains the Q-table of sets of states and actions.

Temporal Differences(TD): used to estimate the expected value of


Q(St+1, a) by using the current state and action and previous state and
action.
How Does Q-Learning Work?
Q-Table

The agent will use a Q-table to take the best possible action based on the
expected reward for each state in the environment. In simple words, a Q-
table is a data structure of sets of actions and states, and we use the Q-
learning algorithm to update the values in the table.

Q-Function

The Q-function uses the Bellman equation and takes state(s) and
action(a) as input. The equation simplifies the state values and state-
action value calculation.
Understanding Q-learning

You might also like