0% found this document useful (0 votes)

20 views38 pages

Kguh

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views38 pages

Kguh

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

RL Framework and Application

Dr.Ch.Balaram Murthy
The goal of an RL algorithm
For instance, imagine putting your
little brother in front of a video game
he never played, giving him a
controller, and leaving him alone.

Your brother will interact with the environment (the video game) by
pressing the right button (action). He got a coin, that’s a +1 reward.
It’s positive, he just understood that in this game he must get the
coins.
But then, he presses the right button again and he touches
an enemy. He just died, so that’s a -1 reward.

By interacting with his environment through trial and error,

your little brother understands that he needs to get coins in
this environment but avoid the enemies.

Without any supervision, the child will get better and better at
playing the game.

That’s how humans and animals learn, through

interaction. Reinforcement Learning is just a computational
approach of learning from actions.
A formal definition

Reinforcement learning is a framework for solving control tasks

(also called decision problems) by building agents that learn from
the environment by interacting with it through trial and error
and receiving rewards (positive or negative) as unique feedback.

But how does Reinforcement Learning work?

The Reinforcement Learning Framework

The RL Process

The RL Process: a loop of state,

action, reward and next state.

To understand the RL process, let’s imagine an agent learning to

play a platform game:
• Our Agent receives state S0 from the Environment — we receive
the first frame of our game (Environment).

• Based on that state S0, the Agent takes action A0 — our Agent
will move to the right.

• The environment goes to a new state S1 — new frame.

• The environment gives some reward R1 to the Agent — we’re not
dead (Positive Reward +1).

• This RL loop outputs a sequence of state, action, reward and next

state.
The agent’s goal is to maximize its cumulative reward, called the
expected return.
The reward hypothesis: the central idea of Reinforcement Learning

⇒ Why is the goal of the agent to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all

goals can be described as the maximization of the expected
return (i.e. expected cumulative reward).

That’s why in Reinforcement Learning, to have the best

behavior, we aim to learn to take actions that maximize the
expected cumulative reward.
Markov Property

RL process is also called as Markov Decision Process (MDP).

The Markov Property implies that our agent needs only the current
state to decide what action to take and not the history of all the
states and actions they took before.
Observations/States Space

Observations/States are the information our agent gets from the

environment.

In the case of a video game, it can be a frame (a screenshot). In the

case of the trading agent, it can be the value of a certain stock, etc.

There is a differentiation to make between observation and state,

however:

State s: is a complete description of the state of the world (there is

no hidden information). In a fully observed environment.
In a chess game, we have access to the whole board information,
so we receive a state from the environment. In other words, the
environment is fully observed.

Observation o: is a partial description of the state. In a partially

observed environment.

In Super Mario Bros, we only see the part of the

level close to the player, so we receive an
observation.
In Super Mario Bros, we are in a partially observed environment.
We receive an observation since we only see a part of the level.
Action Space
The Action space is the set of all possible actions in an
environment.

The actions can come from a discrete or continuous space:

Discrete space: the number of possible actions is finite.

In Super Mario Bros, we have only 4 possible actions:

left, right, up (jumping) and down (crouching).

Again, in Super Mario Bros, we have a finite set of actions since we

have only 4 directions.
Continuous space: the number of possible actions is infinite.

A Self Driving Car agent has an infinite

number of possible actions since it can turn
left 20°, 21,1°, 21,2°, turn right 20°…

Taking this information into consideration is crucial because it

will have importance when choosing the RL algorithm in the future.
Rewards and the discounting

The reward is fundamental in RL because it’s the only

feedback for the agent. Thanks to it, our agent knows if the
action taken was good or not.

The cumulative reward at each time step t can be written as:

The cumulative reward equals the sum of all rewards in the sequence.
Which is equivalent to:

However, in reality, we can’t just add them like that.

The rewards that come sooner (at the beginning
of the game) are more likely to happen since
they are more predictable than the long-term
future reward.

Let’s say your agent is this tiny mouse that can move one tile each
time step, and your opponent is the cat (that can move too). The
mouse’s goal is to eat the maximum amount of cheese before
being eaten by the cat.

From figure, it is clear it’s more probable to eat the cheese near us
than the cheese close to the cat (the closer we are to the cat, the
more dangerous it is).

Consequently, the reward near the cat, even if it is bigger (more

cheese), will be more discounted since we’re not really sure we’ll be
able to eat it.
To discount the rewards, we proceed like this:

1. We define a discount rate called gamma. It must be between 0

and 1. Most of the time between 0.95 and 0.99.

• The larger the gamma, the smaller the discount. This means our
agent cares more about the long-term reward.

• Otherwise the smaller the gamma, the bigger the discount. This
means our agent cares more about the short term reward (the
nearest cheese).
2. Then, each reward will be discounted by gamma to the
exponent of the time step. As the time step increases, the cat
gets closer to us, so the future reward is less and less likely to
happen.

Our discounted expected cumulative reward is:

Type of tasks
A task is an instance of a Reinforcement Learning problem. We
can have two types of tasks:
episodic and
continuing.

Episodic task
Beginning of a new episode.

In this case, we have a starting point and an ending point (a

terminal state). This creates an episode: a list of States, Actions,
Rewards, and new States.

For instance, think about Super Mario Bros: an episode begin at

the launch of a new Mario Level and ends when you’re killed or
you reached the end of the level.
Continuing tasks

These are tasks that continue forever (no terminal state). In this
case, the agent must learn how to choose the best actions and
simultaneously interact with the environment.

For instance, an agent that does automated stock trading. For this
task, there is no starting point and terminal state. The agent keeps
running until we decide to stop it.
The Exploration/Exploitation trade-off

Exploration is exploring the environment by trying random actions

in order to find more information about the environment.

Exploitation is exploiting known information to maximize the

reward.

But the goal of our RL agent is to maximize the expected cumulative

reward. However, we can fall into a common trap.
However, we can fall into a common trap.

Let’s take an example:

In this game, our mouse can have an infinite amount of small

cheese (+1 each). But at the top of the maze, there is a gigantic
sum of cheese (+1000).

However, if we only focus on exploitation, our agent will never

reach the gigantic sum of cheese. Instead, it will only exploit the
nearest source of rewards, even if this source is small
(exploitation).
But if our agent does a little bit of exploration, it can discover the
big reward (the pile of big cheese).

This is what we call the exploration/exploitation trade-off. We

need to balance how much we explore the environment and how
much we exploit what we know about the environment.

Therefore, we must define a rule that helps to handle this trade-

off. We’ll see the different ways to handle it in the future units.
If it’s still confusing, think of a real problem: the choice of picking
a restaurant:

Exploitation: You go to the same one

that you know is good every day
and take the risk to miss another
better restaurant.

Exploration: Try restaurants you never

went to before, with the risk of having
a bad experience but the probable
opportunity of a fantastic experience.
Now that we learned the RL framework, how do we solve the RL problem?

Three main approaches for solving RL problems

In other words, how do we build an RL agent that can select the
actions that maximize its expected cumulative reward?

The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tells us
what action to take given the state we are in. So it defines the
agent’s behavior at a given time.
Think of policy as the brain of our agent, the
function that will tell us the action to take
given a state

This Policy is the function we want to learn, our goal is to find the
optimal policy π*, the policy that maximizes expected return when
the agent acts according to it. We find this π* through training.

Two approaches to train our agent to find this optimal policy π*:

• Directly, by teaching the agent to learn which action to take, given

the current state: Policy-Based Methods.

• Indirectly, teach the agent to learn which state is more

valuable and then take the action that leads to the more valuable
states: Value-Based Methods.
Policy-Based Methods: Monte Carlo policy gradient (REINFORCE)
and Deterministic Policy Gradient (DPG).
In Policy-Based methods, we learn a policy function directly.

This function will define a mapping from each state to the best
corresponding action. Alternatively, it could define a probability
distribution over the set of possible actions at that state.

As we can see here, the policy (deterministic) directly indicates the

action to take for each step.
We have two types of policies:

Deterministic: a policy at a given state will always return the same

action.

action = policy(state)
Stochastic: outputs a probability distribution over actions.

policy(actions | state) = probability distribution over the set of actions given

the current state

Given an initial state, our stochastic policy will output

probability distributions over the possible actions at that state.
Value-based methods:
SARSA (State Action Reward State Action) &
Q-Learning (Off policy RL algorithm)
In value-based methods, instead of learning a policy
function, we learn a value function that maps a state to the
expected value of being at that state.

The value of a state is the expected discounted return the agent can
get if it starts in that state, and then acts according to our policy.

“Act according to our policy” just means that our policy is “going to
the state with the highest value”.
Here we see that our value function defined values for each possible
state.

Thanks to our value function, at each step our policy will select the
state with the biggest value defined by the value function: -7, then
-6, then -5 ………………. ……… to attain the goal.
Model-based RL algorithms build a model of the environment by
sampling the states, taking actions, and observing the rewards. For
every state and a possible action, the model predicts the expected
reward and the expected future state.

Marvel Electronics and Home Entertainment (SRS)
No ratings yet
Marvel Electronics and Home Entertainment (SRS)
15 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
Module 1
No ratings yet
Module 1
81 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
RL Frra
No ratings yet
RL Frra
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
RL Frra
No ratings yet
RL Frra
10 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Module 1
No ratings yet
Module 1
72 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Lecture - 01 - Introduction - I
No ratings yet
Lecture - 01 - Introduction - I
15 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Maai 6
No ratings yet
Maai 6
143 pages
RL Module 1
No ratings yet
RL Module 1
6 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 04 Finite Markov Decision Processes
100% (1)
Unit 04 Finite Markov Decision Processes
8 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
A Review of Reinforcement Learning For Financial Time Series Prediction and Portfolio Optimization
No ratings yet
A Review of Reinforcement Learning For Financial Time Series Prediction and Portfolio Optimization
38 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
JHV
No ratings yet
JHV
24 pages
HKHB
No ratings yet
HKHB
21 pages
Software Re-Engineering: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 28 Slide 1
No ratings yet
Software Re-Engineering: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 28 Slide 1
32 pages
Module-5: Project Management Concepts
No ratings yet
Module-5: Project Management Concepts
18 pages
Chess Using Reinforcement Learning: Mungili Chetan Sai Raju - 21BCE9409 Jakka Subramanya Rithwik - 21BCE9028
No ratings yet
Chess Using Reinforcement Learning: Mungili Chetan Sai Raju - 21BCE9409 Jakka Subramanya Rithwik - 21BCE9028
15 pages
Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
No ratings yet
Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
11 pages
Component-Based Software Engineering 1: ©ian Sommerville 2004 Slide 1
No ratings yet
Component-Based Software Engineering 1: ©ian Sommerville 2004 Slide 1
16 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
Module No. 3: Parsing Structure in Text
No ratings yet
Module No. 3: Parsing Structure in Text
54 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Resource Management With Deep Reinforcement Learning
No ratings yet
Resource Management With Deep Reinforcement Learning
7 pages
World of Bits (Web-Based Agents)
No ratings yet
World of Bits (Web-Based Agents)
10 pages
The Rise and Potential of Large Language Model
No ratings yet
The Rise and Potential of Large Language Model
86 pages
Lecture 1 - Introduction To ML
No ratings yet
Lecture 1 - Introduction To ML
41 pages
Hybrid AI Agent On 2d Racing Game Using Neural Networks and Reinforcement Learning
No ratings yet
Hybrid AI Agent On 2d Racing Game Using Neural Networks and Reinforcement Learning
7 pages
Network Optimization - GNN
No ratings yet
Network Optimization - GNN
34 pages
Optimal Adaptive Control - Lewis - Full Book
No ratings yet
Optimal Adaptive Control - Lewis - Full Book
302 pages
Where Can Buy Digital Control and State Variable Methods Conventional and Intelligent Control Systems 4th Edition M. Gopal Ebook With Cheap Price
No ratings yet
Where Can Buy Digital Control and State Variable Methods Conventional and Intelligent Control Systems 4th Edition M. Gopal Ebook With Cheap Price
67 pages
PA4
No ratings yet
PA4
8 pages
Py Torch
50% (2)
Py Torch
23 pages
Notes RL
No ratings yet
Notes RL
12 pages
BD X Paper
No ratings yet
BD X Paper
14 pages
5131 Pep002 Hpe062nm
No ratings yet
5131 Pep002 Hpe062nm
26 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
Master
No ratings yet
Master
2 pages
SOP Apping MIT
No ratings yet
SOP Apping MIT
2 pages
Understanding Deep Learning
No ratings yet
Understanding Deep Learning
782 pages
Deep Reinforcement Learning For Stock Prediction
No ratings yet
Deep Reinforcement Learning For Stock Prediction
9 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Efficient Autonomous Navigation For Mobile Robots Using Machine Learning
No ratings yet
Efficient Autonomous Navigation For Mobile Robots Using Machine Learning
11 pages
Artificial Intelligence (AI) Methods in Optical Networks: A Comprehensive Survey
No ratings yet
Artificial Intelligence (AI) Methods in Optical Networks: A Comprehensive Survey
16 pages
Controlled Online Optimization Learning (COOL) - Finding The Ground State of Spin Hamiltonians With Reinforcement Learning
No ratings yet
Controlled Online Optimization Learning (COOL) - Finding The Ground State of Spin Hamiltonians With Reinforcement Learning
13 pages
Benchmarking Deep Reinforcement Learning For Continuous Control
No ratings yet
Benchmarking Deep Reinforcement Learning For Continuous Control
14 pages
Knowledge-Based Systems TSF Trading
No ratings yet
Knowledge-Based Systems TSF Trading
10 pages
Coa GPT
No ratings yet
Coa GPT
10 pages
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
No ratings yet
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
9 pages
Introducing Decision Transformers On Hugging Face ?
No ratings yet
Introducing Decision Transformers On Hugging Face ?
12 pages
Fcteg 02 722092
No ratings yet
Fcteg 02 722092
14 pages

Kguh

Uploaded by

Kguh

Uploaded by

RL Framework and Application

By interacting with his environment through trial and error,

That’s how humans and animals learn, through

Reinforcement learning is a framework for solving control tasks

But how does Reinforcement Learning work?

The RL Process: a loop of state,

To understand the RL process, let’s imagine an agent learning to

• The environment goes to a new state S1​ — new frame.

• This RL loop outputs a sequence of state, action, reward and next

⇒ Why is the goal of the agent to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all

That’s why in Reinforcement Learning, to have the best

RL process is also called as Markov Decision Process (MDP).

Observations/States are the information our agent gets from the

In the case of a video game, it can be a frame (a screenshot). In the

There is a differentiation to make between observation and state,

State s: is a complete description of the state of the world (there is

Observation o: is a partial description of the state. In a partially

In Super Mario Bros, we only see the part of the

The actions can come from a discrete or continuous space:

Discrete space: the number of possible actions is finite.

In Super Mario Bros, we have only 4 possible actions:

Again, in Super Mario Bros, we have a finite set of actions since we

A Self Driving Car agent has an infinite

Taking this information into consideration is crucial because it

The reward is fundamental in RL because it’s the only

The cumulative reward at each time step t can be written as:

However, in reality, we can’t just add them like that.

Consequently, the reward near the cat, even if it is bigger (more

1. We define a discount rate called gamma. It must be between 0

Our discounted expected cumulative reward is:

In this case, we have a starting point and an ending point (a

For instance, think about Super Mario Bros: an episode begin at

Exploration is exploring the environment by trying random actions

Exploitation is exploiting known information to maximize the

But the goal of our RL agent is to maximize the expected cumulative

Let’s take an example:

In this game, our mouse can have an infinite amount of small

However, if we only focus on exploitation, our agent will never

This is what we call the exploration/exploitation trade-off. We

Therefore, we must define a rule that helps to handle this trade-

Exploitation: You go to the same one

Exploration: Try restaurants you never

Three main approaches for solving RL problems

The Policy π: the agent’s brain

• Directly, by teaching the agent to learn which action to take, given

• Indirectly, teach the agent to learn which state is more

As we can see here, the policy (deterministic) directly indicates the

Deterministic: a policy at a given state will always return the same

policy(actions | state) = probability distribution over the set of actions given

Given an initial state, our stochastic policy will output

You might also like

• The environment goes to a new state S1 — new frame.