0% found this document useful (0 votes)

10 views62 pages

Introduction To Reinforcement Learning

Uploaded by

salahalj2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views62 pages

Introduction To Reinforcement Learning

Uploaded by

salahalj2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 62

Introduction to Reinforcement Learning

Outline
• What is Reinforcement Learning
• RL Formalism
1. Reward
2. The agent
3. The environment
4. Actions
5. Observations
• Markov Decision Process
1. Markov Process
2. Markov reward process
3. Markov Decision process
• Learning Optimal Policies
What is Reinforcement Learning ?
Describe this:
• Mouse
• A maze with walls, food and
electricity
• Mouse can move left, right, up
and down
• Mouse wants the cheese but not
electric shocks
• Mouse can observe the
environment

Lapan, Maxim. Deep Reinforcement Learning Hands-

On
What is Reinforcement Learning ?
Describe this:
• Mouse => Agent
• A maze with walls, food and
electricity => Environment
• Mouse can move left, right, up
and down => Actions
• Mouse wants the cheese but not
electric shocks => Rewards
• Mouse can observe the
environment => Observations

Lapan, Maxim. Deep Reinforcement Learning Hands-

On
What is Reinforcement Learning ?

Learning to make sequential decisions in an environment

so as to maximize some notion of overall rewards
acquired along the way.
In simple terms:
The mouse is trying to find as much
food as possible, while avoiding an
electric shock whenever possible.

The mouse could be brave and get an

electric shock to get to the place with
plenty of food—this is better result than
just standing still and gaining nothing.
What is Reinforcement Learning ?

• Learning to make sequential decisions in an

environment so as to maximize some notion of
overall rewards acquired along the way.
• Simple Machine Learning problems have a hidden
time dimension, which is often overlooked, but it is
important become in a production system.
• Reinforcement Learning incorporates time (or an
extra dimension) into learning, which puts it much
close to the human perception of artificial
intelligence.
What we don’t want the mouse to do?

• We do not want to have best actions to take in every specific

situation. Too much and not flexible.

• Find some magic set of methods that will allow our mouse to
learn on its own how to avoid electricity and gather as much
food as possible.

Reinforcement Learning is exactly this magic toolbox

Challenges of RL

A. Observations depends on agent’s actions. If agent

decides to do stupid things, then the observations will
tell nothing about how to improve the outcome (only
negative feedback).
B. Agents need to not only exploit the policy they have
learned, but to actively explore the environment. In
other words maybe by doing things differently we can
significantly improve the outcome. This
exploration/exploitation dilemma is one of the open
fundamental questions in RL (and in my life).
C. Reward can be delayed from actions. Ex: In cases of
chess, it can be one single strong move in the middle
of the game that has shifted the balance.
RL formalisms and relations

• Agent
• Environment

Communication
channels:
• Actions,
• Reward, and
• Observations:

Lapan, Maxim. Deep Reinforcement Learning Hands-

On
Reward
Reward

• A scalar value obtained from the environment

• It can be positive or negative, large or small
• The purpose of reward is to tell our agent how well they
have behaved.

reinforcement = reward or reinforced the behavior

Examples:
– Cheese or electric shock
– Grades: Grades are a reward system to give you feedback
about you are paying attention to me.
Reward (cont)

All goals can be described by the maximization of some

expected cumulative reward
The agent
The agent

An agent is somebody or something who/which interacts with

the environment by executing certain actions, taking
observations, and receiving eventual rewards for this.

In most practical RL scenarios, it's our piece of software that

is supposed to solve some problem in a more-or-less efficient
way.

Example:
You
The environment

Everything outside of an agent.

The universe!

The environment is external to an agent, and

communications to and from the agent are limited to
rewards, observations and actions.
Actions

Things an agent can do in the environment.

Can be:
• moves allowed by the rules of play (if it's
some game),
• or it can be doing homework (in the case of
school).

They can be simple such as move pawn one space

forward, or complicated such as fill the tax form in for
tomorrow morning.
Could be discrete or continuous
Observations

Second information channel for an agent, with the first

being a reward.

Why?
Convenience
RL within the ML Spectrum
What makes RL different from
other ML paradigms ?

● No supervision, just a reward

signal from the environment
● Feedback is sometimes
delayed (Example: Time
taken for drugs to take effect)
● Time matters - sequential
data
● Feedback - Agent’s action
affects the subsequent data it
receives ( not i.i.d.)
Many Faces of Reinforcement Learning
● Defeat a World Champion in
Chess, Go, BackGammon
● Manage an investment
portfolio
● Control a power station
● Control the dynamics of a
humanoid robot locomotion
● Treat patients in the ICU
● Automatic fly stunt
manoeuvres in helicopters
Outline
What is Reinforcement Learning
RL Formalism
1. Reward
2. The agent
3. The environment
4. Actions
5. Observations
Markov Decision Process
6. Markov Process
7. Markov reward process
8. Markov Decision process
Learning Optimal Policies
MDP + Formal Definitions
Markov Decision Process

More terminology we need to

learn
• state
• episode
• history
• value
• policy
Markov Process

Example:
System: Weather in Boston.
States: We can observe the current day as sunny or rainy
History: . A sequence of observations over time forms a chain
of states, such as

[sunny, sunny, rainy, sunny, …],

Markov Process

• For a given system we observe states

• The system changes between states according to some

dynamics.

• We do not influence the system just observe

• There are only finite number of states (could be very large)

• Observe a sequence of states or a chain => Markov chain

Markov Process (cont)

A system is a Markov Process, if it fulfils the Markov

property.

The future system dynamics from any state have to

depend on this state only.

• Every observable state is self-contained to describe the

future of the system.
• Only one state is required to model the future dynamics
of the system, not the whole history or, say, the last N
states.
Markov Process (cont)

Weather example:
The probability of sunny day followed by rainy day is
independent of the amount of sunny days we've seen in
the past.

Notes:
This example is really naïve, but it's important to understand the
limitations.
We can for example extend the state space to include other
factors.
Markov Process (cont)
Transition probabilities is expressed as a transition matrix,
which is a square matrix of the size N×N, where N is the number
of states in our model.
sunny rainy
sunny 0.8 0.2
rainy 0.1 0.9
Markov Reward Process

Extend Markov process to include rewards.

going from state i to state j.

Add another square matrix which tells us the reward

Often (but not always the case) the reward only

depends on the landing state so we only need a
number:

Note: Reward is just a number, positive, negative,

small, large
Markov Reward Process (cont)

For every time point, we define return as a sum of

subsequent rewards

But more distant rewards should not count as much so

we multiply by the discount factor raised to the power
of the number of steps we are away from the starting
point at time t.
Markov Reward Process (cont)

The return quantity is not very useful in practice, as it

was defined for every specific chain. But since there are
probabilities to reach other states this can vary a lot
depending which path we take.

Take the expectation of return for any state we get the

quantity called a value of state:
Markov Decision Process

How to extend our Markov Return Process to include

actions?

We must add a set of actions (A), which has to be finite.

This is our agent's action space.

Condition our transition matrix with action, which

means the transition matrix needs an extra action
dimension => turns it into a cube.
Markov Decision Process (cont)

Lapan, Maxim. Deep Reinforcement Learning Hands-

On
Markov Decision Process (cont)

By choosing an action, the agent can affect the probabilities of

target states, which is GREAT to have.

Finally, to turn our MRP into an MDP, we need to add actions to

our reward matrix in the same way we did with the transition
matrix: our reward matrix will depend not only on state but also
on action.

In other words, it means that the reward the agent obtains now
depends not only on the state it ends up in but also on the action
that leads to this state. It's similar as when putting effort into
something, you're usually gaining skills and knowledge, even if
the result of your efforts wasn't too successful.
Markov Decision Process

More terminology we need to

learn
• state ✓
• episode ✓
• history ✓
• value ✓
• policy
Policy

We are finally ready to introduce the most important

central thing for MDPs and Reinforcement Learning:

policy
The intuitive definition of policy is that it is some set of
rules that controls the agent's behavior.
Policy (cont)

Even for fairly simple environments, we can have a

variety of policies.

• Always move forward

• Try to go around obstacles by checking whether that
previous forward action failed
• Choose an action randomly
Policy (cont)

Remember: The main objective of the agent in RL is to

gather as much return (which was defined as discounted
cumulative reward) as possible.

Different policies can give us different return, which makes

it important to find a good policy. This is why the notion of
policy is important, and it's the central thing we're looking
for.
Policy (cont)

Formally, policy is defined as the probability distribution over

actions for every possible state:

An optimal policy 𝛑* is one that maximizes the expected value

function :

𝛑* = argmax𝛑 V𝛑(s)
Markov Decision Process

More terminology we need to

learn
• state ✓
• episode ✓
• history ✓
• value ✓
• policy ✓
🙌
Learning Optimal Policies

Dynamic Programming Methods (Value and Policy

Iteration)
Bellman equation (deterministic)

Lets start with state S0, and

take the action ai, then the
value will be

So, to choose the best possible

action, the agent needs to
calculate the the resulting
values for every action and
choose the maximum possible
outcome. (not totally greedy)
Bellman equation (stochastic)

Bellman optimality equation for the

general case:
Value of Action Q(s,a)

● The total reward of the one-step rewards for taking action a

in state s and can be defined via .
● Provides a convenient form for policy-optimization and
learning policies Q-learning.

Notes:
A. The first action is taken not from the optimal policy.
B. The expectation is because given action this is
stochastic.
Dynamic Programming

● Remember that value functions are recursive.

● Dynamic Programming - Breaking down a big problem into smaller

sub-problems and solving the smaller sub-problems, store its
values and backtrack towards bigger problems.

WORKING BACKWARDS :

(T is terminal state)
Model Based and Model Free Methods

Model Based:
Knowing the transition matrix.

Model Free:
Not knowing the transition matrix.
Model-Based Methods

Value Iteration, Policy Iteration

Value Iteration

1. Start with some arbitrary value assignments (S)

2. Update Policy and repeat until

(s, a)

INTUITION : Iteratively improve your value estimates using Q, V

relations.
Example: -1

S0 S1 S2
-1 +3

-1

Actions: a1: R (right) a2: L (left)

Step 0: V(S0)=V(S1)=V(S2) = 0
Step 1:
Q(S0, a1) = R(S0, a1) +V(S1) = -1 + 0 = -1

Q(S0, a2) = R(S0, a2) +V(S0) = -1 + 0 = -1

Q(S1, a1) = R(S1, a1) +V(S2) = 3 +0 = 3

Q(S1, a2) = R(S1, a2) +V(S0) = -1 + 0 = -1

Example: -1

S0 S1 S2
-1 +3

-1

Step2:
V(S0) = max(Q(S0,a)) = -1

V(S1) = max(Q(S1,a)) = 3

p(S0) = R

p(S1) = R
Policy Iteration

1. Start with some policy

2. Compute the value of the states V(s) using current policy.
(Policy Evaluation)
3. (Policy Improvement) Update Policy and repeat until
Transition from si to sj

INTUITION : At each step, you are modifying your policy by

picking that action which gives you the highest Q-value.
Example: -1

S0 S1 S2
-1 +3

-1

Actions: a1: R (right) a2: L (left)

Policy: p(S0) = R p(S1) = L g=0.5

Step 0:
V(S0; p) = R(S0, a1) + g V(S1)

V(S1; p) = R(S1, a1) + g V(S0)

V(S0) = -6/5

V(S1) = -8/5
Example: -1

S0 S1 S2
-1 +3

-1

Step 1:
Q(S0; a1) = -1 + ½(-8/5)

Q(S0; a2) = -1 + ½(-6/5)

Q(S1; a1) =

Q(S1; a2) =
Update Policy:
Example: -1

S0 S1 S2
-1 +3

-1

Update:
V(S0) = max(Q(S0,a)) = -1

V(S1) = max(Q(S1,a)) = 3

p(S0) = R

p(S1) = R
Model-Free Methods

Q-Learning and SARSA

Why Model-Free Methods ?

● Learning or providing a transition model can be hard in

several scenarios.
○ Autonomous Driving, ICU Treatments, Stock
Trading etc.

What do you have then ?

An ability to obtain a set of simulations/trajectories with each

transition in the episodes of the form (s,a,r,s’)

E.g. Using sensors to understand robot’s new position when it

does an action, Recording new patient vitals when given a drug
from a state etc.
On-Policy vs Off-Policy Learning

● On-Policy Learning

○ Evaluate policy 𝛑 when sampling experiences from 𝛑.

○ Learn on the job.

● Off-Policy Learning

○ Evaluate policy 𝛑 (target policy) while following a

○ Look over someone’s shoulder.

different policy Ѱ (behavior policy) in the environment.

Some domains prohibit on-policy learning. For instance, treating

a patient in ICUs you cannot learn about random actions by
testing them out.
Q-Learning

● Start with a random Q-table (S x A). For all transitions collected

according to any behavior policy, perform this TD Update

OVER-OPTIMISTIC : Assumes the best things would happen from

the next state onwards - Greedy (Hence the max operation over
future Q-values)
● OFF-POLICY : Q directly approximates the optimal action value
function independently of the policy being followed (max over
all actions)
SARSA

● Start with a random Q-table (S X A). For all transitions (collected by

acting according to 𝛑 that maximizes Q) perform this TD Update

𝛑 - Data collection policy

● ON-Policy Learning : While learning the optimal policy it uses the

current estimate of the optimal policy to generate the behaviour
Q-Learning and SARSA Algorithm

1. Start with a random Q-table (S X A).

2. Choose one among the two actions

a. (𝜀-greedy) With probability 𝜀, choose a random action

(EXPLORATION)

b. With probability 1-𝜀, an action that maximizes Q-value

from a state.(EXPLOITATION)

3. Perform an action and collect transition (s,a,r,s’)

4. Update Q-table using the corresponding TD updates.

5. Repeat steps 2-5 till convergence of Q-values across all

states.
Q-Learning vs SARSA

Demo :
https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q
-learning/

● Q-Learning converges faster since Q values directly try to approximate the

optimal value.
● Q-Learning is more risky since it is over-optimistic of what happens in the
future. Could be risky for real-life tasks such as robot navigation over
Parametric Q-Learning

● Often hard to learn Q-values in tabular form. E.g. Huge number of

states, Continuous state spaces etc.
● Parametrize Q(s,a) using any function approximator f - linear model,
neural networks etc. and do usual Q-learning.
Q(s,a) = f(s,a;𝜭) 𝜭- model params

Example : Image Frames in a game - Use ConvNets to parametrize Q(s,a)

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Learning To Trade Using Q-Learning
No ratings yet
Learning To Trade Using Q-Learning
18 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Module 1
No ratings yet
Module 1
72 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Kguh
No ratings yet
Kguh
38 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
DLMAIRIL01 Q4-2024 Session1
No ratings yet
DLMAIRIL01 Q4-2024 Session1
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
Reinforcement Learning Advancements Limitations An
No ratings yet
Reinforcement Learning Advancements Limitations An
14 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Unit 6
No ratings yet
Unit 6
34 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Unit 5
No ratings yet
Unit 5
45 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Ensembles Learning
No ratings yet
Ensembles Learning
16 pages
Bayesian Networks
No ratings yet
Bayesian Networks
48 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
SVMs
No ratings yet
SVMs
42 pages
Multi-Agent Reinforcement Learning For Spectrum
No ratings yet
Multi-Agent Reinforcement Learning For Spectrum
5 pages
Automated Penetration Testing Using Deep Reinforcement Learning
No ratings yet
Automated Penetration Testing Using Deep Reinforcement Learning
9 pages
8.non-Deterministic Rewards and Actions-Temporal Difference Learning
No ratings yet
8.non-Deterministic Rewards and Actions-Temporal Difference Learning
6 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
6 Reading Playing Atari With Deep Reinforcement Learning
No ratings yet
6 Reading Playing Atari With Deep Reinforcement Learning
69 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
64 pages
Graph RL Malware Detection
No ratings yet
Graph RL Malware Detection
8 pages
Instructors Manual
No ratings yet
Instructors Manual
96 pages
Batch Reinforcement Learning: Alan Fern
No ratings yet
Batch Reinforcement Learning: Alan Fern
47 pages
Energy Plus With RL
No ratings yet
Energy Plus With RL
112 pages
Lecture 4 - CS50 - S Introduction To Artificial Intelligence With Python1
No ratings yet
Lecture 4 - CS50 - S Introduction To Artificial Intelligence With Python1
18 pages
Artificial Intelligence: Smart Assistants
No ratings yet
Artificial Intelligence: Smart Assistants
21 pages
ML Mcqs Without Answers
50% (2)
ML Mcqs Without Answers
21 pages
1 s2.0 S2352484722027391 Main
No ratings yet
1 s2.0 S2352484722027391 Main
13 pages
Ai Viva
No ratings yet
Ai Viva
10 pages
Machine Learning For Cyber Security 6th International Conference, ML4CS 2024, Hangzhou, China, December 27-29, 2024
No ratings yet
Machine Learning For Cyber Security 6th International Conference, ML4CS 2024, Hangzhou, China, December 27-29, 2024
463 pages
Low Complexity Online Radio Access
No ratings yet
Low Complexity Online Radio Access
14 pages
AI and Power BI Que-Finger Tips
No ratings yet
AI and Power BI Que-Finger Tips
37 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
Paper Formatt
No ratings yet
Paper Formatt
12 pages
RL Examples
No ratings yet
RL Examples
6 pages
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
No ratings yet
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
30 pages
AAI QB
No ratings yet
AAI QB
15 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning For High Frequency Market Making
No ratings yet
Reinforcement Learning For High Frequency Market Making
6 pages
1701 07274v2 PDF
No ratings yet
1701 07274v2 PDF
30 pages
Optimizing Energy Efficiency of LoRaWAN Based Wireless Undergr - 2023 - Internet
No ratings yet
Optimizing Energy Efficiency of LoRaWAN Based Wireless Undergr - 2023 - Internet
18 pages
Tensor Force
No ratings yet
Tensor Force
25 pages