0% found this document useful (0 votes)

37 views

5.4-Reinforcement Learning-Part2-Learning-Algorithms

This document provides an overview of reinforcement learning algorithms. It discusses the distinctions between passive vs active learning, on-policy vs off-policy learning, exploitation vs exploration, and model-based vs model-free learning. It then describes several specific reinforcement learning algorithms in detail, including Monte Carlo simulation, temporal difference learning, Q-learning, and SARSA. Model-based approaches like adaptive dynamic programming are also covered. The document provides examples to illustrate how these various algorithms work.

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

5.4-Reinforcement Learning-Part2-Learning-Algorithms

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled

by prior Theories

Video 5.4 Reinforcement Learning – Part 2 : Learning Algorithms

Distinctions in Reinforcement Learning

• Passive versus Active learning

• On-policy versus Off-policy learning

• Exploitation versus Exploration learning

• Model-based versus Model-free learning

Passive Learning versus Active learning
Passive learning

A Passive Agent executes a fixed policy and evaluates it. The agent
simply watches the world going by and tries to learn the utilities of
being in various states.

Active learning

An Active agent updates its policy as it learns while acting in an

uncertain world. An active agent must consider what actions to take,
what their outcomes may be, and how they affect the rewards
achieved via exploration. Active learning is most typical for the true
reinforcement learning cases.
On-policy versus Off-policy

In the Planning DP case an agent could conduct off-policy planning, that is,
formulate a policy without needing to interact with the world.
In off-policy methods, the agent´s policy used to chose its actions is called
the behavior policy which may be unrelated to the policy that is evaluated and
improved, called the estimation policy.

In the situation that a complete model of the environment is NOT available

for the agent, i.e. elements of T(s,a) and R(s´|s,a) are lacking, and then we
must typically engage in on-policy planning, in which we must interact with
the world to better estimate the value of a policy while using it for control.
In the on-policy mode the planning and learning parts are interwoven.
Exploration versus Exploitation

Actions in RL systems can be of the following two kinds:

Exploitation actions: have a preference for past actions that have been found to be effective at
producing reward and thereby exploiting what is already known. The Planning DP case is
completely in this genre.

Exploration actions: have a preference for exploring untested actions to discover new and
potentially more reward-producing actions. Exploration is typical for the true reinforcement
learning cases.

Managing the trade-off between exploration and exploitation in its policies is a critical issue in
RL algorithms.
• One guideline could be to explore more when knowledge is weak and exploit more when
we have gained more knowledge.
• One method: ε-greedy keep to exploitation as long as actions has the probability 1-ε. If no
action which satisfies this condition is found, the agent turns to exploration. ε is a
hyper-parameter controlling the balance between exploitation and exploration.
Model-free versus Model-based Reinforcement Learning
In the situation that a complete model of the environment is NOT available for the agent, i.e. elements
of T(s,a) and R(s´|s,a) are lacking, RL offers two different approaches.

Model-based RL algorithms.

One approach is to try to learn an adequate model of the environment and then fall back on the
planning task for a complete model to find a policy.

That is, if the agent is currently in state s, takes action a, and then observes the environment transition to
state s´ with reward r , that observation can be used to improve its estimate of T( s, a) and R(s´| s,a)
through supervised learning techniques.

Model-free algorithms.
However a model of the environment is not necessary to find a good policy.

One of the most classic examples is Q-learning, which directly estimates optimal so called Q-values of
each action in each state (related to the utility of each action in each state), from which a policy may be
derived by choosing the action with the highest Q-value in the current state.
Solving a Re-inforcement Learning Problem
Model-based approaches

Adaptive Dynamic Programming (ADP)

Model-free approaches

Monte Carlo simulation (MC

Direct Estimate Temporal-Difference learning (TD)
Q-learning –off-policy
SARSA – on-policy
Model-free Direct Estimation of the value function
Estimate the value function for a particular state s as the average of total rewards of a sample of
epochs (calculating from s to end of epoch)

The so called ´Reward to go´ of a state s is the sum of the (discounted) rewards from that state until
a terminal state is reached. The estimated value of the state is based on the ´Rewards to go´.

Direct Utility Estimation keeps a running average of the observed ´rewards‐to‐go´ for each state s.

Value (s) = average of ( all (´Reward to go´ (s) for all episodes e in the sample)

As the number of trials goes to infinity, the sample average converges to the true utility for state s.

Drawback: very slow convergence.

Model-based Adaptive Dynamic programming(ADP)

The strategy in ADP is to first complete the partially known MDP model and then regard it as
complete applying the Dynamic Programming technique as in the planning case with complete
knowledge.
V(s):= ∑ T (s,a) * (R(s´|s,a)+ γ * V(s') )
s´

To be learned

Algorithm for iteratively updating estimates of R(s´|s,a) and T(s,a)

1. Collecting examples of rewards for s, a, s triples

2. Collecting examples of s, a and s´ triples
3. Taking averages of collected rewards
4. Calculate the fraction of time s,a leads to s´.
Model-based Learning :Monte Carlo Simulation (MC)
Monte Carlo metods are computational algorithms that are based on
Repeated random samples to estimate some target properties of a model.

Monte Carlo Simulations simply apply Monte Carlo methods in the context of
Simulations.

Monte Carlo Reinforcement Learning MC methods learns directly from samples

of complete episodes of experience.

MC is model-free: no explicit model of MDP transitions and rewards is built up.

MC takes mean returns for states in sampled episodes. The value expectation of
a state is set to the iterative mean of all the empirical returns (not expected) for
the episodes.

First-visit MC: average returns only for first time s is visited in an episode.
Every-Visit MC: average returns for every time s is visited in an episode.
Algorithm for first-visit Monte Carlo
1. Initialize the policy, state-value function arbitrarily 2: Returns(s) ← empty list

2. Start by generating an episode E according to the current policy π.

3. For each s in E
Begin
4. if this is the first occurrence of this state add the return received to the returns list.
5. calculate an iterative means (average) over all returns
6. set the value of the state as that computed average
End
Repeat step 2-4 until convergence.

The every-visit algorithm simply modifies step 4.

Using an Incremental Mean in Monte Carlo Simulation

It is convenient to convert the mean return into an incremental update so that the
mean can be updated with each episode and one can follow the progress made with
each episode.

V(s) is incrementally updated for each state St, with return Gt.
Example of Monte Carlo Simulation
An undiscounted Markov Reward Process with two states A and B
Transition matrix and reward function are unknown

Two sample episodes

As an example A+3 → A indicates a transition from state A to state A, with a

reward of +3.

first-visit MC

Return for A in EI = +3+2-4+4-3 = 2 Return for A in E2 = +3-3 = 0 V(A) = 1/2(2 +

0)=1 Return for B in EI = -4+4-3 = -3 Return for A in E2 = -2+3-3 =-2 V(B) = 1/2(-3
+ -2)= -5/2

every-visit MC

V(A) = 1/4(2 + -1 + 1 + 0) = 1/2 V(B) = 1/4(-3 + -3 + -2 + -3) = -11/4

Model-free learning: Temporal difference (TD) learning

Temporal difference (TD) learning is a class of model-free reinforcement learning methods which learn
by bootstrapping from the current estimate of the value function.

TD algorithms:
- sample from the environment, like Monte Carlo simulations and
- perform updates based on current estimates like adaptive dynamic programming methods.

While Monte Carlo methods only adjust their estimates once the final outcome is known, TD methods adjust
predictions before the final outcome is known. Typically TD algorithms adjust the estimated utility value of
the current state s based on its immediate reward and the estimated value of the next state s´. The term
´temporal´ is motivated by the temporal relation between states s and s´.

The Temporal difference equation

V (s) ← V (s) + α * ( R (s,a) + γ *V (s´) − V (s) )

where α is a learning rate parameter (hyper parameter).

To be continued in Part 3

Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Unit 4
100% (1)
Unit 4
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
notes
No ratings yet
notes
6 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Model free methods
No ratings yet
Model free methods
31 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL
No ratings yet
RL
9 pages
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
15 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Unit 1
No ratings yet
Unit 1
18 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
F20-AI-L10
No ratings yet
F20-AI-L10
45 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
qp ans
No ratings yet
qp ans
40 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Ai Unit 3
No ratings yet
Ai Unit 3
23 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
RL-1
No ratings yet
RL-1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
CH3_2 Montecarlo Control
No ratings yet
CH3_2 Montecarlo Control
33 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
MDP
No ratings yet
MDP
10 pages
37 RL
No ratings yet
37 RL
18 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignments For Week 5 2024
No ratings yet
Assignments For Week 5 2024
10 pages
4.2-GeneralizationAsSearch Part 1
No ratings yet
4.2-GeneralizationAsSearch Part 1
17 pages
Assignments For Week 6 2024
No ratings yet
Assignments For Week 6 2024
13 pages
5 2-ExplanationBasedLearning
No ratings yet
5 2-ExplanationBasedLearning
19 pages
7.2 Interdisciplinary Inspiration
No ratings yet
7.2 Interdisciplinary Inspiration
17 pages
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
No ratings yet
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
9 pages
Hebbian Learning and Associative Memory
No ratings yet
Hebbian Learning and Associative Memory
13 pages
6 9-DeepLearning
No ratings yet
6 9-DeepLearning
8 pages
Hopfield Networks and Boltzman Machines-Part 1
100% (1)
Hopfield Networks and Boltzman Machines-Part 1
13 pages
Convolutional Neural Networks-Part2
No ratings yet
Convolutional Neural Networks-Part2
21 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages
Hopfield Networks and Boltzman Machines-Part 2
No ratings yet
Hopfield Networks and Boltzman Machines-Part 2
13 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
Perceptrons
No ratings yet
Perceptrons
11 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Lec 5
No ratings yet
Lec 5
13 pages
Master Thesis
No ratings yet
Master Thesis
77 pages
AI For Autonomous Systems
No ratings yet
AI For Autonomous Systems
18 pages
Continuous Deep Q-Learning With Model-Based Acceleration
No ratings yet
Continuous Deep Q-Learning With Model-Based Acceleration
13 pages
Exam 23 01
No ratings yet
Exam 23 01
20 pages
Unit 3
No ratings yet
Unit 3
12 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Artificial Intelligence Operated Elevator Using RL AIOERL
No ratings yet
Artificial Intelligence Operated Elevator Using RL AIOERL
4 pages
Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level
No ratings yet
Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level
16 pages
Lecture 0
No ratings yet
Lecture 0
25 pages
10 MCQ RL
No ratings yet
10 MCQ RL
6 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages

5.4-Reinforcement Learning-Part2-Learning-Algorithms

Uploaded by

5.4-Reinforcement Learning-Part2-Learning-Algorithms

Uploaded by

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled

Video 5.4 Reinforcement Learning – Part 2 : Learning Algorithms

• Passive versus Active learning

• On-policy versus Off-policy learning

• Exploitation versus Exploration learning

• Model-based versus Model-free learning

An Active agent updates its policy as it learns while acting in an

In the situation that a complete model of the environment is NOT available

Actions in RL systems can be of the following two kinds:

Adaptive Dynamic Programming (ADP)

Monte Carlo simulation (MC

Drawback: very slow convergence.

Algorithm for iteratively updating estimates of R(s´|s,a) and T(s,a)

1. Collecting examples of rewards for s, a, s triples

Monte Carlo Reinforcement Learning MC methods learns directly from samples

MC is model-free: no explicit model of MDP transitions and rewards is built up.

2. Start by generating an episode E according to the current policy π.

The every-visit algorithm simply modifies step 4.

Two sample episodes

As an example A+3 → A indicates a transition from state A to state A, with a

Return for A in EI = +3+2-4+4-3 = 2 Return for A in E2 = +3-3 = 0 V(A) = 1/2(2 +

V(A) = 1/4(2 + -1 + 1 + 0) = 1/2 V(B) = 1/4(-3 + -3 + -2 + -3) = -11/4

The Temporal difference equation

V (s) ← V (s) + α * ( R (s,a) + γ *V (s´) − V (s) )

where α is a learning rate parameter (hyper parameter).

You might also like