100% found this document useful (1 vote)

163 views31 pages

Reinforcement Learning

The document discusses reinforcement learning, including passive reinforcement learning where an agent learns utilities under a fixed policy, temporal difference learning which estimates utilities based on observed rewards and transitions, and active reinforcement learning where the agent learns the optimal policy. It provides examples of reinforcement learning applications and an overview of key concepts like states, actions, rewards, policies, and the Bellman equation for calculating optimal utilities.

Uploaded by

Ariel Sialongo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

163 views31 pages

Reinforcement Learning

Uploaded by

Ariel Sialongo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 31

Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Introduction
Supervised Learning:

Example Class

Reinforcement Learning:
…
Situation Reward Situation Reward

Eick: Reinforcement Learning.

Examples
Playing chess:
Reward comes at end of game

Ping-pong:
Reward on each point scored

Animals:
Hunger and pain - negative reward
food intake – positive reward
Eick: Reinforcement Learning.
Framework: Agent in State Space
Remark: no
Example: XYZ-World terminal states
e e
1 2 3 R=+5
n s s
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 x/0.7
7 8 R=+4 9 R=-6
Problem: What actions nw s
should an agent choose
to maximize its rewards? 10
Eick: Reinforcement Learning.
Bellman TD P
XYZ-World: Discussion Problem 12
e e (3.3, 0.5)
1 2 3 R=+5
n s s
I tried hard
but: any better 4 5 R=+3 w 6 R=-9
explanations? sw ne
n s s
(3.2, -0.5) x/0.7
x/0.3
7 8 R=+4 9 R=-6
nw s
(0.6, -0.2)
Explanation of discrepancies TD for P/Bellman:
• Most significant discrepancies in states 3 and 8; minor in state 10
10
• P chooses worst successor of 8; should apply operator x instead
• P should apply w in state 6, but only does it only in 2/3 of the cases;
which affects the utility of state 3
• The low utility value of state 8 in TD seems to lower the utility value
of state 10  only a minor discrepancy
P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World: Discussion Problem 12 Bellman Update g=0.2
e e
10.145 20.72 30.58 R=+5
n s s
40.03 53.63 R=+3 w 6-8.27 R=-9
sw ne
n s s
x/0.383.17 R=+4 x/0.7
70.001 9-5.98 R=-6
nw s
Discussion on using Bellman Update for Problem 12:
• No convergence for g=1.0; utility values seem to run away!
100.63
• State 3 has utility 0.58 although it gives a reward of +5 due to the
immediate penalty that follows; we were able to detect that.
• Did anybody run the algorithm for other g e.g. 0.4 or 0.6 values; if
yes, did it converge to the same values?
• Speed of convergence seems to depend on the value of g.
TD TD inverse R
XYZ-World: Discussion Problem 12
e e (0.57, -0.65)
1 2 3 R=+5
n s s
(2.98, -2.99)
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 (-0.50, 0.47) x/0.7
7 8 R=+4 9 R=-6
Other observations: nw s
• The Bellman update did not converge for g=1 (-0.18, -0.12)
• The Bellman update converged very fast for g=0.2
• Did anybody try other values for g (e.g. 0.6)?
10
• The Bellman update suggest a utility value for 3.6 for state 5; what
does this tell us about the optimal policy? E.g. is 1-2-5-7-4-1
optimal?
• TD reversed utility values quite neatly when reward were inversed;
x become –x+u with u[-0.08,0.08].
• P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World --- Other
Considerations
• R(s) might be known in advance or has to be
learnt.
• R(s) might be probabilistic or not
• R(s) might change over time --- agent has to
adapt.
• Results of actions might be known in advance or
have to be learnt; results of actions can be fixed,
or may change over time.

Eick: Reinforcement Learning.

To be used in Assignment2:

Example: The ABC-World

Remark: no
terminal states
e e
1 2 3 R=+5
n n sw s
w w 6 R=-9
4 R=-1 5 R=-4
ne ne
n s s
x/0.1 x/0.9
7 8 R=-3 9 R=+8
Problem: What actions nw s
should an agent choose
to maximize its rewards? 10R=+9
Eick: Reinforcement Learning.
Basic Notations
• T(s,a,s’) denotes the probability of reaching s’ when using
action a in state s; it describes the transition model
• A policy p specifies what action to take for every possible
state sS
• R(s) denotes the reward an agent receives in state s
• Utility-based agents learn an utility function of states uses it
to select actions to maximize the expected outcome utility.
• Q-learning, on the other hand, learns the expected utility of
taking a particular action a in a particular state s (Q-value
of the pair (s,a)
• Finally, reflex agents learn a policy that maps directly from
states to actions

Eick: Reinforcement Learning.

Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Passive Learning

• We assume the policy Π is fixed.

• In state s we always execute action Π(s)

• Rewards are given.

Eick: Reinforcement Learning.

Expected Utility

UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ]

Expected sum of rewards when the

policy is followed.

Eick: Reinforcement Learning.

Direct Utility Estimation
Convert the problem to a supervised
learning problem:

(1,1)  U = 0.72
(2,1)  U = 0.68
…
Learn to map states to utilities.

Problem: utilities are not independent of each other!

Eick: Reinforcement Learning.
Incorrect formula replaced on March 10, 2006

Bellman Equation

Utility values obey the following equations:

Assume γ =1, for this lecture!

U (s) = R(s) + γmaxaΣs’ T(s,a,s’)U (s’)

Can be solved using dynamic programming.

Assumes knowledge of transition model T
and reward R; the result is policy independent!
Eick: Reinforcement Learning.
Example

U(1,3) = 0.84
(1,3) (2,3)
U(2,3) = 0.92

We hope to see that:

U(1,3) = -0.04 + U(2,3)

The value is 0.88. Current value

is a bit low and we must increase it.
Eick: Reinforcement Learning.
Bellman Update (Section 17.2 of textbook)

If we apply the Bellman update indefinitely

often, we obtain the utility values that are the
solution for the Bellman equation!!
Bellman Update:
Ui+1(s) = R(s) + γ maxa(Σs’(T(s,a,s’)*Ui(s’)))

Some Equations for the XYZ World:

Ui+1(1) = 0+ γ*Ui(2)
Ui+1(5) = 3+ γ *max(Ui(7),Ui(8))
Ui+1(8) = 4+ γ *max(Ui(6),0.3*Ui(7) + 0.7*Ui(9) )
Eick: Reinforcement Learning.
Updating Estimations Based on Observations:
New_Estimation = Old_Estimation*(1-) + Observed_Value*
New_Estimation= Old_Estimation + Observed_Difference*
Example: Measure the utility of a state s with current value
being 2 and observed values are 3 and 3 and the learning rate
is 0.2:

Initial Utility Value:2

Utility Value after observing 3: 2x0.8 + 3x0.2=2.2
Utility Value after observing 3,3: 2.2x0.8 +3x0.2= 2.36

Eick: Reinforcement Learning.

Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Temporal Difference Learning
Idea: Use observed transitions to adjust
values in observed states so that the
comply with the constraint equation, using
the following update rule:
UΠ (s)  UΠ (s) +
α [ R(s) + γ UΠ (s’) - UΠ (s) ]
α is the learning rate; γ discount rate
Temporal difference equation.
No model assumption --- T and R have
not to be known. Eick: Reinforcement Learning.
TD-Q-Learning
Goal: Measure the utility of using action a
in state s, denoted by Q(a,s); the following
update formula is used every time an agent
reaches state s’ from s using actions a:

Q(a,s)  Q(a,s) +
α [ R(s) + γ*maxa’Q(a’,s’) - Q(a,s) ]

•α is the learning rate; g is the discount factor

•Variation of TD-Learning
•Not necessary to know transition model T!
Eick: Reinforcement Learning.
Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Active Reinforcement Learning

Now we must decide what actions to take.

Optimal policy: Choose action with

highest utility value.

Is that the right thing to do?

Eick: Reinforcement Learning.

Active Reinforcement Learning

No! Sometimes we may get stuck in

suboptimal solutions.

Exploration vs Exploitation Tradeoff

Why is this important?

The learned model is not the same as

the true environment.
Eick: Reinforcement Learning.
Explore vs Exploit
Exploitation: Maximize its reward

Exploration: Maximize long-term

well being.

Eick: Reinforcement Learning.

Simple Solution to the
Exploitation/Exploration Problem
• Choose a random action once in k times
• Otherwise, choose the action with the
highest expected utility (k-1 out of k times)

Eick: Reinforcement Learning.

Another Solution --- Combining
Exploration and Exploitation
U+ (s)  R(s) + γ*maxaf(u,n)
u=Ss’(T(s,a,s’)*U+(s’)); n=N(a,s)
U+ (s) : optimistic estimate of utility
N(a,s): number of times action a has
been tried.
f(u,n): exploration function (idea: returns the value u,
if n is large, and values larger than u as n decreases)
Example: f(u,n):= if n>navg then u else max(n/navg*u,
uavg) navg being the average number of operator
applications.
Idea f: Utility of states/actions that have not been
explored much is increased artificially. Eick: Reinforcement Learning.
Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Applications
Game Playing

Checker playing program by

Arthur Samuel (IBM)

Update rules: change weights by

difference between current states
and backed-up value generating
full look-ahead tree
Eick: Reinforcement Learning.
Reinforcement Learning

• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary

Eick: Reinforcement Learning.

Summary

• Goal is to learn utility values of states and

an optimal mapping from states to actions.
• Direct Utility Estimation ignores
dependencies among states  we must
follow Bellman Equations.
• Temporal difference updates values to
match those of successor states.
• Active reinforcement learning learns the
optimal mapping from states to actions.
Eick: Reinforcement Learning.

Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Fundamentals of Neural Networks
No ratings yet
Fundamentals of Neural Networks
24 pages
Reinforcement Learning Ebook Part1 PDF
No ratings yet
Reinforcement Learning Ebook Part1 PDF
24 pages
Genetic Algorithms
100% (2)
Genetic Algorithms
94 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learn
No ratings yet
Reinforcement Learn
36 pages
1 Introduction To Reinforcement Learning
100% (2)
1 Introduction To Reinforcement Learning
104 pages
DCGAN Presentation
No ratings yet
DCGAN Presentation
16 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
A Recurrent Neural Network
No ratings yet
A Recurrent Neural Network
22 pages
Deep Learning
100% (4)
Deep Learning
100 pages
Unit 3
No ratings yet
Unit 3
32 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Reinforcement Learning Tutorial
100% (1)
Reinforcement Learning Tutorial
17 pages
PDF Deep Learning For Remote Sensing Images With Open Source Software 1St Edition Remi Cresson Ebook Full Chapter
No ratings yet
PDF Deep Learning For Remote Sensing Images With Open Source Software 1St Edition Remi Cresson Ebook Full Chapter
53 pages
Unit 4 Deeplearning
No ratings yet
Unit 4 Deeplearning
41 pages
Deep Learning Lab Practicals
No ratings yet
Deep Learning Lab Practicals
24 pages
Neural Networks PDF
No ratings yet
Neural Networks PDF
89 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
First Contact With Tensor Flow PDF
100% (2)
First Contact With Tensor Flow PDF
136 pages
RNN Neural Network
No ratings yet
RNN Neural Network
23 pages
Reinforcement Learning Notes
No ratings yet
Reinforcement Learning Notes
167 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit 2
No ratings yet
Unit 2
112 pages
Deep Learning (MODULE-3)
No ratings yet
Deep Learning (MODULE-3)
85 pages
Behavioral Learning Theories and Approaches To Learning - 20241226 - 202745 - 0000
No ratings yet
Behavioral Learning Theories and Approaches To Learning - 20241226 - 202745 - 0000
105 pages
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
19 pages
Dive in Deep Learning
100% (1)
Dive in Deep Learning
658 pages
Deep Learning
No ratings yet
Deep Learning
43 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
Conceptualizing Teacher Professional Learning
100% (1)
Conceptualizing Teacher Professional Learning
33 pages
Neural Networks and Deep Learning: Deeplearning - Ai-Summary
No ratings yet
Neural Networks and Deep Learning: Deeplearning - Ai-Summary
24 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
TensorFlow Basics
100% (1)
TensorFlow Basics
38 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Public Versus Private Education - A Comparative Case Study of A P PDF
No ratings yet
Public Versus Private Education - A Comparative Case Study of A P PDF
275 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
31 pages
Machine Learning Textbook
No ratings yet
Machine Learning Textbook
191 pages
Complete Download The Feldenkrais Method in Creative Practice: Dance, Music and Theatre Robert Sholl PDF All Chapters
100% (7)
Complete Download The Feldenkrais Method in Creative Practice: Dance, Music and Theatre Robert Sholl PDF All Chapters
55 pages
Column Short
No ratings yet
Column Short
46 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
24 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Artificial Neural Networks: System That Can Acquire, Store, and Utilize Experiential Knowledge
100% (1)
Artificial Neural Networks: System That Can Acquire, Store, and Utilize Experiential Knowledge
40 pages
Ess Connect Est-R Calendar
No ratings yet
Ess Connect Est-R Calendar
4 pages
Cover Letter
100% (1)
Cover Letter
2 pages
My English Book Eight
No ratings yet
My English Book Eight
114 pages
Sitio Causwagan
No ratings yet
Sitio Causwagan
232 pages
Structural Design Iii: Felix V. Garde, JR., Msce
No ratings yet
Structural Design Iii: Felix V. Garde, JR., Msce
38 pages
Artificial Intelligence Vs Machine Learning Vs Deep Learning
No ratings yet
Artificial Intelligence Vs Machine Learning Vs Deep Learning
38 pages
Teaching Standards, Rules, and Procedures
100% (1)
Teaching Standards, Rules, and Procedures
25 pages
Machine Learning: Louis Fippo Fitime
No ratings yet
Machine Learning: Louis Fippo Fitime
37 pages
Deep Generative Adversarial Networks For Image-To
No ratings yet
Deep Generative Adversarial Networks For Image-To
26 pages
LSTM
No ratings yet
LSTM
42 pages
Lab ANDandXOR REGRESSION ANN
No ratings yet
Lab ANDandXOR REGRESSION ANN
13 pages
Stages of Literacy Development
No ratings yet
Stages of Literacy Development
1 page
9 Lesson Plan (Fafne 5 - Lesson 5.1)
No ratings yet
9 Lesson Plan (Fafne 5 - Lesson 5.1)
4 pages
Perceptron and Backpropagation
No ratings yet
Perceptron and Backpropagation
17 pages
Numeracy Through Literature
No ratings yet
Numeracy Through Literature
50 pages
Evolutionary Programming
No ratings yet
Evolutionary Programming
19 pages
Final Teaching Resume
No ratings yet
Final Teaching Resume
1 page
DLL Science-6 Q4 W1
No ratings yet
DLL Science-6 Q4 W1
3 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
Scott Mcknight: Education
No ratings yet
Scott Mcknight: Education
2 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
CSE Dept. PPT 176 173
No ratings yet
CSE Dept. PPT 176 173
17 pages
Constructivism
No ratings yet
Constructivism
18 pages
Bearing Capacity
No ratings yet
Bearing Capacity
50 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Compressibility of Soils: Geotechnical Engineering II
No ratings yet
Compressibility of Soils: Geotechnical Engineering II
34 pages
Learning Theories & Technology Integration
No ratings yet
Learning Theories & Technology Integration
23 pages
Lateral Earth Pressure: Geotechnical Engineering II
No ratings yet
Lateral Earth Pressure: Geotechnical Engineering II
32 pages
S Ing 034588 Chapter2
No ratings yet
S Ing 034588 Chapter2
22 pages
Lecture Notes - Recurrent Neural Networks
No ratings yet
Lecture Notes - Recurrent Neural Networks
11 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Confusion Matrix
No ratings yet
Confusion Matrix
14 pages
Essay Tecnología
No ratings yet
Essay Tecnología
4 pages
Bfi Education Reframing Literacy 2013 04
No ratings yet
Bfi Education Reframing Literacy 2013 04
15 pages
Unit IA - Curriculum Map-April 29
No ratings yet
Unit IA - Curriculum Map-April 29
2 pages
Tensorflow Internal
No ratings yet
Tensorflow Internal
17 pages
Lecture 1 Introduction by Dr. Fazeel Abid
No ratings yet
Lecture 1 Introduction by Dr. Fazeel Abid
26 pages
ML Unit 1 Pallav
No ratings yet
ML Unit 1 Pallav
22 pages
Producto A1 Ingles I Uss
No ratings yet
Producto A1 Ingles I Uss
5 pages
DLL - English 3 - Q2 - W2
No ratings yet
DLL - English 3 - Q2 - W2
4 pages
STR Chapter 123
No ratings yet
STR Chapter 123
7 pages
Desalination by Renewable Energy: A Mini Review of The Recent Patents
No ratings yet
Desalination by Renewable Energy: A Mini Review of The Recent Patents
11 pages
Instituto de Educación Superior Pedagógico Público Indoamerica
No ratings yet
Instituto de Educación Superior Pedagógico Público Indoamerica
6 pages
Deped Open To More Dialogue On Improvement of Mtb-Mle Implementation
No ratings yet
Deped Open To More Dialogue On Improvement of Mtb-Mle Implementation
5 pages
De-1A - 3rd Sem - Course Abstract
No ratings yet
De-1A - 3rd Sem - Course Abstract
8 pages
The Variance Competence Model
No ratings yet
The Variance Competence Model
2 pages
Co-Exploring The Perception of Food and Eating
No ratings yet
Co-Exploring The Perception of Food and Eating
4 pages
LP in English 8 Cot 1
No ratings yet
LP in English 8 Cot 1
3 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages