0% found this document useful (0 votes)

72 views46 pages

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

This document discusses reinforcement learning techniques including: 1. Monte Carlo methods which learn directly from episodes without a model of the environment. 2. Temporal difference learning which updates values based on other learned values rather than waiting for the final outcome. 3. Function approximation methods like linear approximation which represent the value function as a linear combination of features to generalize to continuous state spaces.

Uploaded by

Ác Qủy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views46 pages

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

Uploaded by

Ác Qủy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

REINFORCEMENT

LEARNING
(part 2)

Nguyen Do Van, PhD

Reinforcement Learning

§ Online Learning
§ Value Function Approximation
§ Policy Gradients

2
Reinforcement learning: Recall

§ Making good decision to do new task: fundamental challenge in

AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards

3
Reinforcement learning: Recall

§ At each step t the agent:

q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
4
Reinforcement learning: Recall

§ Policy - maps current state to action

§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.

5
Markov Decision Process
(Model of the environment)

§ Terminologies:

6
Bellman’s equation

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations

(one per state)
• Belman’s equation in matrix form:
7
Optimal Value, Q and policy

§ Optimal V: the highest possible value for each s under any

possible policy
§ Satisfies the bellman Equation:

§ Optimal Q-function:

§ Optimal policy:

8
Dynamic Programming (DP)

§ Assuming full knowledge of Markov Decision Process

§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
9
ONLINE LEARNING
Model-free Reinforcement Learning
Partially observable environment, Monte Carlo, TD, Q-Learning

10
Monte-Carlo Reinforcement Learning

§ MC methods learn directly from episodes of

experience
§ MC is model-free: no knowledge of MDP
transitions / rewards
§ MC learns from complete episodes: no
bootstrapping
§ MC uses the simplest possible idea: value = mean
return
§ Caveat:
q Can only apply MC to episodic MDPs
q All episodes must terminate

11
Monte-Carlo Policy Evaluation

§ Goal: learn vπ from episodes of experience under policy π

§ Total discounted reward

§ Value function is expected return

§ Monte-Carlo policy evaluation uses empirical mean return

instead of expected return

12
State Visit Monte-Carlo Policy Evaluation

§ To evaluate state s
§ At time-step t that state s is visited in an episode
q Visiting state s: first or every time-step
§ Increase counter N(s) = N(s) + 1
§ Increate total return S(s) = S(s) + Gt
§ Value is estimated by mean return V(s) = S(s)/N(s)
§ By law of large number

13
Incremental Monte-Carlo Updates

§ Learning from experience

§ Update V(s) incrementally after full game
§ For each state St, with actual return Gt

§ With learning rate

14
Temporal-Difference Learning

§ Model-free: no knowledge of MDP

§ Do not wait for episodes, learn from incomplete episode by
bootstrapping
§ Update value V(st) toward estimated return
TD Target

TD error
15
Monte-Carlo and Temporal Difference

§ TD can learn before knowing the final outcome

q TD can learn online after every step
q MC must wait until end of episode before return is known
§ TD can learn without the final outcome
q TD can learn from incomplete sequences
q MC can only learn from complete sequences
q TD works in continuing (non-terminating) environments
q MC only works for episodic (terminating) environments

16
Monte-Carlo Backup

17
Temporal-Difference Backup

18
Dynamic Programming Backup

19
Bootstrapping and Sampling

§ Bootstrapping: update involves

an estimate
q MC does not bootstrap
q DP bootstraps
q TD bootstraps
§ Sampling: update samples an
expectation
q MC samples
q DP does not sample
q TD samples

20
N-step prediction

§ n-step return

§ Define n-step return

§ n-step temporal-difference learning

21
On-policy Learning

§ Advantage of TD:
q Lower variance
q Online
q Incomplete sequence
§ Sarsa:
q Apply TD to Q(S,A)
q Use policy improvement eg ϵ-greedy
q Update every time-step

22
Sarsa Algorithm
Initialize any Q(s,a) and Q (terminate-state, null) =0
Repeat (for each episode)
Initialize S
Choose A from S using Q (eg ϵ-greedy)
Repeat (for steps of episode)
Take A, observe R, S’
Chose A’ from S’ using Q (eg ϵ-greedy)

Until S is terminal

23
Off-Policy Learning

§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)

§ While following policy μ(a|s)
𝑆", 𝐴" , 𝑅& , … , 𝑆( ~μ
§ Advantages:
q Learning from observing human or other agents
q Reuse experience generated from old policies π1 , π2 , π3 , … , πt−1
q Learn about optimal policy while following exploratory policy
q Learn about multiple policies while following one policy

24
Q-Learning
§ Off-policy learning action-value Q(s,a)
§ No importance sampling is required
§ Off policy: Next action is chosen by
§ Q-Learning: choose alternative successor
§ Update Q(St,At) towards value of alternative action

§ Improve policy by greedy

25
Q-Learning

Update equation

Algorithm

Q-Learning Table version

26
Visualization and Codes

§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

27
VALUE FUNCTION APPROXIMATION
State representation in complex environments
Linear Function Approximation
Gradient Descent and Update rules

28
Function approximation

29
Type of Value Function Approximation

§ Differentiable function
approximation
q Linear combination of
feature
• Robots: distance from
checking point, target, dead
mark, wall
• Business Intelligence Systems:
Trends in stock market
q Neural Network
• Deep Q Learning
§ Training strategies
30
Value Function by Stochastic Gradient Descent

§ Goal: find parameter w minimizing mean-squared error

between approximate value function and true state value on π

§ Gradient descent finds a local minimum

§ Stochastic gradient descent sample

31
Linear Value Function Approximation

§ Represent state by a feature vector

§ Feature examples:
q Distance to obstacle by lidar
q Angle to target
q Energy level of robot
§ Represent a value function by a linear combination of features

32
Linear Value Function Approximation

§ Objective function is quadratic in parameter w

§ Stochastic gradient descent converges on global optimum

§ Update rule:
Update = step-size x predict error x feature value

33
Incremental Prediction Algorithms

§ Value function 𝑣. (𝑠) is

assumed to be given by
supervisors
§ In reinforcement learning,
there is only rewards
instead
§ In online learning
(practice), a target for
𝑣. (𝑠) is used

34
Control with Value Function

§ Policy evaluation Approximate

policy evaluation

§ Policy improvement e-greedy policy

improvement

35
Action-Value Function Approximation

§ Approximate the action-value function (Q-value)

§ Minimize mean-squared error between approximate action-

value function and true value function with π

§ Use stochastic gradient descent to find a local minimum

36
Linear Action-Value Function Approximation

§ Represent state and action by a feature vector

§ Represent action-value function by linear combination

§ Stochastic gradient descent update

§ Using target update in practice

q MC
q TD

37
POLICY GRADIENT

38
Policy-Based Reinforcement Learning

§ Last part: value (and action-value) functions are approximate by

parameterized function:

§ Generate policy from value function, e.g using e-gready

§ In this part: directly parameterize policy

§ Effective in high-dimensional or continuous action spaces

39
Policy Objective Functions

§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ

§ Define objective function to measure quality of policy
q In episodic environments, objective function is the start value

q In continuing environments, objective function is average value

q Or average reward per time-step

40
Policy Optimization

§ Policy based RL is an optimization problem

§ Find θ that maximize objective function J(θ)
§ Policy gradient algorithms search for a local maximum in J(θ)
by ascending gradient of the policy

41
Monte-Carlo Policy Gradient (REINFORCE)

§ Update parameter by stochastic gradient ascent

§ Using policy gradient theorem
§ Using return 𝑣7 as an unbiased sample of 𝑄 .9 (𝑠7 , 𝑎7 )

42
Reducing Variance using a Critic

§ Monte-Carlo policy gradient has high variance

§ A Critic is used to estimate action-value function

§ Actor-critic algorithms maintain two sets of parameters

q Critic: Updates action-value function parameter w
q Actor: Updates policy parameter θ
§ Actor-critic algorithms follow an approximate policy gradient

43
Action-Value Actor-Critic

§ Simple actor-critic algorithm

based on action-value critic
§ Using linear value function
approximation 𝑄: (𝑠, 𝑎)
§ Critic: Update w by linear
TD(0)
§ Actor: Update θ by policy
gradient
44
Recap on Reinforcement Learning 02

§ Online Learning § Value Function Approximation

q Model-free Reinforcement Learning q State representation in complex
q Partially observable environment, environments
q Monte Carlo q Linear Function Approximation

q Temporal Difference q Gradient Descent and Update rules

q Q-Learning
§ Policy Gradient
q Objective Function
q Gradient Ascent
q REINFORCE, Actor-Critic

45
Questions?

THANK YOU!

CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Athul - Summer Internship Project
100% (2)
Athul - Summer Internship Project
6 pages
Q1 Housekeeping Week4
No ratings yet
Q1 Housekeeping Week4
4 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Research Proposal Patient Satisfaction
100% (2)
Research Proposal Patient Satisfaction
29 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Span Cop
100% (1)
Span Cop
2 pages
Raw Material Release Process
100% (2)
Raw Material Release Process
9 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
"Impact of E-Commerce": Submitted by
No ratings yet
"Impact of E-Commerce": Submitted by
53 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
QP Ans
No ratings yet
QP Ans
40 pages
Excel Sensitivity Analysis
No ratings yet
Excel Sensitivity Analysis
32 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Sms Marketing Thesis
100% (3)
Sms Marketing Thesis
6 pages
Lecture 3.1 AML
No ratings yet
Lecture 3.1 AML
65 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
U 25
No ratings yet
U 25
44 pages
Eapp 4TH Q Adm
No ratings yet
Eapp 4TH Q Adm
32 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Manonmaniam Sundaranar University: M.B.A. Production - Ii Year
No ratings yet
Manonmaniam Sundaranar University: M.B.A. Production - Ii Year
212 pages
Metaphor and Metonymy in Editorials' Headlines
100% (2)
Metaphor and Metonymy in Editorials' Headlines
75 pages
Lec 10
No ratings yet
Lec 10
50 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Terra Lit Review
No ratings yet
Terra Lit Review
174 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Mindful Communication
No ratings yet
Mindful Communication
3 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
K08052 - Example Lessons Learned Submission Forms
No ratings yet
K08052 - Example Lessons Learned Submission Forms
6 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
16 RL
No ratings yet
16 RL
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
2007 Census Admin Report
No ratings yet
2007 Census Admin Report
125 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
RL Unit - Iv
No ratings yet
RL Unit - Iv
25 pages
Binc Syllabus For Paper-Iii Binc Bioinformatics Syllabus - Basic
No ratings yet
Binc Syllabus For Paper-Iii Binc Bioinformatics Syllabus - Basic
7 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
37 RL
No ratings yet
37 RL
18 pages
Tolato MK Enriquez R Tagulao J Tammidao e
No ratings yet
Tolato MK Enriquez R Tagulao J Tammidao e
24 pages
TAGT
No ratings yet
TAGT
8 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Learning Task
No ratings yet
Learning Task
14 pages
L1-WRS-Career Planning
No ratings yet
L1-WRS-Career Planning
16 pages
Cecily CV2023
No ratings yet
Cecily CV2023
3 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Macroeconomics An Introduction To Advanced Methods 1989
No ratings yet
Macroeconomics An Introduction To Advanced Methods 1989
1 page
FYP (Proposal)
No ratings yet
FYP (Proposal)
17 pages
The Gendered Intersectional Nature of Pluralism and Social Inclusion
No ratings yet
The Gendered Intersectional Nature of Pluralism and Social Inclusion
28 pages
Challenges in Managing Are Tail Pharmacy 1
No ratings yet
Challenges in Managing Are Tail Pharmacy 1
7 pages
Sociology Understanding Our Social World
No ratings yet
Sociology Understanding Our Social World
8 pages
Research Paper
No ratings yet
Research Paper
13 pages
Exercise 3
No ratings yet
Exercise 3
8 pages
APA Paper Instructions and Rubric
No ratings yet
APA Paper Instructions and Rubric
2 pages
PiyushRawat KMC
No ratings yet
PiyushRawat KMC
1 page

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

Uploaded by

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

Uploaded by

REINFORCEMENT

Nguyen Do Van, PhD

§ Making good decision to do new task: fundamental challenge in

§ At each step t the agent:

§ Policy - maps current state to action

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations

§ Optimal V: the highest possible value for each s under any

§ Assuming full knowledge of Markov Decision Process

§ MC methods learn directly from episodes of

§ Goal: learn vπ from episodes of experience under policy π

§ Total discounted reward

§ Value function is expected return

§ Monte-Carlo policy evaluation uses empirical mean return

§ Learning from experience

§ With learning rate

§ Model-free: no knowledge of MDP

§ TD can learn before knowing the final outcome

§ Bootstrapping: update involves

§ Define n-step return

§ n-step temporal-difference learning

§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)

§ Improve policy by greedy

Q-Learning Table version

§ Goal: find parameter w minimizing mean-squared error

§ Gradient descent finds a local minimum

§ Stochastic gradient descent sample

§ Represent state by a feature vector

§ Objective function is quadratic in parameter w

§ Stochastic gradient descent converges on global optimum

§ Value function 𝑣. (𝑠) is

§ Policy evaluation Approximate

§ Policy improvement e-greedy policy

§ Approximate the action-value function (Q-value)

§ Minimize mean-squared error between approximate action-

§ Use stochastic gradient descent to find a local minimum

§ Represent state and action by a feature vector

§ Stochastic gradient descent update

§ Using target update in practice

§ Last part: value (and action-value) functions are approximate by

§ Generate policy from value function, e.g using e-gready

§ Effective in high-dimensional or continuous action spaces

§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ

q In continuing environments, objective function is average value

q Or average reward per time-step

§ Policy based RL is an optimization problem

§ Update parameter by stochastic gradient ascent

§ Monte-Carlo policy gradient has high variance

§ Actor-critic algorithms maintain two sets of parameters

§ Simple actor-critic algorithm

§ Online Learning § Value Function Approximation

q Temporal Difference q Gradient Descent and Update rules

You might also like