0% found this document useful (0 votes)
72 views46 pages

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

This document discusses reinforcement learning techniques including: 1. Monte Carlo methods which learn directly from episodes without a model of the environment. 2. Temporal difference learning which updates values based on other learned values rather than waiting for the final outcome. 3. Function approximation methods like linear approximation which represent the value function as a linear combination of features to generalize to continuous state spaces.

Uploaded by

Ác Qủy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views46 pages

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

This document discusses reinforcement learning techniques including: 1. Monte Carlo methods which learn directly from episodes without a model of the environment. 2. Temporal difference learning which updates values based on other learned values rather than waiting for the final outcome. 3. Function approximation methods like linear approximation which represent the value function as a linear combination of features to generalize to continuous state spaces.

Uploaded by

Ác Qủy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

REINFORCEMENT

LEARNING
(part 2)

Nguyen Do Van, PhD


Reinforcement Learning

§ Online Learning
§ Value Function Approximation
§ Policy Gradients

2
Reinforcement learning: Recall

§ Making good decision to do new task: fundamental challenge in


AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards

3
Reinforcement learning: Recall

§ At each step t the agent:


q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
4
Reinforcement learning: Recall

§ Policy - maps current state to action


§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.

5
Markov Decision Process
(Model of the environment)

§ Terminologies:

6
Bellman’s equation

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations


(one per state)
• Belman’s equation in matrix form:
7
Optimal Value, Q and policy

§ Optimal V: the highest possible value for each s under any


possible policy
§ Satisfies the bellman Equation:

§ Optimal Q-function:

§ Optimal policy:

8
Dynamic Programming (DP)

§ Assuming full knowledge of Markov Decision Process


§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
9
ONLINE LEARNING
Model-free Reinforcement Learning
Partially observable environment, Monte Carlo, TD, Q-Learning

10
Monte-Carlo Reinforcement Learning

§ MC methods learn directly from episodes of


experience
§ MC is model-free: no knowledge of MDP
transitions / rewards
§ MC learns from complete episodes: no
bootstrapping
§ MC uses the simplest possible idea: value = mean
return
§ Caveat:
q Can only apply MC to episodic MDPs
q All episodes must terminate

11
Monte-Carlo Policy Evaluation

§ Goal: learn vπ from episodes of experience under policy π

§ Total discounted reward

§ Value function is expected return

§ Monte-Carlo policy evaluation uses empirical mean return


instead of expected return

12
State Visit Monte-Carlo Policy Evaluation

§ To evaluate state s
§ At time-step t that state s is visited in an episode
q Visiting state s: first or every time-step
§ Increase counter N(s) = N(s) + 1
§ Increate total return S(s) = S(s) + Gt
§ Value is estimated by mean return V(s) = S(s)/N(s)
§ By law of large number

13
Incremental Monte-Carlo Updates

§ Learning from experience


§ Update V(s) incrementally after full game
§ For each state St, with actual return Gt

§ With learning rate

14
Temporal-Difference Learning

§ Model-free: no knowledge of MDP


§ Do not wait for episodes, learn from incomplete episode by
bootstrapping
§ Update value V(st) toward estimated return
TD Target

TD error
15
Monte-Carlo and Temporal Difference

§ TD can learn before knowing the final outcome


q TD can learn online after every step
q MC must wait until end of episode before return is known
§ TD can learn without the final outcome
q TD can learn from incomplete sequences
q MC can only learn from complete sequences
q TD works in continuing (non-terminating) environments
q MC only works for episodic (terminating) environments

16
Monte-Carlo Backup

17
Temporal-Difference Backup

18
Dynamic Programming Backup

19
Bootstrapping and Sampling

§ Bootstrapping: update involves


an estimate
q MC does not bootstrap
q DP bootstraps
q TD bootstraps
§ Sampling: update samples an
expectation
q MC samples
q DP does not sample
q TD samples

20
N-step prediction

§ n-step return

§ Define n-step return

§ n-step temporal-difference learning

21
On-policy Learning

§ Advantage of TD:
q Lower variance
q Online
q Incomplete sequence
§ Sarsa:
q Apply TD to Q(S,A)
q Use policy improvement eg ϵ-greedy
q Update every time-step

22
Sarsa Algorithm
Initialize any Q(s,a) and Q (terminate-state, null) =0
Repeat (for each episode)
Initialize S
Choose A from S using Q (eg ϵ-greedy)
Repeat (for steps of episode)
Take A, observe R, S’
Chose A’ from S’ using Q (eg ϵ-greedy)

Until S is terminal

23
Off-Policy Learning

§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)


§ While following policy μ(a|s)
𝑆", 𝐴" , 𝑅& , … , 𝑆( ~μ
§ Advantages:
q Learning from observing human or other agents
q Reuse experience generated from old policies π1 , π2 , π3 , … , πt−1
q Learn about optimal policy while following exploratory policy
q Learn about multiple policies while following one policy

24
Q-Learning
§ Off-policy learning action-value Q(s,a)
§ No importance sampling is required
§ Off policy: Next action is chosen by
§ Q-Learning: choose alternative successor
§ Update Q(St,At) towards value of alternative action

§ Improve policy by greedy

25
Q-Learning

Update equation

Algorithm

Q-Learning Table version


26
Visualization and Codes

§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

27
VALUE FUNCTION APPROXIMATION
State representation in complex environments
Linear Function Approximation
Gradient Descent and Update rules

28
Function approximation

29
Type of Value Function Approximation

§ Differentiable function
approximation
q Linear combination of
feature
• Robots: distance from
checking point, target, dead
mark, wall
• Business Intelligence Systems:
Trends in stock market
q Neural Network
• Deep Q Learning
§ Training strategies
30
Value Function by Stochastic Gradient Descent

§ Goal: find parameter w minimizing mean-squared error


between approximate value function and true state value on π

§ Gradient descent finds a local minimum

§ Stochastic gradient descent sample

31
Linear Value Function Approximation

§ Represent state by a feature vector


§ Feature examples:
q Distance to obstacle by lidar
q Angle to target
q Energy level of robot
§ Represent a value function by a linear combination of features

32
Linear Value Function Approximation

§ Objective function is quadratic in parameter w

§ Stochastic gradient descent converges on global optimum


§ Update rule:
Update = step-size x predict error x feature value

33
Incremental Prediction Algorithms

§ Value function 𝑣. (𝑠) is


assumed to be given by
supervisors
§ In reinforcement learning,
there is only rewards
instead
§ In online learning
(practice), a target for
𝑣. (𝑠) is used

34
Control with Value Function

§ Policy evaluation Approximate


policy evaluation

§ Policy improvement e-greedy policy


improvement

35
Action-Value Function Approximation

§ Approximate the action-value function (Q-value)

§ Minimize mean-squared error between approximate action-


value function and true value function with π

§ Use stochastic gradient descent to find a local minimum

36
Linear Action-Value Function Approximation

§ Represent state and action by a feature vector


§ Represent action-value function by linear combination

§ Stochastic gradient descent update

§ Using target update in practice


q MC
q TD

37
POLICY GRADIENT

38
Policy-Based Reinforcement Learning

§ Last part: value (and action-value) functions are approximate by


parameterized function:

§ Generate policy from value function, e.g using e-gready


§ In this part: directly parameterize policy

§ Effective in high-dimensional or continuous action spaces


39
Policy Objective Functions

§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ


§ Define objective function to measure quality of policy
q In episodic environments, objective function is the start value

q In continuing environments, objective function is average value

q Or average reward per time-step

40
Policy Optimization

§ Policy based RL is an optimization problem


§ Find θ that maximize objective function J(θ)
§ Policy gradient algorithms search for a local maximum in J(θ)
by ascending gradient of the policy

41
Monte-Carlo Policy Gradient (REINFORCE)

§ Update parameter by stochastic gradient ascent


§ Using policy gradient theorem
§ Using return 𝑣7 as an unbiased sample of 𝑄 .9 (𝑠7 , 𝑎7 )

42
Reducing Variance using a Critic

§ Monte-Carlo policy gradient has high variance


§ A Critic is used to estimate action-value function

§ Actor-critic algorithms maintain two sets of parameters


q Critic: Updates action-value function parameter w
q Actor: Updates policy parameter θ
§ Actor-critic algorithms follow an approximate policy gradient

43
Action-Value Actor-Critic

§ Simple actor-critic algorithm


based on action-value critic
§ Using linear value function
approximation 𝑄: (𝑠, 𝑎)
§ Critic: Update w by linear
TD(0)
§ Actor: Update θ by policy
gradient
44
Recap on Reinforcement Learning 02

§ Online Learning § Value Function Approximation


q Model-free Reinforcement Learning q State representation in complex
q Partially observable environment, environments
q Monte Carlo q Linear Function Approximation

q Temporal Difference q Gradient Descent and Update rules

q Q-Learning
§ Policy Gradient
q Objective Function
q Gradient Ascent
q REINFORCE, Actor-Critic

45
Questions?

THANK YOU!

46

You might also like