0% found this document useful (0 votes)

12 views58 pages

Lec17 ReinforcementLearning

The document provides an introduction to Reinforcement Learning (RL), detailing its principles, including the interaction of agents with environments to maximize rewards through learned policies. It explains the Markov Decision Process (MDP) framework, the importance of value functions, and the distinction between value-based and policy-based methods. Additionally, it discusses Q-learning and Deep Q-learning, highlighting their algorithms and challenges in training.

Uploaded by

eduardogiamballuci1967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views58 pages

Lec17 ReinforcementLearning

Uploaded by

eduardogiamballuci1967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

INTRODUCTION TO MACHINE LEARNING

Reinforcement Learning

Giovanni Iacca

(credits: Elisa Ricci)

Reinforcement
Learning
IDEA
Reinforcement learning
● We discussed supervised and unsupervised learning
● Today we will see Reinforcement Learning (RL)
Reinforcement learning
● Inspired by research on psychology and animal learning
● Problems involving an agent interacting with an environment, which
provides numeric reward signals
● Goal: Learn how to take actions to maximize a reward
Reinforcement learning
● Agent can take actions that affect
the state of the environment and
observe occasional rewards that
depend on the state
● A policy is a mapping from states to
actions
Reward rt
● Goal: Learn a policy to maximize Next State st+1
Action at

expected reward over time

State st
Example – Atari Games
● Objective: Complete the game with the highest score
● State: Raw pixel inputs of the game state
● Action: Game controls, e.g., Left, Right, Up, Down
● Reward: Score increase/decrease at each time step

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, et al, Human-level control through deep reinforcement learning, Nature 2015
Example – Go
● Objective: Win the game
● State: Position of all pieces
● Action: Where to put the next piece down
● Reward: 1 if win, 0 otherwise (“delayed” reward at the end of a game)

https://deepmind.com/research/alphago/
Example – ChatGPT
Markov
Decision
Process
Markov Decision Process
● Markov Decision process (MDP) is a framework used to help to make
decisions on a stochastic environment.
● Our goal is to find a policy, which is a map that gives us all optimal
actions on each state on our environment.
● To solve MDPs, we need Dynamic Programming (DP), more
specifically the Bellman equation.
● DP is a method that divides a problem into simpler sub-problems that
are easier to solve.
Markov Decision Process
● Components:
○ States s, beginning with initial state s0
○ Actions a
○ Transition model P(s' | s , a)
■ Markov assumption: the probability of going to s' from s depends
only on s and a and not on any other past actions/states
○ Reward function r(s)
● Policy 𝛑(s): the action that an agent takes in any given state
Markov Decision Process
● An MDP is defined by:
(𝓢, 𝓐, 𝓡, ℙ, 𝛾)
𝓢 : set of possible states
𝓐 : set of possible actions
𝓡 : distribution of reward given (state, action) pair
ℙ : transition probability, i.e., distribution over the next state, given
(state, action) pair
𝛾 : discount factor
Example – Grid World
Objective: reach the diamond terminal state in least number of actions

Reward: scalar value that

you get for being on a state
Example – Grid World
Transition model:

0.1 0.8 0.1

Example – Grid World
Goal: find the optimal policy

The policy is a map that tells the optimal action for every state.
The optimal policy is a policy that maximizes the expected reward.
Example – Grid World
The optimal policy depends on the reward function
MDP Loop
● At time step t=0, environment samples initial state s0 ~ p(s0)
● Repeat:
○ Agent selects action at
○ Environment samples reward rt ~ R( . | st , at )
○ Environment samples next state st +1 ~ P( . | st , at )
○ Agent receives reward rt and next state st +1

A policy 𝛑(s) is a function from 𝓢 to 𝓐 that specifies what action to take in each state.
Objective: find policy 𝛑* that maximizes cumulative discounted reward.
Cumulative Discounted Reward
● Suppose that following policy 𝛑 , starting in state s0 , leads to a sequence s0 s1 s2 ....
● The cumulative reward of the sequence is:

● State sequences can vary in length or even be infinite

● Typically, we define the cumulative reward as sum of rewards discounted by a factor 𝛄:
Cumulative Discounted Reward
● The discounting factor controls the importance of the future rewards versus the immediate ones.
● The lower the discount factor is, the less important future rewards are, and the agent will tend
to focus on actions which will yield immediate rewards only.
● The cumulative reward is bounded by
● Helps algorithm to converge.
RL vs. Supervised Learning
● Supervised Learning loop
○ Get input xi sampled i.i.d. from data distribution
○ Use model with parameters w to predict output y
○ Observe target output yi and loss l(w, xi , yi)
○ Update w to reduce loss with SGD:
RL vs. Supervised Learning
● Reinforcement Learning loop
○ From state s, take action a determined by policy 𝛑(s)
○ Environment selects next state s' based on transition model P(s' | s, a)
○ Observe s' and reward r(s), update policy
RL vs. Supervised Learning
● Supervised Learning
○ Next input does not depend on previous inputs or agent predictions
○ There is a supervision signal at every step
○ Loss is differentiable w.r.t. model parameters

● Reinforcement Learning
○ Agent’s actions affect the environment and help to determine next observation
○ Rewards may be sparse (i.e., not every state may have a reward)
○ Rewards are usually not differentiable w.r.t. model parameters
RL Methods
Two main approaches for RL
● Value-based methods
○ The goal of the agent is to optimize the value function V(s).
○ The value of each state is the total reward an RL agent can expect to collect over
the future from a given state.
● Policy-based approach:
○ We define a policy which we need to optimize directly.
○ The policy defines how the agent behaves.
○ Stochastic policies give a distribution of probability over different actions:
Value based
methods
Value Function
● The value function gives the total reward the agent can expect from a particular state
considering all possible states reachable from that state. With the value function, you can find
a policy.
● The value function V of a state s w.r.t. policy 𝛑 is the expected cumulative reward following
that policy starting in s:

● The optimal value of a state is the value achievable by following the best possible policy:

● Essentially, the value function tells “how good” is a state.

Q-value Function
● It is more convenient to define the value of a state-action pair:

● The optimal Q-value function tells “how good” is a state-action pair:

● When the optimal Q-value is found, it is used to compute the optimal policy:
Q-value Function

A table where we have the maximum

expected future reward, for each
action at each state.
Bellman Equation
● Recursive relationship between optimal values of successive states and actions:

● If the optimal state-action values for the next time-step Q*(s', a') are known, then the
optimal strategy is to take the action that maximizes the expected value.
Q-learning
A robot needs to reach room 5.

https://leonardoaraujosantos.gitbook.io/artificial-inteligence/artificial_intelligence/reinforcement_learning/qlearning_simple
Q-learning
● Components
○ Actions: {0,1,2,3,4,5}
○ States: {0,1,2,3,4,5}
○ Rewards: {0,100}
● Goal state: 5

NOTE: in this specific example,

the action and state spaces are the
same; however, this is not always
the case!
Q-learning
● Reward Table (the value -1 indicates that some specific action is not available)
Q-learning
● The whole point of Q-learning is that the matrix R is available only to the environment; the agent
needs to learn R by itself through experience.
● What the agent has is a Q matrix that encodes the [(state, action) à reward] mapping, but is
initialized with zeros and through experience becomes like the matrix R.
● The policy can be obtained from the Q matrix.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

𝛾=0.8 s=1
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

As we start from state s=1 (second row) there are only

𝛾=0.8 s=1 the actions 3 (reward 0) or 5 (reward 100) to be done,
imagine that we choose randomly the action 5.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

On state 5, there are 3 possible actions. We're just Episode 1

𝛾=0.8 s=1 interested on the action with biggest reward. But,
at this point the Q table is still filled with zeros!
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

As the new state is 5 and this state is the goal Episode 1

𝛾=0.8 s=1 state, we finish our episode. Now at the end of this
episode the we update the Q table.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

Typically, all the non-zero

elements are divided by
they greatest value to
normalize the Q table.

After many episodes

𝛾=0.8 s=1
Find optimal Policy
1. Set current state = initial state.
2. From current state, find the action with the highest Q value
3. Set current state = next state (state from action chosen on 2).
4. Repeat Steps 2 and 3 until current state = goal state.
Deep Q-learning

● The Bellman equation is a constraint on Q-values of

successive states:

● Problem: state spaces for interesting problems are

huge (e.g., Atari games)
● Solution: approximate Q-values using a parametric
function (w being the parameters):
Deep Q-learning

● Train a deep network w that approximates Q:

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Human-level control through deep
reinforcement learning, Nature 2015
Deep Q-learning

● Idea: at each iteration i of training, update model parameters w to

push Q close to the target yi

● Loss function (that changes at each iteration):

where ρ is a probability distribution over states s and actions a that we

refer to as the behaviour distribution .

https://shivam5.github.io/drl/
Deep Q-learning
● Target:
● Loss:
● Gradient update:

● SGD training: replace expectation by sampling experiences (s, a, s')

using behaviour distribution and transition model

https://shivam5.github.io/drl/
Deep Q-learning
● Training is prone to instability
○ Unlike in supervised learning, the targets are “moving”
○ Successive experiences are correlated and depend on the policy
○ Policy may change rapidly with slight changes to parameters, leading to
drastic changes in data distribution
● Solutions
○ “Freeze” target Q-network
○ Use experience replay buffer to store experience and sample from that
Deep Q-learning in Atari

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Human-level control through deep reinforcement
learning, Nature 2015
Deep Q-learning in Atari

https://www.youtube.com/watch?v=V1eYniJ0Rnk
Policy
Gradient
methods
Policy gradient methods

● Instead of indirectly representing the policy using

Q-values, it can be more efficient to parameterize
𝛑 and learn it directly
● Especially in large or continuous action spaces the
Q-value function can be very complicated
● Example: a robot grasping an object has a very
high-dimensional state and it is hard to learn the
exact value of every (state, action) pair
Stochastic Policy Representation
Instead, learn a function giving the probability distribution over
actions from current state:
Policy gradient methods

Policy gradient for Pong game:

The basic idea is to use a Machine Learning model
that will learn a good policy from playing the game
and receiving rewards.
Objective function
Find the best parameters θ (parameters of the policy) to
maximize the expected reward (use gradient descent):
Optimization

We don’t know the

transition probability

NOTE: We do not need to know the the environment

dynamics p.
Optimization

Stochastic approximation: sample N trajectories:

Reinforcement Loop

Williams et al. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning,
8(3):229-256, 1992
Intuition
If going up the hill (of the objective function) means
higher rewards, we will change the model parameters
and thus the policy to increase the likelihood of
trajectories that move higher during the optimization
process.
QUESTIONS?

CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
R Data Analysis Projects PDF
No ratings yet
R Data Analysis Projects PDF
354 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Sections
No ratings yet
Sections
76 pages
Deep Learning and Its Applications
No ratings yet
Deep Learning and Its Applications
21 pages
Broad Agency Announcement Assured Neuro Symbolic Learning and Reasoning (ANSR) Information Innovation Office HR001122S0039 June 1, 2022
No ratings yet
Broad Agency Announcement Assured Neuro Symbolic Learning and Reasoning (ANSR) Information Innovation Office HR001122S0039 June 1, 2022
48 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
K Means Clustering
100% (1)
K Means Clustering
14 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Dynamic Bus Management System Using ML and Iot
No ratings yet
Dynamic Bus Management System Using ML and Iot
9 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Computing Cognition and The Future of Knowing IBM WhitePaper
No ratings yet
Computing Cognition and The Future of Knowing IBM WhitePaper
7 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
DEVARPITA DEY Resume Original
No ratings yet
DEVARPITA DEY Resume Original
2 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Soumen Samajdar Bio Data Linkdin
No ratings yet
Soumen Samajdar Bio Data Linkdin
2 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Major Project Documentation Final
No ratings yet
Major Project Documentation Final
40 pages
CS F415 Data Mining L1
No ratings yet
CS F415 Data Mining L1
4 pages
SSRN Id3607845 PDF
No ratings yet
SSRN Id3607845 PDF
39 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
AI in Talent Development For Proactive Retention Strategies
No ratings yet
AI in Talent Development For Proactive Retention Strategies
9 pages
Unit Iv-1
No ratings yet
Unit Iv-1
32 pages
HFT Strategies
No ratings yet
HFT Strategies
4 pages
Big Data
No ratings yet
Big Data
8 pages
VTU ML Module1 Chapter1 Answers
No ratings yet
VTU ML Module1 Chapter1 Answers
3 pages
Gen AI ML
No ratings yet
Gen AI ML
28 pages
Capstone Project Proposal: Suraksha
No ratings yet
Capstone Project Proposal: Suraksha
12 pages
Project Review-3
No ratings yet
Project Review-3
17 pages
Air Quality Index Prediction Via Multi Task Machine Learning
No ratings yet
Air Quality Index Prediction Via Multi Task Machine Learning
13 pages
Characteristics of Complex Systems in Sports Injury Rehabilitation: Examples and Implications For Practice
No ratings yet
Characteristics of Complex Systems in Sports Injury Rehabilitation: Examples and Implications For Practice
15 pages
SOCS - (Odd Sem) Date Sheet - Supplementary Exam. May 2023
No ratings yet
SOCS - (Odd Sem) Date Sheet - Supplementary Exam. May 2023
5 pages
MoE Instruction Tuning
No ratings yet
MoE Instruction Tuning
24 pages
H2O AIML Ebook
No ratings yet
H2O AIML Ebook
8 pages
AT - A A D M: E S V T L: Arget Gnostic Ttack On EEP Odels Xploiting Ecurity Ulnerabilities of Ransfer Earning
No ratings yet
AT - A A D M: E S V T L: Arget Gnostic Ttack On EEP Odels Xploiting Ecurity Ulnerabilities of Ransfer Earning
14 pages
Data Lakes Powering The Future of Big Data
No ratings yet
Data Lakes Powering The Future of Big Data
8 pages
CS 7641 2025-1
No ratings yet
CS 7641 2025-1
9 pages
18 Intelligent Methods For Embedded Systems
No ratings yet
18 Intelligent Methods For Embedded Systems
9 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lec17 ReinforcementLearning

Uploaded by

Lec17 ReinforcementLearning

Uploaded by

INTRODUCTION TO MACHINE LEARNING

(credits: Elisa Ricci)

expected reward over time

Reward: scalar value that

0.1 0.8 0.1

● State sequences can vary in length or even be infinite

● Essentially, the value function tells “how good” is a state.

● The optimal Q-value function tells “how good” is a state-action pair:

A table where we have the maximum

NOTE: in this specific example,

As we start from state s=1 (second row) there are only

On state 5, there are 3 possible actions. We're just Episode 1

As the new state is 5 and this state is the goal Episode 1

Typically, all the non-zero

After many episodes

● The Bellman equation is a constraint on Q-values of

● Problem: state spaces for interesting problems are

● Train a deep network w that approximates Q:

● Idea: at each iteration i of training, update model parameters w to

● Loss function (that changes at each iteration):

where ρ is a probability distribution over states s and actions a that we

● SGD training: replace expectation by sampling experiences (s, a, s')

● Instead of indirectly representing the policy using

Policy gradient for Pong game:

We don’t know the

NOTE: We do not need to know the the environment

Stochastic approximation: sample N trajectories:

You might also like