0% found this document useful (0 votes)
257 views60 pages

Deep Reinforcement Learning: Lecture Notes

This document contains lecture notes on deep reinforcement learning. It covers key concepts in reinforcement learning like states, actions, policies, rewards, and value functions. It also covers specific deep reinforcement learning algorithms like Deep Q-Networks (DQN), policy gradients, actor-critic methods, and imitation learning. The document is divided into multiple chapters with explanations and mathematical formulations of different reinforcement learning concepts and algorithms. It is intended to teach deep reinforcement learning and is available on GitHub from the contact email provided.

Uploaded by

RITWIK MAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
257 views60 pages

Deep Reinforcement Learning: Lecture Notes

This document contains lecture notes on deep reinforcement learning. It covers key concepts in reinforcement learning like states, actions, policies, rewards, and value functions. It also covers specific deep reinforcement learning algorithms like Deep Q-Networks (DQN), policy gradients, actor-critic methods, and imitation learning. The document is divided into multiple chapters with explanations and mathematical formulations of different reinforcement learning concepts and algorithms. It is intended to teach deep reinforcement learning and is available on GitHub from the contact email provided.

Uploaded by

RITWIK MAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Deep Reinforcement

Learning
372.2.5910
Ben-Gurion University of the Negev

Lecture Notes
W RITTEN BY: Hadar Sharvit
A LSO AVAILABLE ON G IT H UB
CONTACT ME AT : [email protected]
BASED ON: Lectures given by Gilad Katz
C HAPTERS & B OOK COVER BY: Rohit Choudhari/Unsplash
Contents

1 Hello world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Terminology 7
1.1.1 State & Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Action spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.6 The goal of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.7 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.8 The optimal Q-Function and the optimal action . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.9 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.10 Advantage function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Kinds of RL Algorithms 11
1.2.1 Model-Free vs Model-Based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 What do we learn in RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Intro to policy optimization 12

2 RL basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Motivation 15
2.2 When to use RL? 16
2.3 Markov Decision Processes (MDP) 16
2.3.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Goals and Rewards 18
2.5 Policies,Value-function and Q-function 18
2.6 The Bellman equation 18
2.7 Policy Iteration 20
2.8 Value iteration 20
2.9 Monte-Carlo 20
2.9.1 Approximating Value-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9.2 Approximating policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 On/Off-Policy methods 22
2.10.1 Importance sampling for Off-Policy methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 ε-Greedy Algorithms 24
2.12 Temporal Difference (TD) Learning 24
2.12.1 On-Policy TD Control: SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.12.2 Off-Policy TD Control: Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 DQN & it’s derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.1 Deep Q-Network 28
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Training DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Double Deep Q-Network (DDQN) 29
3.3 Dueling network 30
3.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 The Policy-Gradient theorem 33
4.2 The REINFORCE Algorithm 34
4.2.1 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Actor-Critic methods 35
4.3.1 One-Step AC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Asynchronous Advantage Actor-Critic (A3C) . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 The Regret 39
5.2 Imitation learning 40
5.2.1 Apprenticeship Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Forward Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.4 Dataset Aggregation (DAgger) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.5 DAgger with coaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5

6 Multi-Arm Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 Basic bandit algorithms 45
6.2 Advanced bandits algorithm 46
6.2.1 Gradient bandit algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.2 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.3 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Advanced RL use case - AlphaGo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


7.1 Monte Carlo Tree Search (MCTS) 49
7.2 RL for the game of Go 50
7.2.1 Alpha-Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2.2 Alpha-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2.3 Alpha-Zero in other domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8 Meta and Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


8.1 Meta-Learning 55
8.1.1 Memory-Augmented Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.1.2 Simple Neural Attentive Meta-Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1.3 Model Agnostic Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Transfer Learning 60
8.2.1 Training the model for diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1. Hello world

R This chapter is provided as a preliminary, and is not part of the course. It is based on OpenAI’s
Spinning Up docs (For further references see here).

Reinforcement Learning (RL) is the study of agents and how they learn by trial and error. The
two main components of RL are the agent and the environment - The agent interacts with the
environment (also known as taking a "step") by seeing a (sometimes partial) observation of the
environment’s state, and then decides which action should be. The agent also perceives a reward
from the environment, which is essentially a number that tells the agent how good the state of the
world is, and the agent’s goal is to maximize the cumulative reward, called return.

Agent
st , rt at
Env

1.1 Terminology
let’s introduce some additional terminology

1.1.1 State & Observation


The complete description of the environment/world’s state is the state s, and an observation o is a
partial description of s. We usually work with an observation, but wrongly denote it as s - we are
going to stick with this convention.

R if o = s we say that the environment is fully observed. Otherwise, it is partially observed.


8 Chapter 1. Hello world

1.1.2 Action spaces


Is the set of all valid actions in the environment. The action space could be discrete (like the action
to move in one of the direction {↑, ↓, ←, →}) or continuous (like the action to move the motor with
α ∈ R Newtons of force)

1.1.3 Policy
Is a set of rules used by our agent to decide on the next action. It can be either deterministic
of stochastic at ∼ π(·|st ). Under the scope of deep RL, the policy is a parameterized function,
i.e it is a mapping with parameters θ that should be learned in some optimization process. A
deterministic Policy could be implemented, for example, using some basic MLP architecture. For a
stochastic policy, the two most common types are Categorical policy (for discrete action space) and
Diagonal-Gaussian policy (for continuous action space)
Categorical (stochastic) Policy
Is essentially a classifier, mapping discrete states to discrete actions. For example, you could build
a basic NN that takes in the observation and outputs action probabilities (after applying softmax).
Denoting the last layer as Pθ (s), we can treat the actions as indices so the log-likelihood for action a
is
log πθ (a|s) = log [Pθ (s)]a (1.1)
Given Pθ (s), we can also sample from the distribution (one can use PyTorch Categorical to sample
from a probability vector)
Diagonal-Gaussian (stochastic) Policy
Is a policy that can be implemented using a neural network that maps observations to mean actions,
under the assumption that the action probability space can be represented by some multivariate
Gaussian with diagonal covariance matrix, which can be represented in two ways
• we use log diag(Σ) = log σ which is not a function of the state s (σ is a vector of standalone
parameters)
• we use a NN that maps from s → log σθ (s)
we use log σ and not σ as the log takes any value ∈ (−∞, ∞), unlike σ that only takes values in
[0, ∞), making it harder to train.
Once the mean action µθ (s) and the std σθ (s) are obtain, the action is sampled as a = µθ (s) +
σθ (s) ⊗ z, where z ∼ N(0, 1) and ⊗ is element-wise multiplication (This is similar to VAEs).
The log-likelihood of a k-dimensional action a ∈ Rk for a diagonal-Gaussian with mean µθ and std
σθ can be simplified if we remember that when Σ is diagonal, a k-multivariate Gaussian’s PDF is
equivalent to the product of k one-dimension Gaussian PDF, hence
" # "  #
exp − 12 (a − µ)T Σ−1 (a − µ)

1 k (ai − µi )2

log [πθ (a|s)] = log = ... = − ∑ + 2 log σi + k log 2π
σi2
p
(2π)k |Σ| 2 i=1
(1.2)

1.1.4 Trajectories
we denote the trajectory as τ = (s0 , a0 , s1 , a1 , ...) where the first state s0 is randomly sampled from
some start-state distribution s0 ∼ ρ0 . A new state is obtained from the previous state and action in
either a stochastic or deterministic process.
1.1 Terminology 9

R τ is also noted as "episode" or "rollout"

1.1.5 Reward
The reward rt is some function of our states and action, and the goal of the agent is to maximize the
cumulative reward over some trajectory τ.
T
• Finite-horizon un-discounted return: R(τ) = ∑ rt
t=0

• Infinite-horizon discounted return: R(τ) = ∑ γ t rt , γ ∈ (0, 1)
t=0
Not adding a converging term γ t means that our infinite sum may diverge, but also it manifests the
concept of "reward now > reward later"

1.1.6 The goal of RL


We always wish to find a policy π ∗ which maximizes the expected return when the agent acts
according to it. Under the assumption of stochastic environment and policy, we can write the
probability to obtain some trajectory τ of size T , given a policy π as

T −1
P(τ|π) = ρ0 (s0 ) ∏ P(st+1 |st , at ) · π(at |st ) (1.3)
t=0
| {z } | {z }
Pr. to reach st+1 Pr. to choose action at
from st when applying at . when in state st .

The expected return is by definition the sum of returns given all possible trajectories, weighted by
their probabilities
Z
J(π) = P(τ|π)R(τ) = Eτ∼π [R(τ)] (1.4)
τ

and w.r.t to this objective, we wish to find

π ∗ = argmax J(π) (1.5)


π∈Π

1.1.7 Value function


We can think of the expected return given some specific state, or some specific state-action pair as
the "value" of the state, or the state-action pair. Those are simply the expected return conditioned
with some initial state or action
• On-policy Value function V π (s): if you start from s and act according to π, the expected
reward is

V π (s) = Eτ∼π [R(τ)|s0 = s]

• On-policy Action-Value function Qπ (s): if you start from s, take an action a (which may or
may not come from π) and only then act according to π, the expected reward is

Qπ (s, a) = Eτ∼π [R(τ)|s0 = s, a0 = a]


10 Chapter 1. Hello world

when finding a value function or an action-value function that maximizes the expected reward, we
scan various policies and extract V ∗ (s) = max V π (s) or Q∗ (s, a) = max Qπ (s, a).
π∈Π π∈Π
We can also find a relation between V and Q:
V π (s) = Eτ∼π [R(τ)|s0 = s]
= ∑ Pr[R(τ)|s0 = s]R(τ)
τ∼π
= ∑ ∑ Pr[R(τ), a|s0 = s]R(τ) [Total prob.]
τ∼π a∼π
= ∑ Pr[a|s0 = s] ∑ Pr[R(τ)|s0 = s, a0 = a]R(τ) (1.6)
a∼π τ∼π
= ∑ Pr[a|s0 = s]Eτ∼π [R(τ)|s0 = s, a0 = a]
a∼π
= ∑ Pr[a|s0 = s]Qπ (s, a)
a∼π
= Ea∼π Qπ (s, a)
where in the 4’th line we used the fact that the probability of both R(τ) and a is the same as summing
over all possible a and conditioning the probability Pr[R(τ)] given a. In terms of optimality, notice
that as V ∗ (s) is the optimal value function for a specific s, and for any a, and Q∗ (s, a) is the optimal
value for a specific s and a, taking max Q over all a is exactly V (s). Specifically
V ∗ (s) = max Q∗ (s, a) (1.7)
a

1.1.8 The optimal Q-Function and the optimal action


Q∗ (s, a) gives the expected return for starting in s and taking action a, and then acting according
to the optimal policy. As the optimal policy will select, when in s, the action that maximizes
the expected return for when the initial state is s, we can obtain the optimal action a∗ by simply
maximizing over all values of Q∗
a∗ (s) = argmax Q∗ (s, a) (1.8)
a

We also note that if there are many optimal actions, we may choose one randomly

1.1.9 Bellman Equations


An important idea for all value functions is that the value of your starting point is the reward you
expected to get from being there + the value of wherever you land next
V π (s) = Ea∼π,s′ ∼P r(s, a) + γV π (s′ )
 
(1.9)

Qπ (s, a) = Es′ ∼P r(s, a) + γEa′ ∼π [Qπ (s′ , a′ )]


 
(1.10)
and optimality is obtained for
V ∗ (s) = max Es′ ∼P r(s, a) + γV ∗ (s′ )
 
(1.11)
a∼π

 
∗ ∗ ′ ′
Q (s, a) = Es′ ∼P r(s, a) + γ max

[Q (s , a )] (1.12)
a ∼π
1.2 Kinds of RL Algorithms 11

1.1.10 Advantage function


Sometimes we only care if an action is better than others on average, and do not care as much for its’
value on its own. The advantage function Aπ (s, a) describes how better is taking action a (that can
be from π or not) when in s compared to selecting some random action a′ ∼ π, assuming you act
according to π afterwards.

Aπ (s, a) = Qπ (s, a) −V π (s)


(1.13)
= Qπ (s, a) − Ea∼π [Qπ (s, a)]

By calculating the advantage, we ask if the Q function of some candidate action a (given some state
s) is larger then the average Q-function associated with the examination of all other actions taken by
our policy.

1.2 Kinds of RL Algorithms


1.2.1 Model-Free vs Model-Based RL
In some cases we have a closed form of how our environment behaves. For example, we may know
the probability space P[s′ |s, a], i.e we know that probability to transition from some s to some other
s′ given arbitrary action a. It may also be the case that our our model can be described using some
equation of motion. Either way, we can use this knowledge to formulate an optimal solution, which
in many cases translate to some greedy approach of scanning various states and thinking ahead.
Some problems are that such model may not even be available (or even known), as for example, I
do not have a model of how my chess opponent may play. Furthermore, greedy approaches usually
mean brute-forcing all possible solutions.
In Model-Free RL, the model is not available, and we are trying to understand how the environment
behaves by exploring it in some non-exhaustive manner. This means that model-free RL are likely to
be not sample-efficient, though they are usually easier to implement

1.2.2 What do we learn in RL


After going through the taxonomy, we can ask whether we wish to learn the Q-function, the value-
function, the policy or the environment model itself.

Under model-free RL
• Policy optimization: we parameterize the policy πθ (a|s) and find optimum w.r.t the return
J(πθ ). Such optimization is usually on-policy, meaning that the data used in the training
process is only data given while acting according to the most recent version of the policy.
In policy optimization we also find an approximator value function Vφ (s) ≈ V π (s). Some
examples are A2C,A3C,PPO.
• Q-Learning: approximate Qθ (s, a) ≈ Q∗ (s, a). Usually the objective is some form of the
bellman equation. Q-Learning is usually off-policy, meaning that we use data from any point
during training. Some examples are DQN, C51.
Compared to Q-Learning, that approximates Q∗ , policy optimization finds exactly what we wish for -
how to act optimally in the environment. Also, there are models that combine the two approaches, as
DDPG for example, which learns both a Q function and an optimal policy.
12 Chapter 1. Hello world

Under model-based RL
cannot be clustered as easily, though some of the (many) approaches include methods of planning
techniques to select actions that are optimal w.r.t to the model.

1.3 Intro to policy optimization


We aim to maximize the expected return J(πθ ) = Eτ∼πθ [R(τ)], and we assume the finite-horizon
undiscounted return (∞-horizon is nearly identical). Our goal is to optimize πθ with a gradient step

θk+1 = θk + α ∇θ J(πθ ) |θk (1.14)


| {z }
Policy gradient

and
R
to do so, we must find a numerical expression for the policy gradient. As J(πθ ) = Eτ∼πθ [R(τ)] =
τ P(τ|θ )R(τ), we might as well write down a term for the probability of a trajectory
T
P(τ|θ ) = ρ0 (s0 ) ∏ P(st+1 |st , at )πθ (at |st ) (1.15)
t=0

d
Using the log-derivative trick, dx log x = 1x , meaning that x d log x d
dx = 1. rewrite 1 as dx x, Substitute
d
x ↔ P(τ|θ ) and dx ↔ ∇θ and we have that P(τ|θ )∇θ log P(τ|θ ) = ∇θ P(τ|θ ). We will use this later.
Now, lets expand the log term
T
log P(τ|θ ) = log ρ0 (s0 ) + ∑ [log P(st+1 |st , at ) + log πθ (at |st )] (1.16)
t=0

When deriving w.r.t θ , we are only left with the last term (the others only depend on the environment
and not our agent), hence
T T
∇θ log P(τ|θ ) = ∇θ ∑ log πθ (at |st ) = ∑ ∇θ log πθ (at |st ) (1.17)
t=0 t=0

Notice the use of linearity in the second transition. Consequently, we re-write the expected return
using eq 1.17 -
Z
∇θ J(πθ ) = ∇θ P(τ|θ )R(τ)
τ
Z
= ∇θ P(τ|θ )R(τ)

= P(τ|θ )∇θ log P(τ|θ )R(τ) (1.18)
τ
= Eτ∼πθ [∇θ log P(τ|θ )R(τ)]
" #
T
= Eτ∼πθ ∑ ∇θ log πθ (at |st )R(τ)
t=0

In the 3rd transition we used the log-derivative trick, and in the last transition we used the expression
from 1.17.
The last term is an expectation, hence can be estimated using mean - given a collected set D =
1.3 Intro to policy optimization 13

{τ1 , τ2 , ..., τN } of trajectories obtained by letting our agent act in the environment using πθ we can
write
T
1
∇θ J(πθ ) ≈ ∑ ∑ ∇θ log πθ (at |st )R(τ) (1.19)
|D| τ∈D t=0

R It should be stated that this "Loss" term is not really a "loss" like we know from supervised
learning. First of all, it does not depend on a fixed data distribution - here, the data is sampled
from the recent policy. More importantly, it does not measure performance! the only thing it
makes sure of is that given the current parameters, it has the negative gradient of performance.
After this first step of gradient descent, there is no more connection to performance. This
means that the loss minimization has no guarantee to improve expected return. This should
come as a warning to when we look at the loss going down thinking that all is well - in policy
gradients, this intuition is wrong, and we should only look at the average return.
2. RL basics

2.1 Motivation
Current malware detection platforms often deploy an ensemble of detectors to increase overall
performance. This approach creates lots of redundancy, as in most cases one detector is enough, and
it of course computationally expensive and time consuming, compared to one detector.
We can come up with a simple improvement - query a subset of detectors and decide based on
their classification if more detectors are needed. If we observe our approach under the scope of
classification, it may very well be the case that training a model w.r.t to every set of detectors is
needed, as we cannot evaluate the performance of a subset of detectors without actually learning how
they performed. As this is computationally hard for large detectors, this is not a preferred approach.
Instead, we can use RL:

Suppose we use four detectors, and our agent takes as input the vector [−1, −1, −1, −1] ∈ R4 ,
which is considered an initial state. The agent will choose a set of detectors/detector configurations,
and a classification measurement of either "malicious" or "benign" will be taken. The decisions
of the agent will be based on a reward mechanism that takes uses the values of TP, FP (correctly
classified the content as "malicious" or "benign") and FP, FN (incorrectly classified to "malicious" or
"benign"). We will "punish" using C(t), which is a function that depends on the time it took for the
detectors to run. We can see that regardless of how many detectors were used, if we are right - the

Exp. # TP TN FP FN
1 1 1 -C(t) -C(t)
2 10 10 -C(t) -C(t)
3 100 100 -C(t) -C(t)

Table 2.1: Three suggested reward mechanisms for a malware detection platform

reward is constant (experiment with 1, 10, 100). On the other hand, if we were wrong, we subtract
16 Chapter 2. RL basics

C(t) which increases with the time t that has passed. As it is now "painful" to use more detectors,
the reward incentives our model to only use more detectors if that addition translated to higher
success rates. As our model efficiently scans through the state space, it is able to outperform (at least
conceptually) the suggested "check-all" classification approach that was previously introduced.

2.2 When to use RL?


Not all cases adhere to the framework of RL. Here are some rules
• our data should be in the form of trajectories - a set of distinguishable states s0 → s1 → · · · → sN
• need to make a sequence of related decisions - if every decision is independent, like classifying
4 images of cats-vs-dogs, don’t use RL.
• the actions we perform result in some feedback - either positive or negative
• Tasks that are more suitable include both learning and planning - we learn our environment
and plan an optimal behaviour
• the data is not provided apriori, and its’ distribution changes with our action choice. This
means that we must make sure our agent effectively explores the entire data, and does not
settle in some small subspace.

2.3 Markov Decision Processes (MDP)


consists of the following
• States: that make up the environment. could be either discrete or continuous.
• Actions: by taking an action we transition from one state to another. In a deterministic process,
we have that P(s′ |s, a) = 1
• Reward: taking action a ∈ A from state s ∈ S results in reward R(s, a) ∈ R
In finite MDP, s, a, r are all final.

2.3.1 The Markov Property


the distribution over the future states depends only on the present state and action
Pr[st+1 |s1 , a1 , s2 , a2 , ..., st , at ] = Pr[st+1 |st , at ] (2.1)
In poker, for example, the markovian property does not hold as a player’s current hand depends
on the actions and hands of previous hands and/or players. A traffic light, on the other hand, is
completely markovian as it is based on deterministic rules.
Using the markovian property, we can define the probability to reach a certain state s′ with a certain
reward r, as
Pr[s′ , r|s, a] ≡ Pr[st+1 = s′ , rt+1 = r|st = s, at = a] (2.2)
The probabilities induced by all event in S and R make up a probability space, hence
∀s ∈ S ∀a ∈ A : ∑ ∑ Pr[s′ , r|s, a] = 1 (2.3)
s′ ∈S r∈R

The expected reward for state-action pairs, namely, what should we anticipate (in terms of reward)
when performing the action a from the state s is
r(s, a) ≡ E[rt+1 |st = s, at = a] = ∑ r ∑ Pr[s′ , r|s, a] (2.4)
r∈R s′ ∈S
2.3 Markov Decision Processes (MDP) 17

where notice that the probability for a specific reward is the sum over all states, given the specific r
(hence the sum over s′ ∈ S).
We can also phrase our reward in terms of state-action-next state triplets, namely, what should we
anticipate (in terms of reward) when performing the action a that takes us from state s to state s′

Pr[s′ , r|s, a]
r(s, a, s′ ) ≡ E[rt+1 |st = s, at = a, st+1 = s′ ] = ∑r (2.5)
r∈R Pr[s′ |s, a]

Where we can think of the probability fraction as the number of events that reach s′ (from s after
performing a) and provide reward r, out of all the event that reach s′ (from s after performing a)
given any reward.
MDPs are very flexible
• both states and actions could be either abstract (s="sad", s="happy", a="take a nap") or
well-defined (like s=sensor readings, or a=turn on a switch).
• the time intervals may not be constant (some transitions are slow while other are fast)
• the setting of an MDP does not need to be an exact copy of the real-world model. For example,
a set of sensors may be enough to describe a robotic arm, even though there are many more
aspects that the arm is made up of (that are not as relevant).

R At this point is may be helpful to look at what is known as the "Backup Diagram", That
describes how the states are propagated based on the actions chosen by π and the probabilities
induces by the environment.

s
π(a1 |s) π(a2 |s)

a1 a2
Pr(s2 |s, a1 ) Pr(s3 |s, a1 ) Pr(s4 |s, a2 ) Pr(s5 |s, a2 )

s2 s3 s4 s5

We can write, for example, the probability to transition from s to s5 by performing an action
as as P(s5 |s, a2 ) = π(a2 |s)Pr(s5 |s, a2 ). In general term„ the probability to move to any state
by performing any action is the probability to take some action a and sum all probabilities of
states s′ reachable from s using a, and finally sum over all such actions

Pr(reach any state using any action|s) = ∑ π(a|s) ∑ Pr(s′ |s, a) (2.6)
a∈A ′ s ∈S

Equivalently, we can write the probability to reach any state using any action and receiving any
reward

Pr(any state any action any reward|s) = ∑ π(a|s) ∑



∑ Pr(s′ , r|s, a) (2.7)
a∈A s ∈S r∈R
18 Chapter 2. RL basics

2.4 Goals and Rewards


The agent’s goal is to maximize the expected return.
For a finite horizon of size T , the undiscounted sum of rewards is
T
Gt = Rt+1 + Rt+2 + ... + Rt+T = ∑ Rt+k+1 (2.8)
k=0

For an infinite horizon, we add a discount factor, as otherwise infinite sum would result in an agent
that does not really care for the reward mechanism. As previously stated, the intuition here is reward
now > future reward.

Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + ... = ∑ γ k Rt+k+1 (2.9)
k=0

where γ ∈ (0, 1)

2.5 Policies,Value-function and Q-function


The policy π defines out strategy, namely, what we choose to do at every step

π(s, a) = Pr[at = a|st = s] (2.10)

The goal of π is to maximize the value function, which is the cumulative expected return of following
π starting from some state s
" #

Vπ (s) = Eπ [Gt |st = s] = Eπ ∑ γ k Rt+k+1 |st = s (2.11)
k=0

Notice that the expectation is w.r.t π, meaning that after the initial state st = s, the next states are
fully determined by π. We can think of Vπ as a measurement of "How good is π?", as intuitively, we
can choose the policy that provides us the maximal expected return.
We can also define the Q-function, which is the same as V except the fact that we start from s and
perform an initial action a (that may or may not be one of π’s options)
" #

Qπ (s, a) = Eπ [Gt |st = s, at = a] = Eπ ∑ γ k Rt+k+1 |st = s, at = a (2.12)
k=0

2.6 The Bellman equation


Theorem 2.6.1 The value function can be written as

Vπ (s) = Eπ [Gt |st = s]


= ∑ π(a|s) ∑ ∑ Pr[s′ , r|s, a][r + γVπ (s′ )] (2.13)
a∈A s′ ∈S r∈R
= Ea∈A [Es′ ,r [r + γVπ (s′ )]]
2.6 The Bellman equation 19

Proof. 1 The reward and time t can be rewritten as

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ...


= Rt+1 + γ(Rt+2 + γRt+3 + ...) (2.14)
= Rt+1 + γGt+1

Therefore the can re-write the Value-function as


Vπ (s) = Eπ [Rt+1 + γGt+1 |st = s]
(2.15)
= Eπ [Rt+1 |st = s] + γEπ [Gt+1 |st = s]

Focusing on the second term, we will use the law of iterated expectation
E[Y |X = x] = E[E[Y |X = x, Z = z]|X = x]
with Y = Gt+1 , X = St , x = s, Z = St+1 and z = s′ , hence

Eπ [Gt+1 |st = s] = E[E[Gt+1 |St = s, St+1 = s′ ]|St = s]


= E[E[Gt+1 |St+1 = s′ ]|St = s] (2.16)

= E[Vπ (St+1 = s )|St = t]

In the 2nd transition we removed the inner condition for St = s as Gt+1 = Rt+2 + γRt+3 + ... does not
depend on St . This is the case as every reward term Rt+i is only a function of the current state and
action, so since Rt is not present, no term in Gt+1 is related to St (only to St+1 , St+2 ...). In the last
transition we use the fact that the inner E term is nothing but the value function for t ← t + 1.
Substituting all to 2.15 we have

Vπ (s) = Eπ [Rt+1 |st = s] + γEπ [Gt+1 |st = s]


= Eπ [Rt+1 |st = s] + γE[Vπ (St+1 = s′ )|St = t]
= Eπ [Rt+1 + γVπ (St+1 = s′ )|St = s] (2.17)
= ∑ π(a|s) ∑ ∑ Pr(s′ , r|s, a) r(s, a, s′ ) + γVπ (St+1 = s′ )
 
a s′ ∈S r∈R

Rt+1 describes the reward obtained when moving from s to s′ using a, so it can be written as r(s, a, s′ ).
Furthermore, the expectation Eπ is w.r.t to the states, actions and rewards induced by π (so the
probability associated with every term is the one introduced in 2.7), and by summing over r ∈ R we
also indicate the fact that the transition s → s′ could be rewarded with multiple different rewards
(more than one option is plausible).

The bellman optimality equation is the bellman equation for the optimal Value-function

V ∗ (s) = max E[r(s, a) + γV ∗ (s′ )] (2.18)


a

1 https://stats.stackexchange.com/questions/243384/deriving-bellmans-equation-in-reinforcement-learning
20 Chapter 2. RL basics

Theorem 2.6.2 The Q-function can be written as

Qπ (s, a) = ∑[r(s, a) + γ ∑ Qπ (s′ , a′ )] (2.19)


s′ ,r a′

Proof. Not included ■

The bellman optimality equation is the bellman equation for the optimal Q-function

Q∗ (s, a) = E[r(s, a) + γ max



Q∗ (s′ , a)] (2.20)
a

2.7 Policy Iteration


Is a method of finding an optimal policy by performing two steps - evaluation and improvement.
After a random initialization of both π and V (Step I), we evaluate the Value-function given some
policy π (Step II)2 . We do this by constantly sampling our environment and updating the value
function for every state respectively. The loop stops when the change in value function (for all s) is
smaller than some tolerance ε. We use the notation π(a|s) to indicate the probability Pr(π(s) = a).
Next, in step III, we observe the current Value function and choose an action that maximizes it.
This will be considered our new returned action for every single state
Finally, we combine policy evaluation and policy improvement in an iterative process. More
specifically, we evaluate V and improve π until a fixed point is reached (for all s, π(s) has not
changed, possibly up to some margin δ )3 .

2.8 Value iteration


In some cases it may not be efficient or even possible to scan all states s ∈ S, but scanning all possible
actions a ∈ A, is. In VI, we choose the value of some state s as the maximal V function over all a ∈ A
and when the needed tolerance was reached, our returned policy would be the action that maximizes
the final V function

2.9 Monte-Carlo
In cases where the dynamics Pr[s′ |s, a] and the reward Gt are unknown (model-free setting, 1.2.1),
we can use a Monte-Carlo approach to sample the environment and come up with approximations
for the value-function and Q-function. To do so, one must make sure that the episodes are finite (the
number of transition until termination is < inf). Another important factor is how shall we behave
when encountering the same state more than once, and there are two common variants
• First-Visit-Monte-Carlo (FVMC): estimates the return obtained only after the first visit to s
(ignore future visits to s)
• Every-Visit-Monte-Carlo (EVMC): estimates the average returns obtained after all visits to s
(average all rewards obtained from s in the episode)
2 The convergence of PE is the result of the Policy evaluation convergence theory. see HERE for more info
3I highly recommend checking out THIS implementation, by Denny Britz
2.9 Monte-Carlo 21

Algorithm 1 Policy Iteration


Require: tolerance ε > 0 and an MDP ⟨S, A, R, Pr : S × R → R, γ⟩
V, π ← Random Initialization ∈ R|S| ▷ Step I: Initialization

while True do ▷ Step II: Policy Evaluation


∆←0
for s ∈ S do
v ← V (s)
V (s) ← ∑ π(a|s) ∑ ∑ Pr[s′ , r|s, a][r + γV (s′ )] ▷ using thrm. 2.6.1
a∈A s′ ∈S r∈R
∆ ← max(∆, |v −V (s)|)
end for
if ∆ < ε then break
end if
end while

policy_stable←True ▷ Step III: Policy Improvement


for s ∈ S do
old_π ← π(s)
π(s) ← argmax ∑ ∑ Pr[s′ , r|s, a][r + γV (s′ )]
a∈A s′ ∈S r∈R
if old_π ̸= π(s) then policy_stable←False
end if
end for

if policy_stable then return V, π


else go to step II
end if

Algorithm 2 Value Iteration


Require: tolerance ε > 0 and an MDP ⟨S, A, R, Pr : S × R → R, γ⟩
V ← Random Initialization ∈ R|S|
while True do
∆←0
for s ∈ S do
v ← V (s)
V (s) ← max ∑ ∑ Pr[s′ , r|s, a][r + γV (s′ )]
a∈A s′ ∈S r∈R
∆ ← max(∆, |v −V (s)|)
end for
if ∆ < ε then break
end if
end while
return π(s) = argmax ∑ ∑ Pr[s′ , r|s, a][r + γV (s′ ) for all s ∈ S
a∈A s′ ∈S r∈R
22 Chapter 2. RL basics

2.9.1 Approximating Value-function


To approximate the value function Vπ of a given policy π, we will sample a trajectory and update the
cumulative discounted reward given the reward we have received. In the case of FVMC, we save the
resulted reward for a given state iff we did not encounter it previously. The above is also described
in alg. 3

Algorithm 3 First-Visit MC for Value function approximation


Require: a policy π to be evaluated
V (s) ← arbitrary initialization ∈ R for all s ∈ S
Returns(s) ←empty list [] for all s ∈ S
while True do
τ ← {s0 , a0 , r1 , s1 , a1 , r2 , ..., sT −1 , aT −1 , rT } ▷ generate an episode using π
G←0
for step t = T − 1, T − 2, ..., 0 do
G ← γG + rt+1
if st ∈
/ {s0 , s1 , ..., st−1 } then ▷ Verifying first-visit
Returns(st ).append(G)
V (st ) ← avg(Returns(st ))
end if
end for
end while

2.9.2 Approximating policies


When a model of the world is available, we have already seen (in Policy Iteration, for example)
that given the state-values only, one could generate a policy. On the other hand, when the model
is not available - the state values alone does not contain enough information to formulate a policy,
and one must estimate the action values in order to come up with a good policy. This means that
approximating the Q-function could be useful to determine a policy π.
The formalism of FVMC for Q-function approximation is almost identical to alg. 3, except the
fact that in a First-Visit, we make sure that both the state s and the action a were not encountered
yet in the trajectory. It is also important to understand that if π is deterministic, following it we
only observe returns for one specific action from each state. This means that there are no returns to
average, hence we in fact do not allow our estimate to explore the state-action space properly. To
solve this we must enforce exploration by, for example, setting a random state-action starting point
for every episode.
To understand how a policy can be generated, we will go through the following steps:
Consider a Monte-Carlo version of classical policy iteration - Given some initial policy π0 , we
approximate Gπ over and over again using MC sampling (with additional starting point exploration).
From here, for any action-value function Q, the greedy policy is the one that chooses a maximal
action, i.e π(s) = argmaxa Q(s, a). The above is neatly described in the following pseudo-code

2.10 On/Off-Policy methods


There are two types of policy methods
2.10 On/Off-Policy methods 23

Algorithm 4 MCFV with Exploring Starts (ES) for estimating π∗


π(s) ∈ A arbitrary for all s ∈ S ▷ Initialization
Q(s, a) ∈ R arbitrary for all s ∈ S and a ∈ A
Returns(s, a) ←empty list [] for all s ∈ S and a ∈ A
while True do
s0 ∈ S and a0 ∈ A randomly chosen s.t all pairs (si , ai ) are reachable from (s0 , a0 ) with prob> 0
τ ← {s0 , a0 , r1 , s1 , a1 , r2 , ..., sT −1 , aT −1 , rT } ▷ generate an episode using π from (s0 , a0 )
G←0
for step t = T − 1, T − 2, ..., 0 do
G ← γG + rt+1
if (st , at ) ∈
/ {(s0 , a0 ), (s1 , a1 ), ..., (st−1 , at−1 )} then ▷ Verifying state-action first-visit
Returns(st , at ).append(G)
Q(st , at ) ← avg(Returns(st , at ))
π(st ) ← argmaxa Q(st , a)
end if
end for
end while

• On-policy methods: attempts to evaluate or improve the policy that is being used to make
decisions. As stated before, if π does not attain all state-action pairs with probability> 0, we
will poorly explore the space.
• Off-policy methods: attempt to evaluate or improve a policy other than the one used to generate
the data (the one that selects actions).
We work with two distinct policies:
– The target policy π - the one that we wish to learn
– the behavior policy b - the one used to generate the data
While On-policy methods tend to be more data efficient, they require new samples with each change
of policy. Off-policy, on the other hand, are slower but more powerful and general, as they can be
used to learn from various sources (like from a human expert)

2.10.1 Importance sampling for Off-Policy methods

Importance sampling is a technique for estimating expected values under one distribution given
samples from another. It is performed by weighting returns according to the fraction of probabilities
to a trajectory under some policy.
Lets assume that the behavior policy b is stochastic and the target policy π is deterministic. This
means that the trajectories in the data (that were chosen by b) may be different than those chosen by
π, which begs the question of how to calculate the expected return? The solution would be to weigh
the return based on how it resembled the actual values returned by the target policy.
Consider the trajectory τ = {st , at , st+1 , at+1 , ..., sT }. The probability to obtain τ given the starting
state st and the actions at:T −1 ∼ π is

T −1
Pr[τ|st , at:T −1 ∼ π] = ∏ π(ak |sk )Pr[sk+1 |sk , ak ] (2.21)
k=t
24 Chapter 2. RL basics

Denoting the importance sampling for the time window [t,t + 1, ..., T − 1] as ρt:T −1 , we take the
relative probability of the trajectory for the target and behaviour policy

Pr[τ|st , at:T −1 ∼ π] T −1 π(ak |sk )


ρt:T −1 ≡ =∏ (2.22)
Pr[τ|st , at:T −1 ∼ b] k=t b(ak |sk )

Notice that even though the probabilities P[s′ |s, a] may be unknown, they cancel out in 2.22. From
here, we can use ρt:T −1 and the return Gt of the behaviour policy b to obtain Vπ , as

Vπ (s) = E[ρt:T −1 Gt |st = s] (2.23)

For example, if some trajectory τ is twice as plausible under b than it is under π, the expected return
for π would be 1/2 (in expectation) the return under b, which can also be seen as ρ = 1/2.
From here, we can take the MC algorithm (that averages returns), provide it with episodes following
b but still estimate Vπ .
Let T (s) be all time steps state s was visited (over all episodes), and T (t) be the time of termination
after time t for a given episode, then {Gt }t∈T (s) is the set of returns associated with s across all
episodes, and {ρt:T (t)=1 }t∈T (s) are the corresponding IS ratios. To estimate Vπ we can use

∑t∈T (t)−1 ρt:T (t)−1 Gt


Vπ (s) = (2.24)
|T (s)|

2.11 ε-Greedy Algorithms


In RL we often need to balance between
• Exploration - experimenting with multiple actions to better assess the expected reward
• Exploitation - attempt to maximize the reward by choosing the optimal action. We can do this
by, for example, estimating our Q function with MC as
1 T
Q̂t (a, s) = Nt (a) ∑t=1 (rt |at = a, st = s)

where Nt (a) is the number of times the action a was chosen, and then choose an action

at∗ = argmax Q̂t (a, s)


a∈A

Notice how these two may collide, as when we explore the environment we also may not always
choose an optimal action. In an ε-greedy approach, we explore with probability ε, and in all other
cases we choose the optimal action:

• With probability 1 − ε: select at∗ = argmax Q̂t (a, s)


a∈A
• with probability ε: choose a random a ∈ A
Intuitively speaking, as we maintain the ability to explore forever, we will eventually find an optimal
policy. We also make sure to not include the randomness in the testing phase.

2.12 Temporal Difference (TD) Learning


When discussing MC learning, we showed how sampling the environment without knowing the
dynamics of the system could be enough to learn V , Q, and a policy π directly. On the other hand,
2.12 Temporal Difference (TD) Learning 25

when using MC we must make sure that the episodes are finite, and we could only learn based on
complete episodes. In TD learning, on the other hand, both infinite environments and incomplete
sequences learning is possible, as we can update our approximation after every step (compared to
after every episode in MC). We define the TD-error as the difference between the optimal value
function Vt∗ and the current prediction Vt :

δtT D (s) = Vt∗ (s) −Vt (s)



= ∑ γ k rt+k+1 −Vt (s)
k=0

= rt+1 + ∑ γ k rt+k+2 −Vt (s) (2.25)
k=1

= rt+1 + γ ∑ γ k rt+k+2 −Vt (s)
k=0

= rt+1 + γVt+1 −Vt (s)
∗ , we can approximate it using the predicted V
As we do not know Vt+1 t+1

δtT D ≈ rt+1 + γVt+1 −Vt (2.26)

This error is used as the general update rule, where we also add a learning rate α (much like in
gradient descent, where we can think of Vt+1 −Vt as the term ∇Vt )

Vt ← V (t) + α [rt+1 + γVt+1 −Vt ] (2.27)

R notice that we do not use an expectation term (as seen in policy iteration for example), as the
update rule is the result of looking only one step into the future, given some episode rollout.

Algorithm 5 One-Step TD [TD(0)]


Require: π the policy to evaluate, α ∈ R
V (s) ∈ R arbitrary initialized for all s ∈ S
for each episode E do ▷ sampling a trajectory like in MC
for each s ∈ E do
a ← π(s)
r, s′ ←take action a and observe r, s′ ▷ taking one step to the future
V (s) ← V (s) + α(r + γV (s′ ) −V (s)) ▷ using eq. 2.27
s ← s′
end for
end for
return V

2.12.1 On-Policy TD Control: SARSA


SARSA is an on-policy (remember 2.10) algorithm that used TD(0) in order to approximate the
Q-function. The idea will be similar to the value function update error, and the MC sampling would
26 Chapter 2. RL basics

Algorithm 6 SARSA
Require: α ∈ R
Q(s, a) ∈ R arbitrary initialized for all s ∈ S and for all a ∈ A
for each episode E do ▷ sampling a trajectory like in MC
for each s ∈ E do
choose a from s given Q ▷ like in ε-greedy (2.11)
r, s′ ←take action a and observe r, s′ ▷ taking one step to the future
choose a′ from s′ given Q ▷ like in ε-greedy (2.11)
Q(s, a) ← Q(s, a) + α(r + γQ(s′ , a′ ) − Q(s, a))
s ← s′ , a ← a′
end for
end for
return Q

consider both the next state and the next action. Do notice that, as always, to approximate Q we
must find its optimal value for every s ∈ S, and a ∈ A - which is computationally difficult.

R the name SARSA stems from the idea that the update rule uses the quintuple ⟨st , at , rt+1 , st+1 , at+1 ⟩

2.12.2 Off-Policy TD Control: Q-Learning


In Q-Learning, we do not look at the next Q(s′ , a′ ), but on the Q-function that has maximal value
over all possible actions maxa Q(s′ , a). This means that the corresponding maximal Q-function was
provided given an action that may or may not be the result of our policy π (hence, it is Off-Policy).
More specifically - Q-learning is based on a greedy approximation of the optimal policy - which is
the behaviour policy (compared to SARSA, that only used the current policy). Notice that we still
use the current policy, as it determines which state-action pairs are visited and updated.

Algorithm 7 Q-Learning
Require: α ∈ R
Q(s, a) ∈ R arbitrary initialized for all s ∈ S and for all a ∈ A
for each episode E do ▷ sampling a trajectory like in MC
for each s ∈ E do
choose a from s given Q ▷ like in ε-greedy (2.11)
r, s′ ←take action a and observe r, s′ ▷ taking one step to the future
Q(s, a) ← Q(s, a) + α(r + γ maxa Q(s′ , a) − Q(s, a))
s ← s′
end for
end for
return Q

Also notice we did not track down the next action a′ , as we did not use it (scanned all a ∈ A
instead). It should also be stated that Q-Learning usually converges quicker, due to the optimal
choice of maxa Q(s′ , a). Having said that, as we do not take into consideration the next state a′ , using
ε-greedy actions might mean that we take a step into a state with very bad reward (like falling down
a cliff), simply as we are less conservative (due to exploration and not admitting to a next action).
3. DQN & it’s derivatives

This section is based on sources I found online (because the lecture was not uploaded). Our new
goal would be to use a parameterized function approximator to represent the state-action Q-function,
instead of representing it with a table. In other words, we wish to find a function Q̂(s, a, θ ) ≈ Q(s, a),
where θ is our function’s parameters and Q(s, a) is the true function/oracle.

Lets start of with the ideal assumption, in which the oracle Q(s, a) is accessible. Our function
approximator could be learned using SGD, that is, by minimizing the squared loss J(θ ) w.r.t to the
oracle over batches sampled from our environment

J(θ ) = E (Q(s, a) − Q̂(s, a, θ ))2


 
(3.1)

We update our parameter vector θ using θ ← θ − ∆θ , where

1
∆θ = − α∇θ J(θ )
2
1
= − α∇θ E (Q(s, a) − Q̂(s, a, θ ))2
 
2 (3.2)
1  
= − α · 2E Q(s, a) − Q̂(s, a, θ ) · (−∇θ Q̂(s, a, θ ))
2 
= αE Q(s, a) − Q̂(s, a, θ ) ∇θ Q̂(s, a, θ )

As Q(s, a) is generally unknown, it must be replaced with some approximated target. Recall
that in SARSA for example, our target was based on the temporal difference r + γ Q̂(s′ , a′ , θ ). In
classic Q-learning, on the other hand, we used an off policy approach, in which our target was
r + γ maxa′ ∈A Q̂(s′ , a′ , θ ). As we now proceed to describe DQN, we will follow the update rule of
 
′ ′
∆θ = αE r + γ max

Q̂(s , a , θ ) − Q̂(s, a, θ ) ∇θ Q̂(s, a, θ ) (3.3)
a ∈A
28 Chapter 3. DQN & it’s derivatives

3.1 Deep Q-Network


We now ask how shall we describe our approximated Q̂(s, a, θ ), and the answer is a deep neural
network.

3.1.1 Architecture
In the original paper, the DQN architecture was based on convolutional neural network. The
network takes in an input of shape 84 × 84 × 4 (a processed batch of images, which is considered the
state)1 , and propagates this input via three convolutional layers. To finally come up with an action
value, there are two fully connected layer, where the last one has a single output for every possible
Notice how our function approximates an action for every state, hence can be written as ∀ s ∈ S :
Q̂(s, θ ) : R|S| → R|A| .

3.1.2 Training DQN


Those of us who are familiar with supervised learning, might think that the kind of "true-label"
introduced in 3.3 is not much of a true label at all, as it is based on the network itself, which is
constantly changing. This fact makes the learning process unstable, therefore few "tricks" were
introduced in the paper - the Experience Replay, and a separate Target Network. Summarizing the
two in a short sentence, our Q-network is learned by minimizing
h i
J(θ ) = Est ,at ,rt ,st+1 ytDQN − Q̂(st , at , θ ) (3.4)

where ytDQN is the one-step TD component

ytDQN = rt + γ max

Q̂(st+1 , a′ , θ − ) (3.5)
a ∈A

Notice how we use both θ and θ − , where θ is the original parameters vector of the network but
θ − represents the parameters of the target network. By doing so we update θ based on values from
a previous version of our model. Furthermore, the batches we sample are provided from all past
transition tuples.

Experience Replay
We store the transitions ⟨st , at , rt , st+1 ⟩ := et at each time-step in a fixed buffer Dt = {e1 , e2 , ..., et },
and During SGD, we only sample uniformly from D. This is because
• Data efficiency: each experience ei can potentially be used in many updates
• De-correlation: randomly sampling leads to de-correlation between consecutive experiences.
As correlation breaks, expected values will fluctuate less, meaning that the variance in sampling
is reduced, hence stability is increased.
• Smoother divergence: As our training process is based on a large number of experiences,
outliers tend to average out with the rest of the samples, leading to less oscillations in the
training process.
Note that in a more sophisticated experience replay, we might weigh experiences based on their
importance and keep the relevant ones for longer.
1 Originally, images were 210 × 160 × 3, but the preprocessing applied spacial dimension reduction to 84 × 84, extracted

the Y channel from the RGB and concatenated 4 consecutive images as "memory"
3.2 Double Deep Q-Network (DDQN) 29

Target Network
Another stability improvement comes from the fact that we use two neural nets. Ideally, we would
like to minimize the effect of our targets being Non-Stationary, and we do so by setting a network
that is only updated after C ≫ 1 steps. This means that our non-stationary target will, in a sense, be
stationary for at least C steps, hence stabilizing our learning process even further.

Algorithm 8 Deep Q-Learning (DQN)


Initialize experience replay D with fixed size
Initialize network Q̂ with random weights θ
Initialize target network Q̂T with random weights θ − = θ
for episode m = 1, 2, ..., M do
observe environment
 to get s1
random, with probability ε
take action at ←
argmax Q̂(st , a, θ ), otherwise
a∈A
rt , st+1 , donei ← Take action at in the environment
D ← D ∪ {⟨st , at , rt , st+1 , donei ⟩}
B ← sample a random batch {⟨si , ai , ri , si+1 , donei ⟩}Ni=1 from D
for every experience ⟨si , ai , ri , si+1 , donei ⟩ ∈ B do
yi ← ri if donei else ri + γ max ′
Q̂T (si+1 , a′ , θ − )
a ∈A
end for
1
perform SGD on J(θ ) = N ∑Ni=1 (yi − Q̂(si , ai , θ ))2 w.r.t θ
every C steps θ − ← θ
end for

3.2 Double Deep Q-Network (DDQN)


We’d like to state that the max operator is prone to over estimation. Consider the following example:2
■ Example 3.1 Say N ≫ 1 people have equal weight of 80kg, and we’d like to measure all of their

weights using a weighing scale that is off by ±1 kg (equal probability to measure > 80 and < 80).
Lets run two sets of measurements:
• Denote the weight measurement of the i’th person as Xi , and set Y = maxi Xi . We can intuitively
understand that almost surely Y > 80, as almost surely there exists some j for which X j > 80.
This really tells us that the max operator is prone to over estimation when noise is introduced
to the system.
• As a second experiment, we will measure each person’s weight twice and store the values in
X1i , X2i . to estimate Y , we first calculate n = argmaxi X1i . Next, we take the second measurement
Y = X2n as our maximal value. Notice that as X2n is independent from X1n , it is equally likely to
overestimate the real value or underestimate it, hence it is not systematically over-optimistic.
So everyone weighs 80kg and X1n > 80 with high probability, but X2n is both > 80 and < 80
with even probability.

2 from CIS 522 YouTube channel


30 Chapter 3. DQN & it’s derivatives

With the example above in mind, notice how in the DQN algorithm we both choose an action at
using Q̂(st , a, θ ) and evaluate it (when calculating yi ), which means that we tend to overestimating
the target values. To address this issue we replace the current update of yi (when not donei ) with
 
DoubleDQN ′ −
yi = ri + γ Q̂ si+1 , argmax Q̂(si+1 , a , θ ), θ (3.6)
a′ ∈A

In other words, we choose an action (argmax) using a network with parameters θ (that is currently
being trained), but evaluate the action using a network with parameters θ − (that is not being trained)

3.3 Dueling network


Starting off with a new definition, the Advantage function that is associated with a policy π and a
state-action function Q is given by
Aπ (s, a) = Qπ (s, a) −Vπ (s) (3.7)
We can think of the advantage as some measurement to how important is some action, as in - take
the importance of a given state-action pair (Q(s, a)) and subtract the importance of the state (V (π))
to get the importance of a given action. Note that as V π (s) = Ea∼π [Qπ (s, a)] (see eq. 1.6), we have
that Ea∼π [Aπ (s, a)] = 0.
A Dueling network approximates the Q-function by first separating the output of the network into
two distinct elements - one is V (s) and the other is A(s, a1), A(s, a2 ), ..., and then extracts from both
the value of Q. Recall that in regular DQN, for every state we estimated the value of all action

Figure 3.1: (Top) a standard DQN architecture, where the last layer is a vector representing
Q(s, a1 ), Q(s, a2 ), .... (Bottom) a Dueling DQN, where the upper branch represents the value of
V (s), the lower branch represents the values of A(s, a1 ), A(s, a2 ), ... and the final layer represents
Q(s, a1 ), Q(s, a2 ), .... As the input/output are the same, Alg. 8 can be applied given the Dueling
architecture as well.

choices. This may be unnecessary in cases where an action has little to no effect on the outcome.
For example - if you fell down a cliff, it is really irrelevant what are the action values of steering the
wheel. As we split our architecture to two distinct components, we can learn state values without
having to learn the effect of each action on those states - as those may be an unnecessary computation
for some states.
3.3 Dueling network 31

3.3.1 Implementation
Denoting the output of the upper branch as V (s, θ , θV ) and the output of the lower branch as
A(s, θ , θA ) where θ are shared parameters and θV , θA are distinct parameters for every branch, it may
seem reasonable to add these values to obtain Q(s, a, θ , θV , θA ). The immediate problem that arises
is that given such Q, we cannot recover the exact values of both A and V , as for example it may be
the case that Q = (A + x) + (V − x), and x can be chosen freely.
To distinguish A and V , the following formula was applied:
 

Q̂(s, a, θ , θA , θV ) = V (s, θ , θV ) + A(s, a, θ , θA ) − ′ max A(s, a , θ , θA ) (3.8)
a ∈Actions

This trick forces the Q value associated with the maximizing action to equal V (as for the maximizing
action, which is the action that is choose, the squared brackets zero out). This means that the upper
stream (V ) can be identified as the Q function value, and the lower stream would be the advantage
function.

R Alternatively, one can use


" #
1 ′
Q̂(s, a, θ , θA , θV ) = V (s, θ , θV ) + A(s, a, θ , θA ) − ∑ A(s, a , θ , θA ) (3.9)
|Actions| a′ ∈Actions

which was shown to improve stability, as was stated in the article


4. Policy Gradients

(see also 1.3).


Instead of value function of state-action value function approximation, we can aim to model the
policy itself. Again, we take the path of function approximation, rather than a look-up table approach
that assign every state an action. Generally speaking, we define a parameterized policy function πθ
and try to optimize it using a gradient ascent optimization process

θt+1 = θt + α∇J(θt ) (4.1)

where ∇J(θt ) is an estimated performance measurement.


πθ itself may have various kinds of outputs, given the problem setup (see 1.1.3 for more info)
• For a discrete action space - πθ (s) ∈ R|A| , where each entry may represent a probability
associated with an action
• For continuous action space - the output may be mean and covariance of a Gaussian distribution
(from which we sample an action), or a floating point value as commonly used in regression
tasks
Policy-gradient methods can learn an appropriate balance between exploration/exploitation, hence
are preferred over ε-greedy approaches. This also means that they provide us with a smoother action
selection, as changes in final action choice are based in changes to the parameters of the model,
rather than some random value exceeding ε

4.1 The Policy-Gradient theorem


Let us define the performance measure J(θ ) as the value function associated with the parameterized
policy

J(θ ) = Vπθ (s) (4.2)

The PG theorem states that the gradient of J(θ ) is proportional to the gradient of our policy,
weighted by the Q function over all states, and by the average number of times each state is visited
34 Chapter 4. Policy Gradients

in a trajectory µ(s).

∇J(θ ) ∝ ∑ µ(s) ∑ Qπθ (s, a)∇πθ (a|s) (4.3)


s a

The proof can be seen in S&B2020, page 325.


The PG theorem provides us with an analytic term to the change in performance w.r.t the parameters
of our policy.

4.2 The REINFORCE Algorithm


If we follow a target policy π, the weighted sum over all states in eq. 4.3 can be replaced with
expected value over the policy, as the number of times µ(s) some state was reached under π
is the same as summing all values under the policy associated with the state s. In other words,
∇J(θ ) = Eπ [∑a Qπθ (s, a)∇πθ (a|s)].
Supposedly, this is enough to formulate a gradient-ascent algorithm, as the mean value is replaced
with sampling a set of experiences - θt+1 = θt + α ∑a Q̂(st , a, θQ )∇πθ (a|st ). Having said that, this
approach means that an approximation of Q is needed as well, and the update rule takes into
consideration all actions, which is not ideal.
A preferred option would be to articulate a learning scheme that is based only on taking one step in
our environment, i.e taking an action at and understanding from it alone how to change the weights:
 
∇J(θ ) = Eπ ∑ Qπθ (st , a)∇πθ (a|st )
a
 
∇πθ (a|st )
= Eπ ∑ πθ (a|st )Qπθ (st , a) [multiply and divine the same value]
a πθ (a|st )
  (4.4)
∇πθ (at |st )
= Eπ Qπθ (st , at ) [replace ∑a [...] with sampling at ∼ π]
πθ (at |st )
 
∇πθ (at |st )
= Eπ Gt [Qπ (st , at ) = Eπ [Gt |st , at ]]
πθ (at |st )
θ (at |st )
Hence we can now rewrite the update rule as θt+1 = θt + αGt ∇π πθ (at |st ) . Notice that this approach is
only applicable in the episodic setting, as to calculate the return Gt we need to sum all future rewards
until the end is reached. Intuitively, we can decompose this expression as
• ∇πθ (at |st ) is a vector pointing to the direction (in parameter space) that increases the proba-
bility of taking at the most, given the state st
• the "amplitude" is πθ (aGtt |st ) , meaning that the update tends to increase Gt - actions are chosen if
they maximize the return, and decrease πθ (at |st ) - this makes sense as otherwise, maximizing
πθ (at |st ) means that actions that are frequent (high πθ (at |st )) will be chosen even if they do
not yield high return.
One last trick is to notice that ∇ ln x = ∇x/x, so the final update rule is

θt+1 = θt + αGt ∇ ln πθ (at |st ) (4.5)

From here, we can generate an episode from our initialized policy function, calculate the return as a
sun of discounted (or undiscounted) reward (remember that we assume finite episodes setting) and
update the network parameters using the update rule from above.
4.3 Actor-Critic methods 35

Algorithm 9 REINFORCE: MC Policy-Gradient control (episodic case) for control


Require: a (differentiable), randomly initialized policy πθ (a|s)
Require: learning rate α
while did not converge do
generate an episode {s0 , a0 , r1 , ..., sT −1 , aT −1 , rT } given πθ
for each step t = 0, 1, ..., T − 1 do
G ← ∑Tk=t+1 γ k−t−1 rk
θ ← θ + αγ t G∇ ln πθ (at |st )
end for
end while

4.2.1 REINFORCE with Baseline


Sampling an episode may cause fluctuations and slow convergence properties. To mitigate the
problem we can include a baseline, which can be any function that we compare our predicted return
θ (at |st )
to. We use this baseline in the update rule as θt+1 = θt + α(Gt − b(st )) ∇π πθ (at |st ) . We also must
ensure that our baseline does not depend on the actions, as those are what we aim to learn. More
specifically, a reasonable baseline would be the value function V̂ (st , θV ) (recall that V is the mean
value of G, starting from st and following the policy π). V̂ should be calculated using some method
that was previously discussed.

Algorithm 10 REINFORCE: MC Policy-Gradient control (episodic case) for control


Require: a (differentiable), randomly initialized policy πθ (a|s)
Require: a (differentiable), randomly initialized value function V̂ (s, θV )
Require: learning rate αV , α
while did not converge do
generate an episode {s0 , a0 , r1 , ..., sT −1 , aT −1 , rT } given πθ
for each step t = 0, 1, ..., T − 1 do
G ← ∑Tk=t+1 γ k−t−1 rk
δ ← G − V̂ (st , θV )
θV ← θV + αV δ ∇V̂ (st , θV )
θ ← θ + αγ t δ ∇ ln πθ (at |st )
end for
end while

Intuitively, Gt − V̂ (st ) tells us how good we did after time step t (Gt ) compared to what was
expected to be achieved for that same timestamp (V̂ (st )). If, for example, the performance was better
than expected, the log-likelihood term for the action that was used is now weighted based on that
performance, encouraging our agent to continue with those behaviors. If we anticipated the return
exactly, then the actions we chose are ideal, hence no more updates are required (G(t) − b(st ) = 0).
Comparing this to the previous method with no baseline, where we always multiplied by the actual
return Gt , unnecessary updates would have happened, which hinders convergence

4.3 Actor-Critic methods


let’s go over two new definitions
36 Chapter 4. Policy Gradients

• Actor-only methods: use a parameterized representation + gradient ascent and relies on


sampling the environment (aka acting on the environment). such methods have high variance
due to sampling and do not accumulate older information (meaning that once the policy, for
example, is updated, we start over again)
• Critic-only methods: directly solving for a value function / optimizing a parameterized function
(aka creating a value criterion based on states and actions). Theoretically, these methods can
yield optimal policies, though it is not a guarantee
We can think of an actor method as a method that, well, acts on the environment, whereas the critic
provides some measurement of that performance.
The concept of combining the two means that we use our critic to evaluate the state and reward
given by the system, provide a value that will be inserted to the actor, and the actor will eventually
output some action to be performed on the environment. AC methods are the intersection between
Value-based methods and Policy-based methods:

The policy is The Value function is


implicit
Value-based estimated
(ε-greedy)
Policy-based estimated no use of Value function
Actor-Critic
estimated estimated
based

Figure 4.1: Actor-Critic diagram. Notice that w.r.t to the table above, both the actor and the critic are
estimated (i.e parameterized)

The formalism of AC methods is similar to Policy iteration (recall alg. 1), as we alternate
between policy evaluation, where the value function is being estimated, and a policy improvement,
where given the evaluated policy we improve our policy. More specifically, the actor attempts to
improve the policy (using Policy gradients for example or argmax Q), and the critic evaluates the
current policy.

4.3.1 One-Step AC algorithm


Recall our td-update decomposition (see eq. 2.27), Vt ← V (t) + α [rt+1 + γVt+1 −Vt ]. We can use
this update rule to reformat the REINFORCE w/Baseline:

θt+1 = θt + α(Gt − V̂ (st , θV ))∇ ln πθ (at |st )


= θt + α(rt+1 + γ V̂ (st+1 , θV ) − V̂ (st , θV ))∇ ln πθ (at |st ) (4.6)
= θt + αδtT D ∇ ln πθ (at |st )
4.3 Actor-Critic methods 37

Algorithm 11 One-Step Actor-Critic


Require: a (differentiable), randomly initialized policy πθ (a|s)
Require: a (differentiable), randomly initialized value function V̂ (s, θV )
Require: learning rate αV , α
while did not converge do
sample an initial state s
define γ term I ← 1
while s is not terminal do
take action a ∼ πθ (·|s) and observe s′ , r
calculate TD error δ ← r + γ V̂ (s′ , θV ) − V̂ (s, θV )
update V̂ ’s weights θV ← θV + αV δ ∇θV V̂ (s, θv )
update πθ ’s weights θ ← θ + αθ Iδ ∇θ ln πθ (a|s)
I ← γI, s ← s′
end while
end while

The above can be summarized to the following pseudo-code Notice how compared to MC control
(like in the original REINFORCE), the update rule happens after every step, meaning we do not have
to simulate an entire episode. This makes the process less variable, allowing for faster convergence.

4.3.2 Asynchronous Advantage Actor-Critic (A3C)


Note that we are not bounded to only one critic, and multiple critics can operate simultaneously to
output a singular value. Using multiple critics also means that we can further explore the environment,
as for example the initialization can be different for every one of them. We can think of A3C as
some sort of committee method between critics, but other versions exists (weighted sum of critics, or
even an attention mechanism). Broadly speaking, A3C is implemented as a wrapper to the basic
Actor-Critic, where every critic generates an update rule ∆θ , ∆θV , and once all critics finished, the
global update rule for θ , θV is generated.
5. Imitation Learning

Imitation learning is a learning process that is based on data provided from an "expert" - a being
from which we understand how to perform some task/movement/etc..
Let us start with a new definition - the Regret, which is the difference between following the optimal
policy and some other policy.

5.1 The Regret

R This part seems to use slightly different notations, but hopefully still be understandable

We define an action value as the expected reward for an action a

Q(a) = E[r|a] (5.1)

Given an optimal policy π ∗ , the optimal action value is

V ∗ = max Q(a) (5.2)


a

The Regret at time t is the expected loss of all action values w.r.t to the optimal value

ℓt = E[V ∗ − Q(at )] (5.3)

and the total regret up to time T is


" #
T
L=E ∑ (V ∗ − Q(at )) (5.4)
t=1

We sometimes refer to V ∗ − Q(a) as the gap, and denote it with ∆a .


Notice how when we maximize the rewards the action value Q(a) increases, getting closer to V ∗ . In
other words - maximizing the rewards ↔ minimizing the regret. With this intuition in mind, our
40 Chapter 5. Imitation Learning

goal would be to perform some iterative optimization process in which every in every iteration we
examine some πi and hope it will be as close to π ∗ as possible.
Next, we define the count Nt (a), which is the number of times action a is selected by the t’th time
step, and use it to reformulate the total regret
" #
T
L=E ∑ (V ∗ − Q(at ))
t=1
= ∑ E[Nt (a)](V ∗ − Q(a)) (5.5)
a∈A
= ∑ E[Nt (a)]∆a
a∈A

Let’s observe the regret value under various algorithms:


• A greedy algorithm that always chooses argmaxa Q̂t (a): as we do not explore the environment,
we may lock into some sub-optimal policy. This means that the action values remain constant,
so as T increases we sum over more constant terms (V ∗ −Q(at ) ∼const), i.e the regret increases
linearly over time.
• An ε-greedy algorithm: will also lead to linear behaviour of the regret, but with a slower trend
(slope). This is due to the fact that with probability ε, an exploration is made.
• Decaying ε-greedy: over time, we control the trade-off between exploration and exploitation,
closing the gap between Q(a) and V ∗ in a logarithmic fashion (needs clarification)

5.2 Imitation learning


In some cases, the task that is to be solved is "too difficult" for a model to learn from scratch, so
we’d like to convert it to a series of prediction problems instead, based on some input provided by
an expect demonstrator. This means that our problem is decomposed to a set of actions that are to
be learned in a supervised fashion, which may simplify the task and improve overall results. For
example, if we are interested to teach a model to fly a helicopter, we will provide it with features that
represent a set of flights performed by a human (videos/control parameters over time/etc..) and expect
our model to learn the observed behaviour. Notice that this approach comes with problems, and the
most prominent of those is the fact that it is difficult to learn failure recovery. More precisely, our
model is only as good as the expect, which means that new states (that the expert did not encounter)
will be difficult to recover from. In a sense, we want to train our model on all possible states, but as
this is not feasible, we must consider which states are more relevant than others.
In the following section, we use these notations:
1. T - the tasks’ horizon
2. dπt - the states distribution for times∈ [1,t − 1] when following π
3. dπ - the average state distribution T1 ∑t=1
T
dπt
4. C(s, a) - the expected immediate cost of performing action a in state s (can be thought of as
the opposite of the reward)
5. Cπ (s) - expected immediate policy cost from state s. Ea∼π(s) [C(s, a)]
T
6. J(π) the total cost of the policy. J(π) = ∑t=1 Es∼dπt [Cπ (s)] = T · Es∼dπ [Cπ (s)]
7. ℓ(s, π) - observed surrogate loss, which is the gap of π compared to π ∗ given state s

5.2.1 Apprenticeship Learning


In the most basic form of imitation learning, we can perform this hand-wavy approach:
5.2 Imitation learning 41

1. watch an expert perform a task and record state-action trajectories


2. use those trajectories to learn the model dynamics (the transitions matrix P ∈ M |S|×|A| )
3. use some RL approach to find a near-optimal policy
4. if the policy is good enough - finish. otherwise, start over again
Apprenticeship learning is completely greedy, as there is no exploration term - this means that it is
more suitable for cases where exploration is dangerous or costly. Also notice how we try to learn
both the dynamics of the environment and the transitions function. Ideally, we’d prefer to disentangle
the two, as such mechanism is less likely to generalize to new states.

5.2.2 Supervised learning


As a first, naive, solution - lets reduce the sequence trajectory to many decouples supervised learning
problems. With this framework in hand, our objective may be to find a policy that minimizes the
observed surrogate loss, under some current state distribution
π ∗ = argmin Es∼dπ [ℓ(s, π)] (5.6)
π∈Π
As this is a sequential learning problem, we are faced with non i.i.d samples (dπ depends on π and
the action now may affect the states in the next sequence), hence our optimization is not as easy.
The problem with eq. 5.6 is that the states are sampled under a distribution induced by some
non-optimal policy, which results in a total cost that diverges from the cost of the optimal policy’s
cost (namely, J(π) ≤ J(π ∗ ) + T 2 ε, where ε is an error rate). This is all related to our initial intuition
- when a classifier makes a mistake, a new unseen state is being introduced to the agent. From this
point on, w.h.p all actions chosen by the agent will be wrong, and the error compounds.

5.2.3 Forward Training


the algorithm trains a non-stationary policy, meaning that at each iteration t we train a new policy πt
in the following manner
• at the t’th iteration, we sample a trajectory generated by the latest policy πt−1
• we ask our expect (π ∗ ) to provide state action for that provided trajectory
• we train a new classifier to provide the policy for t + 1
• we use πt+1 to advance the system one step forward
More formally, we perform the following pseudo-code

Algorithm 12 Forward Training


Require: Initialize non-stationary policy {πi0 }Ti=1
Require: expect policy π ∗
for i = 1, 2, ..., T do
sample T -step trajectories by following π i−1
generate data D = {⟨si , π ∗ (si )⟩} taken by the expert at step i
train the policy πii = argmin Es∼D [eπ (s)]
π∈Π
update policy π ij = π i−1
j for all j ̸= i
end for
return {πiT }Ti=1

where eπ (s) = Ea∼π(s) [e(s, a)] and e(s, a) = 1[a ̸= π ∗ (s)]. Examining the bounds of the algorithm,
the regret is ≤ O(uT ε), where u is the diversion from the optimal policy (u ≤ T ).
42 Chapter 5. Imitation Learning

As we train the algorithm on currently visited states, we can make better decisions and recover from
mistakes. But if T is large, the algorithm is generally impractical for real-life applications.

5.2.4 Dataset Aggregation (DAgger)


Compared to a pre-defined set of expert demonstrations, which causes compound error accumulation,
DAgger aims to mitigate compounding errors by constantly letting the expert tag the new experiences
our agent encounters.

Algorithm 13 DAagger
Initialize dataset D ← 0/
Initialize random policy π̂1 ∈ Π
for i = 1, 2, ..., N do
sample T -step trajectory using πi
get dataset Di = {(s, π ∗ (s)} of visited states by πi and actions given by expert
aggregate D ← D ∪ Di
train π̂i+1 on D
end for

Though it was shown to work well for both simple and complex problems (linear regret), one
disadvantage is that the expert must be available throughout the training process, which may not
always be the case. Furthermore, another shortcoming is when the learner’s policy is drastically
different from the expert’s policy. Think for example of a person who has just received his driver’s
license - if that person would observe the driving performance of a Formula 1 racer driving a track,
not much would have been processed and transferred to the new driver, and he will probably not be
able to reconstruct the experienced driver’s patterns. Obviously, to properly teach the new driver the
complicated set of skills, there has to be some learning curve, allowing for an increasing level of
complexity.

R In some versions of the dagger algorithm, the policy used is a convex sum of the expert’s policy
and the current trained policy, i.e πi = βi π ∗ + (1 − βi )π̂i . Furthermore, the βi term usually
decays, as to account for the expert less over time. in terms of the pseudo-code in alg. 13, we
can add this line right at the beginning of the for loop

5.2.5 DAgger with coaching


Intuitively speaking, we train our policy by presenting it with examples that are increasingly
challenging. Formally we define the hope action

π̃i (s) = argmax [λi · scoreπi (s, a) − L(s, a)] (5.7)


a∈A

This is for
• λi ≥ 0 specifies how close the coach is to the expert
• scoreπi is the likelihood of our agent to choose an action a in state s
• L(s, a) is the immediate cost
As the oracle’s action choices are minimizers of the cost L, when our policy’s actions result in a
small L(s, a) for a given s, we understand that the actions that were chosen are close to the oracle
5.2 Imitation learning 43

actions. Furthermore, if a score of an action scoreπi (s, a) is high, it is more likely to be chosen by
our current policy πi . We can use this intuition to formulate the following table:

cost↓/score→ Low High


a is unlikely but is near-optimal a is likely and near-optimal
Low
(score − L ≈ 0) (score − L ≫ 1)
a is unlikely and far from optimal a is likely and far from optimal
High
(score − L ≪ 1) (score − L ≈ 0)

as we wish to maximize the difference (score − L), we tend towards a likely action that is also
optimal, and we control λi to make sure that the actions are likely enough to be chosen.
When adding this functionality to our dagger implementation (alg. 13), we only need to change
the tagging of π ∗ to π̃
6. Multi-Arm Bandit

We’d like to cover the concept of RL in the most simplified setting, which does not involve learning
to act in more than one situation and avoids some complexities introduced in other RL problems.
Imagine a row of K slot machines ("arms) with unknown and variable payoffs. A player must
choose which machines to play given a finite time horizon H to maximize his profits. Our player
must balance exploration and exploitation, he should also understand the behavior of multiple arms,
but also focus on those that provide a better reward (or minimal regret).

R Before moving on, We define the expected cumulative regret E[Regn ] to be the difference
between the optimal expected cumulative reward and the expected cumulative reward of our
strategy at time n. If the optimal reward at every time step is R∗ , after n steps we can write
n
E[Regn ] = nR∗ − ∑ E[ri ] (6.1)
i=1

6.1 Basic bandit algorithms


In the most basic Bandit algorithm, we initialize the value of an action Q(a) and the number of times
we performed an action N(a), and repeat the following process:
• choose an ε greedy action (random or argmaxa Q(a))
• obtain a reward R by taking an action a (pulling the a’th bandit’s lever, if you will)
• update N(a) ← N(a) = 1
1
• update the value Q(a) ← Q(a) + N(a) [R − Q(a)]
The update rule says that the expected reward for action a should be updated to be a weighted average
of the previously expected reward and the newly observed reward. The weight given to the new
observed reward is inversely proportional to the number of times action a has been taken so far. In
other words, the more times action a has been taken, the less weight is given to the new observed
reward. This encourages the exploration of other actions. Notice also how the update rule resembles
gradient ascend - we take in the old estimate for the value and update it w.r.t to the difference from
46 Chapter 6. Multi-Arm Bandit

the target reward, up to some step size.


We can also weigh the update rule with some generic step size term α, to add more versatility.
We can improve on the naive action exploration by exploring actions according to their potential of
actually being optimal. formally we can define
" s #
lnt
at = argmax Qt (a) + c (6.2)
a∈A Nt (a)

This means that we not only choose the greedy action based on the action that maximizes Qt (a) for
all a ∈ A, but also based on the visit
√ rate - here like the above, if Nt (a) increases, we are less likely to
choose the action. we also add a lnt term, to indicate a measurement of time increment. If Nt (a) is
large but also lnt is, the term will be not as small, indicating that it is reasonable to choose an action
many times if much time has passed since the beginning.
eq. 6.2 helps us replace the ε-greedy random choice, that decouples the exploration and exploitation
with one single term that governs both in a more elegant way.

R The above can then be reformulated to an algorithm called The Upper Confidence Bound
(UCB), named after the fact that the expected number of pulls that was required to achieve an
optimal policy was bounded from above

Some problems with UCB and other stationary algorithms:


• Non-stationary problems, where the expected return of every arm can change over time, are
not supported as we can’t really rely on what had already happened.
• if the state space is too large, it is unfeasible to learn.

6.2 Advanced bandits algorithm


Let’s briefly go over some of the more advanced methods

6.2.1 Gradient bandit algorithms


Again we come to the conclusion that estimating action values to select ones is a good approach to
support large state/action spaces and generalize to more problem setups.
Let us define the numerical preference for action a at time t as Ht (a), where large Ht (a) means
that a is more likely to be taken. Our goal would be to learn Ht (a) using iterative gradient descend,
where

Ht+1 (a) = Ht (a) − α(Rt − R̄t )πt (a) (6.3)

with H0 (a) = 0, and

exp (Ht (a))


πt (a) = Pr[at = a] = so f tmax(Ht (a)) = (6.4)
∑ai ∈A exp (Ht (ai ))

6.2.2 Contextual Bandits


We would like to take into account the context in which the action sampling is being made. By doing
so, we are essentially modeling the state in our environment, thus getting closer to the initial setting
of reinforcement learning. we can write a schematic iterative process as follows - in round t:
6.2 Advanced bandits algorithm 47

• Observe user ut and a set A of arms, alongside their features (context) xt,a
• based on reward from previous iterations, choose an arm a ∈ A and receive a reward rt,a
• improve the arm selection strategy with each observation of ⟨xt,a , a, rt,a ⟩. Notice here that
different rewards from the same arm are possible, as many contexts are available.
For example, if we were to implement a recommendation system, we could ideally identify the
context of our users (some tabular traits for example) and use those to provide a selection that is
more specific.
In some cases, we can model the relation between the reward and the context in a linear fashion, and
in The Linear UCB algorithmm they did just that. Let us define the expected reward conditioned on
the context as

E[rt,a |xt,a ] = (xt,a )T θ ∗ (6.5)

where θ ∗ is the unknown coefficient vector that we aim to learn. Our goal is to minimize the regret,
which can be defined as the expected difference between the observed reward for the best arms and
the expected reward for the selected arms
" #
T
Rt (T ) = E ∑ rt,a ∗
t
− rt,at (6.6)
t=1

where at∗ = argmaxa∈A xt,a


T θ ∗ is the best action at step t according to θ ∗ , and a is the action selected
t
at step t. We then minimize the regret using linear regression minimization (the math formalism is
discarded here)

6.2.3 Thompson Sampling


We assume that each arm’s reward is sampled from a constant unknown distribution. Over time, we
find an estimation for those distributions by sampling the arms.
let’s say that the agent is trying to choose between K different actions, and let’s call the probability
that action k is the best action pk . At each step, the agent will sample a value from the distribution
for each action and choose the action with the highest sample. The distribution for each action is
updated using Bayes’ rule, which states that the posterior probability (the updated probability after
observing new data) is equal to the prior probability (the initial probability before observing new
data) times the likelihood (the probability of observing the new data given the prior probability).
In this case, the prior probability is the current estimate for the probability that action k is the best
action, and the likelihood is the probability of observing the reward that the agent received for taking
action k. So the updated probability for action k after observing a reward r is:

pk = pk · P(r|pk ) (6.7)

This process is repeated for each action at each step, with the distribution for each action being
updated based on the rewards that the agent receives. The advantage of using this approach is that it
allows the agent to balance exploration and exploitation, as it will choose actions that have a high
probability of being the best action based on the current information, while also trying out other
actions to learn more about them and improve its estimates.
More specifically, in the algorithm, they’ve assumed beta distribution in the following manner
• for each arm i = 1, 2, ..., N set Si = 0, Fi = 0 (we think of Si as #Success and Fi as #Failures)
• for every time t = 1, 2, ...
48 Chapter 6. Multi-Arm Bandit

– for each arm i = 1, 2, ..., N: sample from β (Si +1, Fi +1) and play arm i(t) = argmaxi θi (t)
to observe reward rt .
– if r = 1 then Si(t) + = 1, else Fi(t) + = 1

R as a reminder, we write the relation between the posterior and the prior based on the Bayes rule

likelihood × prior P(X|θ )P(θ )


posterior = ←→ P(θ |X) = (6.8)
evidence ∑θ P(X|θ )
7. Advanced RL use case - AlphaGo

Using some of the concepts from previous chapters, we will explore the story of AlphaGo, a
groundbreaking artificial intelligence program developed by Google Deep Mind that made history in
2016 when it defeated Lee Sedol, a world champion Go player, in a highly publicized five-game
match. The victory was significant because Go is a complex and challenging board game that has
long been considered a pinnacle of the human intellect, and it was the first time that a computer
program had been able to defeat a human professional player in the game. In this chapter, we will
delve deeper into the techniques and strategies used by AlphaGo to achieve this impressive feat.
Before we can fully understand how AlphaGo was able to defeat a human champion at Go, it is
important to first understand the concept of Monte Carlo tree search, which was a key component
of the program’s strategy. In the following sections, we will discuss the basics of Monte Carlo tree
search and how it was used by AlphaGo to make informed decisions on the Go board.

7.1 Monte Carlo Tree Search (MCTS)


MCTS combines both sampling and tree search. We use the vertices of a tree to represent states, and
the edges to represent actions that take us to the next state. The choice of action at any given vertices
is governed by the tree’s policy π. At any given iteration, if a leaf node was reached, we expand the
tree by sampling the environment. Formally, we can write MCTS in 4 steps:
• Selection: we navigate to a leaf using thequpper confidence bound (UCB) formula, (similar to
eq 6.2): πUCB (s) = argmaxa Q(s, a) + c ln n(s)
n(s,a) where Q(s, a) is the value of the state-action
pair, n(s) is the number of times a state s was visited, n(s, a) is the number of times a state-
action pair (s, a) was visited and c is a calibration parameter. Remember that the first term is
an exploitation term, whereas the other term is an exploration term.
• expansion: if a leaf was reached, we either reached the end of the game (and then jump to
back-propagation) or were still mid-game, but in an unknown situation. In the latter case, we
use another policy called the rollout policy (which is a simpler policy) to choose which note
to move to, and add that note to our tree.
50 Chapter 7. Advanced RL use case - AlphaGo

• simulation (play-out): one simulated game is played from the leaf node reached in the previous
step, and a reward is observed at the end of the game. The actions during the simulation step
are either randomly chosen or given using the rollout-policy
• back-up: the results of the game are back-propagated in the tree, updating the value of Q(s, a)
for each of the nodes that participated in the game.
We emphasize the MCTS is only aware of the "rules" of the game, and hence might be outperformed
by other approaches that use tailor-made heuristics. Furthermore, the algorithm can be halted at any
given time, therefore training it for a fixed time is plausible.

Figure 7.1: MCTS Diagram

Some improvements were suggested over the years, to enhance the basic functionality of MCTS.
some of which are:
• Pruning policy - once an expanded node has had a sufficient number of simulations, we use a
hand-crafted policy to determine whether it should remain the tree or be removed
• Improving the Value function - we define the value function as an MLP layer Q(s, a) =
σ (φ (s, a)T θ ) where φ are binary features, θ are the weights and σ is some activation probabil-
ity function like softmax. Furthermore, instead of randomly sampling a state in the expansion
step, a few policies were tested
– ε-greedy one
– greedy policy with noisy value function π(s, a) = 1 ⇐⇒ a = argmax [Q(s, a′ ) + η(s, a′ )]
a′ ∈A
(otherwise π(s, a) = 0))
– a smoothed softmax πτ (s, a) = so f tmax(Q(s, a)/τ)

7.2 RL for the game of Go


Go is a perfect information game, meaning that all the information is given at any time step (unlike
poker, for example). The optimal value function can be computed recursively on a search tree of
bd sequences, where b is the number of legal moves possible in each turn and d is the length of the
game
7.2 RL for the game of Go 51

R In chess, for example, b ≈ 35 and d ≈ 80. In a game of Go, on the other hand, b ≈ 250 and
d ≈ 150

7.2.1 Alpha-Go
check out this blog post for more information.
The AlphaGo architecture consists of three networks: supervised learning policy network, RL policy
network, and RL value network. In addition, the linear softmax model was used as a rollout policy.
The state consisted of a few parameters such as stone color, turn, viable stone positions, etc. To make
our arguments cleaner, the parameterized policy that outputs an action given a state as pθ (a|s)
The supervised learning policy network
the goal - recommend good moves by predicting those performed by Go grans-masters (similar to
imitation learning). The policy network was trained on ∼ 30M positions from ∼ 160K games, and
the data was augmented by rotating the board. The goal function was maximizing the log-likelihood
function of taking the "human" action. More formally, the step we take is
∂ log pσ (a|s)
∝ (7.1)
∂σ
where m is a batch size, α is a step size, θSL are the network parameters and log pθ [ak |sk ] is the log
probability of taking an expert action ak given the state sk
The rollout policy
During game simulation (discussed later), we need a fast approach to narrow down the moves options.
Therefore, we create a rollout policy π which is a simple linear softmax classifier. This policy is
trained exactly like the linear model one, but as it is simpler it is much faster.
The RL policy network
In the next phase, we play the game of Go with the current policy network itself, to improve the
overall results. To do so, we start off with a duplicate of the SL policy network and call it the
RL policy network pρ (with parameters ρ). We use REINFORCE with a baseline to improve pρ
iteratively. The opponent pρ plays against is a previous version of pρ itself (chosen randomly), and
the game goes on until it is finished. We denote the output of the game at time t as zt , where zt = 1 if
pρ won and −1 otherwise. The update rule for pρ is
∂ log pρ (at |st )
∝ zt (7.2)
∂ρ
where notice that the direction we move is governed by sign(zt )
The RL value network
In the last stage, we want to add the capabilities of the grand-master board position evaluation, so we
train a deep network to estimate the value of the current position (which is 1 if that position can be
translated to a win and −1 otherwise), and call it vπ . To train the value network, which has similar
architecture as the RL policy one except that it outputs a single value, we again perform self-play
using the RL policy network. we compute the MSE w.r.t to the actual game outcome, meaning that
the update rule is
∂ log vθ (s)
∝ (z − vθ (s)) (7.3)
∂θ
52 Chapter 7. Advanced RL use case - AlphaGo

Throughout the games, we collect board positions. More specifically, one board position is
collected for every game (as all board positions in a game lead to the same result - win or lose) as
different positions of the same game are highly correlated.
Our last goal is to use the policy network and the value network to complement each other

MCTS
We’d like to search for actions that translate to as many wins as possible, and we do so by examining
the four steps of MCTS:
• Selection: remember that we aim to choose actions that are also greedy but also explore. in
terms of the exploitation part, we take advantage of our value function network and define
n
1
Q(s, a) = ∑ 1(s, a, i)V (si ) (7.4)
N(s, a) i=1

with V (si ) = (1 − λ )vθ (si ) + λ zi . In words - we set the value of a state action pair as the
weighted sum (in terms of #visits) of the actual value we give to the state and whether that
state is translated to a win condition. We use a convex sum to account for the two factors
combined. In terms of the exploration part, we define

pσ (a|s)
u(s, a) = (7.5)
1 + N(s, a)

meaning that we normalize "how good it is to take action a" by the number of visits, as it
makes visited actions less likely to be chosen (which is the meaning of exploration)
Finally, we aggregate the two and choose the action

at = argmax [Q(st , a) + u(st , a)] (7.6)


a

• Expansion: we add more positions into the tree to reflect what moves we have tried. every new
node is initialized with a predefined value of N(s, a) = Q(s, a) = 0, an associated probability
pσ (a|s) (set by the SL policy network), and a value vπ (s) (set by the RL value network)
• Simulation: we simulate the rest of the game using MC rollout starting from the current leaf
node. more formally, we sample action from our rollout policy a ∼ pπ . recall that pπ is very
fast, as many game roll-outs are necessary.
• back-up: after the roll-out we know if our game resulted in a win or a loss, so we can compute
Q. over time, our Q estimate will be good enough to choose a good action

7.2.2 Alpha-Zero
for more info, see HERE
In the next generation of AlphaGo, we only use self-play to learn, taking into consideration nothing
but the rules of the game (no expert demonstrations). Furthermore, the state’s representation is only
the stones on the board (with some history saved as well). Lastly, only one neural net was used. We
start off with a high level description:
• Self-Play: we create a training set by using self-play, where in each move the game state, the
search probabilities (from MCTS) and the winner are saved.
• Network optimization: sample a mini batch from the training set (of the previous step) and
train the current network on these board positions. the loss function has two terms
7.2 RL for the game of Go 53

– the value function (how probable is winning from the given board states), that is compared
to the actual win condition (win or lose) using MSE
– the action probabilities for each legal state
More specifically, we write that

ℓ = ∑(vθ (st ) − zt )2 − πt log (pθ (st )) (7.7)


t

and in some cases, a regularization term, λ ||θ ||, was added to normalize the weights
• evaluate network: play 400 games between the latest neural network and the current best
neural network, where both networks use MCTS to select their moves. the network who wins
55% or more is declared the new best network.
Similar to AlphaGo, each node contains a V (s), that represents how likely the player to win from
the current state, and each edge contains the action value Q(s, a), the visit count N(s, a) and the
probability to visit P(s, a).
Next, we describe the four steps of MCTS √
∑ ′
a ∈A N(s,a′ )
• Selection: again the exploitation term is Q(s, a), but the exploration is c · P(s, a) 1+N(s,a) ,
which
p is similar to what we had in Alpha-Go, up to the factorization of all possible actions
∑a′ ∈A N(s, a′ )
• Expansion + simulation: when a leaf is reached, all possible states are initialized. Then, a
single node is expanded and evaluated using Q(s, a). For this node we also calculate V (s) and
immediately return, that is no rollout is being performed.
• backup - traverse up the tree, update the values N+ = 1,V + = v, Q = V /N
after around ∼ 1600 simulations, we select a move. In the case of the test phase, we chose the node
for which N is largest. for the training phase, we choose

N(s0 , a)1/τ
π(a|s0 ) = (7.8)
∑ N(s0 , a′ )1/τ
a′ ∈A

7.2.3 Alpha-Zero in other domains


Some later projects attempted to apply Alpha-Zero’s framework to the field of automatic pipeline
generation, the prominent one was Alpha-D3M. More specifically, instead of Go/Chess pieces, the
basic unit was some pipeline primitive (say, random forest, neural net, etc...), the state was some
metadata associated with the task, and the task itself, the actions were replacing pipelines, adding
more aggregated ones, etc, and the reward was the pipeline performance. To encode a pipeline we
may perform the following abstract steps:
• encode dataset Di as metadata features f (Di )
• encode the task T j
• encode the current pipeline at time t by a vector St (St is a pipeline sequence)
• encode action fa (St ) so policy π maps ( f (Di ), T j , St ) → ( fa (S1 ), ..., fa (Sn ))
The general learning scheme is similar to Alpha-Zero, where the action choice is given using the same
exploitation + exploration term, and the loss function is a sum of both a cross-entropy term for the
probability function and an MSE term for the value function (with some additional regularization).
Another line of research is known as Neural architecture search, in which the idea is to automatically
learn a more efficient network architecture representation. One approach to do so is to learn a string
representation of the network’s architecture and train an RNN to generate the next architecture block
54 Chapter 7. Advanced RL use case - AlphaGo

sequence. In the original paper, they defined the reward as the accuracy result over the validation set,
and user REINFORCE to optimize the loss term.
8. Meta and Transfer Learning

In the following chapter, we distinguish between two similar yet different learning tasks
• Meta-learning (sometimes refers to as "Learning to learn"): is the process of learning how
to model our problem and use the generalized result over multiple sets of tasks with similar
setups. For example - we train a robotic arm with two joints to do an arbitrary task and then
use the same agent on a different robotic arm with three joints.
• Transfer learning: is the process of learning some knowledge from one task and using it to
improve the performance of a model on a different but related task. For example - we can use
a pre-trained ResNet to achieve better image classifiers.
the main difference between meta-learning and transfer learning is that meta-learning focuses on
learning generalizable knowledge that can be applied to a wide range of tasks, while transfer learning
focuses on transferring knowledge from one specific task to another related task.

8.1 Meta-Learning
Let us define a set of tasks {τ1 , ..., τn } where each τi is episodic (with length Hi ) and defined by a
Hi Hi
set of states {sti }t=0 , actions {ati }t=0 , a loss Li , and a transition distribution Pi . A meta-learner with
parameters θ models the distribution π(at |s1 , ..., st ; θ ) with the objective of minimizing the expected
loss over all tasks
" #
Hi
min Eτi
θ
∑ L(st , at ) (8.1)
t=0

where st ∼ Pi (st |st−1 , at−1 ), and at ∼ π(at |s1 , ..., st ; θ )


There are a few desired properties of a Meta-RL algorithm:
• consistency: the ability to continually improve as we get more data
• expressive: the ability to represent multiple types of tasks
• Structured exploration: the ability to explore the problem space efficiently, utilizing as few
samples as possible
• Efficient and off-policy: this way we can run the Meta algorithm on real-world problems
56 Chapter 8. Meta and Transfer Learning

8.1.1 Memory-Augmented Networks


(Based on the paper Meta-Learning with Memory-Augmented Neural Networks )
In some cases, we wish to perform meta-learning when we are only provided with small (or no) data.
For example - in one/few-shot learning we are expected to provide a label with only a few tagged
samples. To do so, we use two techniques
• External memory module - we initialize a container of knowledge that is called upon to
respond to challenging circumstances
• Label shuffling - labels are presented one time-step after their corresponding sample, which
helps with dealing with simply learning the mapping sample → label without generalizing.
More specifically, we feed the network with (x1 , null), (x2 , y1 ), ..., (xt+1 , yt )
In addition, samples are shuffled across different datasets (samples from different datasets may
appear in the same sequence). This encourages the network to use the memory module and extract
the relevant label once the corresponding sample is provided.

Figure 8.1: meta-learning with memory-augmented network diagram. Left (a): the (lagged) episodes
from various datasets are shuffled. Right (b): during the learning process, the sample is first saved to
the external memory and later retrieved when the relevant label is presented

We can also represent the pipeline using the following block diagram

External Input External Out put

controller

Read Head W rite Head

Memory

Our main question would be - which memories should we read? We use a similarity measure to
generate a weights vector - given an input xt , the controller (the network) produces a key kt which
is then stored in a row of a matrix Mt or used to retrieve the particular memory i from the i’th row
8.1 Meta-Learning 57

Mt (i). When retrieving a memory, we use cosine similarity

kt · Mt (i)
K(kt , Mt (i)) = (8.2)
||kt || · ||Mt (i)||
Meaning that we compare the current key kt with another key described in the i’th row of Mt . We
use The similarity between all keys to produce the read-weight vector (with superscript r)

exp (K(kt , Mt (i)))


wtr (i) = so f tmax(K(kt , Mt (i)))) = (8.3)
∑ j exp (K(kt , Mt ( j)))

wtr (i) is stored in memory, from which the read value is retrieved as

rt = ∑ wtr (i)Mt (i) (8.4)


i

Over time, new information is written into rarely-used locations, preserving recently encoded
information, or it is written to the last used location, which can function as an update of the memory
with newer, possibly more relevant information. The distinction between these two options is
accomplished with interpolation between the previous read weights and weights scaled according to
usage weights wtu . Those are updated as

wtu ← γwt−1
u
+ wtr + wtw (8.5)

where γ is a decaying parameter, wtr is computed as in eq. 8.3, and wtw will be defined later. The
least-used weights wtlu (i) is then given as 0 if wtu (i) > m(wtu , n) or 1 otherwise, where m(wtu , n)
denotes the n’th smallest item in wtu (and we set n to the number of read to memory). This definition
allows us to recursively define the write weights

wtw ← σ (α)wt−1
r lu
+ (1 − σ (α))wt−1 (8.6)

where σ is the sigmoid function and α is a hyper-parameter.


Eventually, we update the memory using

Mt (i) ← Mt−1 (i) + wtw (i)kt (8.7)

for all i

R we did not specify how to formulate the above as an RL task

8.1.2 Simple Neural Attentive Meta-Learner


The previous approach struggled to identify historic events that are most relevant to the current
timestamp, as it was hard-coded to only account for one step difference. In a newer version, attention
layers and dilated temporal convolutions allowed just that.
Dilated + Causal convolutions
Instead of a regular k × k kernel, we use some form of inherent stride, where each convolution pixel
is distant from its neighbors, hence increasing the filter’s receptive field while using the same number
of parameters as the original Conv layer.
58 Chapter 8. Meta and Transfer Learning

Figure 8.2: Dilated Conv layer for various rates (strides)

Attention mechanism (in brief)


Attention mechanisms are a key component in many natural language processing and machine
learning models. They allow the model to focus on specific parts of the input, rather than processing
the entire input equally. Mathematically, an attention mechanism can be represented as a function
Attention(Q, K,V ), where Q, K, and V are matrices representing the query, key, and value, respec-
tively. The attention function returns a weighted sum of the values, where the weight for each value
is determined by the dot product of the corresponding query and key.
n
dot(Q, Ki )
Attention(Q, K,V ) = ∑ √ Vi (8.8)
i=1 dk

Here, n is the number of values, dk is the dimension


√ of the keys, and dot(Q, Ki ) is the dot product
of the query and the ith key. The division by dk is included to scale the dot product, as it can
become large for large values of dk .
To add an attention mechanism to an LSTM, for example, we can modify the computation of
the output at each time step to incorporate the attention weights. Let’s consider an LSTM with
outputs o1 , o2 , . . . , oT , where T is the number of time steps. The output at each time step is typically
computed as a function of the previous output, the current input, and the current hidden state:

ot = g(ot−1 , xt , ht )
To incorporate an attention mechanism, we can modify this equation to include the attention
weights:

ot = g(ot−1 , xt , ht , Attention(ot−1 , K,V ))


Here, Attention(ot−1 , K,V ) is the attention function described in a previous answer, with Q set
to the previous output ot−1 . The matrices K and V are typically fixed and are learned during training.
This modified equation allows the output at each time step to be a weighted sum of the previous
output, the current input, and the current hidden state, with the weights determined by the attention
mechanism. This allows the model to focus on specific parts of the input when computing the output,
rather than processing the entire input equally.
It’s also possible to use the attention mechanism in conjunction with the hidden state of the
LSTM, rather than the output. In this case, the attention function would be applied to the hidden
state at each time step, rather than the output.
8.1 Meta-Learning 59

Furthermore, we use causal attention that makes sure no future input is used when calculating
the output of the current state (to avoid data leakage). This means that every output is generated by
looking over only previous samples.

The architecture

we use two blocks of temporal convolution layers (orange) that are


interleaved with two causal attention layers (green). In reinforcement-
learning settings, it receives a sequence of observation-action-reward
tuples (o1 , null, null), ..., (ot , at−1 , rt−1 ). At each time t, it outputs a dis-
tribution over actions at based on the current observation ot as well as
previous observations, actions, and rewards. Furthermore, the internal
state across episode boundaries is preserved, which allows it to have a
memory that spans multiple episodes. The observations also contain a
binary input that indicates episode termination.
SNAIL achieves state-of-the-art performance by significant margins
on all of the most widely bench-marked meta-learning tasks in both su-
pervised and reinforcement learning, without relying on any application-
specific architectural components or algorithmic priors.

8.1.3 Model Agnostic Meta-Learning

If we write a generic learning process as θ ← θ − α∇θ Ltrain (θ ), a generalized approach over many
i
tasks minimizes the objective ∑taski ∑i (θ − α∇θ Ltrain (θ )). Intuitively, we take a step in the averaged
direction based on the gradients of all tasks’ losses.

Figure 8.3: MAML - update θ w.r.t to the expected direction induced by the losses of all tasks

Under the scope of RL, we will sample a batch of tasks, and for each task sample a set of
trajectories from our environment. once a trajectory was sampled, a loss gradient will be computed
based on the expected episodic return, followed by an update rule for the specific tasks’ parameters.
once all tasks were iterated through, the final update rule will be performed w.r.t the mean loss values
over all tasks. We can compare MAML and SNAIL in terms of their properties:
60 Chapter 8. Meta and Transfer Learning

SNAIL MAML
As a heavy-duty model, less
easily adjusted given new data as
consistent likely to improve with only new
it is only gradient-based
data
uses memory therefore can ob-
has no memory, so is generally
expressive tain deeper understanding of
less expressive
tasks
does not enforce a smart explo-
structured
ration scheme, though still not same as SNAIL
exploration
very inefficient
efficient and
is on policy same as SNAIL
off-policy

R [Adaptations in real-life] We briefly discussed Adaptations in real-life, which is a meta-


learning process of adapting an agent to environmental changes, usually encountered in
real-life problems. To adapt our model to real life, we first make the assumption that any new
time step is potentially a new environment, with different dynamics though a common structure.
For each of the environments, we learn a trajectory that lasts as far as the environment does
and store the recent history in a database. The adaptation takes hold in a cyclic fashion - a
model is learned for the current environment, and once a new environment was encountered, it
is either acted upon using experience stored in the database or adapted to and then stored.

8.2 Transfer Learning


TL is, again, a process where a model trained on one problem/dataset is applied to another. TL is
mostly used when multiple models are harder to train and when we’d like to leverage knowledge
across problems or domains. Few options are common
• Classic approach: take a trained model from one task and re-train it on another task. Another
possibility is to freeze some of the weights.
• Diversity learning: pre-train the model for diversity by learning various solutions to a similar
task

8.2.1 Training the model for diversity


A key point in diversity learning is the fact that a stochastic policy is required, as a deterministic one
cannot easily consider multiple possible action choices.

You might also like