Deep Reinforcement Learning: Lecture Notes
Deep Reinforcement Learning: Lecture Notes
Learning
372.2.5910
Ben-Gurion University of the Negev
Lecture Notes
W RITTEN BY: Hadar Sharvit
A LSO AVAILABLE ON G IT H UB
CONTACT ME AT : [email protected]
BASED ON: Lectures given by Gilad Katz
C HAPTERS & B OOK COVER BY: Rohit Choudhari/Unsplash
Contents
1 Hello world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Terminology 7
1.1.1 State & Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Action spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.6 The goal of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.7 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.8 The optimal Q-Function and the optimal action . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.9 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.10 Advantage function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Kinds of RL Algorithms 11
1.2.1 Model-Free vs Model-Based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 What do we learn in RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Intro to policy optimization 12
2 RL basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Motivation 15
2.2 When to use RL? 16
2.3 Markov Decision Processes (MDP) 16
2.3.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Goals and Rewards 18
2.5 Policies,Value-function and Q-function 18
2.6 The Bellman equation 18
2.7 Policy Iteration 20
2.8 Value iteration 20
2.9 Monte-Carlo 20
2.9.1 Approximating Value-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9.2 Approximating policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 On/Off-Policy methods 22
2.10.1 Importance sampling for Off-Policy methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 ε-Greedy Algorithms 24
2.12 Temporal Difference (TD) Learning 24
2.12.1 On-Policy TD Control: SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.12.2 Off-Policy TD Control: Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 The Policy-Gradient theorem 33
4.2 The REINFORCE Algorithm 34
4.2.1 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Actor-Critic methods 35
4.3.1 One-Step AC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Asynchronous Advantage Actor-Critic (A3C) . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 The Regret 39
5.2 Imitation learning 40
5.2.1 Apprenticeship Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Forward Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.4 Dataset Aggregation (DAgger) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.5 DAgger with coaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5
6 Multi-Arm Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 Basic bandit algorithms 45
6.2 Advanced bandits algorithm 46
6.2.1 Gradient bandit algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.2 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.3 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
R This chapter is provided as a preliminary, and is not part of the course. It is based on OpenAI’s
Spinning Up docs (For further references see here).
Reinforcement Learning (RL) is the study of agents and how they learn by trial and error. The
two main components of RL are the agent and the environment - The agent interacts with the
environment (also known as taking a "step") by seeing a (sometimes partial) observation of the
environment’s state, and then decides which action should be. The agent also perceives a reward
from the environment, which is essentially a number that tells the agent how good the state of the
world is, and the agent’s goal is to maximize the cumulative reward, called return.
Agent
st , rt at
Env
1.1 Terminology
let’s introduce some additional terminology
1.1.3 Policy
Is a set of rules used by our agent to decide on the next action. It can be either deterministic
of stochastic at ∼ π(·|st ). Under the scope of deep RL, the policy is a parameterized function,
i.e it is a mapping with parameters θ that should be learned in some optimization process. A
deterministic Policy could be implemented, for example, using some basic MLP architecture. For a
stochastic policy, the two most common types are Categorical policy (for discrete action space) and
Diagonal-Gaussian policy (for continuous action space)
Categorical (stochastic) Policy
Is essentially a classifier, mapping discrete states to discrete actions. For example, you could build
a basic NN that takes in the observation and outputs action probabilities (after applying softmax).
Denoting the last layer as Pθ (s), we can treat the actions as indices so the log-likelihood for action a
is
log πθ (a|s) = log [Pθ (s)]a (1.1)
Given Pθ (s), we can also sample from the distribution (one can use PyTorch Categorical to sample
from a probability vector)
Diagonal-Gaussian (stochastic) Policy
Is a policy that can be implemented using a neural network that maps observations to mean actions,
under the assumption that the action probability space can be represented by some multivariate
Gaussian with diagonal covariance matrix, which can be represented in two ways
• we use log diag(Σ) = log σ which is not a function of the state s (σ is a vector of standalone
parameters)
• we use a NN that maps from s → log σθ (s)
we use log σ and not σ as the log takes any value ∈ (−∞, ∞), unlike σ that only takes values in
[0, ∞), making it harder to train.
Once the mean action µθ (s) and the std σθ (s) are obtain, the action is sampled as a = µθ (s) +
σθ (s) ⊗ z, where z ∼ N(0, 1) and ⊗ is element-wise multiplication (This is similar to VAEs).
The log-likelihood of a k-dimensional action a ∈ Rk for a diagonal-Gaussian with mean µθ and std
σθ can be simplified if we remember that when Σ is diagonal, a k-multivariate Gaussian’s PDF is
equivalent to the product of k one-dimension Gaussian PDF, hence
" # " #
exp − 12 (a − µ)T Σ−1 (a − µ)
1 k (ai − µi )2
log [πθ (a|s)] = log = ... = − ∑ + 2 log σi + k log 2π
σi2
p
(2π)k |Σ| 2 i=1
(1.2)
1.1.4 Trajectories
we denote the trajectory as τ = (s0 , a0 , s1 , a1 , ...) where the first state s0 is randomly sampled from
some start-state distribution s0 ∼ ρ0 . A new state is obtained from the previous state and action in
either a stochastic or deterministic process.
1.1 Terminology 9
1.1.5 Reward
The reward rt is some function of our states and action, and the goal of the agent is to maximize the
cumulative reward over some trajectory τ.
T
• Finite-horizon un-discounted return: R(τ) = ∑ rt
t=0
∞
• Infinite-horizon discounted return: R(τ) = ∑ γ t rt , γ ∈ (0, 1)
t=0
Not adding a converging term γ t means that our infinite sum may diverge, but also it manifests the
concept of "reward now > reward later"
T −1
P(τ|π) = ρ0 (s0 ) ∏ P(st+1 |st , at ) · π(at |st ) (1.3)
t=0
| {z } | {z }
Pr. to reach st+1 Pr. to choose action at
from st when applying at . when in state st .
The expected return is by definition the sum of returns given all possible trajectories, weighted by
their probabilities
Z
J(π) = P(τ|π)R(τ) = Eτ∼π [R(τ)] (1.4)
τ
• On-policy Action-Value function Qπ (s): if you start from s, take an action a (which may or
may not come from π) and only then act according to π, the expected reward is
when finding a value function or an action-value function that maximizes the expected reward, we
scan various policies and extract V ∗ (s) = max V π (s) or Q∗ (s, a) = max Qπ (s, a).
π∈Π π∈Π
We can also find a relation between V and Q:
V π (s) = Eτ∼π [R(τ)|s0 = s]
= ∑ Pr[R(τ)|s0 = s]R(τ)
τ∼π
= ∑ ∑ Pr[R(τ), a|s0 = s]R(τ) [Total prob.]
τ∼π a∼π
= ∑ Pr[a|s0 = s] ∑ Pr[R(τ)|s0 = s, a0 = a]R(τ) (1.6)
a∼π τ∼π
= ∑ Pr[a|s0 = s]Eτ∼π [R(τ)|s0 = s, a0 = a]
a∼π
= ∑ Pr[a|s0 = s]Qπ (s, a)
a∼π
= Ea∼π Qπ (s, a)
where in the 4’th line we used the fact that the probability of both R(τ) and a is the same as summing
over all possible a and conditioning the probability Pr[R(τ)] given a. In terms of optimality, notice
that as V ∗ (s) is the optimal value function for a specific s, and for any a, and Q∗ (s, a) is the optimal
value for a specific s and a, taking max Q over all a is exactly V (s). Specifically
V ∗ (s) = max Q∗ (s, a) (1.7)
a
We also note that if there are many optimal actions, we may choose one randomly
∗ ∗ ′ ′
Q (s, a) = Es′ ∼P r(s, a) + γ max
′
[Q (s , a )] (1.12)
a ∼π
1.2 Kinds of RL Algorithms 11
By calculating the advantage, we ask if the Q function of some candidate action a (given some state
s) is larger then the average Q-function associated with the examination of all other actions taken by
our policy.
Under model-free RL
• Policy optimization: we parameterize the policy πθ (a|s) and find optimum w.r.t the return
J(πθ ). Such optimization is usually on-policy, meaning that the data used in the training
process is only data given while acting according to the most recent version of the policy.
In policy optimization we also find an approximator value function Vφ (s) ≈ V π (s). Some
examples are A2C,A3C,PPO.
• Q-Learning: approximate Qθ (s, a) ≈ Q∗ (s, a). Usually the objective is some form of the
bellman equation. Q-Learning is usually off-policy, meaning that we use data from any point
during training. Some examples are DQN, C51.
Compared to Q-Learning, that approximates Q∗ , policy optimization finds exactly what we wish for -
how to act optimally in the environment. Also, there are models that combine the two approaches, as
DDPG for example, which learns both a Q function and an optimal policy.
12 Chapter 1. Hello world
Under model-based RL
cannot be clustered as easily, though some of the (many) approaches include methods of planning
techniques to select actions that are optimal w.r.t to the model.
and
R
to do so, we must find a numerical expression for the policy gradient. As J(πθ ) = Eτ∼πθ [R(τ)] =
τ P(τ|θ )R(τ), we might as well write down a term for the probability of a trajectory
T
P(τ|θ ) = ρ0 (s0 ) ∏ P(st+1 |st , at )πθ (at |st ) (1.15)
t=0
d
Using the log-derivative trick, dx log x = 1x , meaning that x d log x d
dx = 1. rewrite 1 as dx x, Substitute
d
x ↔ P(τ|θ ) and dx ↔ ∇θ and we have that P(τ|θ )∇θ log P(τ|θ ) = ∇θ P(τ|θ ). We will use this later.
Now, lets expand the log term
T
log P(τ|θ ) = log ρ0 (s0 ) + ∑ [log P(st+1 |st , at ) + log πθ (at |st )] (1.16)
t=0
When deriving w.r.t θ , we are only left with the last term (the others only depend on the environment
and not our agent), hence
T T
∇θ log P(τ|θ ) = ∇θ ∑ log πθ (at |st ) = ∑ ∇θ log πθ (at |st ) (1.17)
t=0 t=0
Notice the use of linearity in the second transition. Consequently, we re-write the expected return
using eq 1.17 -
Z
∇θ J(πθ ) = ∇θ P(τ|θ )R(τ)
τ
Z
= ∇θ P(τ|θ )R(τ)
Zτ
= P(τ|θ )∇θ log P(τ|θ )R(τ) (1.18)
τ
= Eτ∼πθ [∇θ log P(τ|θ )R(τ)]
" #
T
= Eτ∼πθ ∑ ∇θ log πθ (at |st )R(τ)
t=0
In the 3rd transition we used the log-derivative trick, and in the last transition we used the expression
from 1.17.
The last term is an expectation, hence can be estimated using mean - given a collected set D =
1.3 Intro to policy optimization 13
{τ1 , τ2 , ..., τN } of trajectories obtained by letting our agent act in the environment using πθ we can
write
T
1
∇θ J(πθ ) ≈ ∑ ∑ ∇θ log πθ (at |st )R(τ) (1.19)
|D| τ∈D t=0
R It should be stated that this "Loss" term is not really a "loss" like we know from supervised
learning. First of all, it does not depend on a fixed data distribution - here, the data is sampled
from the recent policy. More importantly, it does not measure performance! the only thing it
makes sure of is that given the current parameters, it has the negative gradient of performance.
After this first step of gradient descent, there is no more connection to performance. This
means that the loss minimization has no guarantee to improve expected return. This should
come as a warning to when we look at the loss going down thinking that all is well - in policy
gradients, this intuition is wrong, and we should only look at the average return.
2. RL basics
2.1 Motivation
Current malware detection platforms often deploy an ensemble of detectors to increase overall
performance. This approach creates lots of redundancy, as in most cases one detector is enough, and
it of course computationally expensive and time consuming, compared to one detector.
We can come up with a simple improvement - query a subset of detectors and decide based on
their classification if more detectors are needed. If we observe our approach under the scope of
classification, it may very well be the case that training a model w.r.t to every set of detectors is
needed, as we cannot evaluate the performance of a subset of detectors without actually learning how
they performed. As this is computationally hard for large detectors, this is not a preferred approach.
Instead, we can use RL:
Suppose we use four detectors, and our agent takes as input the vector [−1, −1, −1, −1] ∈ R4 ,
which is considered an initial state. The agent will choose a set of detectors/detector configurations,
and a classification measurement of either "malicious" or "benign" will be taken. The decisions
of the agent will be based on a reward mechanism that takes uses the values of TP, FP (correctly
classified the content as "malicious" or "benign") and FP, FN (incorrectly classified to "malicious" or
"benign"). We will "punish" using C(t), which is a function that depends on the time it took for the
detectors to run. We can see that regardless of how many detectors were used, if we are right - the
Exp. # TP TN FP FN
1 1 1 -C(t) -C(t)
2 10 10 -C(t) -C(t)
3 100 100 -C(t) -C(t)
Table 2.1: Three suggested reward mechanisms for a malware detection platform
reward is constant (experiment with 1, 10, 100). On the other hand, if we were wrong, we subtract
16 Chapter 2. RL basics
C(t) which increases with the time t that has passed. As it is now "painful" to use more detectors,
the reward incentives our model to only use more detectors if that addition translated to higher
success rates. As our model efficiently scans through the state space, it is able to outperform (at least
conceptually) the suggested "check-all" classification approach that was previously introduced.
The expected reward for state-action pairs, namely, what should we anticipate (in terms of reward)
when performing the action a from the state s is
r(s, a) ≡ E[rt+1 |st = s, at = a] = ∑ r ∑ Pr[s′ , r|s, a] (2.4)
r∈R s′ ∈S
2.3 Markov Decision Processes (MDP) 17
where notice that the probability for a specific reward is the sum over all states, given the specific r
(hence the sum over s′ ∈ S).
We can also phrase our reward in terms of state-action-next state triplets, namely, what should we
anticipate (in terms of reward) when performing the action a that takes us from state s to state s′
Pr[s′ , r|s, a]
r(s, a, s′ ) ≡ E[rt+1 |st = s, at = a, st+1 = s′ ] = ∑r (2.5)
r∈R Pr[s′ |s, a]
Where we can think of the probability fraction as the number of events that reach s′ (from s after
performing a) and provide reward r, out of all the event that reach s′ (from s after performing a)
given any reward.
MDPs are very flexible
• both states and actions could be either abstract (s="sad", s="happy", a="take a nap") or
well-defined (like s=sensor readings, or a=turn on a switch).
• the time intervals may not be constant (some transitions are slow while other are fast)
• the setting of an MDP does not need to be an exact copy of the real-world model. For example,
a set of sensors may be enough to describe a robotic arm, even though there are many more
aspects that the arm is made up of (that are not as relevant).
R At this point is may be helpful to look at what is known as the "Backup Diagram", That
describes how the states are propagated based on the actions chosen by π and the probabilities
induces by the environment.
s
π(a1 |s) π(a2 |s)
a1 a2
Pr(s2 |s, a1 ) Pr(s3 |s, a1 ) Pr(s4 |s, a2 ) Pr(s5 |s, a2 )
s2 s3 s4 s5
We can write, for example, the probability to transition from s to s5 by performing an action
as as P(s5 |s, a2 ) = π(a2 |s)Pr(s5 |s, a2 ). In general term„ the probability to move to any state
by performing any action is the probability to take some action a and sum all probabilities of
states s′ reachable from s using a, and finally sum over all such actions
Pr(reach any state using any action|s) = ∑ π(a|s) ∑ Pr(s′ |s, a) (2.6)
a∈A ′ s ∈S
Equivalently, we can write the probability to reach any state using any action and receiving any
reward
For an infinite horizon, we add a discount factor, as otherwise infinite sum would result in an agent
that does not really care for the reward mechanism. As previously stated, the intuition here is reward
now > future reward.
∞
Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + ... = ∑ γ k Rt+k+1 (2.9)
k=0
where γ ∈ (0, 1)
The goal of π is to maximize the value function, which is the cumulative expected return of following
π starting from some state s
" #
∞
Vπ (s) = Eπ [Gt |st = s] = Eπ ∑ γ k Rt+k+1 |st = s (2.11)
k=0
Notice that the expectation is w.r.t π, meaning that after the initial state st = s, the next states are
fully determined by π. We can think of Vπ as a measurement of "How good is π?", as intuitively, we
can choose the policy that provides us the maximal expected return.
We can also define the Q-function, which is the same as V except the fact that we start from s and
perform an initial action a (that may or may not be one of π’s options)
" #
∞
Qπ (s, a) = Eπ [Gt |st = s, at = a] = Eπ ∑ γ k Rt+k+1 |st = s, at = a (2.12)
k=0
Focusing on the second term, we will use the law of iterated expectation
E[Y |X = x] = E[E[Y |X = x, Z = z]|X = x]
with Y = Gt+1 , X = St , x = s, Z = St+1 and z = s′ , hence
In the 2nd transition we removed the inner condition for St = s as Gt+1 = Rt+2 + γRt+3 + ... does not
depend on St . This is the case as every reward term Rt+i is only a function of the current state and
action, so since Rt is not present, no term in Gt+1 is related to St (only to St+1 , St+2 ...). In the last
transition we use the fact that the inner E term is nothing but the value function for t ← t + 1.
Substituting all to 2.15 we have
Rt+1 describes the reward obtained when moving from s to s′ using a, so it can be written as r(s, a, s′ ).
Furthermore, the expectation Eπ is w.r.t to the states, actions and rewards induced by π (so the
probability associated with every term is the one introduced in 2.7), and by summing over r ∈ R we
also indicate the fact that the transition s → s′ could be rewarded with multiple different rewards
(more than one option is plausible).
■
The bellman optimality equation is the bellman equation for the optimal Value-function
1 https://stats.stackexchange.com/questions/243384/deriving-bellmans-equation-in-reinforcement-learning
20 Chapter 2. RL basics
The bellman optimality equation is the bellman equation for the optimal Q-function
2.9 Monte-Carlo
In cases where the dynamics Pr[s′ |s, a] and the reward Gt are unknown (model-free setting, 1.2.1),
we can use a Monte-Carlo approach to sample the environment and come up with approximations
for the value-function and Q-function. To do so, one must make sure that the episodes are finite (the
number of transition until termination is < inf). Another important factor is how shall we behave
when encountering the same state more than once, and there are two common variants
• First-Visit-Monte-Carlo (FVMC): estimates the return obtained only after the first visit to s
(ignore future visits to s)
• Every-Visit-Monte-Carlo (EVMC): estimates the average returns obtained after all visits to s
(average all rewards obtained from s in the episode)
2 The convergence of PE is the result of the Policy evaluation convergence theory. see HERE for more info
3I highly recommend checking out THIS implementation, by Denny Britz
2.9 Monte-Carlo 21
• On-policy methods: attempts to evaluate or improve the policy that is being used to make
decisions. As stated before, if π does not attain all state-action pairs with probability> 0, we
will poorly explore the space.
• Off-policy methods: attempt to evaluate or improve a policy other than the one used to generate
the data (the one that selects actions).
We work with two distinct policies:
– The target policy π - the one that we wish to learn
– the behavior policy b - the one used to generate the data
While On-policy methods tend to be more data efficient, they require new samples with each change
of policy. Off-policy, on the other hand, are slower but more powerful and general, as they can be
used to learn from various sources (like from a human expert)
Importance sampling is a technique for estimating expected values under one distribution given
samples from another. It is performed by weighting returns according to the fraction of probabilities
to a trajectory under some policy.
Lets assume that the behavior policy b is stochastic and the target policy π is deterministic. This
means that the trajectories in the data (that were chosen by b) may be different than those chosen by
π, which begs the question of how to calculate the expected return? The solution would be to weigh
the return based on how it resembled the actual values returned by the target policy.
Consider the trajectory τ = {st , at , st+1 , at+1 , ..., sT }. The probability to obtain τ given the starting
state st and the actions at:T −1 ∼ π is
T −1
Pr[τ|st , at:T −1 ∼ π] = ∏ π(ak |sk )Pr[sk+1 |sk , ak ] (2.21)
k=t
24 Chapter 2. RL basics
Denoting the importance sampling for the time window [t,t + 1, ..., T − 1] as ρt:T −1 , we take the
relative probability of the trajectory for the target and behaviour policy
Notice that even though the probabilities P[s′ |s, a] may be unknown, they cancel out in 2.22. From
here, we can use ρt:T −1 and the return Gt of the behaviour policy b to obtain Vπ , as
For example, if some trajectory τ is twice as plausible under b than it is under π, the expected return
for π would be 1/2 (in expectation) the return under b, which can also be seen as ρ = 1/2.
From here, we can take the MC algorithm (that averages returns), provide it with episodes following
b but still estimate Vπ .
Let T (s) be all time steps state s was visited (over all episodes), and T (t) be the time of termination
after time t for a given episode, then {Gt }t∈T (s) is the set of returns associated with s across all
episodes, and {ρt:T (t)=1 }t∈T (s) are the corresponding IS ratios. To estimate Vπ we can use
where Nt (a) is the number of times the action a was chosen, and then choose an action
Notice how these two may collide, as when we explore the environment we also may not always
choose an optimal action. In an ε-greedy approach, we explore with probability ε, and in all other
cases we choose the optimal action:
when using MC we must make sure that the episodes are finite, and we could only learn based on
complete episodes. In TD learning, on the other hand, both infinite environments and incomplete
sequences learning is possible, as we can update our approximation after every step (compared to
after every episode in MC). We define the TD-error as the difference between the optimal value
function Vt∗ and the current prediction Vt :
This error is used as the general update rule, where we also add a learning rate α (much like in
gradient descent, where we can think of Vt+1 −Vt as the term ∇Vt )
R notice that we do not use an expectation term (as seen in policy iteration for example), as the
update rule is the result of looking only one step into the future, given some episode rollout.
Algorithm 6 SARSA
Require: α ∈ R
Q(s, a) ∈ R arbitrary initialized for all s ∈ S and for all a ∈ A
for each episode E do ▷ sampling a trajectory like in MC
for each s ∈ E do
choose a from s given Q ▷ like in ε-greedy (2.11)
r, s′ ←take action a and observe r, s′ ▷ taking one step to the future
choose a′ from s′ given Q ▷ like in ε-greedy (2.11)
Q(s, a) ← Q(s, a) + α(r + γQ(s′ , a′ ) − Q(s, a))
s ← s′ , a ← a′
end for
end for
return Q
consider both the next state and the next action. Do notice that, as always, to approximate Q we
must find its optimal value for every s ∈ S, and a ∈ A - which is computationally difficult.
R the name SARSA stems from the idea that the update rule uses the quintuple ⟨st , at , rt+1 , st+1 , at+1 ⟩
Algorithm 7 Q-Learning
Require: α ∈ R
Q(s, a) ∈ R arbitrary initialized for all s ∈ S and for all a ∈ A
for each episode E do ▷ sampling a trajectory like in MC
for each s ∈ E do
choose a from s given Q ▷ like in ε-greedy (2.11)
r, s′ ←take action a and observe r, s′ ▷ taking one step to the future
Q(s, a) ← Q(s, a) + α(r + γ maxa Q(s′ , a) − Q(s, a))
s ← s′
end for
end for
return Q
Also notice we did not track down the next action a′ , as we did not use it (scanned all a ∈ A
instead). It should also be stated that Q-Learning usually converges quicker, due to the optimal
choice of maxa Q(s′ , a). Having said that, as we do not take into consideration the next state a′ , using
ε-greedy actions might mean that we take a step into a state with very bad reward (like falling down
a cliff), simply as we are less conservative (due to exploration and not admitting to a next action).
3. DQN & it’s derivatives
This section is based on sources I found online (because the lecture was not uploaded). Our new
goal would be to use a parameterized function approximator to represent the state-action Q-function,
instead of representing it with a table. In other words, we wish to find a function Q̂(s, a, θ ) ≈ Q(s, a),
where θ is our function’s parameters and Q(s, a) is the true function/oracle.
Lets start of with the ideal assumption, in which the oracle Q(s, a) is accessible. Our function
approximator could be learned using SGD, that is, by minimizing the squared loss J(θ ) w.r.t to the
oracle over batches sampled from our environment
1
∆θ = − α∇θ J(θ )
2
1
= − α∇θ E (Q(s, a) − Q̂(s, a, θ ))2
2 (3.2)
1
= − α · 2E Q(s, a) − Q̂(s, a, θ ) · (−∇θ Q̂(s, a, θ ))
2
= αE Q(s, a) − Q̂(s, a, θ ) ∇θ Q̂(s, a, θ )
As Q(s, a) is generally unknown, it must be replaced with some approximated target. Recall
that in SARSA for example, our target was based on the temporal difference r + γ Q̂(s′ , a′ , θ ). In
classic Q-learning, on the other hand, we used an off policy approach, in which our target was
r + γ maxa′ ∈A Q̂(s′ , a′ , θ ). As we now proceed to describe DQN, we will follow the update rule of
′ ′
∆θ = αE r + γ max
′
Q̂(s , a , θ ) − Q̂(s, a, θ ) ∇θ Q̂(s, a, θ ) (3.3)
a ∈A
28 Chapter 3. DQN & it’s derivatives
3.1.1 Architecture
In the original paper, the DQN architecture was based on convolutional neural network. The
network takes in an input of shape 84 × 84 × 4 (a processed batch of images, which is considered the
state)1 , and propagates this input via three convolutional layers. To finally come up with an action
value, there are two fully connected layer, where the last one has a single output for every possible
Notice how our function approximates an action for every state, hence can be written as ∀ s ∈ S :
Q̂(s, θ ) : R|S| → R|A| .
ytDQN = rt + γ max
′
Q̂(st+1 , a′ , θ − ) (3.5)
a ∈A
Notice how we use both θ and θ − , where θ is the original parameters vector of the network but
θ − represents the parameters of the target network. By doing so we update θ based on values from
a previous version of our model. Furthermore, the batches we sample are provided from all past
transition tuples.
Experience Replay
We store the transitions ⟨st , at , rt , st+1 ⟩ := et at each time-step in a fixed buffer Dt = {e1 , e2 , ..., et },
and During SGD, we only sample uniformly from D. This is because
• Data efficiency: each experience ei can potentially be used in many updates
• De-correlation: randomly sampling leads to de-correlation between consecutive experiences.
As correlation breaks, expected values will fluctuate less, meaning that the variance in sampling
is reduced, hence stability is increased.
• Smoother divergence: As our training process is based on a large number of experiences,
outliers tend to average out with the rest of the samples, leading to less oscillations in the
training process.
Note that in a more sophisticated experience replay, we might weigh experiences based on their
importance and keep the relevant ones for longer.
1 Originally, images were 210 × 160 × 3, but the preprocessing applied spacial dimension reduction to 84 × 84, extracted
the Y channel from the RGB and concatenated 4 consecutive images as "memory"
3.2 Double Deep Q-Network (DDQN) 29
Target Network
Another stability improvement comes from the fact that we use two neural nets. Ideally, we would
like to minimize the effect of our targets being Non-Stationary, and we do so by setting a network
that is only updated after C ≫ 1 steps. This means that our non-stationary target will, in a sense, be
stationary for at least C steps, hence stabilizing our learning process even further.
weights using a weighing scale that is off by ±1 kg (equal probability to measure > 80 and < 80).
Lets run two sets of measurements:
• Denote the weight measurement of the i’th person as Xi , and set Y = maxi Xi . We can intuitively
understand that almost surely Y > 80, as almost surely there exists some j for which X j > 80.
This really tells us that the max operator is prone to over estimation when noise is introduced
to the system.
• As a second experiment, we will measure each person’s weight twice and store the values in
X1i , X2i . to estimate Y , we first calculate n = argmaxi X1i . Next, we take the second measurement
Y = X2n as our maximal value. Notice that as X2n is independent from X1n , it is equally likely to
overestimate the real value or underestimate it, hence it is not systematically over-optimistic.
So everyone weighs 80kg and X1n > 80 with high probability, but X2n is both > 80 and < 80
with even probability.
■
With the example above in mind, notice how in the DQN algorithm we both choose an action at
using Q̂(st , a, θ ) and evaluate it (when calculating yi ), which means that we tend to overestimating
the target values. To address this issue we replace the current update of yi (when not donei ) with
DoubleDQN ′ −
yi = ri + γ Q̂ si+1 , argmax Q̂(si+1 , a , θ ), θ (3.6)
a′ ∈A
In other words, we choose an action (argmax) using a network with parameters θ (that is currently
being trained), but evaluate the action using a network with parameters θ − (that is not being trained)
Figure 3.1: (Top) a standard DQN architecture, where the last layer is a vector representing
Q(s, a1 ), Q(s, a2 ), .... (Bottom) a Dueling DQN, where the upper branch represents the value of
V (s), the lower branch represents the values of A(s, a1 ), A(s, a2 ), ... and the final layer represents
Q(s, a1 ), Q(s, a2 ), .... As the input/output are the same, Alg. 8 can be applied given the Dueling
architecture as well.
choices. This may be unnecessary in cases where an action has little to no effect on the outcome.
For example - if you fell down a cliff, it is really irrelevant what are the action values of steering the
wheel. As we split our architecture to two distinct components, we can learn state values without
having to learn the effect of each action on those states - as those may be an unnecessary computation
for some states.
3.3 Dueling network 31
3.3.1 Implementation
Denoting the output of the upper branch as V (s, θ , θV ) and the output of the lower branch as
A(s, θ , θA ) where θ are shared parameters and θV , θA are distinct parameters for every branch, it may
seem reasonable to add these values to obtain Q(s, a, θ , θV , θA ). The immediate problem that arises
is that given such Q, we cannot recover the exact values of both A and V , as for example it may be
the case that Q = (A + x) + (V − x), and x can be chosen freely.
To distinguish A and V , the following formula was applied:
′
Q̂(s, a, θ , θA , θV ) = V (s, θ , θV ) + A(s, a, θ , θA ) − ′ max A(s, a , θ , θA ) (3.8)
a ∈Actions
This trick forces the Q value associated with the maximizing action to equal V (as for the maximizing
action, which is the action that is choose, the squared brackets zero out). This means that the upper
stream (V ) can be identified as the Q function value, and the lower stream would be the advantage
function.
The PG theorem states that the gradient of J(θ ) is proportional to the gradient of our policy,
weighted by the Q function over all states, and by the average number of times each state is visited
34 Chapter 4. Policy Gradients
in a trajectory µ(s).
From here, we can generate an episode from our initialized policy function, calculate the return as a
sun of discounted (or undiscounted) reward (remember that we assume finite episodes setting) and
update the network parameters using the update rule from above.
4.3 Actor-Critic methods 35
Intuitively, Gt − V̂ (st ) tells us how good we did after time step t (Gt ) compared to what was
expected to be achieved for that same timestamp (V̂ (st )). If, for example, the performance was better
than expected, the log-likelihood term for the action that was used is now weighted based on that
performance, encouraging our agent to continue with those behaviors. If we anticipated the return
exactly, then the actions we chose are ideal, hence no more updates are required (G(t) − b(st ) = 0).
Comparing this to the previous method with no baseline, where we always multiplied by the actual
return Gt , unnecessary updates would have happened, which hinders convergence
Figure 4.1: Actor-Critic diagram. Notice that w.r.t to the table above, both the actor and the critic are
estimated (i.e parameterized)
The formalism of AC methods is similar to Policy iteration (recall alg. 1), as we alternate
between policy evaluation, where the value function is being estimated, and a policy improvement,
where given the evaluated policy we improve our policy. More specifically, the actor attempts to
improve the policy (using Policy gradients for example or argmax Q), and the critic evaluates the
current policy.
The above can be summarized to the following pseudo-code Notice how compared to MC control
(like in the original REINFORCE), the update rule happens after every step, meaning we do not have
to simulate an entire episode. This makes the process less variable, allowing for faster convergence.
Imitation learning is a learning process that is based on data provided from an "expert" - a being
from which we understand how to perform some task/movement/etc..
Let us start with a new definition - the Regret, which is the difference between following the optimal
policy and some other policy.
R This part seems to use slightly different notations, but hopefully still be understandable
The Regret at time t is the expected loss of all action values w.r.t to the optimal value
goal would be to perform some iterative optimization process in which every in every iteration we
examine some πi and hope it will be as close to π ∗ as possible.
Next, we define the count Nt (a), which is the number of times action a is selected by the t’th time
step, and use it to reformulate the total regret
" #
T
L=E ∑ (V ∗ − Q(at ))
t=1
= ∑ E[Nt (a)](V ∗ − Q(a)) (5.5)
a∈A
= ∑ E[Nt (a)]∆a
a∈A
where eπ (s) = Ea∼π(s) [e(s, a)] and e(s, a) = 1[a ̸= π ∗ (s)]. Examining the bounds of the algorithm,
the regret is ≤ O(uT ε), where u is the diversion from the optimal policy (u ≤ T ).
42 Chapter 5. Imitation Learning
As we train the algorithm on currently visited states, we can make better decisions and recover from
mistakes. But if T is large, the algorithm is generally impractical for real-life applications.
Algorithm 13 DAagger
Initialize dataset D ← 0/
Initialize random policy π̂1 ∈ Π
for i = 1, 2, ..., N do
sample T -step trajectory using πi
get dataset Di = {(s, π ∗ (s)} of visited states by πi and actions given by expert
aggregate D ← D ∪ Di
train π̂i+1 on D
end for
Though it was shown to work well for both simple and complex problems (linear regret), one
disadvantage is that the expert must be available throughout the training process, which may not
always be the case. Furthermore, another shortcoming is when the learner’s policy is drastically
different from the expert’s policy. Think for example of a person who has just received his driver’s
license - if that person would observe the driving performance of a Formula 1 racer driving a track,
not much would have been processed and transferred to the new driver, and he will probably not be
able to reconstruct the experienced driver’s patterns. Obviously, to properly teach the new driver the
complicated set of skills, there has to be some learning curve, allowing for an increasing level of
complexity.
R In some versions of the dagger algorithm, the policy used is a convex sum of the expert’s policy
and the current trained policy, i.e πi = βi π ∗ + (1 − βi )π̂i . Furthermore, the βi term usually
decays, as to account for the expert less over time. in terms of the pseudo-code in alg. 13, we
can add this line right at the beginning of the for loop
This is for
• λi ≥ 0 specifies how close the coach is to the expert
• scoreπi is the likelihood of our agent to choose an action a in state s
• L(s, a) is the immediate cost
As the oracle’s action choices are minimizers of the cost L, when our policy’s actions result in a
small L(s, a) for a given s, we understand that the actions that were chosen are close to the oracle
5.2 Imitation learning 43
actions. Furthermore, if a score of an action scoreπi (s, a) is high, it is more likely to be chosen by
our current policy πi . We can use this intuition to formulate the following table:
as we wish to maximize the difference (score − L), we tend towards a likely action that is also
optimal, and we control λi to make sure that the actions are likely enough to be chosen.
When adding this functionality to our dagger implementation (alg. 13), we only need to change
the tagging of π ∗ to π̃
6. Multi-Arm Bandit
We’d like to cover the concept of RL in the most simplified setting, which does not involve learning
to act in more than one situation and avoids some complexities introduced in other RL problems.
Imagine a row of K slot machines ("arms) with unknown and variable payoffs. A player must
choose which machines to play given a finite time horizon H to maximize his profits. Our player
must balance exploration and exploitation, he should also understand the behavior of multiple arms,
but also focus on those that provide a better reward (or minimal regret).
R Before moving on, We define the expected cumulative regret E[Regn ] to be the difference
between the optimal expected cumulative reward and the expected cumulative reward of our
strategy at time n. If the optimal reward at every time step is R∗ , after n steps we can write
n
E[Regn ] = nR∗ − ∑ E[ri ] (6.1)
i=1
This means that we not only choose the greedy action based on the action that maximizes Qt (a) for
all a ∈ A, but also based on the visit
√ rate - here like the above, if Nt (a) increases, we are less likely to
choose the action. we also add a lnt term, to indicate a measurement of time increment. If Nt (a) is
large but also lnt is, the term will be not as small, indicating that it is reasonable to choose an action
many times if much time has passed since the beginning.
eq. 6.2 helps us replace the ε-greedy random choice, that decouples the exploration and exploitation
with one single term that governs both in a more elegant way.
R The above can then be reformulated to an algorithm called The Upper Confidence Bound
(UCB), named after the fact that the expected number of pulls that was required to achieve an
optimal policy was bounded from above
• Observe user ut and a set A of arms, alongside their features (context) xt,a
• based on reward from previous iterations, choose an arm a ∈ A and receive a reward rt,a
• improve the arm selection strategy with each observation of ⟨xt,a , a, rt,a ⟩. Notice here that
different rewards from the same arm are possible, as many contexts are available.
For example, if we were to implement a recommendation system, we could ideally identify the
context of our users (some tabular traits for example) and use those to provide a selection that is
more specific.
In some cases, we can model the relation between the reward and the context in a linear fashion, and
in The Linear UCB algorithmm they did just that. Let us define the expected reward conditioned on
the context as
where θ ∗ is the unknown coefficient vector that we aim to learn. Our goal is to minimize the regret,
which can be defined as the expected difference between the observed reward for the best arms and
the expected reward for the selected arms
" #
T
Rt (T ) = E ∑ rt,a ∗
t
− rt,at (6.6)
t=1
pk = pk · P(r|pk ) (6.7)
This process is repeated for each action at each step, with the distribution for each action being
updated based on the rewards that the agent receives. The advantage of using this approach is that it
allows the agent to balance exploration and exploitation, as it will choose actions that have a high
probability of being the best action based on the current information, while also trying out other
actions to learn more about them and improve its estimates.
More specifically, in the algorithm, they’ve assumed beta distribution in the following manner
• for each arm i = 1, 2, ..., N set Si = 0, Fi = 0 (we think of Si as #Success and Fi as #Failures)
• for every time t = 1, 2, ...
48 Chapter 6. Multi-Arm Bandit
– for each arm i = 1, 2, ..., N: sample from β (Si +1, Fi +1) and play arm i(t) = argmaxi θi (t)
to observe reward rt .
– if r = 1 then Si(t) + = 1, else Fi(t) + = 1
R as a reminder, we write the relation between the posterior and the prior based on the Bayes rule
Using some of the concepts from previous chapters, we will explore the story of AlphaGo, a
groundbreaking artificial intelligence program developed by Google Deep Mind that made history in
2016 when it defeated Lee Sedol, a world champion Go player, in a highly publicized five-game
match. The victory was significant because Go is a complex and challenging board game that has
long been considered a pinnacle of the human intellect, and it was the first time that a computer
program had been able to defeat a human professional player in the game. In this chapter, we will
delve deeper into the techniques and strategies used by AlphaGo to achieve this impressive feat.
Before we can fully understand how AlphaGo was able to defeat a human champion at Go, it is
important to first understand the concept of Monte Carlo tree search, which was a key component
of the program’s strategy. In the following sections, we will discuss the basics of Monte Carlo tree
search and how it was used by AlphaGo to make informed decisions on the Go board.
• simulation (play-out): one simulated game is played from the leaf node reached in the previous
step, and a reward is observed at the end of the game. The actions during the simulation step
are either randomly chosen or given using the rollout-policy
• back-up: the results of the game are back-propagated in the tree, updating the value of Q(s, a)
for each of the nodes that participated in the game.
We emphasize the MCTS is only aware of the "rules" of the game, and hence might be outperformed
by other approaches that use tailor-made heuristics. Furthermore, the algorithm can be halted at any
given time, therefore training it for a fixed time is plausible.
Some improvements were suggested over the years, to enhance the basic functionality of MCTS.
some of which are:
• Pruning policy - once an expanded node has had a sufficient number of simulations, we use a
hand-crafted policy to determine whether it should remain the tree or be removed
• Improving the Value function - we define the value function as an MLP layer Q(s, a) =
σ (φ (s, a)T θ ) where φ are binary features, θ are the weights and σ is some activation probabil-
ity function like softmax. Furthermore, instead of randomly sampling a state in the expansion
step, a few policies were tested
– ε-greedy one
– greedy policy with noisy value function π(s, a) = 1 ⇐⇒ a = argmax [Q(s, a′ ) + η(s, a′ )]
a′ ∈A
(otherwise π(s, a) = 0))
– a smoothed softmax πτ (s, a) = so f tmax(Q(s, a)/τ)
R In chess, for example, b ≈ 35 and d ≈ 80. In a game of Go, on the other hand, b ≈ 250 and
d ≈ 150
7.2.1 Alpha-Go
check out this blog post for more information.
The AlphaGo architecture consists of three networks: supervised learning policy network, RL policy
network, and RL value network. In addition, the linear softmax model was used as a rollout policy.
The state consisted of a few parameters such as stone color, turn, viable stone positions, etc. To make
our arguments cleaner, the parameterized policy that outputs an action given a state as pθ (a|s)
The supervised learning policy network
the goal - recommend good moves by predicting those performed by Go grans-masters (similar to
imitation learning). The policy network was trained on ∼ 30M positions from ∼ 160K games, and
the data was augmented by rotating the board. The goal function was maximizing the log-likelihood
function of taking the "human" action. More formally, the step we take is
∂ log pσ (a|s)
∝ (7.1)
∂σ
where m is a batch size, α is a step size, θSL are the network parameters and log pθ [ak |sk ] is the log
probability of taking an expert action ak given the state sk
The rollout policy
During game simulation (discussed later), we need a fast approach to narrow down the moves options.
Therefore, we create a rollout policy π which is a simple linear softmax classifier. This policy is
trained exactly like the linear model one, but as it is simpler it is much faster.
The RL policy network
In the next phase, we play the game of Go with the current policy network itself, to improve the
overall results. To do so, we start off with a duplicate of the SL policy network and call it the
RL policy network pρ (with parameters ρ). We use REINFORCE with a baseline to improve pρ
iteratively. The opponent pρ plays against is a previous version of pρ itself (chosen randomly), and
the game goes on until it is finished. We denote the output of the game at time t as zt , where zt = 1 if
pρ won and −1 otherwise. The update rule for pρ is
∂ log pρ (at |st )
∝ zt (7.2)
∂ρ
where notice that the direction we move is governed by sign(zt )
The RL value network
In the last stage, we want to add the capabilities of the grand-master board position evaluation, so we
train a deep network to estimate the value of the current position (which is 1 if that position can be
translated to a win and −1 otherwise), and call it vπ . To train the value network, which has similar
architecture as the RL policy one except that it outputs a single value, we again perform self-play
using the RL policy network. we compute the MSE w.r.t to the actual game outcome, meaning that
the update rule is
∂ log vθ (s)
∝ (z − vθ (s)) (7.3)
∂θ
52 Chapter 7. Advanced RL use case - AlphaGo
Throughout the games, we collect board positions. More specifically, one board position is
collected for every game (as all board positions in a game lead to the same result - win or lose) as
different positions of the same game are highly correlated.
Our last goal is to use the policy network and the value network to complement each other
MCTS
We’d like to search for actions that translate to as many wins as possible, and we do so by examining
the four steps of MCTS:
• Selection: remember that we aim to choose actions that are also greedy but also explore. in
terms of the exploitation part, we take advantage of our value function network and define
n
1
Q(s, a) = ∑ 1(s, a, i)V (si ) (7.4)
N(s, a) i=1
with V (si ) = (1 − λ )vθ (si ) + λ zi . In words - we set the value of a state action pair as the
weighted sum (in terms of #visits) of the actual value we give to the state and whether that
state is translated to a win condition. We use a convex sum to account for the two factors
combined. In terms of the exploration part, we define
pσ (a|s)
u(s, a) = (7.5)
1 + N(s, a)
meaning that we normalize "how good it is to take action a" by the number of visits, as it
makes visited actions less likely to be chosen (which is the meaning of exploration)
Finally, we aggregate the two and choose the action
• Expansion: we add more positions into the tree to reflect what moves we have tried. every new
node is initialized with a predefined value of N(s, a) = Q(s, a) = 0, an associated probability
pσ (a|s) (set by the SL policy network), and a value vπ (s) (set by the RL value network)
• Simulation: we simulate the rest of the game using MC rollout starting from the current leaf
node. more formally, we sample action from our rollout policy a ∼ pπ . recall that pπ is very
fast, as many game roll-outs are necessary.
• back-up: after the roll-out we know if our game resulted in a win or a loss, so we can compute
Q. over time, our Q estimate will be good enough to choose a good action
7.2.2 Alpha-Zero
for more info, see HERE
In the next generation of AlphaGo, we only use self-play to learn, taking into consideration nothing
but the rules of the game (no expert demonstrations). Furthermore, the state’s representation is only
the stones on the board (with some history saved as well). Lastly, only one neural net was used. We
start off with a high level description:
• Self-Play: we create a training set by using self-play, where in each move the game state, the
search probabilities (from MCTS) and the winner are saved.
• Network optimization: sample a mini batch from the training set (of the previous step) and
train the current network on these board positions. the loss function has two terms
7.2 RL for the game of Go 53
– the value function (how probable is winning from the given board states), that is compared
to the actual win condition (win or lose) using MSE
– the action probabilities for each legal state
More specifically, we write that
and in some cases, a regularization term, λ ||θ ||, was added to normalize the weights
• evaluate network: play 400 games between the latest neural network and the current best
neural network, where both networks use MCTS to select their moves. the network who wins
55% or more is declared the new best network.
Similar to AlphaGo, each node contains a V (s), that represents how likely the player to win from
the current state, and each edge contains the action value Q(s, a), the visit count N(s, a) and the
probability to visit P(s, a).
Next, we describe the four steps of MCTS √
∑ ′
a ∈A N(s,a′ )
• Selection: again the exploitation term is Q(s, a), but the exploration is c · P(s, a) 1+N(s,a) ,
which
p is similar to what we had in Alpha-Go, up to the factorization of all possible actions
∑a′ ∈A N(s, a′ )
• Expansion + simulation: when a leaf is reached, all possible states are initialized. Then, a
single node is expanded and evaluated using Q(s, a). For this node we also calculate V (s) and
immediately return, that is no rollout is being performed.
• backup - traverse up the tree, update the values N+ = 1,V + = v, Q = V /N
after around ∼ 1600 simulations, we select a move. In the case of the test phase, we chose the node
for which N is largest. for the training phase, we choose
N(s0 , a)1/τ
π(a|s0 ) = (7.8)
∑ N(s0 , a′ )1/τ
a′ ∈A
sequence. In the original paper, they defined the reward as the accuracy result over the validation set,
and user REINFORCE to optimize the loss term.
8. Meta and Transfer Learning
In the following chapter, we distinguish between two similar yet different learning tasks
• Meta-learning (sometimes refers to as "Learning to learn"): is the process of learning how
to model our problem and use the generalized result over multiple sets of tasks with similar
setups. For example - we train a robotic arm with two joints to do an arbitrary task and then
use the same agent on a different robotic arm with three joints.
• Transfer learning: is the process of learning some knowledge from one task and using it to
improve the performance of a model on a different but related task. For example - we can use
a pre-trained ResNet to achieve better image classifiers.
the main difference between meta-learning and transfer learning is that meta-learning focuses on
learning generalizable knowledge that can be applied to a wide range of tasks, while transfer learning
focuses on transferring knowledge from one specific task to another related task.
8.1 Meta-Learning
Let us define a set of tasks {τ1 , ..., τn } where each τi is episodic (with length Hi ) and defined by a
Hi Hi
set of states {sti }t=0 , actions {ati }t=0 , a loss Li , and a transition distribution Pi . A meta-learner with
parameters θ models the distribution π(at |s1 , ..., st ; θ ) with the objective of minimizing the expected
loss over all tasks
" #
Hi
min Eτi
θ
∑ L(st , at ) (8.1)
t=0
Figure 8.1: meta-learning with memory-augmented network diagram. Left (a): the (lagged) episodes
from various datasets are shuffled. Right (b): during the learning process, the sample is first saved to
the external memory and later retrieved when the relevant label is presented
We can also represent the pipeline using the following block diagram
controller
Memory
Our main question would be - which memories should we read? We use a similarity measure to
generate a weights vector - given an input xt , the controller (the network) produces a key kt which
is then stored in a row of a matrix Mt or used to retrieve the particular memory i from the i’th row
8.1 Meta-Learning 57
kt · Mt (i)
K(kt , Mt (i)) = (8.2)
||kt || · ||Mt (i)||
Meaning that we compare the current key kt with another key described in the i’th row of Mt . We
use The similarity between all keys to produce the read-weight vector (with superscript r)
wtr (i) is stored in memory, from which the read value is retrieved as
Over time, new information is written into rarely-used locations, preserving recently encoded
information, or it is written to the last used location, which can function as an update of the memory
with newer, possibly more relevant information. The distinction between these two options is
accomplished with interpolation between the previous read weights and weights scaled according to
usage weights wtu . Those are updated as
wtu ← γwt−1
u
+ wtr + wtw (8.5)
where γ is a decaying parameter, wtr is computed as in eq. 8.3, and wtw will be defined later. The
least-used weights wtlu (i) is then given as 0 if wtu (i) > m(wtu , n) or 1 otherwise, where m(wtu , n)
denotes the n’th smallest item in wtu (and we set n to the number of read to memory). This definition
allows us to recursively define the write weights
wtw ← σ (α)wt−1
r lu
+ (1 − σ (α))wt−1 (8.6)
for all i
ot = g(ot−1 , xt , ht )
To incorporate an attention mechanism, we can modify this equation to include the attention
weights:
Furthermore, we use causal attention that makes sure no future input is used when calculating
the output of the current state (to avoid data leakage). This means that every output is generated by
looking over only previous samples.
The architecture
If we write a generic learning process as θ ← θ − α∇θ Ltrain (θ ), a generalized approach over many
i
tasks minimizes the objective ∑taski ∑i (θ − α∇θ Ltrain (θ )). Intuitively, we take a step in the averaged
direction based on the gradients of all tasks’ losses.
Figure 8.3: MAML - update θ w.r.t to the expected direction induced by the losses of all tasks
Under the scope of RL, we will sample a batch of tasks, and for each task sample a set of
trajectories from our environment. once a trajectory was sampled, a loss gradient will be computed
based on the expected episodic return, followed by an update rule for the specific tasks’ parameters.
once all tasks were iterated through, the final update rule will be performed w.r.t the mean loss values
over all tasks. We can compare MAML and SNAIL in terms of their properties:
60 Chapter 8. Meta and Transfer Learning
SNAIL MAML
As a heavy-duty model, less
easily adjusted given new data as
consistent likely to improve with only new
it is only gradient-based
data
uses memory therefore can ob-
has no memory, so is generally
expressive tain deeper understanding of
less expressive
tasks
does not enforce a smart explo-
structured
ration scheme, though still not same as SNAIL
exploration
very inefficient
efficient and
is on policy same as SNAIL
off-policy