Lec 12
Lec 12
The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: A t = π(S t ) for some function π : S → A
Stochastic policy: A t ∼ π(· | S t ) for some function π : S →
P(A). (Here, P ( A ) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s 1 , a 1 , . . . , s T , a T ) = p(s 1 ) π(a 1 | s 1 ) P ( s 2 | s 1 , a 1 ) π(a 2 | s 2 ) · · · P ( s T | s T −1 ,
aT −1 ) π(aT | sT )
Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.
Intro M L (UofT) CSC311-Lec12 6 / 60
MDPs: Rewards
In each time step, the agent receives a reward from a distribution
that depends on the current state and action
R t ∼ R (· | S t , A t )
R t = r (S t , A t )
Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?
The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
" ∞
#Σ
π
V (s) , Eπ [G t | St = s] = π γ k R t+ k | S t =
E k=0 s .
vVpuptu
sand
ttt
sa gree
n n
d
— S u tt o n a n d B a r t o , Re in fo rcem e nt L e a r n in g : A n I nt ro d u c ti o n
' (s ))
Σs y
= r (s, a) + γ p(s′ | s, a) max Qπ (s′ , ′
a) s'
a'
πs
Now Q ∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ٨ :
Σ
Q ∗J (s, a) = r (s, a) + γ p(s J | s, a) max Q∗ (sJ , a
) a′
` s′ x
˛¸
,(T∗Q ∗)(s,a)
T π Qπ = Qπ
T ∗ Q∗ = Q∗
Idea: keep iterating the backup operator over and over again.
π
Q←T (policy evaluation)
Q
Q ← T ∗Q (finding the optimal
We’re treating Q π or Q ∗ aspolicy)
a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.
ǁf (k)
(x ) − x ∗ ǁ ≤ α k ǁx − x ∗ ǁ.
Hence, ( iterated
Intro M LUofT) applicationC Sof
C 3 1f1 - ,L estarting
c12 from any x, converges 24 / 60
Finding the Opti mal Value Function: Value Iteration
Q2
T ⇤Q 2
= γ ǁ Q 1 − Q 2ǁ ∞
ǁ T ∗Q 1 − T ∗Q 2 ǁ ∞ ≤ γ ǁ Q 1 − Q 2ǁ ∞ ,
Iwhich
n t r o M Lis ( what
UofT) we wanted to show.
CSC311-Lec12 27 / 60
Value Iteration Recap
The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]
Q (S , A ) ← Q (S , A ) + α R + γ max Q (S ′ , a ′ ) − Q (S ,
a' ∈A
A) .
To understand this better, let us focus on its stochastic equilibrium, i.e.,
where the expected change in Q(S, A ) is zero. We have
E R + γ max Q (S ′ , a′ ) − Q (S , A )|S , A =
a' ∈A
0
⇒(T Q )(S , A ) = Q (S , A )
∗
Q (S , A ) ← Q (S , A ) + α R + γ max Q (S JJ , a ) − Q (S , A )
a′ ∈A
.
Noti ce: this update doesn’t mention the policy anywhere. The
only thing the policy is used for is to determine which states are
visited.
This means we can follow whatever policy we want (e.g. ε-greedy),
and it still coverges to the optimal Q-function. Algorithms like
this are known as off-policy algorithms, and this is an extremely
useful property.
Policy gradient (another popular R L algorithm, not covered in this
course) is an on-policy algorithm. Encouraging exploration is
much harder in that case.
t ← r (s t , a t) + γ max Q (s t+ 1 , a)
a
θ ← θ + α(t − Q (s, a))∇θ Q (s t , a t ).
htt ps://www.youtube.com/watch?v=4MlZncshy1Q
Intro M L
(UofT) CSC311-Lec12 41 / 60
Recap and Other Approaches
All discussed approaches estimate the value function first. They are
called value-based methods.
There are methods that directly optimize the policy, i.e., policy search
methods.
Model-based R L methods estimate the true, but unknown, model of
environment P by an estimate Pˆ, and use the estimate P in order to
plan.
There are hybrid methods.
Value Policy
Model
Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
C I FA R and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, C S 294-112 at U C Berkeley
We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data
such that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors
Z h u et a l. , 2017, “ U n p a i r e d i m a g e - t o - i m a g e t ra n s l a ti o n u s i n g c yc l e - c on s iste n t a d v e r s a r i a l n e t w o r k s ”
G h a h r a m a n i , 2015, “ P r o b a b i l i s ti c M L a n d a r ti fi c i a l i n te l li ge n c e ”
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading
supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual
$100,000 donation, too.
B l e i et a l. , 2003, “ L a t e n t D i r i c h l e t A l l o c a ti o n ”