Reinforcement Learning
Reinforcement Learning
No dataset
in advanc
Youreceive
data
your
asagent
performs
sequential
actions
environment
Reward
actions
action
good bad
to minimize the no
of trials
Addsalt
EEngin
You STA
maxtrzes
Tastesgood reward
stress
Salthes Addsalt
independent
identical
distribution
0.2 0.05
persist
penalty Aim complete task asap
Probability
Et
N
550,51 52
A ao a
pys.gg _P 5215191
3ʳsEmpñ
Policy is a
mapping from
state to Action
Poway
0.03 0 03 0.03
WHET 0.03 0.03
0.030.03 0.03
If your reward
changes then
policy
will also change
1 Pong
optimal policy is
sensitive to
reward fr
to new state
really transitioned
a
Not
upperbound 0.3
2
uphill 0 3 r o3 0 03
83 0 3 890.3
discounted
8 8 re 813
keep
8 diluting
the
contribution
8 of
future
transitions
82 v
S A PR
A S A
MarsRoverexample
100 0 0 0 0 40 Rewards
State 1 2 3 4 5 6
Action
Teft Fight
Ei as a 321 0 0 0 0 100
rewards
4 3 2 1 109 0
Coria
1 9 0 693100
y 0.9 0 0 0 0.729 100 72.9
Rewards
8 0.5
43 2 1 10 5 0 0.570 0.573100
12.5
Return
Exampleofreturs Return depends on
Reward depends on
Reward
Action depends
Triare
it L
it L
is
51 51 1 1 ZEEE.ae
action
v5
Zeisiae action
2 710.5740
0 105 0 105740
Optimee
alien world
titimith genetics
Policy
state's acts
A 4
A 5
5117117 EE
aaEem moves
State value function Action value function
stochastic
E X EXPLX
mtfaztggggm
yactig.gg
output by optimal
policy
8R 2 R2 0213
State value function
Action value function
nnt.li e iio
2 o 5 o
6.570 6.573100
12 5
9 acs
q7
Actin value
f
function
yj
for all States
fora acini
calculate return Q Sa
MarsRover
il Kittie
c
t.mn
na
ta vis
EE
1fiii1imI1t
showing
www taa.m QLS
a Rls
a
Q 2 R 2 05 Q 3 a
may
Q 2 0 0.5 max 25 6 25
Q 2 of 0.5 25 12.5
Q 4 R4 0 5 Q 3 a
may
0 0 5 max 25,625
12.5
Bellmanequation
ECX EXPLX
state
Ufunction
Stochastic
Action
violation
Q s a start from states
actin a return
behave
optimally
ETEEs
g gg maggagggy ages
QCs a RCS rmaa yes Bf7
JITE
VALUE AND POLICY ITERATION
VALUE ITERATION
bastaI
1s
10
807
A
To
927
8 111
08 0 0.9 1
0140 0 to 1 0 0 9 0
n
tot
POLICY ITERATION
Temporal Difference and Q
learning
we have seen how to solve MDP when P and R
known by value and Policy Iteration
are
using
33IE asipaate
By taking action in
EEEE.tt
quantities
Expectationof under P
Everytime we take
some action we
get to see one
sample
we treat
single
a
sample as
proxy for
entire distribution
https://www.youtube.com/playlist?list=PLYgyoWurxA_8ePNUuTLDtMvzyf-YW7im2
http://incompleteideas.net/book/ebook/
https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning