0% found this document useful (0 votes)
8 views

Reinforcement Learning

Uploaded by

Lakshit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Reinforcement Learning

Uploaded by

Lakshit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

REINFORCEMENT LEARNING

No dataset
in advanc
Youreceive
data
your
asagent
performs
sequential
actions
environment
Reward
actions
action
good bad

to minimize the no
of trials

Addsalt

EEngin
You STA
maxtrzes

Tastesgood reward

stress
Salthes Addsalt
independent
identical
distribution

Reinforcement Learning Examples


Robotics Autonomous Driving Mario

MARKOV DECISION PROCESS

0.2 0.05
persist
penalty Aim complete task asap
Probability

101 801 101

State Transition Diagram

Et
N
550,51 52
A ao a
pys.gg _P 5215191

we donot know PC and RC


typically
we dont know state transition diagram in advance
have to learn it trial and In
we through error
the process of learning through trial and error we
will encounter samples from the transition diagram

why the name Markov

3ʳsEmpñ
Policy is a
mapping from
state to Action

Poway
0.03 0 03 0.03
WHET 0.03 0.03

0.030.03 0.03

If your reward
changes then
policy
will also change

1 Pong

optimal policy is
sensitive to
reward fr

to new state
really transitioned
a
Not
upperbound 0.3

2
uphill 0 3 r o3 0 03
83 0 3 890.3
discounted

8 8 re 813

keep
8 diluting
the
contribution
8 of
future
transitions

82 v

S A PR

A S A
MarsRoverexample

Terminal States 1,6


After reaching
terminal state
state 1 2 3 4 5 6
you stop

100 0 0 0 0 40 Rewards
State 1 2 3 4 5 6

Action
Teft Fight

Ei as a 321 0 0 0 0 100
rewards

RETURN UTILITY Discouted Reward

4 3 2 1 109 0
Coria
1 9 0 693100
y 0.9 0 0 0 0.729 100 72.9

Return R OR 0 Rz until terminal state

Rewards

8 0.5

43 2 1 10 5 0 0.570 0.573100
12.5
Return
Exampleofreturs Return depends on
Reward depends on
Reward
Action depends

Triare
it L
it L
is
51 51 1 1 ZEEE.ae
action

10 8 051100 o.gotlo.pt 1o 5 o 0.570 1053100

v5
Zeisiae action

2 710.5740
0 105 0 105740

Optimee

alien world

titimith genetics
Policy

state's acts
A 4
A 5

5117117 EE
aaEem moves
State value function Action value function

fan of state and of policy that you


are
following
Elad pale

stochastic

E X EXPLX

mtfaztggggm
yactig.gg

output by optimal
policy
8R 2 R2 0213
State value function
Action value function

OCS a from state


o start s
O take actin a
Behave
optimally after that
Return that you get

nnt.li e iio
2 o 5 o
6.570 6.573100

12 5
9 acs

q7
Actin value

f
function

yj
for all States

fora acini
calculate return Q Sa

MarsRover

il Kittie
c
t.mn
na
ta vis

EE
1fiii1imI1t
showing
www taa.m QLS
a Rls
a
Q 2 R 2 05 Q 3 a
may
Q 2 0 0.5 max 25 6 25

Q 2 of 0.5 25 12.5

Q 4 R4 0 5 Q 3 a
may
0 0 5 max 25,625

12.5

Bellmanequation
ECX EXPLX

state
Ufunction
Stochastic
Action
violation
Q s a start from states
actin a return
behave
optimally

ETEEs
g gg maggagggy ages
QCs a RCS rmaa yes Bf7
JITE
VALUE AND POLICY ITERATION

Solving MDPs with known P and R


value and
Policy Iteration

VALUE ITERATION
bastaI

1s
10

807
A
To

927
8 111
08 0 0.9 1
0140 0 to 1 0 0 9 0

n
tot
POLICY ITERATION
Temporal Difference and Q
learning
we have seen how to solve MDP when P and R
known by value and Policy Iteration
are
using

33IE asipaate
By taking action in

EEEE.tt
quantities

Expectationof under P

Everytime we take
some action we
get to see one
sample
we treat
single
a
sample as

proxy for
entire distribution

TD treatsevery single sample you encounter as


representative of distribution and you dont have
to wait for large no of interactions in environment
to improve your
Resources :

https://www.youtube.com/playlist?list=PLYgyoWurxA_8ePNUuTLDtMvzyf-YW7im2
http://incompleteideas.net/book/ebook/
https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning

You might also like