0% found this document useful (0 votes)
21 views1 page

Reinforcement Learning Notes

The document discusses Monte Carlo methods and how they can be used for prediction, value estimation, and policy iteration. Monte Carlo methods rely on repeated random sampling to estimate values. The document also discusses using Monte Carlo methods for solving a blackjack example and using epsilon-greedy and epsilon-soft policies for control.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views1 page

Reinforcement Learning Notes

The document discusses Monte Carlo methods and how they can be used for prediction, value estimation, and policy iteration. Monte Carlo methods rely on repeated random sampling to estimate values. The document also discusses using Monte Carlo methods for solving a blackjack example and using epsilon-greedy and epsilon-soft policies for control.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Sample based learning methods,

Monte-Carlo Methods:
for any estimation method that relies on
repeated random sampling.

sum of faces Probability

12 ?
13 ? o
14
i

71
72 ?

value function,

* (s) I EA [G 1St-s]

Ge = Pet, +84++1
{Ge = o, if't' is last}

Ex: Use monte-Carlo for prediction, .

Problem formulation,

s Return (s) VCs)


(usable Ace, sum,dedes)

A = (No Ace, 20,10)


B = (No Ace, 13, 10)

using monte-carlo for action value

Vit (S) En [Ge/St-s,Ae=a]

argmax q, Cs,a)
a

so No

9,

Az

using monte carlo method for generalized policy

iteration (GPI)

To → IT, IT, → . -

Improvement

THI (5) arg Max 9 niels,a)

Solving Blackjack example,

Q CS, A)
S, A Returns CS, A) HIT/Stic IT(S)

✗, stick Returns CX,stick)-130 stick


Hit
Y, Hit Returnsly, Hit)=[I 1 0

Epsilon-soft policies,

E- greedy policies are scastic policies

E- greedy policies C E- soft policies

IT
State MMM

Mc control (for E- soft policies)

Algorithm parameter: small e- so

Initialize
+ ← an arbitrary f- soft policy

Q(s, a) ER (arbitrarily), for all SES, at ACS)

Returns (Ssa) ← empty list, for all sff, af ACS)

Repeat forever (for each episode):


generate an episode following it: So, Ao, Ri,-.- , Stu, AT-i. Rt
*
Geo
loop for each step of episode, t -t-1st-2,... O,

G ← 8Gt Rtt,
Append G to returns (St, At)

Q Cst, At) ← Avg (Returns (St,At))

A org may @ (Sta)

For all at A Cst)


a=A*
{ TC also ← { '-EYES),
a#At
FACSH

Off-Policy learning matter.

on- policy : Improve and evaluate → behavior


policy being used to select action. policy
blats)
off-policy: Improve and evaluate + Cals) > target policy
a different policy from one used to
select actions.

Importance Sampling,
E.CRY
mob

l Mr

NIT (n) = I Npca) bln)


E- CX] ÷ I
nex NEX

= Eb [XP CX)]
n
nip (ni)
at

You might also like