0% found this document useful (0 votes)
49 views

Reinforcement Learning Notes

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Reinforcement Learning Notes

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

REINFORCEMENT LEARNING

UNIT - I

Basics of probability and linear algebra, Definition of a stochastic multi-


armed bandit, Definition ofregret, Achieving sublinear regret, UCB
algorithm, KL-UCB, Thompson Sampling.

UNIT - II

Markov Decision Problem, policy, and value function, Reward models


(infinite discounted, total, finitehorizon, and average), Episodic &
continuing tasks, Bellman's optimality operator, and Value iteration&
policy iteration

UNIT - III

The Reinforcement Learning problem, prediction and control problems,


Model-based algorithm, MonteCarlo methods for prediction, and Online
implementation of Monte Carlo policy evaluation

UNIT - IV

Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch


TD(0) algorithms; Model-freecontrol: Q-learning, Sarsa, Expected Sarsa.

UNIT - V

n-step returns; TD(λ) algorithm; Need for generalization in practice;


Linear function approximation and geometric view; Linear TD(λ). Tile
coding; Control with function approximation; Policy search; Policy
gradient methods; Experience replay; Fitted Q Iteration; Case studies.

prepared by- P Srinivas Rao (ns lectures youtube channel)


prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
KLuCB Alqoith)
KL- kul)b ack-Lefblex
Ihe kL-vcB agonithn ís desigued to
selve multi-o3med bandit poro blens, whee
an ogent faces set of actons(ans)
and nee ds t ehoose(select) actiens
cueY trme to maxi f3e ue cumulatrve
he
Teward. KL-UCe aims t baance
erpleration of uwcertaín(less freguenty
or selected octíons hat give
less eaoyd octions urtth explortaton
of at hat Seem poomisfivg (actiong

that ne occUTMhg selected


no ef fines that qíve mee ewayd)
lbased n the avatlable fnfmatron.
uses the k
kL
The alqoilhm
druergen ce to mEasuNe he nce
trinty
beteen the empmical(obseved) disti
bution of arde and the unkve un

prepared by- P Srinivas Rao (ns lectures youtube channel)


aie ewrd distibut o fo each action
KL/tucactual
obseved rewad (s-9)
ckon a
actn of actioy a

A then adiusts the upper congidence


bounde for act ons based on this measure
kuloack-Leibler(kL) diveqence(df
eoence) fs a Mathematca meaue hat
caledates the dtfference bchwcen twt
robability dstibutrong.
Civen two prubalty distibutrog
P cnd a over the same/ sample
over he ExpeiweNe
he kL diverqence tom P to a ísts
defned as follous
1.

for each xpeSaniplespace x


X=, XL --Aa outcomes
Fur contuous distibutnc
P) dx

prepared by- P Srinivas Rao (ns lectures youtube channel)


8.

Pseudo code of kL- UCB alqonfthw


Intalrze fr each o m(acion) a a,a2,34

tazl (ttme step fr am a)


(a)
Sample_rewand [Irtral estrwmateof
ha meay

pulled or
Nad (No: of timer am a s seleted(ueuta)
action a (s
SRepea fuy each time step t
Foy each a)
cal culate theKL díveqence
KL
Na

cumet (observed)
emprícal
between the ve7sfon
Arstr butio) and its updated
(after ta steps
upper cenfdence bound
-calcl ate
Ua sclvekL]kL, log)
Na

choose the orm A wth the highert


Uppor confideuce bound
prepared by- P Srinivas Rao (ns lectures youtube channel)
Ap= rgmexa
- obseve the eard R after selectiry
(pertíng oxY enee whngl acton At
Update estmates
mea)
taztatl ’previousrewad
neyypdato Valueof (n-)
selections
ead
-a (Na-)xa t Rt
KtYewardth
eceived
aflalue Na ín aiseletr
Na
Selectíons
balances expleation and
by ad(ustrng the
explottatíon bq uppey
onfdene bounds based on KL dívergence.
the fun cton s olvek
-he choice. of
depends on the speúfc
is fmpetant ond propertes of the
RL prollem qnd the
Teward distibution
more
advanced than UCB
fn handling uncetatn Teward distib

t takes higher computatondl


trons , but
costs due to soluig kL divergences.
prepared by- P Srinivas Rao (ns lectures youtube channel)
|9.
THOMPSoN SAMPL1NG ('may)
Same actron can' ve diferent ewods
due to changivg
in diffeEt seledions
sele tions (executions) natureof enumeut
Ex Reweyd value Aistibudions
Samptes
of (2) selectel
actron , Pa)= 15l6,4, lo, ,&, 12}/
action g P(ae) 2o15, &, LS, 2 ,25j.
Tn Tiompson samplrng, heagen
maintains prvbability drstribution
(usually a Bayesian. posteior distibution)
fts
for each amactou), Tepresenting
beliefs aloout the tue eward disti
bukíon of that acton. At each time step,
samples (selets s ome ewayd
the agent
Ayaus
Sounple values) a eward Estimate f
a

cah actow distibutro and then


selets he,ac, uith the highert
valuh
Sampledumof estimate
Tho ipson sampling s a he uviste
for choosirq actrons that adesses the
itatíon dilemna (n the
txploration-exploitation
prepared by- P Srinivas Rao (ns lectures youtube channel)
waltf- an med bandrt problem.Tt consisto
of choosíuq the acton thot maxímíaes
he expected eward wt Tandony a

drawn belief (Tandady thawn sampl


Consídey set of contexts(states
of agent)x ,a set ef Sactrong A, nd
aim of hepayex fs
undes Ae vaí ous
acttons
to hay
play
to marímí3e 4he
centets, such as
Cumuloatve Veoard.
Tn each und,the playey obtarn
a conbext(state) e, plays CAn actron
a renayd E R.
acA and rece•ves
depends
follousing a a distibution tat
centeut and the fssued action.
te
’The plenents of Thompson Seamplíng ane
os follous
funcron P/,a, )
d) A likelhood 9fr the
2) A set of paYameters
distibutí oy oof
prepared by- P Srinivas Rao (ns lectures youtube channel)
20.

A príc distibatio) P(O) on these


parameters
Pat
45 cbsevatfons tnplets D-f,)
SA postesío distibution

ts the Iikelrhood function


whene Po) playing
in
Thcmpson Sawiplnq consísts
aeA acoydng to the pmba
the action
t maxíNÍ 2¬S the expected
bilrty tht
Tewasd. Acton
Teunnd. chosen wtth probability
JTEGlx,9= ej
maEQlax, P/D)de.
expected
TRe rule s fonplemented by samplinq
In eah Tound(selecton or tne step)
parametersg* ae Sampled from the
pesteníor P(D) and CAacron a is

chosen that

prepared by- P Srinivas Rao (ns lectures youtube channel)


That Ls, he epe cted TewaTd iven
the sonpled ponc ctes(e). te actran
, and the cunmet contet x.
Thís means
that the playergent)
instanttates theiy beless andomly
Cach und aceNding to the postesiY
dtstibution and then acts optnay
acceTding to them.
7Pis obabilíty TepYesents what is
believed bete new eidence
eníyinally
is futduced occuíng
event A
PCA) pmbabilrty oflpnor baorlity)
pbability of event B occuríng
PCB)=
(evtdence cY marqina Iikelhsod
And postenoY pbabi lrty takes this
fnfomaton uto account
hew of evet A
probalbility
P(NE) = The occuTTnq pvided the evidene B.
P(B/A) P(A)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)

You might also like