IntroductiontoRL BR
IntroductiontoRL BR
Learning
• S– set of states
Agent • A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment • R(s,a)– the expected reward
for taking action a in state s
R ( s, a ) P ( s ' | s, a )r ( s, a, s ' )
a0 a1 a2 s'
s0 s1 s2 s3 R ( s, a ) T ( s, a, s ' )r ( s, a, s ' )
r0 r1 r2 s'
Exploration versus Exploitation
We want a reinforcement learning agent to
earn lots of reward
The agent must prefer past actions that have
been found to be effective at producing reward
The agent must exploit what it already knows
to obtain reward
The agent must select untested actions to
discover reward-producing actions
The agent must explore actions to make better
action selections in the future
Trade-off between exploration and exploitation
Passive learning v.s. Active
learning
Passive learning
The agent imply watches the world going by
and tries to learn the utilities of being in
various states
Active learning
The agent not simply watches, but also acts
Passive learning scenario