37 RL
37 RL
• RL deals with agents that must sense & act upon their environment.
This is combines classical AI and machine learning techniques.
It the most comprehensive problem setting.
• Examples:
• A robot cleaning my room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
• and so on
The Big Picture
Your action influences the state of the world which determines its reward
Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• How much time do you need to explore uncharted territory before you
exploit what you have learned?
The Task
• To learn an optimal policy that maps states of the world to actions of the agent.
I.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.
• We assume that we know what the reward will be if we perform action “a” in
state “s”:
• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”:
Example I
• Consider some complicated graph, and we would like to find the shortest
path from a node Si to a goal node G.
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
Example II
Check that
Q-Learning
• However, imagine the robot is exploring its environment, trying new actions
as it goes.
• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a.
How can we use these observations, (s,a,s’,r) to learn a model?
s’=st+1
Q-Learning
s’=st+1
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• We are learning useful things about explored state-action pairs. These are typically
most useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the
real answer.
Example Q-Learning
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck
in a suboptimal solution. I.e. there may be other solutions out there that you
have never seen.
• One can actively search for state-action pairs for which Q(s,a) is
expected to change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back than just
one step ( learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
The features Phi are fixed measurements of the state (e.g. # stones on the board).
We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end up in state s’)
change in Q
Conclusion
• Reinforcement learning addresses a very broad and relevant question:
How can we learn to survive in our environment?
http://elsy.gdan.pl/index.php?option=com_content&task=view&id=20&Itemid=39