Introduction To Reinforcement Learning
Introduction To Reinforcement Learning
Outline
• What is Reinforcement Learning
• RL Formalism
1. Reward
2. The agent
3. The environment
4. Actions
5. Observations
• Markov Decision Process
1. Markov Process
2. Markov reward process
3. Markov Decision process
• Learning Optimal Policies
What is Reinforcement Learning ?
Describe this:
• Mouse
• A maze with walls, food and
electricity
• Mouse can move left, right, up
and down
• Mouse wants the cheese but not
electric shocks
• Mouse can observe the
environment
• Find some magic set of methods that will allow our mouse to
learn on its own how to avoid electricity and gather as much
food as possible.
• Agent
• Environment
Communication
channels:
• Actions,
• Reward, and
• Observations:
Example:
You
The environment
The universe!
Why?
Convenience
RL within the ML Spectrum
What makes RL different from
other ML paradigms ?
Example:
System: Weather in Boston.
States: We can observe the current day as sunny or rainy
History: . A sequence of observations over time forms a chain
of states, such as
Weather example:
The probability of sunny day followed by rainy day is
independent of the amount of sunny days we've seen in
the past.
Notes:
This example is really naïve, but it's important to understand the
limitations.
We can for example extend the state space to include other
factors.
Markov Process (cont)
Transition probabilities is expressed as a transition matrix,
which is a square matrix of the size N×N, where N is the number
of states in our model.
sunny rainy
sunny 0.8 0.2
rainy 0.1 0.9
Markov Reward Process
In other words, it means that the reward the agent obtains now
depends not only on the state it ends up in but also on the action
that leads to this state. It's similar as when putting effort into
something, you're usually gaining skills and knowledge, even if
the result of your efforts wasn't too successful.
Markov Decision Process
policy
The intuitive definition of policy is that it is some set of
rules that controls the agent's behavior.
Policy (cont)
𝛑* = argmax𝛑 V𝛑(s)
Markov Decision Process
Notes:
A. The first action is taken not from the optimal policy.
B. The expectation is because given action this is
stochastic.
Dynamic Programming
WORKING BACKWARDS :
(T is terminal state)
Model Based and Model Free Methods
Model Based:
Knowing the transition matrix.
Model Free:
Not knowing the transition matrix.
Model-Based Methods
(s, a)
S0 S1 S2
-1 +3
-1
Step 0: V(S0)=V(S1)=V(S2) = 0
Step 1:
Q(S0, a1) = R(S0, a1) +V(S1) = -1 + 0 = -1
S0 S1 S2
-1 +3
-1
Step2:
V(S0) = max(Q(S0,a)) = -1
V(S1) = max(Q(S1,a)) = 3
p(S0) = R
p(S1) = R
Policy Iteration
S0 S1 S2
-1 +3
-1
V(S0) = -6/5
V(S1) = -8/5
Example: -1
S0 S1 S2
-1 +3
-1
Step 1:
Q(S0; a1) = -1 + ½(-8/5)
Q(S1; a1) =
Q(S1; a2) =
Update Policy:
Example: -1
S0 S1 S2
-1 +3
-1
Update:
V(S0) = max(Q(S0,a)) = -1
V(S1) = max(Q(S1,a)) = 3
p(S0) = R
p(S1) = R
Model-Free Methods
● On-Policy Learning
● Off-Policy Learning
Demo :
https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q
-learning/