Chapter 18 - Reinforcement Learning
Chapter 18 - Reinforcement Learning
INTRODUCTION TO
Machine Learning
CHAPTER 18:
Reinforcement Learning
Machine Learning
This programs the agent to seek long-term and maximum overall rewards to
achieve an optimal solution.
These long-term goals help prevent the agent from getting stuck on less
important goals.
With time, the agent learns to avoid the negative and seek the positive.
Machine Learning
When in its next state, new rewarding actions are available to it.
Over time, the cumulative reward is the sum of rewards the agent
receives from the actions it chooses to perform.
Machine Learning
• Gaming.
• Resource management.
• Personalized
recommendations.
For example, if you were to deploy a robot that was reliant on reinforcement learning to
navigate a complex physical environment, it will seek new states and take different
actions as it moves. With this type of reinforcement learning problem, however, it's
difficult to consistently take the best actions in a real-world environment because of how
frequently the environment changes.
The time required to ensure the learning is done properly through this method
can limit its usefulness and be intensive on computing resources. As the
training environment grows more complex, so too do the demands on time and
compute resources.
Supervised learning can deliver faster, more efficient results than reinforcement
learning to companies if the proper amount of data is available, as it can be
employed with fewer resources.
Machine Learning
There are different approaches because of the different strategies they use to
explore their environments:
9
Machine Learning
10
Machine Learning
MDP
S5 S4 S1
S1
a
b
0.1
1
S4
S5
S2 a 1 S3
R=2 R=0
11
Machine Learning
12
Machine Learning
Bellman Equation
If gamma is set to 0, the V(s’) term is completely cancelled out and the model only
cares about the immediate reward.
If gamma is set to 1, the model weights potential future rewards just as much as it
weights immediate rewards.
The optimal value of gamma is usually somewhere between 0 and 1, such that the
value of farther-out rewards has diminishing effects. 13
Machine Learning
Bellman Equation
2
𝑉 ( 𝑠 )=𝑟 + 𝛾 .𝑟 + 𝛾 .𝑟 +… .
14
Machine Learning
Bellman Equation
S5 V = 0 + 0.9 * 5 + (0.9)2 * 5 + ….
R=1
15
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning
16
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning
Maze Example: Utility
19
Machine Learning
21
Machine Learning
• Uncertainty:
– We know in which state we are (fully observable)
– But we are not sure that the commanded action will be executed
exactly
22
Machine Learning
Uncertainty
• No uncertainty:
– An action a deterministically causes a transition from a
state s to another state s’
• With uncertainty:
– An action a causes a transition from a state s to another
state s’ with some probability T(s,a,s’)
– T(s,a,s’) is called the transition probability from state s to
state s’ through action a
– In general, we need |S|2x|A| numbers to store all the
transitions probabilities
23
Machine Learning
24
Machine Learning
25
Machine Learning
26
Machine Learning
27
Machine Learning
28
Machine Learning
29