Reinforcement Learning
Reinforcement Learning
Economics
3
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Branches of Machine Learning (ML)
Reinforcement Learning (RL)
• No Labels
• Labeled data • No feedback
• Direct feedback • Find hidden structure
• Predict
Supervised Unsupervised
Learning Learning
Machine
Learning
Reinforcement
Learning • Decision process
• Reward system
• Learn series of actions
4
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Reinforcement learning Applications
• RL solve problems that are sequential and goal is long term such as
game playing robotics resource management, industrial automation
etc.
• Not Suitable for problems where the solutions can be directly obtains
and we know complete in for supervised learning like object
detection, fraud detection etc.
Elements of Reinforcement Learning
• Agent
• Environment
• Policy
• Reward signal
• Value function
• Model
Elements of Reinforcement Learning
• Policy
• Agent’s behavior
• It is a map from state to action
• Deterministic policy: a = π(s)
• Stochastic policy: π(a|s) = P[At = a|St = s]
• Reward signal
• The goal of a reinforcement learning problem
• Value function
• How good is each state and/or action
• A prediction of future reward
• Model
• Agent’s representation of the environment
7
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Reinforcement Learning
• Value Based
• No Policy (Implicit)
• Value Function
• Policy Based
• Policy
• No Value Function
• Actor Critic
• Policy
• Value Function
8
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Examples of Rewards
• Make a humanoid robot walk
• +ve reward for forward motion
• -ve reward for falling over
• Play may different Atari games better than humans
• +/-ve reward for increasing/decreasing score
• Manage an investment portfolio
• +ve reward for each $ in bank
9
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Reinforcement Learning
• Model Free
• Policy and/or Value Function
• No Model
• Model Based
• Policy and/or Value Function
• Model
10
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Learning and Planning
• Two fundamental problems in
sequential decision making
• Reinforcement Learning
• The environment is initially unknown
• The agent interacts with environment
• The agent improves its policy
• Planning
• A model of the environment is known
• The agent performs computations with its model
(without any external interaction)
• The agent improves its policy
• a.k.a deliberation, reasoning, introspection, pondering,
thought, search
11
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Exploration and Exploitation
• Reinforcement learning is like trial-and-error learning
• The agent should discover a good policy
• From its experiences of the environment
• Without losing too much reward along the way
• Exploration finds more information about the
environment
• Exploitation exploits known information to maximise
reward
• It is usually important to explore as well as exploit
12
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Exploration and Exploitation
Examples
• Restaurant Selection
• Exploitation: Go to your favorite restaurant
• Exploration: Try a new restaurant Online Banner
• Advertisements
• Exploitation: Show the most successful advert
• Exploration: Show a different advert
• Oil Drilling
• Exploitation: Drill at the best known location
• Exploration: Drill at a new location
• Game Playing
• Exploitation: Play the move you believe is best
• Exploration: Play an experimental move
13
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Prediction and Control
• Prediction: evaluate the future
• Given a policy
• Control: optimize the future
• Find the best policy
14
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Generalized Policy Iteration (GPI)
evaluation
V → v𝜋
𝜋 V
𝜋 → greedy (V)
improvement
𝜋* v*
15
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Generalized Policy Iteration (GPI)
Any iteration of policy evaluation and policy improvement,
independent of their granularity.
evaluation
Q → q𝜋
𝜋 Q
𝜋 → greedy ( Q )
improvement
16
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Bellman Equation
Bellman equation decomposes the value function into two parts, the immediate reward plus the discounted
future values.
This equation simplifies the computation of the value function, such that rather than summing over multiple
time steps, we can find the optimal solution of a complex problem by breaking it down into simpler,
recursive subproblems and finding their optimal solutions.
Bellman Equation
Example
Example
Example
Q Learning Algorithm
Q learning Example
Example Continued
Example Continued
Example Continued
The transition rule of Q learning is a very simple formula:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Example Continued
Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state
3, or go to state 5. By random selection, we select to go to 5 as our action.
Now let’s imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix
R (i.e. state 5). It has 3 possible actions: go to states 1, 4, or 5.
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
Example Continued
For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial
state. Look at the fourth row of matrix R; it has 3 possible actions: go to states 1, 2, or 4. By random
selection, we select to go to state 1 as our action. Now we imagine that we are in state 1. Look at the
second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5. Then, we
compute the Q value:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80
We use the updated matrix Q from the last episode. Q(1, 3) = 0 and Q(1, 5) = 100. The result of the
computation is Q(3, 1) = 80 because the reward is zero. The matrix Q becomes:
Example Continued
Example
Thank You