0% found this document useful (0 votes)
44 views30 pages

Reinforcement Learning

Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The goal is to maximize a numerical reward signal by learning how to map situations to actions. Reinforcement learning contains elements like a policy, which is the agent's behavior mapping situations to actions, a reward signal defining the goal, and a value function estimating future rewards. Model-free methods learn without knowing the environment's dynamics, while model-based methods learn an environment model to perform planning. Exploration involves discovering new information and exploitation uses known information to maximize reward. Generalized policy iteration alternates between policy evaluation to estimate value functions and policy improvement to find better policies based on the evaluation.

Uploaded by

Raghu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views30 pages

Reinforcement Learning

Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The goal is to maximize a numerical reward signal by learning how to map situations to actions. Reinforcement learning contains elements like a policy, which is the agent's behavior mapping situations to actions, a reward signal defining the goal, and a value function estimating future rewards. Model-free methods learn without knowing the environment's dynamics, while model-based methods learn an environment model to perform planning. Exploration involves discovering new information and exploitation uses known information to maximize reward. Generalized policy iteration alternates between policy evaluation to estimate value functions and policy improvement to find better policies based on the evaluation.

Uploaded by

Raghu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Reinforcement Learning

Raghunath Reddy, IHub Data Foundation


Reinforcement learning
• Reinforcement learning is learning what to do
—how to map situations to actions
—so as to maximize a numerical reward signal.
• trial-and-error search
• delayed reward
• Absence of model
• Partially observable state
• Have large number of states
2
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Reinforcement Learning (RL)
Computer Science

Engineering Machine Neuroscience


Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Mathematics Bounded Psychology
Rationality

Economics

3
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Branches of Machine Learning (ML)
Reinforcement Learning (RL)
• No Labels
• Labeled data • No feedback
• Direct feedback • Find hidden structure
• Predict

Supervised Unsupervised
Learning Learning
Machine
Learning

Reinforcement
Learning • Decision process
• Reward system
• Learn series of actions
4
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Reinforcement learning Applications
• RL solve problems that are sequential and goal is long term such as
game playing robotics resource management, industrial automation
etc.
• Not Suitable for problems where the solutions can be directly obtains
and we know complete in for supervised learning like object
detection, fraud detection etc.
Elements of Reinforcement Learning
• Agent
• Environment
• Policy
• Reward signal
• Value function
• Model
Elements of Reinforcement Learning
• Policy
• Agent’s behavior
• It is a map from state to action
• Deterministic policy: a = π(s)
• Stochastic policy: π(a|s) = P[At = a|St = s]

• Reward signal
• The goal of a reinforcement learning problem
• Value function
• How good is each state and/or action
• A prediction of future reward
• Model
• Agent’s representation of the environment
7
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Reinforcement Learning
• Value Based
• No Policy (Implicit)
• Value Function
• Policy Based
• Policy
• No Value Function
• Actor Critic
• Policy
• Value Function

8
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Examples of Rewards
• Make a humanoid robot walk
• +ve reward for forward motion
• -ve reward for falling over
• Play may different Atari games better than humans
• +/-ve reward for increasing/decreasing score
• Manage an investment portfolio
• +ve reward for each $ in bank

9
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Reinforcement Learning
• Model Free
• Policy and/or Value Function
• No Model
• Model Based
• Policy and/or Value Function
• Model

10
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Learning and Planning
• Two fundamental problems in
sequential decision making
• Reinforcement Learning
• The environment is initially unknown
• The agent interacts with environment
• The agent improves its policy
• Planning
• A model of the environment is known
• The agent performs computations with its model
(without any external interaction)
• The agent improves its policy
• a.k.a deliberation, reasoning, introspection, pondering,
thought, search
11
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Exploration and Exploitation
• Reinforcement learning is like trial-and-error learning
• The agent should discover a good policy
• From its experiences of the environment
• Without losing too much reward along the way
• Exploration finds more information about the
environment
• Exploitation exploits known information to maximise
reward
• It is usually important to explore as well as exploit

12
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Exploration and Exploitation
Examples
• Restaurant Selection
• Exploitation: Go to your favorite restaurant
• Exploration: Try a new restaurant Online Banner
• Advertisements
• Exploitation: Show the most successful advert
• Exploration: Show a different advert
• Oil Drilling
• Exploitation: Drill at the best known location
• Exploration: Drill at a new location
• Game Playing
• Exploitation: Play the move you believe is best
• Exploration: Play an experimental move
13
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Prediction and Control
• Prediction: evaluate the future
• Given a policy
• Control: optimize the future
• Find the best policy

14
Source: David Silver (2015), Introduction to reinforcement learning, https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
Generalized Policy Iteration (GPI)
evaluation
V → v𝜋

𝜋 V
𝜋 → greedy (V)

improvement

𝜋* v*
15
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Generalized Policy Iteration (GPI)
Any iteration of policy evaluation and policy improvement,
independent of their granularity.

evaluation
Q → q𝜋

𝜋 Q
𝜋 → greedy ( Q )

improvement
16
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book.
Bellman Equation
Bellman equation decomposes the value function into two parts, the immediate reward plus the discounted
future values.
This equation simplifies the computation of the value function, such that rather than summing over multiple
time steps, we can find the optimal solution of a complex problem by breaking it down into simpler,
recursive subproblems and finding their optimal solutions.
Bellman Equation
Example
Example
Example
Q Learning Algorithm
Q learning Example
Example Continued
Example Continued
Example Continued
The transition rule of Q learning is a very simple formula:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Example Continued
Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state
3, or go to state 5. By random selection, we select to go to 5 as our action.
Now let’s imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix
R (i.e. state 5). It has 3 possible actions: go to states 1, 4, or 5.
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
Example Continued
For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial
state. Look at the fourth row of matrix R; it has 3 possible actions: go to states 1, 2, or 4. By random
selection, we select to go to state 1 as our action. Now we imagine that we are in state 1. Look at the
second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5. Then, we
compute the Q value:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80
We use the updated matrix Q from the last episode. Q(1, 3) = 0 and Q(1, 5) = 100. The result of the
computation is Q(3, 1) = 80 because the reward is zero. The matrix Q becomes:
Example Continued

Example
Thank You

You might also like