RL Introduction
RL Introduction
Session #1:
Introduction to the Course
Instructors :
1. Prof. S. P. Vimal ([email protected]),
2. Dr. V Chandra Sekhar ([email protected])
PRESENTATION TITLE 1
What is Reinforcement Learning ?
- reward based learning / feedback based learning
- not a type of NN nor it is an alternative to NN. Rather it is an approach for
learning
- Autonomous driving, gaming
2
Course Objectives
Course Objectives:
1. Understand
a. the conceptual, mathematical foundations of deep reinforcement learning
b. various classic & state of the art Deep Reinforcement Learning algorithms
2. Implement and Evaluate the deep reinforcement learning solutions to various
problems like planning, control and decision making in various domains
3. Provide conceptual, mathematical and practical exposure on DRL
a. to understand the recent developments in deep reinforcement learning and
b. to enable modelling new problems as DRL problems.
3
Learning Outcomes
4
Course Operation
• Instructors
Prof. S.P.Vimal
Dr. V Chandra Sekhar
• Textbooks
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto,
Second Ed. , MIT Press
2. Foundations of Deep Reinforcement Learning: Theory and Practice in Python
(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and Wah
Loon Keng
5
Course Operation
• Evaluation
Two Quizzes for 5% each; Best of two will be taken for 5% ( in final grading);
Whatever be the points set for quizzes, the score will be scaled to 5%
NO MAKEUP, for whatever be the reason. Ensure to attend at least one of the quizzes.
Two Assignments - Tensorflow/ Pytorch / OpenAI Gym Toolkit → 25 %
Assignment 1: Partially Numerical + Implementation of Classic Algorithms - 10%
Assignment 2: Deep Learning based RL - 15%
Mid-Term Exam - 30% [ Only to be written in A4 pages, scanned and uploaded]
Comprehensive Exam - 40% [ Only to be written in A4 pages, scanned and uploaded]
• Webinars/Tutorials
4 tutorials : 2 before mid-sem & 2 after mid-sem
• Teaching Assistants will be introduced to you in the next class
6
Course Operation
• How to reach us ? (for any question on lab aspects, availability of slides on portal, quiz
availability , assignment operations )
• Plagiarism [ Important ]
All submissions for graded components must be the result of your original effort. It is
strictly prohibited to copy and paste verbatim from any sources, whether online or from
your peers. The use of unauthorized sources or materials, as well as collusion or
unauthorized collaboration to gain an unfair advantage, is also strictly prohibited. Please
note that we will not distinguish between the person sharing their resources and the one
receiving them for plagiarism, and the consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as identical verbatim answers or a
significant overlap of unreasonable similarities in a set of submissions, will be investigated,
and severe punishments will be imposed on all those found guilty of plagiarism.
7
Reinforcement Learning
2016
‘10
Reinforcement Learning
1.A model of the environment is known, but an analytic solution is not available;
2.Only a simulation model of the environment is given (the subject of simulation-based
optimization)
3.The only way to collect information about the environment is to interact with it.
11
What is Reinforcement Learning?
• At every time step t, the agent perceives the state of the environment
• Based on this perception, it chooses an action
• The action causes the agent to receive a numerical reward
• Find a way of choosing actions,called a policy which maximizes the agent’s long-
term expected return
Example: AlphaGo
Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality
Economics
David Silver 2015
Key Features of RL
Key Features of RL
It is also important to recognize that Reinforcement Learning is considered to be a branch of Machine Learning.
While, ML generally refers to the broad set of techniques to infer mathematical models/functions by acquiring
(“learning”) knowledge of patterns and properties in the presented data.
In this regard, Reinforcement Learning does fit this definition. However, unlike the other branches of ML
(Supervised Learning and Unsupervised Learning), Reinforcement Learning is a lot more ambitious - it not only
learns the patterns and properties of the presented data, it also learns about the appropriate behaviors to be
exercised (appropriate decisions to be made) so as to drive towards the optimization objective.
It is sometimes said that Supervised Learning and Unsupervised learning are about “minimization” (i.e., they
minimize the fitting error of a model to the presented data),
while Reinforcement Learning is about “maximization” (i.e., RL identifies the suitable decisions to be made to
maximize a well-defined objective).
More importantly, the class of problems RL aims to solve can be described with a simple yet powerful
mathematical framework known as Markov Decision Processes (abbreviated as MDPs).
Key Features of RL
1. As the Figure indicates, at each time step t, the Agent observes an abstract piece of information (which we call State) and a numerical (real
number) quantity that we call Reward.
2. Note that the concept of State is indeed completely abstract in this framework and we can model State to be any data type, as complex or
elaborate as we’d like.
3. Upon observing a State and a Reward at time step t, the Agent responds by taking an Action.
4. Note that the concept of State and Action is indeed completely abstract.
5. Action represents a purchase or sale of a stock responding to market stock price movements, or it could be movement of inventory from a
warehouse to a store in response to large sales at the store, or it could be a chess move in response to the opponent’s chess move.
Key Features of RL
6. Upon receiving an Action from the Agent at time step t, the Environment responds (with time ticking over to t + 1) by serving up the next time step’s
random State and random Reward.
7. The goal of the Agent at any point in time is to maximize the Expected Sum of all future Rewards by controlling (at each time step) the Action as a
function of the observed State (at that time step).
8. This function from a State to Action at any time step is known as the Policy function. So we say that the agent’s job is exercise control by
determining the Optimal Policy function. Hence, this is a dynamic (i.e., time-sequenced) control system under uncertainty.
Real-world problems that fit the MDP framework
All kinds of problems in nature and in business (and indeed, in our personal lives) can be modeled as Markov Decision Processes.
Wbar
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
estimated state value
(≈ prob of winning)
6 5 4
w
Action selection
3 2
by a shallow search
1 0
s
Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions
RL + Deep Learing Performance on Atari Games
No input
to predictions
mapping raw of final score
screen pixels for each of 18
joystick actions
Seizure
0Hz
0.5Hz
0.20
LSPI20
LSPI40
Fraction of seizures
T20010T20020
T20040
O u r lab: Guez et al, 2008
0.15
KBSF10
T2010
T2020 1.5Hz
Barreto et al, 2011, 2012
1Hz
KBSF20
0.10
T2040
Helicopter control
Ng et al, 2008 ; O u r lab: Barreto, Gehring et al, 2013
Non-stationarity
33
Criteria Supervised ML Unsupervised ML Reinforcement ML
Definition Learns by using labelled Trained using Works on interacting with
data unlabelled data the environment
without any guidance.
Type of data Labelled data Unlabelled data No – predefined data
Types of Learning
34
Characteristics of RL
35
Types of Reinforcement learning
36
Types of Reinforcement Learning
Negative Reinforcement - strengthening of behavior because a negative condition
is stopped or avoided. Advantages of negative reinforcement learning:
•Increases Behavior
•Provide defiance to a minimum standard of performance
•It Only provides enough to meet up the minimum behavior
37
Elements of Reinforcement Learning
38
Elements of Reinforcement Learning
Beyond the agent and the environment, one can identify four main sub-elements of a
reinforcement learning system: a policy, a reward , a value function, and, optionally, a model
of the environment.
39
Elements of Reinforcement Learning
•Agent
- an entity that tries to learn the best way to perform a specific task.
- In our example, the child is the agent who learns to ride a bicycle.
•Action (A) -
- what the agent does at each time step.
- In the example of a child learning to walk, the action would be “walking”.
- A is the set of all possible moves.
- In video games, the list might include running right or left, jumping high or low,
crouching or standing still.
40
Elements of Reinforcement Learning
•State (S)
- current situation of the agent.
- After doing performing an action, the agent can move to different states.
- In the example of a child learning to walk, the child can take the action of taking
a step and move to the next state (position).
•Rewards (R)
- feedback that is given to the agent based on the action of the agent.
- If the action of the agent is good and can lead to winning or a positive side then
a positive reward is given and vice versa.
41
Elements of Reinforcement Learning
•Environment
- outside world of an agent or physical world in which the agent operates.
•Discount factor
- The discount factor is multiplied by future rewards as discovered by the agent
in order to dampen these rewards’ effect on the agent’s choice of action. Why?
It is designed to make future rewards worth less than immediate rewards.
Often expressed with the lower-case Greek letter gamma: γ. If γ is .8, and there’s
a reward of 10 points after 3 time steps, the present value of that reward is
0.8³ x 10.
42
Elements of Reinforcement Learning
Formal Definition - Reinforcement learning (RL) is an area of machine
learning concerned with how intelligent agents ought to take actions in
an environment in order to maximize the notion of cumulative reward.
43
Elements of Reinforcement Learning
•Goal of RL - maximize the total amount of rewards or cumulative rewards that are
received by taking actions in given states.
•Notations –
a set of states as S,
a set of actions as A,
a set of rewards as R.
At each time step t = 0, 1, 2, …, some representation of the environment’s state St
∈ S is received by the agent. According to this state, the agent selects an action At
∈ A which gives us the state-action pair (St , At). In the next time step t+1, the
transition of the environment happens and the new state St+1 ∈ S is achieved. At
this time step t+1, a reward Rt+1 ∈ R is received by the agent for the action At taken
from state St. 44
Elements of Reinforcement Learning
45
Mathematical Formulation of Reinforcement
Learning
Before we answer our root question i.e. How we
formulate RL problems mathematically (using MDP), we
need to develop our intuition about :
46
Agent-Environment Interface
47
In reinforcement learning (RL), an agent learns through interactions with its environment by a process that involves
experimentation, feedback, and adaptation.
This learning paradigm is fundamentally different from supervised learning, where a model learns from a labeled
dataset.
Instead, RL is about learning what actions to take in various situations to maximize a cumulative reward over time.
48
49
50
51
52
53
Markov Decision Processes
54
Markov Decision Processes
• Components of MDPs:
• A set of states: These represent all possible scenarios or configurations
in which the system can exist.
• A set of actions: These are the choices available to the decision-maker
in each state.
• State-transition probabilities: For each state and action, this
component provides the probability of landing in a future state. This is
essential for understanding how actions affect the system's state.
• A reward function: This function assigns a numerical reward to each
transition between states. The goal is usually to maximize cumulative
rewards over time.
• A start state: The initial condition or state of the system.
• A terminal state: Some MDPs have an endpoint or terminal state,
which, once reached, ends the decision process.
55
Markov Decision Processes
• Model Dynamics:
• Also referred to as the dynamics of the model, these describe how the
system evolves in response to various actions taken in different states,
according to the transition probabilities.
56
Markov Decision Processes
● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state
57
Markov Decision Processes
58
Markov Decision Processes
59
Markov Decision Processes
60
Markov Decision Processes
61
Markov Decision Processes
1. Value Iteration
62
Markov Decision Processes
1. Value Iteration
63
Markov Decision Processes
1. Value Iteration
64
Markov Decision Processes
1. Value Iteration
65
Markov Decision Processes
2. Policy Iteration
66
Markov Decision Processes
2. Policy Iteration
67
Markov Decision Processes
2. Policy Iteration
68
Markov Decision Processes
2. Policy Iteration
69
Markov Decision Processes
2. Policy Iteration
70
Markov Decision Processes
2. Policy Iteration
71
Markov Decision Processes
3. Monte Carlo Techniques
Monte Carlo methods are a class of algorithms that rely on repeated random
sampling to obtain numerical results, typically used to solve complex
mathematical and computational problems that are deterministic in principle
but for which analytical solutions are infeasible or unknown. In the context of
reinforcement learning, Monte Carlo (MC) methods are used to estimate value
functions and improve policies based on experience sampled from actual or
simulated interaction with the environment.
72
Markov Decision Processes
3. Monte Carlo Techniques
In Monte Carlo methods for reinforcement learning, specifically in the
context of policy evaluation, the approach is centered around estimating the
value of states based on the returns from episodes of the environment.
73
Markov Decision Processes
3. Monte Carlo Techniques
74
Markov Decision Processes
75
Markov Decision Processes
76
Markov Decision Processes
77
Markov Decision Processes
78
Markov Decision Processes
79
Markov Decision Processes
80
Markov Decision Processes
81
Markov Decision Processes
82
Markov Decision Processes
83
Markov Decision Processes
84
Markov Decision Processes
85
Markov Decision Processes
86
Markov Decision Processes
87
Markov Decision Processes
88
Markov Decision Processes
89
Markov Decision Processes
90
Markov Decision Processes
91
Markov Decision Processes
92
Markov Decision Processes
93
Markov Decision Processes
94
Markov Decision Processes
95
Markov Decision Processes
96
Markov Decision Processes
97
Markov Decision Processes
98
Markov Decision Processes
99
Markov Decision Processes
100
Markov Decision Processes
101
Markov Decision Processes
102
Markov Decision Processes
103
Markov Decision Processes
104
Markov Decision Processes
105
Markov Decision Processes
106
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
107
Markov Decision Processes
108
Markov Decision Processes
109
Markov Decision Processes
3. Monte Carlo Techniques
111
Markov Decision Processes
4. Q Learning
112
Markov Decision Processes
4. Q Learning
113
Markov Decision Processes
4. Q Learning
114
Markov Decision Processes
4. Q Learning
115
Markov Decision Processes
4. Q Learning
116
Markov Decision Processes
4. Q Learning
Applications
Q-learning has been successfully applied in various domains, including robotics,
automated control, economics, and gaming, where the decision-making process
must be optimized based on sequential interactions with an environment.
Its ability to learn optimal actions without prior knowledge of the environment's
dynamics makes it a powerful tool in the field of artificial intelligence.
117
● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution
Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 118
Elements of Reinforcement Learning
The Agent-Environment Relationship
Agent : Software programs that make intelligent decisions and they are the learners in RL. These agents interact
with the environment by actions and receive rewards based on there actions.
Environment :It is the demonstration of the problem to be solved.Now, we can have a real-world environment
or a simulated environment with which our agent will interact.
State : This is the position of the agents at a specific time-step in the environment. So,whenever an agent
performs a action the environment gives the agent reward and a new state where the agent reached by
performing the action.
Anything that the agent cannot change arbitrarily is considered to be part of the environment. In simple
terms, actions can be any decision we want the agent to learn and state can be anything which can be
useful in choosing actions.
119
Elements of RL: The Markov Property
Transition : Moving from one state to another is called Transition.
Transition Probability: The probability that the agent will move from one state to another
is called transition probability.
The Markov Property state that : “Future is Independent of the past given the present”
120
Elements of RL: The Markov Property
• S[t] : denotes the current state of the agent and s[t+1] denotes the next state.
• What this equation means is that the transition from state S[t] to S[t+1] is entirely independent of
the past.
• So, the RHS of the Equation means the same as LHS if the system has a Markov Property.
• Intuitively meaning that our current state already captures the information of the past states.
121
Elements of RL:State Transition Probability :
• As we now know about transition probability we can define state Transition Probability as follows :
• For Markov State from S[t] to S[t+1] i.e. any other successor state, the state transition probability is
given by
We can formulate the State Transition probability into a State Transition probability matrix by :
122
Elements of Reinforcement Learning
Each row in the matrix represents the probability from moving from our original or starting state to any
successor state.Sum of each row is equal to 1.
123
Elements of RL:Markov Process or Markov Chains
• Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n]
with a Markov Property.
• It can be defined using a set of states(S) and transition probability matrix (P).
• The dynamics of the environment can be fully defined using the States(S) and Transition Probability
matrix(P).
124
Elements of Reinforcement Learning
Some samples from the chain :
.Hope, it’s now clear why Markov process is called random set of sequences.
125
Elements of RL: Reward and Returns
• Rewards are the numerical values that the agent receives on performing some action at some
state(s) in the environment.
• The numerical value can be positive or negative based on the actions of the agent.
• In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards
agent receives from the environment) instead of, the reward agent receives from the current
state(also called immediate reward).
• This total sum of reward the agent receives from the environment is called returns.
126
Elements of Reinforcement Learning
• r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another.
• Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing an
action to move to another state.
• And, r[T] is the reward received by the agent by at the final time step by performing an
action to move to another state.
127
Elements of RL:Episodic and Continuous Tasks
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say
they have finite states. For example, in racing games, we start the game (start the race)
and play it until the game is over (race ends!).
This is called an episode. Once we restart the game it will start from an initial state and
hence, every episode is independent.
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any
terminal state.These types of tasks will never end. For example, Learning how to code!
Now, it’s easy to calculate the returns from the episodic tasks as they will eventually
end but what about continuous tasks, as it will go on and on forever.
The returns from sum up to infinity! So, how we define returns for continuous tasks?128
Elements of RL: Discount factor(ɤ).
• This is where we need Discount Factor (ɤ):
• It determines how much importance is to be given to the immediate reward and future rewards. This basically helps
us to avoid infinity as a reward in continuous tasks.
• It has a value between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a
value of 1 means that more importance is given to future rewards.
• In practice, a discount factor of 0 will never learn as it only considers immediate reward and a discount factor of 1
will go on for future rewards which may lead to infinity. Therefore, the optimal value for the discount factor lies
between 0.2 to 0.8.
• So, we can define returns using discount factor as follows :(Let’s say this is equation 1 ,as we are going to use this
equation in later for deriving Bellman Equation)
129
Elements of RL: Discount factor(ɤ).
130
Elements of RL: Discount factor(ɤ).
131
Elements of RL: Discount factor(ɤ).
132
Elements of RL: Discount factor(ɤ).
Practical Implications:
Decision Making: In practical scenarios like financial investment, ecological management, or strategic planning,
discounting affects how future benefits are weighed against immediate costs. For instance, in environmental policy, a
lower discount rate might justify significant upfront costs for long-term sustainability benefits.
Policy Evaluation in RL: In RL, the discount factor affects how an agent evaluates its policy. An agent with a high
discount factor will behave more conservatively, trying to ensure good long-term outcomes, while an agent with a low
discount factor may take riskier actions that promise immediate rewards.
In summary, the discount factor 𝛾 in reinforcement learning serves multiple purposes: it reflects time preference for
rewards, aids in algorithmic convergence, affects policy development, and impacts the agent's performance and
decision-making strategy significantly.
Choosing the right value of 𝛾 is crucial for balancing between immediate and future rewards, ensuring both effective
learning and optimal decision-making.
133
Elements of Reinforcement Learning
Let’s understand it with an example,suppose you live at a place where you face water scarcity so if
someone comes to you and say that he will give you 100 liters of water!(assume please!) for the
next 15 hours as a function of some parameter (ɤ).
Let’s look at two possibilities : (Let’s say this is equation 1 ,as we are going to use this equation in
later for deriving Bellman Equation) One with discount factor (ɤ) 0.8 :
This means that we should wait till 15th hour because the decrease is not very significant , so it’s still
worth to go till the end.
This means that we are also interested in future rewards. So, if the discount factor is close to 1 then
we will make a effort to go to end as the reward are of significant importance.
134
Elements of Reinforcement Learning
Second, with discount factor (ɤ) 0.2 :
This means that we are more interested in early rewards as the rewards are getting significantly
low at hour. So, we might not want to wait till the end (till 15th hour) as it will be worthless.So, if the
discount factor is close to zero then immediate rewards are more important that the future.
135
Markov Reward Process
Till now we have seen how Markov chain defined the dynamics of a environment using
set of states(S) and Transition Probability Matrix(P).
But, we know that Reinforcement Learning is all about goal to maximize the reward.
So, let’s add reward to our Markov Chain.This gives us Markov Reward Process.
Markov Reward Process : As the name suggests, MDPs are the Markov chains
with values judgement. Basically, we get a value from every state our agent is in.
136
Elements of Reinforcement Learning
What this equation means is how much reward (Rs) we get from a particular state S[t]. This
tells us the immediate reward from that particular state our agent is in. As we will see in the
next story how we maximize these rewards from each state our agent is in. In simple terms,
maximizing the cumulative reward we get from each state.
S is a set of states,
P is the Transition Probability Matrix,
R is the Reward function, we saw earlier,
ɤ is the discount factor
137
Markov Decision Process
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
138
Elements
Markov Decision Process
of Reinforcement Learning
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
Value Function: determines how good it is for the agent to be in a particular state. Of course, to
determine how good it will be to be in a particular state it must depend on some actions ( possible
set of actions, out of which , we have to choose the best one). that it will take. This is where policy
comes in.
A policy defines what set of actions or series of actions to perform in a particular state s.
139
Markov Decision Processes
• Formal Specification and example
• Policies and Optimal Policy
• Intro to Value Iteration
Combining ideas for Stochastic planning
• What is a key limitation of decision networks?
Represent (and optimize) only a fixed number of
decisions
Markov Chains
Hidden Markov
Model
Partially Observable
Markov Decision
Processes (POMDPs)
Agent moves in the above grid via actions Up, Down, Left, Right
Each action has:
• 0.8 probability to reach its intended effect
• 0.1 probability to move at right angles of the intended direction
• If the agents bumps into a wall, it says there
How many states?
There are two terminal states (3,4) and (2,4)
CPSC 422, Lecture 3 Slide 144
Example MDP: Rewards
2,1
Can the sequence [Up, Up, Right, Right, Right ] take the
agent in terminal state (3,4)?
• Needs to make the same decision over and over: Given the current
state what should I do?
discount factor , 0 1
Rmax bound on R(s) for every s
This equation gives us the expected returns starting from the state(s) and going to successor
states thereafter, with the policy π. One thing to note is the returns we get is stochastic whereas
the value of a state is not stochastic. It is the expectation of returns from start state s and
thereafter, to any other state. And also note that the value of the terminal state (if there is any) is
zero. Let’s look at an example :
MDPs: Policy
The advantage function describes how much better it is to take action a instead of following policy π:
the advantage of choosing action a over the default action.
MDPs: Policy
MDPs: Policy
Mathematically, we can define Bellman Equation as :
We want to know the value of state s. The value of state(s) is the reward we got upon leaving that state,
plus the discounted value of the state we landed upon multiplied by the transition probability that we will
move into it.
Mathematically, we can define Bellman Equation as :
We want to know the value of state s. The value of state(s) is the reward we got upon leaving that state,
plus the discounted value of the state we landed upon multiplied by the transition probability that we will
move into it.
Mathematically, we can define Bellman Equation as :
Where v is the value of state we were in, which is equal to the immediate reward plus the discounted
value of the next state multiplied by the probability of moving into that state.
The running time complexity for this computation is O(n³). Therefore, this is clearly not a practical
solution for solving larger MRPs (same for MDPs).In later Blogs, we will look at more efficient methods
like Dynamic Programming (Value iteration and Policy iteration), Monte-Claro methods, and TD-
Learning.
What is Markov Decision Process
Markov Decision Process : It is Markov Reward Process with a decisions. Everything is same like
MRP but now we have actual agency that makes decisions or take actions.
What is Markov Decision Process
Markov Decision Process : It is Markov Reward Process with a decisions. Everything is same like
MRP but now we have actual agency that makes decisions or take actions.
• Till now we have talked about getting a reward (r) when our agent goes through a set of states (s)
following a policy π.
• Actually, in Markov Decision Process(MDP) the policy is the mechanism to take decisions. So now
we have a mechanism that will choose to take an action.
• Policies in an MDP depend on the current state. They do not depend on history. That’s the Markov
Property. So, the current state we are in characterizes history.
• We have already seen how good it is for the agent to be in a particular state(State-value function).
• Now, let’s see how good it is to take a particular action following a policy π from state s (Action-
Value Function).
State-action value function or Q-Function
This function specifies how good it is for the agent to take action (a) in a state (s) with a policy π.
The above equation tells us that the value of a particular state is determined by the immediate
reward plus the value of successor states when we are following a certain policy(π).
From the above equation, we can see that the State-Action Value of a state can be
decomposed into the immediate reward we get on performing a certain action in state(s) and
moving to another state(s’) plus the discounted value of the state-action value of the state(s’)
with respect to the some action(a) our agent will take from that state on-wards.
State-action value function or Q-Function
This backup diagram describes the value of being in a particular state. From the state s there is
some probability that we take both the actions. There is a Q-value(State-action value function) for
each of the action. We average the Q-values which tells us how good it is to be in a particular state
This backup diagram says that suppose we start off by taking some action(a). So, because of the action(a) the agent
might be blown to any of these states by the environment. Therefore, we are asking the question, how good it is to
take action(a)?
State-action value function or Q-Function
Mathematically, we can define this as follows :
State-action value function or Q-Function
State-action value function or Q-Function
From the above diagram, if our agent is in some state(s) and from that state suppose our agent can take two actions
due to which environment might take our agent to any of the states(s’).
Note that the probability of the action our agent might take from state s is weighted by our policy and after taking
that action the probability that we land in any of the states(s’) is weighted by the environment.
Now our question is, how good it is to be in state(s) after taking some action and landing on another state(s’) and
following our policy(π) after that?
It is similar to what we have done before,we are going to average the value of successor states(s’) with some
transition probability(P) weighted with our policy.
State-action value function or Q-Function
It’s very similar to what we did in State-Value Function and just it’s inverse, so this diagram basically
says that our agent take some action(a) because of which the environment might land us on any of the
states(s), then from that state we can choose to take any actions(a’) weighted with the probability of our
policy(π). Again, we average them together and that gives us how good it is to take a particular action
following a particular policy(π) all along.
State-action value function or Q-Function
Optimal State-Value Function :It is the maximum Value function over all policies.
Optimal State-Action Value Function: It is the maximum action-value function over all policies.
State-action value function or Q-Function
Sketch of ideas to find the optimal policy for a
MDP (Value Iteration)
We first need a couple of definitions
• Vπ(s): the expected value of following policy π in state s
• Q π (s, a), where a is an action: expected value of performing a
in s, and then following policy π.
We have, by definition
Q π (s, a)=
reward
obtained in s
states reachable Probability of expected value
Discount from s by doing a getting to s’ from of following
factor s via a policy π in s’
CPSC 422, Lecture 3 Slide 191
Value of a policy and Optimal policy
We can also compute V п(s) in terms of Q п(s, a)
V ( s ) Q ( s, ( s ))
action indicated by π in s
Expected Expected value of performing
value of the action indicated by π in s
following π
in s and following π after that
Q * ( s, * ( s )) R ( s )
So…
V * ( s ) R( s ) max P( s ' | s, a ) V * ( s ' ))
a
s'
Slide 193
CPSC 422, Lecture 3
Elements of Reinforcement Learning
196
Elements of Reinforcement Learning
197
Elements of Reinforcement Learning
198
Elements of Reinforcement Learning
199
Elements of Reinforcement Learning
200
Elements of Reinforcement Learning
201
Elements of Reinforcement Learning
202
Elements of Reinforcement Learning
203
Elements of Reinforcement Learning
204
Elements of Reinforcement Learning
•Policy (π)
- Policy in RL decides which action will the agent take in the current state.
- It tells the probability that an agent will select a specific action from a specific state.
- Policy is a function that maps a given state to probabilities of selecting each possible
action from the given state.
•If at time t, an agent follows policy π, then π(a|s) becomes the probability that the
action at time step t is at=a if the state at time step t is St=s .The meaning of this is, the
probability that an agent will take an action a in state s is π(a|s) at time t with policy π.
205
Elements of Reinforcement Learning
•Value Functions
- a simple measure of how good it is for an agent to be in a given state, or how
good it is for the agent to perform a given action in a given state.
•Two types
- state- value function
- action-value function
206
Elements of Reinforcement Learning
•State-value function
- The state-value function for policy π denoted as vπ determines the goodness of
any given state for an agent who is following policy π.
- This function gives us the value which is the expected return starting from state
s at time step t and following policy π afterward.
207
Elements of Reinforcement Learning
•Action value function
- determines the goodness of the action taken by the agent from a given state for
policy π.
- This function gives the value which is the expected return starting from state s
at time step t, with action a, and following policy π afterward.
- The output of this function is also called as Q-value where q stands for Quality.
Note that in the state-value function, we did not consider the action taken by
the agent.
208
Elements of Reinforcement Learning
209
(Deep) Reinforcement Learning
210
From OpenAI
Advantages of Reinforcement Learning
- solve very complex problems that cannot be solved by conventional techniques
- achieve long-term results
- model can correct the errors that occurred during the training process.
- In the absence of a training dataset, it is bound to learn from its experience
- can be useful when the only way to collect information about the environment is
to interact with it
- Reinforcement learning algorithms maintain a balance between exploration and
exploitation. Exploration is the process of trying different things to see if they
are better than what has been tried before. Exploitation is the process of trying
the things that have worked best in the past. Other learning algorithms do not
perform this balance
211
An example scenario - Tic-Tac-Toe
Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?
212
(Deep) Reinforcement Learning: Tic-Tac Toe
213
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
214
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
215
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
216
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
217
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
218
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
219
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
220
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
221
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
222
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
223
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe
224
From OpenAI
Reinforcement Learning for Tic-Tac-Toe
•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.
225
Reinforcement Learning for Tic-Tac-Toe
226
Reinforcement Learning for Tic-Tac-Toe
•The solid lines represent the moves
taken during a game; the dashed lines
represent moves that we (our
reinforcement learning player)
considered but did not make.
•Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e∗, was ranked higher.
227
Reinforcement Learning for Tic-Tac-Toe
● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.
228
Thank you
229