0% found this document useful (0 votes)
18 views225 pages

RL Introduction

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views225 pages

RL Introduction

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 225

Deep Reinforcement Learning

2022-23 Second Semester, M.Tech (AIML)

Session #1:
Introduction to the Course

Instructors :
1. Prof. S. P. Vimal ([email protected]),
2. Dr. V Chandra Sekhar ([email protected])
PRESENTATION TITLE 1
What is Reinforcement Learning ?
- reward based learning / feedback based learning
- not a type of NN nor it is an alternative to NN. Rather it is an approach for
learning
- Autonomous driving, gaming

Why Reinforcement Learning ?


- a goal-oriented learning based on interaction with environment

2
Course Objectives

Course Objectives:
1. Understand
a. the conceptual, mathematical foundations of deep reinforcement learning
b. various classic & state of the art Deep Reinforcement Learning algorithms
2. Implement and Evaluate the deep reinforcement learning solutions to various
problems like planning, control and decision making in various domains
3. Provide conceptual, mathematical and practical exposure on DRL
a. to understand the recent developments in deep reinforcement learning and
b. to enable modelling new problems as DRL problems.

3
Learning Outcomes

1. understand the fundamental concepts of reinforcement learning (RL), algorithms


and apply them for solving problems including control, decision-making, and
planning.
2. Implement DRL algorithms, handle challenges in training due to stability and
convergence
3. evaluate the performance of DRL algorithms, including metrics such as sample
efficiency, robustness and generalization.
4. understand the challenges and opportunities of applying DRL to real-world
problems & model real life problems

4
Course Operation

• Instructors
Prof. S.P.Vimal
Dr. V Chandra Sekhar

• Textbooks
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto,
Second Ed. , MIT Press
2. Foundations of Deep Reinforcement Learning: Theory and Practice in Python
(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and Wah
Loon Keng
5
Course Operation
• Evaluation
Two Quizzes for 5% each; Best of two will be taken for 5% ( in final grading);
Whatever be the points set for quizzes, the score will be scaled to 5%
NO MAKEUP, for whatever be the reason. Ensure to attend at least one of the quizzes.
Two Assignments - Tensorflow/ Pytorch / OpenAI Gym Toolkit → 25 %
Assignment 1: Partially Numerical + Implementation of Classic Algorithms - 10%
Assignment 2: Deep Learning based RL - 15%
Mid-Term Exam - 30% [ Only to be written in A4 pages, scanned and uploaded]
Comprehensive Exam - 40% [ Only to be written in A4 pages, scanned and uploaded]
• Webinars/Tutorials
4 tutorials : 2 before mid-sem & 2 after mid-sem
• Teaching Assistants will be introduced to you in the next class

6
Course Operation
• How to reach us ? (for any question on lab aspects, availability of slides on portal, quiz
availability , assignment operations )

1.Prof. S. P. Vimal ([email protected]),


2.Dr. V Chandra Sekhar ([email protected])

• Plagiarism [ Important ]
All submissions for graded components must be the result of your original effort. It is
strictly prohibited to copy and paste verbatim from any sources, whether online or from
your peers. The use of unauthorized sources or materials, as well as collusion or
unauthorized collaboration to gain an unfair advantage, is also strictly prohibited. Please
note that we will not distinguish between the person sharing their resources and the one
receiving them for plagiarism, and the consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as identical verbatim answers or a
significant overlap of unreasonable similarities in a set of submissions, will be investigated,
and severe punishments will be imposed on all those found guilty of plagiarism.
7
Reinforcement Learning

Reward: Food or Reward: Positive and negative


electric shock numbers
•Learning by trial-and-error
•Numerical reward is often delayed
A big success story: AlphaGo

The first AI Go player to


defeat a human
(9 dan) champion
The computational revolution
≈computation
al power of
the human
brain
by ≈2030

2016
‘10
Reinforcement Learning

Reinforcement learning (RL) is based on rewarding desired behaviors or punishing undesired


ones. Instead of one input producing one output, the algorithm produces a variety of outputs
and is trained to select the right one based on certain variables – Gartner

When to use RL?


RL can be used in large environments in the following situations:

1.A model of the environment is known, but an analytic solution is not available;
2.Only a simulation model of the environment is given (the subject of simulation-based
optimization)
3.The only way to collect information about the environment is to interact with it.
11
What is Reinforcement Learning?

Agent-oriented learning—learning by interacting with an


environment to achieve a goal

• more realistic and ambitious than other kinds of machine


learning

Learning by trial and error, with only delayed evaluative feedback


(reward)

• the kind of machine learning most like natural learning

• learning that can tell for itself when it is right or wrong

The beginnings of a science of mind that is neither natural science


nor applications technology
Contrast: Supervised Learning
• Training experience: a set of labeled examples of the
form h <x1 x2 .. . x n > , <yi>, where x j are values for input variables
and y is the desired output
• This implies the existence of a “teacher” who knows the right answers
• What to learn: A function mapping inputs to outputs which
optimizes an objective function
• E.g. Face detection and recognition:
Contrast: Unsupervised learning
• Training experience: unlabelled data
• What to learn: interesting associations in the data
• E.g., clustering, dimensionality reduction, density estimation
• Often there is no single correct answer
• Very necessary, but significantly more difficult that supervised
learning
Computational framework

• At every time step t, the agent perceives the state of the environment
• Based on this perception, it chooses an action
• The action causes the agent to receive a numerical reward
• Find a way of choosing actions,called a policy which maximizes the agent’s long-
term expected return
Example: AlphaGo

• Perceptions: state of the board


• Actions: legal moves
• Reward: +1 o r -1 at the end of the game
• Trained by playing games against itself
• Invented new ways of playing which seem superior
Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality

Economics
David Silver 2015
Key Features of RL
Key Features of RL
It is also important to recognize that Reinforcement Learning is considered to be a branch of Machine Learning.
While, ML generally refers to the broad set of techniques to infer mathematical models/functions by acquiring
(“learning”) knowledge of patterns and properties in the presented data.

In this regard, Reinforcement Learning does fit this definition. However, unlike the other branches of ML
(Supervised Learning and Unsupervised Learning), Reinforcement Learning is a lot more ambitious - it not only
learns the patterns and properties of the presented data, it also learns about the appropriate behaviors to be
exercised (appropriate decisions to be made) so as to drive towards the optimization objective.

It is sometimes said that Supervised Learning and Unsupervised learning are about “minimization” (i.e., they
minimize the fitting error of a model to the presented data),

while Reinforcement Learning is about “maximization” (i.e., RL identifies the suitable decisions to be made to
maximize a well-defined objective).

More importantly, the class of problems RL aims to solve can be described with a simple yet powerful
mathematical framework known as Markov Decision Processes (abbreviated as MDPs).
Key Features of RL

1. As the Figure indicates, at each time step t, the Agent observes an abstract piece of information (which we call State) and a numerical (real
number) quantity that we call Reward.
2. Note that the concept of State is indeed completely abstract in this framework and we can model State to be any data type, as complex or
elaborate as we’d like.
3. Upon observing a State and a Reward at time step t, the Agent responds by taking an Action.
4. Note that the concept of State and Action is indeed completely abstract.
5. Action represents a purchase or sale of a stock responding to market stock price movements, or it could be movement of inventory from a
warehouse to a store in response to large sales at the store, or it could be a chess move in response to the opponent’s chess move.
Key Features of RL

6. Upon receiving an Action from the Agent at time step t, the Environment responds (with time ticking over to t + 1) by serving up the next time step’s
random State and random Reward.

7. The goal of the Agent at any point in time is to maximize the Expected Sum of all future Rewards by controlling (at each time step) the Action as a
function of the observed State (at that time step).

8. This function from a State to Action at any time step is known as the Policy function. So we say that the agent’s job is exercise control by
determining the Optimal Policy function. Hence, this is a dynamic (i.e., time-sequenced) control system under uncertainty.
Real-world problems that fit the MDP framework

All kinds of problems in nature and in business (and indeed, in our personal lives) can be modeled as Markov Decision Processes.

Here is a sample of such problems:

• Self-driving vehicle (Actions constitute speed/steering to optimize safety/time).


• Game of Chess (Actions constitute moves of the pieces to optimize chances of winning the game).
• Complex Logistical Operations, such as those in a Warehouse (Actions constitute inventory movements to optimize
throughput/time).
• Making a humanoid robot walk/run on a difficult terrain (Actions are walking movements to optimize time to destination).
• Manage an investment portfolio (Actions are trades to optimize long-term investment gains).
• Optimal decisions during a football game (Actions are strategic game calls to optimize chances of winning the game).
• Strategy to win an election (Actions constitute political decisions to optimize chances of winning the election).
Some RL
Successes
• Learned the world’s best player of Backgammon (Tesauro 1995)

• Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al


2006+)

• Widely used in the placement and selection of advertisements and


pages on the web (e.g., A-B tests)

• Used to make strategic decisions in Jeopardy! (IBM’s Watson 2011)


• Achieved human-level performance on Atari games from pixel-level
visual input, in conjunction with deep learning (Google DeepMind
2015)

• In all these cases, performance was better than could be obtained


by any other method, and was obtained without human instruction
Example: TD-Gammon Tesauro, 1992-1995

Wbar
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
estimated state value
(≈ prob of winning)

6 5 4
w
Action selection

3 2
by a shallow search

1 0
s

Start with a random Network

Play millions of games against itself

Learn a value function from this simulated experience

Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions
RL + Deep Learing Performance on Atari Games

Space Invaders Breakout Enduro


RL + Deep Learning, applied to Classic Atari Games
Google Deepmind 2015, Bowling et al. 2012

• Learned to play 49 games for the Atari 2600 game console,


without labels or human input, from self-play and the score alone
Convolution Convolution Fully connected Fully connected

No input

to predictions
mapping raw of final score
screen pixels for each of 18
joystick actions

• Learned to play better than all previous algorithms Same learning


algorithm applied
and at human level for more than half the games to all 49 games!
w/o human tuning
Applications: Few trajectories
Stimulations

Seizure

Epileptic seizure control


LSPI10

0Hz
0.5Hz
0.20

LSPI20
LSPI40
Fraction of seizures

T20010T20020

T20040
O u r lab: Guez et al, 2008
0.15

KBSF10

T2010
T2020 1.5Hz
Barreto et al, 2011, 2012
1Hz
KBSF20
0.10

T2040

0.00 0.05 0.10 0.15 0.2040


KBSF 0.25 0.30 0.35
Fraction of stimulation

Helicopter control
Ng et al, 2008 ; O u r lab: Barreto, Gehring et al, 2013

Power plant optimization


O u r lab: Grinberg et al, 2014
Signature challenges of RL

Evaluative feedback (reward)

Sequentiality, delayed consequences

Need for trial and error, to explore as well as exploit

Non-stationarity

The fleeting nature of time and online data


(Deep) Reinforcement Learning

33
Criteria Supervised ML Unsupervised ML Reinforcement ML
Definition Learns by using labelled Trained using Works on interacting with
data unlabelled data the environment
without any guidance.
Type of data Labelled data Unlabelled data No – predefined data
Types of Learning

Type of problems Regression and Association and Exploitation or Exploration


classification Clustering
Supervision Extra supervision No supervision No supervision
Algorithms Linear Regression, Logistic K – Means, Q – Learning,
Regression, SVM, KNN etc. C – Means, Apriori SARSA

Aim Calculate outcomes Discover underlying Learn a series of action


patterns
Application Risk Evaluation, Forecast Recommendation Self Driving Cars, Gaming,
Sales System, Anomaly Healthcare
Detection

34
Characteristics of RL

•No supervision, only a real value or reward signal


•Decision making is sequential
•Time plays a major role in reinforcement problems
•Feedback isn’t prompt but delayed
•The following data it receives is determined by the agent’s actions

35
Types of Reinforcement learning

•Positive Reinforcement - an event, occurs due to a particular behavior, increases


the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior.
•Maximizes Performance
•Sustain Change for a long period of time
•Too much Reinforcement can lead to an overload of states which can diminish
the results

36
Types of Reinforcement Learning
Negative Reinforcement - strengthening of behavior because a negative condition
is stopped or avoided. Advantages of negative reinforcement learning:
•Increases Behavior
•Provide defiance to a minimum standard of performance
•It Only provides enough to meet up the minimum behavior

37
Elements of Reinforcement Learning

38
Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main sub-elements of a
reinforcement learning system: a policy, a reward , a value function, and, optionally, a model
of the environment.

39
Elements of Reinforcement Learning

•Agent
- an entity that tries to learn the best way to perform a specific task.
- In our example, the child is the agent who learns to ride a bicycle.

•Action (A) -
- what the agent does at each time step.
- In the example of a child learning to walk, the action would be “walking”.
- A is the set of all possible moves.
- In video games, the list might include running right or left, jumping high or low,
crouching or standing still.

40
Elements of Reinforcement Learning

•State (S)
- current situation of the agent.
- After doing performing an action, the agent can move to different states.
- In the example of a child learning to walk, the child can take the action of taking
a step and move to the next state (position).
•Rewards (R)
- feedback that is given to the agent based on the action of the agent.
- If the action of the agent is good and can lead to winning or a positive side then
a positive reward is given and vice versa.

41
Elements of Reinforcement Learning

•Environment
- outside world of an agent or physical world in which the agent operates.
•Discount factor
- The discount factor is multiplied by future rewards as discovered by the agent
in order to dampen these rewards’ effect on the agent’s choice of action. Why?
It is designed to make future rewards worth less than immediate rewards.
Often expressed with the lower-case Greek letter gamma: γ. If γ is .8, and there’s
a reward of 10 points after 3 time steps, the present value of that reward is
0.8³ x 10.

42
Elements of Reinforcement Learning
Formal Definition - Reinforcement learning (RL) is an area of machine
learning concerned with how intelligent agents ought to take actions in
an environment in order to maximize the notion of cumulative reward.

43
Elements of Reinforcement Learning
•Goal of RL - maximize the total amount of rewards or cumulative rewards that are
received by taking actions in given states.
•Notations –
a set of states as S,
a set of actions as A,
a set of rewards as R.
At each time step t = 0, 1, 2, …, some representation of the environment’s state St
∈ S is received by the agent. According to this state, the agent selects an action At
∈ A which gives us the state-action pair (St , At). In the next time step t+1, the
transition of the environment happens and the new state St+1 ∈ S is achieved. At
this time step t+1, a reward Rt+1 ∈ R is received by the agent for the action At taken
from state St. 44
Elements of Reinforcement Learning

45
Mathematical Formulation of Reinforcement
Learning
Before we answer our root question i.e. How we
formulate RL problems mathematically (using MDP), we
need to develop our intuition about :

• The Agent-Environment relationship


• Markov Property
• Markov Process and Markov chains
• Markov Reward Process (MRP)
• Bellman Equation
• Markov Reward Process

46
Agent-Environment Interface

● Agent - Learner & the decision


maker
● Environment - Everything
outside the agent
● Interaction:
○ Agent performs an action
○ Environment responds by
■ presenting a new situation Note:
(change in state) ● Interaction occurs in discrete time steps
■ presents numerical reward
● Objective (of the interaction):
○ Maximize the return
(cumulative rewards) over time

47
In reinforcement learning (RL), an agent learns through interactions with its environment by a process that involves
experimentation, feedback, and adaptation.

This learning paradigm is fundamentally different from supervised learning, where a model learns from a labeled
dataset.

Instead, RL is about learning what actions to take in various situations to maximize a cumulative reward over time.

48
49
50
51
52
53
Markov Decision Processes

• Definition of Markov Decision Processes (MDPs):


• MDPs are mathematical frameworks used in decision-
making situations where outcomes are partly random
and partly under the control of a decision maker.
They are commonly used in areas like robotics,
economics, and game theory.

54
Markov Decision Processes
• Components of MDPs:
• A set of states: These represent all possible scenarios or configurations
in which the system can exist.
• A set of actions: These are the choices available to the decision-maker
in each state.
• State-transition probabilities: For each state and action, this
component provides the probability of landing in a future state. This is
essential for understanding how actions affect the system's state.
• A reward function: This function assigns a numerical reward to each
transition between states. The goal is usually to maximize cumulative
rewards over time.
• A start state: The initial condition or state of the system.
• A terminal state: Some MDPs have an endpoint or terminal state,
which, once reached, ends the decision process.

55
Markov Decision Processes

• Model Dynamics:
• Also referred to as the dynamics of the model, these describe how the
system evolves in response to various actions taken in different states,
according to the transition probabilities.

56
Markov Decision Processes

● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state

57
Markov Decision Processes

58
Markov Decision Processes

59
Markov Decision Processes

60
Markov Decision Processes

61
Markov Decision Processes
1. Value Iteration

• Value Iteration is a fundamental method in reinforcement


learning for solving Markov Decision Processes (MDPs).

• It helps to determine the optimal policy by systematically


improving the estimate of the value function for each state,
which indicates the best possible return from that state.

62
Markov Decision Processes
1. Value Iteration

63
Markov Decision Processes
1. Value Iteration

64
Markov Decision Processes
1. Value Iteration

65
Markov Decision Processes
2. Policy Iteration

Policy Iteration is a method used in reinforcement learning and dynamic


programming to find the optimal policy for a Markov Decision Process
(MDP). It alternates between two main phases: policy evaluation and
policy improvement. This iterative process continues until the policy is
stable, meaning it no longer changes, and thus is optimal.

66
Markov Decision Processes
2. Policy Iteration

67
Markov Decision Processes
2. Policy Iteration

68
Markov Decision Processes
2. Policy Iteration

69
Markov Decision Processes
2. Policy Iteration

70
Markov Decision Processes
2. Policy Iteration

Policy evaluation thus involves a systematic computation of the


expected returns for each state under a fixed policy, utilizing the
model's dynamics (probabilities and rewards) and a recursive
updating process based on the Bellman expectation equation. This
step is crucial for assessing the effectiveness of a given policy and
setting the stage for policy improvement.

71
Markov Decision Processes
3. Monte Carlo Techniques

Monte Carlo methods are a class of algorithms that rely on repeated random
sampling to obtain numerical results, typically used to solve complex
mathematical and computational problems that are deterministic in principle
but for which analytical solutions are infeasible or unknown. In the context of
reinforcement learning, Monte Carlo (MC) methods are used to estimate value
functions and improve policies based on experience sampled from actual or
simulated interaction with the environment.

72
Markov Decision Processes
3. Monte Carlo Techniques
In Monte Carlo methods for reinforcement learning, specifically in the
context of policy evaluation, the approach is centered around estimating the
value of states based on the returns from episodes of the environment.

73
Markov Decision Processes
3. Monte Carlo Techniques

74
Markov Decision Processes

75
Markov Decision Processes

76
Markov Decision Processes

77
Markov Decision Processes

78
Markov Decision Processes

79
Markov Decision Processes

80
Markov Decision Processes

81
Markov Decision Processes

82
Markov Decision Processes

83
Markov Decision Processes

84
Markov Decision Processes

85
Markov Decision Processes

86
Markov Decision Processes

87
Markov Decision Processes

88
Markov Decision Processes

89
Markov Decision Processes

90
Markov Decision Processes

91
Markov Decision Processes

92
Markov Decision Processes

93
Markov Decision Processes

94
Markov Decision Processes

95
Markov Decision Processes

96
Markov Decision Processes

97
Markov Decision Processes

98
Markov Decision Processes

99
Markov Decision Processes

100
Markov Decision Processes

101
Markov Decision Processes

102
Markov Decision Processes

103
Markov Decision Processes

104
Markov Decision Processes

105
Markov Decision Processes

106
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0

107
Markov Decision Processes

108
Markov Decision Processes

109
Markov Decision Processes
3. Monte Carlo Techniques

Importance of This Step


This procedure of focusing on the first visit to each state ensures that the estimates of state values are unbiased by
the specific paths taken in individual episodes. It is essential for accurately evaluating how good it is to be in a state
sss under a given policy, especially in stochastic environments where different paths might yield different rewards.
The first-visit method isolates the effect of each initial visit, providing a clear and unbiased snapshot of future
potential returns from each state.
110
Markov Decision Processes
3. Monte Carlo Techniques

By accumulating these returns and averaging them, Monte Carlo


methods efficiently use actual experiences to converge to the true
value estimates, thereby enabling effective policy evaluation without
needing a model of the environment's dynamics.

111
Markov Decision Processes
4. Q Learning

Q-learning is a model-free reinforcement learning algorithm that is used


to learn the value of an action in a particular state without requiring a
model of the environment. It is part of a larger class of reinforcement
learning algorithms that operate on the principle of learning from
delayed rewards to maximize the total expected reward.

112
Markov Decision Processes
4. Q Learning

113
Markov Decision Processes
4. Q Learning

114
Markov Decision Processes
4. Q Learning

115
Markov Decision Processes
4. Q Learning

116
Markov Decision Processes
4. Q Learning

Applications
Q-learning has been successfully applied in various domains, including robotics,
automated control, economics, and gaming, where the decision-making process
must be optimized based on sequential interactions with an environment.

In summary, Q-learning is a fundamental reinforcement learning algorithm that


builds an optimal policy by iteratively updating Q-values based on the rewards
received through interactions with the environment.

Its ability to learn optimal actions without prior knowledge of the environment's
dynamics makes it a powerful tool in the field of artificial intelligence.

117
● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution

Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 118
Elements of Reinforcement Learning
The Agent-Environment Relationship

Agent : Software programs that make intelligent decisions and they are the learners in RL. These agents interact
with the environment by actions and receive rewards based on there actions.

Environment :It is the demonstration of the problem to be solved.Now, we can have a real-world environment
or a simulated environment with which our agent will interact.

State : This is the position of the agents at a specific time-step in the environment. So,whenever an agent
performs a action the environment gives the agent reward and a new state where the agent reached by
performing the action.

Anything that the agent cannot change arbitrarily is considered to be part of the environment. In simple
terms, actions can be any decision we want the agent to learn and state can be anything which can be
useful in choosing actions.

119
Elements of RL: The Markov Property
Transition : Moving from one state to another is called Transition.

Transition Probability: The probability that the agent will move from one state to another
is called transition probability.

The Markov Property state that : “Future is Independent of the past given the present”

Mathematically we can express this statement as :

120
Elements of RL: The Markov Property

• S[t] : denotes the current state of the agent and s[t+1] denotes the next state.

• What this equation means is that the transition from state S[t] to S[t+1] is entirely independent of
the past.

• So, the RHS of the Equation means the same as LHS if the system has a Markov Property.

• Intuitively meaning that our current state already captures the information of the past states.

121
Elements of RL:State Transition Probability :

• As we now know about transition probability we can define state Transition Probability as follows :

• For Markov State from S[t] to S[t+1] i.e. any other successor state, the state transition probability is
given by

We can formulate the State Transition probability into a State Transition probability matrix by :

122
Elements of Reinforcement Learning

Each row in the matrix represents the probability from moving from our original or starting state to any
successor state.Sum of each row is equal to 1.

123
Elements of RL:Markov Process or Markov Chains
• Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n]
with a Markov Property.

• So, it’s basically a sequence of states with the Markov Property.

• It can be defined using a set of states(S) and transition probability matrix (P).

• The dynamics of the environment can be fully defined using the States(S) and Transition Probability
matrix(P).

124
Elements of Reinforcement Learning
Some samples from the chain :

Sleep — Run — Ice-cream — Sleep


Sleep — Ice-cream — Ice-cream — Run

.Hope, it’s now clear why Markov process is called random set of sequences.

125
Elements of RL: Reward and Returns
• Rewards are the numerical values that the agent receives on performing some action at some
state(s) in the environment.

• The numerical value can be positive or negative based on the actions of the agent.

• In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards
agent receives from the environment) instead of, the reward agent receives from the current
state(also called immediate reward).

• This total sum of reward the agent receives from the environment is called returns.

126
Elements of Reinforcement Learning

• r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another.

• Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing an
action to move to another state.

• And, r[T] is the reward received by the agent by at the final time step by performing an
action to move to another state.

127
Elements of RL:Episodic and Continuous Tasks
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say
they have finite states. For example, in racing games, we start the game (start the race)
and play it until the game is over (race ends!).

This is called an episode. Once we restart the game it will start from an initial state and
hence, every episode is independent.

Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any
terminal state.These types of tasks will never end. For example, Learning how to code!

Now, it’s easy to calculate the returns from the episodic tasks as they will eventually
end but what about continuous tasks, as it will go on and on forever.

The returns from sum up to infinity! So, how we define returns for continuous tasks?128
Elements of RL: Discount factor(ɤ).
• This is where we need Discount Factor (ɤ):

• It determines how much importance is to be given to the immediate reward and future rewards. This basically helps
us to avoid infinity as a reward in continuous tasks.

• It has a value between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a
value of 1 means that more importance is given to future rewards.

• In practice, a discount factor of 0 will never learn as it only considers immediate reward and a discount factor of 1
will go on for future rewards which may lead to infinity. Therefore, the optimal value for the discount factor lies
between 0.2 to 0.8.

• So, we can define returns using discount factor as follows :(Let’s say this is equation 1 ,as we are going to use this
equation in later for deriving Bellman Equation)

129
Elements of RL: Discount factor(ɤ).

130
Elements of RL: Discount factor(ɤ).

131
Elements of RL: Discount factor(ɤ).

132
Elements of RL: Discount factor(ɤ).
Practical Implications:

Decision Making: In practical scenarios like financial investment, ecological management, or strategic planning,
discounting affects how future benefits are weighed against immediate costs. For instance, in environmental policy, a
lower discount rate might justify significant upfront costs for long-term sustainability benefits.

Policy Evaluation in RL: In RL, the discount factor affects how an agent evaluates its policy. An agent with a high
discount factor will behave more conservatively, trying to ensure good long-term outcomes, while an agent with a low
discount factor may take riskier actions that promise immediate rewards.

In summary, the discount factor 𝛾 in reinforcement learning serves multiple purposes: it reflects time preference for
rewards, aids in algorithmic convergence, affects policy development, and impacts the agent's performance and
decision-making strategy significantly.

Choosing the right value of 𝛾 is crucial for balancing between immediate and future rewards, ensuring both effective
learning and optimal decision-making.

133
Elements of Reinforcement Learning
Let’s understand it with an example,suppose you live at a place where you face water scarcity so if
someone comes to you and say that he will give you 100 liters of water!(assume please!) for the
next 15 hours as a function of some parameter (ɤ).

Let’s look at two possibilities : (Let’s say this is equation 1 ,as we are going to use this equation in
later for deriving Bellman Equation) One with discount factor (ɤ) 0.8 :

This means that we should wait till 15th hour because the decrease is not very significant , so it’s still
worth to go till the end.

This means that we are also interested in future rewards. So, if the discount factor is close to 1 then
we will make a effort to go to end as the reward are of significant importance.

134
Elements of Reinforcement Learning
Second, with discount factor (ɤ) 0.2 :

This means that we are more interested in early rewards as the rewards are getting significantly
low at hour. So, we might not want to wait till the end (till 15th hour) as it will be worthless.So, if the
discount factor is close to zero then immediate rewards are more important that the future.

135
Markov Reward Process
Till now we have seen how Markov chain defined the dynamics of a environment using
set of states(S) and Transition Probability Matrix(P).

But, we know that Reinforcement Learning is all about goal to maximize the reward.

So, let’s add reward to our Markov Chain.This gives us Markov Reward Process.

Markov Reward Process : As the name suggests, MDPs are the Markov chains
with values judgement. Basically, we get a value from every state our agent is in.

Mathematically, we define Markov Reward Process as :

136
Elements of Reinforcement Learning
What this equation means is how much reward (Rs) we get from a particular state S[t]. This
tells us the immediate reward from that particular state our agent is in. As we will see in the
next story how we maximize these rewards from each state our agent is in. In simple terms,
maximizing the cumulative reward we get from each state.

We define MRP as (S,P, R,ɤ) , where :

S is a set of states,
P is the Transition Probability Matrix,
R is the Reward function, we saw earlier,
ɤ is the discount factor

137
Markov Decision Process

Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.

Policy Function and Value Function

138
Elements
Markov Decision Process
of Reinforcement Learning
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.

Policy Function and Value Function

Value Function: determines how good it is for the agent to be in a particular state. Of course, to
determine how good it will be to be in a particular state it must depend on some actions ( possible
set of actions, out of which , we have to choose the best one). that it will take. This is where policy
comes in.

A policy defines what set of actions or series of actions to perform in a particular state s.

139
Markov Decision Processes
• Formal Specification and example
• Policies and Optimal Policy
• Intro to Value Iteration
Combining ideas for Stochastic planning
• What is a key limitation of decision networks?
Represent (and optimize) only a fixed number of
decisions

• What is an advantage of Markov models?


The network can extend indefinitely

Goal: represent (and optimize) an indefinite


sequence of decisions
CPSC 422, Lecture 2 Slide 141
Decision Processes
Often an agent needs to go beyond a fixed set of decisions
– Examples?
• Would like to have an ongoing decision process

Infinite horizon problems: process does not stop

Indefinite horizon problem: the agent does not know


when the process may stop

Finite horizon: the process must end at a give time N

CPSC 422, Lecture 2 Slide 142


Markov Models

Markov Chains

Hidden Markov
Model

Partially Observable
Markov Decision
Processes (POMDPs)

Markov Decision Processes


(MDPs)

CPSC 422, Lecture 3 Slide 143


Example MDP: Scenario and Actions

Agent moves in the above grid via actions Up, Down, Left, Right
Each action has:
• 0.8 probability to reach its intended effect
• 0.1 probability to move at right angles of the intended direction
• If the agents bumps into a wall, it says there
How many states?
There are two terminal states (3,4) and (2,4)
CPSC 422, Lecture 3 Slide 144
Example MDP: Rewards

CPSC 422, Lecture 3 Slide 145


Example MDP: Underlying info structures
Four actions Up, Down, Left, Right
Eleven States: {(1,1), (1,2)…… (3,4)}

2,1

CPSC 422, Lecture 3 Slide 146


Example MDP: Sequence of actions

Can the sequence [Up, Up, Right, Right, Right ] take the
agent in terminal state (3,4)?

Can the sequence reach the goal in any other way?

CPSC 422, Lecture 3 Slide 147


MDPs: Policy
• The robot needs to know what to do as the decision process unfolds…
• It starts in a state, selects an action, ends up in another state selects
another action….

• Needs to make the same decision over and over: Given the current
state what should I do?

• So a policy for an MDP is a


single decision function π(s)
that specifies what the agent
should do for each state s

CPSC 422, Lecture 3 Slide 148


How to evaluate a policy
A policy can generate a set of state sequences with different
probabilities

Each state sequence has a corresponding reward. Typically the


(discounted) sum of the rewards for each state in the sequence

CPSC 422, Lecture 3 Slide 149


MDPs: expected value/total reward of a policy
and optimal policy
Each sequence of states (environment history) associated with a
policy has
• a certain probability of occurring
• a given amount of total reward as a function of the rewards of its individual
states
Expected value /total reward

For all the sequences of states generated


by the policy

Optimal policy is the policy that maximizes expected total reward


CPSC 422, Lecture 3 Slide 150
Discounted Reward Function
 Suppose the agent goes through states s1, s2,...,sk and receives
rewards r1, r2,...,rk

 We will look at discounted reward to define the reward for this


sequence, i.e. its utility for the agent

 discount factor , 0    1
Rmax bound on R(s) for every s

U [ s1 , s2 , s3 ,..]  r1  γr2  γ 2 r3  .....


 
Rmax
 γ r
i 0
i
i 1  γ
i 0
i
Rmax 
1 γ

CPSC 422, Lecture 3 Slide 151


MDPs: Policy
• A policy is a simple function, that defines a probability distribution over Actions
(a∈ A) for each state (s ∈ S).
• If an agent at time t follows a policy π then π(a|s) is the probability that the agent
with taking action (a ) at a particular time step (t).
• In Reinforcement Learning the experience of the agent determines the change in
policy. . Mathematically, a policy is defined as follows :

CPSC 422, Lecture 3 Slide 152


MDPs: Policy
Now, how do we find a value of a state. The value of state s, when the agent is following a
policy π which is denoted by vπ(s) is the expected return starting from s and following a policy π
for the next states until we reach the terminal state.

We can formulate this as :(This function is also called State-value Function)

This equation gives us the expected returns starting from the state(s) and going to successor
states thereafter, with the policy π. One thing to note is the returns we get is stochastic whereas
the value of a state is not stochastic. It is the expectation of returns from start state s and
thereafter, to any other state. And also note that the value of the terminal state (if there is any) is
zero. Let’s look at an example :
MDPs: Policy

Suppose our start state is Class 2, and we move to


Class 3 then Pass then Sleep.In short, Class 2 > Clas
3 > Pass > Sleep.
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy
MDPs: Policy

The advantage function describes how much better it is to take action a instead of following policy π:
the advantage of choosing action a over the default action.
MDPs: Policy
MDPs: Policy
Mathematically, we can define Bellman Equation as :

We want to know the value of state s. The value of state(s) is the reward we got upon leaving that state,
plus the discounted value of the state we landed upon multiplied by the transition probability that we will
move into it.
Mathematically, we can define Bellman Equation as :

We want to know the value of state s. The value of state(s) is the reward we got upon leaving that state,
plus the discounted value of the state we landed upon multiplied by the transition probability that we will
move into it.
Mathematically, we can define Bellman Equation as :

Where v is the value of state we were in, which is equal to the immediate reward plus the discounted
value of the next state multiplied by the probability of moving into that state.

The running time complexity for this computation is O(n³). Therefore, this is clearly not a practical
solution for solving larger MRPs (same for MDPs).In later Blogs, we will look at more efficient methods
like Dynamic Programming (Value iteration and Policy iteration), Monte-Claro methods, and TD-
Learning.
What is Markov Decision Process
Markov Decision Process : It is Markov Reward Process with a decisions. Everything is same like
MRP but now we have actual agency that makes decisions or take actions.
What is Markov Decision Process
Markov Decision Process : It is Markov Reward Process with a decisions. Everything is same like
MRP but now we have actual agency that makes decisions or take actions.

• Till now we have talked about getting a reward (r) when our agent goes through a set of states (s)
following a policy π.

• Actually, in Markov Decision Process(MDP) the policy is the mechanism to take decisions. So now
we have a mechanism that will choose to take an action.

• Policies in an MDP depend on the current state. They do not depend on history. That’s the Markov
Property. So, the current state we are in characterizes history.

• We have already seen how good it is for the agent to be in a particular state(State-value function).

• Now, let’s see how good it is to take a particular action following a policy π from state s (Action-
Value Function).
State-action value function or Q-Function
This function specifies how good it is for the agent to take action (a) in a state (s) with a policy π.

Mathematically, we can define the State-action value function as :

Basically, it tells us the value of performing a certain action(a) in a state(s) with a


policy π.
State-action value function or Q-Function
State-action value function or Q-Function

The above equation tells us that the value of a particular state is determined by the immediate
reward plus the value of successor states when we are following a certain policy(π).

Similarly, we can express our state-action Value function (Q-Function) as follows :

From the above equation, we can see that the State-Action Value of a state can be
decomposed into the immediate reward we get on performing a certain action in state(s) and
moving to another state(s’) plus the discounted value of the state-action value of the state(s’)
with respect to the some action(a) our agent will take from that state on-wards.
State-action value function or Q-Function

This backup diagram describes the value of being in a particular state. From the state s there is
some probability that we take both the actions. There is a Q-value(State-action value function) for
each of the action. We average the Q-values which tells us how good it is to be in a particular state

This backup diagram says that suppose we start off by taking some action(a). So, because of the action(a) the agent
might be blown to any of these states by the environment. Therefore, we are asking the question, how good it is to
take action(a)?
State-action value function or Q-Function
Mathematically, we can define this as follows :
State-action value function or Q-Function
State-action value function or Q-Function
From the above diagram, if our agent is in some state(s) and from that state suppose our agent can take two actions
due to which environment might take our agent to any of the states(s’).

Note that the probability of the action our agent might take from state s is weighted by our policy and after taking
that action the probability that we land in any of the states(s’) is weighted by the environment.

Now our question is, how good it is to be in state(s) after taking some action and landing on another state(s’) and
following our policy(π) after that?

It is similar to what we have done before,we are going to average the value of successor states(s’) with some
transition probability(P) weighted with our policy.
State-action value function or Q-Function

It’s very similar to what we did in State-Value Function and just it’s inverse, so this diagram basically
says that our agent take some action(a) because of which the environment might land us on any of the
states(s), then from that state we can choose to take any actions(a’) weighted with the probability of our
policy(π). Again, we average them together and that gives us how good it is to take a particular action
following a particular policy(π) all along.
State-action value function or Q-Function

Optimal State-Value Function :It is the maximum Value function over all policies.

Optimal State-Action Value Function: It is the maximum action-value function over all policies.
State-action value function or Q-Function
Sketch of ideas to find the optimal policy for a
MDP (Value Iteration)
We first need a couple of definitions
• Vπ(s): the expected value of following policy π in state s
• Q π (s, a), where a is an action: expected value of performing a
in s, and then following policy π.
We have, by definition

Q π (s, a)=
reward
obtained in s
states reachable Probability of expected value
Discount from s by doing a getting to s’ from of following
factor s via a policy π in s’
CPSC 422, Lecture 3 Slide 191
Value of a policy and Optimal policy
We can also compute V п(s) in terms of Q п(s, a)
 
V ( s )  Q ( s,  ( s ))
action indicated by π in s
Expected Expected value of performing
value of the action indicated by π in s
following π
in s and following π after that

For the optimal policy π * we also have


* *
V ( s )  Q ( s,  * ( s ))

CPSC 422, Lecture 3 Slide 192


Value of Optimal policy
* *
V ( s )  Q ( s,  * ( s ))
Remember for any policy π
Q  ( s,  ( s ))  R ( s )    P ( s ' | s,  ( s ))  V  ( s ' ))
s'
But the Optimal policy π * is the one that gives the action that
maximizes the future reward for each state

Q  * ( s,  * ( s ))  R ( s )  

So…
V  * ( s )  R( s )   max  P( s ' | s, a )  V  * ( s ' ))
a
s'
Slide 193
CPSC 422, Lecture 3
Elements of Reinforcement Learning

196
Elements of Reinforcement Learning

197
Elements of Reinforcement Learning

198
Elements of Reinforcement Learning

199
Elements of Reinforcement Learning

200
Elements of Reinforcement Learning

201
Elements of Reinforcement Learning

202
Elements of Reinforcement Learning

203
Elements of Reinforcement Learning

204
Elements of Reinforcement Learning

•Policy (π)

- Policy in RL decides which action will the agent take in the current state.
- It tells the probability that an agent will select a specific action from a specific state.
- Policy is a function that maps a given state to probabilities of selecting each possible
action from the given state.

•If at time t, an agent follows policy π, then π(a|s) becomes the probability that the
action at time step t is at=a if the state at time step t is St=s .The meaning of this is, the
probability that an agent will take an action a in state s is π(a|s) at time t with policy π.

205
Elements of Reinforcement Learning

•Value Functions
- a simple measure of how good it is for an agent to be in a given state, or how
good it is for the agent to perform a given action in a given state.
•Two types
- state- value function
- action-value function

206
Elements of Reinforcement Learning
•State-value function
- The state-value function for policy π denoted as vπ determines the goodness of
any given state for an agent who is following policy π.
- This function gives us the value which is the expected return starting from state
s at time step t and following policy π afterward.

207
Elements of Reinforcement Learning
•Action value function
- determines the goodness of the action taken by the agent from a given state for
policy π.
- This function gives the value which is the expected return starting from state s
at time step t, with action a, and following policy π afterward.
- The output of this function is also called as Q-value where q stands for Quality.
Note that in the state-value function, we did not consider the action taken by
the agent.

208
Elements of Reinforcement Learning

•Model of the environment


- mimics the behavior of the environment, or more generally, that allows
inferences to be made about how the environment will behave.
- For example, given a state and action, the model might predict the resultant
next state and next reward.
- Models are used for planning

209
(Deep) Reinforcement Learning

210
From OpenAI
Advantages of Reinforcement Learning
- solve very complex problems that cannot be solved by conventional techniques
- achieve long-term results
- model can correct the errors that occurred during the training process.
- In the absence of a training dataset, it is bound to learn from its experience
- can be useful when the only way to collect information about the environment is
to interact with it
- Reinforcement learning algorithms maintain a balance between exploration and
exploitation. Exploration is the process of trying different things to see if they
are better than what has been tried before. Exploitation is the process of trying
the things that have worked best in the past. Other learning algorithms do not
perform this balance

211
An example scenario - Tic-Tac-Toe

Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?

212
(Deep) Reinforcement Learning: Tic-Tac Toe

213
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

214
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

215
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

216
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

217
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

218
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

219
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

220
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

221
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

222
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

223
From OpenAI
(Deep) Reinforcement Learning: Tic-Tac Toe

224
From OpenAI
Reinforcement Learning for Tic-Tac-Toe

•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.

225
Reinforcement Learning for Tic-Tac-Toe

● Assuming we always play X’s, three X = probability is 1, three O = probability =0 .


Initial = 0.5
● Play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves and look up their
current values in the table.
● Most of the time we move greedily, selecting the move that leads to the state
with greatest value, i.e, with the highest estimated probability of winning.
● Occasionally, however, we select randomly from among the other moves instead.
These are called exploratory moves because they cause us to experience states
that we might otherwise never see.

226
Reinforcement Learning for Tic-Tac-Toe
•The solid lines represent the moves
taken during a game; the dashed lines
represent moves that we (our
reinforcement learning player)
considered but did not make.
•Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e∗, was ranked higher.

227
Reinforcement Learning for Tic-Tac-Toe

● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.

228
Thank you

229

You might also like