Module 01

Uploaded by

Mukund Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views66 pages

Module 01

Uploaded by

Mukund Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Reinforcement Learning :

Introduction
“Learning from interaction is a foundational idea underlying nearly
all theories of learning and intelligence”.

The slides contain

• Excerpts from the book Reinforcement Learning : An Introduction, 2nd edition, Richard S. Sutton and Andrew G. Barto
• Content acquired using ChatGPT
• Excerpts from Stanford CS234 Course Material available on their website :
[Link]
What is Reinforcement Learning (RL)?
• Reinforcement learning (RL) is a type of machine learning where an
agent learns to make decisions by interacting with an environment.
• The agent receives feedback in the form of rewards or punishments
based on its actions, helping it learn optimal strategies over time.
• RL is commonly used in areas like robotics, game playing, and
autonomous systems.
• Algorithms such as Q-learning and deep reinforcement learning have
advanced the field, allowing agents to learn complex behaviors
through trial and error.
Reinforcement Learning
• Reinforcement learning is learning what to do—how to map
situations to actions—so as to maximize a numerical reward signal.
The learner is not told which actions to take, but instead must
discover which actions yield the most reward by trying them.
• In the most interesting and challenging cases, actions may affect not
only the immediate reward but also the next situation and, through
that, all subsequent rewards.
• These two characteristics—trial-and-error search and delayed
reward—are the two most important distinguishing features of
reinforcement learning.
Reinforcement Learning
Reinforcement Learning is simultaneously
• a problem,
• a class of solution methods that work well on the problem,
and
• the field that studies this problem and its solution methods.
Types of Machine Learning

Source [Link]
Machine Learning (ML) Types
• Supervised Learning:
• In supervised learning, the algorithm is trained on a labeled dataset, which
means the input data is paired with corresponding output labels.
• The goal is for the algorithm to learn a mapping from inputs to outputs,
allowing it to make predictions or decisions on new, unseen data.
• Algorithms : Linear Regression, Logistic Regression, Decision Trees, Random
Forest, Support Vector Machines (SVM), Naive Bayes, K-Nearest Neighbors
(KNN), Neural Networks (Deep Learning), Gradient Boosting Algorithms (e.g.,
XGBoost, LightGBM), Linear Discriminant Analysis (LDA)
Machine Learning (ML) Types
• Unsupervised Learning:
• Unsupervised learning involves training an algorithm on an unlabeled dataset,
where the algorithm needs to find patterns, relationships, or structures in the
data without explicit guidance.
• Clustering and dimensionality reduction are common tasks in unsupervised
learning.
• Algorithms : K-Means Clustering, Hierarchical Clustering, DBSCAN (Density-
Based Spatial Clustering of Applications with Noise), Principal Component
Analysis (PCA), Independent Component Analysis (ICA), Autoencoders,
Generative Adversarial Networks (GANs), t-Distributed Stochastic Neighbor
Embedding (t-SNE), Apriori Algorithm, Mean Shift
Machine Learning (ML) Types
• Reinforcement Learning:
• Reinforcement learning is about training an agent to make
sequential decisions by interacting with an environment.
• The agent receives feedback in the form of rewards or
punishments based on its actions, and it learns to optimize its
behavior over time.
• Algorithms : Q-Learning, Deep Reinforcement Learning
RL Involves
• Optimization
• Delayed consequences
• Exploration
• Generalization
RL Involves
• Optimization
• Optimization in reinforcement learning refers to the process of
finding the best policy or strategy that maximizes the cumulative
reward over time.
• Goal is to find an optimal way to make decisions
• Yielding best outcomes or at least very good outcomes
• Explicit notion of utility of decisions
• Example: finding minimum distance route between two cities
given network of roads
RL Involves
• Delayed Consequences
• Delayed consequences highlight the challenge in RL where actions may
not immediately lead to rewards, requiring the agent to consider long-
term effects and plan accordingly.
• Decisions now can impact things much later...
• Choosing an elective subject.
• Introduces two challenges
• When planning: decisions involve reasoning about not just immediate benefit of
a decision but also its longer term ramifications
• When learning: temporal credit assignment is hard (what caused later high or
low rewards?)
RL Involves
• Exploration
• Exploration is the agent's strategy of trying out different actions to discover optimal
policies, balancing the need for exploiting known good actions and exploring
potentially better ones.
• Learning about the world by making decisions
• Agent as scientist
• Learn to ride a bike by trying (and failing)
• Censored data
• Only get a reward for decision made
• Decisions impact what we learn about
• If we choose to go to X college instead of Y, we will have different later
experiences.
RL Involves
• Generalization
• Generalization involves the ability of a reinforcement learning
agent to apply learned knowledge from specific experiences to
new, unseen situations, enhancing its adaptability and efficiency.
Comparison
AI
SL UL RL IL
Planning
Optimization X X

Learns from Experience X X X X

Generalization X X X X

Delayed consequences X X

Exploration X
Examples of RL
• A master chess player makes a move.
• The choice is informed both by planning (anticipating possible replies and
counter replies) and by immediate, intuitive judgments of the desirability of
particular positions and moves.
• An adaptive controller adjusts parameters of a petroleum refinery’s
operation in real time.
• The controller optimizes the yield/cost/quality trade-off on the basis of
specified marginal costs without sticking strictly to the set points originally
suggested by engineers.
Examples of RL
• A gazelle calf struggles to its feet minutes after being born.
• Half an hour later it is running at 20 miles per hour.
• A mobile robot decides whether it should enter a new room in search
of more trash to collect or start trying to find its way back to its
battery recharging station.
• It makes its decision based on the current charge level of its battery and how
quickly and easily it has been able to find the recharger in the past.
• Phil prepares his breakfast.
Examples of RL : Preparing Breakfast
• Closely examined, even this apparently mundane activity reveals a
complex web of conditional behavior and interlocking goal–sub goal
relationships: walking to the cupboard, opening it, selecting a cereal
box, then reaching for, grasping, and retrieving the box.
• Other complex, tuned, interactive sequences of behavior are required
to obtain a bowl, spoon, and milk carton.
• Each step involves a series of eye movements to obtain information
and to guide reaching and locomotion.
Examples of RL : Preparing Breakfast
• Rapid judgments are continually made about how to carry the objects
or whether it is better to ferry some of them to the dining table
before obtaining others.
• Each step is guided by goals, such as grasping a spoon or getting to
the refrigerator, and is in service of other goals, such as having the
spoon to eat with once the cereal is prepared and ultimately
obtaining nourishment.
• Whether he is aware of it or not, Phil is accessing information about
the state of his body that determines his nutritional needs, level of
hunger, and food preferences.
Exploration and Exploitation
Exploration and Exploitation
Elements of
Reinforcement Learning
Elements of Reinforcement Learning
• Policy
• Reward
• Value Function
• Model
Elements of RL : Policy
• A policy defines the learning agent’s way of behaving at a given time.
• Roughly speaking, a policy is a mapping from perceived states of the
environment to actions to be taken when in those states.
• In some cases the policy may be a simple function or lookup table,
whereas in others it may involve extensive computation such as a
search process.
• The policy is the core of a reinforcement learning agent in the sense
that it alone is sufficient to determine behavior.
Elements of RL : Reward
• A reward signal defines the goal of a reinforcement learning problem.
• On each time step, the environment sends to the reinforcement learning
agent a single number called the reward.
• The agent’s sole objective is to maximize the total reward it receives over
the long run.
• The reward signal thus defines what are the good and bad events for the
agent.
• The reward signal is the primary basis for altering the policy; if an action
selected by the policy is followed by low reward, then the policy may be
changed to select some other action in that situation in the future.
Elements of RL : Value function
• Whereas the reward signal indicates what is good in an immediate sense, a value
function specifies what is good in the long run.
• Roughly speaking, the value of a state is the total amount of reward an agent can
expect to accumulate over the future, starting from that state.
• Whereas rewards determine the immediate, intrinsic desirability of
environmental states, values indicate the long-term desirability of states after
taking into account the states that are likely to follow and the rewards available in
those states.
• For example, a state might always yield a low immediate reward but still have a
high value because it is regularly followed by other states that yield high rewards.
Or the reverse could be true.
• To make a human analogy, rewards are somewhat like pleasure (if high) and pain
(if low), whereas values correspond to a more refined and farsighted judgment of
how pleased or displeased we are that our environment is in a particular state.
Juxtaposing Rewards and Value Functions
• Rewards are in a sense primary, whereas values, as predictions of rewards,
are secondary.
• Without rewards there could be no values, and the only purpose of
estimating values is to achieve more reward.
• It is values with which we are most concerned when making and evaluating
decisions. Action choices are made based on value judgments.
• We seek actions that bring about states of highest value, not highest
reward, because these actions obtain the greatest amount of reward for us
over the long run.
• Unfortunately, it is much harder to determine values than it is to determine
rewards. Rewards are basically given directly by the environment, but
values must be estimated and re-estimated from the sequences of
observations an agent makes over its entire lifetime.
Elements of RL : Model
• Model is something that mimics the behavior of the environment, or more
generally, that allows inferences to be made about how the environment will
behave.
• For example, given a state and action, the model might predict the resultant next
state and next reward.
• Models are used for planning, by which we mean any way of deciding on a course
of action by considering possible future situations before they are actually
experienced.
• Methods for solving reinforcement learning problems that use models and
planning are called model-based methods, as opposed to simpler model-free
methods that are explicitly trial-and-error learners—viewed as almost the
opposite of planning.
• Modern reinforcement learning spans the spectrum from low-level, trial-and-
error learning to high-level, deliberative planning.
Types of RL
Reinforcement Learning : Types
• Model-based methods:
• These methods use a model of the environment to predict the next state and
reward.
• Examples include Markov decision processes (MDPs) and dynamic
programming.
• Model-free methods:
• These methods do not use a model of the environment and instead learn
from experience.
• Examples include Q-learning and State Action Reward State action (SARSA).
Reinforcement Learning : Types
• Value-based methods:
• These methods learn to estimate the value of each state or state-
action pair.
• Examples include Q-learning and deep Q-networks (DQNs).
• Policy-based methods:
• These methods learn a policy that maps states to actions.
• Examples include policy gradient methods and actor-critic
methods.
RL Algorithms
Reinforcement Learning : Algorithms

Source : [Link]
Markov Decision Process
Introduction
Markov Decision Process (MDP)
• MDP is a mathematical framework used to model decision-making
problems where an agent interacts with an environment over a
sequence of discrete time steps.
• It provides a mathematical framework for modeling decision making
in situations where outcomes are partly random and partly under the
control of a decision maker.
• The MDP framework is widely used in the field of reinforcement
learning to formalize problems and algorithms.
Markov Decision Process (MDP)

Figure : The agent–environment interaction in a Markov decision process.

Markov Decision Process (MDP)
• Key components of an MDP include:
• States (S): A finite set of possible situations or configurations the system can be in. At each
time step, the system is in a specific state.
• Actions (A): A finite set of possible actions the agent can take. The available actions may
depend on the current state.
• Transition Probabilities (P): The probabilities of moving from one state to another based on
the chosen action. It defines the dynamics of the system.
• Rewards (R): Immediate numerical values that the agent receives as a consequence of taking
a specific action in a particular state. The goal is to maximize the cumulative reward over
time.
• Policy (π): A strategy or mapping from states to actions that guides the agent's decision-
making. It represents the agent's behavior in the environment.
• Discount Factor (γ): A parameter that influences the agent's preference for immediate
rewards over future rewards. It determines the importance of future rewards in the decision-
making process.
• Value Function (V or Q): The expected cumulative reward an agent can achieve starting from
a particular state (V) or state-action pair (Q) following a given policy.
Markov Property in MDP
• The dynamics of an MDP satisfy the Markov property,
meaning the future state depends only on the current state
and action, and not on the sequence of states and actions
that preceded it.
• making the decision-making process memoryless.
MDP (from Wikipedia)
MDP (from Wikipedia)

Example of a simple MDP with three states (green circles) and

two actions (orange circles), with two rewards (orange arrows).

By waldoalvarez - Own work, CC BY-SA 4.0, [Link]

Reinforcement Learning : Algorithms
• Q-Learning
• State Action Reward State action (SARSA)
Q-Learning
Q(st,at)←(1−α)⋅Q(st,at)+α⋅(rt+1+γ ⋅ maxaQ(st+1,a))
Q-Learning Algorithm
• Create a table Q(s, a) to represent the Q-values for each state-action pair.
Initialize the Q-values arbitrarily, often with zeros.
• For each time step t, observe the current state st.
• Select an action at using the exploration-exploitation strategy.
• Common strategies include ε-greedy, where with probability ε, a random action is
selected, and with probability 1-ε, the action with the highest Q-value is chosen.
• Execute the selected action and observe the immediate reward rt+1 and the
new state st+1.
• Update the Q-value of the current state-action pair using the equation:
Q(st,at)←(1−α)⋅Q(st,at)+α⋅(rt+1+γ ⋅ maxaQ(st+1,a))
Q-Learning : Learning Rate (α)
• Role: The learning rate (α) determines the extent to which the Q-
values are updated in each iteration. It controls the weight given to
new information compared to the existing knowledge in the Q-table.
• Range: 0 < α ≤ 1.
• Effect: A higher learning rate gives more weight to recent
experiences, potentially leading to faster convergence but also
making the algorithm more sensitive to noise. A lower learning rate
makes the algorithm more stable but slower to adapt to changes in
the environment.
Q-Learning : Discount Factor (γ)
• Role: The discount factor (γ) determines the importance of future
rewards in the Q-value update. It reflects the agent's preference for
immediate rewards over delayed rewards.
• Range: 0 ≤ γ ≤ 1.
• Effect: A higher discount factor values future rewards more,
encouraging the agent to consider long-term consequences. A lower
discount factor makes the agent focus more on immediate rewards.
Q-Learning
• Q(st,at)←(1−α)⋅Q(st,at)+α⋅(rt+1+γ ⋅ maxaQ(st+1,a))

• Assume α = 1
• What does it signify?
• Q(st,at) ←rt+1+γ ⋅ maxaQ(st+1,a)
Consider Rewards as
Q-Learning : Example • -1 for ‘No Connection’
• 0 for ‘Non Goal State’
• 100 for ‘Goal State’
Q-Learning
• Initialize Q matrix to Zero
Q-Learning
• Consider discount factor (γ) = 0.8
• Initial State as 1
• We can go to ??
• 3 or 5
• How to select one of it?
Q-Learning
• Assume 5 is Selected
• Calculate Q (1,5)
• Which actions from 5 ??
• 1, 4 or 5
• Calculations
• Q(st,at)←rt+1+γ ⋅ maxaQ(st+1,a)
• Q (1,5) = 100 + 0.8 Max [ Q(5,1), Q(5,4), Q(5,5)]
• Q (1,5) = 100 + 0.8 Max [ 0 , 0, 0]
• Q (1,5) = 100 + 0.8 * 0
• Q (1,5) = 100
Q-Learning
• Q(1,5) is calculated as 100
• Next state selected is 5.
• We can’t go anywhere from 5, since it is goal state.
• One episode is over.
• For next episode, randomly choose the starting stage.
• Let us say 3
• From 3 we can go to ??
• 1, 2 or 4
• Assume 1 is selected, Calculate Q (3,1)
Q-Learning
• Q (3,1)
= R(3,1) + 0.8 * Max [ Q(1,3), Q(1,5)]
= 0 + 0.8 * Max [ 0, 100]
= 0.8 * 100
= 80
Q-Learning
• The agent learns more through further episodes.
• It will finally reach to convergence values in Matrix Q as
Q-Learning
• Once fully learnt (convergence happened) the agent may simply
follow the action with highest reward at given state to reach the Goal
State
SARSA
State-Action-Reward-State-Action
Q(st,at)←Q(st,at)+α⋅(rt+1+γ⋅Q(st+1,at+1)−Q(st,at))
SARSA
• SARSA (State-Action-Reward-State-Action) is another reinforcement
learning algorithm, similar to Q-learning, used for learning optimal
policies in Markov Decision Processes (MDPs).
• Like Q-learning, SARSA is a model-free algorithm, meaning it does not
require knowledge of the underlying dynamics of the environment.
SARSA Overview
1. Initialize Q-values:
• Start by initializing the Q-values for all state-action pairs.
• This is typically done by creating a Q-table and initializing its entries with
arbitrary values.
2. Exploration-Exploitation Tradeoff:
• Define an exploration-exploitation strategy, similar to Q-learning, to balance
exploration of new actions and exploitation of learned knowledge.
• Common strategies include ε-greedy, where with probability ε, a random
action is selected, and with probability 1-ε, the action with the highest Q-
value is chosen.
SARSA Overview
3. Learning Process:
• For each time step t, observe the current state st.
• Select an action at using the exploration-exploitation strategy.
• Execute the selected action and observe the immediate reward rt+1 and the new
state st+1.
• Select a new action at+1 using the same exploration-exploitation strategy.
• Update the Q-value of the current state-action pair using the SARSA update rule:
• Q(st,at)←Q(st,at)+α⋅(rt+1+γ⋅Q(st+1,at+1)−Q(st,at))
• α is the learning rate (0 < α ≤ 1), determining the extent to which new information replaces
old information.
• γ is the discount factor (0 ≤ γ ≤ 1), which discounts the importance of future rewards.
4. Repeat:
• Continue the learning process for multiple episodes or until convergence.
SARSA Overview
• SARSA is an on-policy algorithm, meaning it learns the policy that it
follows during training.
• It is suitable for scenarios where the policy needs to be continuously
updated based on the agent's actions.
SARSA vs Q-Learning
• Q-Learning : Q(st,at)←(1−α)⋅Q(st,at)+α⋅(rt+1+γ ⋅ maxaQ(st+1,a))
• SARSA : Q(st, at)←Q(st,at)+α⋅(rt+1+γ⋅Q(st+1,at+1)−Q(st,at))
• The (1−α) term is not explicitly present in the SARSA update because it's implicitly
accounted for in the form of α itself. The SARSA update rule directly multiplies the
learning rate (α) with the temporal difference error (rt+1+γ⋅Q(st+1,at+1)−Q(st,at)).
• In Q-learning, you often see the (1−α) term explicitly because the Q-value update
involves both the current Q-value and the maximum Q-value of the next state
(which is part of the temporal difference error), and (1−α) scales the current
estimate down, leaving room for the updated estimate.
• In summary, the absence of (1−α) in the SARSA update is due to how the learning
rate α is incorporated directly into the update rule.
SARSA vs Q-Learning

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

SARSA vs Q-Learning

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

SARSA vs Q-Learning

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

SARSA vs Q-Learning
• On-Policy vs Off-Policy:
• Q-learning: Off-policy. Q-learning learns the optimal Q-values
regardless of the policy followed to generate the data. It estimates
the maximum expected future rewards for each state-action pair.
• SARSA: On-policy. SARSA learns the Q-values for the policy it is
currently following. It updates Q-values based on the actual
actions taken and the subsequent state-action pairs.
SARSA vs Q-Learning
• Update Rule:
• Q-learning: The update rule is based on the maximum Q-value of the next
state, irrespective of the action taken. The Q-value for a state-action pair is
updated using the maximum Q-value of the next state.
Q(st,at)←(1−α)⋅Q(st,at)+α⋅(rt+1+γ ⋅ maxaQ(st+1,a))

• SARSA: The update rule considers the Q-value of the next state-action pair
based on the policy being followed. It takes into account the action actually
taken in the next state.
Q(st,at)←Q(st,at)+α⋅(rt+1+γ⋅Q(st+1,at+1)−Q(st,at))
SARSA vs Q-Learning
• Policy Used for Exploration:
• Q-learning: Typically uses an epsilon-greedy strategy for exploration. It
exploits the current best action with high probability and explores randomly
with a small probability.
• SARSA: Also often uses epsilon-greedy, but it explores and updates its policy
simultaneously, as it's an on-policy algorithm.
• Usage:
• Q-learning: Can be more effective in situations where exploration is crucial,
and the learned policy may deviate significantly from the behavior policy.
• SARSA: Tends to be more stable in environments where the agent should
follow a specific policy closely, such as in applications where safety is a
primary concern.
End

RL Week - 1
No ratings yet
RL Week - 1
53 pages
Module 1
No ratings yet
Module 1
72 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Assignment 15 Modern AI
No ratings yet
Assignment 15 Modern AI
3 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Final
No ratings yet
Final
18 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
73 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Lecture 9 - Reinforced Learning
No ratings yet
Lecture 9 - Reinforced Learning
18 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Lecture#1 - RL An Introduction 2023
No ratings yet
Lecture#1 - RL An Introduction 2023
44 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Unit 5
No ratings yet
Unit 5
45 pages
Unit 4
No ratings yet
Unit 4
56 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
23 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Unit I
No ratings yet
Unit I
8 pages
Overview of Reinforcement Learning
No ratings yet
Overview of Reinforcement Learning
17 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Introduction - Week 1
No ratings yet
Introduction - Week 1
52 pages
RLDL Unit 1
No ratings yet
RLDL Unit 1
15 pages
Unit 6
No ratings yet
Unit 6
34 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
8 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Module 1
No ratings yet
Module 1
85 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
RL 1
No ratings yet
RL 1
29 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Don't Want To Upload?: Make Your Document Easy To Find
No ratings yet
Don't Want To Upload?: Make Your Document Easy To Find
3 pages
Manual
No ratings yet
Manual
72 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
SMA Syllabus2025
No ratings yet
SMA Syllabus2025
5 pages
SMA Module2A
No ratings yet
SMA Module2A
121 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
145 pages
Module 04
No ratings yet
Module 04
63 pages
Ratios 2
No ratings yet
Ratios 2
70 pages
Permutations and How To Use Them
No ratings yet
Permutations and How To Use Them
10 pages
Data Analytics
No ratings yet
Data Analytics
9 pages
EOC Workshop Technology
No ratings yet
EOC Workshop Technology
5 pages
Avl Project
No ratings yet
Avl Project
29 pages
Grid Computing: Techniques and Applications Barry Wilkinson: 6.1 6.2 Job Submission
No ratings yet
Grid Computing: Techniques and Applications Barry Wilkinson: 6.1 6.2 Job Submission
30 pages
QUESTION MEE 431 Creep, Fatigue
No ratings yet
QUESTION MEE 431 Creep, Fatigue
2 pages
Biotransformation of Cedrol by Curvularia Lunata ATCC 12017: Dwight O. Collins, Paul B. Reese
No ratings yet
Biotransformation of Cedrol by Curvularia Lunata ATCC 12017: Dwight O. Collins, Paul B. Reese
5 pages
Database ACID Properties Explained
100% (1)
Database ACID Properties Explained
2 pages
Design and Installation of Suctıon Anchor Piles
No ratings yet
Design and Installation of Suctıon Anchor Piles
13 pages
Tutorial Questions - Ground Improvement
No ratings yet
Tutorial Questions - Ground Improvement
3 pages
Release Notes
No ratings yet
Release Notes
7 pages
Digital Communications Lecture Notes
No ratings yet
Digital Communications Lecture Notes
75 pages
Grade 7 TVE Dressmaking Test 2
No ratings yet
Grade 7 TVE Dressmaking Test 2
2 pages
Hypothesis Testing (CW & TA)
No ratings yet
Hypothesis Testing (CW & TA)
4 pages
Aqwa-Drift Manual
No ratings yet
Aqwa-Drift Manual
119 pages
Intel's Innovation & Strategy Insights
No ratings yet
Intel's Innovation & Strategy Insights
14 pages
Boring Methods - Site Exploration
No ratings yet
Boring Methods - Site Exploration
4 pages
Experiment - Buckling of Strut
No ratings yet
Experiment - Buckling of Strut
2 pages
Mathematics: San Agustin Elementary
No ratings yet
Mathematics: San Agustin Elementary
3 pages
Charles Correa's Kanchanjunga Apartments
No ratings yet
Charles Correa's Kanchanjunga Apartments
11 pages
Alternative Math Assessments
No ratings yet
Alternative Math Assessments
7 pages
Python Strings PDF
No ratings yet
Python Strings PDF
27 pages
65nm Low-K Process Design Manual
No ratings yet
65nm Low-K Process Design Manual
7 pages
12-2324-SHW (Practiacal Work)
No ratings yet
12-2324-SHW (Practiacal Work)
4 pages
MAC1105 College Algebra All Formulas List Academic Systems
No ratings yet
MAC1105 College Algebra All Formulas List Academic Systems
5 pages
Materials Letters: Runhua Yao, Peng Dong, Peter K. Liaw, Jun Zhou, Wenxian Wang
No ratings yet
Materials Letters: Runhua Yao, Peng Dong, Peter K. Liaw, Jun Zhou, Wenxian Wang
5 pages
Understanding Rockets and Space Travel
No ratings yet
Understanding Rockets and Space Travel
10 pages
D. Chaum Et Al. (Eds.), Advances in Cryptology © Springer Science+Business Media New York 1983
No ratings yet
D. Chaum Et Al. (Eds.), Advances in Cryptology © Springer Science+Business Media New York 1983
2 pages
Professional Refrigeration Products
No ratings yet
Professional Refrigeration Products
4 pages
Class Xi Science
No ratings yet
Class Xi Science
16 pages
Gravitational Potential Energy Calculations
No ratings yet
Gravitational Potential Energy Calculations
15 pages
07 k275 Varistor
No ratings yet
07 k275 Varistor
27 pages

Module 01

Uploaded by

Module 01

Uploaded by

Reinforcement Learning :

The slides contain

Learns from Experience X X X X

Figure : The agent–environment interaction in a Markov decision process.

Example of a simple MDP with three states (green circles) and

By waldoalvarez - Own work, CC BY-SA 4.0, [Link]

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

Credit : Dr. Pankaj Kumar Porwal Source : [Link]

You might also like