100% found this document useful (1 vote)

223 views52 pages

Introduction To Reinforcement Learning

Uploaded by

Editor X

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

223 views52 pages

Introduction To Reinforcement Learning

Uploaded by

Editor X

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to

Reinforcement Learning

Dr. Harald Stein

Dec _ 2023
Agenda
• What is Reinforcement Learning (RL)?
• Basic concepts of RL
• Policy vs. value based approaches
• Q-learning
• Code example: Taxi driver with Q-learning
What is Reinforcement Learning?
And what is it good for?

Definition

Applications

3 types of machine learning

What is Reinforcement Learning?
It is type of machine learning where agents learn how to behave in an environment
A framework for solving control tasks (also By interacting with it through trial and error
called decision problems)

Reinforcement
learning is

By building agents that learn from the And receiving rewards (positive or negative)
environment as unique feedback.
What is Reinforcement Learning?
It is type of machine learning where agents learn how to behave in an environment

The primary objective of the agent is to

takes actions based on observations, aiming to maximize the cumulative reward over a
achieve best possible outcome Agent: Goal Setting: period, refining its actions and strategies over
time.

Through experience, the agent develops a policy,

a guideline on which action to choose in a given
Each action the agent takes impacts its Decision Strategy: situation, to achieve its long-term reward
environment, leading to new situations (states) Environment Interaction:
maximization goal.

An ongoing challenge in reinforcement learning is

the balance between exploration (trying new
After every action, the agent receives feedback
Feedback Mechanism: Challenges Faced: actions) and exploitation (sticking with known
in the form of rewards or penalties, indicating
beneficial actions).
the quality of its decision.
Applications of RL

Is …
• Game Playing (e.g., AlphaGo)
• Robotics (e.g., robot locomotion)
• Recommendation Systems
• Financial Portfolio Management
• Autonomous Vehicles
• Image: Icons for each application (e.g., a game
board, robot, car).
3 types of machine learning
Is.....

• Supervised Learning:
• Machine learning paradigm where a model is trained on
labeled data to make predictions or classifications. The
model learns a function that maps input features to output
labels.

• Unsupervised Learning:
• Type of machine learning that deals with unlabeled data
and aims to identify underlying patterns or structures. The
model learns to represent the data without explicit
guidance.

• Reinforcement Learning:
• Type of machine learning where agent learns to make
decisions by taking actions in environment to maximize
cumulative rewards over time.
3 types of machine learning

Unsupervised learning Supervised learning Reinforcement learning

Data are not part of the input, they are

Unlabeled: no “right answer” specified Labeled: the “right answer” is included
Input data collected through trial and error

Explorative controlling:
Discovery of clusters, patterns, Classification, regression Solution of reward-based problems by
General tasks relationships exploration and exploitation

Finds which states and actions would

Solution Finds similarities and differences in input Maps input to output maximize the total cumulative reward of the
data agent

Customer segmentation, product Image detection, stock market Game playing, robotic vacuum cleaners
Examples recommendation prediction

Yes. The correct set of actions is Yes, through rewards and punishments (positive
Feedback No
provided. and negative rewards)
Basic concepts of RL
And what is it good for?

Agent-Environment interaction

Reinforcement Learning process & loop

Reward and discounting

States space and action space

Tasks: Episodic vs. continuous

Exploration-Exploitation tradeoff
Agent-environment interaction
… provides a mathematical framework for modeling decision-making in situations where outcomes are partly
random and partly under the control of a decision-maker
Reinforcement Learning process
… is a Markov decision process that provides mathematical framework for modeling decision-making in situations where
outcomes are partly random, partly under the control of decision-maker

the Markov Property implies that our agent needs only the current state to decide what action to take and not the
history of all the states and actions they took before.
Example: Video game
… for Markov Decision Process in RL context
The RL loop
… outputs a sequence of state, action, reward and next state.

The agent’s goal:

to maximize its cumulative reward
called the expected return.
The reward hypothesis
… is that all goals can be described as the maximization
of the expected return (expected cumulative reward)
Reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our
agent knows if the action taken was good or not
Rewards and discounting

To discount the rewards, we proceed like this:

• We define discount rate gamma between 0 and 1. Most of the time between 0.95 and 0.99.
(The larger the gamma, the smaller the discount. This means our agent cares more about
the long-term reward. The smaller the gamma, the bigger the discount. This means our
agent cares more about the short term reward (the nearest cheese)).
• Each reward will be discounted by gamma to the exponent of the time step. As the time
step increases, the cat gets closer to us, so the future reward is less and less likely to
happen.
Observations/States Space
… are the information our agent gets from the environment

Observation Space
State: complete description of the state Observation: Partial description of
of the world (no hidden information) the state of the world
Action Space
… is the set of all possible actions in an environment

Discrete: Finite number of Continuous Infinite number of

possible actions possible actions

In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock,
etc.
Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the
future.
Task
… is an instance of a Reinforcement Learning problem. Two types: episodic and continuing

Episodic task
In this case, we have a starting point and an
ending point (a terminal state). This creates an
episode: a list of States, Actions, Rewards, and
new States. Types of Task
For instance, think about Super Mario Bros: an
episode begin at the launch of a new Mario Episodic: Starting point and an ending point Continuing: Task that continue forever (no
Level and ends when you’re killed or you (a terminal state) terminal state)
reached the end of the level.

These are tasks that continue forever (no

terminal state). In this case, the agent must learn
how to choose the best actions and
simultaneously interact with the environment.
For instance, an agent that does automated stock
trading. For this task, there is no starting point
and terminal state. The agent keeps running until
we decide to stop it.
Exploration vs Exploitation
… is the dilemma where an agent must decide between trying new actions (exploration) or sticking with known
beneficial actions (exploitation).

• Exploration: Trying new actions to discover

their outcomes.
• Exploitation: Choosing actions known to yield
good rewards.
• Dilemma: Balancing between gaining new
knowledge (exploration) and maximizing
rewards with current knowledge (exploitation).
• Importance: Essential for an agent to adapt to
changing environments and avoid local optima.
• Strategies: Techniques like ε-greedy or UCB
(Upper Confidence Bound) help manage this
trade-off.
Policy

value based
approaches
Policy vs. value based approaches
involves the organization, summarization, and visualization of data. It
provides simple summaries about the sample and the measures.

Policy π : The agent‘s brain

Policy based vs. Value based methods

Bellman equation
Policy π: the agent’s brain
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

This Policy is the function we want to learn, our

goal is to find the optimal policy π*, the policy
that maximizes expected return when the agent
acts according to it. We find this π* through
training.
There are two approaches to train our agent to find
this optimal policy π*:
• Directly, by teaching the agent to learn which
action to take, given the current state: Policy-
Based Methods.
• Indirectly, teach the agent to learn which state
is more valuable and then take the action that
leads to the more valuable states: Value-Based
Methods.
Policy based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

In Policy-Based methods, we learn a policy

function directly.
This function will define a mapping from each
state to the best corresponding action.
Alternatively, it could define a probability
distribution over the set of possible actions at
that state.
Policy based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

Deterministic policy at a given state Stochastic : output a probability

will always return the same action distribution over action

a = π(s) π(a|s ) = P{A|s}

State S₀ > π(S₀) > a ₀ = Right State S₀ > π(A/S₀) > Left:[ 0.1, Right: 0.7
Jump: 0.2}
Value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of
being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Thanks to our value function at each step our policy will select the
The value of a state is the expected discounted return the agent can get if it state with the biggest value defined by the value functions - 7 then 6
starts in that state, and then acts according to our policy. then 5 and so on to attain the goal.

“Act according to our policy” just means that our policy is “going to the state
with the highest value”.
Policy vs. value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

Value-Based Methods : Train the agent to learn which state

Policy Based Method: Train the agent to learn which
is more valuable and take the action that leads to it.
action to take given a state

So the difference is:

• In policy-based training, the optimal policy (denoted π*) is found by training the policy directly.
• In value-based training, finding an optimal value function (denoted Q* or V*, we’ll study the difference below) leads to
having an optimal policy.
Link between value and policy

… provides a mathematical framework for modeling decision-making

in situations where outcomes are partly random and partly under the
control of a decision-maker

Finding an optimal value function leads to having an optimal policy

Two types of Value-based Methods
… provides a mathematical framework for modeling decision-making in situations where outcomes are partly
random and partly under the control of a decision-maker

We see that the difference is:

• For the state-value function, we calculate the
value of a state ��St
• For the action-value function, we calculate the
State-Value-Function Action-Value-Function
value of the state-action pair ( ��,��St
,At) hence the value of taking that action at Calculate the value of state Calculate the value of state action
that state. pair

For each state, the state-value function outputs the

expected return if the agent starts at that state and
then follows the policy forever afterward (for all
future timesteps, if you prefer).

In the action-value function, for each state and

action pair, the action-value function outputs the
expected return if the agent starts in that state,
takes that action, and then follows the policy
forever after.
The value of taking action �a in state �s under a
policy �π is:
Bellman equation
… simplifies summing up all the rewards an agent can get if it starts at that state

With what we have learned so far, we know that if we calculate �( ��)V(St)

(the value of a state), we need to calculate the return starting at that state and then
follow the policy forever after. (The policy we defined in the following example is
a Greedy Policy; for simplification, we don’t discount the reward).
So to calculate �(��)V(St), we need to calculate the sum of the expected
rewards. Hence:

In the interest of simplicity, here we don’t discount, so gamma = 1. But you’ll

study an example with gamma = 0.99 in the Q-Learning section of this unit.

• The value of �(��+1)V(St+1) = Immediate reward ��+2Rt+2+

Discounted value of the next state ( ��∗�(��+2)gamma∗V(St+2
) ).
• And so on.

To recap, the idea of the Bellman equation is that instead of calculating each value
as the sum of the expected return, which is a long process, we calculate the value
as the sum of immediate reward + the discounted value of the state that follows.
Q-Learning
… is an off-policy value-based method that uses a TD approach to train its
action-value function

Temporal-Difference (TD) approach

What is Q-learning?

Q-Function & Q-Table

Example of maze for Q-Table calculation

Off-policy vs. On-policy

Q-Learning
combines ideas from Monte Carlo methods and dynamic programming by updates predictions based on the
difference between predicted values at consecutive time steps

• Estimation: It estimates value functions without needing a

complete model of the environment.
• Update Rule: The value estimates are updated using the
difference (or "temporal difference") between successive
estimates.
• Combination: TD learning combines the sample efficiency of
Monte Carlo methods (which estimate values based on
complete episodes) with the bootstrapping of dynamic
programming methods (which update values based on current
estimates)
• Popular Methods: TD learning forms the basis for popular
algorithms like Q-learning and SARSA.
Temporal Difference approach

… updates its action-value function at each step instead of at the end of the episode
Q-Learning
is an arrangement of objects in a specific order.

Q-Learning is an off-policy value-based method that uses a TD

approach to train its action-value function:

• Off-policy: we’ll talk about that at the end of this unit.

• Value-based method: finds the optimal policy indirectly by
training a value or action-value function that will tell us the
value of each state or each state-action pair.
• TD approach: updates its action-value function at each step
instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-function, an

action-value function that determines the value of being at a
particular state and taking a specific action at that state.
Q-Function & Q-Table
is an arrangement of objects in a specific order.

The Q comes from “the Quality” (the value) of that action at that state.
Let’s recap the difference between value and reward:

• The value of a state, or a state-action pair is the expected cumulative

reward our agent gets if it starts at this state (or state-action pair) and To train the Q-function, that given a state and action as input, output
the value we use Q-learning algorithm.
then acts accordingly to its policy.
Given a state and action, our Q Function outputs a state-action value
• The reward is the feedback I get from the environment after
(also called Q-value)
performing an action at a state.

Internally, our Q-function is encoded by a Q-table, a table where each

cell corresponds to a state-action pair value. Think of this Q-table as the
memory or cheat sheet of our Q-function.
The Q-table
is an arrangement of objects in a specific order.
The Q-table
is an arrangement of objects in a specific order.

The Q-table is initialized. That’s why all values are = 0. This table contains, for each state and
action, the corresponding state-action values.
The Q-table
is an arrangement of objects in a specific order.

Slide 14: Temporal Difference (TD) Learning

Concept: Update estimatesbased on current and expectedfuturerewards.
TD Error: Differencebetweenpredicted and actual rewards.
Role: Core mechanism in Q-learning updates.

Slide 15: Advantages of Q-Learning

Model-free: Doesn't requireknowledgeofenvironmentdynamics.
Convergence: Guaranteedto find optimal policy (given conditions).
Flexibility: Adaptabletovarioustasks and environments.

Slide 16: Limitationsof Q-Learning

Scalability: Q-tablesizegrowsexponentiallywithstates/actions.
Exploration: Can get stuck in local optima.
Continuous Spaces: Not directly
applicablewithoutdiscretizationorfunctionapproximation.
The Q-table
is an arrangement of objects in a specific order.
The Q-table
is an arrangement of objects in a specific order.

Step 4: Update Q(St, At)

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) after one step of
the interaction.
To produce our TD target, we used the immediate reward ��+1Rt+1plus the discounted value of the next state, computed by
finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).

Therefore, our �(��,��)Q(St,At) update formula goes like this:

The Q-table
is an arrangement of objects in a specific order.

Training
Off-policy

On-policy
Off-policy vs. On-policy
is an arrangement of objects in a specific order.

Off policy
Using a different policy for acting and for updating

On policy
Using a same policy for acting and for updating
Cliff Walking Example
is a standard gridworld environment used to illustrate the difference
between on-policy and off-policy methods like SARSA and Q-learning

• The agent can take actions to move in one of the four cardinal directions:
up, down, left, or right.
• Moving into a "Cliff" state incurs a large negative reward (e.g., -100) and
sends the agent back to the "Start" state.
• Each other move typically has a small negative reward (e.g., -1),
incentivizing the agent to reach the goal quickly
Comparing SARSA and Q-Learning:
§SARSA:
agent tends to take a longer but safer route, avoiding the edge adjacent to the
cliff, because it considers the future action which might be exploratory and lead
it into the cliff.
§Q-Learning:
agent usually learns the optimal policy to skirt dangerously close to the cliff for
the shortest path to the goal, but it may occasionally fall into the cliff during
exploration due to the greedy nature of its learning.
Use Case: The taxi driver
involves the organization, summarization, and visualization of data. It provides simple
summaries about the sample and the measures.

Task of Tutorial use case „Taxi driver“

Problem setting

Training the Q-Table

Test the trained Q-Table

Task of Tutorial use case „Taxi driver“
…a taxi should pick up passengers and drop them at their destination on a small parking lot
Problem setting
is an arrangement of objects in a specific order.

State space
§Grid, number of fields 25 (5*5)
§Pickup positions 5 (Y, R, G, B or in the taxi)
§Possible destinations 4
àNumber of states 500

Action space
§Down, Up, Left, Right
§Drop, Pick up passenger

Reward function
§Move -1
§Failed drop-off -10
Successfuldrop-off 10
Training the Q-Table
is an arrangement of objects in a specific order.

• Initialize Q-Table
• Start with table of zeros for each state-action pair.
• Environment Interaction
• Agent takes actions in environment based on the current Q-values,
often using strategy like ε-greedy for exploration.
• Receive Reward
• After taking an action, the agent observes a reward and the new state
from the environment.
• Update Q-Values
• Use Q-learning update rule to adjust Q-value of taken action based on
received reward and highest Q-value for new state
• Iterate
• Repeat process of action selection, observation, and Q-value updates
until termination condition is met, such as a set number of episodes or
convergence of Q-table.
Test the trained Q-Table
is an arrangement of objects in a specific order.

Initialize Q-Table
Start with
Title Link

Overview Sutton, Barto: Reinforcement http://incompleteideas.net/book/bookdraft2017nov5.pdf

Learning: An Introduction Links
i.e. sources for self-learning
Artificial Intelligence: https://www.udemy.com/course/artificial-intelligence-reinforce
Reinforcement Learning in Python ment-learning-in-python/

Fundamentals of Reinforcement https://levelup.gitconnected.com/fundamental-of-reinforcement

Learning -learning-markov-decision-process-8ba98fa66060

Reinforcement Learning Series: https://www.youtube.com/watch?app=desktop&v=i7q8bI

Overview of Methods SGwMQ

Reinforcement Learning in https://pythongeeks.org/reinforcement-learning-in-machine-lear

Machine Learning ning/

Easy Introduction to https://www.scribbr.com/ai-tools/reinforcement-learning/

Reinforcement Learning
A (Long) Peek into Reinforcement https://lilianweng.github.io/posts/2018-02-19-rl-overview/
Learning

Title Link

Basic concepts The Exploration/Exploitation https://huggingface.co/learn/deep-rl-course/unit1/exp-exp-trad

of RL trade-off eoff

Dynamic Programming RL https://shirsho-12.github.io/blog/rl_dp/

Links
i.e. sources for self-learning
Title Link

Q-Learning Temporal-Difference (TD) https://towardsdatascience.com/introduction-to-reinforcement-lea

Learning rning-rl-part-6-temporal-difference-td-learning-2a12f0aba9f9

Reinforcement Learning 6. https://www.slideshare.net/SeungJaeLee17/reinforcement-learnin

Temporal Difference Learning g-an-introduction-chapter-6

Diving deeper into Reinforcement https://medium.com/free-code-camp/diving-deeper-into-reinforce

Learning with Q-Learning ment-learning-with-q-learning-c18d0db58efe

Fundamentals of Reinforcement https://medium.com/gradientcrescent/fundamentals-of-rei

Learning: Navigating Cliffworld nforcement-learning-navigating-cliffworld-with-sarsa-and-q-l
with SARSA and Q-learning earning-cc3c36eb5830

Reinforcement Learning: SARSA https://arshren.medium.com/reinforcement-learning-sarsa-and-q-l

and Q-Learning earning-e11ebe87dca9

An introduction to Q-Learning: https://www.freecodecamp.org/news/an-introduction-to-q-learnin

reinforcement learning g-reinforcement-learning-14ac0b4493cc/

Walking Off The Cliff With Off- https://towardsdatascience.com/walking-off-the-cliff-with-off-polic

Policy Reinforcement Learning y-reinforcement-learning-7fdbcdfe31ff
Links
Title Link i.e. sources for self-learning
Tutorial Solving The Taxi Environment https://towardsdatascience.com/solving-the-taxi-environment-
Applications With Q-Learning — A Tutorial with-q-learning-a-tutorial-c76c22fc5d8f

Text-Flappy Bird https://aspram.medium.com/learning-flappy-bird-agents-with-r

einforcement-learning-d07f31609333

Practical Reinforcement Learning https://www.udemy.com/course/practical-reinforcement-learni

using Python - 8 AI Agents ng/

Title Link
Real world 9 awesome real world https://medium.com/@mlblogging.k/9-awesome-applications-o
Applications applications of Reinforcement f-reinforcement-learning-e1306ed25c09
Learning

Reinforcement Learning and its https://blogs.skillovilla.com/reinforcement-learning-and-its-real

Real-Life Applications -life-applications/

Mastering the game of Go with https://www.nature.com/articles/nature16961.pdf

deep neural networks and tree
search

Deep Learning Algorithms
100% (1)
Deep Learning Algorithms
412 pages
Deep Reinforcement Learning
100% (1)
Deep Reinforcement Learning
410 pages
1 Introduction To Reinforcement Learning
100% (2)
1 Introduction To Reinforcement Learning
104 pages
Dive Into Deep Learning
100% (1)
Dive Into Deep Learning
633 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Deep Reinforcement Learning (Aske Plaat) (Z-Library)
100% (3)
Deep Reinforcement Learning (Aske Plaat) (Z-Library)
414 pages
22 Selected Top Papers On Deep Learning
No ratings yet
22 Selected Top Papers On Deep Learning
393 pages
Introduction To TensorFlow in Python
100% (3)
Introduction To TensorFlow in Python
146 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
Deep Learning in Computer Vision - Principles and Applications
100% (3)
Deep Learning in Computer Vision - Principles and Applications
339 pages
Unsupervised Machine Learning in Python
100% (1)
Unsupervised Machine Learning in Python
89 pages
Machine Learning With Python
100% (2)
Machine Learning With Python
137 pages
Emotion Detection
No ratings yet
Emotion Detection
23 pages
Interpretable ML
No ratings yet
Interpretable ML
447 pages
(Addison-Wesley Data & Analytics Series) Krohn, J. - Beyleveld, G. - Bassens, A. - Deep Learning Illustrated - A Visual, Interactive Guide To Artificial Intelligence-Pearson Education (2019)
100% (4)
(Addison-Wesley Data & Analytics Series) Krohn, J. - Beyleveld, G. - Bassens, A. - Deep Learning Illustrated - A Visual, Interactive Guide To Artificial Intelligence-Pearson Education (2019)
192 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Dive Into Deep Learning
100% (2)
Dive Into Deep Learning
291 pages
Basics of Deep Learning
100% (1)
Basics of Deep Learning
17 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
First Contact With Tensor Flow PDF
100% (2)
First Contact With Tensor Flow PDF
136 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
Deep Learning Book
100% (5)
Deep Learning Book
42 pages
A Recurrent Neural Network
No ratings yet
A Recurrent Neural Network
22 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
31 pages
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
100% (5)
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
84 pages
Reinforcement Learning Tutorial
100% (1)
Reinforcement Learning Tutorial
17 pages
(Isc) Cissp
No ratings yet
(Isc) Cissp
161 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
100% (1)
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Ebook Deep Learning Objective Type Questions
No ratings yet
Ebook Deep Learning Objective Type Questions
102 pages
Deep Learning
No ratings yet
Deep Learning
43 pages
GANppt
100% (1)
GANppt
34 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Reinforcement Learning Ebook Part1 PDF
No ratings yet
Reinforcement Learning Ebook Part1 PDF
24 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
Ai
No ratings yet
Ai
28 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Deep Learning Interview Questions
No ratings yet
Deep Learning Interview Questions
17 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
1 - Intro To Machine Learning
100% (1)
1 - Intro To Machine Learning
20 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Introduction of Neural Network
No ratings yet
Introduction of Neural Network
31 pages
Artificial Intelligence: Agents and Environment
No ratings yet
Artificial Intelligence: Agents and Environment
27 pages
Analyze The Types of Artificial Intelligence: 1. Reactive Machines
No ratings yet
Analyze The Types of Artificial Intelligence: 1. Reactive Machines
2 pages
Module 1
No ratings yet
Module 1
72 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Kguh
No ratings yet
Kguh
38 pages
Object Detection Using Transformers: H.O.D DR.D.Haritha
No ratings yet
Object Detection Using Transformers: H.O.D DR.D.Haritha
24 pages
ML Unit 1 Pallav
No ratings yet
ML Unit 1 Pallav
22 pages
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
No ratings yet
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
11 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Deep Reinforcement Learning in Games
No ratings yet
Deep Reinforcement Learning in Games
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Tensorflow Internal
No ratings yet
Tensorflow Internal
17 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
What Is A Support Vector Machine?: Primer
No ratings yet
What Is A Support Vector Machine?: Primer
3 pages
PDF Download CSSLP Certified Secure Software Lifecycle Professional
No ratings yet
PDF Download CSSLP Certified Secure Software Lifecycle Professional
14 pages
k3JMy1EnQfCgQiRTWph0 RMFISSOAssess-1570406055399
No ratings yet
k3JMy1EnQfCgQiRTWph0 RMFISSOAssess-1570406055399
9 pages
3Ut9kMskRy6953dshr8R RMFISSOAuthorize-1570406154507
No ratings yet
3Ut9kMskRy6953dshr8R RMFISSOAuthorize-1570406154507
6 pages

Introduction To Reinforcement Learning

Uploaded by

Introduction To Reinforcement Learning

Uploaded by

Introduction to

Dr. Harald Stein

3 types of machine learning

The primary objective of the agent is to

Through experience, the agent develops a policy,

An ongoing challenge in reinforcement learning is

Unsupervised learning Supervised learning Reinforcement learning

Data are not part of the input, they are

Finds which states and actions would

Reinforcement Learning process & loop

Reward and discounting

States space and action space

Tasks: Episodic vs. continuous

The agent’s goal:

To discount the rewards, we proceed like this:

Discrete: Finite number of Continuous Infinite number of

These are tasks that continue forever (no

• Exploration: Trying new actions to discover

Policy π : The agent‘s brain

Policy based vs. Value based methods

This Policy is the function we want to learn, our

In Policy-Based methods, we learn a policy

Deterministic policy at a given state Stochastic : output a probability

a = π(s) π(a|s ) = P{A|s}

Value-Based Methods : Train the agent to learn which state

So the difference is:

… provides a mathematical framework for modeling decision-making

Finding an optimal value function leads to having an optimal policy

We see that the difference is:

For each state, the state-value function outputs the

In the action-value function, for each state and

With what we have learned so far, we know that if we calculate �( ��)V(St​)

In the interest of simplicity, here we don’t discount, so gamma = 1. But you’ll

• The value of �(��+1)V(St+1​) = Immediate reward ��+2Rt+2​+

Temporal-Difference (TD) approach

Q-Function & Q-Table

Example of maze for Q-Table calculation

Off-policy vs. On-policy

• Estimation: It estimates value functions without needing a

Q-Learning is an off-policy value-based method that uses a TD

• Off-policy: we’ll talk about that at the end of this unit.

Q-Learning is the algorithm we use to train our Q-function, an

• The value of a state, or a state-action pair is the expected cumulative

Internally, our Q-function is encoded by a Q-table, a table where each

Slide 14: Temporal Difference (TD) Learning

Slide 15: Advantages of Q-Learning

Slide 16: Limitationsof Q-Learning

Step 4: Update Q(St, At)

Therefore, our �(��,��)Q(St​,At​) update formula goes like this:

Task of Tutorial use case „Taxi driver“

Training the Q-Table

Test the trained Q-Table

Overview Sutton, Barto: Reinforcement http://incompleteideas.net/book/bookdraft2017nov5.pdf

Fundamentals of Reinforcement https://levelup.gitconnected.com/fundamental-of-reinforcement

Reinforcement Learning Series: https://www.youtube.com/watch?app=desktop&v=i7q8bI

Reinforcement Learning in https://pythongeeks.org/reinforcement-learning-in-machine-lear

Easy Introduction to https://www.scribbr.com/ai-tools/reinforcement-learning/

Basic concepts The Exploration/Exploitation https://huggingface.co/learn/deep-rl-course/unit1/exp-exp-trad

Dynamic Programming RL https://shirsho-12.github.io/blog/rl_dp/

Q-Learning Temporal-Difference (TD) https://towardsdatascience.com/introduction-to-reinforcement-lea

Reinforcement Learning 6. https://www.slideshare.net/SeungJaeLee17/reinforcement-learnin

Diving deeper into Reinforcement https://medium.com/free-code-camp/diving-deeper-into-reinforce

Fundamentals of Reinforcement https://medium.com/gradientcrescent/fundamentals-of-rei

Reinforcement Learning: SARSA https://arshren.medium.com/reinforcement-learning-sarsa-and-q-l

An introduction to Q-Learning: https://www.freecodecamp.org/news/an-introduction-to-q-learnin

Walking Off The Cliff With Off- https://towardsdatascience.com/walking-off-the-cliff-with-off-polic

Text-Flappy Bird https://aspram.medium.com/learning-flappy-bird-agents-with-r

Practical Reinforcement Learning https://www.udemy.com/course/practical-reinforcement-learni

Reinforcement Learning and its https://blogs.skillovilla.com/reinforcement-learning-and-its-real

Mastering the game of Go with https://www.nature.com/articles/nature16961.pdf

You might also like

With what we have learned so far, we know that if we calculate �( ��)V(St)

• The value of �(��+1)V(St+1) = Immediate reward ��+2Rt+2+

Therefore, our �(��,��)Q(St,At) update formula goes like this: