100% found this document useful (1 vote)
223 views52 pages

Introduction To Reinforcement Learning

Uploaded by

Editor X
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
223 views52 pages

Introduction To Reinforcement Learning

Uploaded by

Editor X
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to

Reinforcement Learning

Dr. Harald Stein


Dec _ 2023
Agenda
• What is Reinforcement Learning (RL)?
• Basic concepts of RL
• Policy vs. value based approaches
• Q-learning
• Code example: Taxi driver with Q-learning
What is Reinforcement Learning?
And what is it good for?

Definition

Applications

3 types of machine learning


What is Reinforcement Learning?
It is type of machine learning where agents learn how to behave in an environment
A framework for solving control tasks (also By interacting with it through trial and error
called decision problems)

Reinforcement
learning is

By building agents that learn from the And receiving rewards (positive or negative)
environment as unique feedback.
What is Reinforcement Learning?
It is type of machine learning where agents learn how to behave in an environment

The primary objective of the agent is to


takes actions based on observations, aiming to maximize the cumulative reward over a
achieve best possible outcome Agent: Goal Setting: period, refining its actions and strategies over
time.

Through experience, the agent develops a policy,


a guideline on which action to choose in a given
Each action the agent takes impacts its Decision Strategy: situation, to achieve its long-term reward
environment, leading to new situations (states) Environment Interaction:
maximization goal.

An ongoing challenge in reinforcement learning is


the balance between exploration (trying new
After every action, the agent receives feedback
Feedback Mechanism: Challenges Faced: actions) and exploitation (sticking with known
in the form of rewards or penalties, indicating
beneficial actions).
the quality of its decision.
Applications of RL

Is …
• Game Playing (e.g., AlphaGo)
• Robotics (e.g., robot locomotion)
• Recommendation Systems
• Financial Portfolio Management
• Autonomous Vehicles
• Image: Icons for each application (e.g., a game
board, robot, car).
3 types of machine learning
Is.....

• Supervised Learning:
• Machine learning paradigm where a model is trained on
labeled data to make predictions or classifications. The
model learns a function that maps input features to output
labels.

• Unsupervised Learning:
• Type of machine learning that deals with unlabeled data
and aims to identify underlying patterns or structures. The
model learns to represent the data without explicit
guidance.

• Reinforcement Learning:
• Type of machine learning where agent learns to make
decisions by taking actions in environment to maximize
cumulative rewards over time.
3 types of machine learning

Unsupervised learning Supervised learning Reinforcement learning

Data are not part of the input, they are


Unlabeled: no “right answer” specified Labeled: the “right answer” is included
Input data collected through trial and error

Explorative controlling:
Discovery of clusters, patterns, Classification, regression Solution of reward-based problems by
General tasks relationships exploration and exploitation

Finds which states and actions would


Solution Finds similarities and differences in input Maps input to output maximize the total cumulative reward of the
data agent

Customer segmentation, product Image detection, stock market Game playing, robotic vacuum cleaners
Examples recommendation prediction

Yes. The correct set of actions is Yes, through rewards and punishments (positive
Feedback No
provided. and negative rewards)
Basic concepts of RL
And what is it good for?

Agent-Environment interaction

Reinforcement Learning process & loop

Reward and discounting

States space and action space

Tasks: Episodic vs. continuous

Exploration-Exploitation tradeoff
Agent-environment interaction
… provides a mathematical framework for modeling decision-making in situations where outcomes are partly
random and partly under the control of a decision-maker
Reinforcement Learning process
… is a Markov decision process that provides mathematical framework for modeling decision-making in situations where
outcomes are partly random, partly under the control of decision-maker

the Markov Property implies that our agent needs only the current state to decide what action to take and not the
history of all the states and actions they took before.
Example: Video game
… for Markov Decision Process in RL context
The RL loop
… outputs a sequence of state, action, reward and next state.

The agent’s goal:


to maximize its cumulative reward
called the expected return.
The reward hypothesis
… is that all goals can be described as the maximization
of the expected return (expected cumulative reward)
Reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our
agent knows if the action taken was good or not
Rewards and discounting

To discount the rewards, we proceed like this:

• We define discount rate gamma between 0 and 1. Most of the time between 0.95 and 0.99.
(The larger the gamma, the smaller the discount. This means our agent cares more about
the long-term reward. The smaller the gamma, the bigger the discount. This means our
agent cares more about the short term reward (the nearest cheese)).
• Each reward will be discounted by gamma to the exponent of the time step. As the time
step increases, the cat gets closer to us, so the future reward is less and less likely to
happen.
Observations/States Space
… are the information our agent gets from the environment

Observation Space
State: complete description of the state Observation: Partial description of
of the world (no hidden information) the state of the world
Action Space
… is the set of all possible actions in an environment

Discrete: Finite number of Continuous Infinite number of


possible actions possible actions

In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock,
etc.
Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the
future.
Task
… is an instance of a Reinforcement Learning problem. Two types: episodic and continuing

Episodic task
In this case, we have a starting point and an
ending point (a terminal state). This creates an
episode: a list of States, Actions, Rewards, and
new States. Types of Task
For instance, think about Super Mario Bros: an
episode begin at the launch of a new Mario Episodic: Starting point and an ending point Continuing: Task that continue forever (no
Level and ends when you’re killed or you (a terminal state) terminal state)
reached the end of the level.

These are tasks that continue forever (no


terminal state). In this case, the agent must learn
how to choose the best actions and
simultaneously interact with the environment.
For instance, an agent that does automated stock
trading. For this task, there is no starting point
and terminal state. The agent keeps running until
we decide to stop it.
Exploration vs Exploitation
… is the dilemma where an agent must decide between trying new actions (exploration) or sticking with known
beneficial actions (exploitation).

• Exploration: Trying new actions to discover


their outcomes.
• Exploitation: Choosing actions known to yield
good rewards.
• Dilemma: Balancing between gaining new
knowledge (exploration) and maximizing
rewards with current knowledge (exploitation).
• Importance: Essential for an agent to adapt to
changing environments and avoid local optima.
• Strategies: Techniques like ε-greedy or UCB
(Upper Confidence Bound) help manage this
trade-off.
Policy

value based
approaches
Policy vs. value based approaches
involves the organization, summarization, and visualization of data. It
provides simple summaries about the sample and the measures.

Policy π : The agent‘s brain

Policy based vs. Value based methods

Bellman equation
Policy π: the agent’s brain
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

This Policy is the function we want to learn, our


goal is to find the optimal policy π*, the policy
that maximizes expected return when the agent
acts according to it. We find this π* through
training.
There are two approaches to train our agent to find
this optimal policy π*:
• Directly, by teaching the agent to learn which
action to take, given the current state: Policy-
Based Methods.
• Indirectly, teach the agent to learn which state
is more valuable and then take the action that
leads to the more valuable states: Value-Based
Methods.
Policy based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

In Policy-Based methods, we learn a policy


function directly.
This function will define a mapping from each
state to the best corresponding action.
Alternatively, it could define a probability
distribution over the set of possible actions at
that state.
Policy based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

Deterministic policy at a given state Stochastic : output a probability


will always return the same action distribution over action

a = π(s) π(a|s ) = P{A|s}

State S₀ > π(S₀) > a ₀ = Right State S₀ > π(A/S₀) > Left:[ 0.1, Right: 0.7
Jump: 0.2}
Value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of
being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Thanks to our value function at each step our policy will select the
The value of a state is the expected discounted return the agent can get if it state with the biggest value defined by the value functions - 7 then 6
starts in that state, and then acts according to our policy. then 5 and so on to attain the goal.

“Act according to our policy” just means that our policy is “going to the state
with the highest value”.
Policy vs. value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?

Value-Based Methods : Train the agent to learn which state


Policy Based Method: Train the agent to learn which
is more valuable and take the action that leads to it.
action to take given a state

So the difference is:


• In policy-based training, the optimal policy (denoted π*) is found by training the policy directly.
• In value-based training, finding an optimal value function (denoted Q* or V*, we’ll study the difference below) leads to
having an optimal policy.
Link between value and policy

… provides a mathematical framework for modeling decision-making


in situations where outcomes are partly random and partly under the
control of a decision-maker

Finding an optimal value function leads to having an optimal policy


Two types of Value-based Methods
… provides a mathematical framework for modeling decision-making in situations where outcomes are partly
random and partly under the control of a decision-maker

We see that the difference is:


• For the state-value function, we calculate the
value of a state ��St​
• For the action-value function, we calculate the
State-Value-Function Action-Value-Function
value of the state-action pair ( ��,��St​
,At​) hence the value of taking that action at Calculate the value of state Calculate the value of state action
that state. pair

For each state, the state-value function outputs the


expected return if the agent starts at that state and
then follows the policy forever afterward (for all
future timesteps, if you prefer).

In the action-value function, for each state and


action pair, the action-value function outputs the
expected return if the agent starts in that state,
takes that action, and then follows the policy
forever after.
The value of taking action �a in state �s under a
policy �π is:
Bellman equation
… simplifies summing up all the rewards an agent can get if it starts at that state

With what we have learned so far, we know that if we calculate �( ��)V(St​)


(the value of a state), we need to calculate the return starting at that state and then
follow the policy forever after. (The policy we defined in the following example is
a Greedy Policy; for simplification, we don’t discount the reward).
So to calculate �(��)V(St​), we need to calculate the sum of the expected
rewards. Hence:

In the interest of simplicity, here we don’t discount, so gamma = 1. But you’ll


study an example with gamma = 0.99 in the Q-Learning section of this unit.

• The value of �(��+1)V(St+1​) = Immediate reward ��+2Rt+2​+


Discounted value of the next state ( �����∗�(��+2)gamma∗V(St+2​
) ).
• And so on.

To recap, the idea of the Bellman equation is that instead of calculating each value
as the sum of the expected return, which is a long process, we calculate the value
as the sum of immediate reward + the discounted value of the state that follows.
Q-Learning
… is an off-policy value-based method that uses a TD approach to train its
action-value function

Temporal-Difference (TD) approach

What is Q-learning?

Q-Function & Q-Table

Example of maze for Q-Table calculation

Off-policy vs. On-policy


Q-Learning
combines ideas from Monte Carlo methods and dynamic programming by updates predictions based on the
difference between predicted values at consecutive time steps

• Estimation: It estimates value functions without needing a


complete model of the environment.
• Update Rule: The value estimates are updated using the
difference (or "temporal difference") between successive
estimates.
• Combination: TD learning combines the sample efficiency of
Monte Carlo methods (which estimate values based on
complete episodes) with the bootstrapping of dynamic
programming methods (which update values based on current
estimates)
• Popular Methods: TD learning forms the basis for popular
algorithms like Q-learning and SARSA.
Temporal Difference approach

… updates its action-value function at each step instead of at the end of the episode
Q-Learning
is an arrangement of objects in a specific order.

Q-Learning is an off-policy value-based method that uses a TD


approach to train its action-value function:

• Off-policy: we’ll talk about that at the end of this unit.


• Value-based method: finds the optimal policy indirectly by
training a value or action-value function that will tell us the
value of each state or each state-action pair.
• TD approach: updates its action-value function at each step
instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-function, an


action-value function that determines the value of being at a
particular state and taking a specific action at that state.
Q-Function & Q-Table
is an arrangement of objects in a specific order.

The Q comes from “the Quality” (the value) of that action at that state.
Let’s recap the difference between value and reward:

• The value of a state, or a state-action pair is the expected cumulative


reward our agent gets if it starts at this state (or state-action pair) and To train the Q-function, that given a state and action as input, output
the value we use Q-learning algorithm.
then acts accordingly to its policy.
Given a state and action, our Q Function outputs a state-action value
• The reward is the feedback I get from the environment after
(also called Q-value)
performing an action at a state.

Internally, our Q-function is encoded by a Q-table, a table where each


cell corresponds to a state-action pair value. Think of this Q-table as the
memory or cheat sheet of our Q-function.
The Q-table
is an arrangement of objects in a specific order.
The Q-table
is an arrangement of objects in a specific order.

The Q-table is initialized. That’s why all values are = 0. This table contains, for each state and
action, the corresponding state-action values.
The Q-table
is an arrangement of objects in a specific order.

Slide 14: Temporal Difference (TD) Learning


Concept: Update estimatesbased on current and expectedfuturerewards.
TD Error: Differencebetweenpredicted and actual rewards.
Role: Core mechanism in Q-learning updates.

Slide 15: Advantages of Q-Learning


Model-free: Doesn't requireknowledgeofenvironmentdynamics.
Convergence: Guaranteedto find optimal policy (given conditions).
Flexibility: Adaptabletovarioustasks and environments.

Slide 16: Limitationsof Q-Learning


Scalability: Q-tablesizegrowsexponentiallywithstates/actions.
Exploration: Can get stuck in local optima.
Continuous Spaces: Not directly
applicablewithoutdiscretizationorfunctionapproximation.
The Q-table
is an arrangement of objects in a specific order.
The Q-table
is an arrangement of objects in a specific order.

Step 4: Update Q(St, At)


Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) after one step of
the interaction.
To produce our TD target, we used the immediate reward ��+1Rt+1​plus the discounted value of the next state, computed by
finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).

Therefore, our �(��,��)Q(St​,At​) update formula goes like this:


The Q-table
is an arrangement of objects in a specific order.

Training
Off-policy

On-policy
Off-policy vs. On-policy
is an arrangement of objects in a specific order.

Off policy
Using a different policy for acting and for updating

On policy
Using a same policy for acting and for updating
Cliff Walking Example
is a standard gridworld environment used to illustrate the difference
between on-policy and off-policy methods like SARSA and Q-learning

• The agent can take actions to move in one of the four cardinal directions:
up, down, left, or right.
• Moving into a "Cliff" state incurs a large negative reward (e.g., -100) and
sends the agent back to the "Start" state.
• Each other move typically has a small negative reward (e.g., -1),
incentivizing the agent to reach the goal quickly
Comparing SARSA and Q-Learning:
§SARSA:
agent tends to take a longer but safer route, avoiding the edge adjacent to the
cliff, because it considers the future action which might be exploratory and lead
it into the cliff.
§Q-Learning:
agent usually learns the optimal policy to skirt dangerously close to the cliff for
the shortest path to the goal, but it may occasionally fall into the cliff during
exploration due to the greedy nature of its learning.
Use Case: The taxi driver
involves the organization, summarization, and visualization of data. It provides simple
summaries about the sample and the measures.

Task of Tutorial use case „Taxi driver“

Problem setting

Training the Q-Table

Test the trained Q-Table


Task of Tutorial use case „Taxi driver“
…a taxi should pick up passengers and drop them at their destination on a small parking lot
Problem setting
is an arrangement of objects in a specific order.

State space
§Grid, number of fields 25 (5*5)
§Pickup positions 5 (Y, R, G, B or in the taxi)
§Possible destinations 4
àNumber of states 500

Action space
§Down, Up, Left, Right
§Drop, Pick up passenger

Reward function
§Move -1
§Failed drop-off -10
Successfuldrop-off 10
Training the Q-Table
is an arrangement of objects in a specific order.

• Initialize Q-Table
• Start with table of zeros for each state-action pair.
• Environment Interaction
• Agent takes actions in environment based on the current Q-values,
often using strategy like ε-greedy for exploration.
• Receive Reward
• After taking an action, the agent observes a reward and the new state
from the environment.
• Update Q-Values
• Use Q-learning update rule to adjust Q-value of taken action based on
received reward and highest Q-value for new state
• Iterate
• Repeat process of action selection, observation, and Q-value updates
until termination condition is met, such as a set number of episodes or
convergence of Q-table.
Test the trained Q-Table
is an arrangement of objects in a specific order.

Initialize Q-Table
Start with
Title Link

Overview Sutton, Barto: Reinforcement http://incompleteideas.net/book/bookdraft2017nov5.pdf


Learning: An Introduction Links
i.e. sources for self-learning
Artificial Intelligence: https://www.udemy.com/course/artificial-intelligence-reinforce
Reinforcement Learning in Python ment-learning-in-python/

Fundamentals of Reinforcement https://levelup.gitconnected.com/fundamental-of-reinforcement


Learning -learning-markov-decision-process-8ba98fa66060

Reinforcement Learning Series: https://www.youtube.com/watch?app=desktop&v=i7q8bI


Overview of Methods SGwMQ

Reinforcement Learning in https://pythongeeks.org/reinforcement-learning-in-machine-lear


Machine Learning ning/

Easy Introduction to https://www.scribbr.com/ai-tools/reinforcement-learning/


Reinforcement Learning
A (Long) Peek into Reinforcement https://lilianweng.github.io/posts/2018-02-19-rl-overview/
Learning

Title Link

Basic concepts The Exploration/Exploitation https://huggingface.co/learn/deep-rl-course/unit1/exp-exp-trad


of RL trade-off eoff

Dynamic Programming RL https://shirsho-12.github.io/blog/rl_dp/


Links
i.e. sources for self-learning
Title Link

Q-Learning Temporal-Difference (TD) https://towardsdatascience.com/introduction-to-reinforcement-lea


Learning rning-rl-part-6-temporal-difference-td-learning-2a12f0aba9f9

Reinforcement Learning 6. https://www.slideshare.net/SeungJaeLee17/reinforcement-learnin


Temporal Difference Learning g-an-introduction-chapter-6

Diving deeper into Reinforcement https://medium.com/free-code-camp/diving-deeper-into-reinforce


Learning with Q-Learning ment-learning-with-q-learning-c18d0db58efe

Fundamentals of Reinforcement https://medium.com/gradientcrescent/fundamentals-of-rei


Learning: Navigating Cliffworld nforcement-learning-navigating-cliffworld-with-sarsa-and-q-l
with SARSA and Q-learning earning-cc3c36eb5830

Reinforcement Learning: SARSA https://arshren.medium.com/reinforcement-learning-sarsa-and-q-l


and Q-Learning earning-e11ebe87dca9

An introduction to Q-Learning: https://www.freecodecamp.org/news/an-introduction-to-q-learnin


reinforcement learning g-reinforcement-learning-14ac0b4493cc/

Walking Off The Cliff With Off- https://towardsdatascience.com/walking-off-the-cliff-with-off-polic


Policy Reinforcement Learning y-reinforcement-learning-7fdbcdfe31ff
Links
Title Link i.e. sources for self-learning
Tutorial Solving The Taxi Environment https://towardsdatascience.com/solving-the-taxi-environment-
Applications With Q-Learning — A Tutorial with-q-learning-a-tutorial-c76c22fc5d8f

Text-Flappy Bird https://aspram.medium.com/learning-flappy-bird-agents-with-r


einforcement-learning-d07f31609333

Practical Reinforcement Learning https://www.udemy.com/course/practical-reinforcement-learni


using Python - 8 AI Agents ng/

Title Link
Real world 9 awesome real world https://medium.com/@mlblogging.k/9-awesome-applications-o
Applications applications of Reinforcement f-reinforcement-learning-e1306ed25c09
Learning

Reinforcement Learning and its https://blogs.skillovilla.com/reinforcement-learning-and-its-real


Real-Life Applications -life-applications/

Mastering the game of Go with https://www.nature.com/articles/nature16961.pdf


deep neural networks and tree
search

You might also like