Introduction To Reinforcement Learning
Introduction To Reinforcement Learning
Reinforcement Learning
Definition
Applications
Reinforcement
learning is
By building agents that learn from the And receiving rewards (positive or negative)
environment as unique feedback.
What is Reinforcement Learning?
It is type of machine learning where agents learn how to behave in an environment
Is …
• Game Playing (e.g., AlphaGo)
• Robotics (e.g., robot locomotion)
• Recommendation Systems
• Financial Portfolio Management
• Autonomous Vehicles
• Image: Icons for each application (e.g., a game
board, robot, car).
3 types of machine learning
Is.....
• Supervised Learning:
• Machine learning paradigm where a model is trained on
labeled data to make predictions or classifications. The
model learns a function that maps input features to output
labels.
• Unsupervised Learning:
• Type of machine learning that deals with unlabeled data
and aims to identify underlying patterns or structures. The
model learns to represent the data without explicit
guidance.
• Reinforcement Learning:
• Type of machine learning where agent learns to make
decisions by taking actions in environment to maximize
cumulative rewards over time.
3 types of machine learning
Explorative controlling:
Discovery of clusters, patterns, Classification, regression Solution of reward-based problems by
General tasks relationships exploration and exploitation
Customer segmentation, product Image detection, stock market Game playing, robotic vacuum cleaners
Examples recommendation prediction
Yes. The correct set of actions is Yes, through rewards and punishments (positive
Feedback No
provided. and negative rewards)
Basic concepts of RL
And what is it good for?
Agent-Environment interaction
Exploration-Exploitation tradeoff
Agent-environment interaction
… provides a mathematical framework for modeling decision-making in situations where outcomes are partly
random and partly under the control of a decision-maker
Reinforcement Learning process
… is a Markov decision process that provides mathematical framework for modeling decision-making in situations where
outcomes are partly random, partly under the control of decision-maker
the Markov Property implies that our agent needs only the current state to decide what action to take and not the
history of all the states and actions they took before.
Example: Video game
… for Markov Decision Process in RL context
The RL loop
… outputs a sequence of state, action, reward and next state.
• We define discount rate gamma between 0 and 1. Most of the time between 0.95 and 0.99.
(The larger the gamma, the smaller the discount. This means our agent cares more about
the long-term reward. The smaller the gamma, the bigger the discount. This means our
agent cares more about the short term reward (the nearest cheese)).
• Each reward will be discounted by gamma to the exponent of the time step. As the time
step increases, the cat gets closer to us, so the future reward is less and less likely to
happen.
Observations/States Space
… are the information our agent gets from the environment
Observation Space
State: complete description of the state Observation: Partial description of
of the world (no hidden information) the state of the world
Action Space
… is the set of all possible actions in an environment
In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock,
etc.
Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the
future.
Task
… is an instance of a Reinforcement Learning problem. Two types: episodic and continuing
Episodic task
In this case, we have a starting point and an
ending point (a terminal state). This creates an
episode: a list of States, Actions, Rewards, and
new States. Types of Task
For instance, think about Super Mario Bros: an
episode begin at the launch of a new Mario Episodic: Starting point and an ending point Continuing: Task that continue forever (no
Level and ends when you’re killed or you (a terminal state) terminal state)
reached the end of the level.
value based
approaches
Policy vs. value based approaches
involves the organization, summarization, and visualization of data. It
provides simple summaries about the sample and the measures.
Bellman equation
Policy π: the agent’s brain
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?
State S₀ > π(S₀) > a ₀ = Right State S₀ > π(A/S₀) > Left:[ 0.1, Right: 0.7
Jump: 0.2}
Value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?
In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of
being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.
Thanks to our value function at each step our policy will select the
The value of a state is the expected discounted return the agent can get if it state with the biggest value defined by the value functions - 7 then 6
starts in that state, and then acts according to our policy. then 5 and so on to attain the goal.
“Act according to our policy” just means that our policy is “going to the state
with the highest value”.
Policy vs. value based methods
How do we build an RL agent that can select the actions that maximize its expected cumulative reward?
To recap, the idea of the Bellman equation is that instead of calculating each value
as the sum of the expected return, which is a long process, we calculate the value
as the sum of immediate reward + the discounted value of the state that follows.
Q-Learning
… is an off-policy value-based method that uses a TD approach to train its
action-value function
What is Q-learning?
… updates its action-value function at each step instead of at the end of the episode
Q-Learning
is an arrangement of objects in a specific order.
The Q comes from “the Quality” (the value) of that action at that state.
Let’s recap the difference between value and reward:
The Q-table is initialized. That’s why all values are = 0. This table contains, for each state and
action, the corresponding state-action values.
The Q-table
is an arrangement of objects in a specific order.
Training
Off-policy
On-policy
Off-policy vs. On-policy
is an arrangement of objects in a specific order.
Off policy
Using a different policy for acting and for updating
On policy
Using a same policy for acting and for updating
Cliff Walking Example
is a standard gridworld environment used to illustrate the difference
between on-policy and off-policy methods like SARSA and Q-learning
• The agent can take actions to move in one of the four cardinal directions:
up, down, left, or right.
• Moving into a "Cliff" state incurs a large negative reward (e.g., -100) and
sends the agent back to the "Start" state.
• Each other move typically has a small negative reward (e.g., -1),
incentivizing the agent to reach the goal quickly
Comparing SARSA and Q-Learning:
§SARSA:
agent tends to take a longer but safer route, avoiding the edge adjacent to the
cliff, because it considers the future action which might be exploratory and lead
it into the cliff.
§Q-Learning:
agent usually learns the optimal policy to skirt dangerously close to the cliff for
the shortest path to the goal, but it may occasionally fall into the cliff during
exploration due to the greedy nature of its learning.
Use Case: The taxi driver
involves the organization, summarization, and visualization of data. It provides simple
summaries about the sample and the measures.
Problem setting
State space
§Grid, number of fields 25 (5*5)
§Pickup positions 5 (Y, R, G, B or in the taxi)
§Possible destinations 4
àNumber of states 500
Action space
§Down, Up, Left, Right
§Drop, Pick up passenger
Reward function
§Move -1
§Failed drop-off -10
Successfuldrop-off 10
Training the Q-Table
is an arrangement of objects in a specific order.
• Initialize Q-Table
• Start with table of zeros for each state-action pair.
• Environment Interaction
• Agent takes actions in environment based on the current Q-values,
often using strategy like ε-greedy for exploration.
• Receive Reward
• After taking an action, the agent observes a reward and the new state
from the environment.
• Update Q-Values
• Use Q-learning update rule to adjust Q-value of taken action based on
received reward and highest Q-value for new state
• Iterate
• Repeat process of action selection, observation, and Q-value updates
until termination condition is met, such as a set number of episodes or
convergence of Q-table.
Test the trained Q-Table
is an arrangement of objects in a specific order.
Initialize Q-Table
Start with
Title Link
Title Link
Title Link
Real world 9 awesome real world https://medium.com/@mlblogging.k/9-awesome-applications-o
Applications applications of Reinforcement f-reinforcement-learning-e1306ed25c09
Learning